linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller
@ 2015-01-15 18:49 Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 1/6] memcg: inode-based dirty and writeback pages accounting Konstantin Khebnikov
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

This is ressurection of my old RFC patch for dirty-set accounting cgroup [1]
Now it's merged into memory cgroup and got bandwidth controller as a bonus.

That shows alternative solution: less accurate but much less monstrous than
accurate page-based dirty-set controller from Tejun Heo.

Memory overhead: +1 pointer into struct address_space.
Perfomance overhead is almost zero, no new locks added.

Idea is stright forward: link each inode to some cgroup when first dirty
page appers and account all dirty pages to it. Writeback is implemented
as single per-bdi writeback work which writes only inodes which belong
to memory cgroups where amount of dirty memory is beyond thresholds.

Third patch adds trick for handling shared inodes which have dirty pages
from several cgroups: it marks whole inode as shared and alters writeback
filter for it.

The rest is an example of bandwith and iops controller build on top of that.
Design is completely original, I bet nobody ever used task-works for that =)

[1] [PATCH RFC] fsio: filesystem io accounting cgroup
http://marc.info/?l=linux-kernel&m=137331569501655&w=2

Patches also available here:
https://github.com/koct9i/linux.git branch memcg_dirty_control

---

Konstantin Khebnikov (6):
      memcg: inode-based dirty and writeback pages accounting
      memcg: dirty-set limiting and filtered writeback
      memcg: track shared inodes with dirty pages
      percpu_ratelimit: high-performance ratelimiting counter
      delay-injection: resource management via procrastination
      memcg: filesystem bandwidth controller


 block/blk-core.c                 |    2 
 fs/direct-io.c                   |    2 
 fs/fs-writeback.c                |   22 ++
 fs/inode.c                       |    1 
 include/linux/backing-dev.h      |    1 
 include/linux/fs.h               |   14 +
 include/linux/memcontrol.h       |   27 +++
 include/linux/percpu_ratelimit.h |   45 ++++
 include/linux/sched.h            |    7 +
 include/linux/writeback.h        |    1 
 include/trace/events/sched.h     |    7 +
 include/trace/events/writeback.h |    1 
 kernel/sched/core.c              |   66 +++++++
 kernel/sched/fair.c              |   12 +
 lib/Makefile                     |    1 
 lib/percpu_ratelimit.c           |  168 +++++++++++++++++
 mm/memcontrol.c                  |  381 ++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |   32 +++
 mm/readahead.c                   |    2 
 mm/truncate.c                    |    1 
 mm/vmscan.c                      |    4 
 21 files changed, 787 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/percpu_ratelimit.h
 create mode 100644 lib/percpu_ratelimit.c

--
Signature

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] memcg: inode-based dirty and writeback pages accounting
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
@ 2015-01-15 18:49 ` Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 2/6] memcg: dirty-set limiting and filtered writeback Konstantin Khebnikov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

This patch links memory cgroup into vfs layer and assigns owner
memcg for each inode which has dirty or writeback pages within.
The main goal of this is controlling dirty memory size.

Accounting dirty memory in per-inode manner is much easier (we've
got locking for free) and more effective because we could use this
information in in writeback and writeout only inodes which belongs
to cgroup where amount of dirty memory is beyond of thresholds.

Interface: fs_dirty and fs_writeback in memory.stat attribute.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/inode.c                 |    1 
 include/linux/fs.h         |   11 ++++
 include/linux/memcontrol.h |   13 +++++
 mm/memcontrol.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c        |    7 ++-
 mm/truncate.c              |    1 
 6 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index aa149e7..979a548 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -559,6 +559,7 @@ static void evict(struct inode *inode)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
+	mem_cgroup_forget_mapping(&inode->i_data);
 
 	remove_inode_hash(inode);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42efe13..ee2e3c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -413,6 +413,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+#ifdef CONFIG_MEMCG
+	struct mem_cgroup __rcu	*i_memcg;	/* protected by ->tree_lock */
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -489,6 +492,14 @@ static inline void i_mmap_unlock_read(struct address_space *mapping)
 }
 
 /*
+ * Returns bitmap with all page-cache radix-tree tags
+ */
+static inline unsigned mapping_tags(struct address_space *mapping)
+{
+	return (__force unsigned)mapping->page_tree.gfp_mask >> __GFP_BITS_SHIFT;
+}
+
+/*
  * Might pages of this file be mapped into userspace?
  */
 static inline int mapping_mapped(struct address_space *mapping)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7c95af8..b281333 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -173,6 +173,12 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+void mem_cgroup_inc_page_dirty(struct address_space *mapping);
+void mem_cgroup_dec_page_dirty(struct address_space *mapping);
+void mem_cgroup_inc_page_writeback(struct address_space *mapping);
+void mem_cgroup_dec_page_writeback(struct address_space *mapping);
+void mem_cgroup_forget_mapping(struct address_space *mapping);
+
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -340,6 +346,13 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void mem_cgroup_inc_page_dirty(struct address_space *mapping) {}
+static inline void mem_cgroup_dec_page_dirty(struct address_space *mapping) {}
+static inline void mem_cgroup_inc_page_writeback(struct address_space *mapping) {}
+static inline void mem_cgroup_dec_page_writeback(struct address_space *mapping) {}
+static inline void mem_cgroup_forget_mapping(struct address_space *mapping) {}
+
 #endif /* CONFIG_MEMCG */
 
 enum {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 851924f..c5655f1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -361,6 +361,9 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;
 
+	struct percpu_counter nr_dirty;
+	struct percpu_counter nr_writeback;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -3743,6 +3746,11 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 		seq_printf(m, "total_%s %llu\n", mem_cgroup_lru_names[i], val);
 	}
 
+	seq_printf(m, "fs_dirty %llu\n", PAGE_SIZE *
+			percpu_counter_sum_positive(&memcg->nr_dirty));
+	seq_printf(m, "fs_writeback %llu\n", PAGE_SIZE *
+			percpu_counter_sum_positive(&memcg->nr_writeback));
+
 #ifdef CONFIG_DEBUG_VM
 	{
 		int nid, zid;
@@ -4577,6 +4585,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg)
 		return NULL;
 
+	if (percpu_counter_init(&memcg->nr_dirty, 0, GFP_KERNEL) ||
+	    percpu_counter_init(&memcg->nr_writeback, 0, GFP_KERNEL))
+		goto out_free;
+
 	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
 	if (!memcg->stat)
 		goto out_free;
@@ -4584,6 +4596,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	return memcg;
 
 out_free:
+	percpu_counter_destroy(&memcg->nr_dirty);
+	percpu_counter_destroy(&memcg->nr_writeback);
 	kfree(memcg);
 	return NULL;
 }
@@ -4608,6 +4622,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	for_each_node(node)
 		free_mem_cgroup_per_zone_info(memcg, node);
 
+	percpu_counter_destroy(&memcg->nr_dirty);
+	percpu_counter_destroy(&memcg->nr_writeback);
 	free_percpu(memcg->stat);
 
 	disarm_static_keys(memcg);
@@ -4750,6 +4766,31 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void mem_cgroup_switch_one_sb(struct super_block *sb, void *_memcg)
+{
+	struct mem_cgroup *memcg = _memcg;
+	struct mem_cgroup *target = parent_mem_cgroup(memcg);
+	struct address_space *mapping;
+	struct inode *inode;
+	extern spinlock_t inode_sb_list_lock;
+
+	spin_lock(&inode_sb_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		mapping = inode->i_mapping;
+		if (likely(rcu_access_pointer(mapping->i_memcg) != memcg))
+			continue;
+		spin_lock_irq(&mapping->tree_lock);
+		if (rcu_access_pointer(mapping->i_memcg) == memcg) {
+			rcu_assign_pointer(mapping->i_memcg, target);
+			if (target)
+				css_get(&target->css);
+			css_put(&memcg->css);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+	}
+	spin_unlock(&inode_sb_list_lock);
+}
+
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -4767,6 +4808,9 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
+	/* Switch all ->i_memcg references to the parent cgroup */
+	iterate_supers(mem_cgroup_switch_one_sb, memcg);
+
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -5821,6 +5865,80 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 	commit_charge(newpage, memcg, lrucare);
 }
 
+static inline struct mem_cgroup *
+mem_cgroup_from_mapping(struct address_space *mapping)
+{
+	return rcu_dereference_check(mapping->i_memcg,
+			lockdep_is_held(&mapping->tree_lock));
+}
+
+void mem_cgroup_inc_page_dirty(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_mapping(mapping);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/* Remember context at dirtying first page in the mapping */
+	if (unlikely(!(mapping_tags(mapping) &
+	    (BIT(PAGECACHE_TAG_DIRTY) | BIT(PAGECACHE_TAG_WRITEBACK))))) {
+		struct mem_cgroup *task_memcg;
+
+		rcu_read_lock();
+		task_memcg = mem_cgroup_from_task(current);
+		if (task_memcg != memcg) {
+			if (memcg)
+				css_put(&memcg->css);
+			css_get(&task_memcg->css);
+			memcg = task_memcg;
+			lockdep_assert_held(&mapping->tree_lock);
+			rcu_assign_pointer(mapping->i_memcg, memcg);
+		}
+		rcu_read_unlock();
+	}
+
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
+		percpu_counter_inc(&memcg->nr_dirty);
+}
+
+void mem_cgroup_dec_page_dirty(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_mapping(mapping);
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
+		percpu_counter_dec(&memcg->nr_dirty);
+	rcu_read_unlock();
+}
+
+void mem_cgroup_inc_page_writeback(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_mapping(mapping);
+
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
+		percpu_counter_inc(&memcg->nr_writeback);
+}
+
+void mem_cgroup_dec_page_writeback(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_mapping(mapping);
+
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
+		percpu_counter_dec(&memcg->nr_writeback);
+}
+
+void mem_cgroup_forget_mapping(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = rcu_dereference_protected(mapping->i_memcg, 1);
+	if (memcg) {
+		css_put(&memcg->css);
+		RCU_INIT_POINTER(mapping->i_memcg, NULL);
+	}
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6f43352..afaf263 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2098,6 +2098,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+		mem_cgroup_inc_page_dirty(mapping);
 		task_io_account_write(PAGE_CACHE_SIZE);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
@@ -2297,6 +2298,7 @@ int clear_page_dirty_for_io(struct page *page)
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
+			mem_cgroup_dec_page_dirty(mapping);
 			return 1;
 		}
 		return 0;
@@ -2327,6 +2329,7 @@ int test_clear_page_writeback(struct page *page)
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
+				mem_cgroup_dec_page_writeback(mapping);
 			}
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2361,8 +2364,10 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_account_writeback(bdi))
+			if (bdi_cap_account_writeback(bdi)) {
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+				mem_cgroup_inc_page_writeback(mapping);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index f1e4d60..37915fe 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -114,6 +114,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
+			mem_cgroup_dec_page_dirty(mapping);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] memcg: dirty-set limiting and filtered writeback
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 1/6] memcg: inode-based dirty and writeback pages accounting Konstantin Khebnikov
@ 2015-01-15 18:49 ` Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 3/6] memcg: track shared inodes with dirty pages Konstantin Khebnikov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

mem_cgroup_dirty_limits() checks thresholds and schedules per-bdi
writeback work (where ->for_memcg is set) which writes only inodes
where dirty limit is exceeded for owner memcg or for whole bdi.

Interface: memory.dirty_ratio percent of memory limit used as threshold
(0 = unlimited, default 50). Background threshold is a half of that.
And fs_dirty_threshold line in memory.stat shows current threshold.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/fs-writeback.c                |   18 ++++-
 include/linux/backing-dev.h      |    1 
 include/linux/memcontrol.h       |    6 ++
 include/linux/writeback.h        |    1 
 include/trace/events/writeback.h |    1 
 mm/memcontrol.c                  |  145 ++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |   25 ++++++-
 7 files changed, 190 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..9034768 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/memcontrol.h>
 #include <linux/pagemap.h>
 #include <linux/kthread.h>
 #include <linux/writeback.h>
@@ -47,6 +48,7 @@ struct wb_writeback_work {
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
+	unsigned int for_memcg:1;
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -137,6 +139,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
+	work->for_memcg = reason == WB_REASON_FOR_MEMCG;
 
 	bdi_queue_work(bdi, work);
 }
@@ -258,15 +261,16 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
-	struct inode *inode;
+	struct inode *inode, *next;
 	int do_sb_sort = 0;
 	int moved = 0;
 
-	while (!list_empty(delaying_queue)) {
-		inode = wb_inode(delaying_queue->prev);
+	list_for_each_entry_safe(inode, next, delaying_queue, i_wb_list) {
 		if (work->older_than_this &&
 		    inode_dirtied_after(inode, *work->older_than_this))
 			break;
+		if (work->for_memcg && !mem_cgroup_dirty_exceeded(inode))
+			continue;
 		list_move(&inode->i_wb_list, &tmp);
 		moved++;
 		if (sb_is_blkdev_sb(inode->i_sb))
@@ -650,6 +654,11 @@ static long writeback_sb_inodes(struct super_block *sb,
 			break;
 		}
 
+		if (work->for_memcg && !mem_cgroup_dirty_exceeded(inode)) {
+			redirty_tail(inode, wb);
+			continue;
+		}
+
 		/*
 		 * Don't bother with new inodes or inodes being freed, first
 		 * kind does not need periodic writeout yet, and for the latter
@@ -1014,6 +1023,9 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
 		wrote += wb_writeback(wb, work);
 
+		if (work->for_memcg)
+			clear_bit(BDI_memcg_writeback_running, &bdi->state);
+
 		/*
 		 * Notify the caller of completion if this is a synchronous
 		 * work item, otherwise just free it.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5da6012..91b55d8 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -32,6 +32,7 @@ enum bdi_state {
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_writeback_running,	/* Writeback is in progress */
+	BDI_memcg_writeback_running,
 };
 
 typedef int (congested_fn)(void *, int);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b281333..ae05563 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,6 +178,9 @@ void mem_cgroup_dec_page_dirty(struct address_space *mapping);
 void mem_cgroup_inc_page_writeback(struct address_space *mapping);
 void mem_cgroup_dec_page_writeback(struct address_space *mapping);
 void mem_cgroup_forget_mapping(struct address_space *mapping);
+bool mem_cgroup_dirty_limits(struct address_space *mapping, unsigned long *dirty,
+			     unsigned long *thresh, unsigned long *bg_thresh);
+bool mem_cgroup_dirty_exceeded(struct inode *inode);
 
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
@@ -352,6 +355,9 @@ static inline void mem_cgroup_dec_page_dirty(struct address_space *mapping) {}
 static inline void mem_cgroup_inc_page_writeback(struct address_space *mapping) {}
 static inline void mem_cgroup_dec_page_writeback(struct address_space *mapping) {}
 static inline void mem_cgroup_forget_mapping(struct address_space *mapping) {}
+static inline bool mem_cgroup_dirty_limits(struct address_space *mapping, unsigned long *dirty,
+			     unsigned long *thresh, unsigned long *bg_thresh) { return false; }
+static inline bool mem_cgroup_dirty_exceeded(struct inode *inode) { return false; }
 
 #endif /* CONFIG_MEMCG */
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0004833..1239fa6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -47,6 +47,7 @@ enum wb_reason {
 	WB_REASON_LAPTOP_TIMER,
 	WB_REASON_FREE_MORE_MEM,
 	WB_REASON_FS_FREE_SPACE,
+	WB_REASON_FOR_MEMCG,
 	/*
 	 * There is no bdi forker thread any more and works are done
 	 * by emergency worker, however, this is TPs userland visible
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index cee02d6..106a8d7 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -29,6 +29,7 @@
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
 		{WB_REASON_FREE_MORE_MEM,	"free_more_memory"},	\
 		{WB_REASON_FS_FREE_SPACE,	"fs_free_space"},	\
+		{WB_REASON_FOR_MEMCG,		"for_memcg"},		\
 		{WB_REASON_FORKER_THREAD,	"forker_thread"}
 
 struct wb_writeback_work;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5655f1..17d966a3b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -363,6 +363,10 @@ struct mem_cgroup {
 
 	struct percpu_counter nr_dirty;
 	struct percpu_counter nr_writeback;
+	unsigned long dirty_threshold;
+	unsigned long dirty_background;
+	unsigned int dirty_exceeded;
+	unsigned int dirty_ratio;
 
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
@@ -3060,6 +3064,8 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 
 static DEFINE_MUTEX(memcg_limit_mutex);
 
+static void mem_cgroup_update_dirty_thresh(struct mem_cgroup *memcg);
+
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 				   unsigned long limit)
 {
@@ -3112,6 +3118,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
 
+	if (!ret)
+		mem_cgroup_update_dirty_thresh(memcg);
+
 	return ret;
 }
 
@@ -3750,6 +3759,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 			percpu_counter_sum_positive(&memcg->nr_dirty));
 	seq_printf(m, "fs_writeback %llu\n", PAGE_SIZE *
 			percpu_counter_sum_positive(&memcg->nr_writeback));
+	seq_printf(m, "fs_dirty_threshold %llu\n", (u64)PAGE_SIZE *
+			memcg->dirty_threshold);
 
 #ifdef CONFIG_DEBUG_VM
 	{
@@ -3803,6 +3814,25 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_ratio_read(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return memcg->dirty_ratio;
+}
+
+static int mem_cgroup_dirty_ratio_write(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	memcg->dirty_ratio = val;
+	mem_cgroup_update_dirty_thresh(memcg);
+
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4454,6 +4484,11 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_ratio_read,
+		.write_u64 = mem_cgroup_dirty_ratio_write,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -4686,6 +4721,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
+		memcg->dirty_ratio = 50; /* default value for cgroups */
 	}
 
 	memcg->last_scanned_node = MAX_NUMNODES;
@@ -4750,6 +4786,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		if (parent != root_mem_cgroup)
 			memory_cgrp_subsys.broken_hierarchy = true;
 	}
+
+	memcg->dirty_ratio = parent->dirty_ratio;
+	mem_cgroup_update_dirty_thresh(memcg);
+
 	mutex_unlock(&memcg_create_mutex);
 
 	ret = memcg_init_kmem(memcg, &memory_cgrp_subsys);
@@ -5939,6 +5979,111 @@ void mem_cgroup_forget_mapping(struct address_space *mapping)
 	}
 }
 
+static void mem_cgroup_update_dirty_thresh(struct mem_cgroup *memcg)
+{
+	struct cgroup_subsys_state *pos;
+
+	if (memcg->memory.limit > totalram_pages || !memcg->dirty_ratio) {
+		memcg->dirty_threshold = 0; /* 0 means no limit at all*/
+		memcg->dirty_background = ULONG_MAX;
+	} else {
+		memcg->dirty_threshold = memcg->memory.limit *
+					 memcg->dirty_ratio / 100;
+		memcg->dirty_background = memcg->dirty_threshold / 2;
+	}
+
+	/* Propogate threshold into childs */
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, &memcg->css) {
+		struct mem_cgroup *memcg = mem_cgroup_from_css(pos);
+		struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+
+		if (!(pos->flags & CSS_ONLINE))
+			continue;
+
+		if (memcg->dirty_threshold == 0 ||
+		    memcg->dirty_threshold == ULONG_MAX) {
+			if (parent && parent->use_hierarchy &&
+				      parent->dirty_threshold)
+				memcg->dirty_threshold = ULONG_MAX;
+			else
+				memcg->dirty_threshold = 0;
+		}
+	}
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_dirty_limits(struct address_space *mapping,
+			     unsigned long *pdirty,
+			     unsigned long *pthresh,
+			     unsigned long *pbg_thresh)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long dirty, threshold, background;
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		/* No limit at all */
+		if (memcg->dirty_threshold == 0)
+			break;
+		/* No limit here, but must check parent */
+		if (memcg->dirty_threshold == ULONG_MAX)
+			continue;
+		dirty = percpu_counter_read_positive(&memcg->nr_dirty) +
+			percpu_counter_read_positive(&memcg->nr_writeback);
+		threshold = memcg->dirty_threshold;
+		background = memcg->dirty_background;
+		if (dirty > background) {
+			if (!memcg->dirty_exceeded)
+				memcg->dirty_exceeded = 1;
+			rcu_read_unlock();
+			if (dirty > (background + threshold) / 2 &&
+			    !test_and_set_bit(BDI_memcg_writeback_running,
+					      &bdi->state))
+				bdi_start_writeback(bdi, dirty - background,
+						    WB_REASON_FOR_MEMCG);
+			*pdirty = dirty;
+			*pthresh = threshold;
+			*pbg_thresh = background;
+			return true;
+		}
+	}
+	rcu_read_unlock();
+
+	return false;
+}
+
+bool mem_cgroup_dirty_exceeded(struct inode *inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct mem_cgroup *memcg;
+	unsigned long dirty;
+
+	if (mapping->backing_dev_info->dirty_exceeded)
+		return true;
+
+	rcu_read_lock();
+	memcg = rcu_dereference(mapping->i_memcg);
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		if (!memcg->dirty_threshold) {
+			memcg = NULL;
+			break;
+		}
+		if (!memcg->dirty_exceeded)
+			continue;
+		dirty = percpu_counter_read_positive(&memcg->nr_dirty) +
+			percpu_counter_read_positive(&memcg->nr_writeback);
+		if (dirty > memcg->dirty_background)
+			break;
+		memcg->dirty_exceeded = 0;
+	}
+	rcu_read_unlock();
+
+	return memcg != NULL;
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index afaf263..325510f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1328,6 +1328,17 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
 	}
 }
 
+static unsigned long mem_cgroup_position_ratio(unsigned long dirty,
+		unsigned long thresh, unsigned long bg_thresh)
+{
+	unsigned long setpoint = dirty_freerun_ceiling(thresh, bg_thresh);
+
+	if (dirty > thresh)
+		return 0;
+
+	return pos_ratio_polynom(setpoint, dirty, thresh);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -1362,6 +1373,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		unsigned long uninitialized_var(bdi_dirty);
 		unsigned long dirty;
 		unsigned long bg_thresh;
+		bool memcg;
 
 		/*
 		 * Unstable writes are a feature of certain networked
@@ -1387,6 +1399,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 			bg_thresh = background_thresh;
 		}
 
+		memcg = mem_cgroup_dirty_limits(mapping, &dirty, &thresh, &bg_thresh);
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
@@ -1404,7 +1418,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			break;
 		}
 
-		if (unlikely(!writeback_in_progress(bdi)))
+		if (unlikely(!writeback_in_progress(bdi) && !memcg))
 			bdi_start_background_writeback(bdi);
 
 		if (!strictlimit)
@@ -1421,9 +1435,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 				     start_time);
 
 		dirty_ratelimit = bdi->dirty_ratelimit;
-		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
-					       background_thresh, nr_dirty,
-					       bdi_thresh, bdi_dirty);
+		if (memcg)
+			pos_ratio = mem_cgroup_position_ratio(dirty, thresh, bg_thresh);
+		else
+			pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					background_thresh, nr_dirty,
+					bdi_thresh, bdi_dirty);
 		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
 		max_pause = bdi_max_pause(bdi, bdi_dirty);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/6] memcg: track shared inodes with dirty pages
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 1/6] memcg: inode-based dirty and writeback pages accounting Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 2/6] memcg: dirty-set limiting and filtered writeback Konstantin Khebnikov
@ 2015-01-15 18:49 ` Konstantin Khebnikov
  2015-01-15 18:55   ` Tejun Heo
  2015-01-15 18:49 ` [PATCH 4/6] percpu_ratelimit: high-performance ratelimiting counter Konstantin Khebnikov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Inode is owned only by one memory cgroup, but if it's shared it might
contain pages from multiple cgroups. This patch detects this situation
in memory reclaiemer and marks dirty inode with flag I_DIRTY_SHARED
which is cleared only when data is completely written. Memcg writeback
always writes such inodes.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/fs-writeback.c          |    4 ++--
 include/linux/fs.h         |    3 +++
 include/linux/memcontrol.h |    4 ++++
 mm/memcontrol.c            |   20 ++++++++++++++++++++
 mm/vmscan.c                |    4 ++++
 5 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9034768..fda6a64 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -484,7 +484,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 */
 	spin_lock(&inode->i_lock);
 
-	dirty = inode->i_state & I_DIRTY;
+	dirty = inode->i_state & (I_DIRTY | I_DIRTY_SHARED);
 	inode->i_state &= ~I_DIRTY;
 
 	/*
@@ -501,7 +501,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	smp_mb();
 
 	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
-		inode->i_state |= I_DIRTY_PAGES;
+		inode->i_state |= I_DIRTY_PAGES | (dirty & I_DIRTY_SHARED);
 
 	spin_unlock(&inode->i_lock);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ee2e3c0..303f0ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1741,6 +1741,8 @@ struct super_operations {
  *
  * I_DIO_WAKEUP		Never set.  Only used as a key for wait_on_bit().
  *
+ * I_DIRTY_SHARED	Dirty pages belong to multiple memory cgroups.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -1757,6 +1759,7 @@ struct super_operations {
 #define __I_DIO_WAKEUP		9
 #define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
 #define I_LINKABLE		(1 << 10)
+#define I_DIRTY_SHARED		(1 << 11)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ae05563..3f89e9b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -181,6 +181,8 @@ void mem_cgroup_forget_mapping(struct address_space *mapping);
 bool mem_cgroup_dirty_limits(struct address_space *mapping, unsigned long *dirty,
 			     unsigned long *thresh, unsigned long *bg_thresh);
 bool mem_cgroup_dirty_exceeded(struct inode *inode);
+void mem_cgroup_poke_writeback(struct address_space *mapping,
+			       struct mem_cgroup *memcg);
 
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
@@ -358,6 +360,8 @@ static inline void mem_cgroup_forget_mapping(struct address_space *mapping) {}
 static inline bool mem_cgroup_dirty_limits(struct address_space *mapping, unsigned long *dirty,
 			     unsigned long *thresh, unsigned long *bg_thresh) { return false; }
 static inline bool mem_cgroup_dirty_exceeded(struct inode *inode) { return false; }
+static inline void mem_cgroup_poke_writeback(struct address_space *mapping,
+					     struct mem_cgroup *memcg) { }
 
 #endif /* CONFIG_MEMCG */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17d966a3b..d9d345c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6064,6 +6064,9 @@ bool mem_cgroup_dirty_exceeded(struct inode *inode)
 	if (mapping->backing_dev_info->dirty_exceeded)
 		return true;
 
+	if (inode->i_state & I_DIRTY_SHARED)
+		return true;
+
 	rcu_read_lock();
 	memcg = rcu_dereference(mapping->i_memcg);
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
@@ -6084,6 +6087,23 @@ bool mem_cgroup_dirty_exceeded(struct inode *inode)
 	return memcg != NULL;
 }
 
+void mem_cgroup_poke_writeback(struct address_space *mapping,
+			       struct mem_cgroup *memcg)
+{
+	struct inode *inode = mapping->host;
+
+	if (rcu_access_pointer(mapping->i_memcg) == memcg ||
+	    !memcg->dirty_exceeded)
+		return;
+
+	if (inode->i_state & (I_DIRTY_PAGES|I_DIRTY_SHARED) == I_DIRTY_PAGES) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_DIRTY_PAGES)
+			inode->i_state |= I_DIRTY_SHARED;
+		spin_unlock(&inode->i_lock);
+	}
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ab2505c..75165fc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1013,6 +1013,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
+				if (!global_reclaim(sc))
+					mem_cgroup_poke_writeback(mapping,
+							sc->target_mem_cgroup);
+
 				goto keep_locked;
 			}
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/6] percpu_ratelimit: high-performance ratelimiting counter
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
                   ` (2 preceding siblings ...)
  2015-01-15 18:49 ` [PATCH 3/6] memcg: track shared inodes with dirty pages Konstantin Khebnikov
@ 2015-01-15 18:49 ` Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 5/6] delay-injection: resource management via procrastination Konstantin Khebnikov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Parameters:
period   - interval between refills (100ms should be fine)
quota    - events refill per period
deadline - interval to utilize unused past quota (1s by default)
latency  - maximum injected delay (10s by default)

Quota sums into 'budget' and spreads across cpus.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/percpu_ratelimit.h |   45 ++++++++++
 lib/Makefile                     |    1 
 lib/percpu_ratelimit.c           |  168 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 214 insertions(+)
 create mode 100644 include/linux/percpu_ratelimit.h
 create mode 100644 lib/percpu_ratelimit.c

diff --git a/include/linux/percpu_ratelimit.h b/include/linux/percpu_ratelimit.h
new file mode 100644
index 0000000..42c45d4
--- /dev/null
+++ b/include/linux/percpu_ratelimit.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_PERCPU_RATELIMIT_H
+#define _LINUX_PERCPU_RATELIMIT_H
+
+#include <linux/hrtimer.h>
+
+struct percpu_ratelimit {
+	struct hrtimer  timer;
+	ktime_t		target;		/* time of next refill */
+	ktime_t		deadline;	/* interval to utilize past budget */
+	ktime_t		latency;	/* maximum injected delay */
+	ktime_t		period;		/* interval between refills */
+	u64		quota;		/* events refill per period */
+	u64		budget;		/* amount of available events */
+	u64		total;		/* consumed and pre-charged events */
+	raw_spinlock_t	lock;		/* protect the state */
+	u32		cpu_batch;	/* events in per-cpu precharge */
+	u32 __percpu	*cpu_budget;	/* per-cpu precharge */
+};
+
+static inline bool percpu_ratelimit_blocked(struct percpu_ratelimit *rl)
+{
+       return hrtimer_active(&rl->timer);
+}
+
+static inline ktime_t percpu_ratelimit_target(struct percpu_ratelimit *rl)
+{
+	return rl->target;
+}
+
+static inline int percpu_ratelimit_wait(struct percpu_ratelimit *rl)
+{
+	ktime_t target = rl->target;
+
+	return schedule_hrtimeout_range(&target, ktime_to_ns(rl->period),
+					HRTIMER_MODE_ABS);
+}
+
+int percpu_ratelimit_init(struct percpu_ratelimit *rl, gfp_t gfp);
+void percpu_ratelimit_destroy(struct percpu_ratelimit *rl);
+void percpu_ratelimit_setup(struct percpu_ratelimit *rl, u64 quota, u64 period);
+u64 percpu_ratelimit_quota(struct percpu_ratelimit *rl, u64 period);
+bool percpu_ratelimit_charge(struct percpu_ratelimit *rl, u64 events);
+u64 percpu_ratelimit_sum(struct percpu_ratelimit *rl);
+
+#endif /* _LINUX_PERCPU_RATELIMIT_H */
diff --git a/lib/Makefile b/lib/Makefile
index 3c3b30b..b20ab47 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -21,6 +21,7 @@ lib-$(CONFIG_SMP) += cpumask.o
 
 lib-y	+= kobject.o klist.o
 obj-y	+= lockref.o
+obj-y   += percpu_ratelimit.o
 
 obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
diff --git a/lib/percpu_ratelimit.c b/lib/percpu_ratelimit.c
new file mode 100644
index 0000000..8254683
--- /dev/null
+++ b/lib/percpu_ratelimit.c
@@ -0,0 +1,168 @@
+#include <linux/percpu_ratelimit.h>
+
+static void __percpu_ratelimit_setup(struct percpu_ratelimit *rl,
+				     u64 period, u64 quota)
+{
+	rl->period = ns_to_ktime(period);
+	rl->quota = quota;
+	rl->total += quota - rl->budget;
+	rl->budget = quota;
+	if (do_div(quota, num_possible_cpus() * 2))
+		quota++;
+	rl->cpu_batch = min_t(u64, UINT_MAX, quota);
+	rl->target = ktime_get();
+}
+
+static enum hrtimer_restart ratelimit_unblock(struct hrtimer *t)
+{
+	struct percpu_ratelimit *rl = container_of(t, struct percpu_ratelimit, timer);
+	enum hrtimer_restart ret = HRTIMER_NORESTART;
+	ktime_t now = t->base->get_time();
+
+	raw_spin_lock(&rl->lock);
+	if (ktime_after(rl->target, now)) {
+		hrtimer_set_expires_range(t, rl->target, rl->period);
+		ret = HRTIMER_RESTART;
+	}
+	raw_spin_unlock(&rl->lock);
+
+	return ret;
+}
+
+int percpu_ratelimit_init(struct percpu_ratelimit *rl, gfp_t gfp)
+{
+	memset(rl, 0, sizeof(*rl));
+	rl->cpu_budget = alloc_percpu_gfp(typeof(*rl->cpu_budget), gfp);
+	if (!rl->cpu_budget)
+		return -ENOMEM;
+	raw_spin_lock_init(&rl->lock);
+	hrtimer_init(&rl->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	rl->timer.function = ratelimit_unblock;
+	rl->deadline = ns_to_ktime(NSEC_PER_SEC);
+	rl->latency  = ns_to_ktime(NSEC_PER_SEC * 10);
+	__percpu_ratelimit_setup(rl, NSEC_PER_SEC, ULLONG_MAX);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(percpu_ratelimit_init);
+
+void percpu_ratelimit_destroy(struct percpu_ratelimit *rl)
+{
+	free_percpu(rl->cpu_budget);
+	hrtimer_cancel(&rl->timer);
+}
+EXPORT_SYMBOL_GPL(percpu_ratelimit_destroy);
+
+static void percpu_ratelimit_drain(void *info)
+{
+	struct percpu_ratelimit *rl = info;
+
+	__this_cpu_write(*rl->cpu_budget, 0);
+}
+
+void percpu_ratelimit_setup(struct percpu_ratelimit *rl, u64 quota, u64 period)
+{
+	unsigned long flags;
+
+	if (!quota || !period) {
+		quota = ULLONG_MAX;
+		period = NSEC_PER_SEC;
+	} else if (period > NSEC_PER_SEC / 10) {
+		u64 quant = div_u64(quota * NSEC_PER_SEC / 10, period);
+
+		if (quant > 20) {
+			quota = quant;
+			period = NSEC_PER_SEC / 10;
+		}
+	}
+
+	raw_spin_lock_irqsave(&rl->lock, flags);
+	__percpu_ratelimit_setup(rl, period, quota);
+	raw_spin_unlock_irqrestore(&rl->lock, flags);
+	on_each_cpu(percpu_ratelimit_drain, rl, 1);
+	hrtimer_cancel(&rl->timer);
+}
+EXPORT_SYMBOL_GPL(percpu_ratelimit_setup);
+
+u64 percpu_ratelimit_quota(struct percpu_ratelimit *rl, u64 period)
+{
+	unsigned long flags;
+	u64 quota;
+
+	raw_spin_lock_irqsave(&rl->lock, flags);
+	if (rl->quota == ULLONG_MAX)
+		quota = 0;
+	else
+		quota = div64_u64(rl->quota * period, ktime_to_ns(rl->period));
+	raw_spin_unlock_irqrestore(&rl->lock, flags);
+
+	return quota;
+}
+EXPORT_SYMBOL_GPL(percpu_ratelimit_quota);
+
+/*
+ * Charges events, returns true if ratelimit is blocked and caller should sleep.
+ */
+bool percpu_ratelimit_charge(struct percpu_ratelimit *rl, u64 events)
+{
+	unsigned long flags;
+	u64 budget, delta;
+	ktime_t now, deadline;
+
+	preempt_disable();
+	budget = __this_cpu_read(*rl->cpu_budget);
+	if (likely(budget >= events)) {
+		__this_cpu_sub(*rl->cpu_budget, events);
+	} else {
+		now = ktime_get();
+		raw_spin_lock_irqsave(&rl->lock, flags);
+		deadline = ktime_sub(now, rl->deadline);
+		if (ktime_after(deadline, rl->target))
+			rl->target = deadline;
+		budget += rl->budget;
+		if (budget >= events + rl->cpu_batch) {
+			budget -= events;
+		} else {
+			delta = events + rl->cpu_batch - budget;
+			if (do_div(delta, rl->quota))
+				delta++;
+			rl->target = ktime_add_ns(rl->target,
+					ktime_to_ns(rl->period) * delta);
+			deadline = ktime_add(now, rl->latency);
+			if (ktime_after(rl->target, deadline))
+				rl->target = deadline;
+			delta *= rl->quota;
+			rl->total += delta;
+			budget += delta - events;
+		}
+		rl->budget = budget - rl->cpu_batch;
+		__this_cpu_write(*rl->cpu_budget, rl->cpu_batch);
+		if (!hrtimer_active(&rl->timer) && ktime_after(rl->target, now))
+			hrtimer_start_range_ns(&rl->timer, rl->target,
+					ktime_to_ns(rl->period),
+					HRTIMER_MODE_ABS);
+		raw_spin_unlock_irqrestore(&rl->lock, flags);
+	}
+	preempt_enable();
+
+	return percpu_ratelimit_blocked(rl);
+}
+EXPORT_SYMBOL_GPL(percpu_ratelimit_charge);
+
+/*
+ * Returns count of consumed events.
+ */
+u64 percpu_ratelimit_sum(struct percpu_ratelimit *rl)
+{
+	unsigned long flags;
+	int cpu;
+	s64 ret;
+
+	raw_spin_lock_irqsave(&rl->lock, flags);
+	ret = rl->total - rl->budget;
+	for_each_online_cpu(cpu)
+		ret -= per_cpu(*rl->cpu_budget, cpu);
+	raw_spin_unlock_irqrestore(&rl->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(percpu_ratelimit_sum);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/6] delay-injection: resource management via procrastination
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
                   ` (3 preceding siblings ...)
  2015-01-15 18:49 ` [PATCH 4/6] percpu_ratelimit: high-performance ratelimiting counter Konstantin Khebnikov
@ 2015-01-15 18:49 ` Konstantin Khebnikov
  2015-01-15 18:49 ` [PATCH 6/6] memcg: filesystem bandwidth controller Konstantin Khebnikov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

inject_delay() allows to pause current task before returning
into userspace in place where kernel doesn't hold any locks
thus wait wouldn't introduce any priority-inversion problems.

This code abuses existing task-work and 'TASK_PARKED' state.
Parked tasks are killable and don't contribute into cpu load.

Together with percpu_ratelimit this could be used in this manner:

if (percpu_ratelimit_charge(&ratelimit, events))
        inject_delay(percpu_ratelimit_target(&ratelimit));

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/sched.h        |    7 ++++
 include/trace/events/sched.h |    7 ++++
 kernel/sched/core.c          |   66 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |   12 ++++++++
 4 files changed, 92 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..2363918 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1132,6 +1132,7 @@ struct sched_statistics {
 	u64			iowait_sum;
 
 	u64			sleep_start;
+	u64			delay_start;
 	u64			sleep_max;
 	s64			sum_sleep_runtime;
 
@@ -1662,6 +1663,10 @@ struct task_struct {
 	unsigned long timer_slack_ns;
 	unsigned long default_timer_slack_ns;
 
+	/* Pause task till this time before returning into userspace */
+	ktime_t delay_injection_target;
+	struct callback_head delay_injection_work;
+
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	/* Index of current stored address in ret_stack */
 	int curr_ret_stack;
@@ -2277,6 +2282,8 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 
 void yield(void);
 
+extern void inject_delay(ktime_t target);
+
 /*
  * The default (Linux) execution domain.
  */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 30fedaf..d35154e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -365,6 +365,13 @@ DEFINE_EVENT(sched_stat_template, sched_stat_blocked,
 	     TP_ARGS(tsk, delay));
 
 /*
+ * Tracepoint for accounting delay-injection
+ */
+DEFINE_EVENT(sched_stat_template, sched_stat_delayed,
+	     TP_PROTO(struct task_struct *tsk, u64 delay),
+	     TP_ARGS(tsk, delay));
+
+/*
  * Tracepoint for accounting runtime (time the task is executing
  * on a CPU).
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c0accc0..7a9d6a1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -65,6 +65,7 @@
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
 #include <linux/hrtimer.h>
+#include <linux/task_work.h>
 #include <linux/tick.h>
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
@@ -8377,3 +8378,68 @@ void dump_cpu_task(int cpu)
 	pr_info("Task dump for CPU %d:\n", cpu);
 	sched_show_task(cpu_curr(cpu));
 }
+
+#define DELAY_INJECTION_SLACK_NS	(NSEC_PER_SEC / 50)
+
+static enum hrtimer_restart delay_injection_wakeup(struct hrtimer *timer)
+{
+	struct hrtimer_sleeper *t =
+		container_of(timer, struct hrtimer_sleeper, timer);
+	struct task_struct *task = t->task;
+
+	t->task = NULL;
+	if (task)
+		wake_up_state(task, TASK_PARKED);
+
+	return HRTIMER_NORESTART;
+}
+
+/*
+ * Here delayed task sleeps in 'P'arked state.
+ */
+static void delay_injection_sleep(struct callback_head *head)
+{
+	struct task_struct *task = current;
+	struct hrtimer_sleeper t;
+
+	head->func = NULL;
+	__set_task_state(task, TASK_WAKEKILL | TASK_PARKED);
+	hrtimer_init_on_stack(&t.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_set_expires_range_ns(&t.timer, current->delay_injection_target,
+				     DELAY_INJECTION_SLACK_NS);
+
+	t.timer.function = delay_injection_wakeup;
+	t.task = task;
+
+	hrtimer_start_expires(&t.timer, HRTIMER_MODE_ABS);
+	if (!hrtimer_active(&t.timer))
+		t.task = NULL;
+
+	if (likely(t.task))
+		schedule();
+
+	hrtimer_cancel(&t.timer);
+	destroy_hrtimer_on_stack(&t.timer);
+
+	__set_task_state(task, TASK_RUNNING);
+}
+
+/*
+ * inject_delay - injects delay before returning into userspace
+ * @target: absolute monotomic timestamp to sleeping for,
+ *	    task will not return into userspace before this time
+ */
+void inject_delay(ktime_t target)
+{
+	struct task_struct *task = current;
+
+	if (ktime_after(target, task->delay_injection_target)) {
+		task->delay_injection_target = target;
+		if (!task->delay_injection_work.func) {
+			init_task_work(&task->delay_injection_work,
+					delay_injection_sleep);
+			task_work_add(task, &task->delay_injection_work, true);
+		}
+	}
+}
+EXPORT_SYMBOL(inject_delay);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40667cb..2e3269b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2944,6 +2944,15 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 			account_scheduler_latency(tsk, delta >> 10, 0);
 		}
 	}
+	if (se->statistics.delay_start) {
+		u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.delay_start;
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		se->statistics.delay_start = 0;
+		trace_sched_stat_delayed(tsk, delta);
+	}
 #endif
 }
 
@@ -3095,6 +3104,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 				se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
 			if (tsk->state & TASK_UNINTERRUPTIBLE)
 				se->statistics.block_start = rq_clock(rq_of(cfs_rq));
+			if ((tsk->state & TASK_PARKED) &&
+			    tsk->delay_injection_target.tv64)
+				se->statistics.delay_start = rq_clock(rq_of(cfs_rq));
 		}
 #endif
 	}


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/6] memcg: filesystem bandwidth controller
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
                   ` (4 preceding siblings ...)
  2015-01-15 18:49 ` [PATCH 5/6] delay-injection: resource management via procrastination Konstantin Khebnikov
@ 2015-01-15 18:49 ` Konstantin Khebnikov
  2015-01-16  9:37 ` [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Jan Kara
  2015-01-29  1:21 ` Tejun Heo
  7 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khebnikov @ 2015-01-15 18:49 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: Roman Gushchin, Jan Kara, Dave Chinner, linux-kernel, Tejun Heo,
	linux-fsdevel, koct9i

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

This is example of filesystem bandwidth controller build on the top of
dirty memory accounting, percpu_ratelimit and delay-injection.

Cgroup charges read/write requests into rate-limiters and injects delays
which controls overall speed.

Interface:
memory.fs_bps_limit     bytes per second, 0 == unlimited
memory.fs_iops_limit    iops limit, 0 == unlimited
Statistics: fs_io_bytes and fs_io_operations in memory.stat

For small bandwidth limits memory limit also must be set into corresponded
value otherwise injected delay after writing dirty-set might be enormous.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 block/blk-core.c           |    2 +
 fs/direct-io.c             |    2 +
 include/linux/memcontrol.h |    4 ++
 mm/memcontrol.c            |  102 +++++++++++++++++++++++++++++++++++++++++++-
 mm/readahead.c             |    2 +
 5 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3ad4055..799f5f5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1966,6 +1966,7 @@ void submit_bio(int rw, struct bio *bio)
 			count_vm_events(PGPGOUT, count);
 		} else {
 			task_io_account_read(bio->bi_iter.bi_size);
+			mem_cgroup_account_bandwidth(bio->bi_iter.bi_size);
 			count_vm_events(PGPGIN, count);
 		}
 
@@ -2208,6 +2209,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
 		}
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		mem_cgroup_account_ioop();
 		rq->part = part;
 	}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..9c60a82 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -24,6 +24,7 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/memcontrol.h>
 #include <linux/slab.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
@@ -775,6 +776,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		mem_cgroup_account_bandwidth(len);
 	}
 
 	/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3f89e9b..633310e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -183,6 +183,8 @@ bool mem_cgroup_dirty_limits(struct address_space *mapping, unsigned long *dirty
 bool mem_cgroup_dirty_exceeded(struct inode *inode);
 void mem_cgroup_poke_writeback(struct address_space *mapping,
 			       struct mem_cgroup *memcg);
+void mem_cgroup_account_bandwidth(unsigned long bytes);
+void mem_cgroup_account_ioop(void);
 
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
@@ -362,6 +364,8 @@ static inline bool mem_cgroup_dirty_limits(struct address_space *mapping, unsign
 static inline bool mem_cgroup_dirty_exceeded(struct inode *inode) { return false; }
 static inline void mem_cgroup_poke_writeback(struct address_space *mapping,
 					     struct mem_cgroup *memcg) { }
+static inline void mem_cgroup_account_bandwidth(unsigned long bytes) {}
+static inline void mem_cgroup_account_ioop(void) {}
 
 #endif /* CONFIG_MEMCG */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d9d345c..f49fbbf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -27,6 +27,7 @@
 
 #include <linux/page_counter.h>
 #include <linux/memcontrol.h>
+#include <linux/percpu_ratelimit.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
@@ -368,6 +369,9 @@ struct mem_cgroup {
 	unsigned int dirty_exceeded;
 	unsigned int dirty_ratio;
 
+	struct percpu_ratelimit iobw;
+	struct percpu_ratelimit ioop;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -3762,6 +3766,12 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "fs_dirty_threshold %llu\n", (u64)PAGE_SIZE *
 			memcg->dirty_threshold);
 
+	seq_printf(m, "fs_io_bytes %llu\n",
+			percpu_ratelimit_sum(&memcg->iobw));
+	seq_printf(m, "fs_io_operations %llu\n",
+			percpu_ratelimit_sum(&memcg->ioop));
+
+
 #ifdef CONFIG_DEBUG_VM
 	{
 		int nid, zid;
@@ -3833,6 +3843,40 @@ static int mem_cgroup_dirty_ratio_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+static u64 mem_cgroup_get_bps_limit(
+		struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return percpu_ratelimit_quota(&memcg->iobw, NSEC_PER_SEC);
+}
+
+static int mem_cgroup_set_bps_limit(
+		struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	percpu_ratelimit_setup(&memcg->iobw, val, NSEC_PER_SEC);
+	return 0;
+}
+
+static u64 mem_cgroup_get_iops_limit(
+		struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return percpu_ratelimit_quota(&memcg->ioop, NSEC_PER_SEC);
+}
+
+static int mem_cgroup_set_iops_limit(
+		struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	percpu_ratelimit_setup(&memcg->ioop, val, NSEC_PER_SEC);
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4489,6 +4533,16 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_dirty_ratio_write,
 	},
 	{
+		.name = "fs_bps_limit",
+		.read_u64 = mem_cgroup_get_bps_limit,
+		.write_u64 = mem_cgroup_set_bps_limit,
+	},
+	{
+		.name = "fs_iops_limit",
+		.read_u64 = mem_cgroup_get_iops_limit,
+		.write_u64 = mem_cgroup_set_iops_limit,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -4621,7 +4675,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		return NULL;
 
 	if (percpu_counter_init(&memcg->nr_dirty, 0, GFP_KERNEL) ||
-	    percpu_counter_init(&memcg->nr_writeback, 0, GFP_KERNEL))
+	    percpu_counter_init(&memcg->nr_writeback, 0, GFP_KERNEL) ||
+	    percpu_ratelimit_init(&memcg->iobw, GFP_KERNEL) ||
+	    percpu_ratelimit_init(&memcg->ioop, GFP_KERNEL))
 		goto out_free;
 
 	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
@@ -4633,6 +4689,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 out_free:
 	percpu_counter_destroy(&memcg->nr_dirty);
 	percpu_counter_destroy(&memcg->nr_writeback);
+	percpu_ratelimit_destroy(&memcg->iobw);
+	percpu_ratelimit_destroy(&memcg->ioop);
 	kfree(memcg);
 	return NULL;
 }
@@ -4659,6 +4717,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 	percpu_counter_destroy(&memcg->nr_dirty);
 	percpu_counter_destroy(&memcg->nr_writeback);
+	percpu_ratelimit_destroy(&memcg->iobw);
+	percpu_ratelimit_destroy(&memcg->ioop);
 	free_percpu(memcg->stat);
 
 	disarm_static_keys(memcg);
@@ -5956,8 +6016,44 @@ void mem_cgroup_inc_page_writeback(struct address_space *mapping)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_mapping(mapping);
 
-	for (; memcg; memcg = parent_mem_cgroup(memcg))
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		percpu_counter_inc(&memcg->nr_writeback);
+		percpu_ratelimit_charge(&memcg->iobw, PAGE_CACHE_SIZE);
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		if (percpu_ratelimit_blocked(&memcg->iobw))
+			inject_delay(percpu_ratelimit_target(&memcg->iobw));
+	}
+	rcu_read_unlock();
+}
+
+void mem_cgroup_account_bandwidth(unsigned long bytes)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		if (percpu_ratelimit_charge(&memcg->iobw, bytes))
+			inject_delay(percpu_ratelimit_target(&memcg->iobw));
+	}
+	rcu_read_unlock();
+}
+
+void mem_cgroup_account_ioop(void)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		if (percpu_ratelimit_charge(&memcg->ioop, 1))
+			inject_delay(percpu_ratelimit_target(&memcg->ioop));
+	}
+	rcu_read_unlock();
 }
 
 void mem_cgroup_dec_page_writeback(struct address_space *mapping)
@@ -6038,6 +6134,8 @@ bool mem_cgroup_dirty_limits(struct address_space *mapping,
 		if (dirty > background) {
 			if (!memcg->dirty_exceeded)
 				memcg->dirty_exceeded = 1;
+			if (percpu_ratelimit_blocked(&memcg->iobw))
+				inject_delay(percpu_ratelimit_target(&memcg->iobw));
 			rcu_read_unlock();
 			if (dirty > (background + threshold) / 2 &&
 			    !test_and_set_bit(BDI_memcg_writeback_running,
diff --git a/mm/readahead.c b/mm/readahead.c
index 17b9172..7c7ec23 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -16,6 +16,7 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
+#include <linux/memcontrol.h>
 #include <linux/file.h>
 
 #include "internal.h"
@@ -102,6 +103,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		mem_cgroup_account_bandwidth(PAGE_CACHE_SIZE);
 	}
 	return ret;
 }


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/6] memcg: track shared inodes with dirty pages
  2015-01-15 18:49 ` [PATCH 3/6] memcg: track shared inodes with dirty pages Konstantin Khebnikov
@ 2015-01-15 18:55   ` Tejun Heo
  2015-01-15 19:04     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-01-15 18:55 UTC (permalink / raw)
  To: Konstantin Khebnikov
  Cc: linux-mm, cgroups, Roman Gushchin, Jan Kara, Dave Chinner,
	linux-kernel, linux-fsdevel, koct9i

On Thu, Jan 15, 2015 at 09:49:14PM +0300, Konstantin Khebnikov wrote:
> From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> Inode is owned only by one memory cgroup, but if it's shared it might
> contain pages from multiple cgroups. This patch detects this situation
> in memory reclaiemer and marks dirty inode with flag I_DIRTY_SHARED
> which is cleared only when data is completely written. Memcg writeback
> always writes such inodes.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

This conflicts with the writeback cgroup support patchset which will
solve the writeback and memcg problem a lot more comprehensively.

 http://lkml.kernel.org/g/1420579582-8516-1-git-send-email-tj@kernel.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/6] memcg: track shared inodes with dirty pages
  2015-01-15 18:55   ` Tejun Heo
@ 2015-01-15 19:04     ` Konstantin Khlebnikov
  2015-01-15 19:08       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2015-01-15 19:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khebnikov, linux-mm, cgroups, Roman Gushchin,
	Jan Kara, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 15, 2015 at 9:55 PM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Jan 15, 2015 at 09:49:14PM +0300, Konstantin Khebnikov wrote:
>> From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>
>> Inode is owned only by one memory cgroup, but if it's shared it might
>> contain pages from multiple cgroups. This patch detects this situation
>> in memory reclaiemer and marks dirty inode with flag I_DIRTY_SHARED
>> which is cleared only when data is completely written. Memcg writeback
>> always writes such inodes.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>
> This conflicts with the writeback cgroup support patchset which will
> solve the writeback and memcg problem a lot more comprehensively.
>
>  http://lkml.kernel.org/g/1420579582-8516-1-git-send-email-tj@kernel.org
>
> Thanks.

I know. Absolutely accurate per-page solution looks too complicated for me.
Is there any real demand for accurate handling dirty set in shared inodes?
Doing whole accounting in per-inode basis makes life so much easier.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/6] memcg: track shared inodes with dirty pages
  2015-01-15 19:04     ` Konstantin Khlebnikov
@ 2015-01-15 19:08       ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2015-01-15 19:08 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Konstantin Khebnikov, linux-mm, cgroups, Roman Gushchin,
	Jan Kara, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

Hello,

On Thu, Jan 15, 2015 at 11:04:49PM +0400, Konstantin Khlebnikov wrote:
> I know. Absolutely accurate per-page solution looks too complicated for me.
> Is there any real demand for accurate handling dirty set in shared inodes?
> Doing whole accounting in per-inode basis makes life so much easier.

Ah, yeah, patch #3 arrived in isolation, so I thought it was part of
something completely different.  I definitely thought about doing it
per-inode too (and also requiring memcg to attribute pages according
to its inode rather than individual pages).  I'll look into the
patchset and try to identify the pros and cons of our approaches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
                   ` (5 preceding siblings ...)
  2015-01-15 18:49 ` [PATCH 6/6] memcg: filesystem bandwidth controller Konstantin Khebnikov
@ 2015-01-16  9:37 ` Jan Kara
  2015-01-16 12:33   ` Konstantin Khlebnikov
  2015-01-29  1:21 ` Tejun Heo
  7 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-01-16  9:37 UTC (permalink / raw)
  To: Konstantin Khebnikov
  Cc: linux-mm, cgroups, Roman Gushchin, Jan Kara, Dave Chinner,
	linux-kernel, Tejun Heo, linux-fsdevel, koct9i

  Hello,

On Thu 15-01-15 21:49:10, Konstantin Khebnikov wrote:
> This is ressurection of my old RFC patch for dirty-set accounting cgroup [1]
> Now it's merged into memory cgroup and got bandwidth controller as a bonus.
> 
> That shows alternative solution: less accurate but much less monstrous than
> accurate page-based dirty-set controller from Tejun Heo.
> 
> Memory overhead: +1 pointer into struct address_space.
> Perfomance overhead is almost zero, no new locks added.
> 
> Idea is stright forward: link each inode to some cgroup when first dirty
> page appers and account all dirty pages to it. Writeback is implemented
> as single per-bdi writeback work which writes only inodes which belong
> to memory cgroups where amount of dirty memory is beyond thresholds.
> 
> Third patch adds trick for handling shared inodes which have dirty pages
> from several cgroups: it marks whole inode as shared and alters writeback
> filter for it.
> 
> The rest is an example of bandwith and iops controller build on top of that.
> Design is completely original, I bet nobody ever used task-works for that =)
  So I like the simplicity of your code but there are a few downsides too
(please correct me if I've got something wrong - btw the documentation of
high-level design would be welcome so that one doesn't have to understand
that from the patches):
1) The bandwidth controller simply accounts number of bytes submitted for
IO in submit_bio().  This doesn't reflect HW capabilities in any way. There
a huge difference between a process submitting single block random IO and a
process doing the same amount of sequential IO. This could be somewhat
dealt with by not accounting number of bytes but rather time it took to
complete a bio (but that somewhat complicates the code and block layer
already does similar counting so it would be good you used that).

2) The controller accounts bio to current task - that makes the limiting
useless for background writeback. You need to somehow propagate i_memcg
into submit_bio() so that IO is properly accounted.

3) The controller doesn't seem to guarantee any quality of service at the
IO level like blkcg does. The controller limits amount of IO userspace is
able to submit to kernel but only after we decide to submit the IO to disk.
So at that time cgroup may have generated lots of IO - e.g. by dirtying lots
of pages - and there's nothing to protect other cgroups from starvation
because of writeback of these pages.

Especially the last point seems to be essential to your approach (although
you could somewhat mitigate the issue by accounting the IO already when
it is entering the kernel) and I'm not sure whether that's really
acceptable for potential users of this feature.

								Honza
> [1] [PATCH RFC] fsio: filesystem io accounting cgroup
> http://marc.info/?l=linux-kernel&m=137331569501655&w=2
> 
> Patches also available here:
> https://github.com/koct9i/linux.git branch memcg_dirty_control
> 
> ---
> 
> Konstantin Khebnikov (6):
>       memcg: inode-based dirty and writeback pages accounting
>       memcg: dirty-set limiting and filtered writeback
>       memcg: track shared inodes with dirty pages
>       percpu_ratelimit: high-performance ratelimiting counter
>       delay-injection: resource management via procrastination
>       memcg: filesystem bandwidth controller
> 
> 
>  block/blk-core.c                 |    2 
>  fs/direct-io.c                   |    2 
>  fs/fs-writeback.c                |   22 ++
>  fs/inode.c                       |    1 
>  include/linux/backing-dev.h      |    1 
>  include/linux/fs.h               |   14 +
>  include/linux/memcontrol.h       |   27 +++
>  include/linux/percpu_ratelimit.h |   45 ++++
>  include/linux/sched.h            |    7 +
>  include/linux/writeback.h        |    1 
>  include/trace/events/sched.h     |    7 +
>  include/trace/events/writeback.h |    1 
>  kernel/sched/core.c              |   66 +++++++
>  kernel/sched/fair.c              |   12 +
>  lib/Makefile                     |    1 
>  lib/percpu_ratelimit.c           |  168 +++++++++++++++++
>  mm/memcontrol.c                  |  381 ++++++++++++++++++++++++++++++++++++++
>  mm/page-writeback.c              |   32 +++
>  mm/readahead.c                   |    2 
>  mm/truncate.c                    |    1 
>  mm/vmscan.c                      |    4 
>  21 files changed, 787 insertions(+), 10 deletions(-)
>  create mode 100644 include/linux/percpu_ratelimit.h
>  create mode 100644 lib/percpu_ratelimit.c
> 
> --
> Signature
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller
  2015-01-16  9:37 ` [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Jan Kara
@ 2015-01-16 12:33   ` Konstantin Khlebnikov
  2015-01-16 14:25     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2015-01-16 12:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Konstantin Khebnikov, linux-mm, cgroups, Roman Gushchin,
	Dave Chinner, Linux Kernel Mailing List, Tejun Heo,
	linux-fsdevel

On Fri, Jan 16, 2015 at 12:37 PM, Jan Kara <jack@suse.cz> wrote:
>   Hello,
>
> On Thu 15-01-15 21:49:10, Konstantin Khebnikov wrote:
>> This is ressurection of my old RFC patch for dirty-set accounting cgroup [1]
>> Now it's merged into memory cgroup and got bandwidth controller as a bonus.
>>
>> That shows alternative solution: less accurate but much less monstrous than
>> accurate page-based dirty-set controller from Tejun Heo.
>>
>> Memory overhead: +1 pointer into struct address_space.
>> Perfomance overhead is almost zero, no new locks added.
>>
>> Idea is stright forward: link each inode to some cgroup when first dirty
>> page appers and account all dirty pages to it. Writeback is implemented
>> as single per-bdi writeback work which writes only inodes which belong
>> to memory cgroups where amount of dirty memory is beyond thresholds.
>>
>> Third patch adds trick for handling shared inodes which have dirty pages
>> from several cgroups: it marks whole inode as shared and alters writeback
>> filter for it.
>>
>> The rest is an example of bandwith and iops controller build on top of that.
>> Design is completely original, I bet nobody ever used task-works for that =)
>   So I like the simplicity of your code but there are a few downsides too
> (please correct me if I've got something wrong - btw the documentation of
> high-level design would be welcome so that one doesn't have to understand
> that from the patches):

Rate-limiting design uses per-task delay injection when controller
sees that this
task or cgroup have done too much IO. This is similar to balance_dirty_pages
but this approach extends this logic to any kind of IO and doesn't
require special
point where task checks balance because delay ejected in task-work which
runs when task returns into userspace.

> 1) The bandwidth controller simply accounts number of bytes submitted for
> IO in submit_bio().  This doesn't reflect HW capabilities in any way. There
> a huge difference between a process submitting single block random IO and a
> process doing the same amount of sequential IO. This could be somewhat
> dealt with by not accounting number of bytes but rather time it took to
> complete a bio (but that somewhat complicates the code and block layer
> already does similar counting so it would be good you used that).

Yes, it is. But completion time works accurately only for simple disks
with single
depth queue. For disk with NCQ completion time often have no relation to actual
complexity of requests.

We could use it as third metric in addition to bandwidth and iops or combine all
of them into some abstract disk utilization, anyway splitting accounting and
scheduling phases gives more flexibility.

>
> 2) The controller accounts bio to current task - that makes the limiting
> useless for background writeback. You need to somehow propagate i_memcg
> into submit_bio() so that IO is properly accounted.

It would be nice to get information about randomness of issued writeback
but I think propagation disk bandwidth limit and especially iops limit
into writeback
is almost useless, we must ratelimit task which generates data flow before
it generations next bunch of dirty memory.

In some cases it's possible to slowdown writeback but journalled filesystems
often require write everything to close transaction.

>
> 3) The controller doesn't seem to guarantee any quality of service at the
> IO level like blkcg does. The controller limits amount of IO userspace is
> able to submit to kernel but only after we decide to submit the IO to disk.
> So at that time cgroup may have generated lots of IO - e.g. by dirtying lots
> of pages - and there's nothing to protect other cgroups from starvation
> because of writeback of these pages.
>
> Especially the last point seems to be essential to your approach (although
> you could somewhat mitigate the issue by accounting the IO already when
> it is entering the kernel) and I'm not sure whether that's really
> acceptable for potential users of this feature.

I try to limit amount of dirty memory and speed of generation new dirty pages
after crossing threshold. Probably it's also possible to limit speed
of switching
pages from "dirty" to "towrite" state. Thus memcg could have a lot of
dirty pages
but couldn't trigger immediate writeback for all of them.

I think it's possible to build solid io scheduler using that approach:
this controller proves only single static limits for bandwidth and
iops, but they
might be balanced automatically depending on disk speed and estimated load.

>
>                                                                 Honza
>> [1] [PATCH RFC] fsio: filesystem io accounting cgroup
>> http://marc.info/?l=linux-kernel&m=137331569501655&w=2
>>
>> Patches also available here:
>> https://github.com/koct9i/linux.git branch memcg_dirty_control
>>
>> ---
>>
>> Konstantin Khebnikov (6):
>>       memcg: inode-based dirty and writeback pages accounting
>>       memcg: dirty-set limiting and filtered writeback
>>       memcg: track shared inodes with dirty pages
>>       percpu_ratelimit: high-performance ratelimiting counter
>>       delay-injection: resource management via procrastination
>>       memcg: filesystem bandwidth controller
>>
>>
>>  block/blk-core.c                 |    2
>>  fs/direct-io.c                   |    2
>>  fs/fs-writeback.c                |   22 ++
>>  fs/inode.c                       |    1
>>  include/linux/backing-dev.h      |    1
>>  include/linux/fs.h               |   14 +
>>  include/linux/memcontrol.h       |   27 +++
>>  include/linux/percpu_ratelimit.h |   45 ++++
>>  include/linux/sched.h            |    7 +
>>  include/linux/writeback.h        |    1
>>  include/trace/events/sched.h     |    7 +
>>  include/trace/events/writeback.h |    1
>>  kernel/sched/core.c              |   66 +++++++
>>  kernel/sched/fair.c              |   12 +
>>  lib/Makefile                     |    1
>>  lib/percpu_ratelimit.c           |  168 +++++++++++++++++
>>  mm/memcontrol.c                  |  381 ++++++++++++++++++++++++++++++++++++++
>>  mm/page-writeback.c              |   32 +++
>>  mm/readahead.c                   |    2
>>  mm/truncate.c                    |    1
>>  mm/vmscan.c                      |    4
>>  21 files changed, 787 insertions(+), 10 deletions(-)
>>  create mode 100644 include/linux/percpu_ratelimit.h
>>  create mode 100644 lib/percpu_ratelimit.c
>>
>> --
>> Signature
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller
  2015-01-16 12:33   ` Konstantin Khlebnikov
@ 2015-01-16 14:25     ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2015-01-16 14:25 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Jan Kara, Konstantin Khebnikov, linux-mm, cgroups,
	Roman Gushchin, Dave Chinner, Linux Kernel Mailing List,
	Tejun Heo, linux-fsdevel

On Fri 16-01-15 15:33:48, Konstantin Khlebnikov wrote:
> On Fri, Jan 16, 2015 at 12:37 PM, Jan Kara <jack@suse.cz> wrote:
> >   Hello,
> >
> > On Thu 15-01-15 21:49:10, Konstantin Khebnikov wrote:
> >> This is ressurection of my old RFC patch for dirty-set accounting cgroup [1]
> >> Now it's merged into memory cgroup and got bandwidth controller as a bonus.
> >>
> >> That shows alternative solution: less accurate but much less monstrous than
> >> accurate page-based dirty-set controller from Tejun Heo.
> >>
> >> Memory overhead: +1 pointer into struct address_space.
> >> Perfomance overhead is almost zero, no new locks added.
> >>
> >> Idea is stright forward: link each inode to some cgroup when first dirty
> >> page appers and account all dirty pages to it. Writeback is implemented
> >> as single per-bdi writeback work which writes only inodes which belong
> >> to memory cgroups where amount of dirty memory is beyond thresholds.
> >>
> >> Third patch adds trick for handling shared inodes which have dirty pages
> >> from several cgroups: it marks whole inode as shared and alters writeback
> >> filter for it.
> >>
> >> The rest is an example of bandwith and iops controller build on top of that.
> >> Design is completely original, I bet nobody ever used task-works for that =)
> >   So I like the simplicity of your code but there are a few downsides too
> > (please correct me if I've got something wrong - btw the documentation of
> > high-level design would be welcome so that one doesn't have to understand
> > that from the patches):
> 
> Rate-limiting design uses per-task delay injection when controller sees
> that this task or cgroup have done too much IO. This is similar to
> balance_dirty_pages but this approach extends this logic to any kind of
> IO and doesn't require special point where task checks balance because
> delay ejected in task-work which runs when task returns into userspace.
> 
> > 1) The bandwidth controller simply accounts number of bytes submitted for
> > IO in submit_bio().  This doesn't reflect HW capabilities in any way. There
> > a huge difference between a process submitting single block random IO and a
> > process doing the same amount of sequential IO. This could be somewhat
> > dealt with by not accounting number of bytes but rather time it took to
> > complete a bio (but that somewhat complicates the code and block layer
> > already does similar counting so it would be good you used that).
> 
> Yes, it is. But completion time works accurately only for simple disks
> with single depth queue. For disk with NCQ completion time often have no
> relation to actual complexity of requests.
  Well, in the same way I could say that the number bytes submitted has
often no relation to the actual complexity of requests. :) But the fact
that there are more IO requests running in parallel is the reason why blkcg
code allows only requests from a single blkcg to run in the HW so that we
know to whom the time should be accounted. Of course, this means a
non-trivial overhead when switching between time slices for different
blkcgs so total system throughput is affected. So either approach has its
pros and cons.

> We could use it as third metric in addition to bandwidth and iops or
> combine all of them into some abstract disk utilization, anyway splitting
> accounting and scheduling phases gives more flexibility.
  Yes, I agree that splitting accounting and scheduling of pauses gives
more flexibility.

> > 2) The controller accounts bio to current task - that makes the limiting
> > useless for background writeback. You need to somehow propagate i_memcg
> > into submit_bio() so that IO is properly accounted.
> 
> It would be nice to get information about randomness of issued writeback
> but I think propagation disk bandwidth limit and especially iops limit
> into writeback is almost useless, we must ratelimit task which generates
> data flow before it generations next bunch of dirty memory.
  I think we misunderstand here. The point I was trying to make was that
I don't see how accounting of writes works with your patches. But after
reading your patches again I have notice that you end up inserting delay
from balance_dirty_pages() which is the point I originally missed.

> In some cases it's possible to slowdown writeback but journalled
> filesystems often require write everything to close transaction.
  Well, yes, that's a concern mostly for metadata (which e.g. Tejun keeps
unaccounted) but ext4 can have this problem for data in some cases as well.

> > 3) The controller doesn't seem to guarantee any quality of service at the
> > IO level like blkcg does. The controller limits amount of IO userspace is
> > able to submit to kernel but only after we decide to submit the IO to disk.
> > So at that time cgroup may have generated lots of IO - e.g. by dirtying lots
> > of pages - and there's nothing to protect other cgroups from starvation
> > because of writeback of these pages.
> >
> > Especially the last point seems to be essential to your approach (although
> > you could somewhat mitigate the issue by accounting the IO already when
> > it is entering the kernel) and I'm not sure whether that's really
> > acceptable for potential users of this feature.
> 
> I try to limit amount of dirty memory and speed of generation new dirty
> pages after crossing threshold. Probably it's also possible to limit
> speed of switching pages from "dirty" to "towrite" state. Thus memcg
> could have a lot of dirty pages but couldn't trigger immediate writeback
> for all of them.
  Well, with your way of throttling there's also a possibility to submit
e.g. a large batch of AIO reads or AIO writes and the process will get
blocked only after all the IO is submitted and the system is hogged. So
IMHO it's hard to guarantee anything in the small scale with your patches.

> I think it's possible to build solid io scheduler using that approach:
> this controller proves only single static limits for bandwidth and iops,
> but they might be balanced automatically depending on disk speed and
> estimated load.
  
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller
  2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
                   ` (6 preceding siblings ...)
  2015-01-16  9:37 ` [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Jan Kara
@ 2015-01-29  1:21 ` Tejun Heo
  7 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2015-01-29  1:21 UTC (permalink / raw)
  To: Konstantin Khebnikov
  Cc: linux-mm, cgroups, Roman Gushchin, Jan Kara, Dave Chinner,
	linux-kernel, linux-fsdevel, koct9i

Hello, Konstantin.

Sorry about the delay.

On Thu, Jan 15, 2015 at 09:49:10PM +0300, Konstantin Khebnikov wrote:
> This is ressurection of my old RFC patch for dirty-set accounting cgroup [1]
> Now it's merged into memory cgroup and got bandwidth controller as a bonus.
> 
> That shows alternative solution: less accurate but much less monstrous than
> accurate page-based dirty-set controller from Tejun Heo.

I went over the whole patchset and ISTR having reviewed this a while
ago and the conclusion is the same.  This appears to be simpler on the
surface but this is a hackjob of a design to put it nicely.  You're
working around the complexity of pressure propagation from the lower
layer by building a separate pressure layer at the top most layer.  In
doing so, it's duplicating what already exist below in degenerate
forms but at the cost of fundamental crippling of the whole thing.

This, even in its current simplistic form, is already a dead end.
e.g. iops or bw aren't even the proper resources to distribute for
rotating disks, IO time is, which is what a large proportion of cfq is
trying to estimate and distribute.  What if there are multiple
filesystems on a single device?  Or if a cgroup accesses multiple
backing devices?  How would you guarantee low latency access to a high
priority cgroup while a huge inode from a low pri cgroup is being
written out when none of the lower layers have any idea what they're
doing?

Sure, these issues can be dealt with partially with various
workarounds and additions and I'm sure we'll be doing that if we go
down this path, but the only thing that'll lead to is duplicating more
of what's already in the lower layers with ever growing list of
behavioral and interface oddities which are inherent to the design.

Even in the absence of alternatives, I'd be strongly against this
direction.  I think this sort of ad-hoc "let's solve this one
immediate issue in the easiest way possible" is often worse than not
doing anything.  In the longer term, things like this paint us into a
corner of which we can't easily get out and memcg happens to be an
area where that sort of things took place quite a bit in the past and
people have been desparately trying to right the course, so, no, I
don't think this is happening.

I agree that propagating backpressure from the lower layer involves
more complexity but it is a full and conceptually and design-wise
straight-forward solution which doesn't need to get constantly papered
over.  This is the right thing to do.  It can be argued that the
amount of complexity can be reduced by tracking dirty pages per-inode,
but, if we're gonna do that, we should converting memcg itself to be
per address space too.  The arguments would be exactly the same for
memcg and memcg and writeback must be on the same foundation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-01-29  1:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 1/6] memcg: inode-based dirty and writeback pages accounting Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 2/6] memcg: dirty-set limiting and filtered writeback Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 3/6] memcg: track shared inodes with dirty pages Konstantin Khebnikov
2015-01-15 18:55   ` Tejun Heo
2015-01-15 19:04     ` Konstantin Khlebnikov
2015-01-15 19:08       ` Tejun Heo
2015-01-15 18:49 ` [PATCH 4/6] percpu_ratelimit: high-performance ratelimiting counter Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 5/6] delay-injection: resource management via procrastination Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 6/6] memcg: filesystem bandwidth controller Konstantin Khebnikov
2015-01-16  9:37 ` [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Jan Kara
2015-01-16 12:33   ` Konstantin Khlebnikov
2015-01-16 14:25     ` Jan Kara
2015-01-29  1:21 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).