linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups
@ 2019-10-04 22:11 Roman Gushchin
  2019-10-07 14:57 ` Vlastimil Babka
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Roman Gushchin @ 2019-10-04 22:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-kernel, kernel-team, tj, Jan Kara, Roman Gushchin

This is a RFC patch, which is not intended to be merged as is,
but hopefully will start a discussion which can result in a good
solution for the described problem.

--

We've noticed that the number of dying cgroups on our production hosts
tends to grow with the uptime. This time it's caused by the writeback
code.

An inode which is getting dirty for the first time is associated
with the wb structure (look at __inode_attach_wb()). It can later
be switched to another wb under some conditions (e.g. some other
cgroup is writing a lot of data to the same inode), but generally
stays associated up to the end of life of the inode structure.

The problem is that the wb structure holds a reference to the original
memory cgroup. So if the inode was dirty once, it has a good chance
to pin down the original memory cgroup.

An example from the real life: some service runs periodically and
updates rpm packages. Each time in a new memory cgroup. Installed
.so files are heavily used by other cgroups, so corresponding inodes
tend to stay alive for a long. So do pinned memory cgroups.
In production I've seen many hosts with 1-2 thousands of dying
cgroups.

This is not the first problem with the dying memory cgroups. As
always, the problem is with their relative size: memory cgroups
are large objects, easily 100x-1000x larger that inodes. So keeping
a couple of thousands of dying cgroups in memory without a good reason
(what we easily do with inodes) is quite costly (and is measured
in tens and hundreds of Mb).

One possible approach to this problem is to switch inodes associated
with dying wbs to the root wb. Switching is a best effort operation
which can fail silently, so unfortunately we can't run once over a
list of associated inodes (even if we'd have such a list). So we
really have to scan all inodes.

In the proposed patch I schedule a work on each memory cgroup
deletion, which is probably too often. Alternatively, we can do it
periodically under some conditions (e.g. the number of dying memory
cgroups is larger than X). So it's basically a gc run.

I wonder if there are any better ideas?

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 fs/fs-writeback.c | 29 +++++++++++++++++++++++++++++
 mm/memcontrol.c   |  5 +++++
 2 files changed, 34 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 542b02d170f8..4bbc9a200b2c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -545,6 +545,35 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 	up_read(&bdi->wb_switch_rwsem);
 }
 
+static void reparent_dirty_inodes_one_sb(struct super_block *sb, void *arg)
+{
+	struct inode *inode, *next;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		if (inode->i_wb && wb_dying(inode->i_wb)) {
+			spin_unlock(&inode->i_lock);
+			inode_switch_wbs(inode, root_mem_cgroup->css.id);
+			continue;
+		}
+
+		spin_unlock(&inode->i_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+
+}
+
+void reparent_dirty_inodes(struct work_struct *work)
+{
+	iterate_supers(reparent_dirty_inodes_one_sb, NULL);
+}
+
 /**
  * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
  * @wbc: writeback_control of interest
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ec5e12486a7..ea8bc8d1403b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4911,6 +4911,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+extern void reparent_dirty_inodes(struct work_struct *w);
+static DECLARE_WORK(dirty_work, reparent_dirty_inodes);
+
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -4934,6 +4937,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
 
+	schedule_work(&dirty_work);
+
 	drain_all_stock(memcg);
 
 	mem_cgroup_id_put(memcg);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread
* Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups
@ 2019-10-07  6:01 Hillf Danton
  2019-10-07 22:02 ` Roman Gushchin
  0 siblings, 1 reply; 11+ messages in thread
From: Hillf Danton @ 2019-10-07  6:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, linux-fsdevel, linux-kernel, kernel-team, tj, Jan Kara


On Fri, 4 Oct 2019 15:11:04 -0700 Roman Gushchin wrote:
> 
> This is a RFC patch, which is not intended to be merged as is,
> but hopefully will start a discussion which can result in a good
> solution for the described problem.
> --
> We've noticed that the number of dying cgroups on our production hosts
> tends to grow with the uptime. This time it's caused by the writeback
> code.
> 
> An inode which is getting dirty for the first time is associated
> with the wb structure (look at __inode_attach_wb()). It can later
> be switched to another wb under some conditions (e.g. some other
> cgroup is writing a lot of data to the same inode), but generally
> stays associated up to the end of life of the inode structure.
> 
> The problem is that the wb structure holds a reference to the original
> memory cgroup. So if the inode was dirty once, it has a good chance
> to pin down the original memory cgroup.
> 
> An example from the real life: some service runs periodically and
> updates rpm packages. Each time in a new memory cgroup. Installed
> .so files are heavily used by other cgroups, so corresponding inodes
> tend to stay alive for a long. So do pinned memory cgroups.
> In production I've seen many hosts with 1-2 thousands of dying
> cgroups.

The diff below fixes e8a7abf5a5bd ("writeback: disassociate inodes
from dying bdi_writebacks") by selecting new memcg_css id for dying
bdi_writeback to switch to.
Checking offline memcg is also added, which is perhaps needed in your
case. Let us know if it makes sense in helping you cut dying cgroups
down a bit.

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -552,6 +552,8 @@ out_free:
 void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 				 struct inode *inode)
 {
+	int new_id = 0;
+
 	if (!inode_cgwb_enabled(inode)) {
 		spin_unlock(&inode->i_lock);
 		return;
@@ -560,6 +562,22 @@ void wbc_attach_and_unlock_inode(struct
 	wbc->wb = inode_to_wb(inode);
 	wbc->inode = inode;
 
+	if (unlikely(wb_dying(wbc->wb)) ||
+	    !mem_cgroup_from_css(wbc->wb->memcg_css)->cgwb_list.next) {
+		int id = wbc->wb->memcg_css->id;
+		/*
+		 * any css id is fine in order to let dying/offline
+		 * memcg reap
+		 */
+		if (id != wbc->wb_id && wbc->wb_id)
+			new_id = wbc->wb_id;
+		else if (id != wbc->wb_lcand_id && wbc->wb_lcand_id)
+			new_id = wbc->wb_lcand_id;
+		else if (id != wbc->wb_tcand_id && wbc->wb_tcand_id)
+			new_id = wbc->wb_tcand_id;
+		else
+			new_id = inode_to_bdi(inode)->wb.memcg_css->id;
+	}
 	wbc->wb_id = wbc->wb->memcg_css->id;
 	wbc->wb_lcand_id = inode->i_wb_frn_winner;
 	wbc->wb_tcand_id = 0;
@@ -574,8 +592,8 @@ void wbc_attach_and_unlock_inode(struct
 	 * A dying wb indicates that the memcg-blkcg mapping has changed
 	 * and a new wb is already serving the memcg.  Switch immediately.
 	 */
-	if (unlikely(wb_dying(wbc->wb)))
-		inode_switch_wbs(inode, wbc->wb_id);
+	if (new_id)
+		inode_switch_wbs(inode, new_id);
 }
 EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
 
--



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-09 21:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-04 22:11 [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Roman Gushchin
2019-10-07 14:57 ` Vlastimil Babka
2019-10-07 23:35   ` Roman Gushchin
2019-10-07 16:19 ` Michal Koutný
2019-10-07 23:24   ` Roman Gushchin
2019-10-08  4:06 ` Dave Chinner
     [not found]   ` <20191008053854.GA14951@castle.dhcp.thefacebook.com>
2019-10-08  8:20     ` Jan Kara
2019-10-09  5:19       ` Roman Gushchin
2019-10-09 21:48       ` Roman Gushchin
2019-10-07  6:01 Hillf Danton
2019-10-07 22:02 ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).