From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=GHHo=YA=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.7 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 30268C4740A
	for <linux-mm@archiver.kernel.org>; Mon,  7 Oct 2019 06:02:02 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C8140206C2
	for <linux-mm@archiver.kernel.org>; Mon,  7 Oct 2019 06:02:01 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C8140206C2
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 6445A6B000D; Mon,  7 Oct 2019 02:02:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5CB8C6B000E; Mon,  7 Oct 2019 02:02:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 46B2F6B0010; Mon,  7 Oct 2019 02:02:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59])
	by kanga.kvack.org (Postfix) with ESMTP id 202876B000D
	for <linux-mm@kvack.org>; Mon,  7 Oct 2019 02:02:01 -0400 (EDT)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id 79EB2181AC9B4
	for <linux-mm@kvack.org>; Mon,  7 Oct 2019 06:02:00 +0000 (UTC)
X-FDA: 76015942800.22.arch12_1155b978c3f18
X-HE-Tag: arch12_1155b978c3f18
X-Filterd-Recvd-Size: 3970
Received: from r3-18.sinamail.sina.com.cn (r3-18.sinamail.sina.com.cn [202.108.3.18])
	by imf23.hostedemail.com (Postfix) with SMTP
	for <linux-mm@kvack.org>; Mon,  7 Oct 2019 06:01:58 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([124.64.2.167])
	by sina.com with ESMTP
	id 5D9AD4D200035A03; Mon, 7 Oct 2019 14:01:56 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 46415115074991
From: Hillf Danton <hdanton@sina.com>
To: Roman Gushchin <guro@fb.com>
Cc: linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel-team@fb.com,
	tj@kernel.org,
	Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups
Date: Mon,  7 Oct 2019 14:01:44 +0800
Message-Id: <20191007060144.12416-1-hdanton@sina.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On Fri, 4 Oct 2019 15:11:04 -0700 Roman Gushchin wrote:
>=20
> This is a RFC patch, which is not intended to be merged as is,
> but hopefully will start a discussion which can result in a good
> solution for the described problem.
> --
> We've noticed that the number of dying cgroups on our production hosts
> tends to grow with the uptime. This time it's caused by the writeback
> code.
>=20
> An inode which is getting dirty for the first time is associated
> with the wb structure (look at __inode_attach_wb()). It can later
> be switched to another wb under some conditions (e.g. some other
> cgroup is writing a lot of data to the same inode), but generally
> stays associated up to the end of life of the inode structure.
>=20
> The problem is that the wb structure holds a reference to the original
> memory cgroup. So if the inode was dirty once, it has a good chance
> to pin down the original memory cgroup.
>=20
> An example from the real life: some service runs periodically and
> updates rpm packages. Each time in a new memory cgroup. Installed
> .so files are heavily used by other cgroups, so corresponding inodes
> tend to stay alive for a long. So do pinned memory cgroups.
> In production I've seen many hosts with 1-2 thousands of dying
> cgroups.

The diff below fixes e8a7abf5a5bd ("writeback: disassociate inodes
from dying bdi_writebacks") by selecting new memcg_css id for dying
bdi_writeback to switch to.
Checking offline memcg is also added, which is perhaps needed in your
case. Let us know if it makes sense in helping you cut dying cgroups
down a bit.

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -552,6 +552,8 @@ out_free:
 void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 				 struct inode *inode)
 {
+	int new_id =3D 0;
+
 	if (!inode_cgwb_enabled(inode)) {
 		spin_unlock(&inode->i_lock);
 		return;
@@ -560,6 +562,22 @@ void wbc_attach_and_unlock_inode(struct
 	wbc->wb =3D inode_to_wb(inode);
 	wbc->inode =3D inode;
=20
+	if (unlikely(wb_dying(wbc->wb)) ||
+	    !mem_cgroup_from_css(wbc->wb->memcg_css)->cgwb_list.next) {
+		int id =3D wbc->wb->memcg_css->id;
+		/*
+		 * any css id is fine in order to let dying/offline
+		 * memcg reap
+		 */
+		if (id !=3D wbc->wb_id && wbc->wb_id)
+			new_id =3D wbc->wb_id;
+		else if (id !=3D wbc->wb_lcand_id && wbc->wb_lcand_id)
+			new_id =3D wbc->wb_lcand_id;
+		else if (id !=3D wbc->wb_tcand_id && wbc->wb_tcand_id)
+			new_id =3D wbc->wb_tcand_id;
+		else
+			new_id =3D inode_to_bdi(inode)->wb.memcg_css->id;
+	}
 	wbc->wb_id =3D wbc->wb->memcg_css->id;
 	wbc->wb_lcand_id =3D inode->i_wb_frn_winner;
 	wbc->wb_tcand_id =3D 0;
@@ -574,8 +592,8 @@ void wbc_attach_and_unlock_inode(struct
 	 * A dying wb indicates that the memcg-blkcg mapping has changed
 	 * and a new wb is already serving the memcg.  Switch immediately.
 	 */
-	if (unlikely(wb_dying(wbc->wb)))
-		inode_switch_wbs(inode, wbc->wb_id);
+	if (new_id)
+		inode_switch_wbs(inode, new_id);
 }
 EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
=20
--