From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.7 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30268C4740A for ; Mon, 7 Oct 2019 06:02:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C8140206C2 for ; Mon, 7 Oct 2019 06:02:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C8140206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6445A6B000D; Mon, 7 Oct 2019 02:02:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CB8C6B000E; Mon, 7 Oct 2019 02:02:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46B2F6B0010; Mon, 7 Oct 2019 02:02:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59]) by kanga.kvack.org (Postfix) with ESMTP id 202876B000D for ; Mon, 7 Oct 2019 02:02:01 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 79EB2181AC9B4 for ; Mon, 7 Oct 2019 06:02:00 +0000 (UTC) X-FDA: 76015942800.22.arch12_1155b978c3f18 X-HE-Tag: arch12_1155b978c3f18 X-Filterd-Recvd-Size: 3970 Received: from r3-18.sinamail.sina.com.cn (r3-18.sinamail.sina.com.cn [202.108.3.18]) by imf23.hostedemail.com (Postfix) with SMTP for ; Mon, 7 Oct 2019 06:01:58 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([124.64.2.167]) by sina.com with ESMTP id 5D9AD4D200035A03; Mon, 7 Oct 2019 14:01:56 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 46415115074991 From: Hillf Danton To: Roman Gushchin Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, tj@kernel.org, Jan Kara Subject: Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Date: Mon, 7 Oct 2019 14:01:44 +0800 Message-Id: <20191007060144.12416-1-hdanton@sina.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 4 Oct 2019 15:11:04 -0700 Roman Gushchin wrote: >=20 > This is a RFC patch, which is not intended to be merged as is, > but hopefully will start a discussion which can result in a good > solution for the described problem. > -- > We've noticed that the number of dying cgroups on our production hosts > tends to grow with the uptime. This time it's caused by the writeback > code. >=20 > An inode which is getting dirty for the first time is associated > with the wb structure (look at __inode_attach_wb()). It can later > be switched to another wb under some conditions (e.g. some other > cgroup is writing a lot of data to the same inode), but generally > stays associated up to the end of life of the inode structure. >=20 > The problem is that the wb structure holds a reference to the original > memory cgroup. So if the inode was dirty once, it has a good chance > to pin down the original memory cgroup. >=20 > An example from the real life: some service runs periodically and > updates rpm packages. Each time in a new memory cgroup. Installed > .so files are heavily used by other cgroups, so corresponding inodes > tend to stay alive for a long. So do pinned memory cgroups. > In production I've seen many hosts with 1-2 thousands of dying > cgroups. The diff below fixes e8a7abf5a5bd ("writeback: disassociate inodes from dying bdi_writebacks") by selecting new memcg_css id for dying bdi_writeback to switch to. Checking offline memcg is also added, which is perhaps needed in your case. Let us know if it makes sense in helping you cut dying cgroups down a bit. --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -552,6 +552,8 @@ out_free: void wbc_attach_and_unlock_inode(struct writeback_control *wbc, struct inode *inode) { + int new_id =3D 0; + if (!inode_cgwb_enabled(inode)) { spin_unlock(&inode->i_lock); return; @@ -560,6 +562,22 @@ void wbc_attach_and_unlock_inode(struct wbc->wb =3D inode_to_wb(inode); wbc->inode =3D inode; =20 + if (unlikely(wb_dying(wbc->wb)) || + !mem_cgroup_from_css(wbc->wb->memcg_css)->cgwb_list.next) { + int id =3D wbc->wb->memcg_css->id; + /* + * any css id is fine in order to let dying/offline + * memcg reap + */ + if (id !=3D wbc->wb_id && wbc->wb_id) + new_id =3D wbc->wb_id; + else if (id !=3D wbc->wb_lcand_id && wbc->wb_lcand_id) + new_id =3D wbc->wb_lcand_id; + else if (id !=3D wbc->wb_tcand_id && wbc->wb_tcand_id) + new_id =3D wbc->wb_tcand_id; + else + new_id =3D inode_to_bdi(inode)->wb.memcg_css->id; + } wbc->wb_id =3D wbc->wb->memcg_css->id; wbc->wb_lcand_id =3D inode->i_wb_frn_winner; wbc->wb_tcand_id =3D 0; @@ -574,8 +592,8 @@ void wbc_attach_and_unlock_inode(struct * A dying wb indicates that the memcg-blkcg mapping has changed * and a new wb is already serving the memcg. Switch immediately. */ - if (unlikely(wb_dying(wbc->wb))) - inode_switch_wbs(inode, wbc->wb_id); + if (new_id) + inode_switch_wbs(inode, new_id); } EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode); =20 --