From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03A7AC433EF for ; Wed, 16 Feb 2022 11:52:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E17E6B0083; Wed, 16 Feb 2022 06:52:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 86A3E6B0085; Wed, 16 Feb 2022 06:52:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 695DC6B0087; Wed, 16 Feb 2022 06:52:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0155.hostedemail.com [216.40.44.155]) by kanga.kvack.org (Postfix) with ESMTP id 566596B0083 for ; Wed, 16 Feb 2022 06:52:57 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 0EBD7180AC339 for ; Wed, 16 Feb 2022 11:52:57 +0000 (UTC) X-FDA: 79148481594.09.0D8A8F4 Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) by imf23.hostedemail.com (Postfix) with ESMTP id 97C1014000A for ; Wed, 16 Feb 2022 11:52:56 +0000 (UTC) Received: by mail-pg1-f171.google.com with SMTP id 75so1943902pgb.4 for ; Wed, 16 Feb 2022 03:52:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=NkR7CAUXx8sPBMjiWASiXAODM2d/YWmlFk9BEdP/gb4=; b=AZsymvIE6FFKXS8HN6Pj1ZjXKJ56fRBcV2LY/jokCPLSI+rvZO7J7IP/eq2SIapYmI rgQgPaHf6mrkZlacPrOuMjXgJ1vOzUJTjS5DjTLHZobDnOWznXqh3IJ3ADNZi0n1kzl1 KSWhS6w39CnNF/8R4WOkOTadDaTB6RQV7sKKxx4bdydxAK3rodPWPZlqNUC0lKdliorq eJemzLJz93vmTjrGKcVIfaWFk6zanaZp0N7qwKVYdmJnVN8ndUMbjqYZTo0sjKU+N/dY AIv4+Zo/kJga6Bp419fD+P0lfW92WrgMHsL0BjjVaTHCVeFF27rsnH3wXb3Ps1M0Pn+r KL1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=NkR7CAUXx8sPBMjiWASiXAODM2d/YWmlFk9BEdP/gb4=; b=bHwt9CJsfrQI5hYOgw7DX/AZ20Cp3GB+3vXmQr4DFrvBVEloRg72usElwevF13gm32 f74MaFi6L6ZaP1PzbW4NY58FxrHsgNsXezwaemP6hu6zIuTaxiQuXSaNxdIfaN3lfTtc mggj51Nkk+cTeQZE+k53un9jvJsh1C7ZUgTH9qWZLxDH97iz91BdhmDCjJOiZF4/DOfk yWUN8hj7OjBKUx9TNw35TsF9PKugZvi0yO0dJqxR6XLd3pW5ZJ43Ym7T65Tkr4nk3vdd 8xYMEi2QRER305yH88P0Ox/jxbuFzxuEnS/RL/AUwF4b4WCf6FTzvAezT1dSWEhcW7Qi 2iPg== X-Gm-Message-State: AOAM531rp6WFZQRIYpQG5pM/LfX4KiUJwecY17A8E2OUKiSJ1CJgkoHA KFHEWkJPr1EqOn4VX68QWnZ5BQ== X-Google-Smtp-Source: ABdhPJznX8lQY4KbWiMUpxqGZ1nAbSX3kOffF+ZO1iGazc9lq+xCm9VFCHQW2xaVFQ3aUhvWK0k6CQ== X-Received: by 2002:a05:6a00:2311:b0:4e1:52bf:e466 with SMTP id h17-20020a056a00231100b004e152bfe466mr2517019pfh.77.1645012375473; Wed, 16 Feb 2022 03:52:55 -0800 (PST) Received: from FVFYT0MHHV2J.tiktokcdn.com ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id m16sm14790221pfc.156.2022.02.16.03.52.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Feb 2022 03:52:55 -0800 (PST) From: Muchun Song To: guro@fb.com, hannes@cmpxchg.org, mhocko@kernel.org, akpm@linux-foundation.org, shakeelb@google.com, vdavydov.dev@gmail.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, bsingharora@gmail.com, shy828301@gmail.com, alexs@kernel.org, smuchun@gmail.com, zhengqi.arch@bytedance.com, Muchun Song Subject: [PATCH v3 09/12] mm: memcontrol: use obj_cgroup APIs to charge the LRU pages Date: Wed, 16 Feb 2022 19:51:29 +0800 Message-Id: <20220216115132.52602-10-songmuchun@bytedance.com> X-Mailer: git-send-email 2.32.0 (Apple Git-132) In-Reply-To: <20220216115132.52602-1-songmuchun@bytedance.com> References: <20220216115132.52602-1-songmuchun@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=AZsymvIE; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf23.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.215.171 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 97C1014000A X-Stat-Signature: p8pbs1ywrci63ro78xoex9j7r98tem8q X-HE-Tag: 1645012376-817671 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We will reuse the obj_cgroup APIs to charge the LRU pages. Finally, page->memcg_data will have 2 different meanings. - For the slab pages, page->memcg_data points to an object cgroups vector. - For the kmem pages (exclude the slab pages) and the LRU pages, page->memcg_data points to an object cgroup. In this patch, we reuse obj_cgroup APIs to charge LRU pages. In the end, The page cache cannot prevent long-living objects from pinning the origin= al memory cgroup in the memory. At the same time we also changed the rules of page and objcg or memcg binding stability. The new rules are as follows. For a page any of the following ensures page and objcg binding stability: - the page lock - LRU isolation - lock_page_memcg() - exclusive reference Based on the stable binding of page and objcg, for a page any of the following ensures page and memcg binding stability: - css_set_lock - cgroup_mutex - the lruvec lock - the split queue lock (only THP page) If the caller only want to ensure that the page counters of memcg are updated correctly, ensure that the binding stability of page and objcg is sufficient. Signed-off-by: Muchun Song --- include/linux/memcontrol.h | 94 ++++++-------- mm/huge_memory.c | 42 +++++++ mm/memcontrol.c | 307 ++++++++++++++++++++++++++++++++-------= ------ 3 files changed, 300 insertions(+), 143 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3c841c155f0d..551fd8b76f9d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -372,8 +372,6 @@ enum page_memcg_data_flags { =20 #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) =20 -static inline bool folio_memcg_kmem(struct folio *folio); - /* * After the initialization objcg->memcg is always pointing at * a valid memcg, but can be atomically swapped to the parent memcg. @@ -387,43 +385,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(s= truct obj_cgroup *objcg) } =20 /* - * __folio_memcg - Get the memory cgroup associated with a non-kmem foli= o - * @folio: Pointer to the folio. - * - * Returns a pointer to the memory cgroup associated with the folio, - * or NULL. This function assumes that the folio is known to have a - * proper memory cgroup pointer. It's not safe to call this function - * against some type of folios, e.g. slab folios or ex-slab folios or - * kmem folios. - */ -static inline struct mem_cgroup *__folio_memcg(struct folio *folio) -{ - unsigned long memcg_data =3D folio->memcg_data; - - VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); - - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); -} - -/* - * __folio_objcg - get the object cgroup associated with a kmem folio. + * folio_objcg - get the object cgroup associated with a folio. * @folio: Pointer to the folio. * * Returns a pointer to the object cgroup associated with the folio, * or NULL. This function assumes that the folio is known to have a - * proper object cgroup pointer. It's not safe to call this function - * against some type of folios, e.g. slab folios or ex-slab folios or - * LRU folios. + * proper object cgroup pointer. */ -static inline struct obj_cgroup *__folio_objcg(struct folio *folio) +static inline struct obj_cgroup *folio_objcg(struct folio *folio) { unsigned long memcg_data =3D folio->memcg_data; =20 VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); - VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); =20 return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } @@ -437,7 +411,7 @@ static inline struct obj_cgroup *__folio_objcg(struct= folio *folio) * proper memory cgroup pointer. It's not safe to call this function * against some type of folios, e.g. slab folios or ex-slab folios. * - * For a non-kmem folio any of the following ensures folio and memcg bin= ding + * For a folio any of the following ensures folio and memcg binding * stability: * * - the folio lock @@ -445,14 +419,28 @@ static inline struct obj_cgroup *__folio_objcg(stru= ct folio *folio) * - lock_page_memcg() * - exclusive reference * - * For a kmem folio a caller should hold an rcu read lock to protect mem= cg - * associated with a kmem folio from being released. + * Based on the stable binding of folio and objcg, for a folio any of th= e + * following ensures folio and memcg binding stability: + * + * - css_set_lock + * - cgroup_mutex + * - the lruvec lock + * - the split queue lock (only THP page) + * + * If the caller only want to ensure that the page counters of memcg are + * updated correctly, ensure that the binding stability of folio and obj= cg + * is sufficient. + * + * A caller should hold an rcu read lock (In addition, regions of code a= cross + * which interrupts, preemption, or softirqs have been disabled also ser= ve as + * RCU read-side critical sections) to protect memcg associated with a f= olio + * from being released. */ static inline struct mem_cgroup *folio_memcg(struct folio *folio) { - if (folio_memcg_kmem(folio)) - return obj_cgroup_memcg(__folio_objcg(folio)); - return __folio_memcg(folio); + struct obj_cgroup *objcg =3D folio_objcg(folio); + + return objcg ? obj_cgroup_memcg(objcg) : NULL; } =20 static inline struct mem_cgroup *page_memcg(struct page *page) @@ -470,6 +458,8 @@ static inline struct mem_cgroup *page_memcg(struct pa= ge *page) * folio is known to have a proper memory cgroup pointer. It's not safe * to call this function against some type of pages, e.g. slab pages or * ex-slab pages. + * + * The page and objcg or memcg binding rules can refer to folio_memcg(). */ static inline struct mem_cgroup *get_mem_cgroup_from_folio(struct folio = *folio) { @@ -500,22 +490,20 @@ static inline struct mem_cgroup *get_mem_cgroup_fro= m_page(struct page *page) * * Return: A pointer to the memory cgroup associated with the folio, * or NULL. + * + * The folio and objcg or memcg binding rules can refer to folio_memcg()= . */ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) { unsigned long memcg_data =3D READ_ONCE(folio->memcg_data); + struct obj_cgroup *objcg; =20 VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); WARN_ON_ONCE(!rcu_read_lock_held()); =20 - if (memcg_data & MEMCG_DATA_KMEM) { - struct obj_cgroup *objcg; - - objcg =3D (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); - return obj_cgroup_memcg(objcg); - } + objcg =3D (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); =20 - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); + return objcg ? obj_cgroup_memcg(objcg) : NULL; } =20 /* @@ -528,16 +516,10 @@ static inline struct mem_cgroup *folio_memcg_rcu(st= ruct folio *folio) * has an associated memory cgroup pointer or an object cgroups vector o= r * an object cgroup. * - * For a non-kmem page any of the following ensures page and memcg bindi= ng - * stability: + * The page and objcg or memcg binding rules can refer to page_memcg(). * - * - the page lock - * - LRU isolation - * - lock_page_memcg() - * - exclusive reference - * - * For a kmem page a caller should hold an rcu read lock to protect memc= g - * associated with a kmem page from being released. + * A caller should hold an rcu read lock to protect memcg associated wit= h a + * page from being released. */ static inline struct mem_cgroup *page_memcg_check(struct page *page) { @@ -546,18 +528,14 @@ static inline struct mem_cgroup *page_memcg_check(s= truct page *page) * for slab pages, READ_ONCE() should be used here. */ unsigned long memcg_data =3D READ_ONCE(page->memcg_data); + struct obj_cgroup *objcg; =20 if (memcg_data & MEMCG_DATA_OBJCGS) return NULL; =20 - if (memcg_data & MEMCG_DATA_KMEM) { - struct obj_cgroup *objcg; - - objcg =3D (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); - return obj_cgroup_memcg(objcg); - } + objcg =3D (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); =20 - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); + return objcg ? obj_cgroup_memcg(objcg) : NULL; } =20 #ifdef CONFIG_MEMCG_KMEM diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b8c6e766c91c..d80afc5f14da 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -499,6 +499,8 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_str= uct *vma) } =20 #ifdef CONFIG_MEMCG +static struct shrinker deferred_split_shrinker; + static inline struct mem_cgroup *split_queue_memcg(struct deferred_split= *queue) { if (mem_cgroup_disabled()) @@ -512,6 +514,46 @@ static inline struct deferred_split *folio_memcg_spl= it_queue(struct folio *folio =20 return memcg ? &memcg->deferred_split_queue : NULL; } + +static void memcg_reparent_split_queue_lock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + spin_lock(&memcg->deferred_split_queue.split_queue_lock); + spin_lock(&parent->deferred_split_queue.split_queue_lock); +} + +static void memcg_reparent_split_queue_unlock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + spin_unlock(&parent->deferred_split_queue.split_queue_lock); + spin_unlock(&memcg->deferred_split_queue.split_queue_lock); +} + +static void memcg_reparent_split_queue(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int nid; + struct deferred_split *src, *dst; + + src =3D &memcg->deferred_split_queue; + dst =3D &parent->deferred_split_queue; + + if (!src->split_queue_len) + return; + + list_splice_tail_init(&src->split_queue, &dst->split_queue); + dst->split_queue_len +=3D src->split_queue_len; + src->split_queue_len =3D 0; + + for_each_node(nid) + set_shrinker_bit(parent, nid, deferred_split_shrinker.id); +} + +const struct memcg_reparent_ops split_queue_reparent_ops =3D { + .lock =3D memcg_reparent_split_queue_lock, + .unlock =3D memcg_reparent_split_queue_unlock, + .reparent =3D memcg_reparent_split_queue, +}; #else static inline struct mem_cgroup *split_queue_memcg(struct deferred_split= *queue) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6a393fe8e589..e4e490690e33 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -75,6 +75,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); =20 struct mem_cgroup *root_mem_cgroup __read_mostly; +static struct obj_cgroup *root_obj_cgroup __read_mostly; =20 /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); @@ -240,6 +241,11 @@ static inline bool task_is_dying(void) (current->flags & PF_EXITING); } =20 +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg) +{ + return objcg =3D=3D root_obj_cgroup; +} + /* Some nice accessors for the vmpressure. */ struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg) { @@ -336,7 +342,81 @@ static struct obj_cgroup *obj_cgroup_alloc(void) return objcg; } =20 -static const struct memcg_reparent_ops *memcg_reparent_ops[] =3D {}; +static void memcg_reparent_lruvec_lock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int i; + + for_each_node(i) { + spin_lock(&mem_cgroup_lruvec(memcg, NODE_DATA(i))->lru_lock); + spin_lock(&mem_cgroup_lruvec(parent, NODE_DATA(i))->lru_lock); + } +} + +static void memcg_reparent_lruvec_unlock(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int i; + + for_each_node(i) { + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(i))->lru_lock); + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(i))->lru_lock); + } +} + +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst, + enum lru_list lru) +{ + int zid; + struct mem_cgroup_per_node *mz_src, *mz_dst; + + mz_src =3D container_of(src, struct mem_cgroup_per_node, lruvec); + mz_dst =3D container_of(dst, struct mem_cgroup_per_node, lruvec); + + list_splice_tail_init(&src->lists[lru], &dst->lists[lru]); + + for (zid =3D 0; zid < MAX_NR_ZONES; zid++) { + mz_dst->lru_zone_size[zid][lru] +=3D mz_src->lru_zone_size[zid][lru]; + mz_src->lru_zone_size[zid][lru] =3D 0; + } +} + +static void memcg_reparent_lruvec(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + int i; + + for_each_node(i) { + enum lru_list lru; + struct lruvec *src, *dst; + + src =3D mem_cgroup_lruvec(memcg, NODE_DATA(i)); + dst =3D mem_cgroup_lruvec(parent, NODE_DATA(i)); + + dst->anon_cost +=3D src->anon_cost; + dst->file_cost +=3D src->file_cost; + + for_each_lru(lru) + lruvec_reparent_lru(src, dst, lru); + } +} + +static const struct memcg_reparent_ops lruvec_reparent_ops =3D { + .lock =3D memcg_reparent_lruvec_lock, + .unlock =3D memcg_reparent_lruvec_unlock, + .reparent =3D memcg_reparent_lruvec, +}; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +extern struct memcg_reparent_ops split_queue_reparent_ops; +#endif + +static const struct memcg_reparent_ops *memcg_reparent_ops[] =3D { + &lruvec_reparent_ops, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + &split_queue_reparent_ops, +#endif +}; =20 static void memcg_reparent_lock(struct mem_cgroup *memcg, struct mem_cgroup *parent) @@ -2806,18 +2886,18 @@ static inline void cancel_charge(struct mem_cgrou= p *memcg, unsigned int nr_pages page_counter_uncharge(&memcg->memsw, nr_pages); } =20 -static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) +static void commit_charge(struct folio *folio, struct obj_cgroup *objcg) { - VM_BUG_ON_FOLIO(folio_memcg(folio), folio); + VM_BUG_ON_FOLIO(folio_objcg(folio), folio); /* - * Any of the following ensures page's memcg stability: + * Any of the following ensures page's objcg stability: * * - the page lock * - LRU isolation * - lock_page_memcg() * - exclusive reference */ - folio->memcg_data =3D (unsigned long)memcg; + folio->memcg_data =3D (unsigned long)objcg; } =20 static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *o= bjcg) @@ -2834,6 +2914,21 @@ static struct mem_cgroup *get_mem_cgroup_from_objc= g(struct obj_cgroup *objcg) return memcg; } =20 +static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *m= emcg) +{ + struct obj_cgroup *objcg =3D NULL; + + rcu_read_lock(); + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { + objcg =3D rcu_dereference(memcg->objcg); + if (objcg && obj_cgroup_tryget(objcg)) + break; + } + rcu_read_unlock(); + + return objcg; +} + #ifdef CONFIG_MEMCG_KMEM /* * The allocated objcg pointers array is not accounted directly. @@ -2997,12 +3092,15 @@ __always_inline struct obj_cgroup *get_obj_cgroup= _from_current(void) else memcg =3D mem_cgroup_from_task(current); =20 - for (; memcg !=3D root_mem_cgroup; memcg =3D parent_mem_cgroup(memcg)) = { - objcg =3D rcu_dereference(memcg->objcg); - if (objcg && obj_cgroup_tryget(objcg)) - break; + if (mem_cgroup_is_root(memcg)) + goto out; + + objcg =3D get_obj_cgroup_from_memcg(memcg); + if (obj_cgroup_is_root(objcg)) { + obj_cgroup_put(objcg); objcg =3D NULL; } +out: rcu_read_unlock(); =20 return objcg; @@ -3132,13 +3230,13 @@ int __memcg_kmem_charge_page(struct page *page, g= fp_t gfp, int order) void __memcg_kmem_uncharge_page(struct page *page, int order) { struct folio *folio =3D page_folio(page); - struct obj_cgroup *objcg; + struct obj_cgroup *objcg =3D folio_objcg(folio); unsigned int nr_pages =3D 1 << order; =20 - if (!folio_memcg_kmem(folio)) + if (!objcg) return; =20 - objcg =3D __folio_objcg(folio); + VM_BUG_ON_FOLIO(!folio_memcg_kmem(folio), folio); obj_cgroup_uncharge_pages(objcg, nr_pages); folio->memcg_data =3D 0; obj_cgroup_put(objcg); @@ -3370,26 +3468,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg= , size_t size) #endif /* CONFIG_MEMCG_KMEM */ =20 /* - * Because page_memcg(head) is not set on tails, set it now. + * Because page_objcg(head) is not set on tails, set it now. */ void split_page_memcg(struct page *head, unsigned int nr) { struct folio *folio =3D page_folio(head); - struct mem_cgroup *memcg =3D get_mem_cgroup_from_folio(folio); + struct obj_cgroup *objcg =3D folio_objcg(folio); int i; =20 - if (mem_cgroup_disabled() || !memcg) + if (mem_cgroup_disabled() || !objcg) return; =20 for (i =3D 1; i < nr; i++) folio_page(folio, i)->memcg_data =3D folio->memcg_data; =20 - if (folio_memcg_kmem(folio)) - obj_cgroup_get_many(__folio_objcg(folio), nr - 1); - else - css_get_many(&memcg->css, nr - 1); - - css_put(&memcg->css); + obj_cgroup_get_many(objcg, nr - 1); } =20 #ifdef CONFIG_MEMCG_SWAP @@ -5320,6 +5413,9 @@ static int mem_cgroup_css_online(struct cgroup_subs= ys_state *css) objcg->memcg =3D memcg; rcu_assign_pointer(memcg->objcg, objcg); =20 + if (unlikely(mem_cgroup_is_root(memcg))) + root_obj_cgroup =3D objcg; + /* Online state pins memcg ID, memcg ID pins CSS */ refcount_set(&memcg->id.ref, 1); css_get(css); @@ -5629,6 +5725,8 @@ static struct page *mc_handle_file_pte(struct vm_ar= ea_struct *vma, linear_page_index(vma, addr)); } =20 +extern struct mutex cgroup_mutex; + /** * mem_cgroup_move_account - move account of the page * @page: the page @@ -5731,10 +5829,12 @@ static int mem_cgroup_move_account(struct page *p= age, */ smp_mb(); =20 - css_get(&to->css); - css_put(&from->css); + rcu_read_lock(); + obj_cgroup_get(rcu_dereference(to->objcg)); + obj_cgroup_put(rcu_dereference(from->objcg)); + rcu_read_unlock(); =20 - folio->memcg_data =3D (unsigned long)to; + folio->memcg_data =3D (unsigned long)rcu_access_pointer(to->objcg); =20 __folio_memcg_unlock(from); =20 @@ -6207,6 +6307,42 @@ static void mem_cgroup_move_charge(void) =20 mmap_read_unlock(mc.mm); atomic_dec(&mc.from->moving_account); + + /* + * Moving its pages to another memcg is finished. Wait for already + * started RCU-only updates to finish to make sure that the caller + * of lock_page_memcg() can unlock the correct move_lock. The + * possible bad scenario would like: + * + * CPU0: CPU1: + * mem_cgroup_move_charge() + * walk_page_range() + * + * lock_page_memcg(page) + * memcg =3D folio_memcg() + * spin_lock_irqsave(&memcg->move_lock) + * memcg->move_lock_task =3D current + * + * atomic_dec(&mc.from->moving_account) + * + * mem_cgroup_css_offline() + * memcg_offline_kmem() + * memcg_reparent_objcgs() <=3D=3D reparented + * + * unlock_page_memcg(page) + * memcg =3D folio_memcg() <=3D=3D memcg has been changed + * if (memcg->move_lock_task =3D=3D current) <=3D=3D false + * spin_unlock_irqrestore(&memcg->move_lock) + * + * Once mem_cgroup_move_charge() returns (it means that the cgroup_mute= x + * would be released soon), the page can be reparented to its parent + * memcg. When the unlock_page_memcg() is called for the page, we will + * miss unlock the move_lock. So using synchronize_rcu to wait for + * already started RCU-only updates to finish before this function + * returns (mem_cgroup_move_charge() and mem_cgroup_css_offline() are + * serialized by cgroup_mutex). + */ + synchronize_rcu(); } =20 /* @@ -6766,21 +6902,27 @@ void mem_cgroup_calculate_protection(struct mem_c= group *root, static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, gfp_t gfp) { + struct obj_cgroup *objcg; long nr_pages =3D folio_nr_pages(folio); - int ret; + int ret =3D 0; =20 - ret =3D try_charge(memcg, gfp, nr_pages); - if (ret) - goto out; + objcg =3D get_obj_cgroup_from_memcg(memcg); + /* Do not account at the root objcg level. */ + if (!obj_cgroup_is_root(objcg)) { + ret =3D try_charge(memcg, gfp, nr_pages); + if (ret) + goto out; + } =20 - css_get(&memcg->css); - commit_charge(folio, memcg); + obj_cgroup_get(objcg); + commit_charge(folio, objcg); =20 local_irq_disable(); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(folio)); local_irq_enable(); out: + obj_cgroup_put(objcg); return ret; } =20 @@ -6866,7 +7008,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t en= try) } =20 struct uncharge_gather { - struct mem_cgroup *memcg; + struct obj_cgroup *objcg; unsigned long nr_memory; unsigned long pgpgout; unsigned long nr_kmem; @@ -6881,84 +7023,73 @@ static inline void uncharge_gather_clear(struct u= ncharge_gather *ug) static void uncharge_batch(const struct uncharge_gather *ug) { unsigned long flags; + struct mem_cgroup *memcg; + + rcu_read_lock(); + memcg =3D obj_cgroup_memcg(ug->objcg); =20 if (ug->nr_memory) { - page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); + page_counter_uncharge(&memcg->memory, ug->nr_memory); if (do_memsw_account()) - page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory); + page_counter_uncharge(&memcg->memsw, ug->nr_memory); if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem) - page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem); - memcg_oom_recover(ug->memcg); + page_counter_uncharge(&memcg->kmem, ug->nr_kmem); + memcg_oom_recover(memcg); } =20 local_irq_save(flags); - __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory= ); - memcg_check_events(ug->memcg, ug->nid); + __count_memcg_events(memcg, PGPGOUT, ug->pgpgout); + __this_cpu_add(memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + memcg_check_events(memcg, ug->nid); local_irq_restore(flags); + rcu_read_unlock(); =20 /* drop reference from uncharge_folio */ - css_put(&ug->memcg->css); + obj_cgroup_put(ug->objcg); } =20 static void uncharge_folio(struct folio *folio, struct uncharge_gather *= ug) { long nr_pages; - struct mem_cgroup *memcg; struct obj_cgroup *objcg; - bool use_objcg =3D folio_memcg_kmem(folio); =20 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); =20 /* * Nobody should be changing or seriously looking at - * folio memcg or objcg at this point, we have fully - * exclusive access to the folio. + * folio objcg at this point, we have fully exclusive + * access to the folio. */ - if (use_objcg) { - objcg =3D __folio_objcg(folio); - /* - * This get matches the put at the end of the function and - * kmem pages do not hold memcg references anymore. - */ - memcg =3D get_mem_cgroup_from_objcg(objcg); - } else { - memcg =3D __folio_memcg(folio); - } - - if (!memcg) + objcg =3D folio_objcg(folio); + if (!objcg) return; =20 - if (ug->memcg !=3D memcg) { - if (ug->memcg) { + if (ug->objcg !=3D objcg) { + if (ug->objcg) { uncharge_batch(ug); uncharge_gather_clear(ug); } - ug->memcg =3D memcg; + ug->objcg =3D objcg; ug->nid =3D folio_nid(folio); =20 - /* pairs with css_put in uncharge_batch */ - css_get(&memcg->css); + /* pairs with obj_cgroup_put in uncharge_batch */ + obj_cgroup_get(objcg); } =20 nr_pages =3D folio_nr_pages(folio); =20 - if (use_objcg) { + if (folio_memcg_kmem(folio)) { ug->nr_memory +=3D nr_pages; ug->nr_kmem +=3D nr_pages; - - folio->memcg_data =3D 0; - obj_cgroup_put(objcg); } else { /* LRU pages aren't accounted at the root level */ - if (!mem_cgroup_is_root(memcg)) + if (!obj_cgroup_is_root(objcg)) ug->nr_memory +=3D nr_pages; ug->pgpgout++; - - folio->memcg_data =3D 0; } =20 - css_put(&memcg->css); + folio->memcg_data =3D 0; + obj_cgroup_put(objcg); } =20 void __mem_cgroup_uncharge(struct folio *folio) @@ -6966,7 +7097,7 @@ void __mem_cgroup_uncharge(struct folio *folio) struct uncharge_gather ug; =20 /* Don't touch folio->lru of any random page, pre-check: */ - if (!folio_memcg(folio)) + if (!folio_objcg(folio)) return; =20 uncharge_gather_clear(&ug); @@ -6989,7 +7120,7 @@ void __mem_cgroup_uncharge_list(struct list_head *p= age_list) uncharge_gather_clear(&ug); list_for_each_entry(folio, page_list, lru) uncharge_folio(folio, &ug); - if (ug.memcg) + if (ug.objcg) uncharge_batch(&ug); } =20 @@ -7006,6 +7137,7 @@ void __mem_cgroup_uncharge_list(struct list_head *p= age_list) void mem_cgroup_migrate(struct folio *old, struct folio *new) { struct mem_cgroup *memcg; + struct obj_cgroup *objcg; long nr_pages =3D folio_nr_pages(new); unsigned long flags; =20 @@ -7018,30 +7150,33 @@ void mem_cgroup_migrate(struct folio *old, struct= folio *new) return; =20 /* Page cache replacement: new folio already charged? */ - if (folio_memcg(new)) + if (folio_objcg(new)) return; =20 - memcg =3D get_mem_cgroup_from_folio(old); - VM_WARN_ON_ONCE_FOLIO(!memcg, old); - if (!memcg) + objcg =3D folio_objcg(old); + VM_WARN_ON_ONCE_FOLIO(!objcg, old); + if (!objcg) return; =20 + rcu_read_lock(); + memcg =3D obj_cgroup_memcg(objcg); + /* Force-charge the new page. The old one will be freed soon */ - if (!mem_cgroup_is_root(memcg)) { + if (!obj_cgroup_is_root(objcg)) { page_counter_charge(&memcg->memory, nr_pages); if (do_memsw_account()) page_counter_charge(&memcg->memsw, nr_pages); } =20 - css_get(&memcg->css); - commit_charge(new, memcg); + obj_cgroup_get(objcg); + commit_charge(new, objcg); =20 local_irq_save(flags); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(new)); local_irq_restore(flags); =20 - css_put(&memcg->css); + rcu_read_unlock(); } =20 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); @@ -7216,6 +7351,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(= struct mem_cgroup *memcg) void mem_cgroup_swapout(struct page *page, swp_entry_t entry) { struct mem_cgroup *memcg, *swap_memcg; + struct obj_cgroup *objcg; unsigned int nr_entries; unsigned short oldid; =20 @@ -7228,15 +7364,16 @@ void mem_cgroup_swapout(struct page *page, swp_en= try_t entry) if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return; =20 + objcg =3D folio_objcg(page_folio(page)); + VM_WARN_ON_ONCE_PAGE(!objcg, page); + if (!objcg) + return; + /* * Interrupts should be disabled by the caller (see the comments below)= , * which can serve as RCU read-side critical sections. */ - memcg =3D page_memcg(page); - - VM_WARN_ON_ONCE_PAGE(!memcg, page); - if (!memcg) - return; + memcg =3D obj_cgroup_memcg(objcg); =20 /* * In case the memcg owning these pages has been offlined and doesn't @@ -7255,7 +7392,7 @@ void mem_cgroup_swapout(struct page *page, swp_entr= y_t entry) =20 page->memcg_data =3D 0; =20 - if (!mem_cgroup_is_root(memcg)) + if (!obj_cgroup_is_root(objcg)) page_counter_uncharge(&memcg->memory, nr_entries); =20 if (!cgroup_memory_noswap && memcg !=3D swap_memcg) { @@ -7274,7 +7411,7 @@ void mem_cgroup_swapout(struct page *page, swp_entr= y_t entry) mem_cgroup_charge_statistics(memcg, -nr_entries); memcg_check_events(memcg, page_to_nid(page)); =20 - css_put(&memcg->css); + obj_cgroup_put(objcg); } =20 /** --=20 2.11.0