From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51272C04AB6 for ; Wed, 15 May 2019 00:07:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DE51A2082E for ; Wed, 15 May 2019 00:07:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EhD9G7Ho" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DE51A2082E Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7712A6B0005; Tue, 14 May 2019 20:07:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 722EB6B0006; Tue, 14 May 2019 20:07:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EA536B0007; Tue, 14 May 2019 20:07:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f198.google.com (mail-yb1-f198.google.com [209.85.219.198]) by kanga.kvack.org (Postfix) with ESMTP id 334CE6B0005 for ; Tue, 14 May 2019 20:07:12 -0400 (EDT) Received: by mail-yb1-f198.google.com with SMTP id n18so739825yba.9 for ; Tue, 14 May 2019 17:07:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:mime-version:references :in-reply-to:from:date:message-id:subject:to:cc; bh=4WEWe9/M1KUwW0UCq7NNJNKHOWt12BNGDR2Zg8R92Io=; b=QepWfSbOLBsYXjOp8Kwvjvfu42wc4ZrQILigRFdL55jVeVnKQtjDlRdLf+uO6dmI8S 8Gkwu//9rfY+XV2DOPAu735OvufmamLfQkjmtijFAD29BoKsp2bGrsmLg1KkX5N8JgN5 EwEGmR1jH37Oq33FH9R7F/LuH3EDrS6oiHYoe6VqUUqgNnzKICKm9J4fj8xBAhwOl/Xg 4us9JtJp0IWJ6/dVBBodRfXtBq0QMmnfd6N9L9owq8wzG4w2Bc8ltA3tLaDZS+MC1J5r cXrPeqpWIUdVLPRS1PodoFYI+YdliMhoTVRVqPYzJ/Q6/V2n1yHwKnn0Cw0JC/xhnn/w 5alQ== X-Gm-Message-State: APjAAAVnUwmdgi9cyOZOK960k5bSGkW4+S0BLAPS9E7u+ieLdeXZCOkd H/Xf/GbmMfZwEJHs6E9nUy/Mu7DbMEQ4aSx8xDX9Lx4uQGRyN92/rJWa7xod+4ynQaqlmon1Wed HMaM/KapI8cDWZNzXwnY4shR4OBawtH+h4liUg6nIByhvfEZJPSY8UhgSujMuC9exFg== X-Received: by 2002:a25:2706:: with SMTP id n6mr18399885ybn.181.1557878831861; Tue, 14 May 2019 17:07:11 -0700 (PDT) X-Received: by 2002:a25:2706:: with SMTP id n6mr18399724ybn.181.1557878827706; Tue, 14 May 2019 17:07:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557878827; cv=none; d=google.com; s=arc-20160816; b=gWtoIKVEGkK5WsUizFTbzP40IZ8qKza1cTiRKsHJnC4LA/hssb24KB3d/J8ggUxwL1 OzJTm+ZhjXFCSPiCiYhwim+R3Q9gp7gusCIbs0lbLQTx1tmLVoPFiwTZTzIQsCYHlAb1 hnWC1wbvqYlI9c4y9wMxvVpalB051X6VW2sa7V7SdoUqZ5V0K4zQEPrUpnTjiDqqBrKZ ZkHZrlRT9LNwLEn8FYP+RIlcDNd3bLW8zxinRFy8lH+cTK18ncaHxjU9WiV3oYjvtvH0 J43DGME/XKi2bgvU2hZU7IfKNsjsHB0I/jzYXIU0vAH85G2zwS6mbM+XuHwnOTOsCNw1 5wGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=4WEWe9/M1KUwW0UCq7NNJNKHOWt12BNGDR2Zg8R92Io=; b=SsR+H+59sCU8hkheE7t6Cbs8IdiyAL5Fgom8nlAwpN9SxKdoF53k93YfeE9lqaLFM5 EtSNfnzF7pdvnHwKECV3G2YOLzHrW0ej0osnprhnSvw6x1tieyXRoRrSo9I7IKJ/IFZp 1mJ36PnvgUL5Hd8Hfj4zzy7O+/EvzLFuxWs3zzofqbMZ0L+nT0wI9D78+0Z5wVGs34qm GeMutizyv4N34UVxOeCLJd6xrOwZSbMy0bXvez2rTWttW7x6f3ndauc+vdIVAc0J8Iub IZp3/6zwIEYrbyFa1inS7YZ/6fdeZSc63NcWIH1GLjxDACjM4OI28S5oD4j7dIc/009F upWA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=EhD9G7Ho; spf=pass (google.com: domain of shakeelb@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id d17sor132596ybr.52.2019.05.14.17.07.07 for (Google Transport Security); Tue, 14 May 2019 17:07:07 -0700 (PDT) Received-SPF: pass (google.com: domain of shakeelb@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=EhD9G7Ho; spf=pass (google.com: domain of shakeelb@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4WEWe9/M1KUwW0UCq7NNJNKHOWt12BNGDR2Zg8R92Io=; b=EhD9G7HoQNo0NrMzpLtzsiajiynFdUz7X9quQOVhGqdmr3zcrKgT/vv8bYg1XjtdO6 aX33YONdzNcJghN2E9cXTzAlWJioU1e0XvrYp3LhUo5WCMKFTt+pm76xZiIVul25JQ2F AIz2WpO0ao3IpCnATgAOMsRVFL7IU+vbSugwGdCclZSRSPwD+I+lOZA3iMxy0xOFTFLy +2zCKuUacQWilysNEvPeFUzmtq6iJIsr3mtzQmnPueq7fvXjUk26by1X23r7QT6ZbC4D bgHU3XTSRwhcnvOWht0mhtKqxbyZdYEsDrIb2+gJnHr8RyzHl2TC0lWhQewBVNJIeKkl ZSUw== X-Google-Smtp-Source: APXvYqz7Vnay3ZhrspAqYXPmisDN+bPhdqF09Nw/8uYXDYbLYZXpWgaN4DM05eTgcpqvw47nHBD2/2CMnjWJbCSlGAY= X-Received: by 2002:a25:a2c1:: with SMTP id c1mr17452529ybn.496.1557878826976; Tue, 14 May 2019 17:07:06 -0700 (PDT) MIME-Version: 1.0 References: <20190514213940.2405198-1-guro@fb.com> <20190514213940.2405198-6-guro@fb.com> In-Reply-To: <20190514213940.2405198-6-guro@fb.com> From: Shakeel Butt Date: Tue, 14 May 2019 17:06:55 -0700 Message-ID: Subject: Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management To: Roman Gushchin Cc: Andrew Morton , Linux MM , LKML , Kernel Team , Johannes Weiner , Michal Hocko , Rik van Riel , Christoph Lameter , Vladimir Davydov , Cgroups Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Roman Gushchin Date: Tue, May 14, 2019 at 2:55 PM To: Andrew Morton, Shakeel Butt Cc: , , , Johannes Weiner, Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov, , Roman Gushchin > This commit makes several important changes in the lifecycle > of a non-root kmem_cache, which also affect the lifecycle > of a memory cgroup. > > Currently each charged slab page has a page->mem_cgroup pointer > to the memory cgroup and holds a reference to it. > Kmem_caches are held by the memcg and are released with it. > It means that none of kmem_caches are released unless at least one > reference to the memcg exists, which is not optimal. > > So the current scheme can be illustrated as: > page->mem_cgroup->kmem_cache. > > To implement the slab memory reparenting we need to invert the scheme > into: page->kmem_cache->mem_cgroup. > > Let's make every page to hold a reference to the kmem_cache (we > already have a stable pointer), and make kmem_caches to hold a single > reference to the memory cgroup. > > To make this possible we need to introduce a new percpu refcounter > for non-root kmem_caches. The counter is initialized to the percpu > mode, and is switched to atomic mode after deactivation, so we never > shutdown an active cache. The counter is bumped for every charged page > and also for every running allocation. So the kmem_cache can't > be released unless all allocations complete. > > To shutdown non-active empty kmem_caches, let's reuse the > infrastructure of the RCU-delayed work queue, used previously for > the deactivation. After the generalization, it's perfectly suited > for our needs. > > Since now we can release a kmem_cache at any moment after the > deactivation, let's call sysfs_slab_remove() only from the shutdown > path. It makes deactivation path simpler. > > Because we don't set the page->mem_cgroup pointer, we need to change > the way how memcg-level stats is working for slab pages. We can't use > mod_lruvec_page_state() helpers anymore, so switch over to > mod_lruvec_state(). > > * I used the following simple approach to test the performance > (stolen from another patchset by T. Harding): > > time find / -name fname-no-exist > echo 2 > /proc/sys/vm/drop_caches > repeat 10 times > > Results (I've chosen best results in several runs): > > orig patched > > real 0m0.648s real 0m0.593s > user 0m0.148s user 0m0.162s > sys 0m0.295s sys 0m0.253s > > real 0m0.581s real 0m0.649s > user 0m0.119s user 0m0.136s > sys 0m0.254s sys 0m0.250s > > real 0m0.645s real 0m0.705s > user 0m0.138s user 0m0.138s > sys 0m0.263s sys 0m0.250s > > real 0m0.691s real 0m0.718s > user 0m0.139s user 0m0.134s > sys 0m0.262s sys 0m0.253s > > real 0m0.654s real 0m0.715s > user 0m0.146s user 0m0.128s > sys 0m0.247s sys 0m0.261s > > real 0m0.675s real 0m0.717s > user 0m0.129s user 0m0.137s > sys 0m0.277s sys 0m0.248s > > real 0m0.631s real 0m0.719s > user 0m0.137s user 0m0.134s > sys 0m0.255s sys 0m0.251s > > real 0m0.622s real 0m0.715s > user 0m0.108s user 0m0.124s > sys 0m0.279s sys 0m0.264s > > real 0m0.651s real 0m0.669s > user 0m0.139s user 0m0.139s > sys 0m0.252s sys 0m0.247s > > real 0m0.671s real 0m0.632s > user 0m0.130s user 0m0.139s > sys 0m0.263s sys 0m0.245s > > So it looks like the difference is not noticeable in this test. > > Signed-off-by: Roman Gushchin Reviewed-by: Shakeel Butt > --- > include/linux/slab.h | 3 +- > mm/memcontrol.c | 57 +++++++++++++++++++++--------- > mm/slab.h | 82 +++++++++++++++++++++++++------------------- > mm/slab_common.c | 74 +++++++++++++++++++++++---------------- > mm/slub.c | 12 +------ > 5 files changed, 135 insertions(+), 93 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 47923c173f30..1b54e5f83342 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > > > /* > @@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *); > > void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); > void memcg_deactivate_kmem_caches(struct mem_cgroup *); > -void memcg_destroy_kmem_caches(struct mem_cgroup *); > > /* > * Please use this macro to create slab caches. Simply specify the > @@ -641,6 +641,7 @@ struct memcg_cache_params { > struct mem_cgroup *memcg; > struct list_head children_node; > struct list_head kmem_caches_node; > + struct percpu_ref refcnt; > > void (*work_fn)(struct kmem_cache *); > union { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b2c39f187cbb..413cef3d8369 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2610,12 +2610,13 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg, > { > struct memcg_kmem_cache_create_work *cw; > > + if (!css_tryget_online(&memcg->css)) > + return; > + > cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN); > if (!cw) > return; > > - css_get(&memcg->css); > - > cw->memcg = memcg; > cw->cachep = cachep; > INIT_WORK(&cw->work, memcg_kmem_cache_create_func); > @@ -2651,20 +2652,35 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) > struct mem_cgroup *memcg; > struct kmem_cache *memcg_cachep; > int kmemcg_id; > + struct memcg_cache_array *arr; > > VM_BUG_ON(!is_root_cache(cachep)); > > if (memcg_kmem_bypass()) > return cachep; > > - memcg = get_mem_cgroup_from_current(); > + rcu_read_lock(); > + > + if (unlikely(current->active_memcg)) > + memcg = current->active_memcg; > + else > + memcg = mem_cgroup_from_task(current); > + > + if (!memcg || memcg == root_mem_cgroup) > + goto out_unlock; > + > kmemcg_id = READ_ONCE(memcg->kmemcg_id); > if (kmemcg_id < 0) > - goto out; > + goto out_unlock; > > - memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id); > - if (likely(memcg_cachep)) > - return memcg_cachep; > + arr = rcu_dereference(cachep->memcg_params.memcg_caches); > + > + /* > + * Make sure we will access the up-to-date value. The code updating > + * memcg_caches issues a write barrier to match this (see > + * memcg_create_kmem_cache()). > + */ > + memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]); > > /* > * If we are in a safe context (can wait, and not in interrupt > @@ -2677,10 +2693,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) > * memcg_create_kmem_cache, this means no further allocation > * could happen with the slab_mutex held. So it's better to > * defer everything. > + * > + * If the memcg is dying or memcg_cache is about to be released, > + * don't bother creating new kmem_caches. Because memcg_cachep > + * is ZEROed as the fist step of kmem offlining, we don't need > + * percpu_ref_tryget() here. css_tryget_online() check in *percpu_ref_tryget_live() > + * memcg_schedule_kmem_cache_create() will prevent us from > + * creation of a new kmem_cache. > */ > - memcg_schedule_kmem_cache_create(memcg, cachep); > -out: > - css_put(&memcg->css); > + if (unlikely(!memcg_cachep)) > + memcg_schedule_kmem_cache_create(memcg, cachep); > + else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) > + cachep = memcg_cachep; > +out_unlock: > + rcu_read_lock(); > return cachep; > } > > @@ -2691,7 +2717,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) > void memcg_kmem_put_cache(struct kmem_cache *cachep) > { > if (!is_root_cache(cachep)) > - css_put(&cachep->memcg_params.memcg->css); > + percpu_ref_put(&cachep->memcg_params.refcnt); > } > > /** > @@ -2719,9 +2745,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > cancel_charge(memcg, nr_pages); > return -ENOMEM; > } > - > - page->mem_cgroup = memcg; > - > return 0; > } > > @@ -2744,8 +2767,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) > memcg = get_mem_cgroup_from_current(); > if (!mem_cgroup_is_root(memcg)) { > ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg); > - if (!ret) > + if (!ret) { > + page->mem_cgroup = memcg; > __SetPageKmemcg(page); > + } > } > css_put(&memcg->css); > return ret; > @@ -3238,7 +3263,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg) > memcg_offline_kmem(memcg); > > if (memcg->kmem_state == KMEM_ALLOCATED) { > - memcg_destroy_kmem_caches(memcg); > + WARN_ON(!list_empty(&memcg->kmem_caches)); > static_branch_dec(&memcg_kmem_enabled_key); > WARN_ON(page_counter_read(&memcg->kmem)); > } > diff --git a/mm/slab.h b/mm/slab.h > index c9a31120fa1d..b86744c58702 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -173,6 +173,7 @@ void __kmem_cache_release(struct kmem_cache *); > int __kmem_cache_shrink(struct kmem_cache *); > void __kmemcg_cache_deactivate(struct kmem_cache *s); > void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s); > +void kmemcg_cache_shutdown(struct kmem_cache *s); > void slab_kmem_cache_release(struct kmem_cache *); > > struct seq_file; > @@ -248,31 +249,6 @@ static inline const char *cache_name(struct kmem_cache *s) > return s->name; > } > > -/* > - * Note, we protect with RCU only the memcg_caches array, not per-memcg caches. > - * That said the caller must assure the memcg's cache won't go away by either > - * taking a css reference to the owner cgroup, or holding the slab_mutex. > - */ > -static inline struct kmem_cache * > -cache_from_memcg_idx(struct kmem_cache *s, int idx) > -{ > - struct kmem_cache *cachep; > - struct memcg_cache_array *arr; > - > - rcu_read_lock(); > - arr = rcu_dereference(s->memcg_params.memcg_caches); > - > - /* > - * Make sure we will access the up-to-date value. The code updating > - * memcg_caches issues a write barrier to match this (see > - * memcg_create_kmem_cache()). > - */ > - cachep = READ_ONCE(arr->entries[idx]); > - rcu_read_unlock(); > - > - return cachep; > -} > - > static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) > { > if (is_root_cache(s)) > @@ -280,19 +256,49 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) > return s->memcg_params.root_cache; > } > > +/* > + * Charge the slab page belonging to the non-root kmem_cache. > + * Can be called for non-root kmem_caches only. > + */ > static __always_inline int memcg_charge_slab(struct page *page, > gfp_t gfp, int order, > struct kmem_cache *s) > { > - if (is_root_cache(s)) > - return 0; > - return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); > + struct mem_cgroup *memcg; > + struct lruvec *lruvec; > + int ret; > + > + memcg = s->memcg_params.memcg; > + ret = memcg_kmem_charge_memcg(page, gfp, order, memcg); > + if (ret) > + return ret; > + > + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); > + mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order); > + > + /* transer try_charge() page references to kmem_cache */ > + percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order); > + css_put_many(&memcg->css, 1 << order); > + > + return 0; > } > > +/* > + * Uncharge a slab page belonging to a non-root kmem_cache. > + * Can be called for non-root kmem_caches only. > + */ > static __always_inline void memcg_uncharge_slab(struct page *page, int order, > struct kmem_cache *s) > { > - memcg_kmem_uncharge(page, order); > + struct mem_cgroup *memcg; > + struct lruvec *lruvec; > + > + memcg = s->memcg_params.memcg; > + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); > + mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order)); > + memcg_kmem_uncharge_memcg(page, order, memcg); > + > + percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order); > } > > extern void slab_init_memcg_params(struct kmem_cache *); > @@ -362,18 +368,24 @@ static __always_inline int charge_slab_page(struct page *page, > gfp_t gfp, int order, > struct kmem_cache *s) > { > - int ret = memcg_charge_slab(page, gfp, order, s); > - > - if (!ret) > - mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order); > + if (is_root_cache(s)) { > + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), > + 1 << order); > + return 0; > + } > > - return ret; > + return memcg_charge_slab(page, gfp, order, s); > } > > static __always_inline void uncharge_slab_page(struct page *page, int order, > struct kmem_cache *s) > { > - mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order)); > + if (is_root_cache(s)) { > + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), > + -(1 << order)); > + return; > + } > + > memcg_uncharge_slab(page, order, s); > } > > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 4e5b4292a763..1ee967b4805e 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -45,6 +45,8 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); > static DECLARE_WORK(slab_caches_to_rcu_destroy_work, > slab_caches_to_rcu_destroy_workfn); > > +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref); > + > /* > * Set of flags that will prevent slab merging > */ > @@ -130,6 +132,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr, > #ifdef CONFIG_MEMCG_KMEM > > LIST_HEAD(slab_root_caches); > +static DEFINE_SPINLOCK(memcg_kmem_wq_lock); > > void slab_init_memcg_params(struct kmem_cache *s) > { > @@ -145,6 +148,12 @@ static int init_memcg_params(struct kmem_cache *s, > struct memcg_cache_array *arr; > > if (root_cache) { > + int ret = percpu_ref_init(&s->memcg_params.refcnt, > + kmemcg_queue_cache_shutdown, > + 0, GFP_KERNEL); > + if (ret) > + return ret; > + > s->memcg_params.root_cache = root_cache; > INIT_LIST_HEAD(&s->memcg_params.children_node); > INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node); > @@ -170,6 +179,8 @@ static void destroy_memcg_params(struct kmem_cache *s) > { > if (is_root_cache(s)) > kvfree(rcu_access_pointer(s->memcg_params.memcg_caches)); > + else > + percpu_ref_exit(&s->memcg_params.refcnt); > } > > static void free_memcg_params(struct rcu_head *rcu) > @@ -225,6 +236,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg) > if (is_root_cache(s)) { > list_add(&s->root_caches_node, &slab_root_caches); > } else { > + css_get(&memcg->css); > s->memcg_params.memcg = memcg; > list_add(&s->memcg_params.children_node, > &s->memcg_params.root_cache->memcg_params.children); > @@ -240,6 +252,7 @@ static void memcg_unlink_cache(struct kmem_cache *s) > } else { > list_del(&s->memcg_params.children_node); > list_del(&s->memcg_params.kmem_caches_node); > + css_put(&s->memcg_params.memcg->css); > } > } > #else > @@ -708,16 +721,13 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work) > > put_online_mems(); > put_online_cpus(); > - > - /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */ > - css_put(&s->memcg_params.memcg->css); > } > > /* > * We need to grab blocking locks. Bounce to ->work. The > * work item shares the space with the RCU head and can't be > - * initialized eariler. > -*/ > + * initialized earlier. > + */ > static void kmemcg_schedule_work_after_rcu(struct rcu_head *head) > { > struct kmem_cache *s = container_of(head, struct kmem_cache, > @@ -727,9 +737,31 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head) > queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); > } > > +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s) > +{ > + WARN_ON(shutdown_cache(s)); > +} > + > +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref) > +{ > + struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache, > + memcg_params.refcnt); > + > + spin_lock(&memcg_kmem_wq_lock); > + if (s->memcg_params.root_cache->memcg_params.dying) > + goto unlock; > + > + WARN_ON(s->memcg_params.work_fn); > + s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu; > + call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu); > +unlock: > + spin_unlock(&memcg_kmem_wq_lock); > +} > + > static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) > { > __kmemcg_cache_deactivate_after_rcu(s); > + percpu_ref_kill(&s->memcg_params.refcnt); > } > > static void kmemcg_cache_deactivate(struct kmem_cache *s) > @@ -739,9 +771,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s) > if (s->memcg_params.root_cache->memcg_params.dying) > return; > > - /* pin memcg so that @s doesn't get destroyed in the middle */ > - css_get(&s->memcg_params.memcg->css); > - > WARN_ON_ONCE(s->memcg_params.work_fn); > s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu; > call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu); > @@ -775,28 +804,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) > put_online_cpus(); > } > > -void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) > -{ > - struct kmem_cache *s, *s2; > - > - get_online_cpus(); > - get_online_mems(); > - > - mutex_lock(&slab_mutex); > - list_for_each_entry_safe(s, s2, &memcg->kmem_caches, > - memcg_params.kmem_caches_node) { > - /* > - * The cgroup is about to be freed and therefore has no charges > - * left. Hence, all its caches must be empty by now. > - */ > - BUG_ON(shutdown_cache(s)); > - } > - mutex_unlock(&slab_mutex); > - > - put_online_mems(); > - put_online_cpus(); > -} > - > static int shutdown_memcg_caches(struct kmem_cache *s) > { > struct memcg_cache_array *arr; > @@ -854,8 +861,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > static void flush_memcg_workqueue(struct kmem_cache *s) > { > + /* > + * memcg_params.dying is synchronized using slab_mutex AND > + * memcg_kmem_wq_lock spinlock, because it's not always > + * possible to grab slab_mutex. > + */ > mutex_lock(&slab_mutex); > + spin_lock(&memcg_kmem_wq_lock); > s->memcg_params.dying = true; > + spin_unlock(&memcg_kmem_wq_lock); > mutex_unlock(&slab_mutex); > > /* > diff --git a/mm/slub.c b/mm/slub.c > index 13e415cc71b7..0a4ddbeb5ca6 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -4018,18 +4018,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) > { > /* > * Called with all the locks held after a sched RCU grace period. > - * Even if @s becomes empty after shrinking, we can't know that @s > - * doesn't have allocations already in-flight and thus can't > - * destroy @s until the associated memcg is released. > - * > - * However, let's remove the sysfs files for empty caches here. > - * Each cache has a lot of interface files which aren't > - * particularly useful for empty draining caches; otherwise, we can > - * easily end up with millions of unnecessary sysfs files on > - * systems which have a lot of memory and transient cgroups. > */ > - if (!__kmem_cache_shrink(s)) > - sysfs_slab_remove(s); > + __kmem_cache_shrink(s); > } > > void __kmemcg_cache_deactivate(struct kmem_cache *s) > -- > 2.20.1 >