From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751476AbdISPGn (ORCPT ); Tue, 19 Sep 2017 11:06:43 -0400 Received: from mail-db5eur01on0092.outbound.protection.outlook.com ([104.47.2.92]:22496 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750919AbdISPGl (ORCPT ); Tue, 19 Sep 2017 11:06:41 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=ktkhai@virtuozzo.com; Subject: [PATCH] mm: Make count list_lru_one::nr_items lockless From: Kirill Tkhai To: vdavydov.dev@gmail.com, apolyakov@beget.ru, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aryabinin@virtuozzo.com, akpm@linux-foundation.org Date: Tue, 19 Sep 2017 18:06:33 +0300 Message-ID: <150583358557.26700.8490036563698102569.stgit@localhost.localdomain> User-Agent: StGit/0.18 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [195.214.232.6] X-ClientProxiedBy: AM5PR0701CA0003.eurprd07.prod.outlook.com (2603:10a6:203:51::13) To DB6PR0801MB1335.eurprd08.prod.outlook.com (2603:10a6:4:b::7) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: bb95dd06-64b1-412f-c77f-08d4ff70058b X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(300000503095)(300135400095)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:DB6PR0801MB1335; X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1335;3:qws60nArwRrXFD3UMY0je3KAEuIOF5qftC4R/cLKXBq/yfl4NRHER8PuKKnaejBomZo9zhEtXXNfRO2rZWK7m6CYOrz2B17im9uwO7w+1fLIZOdqeX+TmB65gqzNHYav8IMrvQ1tuP3ShKVYRKDmUKsKBrtI8TNDH4LBXsF4A2aPZY5Kqd5HzH6Pq19sHbIKYyhVF7T3Kx0q5sfJIqbQic4b854xWxzKlzuRgXiigMatVisoi+sd/+hPoeCpgz+z;25:LB/ele0hfzXyu7W6g3BgWGeeV37abAMzNv86LRKC5P1SBQERl4VhnZWHyKWhywhcwdlZZ+GgK0xx02pjaxczpGTVMcng65Zk4gvHMm1TlrrDW00XspRo07t+P374I+agrAty+GB4qO0tAQesuhW0udc0pBchwx8eDAwcCjb5XBvcYGj5Pc4yNbm3MRJZ58c/tYk85fuYh4/wCMQQ+ffuJ+GCxMXp1HhpWR6gJOaktoPd6g5x95TFRxMaUDNeGFpbymT6s6Bgop/m87Y5KcjMecbmUUbzVgJQk8OeUqYxLtAKufJOySmLX9L/6QaTyrS7E0PeLkBEogMTBwYgjTt0QQ==;31:jDVhoLKI+ZY7iwach5qQ2v5puWXYVnbqUY1OPAzhshD7iYpWto9zdSbZoxneaqps6Ulx7jYxUssKJTFct2ln617zM8R3m+GzjkCAB9WzEGv4aH3OHP7W+fdCMXIoGfFt2f+TH//TvSieDwjOsgCKi+4ZzlqjxrhIQqUB8SEuy7/ze5IybddY1rjrtFze6zXv5uW1dUy3DgJpS30dY0VBA27Yb36gI4Xe/TyERBNwycg= X-MS-TrafficTypeDiagnostic: DB6PR0801MB1335: X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1335;20:2PTBC3qqo9gn5JvJExVWh5d9+71YNsPv/uYLJfu3AnRxjLlzeDY/g1+CRIiyuOKshc0nv46ahBYlTfZpcv4JxmtFEUSc1FrWCeemdhwBX695Finkp2N/tAbz+G1a9kEqntGxCP2P4YTBGRkJTGX7RvP0XNFWUp3CE2mT4/MYUiqxyUVa82zzVKWxI6Um3g2N2rSJlWGM6j6qt+zVgHno7fo9GNGiBwvGJ4Y2pe9qfdjxLTxS1U9PeQx7LgLgzvo40O5M3tRHBWO6Go3UOlb4oqLNFiGQzWSecQ/6bPkVwfguLGmYKYQ8/zjVcLxZCikb5paj0n8Y15z8rst0qH/zNMBRv5gmNZU1W1yne7H/8GCobp3Bh7z3VzS9PsKsn7TJeMuJ4ahVtoftIwAoQ4mfDRcCfcVbXlhhVS1f087uFDw=;4:Xpr2czpWN5sw5qMVlC7UCcRw0P6h1mMR2n9FE+gLu8YgjwbniUqsMa0TYGSycA+7Yu88q7NNxq+ruYXUkYrDayeScEz0R9lcKBgS66AWDPbeY/ByIx1SDLnrxEx0KOe8E30hmjcNmOEPhsF6wirw7AOBkuz5RHj1MexZE0jOU+ERpgpw4RQaCOYyhdPCHN4QjZp456VGp1OGHjX7UpPQx2obetffs7hXUZzI6W3LbZXZrOnxibEFymjenq69whxwyiQzCCTL49M0fefvrSEr/CKgqhB09ssSANPLAr3sxxU= X-Exchange-Antispam-Report-Test: UriScan:(190756311086443); X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(8121501046)(5005006)(3002001)(100000703101)(100105400095)(10201501046)(93006095)(93001095)(6041248)(20161123562025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123558100)(20161123560025)(20161123555025)(20161123564025)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:DB6PR0801MB1335;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:DB6PR0801MB1335; X-Forefront-PRVS: 04359FAD81 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(7370300001)(6009001)(6069001)(39830400002)(376002)(346002)(199003)(189002)(53936002)(68736007)(230700001)(105586002)(33646002)(478600001)(50466002)(54356999)(106356001)(25786009)(50986999)(55016002)(9686003)(101416001)(5660300001)(305945005)(7736002)(6506006)(47776003)(8676002)(81166006)(66066001)(316002)(16526017)(58126008)(7350300001)(61506002)(3846002)(97736004)(86362001)(6116002)(2906002)(189998001)(103116003)(23676002)(83506001)(81156014);DIR:OUT;SFP:1102;SCL:1;SRVR:DB6PR0801MB1335;H:localhost.localdomain;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:3;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtEQjZQUjA4MDFNQjEzMzU7MjM6WW9JV0p1T3lkZDlsUGw0TFYweWRuQzE3?= =?utf-8?B?b0dxbnZkRGhCdXRrOWpZV0RmZnRuK2RPUXJvR1lJMUJHcXplcVNueGZNVmJm?= =?utf-8?B?RHNKdkU4Q3FBbGNBUUcyajF1UlJiaEFPKzhacERaUlU5aXJsa3RXQ3dmdUdl?= =?utf-8?B?VjBjbU5sWkdHZW9Eem1aUVVRbFFrbER5VGlld3ZyaFZ2ZTMxejM5NnVCZlg0?= =?utf-8?B?dkJtbWo1bXdvYk14ZU1VaGlHZ2F1WnJBM1o0VW8xQlVjcXdXWmh4NllrZS85?= =?utf-8?B?Z3ROcnVpUGdPNVhzUVVwR0NqdnBLUUVsUytuaTJUalFNTTg2U3l2OGdqR0V2?= =?utf-8?B?WnM1Q0drNzFvV1cxVGloeVFJdHVsL09zcGxrNVhodlU1enRzUm9qMUt2QVNj?= =?utf-8?B?Rzd4L3UyN1ljMHlNQmU5QnFUNWw5YzJtcVloMHdMczB2MnFMclkycHd3N2JE?= =?utf-8?B?RldoRkFHV1pzdnRXc1VHbE5mNTltRXplcFY4aHdOTGgwZkZ5aEFhZks5aFFN?= =?utf-8?B?cC9qeVNOZXJLa0tBYUhwZ3hSK3BkcXNjSWN5a0E0aGFZaURoV0x3TlJJeVBr?= =?utf-8?B?eHhzZ3dNYU9pY1B5NTZwbkZLeHpCUjUwcEs5ZWs5VzQxdXd5MFNrOTdvOFVx?= =?utf-8?B?UktkdFlFNzBmbER3eXZ1TXZBa042Sy8ybjJjeXJRbGdFeDkrb1ZVc2dsYjd4?= =?utf-8?B?SXZldW1mQTVFa0tYMXBEWHBHYTZDbFBMN0RSYkpQRHlMcG5Nc1BnMWhNODRI?= =?utf-8?B?SElFRDRQQ1dYM3M5ZFZGVGdsUkZjbmtyd2FhUmtPcFVBdWJLbFp3UXAxdnVk?= =?utf-8?B?ME4yMEVxYmtqbE9hMTY5NERGbk05VEJUWHZxUlFpUlhkOWZ5TTNIZ21GWlZ0?= =?utf-8?B?YmV4NTRmMjFVS0thbWhXa0w0WHB2OEw4emRyQ3ZGS3QwcktNVFlibVFUL2t1?= =?utf-8?B?QTErR2NhOGFyblVLVE1yR0h5b3FOSUJHTFZiTTlzUklxWTZEU1ZIT01vRnI4?= =?utf-8?B?MGlMYWpuQXdhUnM5YjJUZndjMWllSUtMK09zRTBrNU0yMXBUc0Y5dFNWVXdK?= =?utf-8?B?dGc3dXpUZVZuMXVGT2g4QUpvNFhmSHVuZE00Z2NScThSUDFaQW14UlNybWVZ?= =?utf-8?B?bkY5QVBDNWxNLzVCaUZBYjM3UWswT01XRml5d0VpbUMrQmhWeUNnS01POFRN?= =?utf-8?B?VG5LYS9FbHdOOXFRL1ZoajlJZ2FnZ1hIOWxJdkN5Ujk4S0hKazN0dnpsdEhi?= =?utf-8?B?Z291L3dXK3JLdjFrdUZSbFpmY3lGcTRpQ0VGV3lpYUxhQ0llNWJzRHhpMTZ4?= =?utf-8?B?ZElDUzBTTkcwU3B2U2RwNlRNQk1NMW9XLytLeERrMmdYTnhMbkhBWnE2d2lN?= =?utf-8?B?ZjUvU0liTkZMMVJBOTdVaVJxUjRaSmdKR0piT3JjU1AwUkptNk42eGtjeW00?= =?utf-8?B?YlczUTV1MmFKeW5Oa1AxSzJSZ1cxOVZCUjhDc05HZmc0UCt1Qmwrc1dvY1lD?= =?utf-8?B?R25CL1ZnPT0=?= X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1335;6:FVSmdi9lhgtil3/asvTaGIsM8M0Hm2F4CfSX41IsCJOMxnjVfUEGj7zb3RPhdPNolvmLwtHOSmldm/ss9iw39NSuD7+xASssEf/kTNfHO3qsIGLxKoxbuUbazXN1uSAlb1sFA1LQkTR/Si8cXWI/bGQftcJWIifjzftVNX7qQfcnIK7sOqRHBlQoZNy1vc5cLHUkr1wpqyDUv4axq8XqQxo7PIRm1kKJ/Dqdh0vBU4NfbuI5EXkI2HbvWWv6q06eQ4MCQtV5iU97baXpoQd7dtGGly36YF5RDzQ6X+2gkIgd5GP1jR8X09Tnyhw/osc5X2msUkrMkeytxa9NtJnekw==;5:rhW+iPoFlHZVPn+o82dPPQFgQTpvlAEiLFkYYknP3rhWmm99RP1ExWvU9pWv4oIlA6BWXgY/MC5PiIB+l5pINWZePABJ7Q6OChV4sCH4HtaAvqk5qUzi1s2BVM6zMxp+uxcIjw7Tt1HDWfFXuXoiSA==;24:nZqdGunQXtRentcordhEL+TQy7afklEhAX2Piu8xv0RNVKpKPh8tAKIqeFIaff4U6ZOAnUwiHT2KrwvfL34IWUHahu+v8cvO7b6gs/WG9Ko=;7:ASFty+QxY67a5Ias9R+sz7z6deMAltfRsRtN8Th/FAjhP0mrXcntvGo5K0HpmGLO0sVruB+1JtTSUsulex6Vz8FCsu4wHCI2LJG8TCuLUecVch7SRCpqVWhVNU1x9yBJYeJ6QF/H0dbLo+8RJmbIeyhHJkawBm06zDlMVj8neEd10QDNGo41ErLbev30l2XAyZSenWCeDZoqTu0lmei9oTc0273fqpmMZSuRerObPhg= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;DB6PR0801MB1335;20:Mbl+jOJCl26dzKStuA+4qvDGICg6MzFMb4WYVXU9sXX0UQQME371kEpMBfadXKt/N7HZ/QQymPoqFJaHJZ3+w0VjXHjBXwTNbWcszql0gGgkqa6I0ZGq6u/ESK4lVSIFM8luXzdUfmCEEFSWhZwA9si0rEqoDObiytqRpVJiW/Y= X-OriginatorOrg: virtuozzo.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Sep 2017 15:06:35.4924 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0801MB1335 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org During the reclaiming slab of a memcg, shrink_slab iterates over all registered shrinkers in the system, and tries to count and consume objects related to the cgroup. In case of memory pressure, this behaves bad: I observe high system time and time spent in list_lru_count_one() for many processes on RHEL7 kernel (collected via $perf record --call-graph fp -j k -a): 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 This patch aims to make super_cache_count() (and other functions, which count LRU nr_items) more effective. It allows list_lru_node::memcg_lrus to be RCU-accessed, and makes __list_lru_count_one() count nr_items lockless to minimize overhead introduced by locking operation, and to make parallel reclaims more scalable. Signed-off-by: Kirill Tkhai Acked-by: Vladimir Davydov --- include/linux/list_lru.h | 3 ++ mm/list_lru.c | 59 +++++++++++++++++++++++++++++----------------- 2 files changed, 39 insertions(+), 23 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index fa7fd03cb5f9..a55258100e40 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,6 +31,7 @@ struct list_lru_one { }; struct list_lru_memcg { + struct rcu_head rcu; /* array of per cgroup lists, indexed by memcg_cache_id */ struct list_lru_one *lru[0]; }; @@ -42,7 +43,7 @@ struct list_lru_node { struct list_lru_one lru; #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ - struct list_lru_memcg *memcg_lrus; + struct list_lru_memcg __rcu *memcg_lrus; #endif long nr_items; } ____cacheline_aligned_in_smp; diff --git a/mm/list_lru.c b/mm/list_lru.c index 7a40fa2be858..9fdb24818dae 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -52,14 +52,15 @@ static inline bool list_lru_memcg_aware(struct list_lru *lru) static inline struct list_lru_one * list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) { + struct list_lru_memcg *memcg_lrus; /* - * The lock protects the array of per cgroup lists from relocation - * (see memcg_update_list_lru_node). + * Either lock or RCU protects the array of per cgroup lists + * from relocation (see memcg_update_list_lru_node). */ - lockdep_assert_held(&nlru->lock); - if (nlru->memcg_lrus && idx >= 0) - return nlru->memcg_lrus->lru[idx]; - + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, + lockdep_is_held(&nlru->lock)); + if (memcg_lrus && idx >= 0) + return memcg_lrus->lru[idx]; return &nlru->lru; } @@ -168,10 +169,10 @@ static unsigned long __list_lru_count_one(struct list_lru *lru, struct list_lru_one *l; unsigned long count; - spin_lock(&nlru->lock); + rcu_read_lock(); l = list_lru_from_memcg_idx(nlru, memcg_idx); count = l->nr_items; - spin_unlock(&nlru->lock); + rcu_read_unlock(); return count; } @@ -323,24 +324,33 @@ static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus, static int memcg_init_list_lru_node(struct list_lru_node *nlru) { + struct list_lru_memcg *memcg_lrus; int size = memcg_nr_cache_ids; - nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL); - if (!nlru->memcg_lrus) + memcg_lrus = kmalloc(sizeof(*memcg_lrus) + + size * sizeof(void *), GFP_KERNEL); + if (!memcg_lrus) return -ENOMEM; - if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) { - kfree(nlru->memcg_lrus); + if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) { + kfree(memcg_lrus); return -ENOMEM; } + RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus); return 0; } static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) { - __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids); - kfree(nlru->memcg_lrus); + struct list_lru_memcg *memcg_lrus; + /* + * This is called when shrinker has already been unregistered, + * and nobody can use it. So, there is no need to use kfree_rcu(). + */ + memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true); + __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids); + kfree(memcg_lrus); } static int memcg_update_list_lru_node(struct list_lru_node *nlru, @@ -350,8 +360,9 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, BUG_ON(old_size > new_size); - old = nlru->memcg_lrus; - new = kmalloc(new_size * sizeof(void *), GFP_KERNEL); + old = rcu_dereference_protected(nlru->memcg_lrus, + lockdep_is_held(&list_lrus_mutex)); + new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL); if (!new) return -ENOMEM; @@ -360,29 +371,33 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, return -ENOMEM; } - memcpy(new, old, old_size * sizeof(void *)); + memcpy(&new->lru, &old->lru, old_size * sizeof(void *)); /* - * The lock guarantees that we won't race with a reader - * (see list_lru_from_memcg_idx). + * The locking below allows readers that hold nlru->lock avoid taking + * rcu_read_lock (see list_lru_from_memcg_idx). * * Since list_lru_{add,del} may be called under an IRQ-safe lock, * we have to use IRQ-safe primitives here to avoid deadlock. */ spin_lock_irq(&nlru->lock); - nlru->memcg_lrus = new; + rcu_assign_pointer(nlru->memcg_lrus, new); spin_unlock_irq(&nlru->lock); - kfree(old); + kfree_rcu(old, rcu); return 0; } static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru, int old_size, int new_size) { + struct list_lru_memcg *memcg_lrus; + + memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, + lockdep_is_held(&list_lrus_mutex)); /* do not bother shrinking the array back to the old size, because we * cannot handle allocation failures here */ - __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size); + __memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size); } static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id E88D86B025F for ; Tue, 19 Sep 2017 11:06:42 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id m30so6970838pgn.2 for ; Tue, 19 Sep 2017 08:06:42 -0700 (PDT) Received: from EUR02-AM5-obe.outbound.protection.outlook.com (mail-eopbgr00101.outbound.protection.outlook.com. [40.107.0.101]) by mx.google.com with ESMTPS id z64si1503419pfj.102.2017.09.19.08.06.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 19 Sep 2017 08:06:41 -0700 (PDT) Subject: [PATCH] mm: Make count list_lru_one::nr_items lockless From: Kirill Tkhai Date: Tue, 19 Sep 2017 18:06:33 +0300 Message-ID: <150583358557.26700.8490036563698102569.stgit@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: vdavydov.dev@gmail.com, apolyakov@beget.ru, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aryabinin@virtuozzo.com, akpm@linux-foundation.org During the reclaiming slab of a memcg, shrink_slab iterates over all registered shrinkers in the system, and tries to count and consume objects related to the cgroup. In case of memory pressure, this behaves bad: I observe high system time and time spent in list_lru_count_one() for many processes on RHEL7 kernel (collected via $perf record --call-graph fp -j k -a): 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 This patch aims to make super_cache_count() (and other functions, which count LRU nr_items) more effective. It allows list_lru_node::memcg_lrus to be RCU-accessed, and makes __list_lru_count_one() count nr_items lockless to minimize overhead introduced by locking operation, and to make parallel reclaims more scalable. Signed-off-by: Kirill Tkhai Acked-by: Vladimir Davydov --- include/linux/list_lru.h | 3 ++ mm/list_lru.c | 59 +++++++++++++++++++++++++++++----------------- 2 files changed, 39 insertions(+), 23 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index fa7fd03cb5f9..a55258100e40 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,6 +31,7 @@ struct list_lru_one { }; struct list_lru_memcg { + struct rcu_head rcu; /* array of per cgroup lists, indexed by memcg_cache_id */ struct list_lru_one *lru[0]; }; @@ -42,7 +43,7 @@ struct list_lru_node { struct list_lru_one lru; #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ - struct list_lru_memcg *memcg_lrus; + struct list_lru_memcg __rcu *memcg_lrus; #endif long nr_items; } ____cacheline_aligned_in_smp; diff --git a/mm/list_lru.c b/mm/list_lru.c index 7a40fa2be858..9fdb24818dae 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -52,14 +52,15 @@ static inline bool list_lru_memcg_aware(struct list_lru *lru) static inline struct list_lru_one * list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) { + struct list_lru_memcg *memcg_lrus; /* - * The lock protects the array of per cgroup lists from relocation - * (see memcg_update_list_lru_node). + * Either lock or RCU protects the array of per cgroup lists + * from relocation (see memcg_update_list_lru_node). */ - lockdep_assert_held(&nlru->lock); - if (nlru->memcg_lrus && idx >= 0) - return nlru->memcg_lrus->lru[idx]; - + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, + lockdep_is_held(&nlru->lock)); + if (memcg_lrus && idx >= 0) + return memcg_lrus->lru[idx]; return &nlru->lru; } @@ -168,10 +169,10 @@ static unsigned long __list_lru_count_one(struct list_lru *lru, struct list_lru_one *l; unsigned long count; - spin_lock(&nlru->lock); + rcu_read_lock(); l = list_lru_from_memcg_idx(nlru, memcg_idx); count = l->nr_items; - spin_unlock(&nlru->lock); + rcu_read_unlock(); return count; } @@ -323,24 +324,33 @@ static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus, static int memcg_init_list_lru_node(struct list_lru_node *nlru) { + struct list_lru_memcg *memcg_lrus; int size = memcg_nr_cache_ids; - nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL); - if (!nlru->memcg_lrus) + memcg_lrus = kmalloc(sizeof(*memcg_lrus) + + size * sizeof(void *), GFP_KERNEL); + if (!memcg_lrus) return -ENOMEM; - if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) { - kfree(nlru->memcg_lrus); + if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) { + kfree(memcg_lrus); return -ENOMEM; } + RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus); return 0; } static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) { - __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids); - kfree(nlru->memcg_lrus); + struct list_lru_memcg *memcg_lrus; + /* + * This is called when shrinker has already been unregistered, + * and nobody can use it. So, there is no need to use kfree_rcu(). + */ + memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true); + __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids); + kfree(memcg_lrus); } static int memcg_update_list_lru_node(struct list_lru_node *nlru, @@ -350,8 +360,9 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, BUG_ON(old_size > new_size); - old = nlru->memcg_lrus; - new = kmalloc(new_size * sizeof(void *), GFP_KERNEL); + old = rcu_dereference_protected(nlru->memcg_lrus, + lockdep_is_held(&list_lrus_mutex)); + new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL); if (!new) return -ENOMEM; @@ -360,29 +371,33 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, return -ENOMEM; } - memcpy(new, old, old_size * sizeof(void *)); + memcpy(&new->lru, &old->lru, old_size * sizeof(void *)); /* - * The lock guarantees that we won't race with a reader - * (see list_lru_from_memcg_idx). + * The locking below allows readers that hold nlru->lock avoid taking + * rcu_read_lock (see list_lru_from_memcg_idx). * * Since list_lru_{add,del} may be called under an IRQ-safe lock, * we have to use IRQ-safe primitives here to avoid deadlock. */ spin_lock_irq(&nlru->lock); - nlru->memcg_lrus = new; + rcu_assign_pointer(nlru->memcg_lrus, new); spin_unlock_irq(&nlru->lock); - kfree(old); + kfree_rcu(old, rcu); return 0; } static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru, int old_size, int new_size) { + struct list_lru_memcg *memcg_lrus; + + memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, + lockdep_is_held(&list_lrus_mutex)); /* do not bother shrinking the array back to the old size, because we * cannot handle allocation failures here */ - __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size); + __memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size); } static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org