From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.7 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42628C5DF61 for ; Thu, 7 Nov 2019 06:09:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DB35C21882 for ; Thu, 7 Nov 2019 06:08:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="E0buQVVt" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DB35C21882 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8F8966B0008; Thu, 7 Nov 2019 01:08:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CEFE6B000A; Thu, 7 Nov 2019 01:08:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BE4F6B000C; Thu, 7 Nov 2019 01:08:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0018.hostedemail.com [216.40.44.18]) by kanga.kvack.org (Postfix) with ESMTP id 678AC6B0008 for ; Thu, 7 Nov 2019 01:08:59 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 0DB31181AEF21 for ; Thu, 7 Nov 2019 06:08:59 +0000 (UTC) X-FDA: 76128453198.17.elbow73_888f8c9f7fd3c X-HE-Tag: elbow73_888f8c9f7fd3c X-Filterd-Recvd-Size: 12582 Received: from mail-pf1-f193.google.com (mail-pf1-f193.google.com [209.85.210.193]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Thu, 7 Nov 2019 06:08:58 +0000 (UTC) Received: by mail-pf1-f193.google.com with SMTP id r4so1672556pfl.7 for ; Wed, 06 Nov 2019 22:08:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=HEVA5roqDS0pmOVe88mTqw0MiAcFRzYPSElVLI7MDDM=; b=E0buQVVtyfol+HLHLx+ignZCQTEVjnIELh85XL+rMcOkcDZ/G8vjqL61wafm8RUgLR 4f8IZjmQdde7nVe1UNbb0wlUkb3qdpdVbV2D9KaDL2OHB5NposCt6smhmgppXrOe+1Pb YOcfa8jlN11qXAV5Ju1PstMC+NJmGJjIC0jNUrw0BXgNhA9Qyd3Uc3i0hJ7o93ysJdsi +vXOPQCUAWhha0u8txtXeClaM0OPCG4dLBsusVZVU86V9WmYQVg5iG1aFoY/Oc2Q2zHC plI4GruJaTRL2aiGl8spSH8YsSe/6uYUD074EeHQNtZ71eJHH9hoqRZcw12O34P59XkT hfGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=HEVA5roqDS0pmOVe88mTqw0MiAcFRzYPSElVLI7MDDM=; b=JWqBeQmuN2fOwjTbpllHdGYNoyCiyW3iQux/A9ASg8tu/xJ3qVNUUPQ14HhyrDrARU exRqoRY38nlyg0ZMx04jMl7wAFwN9cVnRYRS8K9tSmczNFMfJyzvMrxVsu4+zsYdxBc2 +tDST3xVpKNzfZuHGlHW11VBHbTHEIXIUItC8UpL2ETwIXB3OJ81meCm7YTuhWqGZ1hz ZuA2dLptCBJQNsfn98P2YkJIPqQ4rX8QxdIMEA5eGi+yT9KEJI7UA/7RNtqThmCKZNV2 rGCduO/O90VmdaZBMWRREsB+JxIOY0SvlELSFUozsi56wiAeJZ8QoQ63xoiYEVPra347 NPsw== X-Gm-Message-State: APjAAAWeuuVyopg5pgd5YFUj4iz53YPcF3QXxW3XtosQ2wfPe04ipmTV 68Z2N2ECetguMVHvS7KsvfA= X-Google-Smtp-Source: APXvYqy6qsmHYnJACdiywfmjPxmHsDmUlhT/EtlgRjRcDRTvzJXVcpejoMpGZqCPp/W46o+vgf7LEA== X-Received: by 2002:a63:ee44:: with SMTP id n4mr2301672pgk.137.1573106937156; Wed, 06 Nov 2019 22:08:57 -0800 (PST) Received: from dev.localdomain ([203.100.54.194]) by smtp.gmail.com with ESMTPSA id y16sm1083474pfo.62.2019.11.06.22.08.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Nov 2019 22:08:56 -0800 (PST) From: Yafang Shao To: mhocko@kernel.org, hannes@cmpxchg.org, vdavydov.dev@gmail.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Yafang Shao Subject: [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Date: Thu, 7 Nov 2019 01:08:08 -0500 Message-Id: <1573106889-4939-1-git-send-email-laoar.shao@gmail.com> X-Mailer: git-send-email 1.8.3.1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch introduces a new memory controller file memory.low.level, which is used to set multiple levels memory.low protetion. The valid value of memory.low.level is [0..3], meaning we support four levels protection now. This new controller file takes effect only when memory.low is set. With this new file, we can do page reclaim QoS on different memcgs. For example, when the system is under memory pressure, it will reclaim pages from the memcg with lower priority first and then higher priority. - What is the problem in the current memory low proection ? Currently we can set bigger memory.low protection on memcg with higher priority, and smaller memory.low protection on memcg with lower priority. But once there's no available unprotected memory to reclaim, the reclaimers will reclaim the protected memory from all the memcgs. While we really want the reclaimers to reclaim the protected memory from the lower-priority memcgs first, and if it still can't meet the page allocation it will then reclaim the protected memory from higher-priority memdcgs. The logic can be displayed as bellow, under_memory_pressure reclaim_unprotected_memory if (meet_the_request) exit reclaim_protected_memory_from_lowest_priority_memcgs if (meet_the_request) exit reclaim_protected_memory_from_higher_priority_memcgs if (meet_the_request) exit reclaim_protected_memory_from_highest_priority_memcgs - Why does it support four-level memory low protection ? Low priority, medium priority and high priority, that is the most common usecases in the real life. So four-level memory low protection should be enough. The more levels it is, the higher overhead page reclaim will take. So four-level protection is really a trade-off. - How does it work ? One example how this multiple level controller works, target memcg (root or non-root) / \ B C / \ B1 B2 B/memory.low.level=2 effective low level is 2 B1/memory.low.level=3 effective low level is 2 B2/memory.low.level=0 effective low level is 0 C/memory.low.level=1 effective low level is 1 The effective low level is min(low_level, parent_low_level). memory.low in all memcgs is set. Then the reclaimer will reclaims these priority in this order: B2->C->B/B1 Signed-off-by: Yafang Shao --- Documentation/admin-guide/cgroup-v2.rst | 11 ++++++++ include/linux/page_counter.h | 3 +++ mm/memcontrol.c | 45 +++++++++++++++++++++++++++++++++ mm/vmscan.c | 31 ++++++++++++++++++----- 4 files changed, 83 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index ed912315..cdacc9c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1137,6 +1137,17 @@ PAGE_SIZE multiple when read back. Putting more memory than generally available under this protection is discouraged. + memory.low.level + A read-write single value file which exists on non-root + cgroups. The default is "0". The valid value is [0..3]. + + The controller file takes effect only after memory.low is set. + If both memory.low and memory.low.level are set to many MEMCGs, + when under memory pressure the reclaimer will reclaim the + unprotected memory first, and then reclaims the protected memory + with lower memory.low.level and at last relcaims the protected + memory with highest memory.low.level. + memory.high A read-write single value file which exists on non-root cgroups. The default is "max". diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index bab7e57..19bc589 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -6,6 +6,7 @@ #include #include +#define MEMCG_LOW_LEVEL_MAX 4 struct page_counter { atomic_long_t usage; unsigned long min; @@ -22,6 +23,8 @@ struct page_counter { unsigned long elow; atomic_long_t low_usage; atomic_long_t children_low_usage; + unsigned long low_level; + unsigned long elow_level; /* legacy */ unsigned long watermark; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 50f5bc5..9da4ef9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5962,6 +5962,37 @@ static ssize_t memory_low_write(struct kernfs_open_file *of, return nbytes; } +static int memory_low_level_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + seq_printf(m, "%lu\n", memcg->memory.low_level); + + return 0; +} + +static ssize_t memory_low_level_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int ret, low_level; + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + ret = kstrtoint(buf, 0, &low_level); + if (ret) + return ret; + + if (low_level < 0 || low_level >= MEMCG_LOW_LEVEL_MAX) + return -EINVAL; + + memcg->memory.low_level = low_level; + + return nbytes; +} + static int memory_high_show(struct seq_file *m, void *v) { return seq_puts_memcg_tunable(m, READ_ONCE(mem_cgroup_from_seq(m)->high)); @@ -6151,6 +6182,12 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, .write = memory_low_write, }, { + .name = "low.level", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_low_level_show, + .write = memory_low_level_write, + }, + { .name = "high", .flags = CFTYPE_NOT_ON_ROOT, .seq_show = memory_high_show, @@ -6280,6 +6317,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *parent; unsigned long emin, parent_emin; unsigned long elow, parent_elow; + unsigned long elow_level, parent_elow_level; unsigned long usage; if (mem_cgroup_disabled()) @@ -6296,6 +6334,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, emin = memcg->memory.min; elow = memcg->memory.low; + elow_level = memcg->memory.low_level; parent = parent_mem_cgroup(memcg); /* No parent means a non-hierarchical mode on v1 memcg */ @@ -6331,11 +6370,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (low_usage && siblings_low_usage) elow = min(elow, parent_elow * low_usage / siblings_low_usage); + + parent_elow_level = READ_ONCE(parent->memory.elow_level); + elow_level = min(elow_level, parent_elow_level); + } else { + elow_level = 0; } exit: memcg->memory.emin = emin; memcg->memory.elow = elow; + memcg->memory.elow_level = elow_level; if (usage <= emin) return MEMCG_PROT_MIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index d979852..3b08e85 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,15 +88,16 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; + unsigned int hibernation_mode:1; /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to * memory.low and nothing was reclaimed, go back for memory.low. */ - unsigned int memcg_low_reclaim:1; + unsigned int memcg_low_level:3; unsigned int memcg_low_skipped:1; + unsigned int memcg_low_step:2; - unsigned int hibernation_mode:1; /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; @@ -2403,10 +2404,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long lruvec_size; unsigned long scan; unsigned long protection; + bool memcg_low_reclaim = (sc->memcg_low_level > + memcg->memory.elow_level); lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); protection = mem_cgroup_protection(memcg, - sc->memcg_low_reclaim); + memcg_low_reclaim); if (protection) { /* @@ -2691,6 +2694,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); unsigned long reclaimed; unsigned long scanned; + unsigned long step; switch (mem_cgroup_protected(target_memcg, memcg)) { case MEMCG_PROT_MIN: @@ -2706,7 +2710,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) * there is an unprotected supply * of reclaimable memory from other cgroups. */ - if (!sc->memcg_low_reclaim) { + if (sc->memcg_low_level <= memcg->memory.elow_level) { + step = (memcg->memory.elow_level - + sc->memcg_low_level); + if (step < sc->memcg_low_step) + sc->memcg_low_step = step; sc->memcg_low_skipped = 1; continue; } @@ -3007,6 +3015,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, pg_data_t *last_pgdat; struct zoneref *z; struct zone *zone; + + sc->memcg_low_step = MEMCG_LOW_LEVEL_MAX - 1; + retry: delayacct_freepages_start(); @@ -3061,9 +3072,15 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, return 1; /* Untapped cgroup reserves? Don't OOM, retry. */ - if (sc->memcg_low_skipped) { - sc->priority = initial_priority; - sc->memcg_low_reclaim = 1; + if (sc->memcg_low_skipped && + sc->memcg_low_level < MEMCG_LOW_LEVEL_MAX) { + /* + * If it is hard to reclaim page caches, we'd better use a + * lower priority to avoid taking too much time. + */ + sc->priority = initial_priority > sc->memcg_low_level ? + (initial_priority - sc->memcg_low_level) : 0; + sc->memcg_low_level += sc->memcg_low_step + 1; sc->memcg_low_skipped = 0; goto retry; } -- 1.8.3.1