From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC5EEC433FE for ; Wed, 24 Nov 2021 18:59:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 483CB6B007B; Wed, 24 Nov 2021 13:58:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 433DE6B007D; Wed, 24 Nov 2021 13:58:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2D2B96B007E; Wed, 24 Nov 2021 13:58:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0221.hostedemail.com [216.40.44.221]) by kanga.kvack.org (Postfix) with ESMTP id 200E16B007B for ; Wed, 24 Nov 2021 13:58:50 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id D0A77181B048D for ; Wed, 24 Nov 2021 18:58:39 +0000 (UTC) X-FDA: 78844735410.05.E0CCFA7 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf31.hostedemail.com (Postfix) with ESMTP id 8FCEF1046306 for ; Wed, 24 Nov 2021 18:58:32 +0000 (UTC) Received: by mail-qt1-f174.google.com with SMTP id q14so3666597qtx.10 for ; Wed, 24 Nov 2021 10:58:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mJgPRgekSlvs885MNHC+5vq6TTVnni6tQ47GqPwt7rg=; b=MvRBGkIQBQAk0E7xGqOI5gYq6jBORKhrthwKkCbSgJAz4nGyJiaNGZ9Mtc6UaDcerB Evff3YgwsZJQmALZSPRPk7KYRo9ix+upCvBTPWj+YniORrUC3Fciabhdlx0Pw35yrFlT nc+IF0wiCz3mGuMVJJjfff0eIRF4fUM0NytTxIsIMaiYC6pV67T2rQ5fRZGxLqiBMdZe wJv50yvJj5xdsvbVlSt9o+jlR8Bb0m30iV/f5+GoF9U8MM8weiViLkUCLQJPUcptRp/f 7sxnUVlQnZks8CpMTF3D6YVBpJwQ4a6Cfhiz+8FEFPCicyslwKhlsxDy0haeR1nmSUr2 hRwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mJgPRgekSlvs885MNHC+5vq6TTVnni6tQ47GqPwt7rg=; b=pXkSbIoXPIyWSWt7y5sRKI/s7ufQQF+nwzyki2Ld+wo/7KGYMngzL3724rgKo2ycf/ RGiQylh3Mw1NzpzaGr5hSsPKSPbkpsxhCDFlxcJOw2yhTsqTu2QYQrMgUs5F+q2vmDPg pC8XodnIsK8uKGpNyoMpopIU1b6c6ufX+xLYdWXpTDk03YVhga6UyGvJQu5mX1zdcJ6v zxAMh8J5zg6RYkPlePv4Bt9RfudW3INWW9bEfYS3mKZnnZpq9wwLm0vBXifM4XXMjyEX c7l5LIGa1HzhwbKjYez5XSwjF44oF+3mPg2Qe1sJ6uwhbQzIs5hz+4MOB4QhiPJHhLkH ewdQ== X-Gm-Message-State: AOAM53365vtMP3c4T5gp3XdoaafLPnqGXN+ylCAUChmWnbL44NInKSk0 ca4iF8BHnBzLIHqQ2pzQOgs= X-Google-Smtp-Source: ABdhPJyBHAfCScYWsB5ed0Scn3PNhF3ifCh2b+2ctuePXtPnNP1fg4oc+lKSTW6lVUbrPP5Dq3dPwA== X-Received: by 2002:a05:622a:188:: with SMTP id s8mr10130369qtw.347.1637780318711; Wed, 24 Nov 2021 10:58:38 -0800 (PST) Received: from hasanalmaruf-mbp.thefacebook.com ([2620:10d:c091:480::1:a1b0]) by smtp.gmail.com with ESMTPSA id r16sm315775qkp.42.2021.11.24.10.58.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Nov 2021 10:58:38 -0800 (PST) From: Hasan Al Maruf X-Google-Original-From: Hasan Al Maruf To: dave.hansen@linux.intel.com, ying.huang@intel.com, yang.shi@linux.alibaba.com, mgorman@techsingularity.net, riel@surriel.com, hannes@cmpxchg.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/5] NUMA balancing for tiered-memory system Date: Wed, 24 Nov 2021 13:58:27 -0500 Message-Id: <06f961992a2c119ed0904825d8ab3f2b2a2c682b.1637778851.git.hasanalmaruf@fb.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 8FCEF1046306 X-Stat-Signature: oih9y4ja3boo935n8bi5buky439d6pz1 Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=MvRBGkIQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf31.hostedemail.com: domain of hasan3050@gmail.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=hasan3050@gmail.com X-HE-Tag: 1637780312-797833 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With the advent of new memory types and technologies, a server may have different types of memory, e.g. DRAM, PMEM, CXL-enabled memory, etc. As different types of memory usually have different level of performance impact, such a system can be called as a tiered-memory system. In a tiered-memory system, NUMA memory nodes can be CPU-less nodes. For such a system, memory with CPU are considered as the toptier node while memory without CPU are non-toptier nodes. In default NUMA Balancing, NUMA hint faults are generate on both toptier and non-toptier nodes. However, in a tiered-memory system, hot memories i= n toptier memory nodes may not need to be migrated around. In such cases, it's unnecessary to scan the pages in the toptier memory nodes. We disabl= e unnecessary scannings in the toptier nodes for a tiered-memory system. To support NUMA balancing for a tiered-memory system, the existing sysctl user-space interface for enabling numa_balancing is extended in a backwar= d compatible way, so that users can enable/disable these functionalities individually. The sysctl is converted from a Boolean value to a bits fiel= d. Current definition for '/proc/sys/kernel/numa_balancing' is as follow: - 0x0: NUMA_BALANCING_DISABLED - 0x1: NUMA_BALANCING_NORMAL - 0x2: NUMA_BALANCING_TIERED_MEMORY If a system has single toptier node online, default NUMA balancing will automatically be downgraded to the tiered-memory mode to avoid the unnecessary scanning in the toptier node mentioned above. Signed-off-by: Hasan Al Maruf --- Documentation/admin-guide/sysctl/kernel.rst | 18 +++++++++++ include/linux/mempolicy.h | 2 ++ include/linux/node.h | 7 ++++ include/linux/sched/sysctl.h | 6 ++++ kernel/sched/core.c | 36 +++++++++++++++++---- kernel/sched/fair.c | 10 +++++- kernel/sched/sched.h | 1 + kernel/sysctl.c | 7 ++-- mm/huge_memory.c | 27 ++++++++++------ mm/mprotect.c | 8 ++++- 10 files changed, 101 insertions(+), 21 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/= admin-guide/sysctl/kernel.rst index 24ab20d7a50a..1abab69dd5b6 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -608,6 +608,24 @@ numa_balancing_scan_delay_ms, numa_balancing_scan_pe= riod_max_ms, numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls. =20 =20 +By default, NUMA hinting faults are generate on both toptier and non-top= tier +nodes. However, in a tiered-memory system, hot memories in toptier memor= y nodes +may not need to be migrated around. In such cases, it's unnecessary to s= can the +pages in the toptier memory nodes. For a tiered-memory system, unnecessa= ry scannings +and hinting faults in the toptier nodes are disabled. + +This interface takes bits field as input. Supported values and correspon= ding modes are +as follow: + +- 0x0: NUMA_BALANCING_DISABLED +- 0x1: NUMA_BALANCING_NORMAL +- 0x2: NUMA_BALANCING_TIERED_MEMORY + +If a system has single toptier node online, then default NUMA balancing = will +automatically be downgraded to the tiered-memory mode to avoid the unnec= essary scanning +and hinting faults. + + numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_ba= lancing_scan_period_max_ms, numa_balancing_scan_size_mb =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D =20 diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index c7637cfa1be2..ab57b6a82e0a 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -188,6 +188,7 @@ extern int mpol_misplaced(struct page *, struct vm_ar= ea_struct *, unsigned long, extern void mpol_put_task_policy(struct task_struct *); =20 extern bool numa_demotion_enabled; +extern bool numa_promotion_tiered_enabled; =20 #else =20 @@ -299,5 +300,6 @@ static inline nodemask_t *policy_nodemask_current(gfp= _t gfp) } =20 #define numa_demotion_enabled false +#define numa_promotion_tiered_enabled false #endif /* CONFIG_NUMA */ #endif diff --git a/include/linux/node.h b/include/linux/node.h index 8e5a29897936..9a69b31cae74 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -181,4 +181,11 @@ static inline void register_hugetlbfs_with_node(node= _registration_func_t reg, =20 #define to_node(device) container_of(device, struct node, dev) =20 +static inline bool node_is_toptier(int node) +{ + // ideally, toptier nodes should be the memory with CPU. + // for now, just assume node0 is the toptier memory + // return node_state(node, N_CPU); + return (node =3D=3D 0); +} #endif /* _LINUX_NODE_H_ */ diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 3c31ba88aca5..249e00c42246 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -39,6 +39,12 @@ enum sched_tunable_scaling { }; extern enum sched_tunable_scaling sysctl_sched_tunable_scaling; =20 +#define NUMA_BALANCING_DISABLED 0x0 +#define NUMA_BALANCING_NORMAL 0x1 +#define NUMA_BALANCING_TIERED_MEMORY 0x2 + +extern int sysctl_numa_balancing_mode; + extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 790c573f7ed4..3d65e601b973 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3596,9 +3596,29 @@ static void __sched_fork(unsigned long clone_flags= , struct task_struct *p) } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); +int sysctl_numa_balancing_mode; +bool numa_promotion_tiered_enabled; =20 #ifdef CONFIG_NUMA_BALANCING =20 +/* + * If there is only one toptier node available, pages on that + * node can not be promotrd to anywhere. In that case, downgrade + * to numa_promotion_tiered_enabled mode + */ +static void check_numa_promotion_mode(void) +{ + int node, toptier_node_count =3D 0; + + for_each_online_node(node) { + if (node_is_toptier(node)) + ++toptier_node_count; + } + if (toptier_node_count =3D=3D 1) { + numa_promotion_tiered_enabled =3D true; + } +} + void set_numabalancing_state(bool enabled) { if (enabled) @@ -3611,20 +3631,22 @@ void set_numabalancing_state(bool enabled) int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { - struct ctl_table t; int err; - int state =3D static_branch_likely(&sched_numa_balancing); =20 if (write && !capable(CAP_SYS_ADMIN)) return -EPERM; =20 - t =3D *table; - t.data =3D &state; - err =3D proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + err =3D proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (err < 0) return err; - if (write) - set_numabalancing_state(state); + if (write) { + if (sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) + check_numa_promotion_mode(); + else if (sysctl_numa_balancing_mode & NUMA_BALANCING_TIERED_MEMORY) + numa_promotion_tiered_enabled =3D true; + + set_numabalancing_state(*(int *)table->data); + } return err; } #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 210612c9d1e9..45e39832a2b1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1424,7 +1424,7 @@ bool should_numa_migrate_memory(struct task_struct = *p, struct page * page, =20 count_vm_numa_event(PGPROMOTE_CANDIDATE); =20 - if (flags & TNF_DEMOTED) + if (numa_demotion_enabled && (flags & TNF_DEMOTED)) count_vm_numa_event(PGPROMOTE_CANDIDATE_DEMOTED); =20 if (page_is_file_lru(page)) @@ -1435,6 +1435,14 @@ bool should_numa_migrate_memory(struct task_struct= *p, struct page * page, this_cpupid =3D cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid =3D page_cpupid_xchg_last(page, this_cpupid); =20 + /* + * The pages in non-toptier memory node should be migrated + * according to hot/cold instead of accessing CPU node. + */ + if (numa_promotion_tiered_enabled && !node_is_toptier(src_nid)) + return true; + + /* * Allow first faults or private faults to migrate immediately early in * the lifetime of a task. The magic number 4 is based on waiting for diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6057ad67d223..379f3b6f1a3f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -51,6 +51,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6b6653529d92..751b52062eb4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -113,6 +113,7 @@ static int sixty =3D 60; =20 static int __maybe_unused neg_one =3D -1; static int __maybe_unused two =3D 2; +static int __maybe_unused three =3D 3; static int __maybe_unused four =3D 4; static unsigned long zero_ul; static unsigned long one_ul =3D 1; @@ -1840,12 +1841,12 @@ static struct ctl_table kern_table[] =3D { }, { .procname =3D "numa_balancing", - .data =3D NULL, /* filled in by handler */ - .maxlen =3D sizeof(unsigned int), + .data =3D &sysctl_numa_balancing_mode, + .maxlen =3D sizeof(int), .mode =3D 0644, .proc_handler =3D sysctl_numa_balancing, .extra1 =3D SYSCTL_ZERO, - .extra2 =3D SYSCTL_ONE, + .extra2 =3D &three, }, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_SCHED_DEBUG */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e9d7b9125c5e..b76a0990c5f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -1849,16 +1850,24 @@ int change_huge_pmd(struct vm_area_struct *vma, p= md_t *pmd, } #endif =20 - /* - * Avoid trapping faults against the zero page. The read-only - * data is likely to be read-cached on the local CPU and - * local/remote hits to the zero page are not interesting. - */ - if (prot_numa && is_huge_zero_pmd(*pmd)) - goto unlock; + if (prot_numa) { + struct page *page; + /* + * Avoid trapping faults against the zero page. The read-only + * data is likely to be read-cached on the local CPU and + * local/remote hits to the zero page are not interesting. + */ + if (is_huge_zero_pmd(*pmd)) + goto unlock; =20 - if (prot_numa && pmd_protnone(*pmd)) - goto unlock; + if (pmd_protnone(*pmd)) + goto unlock; + + /* skip scanning toptier node */ + page =3D pmd_page(*pmd); + if (numa_promotion_tiered_enabled && node_is_toptier(page_to_nid(page)= )) + goto unlock; + } =20 /* * In case prot_numa, we are under mmap_read_lock(mm). It's critical diff --git a/mm/mprotect.c b/mm/mprotect.c index 94188df1ee55..3171f435925b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -83,6 +83,7 @@ static unsigned long change_pte_range(struct vm_area_st= ruct *vma, pmd_t *pmd, */ if (prot_numa) { struct page *page; + int nid; =20 /* Avoid TLB flush if possible */ if (pte_protnone(oldpte)) @@ -109,7 +110,12 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, * Don't mess with PTEs if page is already on the node * a single-threaded process is running on. */ - if (target_node =3D=3D page_to_nid(page)) + nid =3D page_to_nid(page); + if (target_node =3D=3D nid) + continue; + + /* skip scanning toptier node */ + if (numa_promotion_tiered_enabled && node_is_toptier(nid)) continue; } =20 --=20 2.30.2