All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Ying Huang <ying.huang@intel.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org
Cc: Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
Date: Wed, 08 Jun 2022 20:07:52 +0530	[thread overview]
Message-ID: <87sfoffcfz.fsf@linux.ibm.com> (raw)
In-Reply-To: <cc9566421dedf10b5b7149d093992797540c31e2.camel@intel.com>

Ying Huang <ying.huang@intel.com> writes:

....

> > > 
>> > > is this good (not tested)?
>> > > /*
>> > >    * build the allowed promotion mask. Promotion is allowed
>> > >    * from higher memory tier to lower memory tier only if
>> > >    * lower memory tier doesn't include compute. We want to
>> > >    * skip promotion from a memory tier, if any node which is
>> > >    * part of that memory tier have CPUs. Once we detect such
>> > >    * a memory tier, we consider that tier as top tier from
>> > >    * which promotion is not allowed.
>> > >    */
>> > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>> > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
>> > > 	if (nodes_empty(allowed))
>> > > 		nodes_or(promotion_mask, promotion_mask, allowed);
>> > > 	else
>> > > 		break;
>> > > }
>> > > 
>> > > and then
>> > > 
>> > > static inline bool node_is_toptier(int node)
>> > > {
>> > > 
>> > > 	return !node_isset(node, promotion_mask);
>> > > }
>> > > 
>> > 
>> > This should work.  But it appears unnatural.  So, I don't think we
>> > should avoid to add more and more node masks to mitigate the design
>> > decision that we cannot access memory tier information directly.  All
>> > these becomes simple and natural, if we can access memory tier
>> > information directly.
>> > 
>> 
>> how do you derive whether node is toptier details if we have memtier 
>> details in pgdat?
>
> pgdat -> memory tier -> rank
>
> Then we can compare this rank with the fast memory rank.  The fast
> memory rank can be calculated dynamically at appropriate places.

This is what I am testing now. We still need to closely audit that lock
free access to the NODE_DATA()->memtier. For v6 I will keep this as a
separate patch and once we all agree that it is safe, I will fold it
back.

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a388a806b61a..3e733de1a8a0 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -17,7 +17,6 @@
 #define MAX_MEMORY_TIERS  (MAX_STATIC_MEMORY_TIERS + 2)
 
 extern bool numa_demotion_enabled;
-extern nodemask_t promotion_mask;
 int node_create_and_set_memory_tier(int node, int tier);
 int next_demotion_node(int node);
 int node_set_memory_tier(int node, int tier);
@@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node);
 int node_reset_memory_tier(int node, int tier);
 void node_remove_from_memory_tier(int node);
 void node_get_allowed_targets(int node, nodemask_t *targets);
-
-/*
- * By default all nodes are top tiper. As we create new memory tiers
- * we below top tiers we add them to NON_TOP_TIER state.
- */
-static inline bool node_is_toptier(int node)
-{
-	return !node_isset(node, promotion_mask);
-}
+bool node_is_toptier(int node);
 
 #else
 #define numa_demotion_enabled	false
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..c4fcfd2b9980 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_TIERED_MEMORY
+	struct memory_tier *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 29a038bb38b0..31ef0fab5f19 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -7,6 +7,7 @@
 #include <linux/random.h>
 #include <linux/memory.h>
 #include <linux/idr.h>
+#include <linux/rcupdate.h>
 
 #include "internal.h"
 
@@ -26,7 +27,7 @@ struct demotion_nodes {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
-nodemask_t promotion_mask;
+static int top_tier_rank;
 /*
  * node_demotion[] examples:
  *
@@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev)
 	if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS)
 		ida_free(&memtier_dev_id, tier->dev.id);
 
-	kfree(tier);
+	kfree_rcu(tier);
 }
 
 /*
@@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	return NULL;
 }
 
+/*
+ * Called with memory_tier_lock. Hence the device references cannot
+ * be dropped during this function.
+ */
+static void memtier_node_clear(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	/*
+	 * Make sure read side see the NULL value before we clear the node
+	 * from the nodelist.
+	 */
+	synchronize_rcu();
+	node_clear(node, memtier->nodelist);
+}
+
+static void memtier_node_set(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	synchronize_rcu();
+	node_set(node, memtier->nodelist);
+	rcu_assign_pointer(pgdat->memtier, memtier);
+}
+
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->rank >= top_tier_rank)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 static int __node_create_and_set_memory_tier(int node, int tier)
 {
 	int ret = 0;
@@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
 			goto out;
 		}
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
+	memtier_node_clear(node, current_tier);
 
 	ret = __node_create_and_set_memory_tier(node, tier);
 	if (ret) {
 		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
+		memtier_node_set(node, current_tier);
 		goto out;
 	}
 
@@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier)
 		ret = -EINVAL;
 		goto out;
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
+	memtier_node_clear(node, current_tier);
 
 	ret = __node_set_memory_tier(node, tier);
 	if (ret) {
 		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
+		memtier_node_set(node, current_tier);
 		goto out;
 	}
 
@@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node)
 	 * empty then unregister it to make it invisible
 	 * in sysfs.
 	 */
-	node_clear(node, memtier->nodelist);
+	memtier_node_clear(node, memtier);
 	if (nodes_empty(memtier->nodelist))
 		unregister_memory_tier(memtier);
 
@@ -570,15 +635,13 @@ static void establish_migration_targets(void)
 	 * a memory tier, we consider that tier as top tiper from
 	 * which promotion is not allowed.
 	 */
-	promotion_mask = NODE_MASK_NONE;
 	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
 		nodes_and(allowed, node_states[N_CPU], memtier->nodelist);
-		if (nodes_empty(allowed))
-			nodes_or(promotion_mask, promotion_mask, memtier->nodelist);
-		else
+		if (!nodes_empty(allowed)) {
+			top_tier_rank = memtier->rank;
 			break;
+		}
 	}
-
 	pr_emerg("top tier rank is %d\n", top_tier_rank);
 	allowed = NODE_MASK_NONE;
 	/*
@@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 
 static int __init memory_tier_init(void)
 {
-	int ret;
+	int ret, node;
 	struct memory_tier *memtier;
 
 	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
@@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
 		panic("%s() failed to register memory tier: %d\n", __func__, ret);
 
 	/* CPU only nodes are not part of memory tiers. */
-	memtier->nodelist = node_states[N_MEMORY];
+	for_each_node_state(node, N_MEMORY) {
+		/*
+		 * Should be safe to do this early in the boot.
+		 */
+		NODE_DATA(node)->memtier = memtier;
+		node_set(node, memtier->nodelist);
+	}
 	migrate_on_reclaim_init();
 
 	return 0;

  reply	other threads:[~2022-06-08 14:38 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-06-07 18:43   ` Tim Chen
2022-06-07 20:18     ` Wei Xu
2022-06-08  4:30     ` Aneesh Kumar K V
2022-06-08  6:06       ` Ying Huang
2022-06-08  4:37     ` Aneesh Kumar K V
2022-06-08  6:10       ` Ying Huang
2022-06-08  8:04         ` Aneesh Kumar K V
2022-06-07 21:32   ` Yang Shi
2022-06-08  1:34     ` Ying Huang
2022-06-08 16:37       ` Yang Shi
2022-06-09  6:52         ` Ying Huang
2022-06-08  4:58     ` Aneesh Kumar K V
2022-06-08  6:18       ` Ying Huang
2022-06-08 16:42       ` Yang Shi
2022-06-09  8:17         ` Aneesh Kumar K V
2022-06-09 16:04           ` Yang Shi
2022-06-08 14:11   ` Johannes Weiner
2022-06-08 14:21     ` Aneesh Kumar K V
2022-06-08 15:55     ` Johannes Weiner
2022-06-08 16:13       ` Aneesh Kumar K V
2022-06-08 18:16         ` Johannes Weiner
2022-06-09  2:33           ` Aneesh Kumar K V
2022-06-09 13:55             ` Johannes Weiner
2022-06-09 14:22               ` Jonathan Cameron
2022-06-09 20:41                 ` Johannes Weiner
2022-06-10  6:15                   ` Ying Huang
2022-06-10  9:57                   ` Jonathan Cameron
2022-06-13 14:05                     ` Johannes Weiner
2022-06-13 14:23                       ` Aneesh Kumar K V
2022-06-13 15:50                         ` Johannes Weiner
2022-06-14  6:48                           ` Ying Huang
2022-06-14  8:01                           ` Aneesh Kumar K V
2022-06-14 18:56                             ` Johannes Weiner
2022-06-15  6:23                               ` Aneesh Kumar K V
2022-06-16  1:11                               ` Ying Huang
2022-06-16  3:45                                 ` Wei Xu
2022-06-16  4:47                                   ` Aneesh Kumar K V
2022-06-16  5:51                                     ` Ying Huang
2022-06-17 10:41                                 ` Jonathan Cameron
2022-06-20  1:54                                   ` Huang, Ying
2022-06-14 16:45                       ` Jonathan Cameron
2022-06-21  8:27                         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-06-07 20:15   ` Tim Chen
2022-06-08  4:55     ` Aneesh Kumar K V
2022-06-08  6:42       ` Ying Huang
2022-06-08 16:06       ` Tim Chen
2022-06-08 16:15         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-06-06 13:39   ` Bharata B Rao
2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-06-07 22:51   ` Tim Chen
2022-06-08  5:02     ` Aneesh Kumar K V
2022-06-08  6:52     ` Ying Huang
2022-06-08  6:50   ` Ying Huang
2022-06-08  8:19     ` Aneesh Kumar K V
2022-06-08  8:00   ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-07 23:40   ` Tim Chen
2022-06-08  6:59   ` Ying Huang
2022-06-08  8:20     ` Aneesh Kumar K V
2022-06-08  8:23       ` Ying Huang
2022-06-08  8:29         ` Aneesh Kumar K V
2022-06-08  8:34           ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-06-06  3:11   ` Ying Huang
2022-06-06  3:52     ` Aneesh Kumar K V
2022-06-06  7:24       ` Ying Huang
2022-06-06  8:33         ` Aneesh Kumar K V
2022-06-08  7:26           ` Ying Huang
2022-06-08  8:28             ` Aneesh Kumar K V
2022-06-08  8:32               ` Ying Huang
2022-06-08 14:37                 ` Aneesh Kumar K.V [this message]
2022-06-08 20:14                   ` Tim Chen
2022-06-10  6:04                   ` Ying Huang
2022-06-06  4:53 ` [PATCH] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
2022-06-08 14:20   ` Aneesh Kumar K V
2022-06-09  8:53     ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sfoffcfz.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.