[lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework

All of lore.kernel.org
 help / color / mirror / Atom feed

* [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework
@ 2018-06-24 21:20 James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 01/26] staging: lustre: libcfs: remove useless CPU partition code James Simmons
                   ` (26 more replies)
  0 siblings, 27 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

Recently lustre support has been expanded to extreme machines with as
many as a 1000+ cores. On the other end lustre also has been ported
to platforms like ARM and KNL which have uniquie NUMA and core setup.
For example some devices exist that have NUMA nodes with no cores.
With these new platforms the limitations of the Lustre's SMP code
came to light so a lot of work was needed. This resulted in this
patch set which has been tested on these platforms.

This is the 3rd version of this patch set with the first two submitted
to the staging list. This latest patchset is identical to the 2nd one
expect that the UMP support has been moved to the last patches in this
collection. The approach to support UMP also has changed with using
static initialization to greatly simplify the code.

Amir Shehata (8):
  staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids
  staging: lustre: libcfs: remove excess space
  staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids
  staging: lustre: libcfs: NUMA support
  staging: lustre: libcfs: add cpu distance handling
  staging: lustre: libcfs: use distance in cpu and node handling
  staging: lustre: libcfs: provide debugfs files for distance handling
  staging: lustre: libcfs: invert error handling for cfs_cpt_table_print

Dmitry Eremin (14):
  staging: lustre: libcfs: remove useless CPU partition code
  staging: lustre: libcfs: rename variable i to cpu
  staging: lustre: libcfs: fix libcfs_cpu coding style
  staging: lustre: libcfs: use int type for CPT identification.
  staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask
  staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind
  staging: lustre: libcfs: rename cpumask_var_t variables to *_mask
  staging: lustre: libcfs: update debug messages
  staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes
  staging: lustre: libcfs: report NUMA node instead of just node
  staging: lustre: libcfs: update debug messages in CPT code
  staging: lustre: libcfs: rework CPU pattern parsing code
  staging: lustre: libcfs: change CPT estimate algorithm
  staging: lustre: ptlrpc: use current CPU instead of hardcoded 0

James Simmons (4):
  staging: lustre: libcfs: properly handle failure cases in SMP code
  staging: lustre: libcfs: restore debugfs table reporting for UMP
  staging: lustre: libcfs: make cfs_cpt_tab a static structure
  staging: lustre: libcfs: restore UMP support

 .../lustre/include/linux/libcfs/libcfs_cpu.h       |  203 ++--
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 1020 +++++++++++---------
 drivers/staging/lustre/lnet/libcfs/module.c        |   52 +-
 drivers/staging/lustre/lnet/lnet/api-ni.c          |    4 +-
 drivers/staging/lustre/lnet/lnet/lib-msg.c         |    2 +
 drivers/staging/lustre/lnet/selftest/framework.c   |    2 +-
 drivers/staging/lustre/lustre/ptlrpc/client.c      |    4 +-
 drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     |   10 +-
 drivers/staging/lustre/lustre/ptlrpc/service.c     |   15 +-
 9 files changed, 750 insertions(+), 562 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 01/26] staging: lustre: libcfs: remove useless CPU partition code
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 02/26] staging: lustre: libcfs: rename variable i to cpu James Simmons
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

* remove scratch buffer and mutex which guard it.
* remove global cpumask and spinlock which guard it.
* remove cpt_version for checking CPUs state change during setup
  because of just disable CPUs state change during setup.
* remove whole global struct cfs_cpt_data cpt_data.
* remove few unused APIs.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23303
Reviewed-on: https://review.whamcloud.com/25048
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../lustre/include/linux/libcfs/libcfs_cpu.h       |  32 ++----
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 115 +++------------------
 2 files changed, 22 insertions(+), 125 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 61641c4..1b4333d 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -93,8 +93,6 @@ struct cfs_cpu_partition {
 
 /** descriptor for CPU partitions */
 struct cfs_cpt_table {
-	/* version, reserved for hotplug */
-	unsigned int			ctb_version;
 	/* spread rotor for NUMA allocator */
 	unsigned int			ctb_spread_rotor;
 	/* # of CPU partitions */
@@ -162,12 +160,12 @@ struct cfs_cpt_table {
  * return 1 if successfully set all CPUs, otherwise return 0
  */
 int cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab,
-			int cpt, cpumask_t *mask);
+			int cpt, const cpumask_t *mask);
 /**
  * remove all cpus in \a mask from CPU partition \a cpt
  */
 void cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab,
-			   int cpt, cpumask_t *mask);
+			   int cpt, const cpumask_t *mask);
 /**
  * add all cpus in NUMA node \a node to CPU partition \a cpt
  * return 1 if successfully set all CPUs, otherwise return 0
@@ -190,20 +188,11 @@ int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab,
 void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 			    int cpt, nodemask_t *mask);
 /**
- * unset all cpus for CPU partition \a cpt
- */
-void cfs_cpt_clear(struct cfs_cpt_table *cptab, int cpt);
-/**
  * convert partition id \a cpt to numa node id, if there are more than one
  * nodes in this partition, it might return a different node id each time.
  */
 int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt);
 
-/**
- * return number of HTs in the same core of \a cpu
- */
-int cfs_cpu_ht_nsiblings(int cpu);
-
 int  cfs_cpu_init(void);
 void cfs_cpu_fini(void);
 
@@ -258,13 +247,15 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 }
 
 static inline int
-cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt, cpumask_t *mask)
+cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
+		    const cpumask_t *mask)
 {
 	return 1;
 }
 
 static inline void
-cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt, cpumask_t *mask)
+cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
+		      const cpumask_t *mask)
 {
 }
 
@@ -290,11 +281,6 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 {
 }
 
-static inline void
-cfs_cpt_clear(struct cfs_cpt_table *cptab, int cpt)
-{
-}
-
 static inline int
 cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
 {
@@ -302,12 +288,6 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 }
 
 static inline int
-cfs_cpu_ht_nsiblings(int cpu)
-{
-	return 1;
-}
-
-static inline int
 cfs_cpt_current(struct cfs_cpt_table *cptab, int remap)
 {
 	return 0;
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 3d1cf45..b363a3d 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -73,19 +73,6 @@
 module_param(cpu_pattern, charp, 0444);
 MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
 
-static struct cfs_cpt_data {
-	/* serialize hotplug etc */
-	spinlock_t		cpt_lock;
-	/* reserved for hotplug */
-	unsigned long		cpt_version;
-	/* mutex to protect cpt_cpumask */
-	struct mutex		cpt_mutex;
-	/* scratch buffer for set/unset_node */
-	cpumask_var_t		cpt_cpumask;
-} cpt_data;
-
-#define CFS_CPU_VERSION_MAGIC	   0xbabecafe
-
 struct cfs_cpt_table *
 cfs_cpt_table_alloc(unsigned int ncpt)
 {
@@ -128,11 +115,6 @@ struct cfs_cpt_table *
 			goto failed;
 	}
 
-	spin_lock(&cpt_data.cpt_lock);
-	/* Reserved for hotplug */
-	cptab->ctb_version = cpt_data.cpt_version;
-	spin_unlock(&cpt_data.cpt_lock);
-
 	return cptab;
 
  failed:
@@ -207,17 +189,6 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_table_print);
 
-static void
-cfs_node_to_cpumask(int node, cpumask_t *mask)
-{
-	const cpumask_t *tmp = cpumask_of_node(node);
-
-	if (tmp)
-		cpumask_copy(mask, tmp);
-	else
-		cpumask_clear(mask);
-}
-
 int
 cfs_cpt_number(struct cfs_cpt_table *cptab)
 {
@@ -370,7 +341,8 @@ struct cfs_cpt_table *
 EXPORT_SYMBOL(cfs_cpt_unset_cpu);
 
 int
-cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt, cpumask_t *mask)
+cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
+		    const cpumask_t *mask)
 {
 	int i;
 
@@ -391,7 +363,8 @@ struct cfs_cpt_table *
 EXPORT_SYMBOL(cfs_cpt_set_cpumask);
 
 void
-cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt, cpumask_t *mask)
+cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
+		      const cpumask_t *mask)
 {
 	int i;
 
@@ -403,7 +376,7 @@ struct cfs_cpt_table *
 int
 cfs_cpt_set_node(struct cfs_cpt_table *cptab, int cpt, int node)
 {
-	int rc;
+	const cpumask_t *mask;
 
 	if (node < 0 || node >= MAX_NUMNODES) {
 		CDEBUG(D_INFO,
@@ -411,34 +384,26 @@ struct cfs_cpt_table *
 		return 0;
 	}
 
-	mutex_lock(&cpt_data.cpt_mutex);
-
-	cfs_node_to_cpumask(node, cpt_data.cpt_cpumask);
-
-	rc = cfs_cpt_set_cpumask(cptab, cpt, cpt_data.cpt_cpumask);
+	mask = cpumask_of_node(node);
 
-	mutex_unlock(&cpt_data.cpt_mutex);
-
-	return rc;
+	return cfs_cpt_set_cpumask(cptab, cpt, mask);
 }
 EXPORT_SYMBOL(cfs_cpt_set_node);
 
 void
 cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node)
 {
+	const cpumask_t *mask;
+
 	if (node < 0 || node >= MAX_NUMNODES) {
 		CDEBUG(D_INFO,
 		       "Invalid NUMA id %d for CPU partition %d\n", node, cpt);
 		return;
 	}
 
-	mutex_lock(&cpt_data.cpt_mutex);
-
-	cfs_node_to_cpumask(node, cpt_data.cpt_cpumask);
-
-	cfs_cpt_unset_cpumask(cptab, cpt, cpt_data.cpt_cpumask);
+	mask = cpumask_of_node(node);
 
-	mutex_unlock(&cpt_data.cpt_mutex);
+	cfs_cpt_unset_cpumask(cptab, cpt, mask);
 }
 EXPORT_SYMBOL(cfs_cpt_unset_node);
 
@@ -466,26 +431,6 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_unset_nodemask);
 
-void
-cfs_cpt_clear(struct cfs_cpt_table *cptab, int cpt)
-{
-	int last;
-	int i;
-
-	if (cpt == CFS_CPT_ANY) {
-		last = cptab->ctb_nparts - 1;
-		cpt = 0;
-	} else {
-		last = cpt;
-	}
-
-	for (; cpt <= last; cpt++) {
-		for_each_cpu(i, cptab->ctb_parts[cpt].cpt_cpumask)
-			cfs_cpt_unset_cpu(cptab, cpt, i);
-	}
-}
-EXPORT_SYMBOL(cfs_cpt_clear);
-
 int
 cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
 {
@@ -758,7 +703,7 @@ struct cfs_cpt_table *
 	}
 
 	for_each_online_node(i) {
-		cfs_node_to_cpumask(i, mask);
+		cpumask_copy(mask, cpumask_of_node(i));
 
 		while (!cpumask_empty(mask)) {
 			struct cfs_cpu_partition *part;
@@ -964,16 +909,8 @@ struct cfs_cpt_table *
 #ifdef CONFIG_HOTPLUG_CPU
 static enum cpuhp_state lustre_cpu_online;
 
-static void cfs_cpu_incr_cpt_version(void)
-{
-	spin_lock(&cpt_data.cpt_lock);
-	cpt_data.cpt_version++;
-	spin_unlock(&cpt_data.cpt_lock);
-}
-
 static int cfs_cpu_online(unsigned int cpu)
 {
-	cfs_cpu_incr_cpt_version();
 	return 0;
 }
 
@@ -981,14 +918,9 @@ static int cfs_cpu_dead(unsigned int cpu)
 {
 	bool warn;
 
-	cfs_cpu_incr_cpt_version();
-
-	mutex_lock(&cpt_data.cpt_mutex);
 	/* if all HTs in a core are offline, it may break affinity */
-	cpumask_copy(cpt_data.cpt_cpumask, topology_sibling_cpumask(cpu));
-	warn = cpumask_any_and(cpt_data.cpt_cpumask,
+	warn = cpumask_any_and(topology_sibling_cpumask(cpu),
 			       cpu_online_mask) >= nr_cpu_ids;
-	mutex_unlock(&cpt_data.cpt_mutex);
 	CDEBUG(warn ? D_WARNING : D_INFO,
 	       "Lustre: can't support CPU plug-out well now, performance and stability could be impacted [CPU %u]\n",
 	       cpu);
@@ -1007,7 +939,6 @@ static int cfs_cpu_dead(unsigned int cpu)
 		cpuhp_remove_state_nocalls(lustre_cpu_online);
 	cpuhp_remove_state_nocalls(CPUHP_LUSTRE_CFS_DEAD);
 #endif
-	free_cpumask_var(cpt_data.cpt_cpumask);
 }
 
 int
@@ -1017,16 +948,6 @@ static int cfs_cpu_dead(unsigned int cpu)
 
 	LASSERT(!cfs_cpt_tab);
 
-	memset(&cpt_data, 0, sizeof(cpt_data));
-
-	if (!zalloc_cpumask_var(&cpt_data.cpt_cpumask, GFP_NOFS)) {
-		CERROR("Failed to allocate scratch buffer\n");
-		return -1;
-	}
-
-	spin_lock_init(&cpt_data.cpt_lock);
-	mutex_init(&cpt_data.cpt_mutex);
-
 #ifdef CONFIG_HOTPLUG_CPU
 	ret = cpuhp_setup_state_nocalls(CPUHP_LUSTRE_CFS_DEAD,
 					"staging/lustre/cfe:dead", NULL,
@@ -1042,6 +963,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 #endif
 	ret = -EINVAL;
 
+	get_online_cpus();
 	if (*cpu_pattern) {
 		char *cpu_pattern_dup = kstrdup(cpu_pattern, GFP_KERNEL);
 
@@ -1067,13 +989,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 		}
 	}
 
-	spin_lock(&cpt_data.cpt_lock);
-	if (cfs_cpt_tab->ctb_version != cpt_data.cpt_version) {
-		spin_unlock(&cpt_data.cpt_lock);
-		CERROR("CPU hotplug/unplug during setup\n");
-		goto failed;
-	}
-	spin_unlock(&cpt_data.cpt_lock);
+	put_online_cpus();
 
 	LCONSOLE(0, "HW nodes: %d, HW CPU cores: %d, npartitions: %d\n",
 		 num_online_nodes(), num_online_cpus(),
@@ -1081,6 +997,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 	return 0;
 
  failed:
+	put_online_cpus();
 	cfs_cpu_fini();
 	return ret;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 02/26] staging: lustre: libcfs: rename variable i to cpu
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 01/26] staging: lustre: libcfs: remove useless CPU partition code James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code James Simmons
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Change the name of the variable i used for for_each_cpu() to cpu
for code readability.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23303
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index b363a3d..46d3530 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -344,7 +344,7 @@ struct cfs_cpt_table *
 cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
 		    const cpumask_t *mask)
 {
-	int i;
+	int cpu;
 
 	if (!cpumask_weight(mask) ||
 	    cpumask_any_and(mask, cpu_online_mask) >= nr_cpu_ids) {
@@ -353,8 +353,8 @@ struct cfs_cpt_table *
 		return 0;
 	}
 
-	for_each_cpu(i, mask) {
-		if (!cfs_cpt_set_cpu(cptab, cpt, i))
+	for_each_cpu(cpu, mask) {
+		if (!cfs_cpt_set_cpu(cptab, cpt, cpu))
 			return 0;
 	}
 
@@ -366,10 +366,10 @@ struct cfs_cpt_table *
 cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
 		      const cpumask_t *mask)
 {
-	int i;
+	int cpu;
 
-	for_each_cpu(i, mask)
-		cfs_cpt_unset_cpu(cptab, cpt, i);
+	for_each_cpu(cpu, mask)
+		cfs_cpt_unset_cpu(cptab, cpt, cpu);
 }
 EXPORT_SYMBOL(cfs_cpt_unset_cpumask);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 01/26] staging: lustre: libcfs: remove useless CPU partition code James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 02/26] staging: lustre: libcfs: rename variable i to cpu James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  0:20   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 04/26] staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids James Simmons
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

While pushing the SMP work some bugs were pointed out by Dan
Carpenter in the code. Due to single err label in cfs_cpu_init()
and cfs_cpt_table_alloc() a few items were being cleaned up that
were never initialized. This can lead to crashed and other problems.
In those initialization function introduce individual labels to
jump to only the thing initialized get freed on failure.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10932
Reviewed-on: https://review.whamcloud.com/32085
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 72 ++++++++++++++++++-------
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 46d3530..bdd71a3 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -85,17 +85,19 @@ struct cfs_cpt_table *
 
 	cptab->ctb_nparts = ncpt;
 
+	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS))
+		goto failed_alloc_cpumask;
+
 	cptab->ctb_nodemask = kzalloc(sizeof(*cptab->ctb_nodemask),
 				      GFP_NOFS);
-	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS) ||
-	    !cptab->ctb_nodemask)
-		goto failed;
+	if (!cptab->ctb_nodemask)
+		goto failed_alloc_nodemask;
 
 	cptab->ctb_cpu2cpt = kvmalloc_array(num_possible_cpus(),
 					    sizeof(cptab->ctb_cpu2cpt[0]),
 					    GFP_KERNEL);
 	if (!cptab->ctb_cpu2cpt)
-		goto failed;
+		goto failed_alloc_cpu2cpt;
 
 	memset(cptab->ctb_cpu2cpt, -1,
 	       num_possible_cpus() * sizeof(cptab->ctb_cpu2cpt[0]));
@@ -103,22 +105,41 @@ struct cfs_cpt_table *
 	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
 					  GFP_KERNEL);
 	if (!cptab->ctb_parts)
-		goto failed;
+		goto failed_alloc_ctb_parts;
+
+	memset(cptab->ctb_parts, -1, ncpt * sizeof(cptab->ctb_parts[0]));
 
 	for (i = 0; i < ncpt; i++) {
 		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
 
+		if (!zalloc_cpumask_var(&part->cpt_cpumask, GFP_NOFS))
+			goto failed_setting_ctb_parts;
+
 		part->cpt_nodemask = kzalloc(sizeof(*part->cpt_nodemask),
 					     GFP_NOFS);
-		if (!zalloc_cpumask_var(&part->cpt_cpumask, GFP_NOFS) ||
-		    !part->cpt_nodemask)
-			goto failed;
+		if (!part->cpt_nodemask)
+			goto failed_setting_ctb_parts;
 	}
 
 	return cptab;
 
- failed:
-	cfs_cpt_table_free(cptab);
+failed_setting_ctb_parts:
+	while (i-- >= 0) {
+		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
+
+		kfree(part->cpt_nodemask);
+		free_cpumask_var(part->cpt_cpumask);
+	}
+
+	kvfree(cptab->ctb_parts);
+failed_alloc_ctb_parts:
+	kvfree(cptab->ctb_cpu2cpt);
+failed_alloc_cpu2cpt:
+	kfree(cptab->ctb_nodemask);
+failed_alloc_nodemask:
+	free_cpumask_var(cptab->ctb_cpumask);
+failed_alloc_cpumask:
+	kfree(cptab);
 	return NULL;
 }
 EXPORT_SYMBOL(cfs_cpt_table_alloc);
@@ -944,7 +965,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 int
 cfs_cpu_init(void)
 {
-	int ret = 0;
+	int ret;
 
 	LASSERT(!cfs_cpt_tab);
 
@@ -953,23 +974,23 @@ static int cfs_cpu_dead(unsigned int cpu)
 					"staging/lustre/cfe:dead", NULL,
 					cfs_cpu_dead);
 	if (ret < 0)
-		goto failed;
+		goto failed_cpu_dead;
+
 	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
 					"staging/lustre/cfe:online",
 					cfs_cpu_online, NULL);
 	if (ret < 0)
-		goto failed;
+		goto failed_cpu_online;
+
 	lustre_cpu_online = ret;
 #endif
-	ret = -EINVAL;
-
 	get_online_cpus();
 	if (*cpu_pattern) {
 		char *cpu_pattern_dup = kstrdup(cpu_pattern, GFP_KERNEL);
 
 		if (!cpu_pattern_dup) {
 			CERROR("Failed to duplicate cpu_pattern\n");
-			goto failed;
+			goto failed_alloc_table;
 		}
 
 		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern_dup);
@@ -977,7 +998,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 		if (!cfs_cpt_tab) {
 			CERROR("Failed to create cptab from pattern %s\n",
 			       cpu_pattern);
-			goto failed;
+			goto failed_alloc_table;
 		}
 
 	} else {
@@ -985,7 +1006,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 		if (!cfs_cpt_tab) {
 			CERROR("Failed to create ptable with npartitions %d\n",
 			       cpu_npartitions);
-			goto failed;
+			goto failed_alloc_table;
 		}
 	}
 
@@ -996,8 +1017,19 @@ static int cfs_cpu_dead(unsigned int cpu)
 		 cfs_cpt_number(cfs_cpt_tab));
 	return 0;
 
- failed:
+failed_alloc_table:
 	put_online_cpus();
-	cfs_cpu_fini();
+
+	if (cfs_cpt_tab)
+		cfs_cpt_table_free(cfs_cpt_tab);
+
+	ret = -EINVAL;
+#ifdef CONFIG_HOTPLUG_CPU
+	if (lustre_cpu_online > 0)
+		cpuhp_remove_state_nocalls(lustre_cpu_online);
+failed_cpu_online:
+	cpuhp_remove_state_nocalls(CPUHP_LUSTRE_CFS_DEAD);
+failed_cpu_dead:
+#endif
 	return ret;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 04/26] staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (2 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space James Simmons
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

Replace MAX_NUMNODES which is considered deprocated with
nr_nodes_ids. Looking at page_malloc.c you will see that
nr_nodes_ids is equal to MAX_NUMNODES. MAX_NUMNODES is
actually setup with Kconfig.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index bdd71a3..ea8d55c 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -399,7 +399,7 @@ struct cfs_cpt_table *
 {
 	const cpumask_t *mask;
 
-	if (node < 0 || node >= MAX_NUMNODES) {
+	if (node < 0 || node >= nr_node_ids) {
 		CDEBUG(D_INFO,
 		       "Invalid NUMA id %d for CPU partition %d\n", node, cpt);
 		return 0;
@@ -416,7 +416,7 @@ struct cfs_cpt_table *
 {
 	const cpumask_t *mask;
 
-	if (node < 0 || node >= MAX_NUMNODES) {
+	if (node < 0 || node >= nr_node_ids) {
 		CDEBUG(D_INFO,
 		       "Invalid NUMA id %d for CPU partition %d\n", node, cpt);
 		return;
@@ -840,7 +840,7 @@ struct cfs_cpt_table *
 		return cptab;
 	}
 
-	high = node ? MAX_NUMNODES - 1 : nr_cpu_ids - 1;
+	high = node ? nr_node_ids - 1 : nr_cpu_ids - 1;
 
 	for (str = strim(pattern), c = 0;; c++) {
 		struct cfs_range_expr *range;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (3 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 04/26] staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  0:35   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 06/26] staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids James Simmons
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

The function cfs_cpt_table_print() was adding two spaces
to the string buffer. Just add it once.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index ea8d55c..680a2b1 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -177,7 +177,7 @@ struct cfs_cpt_table *
 
 	for (i = 0; i < cptab->ctb_nparts; i++) {
 		if (len > 0) {
-			rc = snprintf(tmp, len, "%d\t: ", i);
+			rc = snprintf(tmp, len, "%d\t:", i);
 			len -= rc;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 06/26] staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (4 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support James Simmons
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

Move from num_possible_cpus() to nr_cpu_ids.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 680a2b1..33294da 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -93,14 +93,14 @@ struct cfs_cpt_table *
 	if (!cptab->ctb_nodemask)
 		goto failed_alloc_nodemask;
 
-	cptab->ctb_cpu2cpt = kvmalloc_array(num_possible_cpus(),
+	cptab->ctb_cpu2cpt = kvmalloc_array(nr_cpu_ids,
 					    sizeof(cptab->ctb_cpu2cpt[0]),
 					    GFP_KERNEL);
 	if (!cptab->ctb_cpu2cpt)
 		goto failed_alloc_cpu2cpt;
 
 	memset(cptab->ctb_cpu2cpt, -1,
-	       num_possible_cpus() * sizeof(cptab->ctb_cpu2cpt[0]));
+	       nr_cpu_ids * sizeof(cptab->ctb_cpu2cpt[0]));
 
 	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
 					  GFP_KERNEL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (5 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 06/26] staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  0:39   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling James Simmons
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

This patch adds NUMA node support. NUMA node information is stored
in the CPT table. A NUMA node mask is maintained for the entire
table as well as for each CPT to track the NUMA nodes related to
each of the CPTs. Add new function cfs_cpt_of_node() which returns
the CPT of a particular NUMA node.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../lustre/include/linux/libcfs/libcfs_cpu.h        | 11 +++++++++++
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c     | 21 +++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 1b4333d..ff3ecf5 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -103,6 +103,8 @@ struct cfs_cpt_table {
 	int				*ctb_cpu2cpt;
 	/* all cpus in this partition table */
 	cpumask_var_t			ctb_cpumask;
+	/* shadow HW node to CPU partition ID */
+	int				*ctb_node2cpt;
 	/* all nodes in this partition table */
 	nodemask_t			*ctb_nodemask;
 };
@@ -143,6 +145,10 @@ struct cfs_cpt_table {
  */
 int cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu);
 /**
+ * shadow HW node ID \a NODE to CPU-partition ID by \a cptab
+ */
+int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
+/**
  * bind current thread on a CPU-partition \a cpt of \a cptab
  */
 int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
@@ -299,6 +305,11 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 	return 0;
 }
 
+static inline int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
+{
+	return 0;
+}
+
 static inline int
 cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 {
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 33294da..8c5cf7b 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -102,6 +102,15 @@ struct cfs_cpt_table *
 	memset(cptab->ctb_cpu2cpt, -1,
 	       nr_cpu_ids * sizeof(cptab->ctb_cpu2cpt[0]));
 
+	cptab->ctb_node2cpt = kvmalloc_array(nr_node_ids,
+					     sizeof(cptab->ctb_node2cpt[0]),
+					     GFP_KERNEL);
+	if (!cptab->ctb_node2cpt)
+		goto failed_alloc_node2cpt;
+
+	memset(cptab->ctb_node2cpt, -1,
+	       nr_node_ids * sizeof(cptab->ctb_node2cpt[0]));
+
 	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
 					  GFP_KERNEL);
 	if (!cptab->ctb_parts)
@@ -133,6 +142,8 @@ struct cfs_cpt_table *
 
 	kvfree(cptab->ctb_parts);
 failed_alloc_ctb_parts:
+	kvfree(cptab->ctb_node2cpt);
+failed_alloc_node2cpt:
 	kvfree(cptab->ctb_cpu2cpt);
 failed_alloc_cpu2cpt:
 	kfree(cptab->ctb_nodemask);
@@ -150,6 +161,7 @@ struct cfs_cpt_table *
 	int i;
 
 	kvfree(cptab->ctb_cpu2cpt);
+	kvfree(cptab->ctb_node2cpt);
 
 	for (i = 0; cptab->ctb_parts && i < cptab->ctb_nparts; i++) {
 		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
@@ -515,6 +527,15 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_of_cpu);
 
+int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
+{
+	if (node < 0 || node > nr_node_ids)
+		return CFS_CPT_ANY;
+
+	return cptab->ctb_node2cpt[node];
+}
+EXPORT_SYMBOL(cfs_cpt_of_node);
+
 int
 cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (6 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  0:48   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 09/26] staging: lustre: libcfs: use distance in cpu and node handling James Simmons
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

Add functionality to calculate the distance between two CPTs.
Expose those distance in debugfs so people deploying a setup
can debug what is being created for CPTs.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../lustre/include/linux/libcfs/libcfs_cpu.h       | 31 +++++++++++
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 61 ++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index ff3ecf5..a015ac1 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -86,6 +86,8 @@ struct cfs_cpu_partition {
 	cpumask_var_t			cpt_cpumask;
 	/* nodes mask for this partition */
 	nodemask_t			*cpt_nodemask;
+	/* NUMA distance between CPTs */
+	unsigned int			*cpt_distance;
 	/* spread rotor for NUMA allocator */
 	unsigned int			cpt_spread_rotor;
 };
@@ -95,6 +97,8 @@ struct cfs_cpu_partition {
 struct cfs_cpt_table {
 	/* spread rotor for NUMA allocator */
 	unsigned int			ctb_spread_rotor;
+	/* maximum NUMA distance between all nodes in table */
+	unsigned int			ctb_distance;
 	/* # of CPU partitions */
 	unsigned int			ctb_nparts;
 	/* partitions tables */
@@ -120,6 +124,10 @@ struct cfs_cpt_table {
  */
 int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len);
 /**
+ * print distance information of cpt-table
+ */
+int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len);
+/**
  * return total number of CPU partitions in \a cptab
  */
 int
@@ -149,6 +157,10 @@ struct cfs_cpt_table {
  */
 int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
 /**
+ * NUMA distance between \a cpt1 and \a cpt2 in \a cptab
+ */
+unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2);
+/**
  * bind current thread on a CPU-partition \a cpt of \a cptab
  */
 int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
@@ -206,6 +218,19 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 struct cfs_cpt_table;
 #define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
 
+static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
+					 char *buf, int len)
+{
+	int rc;
+
+	rc = snprintf(buf, len, "0\t: 0:1\n");
+	len -= rc;
+	if (len <= 0)
+		return -EFBIG;
+
+	return rc;
+}
+
 static inline cpumask_var_t *
 cfs_cpt_cpumask(struct cfs_cpt_table *cptab, int cpt)
 {
@@ -241,6 +266,12 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 	return NULL;
 }
 
+static inline unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab,
+					    int cpt1, int cpt2)
+{
+	return 1;
+}
+
 static inline int
 cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 {
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 8c5cf7b..b315fb2 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -128,6 +128,15 @@ struct cfs_cpt_table *
 					     GFP_NOFS);
 		if (!part->cpt_nodemask)
 			goto failed_setting_ctb_parts;
+
+		part->cpt_distance = kvmalloc_array(cptab->ctb_nparts,
+						    sizeof(part->cpt_distance[0]),
+						    GFP_KERNEL);
+		if (!part->cpt_distance)
+			goto failed_setting_ctb_parts;
+
+		memset(part->cpt_distance, -1,
+		       cptab->ctb_nparts * sizeof(part->cpt_distance[0]));
 	}
 
 	return cptab;
@@ -138,6 +147,7 @@ struct cfs_cpt_table *
 
 		kfree(part->cpt_nodemask);
 		free_cpumask_var(part->cpt_cpumask);
+		kvfree(part->cpt_distance);
 	}
 
 	kvfree(cptab->ctb_parts);
@@ -168,6 +178,7 @@ struct cfs_cpt_table *
 
 		kfree(part->cpt_nodemask);
 		free_cpumask_var(part->cpt_cpumask);
+		kvfree(part->cpt_distance);
 	}
 
 	kvfree(cptab->ctb_parts);
@@ -222,6 +233,44 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_table_print);
 
+int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
+{
+	char *tmp = buf;
+	int rc;
+	int i;
+	int j;
+
+	for (i = 0; i < cptab->ctb_nparts; i++) {
+		if (len <= 0)
+			goto err;
+
+		rc = snprintf(tmp, len, "%d\t:", i);
+		len -= rc;
+
+		if (len <= 0)
+			goto err;
+
+		tmp += rc;
+		for (j = 0; j < cptab->ctb_nparts; j++) {
+			rc = snprintf(tmp, len, " %d:%d", j,
+				      cptab->ctb_parts[i].cpt_distance[j]);
+			len -= rc;
+			if (len <= 0)
+				goto err;
+			tmp += rc;
+		}
+
+		*tmp = '\n';
+		tmp++;
+		len--;
+	}
+
+	return tmp - buf;
+err:
+	return -E2BIG;
+}
+EXPORT_SYMBOL(cfs_cpt_distance_print);
+
 int
 cfs_cpt_number(struct cfs_cpt_table *cptab)
 {
@@ -273,6 +322,18 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_nodemask);
 
+unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
+{
+	LASSERT(cpt1 == CFS_CPT_ANY || (cpt1 >= 0 && cpt1 < cptab->ctb_nparts));
+	LASSERT(cpt2 == CFS_CPT_ANY || (cpt2 >= 0 && cpt2 < cptab->ctb_nparts));
+
+	if (cpt1 == CFS_CPT_ANY || cpt2 == CFS_CPT_ANY)
+		return cptab->ctb_distance;
+
+	return cptab->ctb_parts[cpt1].cpt_distance[cpt2];
+}
+EXPORT_SYMBOL(cfs_cpt_distance);
+
 int
 cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 09/26] staging: lustre: libcfs: use distance in cpu and node handling
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (7 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 10/26] staging: lustre: libcfs: provide debugfs files for distance handling James Simmons
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

Take into consideration the location of NUMA nodes and core
when calling cfs_cpt_[un]set_cpu() and cfs_cpt_[un]set_node().
This enables functioning on platforms with 100s of cores and
NUMA nodes.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 192 ++++++++++++++++++------
 1 file changed, 143 insertions(+), 49 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index b315fb2..3b4a9dc 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -334,11 +334,134 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 }
 EXPORT_SYMBOL(cfs_cpt_distance);
 
+/*
+ * Calculate the maximum NUMA distance between all nodes in the
+ * from_mask and all nodes in the to_mask.
+ */
+static unsigned int cfs_cpt_distance_calculate(nodemask_t *from_mask,
+					       nodemask_t *to_mask)
+{
+	unsigned int maximum;
+	unsigned int distance;
+	int from;
+	int to;
+
+	maximum = 0;
+	for_each_node_mask(from, *from_mask) {
+		for_each_node_mask(to, *to_mask) {
+			distance = node_distance(from, to);
+			if (maximum < distance)
+				maximum = distance;
+		}
+	}
+	return maximum;
+}
+
+static void cfs_cpt_add_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
+{
+	cptab->ctb_cpu2cpt[cpu] = cpt;
+
+	cpumask_set_cpu(cpu, cptab->ctb_cpumask);
+	cpumask_set_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask);
+}
+
+static void cfs_cpt_del_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
+{
+	cpumask_clear_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask);
+	cpumask_clear_cpu(cpu, cptab->ctb_cpumask);
+
+	cptab->ctb_cpu2cpt[cpu] = -1;
+}
+
+static void cfs_cpt_add_node(struct cfs_cpt_table *cptab, int cpt, int node)
+{
+	struct cfs_cpu_partition *part;
+
+	if (!node_isset(node, *cptab->ctb_nodemask)) {
+		unsigned int dist;
+
+		/* first time node is added to the CPT table */
+		node_set(node, *cptab->ctb_nodemask);
+		cptab->ctb_node2cpt[node] = cpt;
+
+		dist = cfs_cpt_distance_calculate(cptab->ctb_nodemask,
+						  cptab->ctb_nodemask);
+		cptab->ctb_distance = dist;
+	}
+
+	part = &cptab->ctb_parts[cpt];
+	if (!node_isset(node, *part->cpt_nodemask)) {
+		int cpt2;
+
+		/* first time node is added to this CPT */
+		node_set(node, *part->cpt_nodemask);
+		for (cpt2 = 0; cpt2 < cptab->ctb_nparts; cpt2++) {
+			struct cfs_cpu_partition *part2;
+			unsigned int dist;
+
+			part2 = &cptab->ctb_parts[cpt2];
+			dist = cfs_cpt_distance_calculate(part->cpt_nodemask,
+							  part2->cpt_nodemask);
+			part->cpt_distance[cpt2] = dist;
+			dist = cfs_cpt_distance_calculate(part2->cpt_nodemask,
+							  part->cpt_nodemask);
+			part2->cpt_distance[cpt] = dist;
+		}
+	}
+}
+
+static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
+{
+	struct cfs_cpu_partition *part = &cptab->ctb_parts[cpt];
+	int cpu;
+
+	for_each_cpu(cpu, part->cpt_cpumask) {
+		/* this CPT has other CPU belonging to this node? */
+		if (cpu_to_node(cpu) == node)
+			break;
+	}
+
+	if (cpu >= nr_cpu_ids && node_isset(node,  *part->cpt_nodemask)) {
+		int cpt2;
+
+		/* No more CPUs in the node for this CPT. */
+		node_clear(node, *part->cpt_nodemask);
+		for (cpt2 = 0; cpt2 < cptab->ctb_nparts; cpt2++) {
+			struct cfs_cpu_partition *part2;
+			unsigned int dist;
+
+			part2 = &cptab->ctb_parts[cpt2];
+			if (node_isset(node, *part2->cpt_nodemask))
+				cptab->ctb_node2cpt[node] = cpt2;
+
+			dist = cfs_cpt_distance_calculate(part->cpt_nodemask,
+							  part2->cpt_nodemask);
+			part->cpt_distance[cpt2] = dist;
+			dist = cfs_cpt_distance_calculate(part2->cpt_nodemask,
+							  part->cpt_nodemask);
+			part2->cpt_distance[cpt] = dist;
+		}
+	}
+
+	for_each_cpu(cpu, cptab->ctb_cpumask) {
+		/* this CPT-table has other CPUs belonging to this node? */
+		if (cpu_to_node(cpu) == node)
+			break;
+	}
+
+	if (cpu >= nr_cpu_ids && node_isset(node, *cptab->ctb_nodemask)) {
+		/* No more CPUs in the table for this node. */
+		node_clear(node, *cptab->ctb_nodemask);
+		cptab->ctb_node2cpt[node] = -1;
+		cptab->ctb_distance =
+			cfs_cpt_distance_calculate(cptab->ctb_nodemask,
+						   cptab->ctb_nodemask);
+	}
+}
+
 int
 cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 {
-	int node;
-
 	LASSERT(cpt >= 0 && cpt < cptab->ctb_nparts);
 
 	if (cpu < 0 || cpu >= nr_cpu_ids || !cpu_online(cpu)) {
@@ -352,23 +475,11 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 		return 0;
 	}
 
-	cptab->ctb_cpu2cpt[cpu] = cpt;
-
 	LASSERT(!cpumask_test_cpu(cpu, cptab->ctb_cpumask));
 	LASSERT(!cpumask_test_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask));
 
-	cpumask_set_cpu(cpu, cptab->ctb_cpumask);
-	cpumask_set_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask);
-
-	node = cpu_to_node(cpu);
-
-	/* first CPU of @node in this CPT table */
-	if (!node_isset(node, *cptab->ctb_nodemask))
-		node_set(node, *cptab->ctb_nodemask);
-
-	/* first CPU of @node in this partition */
-	if (!node_isset(node, *cptab->ctb_parts[cpt].cpt_nodemask))
-		node_set(node, *cptab->ctb_parts[cpt].cpt_nodemask);
+	cfs_cpt_add_cpu(cptab, cpt, cpu);
+	cfs_cpt_add_node(cptab, cpt, cpu_to_node(cpu));
 
 	return 1;
 }
@@ -377,9 +488,6 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 void
 cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 {
-	int node;
-	int i;
-
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
 	if (cpu < 0 || cpu >= nr_cpu_ids) {
@@ -405,32 +513,8 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 	LASSERT(cpumask_test_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask));
 	LASSERT(cpumask_test_cpu(cpu, cptab->ctb_cpumask));
 
-	cpumask_clear_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask);
-	cpumask_clear_cpu(cpu, cptab->ctb_cpumask);
-	cptab->ctb_cpu2cpt[cpu] = -1;
-
-	node = cpu_to_node(cpu);
-
-	LASSERT(node_isset(node, *cptab->ctb_parts[cpt].cpt_nodemask));
-	LASSERT(node_isset(node, *cptab->ctb_nodemask));
-
-	for_each_cpu(i, cptab->ctb_parts[cpt].cpt_cpumask) {
-		/* this CPT has other CPU belonging to this node? */
-		if (cpu_to_node(i) == node)
-			break;
-	}
-
-	if (i >= nr_cpu_ids)
-		node_clear(node, *cptab->ctb_parts[cpt].cpt_nodemask);
-
-	for_each_cpu(i, cptab->ctb_cpumask) {
-		/* this CPT-table has other CPU belonging to this node? */
-		if (cpu_to_node(i) == node)
-			break;
-	}
-
-	if (i >= nr_cpu_ids)
-		node_clear(node, *cptab->ctb_nodemask);
+	cfs_cpt_del_cpu(cptab, cpt, cpu);
+	cfs_cpt_del_node(cptab, cpt, cpu_to_node(cpu));
 }
 EXPORT_SYMBOL(cfs_cpt_unset_cpu);
 
@@ -448,8 +532,8 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 	}
 
 	for_each_cpu(cpu, mask) {
-		if (!cfs_cpt_set_cpu(cptab, cpt, cpu))
-			return 0;
+		cfs_cpt_add_cpu(cptab, cpt, cpu);
+		cfs_cpt_add_node(cptab, cpt, cpu_to_node(cpu));
 	}
 
 	return 1;
@@ -471,6 +555,7 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 cfs_cpt_set_node(struct cfs_cpt_table *cptab, int cpt, int node)
 {
 	const cpumask_t *mask;
+	int cpu;
 
 	if (node < 0 || node >= nr_node_ids) {
 		CDEBUG(D_INFO,
@@ -480,7 +565,12 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 
 	mask = cpumask_of_node(node);
 
-	return cfs_cpt_set_cpumask(cptab, cpt, mask);
+	for_each_cpu(cpu, mask)
+		cfs_cpt_add_cpu(cptab, cpt, cpu);
+
+	cfs_cpt_add_node(cptab, cpt, node);
+
+	return 1;
 }
 EXPORT_SYMBOL(cfs_cpt_set_node);
 
@@ -488,6 +578,7 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node)
 {
 	const cpumask_t *mask;
+	int cpu;
 
 	if (node < 0 || node >= nr_node_ids) {
 		CDEBUG(D_INFO,
@@ -497,7 +588,10 @@ unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
 
 	mask = cpumask_of_node(node);
 
-	cfs_cpt_unset_cpumask(cptab, cpt, mask);
+	for_each_cpu(cpu, mask)
+		cfs_cpt_del_cpu(cptab, cpt, cpu);
+
+	cfs_cpt_del_node(cptab, cpt, node);
 }
 EXPORT_SYMBOL(cfs_cpt_unset_node);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 10/26] staging: lustre: libcfs: provide debugfs files for distance handling
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (8 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 09/26] staging: lustre: libcfs: use distance in cpu and node handling James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 11/26] staging: lustre: libcfs: invert error handling for cfs_cpt_table_print James Simmons
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

On systems with large number of NUMA nodes and cores it is easy
to incorrectly configure their use with Lustre. Provide debugfs
files which can help track down any issues.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/module.c | 48 +++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/drivers/staging/lustre/lnet/libcfs/module.c b/drivers/staging/lustre/lnet/libcfs/module.c
index 02c404c..2281f08 100644
--- a/drivers/staging/lustre/lnet/libcfs/module.c
+++ b/drivers/staging/lustre/lnet/libcfs/module.c
@@ -425,6 +425,48 @@ static int proc_cpt_table(struct ctl_table *table, int write,
 	return rc;
 }
 
+static int proc_cpt_distance(struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	size_t nob = *lenp;
+	loff_t pos = *ppos;
+	char *buf = NULL;
+	int len = 4096;
+	int rc = 0;
+
+	if (write)
+		return -EPERM;
+
+	LASSERT(cfs_cpt_tab);
+
+	while (1) {
+		buf = kzalloc(len, GFP_KERNEL);
+		if (!buf)
+			return -ENOMEM;
+
+		rc = cfs_cpt_distance_print(cfs_cpt_tab, buf, len);
+		if (rc >= 0)
+			break;
+
+		if (rc == -EFBIG) {
+			kfree(buf);
+			len <<= 1;
+			continue;
+		}
+		goto out;
+	}
+
+	if (pos >= rc) {
+		rc = 0;
+		goto out;
+	}
+
+	rc = cfs_trace_copyout_string(buffer, nob, buf + pos, NULL);
+out:
+	kfree(buf);
+	return rc;
+}
+
 static struct ctl_table lnet_table[] = {
 	{
 		.procname = "debug",
@@ -454,6 +496,12 @@ static int proc_cpt_table(struct ctl_table *table, int write,
 		.proc_handler = &proc_cpt_table,
 	},
 	{
+		.procname = "cpu_partition_distance",
+		.maxlen	  = 128,
+		.mode	  = 0444,
+		.proc_handler = &proc_cpt_distance,
+	},
+	{
 		.procname = "debug_log_upcall",
 		.data     = lnet_debug_log_upcall,
 		.maxlen   = sizeof(lnet_debug_log_upcall),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 11/26] staging: lustre: libcfs: invert error handling for cfs_cpt_table_print
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (9 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 10/26] staging: lustre: libcfs: provide debugfs files for distance handling James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 12/26] staging: lustre: libcfs: fix libcfs_cpu coding style James Simmons
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <amir.shehata@intel.com>

Instead of setting rc to -EFBIG for several cases in the loop lets
just go to the out label on error which returns -E2BIG directly.

Signed-off-by: Amir Shehata <amir.shehata@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 29 ++++++++++---------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 3b4a9dc..cdfa77b 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -194,29 +194,26 @@ struct cfs_cpt_table *
 cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len)
 {
 	char *tmp = buf;
-	int rc = 0;
+	int rc;
 	int i;
 	int j;
 
 	for (i = 0; i < cptab->ctb_nparts; i++) {
-		if (len > 0) {
-			rc = snprintf(tmp, len, "%d\t:", i);
-			len -= rc;
-		}
+		if (len <= 0)
+			goto err;
 
-		if (len <= 0) {
-			rc = -EFBIG;
-			goto out;
-		}
+		rc = snprintf(tmp, len, "%d\t:", i);
+		len -= rc;
+
+		if (len <= 0)
+			goto err;
 
 		tmp += rc;
 		for_each_cpu(j, cptab->ctb_parts[i].cpt_cpumask) {
 			rc = snprintf(tmp, len, "%d ", j);
 			len -= rc;
-			if (len <= 0) {
-				rc = -EFBIG;
-				goto out;
-			}
+			if (len <= 0)
+				goto err;
 			tmp += rc;
 		}
 
@@ -225,11 +222,9 @@ struct cfs_cpt_table *
 		len--;
 	}
 
- out:
-	if (rc < 0)
-		return rc;
-
 	return tmp - buf;
+err:
+	return -E2BIG;
 }
 EXPORT_SYMBOL(cfs_cpt_table_print);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 12/26] staging: lustre: libcfs: fix libcfs_cpu coding style
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (10 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 11/26] staging: lustre: libcfs: invert error handling for cfs_cpt_table_print James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification James Simmons
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

This patch bring the lustre CPT code into alignment with the
Linux kernel coding style.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23304
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../lustre/include/linux/libcfs/libcfs_cpu.h       | 81 ++++++++-----------
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 92 ++++++++--------------
 2 files changed, 69 insertions(+), 104 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index a015ac1..9dbb0b1 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -130,8 +130,7 @@ struct cfs_cpt_table {
 /**
  * return total number of CPU partitions in \a cptab
  */
-int
-cfs_cpt_number(struct cfs_cpt_table *cptab);
+int cfs_cpt_number(struct cfs_cpt_table *cptab);
 /**
  * return number of HW cores or hyper-threadings in a CPU partition \a cpt
  */
@@ -193,25 +192,24 @@ void cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab,
  * remove all cpus in NUMA node \a node from CPU partition \a cpt
  */
 void cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node);
-
 /**
  * add all cpus in node mask \a mask to CPU partition \a cpt
  * return 1 if successfully set all CPUs, otherwise return 0
  */
 int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab,
-			 int cpt, nodemask_t *mask);
+			 int cpt, const nodemask_t *mask);
 /**
  * remove all cpus in node mask \a mask from CPU partition \a cpt
  */
 void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
-			    int cpt, nodemask_t *mask);
+			    int cpt, const nodemask_t *mask);
 /**
  * convert partition id \a cpt to numa node id, if there are more than one
  * nodes in this partition, it might return a different node id each time.
  */
 int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt);
 
-int  cfs_cpu_init(void);
+int cfs_cpu_init(void);
 void cfs_cpu_fini(void);
 
 #else /* !CONFIG_SMP */
@@ -231,37 +229,35 @@ static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
 	return rc;
 }
 
-static inline cpumask_var_t *
-cfs_cpt_cpumask(struct cfs_cpt_table *cptab, int cpt)
+static inline cpumask_var_t *cfs_cpt_cpumask(struct cfs_cpt_table *cptab,
+					     int cpt)
 {
 	return NULL;
 }
 
-static inline int
-cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len)
+static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf,
+				      int len)
 {
 	return 0;
 }
-static inline int
-cfs_cpt_number(struct cfs_cpt_table *cptab)
+
+static inline int cfs_cpt_number(struct cfs_cpt_table *cptab)
 {
 	return 1;
 }
 
-static inline int
-cfs_cpt_weight(struct cfs_cpt_table *cptab, int cpt)
+static inline int cfs_cpt_weight(struct cfs_cpt_table *cptab, int cpt)
 {
 	return 1;
 }
 
-static inline int
-cfs_cpt_online(struct cfs_cpt_table *cptab, int cpt)
+static inline int cfs_cpt_online(struct cfs_cpt_table *cptab, int cpt)
 {
 	return 1;
 }
 
-static inline nodemask_t *
-cfs_cpt_nodemask(struct cfs_cpt_table *cptab, int cpt)
+static inline nodemask_t *cfs_cpt_nodemask(struct cfs_cpt_table *cptab,
+					   int cpt)
 {
 	return NULL;
 }
@@ -272,66 +268,61 @@ static inline unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab,
 	return 1;
 }
 
-static inline int
-cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
+static inline int cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt,
+				  int cpu)
 {
 	return 1;
 }
 
-static inline void
-cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
+static inline void cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt,
+				     int cpu)
 {
 }
 
-static inline int
-cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
-		    const cpumask_t *mask)
+static inline int cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
+				      const cpumask_t *mask)
 {
 	return 1;
 }
 
-static inline void
-cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
-		      const cpumask_t *mask)
+static inline void cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
+					 const cpumask_t *mask)
 {
 }
 
-static inline int
-cfs_cpt_set_node(struct cfs_cpt_table *cptab, int cpt, int node)
+static inline int cfs_cpt_set_node(struct cfs_cpt_table *cptab, int cpt,
+				   int node)
 {
 	return 1;
 }
 
-static inline void
-cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node)
+static inline void cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt,
+				      int node)
 {
 }
 
-static inline int
-cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt, nodemask_t *mask)
+static inline int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt,
+				       const nodemask_t *mask)
 {
 	return 1;
 }
 
-static inline void
-cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab, int cpt, nodemask_t *mask)
+static inline void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
+					  int cpt, const nodemask_t *mask)
 {
 }
 
-static inline int
-cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
+static inline int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
 {
 	return 0;
 }
 
-static inline int
-cfs_cpt_current(struct cfs_cpt_table *cptab, int remap)
+static inline int cfs_cpt_current(struct cfs_cpt_table *cptab, int remap)
 {
 	return 0;
 }
 
-static inline int
-cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu)
+static inline int cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu)
 {
 	return 0;
 }
@@ -341,14 +332,12 @@ static inline int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 	return 0;
 }
 
-static inline int
-cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
+static inline int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 {
 	return 0;
 }
 
-static inline int
-cfs_cpu_init(void)
+static inline int cfs_cpu_init(void)
 {
 	return 0;
 }
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index cdfa77b..aaab7cb 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -73,8 +73,7 @@
 module_param(cpu_pattern, charp, 0444);
 MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
 
-struct cfs_cpt_table *
-cfs_cpt_table_alloc(unsigned int ncpt)
+struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt)
 {
 	struct cfs_cpt_table *cptab;
 	int i;
@@ -165,8 +164,7 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_table_alloc);
 
-void
-cfs_cpt_table_free(struct cfs_cpt_table *cptab)
+void cfs_cpt_table_free(struct cfs_cpt_table *cptab)
 {
 	int i;
 
@@ -190,8 +188,7 @@ struct cfs_cpt_table *
 }
 EXPORT_SYMBOL(cfs_cpt_table_free);
 
-int
-cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len)
+int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len)
 {
 	char *tmp = buf;
 	int rc;
@@ -266,15 +263,13 @@ int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
 }
 EXPORT_SYMBOL(cfs_cpt_distance_print);
 
-int
-cfs_cpt_number(struct cfs_cpt_table *cptab)
+int cfs_cpt_number(struct cfs_cpt_table *cptab)
 {
 	return cptab->ctb_nparts;
 }
 EXPORT_SYMBOL(cfs_cpt_number);
 
-int
-cfs_cpt_weight(struct cfs_cpt_table *cptab, int cpt)
+int cfs_cpt_weight(struct cfs_cpt_table *cptab, int cpt)
 {
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
@@ -284,8 +279,7 @@ int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
 }
 EXPORT_SYMBOL(cfs_cpt_weight);
 
-int
-cfs_cpt_online(struct cfs_cpt_table *cptab, int cpt)
+int cfs_cpt_online(struct cfs_cpt_table *cptab, int cpt)
 {
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
@@ -297,8 +291,7 @@ int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
 }
 EXPORT_SYMBOL(cfs_cpt_online);
 
-cpumask_var_t *
-cfs_cpt_cpumask(struct cfs_cpt_table *cptab, int cpt)
+cpumask_var_t *cfs_cpt_cpumask(struct cfs_cpt_table *cptab, int cpt)
 {
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
@@ -307,8 +300,7 @@ int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
 }
 EXPORT_SYMBOL(cfs_cpt_cpumask);
 
-nodemask_t *
-cfs_cpt_nodemask(struct cfs_cpt_table *cptab, int cpt)
+nodemask_t *cfs_cpt_nodemask(struct cfs_cpt_table *cptab, int cpt)
 {
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
@@ -454,8 +446,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 	}
 }
 
-int
-cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
+int cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 {
 	LASSERT(cpt >= 0 && cpt < cptab->ctb_nparts);
 
@@ -480,8 +471,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_set_cpu);
 
-void
-cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
+void cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 {
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
@@ -513,9 +503,8 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_unset_cpu);
 
-int
-cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
-		    const cpumask_t *mask)
+int cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
+			const cpumask_t *mask)
 {
 	int cpu;
 
@@ -535,9 +524,8 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_set_cpumask);
 
-void
-cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
-		      const cpumask_t *mask)
+void cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
+			   const cpumask_t *mask)
 {
 	int cpu;
 
@@ -546,8 +534,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_unset_cpumask);
 
-int
-cfs_cpt_set_node(struct cfs_cpt_table *cptab, int cpt, int node)
+int cfs_cpt_set_node(struct cfs_cpt_table *cptab, int cpt, int node)
 {
 	const cpumask_t *mask;
 	int cpu;
@@ -569,8 +556,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_set_node);
 
-void
-cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node)
+void cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node)
 {
 	const cpumask_t *mask;
 	int cpu;
@@ -590,8 +576,8 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_unset_node);
 
-int
-cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt, nodemask_t *mask)
+int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt,
+			 const nodemask_t *mask)
 {
 	int i;
 
@@ -604,8 +590,8 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_set_nodemask);
 
-void
-cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab, int cpt, nodemask_t *mask)
+void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab, int cpt,
+			    const nodemask_t *mask)
 {
 	int i;
 
@@ -614,8 +600,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_unset_nodemask);
 
-int
-cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
+int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
 {
 	nodemask_t *mask;
 	int weight;
@@ -647,8 +632,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_spread_node);
 
-int
-cfs_cpt_current(struct cfs_cpt_table *cptab, int remap)
+int cfs_cpt_current(struct cfs_cpt_table *cptab, int remap)
 {
 	int cpu;
 	int cpt;
@@ -668,8 +652,7 @@ static void cfs_cpt_del_node(struct cfs_cpt_table *cptab, int cpt, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_current);
 
-int
-cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu)
+int cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu)
 {
 	LASSERT(cpu >= 0 && cpu < nr_cpu_ids);
 
@@ -686,8 +669,7 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 }
 EXPORT_SYMBOL(cfs_cpt_of_node);
 
-int
-cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
+int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 {
 	cpumask_var_t *cpumask;
 	nodemask_t *nodemask;
@@ -731,9 +713,8 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
  * Choose max to \a number CPUs from \a node and set them in \a cpt.
  * We always prefer to choose CPU in the same core/socket.
  */
-static int
-cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
-		     cpumask_t *node, int number)
+static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
+				cpumask_t *node, int number)
 {
 	cpumask_var_t socket;
 	cpumask_var_t core;
@@ -809,8 +790,7 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 
 #define CPT_WEIGHT_MIN  4u
 
-static unsigned int
-cfs_cpt_num_estimate(void)
+static unsigned int cfs_cpt_num_estimate(void)
 {
 	unsigned int nnode = num_online_nodes();
 	unsigned int ncpu = num_online_cpus();
@@ -852,8 +832,7 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 	return ncpt;
 }
 
-static struct cfs_cpt_table *
-cfs_cpt_table_create(int ncpt)
+static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 {
 	struct cfs_cpt_table *cptab = NULL;
 	cpumask_var_t mask;
@@ -936,9 +915,9 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 
 	return cptab;
 
- failed_mask:
+failed_mask:
 	free_cpumask_var(mask);
- failed:
+failed:
 	CERROR("Failed to setup CPU-partition-table with %d CPU-partitions, online HW nodes: %d, HW cpus: %d.\n",
 	       ncpt, num_online_nodes(), num_online_cpus());
 
@@ -948,8 +927,7 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 	return NULL;
 }
 
-static struct cfs_cpt_table *
-cfs_cpt_table_create_pattern(char *pattern)
+static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 {
 	struct cfs_cpt_table *cptab;
 	char *str;
@@ -1093,7 +1071,7 @@ int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
 
 	return cptab;
 
- failed:
+failed:
 	cfs_cpt_table_free(cptab);
 	return NULL;
 }
@@ -1120,8 +1098,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 }
 #endif
 
-void
-cfs_cpu_fini(void)
+void cfs_cpu_fini(void)
 {
 	if (cfs_cpt_tab)
 		cfs_cpt_table_free(cfs_cpt_tab);
@@ -1133,8 +1110,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 #endif
 }
 
-int
-cfs_cpu_init(void)
+int cfs_cpu_init(void)
 {
 	int ret;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification.
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (11 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 12/26] staging: lustre: libcfs: fix libcfs_cpu coding style James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  0:57   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 14/26] staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask James Simmons
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Use int type for CPT identification to match the linux kernel
CPU identification.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23304
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h |  8 ++++----
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 14 +++++++-------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 9dbb0b1..2bb2140 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -89,18 +89,18 @@ struct cfs_cpu_partition {
 	/* NUMA distance between CPTs */
 	unsigned int			*cpt_distance;
 	/* spread rotor for NUMA allocator */
-	unsigned int			cpt_spread_rotor;
+	int				cpt_spread_rotor;
 };
 
 
 /** descriptor for CPU partitions */
 struct cfs_cpt_table {
 	/* spread rotor for NUMA allocator */
-	unsigned int			ctb_spread_rotor;
+	int				ctb_spread_rotor;
 	/* maximum NUMA distance between all nodes in table */
 	unsigned int			ctb_distance;
 	/* # of CPU partitions */
-	unsigned int			ctb_nparts;
+	int				ctb_nparts;
 	/* partitions tables */
 	struct cfs_cpu_partition	*ctb_parts;
 	/* shadow HW CPU to CPU partition ID */
@@ -355,7 +355,7 @@ static inline void cfs_cpu_fini(void)
 /**
  * create a cfs_cpt_table with \a ncpt number of partitions
  */
-struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt);
+struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt);
 
 /*
  * allocate per-cpu-partition data, returned value is an array of pointers,
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index aaab7cb..8f7de59 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -73,7 +73,7 @@
 module_param(cpu_pattern, charp, 0444);
 MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
 
-struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt)
+struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
 {
 	struct cfs_cpt_table *cptab;
 	int i;
@@ -788,13 +788,13 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
 	return rc;
 }
 
-#define CPT_WEIGHT_MIN  4u
+#define CPT_WEIGHT_MIN 4
 
-static unsigned int cfs_cpt_num_estimate(void)
+static int cfs_cpt_num_estimate(void)
 {
-	unsigned int nnode = num_online_nodes();
-	unsigned int ncpu = num_online_cpus();
-	unsigned int ncpt;
+	int nnode = num_online_nodes();
+	int ncpu = num_online_cpus();
+	int ncpt;
 
 	if (ncpu <= CPT_WEIGHT_MIN) {
 		ncpt = 1;
@@ -824,7 +824,7 @@ static unsigned int cfs_cpt_num_estimate(void)
 	/* config many CPU partitions on 32-bit system could consume
 	 * too much memory
 	 */
-	ncpt = min(2U, ncpt);
+	ncpt = min(2, ncpt);
 #endif
 	while (ncpu % ncpt)
 		ncpt--; /* worst case is 1 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 14/26] staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (12 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 15/26] staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind James Simmons
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Rename variable i to node to make code easier to understand.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23222
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 8f7de59..a2c1068 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -579,10 +579,10 @@ void cfs_cpt_unset_node(struct cfs_cpt_table *cptab, int cpt, int node)
 int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt,
 			 const nodemask_t *mask)
 {
-	int i;
+	int node;
 
-	for_each_node_mask(i, *mask) {
-		if (!cfs_cpt_set_node(cptab, cpt, i))
+	for_each_node_mask(node, *mask) {
+		if (!cfs_cpt_set_node(cptab, cpt, node))
 			return 0;
 	}
 
@@ -593,10 +593,10 @@ int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt,
 void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab, int cpt,
 			    const nodemask_t *mask)
 {
-	int i;
+	int node;
 
-	for_each_node_mask(i, *mask)
-		cfs_cpt_unset_node(cptab, cpt, i);
+	for_each_node_mask(node, *mask)
+		cfs_cpt_unset_node(cptab, cpt, node);
 }
 EXPORT_SYMBOL(cfs_cpt_unset_nodemask);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 15/26] staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (13 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 14/26] staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 16/26] staging: lustre: libcfs: rename cpumask_var_t variables to *_mask James Simmons
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Rename variable i to cpu to make code easier to understand.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23222
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index a2c1068..5654fbe 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -673,8 +673,8 @@ int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 {
 	cpumask_var_t *cpumask;
 	nodemask_t *nodemask;
+	int cpu;
 	int rc;
-	int i;
 
 	LASSERT(cpt == CFS_CPT_ANY || (cpt >= 0 && cpt < cptab->ctb_nparts));
 
@@ -692,8 +692,8 @@ int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 		return -EINVAL;
 	}
 
-	for_each_online_cpu(i) {
-		if (cpumask_test_cpu(i, *cpumask))
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, *cpumask))
 			continue;
 
 		rc = set_cpus_allowed_ptr(current, *cpumask);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 16/26] staging: lustre: libcfs: rename cpumask_var_t variables to *_mask
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (14 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 15/26] staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 17/26] staging: lustre: libcfs: update debug messages James Simmons
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Because we handle both cpu mask as well as core identifiers it can
easily be confused. To avoid this rename various cpumask_var_t to
have appended *_mask to their names.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23222
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 62 ++++++++++++-------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 5654fbe..fcb068a 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -714,23 +714,23 @@ int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
  * We always prefer to choose CPU in the same core/socket.
  */
 static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
-				cpumask_t *node, int number)
+				cpumask_t *node_mask, int number)
 {
-	cpumask_var_t socket;
-	cpumask_var_t core;
+	cpumask_var_t socket_mask;
+	cpumask_var_t core_mask;
 	int rc = 0;
 	int cpu;
 
 	LASSERT(number > 0);
 
-	if (number >= cpumask_weight(node)) {
-		while (!cpumask_empty(node)) {
-			cpu = cpumask_first(node);
+	if (number >= cpumask_weight(node_mask)) {
+		while (!cpumask_empty(node_mask)) {
+			cpu = cpumask_first(node_mask);
 
 			rc = cfs_cpt_set_cpu(cptab, cpt, cpu);
 			if (!rc)
 				return -EINVAL;
-			cpumask_clear_cpu(cpu, node);
+			cpumask_clear_cpu(cpu, node_mask);
 		}
 		return 0;
 	}
@@ -740,34 +740,34 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
 	 * As we cannot initialize a cpumask_var_t, we need
 	 * to alloc both before we can risk trying to free either
 	 */
-	if (!zalloc_cpumask_var(&socket, GFP_NOFS))
+	if (!zalloc_cpumask_var(&socket_mask, GFP_NOFS))
 		rc = -ENOMEM;
-	if (!zalloc_cpumask_var(&core, GFP_NOFS))
+	if (!zalloc_cpumask_var(&core_mask, GFP_NOFS))
 		rc = -ENOMEM;
 	if (rc)
 		goto out;
 
-	while (!cpumask_empty(node)) {
-		cpu = cpumask_first(node);
+	while (!cpumask_empty(node_mask)) {
+		cpu = cpumask_first(node_mask);
 
 		/* get cpumask for cores in the same socket */
-		cpumask_copy(socket, topology_core_cpumask(cpu));
-		cpumask_and(socket, socket, node);
+		cpumask_copy(socket_mask, topology_core_cpumask(cpu));
+		cpumask_and(socket_mask, socket_mask, node_mask);
 
-		LASSERT(!cpumask_empty(socket));
+		LASSERT(!cpumask_empty(socket_mask));
 
-		while (!cpumask_empty(socket)) {
+		while (!cpumask_empty(socket_mask)) {
 			int i;
 
 			/* get cpumask for hts in the same core */
-			cpumask_copy(core, topology_sibling_cpumask(cpu));
-			cpumask_and(core, core, node);
+			cpumask_copy(core_mask, topology_sibling_cpumask(cpu));
+			cpumask_and(core_mask, core_mask, node_mask);
 
-			LASSERT(!cpumask_empty(core));
+			LASSERT(!cpumask_empty(core_mask));
 
-			for_each_cpu(i, core) {
-				cpumask_clear_cpu(i, socket);
-				cpumask_clear_cpu(i, node);
+			for_each_cpu(i, core_mask) {
+				cpumask_clear_cpu(i, socket_mask);
+				cpumask_clear_cpu(i, node_mask);
 
 				rc = cfs_cpt_set_cpu(cptab, cpt, i);
 				if (!rc) {
@@ -778,13 +778,13 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
 				if (!--number)
 					goto out;
 			}
-			cpu = cpumask_first(socket);
+			cpu = cpumask_first(socket_mask);
 		}
 	}
 
 out:
-	free_cpumask_var(socket);
-	free_cpumask_var(core);
+	free_cpumask_var(socket_mask);
+	free_cpumask_var(core_mask);
 	return rc;
 }
 
@@ -835,7 +835,7 @@ static int cfs_cpt_num_estimate(void)
 static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 {
 	struct cfs_cpt_table *cptab = NULL;
-	cpumask_var_t mask;
+	cpumask_var_t node_mask;
 	int cpt = 0;
 	int num;
 	int rc;
@@ -868,15 +868,15 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 		goto failed;
 	}
 
-	if (!zalloc_cpumask_var(&mask, GFP_NOFS)) {
+	if (!zalloc_cpumask_var(&node_mask, GFP_NOFS)) {
 		CERROR("Failed to allocate scratch cpumask\n");
 		goto failed;
 	}
 
 	for_each_online_node(i) {
-		cpumask_copy(mask, cpumask_of_node(i));
+		cpumask_copy(node_mask, cpumask_of_node(i));
 
-		while (!cpumask_empty(mask)) {
+		while (!cpumask_empty(node_mask)) {
 			struct cfs_cpu_partition *part;
 			int n;
 
@@ -893,7 +893,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 			n = num - cpumask_weight(part->cpt_cpumask);
 			LASSERT(n > 0);
 
-			rc = cfs_cpt_choose_ncpus(cptab, cpt, mask, n);
+			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask, n);
 			if (rc < 0)
 				goto failed_mask;
 
@@ -911,12 +911,12 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 		goto failed_mask;
 	}
 
-	free_cpumask_var(mask);
+	free_cpumask_var(node_mask);
 
 	return cptab;
 
 failed_mask:
-	free_cpumask_var(mask);
+	free_cpumask_var(node_mask);
 failed:
 	CERROR("Failed to setup CPU-partition-table with %d CPU-partitions, online HW nodes: %d, HW cpus: %d.\n",
 	       ncpt, num_online_nodes(), num_online_cpus());
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 17/26] staging: lustre: libcfs: update debug messages
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (15 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 16/26] staging: lustre: libcfs: rename cpumask_var_t variables to *_mask James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 18/26] staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes James Simmons
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

For cfs_cpt_bind() change the CERROR to CDEBUG. Make the debug
message in cfs_cpt_table_create_pattern() more understandable.
Report rc value for when cfs_cpt_create_table() fails.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23222
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index fcb068a..ebfa4e3 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -484,7 +484,8 @@ void cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 		/* caller doesn't know the partition ID */
 		cpt = cptab->ctb_cpu2cpt[cpu];
 		if (cpt < 0) { /* not set in this CPT-table */
-			CDEBUG(D_INFO, "Try to unset cpu %d which is not in CPT-table %p\n",
+			CDEBUG(D_INFO,
+			       "Try to unset cpu %d which is not in CPT-table %p\n",
 			       cpt, cptab);
 			return;
 		}
@@ -510,7 +511,8 @@ int cfs_cpt_set_cpumask(struct cfs_cpt_table *cptab, int cpt,
 
 	if (!cpumask_weight(mask) ||
 	    cpumask_any_and(mask, cpu_online_mask) >= nr_cpu_ids) {
-		CDEBUG(D_INFO, "No online CPU is found in the CPU mask for CPU partition %d\n",
+		CDEBUG(D_INFO,
+		       "No online CPU is found in the CPU mask for CPU partition %d\n",
 		       cpt);
 		return 0;
 	}
@@ -687,7 +689,8 @@ int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 	}
 
 	if (cpumask_any_and(*cpumask, cpu_online_mask) >= nr_cpu_ids) {
-		CERROR("No online CPU found in CPU partition %d, did someone do CPU hotplug on system? You might need to reload Lustre modules to keep system working well.\n",
+		CDEBUG(D_INFO,
+		       "No online CPU found in CPU partition %d, did someone do CPU hotplug on system? You might need to reload Lustre modules to keep system working well.\n",
 		       cpt);
 		return -EINVAL;
 	}
@@ -918,8 +921,8 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 failed_mask:
 	free_cpumask_var(node_mask);
 failed:
-	CERROR("Failed to setup CPU-partition-table with %d CPU-partitions, online HW nodes: %d, HW cpus: %d.\n",
-	       ncpt, num_online_nodes(), num_online_cpus());
+	CERROR("Failed (rc = %d) to setup CPU partition table with %d partitions, online HW NUMA nodes: %d, HW CPU cores: %d.\n",
+	       rc, ncpt, num_online_nodes(), num_online_cpus());
 
 	if (cptab)
 		cfs_cpt_table_free(cptab);
@@ -1034,7 +1037,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 
 		bracket = strchr(str, ']');
 		if (!bracket) {
-			CERROR("missing right bracket for cpt %d, %s\n",
+			CERROR("Missing right bracket for partition %d, %s\n",
 			       cpt, str);
 			goto failed;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 18/26] staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (16 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 17/26] staging: lustre: libcfs: update debug messages James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node James Simmons
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Rework CPU partition code in the way of make it more tolerant to
offline CPUs and empty nodes.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23222
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 132 ++++++++++--------------
 1 file changed, 56 insertions(+), 76 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index ebfa4e3..18925c7 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -461,8 +461,16 @@ int cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 		return 0;
 	}
 
-	LASSERT(!cpumask_test_cpu(cpu, cptab->ctb_cpumask));
-	LASSERT(!cpumask_test_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask));
+	if (cpumask_test_cpu(cpu, cptab->ctb_cpumask)) {
+		CDEBUG(D_INFO, "CPU %d is already in cpumask\n", cpu);
+		return 0;
+	}
+
+	if (cpumask_test_cpu(cpu, cptab->ctb_parts[cpt].cpt_cpumask)) {
+		CDEBUG(D_INFO, "CPU %d is already in partition %d cpumask\n",
+		       cpu, cptab->ctb_cpu2cpt[cpu]);
+		return 0;
+	}
 
 	cfs_cpt_add_cpu(cptab, cpt, cpu);
 	cfs_cpt_add_node(cptab, cpt, cpu_to_node(cpu));
@@ -531,8 +539,10 @@ void cfs_cpt_unset_cpumask(struct cfs_cpt_table *cptab, int cpt,
 {
 	int cpu;
 
-	for_each_cpu(cpu, mask)
-		cfs_cpt_unset_cpu(cptab, cpt, cpu);
+	for_each_cpu(cpu, mask) {
+		cfs_cpt_del_cpu(cptab, cpt, cpu);
+		cfs_cpt_del_node(cptab, cpt, cpu_to_node(cpu));
+	}
 }
 EXPORT_SYMBOL(cfs_cpt_unset_cpumask);
 
@@ -583,10 +593,8 @@ int cfs_cpt_set_nodemask(struct cfs_cpt_table *cptab, int cpt,
 {
 	int node;
 
-	for_each_node_mask(node, *mask) {
-		if (!cfs_cpt_set_node(cptab, cpt, node))
-			return 0;
-	}
+	for_each_node_mask(node, *mask)
+		cfs_cpt_set_node(cptab, cpt, node);
 
 	return 1;
 }
@@ -607,7 +615,7 @@ int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
 	nodemask_t *mask;
 	int weight;
 	int rotor;
-	int node;
+	int node = 0;
 
 	/* convert CPU partition ID to HW node id */
 
@@ -617,20 +625,20 @@ int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
 	} else {
 		mask = cptab->ctb_parts[cpt].cpt_nodemask;
 		rotor = cptab->ctb_parts[cpt].cpt_spread_rotor++;
+		node  = cptab->ctb_parts[cpt].cpt_node;
 	}
 
 	weight = nodes_weight(*mask);
-	LASSERT(weight > 0);
-
-	rotor %= weight;
+	if (weight > 0) {
+		rotor %= weight;
 
-	for_each_node_mask(node, *mask) {
-		if (!rotor--)
-			return node;
+		for_each_node_mask(node, *mask) {
+			if (!rotor--)
+				return node;
+		}
 	}
 
-	LBUG();
-	return 0;
+	return node;
 }
 EXPORT_SYMBOL(cfs_cpt_spread_node);
 
@@ -723,17 +731,21 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
 	cpumask_var_t core_mask;
 	int rc = 0;
 	int cpu;
+	int i;
 
 	LASSERT(number > 0);
 
 	if (number >= cpumask_weight(node_mask)) {
 		while (!cpumask_empty(node_mask)) {
 			cpu = cpumask_first(node_mask);
+			cpumask_clear_cpu(cpu, node_mask);
+
+			if (!cpu_online(cpu))
+				continue;
 
 			rc = cfs_cpt_set_cpu(cptab, cpt, cpu);
 			if (!rc)
 				return -EINVAL;
-			cpumask_clear_cpu(cpu, node_mask);
 		}
 		return 0;
 	}
@@ -754,24 +766,19 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
 		cpu = cpumask_first(node_mask);
 
 		/* get cpumask for cores in the same socket */
-		cpumask_copy(socket_mask, topology_core_cpumask(cpu));
-		cpumask_and(socket_mask, socket_mask, node_mask);
-
-		LASSERT(!cpumask_empty(socket_mask));
-
+		cpumask_and(socket_mask, topology_core_cpumask(cpu), node_mask);
 		while (!cpumask_empty(socket_mask)) {
-			int i;
-
 			/* get cpumask for hts in the same core */
-			cpumask_copy(core_mask, topology_sibling_cpumask(cpu));
-			cpumask_and(core_mask, core_mask, node_mask);
-
-			LASSERT(!cpumask_empty(core_mask));
+			cpumask_and(core_mask, topology_sibling_cpumask(cpu),
+				    node_mask);
 
 			for_each_cpu(i, core_mask) {
 				cpumask_clear_cpu(i, socket_mask);
 				cpumask_clear_cpu(i, node_mask);
 
+				if (!cpu_online(i))
+					continue;
+
 				rc = cfs_cpt_set_cpu(cptab, cpt, i);
 				if (!rc) {
 					rc = -EINVAL;
@@ -840,23 +847,18 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 	struct cfs_cpt_table *cptab = NULL;
 	cpumask_var_t node_mask;
 	int cpt = 0;
+	int node;
 	int num;
-	int rc;
-	int i;
+	int rem;
+	int rc = 0;
 
-	rc = cfs_cpt_num_estimate();
+	num = cfs_cpt_num_estimate();
 	if (ncpt <= 0)
-		ncpt = rc;
+		ncpt = num;
 
-	if (ncpt > num_online_cpus() || ncpt > 4 * rc) {
+	if (ncpt > num_online_cpus() || ncpt > 4 * num) {
 		CWARN("CPU partition number %d is larger than suggested value (%d), your system may have performance issue or run out of memory while under pressure\n",
-		      ncpt, rc);
-	}
-
-	if (num_online_cpus() % ncpt) {
-		CERROR("CPU number %d is not multiple of cpu_npartition %d, please try different cpu_npartitions value or set pattern string by cpu_pattern=STRING\n",
-		       (int)num_online_cpus(), ncpt);
-		goto failed;
+		      ncpt, num);
 	}
 
 	cptab = cfs_cpt_table_alloc(ncpt);
@@ -865,55 +867,33 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 		goto failed;
 	}
 
-	num = num_online_cpus() / ncpt;
-	if (!num) {
-		CERROR("CPU changed while setting CPU partition\n");
-		goto failed;
-	}
-
 	if (!zalloc_cpumask_var(&node_mask, GFP_NOFS)) {
 		CERROR("Failed to allocate scratch cpumask\n");
 		goto failed;
 	}
 
-	for_each_online_node(i) {
-		cpumask_copy(node_mask, cpumask_of_node(i));
-
-		while (!cpumask_empty(node_mask)) {
-			struct cfs_cpu_partition *part;
-			int n;
-
-			/*
-			 * Each emulated NUMA node has all allowed CPUs in
-			 * the mask.
-			 * End loop when all partitions have assigned CPUs.
-			 */
-			if (cpt == ncpt)
-				break;
-
-			part = &cptab->ctb_parts[cpt];
+	num = num_online_cpus() / ncpt;
+	rem = num_online_cpus() % ncpt;
+	for_each_online_node(node) {
+		cpumask_copy(node_mask, cpumask_of_node(node));
 
-			n = num - cpumask_weight(part->cpt_cpumask);
-			LASSERT(n > 0);
+		while (cpt < ncpt && !cpumask_empty(node_mask)) {
+			struct cfs_cpu_partition *part = &cptab->ctb_parts[cpt];
+			int ncpu = cpumask_weight(part->cpt_cpumask);
 
-			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask, n);
+			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask,
+						  num - ncpu);
 			if (rc < 0)
 				goto failed_mask;
 
-			LASSERT(num >= cpumask_weight(part->cpt_cpumask));
-			if (num == cpumask_weight(part->cpt_cpumask))
+			ncpu = cpumask_weight(part->cpt_cpumask);
+			if (ncpu == num + !!(rem > 0)) {
 				cpt++;
+				rem--;
+			}
 		}
 	}
 
-	if (cpt != ncpt ||
-	    num != cpumask_weight(cptab->ctb_parts[ncpt - 1].cpt_cpumask)) {
-		CERROR("Expect %d(%d) CPU partitions but got %d(%d), CPU hotplug/unplug while setting?\n",
-		       cptab->ctb_nparts, num, cpt,
-		       cpumask_weight(cptab->ctb_parts[ncpt - 1].cpt_cpumask));
-		goto failed_mask;
-	}
-
 	free_cpumask_var(node_mask);
 
 	return cptab;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (17 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 18/26] staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  1:09   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 20/26] staging: lustre: libcfs: update debug messages in CPT code James Simmons
                   ` (7 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Reporting "HW nodes" is too generic. It really is reporting
"HW NUMA nodes". Update the debug message.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23306
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h | 2 ++
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 2 +-
 drivers/staging/lustre/lnet/lnet/lib-msg.c               | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 2bb2140..29c5071 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -90,6 +90,8 @@ struct cfs_cpu_partition {
 	unsigned int			*cpt_distance;
 	/* spread rotor for NUMA allocator */
 	int				cpt_spread_rotor;
+	/* NUMA node if cpt_nodemask is empty */
+	int				cpt_node;
 };
 
 
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 18925c7..86afa31 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -1142,7 +1142,7 @@ int cfs_cpu_init(void)
 
 	put_online_cpus();
 
-	LCONSOLE(0, "HW nodes: %d, HW CPU cores: %d, npartitions: %d\n",
+	LCONSOLE(0, "HW NUMA nodes: %d, HW CPU cores: %d, npartitions: %d\n",
 		 num_online_nodes(), num_online_cpus(),
 		 cfs_cpt_number(cfs_cpt_tab));
 	return 0;
diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
index 0091273..27bdefa 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
@@ -568,6 +568,8 @@
 
 	/* number of CPUs */
 	container->msc_nfinalizers = cfs_cpt_weight(lnet_cpt_table(), cpt);
+	if (container->msc_nfinalizers == 0)
+		container->msc_nfinalizers = 1;
 
 	container->msc_finalizers = kvzalloc_cpt(container->msc_nfinalizers *
 						 sizeof(*container->msc_finalizers),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 20/26] staging: lustre: libcfs: update debug messages in CPT code
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (18 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 21/26] staging: lustre: libcfs: rework CPU pattern parsing code James Simmons
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Update the debug messages for the CPT table creation code. Place
the passed in string in quotes to make it clear what it is.
Captialize cpu in the debug strings.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23306
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 86afa31..c2bad0d 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -500,7 +500,7 @@ void cfs_cpt_unset_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
 
 	} else if (cpt != cptab->ctb_cpu2cpt[cpu]) {
 		CDEBUG(D_INFO,
-		       "CPU %d is not in cpu-partition %d\n", cpu, cpt);
+		       "CPU %d is not in CPU partition %d\n", cpu, cpt);
 		return;
 	}
 
@@ -944,14 +944,14 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 	if (!ncpt ||
 	    (node && ncpt > num_online_nodes()) ||
 	    (!node && ncpt > num_online_cpus())) {
-		CERROR("Invalid pattern %s, or too many partitions %d\n",
+		CERROR("Invalid pattern '%s', or too many partitions %d\n",
 		       pattern, ncpt);
 		return NULL;
 	}
 
 	cptab = cfs_cpt_table_alloc(ncpt);
 	if (!cptab) {
-		CERROR("Failed to allocate cpu partition table\n");
+		CERROR("Failed to allocate CPU partition table\n");
 		return NULL;
 	}
 
@@ -982,11 +982,11 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 
 		if (!bracket) {
 			if (*str) {
-				CERROR("Invalid pattern %s\n", str);
+				CERROR("Invalid pattern '%s'\n", str);
 				goto failed;
 			}
 			if (c != ncpt) {
-				CERROR("expect %d partitions but found %d\n",
+				CERROR("Expect %d partitions but found %d\n",
 				       ncpt, c);
 				goto failed;
 			}
@@ -994,7 +994,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 		}
 
 		if (sscanf(str, "%d%n", &cpt, &n) < 1) {
-			CERROR("Invalid cpu pattern %s\n", str);
+			CERROR("Invalid CPU pattern '%s'\n", str);
 			goto failed;
 		}
 
@@ -1011,20 +1011,20 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 
 		str = strim(str + n);
 		if (str != bracket) {
-			CERROR("Invalid pattern %s\n", str);
+			CERROR("Invalid pattern '%s'\n", str);
 			goto failed;
 		}
 
 		bracket = strchr(str, ']');
 		if (!bracket) {
-			CERROR("Missing right bracket for partition %d, %s\n",
+			CERROR("Missing right bracket for partition %d in '%s'\n",
 			       cpt, str);
 			goto failed;
 		}
 
 		if (cfs_expr_list_parse(str, (bracket - str) + 1,
 					0, high, &el)) {
-			CERROR("Can't parse number range: %s\n", str);
+			CERROR("Can't parse number range in '%s'\n", str);
 			goto failed;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 21/26] staging: lustre: libcfs: rework CPU pattern parsing code
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (19 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 20/26] staging: lustre: libcfs: update debug messages in CPT code James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 22/26] staging: lustre: libcfs: change CPT estimate algorithm James Simmons
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Currently the module param string for CPU pattern can be
modified which is wrong. Rewrite CPU pattern parsing code
to avoid the passed buffer from being changed. This change
also enables us to add real errors propogation to the caller
functions.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
Signed-off-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/23306
WC-bug-id: https://jira.whamcloud.com/browse/LU-9715
Reviewed-on: https://review.whamcloud.com/27872
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 146 ++++++++++++++----------
 1 file changed, 86 insertions(+), 60 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index c2bad0d..2fdea11 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -696,11 +696,11 @@ int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 		nodemask = cptab->ctb_parts[cpt].cpt_nodemask;
 	}
 
-	if (cpumask_any_and(*cpumask, cpu_online_mask) >= nr_cpu_ids) {
+	if (!cpumask_intersects(*cpumask, cpu_online_mask)) {
 		CDEBUG(D_INFO,
 		       "No online CPU found in CPU partition %d, did someone do CPU hotplug on system? You might need to reload Lustre modules to keep system working well.\n",
 		       cpt);
-		return -EINVAL;
+		return -ENODEV;
 	}
 
 	for_each_online_cpu(cpu) {
@@ -864,11 +864,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 	cptab = cfs_cpt_table_alloc(ncpt);
 	if (!cptab) {
 		CERROR("Failed to allocate CPU map(%d)\n", ncpt);
+		rc = -ENOMEM;
 		goto failed;
 	}
 
 	if (!zalloc_cpumask_var(&node_mask, GFP_NOFS)) {
 		CERROR("Failed to allocate scratch cpumask\n");
+		rc = -ENOMEM;
 		goto failed;
 	}
 
@@ -883,8 +885,10 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 
 			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask,
 						  num - ncpu);
-			if (rc < 0)
+			if (rc < 0) {
+				rc = -EINVAL;
 				goto failed_mask;
+			}
 
 			ncpu = cpumask_weight(part->cpt_cpumask);
 			if (ncpu == num + !!(rem > 0)) {
@@ -907,37 +911,51 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 	if (cptab)
 		cfs_cpt_table_free(cptab);
 
-	return NULL;
+	return ERR_PTR(rc);
 }
 
-static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
+static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 {
 	struct cfs_cpt_table *cptab;
+	char *pattern_dup;
+	char *bracket;
 	char *str;
 	int node = 0;
-	int high;
 	int ncpt = 0;
-	int cpt;
+	int cpt = 0;
+	int high;
 	int rc;
 	int c;
 	int i;
 
-	str = strim(pattern);
+	pattern_dup = kstrdup(pattern, GFP_KERNEL);
+	if (!pattern_dup) {
+		CERROR("Failed to duplicate pattern '%s'\n", pattern);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	str = strim(pattern_dup);
 	if (*str == 'n' || *str == 'N') {
-		pattern = str + 1;
-		if (*pattern != '\0') {
-			node = 1;
-		} else { /* shortcut to create CPT from NUMA & CPU topology */
+		str++; /* skip 'N' char */
+		node = 1; /* NUMA pattern */
+		if (*str == '\0') {
 			node = -1;
-			ncpt = num_online_nodes();
+			for_each_online_node(i) {
+				if (!cpumask_empty(cpumask_of_node(i)))
+					ncpt++;
+			}
+			if (ncpt == 1) { /* single NUMA node */
+				kfree(pattern_dup);
+				return cfs_cpt_table_create(cpu_npartitions);
+			}
 		}
 	}
 
 	if (!ncpt) { /* scanning bracket which is mark of partition */
-		for (str = pattern;; str++, ncpt++) {
-			str = strchr(str, '[');
-			if (!str)
-				break;
+		bracket = str;
+		while ((bracket = strchr(bracket, '['))) {
+			bracket++;
+			ncpt++;
 		}
 	}
 
@@ -945,87 +963,96 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 	    (node && ncpt > num_online_nodes()) ||
 	    (!node && ncpt > num_online_cpus())) {
 		CERROR("Invalid pattern '%s', or too many partitions %d\n",
-		       pattern, ncpt);
-		return NULL;
+		       pattern_dup, ncpt);
+		rc = -EINVAL;
+		goto err_free_str;
 	}
 
 	cptab = cfs_cpt_table_alloc(ncpt);
 	if (!cptab) {
 		CERROR("Failed to allocate CPU partition table\n");
-		return NULL;
+		rc = -ENOMEM;
+		goto err_free_str;
 	}
 
 	if (node < 0) { /* shortcut to create CPT from NUMA & CPU topology */
-		cpt = 0;
-
 		for_each_online_node(i) {
-			if (cpt >= ncpt) {
-				CERROR("CPU changed while setting CPU partition table, %d/%d\n",
-				       cpt, ncpt);
-				goto failed;
-			}
+			if (cpumask_empty(cpumask_of_node(i)))
+				continue;
 
 			rc = cfs_cpt_set_node(cptab, cpt++, i);
-			if (!rc)
-				goto failed;
+			if (!rc) {
+				rc = -EINVAL;
+				goto err_free_table;
+			}
 		}
+		kfree(pattern_dup);
 		return cptab;
 	}
 
 	high = node ? nr_node_ids - 1 : nr_cpu_ids - 1;
 
-	for (str = strim(pattern), c = 0;; c++) {
+	for (str = strim(str), c = 0; /* until break */; c++) {
 		struct cfs_range_expr *range;
 		struct cfs_expr_list *el;
-		char *bracket = strchr(str, '[');
 		int n;
 
+		bracket = strchr(str, '[');
 		if (!bracket) {
 			if (*str) {
 				CERROR("Invalid pattern '%s'\n", str);
-				goto failed;
+				rc = -EINVAL;
+				goto err_free_table;
 			}
 			if (c != ncpt) {
 				CERROR("Expect %d partitions but found %d\n",
 				       ncpt, c);
-				goto failed;
+				rc = -EINVAL;
+				goto err_free_table;
 			}
 			break;
 		}
 
 		if (sscanf(str, "%d%n", &cpt, &n) < 1) {
 			CERROR("Invalid CPU pattern '%s'\n", str);
-			goto failed;
+			rc = -EINVAL;
+			goto err_free_table;
 		}
 
 		if (cpt < 0 || cpt >= ncpt) {
 			CERROR("Invalid partition id %d, total partitions %d\n",
 			       cpt, ncpt);
-			goto failed;
+			rc = -EINVAL;
+			goto err_free_table;
 		}
 
 		if (cfs_cpt_weight(cptab, cpt)) {
 			CERROR("Partition %d has already been set.\n", cpt);
-			goto failed;
+			rc = -EPERM;
+			goto err_free_table;
 		}
 
 		str = strim(str + n);
 		if (str != bracket) {
 			CERROR("Invalid pattern '%s'\n", str);
-			goto failed;
+			rc = -EINVAL;
+			goto err_free_table;
 		}
 
 		bracket = strchr(str, ']');
 		if (!bracket) {
 			CERROR("Missing right bracket for partition %d in '%s'\n",
 			       cpt, str);
-			goto failed;
+			rc = -EINVAL;
+			goto err_free_table;
 		}
 
-		if (cfs_expr_list_parse(str, (bracket - str) + 1,
-					0, high, &el)) {
+		rc = cfs_expr_list_parse(str, (bracket - str) + 1, 0, high,
+					 &el);
+		if (rc) {
 			CERROR("Can't parse number range in '%s'\n", str);
-			goto failed;
+			rc = -ERANGE;
+			goto err_free_table;
 		}
 
 		list_for_each_entry(range, &el->el_exprs, re_link) {
@@ -1037,7 +1064,8 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 					    cfs_cpt_set_cpu(cptab, cpt, i);
 				if (!rc) {
 					cfs_expr_list_free(el);
-					goto failed;
+					rc = -EINVAL;
+					goto err_free_table;
 				}
 			}
 		}
@@ -1046,17 +1074,21 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(char *pattern)
 
 		if (!cfs_cpt_online(cptab, cpt)) {
 			CERROR("No online CPU is found on partition %d\n", cpt);
-			goto failed;
+			rc = -ENODEV;
+			goto err_free_table;
 		}
 
 		str = strim(bracket + 1);
 	}
 
+	kfree(pattern_dup);
 	return cptab;
 
-failed:
+err_free_table:
 	cfs_cpt_table_free(cptab);
-	return NULL;
+err_free_str:
+	kfree(pattern_dup);
+	return ERR_PTR(rc);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -1083,7 +1115,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 
 void cfs_cpu_fini(void)
 {
-	if (cfs_cpt_tab)
+	if (!IS_ERR_OR_NULL(cfs_cpt_tab))
 		cfs_cpt_table_free(cfs_cpt_tab);
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -1116,26 +1148,20 @@ int cfs_cpu_init(void)
 #endif
 	get_online_cpus();
 	if (*cpu_pattern) {
-		char *cpu_pattern_dup = kstrdup(cpu_pattern, GFP_KERNEL);
-
-		if (!cpu_pattern_dup) {
-			CERROR("Failed to duplicate cpu_pattern\n");
-			goto failed_alloc_table;
-		}
-
-		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern_dup);
-		kfree(cpu_pattern_dup);
-		if (!cfs_cpt_tab) {
-			CERROR("Failed to create cptab from pattern %s\n",
+		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern);
+		if (IS_ERR(cfs_cpt_tab)) {
+			CERROR("Failed to create cptab from pattern '%s'\n",
 			       cpu_pattern);
+			ret = PTR_ERR(cfs_cpt_tab);
 			goto failed_alloc_table;
 		}
 
 	} else {
 		cfs_cpt_tab = cfs_cpt_table_create(cpu_npartitions);
-		if (!cfs_cpt_tab) {
-			CERROR("Failed to create ptable with npartitions %d\n",
+		if (IS_ERR(cfs_cpt_tab)) {
+			CERROR("Failed to create cptab with npartitions %d\n",
 			       cpu_npartitions);
+			ret = PTR_ERR(cfs_cpt_tab);
 			goto failed_alloc_table;
 		}
 	}
@@ -1150,7 +1176,7 @@ int cfs_cpu_init(void)
 failed_alloc_table:
 	put_online_cpus();
 
-	if (cfs_cpt_tab)
+	if (!IS_ERR_OR_NULL(cfs_cpt_tab))
 		cfs_cpt_table_free(cfs_cpt_tab);
 
 	ret = -EINVAL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 22/26] staging: lustre: libcfs: change CPT estimate algorithm
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (20 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 21/26] staging: lustre: libcfs: rework CPU pattern parsing code James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0 James Simmons
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

The main idea to have more CPU partitions is based on KNL experience.
When a thread submit IO for network communication one of threads from
current CPT is used for network stack. Whith high parallelization many
threads become involved in network submission but having less CPU
partitions they will wait until single thread process them from network
queue. So, the bottleneck just moves into network layer in case of
small amount of CPU partitions. My experiments showed that the best
performance was when for each IO thread we have one network thread.
This condition can be provided having 2 real HW cores (without hyper
threads) per CPT. This is exactly what implemented in this patch.

Change CPT estimate algorithm from 2 * (N - 1)^2 < NCPUS <= 2 * N^2
to 2 HW cores per CPT. This is critical for machines with number of
cores different from 2^N.

Current algorithm splits CPTs in KNL:
LNet: HW CPU cores: 272, npartitions: 16
cpu_partition_table=
    0       : 0-4,68-71,136-139,204-207
    1       : 5-9,73-76,141-144,209-212
    2       : 10-14,78-81,146-149,214-217
    3       : 15-17,72,77,83-85,140,145,151-153,208,219-221
    4       : 18-21,82,86-88,150,154-156,213,218,222-224
    5       : 22-26,90-93,158-161,226-229
    6       : 27-31,95-98,163-166,231-234
    7       : 32-35,89,100-103,168-171,236-239
    8       : 36-38,94,99,104-105,157,162,167,172-173,225,230,235,240-241
    9       : 39-43,107-110,175-178,243-246
    10      : 44-48,112-115,180-183,248-251
    11      : 49-51,106,111,117-119,174,179,185-187,242,253-255
    12      : 52-55,116,120-122,184,188-190,247,252,256-258
    13      : 56-60,124-127,192-195,260-263
    14      : 61-65,129-132,197-200,265-268
    15      : 66-67,123,128,133-135,191,196,201-203,259,264,269-271

New algorithm will split CPTs in KNL:
LNet: HW CPU cores: 272, npartitions: 34
cpu_partition_table=
    0       : 0-1,68-69,136-137,204-205
    1       : 2-3,70-71,138-139,206-207
    2       : 4-5,72-73,140-141,208-209
    3       : 6-7,74-75,142-143,210-211
    4       : 8-9,76-77,144-145,212-213
    5       : 10-11,78-79,146-147,214-215
    6       : 12-13,80-81,148-149,216-217
    7       : 14-15,82-83,150-151,218-219
    8       : 16-17,84-85,152-153,220-221
    9       : 18-19,86-87,154-155,222-223
    10      : 20-21,88-89,156-157,224-225
    11      : 22-23,90-91,158-159,226-227
    12      : 24-25,92-93,160-161,228-229
    13      : 26-27,94-95,162-163,230-231
    14      : 28-29,96-97,164-165,232-233
    15      : 30-31,98-99,166-167,234-235
    16      : 32-33,100-101,168-169,236-237
    17      : 34-35,102-103,170-171,238-239
    18      : 36-37,104-105,172-173,240-241
    19      : 38-39,106-107,174-175,242-243
    20      : 40-41,108-109,176-177,244-245
    21      : 42-43,110-111,178-179,246-247
    22      : 44-45,112-113,180-181,248-249
    23      : 46-47,114-115,182-183,250-251
    24      : 48-49,116-117,184-185,252-253
    25      : 50-51,118-119,186-187,254-255
    26      : 52-53,120-121,188-189,256-257
    27      : 54-55,122-123,190-191,258-259
    28      : 56-57,124-125,192-193,260-261
    29      : 58-59,126-127,194-195,262-263
    30      : 60-61,128-129,196-197,264-265
    31      : 62-63,130-131,198-199,266-267
    32      : 64-65,132-133,200-201,268-269
    33      : 66-67,134-135,202-203,270-271

'N' pattern in KNL works is not always good.
in flat mode it will be one CPT with all CPUs inside.

in SNC-4 mode:
cpu_partition_table=
    0       : 0-17,68-85,136-153,204-221
    1       : 18-35,86-103,154-171,222-239
    2       : 36-51,104-119,172-187,240-255
    3       : 52-67,120-135,188-203,256-271

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
Reviewed-on: https://review.whamcloud.com/24304
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 30 +++++--------------------
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 2fdea11..3f4a7c7 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -802,34 +802,14 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
 
 static int cfs_cpt_num_estimate(void)
 {
-	int nnode = num_online_nodes();
+	int nthr = cpumask_weight(topology_sibling_cpumask(smp_processor_id()));
 	int ncpu = num_online_cpus();
-	int ncpt;
+	int ncpt = 1;
 
-	if (ncpu <= CPT_WEIGHT_MIN) {
-		ncpt = 1;
-		goto out;
-	}
-
-	/* generate reasonable number of CPU partitions based on total number
-	 * of CPUs, Preferred N should be power2 and match this condition:
-	 * 2 * (N - 1)^2 < NCPUS <= 2 * N^2
-	 */
-	for (ncpt = 2; ncpu > 2 * ncpt * ncpt; ncpt <<= 1)
-		;
-
-	if (ncpt <= nnode) { /* fat numa system */
-		while (nnode > ncpt)
-			nnode >>= 1;
+	if (ncpu > CPT_WEIGHT_MIN)
+		for (ncpt = 2; ncpu > 2 * nthr * ncpt; ncpt++)
+			; /* nothing */
 
-	} else { /* ncpt > nnode */
-		while ((nnode << 1) <= ncpt)
-			nnode <<= 1;
-	}
-
-	ncpt = nnode;
-
-out:
 #if (BITS_PER_LONG == 32)
 	/* config many CPU partitions on 32-bit system could consume
 	 * too much memory
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (21 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 22/26] staging: lustre: libcfs: change CPT estimate algorithm James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  2:38   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP James Simmons
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

fix crash if CPU 0 disabled.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8710
Reviewed-on: https://review.whamcloud.com/23305
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ptlrpc/service.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
index 3fd8c74..8e74a45 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/service.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
@@ -421,7 +421,7 @@ static void ptlrpc_at_timer(struct timer_list *t)
 		 * there are.
 		 */
 		/* weight is # of HTs */
-		if (cpumask_weight(topology_sibling_cpumask(0)) > 1) {
+		if (cpumask_weight(topology_sibling_cpumask(smp_processor_id())) > 1) {
 			/* depress thread factor for hyper-thread */
 			factor = factor - (factor >> 1) + (factor >> 3);
 		}
@@ -2221,15 +2221,16 @@ static int ptlrpc_hr_main(void *arg)
 	struct ptlrpc_hr_thread	*hrt = arg;
 	struct ptlrpc_hr_partition *hrp = hrt->hrt_partition;
 	LIST_HEAD(replies);
-	char threadname[20];
 	int rc;
 
-	snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
-		 hrp->hrp_cpt, hrt->hrt_id);
 	unshare_fs_struct();
 
 	rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
 	if (rc != 0) {
+		char threadname[20];
+
+		snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
+			 hrp->hrp_cpt, hrt->hrt_id);
 		CWARN("Failed to bind %s on CPT %d of CPT table %p: rc = %d\n",
 		      threadname, hrp->hrp_cpt, ptlrpc_hr.hr_cpt_table, rc);
 	}
@@ -2528,7 +2529,7 @@ int ptlrpc_hr_init(void)
 
 	init_waitqueue_head(&ptlrpc_hr.hr_waitq);
 
-	weight = cpumask_weight(topology_sibling_cpumask(0));
+	weight = cpumask_weight(topology_sibling_cpumask(smp_processor_id()));
 
 	cfs_percpt_for_each(hrp, i, ptlrpc_hr.hr_partitions) {
 		hrp->hrp_cpt = i;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (22 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0 James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  1:27   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure James Simmons
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

With the cleanup of the libcfs SMP handling the function
cfs_cpt_table_print() was turned into an empty funciton.
This function is called by a debugfs reporting function for
debugging which now means for UMP machines it reports nothing
which breaks previous behavior exposed to users. Restore the
original behavior.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9856
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/include/linux/libcfs/libcfs_cpu.h  | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 29c5071..32776d2 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -218,6 +218,19 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 struct cfs_cpt_table;
 #define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
 
+static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab,
+				      char *buf, int len)
+{
+	int rc;
+
+	rc = snprintf(buf, len, "0\t: 0\n");
+	len -= rc;
+	if (len <= 0)
+		return -EFBIG;
+
+	return rc;
+}
+
 static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
 					 char *buf, int len)
 {
@@ -237,12 +250,6 @@ static inline cpumask_var_t *cfs_cpt_cpumask(struct cfs_cpt_table *cptab,
 	return NULL;
 }
 
-static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf,
-				      int len)
-{
-	return 0;
-}
-
 static inline int cfs_cpt_number(struct cfs_cpt_table *cptab)
 {
 	return 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (23 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  1:32   ` NeilBrown
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 26/26] staging: lustre: libcfs: restore UMP support James Simmons
  2018-06-25  1:33 ` [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework NeilBrown
  26 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

Only one cfs_cpt_tab exist and its created only at libcfs modules
loading and removal. Instead of dynamically allocating it lets
statically allocate it. This will help to reenable UMP support.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9856
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../lustre/include/linux/libcfs/libcfs_cpu.h       |   4 +-
 drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 111 ++++++++++-----------
 drivers/staging/lustre/lnet/libcfs/module.c        |  10 +-
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   4 +-
 drivers/staging/lustre/lnet/selftest/framework.c   |   2 +-
 drivers/staging/lustre/lustre/ptlrpc/client.c      |   4 +-
 drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     |  10 +-
 drivers/staging/lustre/lustre/ptlrpc/service.c     |   4 +-
 8 files changed, 73 insertions(+), 76 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 32776d2..df7e16b 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -115,7 +115,7 @@ struct cfs_cpt_table {
 	nodemask_t			*ctb_nodemask;
 };
 
-extern struct cfs_cpt_table	*cfs_cpt_tab;
+extern struct cfs_cpt_table	cfs_cpt_tab;
 
 /**
  * return cpumask of CPU partition \a cpt
@@ -215,8 +215,6 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 void cfs_cpu_fini(void);
 
 #else /* !CONFIG_SMP */
-struct cfs_cpt_table;
-#define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
 
 static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab,
 				      char *buf, int len)
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 3f4a7c7..9fd324d 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -41,10 +41,6 @@
 #include <linux/libcfs/libcfs_string.h>
 #include <linux/libcfs/libcfs.h>
 
-/** Global CPU partition table */
-struct cfs_cpt_table   *cfs_cpt_tab __read_mostly;
-EXPORT_SYMBOL(cfs_cpt_tab);
-
 /**
  * modparam for setting number of partitions
  *
@@ -73,15 +69,10 @@
 module_param(cpu_pattern, charp, 0444);
 MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
 
-struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
+static int cfs_cpt_table_setup(struct cfs_cpt_table *cptab, int ncpt)
 {
-	struct cfs_cpt_table *cptab;
 	int i;
 
-	cptab = kzalloc(sizeof(*cptab), GFP_NOFS);
-	if (!cptab)
-		return NULL;
-
 	cptab->ctb_nparts = ncpt;
 
 	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS))
@@ -138,7 +129,7 @@ struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
 		       cptab->ctb_nparts * sizeof(part->cpt_distance[0]));
 	}
 
-	return cptab;
+	return 0;
 
 failed_setting_ctb_parts:
 	while (i-- >= 0) {
@@ -159,8 +150,24 @@ struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
 failed_alloc_nodemask:
 	free_cpumask_var(cptab->ctb_cpumask);
 failed_alloc_cpumask:
-	kfree(cptab);
-	return NULL;
+	return -ENOMEM;
+}
+
+struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
+{
+	struct cfs_cpt_table *cptab;
+	int rc;
+
+	cptab = kzalloc(sizeof(*cptab), GFP_NOFS);
+	if (!cptab)
+		return NULL;
+
+	rc = cfs_cpt_table_setup(cptab, ncpt);
+	if (rc) {
+		kfree(cptab);
+		cptab = NULL;
+	}
+	return cptab;
 }
 EXPORT_SYMBOL(cfs_cpt_table_alloc);
 
@@ -183,8 +190,6 @@ void cfs_cpt_table_free(struct cfs_cpt_table *cptab)
 
 	kfree(cptab->ctb_nodemask);
 	free_cpumask_var(cptab->ctb_cpumask);
-
-	kfree(cptab);
 }
 EXPORT_SYMBOL(cfs_cpt_table_free);
 
@@ -822,9 +827,8 @@ static int cfs_cpt_num_estimate(void)
 	return ncpt;
 }
 
-static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
+static int cfs_cpt_table_create(int ncpt)
 {
-	struct cfs_cpt_table *cptab = NULL;
 	cpumask_var_t node_mask;
 	int cpt = 0;
 	int node;
@@ -841,10 +845,9 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 		      ncpt, num);
 	}
 
-	cptab = cfs_cpt_table_alloc(ncpt);
-	if (!cptab) {
-		CERROR("Failed to allocate CPU map(%d)\n", ncpt);
-		rc = -ENOMEM;
+	rc = cfs_cpt_table_setup(&cfs_cpt_tab, ncpt);
+	if (rc) {
+		CERROR("Failed to setup CPU map(%d)\n", ncpt);
 		goto failed;
 	}
 
@@ -860,10 +863,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 		cpumask_copy(node_mask, cpumask_of_node(node));
 
 		while (cpt < ncpt && !cpumask_empty(node_mask)) {
-			struct cfs_cpu_partition *part = &cptab->ctb_parts[cpt];
-			int ncpu = cpumask_weight(part->cpt_cpumask);
+			struct cfs_cpu_partition *part;
+			int ncpu;
+
+			part = &cfs_cpt_tab.ctb_parts[cpt];
+			ncpu = cpumask_weight(part->cpt_cpumask);
 
-			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask,
+			rc = cfs_cpt_choose_ncpus(&cfs_cpt_tab, cpt, node_mask,
 						  num - ncpu);
 			if (rc < 0) {
 				rc = -EINVAL;
@@ -880,7 +886,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 
 	free_cpumask_var(node_mask);
 
-	return cptab;
+	return 0;
 
 failed_mask:
 	free_cpumask_var(node_mask);
@@ -888,15 +894,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 	CERROR("Failed (rc = %d) to setup CPU partition table with %d partitions, online HW NUMA nodes: %d, HW CPU cores: %d.\n",
 	       rc, ncpt, num_online_nodes(), num_online_cpus());
 
-	if (cptab)
-		cfs_cpt_table_free(cptab);
+	cfs_cpt_table_free(&cfs_cpt_tab);
 
-	return ERR_PTR(rc);
+	return rc;
 }
 
-static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
+static int cfs_cpt_table_create_pattern(const char *pattern)
 {
-	struct cfs_cpt_table *cptab;
 	char *pattern_dup;
 	char *bracket;
 	char *str;
@@ -911,7 +915,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 	pattern_dup = kstrdup(pattern, GFP_KERNEL);
 	if (!pattern_dup) {
 		CERROR("Failed to duplicate pattern '%s'\n", pattern);
-		return ERR_PTR(-ENOMEM);
+		return -ENOMEM;
 	}
 
 	str = strim(pattern_dup);
@@ -948,10 +952,9 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 		goto err_free_str;
 	}
 
-	cptab = cfs_cpt_table_alloc(ncpt);
-	if (!cptab) {
-		CERROR("Failed to allocate CPU partition table\n");
-		rc = -ENOMEM;
+	rc = cfs_cpt_table_setup(&cfs_cpt_tab, ncpt);
+	if (rc) {
+		CERROR("Failed to setup CPU partition table\n");
 		goto err_free_str;
 	}
 
@@ -960,14 +963,14 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 			if (cpumask_empty(cpumask_of_node(i)))
 				continue;
 
-			rc = cfs_cpt_set_node(cptab, cpt++, i);
+			rc = cfs_cpt_set_node(&cfs_cpt_tab, cpt++, i);
 			if (!rc) {
 				rc = -EINVAL;
 				goto err_free_table;
 			}
 		}
 		kfree(pattern_dup);
-		return cptab;
+		return 0;
 	}
 
 	high = node ? nr_node_ids - 1 : nr_cpu_ids - 1;
@@ -1006,7 +1009,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 			goto err_free_table;
 		}
 
-		if (cfs_cpt_weight(cptab, cpt)) {
+		if (cfs_cpt_weight(&cfs_cpt_tab, cpt)) {
 			CERROR("Partition %d has already been set.\n", cpt);
 			rc = -EPERM;
 			goto err_free_table;
@@ -1040,8 +1043,8 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 				if ((i - range->re_lo) % range->re_stride)
 					continue;
 
-				rc = node ? cfs_cpt_set_node(cptab, cpt, i) :
-					    cfs_cpt_set_cpu(cptab, cpt, i);
+				rc = node ? cfs_cpt_set_node(&cfs_cpt_tab, cpt, i) :
+					    cfs_cpt_set_cpu(&cfs_cpt_tab, cpt, i);
 				if (!rc) {
 					cfs_expr_list_free(el);
 					rc = -EINVAL;
@@ -1052,7 +1055,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 
 		cfs_expr_list_free(el);
 
-		if (!cfs_cpt_online(cptab, cpt)) {
+		if (!cfs_cpt_online(&cfs_cpt_tab, cpt)) {
 			CERROR("No online CPU is found on partition %d\n", cpt);
 			rc = -ENODEV;
 			goto err_free_table;
@@ -1062,13 +1065,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
 	}
 
 	kfree(pattern_dup);
-	return cptab;
+	return 0;
 
 err_free_table:
-	cfs_cpt_table_free(cptab);
+	cfs_cpt_table_free(&cfs_cpt_tab);
 err_free_str:
 	kfree(pattern_dup);
-	return ERR_PTR(rc);
+	return rc;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -1095,8 +1098,7 @@ static int cfs_cpu_dead(unsigned int cpu)
 
 void cfs_cpu_fini(void)
 {
-	if (!IS_ERR_OR_NULL(cfs_cpt_tab))
-		cfs_cpt_table_free(cfs_cpt_tab);
+	cfs_cpt_table_free(&cfs_cpt_tab);
 
 #ifdef CONFIG_HOTPLUG_CPU
 	if (lustre_cpu_online > 0)
@@ -1109,8 +1111,6 @@ int cfs_cpu_init(void)
 {
 	int ret;
 
-	LASSERT(!cfs_cpt_tab);
-
 #ifdef CONFIG_HOTPLUG_CPU
 	ret = cpuhp_setup_state_nocalls(CPUHP_LUSTRE_CFS_DEAD,
 					"staging/lustre/cfe:dead", NULL,
@@ -1128,20 +1128,18 @@ int cfs_cpu_init(void)
 #endif
 	get_online_cpus();
 	if (*cpu_pattern) {
-		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern);
-		if (IS_ERR(cfs_cpt_tab)) {
+		ret = cfs_cpt_table_create_pattern(cpu_pattern);
+		if (ret) {
 			CERROR("Failed to create cptab from pattern '%s'\n",
 			       cpu_pattern);
-			ret = PTR_ERR(cfs_cpt_tab);
 			goto failed_alloc_table;
 		}
 
 	} else {
-		cfs_cpt_tab = cfs_cpt_table_create(cpu_npartitions);
-		if (IS_ERR(cfs_cpt_tab)) {
+		ret = cfs_cpt_table_create(cpu_npartitions);
+		if (ret) {
 			CERROR("Failed to create cptab with npartitions %d\n",
 			       cpu_npartitions);
-			ret = PTR_ERR(cfs_cpt_tab);
 			goto failed_alloc_table;
 		}
 	}
@@ -1150,14 +1148,13 @@ int cfs_cpu_init(void)
 
 	LCONSOLE(0, "HW NUMA nodes: %d, HW CPU cores: %d, npartitions: %d\n",
 		 num_online_nodes(), num_online_cpus(),
-		 cfs_cpt_number(cfs_cpt_tab));
+		 cfs_cpt_number(&cfs_cpt_tab));
 	return 0;
 
 failed_alloc_table:
 	put_online_cpus();
 
-	if (!IS_ERR_OR_NULL(cfs_cpt_tab))
-		cfs_cpt_table_free(cfs_cpt_tab);
+	cfs_cpt_table_free(&cfs_cpt_tab);
 
 	ret = -EINVAL;
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/drivers/staging/lustre/lnet/libcfs/module.c b/drivers/staging/lustre/lnet/libcfs/module.c
index 2281f08..35c3959 100644
--- a/drivers/staging/lustre/lnet/libcfs/module.c
+++ b/drivers/staging/lustre/lnet/libcfs/module.c
@@ -66,6 +66,10 @@ struct lnet_debugfs_symlink_def {
 
 static struct dentry *lnet_debugfs_root;
 
+/** Global CPU partition table */
+struct cfs_cpt_table cfs_cpt_tab __read_mostly;
+EXPORT_SYMBOL(cfs_cpt_tab);
+
 BLOCKING_NOTIFIER_HEAD(libcfs_ioctl_list);
 EXPORT_SYMBOL(libcfs_ioctl_list);
 
@@ -402,7 +406,7 @@ static int proc_cpt_table(struct ctl_table *table, int write,
 		if (!buf)
 			return -ENOMEM;
 
-		rc = cfs_cpt_table_print(cfs_cpt_tab, buf, len);
+		rc = cfs_cpt_table_print(&cfs_cpt_tab, buf, len);
 		if (rc >= 0)
 			break;
 
@@ -437,14 +441,12 @@ static int proc_cpt_distance(struct ctl_table *table, int write,
 	if (write)
 		return -EPERM;
 
-	LASSERT(cfs_cpt_tab);
-
 	while (1) {
 		buf = kzalloc(len, GFP_KERNEL);
 		if (!buf)
 			return -ENOMEM;
 
-		rc = cfs_cpt_distance_print(cfs_cpt_tab, buf, len);
+		rc = cfs_cpt_distance_print(&cfs_cpt_tab, buf, len);
 		if (rc >= 0)
 			break;
 
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index f9ed697..98a4942 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -1414,8 +1414,8 @@ int lnet_lib_init(void)
 	memset(&the_lnet, 0, sizeof(the_lnet));
 
 	/* refer to global cfs_cpt_tab for now */
-	the_lnet.ln_cpt_table	= cfs_cpt_tab;
-	the_lnet.ln_cpt_number	= cfs_cpt_number(cfs_cpt_tab);
+	the_lnet.ln_cpt_table = &cfs_cpt_tab;
+	the_lnet.ln_cpt_number = cfs_cpt_number(&cfs_cpt_tab);
 
 	LASSERT(the_lnet.ln_cpt_number > 0);
 	if (the_lnet.ln_cpt_number > LNET_CPT_MAX) {
diff --git a/drivers/staging/lustre/lnet/selftest/framework.c b/drivers/staging/lustre/lnet/selftest/framework.c
index 741af10..939b7ec 100644
--- a/drivers/staging/lustre/lnet/selftest/framework.c
+++ b/drivers/staging/lustre/lnet/selftest/framework.c
@@ -588,7 +588,7 @@
 
 	CDEBUG(D_NET, "Reserved %d buffers for test %s\n",
 	       nbuf * (srpc_serv_is_framework(svc) ?
-		       2 : cfs_cpt_number(cfs_cpt_tab)), svc->sv_name);
+		       2 : cfs_cpt_number(&cfs_cpt_tab)), svc->sv_name);
 	return 0;
 }
 
diff --git a/drivers/staging/lustre/lustre/ptlrpc/client.c b/drivers/staging/lustre/lustre/ptlrpc/client.c
index c1b82bf..c569a8b 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/client.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/client.c
@@ -940,9 +940,9 @@ struct ptlrpc_request_set *ptlrpc_prep_set(void)
 	struct ptlrpc_request_set *set;
 	int cpt;
 
-	cpt = cfs_cpt_current(cfs_cpt_tab, 0);
+	cpt = cfs_cpt_current(&cfs_cpt_tab, 0);
 	set = kzalloc_node(sizeof(*set), GFP_NOFS,
-			   cfs_cpt_spread_node(cfs_cpt_tab, cpt));
+			   cfs_cpt_spread_node(&cfs_cpt_tab, cpt));
 	if (!set)
 		return NULL;
 	atomic_set(&set->set_refcount, 1);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
index 5310054..d496521 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
@@ -177,7 +177,7 @@ void ptlrpcd_wake(struct ptlrpc_request *req)
 	if (req && req->rq_send_state != LUSTRE_IMP_FULL)
 		return &ptlrpcd_rcv;
 
-	cpt = cfs_cpt_current(cfs_cpt_tab, 1);
+	cpt = cfs_cpt_current(&cfs_cpt_tab, 1);
 	if (!ptlrpcds_cpt_idx)
 		idx = cpt;
 	else
@@ -389,7 +389,7 @@ static int ptlrpcd(void *arg)
 	int exit = 0;
 
 	unshare_fs_struct();
-	if (cfs_cpt_bind(cfs_cpt_tab, pc->pc_cpt) != 0)
+	if (cfs_cpt_bind(&cfs_cpt_tab, pc->pc_cpt) != 0)
 		CWARN("Failed to bind %s on CPT %d\n", pc->pc_name, pc->pc_cpt);
 
 	/*
@@ -531,7 +531,7 @@ static int ptlrpcd_partners(struct ptlrpcd *pd, int index)
 
 	size = sizeof(struct ptlrpcd_ctl *) * pc->pc_npartners;
 	pc->pc_partners = kzalloc_node(size, GFP_NOFS,
-				       cfs_cpt_spread_node(cfs_cpt_tab,
+				       cfs_cpt_spread_node(&cfs_cpt_tab,
 							   pc->pc_cpt));
 	if (!pc->pc_partners) {
 		pc->pc_npartners = 0;
@@ -677,7 +677,7 @@ static int ptlrpcd_init(void)
 	/*
 	 * Determine the CPTs that ptlrpcd threads will run on.
 	 */
-	cptable = cfs_cpt_tab;
+	cptable = &cfs_cpt_tab;
 	ncpts = cfs_cpt_number(cptable);
 	if (ptlrpcd_cpts) {
 		struct cfs_expr_list *el;
@@ -831,7 +831,7 @@ static int ptlrpcd_init(void)
 
 		size = offsetof(struct ptlrpcd, pd_threads[nthreads]);
 		pd = kzalloc_node(size, GFP_NOFS,
-				  cfs_cpt_spread_node(cfs_cpt_tab, cpt));
+				  cfs_cpt_spread_node(&cfs_cpt_tab, cpt));
 		if (!pd) {
 			rc = -ENOMEM;
 			goto out;
diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
index 8e74a45..853676f 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/service.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
@@ -565,7 +565,7 @@ struct ptlrpc_service *
 
 	cptable = cconf->cc_cptable;
 	if (!cptable)
-		cptable = cfs_cpt_tab;
+		cptable = &cfs_cpt_tab;
 
 	if (!conf->psc_thr.tc_cpu_affinity) {
 		ncpts = 1;
@@ -2520,7 +2520,7 @@ int ptlrpc_hr_init(void)
 	int weight;
 
 	memset(&ptlrpc_hr, 0, sizeof(ptlrpc_hr));
-	ptlrpc_hr.hr_cpt_table = cfs_cpt_tab;
+	ptlrpc_hr.hr_cpt_table = &cfs_cpt_tab;
 
 	ptlrpc_hr.hr_partitions = cfs_percpt_alloc(ptlrpc_hr.hr_cpt_table,
 						   sizeof(*hrp));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 26/26] staging: lustre: libcfs: restore UMP support
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (24 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure James Simmons
@ 2018-06-24 21:20 ` James Simmons
  2018-06-25  1:33 ` [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework NeilBrown
  26 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-24 21:20 UTC (permalink / raw)
  To: lustre-devel

Pieces needed for UMP platforms was removed which can currently
crash a node. The first promblem was cfs_cpt_table_alloc() and
cfs_cpt_table_free() was completely missing for UMP platforms.
Since only one valid configuration exist just report the default
static cfs_cpt_tab. Don't return NULL for cfs_cpt_cpumask() and
cfs_cpt_nodemask() but real mask instead.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9856
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../lustre/include/linux/libcfs/libcfs_cpu.h       | 41 ++++++++++++++++++++--
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index df7e16b..ff1a24d 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -94,7 +94,6 @@ struct cfs_cpu_partition {
 	int				cpt_node;
 };
 
-
 /** descriptor for CPU partitions */
 struct cfs_cpt_table {
 	/* spread rotor for NUMA allocator */
@@ -114,9 +113,27 @@ struct cfs_cpt_table {
 	/* all nodes in this partition table */
 	nodemask_t			*ctb_nodemask;
 };
+#else /* !CONFIG_SMP */
+
+/** UMP descriptor for CPU partitions */
+struct cfs_cpt_table {
+	cpumask_var_t			ctb_cpumask;
+	nodemask_t			ctb_nodemask;
+};
+
+#endif /* CONFIG_SMP */
 
 extern struct cfs_cpt_table	cfs_cpt_tab;
 
+#ifdef CONFIG_SMP
+/**
+ * create a cfs_cpt_table with \a ncpt number of partitions
+ */
+struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt);
+/**
+ * destroy a CPU partition table
+ */
+void cfs_cpt_table_free(struct cfs_cpt_table *cptab);
 /**
  * return cpumask of CPU partition \a cpt
  */
@@ -216,6 +233,18 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
 
 #else /* !CONFIG_SMP */
 
+static inline struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
+{
+	if (ncpt != 1)
+                return NULL;
+
+	return &cfs_cpt_tab;
+}
+
+static inline void cfs_cpt_table_free(struct cfs_cpt_table *cptab)
+{
+}
+
 static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab,
 				      char *buf, int len)
 {
@@ -245,7 +274,7 @@ static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
 static inline cpumask_var_t *cfs_cpt_cpumask(struct cfs_cpt_table *cptab,
 					     int cpt)
 {
-	return NULL;
+	return &cptab->ctb_cpumask;
 }
 
 static inline int cfs_cpt_number(struct cfs_cpt_table *cptab)
@@ -266,7 +295,7 @@ static inline int cfs_cpt_online(struct cfs_cpt_table *cptab, int cpt)
 static inline nodemask_t *cfs_cpt_nodemask(struct cfs_cpt_table *cptab,
 					   int cpt)
 {
-	return NULL;
+	return &cptab->ctb_nodemask;
 }
 
 static inline unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab,
@@ -346,11 +375,17 @@ static inline int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
 
 static inline int cfs_cpu_init(void)
 {
+	if (!zalloc_cpumask_var(&cfs_cpt_tab.ctb_cpumask, GFP_NOFS))
+		return -ENOMEM;
+
+	cpumask_set_cpu(0, cfs_cpt_tab.ctb_cpumask);
+	node_set(0, cfs_cpt_tab.ctb_nodemask);
 	return 0;
 }
 
 static inline void cfs_cpu_fini(void)
 {
+	free_cpumask_var(cfs_cpt_tab.ctb_cpumask);
 }
 
 #endif /* CONFIG_SMP */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code James Simmons
@ 2018-06-25  0:20   ` NeilBrown
  2018-06-26  0:33     ` James Simmons
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-25  0:20 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> While pushing the SMP work some bugs were pointed out by Dan
> Carpenter in the code. Due to single err label in cfs_cpu_init()
> and cfs_cpt_table_alloc() a few items were being cleaned up that
> were never initialized. This can lead to crashed and other problems.
> In those initialization function introduce individual labels to
> jump to only the thing initialized get freed on failure.
>
> Signed-off-by: James Simmons <uja.ornl@yahoo.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-10932
> Reviewed-on: https://review.whamcloud.com/32085
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 72 ++++++++++++++++++-------
>  1 file changed, 52 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index 46d3530..bdd71a3 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -85,17 +85,19 @@ struct cfs_cpt_table *
>  
>  	cptab->ctb_nparts = ncpt;
>  
> +	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS))
> +		goto failed_alloc_cpumask;
> +
>  	cptab->ctb_nodemask = kzalloc(sizeof(*cptab->ctb_nodemask),
>  				      GFP_NOFS);
> -	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS) ||
> -	    !cptab->ctb_nodemask)
> -		goto failed;
> +	if (!cptab->ctb_nodemask)
> +		goto failed_alloc_nodemask;
>  
>  	cptab->ctb_cpu2cpt = kvmalloc_array(num_possible_cpus(),
>  					    sizeof(cptab->ctb_cpu2cpt[0]),
>  					    GFP_KERNEL);
>  	if (!cptab->ctb_cpu2cpt)
> -		goto failed;
> +		goto failed_alloc_cpu2cpt;
>  
>  	memset(cptab->ctb_cpu2cpt, -1,
>  	       num_possible_cpus() * sizeof(cptab->ctb_cpu2cpt[0]));
> @@ -103,22 +105,41 @@ struct cfs_cpt_table *
>  	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
>  					  GFP_KERNEL);
>  	if (!cptab->ctb_parts)
> -		goto failed;
> +		goto failed_alloc_ctb_parts;
> +
> +	memset(cptab->ctb_parts, -1, ncpt * sizeof(cptab->ctb_parts[0]));
>  
>  	for (i = 0; i < ncpt; i++) {
>  		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
>  
> +		if (!zalloc_cpumask_var(&part->cpt_cpumask, GFP_NOFS))
> +			goto failed_setting_ctb_parts;
> +
>  		part->cpt_nodemask = kzalloc(sizeof(*part->cpt_nodemask),
>  					     GFP_NOFS);
> -		if (!zalloc_cpumask_var(&part->cpt_cpumask, GFP_NOFS) ||
> -		    !part->cpt_nodemask)
> -			goto failed;
> +		if (!part->cpt_nodemask)
> +			goto failed_setting_ctb_parts;

If zalloc_cpumask_var() succeeds, but kzalloc() fails (which is almost
impossible, but still) we go to failed_setting_ctb_parts, with
 cptab->ctb_parts[i]->cpt_cpumask needing to be freed.

>  	}
>  
>  	return cptab;
>  
> - failed:
> -	cfs_cpt_table_free(cptab);
> +failed_setting_ctb_parts:
> +	while (i-- >= 0) {

but we don't free anything in cptab->ctb_parts[i].
I've fix this by calling free_cpumask_var() before the goto.

And will propagate the change through future patches in this series.

> +		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
> +
> +		kfree(part->cpt_nodemask);
> +		free_cpumask_var(part->cpt_cpumask);
> +	}
> +
> +	kvfree(cptab->ctb_parts);
> +failed_alloc_ctb_parts:
> +	kvfree(cptab->ctb_cpu2cpt);
> +failed_alloc_cpu2cpt:
> +	kfree(cptab->ctb_nodemask);
> +failed_alloc_nodemask:
> +	free_cpumask_var(cptab->ctb_cpumask);
> +failed_alloc_cpumask:
> +	kfree(cptab);
>  	return NULL;
>  }
>  EXPORT_SYMBOL(cfs_cpt_table_alloc);
> @@ -944,7 +965,7 @@ static int cfs_cpu_dead(unsigned int cpu)
>  int
>  cfs_cpu_init(void)
>  {
> -	int ret = 0;
> +	int ret;
>  
>  	LASSERT(!cfs_cpt_tab);
>  
> @@ -953,23 +974,23 @@ static int cfs_cpu_dead(unsigned int cpu)
>  					"staging/lustre/cfe:dead", NULL,
>  					cfs_cpu_dead);
>  	if (ret < 0)
> -		goto failed;
> +		goto failed_cpu_dead;
> +
>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  					"staging/lustre/cfe:online",
>  					cfs_cpu_online, NULL);
>  	if (ret < 0)
> -		goto failed;
> +		goto failed_cpu_online;
> +
>  	lustre_cpu_online = ret;
>  #endif
> -	ret = -EINVAL;
> -
>  	get_online_cpus();
>  	if (*cpu_pattern) {
>  		char *cpu_pattern_dup = kstrdup(cpu_pattern, GFP_KERNEL);
>  
>  		if (!cpu_pattern_dup) {
>  			CERROR("Failed to duplicate cpu_pattern\n");
> -			goto failed;
> +			goto failed_alloc_table;
>  		}
>  
>  		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern_dup);
> @@ -977,7 +998,7 @@ static int cfs_cpu_dead(unsigned int cpu)
>  		if (!cfs_cpt_tab) {
>  			CERROR("Failed to create cptab from pattern %s\n",
>  			       cpu_pattern);
> -			goto failed;
> +			goto failed_alloc_table;
>  		}
>  
>  	} else {
> @@ -985,7 +1006,7 @@ static int cfs_cpu_dead(unsigned int cpu)
>  		if (!cfs_cpt_tab) {
>  			CERROR("Failed to create ptable with npartitions %d\n",
>  			       cpu_npartitions);
> -			goto failed;
> +			goto failed_alloc_table;
>  		}
>  	}
>  
> @@ -996,8 +1017,19 @@ static int cfs_cpu_dead(unsigned int cpu)
>  		 cfs_cpt_number(cfs_cpt_tab));
>  	return 0;
>  
> - failed:
> +failed_alloc_table:
>  	put_online_cpus();
> -	cfs_cpu_fini();
> +
> +	if (cfs_cpt_tab)
> +		cfs_cpt_table_free(cfs_cpt_tab);
> +
> +	ret = -EINVAL;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	if (lustre_cpu_online > 0)
> +		cpuhp_remove_state_nocalls(lustre_cpu_online);
> +failed_cpu_online:
> +	cpuhp_remove_state_nocalls(CPUHP_LUSTRE_CFS_DEAD);
> +failed_cpu_dead:
> +#endif
>  	return ret;
>  }
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/120e6fbe/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space James Simmons
@ 2018-06-25  0:35   ` NeilBrown
  2018-06-26  0:55     ` James Simmons
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-25  0:35 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> From: Amir Shehata <amir.shehata@intel.com>
>
> The function cfs_cpt_table_print() was adding two spaces
> to the string buffer. Just add it once.

No it doesn't.  Maybe it did in the out-of-tree code, but the linux code
is different.

The extra space is

                       rc = snprintf(tmp, len, " %d", j);

But in Linux that is

			rc = snprintf(tmp, len, "%d ", j);

Both are wrong, but for different reasons.
I've change this patch to be:

			rc = snprintf(tmp, len, "%d\t:", i);
and
			rc = snprintf(tmp, len, " %d", j);
and changed the comment to say that we don't need a stray space at the
end of the line.

NeilBrown



>
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> Reviewed-on: http://review.whamcloud.com/18916
> Reviewed-by: Olaf Weber <olaf@sgi.com>
> Reviewed-by: Doug Oucharek <dougso@me.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index ea8d55c..680a2b1 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -177,7 +177,7 @@ struct cfs_cpt_table *
>  
>  	for (i = 0; i < cptab->ctb_nparts; i++) {
>  		if (len > 0) {
> -			rc = snprintf(tmp, len, "%d\t: ", i);
> +			rc = snprintf(tmp, len, "%d\t:", i);
>  			len -= rc;
>  		}
>  
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/a16a4b8f/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support James Simmons
@ 2018-06-25  0:39   ` NeilBrown
  2018-06-25 18:22     ` Doug Oucharek
  2018-06-26  0:39     ` James Simmons
  0 siblings, 2 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-25  0:39 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> From: Amir Shehata <amir.shehata@intel.com>
>
> This patch adds NUMA node support. NUMA node information is stored
> in the CPT table. A NUMA node mask is maintained for the entire
> table as well as for each CPT to track the NUMA nodes related to
> each of the CPTs. Add new function cfs_cpt_of_node() which returns
> the CPT of a particular NUMA node.

I note that you didn't respond to Greg's questions about this patch.
I'll accept it anyway in the interests of moving forward, but I think
his comments were probably valid, and need to be considered at some
stage.

There is a bug though....
>
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> Reviewed-on: http://review.whamcloud.com/18916
> Reviewed-by: Olaf Weber <olaf@sgi.com>
> Reviewed-by: Doug Oucharek <dougso@me.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  .../lustre/include/linux/libcfs/libcfs_cpu.h        | 11 +++++++++++
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c     | 21 +++++++++++++++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> index 1b4333d..ff3ecf5 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> @@ -103,6 +103,8 @@ struct cfs_cpt_table {
>  	int				*ctb_cpu2cpt;
>  	/* all cpus in this partition table */
>  	cpumask_var_t			ctb_cpumask;
> +	/* shadow HW node to CPU partition ID */
> +	int				*ctb_node2cpt;
>  	/* all nodes in this partition table */
>  	nodemask_t			*ctb_nodemask;
>  };
> @@ -143,6 +145,10 @@ struct cfs_cpt_table {
>   */
>  int cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu);
>  /**
> + * shadow HW node ID \a NODE to CPU-partition ID by \a cptab
> + */
> +int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
> +/**
>   * bind current thread on a CPU-partition \a cpt of \a cptab
>   */
>  int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
> @@ -299,6 +305,11 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
>  	return 0;
>  }
>  
> +static inline int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
> +{
> +	return 0;
> +}
> +
>  static inline int
>  cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
>  {
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index 33294da..8c5cf7b 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -102,6 +102,15 @@ struct cfs_cpt_table *
>  	memset(cptab->ctb_cpu2cpt, -1,
>  	       nr_cpu_ids * sizeof(cptab->ctb_cpu2cpt[0]));
>  
> +	cptab->ctb_node2cpt = kvmalloc_array(nr_node_ids,
> +					     sizeof(cptab->ctb_node2cpt[0]),
> +					     GFP_KERNEL);
> +	if (!cptab->ctb_node2cpt)
> +		goto failed_alloc_node2cpt;
> +
> +	memset(cptab->ctb_node2cpt, -1,
> +	       nr_node_ids * sizeof(cptab->ctb_node2cpt[0]));
> +
>  	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
>  					  GFP_KERNEL);
>  	if (!cptab->ctb_parts)
> @@ -133,6 +142,8 @@ struct cfs_cpt_table *
>  
>  	kvfree(cptab->ctb_parts);
>  failed_alloc_ctb_parts:
> +	kvfree(cptab->ctb_node2cpt);
> +failed_alloc_node2cpt:
>  	kvfree(cptab->ctb_cpu2cpt);
>  failed_alloc_cpu2cpt:
>  	kfree(cptab->ctb_nodemask);
> @@ -150,6 +161,7 @@ struct cfs_cpt_table *
>  	int i;
>  
>  	kvfree(cptab->ctb_cpu2cpt);
> +	kvfree(cptab->ctb_node2cpt);
>  
>  	for (i = 0; cptab->ctb_parts && i < cptab->ctb_nparts; i++) {
>  		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
> @@ -515,6 +527,15 @@ struct cfs_cpt_table *
>  }
>  EXPORT_SYMBOL(cfs_cpt_of_cpu);
>  
> +int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
> +{
> +	if (node < 0 || node > nr_node_ids)
> +		return CFS_CPT_ANY;
> +
> +	return cptab->ctb_node2cpt[node];
> +}

So if node == nr_node_ids, we access beyond the end of the ctb_node2cpt array.
Oops.
I've fixed this before applying.

Thanks,
NeilBrown


> +EXPORT_SYMBOL(cfs_cpt_of_node);
> +
>  int
>  cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
>  {
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/770edf72/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling James Simmons
@ 2018-06-25  0:48   ` NeilBrown
  2018-06-26  1:15     ` James Simmons
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-25  0:48 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> From: Amir Shehata <amir.shehata@intel.com>
>
> Add functionality to calculate the distance between two CPTs.
> Expose those distance in debugfs so people deploying a setup
> can debug what is being created for CPTs.

This patch doesn't expose anything in debugfs - a later patch
does that.
So I've changed the comment to "Prepare to expose those ...."

NeilBrown


>
> Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> Reviewed-on: http://review.whamcloud.com/18916
> Reviewed-by: Olaf Weber <olaf@sgi.com>
> Reviewed-by: Doug Oucharek <dougso@me.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  .../lustre/include/linux/libcfs/libcfs_cpu.h       | 31 +++++++++++
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 61 ++++++++++++++++++++++
>  2 files changed, 92 insertions(+)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> index ff3ecf5..a015ac1 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> @@ -86,6 +86,8 @@ struct cfs_cpu_partition {
>  	cpumask_var_t			cpt_cpumask;
>  	/* nodes mask for this partition */
>  	nodemask_t			*cpt_nodemask;
> +	/* NUMA distance between CPTs */
> +	unsigned int			*cpt_distance;
>  	/* spread rotor for NUMA allocator */
>  	unsigned int			cpt_spread_rotor;
>  };
> @@ -95,6 +97,8 @@ struct cfs_cpu_partition {
>  struct cfs_cpt_table {
>  	/* spread rotor for NUMA allocator */
>  	unsigned int			ctb_spread_rotor;
> +	/* maximum NUMA distance between all nodes in table */
> +	unsigned int			ctb_distance;
>  	/* # of CPU partitions */
>  	unsigned int			ctb_nparts;
>  	/* partitions tables */
> @@ -120,6 +124,10 @@ struct cfs_cpt_table {
>   */
>  int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len);
>  /**
> + * print distance information of cpt-table
> + */
> +int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len);
> +/**
>   * return total number of CPU partitions in \a cptab
>   */
>  int
> @@ -149,6 +157,10 @@ struct cfs_cpt_table {
>   */
>  int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
>  /**
> + * NUMA distance between \a cpt1 and \a cpt2 in \a cptab
> + */
> +unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2);
> +/**
>   * bind current thread on a CPU-partition \a cpt of \a cptab
>   */
>  int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
> @@ -206,6 +218,19 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
>  struct cfs_cpt_table;
>  #define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
>  
> +static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
> +					 char *buf, int len)
> +{
> +	int rc;
> +
> +	rc = snprintf(buf, len, "0\t: 0:1\n");
> +	len -= rc;
> +	if (len <= 0)
> +		return -EFBIG;
> +
> +	return rc;
> +}
> +
>  static inline cpumask_var_t *
>  cfs_cpt_cpumask(struct cfs_cpt_table *cptab, int cpt)
>  {
> @@ -241,6 +266,12 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
>  	return NULL;
>  }
>  
> +static inline unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab,
> +					    int cpt1, int cpt2)
> +{
> +	return 1;
> +}
> +
>  static inline int
>  cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
>  {
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index 8c5cf7b..b315fb2 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -128,6 +128,15 @@ struct cfs_cpt_table *
>  					     GFP_NOFS);
>  		if (!part->cpt_nodemask)
>  			goto failed_setting_ctb_parts;
> +
> +		part->cpt_distance = kvmalloc_array(cptab->ctb_nparts,
> +						    sizeof(part->cpt_distance[0]),
> +						    GFP_KERNEL);
> +		if (!part->cpt_distance)
> +			goto failed_setting_ctb_parts;
> +
> +		memset(part->cpt_distance, -1,
> +		       cptab->ctb_nparts * sizeof(part->cpt_distance[0]));
>  	}
>  
>  	return cptab;
> @@ -138,6 +147,7 @@ struct cfs_cpt_table *
>  
>  		kfree(part->cpt_nodemask);
>  		free_cpumask_var(part->cpt_cpumask);
> +		kvfree(part->cpt_distance);
>  	}
>  
>  	kvfree(cptab->ctb_parts);
> @@ -168,6 +178,7 @@ struct cfs_cpt_table *
>  
>  		kfree(part->cpt_nodemask);
>  		free_cpumask_var(part->cpt_cpumask);
> +		kvfree(part->cpt_distance);
>  	}
>  
>  	kvfree(cptab->ctb_parts);
> @@ -222,6 +233,44 @@ struct cfs_cpt_table *
>  }
>  EXPORT_SYMBOL(cfs_cpt_table_print);
>  
> +int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
> +{
> +	char *tmp = buf;
> +	int rc;
> +	int i;
> +	int j;
> +
> +	for (i = 0; i < cptab->ctb_nparts; i++) {
> +		if (len <= 0)
> +			goto err;
> +
> +		rc = snprintf(tmp, len, "%d\t:", i);
> +		len -= rc;
> +
> +		if (len <= 0)
> +			goto err;
> +
> +		tmp += rc;
> +		for (j = 0; j < cptab->ctb_nparts; j++) {
> +			rc = snprintf(tmp, len, " %d:%d", j,
> +				      cptab->ctb_parts[i].cpt_distance[j]);
> +			len -= rc;
> +			if (len <= 0)
> +				goto err;
> +			tmp += rc;
> +		}
> +
> +		*tmp = '\n';
> +		tmp++;
> +		len--;
> +	}
> +
> +	return tmp - buf;
> +err:
> +	return -E2BIG;
> +}
> +EXPORT_SYMBOL(cfs_cpt_distance_print);
> +
>  int
>  cfs_cpt_number(struct cfs_cpt_table *cptab)
>  {
> @@ -273,6 +322,18 @@ struct cfs_cpt_table *
>  }
>  EXPORT_SYMBOL(cfs_cpt_nodemask);
>  
> +unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
> +{
> +	LASSERT(cpt1 == CFS_CPT_ANY || (cpt1 >= 0 && cpt1 < cptab->ctb_nparts));
> +	LASSERT(cpt2 == CFS_CPT_ANY || (cpt2 >= 0 && cpt2 < cptab->ctb_nparts));
> +
> +	if (cpt1 == CFS_CPT_ANY || cpt2 == CFS_CPT_ANY)
> +		return cptab->ctb_distance;
> +
> +	return cptab->ctb_parts[cpt1].cpt_distance[cpt2];
> +}
> +EXPORT_SYMBOL(cfs_cpt_distance);
> +
>  int
>  cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
>  {
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/574935ba/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification.
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification James Simmons
@ 2018-06-25  0:57   ` NeilBrown
  2018-06-26  0:42     ` James Simmons
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-25  0:57 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> From: Dmitry Eremin <dmitry.eremin@intel.com>
>
> Use int type for CPT identification to match the linux kernel
> CPU identification.

Can someone site evidence for "int" being the dominant choice for CPU
identification in the kernel?
I looked in cpumask.h and found plenty of "unsigned int".
I also found

Commit: 9b130ad5bb82 ("treewide: make "nr_cpu_ids" unsigned")

which makes nr_cpu_ids unsigned.

So I'm dropping this patch for now as the justification is not
convincing.

If there is a real case to be made, please resubmit.

Thanks,
NeilBrown



>
> Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
> Reviewed-on: https://review.whamcloud.com/23304
> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> Reviewed-by: Doug Oucharek <dougso@me.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h |  8 ++++----
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 14 +++++++-------
>  2 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> index 9dbb0b1..2bb2140 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> @@ -89,18 +89,18 @@ struct cfs_cpu_partition {
>  	/* NUMA distance between CPTs */
>  	unsigned int			*cpt_distance;
>  	/* spread rotor for NUMA allocator */
> -	unsigned int			cpt_spread_rotor;
> +	int				cpt_spread_rotor;
>  };
>  
>  
>  /** descriptor for CPU partitions */
>  struct cfs_cpt_table {
>  	/* spread rotor for NUMA allocator */
> -	unsigned int			ctb_spread_rotor;
> +	int				ctb_spread_rotor;
>  	/* maximum NUMA distance between all nodes in table */
>  	unsigned int			ctb_distance;
>  	/* # of CPU partitions */
> -	unsigned int			ctb_nparts;
> +	int				ctb_nparts;
>  	/* partitions tables */
>  	struct cfs_cpu_partition	*ctb_parts;
>  	/* shadow HW CPU to CPU partition ID */
> @@ -355,7 +355,7 @@ static inline void cfs_cpu_fini(void)
>  /**
>   * create a cfs_cpt_table with \a ncpt number of partitions
>   */
> -struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt);
> +struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt);
>  
>  /*
>   * allocate per-cpu-partition data, returned value is an array of pointers,
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index aaab7cb..8f7de59 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -73,7 +73,7 @@
>  module_param(cpu_pattern, charp, 0444);
>  MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
>  
> -struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt)
> +struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
>  {
>  	struct cfs_cpt_table *cptab;
>  	int i;
> @@ -788,13 +788,13 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
>  	return rc;
>  }
>  
> -#define CPT_WEIGHT_MIN  4u
> +#define CPT_WEIGHT_MIN 4
>  
> -static unsigned int cfs_cpt_num_estimate(void)
> +static int cfs_cpt_num_estimate(void)
>  {
> -	unsigned int nnode = num_online_nodes();
> -	unsigned int ncpu = num_online_cpus();
> -	unsigned int ncpt;
> +	int nnode = num_online_nodes();
> +	int ncpu = num_online_cpus();
> +	int ncpt;
>  
>  	if (ncpu <= CPT_WEIGHT_MIN) {
>  		ncpt = 1;
> @@ -824,7 +824,7 @@ static unsigned int cfs_cpt_num_estimate(void)
>  	/* config many CPU partitions on 32-bit system could consume
>  	 * too much memory
>  	 */
> -	ncpt = min(2U, ncpt);
> +	ncpt = min(2, ncpt);
>  #endif
>  	while (ncpu % ncpt)
>  		ncpt--; /* worst case is 1 */
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/de069a31/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node James Simmons
@ 2018-06-25  1:09   ` NeilBrown
  2018-06-25  1:11     ` NeilBrown
  2018-06-26  0:54     ` James Simmons
  0 siblings, 2 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-25  1:09 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> From: Dmitry Eremin <dmitry.eremin@intel.com>
>
> Reporting "HW nodes" is too generic. It really is reporting
> "HW NUMA nodes". Update the debug message.

I'm not happy with this patch description.....

>
> Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
> Reviewed-on: https://review.whamcloud.com/23306
> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Patrick Farrell <paf@cray.com>
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h | 2 ++
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 2 +-
>  drivers/staging/lustre/lnet/lnet/lib-msg.c               | 2 ++
>  3 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> index 2bb2140..29c5071 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> @@ -90,6 +90,8 @@ struct cfs_cpu_partition {
>  	unsigned int			*cpt_distance;
>  	/* spread rotor for NUMA allocator */
>  	int				cpt_spread_rotor;
> +	/* NUMA node if cpt_nodemask is empty */
> +	int				cpt_node;
>  };

It doesn't give any reason why this (unused) field was added.
So I've removed it.


>  
>  
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index 18925c7..86afa31 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -1142,7 +1142,7 @@ int cfs_cpu_init(void)
>  
>  	put_online_cpus();
>  
> -	LCONSOLE(0, "HW nodes: %d, HW CPU cores: %d, npartitions: %d\n",
> +	LCONSOLE(0, "HW NUMA nodes: %d, HW CPU cores: %d, npartitions: %d\n",
>  		 num_online_nodes(), num_online_cpus(),
>  		 cfs_cpt_number(cfs_cpt_tab));
>  	return 0;

It does explain this hunk, which is fine.


> diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> index 0091273..27bdefa 100644
> --- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
> +++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> @@ -568,6 +568,8 @@
>  
>  	/* number of CPUs */
>  	container->msc_nfinalizers = cfs_cpt_weight(lnet_cpt_table(), cpt);
> +	if (container->msc_nfinalizers == 0)
> +		container->msc_nfinalizers = 1;

It doesn't justify this at all.

I guess this was meant to be in the previous patch, so I've moved it.

Thanks,
NeilBrown


>  
>  	container->msc_finalizers = kvzalloc_cpt(container->msc_nfinalizers *
>  						 sizeof(*container->msc_finalizers),
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/f1f67ad2/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node
  2018-06-25  1:09   ` NeilBrown
@ 2018-06-25  1:11     ` NeilBrown
  2018-06-25 22:57       ` James Simmons
  2018-06-26  0:54     ` James Simmons
  1 sibling, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-25  1:11 UTC (permalink / raw)
  To: lustre-devel

On Mon, Jun 25 2018, NeilBrown wrote:

> On Sun, Jun 24 2018, James Simmons wrote:
>
>> From: Dmitry Eremin <dmitry.eremin@intel.com>
>>
>> Reporting "HW nodes" is too generic. It really is reporting
>> "HW NUMA nodes". Update the debug message.
>
> I'm not happy with this patch description.....
>
>>
>> Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
>> Reviewed-on: https://review.whamcloud.com/23306
>> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
>> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> Reviewed-by: Patrick Farrell <paf@cray.com>
>> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
>> Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> ---
>>  drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h | 2 ++
>>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 2 +-
>>  drivers/staging/lustre/lnet/lnet/lib-msg.c               | 2 ++
>>  3 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
>> index 2bb2140..29c5071 100644
>> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
>> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
>> @@ -90,6 +90,8 @@ struct cfs_cpu_partition {
>>  	unsigned int			*cpt_distance;
>>  	/* spread rotor for NUMA allocator */
>>  	int				cpt_spread_rotor;
>> +	/* NUMA node if cpt_nodemask is empty */
>> +	int				cpt_node;
>>  };
>
> It doesn't give any reason why this (unused) field was added.
> So I've removed it.

Ahhhh. this was meant to be in the previous patch too.
I've moved it.

Thanks,
NeilBrown


>
>
>>  
>>  
>> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
>> index 18925c7..86afa31 100644
>> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
>> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
>> @@ -1142,7 +1142,7 @@ int cfs_cpu_init(void)
>>  
>>  	put_online_cpus();
>>  
>> -	LCONSOLE(0, "HW nodes: %d, HW CPU cores: %d, npartitions: %d\n",
>> +	LCONSOLE(0, "HW NUMA nodes: %d, HW CPU cores: %d, npartitions: %d\n",
>>  		 num_online_nodes(), num_online_cpus(),
>>  		 cfs_cpt_number(cfs_cpt_tab));
>>  	return 0;
>
> It does explain this hunk, which is fine.
>
>
>> diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
>> index 0091273..27bdefa 100644
>> --- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
>> +++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
>> @@ -568,6 +568,8 @@
>>  
>>  	/* number of CPUs */
>>  	container->msc_nfinalizers = cfs_cpt_weight(lnet_cpt_table(), cpt);
>> +	if (container->msc_nfinalizers == 0)
>> +		container->msc_nfinalizers = 1;
>
> It doesn't justify this at all.
>
> I guess this was meant to be in the previous patch, so I've moved it.
>
> Thanks,
> NeilBrown
>
>
>>  
>>  	container->msc_finalizers = kvzalloc_cpt(container->msc_nfinalizers *
>>  						 sizeof(*container->msc_finalizers),
>> -- 
>> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/267e618b/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP James Simmons
@ 2018-06-25  1:27   ` NeilBrown
  0 siblings, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-25  1:27 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> With the cleanup of the libcfs SMP handling the function
> cfs_cpt_table_print() was turned into an empty funciton.
> This function is called by a debugfs reporting function for
> debugging which now means for UMP machines it reports nothing
> which breaks previous behavior exposed to users. Restore the
> original behavior.

Which cleanup was this?
cfs_cpt_table_print seems to have been empty on UMP since at least

commit 3867ea5a4bc4d428f8d93557fb0fbc2cac2f2cdf
Author: Peng Tao <bergwolf@gmail.com>
Date:   Mon Jul 15 22:27:10 2013 +0800

    staging/lustre: fix build error when !CONFIG_SMP
    
    Three functions cfs_cpu_ht_nsiblings, cfs_cpt_cpumask and
    cfs_cpt_table_print are missing if !CONFIG_SMP.

5 years ago.

>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9856

This link seems completely irrelevant.

However, the patch looks sensible enough, so I've applied it.

Thanks,
NeilBrown


> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  .../staging/lustre/include/linux/libcfs/libcfs_cpu.h  | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> index 29c5071..32776d2 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> @@ -218,6 +218,19 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
>  struct cfs_cpt_table;
>  #define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
>  
> +static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab,
> +				      char *buf, int len)
> +{
> +	int rc;
> +
> +	rc = snprintf(buf, len, "0\t: 0\n");
> +	len -= rc;
> +	if (len <= 0)
> +		return -EFBIG;
> +
> +	return rc;
> +}
> +
>  static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
>  					 char *buf, int len)
>  {
> @@ -237,12 +250,6 @@ static inline cpumask_var_t *cfs_cpt_cpumask(struct cfs_cpt_table *cptab,
>  	return NULL;
>  }
>  
> -static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf,
> -				      int len)
> -{
> -	return 0;
> -}
> -
>  static inline int cfs_cpt_number(struct cfs_cpt_table *cptab)
>  {
>  	return 1;
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/4454b313/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure James Simmons
@ 2018-06-25  1:32   ` NeilBrown
  0 siblings, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-25  1:32 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> Only one cfs_cpt_tab exist and its created only at libcfs modules
> loading and removal. Instead of dynamically allocating it lets
> statically allocate it. This will help to reenable UMP support.
>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9856
> Signed-off-by: James Simmons <jsimmons@infradead.org>

While this patch is quite possibly a good idea, I'm not applying it or
the following one until you explain what is currently broken with "UMP"
support (I assume you mean UP - uni-processor).

Again, the WC-bug-id you linked gives no useful hint.

Thanks,
NeilBrown


> ---
>  .../lustre/include/linux/libcfs/libcfs_cpu.h       |   4 +-
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 111 ++++++++++-----------
>  drivers/staging/lustre/lnet/libcfs/module.c        |  10 +-
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |   4 +-
>  drivers/staging/lustre/lnet/selftest/framework.c   |   2 +-
>  drivers/staging/lustre/lustre/ptlrpc/client.c      |   4 +-
>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     |  10 +-
>  drivers/staging/lustre/lustre/ptlrpc/service.c     |   4 +-
>  8 files changed, 73 insertions(+), 76 deletions(-)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> index 32776d2..df7e16b 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> @@ -115,7 +115,7 @@ struct cfs_cpt_table {
>  	nodemask_t			*ctb_nodemask;
>  };
>  
> -extern struct cfs_cpt_table	*cfs_cpt_tab;
> +extern struct cfs_cpt_table	cfs_cpt_tab;
>  
>  /**
>   * return cpumask of CPU partition \a cpt
> @@ -215,8 +215,6 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
>  void cfs_cpu_fini(void);
>  
>  #else /* !CONFIG_SMP */
> -struct cfs_cpt_table;
> -#define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
>  
>  static inline int cfs_cpt_table_print(struct cfs_cpt_table *cptab,
>  				      char *buf, int len)
> diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> index 3f4a7c7..9fd324d 100644
> --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> @@ -41,10 +41,6 @@
>  #include <linux/libcfs/libcfs_string.h>
>  #include <linux/libcfs/libcfs.h>
>  
> -/** Global CPU partition table */
> -struct cfs_cpt_table   *cfs_cpt_tab __read_mostly;
> -EXPORT_SYMBOL(cfs_cpt_tab);
> -
>  /**
>   * modparam for setting number of partitions
>   *
> @@ -73,15 +69,10 @@
>  module_param(cpu_pattern, charp, 0444);
>  MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
>  
> -struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
> +static int cfs_cpt_table_setup(struct cfs_cpt_table *cptab, int ncpt)
>  {
> -	struct cfs_cpt_table *cptab;
>  	int i;
>  
> -	cptab = kzalloc(sizeof(*cptab), GFP_NOFS);
> -	if (!cptab)
> -		return NULL;
> -
>  	cptab->ctb_nparts = ncpt;
>  
>  	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS))
> @@ -138,7 +129,7 @@ struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
>  		       cptab->ctb_nparts * sizeof(part->cpt_distance[0]));
>  	}
>  
> -	return cptab;
> +	return 0;
>  
>  failed_setting_ctb_parts:
>  	while (i-- >= 0) {
> @@ -159,8 +150,24 @@ struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
>  failed_alloc_nodemask:
>  	free_cpumask_var(cptab->ctb_cpumask);
>  failed_alloc_cpumask:
> -	kfree(cptab);
> -	return NULL;
> +	return -ENOMEM;
> +}
> +
> +struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
> +{
> +	struct cfs_cpt_table *cptab;
> +	int rc;
> +
> +	cptab = kzalloc(sizeof(*cptab), GFP_NOFS);
> +	if (!cptab)
> +		return NULL;
> +
> +	rc = cfs_cpt_table_setup(cptab, ncpt);
> +	if (rc) {
> +		kfree(cptab);
> +		cptab = NULL;
> +	}
> +	return cptab;
>  }
>  EXPORT_SYMBOL(cfs_cpt_table_alloc);
>  
> @@ -183,8 +190,6 @@ void cfs_cpt_table_free(struct cfs_cpt_table *cptab)
>  
>  	kfree(cptab->ctb_nodemask);
>  	free_cpumask_var(cptab->ctb_cpumask);
> -
> -	kfree(cptab);
>  }
>  EXPORT_SYMBOL(cfs_cpt_table_free);
>  
> @@ -822,9 +827,8 @@ static int cfs_cpt_num_estimate(void)
>  	return ncpt;
>  }
>  
> -static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
> +static int cfs_cpt_table_create(int ncpt)
>  {
> -	struct cfs_cpt_table *cptab = NULL;
>  	cpumask_var_t node_mask;
>  	int cpt = 0;
>  	int node;
> @@ -841,10 +845,9 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
>  		      ncpt, num);
>  	}
>  
> -	cptab = cfs_cpt_table_alloc(ncpt);
> -	if (!cptab) {
> -		CERROR("Failed to allocate CPU map(%d)\n", ncpt);
> -		rc = -ENOMEM;
> +	rc = cfs_cpt_table_setup(&cfs_cpt_tab, ncpt);
> +	if (rc) {
> +		CERROR("Failed to setup CPU map(%d)\n", ncpt);
>  		goto failed;
>  	}
>  
> @@ -860,10 +863,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
>  		cpumask_copy(node_mask, cpumask_of_node(node));
>  
>  		while (cpt < ncpt && !cpumask_empty(node_mask)) {
> -			struct cfs_cpu_partition *part = &cptab->ctb_parts[cpt];
> -			int ncpu = cpumask_weight(part->cpt_cpumask);
> +			struct cfs_cpu_partition *part;
> +			int ncpu;
> +
> +			part = &cfs_cpt_tab.ctb_parts[cpt];
> +			ncpu = cpumask_weight(part->cpt_cpumask);
>  
> -			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask,
> +			rc = cfs_cpt_choose_ncpus(&cfs_cpt_tab, cpt, node_mask,
>  						  num - ncpu);
>  			if (rc < 0) {
>  				rc = -EINVAL;
> @@ -880,7 +886,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
>  
>  	free_cpumask_var(node_mask);
>  
> -	return cptab;
> +	return 0;
>  
>  failed_mask:
>  	free_cpumask_var(node_mask);
> @@ -888,15 +894,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
>  	CERROR("Failed (rc = %d) to setup CPU partition table with %d partitions, online HW NUMA nodes: %d, HW CPU cores: %d.\n",
>  	       rc, ncpt, num_online_nodes(), num_online_cpus());
>  
> -	if (cptab)
> -		cfs_cpt_table_free(cptab);
> +	cfs_cpt_table_free(&cfs_cpt_tab);
>  
> -	return ERR_PTR(rc);
> +	return rc;
>  }
>  
> -static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
> +static int cfs_cpt_table_create_pattern(const char *pattern)
>  {
> -	struct cfs_cpt_table *cptab;
>  	char *pattern_dup;
>  	char *bracket;
>  	char *str;
> @@ -911,7 +915,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  	pattern_dup = kstrdup(pattern, GFP_KERNEL);
>  	if (!pattern_dup) {
>  		CERROR("Failed to duplicate pattern '%s'\n", pattern);
> -		return ERR_PTR(-ENOMEM);
> +		return -ENOMEM;
>  	}
>  
>  	str = strim(pattern_dup);
> @@ -948,10 +952,9 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  		goto err_free_str;
>  	}
>  
> -	cptab = cfs_cpt_table_alloc(ncpt);
> -	if (!cptab) {
> -		CERROR("Failed to allocate CPU partition table\n");
> -		rc = -ENOMEM;
> +	rc = cfs_cpt_table_setup(&cfs_cpt_tab, ncpt);
> +	if (rc) {
> +		CERROR("Failed to setup CPU partition table\n");
>  		goto err_free_str;
>  	}
>  
> @@ -960,14 +963,14 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  			if (cpumask_empty(cpumask_of_node(i)))
>  				continue;
>  
> -			rc = cfs_cpt_set_node(cptab, cpt++, i);
> +			rc = cfs_cpt_set_node(&cfs_cpt_tab, cpt++, i);
>  			if (!rc) {
>  				rc = -EINVAL;
>  				goto err_free_table;
>  			}
>  		}
>  		kfree(pattern_dup);
> -		return cptab;
> +		return 0;
>  	}
>  
>  	high = node ? nr_node_ids - 1 : nr_cpu_ids - 1;
> @@ -1006,7 +1009,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  			goto err_free_table;
>  		}
>  
> -		if (cfs_cpt_weight(cptab, cpt)) {
> +		if (cfs_cpt_weight(&cfs_cpt_tab, cpt)) {
>  			CERROR("Partition %d has already been set.\n", cpt);
>  			rc = -EPERM;
>  			goto err_free_table;
> @@ -1040,8 +1043,8 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  				if ((i - range->re_lo) % range->re_stride)
>  					continue;
>  
> -				rc = node ? cfs_cpt_set_node(cptab, cpt, i) :
> -					    cfs_cpt_set_cpu(cptab, cpt, i);
> +				rc = node ? cfs_cpt_set_node(&cfs_cpt_tab, cpt, i) :
> +					    cfs_cpt_set_cpu(&cfs_cpt_tab, cpt, i);
>  				if (!rc) {
>  					cfs_expr_list_free(el);
>  					rc = -EINVAL;
> @@ -1052,7 +1055,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  
>  		cfs_expr_list_free(el);
>  
> -		if (!cfs_cpt_online(cptab, cpt)) {
> +		if (!cfs_cpt_online(&cfs_cpt_tab, cpt)) {
>  			CERROR("No online CPU is found on partition %d\n", cpt);
>  			rc = -ENODEV;
>  			goto err_free_table;
> @@ -1062,13 +1065,13 @@ static struct cfs_cpt_table *cfs_cpt_table_create_pattern(const char *pattern)
>  	}
>  
>  	kfree(pattern_dup);
> -	return cptab;
> +	return 0;
>  
>  err_free_table:
> -	cfs_cpt_table_free(cptab);
> +	cfs_cpt_table_free(&cfs_cpt_tab);
>  err_free_str:
>  	kfree(pattern_dup);
> -	return ERR_PTR(rc);
> +	return rc;
>  }
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -1095,8 +1098,7 @@ static int cfs_cpu_dead(unsigned int cpu)
>  
>  void cfs_cpu_fini(void)
>  {
> -	if (!IS_ERR_OR_NULL(cfs_cpt_tab))
> -		cfs_cpt_table_free(cfs_cpt_tab);
> +	cfs_cpt_table_free(&cfs_cpt_tab);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  	if (lustre_cpu_online > 0)
> @@ -1109,8 +1111,6 @@ int cfs_cpu_init(void)
>  {
>  	int ret;
>  
> -	LASSERT(!cfs_cpt_tab);
> -
>  #ifdef CONFIG_HOTPLUG_CPU
>  	ret = cpuhp_setup_state_nocalls(CPUHP_LUSTRE_CFS_DEAD,
>  					"staging/lustre/cfe:dead", NULL,
> @@ -1128,20 +1128,18 @@ int cfs_cpu_init(void)
>  #endif
>  	get_online_cpus();
>  	if (*cpu_pattern) {
> -		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern);
> -		if (IS_ERR(cfs_cpt_tab)) {
> +		ret = cfs_cpt_table_create_pattern(cpu_pattern);
> +		if (ret) {
>  			CERROR("Failed to create cptab from pattern '%s'\n",
>  			       cpu_pattern);
> -			ret = PTR_ERR(cfs_cpt_tab);
>  			goto failed_alloc_table;
>  		}
>  
>  	} else {
> -		cfs_cpt_tab = cfs_cpt_table_create(cpu_npartitions);
> -		if (IS_ERR(cfs_cpt_tab)) {
> +		ret = cfs_cpt_table_create(cpu_npartitions);
> +		if (ret) {
>  			CERROR("Failed to create cptab with npartitions %d\n",
>  			       cpu_npartitions);
> -			ret = PTR_ERR(cfs_cpt_tab);
>  			goto failed_alloc_table;
>  		}
>  	}
> @@ -1150,14 +1148,13 @@ int cfs_cpu_init(void)
>  
>  	LCONSOLE(0, "HW NUMA nodes: %d, HW CPU cores: %d, npartitions: %d\n",
>  		 num_online_nodes(), num_online_cpus(),
> -		 cfs_cpt_number(cfs_cpt_tab));
> +		 cfs_cpt_number(&cfs_cpt_tab));
>  	return 0;
>  
>  failed_alloc_table:
>  	put_online_cpus();
>  
> -	if (!IS_ERR_OR_NULL(cfs_cpt_tab))
> -		cfs_cpt_table_free(cfs_cpt_tab);
> +	cfs_cpt_table_free(&cfs_cpt_tab);
>  
>  	ret = -EINVAL;
>  #ifdef CONFIG_HOTPLUG_CPU
> diff --git a/drivers/staging/lustre/lnet/libcfs/module.c b/drivers/staging/lustre/lnet/libcfs/module.c
> index 2281f08..35c3959 100644
> --- a/drivers/staging/lustre/lnet/libcfs/module.c
> +++ b/drivers/staging/lustre/lnet/libcfs/module.c
> @@ -66,6 +66,10 @@ struct lnet_debugfs_symlink_def {
>  
>  static struct dentry *lnet_debugfs_root;
>  
> +/** Global CPU partition table */
> +struct cfs_cpt_table cfs_cpt_tab __read_mostly;
> +EXPORT_SYMBOL(cfs_cpt_tab);
> +
>  BLOCKING_NOTIFIER_HEAD(libcfs_ioctl_list);
>  EXPORT_SYMBOL(libcfs_ioctl_list);
>  
> @@ -402,7 +406,7 @@ static int proc_cpt_table(struct ctl_table *table, int write,
>  		if (!buf)
>  			return -ENOMEM;
>  
> -		rc = cfs_cpt_table_print(cfs_cpt_tab, buf, len);
> +		rc = cfs_cpt_table_print(&cfs_cpt_tab, buf, len);
>  		if (rc >= 0)
>  			break;
>  
> @@ -437,14 +441,12 @@ static int proc_cpt_distance(struct ctl_table *table, int write,
>  	if (write)
>  		return -EPERM;
>  
> -	LASSERT(cfs_cpt_tab);
> -
>  	while (1) {
>  		buf = kzalloc(len, GFP_KERNEL);
>  		if (!buf)
>  			return -ENOMEM;
>  
> -		rc = cfs_cpt_distance_print(cfs_cpt_tab, buf, len);
> +		rc = cfs_cpt_distance_print(&cfs_cpt_tab, buf, len);
>  		if (rc >= 0)
>  			break;
>  
> diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
> index f9ed697..98a4942 100644
> --- a/drivers/staging/lustre/lnet/lnet/api-ni.c
> +++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
> @@ -1414,8 +1414,8 @@ int lnet_lib_init(void)
>  	memset(&the_lnet, 0, sizeof(the_lnet));
>  
>  	/* refer to global cfs_cpt_tab for now */
> -	the_lnet.ln_cpt_table	= cfs_cpt_tab;
> -	the_lnet.ln_cpt_number	= cfs_cpt_number(cfs_cpt_tab);
> +	the_lnet.ln_cpt_table = &cfs_cpt_tab;
> +	the_lnet.ln_cpt_number = cfs_cpt_number(&cfs_cpt_tab);
>  
>  	LASSERT(the_lnet.ln_cpt_number > 0);
>  	if (the_lnet.ln_cpt_number > LNET_CPT_MAX) {
> diff --git a/drivers/staging/lustre/lnet/selftest/framework.c b/drivers/staging/lustre/lnet/selftest/framework.c
> index 741af10..939b7ec 100644
> --- a/drivers/staging/lustre/lnet/selftest/framework.c
> +++ b/drivers/staging/lustre/lnet/selftest/framework.c
> @@ -588,7 +588,7 @@
>  
>  	CDEBUG(D_NET, "Reserved %d buffers for test %s\n",
>  	       nbuf * (srpc_serv_is_framework(svc) ?
> -		       2 : cfs_cpt_number(cfs_cpt_tab)), svc->sv_name);
> +		       2 : cfs_cpt_number(&cfs_cpt_tab)), svc->sv_name);
>  	return 0;
>  }
>  
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/client.c b/drivers/staging/lustre/lustre/ptlrpc/client.c
> index c1b82bf..c569a8b 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/client.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/client.c
> @@ -940,9 +940,9 @@ struct ptlrpc_request_set *ptlrpc_prep_set(void)
>  	struct ptlrpc_request_set *set;
>  	int cpt;
>  
> -	cpt = cfs_cpt_current(cfs_cpt_tab, 0);
> +	cpt = cfs_cpt_current(&cfs_cpt_tab, 0);
>  	set = kzalloc_node(sizeof(*set), GFP_NOFS,
> -			   cfs_cpt_spread_node(cfs_cpt_tab, cpt));
> +			   cfs_cpt_spread_node(&cfs_cpt_tab, cpt));
>  	if (!set)
>  		return NULL;
>  	atomic_set(&set->set_refcount, 1);
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> index 5310054..d496521 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> @@ -177,7 +177,7 @@ void ptlrpcd_wake(struct ptlrpc_request *req)
>  	if (req && req->rq_send_state != LUSTRE_IMP_FULL)
>  		return &ptlrpcd_rcv;
>  
> -	cpt = cfs_cpt_current(cfs_cpt_tab, 1);
> +	cpt = cfs_cpt_current(&cfs_cpt_tab, 1);
>  	if (!ptlrpcds_cpt_idx)
>  		idx = cpt;
>  	else
> @@ -389,7 +389,7 @@ static int ptlrpcd(void *arg)
>  	int exit = 0;
>  
>  	unshare_fs_struct();
> -	if (cfs_cpt_bind(cfs_cpt_tab, pc->pc_cpt) != 0)
> +	if (cfs_cpt_bind(&cfs_cpt_tab, pc->pc_cpt) != 0)
>  		CWARN("Failed to bind %s on CPT %d\n", pc->pc_name, pc->pc_cpt);
>  
>  	/*
> @@ -531,7 +531,7 @@ static int ptlrpcd_partners(struct ptlrpcd *pd, int index)
>  
>  	size = sizeof(struct ptlrpcd_ctl *) * pc->pc_npartners;
>  	pc->pc_partners = kzalloc_node(size, GFP_NOFS,
> -				       cfs_cpt_spread_node(cfs_cpt_tab,
> +				       cfs_cpt_spread_node(&cfs_cpt_tab,
>  							   pc->pc_cpt));
>  	if (!pc->pc_partners) {
>  		pc->pc_npartners = 0;
> @@ -677,7 +677,7 @@ static int ptlrpcd_init(void)
>  	/*
>  	 * Determine the CPTs that ptlrpcd threads will run on.
>  	 */
> -	cptable = cfs_cpt_tab;
> +	cptable = &cfs_cpt_tab;
>  	ncpts = cfs_cpt_number(cptable);
>  	if (ptlrpcd_cpts) {
>  		struct cfs_expr_list *el;
> @@ -831,7 +831,7 @@ static int ptlrpcd_init(void)
>  
>  		size = offsetof(struct ptlrpcd, pd_threads[nthreads]);
>  		pd = kzalloc_node(size, GFP_NOFS,
> -				  cfs_cpt_spread_node(cfs_cpt_tab, cpt));
> +				  cfs_cpt_spread_node(&cfs_cpt_tab, cpt));
>  		if (!pd) {
>  			rc = -ENOMEM;
>  			goto out;
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
> index 8e74a45..853676f 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/service.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
> @@ -565,7 +565,7 @@ struct ptlrpc_service *
>  
>  	cptable = cconf->cc_cptable;
>  	if (!cptable)
> -		cptable = cfs_cpt_tab;
> +		cptable = &cfs_cpt_tab;
>  
>  	if (!conf->psc_thr.tc_cpu_affinity) {
>  		ncpts = 1;
> @@ -2520,7 +2520,7 @@ int ptlrpc_hr_init(void)
>  	int weight;
>  
>  	memset(&ptlrpc_hr, 0, sizeof(ptlrpc_hr));
> -	ptlrpc_hr.hr_cpt_table = cfs_cpt_tab;
> +	ptlrpc_hr.hr_cpt_table = &cfs_cpt_tab;
>  
>  	ptlrpc_hr.hr_partitions = cfs_percpt_alloc(ptlrpc_hr.hr_cpt_table,
>  						   sizeof(*hrp));
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/02020e75/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework
  2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
                   ` (25 preceding siblings ...)
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 26/26] staging: lustre: libcfs: restore UMP support James Simmons
@ 2018-06-25  1:33 ` NeilBrown
  26 siblings, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-25  1:33 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> Recently lustre support has been expanded to extreme machines with as
> many as a 1000+ cores. On the other end lustre also has been ported
> to platforms like ARM and KNL which have uniquie NUMA and core setup.
> For example some devices exist that have NUMA nodes with no cores.
> With these new platforms the limitations of the Lustre's SMP code
> came to light so a lot of work was needed. This resulted in this
> patch set which has been tested on these platforms.
>
> This is the 3rd version of this patch set with the first two submitted
> to the staging list. This latest patchset is identical to the 2nd one
> expect that the UMP support has been moved to the last patches in this
> collection. The approach to support UMP also has changed with using
> static initialization to greatly simplify the code.

Thanks for these.
Apart from some patches that I've rejected and some that I've modified,
these should appear in my lustre-testing branch in the next 24 hours.
I expect them to migrate to my lustre branch next Monday if no problems
surface.

Thanks,
NeilBrown


>
> Amir Shehata (8):
>   staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids
>   staging: lustre: libcfs: remove excess space
>   staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids
>   staging: lustre: libcfs: NUMA support
>   staging: lustre: libcfs: add cpu distance handling
>   staging: lustre: libcfs: use distance in cpu and node handling
>   staging: lustre: libcfs: provide debugfs files for distance handling
>   staging: lustre: libcfs: invert error handling for cfs_cpt_table_print
>
> Dmitry Eremin (14):
>   staging: lustre: libcfs: remove useless CPU partition code
>   staging: lustre: libcfs: rename variable i to cpu
>   staging: lustre: libcfs: fix libcfs_cpu coding style
>   staging: lustre: libcfs: use int type for CPT identification.
>   staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask
>   staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind
>   staging: lustre: libcfs: rename cpumask_var_t variables to *_mask
>   staging: lustre: libcfs: update debug messages
>   staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes
>   staging: lustre: libcfs: report NUMA node instead of just node
>   staging: lustre: libcfs: update debug messages in CPT code
>   staging: lustre: libcfs: rework CPU pattern parsing code
>   staging: lustre: libcfs: change CPT estimate algorithm
>   staging: lustre: ptlrpc: use current CPU instead of hardcoded 0
>
> James Simmons (4):
>   staging: lustre: libcfs: properly handle failure cases in SMP code
>   staging: lustre: libcfs: restore debugfs table reporting for UMP
>   staging: lustre: libcfs: make cfs_cpt_tab a static structure
>   staging: lustre: libcfs: restore UMP support
>
>  .../lustre/include/linux/libcfs/libcfs_cpu.h       |  203 ++--
>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 1020 +++++++++++---------
>  drivers/staging/lustre/lnet/libcfs/module.c        |   52 +-
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |    4 +-
>  drivers/staging/lustre/lnet/lnet/lib-msg.c         |    2 +
>  drivers/staging/lustre/lnet/selftest/framework.c   |    2 +-
>  drivers/staging/lustre/lustre/ptlrpc/client.c      |    4 +-
>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     |   10 +-
>  drivers/staging/lustre/lustre/ptlrpc/service.c     |   15 +-
>  9 files changed, 750 insertions(+), 562 deletions(-)
>
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/50229983/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0
  2018-06-24 21:20 ` [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0 James Simmons
@ 2018-06-25  2:38   ` NeilBrown
  2018-06-25 22:51     ` James Simmons
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-25  2:38 UTC (permalink / raw)
  To: lustre-devel

On Sun, Jun 24 2018, James Simmons wrote:

> From: Dmitry Eremin <dmitry.eremin@intel.com>
>
> fix crash if CPU 0 disabled.
>
> Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-8710
> Reviewed-on: https://review.whamcloud.com/23305
> Reviewed-by: Doug Oucharek <dougso@me.com>
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/ptlrpc/service.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
> index 3fd8c74..8e74a45 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/service.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
> @@ -421,7 +421,7 @@ static void ptlrpc_at_timer(struct timer_list *t)
>  		 * there are.
>  		 */
>  		/* weight is # of HTs */
> -		if (cpumask_weight(topology_sibling_cpumask(0)) > 1) {
> +		if (cpumask_weight(topology_sibling_cpumask(smp_processor_id())) > 1) {

This pops a warning for me:
[ 1877.516799] BUG: using smp_processor_id() in preemptible [00000000] code: mount.lustre/14077

I'll change it to disable preemption, both here and below.

Thanks,
NeilBrown


>  			/* depress thread factor for hyper-thread */
>  			factor = factor - (factor >> 1) + (factor >> 3);
>  		}
> @@ -2221,15 +2221,16 @@ static int ptlrpc_hr_main(void *arg)
>  	struct ptlrpc_hr_thread	*hrt = arg;
>  	struct ptlrpc_hr_partition *hrp = hrt->hrt_partition;
>  	LIST_HEAD(replies);
> -	char threadname[20];
>  	int rc;
>  
> -	snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
> -		 hrp->hrp_cpt, hrt->hrt_id);
>  	unshare_fs_struct();
>  
>  	rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
>  	if (rc != 0) {
> +		char threadname[20];
> +
> +		snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
> +			 hrp->hrp_cpt, hrt->hrt_id);
>  		CWARN("Failed to bind %s on CPT %d of CPT table %p: rc = %d\n",
>  		      threadname, hrp->hrp_cpt, ptlrpc_hr.hr_cpt_table, rc);
>  	}
> @@ -2528,7 +2529,7 @@ int ptlrpc_hr_init(void)
>  
>  	init_waitqueue_head(&ptlrpc_hr.hr_waitq);
>  
> -	weight = cpumask_weight(topology_sibling_cpumask(0));
> +	weight = cpumask_weight(topology_sibling_cpumask(smp_processor_id()));
>  
>  	cfs_percpt_for_each(hrp, i, ptlrpc_hr.hr_partitions) {
>  		hrp->hrp_cpt = i;
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/36ca15f0/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-25  0:39   ` NeilBrown
@ 2018-06-25 18:22     ` Doug Oucharek
  2018-06-27  2:44       ` NeilBrown
  2018-06-26  0:39     ` James Simmons
  1 sibling, 1 reply; 66+ messages in thread
From: Doug Oucharek @ 2018-06-25 18:22 UTC (permalink / raw)
  To: lustre-devel

Some background on this NUMA change:

First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.  This was done as part of the Multi-Rail feature.  One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.  There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.  To do that, the ?distance? between NUMA nodes needs to be configured.

This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.  Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.

Doug

On Jun 24, 2018, at 5:39 PM, NeilBrown <neilb at suse.com<mailto:neilb@suse.com>> wrote:

On Sun, Jun 24 2018, James Simmons wrote:

From: Amir Shehata <amir.shehata at intel.com<mailto:amir.shehata@intel.com>>

This patch adds NUMA node support. NUMA node information is stored
in the CPT table. A NUMA node mask is maintained for the entire
table as well as for each CPT to track the NUMA nodes related to
each of the CPTs. Add new function cfs_cpt_of_node() which returns
the CPT of a particular NUMA node.

I note that you didn't respond to Greg's questions about this patch.
I'll accept it anyway in the interests of moving forward, but I think
his comments were probably valid, and need to be considered at some
stage.

There is a bug though....

Signed-off-by: Amir Shehata <amir.shehata at intel.com<mailto:amir.shehata@intel.com>>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18916
Reviewed-by: Olaf Weber <olaf at sgi.com<mailto:olaf@sgi.com>>
Reviewed-by: Doug Oucharek <dougso at me.com<mailto:dougso@me.com>>
Signed-off-by: James Simmons <jsimmons at infradead.org<mailto:jsimmons@infradead.org>>
---
.../lustre/include/linux/libcfs/libcfs_cpu.h        | 11 +++++++++++
drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c     | 21 +++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
index 1b4333d..ff3ecf5 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
@@ -103,6 +103,8 @@ struct cfs_cpt_table {
int *ctb_cpu2cpt;
/* all cpus in this partition table */
cpumask_var_t ctb_cpumask;
+ /* shadow HW node to CPU partition ID */
+ int *ctb_node2cpt;
/* all nodes in this partition table */
nodemask_t *ctb_nodemask;
};
@@ -143,6 +145,10 @@ struct cfs_cpt_table {
 */
int cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu);
/**
+ * shadow HW node ID \a NODE to CPU-partition ID by \a cptab
+ */
+int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
+/**
 * bind current thread on a CPU-partition \a cpt of \a cptab
 */
int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
@@ -299,6 +305,11 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
return 0;
}

+static inline int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
+{
+ return 0;
+}
+
static inline int
cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
{
diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
index 33294da..8c5cf7b 100644
--- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
+++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
@@ -102,6 +102,15 @@ struct cfs_cpt_table *
memset(cptab->ctb_cpu2cpt, -1,
       nr_cpu_ids * sizeof(cptab->ctb_cpu2cpt[0]));

+ cptab->ctb_node2cpt = kvmalloc_array(nr_node_ids,
+      sizeof(cptab->ctb_node2cpt[0]),
+      GFP_KERNEL);
+ if (!cptab->ctb_node2cpt)
+ goto failed_alloc_node2cpt;
+
+ memset(cptab->ctb_node2cpt, -1,
+        nr_node_ids * sizeof(cptab->ctb_node2cpt[0]));
+
cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
  GFP_KERNEL);
if (!cptab->ctb_parts)
@@ -133,6 +142,8 @@ struct cfs_cpt_table *

kvfree(cptab->ctb_parts);
failed_alloc_ctb_parts:
+ kvfree(cptab->ctb_node2cpt);
+failed_alloc_node2cpt:
kvfree(cptab->ctb_cpu2cpt);
failed_alloc_cpu2cpt:
kfree(cptab->ctb_nodemask);
@@ -150,6 +161,7 @@ struct cfs_cpt_table *
int i;

kvfree(cptab->ctb_cpu2cpt);
+ kvfree(cptab->ctb_node2cpt);

for (i = 0; cptab->ctb_parts && i < cptab->ctb_nparts; i++) {
struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
@@ -515,6 +527,15 @@ struct cfs_cpt_table *
}
EXPORT_SYMBOL(cfs_cpt_of_cpu);

+int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
+{
+ if (node < 0 || node > nr_node_ids)
+ return CFS_CPT_ANY;
+
+ return cptab->ctb_node2cpt[node];
+}

So if node == nr_node_ids, we access beyond the end of the ctb_node2cpt array.
Oops.
I've fixed this before applying.

Thanks,
NeilBrown


+EXPORT_SYMBOL(cfs_cpt_of_node);
+
int
cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
{
--
1.8.3.1
_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180625/7c60ef32/attachment-0001.html>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0
  2018-06-25  2:38   ` NeilBrown
@ 2018-06-25 22:51     ` James Simmons
  2018-06-26  0:34       ` NeilBrown
  0 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-25 22:51 UTC (permalink / raw)
  To: lustre-devel


> > From: Dmitry Eremin <dmitry.eremin@intel.com>
> >
> > fix crash if CPU 0 disabled.
> >
> > Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8710
> > Reviewed-on: https://review.whamcloud.com/23305
> > Reviewed-by: Doug Oucharek <dougso@me.com>
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/ptlrpc/service.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
> > index 3fd8c74..8e74a45 100644
> > --- a/drivers/staging/lustre/lustre/ptlrpc/service.c
> > +++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
> > @@ -421,7 +421,7 @@ static void ptlrpc_at_timer(struct timer_list *t)
> >  		 * there are.
> >  		 */
> >  		/* weight is # of HTs */
> > -		if (cpumask_weight(topology_sibling_cpumask(0)) > 1) {
> > +		if (cpumask_weight(topology_sibling_cpumask(smp_processor_id())) > 1) {
> 
> This pops a warning for me:
> [ 1877.516799] BUG: using smp_processor_id() in preemptible [00000000] code: mount.lustre/14077
> 
> I'll change it to disable preemption, both here and below.

For .config I have:

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

What does yours look like? Strange no one has ever reported an error 
before. Thanks for finding this!!!

> 
> >  			/* depress thread factor for hyper-thread */
> >  			factor = factor - (factor >> 1) + (factor >> 3);
> >  		}
> > @@ -2221,15 +2221,16 @@ static int ptlrpc_hr_main(void *arg)
> >  	struct ptlrpc_hr_thread	*hrt = arg;
> >  	struct ptlrpc_hr_partition *hrp = hrt->hrt_partition;
> >  	LIST_HEAD(replies);
> > -	char threadname[20];
> >  	int rc;
> >  
> > -	snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
> > -		 hrp->hrp_cpt, hrt->hrt_id);
> >  	unshare_fs_struct();
> >  
> >  	rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
> >  	if (rc != 0) {
> > +		char threadname[20];
> > +
> > +		snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
> > +			 hrp->hrp_cpt, hrt->hrt_id);
> >  		CWARN("Failed to bind %s on CPT %d of CPT table %p: rc = %d\n",
> >  		      threadname, hrp->hrp_cpt, ptlrpc_hr.hr_cpt_table, rc);
> >  	}
> > @@ -2528,7 +2529,7 @@ int ptlrpc_hr_init(void)
> >  
> >  	init_waitqueue_head(&ptlrpc_hr.hr_waitq);
> >  
> > -	weight = cpumask_weight(topology_sibling_cpumask(0));
> > +	weight = cpumask_weight(topology_sibling_cpumask(smp_processor_id()));
> >  
> >  	cfs_percpt_for_each(hrp, i, ptlrpc_hr.hr_partitions) {
> >  		hrp->hrp_cpt = i;
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node
  2018-06-25  1:11     ` NeilBrown
@ 2018-06-25 22:57       ` James Simmons
  0 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-25 22:57 UTC (permalink / raw)
  To: lustre-devel


> > On Sun, Jun 24 2018, James Simmons wrote:
> >
> >> From: Dmitry Eremin <dmitry.eremin@intel.com>
> >>
> >> Reporting "HW nodes" is too generic. It really is reporting
> >> "HW NUMA nodes". Update the debug message.
> >
> > I'm not happy with this patch description.....
> >
> >>
> >> Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
> >> Reviewed-on: https://review.whamcloud.com/23306
> >> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> >> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> >> Reviewed-by: Patrick Farrell <paf@cray.com>
> >> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> >> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> ---
> >>  drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h | 2 ++
> >>  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 2 +-
> >>  drivers/staging/lustre/lnet/lnet/lib-msg.c               | 2 ++
> >>  3 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> >> index 2bb2140..29c5071 100644
> >> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> >> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> >> @@ -90,6 +90,8 @@ struct cfs_cpu_partition {
> >>  	unsigned int			*cpt_distance;
> >>  	/* spread rotor for NUMA allocator */
> >>  	int				cpt_spread_rotor;
> >> +	/* NUMA node if cpt_nodemask is empty */
> >> +	int				cpt_node;
> >>  };
> >
> > It doesn't give any reason why this (unused) field was added.
> > So I've removed it.
> 
> Ahhhh. this was meant to be in the previous patch too.
> I've moved it.

This is stupid NFS. So my test nodes are NFS root which will
contain the source tree. The test nodes don't have access to
the outside world so from another system I do my pulling and
pushing. What happens is that patches applied will not show
up right away before I push. By the time the next patch is
applied it shows up. Sorry about that.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code
  2018-06-25  0:20   ` NeilBrown
@ 2018-06-26  0:33     ` James Simmons
  0 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-26  0:33 UTC (permalink / raw)
  To: lustre-devel


> > While pushing the SMP work some bugs were pointed out by Dan
> > Carpenter in the code. Due to single err label in cfs_cpu_init()
> > and cfs_cpt_table_alloc() a few items were being cleaned up that
> > were never initialized. This can lead to crashed and other problems.
> > In those initialization function introduce individual labels to
> > jump to only the thing initialized get freed on failure.
> >
> > Signed-off-by: James Simmons <uja.ornl@yahoo.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-10932
> > Reviewed-on: https://review.whamcloud.com/32085
> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 72 ++++++++++++++++++-------
> >  1 file changed, 52 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > index 46d3530..bdd71a3 100644
> > --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > @@ -85,17 +85,19 @@ struct cfs_cpt_table *
> >  
> >  	cptab->ctb_nparts = ncpt;
> >  
> > +	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS))
> > +		goto failed_alloc_cpumask;
> > +
> >  	cptab->ctb_nodemask = kzalloc(sizeof(*cptab->ctb_nodemask),
> >  				      GFP_NOFS);
> > -	if (!zalloc_cpumask_var(&cptab->ctb_cpumask, GFP_NOFS) ||
> > -	    !cptab->ctb_nodemask)
> > -		goto failed;
> > +	if (!cptab->ctb_nodemask)
> > +		goto failed_alloc_nodemask;
> >  
> >  	cptab->ctb_cpu2cpt = kvmalloc_array(num_possible_cpus(),
> >  					    sizeof(cptab->ctb_cpu2cpt[0]),
> >  					    GFP_KERNEL);
> >  	if (!cptab->ctb_cpu2cpt)
> > -		goto failed;
> > +		goto failed_alloc_cpu2cpt;
> >  
> >  	memset(cptab->ctb_cpu2cpt, -1,
> >  	       num_possible_cpus() * sizeof(cptab->ctb_cpu2cpt[0]));
> > @@ -103,22 +105,41 @@ struct cfs_cpt_table *
> >  	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
> >  					  GFP_KERNEL);
> >  	if (!cptab->ctb_parts)
> > -		goto failed;
> > +		goto failed_alloc_ctb_parts;
> > +
> > +	memset(cptab->ctb_parts, -1, ncpt * sizeof(cptab->ctb_parts[0]));
> >  
> >  	for (i = 0; i < ncpt; i++) {
> >  		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
> >  
> > +		if (!zalloc_cpumask_var(&part->cpt_cpumask, GFP_NOFS))
> > +			goto failed_setting_ctb_parts;
> > +
> >  		part->cpt_nodemask = kzalloc(sizeof(*part->cpt_nodemask),
> >  					     GFP_NOFS);
> > -		if (!zalloc_cpumask_var(&part->cpt_cpumask, GFP_NOFS) ||
> > -		    !part->cpt_nodemask)
> > -			goto failed;
> > +		if (!part->cpt_nodemask)
> > +			goto failed_setting_ctb_parts;
> 
> If zalloc_cpumask_var() succeeds, but kzalloc() fails (which is almost
> impossible, but still) we go to failed_setting_ctb_parts, with
>  cptab->ctb_parts[i]->cpt_cpumask needing to be freed.
> 
> >  	}
> >  
> >  	return cptab;
> >  
> > - failed:
> > -	cfs_cpt_table_free(cptab);
> > +failed_setting_ctb_parts:
> > +	while (i-- >= 0) {
> 
> but we don't free anything in cptab->ctb_parts[i].
> I've fix this by calling free_cpumask_var() before the goto.
> 
> And will propagate the change through future patches in this series.

Thanks. I will grab the updated patches from your testing tree. 

 
> > +		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
> > +
> > +		kfree(part->cpt_nodemask);
> > +		free_cpumask_var(part->cpt_cpumask);
> > +	}
> > +
> > +	kvfree(cptab->ctb_parts);
> > +failed_alloc_ctb_parts:
> > +	kvfree(cptab->ctb_cpu2cpt);
> > +failed_alloc_cpu2cpt:
> > +	kfree(cptab->ctb_nodemask);
> > +failed_alloc_nodemask:
> > +	free_cpumask_var(cptab->ctb_cpumask);
> > +failed_alloc_cpumask:
> > +	kfree(cptab);
> >  	return NULL;
> >  }
> >  EXPORT_SYMBOL(cfs_cpt_table_alloc);
> > @@ -944,7 +965,7 @@ static int cfs_cpu_dead(unsigned int cpu)
> >  int
> >  cfs_cpu_init(void)
> >  {
> > -	int ret = 0;
> > +	int ret;
> >  
> >  	LASSERT(!cfs_cpt_tab);
> >  
> > @@ -953,23 +974,23 @@ static int cfs_cpu_dead(unsigned int cpu)
> >  					"staging/lustre/cfe:dead", NULL,
> >  					cfs_cpu_dead);
> >  	if (ret < 0)
> > -		goto failed;
> > +		goto failed_cpu_dead;
> > +
> >  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> >  					"staging/lustre/cfe:online",
> >  					cfs_cpu_online, NULL);
> >  	if (ret < 0)
> > -		goto failed;
> > +		goto failed_cpu_online;
> > +
> >  	lustre_cpu_online = ret;
> >  #endif
> > -	ret = -EINVAL;
> > -
> >  	get_online_cpus();
> >  	if (*cpu_pattern) {
> >  		char *cpu_pattern_dup = kstrdup(cpu_pattern, GFP_KERNEL);
> >  
> >  		if (!cpu_pattern_dup) {
> >  			CERROR("Failed to duplicate cpu_pattern\n");
> > -			goto failed;
> > +			goto failed_alloc_table;
> >  		}
> >  
> >  		cfs_cpt_tab = cfs_cpt_table_create_pattern(cpu_pattern_dup);
> > @@ -977,7 +998,7 @@ static int cfs_cpu_dead(unsigned int cpu)
> >  		if (!cfs_cpt_tab) {
> >  			CERROR("Failed to create cptab from pattern %s\n",
> >  			       cpu_pattern);
> > -			goto failed;
> > +			goto failed_alloc_table;
> >  		}
> >  
> >  	} else {
> > @@ -985,7 +1006,7 @@ static int cfs_cpu_dead(unsigned int cpu)
> >  		if (!cfs_cpt_tab) {
> >  			CERROR("Failed to create ptable with npartitions %d\n",
> >  			       cpu_npartitions);
> > -			goto failed;
> > +			goto failed_alloc_table;
> >  		}
> >  	}
> >  
> > @@ -996,8 +1017,19 @@ static int cfs_cpu_dead(unsigned int cpu)
> >  		 cfs_cpt_number(cfs_cpt_tab));
> >  	return 0;
> >  
> > - failed:
> > +failed_alloc_table:
> >  	put_online_cpus();
> > -	cfs_cpu_fini();
> > +
> > +	if (cfs_cpt_tab)
> > +		cfs_cpt_table_free(cfs_cpt_tab);
> > +
> > +	ret = -EINVAL;
> > +#ifdef CONFIG_HOTPLUG_CPU
> > +	if (lustre_cpu_online > 0)
> > +		cpuhp_remove_state_nocalls(lustre_cpu_online);
> > +failed_cpu_online:
> > +	cpuhp_remove_state_nocalls(CPUHP_LUSTRE_CFS_DEAD);
> > +failed_cpu_dead:
> > +#endif
> >  	return ret;
> >  }
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0
  2018-06-25 22:51     ` James Simmons
@ 2018-06-26  0:34       ` NeilBrown
  0 siblings, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-26  0:34 UTC (permalink / raw)
  To: lustre-devel

On Mon, Jun 25 2018, James Simmons wrote:

>> > From: Dmitry Eremin <dmitry.eremin@intel.com>
>> >
>> > fix crash if CPU 0 disabled.
>> >
>> > Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8710
>> > Reviewed-on: https://review.whamcloud.com/23305
>> > Reviewed-by: Doug Oucharek <dougso@me.com>
>> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  drivers/staging/lustre/lustre/ptlrpc/service.c | 11 ++++++-----
>> >  1 file changed, 6 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
>> > index 3fd8c74..8e74a45 100644
>> > --- a/drivers/staging/lustre/lustre/ptlrpc/service.c
>> > +++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
>> > @@ -421,7 +421,7 @@ static void ptlrpc_at_timer(struct timer_list *t)
>> >  		 * there are.
>> >  		 */
>> >  		/* weight is # of HTs */
>> > -		if (cpumask_weight(topology_sibling_cpumask(0)) > 1) {
>> > +		if (cpumask_weight(topology_sibling_cpumask(smp_processor_id())) > 1) {
>> 
>> This pops a warning for me:
>> [ 1877.516799] BUG: using smp_processor_id() in preemptible [00000000] code: mount.lustre/14077
>> 
>> I'll change it to disable preemption, both here and below.
>
> For .config I have:
>
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
>
> What does yours look like? Strange no one has ever reported an error 
> before. Thanks for finding this!!!

I have
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_PREEMPTIRQ_EVENTS is not set
# CONFIG_PREEMPT_TRACER is not set

CONFIG_PREEMPT=y helps find bugs - not just these but also some races.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180626/bb20ad16/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-25  0:39   ` NeilBrown
  2018-06-25 18:22     ` Doug Oucharek
@ 2018-06-26  0:39     ` James Simmons
  1 sibling, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-26  0:39 UTC (permalink / raw)
  To: lustre-devel


> > From: Amir Shehata <amir.shehata@intel.com>
> >
> > This patch adds NUMA node support. NUMA node information is stored
> > in the CPT table. A NUMA node mask is maintained for the entire
> > table as well as for each CPT to track the NUMA nodes related to
> > each of the CPTs. Add new function cfs_cpt_of_node() which returns
> > the CPT of a particular NUMA node.
> 
> I note that you didn't respond to Greg's questions about this patch.
> I'll accept it anyway in the interests of moving forward, but I think
> his comments were probably valid, and need to be considered at some
> stage.

I hope Doug's response answers the questions. I can get Olaf from HPE 
involved if need be.
 
> There is a bug though....
> >
> > Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> > Reviewed-on: http://review.whamcloud.com/18916
> > Reviewed-by: Olaf Weber <olaf@sgi.com>
> > Reviewed-by: Doug Oucharek <dougso@me.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  .../lustre/include/linux/libcfs/libcfs_cpu.h        | 11 +++++++++++
> >  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c     | 21 +++++++++++++++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > index 1b4333d..ff3ecf5 100644
> > --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > @@ -103,6 +103,8 @@ struct cfs_cpt_table {
> >  	int				*ctb_cpu2cpt;
> >  	/* all cpus in this partition table */
> >  	cpumask_var_t			ctb_cpumask;
> > +	/* shadow HW node to CPU partition ID */
> > +	int				*ctb_node2cpt;
> >  	/* all nodes in this partition table */
> >  	nodemask_t			*ctb_nodemask;
> >  };
> > @@ -143,6 +145,10 @@ struct cfs_cpt_table {
> >   */
> >  int cfs_cpt_of_cpu(struct cfs_cpt_table *cptab, int cpu);
> >  /**
> > + * shadow HW node ID \a NODE to CPU-partition ID by \a cptab
> > + */
> > +int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
> > +/**
> >   * bind current thread on a CPU-partition \a cpt of \a cptab
> >   */
> >  int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
> > @@ -299,6 +305,11 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
> >  	return 0;
> >  }
> >  
> > +static inline int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
> > +{
> > +	return 0;
> > +}
> > +
> >  static inline int
> >  cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
> >  {
> > diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > index 33294da..8c5cf7b 100644
> > --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > @@ -102,6 +102,15 @@ struct cfs_cpt_table *
> >  	memset(cptab->ctb_cpu2cpt, -1,
> >  	       nr_cpu_ids * sizeof(cptab->ctb_cpu2cpt[0]));
> >  
> > +	cptab->ctb_node2cpt = kvmalloc_array(nr_node_ids,
> > +					     sizeof(cptab->ctb_node2cpt[0]),
> > +					     GFP_KERNEL);
> > +	if (!cptab->ctb_node2cpt)
> > +		goto failed_alloc_node2cpt;
> > +
> > +	memset(cptab->ctb_node2cpt, -1,
> > +	       nr_node_ids * sizeof(cptab->ctb_node2cpt[0]));
> > +
> >  	cptab->ctb_parts = kvmalloc_array(ncpt, sizeof(cptab->ctb_parts[0]),
> >  					  GFP_KERNEL);
> >  	if (!cptab->ctb_parts)
> > @@ -133,6 +142,8 @@ struct cfs_cpt_table *
> >  
> >  	kvfree(cptab->ctb_parts);
> >  failed_alloc_ctb_parts:
> > +	kvfree(cptab->ctb_node2cpt);
> > +failed_alloc_node2cpt:
> >  	kvfree(cptab->ctb_cpu2cpt);
> >  failed_alloc_cpu2cpt:
> >  	kfree(cptab->ctb_nodemask);
> > @@ -150,6 +161,7 @@ struct cfs_cpt_table *
> >  	int i;
> >  
> >  	kvfree(cptab->ctb_cpu2cpt);
> > +	kvfree(cptab->ctb_node2cpt);
> >  
> >  	for (i = 0; cptab->ctb_parts && i < cptab->ctb_nparts; i++) {
> >  		struct cfs_cpu_partition *part = &cptab->ctb_parts[i];
> > @@ -515,6 +527,15 @@ struct cfs_cpt_table *
> >  }
> >  EXPORT_SYMBOL(cfs_cpt_of_cpu);
> >  
> > +int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node)
> > +{
> > +	if (node < 0 || node > nr_node_ids)
> > +		return CFS_CPT_ANY;
> > +
> > +	return cptab->ctb_node2cpt[node];
> > +}
> 
> So if node == nr_node_ids, we access beyond the end of the ctb_node2cpt array.
> Oops.
> I've fixed this before applying.

Ouch. That bug has been around for a while :-(

> Thanks,
> NeilBrown
> 
> 
> > +EXPORT_SYMBOL(cfs_cpt_of_node);
> > +
> >  int
> >  cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt)
> >  {
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification.
  2018-06-25  0:57   ` NeilBrown
@ 2018-06-26  0:42     ` James Simmons
  0 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-26  0:42 UTC (permalink / raw)
  To: lustre-devel


> > From: Dmitry Eremin <dmitry.eremin@intel.com>
> >
> > Use int type for CPT identification to match the linux kernel
> > CPU identification.
> 
> Can someone site evidence for "int" being the dominant choice for CPU
> identification in the kernel?
> I looked in cpumask.h and found plenty of "unsigned int".
> I also found
> 
> Commit: 9b130ad5bb82 ("treewide: make "nr_cpu_ids" unsigned")
> 
> which makes nr_cpu_ids unsigned.
> 
> So I'm dropping this patch for now as the justification is not
> convincing.
> 
> If there is a real case to be made, please resubmit.

Honestly using int doesn't make sense to me but Dmitry got the
impresssion that using int was more correct. Dmitry where did
you get that information about using int from?

> > Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
> > Reviewed-on: https://review.whamcloud.com/23304
> > Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> > Reviewed-by: Doug Oucharek <dougso@me.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h |  8 ++++----
> >  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 14 +++++++-------
> >  2 files changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > index 9dbb0b1..2bb2140 100644
> > --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > @@ -89,18 +89,18 @@ struct cfs_cpu_partition {
> >  	/* NUMA distance between CPTs */
> >  	unsigned int			*cpt_distance;
> >  	/* spread rotor for NUMA allocator */
> > -	unsigned int			cpt_spread_rotor;
> > +	int				cpt_spread_rotor;
> >  };
> >  
> >  
> >  /** descriptor for CPU partitions */
> >  struct cfs_cpt_table {
> >  	/* spread rotor for NUMA allocator */
> > -	unsigned int			ctb_spread_rotor;
> > +	int				ctb_spread_rotor;
> >  	/* maximum NUMA distance between all nodes in table */
> >  	unsigned int			ctb_distance;
> >  	/* # of CPU partitions */
> > -	unsigned int			ctb_nparts;
> > +	int				ctb_nparts;
> >  	/* partitions tables */
> >  	struct cfs_cpu_partition	*ctb_parts;
> >  	/* shadow HW CPU to CPU partition ID */
> > @@ -355,7 +355,7 @@ static inline void cfs_cpu_fini(void)
> >  /**
> >   * create a cfs_cpt_table with \a ncpt number of partitions
> >   */
> > -struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt);
> > +struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt);
> >  
> >  /*
> >   * allocate per-cpu-partition data, returned value is an array of pointers,
> > diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > index aaab7cb..8f7de59 100644
> > --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > @@ -73,7 +73,7 @@
> >  module_param(cpu_pattern, charp, 0444);
> >  MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");
> >  
> > -struct cfs_cpt_table *cfs_cpt_table_alloc(unsigned int ncpt)
> > +struct cfs_cpt_table *cfs_cpt_table_alloc(int ncpt)
> >  {
> >  	struct cfs_cpt_table *cptab;
> >  	int i;
> > @@ -788,13 +788,13 @@ static int cfs_cpt_choose_ncpus(struct cfs_cpt_table *cptab, int cpt,
> >  	return rc;
> >  }
> >  
> > -#define CPT_WEIGHT_MIN  4u
> > +#define CPT_WEIGHT_MIN 4
> >  
> > -static unsigned int cfs_cpt_num_estimate(void)
> > +static int cfs_cpt_num_estimate(void)
> >  {
> > -	unsigned int nnode = num_online_nodes();
> > -	unsigned int ncpu = num_online_cpus();
> > -	unsigned int ncpt;
> > +	int nnode = num_online_nodes();
> > +	int ncpu = num_online_cpus();
> > +	int ncpt;
> >  
> >  	if (ncpu <= CPT_WEIGHT_MIN) {
> >  		ncpt = 1;
> > @@ -824,7 +824,7 @@ static unsigned int cfs_cpt_num_estimate(void)
> >  	/* config many CPU partitions on 32-bit system could consume
> >  	 * too much memory
> >  	 */
> > -	ncpt = min(2U, ncpt);
> > +	ncpt = min(2, ncpt);
> >  #endif
> >  	while (ncpu % ncpt)
> >  		ncpt--; /* worst case is 1 */
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node
  2018-06-25  1:09   ` NeilBrown
  2018-06-25  1:11     ` NeilBrown
@ 2018-06-26  0:54     ` James Simmons
  2018-06-27  2:49       ` NeilBrown
  1 sibling, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-26  0:54 UTC (permalink / raw)
  To: lustre-devel

> On Sun, Jun 24 2018, James Simmons wrote:
> 
> > From: Dmitry Eremin <dmitry.eremin@intel.com>
> >
> > Reporting "HW nodes" is too generic. It really is reporting
> > "HW NUMA nodes". Update the debug message.
> 
> I'm not happy with this patch description.....

How about:

In the HPC world a node refers to the actual whole computer system
used in a cluster. Reporting just "HW nodes" is not clear so change
the debug report to "HW NUMA nodes" since this report the number
of NUMA nodes in use by Lustre.

 
> >
> > Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8703
> > Reviewed-on: https://review.whamcloud.com/23306
> > Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Reviewed-by: Patrick Farrell <paf@cray.com>
> > Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h | 2 ++
> >  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c          | 2 +-
> >  drivers/staging/lustre/lnet/lnet/lib-msg.c               | 2 ++
> >  3 files changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > index 2bb2140..29c5071 100644
> > --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > @@ -90,6 +90,8 @@ struct cfs_cpu_partition {
> >  	unsigned int			*cpt_distance;
> >  	/* spread rotor for NUMA allocator */
> >  	int				cpt_spread_rotor;
> > +	/* NUMA node if cpt_nodemask is empty */
> > +	int				cpt_node;
> >  };
> 
> It doesn't give any reason why this (unused) field was added.
> So I've removed it.
> 
> 
> >  
> >  
> > diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > index 18925c7..86afa31 100644
> > --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > @@ -1142,7 +1142,7 @@ int cfs_cpu_init(void)
> >  
> >  	put_online_cpus();
> >  
> > -	LCONSOLE(0, "HW nodes: %d, HW CPU cores: %d, npartitions: %d\n",
> > +	LCONSOLE(0, "HW NUMA nodes: %d, HW CPU cores: %d, npartitions: %d\n",
> >  		 num_online_nodes(), num_online_cpus(),
> >  		 cfs_cpt_number(cfs_cpt_tab));
> >  	return 0;
> 
> It does explain this hunk, which is fine.
> 
> 
> > diff --git a/drivers/staging/lustre/lnet/lnet/lib-msg.c b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> > index 0091273..27bdefa 100644
> > --- a/drivers/staging/lustre/lnet/lnet/lib-msg.c
> > +++ b/drivers/staging/lustre/lnet/lnet/lib-msg.c
> > @@ -568,6 +568,8 @@
> >  
> >  	/* number of CPUs */
> >  	container->msc_nfinalizers = cfs_cpt_weight(lnet_cpt_table(), cpt);
> > +	if (container->msc_nfinalizers == 0)
> > +		container->msc_nfinalizers = 1;
> 
> It doesn't justify this at all.
> 
> I guess this was meant to be in the previous patch, so I've moved it.
> 
> Thanks,
> NeilBrown
> 
> 
> >  
> >  	container->msc_finalizers = kvzalloc_cpt(container->msc_nfinalizers *
> >  						 sizeof(*container->msc_finalizers),
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space
  2018-06-25  0:35   ` NeilBrown
@ 2018-06-26  0:55     ` James Simmons
  0 siblings, 0 replies; 66+ messages in thread
From: James Simmons @ 2018-06-26  0:55 UTC (permalink / raw)
  To: lustre-devel


> > From: Amir Shehata <amir.shehata@intel.com>
> >
> > The function cfs_cpt_table_print() was adding two spaces
> > to the string buffer. Just add it once.
> 
> No it doesn't.  Maybe it did in the out-of-tree code, but the linux code
> is different.
> 
> The extra space is
> 
>                        rc = snprintf(tmp, len, " %d", j);
> 
> But in Linux that is
> 
> 			rc = snprintf(tmp, len, "%d ", j);
> 
> Both are wrong, but for different reasons.
> I've change this patch to be:
> 
> 			rc = snprintf(tmp, len, "%d\t:", i);
> and
> 			rc = snprintf(tmp, len, " %d", j);
> and changed the comment to say that we don't need a stray space at the
> end of the line.

Thank you.

> 
> NeilBrown
> 
> 
> 
> >
> > Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> > Reviewed-on: http://review.whamcloud.com/18916
> > Reviewed-by: Olaf Weber <olaf@sgi.com>
> > Reviewed-by: Doug Oucharek <dougso@me.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > index ea8d55c..680a2b1 100644
> > --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > @@ -177,7 +177,7 @@ struct cfs_cpt_table *
> >  
> >  	for (i = 0; i < cptab->ctb_nparts; i++) {
> >  		if (len > 0) {
> > -			rc = snprintf(tmp, len, "%d\t: ", i);
> > +			rc = snprintf(tmp, len, "%d\t:", i);
> >  			len -= rc;
> >  		}
> >  
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling
  2018-06-25  0:48   ` NeilBrown
@ 2018-06-26  1:15     ` James Simmons
  2018-06-27  2:50       ` NeilBrown
  0 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-06-26  1:15 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Jun 24 2018, James Simmons wrote:
> 
> > From: Amir Shehata <amir.shehata@intel.com>
> >
> > Add functionality to calculate the distance between two CPTs.
> > Expose those distance in debugfs so people deploying a setup
> > can debug what is being created for CPTs.
> 
> This patch doesn't expose anything in debugfs - a later patch
> does that.
> So I've changed the comment to "Prepare to expose those ...."

Doug Oucharek recommonds the following commit message body:

Add cpu distance routines which will be used by the Multi-Rail feature 
to determine what fabric interface is nearest to the core we are currently 
running on. Configuration of these distances will be provided from user 
space via configuration routines in Lustre.


> > Signed-off-by: Amir Shehata <amir.shehata@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> > Reviewed-on: http://review.whamcloud.com/18916
> > Reviewed-by: Olaf Weber <olaf@sgi.com>
> > Reviewed-by: Doug Oucharek <dougso@me.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  .../lustre/include/linux/libcfs/libcfs_cpu.h       | 31 +++++++++++
> >  drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c    | 61 ++++++++++++++++++++++
> >  2 files changed, 92 insertions(+)
> >
> > diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > index ff3ecf5..a015ac1 100644
> > --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_cpu.h
> > @@ -86,6 +86,8 @@ struct cfs_cpu_partition {
> >  	cpumask_var_t			cpt_cpumask;
> >  	/* nodes mask for this partition */
> >  	nodemask_t			*cpt_nodemask;
> > +	/* NUMA distance between CPTs */
> > +	unsigned int			*cpt_distance;
> >  	/* spread rotor for NUMA allocator */
> >  	unsigned int			cpt_spread_rotor;
> >  };
> > @@ -95,6 +97,8 @@ struct cfs_cpu_partition {
> >  struct cfs_cpt_table {
> >  	/* spread rotor for NUMA allocator */
> >  	unsigned int			ctb_spread_rotor;
> > +	/* maximum NUMA distance between all nodes in table */
> > +	unsigned int			ctb_distance;
> >  	/* # of CPU partitions */
> >  	unsigned int			ctb_nparts;
> >  	/* partitions tables */
> > @@ -120,6 +124,10 @@ struct cfs_cpt_table {
> >   */
> >  int cfs_cpt_table_print(struct cfs_cpt_table *cptab, char *buf, int len);
> >  /**
> > + * print distance information of cpt-table
> > + */
> > +int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len);
> > +/**
> >   * return total number of CPU partitions in \a cptab
> >   */
> >  int
> > @@ -149,6 +157,10 @@ struct cfs_cpt_table {
> >   */
> >  int cfs_cpt_of_node(struct cfs_cpt_table *cptab, int node);
> >  /**
> > + * NUMA distance between \a cpt1 and \a cpt2 in \a cptab
> > + */
> > +unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2);
> > +/**
> >   * bind current thread on a CPU-partition \a cpt of \a cptab
> >   */
> >  int cfs_cpt_bind(struct cfs_cpt_table *cptab, int cpt);
> > @@ -206,6 +218,19 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
> >  struct cfs_cpt_table;
> >  #define cfs_cpt_tab ((struct cfs_cpt_table *)NULL)
> >  
> > +static inline int cfs_cpt_distance_print(struct cfs_cpt_table *cptab,
> > +					 char *buf, int len)
> > +{
> > +	int rc;
> > +
> > +	rc = snprintf(buf, len, "0\t: 0:1\n");
> > +	len -= rc;
> > +	if (len <= 0)
> > +		return -EFBIG;
> > +
> > +	return rc;
> > +}
> > +
> >  static inline cpumask_var_t *
> >  cfs_cpt_cpumask(struct cfs_cpt_table *cptab, int cpt)
> >  {
> > @@ -241,6 +266,12 @@ void cfs_cpt_unset_nodemask(struct cfs_cpt_table *cptab,
> >  	return NULL;
> >  }
> >  
> > +static inline unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab,
> > +					    int cpt1, int cpt2)
> > +{
> > +	return 1;
> > +}
> > +
> >  static inline int
> >  cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
> >  {
> > diff --git a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > index 8c5cf7b..b315fb2 100644
> > --- a/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > +++ b/drivers/staging/lustre/lnet/libcfs/libcfs_cpu.c
> > @@ -128,6 +128,15 @@ struct cfs_cpt_table *
> >  					     GFP_NOFS);
> >  		if (!part->cpt_nodemask)
> >  			goto failed_setting_ctb_parts;
> > +
> > +		part->cpt_distance = kvmalloc_array(cptab->ctb_nparts,
> > +						    sizeof(part->cpt_distance[0]),
> > +						    GFP_KERNEL);
> > +		if (!part->cpt_distance)
> > +			goto failed_setting_ctb_parts;
> > +
> > +		memset(part->cpt_distance, -1,
> > +		       cptab->ctb_nparts * sizeof(part->cpt_distance[0]));
> >  	}
> >  
> >  	return cptab;
> > @@ -138,6 +147,7 @@ struct cfs_cpt_table *
> >  
> >  		kfree(part->cpt_nodemask);
> >  		free_cpumask_var(part->cpt_cpumask);
> > +		kvfree(part->cpt_distance);
> >  	}
> >  
> >  	kvfree(cptab->ctb_parts);
> > @@ -168,6 +178,7 @@ struct cfs_cpt_table *
> >  
> >  		kfree(part->cpt_nodemask);
> >  		free_cpumask_var(part->cpt_cpumask);
> > +		kvfree(part->cpt_distance);
> >  	}
> >  
> >  	kvfree(cptab->ctb_parts);
> > @@ -222,6 +233,44 @@ struct cfs_cpt_table *
> >  }
> >  EXPORT_SYMBOL(cfs_cpt_table_print);
> >  
> > +int cfs_cpt_distance_print(struct cfs_cpt_table *cptab, char *buf, int len)
> > +{
> > +	char *tmp = buf;
> > +	int rc;
> > +	int i;
> > +	int j;
> > +
> > +	for (i = 0; i < cptab->ctb_nparts; i++) {
> > +		if (len <= 0)
> > +			goto err;
> > +
> > +		rc = snprintf(tmp, len, "%d\t:", i);
> > +		len -= rc;
> > +
> > +		if (len <= 0)
> > +			goto err;
> > +
> > +		tmp += rc;
> > +		for (j = 0; j < cptab->ctb_nparts; j++) {
> > +			rc = snprintf(tmp, len, " %d:%d", j,
> > +				      cptab->ctb_parts[i].cpt_distance[j]);
> > +			len -= rc;
> > +			if (len <= 0)
> > +				goto err;
> > +			tmp += rc;
> > +		}
> > +
> > +		*tmp = '\n';
> > +		tmp++;
> > +		len--;
> > +	}
> > +
> > +	return tmp - buf;
> > +err:
> > +	return -E2BIG;
> > +}
> > +EXPORT_SYMBOL(cfs_cpt_distance_print);
> > +
> >  int
> >  cfs_cpt_number(struct cfs_cpt_table *cptab)
> >  {
> > @@ -273,6 +322,18 @@ struct cfs_cpt_table *
> >  }
> >  EXPORT_SYMBOL(cfs_cpt_nodemask);
> >  
> > +unsigned int cfs_cpt_distance(struct cfs_cpt_table *cptab, int cpt1, int cpt2)
> > +{
> > +	LASSERT(cpt1 == CFS_CPT_ANY || (cpt1 >= 0 && cpt1 < cptab->ctb_nparts));
> > +	LASSERT(cpt2 == CFS_CPT_ANY || (cpt2 >= 0 && cpt2 < cptab->ctb_nparts));
> > +
> > +	if (cpt1 == CFS_CPT_ANY || cpt2 == CFS_CPT_ANY)
> > +		return cptab->ctb_distance;
> > +
> > +	return cptab->ctb_parts[cpt1].cpt_distance[cpt2];
> > +}
> > +EXPORT_SYMBOL(cfs_cpt_distance);
> > +
> >  int
> >  cfs_cpt_set_cpu(struct cfs_cpt_table *cptab, int cpt, int cpu)
> >  {
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-25 18:22     ` Doug Oucharek
@ 2018-06-27  2:44       ` NeilBrown
  2018-06-27 12:42         ` Patrick Farrell
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-27  2:44 UTC (permalink / raw)
  To: lustre-devel

On Mon, Jun 25 2018, Doug Oucharek wrote:

> Some background on this NUMA change:
>
> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.  This was done as part of the Multi-Rail feature.  One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.  There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.  To do that, the ?distance? between NUMA nodes needs to be configured.
>
> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.  Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>

Thanks a lot for the background.

If these NUMA nodes have a 'distance' between them, and if lustre can
benefit from knowing the distance, then is seems likely that other code
might also benefit.  In that case it would be best if the distance were
encoded in some global state information so that lustre and any other
subsystem can extract it.

Do you know if there is any work underway by anyone to make this
information generally available?  If there is, we should make sure that
lustre works in a compatible way so that once that work lands, lustre
can use it directly and not need extra configuration.
If no such work is underway, then it would be really good if something
were done in that direction.  If no-one here is able to work on this, I
can ask around in SUSE and see if anyone here knows anything relevant.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180627/6b26a3de/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node
  2018-06-26  0:54     ` James Simmons
@ 2018-06-27  2:49       ` NeilBrown
  0 siblings, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-27  2:49 UTC (permalink / raw)
  To: lustre-devel

On Tue, Jun 26 2018, James Simmons wrote:

>> On Sun, Jun 24 2018, James Simmons wrote:
>> 
>> > From: Dmitry Eremin <dmitry.eremin@intel.com>
>> >
>> > Reporting "HW nodes" is too generic. It really is reporting
>> > "HW NUMA nodes". Update the debug message.
>> 
>> I'm not happy with this patch description.....
>
> How about:
>
> In the HPC world a node refers to the actual whole computer system
> used in a cluster. Reporting just "HW nodes" is not clear so change
> the debug report to "HW NUMA nodes" since this report the number
> of NUMA nodes in use by Lustre.
>

That's not actually the part I was complaining about.
It was the fact that there were multiple parts of the patch that weren't
mentioned at all.

But I like your revision anyway, so I've updated the patch.
Thanks!

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180627/05d1e08e/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling
  2018-06-26  1:15     ` James Simmons
@ 2018-06-27  2:50       ` NeilBrown
  0 siblings, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-06-27  2:50 UTC (permalink / raw)
  To: lustre-devel

On Tue, Jun 26 2018, James Simmons wrote:

>> On Sun, Jun 24 2018, James Simmons wrote:
>> 
>> > From: Amir Shehata <amir.shehata@intel.com>
>> >
>> > Add functionality to calculate the distance between two CPTs.
>> > Expose those distance in debugfs so people deploying a setup
>> > can debug what is being created for CPTs.
>> 
>> This patch doesn't expose anything in debugfs - a later patch
>> does that.
>> So I've changed the comment to "Prepare to expose those ...."
>
> Doug Oucharek recommonds the following commit message body:
>
> Add cpu distance routines which will be used by the Multi-Rail feature 
> to determine what fabric interface is nearest to the core we are currently 
> running on. Configuration of these distances will be provided from user 
> space via configuration routines in Lustre.

Thanks - I like this more.  Updated.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180627/cc1c67b7/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-27  2:44       ` NeilBrown
@ 2018-06-27 12:42         ` Patrick Farrell
  2018-06-28  1:17           ` NeilBrown
  0 siblings, 1 reply; 66+ messages in thread
From: Patrick Farrell @ 2018-06-27 12:42 UTC (permalink / raw)
  To: lustre-devel

Neil,

I am not the person at Cray for this, but if SUSE does take an interest in this, Cray would probably be interested in weighing in and contributing info if not actually code.  In fact, other HPC vendors like HPE(by which I mostly mean the old SGI) or IBM might as well.  NUMA optimization is a persistent fascination in our area of the industry...

- Patrick

________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
Sent: Tuesday, June 26, 2018 9:44:37 PM
To: Doug Oucharek
Cc: Amir Shehata; Lustre Development List
Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support

On Mon, Jun 25 2018, Doug Oucharek wrote:

> Some background on this NUMA change:
>
> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.  This was done as part of the Multi-Rail feature.  One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.  There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.  To do that, the ?distance? between NUMA nodes needs to be configured.
>
> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.  Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>

Thanks a lot for the background.

If these NUMA nodes have a 'distance' between them, and if lustre can
benefit from knowing the distance, then is seems likely that other code
might also benefit.  In that case it would be best if the distance were
encoded in some global state information so that lustre and any other
subsystem can extract it.

Do you know if there is any work underway by anyone to make this
information generally available?  If there is, we should make sure that
lustre works in a compatible way so that once that work lands, lustre
can use it directly and not need extra configuration.
If no such work is underway, then it would be really good if something
were done in that direction.  If no-one here is able to work on this, I
can ask around in SUSE and see if anyone here knows anything relevant.

Thanks,
NeilBrown
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180627/5fb02a73/attachment-0001.html>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-27 12:42         ` Patrick Farrell
@ 2018-06-28  1:17           ` NeilBrown
  2018-06-29 17:19             ` Doug Oucharek
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-06-28  1:17 UTC (permalink / raw)
  To: lustre-devel


I went digging and found that Linux already has a well defined concept
of distance between NUMA nodes.
On x86 (and amd64?), this is loaded from ACPI.  Other platforms can
describe it in devicetree.
You can view distance information in
  /sys/devices/system/node/node*/distance

or using "numactl --hardware".

Why doesn't lustre simple extract and use this information?  Why does
lustre need to allow it to be configured?

Thanks,
NeilBrown

On Wed, Jun 27 2018, Patrick Farrell wrote:

> Neil,
>
> I am not the person at Cray for this, but if SUSE does take an interest in this, Cray would probably be interested in weighing in and contributing info if not actually code.  In fact, other HPC vendors like HPE(by which I mostly mean the old SGI) or IBM might as well.  NUMA optimization is a persistent fascination in our area of the industry...
>
> - Patrick
>
> ________________________________
> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
> Sent: Tuesday, June 26, 2018 9:44:37 PM
> To: Doug Oucharek
> Cc: Amir Shehata; Lustre Development List
> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>
> On Mon, Jun 25 2018, Doug Oucharek wrote:
>
>> Some background on this NUMA change:
>>
>> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.  This was done as part of the Multi-Rail feature.  One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.  There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.  To do that, the ?distance? between NUMA nodes needs to be configured.
>>
>> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.  Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>>
>
> Thanks a lot for the background.
>
> If these NUMA nodes have a 'distance' between them, and if lustre can
> benefit from knowing the distance, then is seems likely that other code
> might also benefit.  In that case it would be best if the distance were
> encoded in some global state information so that lustre and any other
> subsystem can extract it.
>
> Do you know if there is any work underway by anyone to make this
> information generally available?  If there is, we should make sure that
> lustre works in a compatible way so that once that work lands, lustre
> can use it directly and not need extra configuration.
> If no such work is underway, then it would be really good if something
> were done in that direction.  If no-one here is able to work on this, I
> can ask around in SUSE and see if anyone here knows anything relevant.
>
> Thanks,
> NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180628/2e5f3d39/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-28  1:17           ` NeilBrown
@ 2018-06-29 17:19             ` Doug Oucharek
  2018-06-29 17:27               ` Amir Shehata
  0 siblings, 1 reply; 66+ messages in thread
From: Doug Oucharek @ 2018-06-29 17:19 UTC (permalink / raw)
  To: lustre-devel

I?ll leave Olaf of HPE answer questions about the distance code.  I was only an inspector as it relates to the Multi-Rail feature in the community tree.  

Doug

> On Jun 27, 2018, at 6:17 PM, NeilBrown <neilb@suse.com> wrote:
> 
> 
> I went digging and found that Linux already has a well defined concept
> of distance between NUMA nodes.
> On x86 (and amd64?), this is loaded from ACPI.  Other platforms can
> describe it in devicetree.
> You can view distance information in
>  /sys/devices/system/node/node*/distance
> 
> or using "numactl --hardware".
> 
> Why doesn't lustre simple extract and use this information?  Why does
> lustre need to allow it to be configured?
> 
> Thanks,
> NeilBrown
> 
> On Wed, Jun 27 2018, Patrick Farrell wrote:
> 
>> Neil,
>> 
>> I am not the person at Cray for this, but if SUSE does take an interest in this, Cray would probably be interested in weighing in and contributing info if not actually code.  In fact, other HPC vendors like HPE(by which I mostly mean the old SGI) or IBM might as well.  NUMA optimization is a persistent fascination in our area of the industry...
>> 
>> - Patrick
>> 
>> ________________________________
>> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
>> Sent: Tuesday, June 26, 2018 9:44:37 PM
>> To: Doug Oucharek
>> Cc: Amir Shehata; Lustre Development List
>> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>> 
>> On Mon, Jun 25 2018, Doug Oucharek wrote:
>> 
>>> Some background on this NUMA change:
>>> 
>>> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.  This was done as part of the Multi-Rail feature.  One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.  There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.  To do that, the ?distance? between NUMA nodes needs to be configured.
>>> 
>>> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.  Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>>> 
>> 
>> Thanks a lot for the background.
>> 
>> If these NUMA nodes have a 'distance' between them, and if lustre can
>> benefit from knowing the distance, then is seems likely that other code
>> might also benefit.  In that case it would be best if the distance were
>> encoded in some global state information so that lustre and any other
>> subsystem can extract it.
>> 
>> Do you know if there is any work underway by anyone to make this
>> information generally available?  If there is, we should make sure that
>> lustre works in a compatible way so that once that work lands, lustre
>> can use it directly and not need extra configuration.
>> If no such work is underway, then it would be really good if something
>> were done in that direction.  If no-one here is able to work on this, I
>> can ask around in SUSE and see if anyone here knows anything relevant.
>> 
>> Thanks,
>> NeilBrown

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-29 17:19             ` Doug Oucharek
@ 2018-06-29 17:27               ` Amir Shehata
  2018-06-29 17:47                 ` Weber, Olaf
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Shehata @ 2018-06-29 17:27 UTC (permalink / raw)
  To: lustre-devel

Olaf can add more details, but I believe we are using the linux distance
infrastructure. Take a look at cfs_cpt_distance_calculate(). What we're
doing is extracting the NUMA distances provided in the kernel and building
an internal representation of distances between CPU partitions (CPTs) since
that's what's used in the code.

On 29 June 2018 at 10:19, Doug Oucharek <doucharek@cray.com> wrote:

> I?ll leave Olaf of HPE answer questions about the distance code.  I was
> only an inspector as it relates to the Multi-Rail feature in the community
> tree.
>
> Doug
>
> > On Jun 27, 2018, at 6:17 PM, NeilBrown <neilb@suse.com> wrote:
> >
> >
> > I went digging and found that Linux already has a well defined concept
> > of distance between NUMA nodes.
> > On x86 (and amd64?), this is loaded from ACPI.  Other platforms can
> > describe it in devicetree.
> > You can view distance information in
> >  /sys/devices/system/node/node*/distance
> >
> > or using "numactl --hardware".
> >
> > Why doesn't lustre simple extract and use this information?  Why does
> > lustre need to allow it to be configured?
> >
> > Thanks,
> > NeilBrown
> >
> > On Wed, Jun 27 2018, Patrick Farrell wrote:
> >
> >> Neil,
> >>
> >> I am not the person at Cray for this, but if SUSE does take an interest
> in this, Cray would probably be interested in weighing in and contributing
> info if not actually code.  In fact, other HPC vendors like HPE(by which I
> mostly mean the old SGI) or IBM might as well.  NUMA optimization is a
> persistent fascination in our area of the industry...
> >>
> >> - Patrick
> >>
> >> ________________________________
> >> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf
> of NeilBrown <neilb@suse.com>
> >> Sent: Tuesday, June 26, 2018 9:44:37 PM
> >> To: Doug Oucharek
> >> Cc: Amir Shehata; Lustre Development List
> >> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs:
> NUMA support
> >>
> >> On Mon, Jun 25 2018, Doug Oucharek wrote:
> >>
> >>> Some background on this NUMA change:
> >>>
> >>> First off, this is just a first step to a bigger set of changes which
> include changes to the Lustre utilities.  This was done as part of the
> Multi-Rail feature.  One of the systems that feature is meant to support is
> the SGI UV system (now HPE) which has a massive number of NUMA nodes
> connected by a NUMA Link.  There are multiple fabric cards spread
> throughout the system and Multi-Rail needs to know which fabric cards are
> nearest to the NUMA node we are running on.  To do that, the ?distance?
> between NUMA nodes needs to be configured.
> >>>
> >>> This patch is preparing the infrastructure for the Multi-Rail feature
> to support configuring NUMA node distances.  Technically, this patch should
> be landing with the Multi-Rail feature (still to be pushed) for it to make
> proper sense.
> >>>
> >>
> >> Thanks a lot for the background.
> >>
> >> If these NUMA nodes have a 'distance' between them, and if lustre can
> >> benefit from knowing the distance, then is seems likely that other code
> >> might also benefit.  In that case it would be best if the distance were
> >> encoded in some global state information so that lustre and any other
> >> subsystem can extract it.
> >>
> >> Do you know if there is any work underway by anyone to make this
> >> information generally available?  If there is, we should make sure that
> >> lustre works in a compatible way so that once that work lands, lustre
> >> can use it directly and not need extra configuration.
> >> If no such work is underway, then it would be really good if something
> >> were done in that direction.  If no-one here is able to work on this, I
> >> can ask around in SUSE and see if anyone here knows anything relevant.
> >>
> >> Thanks,
> >> NeilBrown
>
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180629/b8f5825d/attachment.html>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-29 17:27               ` Amir Shehata
@ 2018-06-29 17:47                 ` Weber, Olaf
  2018-07-04  5:22                   ` NeilBrown
  0 siblings, 1 reply; 66+ messages in thread
From: Weber, Olaf @ 2018-06-29 17:47 UTC (permalink / raw)
  To: lustre-devel

To add to Amir's point,  Lustre's CPTs are a way to partition a machine. The distance mechanism I added is one way to map the ACPI-reported distances on the Lustre CPT mapping. It tends to assume the worst case applies to the wholes. It is there because the rest of the Lustre code (at least in the tree I had to work on) "thinks" in CPTs.

Other CPT-related stuff that came in with the multi-rail code has the same rationale. If I'd been working against the kernel interfaces themselves it would have looked differently, but that was not an option at the time.

We've found it to be useful, so replacing it would be better than just ripping it out.

That's all there is to it.

Olaf

---
From: Amir Shehata [mailto:amir.shehata.whamcloud at gmail.com] 
Sent: Friday, June 29, 2018 19:28
To: Doug Oucharek <doucharek@cray.com>
Cc: NeilBrown <neilb@suse.com>; Weber, Olaf (HPC Data Management & Storage) <olaf.weber@hpe.com>; Amir Shehata <amir.shehata@intel.com>; Lustre Development List <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support

Olaf can add more details, but I believe we are using the linux distance infrastructure. Take a look at cfs_cpt_distance_calculate(). What we're doing is extracting the NUMA distances provided in the kernel and building an internal representation of distances between CPU partitions (CPTs) since that's what's used in the code.

On 29 June 2018 at 10:19, Doug Oucharek <doucharek@cray.com> wrote:
I?ll leave Olaf of HPE answer questions about the distance code.? I was only an inspector as it relates to the Multi-Rail feature in the community tree.? 

Doug

> On Jun 27, 2018, at 6:17 PM, NeilBrown <neilb@suse.com> wrote:
> 
> 
> I went digging and found that Linux already has a well defined concept
> of distance between NUMA nodes.
> On x86 (and amd64?), this is loaded from ACPI.? Other platforms can
> describe it in devicetree.
> You can view distance information in
>? /sys/devices/system/node/node*/distance
> 
> or using "numactl --hardware".
> 
> Why doesn't lustre simple extract and use this information?? Why does
> lustre need to allow it to be configured?
> 
> Thanks,
> NeilBrown
> 
> On Wed, Jun 27 2018, Patrick Farrell wrote:
> 
>> Neil,
>> 
>> I am not the person at Cray for this, but if SUSE does take an interest in this, Cray would probably be interested in weighing in and contributing info if not actually code.? In fact, other HPC vendors like HPE(by which I mostly mean the old SGI) or IBM might as well.? NUMA optimization is a persistent fascination in our area of the industry...
>> 
>> - Patrick
>> 
>> ________________________________
>> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
>> Sent: Tuesday, June 26, 2018 9:44:37 PM
>> To: Doug Oucharek
>> Cc: Amir Shehata; Lustre Development List
>> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>> 
>> On Mon, Jun 25 2018, Doug Oucharek wrote:
>> 
>>> Some background on this NUMA change:
>>> 
>>> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.? This was done as part of the Multi-Rail feature.? One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.? There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.? To do that, the ?distance? between NUMA nodes needs to be configured.
>>> 
>>> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.? Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>>> 
>> 
>> Thanks a lot for the background.
>> 
>> If these NUMA nodes have a 'distance' between them, and if lustre can
>> benefit from knowing the distance, then is seems likely that other code
>> might also benefit.? In that case it would be best if the distance were
>> encoded in some global state information so that lustre and any other
>> subsystem can extract it.
>> 
>> Do you know if there is any work underway by anyone to make this
>> information generally available?? If there is, we should make sure that
>> lustre works in a compatible way so that once that work lands, lustre
>> can use it directly and not need extra configuration.
>> If no such work is underway, then it would be really good if something
>> were done in that direction.? If no-one here is able to work on this, I
>> can ask around in SUSE and see if anyone here knows anything relevant.
>> 
>> Thanks,
>> NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-06-29 17:47                 ` Weber, Olaf
@ 2018-07-04  5:22                   ` NeilBrown
  2018-07-04  8:40                     ` Weber, Olaf
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-07-04  5:22 UTC (permalink / raw)
  To: lustre-devel


Thanks everyone for your patience in explaining things to me.
I'm beginning to understand what to look for and where to find it.

So the answers to Greg's questions:

  Where are you reading the host memory NUMA information from?

  And why would a filesystem care about this type of thing?  Are you
  going to now mirror what the scheduler does with regards to NUMA
  topology issues?  How are you going to handle things when the topology
  changes?  What systems did you test this on?  What performance
  improvements were seen?  What downsides are there with all of this?


Are:
  - NUMA info comes from ACPI or device-tree just like for every one
      else.  Lustre just uses node_distance().

  - The filesystem cares about this because...  It has service
    thread that does part of the work of some filesystem operations
    (handling replies for example) and these are best handled "near"
    the CPU the initiated the request.  Lustre partitions
    all CPUs into "partitions" (cpt) each with a few cores.
    If the request thread and the reply thread are on different
    CPUs but in the same partition, then we get best throughput
    (is that close?)

  - Not really mirroring the scheduler, maybe mirroring parts of the
    network layer(?)

  - We don't handle topology changes yet except in very minimal ways
    (cpts *can* become empty, and that can cause problems).

  - This has been tested on .... great big things.
  - When multi-rails configurations are used (like ethernet-bonding,
    but for RDMA), we get ??? closer to theoretical bandwidth.
    Without these changes it scales poorly (??)

  - The down-sides primarily are that we don't auto-configure
    perfectly.  This particularly affects hot-plug, but without
    hotplug the grouping of cpus and interfaces are focussed
    on .... avoiding worst case rather than achieving best case.
    

I've made up a lot of stuff there.  I'm happy not to pursue this further
at the moment, but if anyone would like to enhance my understanding by
correcting the worst errors in the above, I wouldn't object :-)

Thanks,
NeilBrown




On Fri, Jun 29 2018, Weber, Olaf (HPC Data Management & Storage) wrote:

> To add to Amir's point,  Lustre's CPTs are a way to partition a machine. The distance mechanism I added is one way to map the ACPI-reported distances on the Lustre CPT mapping. It tends to assume the worst case applies to the wholes. It is there because the rest of the Lustre code (at least in the tree I had to work on) "thinks" in CPTs.
>
> Other CPT-related stuff that came in with the multi-rail code has the same rationale. If I'd been working against the kernel interfaces themselves it would have looked differently, but that was not an option at the time.
>
> We've found it to be useful, so replacing it would be better than just ripping it out.
>
> That's all there is to it.
>
> Olaf
>
> ---
> From: Amir Shehata [mailto:amir.shehata.whamcloud at gmail.com] 
> Sent: Friday, June 29, 2018 19:28
> To: Doug Oucharek <doucharek@cray.com>
> Cc: NeilBrown <neilb@suse.com>; Weber, Olaf (HPC Data Management & Storage) <olaf.weber@hpe.com>; Amir Shehata <amir.shehata@intel.com>; Lustre Development List <lustre-devel@lists.lustre.org>
> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>
> Olaf can add more details, but I believe we are using the linux distance infrastructure. Take a look at cfs_cpt_distance_calculate(). What we're doing is extracting the NUMA distances provided in the kernel and building an internal representation of distances between CPU partitions (CPTs) since that's what's used in the code.
>
> On 29 June 2018 at 10:19, Doug Oucharek <doucharek@cray.com> wrote:
> I?ll leave Olaf of HPE answer questions about the distance code.? I was only an inspector as it relates to the Multi-Rail feature in the community tree.? 
>
> Doug
>
>> On Jun 27, 2018, at 6:17 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> 
>> I went digging and found that Linux already has a well defined concept
>> of distance between NUMA nodes.
>> On x86 (and amd64?), this is loaded from ACPI.? Other platforms can
>> describe it in devicetree.
>> You can view distance information in
>>? /sys/devices/system/node/node*/distance
>> 
>> or using "numactl --hardware".
>> 
>> Why doesn't lustre simple extract and use this information?? Why does
>> lustre need to allow it to be configured?
>> 
>> Thanks,
>> NeilBrown
>> 
>> On Wed, Jun 27 2018, Patrick Farrell wrote:
>> 
>>> Neil,
>>> 
>>> I am not the person at Cray for this, but if SUSE does take an interest in this, Cray would probably be interested in weighing in and contributing info if not actually code.? In fact, other HPC vendors like HPE(by which I mostly mean the old SGI) or IBM might as well.? NUMA optimization is a persistent fascination in our area of the industry...
>>> 
>>> - Patrick
>>> 
>>> ________________________________
>>> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
>>> Sent: Tuesday, June 26, 2018 9:44:37 PM
>>> To: Doug Oucharek
>>> Cc: Amir Shehata; Lustre Development List
>>> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>>> 
>>> On Mon, Jun 25 2018, Doug Oucharek wrote:
>>> 
>>>> Some background on this NUMA change:
>>>> 
>>>> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.? This was done as part of the Multi-Rail feature.? One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.? There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.? To do that, the ?distance? between NUMA nodes needs to be configured.
>>>> 
>>>> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.? Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>>>> 
>>> 
>>> Thanks a lot for the background.
>>> 
>>> If these NUMA nodes have a 'distance' between them, and if lustre can
>>> benefit from knowing the distance, then is seems likely that other code
>>> might also benefit.? In that case it would be best if the distance were
>>> encoded in some global state information so that lustre and any other
>>> subsystem can extract it.
>>> 
>>> Do you know if there is any work underway by anyone to make this
>>> information generally available?? If there is, we should make sure that
>>> lustre works in a compatible way so that once that work lands, lustre
>>> can use it directly and not need extra configuration.
>>> If no such work is underway, then it would be really good if something
>>> were done in that direction.? If no-one here is able to work on this, I
>>> can ask around in SUSE and see if anyone here knows anything relevant.
>>> 
>>> Thanks,
>>> NeilBrown
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180704/c2020123/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-04  5:22                   ` NeilBrown
@ 2018-07-04  8:40                     ` Weber, Olaf
  2018-07-05  1:57                       ` NeilBrown
  2018-07-06  0:20                       ` James Simmons
  0 siblings, 2 replies; 66+ messages in thread
From: Weber, Olaf @ 2018-07-04  8:40 UTC (permalink / raw)
  To: lustre-devel

NeilBrown [mailto:neilb at suse.com] wrote:

To help contextualize things: the Lustre code can be decomposed into three parts:

1) The filesystem proper: Lustre.
2) The communication protocol it uses: LNet.
3) Supporting code used by Lustre and LNet: CFS.

Part of the supporting code is the CPT mechanism, which provides a way to
partition the CPUs of a system. These partitions are used to distribute queues,
locks, and threads across the system. It was originally introduced years ago, as
far as I can tell mainly to deal with certain hot locks: these were converted into
read/write locks with one spinlock per CPT.

As a general rule, CPT boundaries should respect node and socket boundaries,
but at the higher end, where CPUs have 20+ cores, it may make sense to split
a CPUs cores across several CPTs.

> Thanks everyone for your patience in explaining things to me.
> I'm beginning to understand what to look for and where to find it.
> 
> So the answers to Greg's questions:
> 
>   Where are you reading the host memory NUMA information from?
> 
>   And why would a filesystem care about this type of thing?  Are you
>   going to now mirror what the scheduler does with regards to NUMA
>   topology issues?  How are you going to handle things when the topology
>   changes?  What systems did you test this on?  What performance
>   improvements were seen?  What downsides are there with all of this?
> 
> 
> Are:

>   - NUMA info comes from ACPI or device-tree just like for every one
>       else.  Lustre just uses node_distance().

Correct, the standard kernel interfaces for this information are used to
obtain it, so ultimately Lustre/LNet uses the same source of truth as
everyone else.

>   - The filesystem cares about this because...  It has service
>     thread that does part of the work of some filesystem operations
>     (handling replies for example) and these are best handled "near"
>     the CPU the initiated the request.  Lustre partitions
>     all CPUs into "partitions" (cpt) each with a few cores.
>     If the request thread and the reply thread are on different
>     CPUs but in the same partition, then we get best throughput
>     (is that close?)

At the filesystem level, it does indeed seem to help to have the service
threads that do work for requests run on a different core that is close to
the core that originated the request. So preferably on the same CPU, and
on certain multi-core CPUs there are also distance effects between cores.
That too is one of the things the CPT mechanism handles.

>   - Not really mirroring the scheduler, maybe mirroring parts of the
>     network layer(?)

The LNet code, which is derived from Portals 3.x, is mostly an easier-to-use
abstraction of RDMA interfaces provided by Infiniband and other similar
hardware. It can also use TCP/IP, but that's not the primary use case.

As a communication layer that builds on top of RDMA-capable hardware,
LNet cares about such things as whether the CPU driving communication
is close to the memory used, and also whether it is close to the interface
used. Even in a 2-socket machine, there are measurable performance
differences depending on whether the memory an interface connect
to the same socket or to different sockets. On bigger hardware, like a
32-socket machine, the penalties are much more pronounced. At the
time we found that the QPI links between sockets were a bottleneck
and that performance cratered if they had to handle too much traffic.

UPI, the successor to QPI is better -- has more bandwidth -- but with
the CPUs having more and more cores I expect the scaling issues to
remain similar.

>   - We don't handle topology changes yet except in very minimal ways
>     (cpts *can* become empty, and that can cause problems).

Yes, this is a known deficiency.

>   - This has been tested on .... great big things.

The basic CPT mechanism predates my involvement with Lustre. I did
work on making it more NUMA-aware. A 32-socket system was one of
the primary test beds.

>   - When multi-rails configurations are used (like ethernet-bonding,
>     but for RDMA), we get ??? closer to theoretical bandwidth.
>     Without these changes it scales poorly (??)

The basic idea behind muti-rail configurations is that we use several
Infiniband interfaces and LNet presents them as a single logical interface
to Lustre. For each message, LNet picks the IB interface it should go across
using several criteria, including NUMA distance of the interface and how
busy it is.

With these changes we could get pretty much linear scaling of LNet
throughput by adding more interfaces.

>   - The down-sides primarily are that we don't auto-configure
>     perfectly.  This particularly affects hot-plug, but without
>     hotplug the grouping of cpus and interfaces are focussed
>     on .... avoiding worst case rather than achieving best case.

Without hotplug the CPT grouping should be tuned to achieve a best
case in a static configuration.

Adding simple-minded hotplug tolerance (let's not call it support) would
focus on avoiding truly pathological behaviour.

> I've made up a lot of stuff there.  I'm happy not to pursue this further at the
> moment, but if anyone would like to enhance my understanding by
> correcting the worst errors in the above, I wouldn't object :-)
> 
> Thanks,
> NeilBrown

PS: the NUMA effects I've mentioned above have been making the news
lately under other names: they are part of the side channels used in various
timing based attacks.

Olaf

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-04  8:40                     ` Weber, Olaf
@ 2018-07-05  1:57                       ` NeilBrown
  2018-07-06  0:20                       ` James Simmons
  1 sibling, 0 replies; 66+ messages in thread
From: NeilBrown @ 2018-07-05  1:57 UTC (permalink / raw)
  To: lustre-devel

On Wed, Jul 04 2018, Weber, Olaf (HPC Data Management & Storage) wrote:

> NeilBrown [mailto:neilb at suse.com] wrote:
>
> To help contextualize things: the Lustre code can be decomposed into three parts:
>
> 1) The filesystem proper: Lustre.
> 2) The communication protocol it uses: LNet.
> 3) Supporting code used by Lustre and LNet: CFS.
>
> Part of the supporting code is the CPT mechanism, which provides a way to
> partition the CPUs of a system. These partitions are used to distribute queues,
> locks, and threads across the system. It was originally introduced years ago, as
> far as I can tell mainly to deal with certain hot locks: these were converted into
> read/write locks with one spinlock per CPT.

Thanks for this context.
Looking in the client code there are 2 per-cpt locks: ln_res_lock and
ln_net_lock.

ln_res_lock protects:
 lnet_res_container -> rec_lh_hash hash chains.

 the_lnet.ln_eq_container.rec_active list of lnet_eq

 lists of memory descriptors (rec_active)

 lists of match entries - one table per cpt.
     Some match entries follow cpu affinity, some are global and hashed
     to choose a table (I think).

  lib-move seems to use the lock to protect the md itself,
  rather than just the list of md.... not sure.

  ptl_mt_maps (??) (rather inefficient ordered-insertion in
  		lnet_ptl_enable_mt())

  proc_lnet_portal_rotor() uses lnet_res_lock(0) to protect
     portal_rotos[]. I wonder why.

ln_net_lock protects:

   ni->ni_refs counter (why not atomic_t I wonder)
   the_lnet.ln_testprotocompat - I guess we don't want that changing
          while a per-cpt lock is held?

   the_lnet.ln_counters - keep them stable while reading.
   the_lnet.ln_nis ??
   the_lnet.ln_nis_cpt list of lnet_ni... list of network interfaces I guess.
     Locking a per-cpt lock stops updates as all locks are needed to
     change.  These days RCU is often used for this sort of thing.

   lnet_ping_info
   ... and maybe lots more.

So ln_net_lock seems to be  "read/write locks with one spinlock per
CPT." that you described.  I wonder how much of that could be converted
to use RCU - with just a single spinlock to protect updates, and
rcu_read_lock() to make reads safe.

ln_res_lock is quite different - it protects a selection of different
resources that are distributed across multiple cpts.
I wonder why we don't have one lock per resource...
I also wonder how import having these things per-cpt is.
Lots of other code in the kernel has per-CPU lists etc, and
some have per-numa-node, but no other code seems to be need an
intermediate granularity.

I might try to drill down into some of this code a bit more and see what
I can find.  There is probably something I'm still missing.

Thanks,
NeilBrown

>
> As a general rule, CPT boundaries should respect node and socket boundaries,
> but at the higher end, where CPUs have 20+ cores, it may make sense to split
> a CPUs cores across several CPTs.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180705/d978b412/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-04  8:40                     ` Weber, Olaf
  2018-07-05  1:57                       ` NeilBrown
@ 2018-07-06  0:20                       ` James Simmons
  2018-07-06  0:40                         ` Patrick Farrell
  2018-07-06  3:11                         ` NeilBrown
  1 sibling, 2 replies; 66+ messages in thread
From: James Simmons @ 2018-07-06  0:20 UTC (permalink / raw)
  To: lustre-devel


> NeilBrown [mailto:neilb at suse.com] wrote:
> 
> To help contextualize things: the Lustre code can be decomposed into three parts:
> 
> 1) The filesystem proper: Lustre.
> 2) The communication protocol it uses: LNet.
> 3) Supporting code used by Lustre and LNet: CFS.
> 
> Part of the supporting code is the CPT mechanism, which provides a way to
> partition the CPUs of a system. These partitions are used to distribute queues,
> locks, and threads across the system. It was originally introduced years ago, as
> far as I can tell mainly to deal with certain hot locks: these were converted into
> read/write locks with one spinlock per CPT.
> 
> As a general rule, CPT boundaries should respect node and socket boundaries,
> but at the higher end, where CPUs have 20+ cores, it may make sense to split
> a CPUs cores across several CPTs.
> 
> > Thanks everyone for your patience in explaining things to me.
> > I'm beginning to understand what to look for and where to find it.
> > 
> > So the answers to Greg's questions:
> > 
> >   Where are you reading the host memory NUMA information from?
> > 
> >   And why would a filesystem care about this type of thing?  Are you
> >   going to now mirror what the scheduler does with regards to NUMA
> >   topology issues?  How are you going to handle things when the topology
> >   changes?  What systems did you test this on?  What performance
> >   improvements were seen?  What downsides are there with all of this?
> > 
> > 
> > Are:
> 
> >   - NUMA info comes from ACPI or device-tree just like for every one
> >       else.  Lustre just uses node_distance().
> 
> Correct, the standard kernel interfaces for this information are used to
> obtain it, so ultimately Lustre/LNet uses the same source of truth as
> everyone else.
> 
> >   - The filesystem cares about this because...  It has service
> >     thread that does part of the work of some filesystem operations
> >     (handling replies for example) and these are best handled "near"
> >     the CPU the initiated the request.  Lustre partitions
> >     all CPUs into "partitions" (cpt) each with a few cores.
> >     If the request thread and the reply thread are on different
> >     CPUs but in the same partition, then we get best throughput
> >     (is that close?)
> 
> At the filesystem level, it does indeed seem to help to have the service
> threads that do work for requests run on a different core that is close to
> the core that originated the request. So preferably on the same CPU, and
> on certain multi-core CPUs there are also distance effects between cores.
> That too is one of the things the CPT mechanism handles.

Their is another very important aspect to why Lustre has a CPU partition 
layer. At least at the place I work at. While the Linux kernel manages all
the NUMA nodes and CPU cores Lustre adds the ability for us to specify a 
subset of everything on the system. The reason is to limit the impact of
noise on the compute nodes. Noise has a heavy impact on large scale HP
work loads that can run days or even weeks at a time. Lets take an 
example system:

	       |-------------|	   |-------------|
   |-------|   | NUMA  0     |	   | NUMA  1	 |   |-------|
   | eth0  | - |	     | --- |		 | - | eth1  |      
   |_______|   | CPU0  CPU1  |	   | CPU2  CPU3  |   |_______|
	       |_____________|	   |_____________|

In such a system it is possible with the right job schedular to start a 
large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
large parallel applications will communicate between nodes using MPI,
such as openmpi, which can be configured to use eth0 only. Using the
CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
This greatly reducess the noise impact on the application running.

BTW this is one of the reasons ko2iblnd for lustre doesn't use the
generic RDMA api. The core IB layer doesn't support such isolation.
At least to my knowledge.

> >   - Not really mirroring the scheduler, maybe mirroring parts of the
> >     network layer(?)
> 
> The LNet code, which is derived from Portals 3.x, is mostly an easier-to-use
> abstraction of RDMA interfaces provided by Infiniband and other similar
> hardware. It can also use TCP/IP, but that's not the primary use case.
> 
> As a communication layer that builds on top of RDMA-capable hardware,
> LNet cares about such things as whether the CPU driving communication
> is close to the memory used, and also whether it is close to the interface
> used. Even in a 2-socket machine, there are measurable performance
> differences depending on whether the memory an interface connect
> to the same socket or to different sockets. On bigger hardware, like a
> 32-socket machine, the penalties are much more pronounced. At the
> time we found that the QPI links between sockets were a bottleneck
> and that performance cratered if they had to handle too much traffic.
> 
> UPI, the successor to QPI is better -- has more bandwidth -- but with
> the CPUs having more and more cores I expect the scaling issues to
> remain similar.
> 
> >   - We don't handle topology changes yet except in very minimal ways
> >     (cpts *can* become empty, and that can cause problems).
> 
> Yes, this is a known deficiency.
> 
> >   - This has been tested on .... great big things.
> 
> The basic CPT mechanism predates my involvement with Lustre. I did
> work on making it more NUMA-aware. A 32-socket system was one of
> the primary test beds.
> 
> >   - When multi-rails configurations are used (like ethernet-bonding,
> >     but for RDMA), we get ??? closer to theoretical bandwidth.
> >     Without these changes it scales poorly (??)
> 
> The basic idea behind muti-rail configurations is that we use several
> Infiniband interfaces and LNet presents them as a single logical interface
> to Lustre. For each message, LNet picks the IB interface it should go across
> using several criteria, including NUMA distance of the interface and how
> busy it is.
> 
> With these changes we could get pretty much linear scaling of LNet
> throughput by adding more interfaces.
> 
> >   - The down-sides primarily are that we don't auto-configure
> >     perfectly.  This particularly affects hot-plug, but without
> >     hotplug the grouping of cpus and interfaces are focussed
> >     on .... avoiding worst case rather than achieving best case.
> 
> Without hotplug the CPT grouping should be tuned to achieve a best
> case in a static configuration.
> 
> Adding simple-minded hotplug tolerance (let's not call it support) would
> focus on avoiding truly pathological behaviour.
> 
> > I've made up a lot of stuff there.  I'm happy not to pursue this further at the
> > moment, but if anyone would like to enhance my understanding by
> > correcting the worst errors in the above, I wouldn't object :-)
> > 
> > Thanks,
> > NeilBrown
> 
> PS: the NUMA effects I've mentioned above have been making the news
> lately under other names: they are part of the side channels used in various
> timing based attacks.
> 
> Olaf
> 
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-06  0:20                       ` James Simmons
@ 2018-07-06  0:40                         ` Patrick Farrell
  2018-07-06  3:11                         ` NeilBrown
  1 sibling, 0 replies; 66+ messages in thread
From: Patrick Farrell @ 2018-07-06  0:40 UTC (permalink / raw)
  To: lustre-devel


A tiny bit more about noise for Neil, since it?s a bit subtle and I had never heard of it before working in HPC.  Sorry if this is old news.

Noise here means differences in execution time. A typical HPC Job consists of thousands of processes running across a large system.  The basic model is they all run a compute step, then they all communicate part of their results (generally to some neighboring subset of processes, not all-to-all).  The results which are communicated are then used as part of the input to the next compute step.  As you can see, effectively, everyone must finish each step before anyone can continue (or at least, continue very far).

So if everyone finishes every step in the same amount of time, great.  But if there?s jitter in the completion time for a step for a particular process - as can be introduced by a scheduler with ideas that don?t quite line up with your job priorities - it delays the completion of the step overall.  This is compounded at each step of the job and so can be quite serious.  (Job steps can be quite short - double digit microseconds is not unusual - so relatively small jitter can really add up.)

So HPC users are really fussy about affinity and placement control.  Which isn?t to say Lustre gets it all right, but it?s why we care so much.
________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of James Simmons <jsimmons@infradead.org>
Sent: Thursday, July 5, 2018 7:20:37 PM
To: Weber, Olaf (HPC Data Management & Storage)
Cc: Lustre Development List
Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support


> NeilBrown [mailto:neilb at suse.com] wrote:
>
> To help contextualize things: the Lustre code can be decomposed into three parts:
>
> 1) The filesystem proper: Lustre.
> 2) The communication protocol it uses: LNet.
> 3) Supporting code used by Lustre and LNet: CFS.
>
> Part of the supporting code is the CPT mechanism, which provides a way to
> partition the CPUs of a system. These partitions are used to distribute queues,
> locks, and threads across the system. It was originally introduced years ago, as
> far as I can tell mainly to deal with certain hot locks: these were converted into
> read/write locks with one spinlock per CPT.
>
> As a general rule, CPT boundaries should respect node and socket boundaries,
> but at the higher end, where CPUs have 20+ cores, it may make sense to split
> a CPUs cores across several CPTs.
>
> > Thanks everyone for your patience in explaining things to me.
> > I'm beginning to understand what to look for and where to find it.
> >
> > So the answers to Greg's questions:
> >
> >   Where are you reading the host memory NUMA information from?
> >
> >   And why would a filesystem care about this type of thing?  Are you
> >   going to now mirror what the scheduler does with regards to NUMA
> >   topology issues?  How are you going to handle things when the topology
> >   changes?  What systems did you test this on?  What performance
> >   improvements were seen?  What downsides are there with all of this?
> >
> >
> > Are:
>
> >   - NUMA info comes from ACPI or device-tree just like for every one
> >       else.  Lustre just uses node_distance().
>
> Correct, the standard kernel interfaces for this information are used to
> obtain it, so ultimately Lustre/LNet uses the same source of truth as
> everyone else.
>
> >   - The filesystem cares about this because...  It has service
> >     thread that does part of the work of some filesystem operations
> >     (handling replies for example) and these are best handled "near"
> >     the CPU the initiated the request.  Lustre partitions
> >     all CPUs into "partitions" (cpt) each with a few cores.
> >     If the request thread and the reply thread are on different
> >     CPUs but in the same partition, then we get best throughput
> >     (is that close?)
>
> At the filesystem level, it does indeed seem to help to have the service
> threads that do work for requests run on a different core that is close to
> the core that originated the request. So preferably on the same CPU, and
> on certain multi-core CPUs there are also distance effects between cores.
> That too is one of the things the CPT mechanism handles.

Their is another very important aspect to why Lustre has a CPU partition
layer. At least at the place I work at. While the Linux kernel manages all
the NUMA nodes and CPU cores Lustre adds the ability for us to specify a
subset of everything on the system. The reason is to limit the impact of
noise on the compute nodes. Noise has a heavy impact on large scale HP
work loads that can run days or even weeks at a time. Lets take an
example system:

               |-------------|      |-------------|
   |-------|   | NUMA  0     |     | NUMA  1      |   |-------|
   | eth0  | - |             | --- |              | - | eth1  |
   |_______|   | CPU0  CPU1  |     | CPU2  CPU3  |   |_______|
               |_____________|      |_____________|

In such a system it is possible with the right job schedular to start a
large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
large parallel applications will communicate between nodes using MPI,
such as openmpi, which can be configured to use eth0 only. Using the
CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
This greatly reducess the noise impact on the application running.

BTW this is one of the reasons ko2iblnd for lustre doesn't use the
generic RDMA api. The core IB layer doesn't support such isolation.
At least to my knowledge.

> >   - Not really mirroring the scheduler, maybe mirroring parts of the
> >     network layer(?)
>
> The LNet code, which is derived from Portals 3.x, is mostly an easier-to-use
> abstraction of RDMA interfaces provided by Infiniband and other similar
> hardware. It can also use TCP/IP, but that's not the primary use case.
>
> As a communication layer that builds on top of RDMA-capable hardware,
> LNet cares about such things as whether the CPU driving communication
> is close to the memory used, and also whether it is close to the interface
> used. Even in a 2-socket machine, there are measurable performance
> differences depending on whether the memory an interface connect
> to the same socket or to different sockets. On bigger hardware, like a
> 32-socket machine, the penalties are much more pronounced. At the
> time we found that the QPI links between sockets were a bottleneck
> and that performance cratered if they had to handle too much traffic.
>
> UPI, the successor to QPI is better -- has more bandwidth -- but with
> the CPUs having more and more cores I expect the scaling issues to
> remain similar.
>
> >   - We don't handle topology changes yet except in very minimal ways
> >     (cpts *can* become empty, and that can cause problems).
>
> Yes, this is a known deficiency.
>
> >   - This has been tested on .... great big things.
>
> The basic CPT mechanism predates my involvement with Lustre. I did
> work on making it more NUMA-aware. A 32-socket system was one of
> the primary test beds.
>
> >   - When multi-rails configurations are used (like ethernet-bonding,
> >     but for RDMA), we get ??? closer to theoretical bandwidth.
> >     Without these changes it scales poorly (??)
>
> The basic idea behind muti-rail configurations is that we use several
> Infiniband interfaces and LNet presents them as a single logical interface
> to Lustre. For each message, LNet picks the IB interface it should go across
> using several criteria, including NUMA distance of the interface and how
> busy it is.
>
> With these changes we could get pretty much linear scaling of LNet
> throughput by adding more interfaces.
>
> >   - The down-sides primarily are that we don't auto-configure
> >     perfectly.  This particularly affects hot-plug, but without
> >     hotplug the grouping of cpus and interfaces are focussed
> >     on .... avoiding worst case rather than achieving best case.
>
> Without hotplug the CPT grouping should be tuned to achieve a best
> case in a static configuration.
>
> Adding simple-minded hotplug tolerance (let's not call it support) would
> focus on avoiding truly pathological behaviour.
>
> > I've made up a lot of stuff there.  I'm happy not to pursue this further at the
> > moment, but if anyone would like to enhance my understanding by
> > correcting the worst errors in the above, I wouldn't object :-)
> >
> > Thanks,
> > NeilBrown
>
> PS: the NUMA effects I've mentioned above have been making the news
> lately under other names: they are part of the side channels used in various
> timing based attacks.
>
> Olaf
>
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/df9106cf/attachment-0001.html>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-06  0:20                       ` James Simmons
  2018-07-06  0:40                         ` Patrick Farrell
@ 2018-07-06  3:11                         ` NeilBrown
  2018-07-06  5:36                           ` Doug Oucharek
  1 sibling, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-07-06  3:11 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jul 06 2018, James Simmons wrote:

>> NeilBrown [mailto:neilb at suse.com] wrote:
>> 
>> To help contextualize things: the Lustre code can be decomposed into three parts:
>> 
>> 1) The filesystem proper: Lustre.
>> 2) The communication protocol it uses: LNet.
>> 3) Supporting code used by Lustre and LNet: CFS.
>> 
>> Part of the supporting code is the CPT mechanism, which provides a way to
>> partition the CPUs of a system. These partitions are used to distribute queues,
>> locks, and threads across the system. It was originally introduced years ago, as
>> far as I can tell mainly to deal with certain hot locks: these were converted into
>> read/write locks with one spinlock per CPT.
>> 
>> As a general rule, CPT boundaries should respect node and socket boundaries,
>> but at the higher end, where CPUs have 20+ cores, it may make sense to split
>> a CPUs cores across several CPTs.
>> 
>> > Thanks everyone for your patience in explaining things to me.
>> > I'm beginning to understand what to look for and where to find it.
>> > 
>> > So the answers to Greg's questions:
>> > 
>> >   Where are you reading the host memory NUMA information from?
>> > 
>> >   And why would a filesystem care about this type of thing?  Are you
>> >   going to now mirror what the scheduler does with regards to NUMA
>> >   topology issues?  How are you going to handle things when the topology
>> >   changes?  What systems did you test this on?  What performance
>> >   improvements were seen?  What downsides are there with all of this?
>> > 
>> > 
>> > Are:
>> 
>> >   - NUMA info comes from ACPI or device-tree just like for every one
>> >       else.  Lustre just uses node_distance().
>> 
>> Correct, the standard kernel interfaces for this information are used to
>> obtain it, so ultimately Lustre/LNet uses the same source of truth as
>> everyone else.
>> 
>> >   - The filesystem cares about this because...  It has service
>> >     thread that does part of the work of some filesystem operations
>> >     (handling replies for example) and these are best handled "near"
>> >     the CPU the initiated the request.  Lustre partitions
>> >     all CPUs into "partitions" (cpt) each with a few cores.
>> >     If the request thread and the reply thread are on different
>> >     CPUs but in the same partition, then we get best throughput
>> >     (is that close?)
>> 
>> At the filesystem level, it does indeed seem to help to have the service
>> threads that do work for requests run on a different core that is close to
>> the core that originated the request. So preferably on the same CPU, and
>> on certain multi-core CPUs there are also distance effects between cores.
>> That too is one of the things the CPT mechanism handles.
>
> Their is another very important aspect to why Lustre has a CPU partition 
> layer. At least at the place I work at. While the Linux kernel manages all
> the NUMA nodes and CPU cores Lustre adds the ability for us to specify a 
> subset of everything on the system. The reason is to limit the impact of
> noise on the compute nodes. Noise has a heavy impact on large scale HP
> work loads that can run days or even weeks at a time. Lets take an 
> example system:
>
>                |-------------|     |-------------|
>    |-------|   | NUMA  0     |     | NUMA  1     |   |-------|
>    | eth0  | - |             | --- |             | - | eth1  |      
>    |_______|   | CPU0  CPU1  |     | CPU2  CPU3  |   |_______|
>                |_____________|     |_____________|
>
> In such a system it is possible with the right job schedular to start a 
> large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
> large parallel applications will communicate between nodes using MPI,
> such as openmpi, which can be configured to use eth0 only. Using the
> CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
> This greatly reducess the noise impact on the application running.
>
> BTW this is one of the reasons ko2iblnd for lustre doesn't use the
> generic RDMA api. The core IB layer doesn't support such isolation.
> At least to my knowledge.

Thanks for that background (and for the separate explanation of how
jitter multiplies when jobs needs to synchronize periodically).

I can see that setting CPU affinity for lustre/lnet worker threads could
be important, and that it can be valuable to tie services to a
particular interface.  I cannot yet see why we need partitions for this,
rather that doing it at the CPU (or NODE) level.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/b207f4ab/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-06  3:11                         ` NeilBrown
@ 2018-07-06  5:36                           ` Doug Oucharek
  2018-07-06  6:13                             ` NeilBrown
  0 siblings, 1 reply; 66+ messages in thread
From: Doug Oucharek @ 2018-07-06  5:36 UTC (permalink / raw)
  To: lustre-devel

When the CPT code was added to LNet back in 2012, it was to address one primary case: a need for finer grained locking on metadata servers.  LNet used to have global locks and metadata servers, which do many small messages (high IOPS), much time in the worker threads was spent in spinlocks.  So, CPT configuration was added so locks/resources could be allocated per CPT.  This way, users have control over how they want CPTs to be configured and how they want resources/locks to be divided.  For example, users may want finer grained locking on the metadata servers but not on clients.  Leaving this to be automatically configured by Linux API calls would take this flexibility away from the users who, for HPC, are very knowledgable about what they want (i.e. we do not want to protect them from themselves).

The CPT support in LNet and LNDs has morphed to encompass more traditional NUMA and core affinity performance improvements.  For example, you can restrict a network interface to a socket (NUMA node) which has better affinity to the PCIe lanes that interface is connected to.  Rather than try to do this sort of thing automatically, we have left it to the user to know what they are doing and configure the CPTs accordingly.

I think the many changes to the CPT code has realty clouded its purpose.  In summary, the original purpose was finer grained locking and that needs to be maintained as the IOPS requirements of metadata servers is paramount.

James: The Verbs RDMA interface has very poor support for NUMA/core affinity.  I was going to try to devise some patches to address that but have been too busy on other things.  Perhaps the RDMA maintainer could consider updating it?

Doug

On Jul 5, 2018, at 8:11 PM, NeilBrown <neilb at suse.com<mailto:neilb@suse.com>> wrote:

On Fri, Jul 06 2018, James Simmons wrote:

NeilBrown [mailto:neilb at suse.com] wrote:

To help contextualize things: the Lustre code can be decomposed into three parts:

1) The filesystem proper: Lustre.
2) The communication protocol it uses: LNet.
3) Supporting code used by Lustre and LNet: CFS.

Part of the supporting code is the CPT mechanism, which provides a way to
partition the CPUs of a system. These partitions are used to distribute queues,
locks, and threads across the system. It was originally introduced years ago, as
far as I can tell mainly to deal with certain hot locks: these were converted into
read/write locks with one spinlock per CPT.

As a general rule, CPT boundaries should respect node and socket boundaries,
but at the higher end, where CPUs have 20+ cores, it may make sense to split
a CPUs cores across several CPTs.

Thanks everyone for your patience in explaining things to me.
I'm beginning to understand what to look for and where to find it.

So the answers to Greg's questions:

 Where are you reading the host memory NUMA information from?

 And why would a filesystem care about this type of thing?  Are you
 going to now mirror what the scheduler does with regards to NUMA
 topology issues?  How are you going to handle things when the topology
 changes?  What systems did you test this on?  What performance
 improvements were seen?  What downsides are there with all of this?

Are:

 - NUMA info comes from ACPI or device-tree just like for every one
     else.  Lustre just uses node_distance().

Correct, the standard kernel interfaces for this information are used to
obtain it, so ultimately Lustre/LNet uses the same source of truth as
everyone else.

 - The filesystem cares about this because...  It has service
   thread that does part of the work of some filesystem operations
   (handling replies for example) and these are best handled "near"
   the CPU the initiated the request.  Lustre partitions
   all CPUs into "partitions" (cpt) each with a few cores.
   If the request thread and the reply thread are on different
   CPUs but in the same partition, then we get best throughput
   (is that close?)

At the filesystem level, it does indeed seem to help to have the service
threads that do work for requests run on a different core that is close to
the core that originated the request. So preferably on the same CPU, and
on certain multi-core CPUs there are also distance effects between cores.
That too is one of the things the CPT mechanism handles.

Their is another very important aspect to why Lustre has a CPU partition
layer. At least at the place I work at. While the Linux kernel manages all
the NUMA nodes and CPU cores Lustre adds the ability for us to specify a
subset of everything on the system. The reason is to limit the impact of
noise on the compute nodes. Noise has a heavy impact on large scale HP
work loads that can run days or even weeks at a time. Lets take an
example system:

              |-------------|     |-------------|
  |-------|   | NUMA  0     |     | NUMA  1     |   |-------|
  | eth0  | - |             | --- |             | - | eth1  |
  |_______|   | CPU0  CPU1  |     | CPU2  CPU3  |   |_______|
              |_____________|     |_____________|

In such a system it is possible with the right job schedular to start a
large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
large parallel applications will communicate between nodes using MPI,
such as openmpi, which can be configured to use eth0 only. Using the
CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
This greatly reducess the noise impact on the application running.

BTW this is one of the reasons ko2iblnd for lustre doesn't use the
generic RDMA api. The core IB layer doesn't support such isolation.
At least to my knowledge.

Thanks for that background (and for the separate explanation of how
jitter multiplies when jobs needs to synchronize periodically).

I can see that setting CPU affinity for lustre/lnet worker threads could
be important, and that it can be valuable to tie services to a
particular interface.  I cannot yet see why we need partitions for this,
rather that doing it at the CPU (or NODE) level.

Thanks,
NeilBrown

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/ba357d68/attachment-0001.html>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-06  5:36                           ` Doug Oucharek
@ 2018-07-06  6:13                             ` NeilBrown
  2018-07-06 15:57                               ` James Simmons
  0 siblings, 1 reply; 66+ messages in thread
From: NeilBrown @ 2018-07-06  6:13 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jul 06 2018, Doug Oucharek wrote:

> When the CPT code was added to LNet back in 2012, it was to address
> one primary case: a need for finer grained locking on metadata
> servers.  LNet used to have global locks and metadata servers, which
> do many small messages (high IOPS), much time in the worker threads
> was spent in spinlocks.  So, CPT configuration was added so
> locks/resources could be allocated per CPT.  This way, users have
> control over how they want CPTs to be configured and how they want
> resources/locks to be divided.  For example, users may want finer
> grained locking on the metadata servers but not on clients.  Leaving
> this to be automatically configured by Linux API calls would take this
> flexibility away from the users who, for HPC, are very knowledgable
> about what they want (i.e. we do not want to protect them from
> themselves).
>
> The CPT support in LNet and LNDs has morphed to encompass more
> traditional NUMA and core affinity performance improvements.  For
> example, you can restrict a network interface to a socket (NUMA node)
> which has better affinity to the PCIe lanes that interface is
> connected to.  Rather than try to do this sort of thing automatically,
> we have left it to the user to know what they are doing and configure
> the CPTs accordingly.
>
> I think the many changes to the CPT code has realty clouded its
> purpose.  In summary, the original purpose was finer grained locking
> and that needs to be maintained as the IOPS requirements of metadata
> servers is paramount.

Thanks for the explanation.
I definitely get that fine-grained locking is a good thing.  Lustre is
not alone in this of course.
Even better than fine-grained locking is no locking.  That is not often
possible, but this
  https://github.com/neilbrown/linux/commit/ac3f8fd6e61b245fa9c14e3164203c1211c5ef6b

is an example of doing exactly that.

For the read/writer usage of CPT locks, RCU is a better approach if it
can be made to work (usually it can) - and it scales even better.

When I was digging through the usage of locks I saw some hash tables.
It seems that a lock protected a whole table.  It is usually sufficient
for the lock to just protect a single chain (bit spin-locks can easily
store one lock per chain) and then only for writes - RCU discipline can
allow reads to proceed with only rcu_read_lock().
Would we still need per-CPT tables once that was in place?  I don't know
yet, though per-node seems likely to be sufficient when locking is per-chain.

I certainly wouldn't discard CPTs without replacing them with something
better.  Near the top of my list for when I return from vacation
(leaving in a couple of days) will be to look closely at the current
fine-grained locking that you have helped me to see more clearly, and
see if I can make it even better.

Thanks,
NeilBrown

>
> James: The Verbs RDMA interface has very poor support for NUMA/core affinity.  I was going to try to devise some patches to address that but have been too busy on other things.  Perhaps the RDMA maintainer could consider updating it?
>
> Doug
>
> On Jul 5, 2018, at 8:11 PM, NeilBrown <neilb at suse.com<mailto:neilb@suse.com>> wrote:
>
> On Fri, Jul 06 2018, James Simmons wrote:
>
> NeilBrown [mailto:neilb at suse.com] wrote:
>
> To help contextualize things: the Lustre code can be decomposed into three parts:
>
> 1) The filesystem proper: Lustre.
> 2) The communication protocol it uses: LNet.
> 3) Supporting code used by Lustre and LNet: CFS.
>
> Part of the supporting code is the CPT mechanism, which provides a way to
> partition the CPUs of a system. These partitions are used to distribute queues,
> locks, and threads across the system. It was originally introduced years ago, as
> far as I can tell mainly to deal with certain hot locks: these were converted into
> read/write locks with one spinlock per CPT.
>
> As a general rule, CPT boundaries should respect node and socket boundaries,
> but at the higher end, where CPUs have 20+ cores, it may make sense to split
> a CPUs cores across several CPTs.
>
> Thanks everyone for your patience in explaining things to me.
> I'm beginning to understand what to look for and where to find it.
>
> So the answers to Greg's questions:
>
>  Where are you reading the host memory NUMA information from?
>
>  And why would a filesystem care about this type of thing?  Are you
>  going to now mirror what the scheduler does with regards to NUMA
>  topology issues?  How are you going to handle things when the topology
>  changes?  What systems did you test this on?  What performance
>  improvements were seen?  What downsides are there with all of this?
>
>
> Are:
>
>  - NUMA info comes from ACPI or device-tree just like for every one
>      else.  Lustre just uses node_distance().
>
> Correct, the standard kernel interfaces for this information are used to
> obtain it, so ultimately Lustre/LNet uses the same source of truth as
> everyone else.
>
>  - The filesystem cares about this because...  It has service
>    thread that does part of the work of some filesystem operations
>    (handling replies for example) and these are best handled "near"
>    the CPU the initiated the request.  Lustre partitions
>    all CPUs into "partitions" (cpt) each with a few cores.
>    If the request thread and the reply thread are on different
>    CPUs but in the same partition, then we get best throughput
>    (is that close?)
>
> At the filesystem level, it does indeed seem to help to have the service
> threads that do work for requests run on a different core that is close to
> the core that originated the request. So preferably on the same CPU, and
> on certain multi-core CPUs there are also distance effects between cores.
> That too is one of the things the CPT mechanism handles.
>
> Their is another very important aspect to why Lustre has a CPU partition
> layer. At least at the place I work at. While the Linux kernel manages all
> the NUMA nodes and CPU cores Lustre adds the ability for us to specify a
> subset of everything on the system. The reason is to limit the impact of
> noise on the compute nodes. Noise has a heavy impact on large scale HP
> work loads that can run days or even weeks at a time. Lets take an
> example system:
>
>               |-------------|     |-------------|
>   |-------|   | NUMA  0     |     | NUMA  1     |   |-------|
>   | eth0  | - |             | --- |             | - | eth1  |
>   |_______|   | CPU0  CPU1  |     | CPU2  CPU3  |   |_______|
>               |_____________|     |_____________|
>
> In such a system it is possible with the right job schedular to start a
> large parallel application on NUMA 0/ (CPU0 and CPU1). Normally such
> large parallel applications will communicate between nodes using MPI,
> such as openmpi, which can be configured to use eth0 only. Using the
> CPT layer in lustre we can isolate lustre to NUMA 1 and use only eth1.
> This greatly reducess the noise impact on the application running.
>
> BTW this is one of the reasons ko2iblnd for lustre doesn't use the
> generic RDMA api. The core IB layer doesn't support such isolation.
> At least to my knowledge.
>
> Thanks for that background (and for the separate explanation of how
> jitter multiplies when jobs needs to synchronize periodically).
>
> I can see that setting CPU affinity for lustre/lnet worker threads could
> be important, and that it can be valuable to tie services to a
> particular interface.  I cannot yet see why we need partitions for this,
> rather that doing it at the CPU (or NODE) level.
>
> Thanks,
> NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180706/e1e273c6/attachment.sig>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-06  6:13                             ` NeilBrown
@ 2018-07-06 15:57                               ` James Simmons
  2018-07-06 16:04                                 ` Patrick Farrell
  0 siblings, 1 reply; 66+ messages in thread
From: James Simmons @ 2018-07-06 15:57 UTC (permalink / raw)
  To: lustre-devel


> > When the CPT code was added to LNet back in 2012, it was to address
> > one primary case: a need for finer grained locking on metadata
> > servers.  LNet used to have global locks and metadata servers, which
> > do many small messages (high IOPS), much time in the worker threads
> > was spent in spinlocks.  So, CPT configuration was added so
> > locks/resources could be allocated per CPT.  This way, users have
> > control over how they want CPTs to be configured and how they want
> > resources/locks to be divided.  For example, users may want finer
> > grained locking on the metadata servers but not on clients.  Leaving
> > this to be automatically configured by Linux API calls would take this
> > flexibility away from the users who, for HPC, are very knowledgable
> > about what they want (i.e. we do not want to protect them from
> > themselves).
> >
> > The CPT support in LNet and LNDs has morphed to encompass more
> > traditional NUMA and core affinity performance improvements.  For
> > example, you can restrict a network interface to a socket (NUMA node)
> > which has better affinity to the PCIe lanes that interface is
> > connected to.  Rather than try to do this sort of thing automatically,
> > we have left it to the user to know what they are doing and configure
> > the CPTs accordingly.
> >
> > I think the many changes to the CPT code has realty clouded its
> > purpose.  In summary, the original purpose was finer grained locking
> > and that needs to be maintained as the IOPS requirements of metadata
> > servers is paramount.
> 
> Thanks for the explanation.
> I definitely get that fine-grained locking is a good thing.  Lustre is
> not alone in this of course.
> Even better than fine-grained locking is no locking.  That is not often
> possible, but this
>   https://github.com/neilbrown/linux/commit/ac3f8fd6e61b245fa9c14e3164203c1211c5ef6b
> 
> is an example of doing exactly that.
> 
> For the read/writer usage of CPT locks, RCU is a better approach if it
> can be made to work (usually it can) - and it scales even better.
> 
> When I was digging through the usage of locks I saw some hash tables.
> It seems that a lock protected a whole table.  It is usually sufficient
> for the lock to just protect a single chain (bit spin-locks can easily
> store one lock per chain) and then only for writes - RCU discipline can
> allow reads to proceed with only rcu_read_lock().
> Would we still need per-CPT tables once that was in place?  I don't know
> yet, though per-node seems likely to be sufficient when locking is per-chain.
> 
> I certainly wouldn't discard CPTs without replacing them with something
> better.  Near the top of my list for when I return from vacation
> (leaving in a couple of days) will be to look closely at the current
> fine-grained locking that you have helped me to see more clearly, and
> see if I can make it even better.

If RCU can provide better scaling then its best to replace CPT handling in
those cases. Lets land the Mult-Rail stuff first since it makes the most
heavy use of the CPT code. From there we can get a good idea of how to
move forward. I don't think we can easily abandon the CPT infrastructure
in general since we need it for partitioning to reduce noise. What would
be ideal is integrate the partitoning work to the general linux kernel.
While lustre attempts to reduce noise on nodes the rest of the kernel 
doesn't. If the linux kernel supported this it would be a big win for
HPC systems. The monster HPC systems today will be general hardware 5+
years down the road.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
  2018-07-06 15:57                               ` James Simmons
@ 2018-07-06 16:04                                 ` Patrick Farrell
  0 siblings, 0 replies; 66+ messages in thread
From: Patrick Farrell @ 2018-07-06 16:04 UTC (permalink / raw)
  To: lustre-devel

Yeah, but they still won't really care much about noise.  Noise is really only a big problem if you're compounding it like HPC jobs do, otherwise it's negligible.  You worry about average time and maybe worst case - Not how noisy the average is, unless it suffers from wide excursions.  Lots of small excursions in execution time ("noise/jitter") don't matter.  (Unless you're an HPC job.)

The real time people care more about noise, though I believe they're still more concerned about worst cases and bounds than jitter.  Maybe some real time people are intensely worried about jitter for some use cases.

So this concern is not going mainstream even if the systems do, and the scheduler behavior required to minimize noise is sometimes not the same behavior required to improve responsiveness, reduce power consumption, etc.

Just food for thought.

?On 7/6/18, 10:57 AM, "lustre-devel on behalf of James Simmons" <lustre-devel-bounces at lists.lustre.org on behalf of jsimmons@infradead.org> wrote:

    
    > > When the CPT code was added to LNet back in 2012, it was to address
    > > one primary case: a need for finer grained locking on metadata
    > > servers.  LNet used to have global locks and metadata servers, which
    > > do many small messages (high IOPS), much time in the worker threads
    > > was spent in spinlocks.  So, CPT configuration was added so
    > > locks/resources could be allocated per CPT.  This way, users have
    > > control over how they want CPTs to be configured and how they want
    > > resources/locks to be divided.  For example, users may want finer
    > > grained locking on the metadata servers but not on clients.  Leaving
    > > this to be automatically configured by Linux API calls would take this
    > > flexibility away from the users who, for HPC, are very knowledgable
    > > about what they want (i.e. we do not want to protect them from
    > > themselves).
    > >
    > > The CPT support in LNet and LNDs has morphed to encompass more
    > > traditional NUMA and core affinity performance improvements.  For
    > > example, you can restrict a network interface to a socket (NUMA node)
    > > which has better affinity to the PCIe lanes that interface is
    > > connected to.  Rather than try to do this sort of thing automatically,
    > > we have left it to the user to know what they are doing and configure
    > > the CPTs accordingly.
    > >
    > > I think the many changes to the CPT code has realty clouded its
    > > purpose.  In summary, the original purpose was finer grained locking
    > > and that needs to be maintained as the IOPS requirements of metadata
    > > servers is paramount.
    > 
    > Thanks for the explanation.
    > I definitely get that fine-grained locking is a good thing.  Lustre is
    > not alone in this of course.
    > Even better than fine-grained locking is no locking.  That is not often
    > possible, but this
    >   https://github.com/neilbrown/linux/commit/ac3f8fd6e61b245fa9c14e3164203c1211c5ef6b
    > 
    > is an example of doing exactly that.
    > 
    > For the read/writer usage of CPT locks, RCU is a better approach if it
    > can be made to work (usually it can) - and it scales even better.
    > 
    > When I was digging through the usage of locks I saw some hash tables.
    > It seems that a lock protected a whole table.  It is usually sufficient
    > for the lock to just protect a single chain (bit spin-locks can easily
    > store one lock per chain) and then only for writes - RCU discipline can
    > allow reads to proceed with only rcu_read_lock().
    > Would we still need per-CPT tables once that was in place?  I don't know
    > yet, though per-node seems likely to be sufficient when locking is per-chain.
    > 
    > I certainly wouldn't discard CPTs without replacing them with something
    > better.  Near the top of my list for when I return from vacation
    > (leaving in a couple of days) will be to look closely at the current
    > fine-grained locking that you have helped me to see more clearly, and
    > see if I can make it even better.
    
    If RCU can provide better scaling then its best to replace CPT handling in
    those cases. Lets land the Mult-Rail stuff first since it makes the most
    heavy use of the CPT code. From there we can get a good idea of how to
    move forward. I don't think we can easily abandon the CPT infrastructure
    in general since we need it for partitioning to reduce noise. What would
    be ideal is integrate the partitoning work to the general linux kernel.
    While lustre attempts to reduce noise on nodes the rest of the kernel 
    doesn't. If the linux kernel supported this it would be a big win for
    HPC systems. The monster HPC systems today will be general hardware 5+
    years down the road.
    _______________________________________________
    lustre-devel mailing list
    lustre-devel at lists.lustre.org
    http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
    

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2018-07-06 16:04 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 01/26] staging: lustre: libcfs: remove useless CPU partition code James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 02/26] staging: lustre: libcfs: rename variable i to cpu James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code James Simmons
2018-06-25  0:20   ` NeilBrown
2018-06-26  0:33     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 04/26] staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space James Simmons
2018-06-25  0:35   ` NeilBrown
2018-06-26  0:55     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 06/26] staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support James Simmons
2018-06-25  0:39   ` NeilBrown
2018-06-25 18:22     ` Doug Oucharek
2018-06-27  2:44       ` NeilBrown
2018-06-27 12:42         ` Patrick Farrell
2018-06-28  1:17           ` NeilBrown
2018-06-29 17:19             ` Doug Oucharek
2018-06-29 17:27               ` Amir Shehata
2018-06-29 17:47                 ` Weber, Olaf
2018-07-04  5:22                   ` NeilBrown
2018-07-04  8:40                     ` Weber, Olaf
2018-07-05  1:57                       ` NeilBrown
2018-07-06  0:20                       ` James Simmons
2018-07-06  0:40                         ` Patrick Farrell
2018-07-06  3:11                         ` NeilBrown
2018-07-06  5:36                           ` Doug Oucharek
2018-07-06  6:13                             ` NeilBrown
2018-07-06 15:57                               ` James Simmons
2018-07-06 16:04                                 ` Patrick Farrell
2018-06-26  0:39     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling James Simmons
2018-06-25  0:48   ` NeilBrown
2018-06-26  1:15     ` James Simmons
2018-06-27  2:50       ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 09/26] staging: lustre: libcfs: use distance in cpu and node handling James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 10/26] staging: lustre: libcfs: provide debugfs files for distance handling James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 11/26] staging: lustre: libcfs: invert error handling for cfs_cpt_table_print James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 12/26] staging: lustre: libcfs: fix libcfs_cpu coding style James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification James Simmons
2018-06-25  0:57   ` NeilBrown
2018-06-26  0:42     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 14/26] staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 15/26] staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 16/26] staging: lustre: libcfs: rename cpumask_var_t variables to *_mask James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 17/26] staging: lustre: libcfs: update debug messages James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 18/26] staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node James Simmons
2018-06-25  1:09   ` NeilBrown
2018-06-25  1:11     ` NeilBrown
2018-06-25 22:57       ` James Simmons
2018-06-26  0:54     ` James Simmons
2018-06-27  2:49       ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 20/26] staging: lustre: libcfs: update debug messages in CPT code James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 21/26] staging: lustre: libcfs: rework CPU pattern parsing code James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 22/26] staging: lustre: libcfs: change CPT estimate algorithm James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0 James Simmons
2018-06-25  2:38   ` NeilBrown
2018-06-25 22:51     ` James Simmons
2018-06-26  0:34       ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP James Simmons
2018-06-25  1:27   ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure James Simmons
2018-06-25  1:32   ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 26/26] staging: lustre: libcfs: restore UMP support James Simmons
2018-06-25  1:33 ` [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.