All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC workqueue/driver-core PATCH 0/5] Add NUMA aware async_schedule calls
@ 2018-09-26 21:51 ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, rafael, jiangshanlai, pavel, zwisler

This patch set provides functionality that will help to improve the
locality of the async_schedule calls used to provide deferred
initialization.

This patch set originally started out with me focused on just the one call
to async_schedule_domain in the nvdimm tree that was being used to
defer the device_add call however after doing some digging I realized the
scope of this was much broader than I had originally planned. As such I
went through and reworked the underlying infrastructure down to replacing
the queue_work call itself with a function of my own and opted to try and
provide a NUMA aware solution that would work for a broader audience.

I am submitting this an RFC to figure out just how far off I am from where
I need to be on this patch set and to determine which tree I should
ultimately be submitting this to.

---

Alexander Duyck (5):
      workqueue: Provide queue_work_near to queue work near a given NUMA node
      async: Add support for queueing on specific NUMA node
      driver core: Probe devices asynchronously instead of the driver
      driver core: Use new async_schedule_dev command
      nvdimm: Schedule device registration on node local to the device


 drivers/base/bus.c        |   23 +-------
 drivers/base/dd.c         |   44 +++++++++++++++
 drivers/base/power/main.c |   12 ++--
 drivers/nvdimm/bus.c      |   11 +++-
 include/linux/async.h     |   27 ++++++++-
 include/linux/workqueue.h |    2 +
 kernel/async.c            |   62 ++++++++++++++++------
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 258 insertions(+), 52 deletions(-)

--
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 0/5] Add NUMA aware async_schedule calls
@ 2018-09-26 21:51 ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, dave.jiang, rafael, vishal.l.verma, jiangshanlai,
	pavel, zwisler, dan.j.williams

This patch set provides functionality that will help to improve the
locality of the async_schedule calls used to provide deferred
initialization.

This patch set originally started out with me focused on just the one call
to async_schedule_domain in the nvdimm tree that was being used to
defer the device_add call however after doing some digging I realized the
scope of this was much broader than I had originally planned. As such I
went through and reworked the underlying infrastructure down to replacing
the queue_work call itself with a function of my own and opted to try and
provide a NUMA aware solution that would work for a broader audience.

I am submitting this an RFC to figure out just how far off I am from where
I need to be on this patch set and to determine which tree I should
ultimately be submitting this to.

---

Alexander Duyck (5):
      workqueue: Provide queue_work_near to queue work near a given NUMA node
      async: Add support for queueing on specific NUMA node
      driver core: Probe devices asynchronously instead of the driver
      driver core: Use new async_schedule_dev command
      nvdimm: Schedule device registration on node local to the device


 drivers/base/bus.c        |   23 +-------
 drivers/base/dd.c         |   44 +++++++++++++++
 drivers/base/power/main.c |   12 ++--
 drivers/nvdimm/bus.c      |   11 +++-
 include/linux/async.h     |   27 ++++++++-
 include/linux/workqueue.h |    2 +
 kernel/async.c            |   62 ++++++++++++++++------
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 258 insertions(+), 52 deletions(-)

--

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 0/5] Add NUMA aware async_schedule calls
@ 2018-09-26 21:51 ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, pavel-+ZI9xUNit7I,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A

This patch set provides functionality that will help to improve the
locality of the async_schedule calls used to provide deferred
initialization.

This patch set originally started out with me focused on just the one call
to async_schedule_domain in the nvdimm tree that was being used to
defer the device_add call however after doing some digging I realized the
scope of this was much broader than I had originally planned. As such I
went through and reworked the underlying infrastructure down to replacing
the queue_work call itself with a function of my own and opted to try and
provide a NUMA aware solution that would work for a broader audience.

I am submitting this an RFC to figure out just how far off I am from where
I need to be on this patch set and to determine which tree I should
ultimately be submitting this to.

---

Alexander Duyck (5):
      workqueue: Provide queue_work_near to queue work near a given NUMA node
      async: Add support for queueing on specific NUMA node
      driver core: Probe devices asynchronously instead of the driver
      driver core: Use new async_schedule_dev command
      nvdimm: Schedule device registration on node local to the device


 drivers/base/bus.c        |   23 +-------
 drivers/base/dd.c         |   44 +++++++++++++++
 drivers/base/power/main.c |   12 ++--
 drivers/nvdimm/bus.c      |   11 +++-
 include/linux/async.h     |   27 ++++++++-
 include/linux/workqueue.h |    2 +
 kernel/async.c            |   62 ++++++++++++++++------
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 258 insertions(+), 52 deletions(-)

--

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-09-26 21:51 ` Alexander Duyck
  (?)
@ 2018-09-26 21:51   ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, rafael, jiangshanlai, pavel, zwisler

This patch provides a new function queue_work_near which is meant to
schedule work on the nearest unbound CPU to the requested NUMA node. The
main motivation for this is to help assist asynchronous init to better
improve boot times for devices that are local to a specific node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..1f9f0a65437b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -463,6 +463,8 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
+extern bool queue_work_near(int node, struct workqueue_struct *wq,
+			    struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..a971d3c4096e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
+#include <linux/device.h>
 
 #include "workqueue_internal.h"
 
@@ -1332,8 +1333,9 @@ static bool is_chained_work(struct workqueue_struct *wq)
  * by wq_unbound_cpumask.  Otherwise, round robin among the allowed ones to
  * avoid perturbing sensitive tasks.
  */
-static int wq_select_unbound_cpu(int cpu)
+static int wq_select_unbound_cpu(void)
 {
+	int cpu = raw_smp_processor_id();
 	static bool printed_dbg_warning;
 	int new_cpu;
 
@@ -1385,7 +1387,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 		return;
 retry:
 	if (req_cpu == WORK_CPU_UNBOUND)
-		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
+		cpu = wq_select_unbound_cpu();
 
 	/* pwq which will be used unless @work is executing elsewhere */
 	if (!(wq->flags & WQ_UNBOUND))
@@ -1492,6 +1494,129 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
+/**
+ * workqueue_select_unbound_cpu_near - Select an unbound CPU based on NUMA node
+ * @node: NUMA node ID that we want to bind a CPU from
+ *
+ * This function will attempt to find a "random" cpu available to the unbound
+ * workqueues on a given node. If there are no CPUs available on the given
+ * node it will return WORK_CPU_UNBOUND indicating that we should just
+ * schedule to any available CPU if we need to schedule this work.
+ */
+static int workqueue_select_unbound_cpu_near(int node)
+{
+	const struct cpumask *wq_cpumask, *node_cpumask;
+	int cpu;
+
+	/* No point in doing this if NUMA isn't enabled for workqueues */
+	if (!wq_numa_enabled)
+		return WORK_CPU_UNBOUND;
+
+	/* delay binding to CPU if node is not valid or online */
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		return WORK_CPU_UNBOUND;
+
+	/* If wq_unbound_cpumask is empty then just use cpu_online_mask */
+	wq_cpumask = cpumask_empty(wq_unbound_cpumask) ? cpu_online_mask :
+							 wq_unbound_cpumask;
+
+	/*
+	 * If node has no CPUs, or no CPUs in the unbound cpumask then we
+	 * need to try and find the nearest node that does have CPUs in the
+	 * unbound cpumask.
+	 */
+	if (!nr_cpus_node(node) ||
+	    !cpumask_intersects(cpumask_of_node(node), wq_cpumask)) {
+		int min_val = INT_MAX, best_node = NUMA_NO_NODE;
+		int this_node, val;
+
+		for_each_online_node(this_node) {
+			if (this_node == node)
+				continue;
+
+			val = node_distance(node, this_node);
+			if (min_val < val)
+				continue;
+
+			if (!nr_cpus_node(this_node) ||
+			    !cpumask_intersects(cpumask_of_node(this_node),
+						wq_cpumask))
+				continue;
+
+			best_node = this_node;
+			min_val = val;
+		}
+
+		/* If we failed to find a close node just defer */
+		if (best_node == NUMA_NO_NODE)
+			return WORK_CPU_UNBOUND;
+
+		/* update node to reflect optimal value */
+		node = best_node;
+	}
+
+
+	/* Use local node/cpu if we are already there */
+	cpu = raw_smp_processor_id();
+	if (node == cpu_to_node(cpu) &&
+	    cpumask_test_cpu(cpu, wq_unbound_cpumask))
+		return cpu;
+
+	/*
+	 * Reuse the same value as wq_select_unbound_cpu above to prevent
+	 * us from mapping the same CPU each time. The impact to
+	 * wq_select_unbound_cpu should be minimal since the above function
+	 * only uses it when it has to load balance on remote CPUs similar
+	 * to what I am doing here.
+	 */
+	cpu = __this_cpu_read(wq_rr_cpu_last);
+	node_cpumask = cpumask_of_node(node);
+	cpu = cpumask_next_and(cpu, wq_cpumask, node_cpumask);
+	if (unlikely(cpu >= nr_cpu_ids)) {
+		cpu = cpumask_first_and(wq_cpumask, node_cpumask);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return WORK_CPU_UNBOUND;
+	}
+	__this_cpu_write(wq_rr_cpu_last, cpu);
+
+	return cpu;
+}
+
+/**
+ * queue_work_near - queue work on the nearest unbound cpu to a given NUMA node
+ * @node: NUMA node that we are targeting the work for
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * We queue the work to a specific CPU based on a given NUMA node, the
+ * caller must ensure it can't go away.
+ *
+ * This function will only make a best effort attempt at getting this onto
+ * the right NUMA node. If no node is requested or the requested node is
+ * offline then we just fall back to standard queue_work behavior.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_work_near(int node, struct workqueue_struct *wq,
+		     struct work_struct *work)
+{
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		int cpu = workqueue_select_unbound_cpu_near(node);
+
+		__queue_work(cpu, wq, work);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_near);
+
 void delayed_work_timer_fn(struct timer_list *t)
 {
 	struct delayed_work *dwork = from_timer(dwork, t, timer);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, dave.jiang, rafael, vishal.l.verma, jiangshanlai,
	pavel, zwisler, dan.j.williams

This patch provides a new function queue_work_near which is meant to
schedule work on the nearest unbound CPU to the requested NUMA node. The
main motivation for this is to help assist asynchronous init to better
improve boot times for devices that are local to a specific node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..1f9f0a65437b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -463,6 +463,8 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
+extern bool queue_work_near(int node, struct workqueue_struct *wq,
+			    struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..a971d3c4096e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
+#include <linux/device.h>
 
 #include "workqueue_internal.h"
 
@@ -1332,8 +1333,9 @@ static bool is_chained_work(struct workqueue_struct *wq)
  * by wq_unbound_cpumask.  Otherwise, round robin among the allowed ones to
  * avoid perturbing sensitive tasks.
  */
-static int wq_select_unbound_cpu(int cpu)
+static int wq_select_unbound_cpu(void)
 {
+	int cpu = raw_smp_processor_id();
 	static bool printed_dbg_warning;
 	int new_cpu;
 
@@ -1385,7 +1387,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 		return;
 retry:
 	if (req_cpu == WORK_CPU_UNBOUND)
-		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
+		cpu = wq_select_unbound_cpu();
 
 	/* pwq which will be used unless @work is executing elsewhere */
 	if (!(wq->flags & WQ_UNBOUND))
@@ -1492,6 +1494,129 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
+/**
+ * workqueue_select_unbound_cpu_near - Select an unbound CPU based on NUMA node
+ * @node: NUMA node ID that we want to bind a CPU from
+ *
+ * This function will attempt to find a "random" cpu available to the unbound
+ * workqueues on a given node. If there are no CPUs available on the given
+ * node it will return WORK_CPU_UNBOUND indicating that we should just
+ * schedule to any available CPU if we need to schedule this work.
+ */
+static int workqueue_select_unbound_cpu_near(int node)
+{
+	const struct cpumask *wq_cpumask, *node_cpumask;
+	int cpu;
+
+	/* No point in doing this if NUMA isn't enabled for workqueues */
+	if (!wq_numa_enabled)
+		return WORK_CPU_UNBOUND;
+
+	/* delay binding to CPU if node is not valid or online */
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		return WORK_CPU_UNBOUND;
+
+	/* If wq_unbound_cpumask is empty then just use cpu_online_mask */
+	wq_cpumask = cpumask_empty(wq_unbound_cpumask) ? cpu_online_mask :
+							 wq_unbound_cpumask;
+
+	/*
+	 * If node has no CPUs, or no CPUs in the unbound cpumask then we
+	 * need to try and find the nearest node that does have CPUs in the
+	 * unbound cpumask.
+	 */
+	if (!nr_cpus_node(node) ||
+	    !cpumask_intersects(cpumask_of_node(node), wq_cpumask)) {
+		int min_val = INT_MAX, best_node = NUMA_NO_NODE;
+		int this_node, val;
+
+		for_each_online_node(this_node) {
+			if (this_node == node)
+				continue;
+
+			val = node_distance(node, this_node);
+			if (min_val < val)
+				continue;
+
+			if (!nr_cpus_node(this_node) ||
+			    !cpumask_intersects(cpumask_of_node(this_node),
+						wq_cpumask))
+				continue;
+
+			best_node = this_node;
+			min_val = val;
+		}
+
+		/* If we failed to find a close node just defer */
+		if (best_node == NUMA_NO_NODE)
+			return WORK_CPU_UNBOUND;
+
+		/* update node to reflect optimal value */
+		node = best_node;
+	}
+
+
+	/* Use local node/cpu if we are already there */
+	cpu = raw_smp_processor_id();
+	if (node == cpu_to_node(cpu) &&
+	    cpumask_test_cpu(cpu, wq_unbound_cpumask))
+		return cpu;
+
+	/*
+	 * Reuse the same value as wq_select_unbound_cpu above to prevent
+	 * us from mapping the same CPU each time. The impact to
+	 * wq_select_unbound_cpu should be minimal since the above function
+	 * only uses it when it has to load balance on remote CPUs similar
+	 * to what I am doing here.
+	 */
+	cpu = __this_cpu_read(wq_rr_cpu_last);
+	node_cpumask = cpumask_of_node(node);
+	cpu = cpumask_next_and(cpu, wq_cpumask, node_cpumask);
+	if (unlikely(cpu >= nr_cpu_ids)) {
+		cpu = cpumask_first_and(wq_cpumask, node_cpumask);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return WORK_CPU_UNBOUND;
+	}
+	__this_cpu_write(wq_rr_cpu_last, cpu);
+
+	return cpu;
+}
+
+/**
+ * queue_work_near - queue work on the nearest unbound cpu to a given NUMA node
+ * @node: NUMA node that we are targeting the work for
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * We queue the work to a specific CPU based on a given NUMA node, the
+ * caller must ensure it can't go away.
+ *
+ * This function will only make a best effort attempt at getting this onto
+ * the right NUMA node. If no node is requested or the requested node is
+ * offline then we just fall back to standard queue_work behavior.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_work_near(int node, struct workqueue_struct *wq,
+		     struct work_struct *work)
+{
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		int cpu = workqueue_select_unbound_cpu_near(node);
+
+		__queue_work(cpu, wq, work);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_near);
+
 void delayed_work_timer_fn(struct timer_list *t)
 {
 	struct delayed_work *dwork = from_timer(dwork, t, timer);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, pavel-+ZI9xUNit7I,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A

This patch provides a new function queue_work_near which is meant to
schedule work on the nearest unbound CPU to the requested NUMA node. The
main motivation for this is to help assist asynchronous init to better
improve boot times for devices that are local to a specific node.

Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..1f9f0a65437b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -463,6 +463,8 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
+extern bool queue_work_near(int node, struct workqueue_struct *wq,
+			    struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..a971d3c4096e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
+#include <linux/device.h>
 
 #include "workqueue_internal.h"
 
@@ -1332,8 +1333,9 @@ static bool is_chained_work(struct workqueue_struct *wq)
  * by wq_unbound_cpumask.  Otherwise, round robin among the allowed ones to
  * avoid perturbing sensitive tasks.
  */
-static int wq_select_unbound_cpu(int cpu)
+static int wq_select_unbound_cpu(void)
 {
+	int cpu = raw_smp_processor_id();
 	static bool printed_dbg_warning;
 	int new_cpu;
 
@@ -1385,7 +1387,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 		return;
 retry:
 	if (req_cpu == WORK_CPU_UNBOUND)
-		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
+		cpu = wq_select_unbound_cpu();
 
 	/* pwq which will be used unless @work is executing elsewhere */
 	if (!(wq->flags & WQ_UNBOUND))
@@ -1492,6 +1494,129 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
+/**
+ * workqueue_select_unbound_cpu_near - Select an unbound CPU based on NUMA node
+ * @node: NUMA node ID that we want to bind a CPU from
+ *
+ * This function will attempt to find a "random" cpu available to the unbound
+ * workqueues on a given node. If there are no CPUs available on the given
+ * node it will return WORK_CPU_UNBOUND indicating that we should just
+ * schedule to any available CPU if we need to schedule this work.
+ */
+static int workqueue_select_unbound_cpu_near(int node)
+{
+	const struct cpumask *wq_cpumask, *node_cpumask;
+	int cpu;
+
+	/* No point in doing this if NUMA isn't enabled for workqueues */
+	if (!wq_numa_enabled)
+		return WORK_CPU_UNBOUND;
+
+	/* delay binding to CPU if node is not valid or online */
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		return WORK_CPU_UNBOUND;
+
+	/* If wq_unbound_cpumask is empty then just use cpu_online_mask */
+	wq_cpumask = cpumask_empty(wq_unbound_cpumask) ? cpu_online_mask :
+							 wq_unbound_cpumask;
+
+	/*
+	 * If node has no CPUs, or no CPUs in the unbound cpumask then we
+	 * need to try and find the nearest node that does have CPUs in the
+	 * unbound cpumask.
+	 */
+	if (!nr_cpus_node(node) ||
+	    !cpumask_intersects(cpumask_of_node(node), wq_cpumask)) {
+		int min_val = INT_MAX, best_node = NUMA_NO_NODE;
+		int this_node, val;
+
+		for_each_online_node(this_node) {
+			if (this_node == node)
+				continue;
+
+			val = node_distance(node, this_node);
+			if (min_val < val)
+				continue;
+
+			if (!nr_cpus_node(this_node) ||
+			    !cpumask_intersects(cpumask_of_node(this_node),
+						wq_cpumask))
+				continue;
+
+			best_node = this_node;
+			min_val = val;
+		}
+
+		/* If we failed to find a close node just defer */
+		if (best_node == NUMA_NO_NODE)
+			return WORK_CPU_UNBOUND;
+
+		/* update node to reflect optimal value */
+		node = best_node;
+	}
+
+
+	/* Use local node/cpu if we are already there */
+	cpu = raw_smp_processor_id();
+	if (node == cpu_to_node(cpu) &&
+	    cpumask_test_cpu(cpu, wq_unbound_cpumask))
+		return cpu;
+
+	/*
+	 * Reuse the same value as wq_select_unbound_cpu above to prevent
+	 * us from mapping the same CPU each time. The impact to
+	 * wq_select_unbound_cpu should be minimal since the above function
+	 * only uses it when it has to load balance on remote CPUs similar
+	 * to what I am doing here.
+	 */
+	cpu = __this_cpu_read(wq_rr_cpu_last);
+	node_cpumask = cpumask_of_node(node);
+	cpu = cpumask_next_and(cpu, wq_cpumask, node_cpumask);
+	if (unlikely(cpu >= nr_cpu_ids)) {
+		cpu = cpumask_first_and(wq_cpumask, node_cpumask);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return WORK_CPU_UNBOUND;
+	}
+	__this_cpu_write(wq_rr_cpu_last, cpu);
+
+	return cpu;
+}
+
+/**
+ * queue_work_near - queue work on the nearest unbound cpu to a given NUMA node
+ * @node: NUMA node that we are targeting the work for
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * We queue the work to a specific CPU based on a given NUMA node, the
+ * caller must ensure it can't go away.
+ *
+ * This function will only make a best effort attempt at getting this onto
+ * the right NUMA node. If no node is requested or the requested node is
+ * offline then we just fall back to standard queue_work behavior.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_work_near(int node, struct workqueue_struct *wq,
+		     struct work_struct *work)
+{
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		int cpu = workqueue_select_unbound_cpu_near(node);
+
+		__queue_work(cpu, wq, work);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_near);
+
 void delayed_work_timer_fn(struct timer_list *t)
 {
 	struct delayed_work *dwork = from_timer(dwork, t, timer);

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
  2018-09-26 21:51 ` Alexander Duyck
@ 2018-09-26 21:51   ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, rafael, jiangshanlai, pavel, zwisler

This patch introduces four new variants of the async_schedule_ functions
that allow scheduling on a specific NUMA node.

The first two functions are async_schedule_near and
async_schedule_near_domain which end up mapping to async_schedule and
async_schedule_domain but provide NUMA node specific functionality. They
replace the original functions which were moved to inline function
definitions that call the new functions while passing NUMA_NO_NODE.

The second two functions are async_schedule_dev and
async_schedule_dev_domain which provide NUMA specific functionality when
passing a device as the data member and that device has a NUMA node other
than NUMA_NO_NODE.

The main motivation behind this is to address the need to be able to
schedule device specific init work on specific NUMA nodes in order to
improve performance of memory initialization.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/async.h |   27 +++++++++++++++++++--
 kernel/async.c        |   62 ++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 69 insertions(+), 20 deletions(-)

diff --git a/include/linux/async.h b/include/linux/async.h
index 6b0226bdaadc..abf3ee9102df 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -14,6 +14,9 @@
 
 #include <linux/types.h>
 #include <linux/list.h>
+#include <linux/numa.h>
+
+struct device;
 
 typedef u64 async_cookie_t;
 typedef void (*async_func_t) (void *data, async_cookie_t cookie);
@@ -37,9 +40,27 @@ struct async_domain {
 	struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \
 				      .registered = 0 }
 
-extern async_cookie_t async_schedule(async_func_t func, void *data);
-extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
-					    struct async_domain *domain);
+async_cookie_t async_schedule_near(async_func_t func, void *data,
+				   int node);
+async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
+					  int node,
+					  struct async_domain *domain);
+
+static inline async_cookie_t async_schedule(async_func_t func, void *data)
+{
+	return async_schedule_near(func, data, NUMA_NO_NODE);
+}
+
+static inline async_cookie_t
+async_schedule_domain(async_func_t func, void *data,
+		      struct async_domain *domain)
+{
+	return async_schedule_near_domain(func, data, NUMA_NO_NODE, domain);
+}
+
+async_cookie_t async_schedule_dev(async_func_t func, struct device *dev);
+async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
+					 struct async_domain *domain);
 void async_unregister_domain(struct async_domain *domain);
 extern void async_synchronize_full(void);
 extern void async_synchronize_full_domain(struct async_domain *domain);
diff --git a/kernel/async.c b/kernel/async.c
index a893d6170944..13fcf222b89a 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -56,6 +56,7 @@ synchronization with the async_synchronize_full() function, before returning
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/device.h>
 
 #include "workqueue_internal.h"
 
@@ -149,7 +150,21 @@ static void async_run_entry_fn(struct work_struct *work)
 	wake_up(&async_done);
 }
 
-static async_cookie_t __async_schedule(async_func_t func, void *data, struct async_domain *domain)
+/**
+ * async_schedule_near_domain - schedule a function for asynchronous execution within a certain domain
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.  A
+ * synchronization domain is specified via @domain.  Note: This function
+ * may be called from atomic or non-atomic contexts.
+ */
+async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
+					  int node, struct async_domain *domain)
 {
 	struct async_entry *entry;
 	unsigned long flags;
@@ -195,43 +210,56 @@ static async_cookie_t __async_schedule(async_func_t func, void *data, struct asy
 	current->flags |= PF_USED_ASYNC;
 
 	/* schedule for execution */
-	queue_work(system_unbound_wq, &entry->work);
+	queue_work_near(node, system_unbound_wq, &entry->work);
 
 	return newcookie;
 }
+EXPORT_SYMBOL_GPL(async_schedule_near_domain);
 
 /**
- * async_schedule - schedule a function for asynchronous execution
+ * async_schedule_near - schedule a function for asynchronous execution
  * @func: function to execute asynchronously
  * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
  *
  * Returns an async_cookie_t that may be used for checkpointing later.
  * Note: This function may be called from atomic or non-atomic contexts.
  */
-async_cookie_t async_schedule(async_func_t func, void *data)
+async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
 {
-	return __async_schedule(func, data, &async_dfl_domain);
+	return async_schedule_near_domain(func, data, node, &async_dfl_domain);
 }
-EXPORT_SYMBOL_GPL(async_schedule);
+EXPORT_SYMBOL_GPL(async_schedule_near);
 
 /**
- * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
+ * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
  * @func: function to execute asynchronously
- * @data: data pointer to pass to the function
+ * @dev: device that we are scheduling this work for
  * @domain: the domain
  *
- * Returns an async_cookie_t that may be used for checkpointing later.
- * @domain may be used in the async_synchronize_*_domain() functions to
- * wait within a certain synchronization domain rather than globally.  A
- * synchronization domain is specified via @domain.  Note: This function
- * may be called from atomic or non-atomic contexts.
+ * Device specific version of async_schedule_near_domain that provides some
+ * NUMA awareness based on the device node.
+ */
+async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
+					 struct async_domain *domain)
+{
+	return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
+}
+EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
+
+/**
+ * async_schedule_dev - schedule a function for asynchronous execution
+ * @func: function to execute asynchronously
+ * @dev: device that we are scheduling this work for
+ *
+ * Device specific version of async_schedule_near that provides some NUMA
+ * awareness based on the device node.
  */
-async_cookie_t async_schedule_domain(async_func_t func, void *data,
-				     struct async_domain *domain)
+async_cookie_t async_schedule_dev(async_func_t func, struct device *dev)
 {
-	return __async_schedule(func, data, domain);
+	return async_schedule_dev_domain(func, dev, &async_dfl_domain);
 }
-EXPORT_SYMBOL_GPL(async_schedule_domain);
+EXPORT_SYMBOL_GPL(async_schedule_dev);
 
 /**
  * async_synchronize_full - synchronize all asynchronous function calls

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, dave.jiang, rafael, vishal.l.verma, jiangshanlai,
	pavel, zwisler, dan.j.williams

This patch introduces four new variants of the async_schedule_ functions
that allow scheduling on a specific NUMA node.

The first two functions are async_schedule_near and
async_schedule_near_domain which end up mapping to async_schedule and
async_schedule_domain but provide NUMA node specific functionality. They
replace the original functions which were moved to inline function
definitions that call the new functions while passing NUMA_NO_NODE.

The second two functions are async_schedule_dev and
async_schedule_dev_domain which provide NUMA specific functionality when
passing a device as the data member and that device has a NUMA node other
than NUMA_NO_NODE.

The main motivation behind this is to address the need to be able to
schedule device specific init work on specific NUMA nodes in order to
improve performance of memory initialization.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/async.h |   27 +++++++++++++++++++--
 kernel/async.c        |   62 ++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 69 insertions(+), 20 deletions(-)

diff --git a/include/linux/async.h b/include/linux/async.h
index 6b0226bdaadc..abf3ee9102df 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -14,6 +14,9 @@
 
 #include <linux/types.h>
 #include <linux/list.h>
+#include <linux/numa.h>
+
+struct device;
 
 typedef u64 async_cookie_t;
 typedef void (*async_func_t) (void *data, async_cookie_t cookie);
@@ -37,9 +40,27 @@ struct async_domain {
 	struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \
 				      .registered = 0 }
 
-extern async_cookie_t async_schedule(async_func_t func, void *data);
-extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
-					    struct async_domain *domain);
+async_cookie_t async_schedule_near(async_func_t func, void *data,
+				   int node);
+async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
+					  int node,
+					  struct async_domain *domain);
+
+static inline async_cookie_t async_schedule(async_func_t func, void *data)
+{
+	return async_schedule_near(func, data, NUMA_NO_NODE);
+}
+
+static inline async_cookie_t
+async_schedule_domain(async_func_t func, void *data,
+		      struct async_domain *domain)
+{
+	return async_schedule_near_domain(func, data, NUMA_NO_NODE, domain);
+}
+
+async_cookie_t async_schedule_dev(async_func_t func, struct device *dev);
+async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
+					 struct async_domain *domain);
 void async_unregister_domain(struct async_domain *domain);
 extern void async_synchronize_full(void);
 extern void async_synchronize_full_domain(struct async_domain *domain);
diff --git a/kernel/async.c b/kernel/async.c
index a893d6170944..13fcf222b89a 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -56,6 +56,7 @@ synchronization with the async_synchronize_full() function, before returning
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/device.h>
 
 #include "workqueue_internal.h"
 
@@ -149,7 +150,21 @@ static void async_run_entry_fn(struct work_struct *work)
 	wake_up(&async_done);
 }
 
-static async_cookie_t __async_schedule(async_func_t func, void *data, struct async_domain *domain)
+/**
+ * async_schedule_near_domain - schedule a function for asynchronous execution within a certain domain
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.  A
+ * synchronization domain is specified via @domain.  Note: This function
+ * may be called from atomic or non-atomic contexts.
+ */
+async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
+					  int node, struct async_domain *domain)
 {
 	struct async_entry *entry;
 	unsigned long flags;
@@ -195,43 +210,56 @@ static async_cookie_t __async_schedule(async_func_t func, void *data, struct asy
 	current->flags |= PF_USED_ASYNC;
 
 	/* schedule for execution */
-	queue_work(system_unbound_wq, &entry->work);
+	queue_work_near(node, system_unbound_wq, &entry->work);
 
 	return newcookie;
 }
+EXPORT_SYMBOL_GPL(async_schedule_near_domain);
 
 /**
- * async_schedule - schedule a function for asynchronous execution
+ * async_schedule_near - schedule a function for asynchronous execution
  * @func: function to execute asynchronously
  * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
  *
  * Returns an async_cookie_t that may be used for checkpointing later.
  * Note: This function may be called from atomic or non-atomic contexts.
  */
-async_cookie_t async_schedule(async_func_t func, void *data)
+async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
 {
-	return __async_schedule(func, data, &async_dfl_domain);
+	return async_schedule_near_domain(func, data, node, &async_dfl_domain);
 }
-EXPORT_SYMBOL_GPL(async_schedule);
+EXPORT_SYMBOL_GPL(async_schedule_near);
 
 /**
- * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
+ * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
  * @func: function to execute asynchronously
- * @data: data pointer to pass to the function
+ * @dev: device that we are scheduling this work for
  * @domain: the domain
  *
- * Returns an async_cookie_t that may be used for checkpointing later.
- * @domain may be used in the async_synchronize_*_domain() functions to
- * wait within a certain synchronization domain rather than globally.  A
- * synchronization domain is specified via @domain.  Note: This function
- * may be called from atomic or non-atomic contexts.
+ * Device specific version of async_schedule_near_domain that provides some
+ * NUMA awareness based on the device node.
+ */
+async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
+					 struct async_domain *domain)
+{
+	return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
+}
+EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
+
+/**
+ * async_schedule_dev - schedule a function for asynchronous execution
+ * @func: function to execute asynchronously
+ * @dev: device that we are scheduling this work for
+ *
+ * Device specific version of async_schedule_near that provides some NUMA
+ * awareness based on the device node.
  */
-async_cookie_t async_schedule_domain(async_func_t func, void *data,
-				     struct async_domain *domain)
+async_cookie_t async_schedule_dev(async_func_t func, struct device *dev)
 {
-	return __async_schedule(func, data, domain);
+	return async_schedule_dev_domain(func, dev, &async_dfl_domain);
 }
-EXPORT_SYMBOL_GPL(async_schedule_domain);
+EXPORT_SYMBOL_GPL(async_schedule_dev);
 
 /**
  * async_synchronize_full - synchronize all asynchronous function calls


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
  2018-09-26 21:51 ` Alexander Duyck
  (?)
@ 2018-09-26 21:51   ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, rafael, jiangshanlai, pavel, zwisler

This change makes it so that we probe devices asynchronously instead of the
driver. This results in us seeing the same behavior if the device is
registered before the driver or after. This way we can avoid serializing
the initialization should the driver not be loaded until after the devices
have already been added.

The motivation behind this is that if we have a set of devices that
take a significant amount of time to load we can greatly reduce the time to
load by processing them in parallel instead of one at a time. In addition,
each device can exist on a different node so placing a single thread on one
CPU to initialize all of the devices for a given driver can result in poor
performance on a system with multiple nodes.

One issue I can see with this patch is that I am using the
dev_set/get_drvdata functions to store the driver in the device while I am
waiting on the asynchronous init to complete. For now I am protecting it by
using the lack of a dev->driver and the device lock.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/bus.c |   23 +++--------------------
 drivers/base/dd.c  |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8bfd27ec73d6..2a17bed657ec 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -616,17 +616,6 @@ static ssize_t uevent_store(struct device_driver *drv, const char *buf,
 }
 static DRIVER_ATTR_WO(uevent);
 
-static void driver_attach_async(void *_drv, async_cookie_t cookie)
-{
-	struct device_driver *drv = _drv;
-	int ret;
-
-	ret = driver_attach(drv);
-
-	pr_debug("bus: '%s': driver %s async attach completed: %d\n",
-		 drv->bus->name, drv->name, ret);
-}
-
 /**
  * bus_add_driver - Add a driver to the bus.
  * @drv: driver.
@@ -659,15 +648,9 @@ int bus_add_driver(struct device_driver *drv)
 
 	klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
 	if (drv->bus->p->drivers_autoprobe) {
-		if (driver_allows_async_probing(drv)) {
-			pr_debug("bus: '%s': probing driver %s asynchronously\n",
-				drv->bus->name, drv->name);
-			async_schedule(driver_attach_async, drv);
-		} else {
-			error = driver_attach(drv);
-			if (error)
-				goto out_unregister;
-		}
+		error = driver_attach(drv);
+		if (error)
+			goto out_unregister;
 	}
 	module_add_driver(drv->owner, drv);
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 169412ee4ae8..5ba366c1cb83 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -864,6 +864,29 @@ void device_initial_probe(struct device *dev)
 	__device_attach(dev, true);
 }
 
+static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
+{
+	struct device *dev = _dev;
+
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_lock(dev->parent);
+	device_lock(dev);
+
+	if (!dev->driver) {
+		struct device_driver *drv = dev_get_drvdata(dev);
+
+		driver_probe_device(drv, dev);
+	}
+
+	dev_dbg(dev, "async probe completed\n");
+
+	device_unlock(dev);
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_unlock(dev->parent);
+
+	put_device(dev);
+}
+
 static int __driver_attach(struct device *dev, void *data)
 {
 	struct device_driver *drv = data;
@@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
 		return ret;
 	} /* ret > 0 means positive match */
 
+	if (driver_allows_async_probing(drv)) {
+		/*
+		 * Instead of probing the device synchronously we will
+		 * probe it asynchronously to allow for more parallelism.
+		 *
+		 * We only take the device lock here in order to guarantee
+		 * that the dev->driver and driver_data fields are protected
+		 */
+		dev_dbg(dev, "scheduling asynchronous probe\n");
+		device_lock(dev);
+		if (!dev->driver) {
+			get_device(dev);
+			dev_set_drvdata(dev, drv);
+			async_schedule(__driver_attach_async_helper, dev);
+		}
+		device_unlock(dev);
+		return 0;
+	}
+
 	if (dev->parent && dev->bus->need_parent_lock)
 		device_lock(dev->parent);
 	device_lock(dev);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, dave.jiang, rafael, vishal.l.verma, jiangshanlai,
	pavel, zwisler, dan.j.williams

This change makes it so that we probe devices asynchronously instead of the
driver. This results in us seeing the same behavior if the device is
registered before the driver or after. This way we can avoid serializing
the initialization should the driver not be loaded until after the devices
have already been added.

The motivation behind this is that if we have a set of devices that
take a significant amount of time to load we can greatly reduce the time to
load by processing them in parallel instead of one at a time. In addition,
each device can exist on a different node so placing a single thread on one
CPU to initialize all of the devices for a given driver can result in poor
performance on a system with multiple nodes.

One issue I can see with this patch is that I am using the
dev_set/get_drvdata functions to store the driver in the device while I am
waiting on the asynchronous init to complete. For now I am protecting it by
using the lack of a dev->driver and the device lock.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/bus.c |   23 +++--------------------
 drivers/base/dd.c  |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8bfd27ec73d6..2a17bed657ec 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -616,17 +616,6 @@ static ssize_t uevent_store(struct device_driver *drv, const char *buf,
 }
 static DRIVER_ATTR_WO(uevent);
 
-static void driver_attach_async(void *_drv, async_cookie_t cookie)
-{
-	struct device_driver *drv = _drv;
-	int ret;
-
-	ret = driver_attach(drv);
-
-	pr_debug("bus: '%s': driver %s async attach completed: %d\n",
-		 drv->bus->name, drv->name, ret);
-}
-
 /**
  * bus_add_driver - Add a driver to the bus.
  * @drv: driver.
@@ -659,15 +648,9 @@ int bus_add_driver(struct device_driver *drv)
 
 	klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
 	if (drv->bus->p->drivers_autoprobe) {
-		if (driver_allows_async_probing(drv)) {
-			pr_debug("bus: '%s': probing driver %s asynchronously\n",
-				drv->bus->name, drv->name);
-			async_schedule(driver_attach_async, drv);
-		} else {
-			error = driver_attach(drv);
-			if (error)
-				goto out_unregister;
-		}
+		error = driver_attach(drv);
+		if (error)
+			goto out_unregister;
 	}
 	module_add_driver(drv->owner, drv);
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 169412ee4ae8..5ba366c1cb83 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -864,6 +864,29 @@ void device_initial_probe(struct device *dev)
 	__device_attach(dev, true);
 }
 
+static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
+{
+	struct device *dev = _dev;
+
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_lock(dev->parent);
+	device_lock(dev);
+
+	if (!dev->driver) {
+		struct device_driver *drv = dev_get_drvdata(dev);
+
+		driver_probe_device(drv, dev);
+	}
+
+	dev_dbg(dev, "async probe completed\n");
+
+	device_unlock(dev);
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_unlock(dev->parent);
+
+	put_device(dev);
+}
+
 static int __driver_attach(struct device *dev, void *data)
 {
 	struct device_driver *drv = data;
@@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
 		return ret;
 	} /* ret > 0 means positive match */
 
+	if (driver_allows_async_probing(drv)) {
+		/*
+		 * Instead of probing the device synchronously we will
+		 * probe it asynchronously to allow for more parallelism.
+		 *
+		 * We only take the device lock here in order to guarantee
+		 * that the dev->driver and driver_data fields are protected
+		 */
+		dev_dbg(dev, "scheduling asynchronous probe\n");
+		device_lock(dev);
+		if (!dev->driver) {
+			get_device(dev);
+			dev_set_drvdata(dev, drv);
+			async_schedule(__driver_attach_async_helper, dev);
+		}
+		device_unlock(dev);
+		return 0;
+	}
+
 	if (dev->parent && dev->bus->need_parent_lock)
 		device_lock(dev->parent);
 	device_lock(dev);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, pavel-+ZI9xUNit7I,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A

This change makes it so that we probe devices asynchronously instead of the
driver. This results in us seeing the same behavior if the device is
registered before the driver or after. This way we can avoid serializing
the initialization should the driver not be loaded until after the devices
have already been added.

The motivation behind this is that if we have a set of devices that
take a significant amount of time to load we can greatly reduce the time to
load by processing them in parallel instead of one at a time. In addition,
each device can exist on a different node so placing a single thread on one
CPU to initialize all of the devices for a given driver can result in poor
performance on a system with multiple nodes.

One issue I can see with this patch is that I am using the
dev_set/get_drvdata functions to store the driver in the device while I am
waiting on the asynchronous init to complete. For now I am protecting it by
using the lack of a dev->driver and the device lock.

Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/base/bus.c |   23 +++--------------------
 drivers/base/dd.c  |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8bfd27ec73d6..2a17bed657ec 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -616,17 +616,6 @@ static ssize_t uevent_store(struct device_driver *drv, const char *buf,
 }
 static DRIVER_ATTR_WO(uevent);
 
-static void driver_attach_async(void *_drv, async_cookie_t cookie)
-{
-	struct device_driver *drv = _drv;
-	int ret;
-
-	ret = driver_attach(drv);
-
-	pr_debug("bus: '%s': driver %s async attach completed: %d\n",
-		 drv->bus->name, drv->name, ret);
-}
-
 /**
  * bus_add_driver - Add a driver to the bus.
  * @drv: driver.
@@ -659,15 +648,9 @@ int bus_add_driver(struct device_driver *drv)
 
 	klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
 	if (drv->bus->p->drivers_autoprobe) {
-		if (driver_allows_async_probing(drv)) {
-			pr_debug("bus: '%s': probing driver %s asynchronously\n",
-				drv->bus->name, drv->name);
-			async_schedule(driver_attach_async, drv);
-		} else {
-			error = driver_attach(drv);
-			if (error)
-				goto out_unregister;
-		}
+		error = driver_attach(drv);
+		if (error)
+			goto out_unregister;
 	}
 	module_add_driver(drv->owner, drv);
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 169412ee4ae8..5ba366c1cb83 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -864,6 +864,29 @@ void device_initial_probe(struct device *dev)
 	__device_attach(dev, true);
 }
 
+static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
+{
+	struct device *dev = _dev;
+
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_lock(dev->parent);
+	device_lock(dev);
+
+	if (!dev->driver) {
+		struct device_driver *drv = dev_get_drvdata(dev);
+
+		driver_probe_device(drv, dev);
+	}
+
+	dev_dbg(dev, "async probe completed\n");
+
+	device_unlock(dev);
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_unlock(dev->parent);
+
+	put_device(dev);
+}
+
 static int __driver_attach(struct device *dev, void *data)
 {
 	struct device_driver *drv = data;
@@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
 		return ret;
 	} /* ret > 0 means positive match */
 
+	if (driver_allows_async_probing(drv)) {
+		/*
+		 * Instead of probing the device synchronously we will
+		 * probe it asynchronously to allow for more parallelism.
+		 *
+		 * We only take the device lock here in order to guarantee
+		 * that the dev->driver and driver_data fields are protected
+		 */
+		dev_dbg(dev, "scheduling asynchronous probe\n");
+		device_lock(dev);
+		if (!dev->driver) {
+			get_device(dev);
+			dev_set_drvdata(dev, drv);
+			async_schedule(__driver_attach_async_helper, dev);
+		}
+		device_unlock(dev);
+		return 0;
+	}
+
 	if (dev->parent && dev->bus->need_parent_lock)
 		device_lock(dev->parent);
 	device_lock(dev);

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command
  2018-09-26 21:51 ` Alexander Duyck
  (?)
@ 2018-09-26 21:51   ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, rafael, jiangshanlai, pavel, zwisler

This change makes it so that we use the device specific version of the
async_schedule commands to defer various tasks related to devices. By doing
this we should see a slight improvement in performance as any device that
is sensitive to latency/locality in the setup will now be initializing on
the node closest to the device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/dd.c         |    4 ++--
 drivers/base/power/main.c |   12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 5ba366c1cb83..81472dc44a70 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 			 */
 			dev_dbg(dev, "scheduling asynchronous probe\n");
 			get_device(dev);
-			async_schedule(__device_attach_async_helper, dev);
+			async_schedule_dev(__device_attach_async_helper, dev);
 		} else {
 			pm_request_idle(dev);
 		}
@@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
 		if (!dev->driver) {
 			get_device(dev);
 			dev_set_drvdata(dev, drv);
-			async_schedule(__driver_attach_async_helper, dev);
+			async_schedule_dev(__driver_attach_async_helper, dev);
 		}
 		device_unlock(dev);
 		return 0;
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 3f68e2919dc5..8495d9b1e9d0 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -726,7 +726,7 @@ void dpm_noirq_resume_devices(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_noirq, dev);
+			async_schedule_dev(async_resume_noirq, dev);
 		}
 	}
 
@@ -883,7 +883,7 @@ void dpm_resume_early(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_early, dev);
+			async_schedule_dev(async_resume_early, dev);
 		}
 	}
 
@@ -1047,7 +1047,7 @@ void dpm_resume(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume, dev);
+			async_schedule_dev(async_resume, dev);
 		}
 	}
 
@@ -1366,7 +1366,7 @@ static int device_suspend_noirq(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_noirq, dev);
+		async_schedule_dev(async_suspend_noirq, dev);
 		return 0;
 	}
 	return __device_suspend_noirq(dev, pm_transition, false);
@@ -1569,7 +1569,7 @@ static int device_suspend_late(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_late, dev);
+		async_schedule_dev(async_suspend_late, dev);
 		return 0;
 	}
 
@@ -1830,7 +1830,7 @@ static int device_suspend(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend, dev);
+		async_schedule_dev(async_suspend, dev);
 		return 0;
 	}
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, dave.jiang, rafael, vishal.l.verma, jiangshanlai,
	pavel, zwisler, dan.j.williams

This change makes it so that we use the device specific version of the
async_schedule commands to defer various tasks related to devices. By doing
this we should see a slight improvement in performance as any device that
is sensitive to latency/locality in the setup will now be initializing on
the node closest to the device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/dd.c         |    4 ++--
 drivers/base/power/main.c |   12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 5ba366c1cb83..81472dc44a70 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 			 */
 			dev_dbg(dev, "scheduling asynchronous probe\n");
 			get_device(dev);
-			async_schedule(__device_attach_async_helper, dev);
+			async_schedule_dev(__device_attach_async_helper, dev);
 		} else {
 			pm_request_idle(dev);
 		}
@@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
 		if (!dev->driver) {
 			get_device(dev);
 			dev_set_drvdata(dev, drv);
-			async_schedule(__driver_attach_async_helper, dev);
+			async_schedule_dev(__driver_attach_async_helper, dev);
 		}
 		device_unlock(dev);
 		return 0;
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 3f68e2919dc5..8495d9b1e9d0 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -726,7 +726,7 @@ void dpm_noirq_resume_devices(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_noirq, dev);
+			async_schedule_dev(async_resume_noirq, dev);
 		}
 	}
 
@@ -883,7 +883,7 @@ void dpm_resume_early(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_early, dev);
+			async_schedule_dev(async_resume_early, dev);
 		}
 	}
 
@@ -1047,7 +1047,7 @@ void dpm_resume(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume, dev);
+			async_schedule_dev(async_resume, dev);
 		}
 	}
 
@@ -1366,7 +1366,7 @@ static int device_suspend_noirq(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_noirq, dev);
+		async_schedule_dev(async_suspend_noirq, dev);
 		return 0;
 	}
 	return __device_suspend_noirq(dev, pm_transition, false);
@@ -1569,7 +1569,7 @@ static int device_suspend_late(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_late, dev);
+		async_schedule_dev(async_suspend_late, dev);
 		return 0;
 	}
 
@@ -1830,7 +1830,7 @@ static int device_suspend(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend, dev);
+		async_schedule_dev(async_suspend, dev);
 		return 0;
 	}
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command
@ 2018-09-26 21:51   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:51 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, pavel-+ZI9xUNit7I,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A

This change makes it so that we use the device specific version of the
async_schedule commands to defer various tasks related to devices. By doing
this we should see a slight improvement in performance as any device that
is sensitive to latency/locality in the setup will now be initializing on
the node closest to the device.

Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/base/dd.c         |    4 ++--
 drivers/base/power/main.c |   12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 5ba366c1cb83..81472dc44a70 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 			 */
 			dev_dbg(dev, "scheduling asynchronous probe\n");
 			get_device(dev);
-			async_schedule(__device_attach_async_helper, dev);
+			async_schedule_dev(__device_attach_async_helper, dev);
 		} else {
 			pm_request_idle(dev);
 		}
@@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
 		if (!dev->driver) {
 			get_device(dev);
 			dev_set_drvdata(dev, drv);
-			async_schedule(__driver_attach_async_helper, dev);
+			async_schedule_dev(__driver_attach_async_helper, dev);
 		}
 		device_unlock(dev);
 		return 0;
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 3f68e2919dc5..8495d9b1e9d0 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -726,7 +726,7 @@ void dpm_noirq_resume_devices(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_noirq, dev);
+			async_schedule_dev(async_resume_noirq, dev);
 		}
 	}
 
@@ -883,7 +883,7 @@ void dpm_resume_early(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_early, dev);
+			async_schedule_dev(async_resume_early, dev);
 		}
 	}
 
@@ -1047,7 +1047,7 @@ void dpm_resume(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume, dev);
+			async_schedule_dev(async_resume, dev);
 		}
 	}
 
@@ -1366,7 +1366,7 @@ static int device_suspend_noirq(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_noirq, dev);
+		async_schedule_dev(async_suspend_noirq, dev);
 		return 0;
 	}
 	return __device_suspend_noirq(dev, pm_transition, false);
@@ -1569,7 +1569,7 @@ static int device_suspend_late(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_late, dev);
+		async_schedule_dev(async_suspend_late, dev);
 		return 0;
 	}
 
@@ -1830,7 +1830,7 @@ static int device_suspend(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend, dev);
+		async_schedule_dev(async_suspend, dev);
 		return 0;
 	}

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 5/5] nvdimm: Schedule device registration on node local to the device
  2018-09-26 21:51 ` Alexander Duyck
  (?)
@ 2018-09-26 21:52   ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:52 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, rafael, jiangshanlai, pavel, zwisler

This patch is meant to force the device registration for nvdimm devices to
be closer to the actual device. This is achieved by using either the NUMA
node ID of the region, or of the parent. By doing this we can have
everything above the region based on the region, and everything below the
region based on the nvdimm bus.

By guaranteeing NUMA locality I see an improvement of as high as 25% for
per-node init of a system with 12TB of persistent memory.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/nvdimm/bus.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 96154670bf07..f663d6ff524e 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -23,6 +23,7 @@
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/cpu.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -513,11 +514,15 @@ void __nd_device_register(struct device *dev)
 		set_dev_node(dev, to_nd_region(dev)->numa_node);
 
 	dev->bus = &nvdimm_bus_type;
-	if (dev->parent)
+	if (dev->parent) {
 		get_device(dev->parent);
+		if (dev_to_node(dev) == NUMA_NO_NODE)
+			set_dev_node(dev, dev_to_node(dev->parent));
+	}
 	get_device(dev);
-	async_schedule_domain(nd_async_device_register, dev,
-			&nd_async_domain);
+
+	async_schedule_dev_domain(nd_async_device_register, dev,
+				  &nd_async_domain);
 }
 
 void nd_device_register(struct device *dev)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 5/5] nvdimm: Schedule device registration on node local to the device
@ 2018-09-26 21:52   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:52 UTC (permalink / raw)
  To: linux-nvdimm, gregkh, linux-pm, linux-kernel, tj, akpm
  Cc: len.brown, dave.jiang, rafael, vishal.l.verma, jiangshanlai,
	pavel, zwisler, dan.j.williams

This patch is meant to force the device registration for nvdimm devices to
be closer to the actual device. This is achieved by using either the NUMA
node ID of the region, or of the parent. By doing this we can have
everything above the region based on the region, and everything below the
region based on the nvdimm bus.

By guaranteeing NUMA locality I see an improvement of as high as 25% for
per-node init of a system with 12TB of persistent memory.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/nvdimm/bus.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 96154670bf07..f663d6ff524e 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -23,6 +23,7 @@
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/cpu.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -513,11 +514,15 @@ void __nd_device_register(struct device *dev)
 		set_dev_node(dev, to_nd_region(dev)->numa_node);
 
 	dev->bus = &nvdimm_bus_type;
-	if (dev->parent)
+	if (dev->parent) {
 		get_device(dev->parent);
+		if (dev_to_node(dev) == NUMA_NO_NODE)
+			set_dev_node(dev, dev_to_node(dev->parent));
+	}
 	get_device(dev);
-	async_schedule_domain(nd_async_device_register, dev,
-			&nd_async_domain);
+
+	async_schedule_dev_domain(nd_async_device_register, dev,
+				  &nd_async_domain);
 }
 
 void nd_device_register(struct device *dev)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC workqueue/driver-core PATCH 5/5] nvdimm: Schedule device registration on node local to the device
@ 2018-09-26 21:52   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 21:52 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, pavel-+ZI9xUNit7I,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A

This patch is meant to force the device registration for nvdimm devices to
be closer to the actual device. This is achieved by using either the NUMA
node ID of the region, or of the parent. By doing this we can have
everything above the region based on the region, and everything below the
region based on the nvdimm bus.

By guaranteeing NUMA locality I see an improvement of as high as 25% for
per-node init of a system with 12TB of persistent memory.

Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/nvdimm/bus.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 96154670bf07..f663d6ff524e 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -23,6 +23,7 @@
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/cpu.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -513,11 +514,15 @@ void __nd_device_register(struct device *dev)
 		set_dev_node(dev, to_nd_region(dev)->numa_node);
 
 	dev->bus = &nvdimm_bus_type;
-	if (dev->parent)
+	if (dev->parent) {
 		get_device(dev->parent);
+		if (dev_to_node(dev) == NUMA_NO_NODE)
+			set_dev_node(dev, dev_to_node(dev->parent));
+	}
 	get_device(dev);
-	async_schedule_domain(nd_async_device_register, dev,
-			&nd_async_domain);
+
+	async_schedule_dev_domain(nd_async_device_register, dev,
+				  &nd_async_domain);
 }
 
 void nd_device_register(struct device *dev)

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-09-26 21:51   ` Alexander Duyck
  (?)
@ 2018-09-26 21:53     ` Tejun Heo
  -1 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-09-26 21:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

Hello,

On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
> This patch provides a new function queue_work_near which is meant to
> schedule work on the nearest unbound CPU to the requested NUMA node. The
> main motivation for this is to help assist asynchronous init to better
> improve boot times for devices that are local to a specific node.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Why not just use unbound workqueues, which are NUMA-affine by default?
Are there big enough advantages?

Thanks.

-- 
tejun
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 21:53     ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-09-26 21:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

Hello,

On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
> This patch provides a new function queue_work_near which is meant to
> schedule work on the nearest unbound CPU to the requested NUMA node. The
> main motivation for this is to help assist asynchronous init to better
> improve boot times for devices that are local to a specific node.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Why not just use unbound workqueues, which are NUMA-affine by default?
Are there big enough advantages?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 21:53     ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-09-26 21:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
> This patch provides a new function queue_work_near which is meant to
> schedule work on the nearest unbound CPU to the requested NUMA node. The
> main motivation for this is to help assist asynchronous init to better
> improve boot times for devices that are local to a specific node.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Why not just use unbound workqueues, which are NUMA-affine by default?
Are there big enough advantages?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-09-26 21:53     ` Tejun Heo
@ 2018-09-26 22:05       ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 22:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm



On 9/26/2018 2:53 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
>> This patch provides a new function queue_work_near which is meant to
>> schedule work on the nearest unbound CPU to the requested NUMA node. The
>> main motivation for this is to help assist asynchronous init to better
>> improve boot times for devices that are local to a specific node.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Why not just use unbound workqueues, which are NUMA-affine by default?
> Are there big enough advantages?
> 
> Thanks.

I am using unbound workqueues. However there isn't an interface that 
exposes the NUMA bits of them directly. All I am doing with this patch 
is adding "queue_work_near" which takes a NUMA node as an argument and 
then copies the logic of "queue_work_on" with the exception that I am 
doing a check to verify that there is an intersection between 
wq_unbound_cpumask and the cpumask of the node, and then passing a CPU 
from that intersection into "__queue_work".

Thanks.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 22:05       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 22:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams



On 9/26/2018 2:53 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
>> This patch provides a new function queue_work_near which is meant to
>> schedule work on the nearest unbound CPU to the requested NUMA node. The
>> main motivation for this is to help assist asynchronous init to better
>> improve boot times for devices that are local to a specific node.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Why not just use unbound workqueues, which are NUMA-affine by default?
> Are there big enough advantages?
> 
> Thanks.

I am using unbound workqueues. However there isn't an interface that 
exposes the NUMA bits of them directly. All I am doing with this patch 
is adding "queue_work_near" which takes a NUMA node as an argument and 
then copies the logic of "queue_work_on" with the exception that I am 
doing a check to verify that there is an intersection between 
wq_unbound_cpumask and the cpumask of the node, and then passing a CPU 
from that intersection into "__queue_work".

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-09-26 22:05       ` Alexander Duyck
  (?)
@ 2018-09-26 22:09         ` Tejun Heo
  -1 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-09-26 22:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

Hello,

On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
> I am using unbound workqueues. However there isn't an interface that
> exposes the NUMA bits of them directly. All I am doing with this
> patch is adding "queue_work_near" which takes a NUMA node as an
> argument and then copies the logic of "queue_work_on" with the
> exception that I am doing a check to verify that there is an
> intersection between wq_unbound_cpumask and the cpumask of the node,
> and then passing a CPU from that intersection into "__queue_work".

Can it just take a cpu id and not feed that to __queue_work()?  That
looks like a lot of extra logic.

Thanks.

-- 
tejun
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 22:09         ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-09-26 22:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

Hello,

On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
> I am using unbound workqueues. However there isn't an interface that
> exposes the NUMA bits of them directly. All I am doing with this
> patch is adding "queue_work_near" which takes a NUMA node as an
> argument and then copies the logic of "queue_work_on" with the
> exception that I am doing a check to verify that there is an
> intersection between wq_unbound_cpumask and the cpumask of the node,
> and then passing a CPU from that intersection into "__queue_work".

Can it just take a cpu id and not feed that to __queue_work()?  That
looks like a lot of extra logic.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 22:09         ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-09-26 22:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
> I am using unbound workqueues. However there isn't an interface that
> exposes the NUMA bits of them directly. All I am doing with this
> patch is adding "queue_work_near" which takes a NUMA node as an
> argument and then copies the logic of "queue_work_on" with the
> exception that I am doing a check to verify that there is an
> intersection between wq_unbound_cpumask and the cpumask of the node,
> and then passing a CPU from that intersection into "__queue_work".

Can it just take a cpu id and not feed that to __queue_work()?  That
looks like a lot of extra logic.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-09-26 22:09         ` Tejun Heo
@ 2018-09-26 22:19           ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 22:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

On 9/26/2018 3:09 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
>> I am using unbound workqueues. However there isn't an interface that
>> exposes the NUMA bits of them directly. All I am doing with this
>> patch is adding "queue_work_near" which takes a NUMA node as an
>> argument and then copies the logic of "queue_work_on" with the
>> exception that I am doing a check to verify that there is an
>> intersection between wq_unbound_cpumask and the cpumask of the node,
>> and then passing a CPU from that intersection into "__queue_work".
> 
> Can it just take a cpu id and not feed that to __queue_work()?  That
> looks like a lot of extra logic.
> 
> Thanks.

I could just use queue_work_on probably, but is there any issue if I am 
passing CPU values that are not in the wq_unbound_cpumask? That was 
mostly my concern. Also for an unbound queue do I need to worry about 
the hotplug lock? I wasn't sure if that was the case or not as I know it 
is called out as something to be concerned with using queue_work_on, but 
in __queue_work the value is just used to determine which node to grab a 
work queue from.

I forgot to address your question about the advantages. They are pretty 
significant. The test system I was working with was initializing 3TB of 
nvdimm memory per node. If the node is aligned it takes something like 
24 seconds, whereas an unaligned core can take 36 seconds or more.

Thanks.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-09-26 22:19           ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-26 22:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

On 9/26/2018 3:09 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
>> I am using unbound workqueues. However there isn't an interface that
>> exposes the NUMA bits of them directly. All I am doing with this
>> patch is adding "queue_work_near" which takes a NUMA node as an
>> argument and then copies the logic of "queue_work_on" with the
>> exception that I am doing a check to verify that there is an
>> intersection between wq_unbound_cpumask and the cpumask of the node,
>> and then passing a CPU from that intersection into "__queue_work".
> 
> Can it just take a cpu id and not feed that to __queue_work()?  That
> looks like a lot of extra logic.
> 
> Thanks.

I could just use queue_work_on probably, but is there any issue if I am 
passing CPU values that are not in the wq_unbound_cpumask? That was 
mostly my concern. Also for an unbound queue do I need to worry about 
the hotplug lock? I wasn't sure if that was the case or not as I know it 
is called out as something to be concerned with using queue_work_on, but 
in __queue_work the value is just used to determine which node to grab a 
work queue from.

I forgot to address your question about the advantages. They are pretty 
significant. The test system I was working with was initializing 3TB of 
nvdimm memory per node. If the node is aligned it takes something like 
24 seconds, whereas an unaligned core can take 36 seconds or more.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
  2018-09-26 21:51   ` Alexander Duyck
  (?)
@ 2018-09-27  0:31     ` Dan Williams
  -1 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27  0:31 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This patch introduces four new variants of the async_schedule_ functions
> that allow scheduling on a specific NUMA node.
>
> The first two functions are async_schedule_near and
> async_schedule_near_domain which end up mapping to async_schedule and
> async_schedule_domain but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
>
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
>
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
[..]
>  /**
> - * async_schedule - schedule a function for asynchronous execution
> + * async_schedule_near - schedule a function for asynchronous execution
>   * @func: function to execute asynchronously
>   * @data: data pointer to pass to the function
> + * @node: NUMA node that we want to schedule this on or close to
>   *
>   * Returns an async_cookie_t that may be used for checkpointing later.
>   * Note: This function may be called from atomic or non-atomic contexts.
>   */
> -async_cookie_t async_schedule(async_func_t func, void *data)
> +async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
>  {
> -       return __async_schedule(func, data, &async_dfl_domain);
> +       return async_schedule_near_domain(func, data, node, &async_dfl_domain);
>  }
> -EXPORT_SYMBOL_GPL(async_schedule);
> +EXPORT_SYMBOL_GPL(async_schedule_near);

Looks good to me. The _near() suffix makes it clear that we're doing a
best effort hint to the work placement compared to the strict
expectations of _on routines.

>
>  /**
> - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
> + * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
>   * @func: function to execute asynchronously
> - * @data: data pointer to pass to the function
> + * @dev: device that we are scheduling this work for
>   * @domain: the domain
>   *
> - * Returns an async_cookie_t that may be used for checkpointing later.
> - * @domain may be used in the async_synchronize_*_domain() functions to
> - * wait within a certain synchronization domain rather than globally.  A
> - * synchronization domain is specified via @domain.  Note: This function
> - * may be called from atomic or non-atomic contexts.
> + * Device specific version of async_schedule_near_domain that provides some
> + * NUMA awareness based on the device node.
> + */
> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
> +                                        struct async_domain *domain)
> +{
> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
> +}
> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);

This seems unnecessary and restrictive. Callers may want to pass
something other than dev as the parameter to the async function, and
dev_to_node() is not on onerous burden to place on callers.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-27  0:31     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27  0:31 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This patch introduces four new variants of the async_schedule_ functions
> that allow scheduling on a specific NUMA node.
>
> The first two functions are async_schedule_near and
> async_schedule_near_domain which end up mapping to async_schedule and
> async_schedule_domain but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
>
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
>
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
[..]
>  /**
> - * async_schedule - schedule a function for asynchronous execution
> + * async_schedule_near - schedule a function for asynchronous execution
>   * @func: function to execute asynchronously
>   * @data: data pointer to pass to the function
> + * @node: NUMA node that we want to schedule this on or close to
>   *
>   * Returns an async_cookie_t that may be used for checkpointing later.
>   * Note: This function may be called from atomic or non-atomic contexts.
>   */
> -async_cookie_t async_schedule(async_func_t func, void *data)
> +async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
>  {
> -       return __async_schedule(func, data, &async_dfl_domain);
> +       return async_schedule_near_domain(func, data, node, &async_dfl_domain);
>  }
> -EXPORT_SYMBOL_GPL(async_schedule);
> +EXPORT_SYMBOL_GPL(async_schedule_near);

Looks good to me. The _near() suffix makes it clear that we're doing a
best effort hint to the work placement compared to the strict
expectations of _on routines.

>
>  /**
> - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
> + * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
>   * @func: function to execute asynchronously
> - * @data: data pointer to pass to the function
> + * @dev: device that we are scheduling this work for
>   * @domain: the domain
>   *
> - * Returns an async_cookie_t that may be used for checkpointing later.
> - * @domain may be used in the async_synchronize_*_domain() functions to
> - * wait within a certain synchronization domain rather than globally.  A
> - * synchronization domain is specified via @domain.  Note: This function
> - * may be called from atomic or non-atomic contexts.
> + * Device specific version of async_schedule_near_domain that provides some
> + * NUMA awareness based on the device node.
> + */
> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
> +                                        struct async_domain *domain)
> +{
> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
> +}
> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);

This seems unnecessary and restrictive. Callers may want to pass
something other than dev as the parameter to the async function, and
dev_to_node() is not on onerous burden to place on callers.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-27  0:31     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27  0:31 UTC (permalink / raw)
  To: alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, Linux Kernel Mailing List,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, Pavel Machek, Tejun Heo,
	Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
<alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>
> This patch introduces four new variants of the async_schedule_ functions
> that allow scheduling on a specific NUMA node.
>
> The first two functions are async_schedule_near and
> async_schedule_near_domain which end up mapping to async_schedule and
> async_schedule_domain but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
>
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
>
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
[..]
>  /**
> - * async_schedule - schedule a function for asynchronous execution
> + * async_schedule_near - schedule a function for asynchronous execution
>   * @func: function to execute asynchronously
>   * @data: data pointer to pass to the function
> + * @node: NUMA node that we want to schedule this on or close to
>   *
>   * Returns an async_cookie_t that may be used for checkpointing later.
>   * Note: This function may be called from atomic or non-atomic contexts.
>   */
> -async_cookie_t async_schedule(async_func_t func, void *data)
> +async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
>  {
> -       return __async_schedule(func, data, &async_dfl_domain);
> +       return async_schedule_near_domain(func, data, node, &async_dfl_domain);
>  }
> -EXPORT_SYMBOL_GPL(async_schedule);
> +EXPORT_SYMBOL_GPL(async_schedule_near);

Looks good to me. The _near() suffix makes it clear that we're doing a
best effort hint to the work placement compared to the strict
expectations of _on routines.

>
>  /**
> - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
> + * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
>   * @func: function to execute asynchronously
> - * @data: data pointer to pass to the function
> + * @dev: device that we are scheduling this work for
>   * @domain: the domain
>   *
> - * Returns an async_cookie_t that may be used for checkpointing later.
> - * @domain may be used in the async_synchronize_*_domain() functions to
> - * wait within a certain synchronization domain rather than globally.  A
> - * synchronization domain is specified via @domain.  Note: This function
> - * may be called from atomic or non-atomic contexts.
> + * Device specific version of async_schedule_near_domain that provides some
> + * NUMA awareness based on the device node.
> + */
> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
> +                                        struct async_domain *domain)
> +{
> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
> +}
> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);

This seems unnecessary and restrictive. Callers may want to pass
something other than dev as the parameter to the async function, and
dev_to_node() is not on onerous burden to place on callers.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
  2018-09-26 21:51   ` Alexander Duyck
  (?)
@ 2018-09-27  0:48     ` Dan Williams
  -1 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27  0:48 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This change makes it so that we probe devices asynchronously instead of the
> driver. This results in us seeing the same behavior if the device is
> registered before the driver or after. This way we can avoid serializing
> the initialization should the driver not be loaded until after the devices
> have already been added.
>
> The motivation behind this is that if we have a set of devices that
> take a significant amount of time to load we can greatly reduce the time to
> load by processing them in parallel instead of one at a time. In addition,
> each device can exist on a different node so placing a single thread on one
> CPU to initialize all of the devices for a given driver can result in poor
> performance on a system with multiple nodes.
>
> One issue I can see with this patch is that I am using the
> dev_set/get_drvdata functions to store the driver in the device while I am
> waiting on the asynchronous init to complete. For now I am protecting it by
> using the lack of a dev->driver and the device lock.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
[..]
> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
>                 return ret;
>         } /* ret > 0 means positive match */
>
> +       if (driver_allows_async_probing(drv)) {
> +               /*
> +                * Instead of probing the device synchronously we will
> +                * probe it asynchronously to allow for more parallelism.
> +                *
> +                * We only take the device lock here in order to guarantee
> +                * that the dev->driver and driver_data fields are protected
> +                */
> +               dev_dbg(dev, "scheduling asynchronous probe\n");
> +               device_lock(dev);
> +               if (!dev->driver) {
> +                       get_device(dev);
> +                       dev_set_drvdata(dev, drv);
> +                       async_schedule(__driver_attach_async_helper, dev);

I'm not sure async drivers / sub-systems are ready for their devices
to show up in parallel. While userspace should not be relying on
kernel device names, people get upset when devices change kernel names
from one boot to the next, and I can see this change leading to that
scenario.

If a driver / sub-system wants more parallelism than what
driver_allows_async_probing() provides it should do it locally, for
example, like libata does.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-27  0:48     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27  0:48 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This change makes it so that we probe devices asynchronously instead of the
> driver. This results in us seeing the same behavior if the device is
> registered before the driver or after. This way we can avoid serializing
> the initialization should the driver not be loaded until after the devices
> have already been added.
>
> The motivation behind this is that if we have a set of devices that
> take a significant amount of time to load we can greatly reduce the time to
> load by processing them in parallel instead of one at a time. In addition,
> each device can exist on a different node so placing a single thread on one
> CPU to initialize all of the devices for a given driver can result in poor
> performance on a system with multiple nodes.
>
> One issue I can see with this patch is that I am using the
> dev_set/get_drvdata functions to store the driver in the device while I am
> waiting on the asynchronous init to complete. For now I am protecting it by
> using the lack of a dev->driver and the device lock.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
[..]
> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
>                 return ret;
>         } /* ret > 0 means positive match */
>
> +       if (driver_allows_async_probing(drv)) {
> +               /*
> +                * Instead of probing the device synchronously we will
> +                * probe it asynchronously to allow for more parallelism.
> +                *
> +                * We only take the device lock here in order to guarantee
> +                * that the dev->driver and driver_data fields are protected
> +                */
> +               dev_dbg(dev, "scheduling asynchronous probe\n");
> +               device_lock(dev);
> +               if (!dev->driver) {
> +                       get_device(dev);
> +                       dev_set_drvdata(dev, drv);
> +                       async_schedule(__driver_attach_async_helper, dev);

I'm not sure async drivers / sub-systems are ready for their devices
to show up in parallel. While userspace should not be relying on
kernel device names, people get upset when devices change kernel names
from one boot to the next, and I can see this change leading to that
scenario.

If a driver / sub-system wants more parallelism than what
driver_allows_async_probing() provides it should do it locally, for
example, like libata does.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-27  0:48     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27  0:48 UTC (permalink / raw)
  To: alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, Linux Kernel Mailing List,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, Pavel Machek, Tejun Heo,
	Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
<alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>
> This change makes it so that we probe devices asynchronously instead of the
> driver. This results in us seeing the same behavior if the device is
> registered before the driver or after. This way we can avoid serializing
> the initialization should the driver not be loaded until after the devices
> have already been added.
>
> The motivation behind this is that if we have a set of devices that
> take a significant amount of time to load we can greatly reduce the time to
> load by processing them in parallel instead of one at a time. In addition,
> each device can exist on a different node so placing a single thread on one
> CPU to initialize all of the devices for a given driver can result in poor
> performance on a system with multiple nodes.
>
> One issue I can see with this patch is that I am using the
> dev_set/get_drvdata functions to store the driver in the device while I am
> waiting on the asynchronous init to complete. For now I am protecting it by
> using the lack of a dev->driver and the device lock.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
[..]
> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
>                 return ret;
>         } /* ret > 0 means positive match */
>
> +       if (driver_allows_async_probing(drv)) {
> +               /*
> +                * Instead of probing the device synchronously we will
> +                * probe it asynchronously to allow for more parallelism.
> +                *
> +                * We only take the device lock here in order to guarantee
> +                * that the dev->driver and driver_data fields are protected
> +                */
> +               dev_dbg(dev, "scheduling asynchronous probe\n");
> +               device_lock(dev);
> +               if (!dev->driver) {
> +                       get_device(dev);
> +                       dev_set_drvdata(dev, drv);
> +                       async_schedule(__driver_attach_async_helper, dev);

I'm not sure async drivers / sub-systems are ready for their devices
to show up in parallel. While userspace should not be relying on
kernel device names, people get upset when devices change kernel names
from one boot to the next, and I can see this change leading to that
scenario.

If a driver / sub-system wants more parallelism than what
driver_allows_async_probing() provides it should do it locally, for
example, like libata does.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
  2018-09-27  0:31     ` Dan Williams
  (?)
@ 2018-09-27 15:16       ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 15:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki



On 9/26/2018 5:31 PM, Dan Williams wrote:
> On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
>>
>> This patch introduces four new variants of the async_schedule_ functions
>> that allow scheduling on a specific NUMA node.
>>
>> The first two functions are async_schedule_near and
>> async_schedule_near_domain which end up mapping to async_schedule and
>> async_schedule_domain but provide NUMA node specific functionality. They
>> replace the original functions which were moved to inline function
>> definitions that call the new functions while passing NUMA_NO_NODE.
>>
>> The second two functions are async_schedule_dev and
>> async_schedule_dev_domain which provide NUMA specific functionality when
>> passing a device as the data member and that device has a NUMA node other
>> than NUMA_NO_NODE.
>>
>> The main motivation behind this is to address the need to be able to
>> schedule device specific init work on specific NUMA nodes in order to
>> improve performance of memory initialization.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> [..]
>>   /**
>> - * async_schedule - schedule a function for asynchronous execution
>> + * async_schedule_near - schedule a function for asynchronous execution
>>    * @func: function to execute asynchronously
>>    * @data: data pointer to pass to the function
>> + * @node: NUMA node that we want to schedule this on or close to
>>    *
>>    * Returns an async_cookie_t that may be used for checkpointing later.
>>    * Note: This function may be called from atomic or non-atomic contexts.
>>    */
>> -async_cookie_t async_schedule(async_func_t func, void *data)
>> +async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
>>   {
>> -       return __async_schedule(func, data, &async_dfl_domain);
>> +       return async_schedule_near_domain(func, data, node, &async_dfl_domain);
>>   }
>> -EXPORT_SYMBOL_GPL(async_schedule);
>> +EXPORT_SYMBOL_GPL(async_schedule_near);
> 
> Looks good to me. The _near() suffix makes it clear that we're doing a
> best effort hint to the work placement compared to the strict
> expectations of _on routines.
> 
>>
>>   /**
>> - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
>> + * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
>>    * @func: function to execute asynchronously
>> - * @data: data pointer to pass to the function
>> + * @dev: device that we are scheduling this work for
>>    * @domain: the domain
>>    *
>> - * Returns an async_cookie_t that may be used for checkpointing later.
>> - * @domain may be used in the async_synchronize_*_domain() functions to
>> - * wait within a certain synchronization domain rather than globally.  A
>> - * synchronization domain is specified via @domain.  Note: This function
>> - * may be called from atomic or non-atomic contexts.
>> + * Device specific version of async_schedule_near_domain that provides some
>> + * NUMA awareness based on the device node.
>> + */
>> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
>> +                                        struct async_domain *domain)
>> +{
>> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
>> +}
>> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
> 
> This seems unnecessary and restrictive. Callers may want to pass
> something other than dev as the parameter to the async function, and
> dev_to_node() is not on onerous burden to place on callers.


That is what async_schedule_near_domain is for, they can call that. The 
"dev" versions of the calls as just supposed to be helpers since one of 
the most common parameters to the async_schedule calls is a device, so I 
thought I would just put together a function that takes care of all this 
for us so I could drop an argument and avoid having to use dev_to_node 
everywhere.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-27 15:16       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 15:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler



On 9/26/2018 5:31 PM, Dan Williams wrote:
> On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
>>
>> This patch introduces four new variants of the async_schedule_ functions
>> that allow scheduling on a specific NUMA node.
>>
>> The first two functions are async_schedule_near and
>> async_schedule_near_domain which end up mapping to async_schedule and
>> async_schedule_domain but provide NUMA node specific functionality. They
>> replace the original functions which were moved to inline function
>> definitions that call the new functions while passing NUMA_NO_NODE.
>>
>> The second two functions are async_schedule_dev and
>> async_schedule_dev_domain which provide NUMA specific functionality when
>> passing a device as the data member and that device has a NUMA node other
>> than NUMA_NO_NODE.
>>
>> The main motivation behind this is to address the need to be able to
>> schedule device specific init work on specific NUMA nodes in order to
>> improve performance of memory initialization.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> [..]
>>   /**
>> - * async_schedule - schedule a function for asynchronous execution
>> + * async_schedule_near - schedule a function for asynchronous execution
>>    * @func: function to execute asynchronously
>>    * @data: data pointer to pass to the function
>> + * @node: NUMA node that we want to schedule this on or close to
>>    *
>>    * Returns an async_cookie_t that may be used for checkpointing later.
>>    * Note: This function may be called from atomic or non-atomic contexts.
>>    */
>> -async_cookie_t async_schedule(async_func_t func, void *data)
>> +async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
>>   {
>> -       return __async_schedule(func, data, &async_dfl_domain);
>> +       return async_schedule_near_domain(func, data, node, &async_dfl_domain);
>>   }
>> -EXPORT_SYMBOL_GPL(async_schedule);
>> +EXPORT_SYMBOL_GPL(async_schedule_near);
> 
> Looks good to me. The _near() suffix makes it clear that we're doing a
> best effort hint to the work placement compared to the strict
> expectations of _on routines.
> 
>>
>>   /**
>> - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
>> + * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
>>    * @func: function to execute asynchronously
>> - * @data: data pointer to pass to the function
>> + * @dev: device that we are scheduling this work for
>>    * @domain: the domain
>>    *
>> - * Returns an async_cookie_t that may be used for checkpointing later.
>> - * @domain may be used in the async_synchronize_*_domain() functions to
>> - * wait within a certain synchronization domain rather than globally.  A
>> - * synchronization domain is specified via @domain.  Note: This function
>> - * may be called from atomic or non-atomic contexts.
>> + * Device specific version of async_schedule_near_domain that provides some
>> + * NUMA awareness based on the device node.
>> + */
>> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
>> +                                        struct async_domain *domain)
>> +{
>> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
>> +}
>> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
> 
> This seems unnecessary and restrictive. Callers may want to pass
> something other than dev as the parameter to the async function, and
> dev_to_node() is not on onerous burden to place on callers.


That is what async_schedule_near_domain is for, they can call that. The 
"dev" versions of the calls as just supposed to be helpers since one of 
the most common parameters to the async_schedule calls is a device, so I 
thought I would just put together a function that takes care of all this 
for us so I could drop an argument and avoid having to use dev_to_node 
everywhere.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-27 15:16       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 15:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, Linux Kernel Mailing List,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, Pavel Machek, Tejun Heo,
	Andrew Morton, Rafael J. Wysocki



On 9/26/2018 5:31 PM, Dan Williams wrote:
> On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>>
>> This patch introduces four new variants of the async_schedule_ functions
>> that allow scheduling on a specific NUMA node.
>>
>> The first two functions are async_schedule_near and
>> async_schedule_near_domain which end up mapping to async_schedule and
>> async_schedule_domain but provide NUMA node specific functionality. They
>> replace the original functions which were moved to inline function
>> definitions that call the new functions while passing NUMA_NO_NODE.
>>
>> The second two functions are async_schedule_dev and
>> async_schedule_dev_domain which provide NUMA specific functionality when
>> passing a device as the data member and that device has a NUMA node other
>> than NUMA_NO_NODE.
>>
>> The main motivation behind this is to address the need to be able to
>> schedule device specific init work on specific NUMA nodes in order to
>> improve performance of memory initialization.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> [..]
>>   /**
>> - * async_schedule - schedule a function for asynchronous execution
>> + * async_schedule_near - schedule a function for asynchronous execution
>>    * @func: function to execute asynchronously
>>    * @data: data pointer to pass to the function
>> + * @node: NUMA node that we want to schedule this on or close to
>>    *
>>    * Returns an async_cookie_t that may be used for checkpointing later.
>>    * Note: This function may be called from atomic or non-atomic contexts.
>>    */
>> -async_cookie_t async_schedule(async_func_t func, void *data)
>> +async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
>>   {
>> -       return __async_schedule(func, data, &async_dfl_domain);
>> +       return async_schedule_near_domain(func, data, node, &async_dfl_domain);
>>   }
>> -EXPORT_SYMBOL_GPL(async_schedule);
>> +EXPORT_SYMBOL_GPL(async_schedule_near);
> 
> Looks good to me. The _near() suffix makes it clear that we're doing a
> best effort hint to the work placement compared to the strict
> expectations of _on routines.
> 
>>
>>   /**
>> - * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
>> + * async_schedule_dev_domain - schedule a function for asynchronous execution within a certain domain
>>    * @func: function to execute asynchronously
>> - * @data: data pointer to pass to the function
>> + * @dev: device that we are scheduling this work for
>>    * @domain: the domain
>>    *
>> - * Returns an async_cookie_t that may be used for checkpointing later.
>> - * @domain may be used in the async_synchronize_*_domain() functions to
>> - * wait within a certain synchronization domain rather than globally.  A
>> - * synchronization domain is specified via @domain.  Note: This function
>> - * may be called from atomic or non-atomic contexts.
>> + * Device specific version of async_schedule_near_domain that provides some
>> + * NUMA awareness based on the device node.
>> + */
>> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
>> +                                        struct async_domain *domain)
>> +{
>> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
>> +}
>> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
> 
> This seems unnecessary and restrictive. Callers may want to pass
> something other than dev as the parameter to the async function, and
> dev_to_node() is not on onerous burden to place on callers.


That is what async_schedule_near_domain is for, they can call that. The 
"dev" versions of the calls as just supposed to be helpers since one of 
the most common parameters to the async_schedule calls is a device, so I 
thought I would just put together a function that takes care of all this 
for us so I could drop an argument and avoid having to use dev_to_node 
everywhere.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
  2018-09-27  0:48     ` Dan Williams
  (?)
@ 2018-09-27 15:27       ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 15:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki



On 9/26/2018 5:48 PM, Dan Williams wrote:
> On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
>>
>> This change makes it so that we probe devices asynchronously instead of the
>> driver. This results in us seeing the same behavior if the device is
>> registered before the driver or after. This way we can avoid serializing
>> the initialization should the driver not be loaded until after the devices
>> have already been added.
>>
>> The motivation behind this is that if we have a set of devices that
>> take a significant amount of time to load we can greatly reduce the time to
>> load by processing them in parallel instead of one at a time. In addition,
>> each device can exist on a different node so placing a single thread on one
>> CPU to initialize all of the devices for a given driver can result in poor
>> performance on a system with multiple nodes.
>>
>> One issue I can see with this patch is that I am using the
>> dev_set/get_drvdata functions to store the driver in the device while I am
>> waiting on the asynchronous init to complete. For now I am protecting it by
>> using the lack of a dev->driver and the device lock.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> [..]
>> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
>>                  return ret;
>>          } /* ret > 0 means positive match */
>>
>> +       if (driver_allows_async_probing(drv)) {
>> +               /*
>> +                * Instead of probing the device synchronously we will
>> +                * probe it asynchronously to allow for more parallelism.
>> +                *
>> +                * We only take the device lock here in order to guarantee
>> +                * that the dev->driver and driver_data fields are protected
>> +                */
>> +               dev_dbg(dev, "scheduling asynchronous probe\n");
>> +               device_lock(dev);
>> +               if (!dev->driver) {
>> +                       get_device(dev);
>> +                       dev_set_drvdata(dev, drv);
>> +                       async_schedule(__driver_attach_async_helper, dev);
> 
> I'm not sure async drivers / sub-systems are ready for their devices
> to show up in parallel. While userspace should not be relying on
> kernel device names, people get upset when devices change kernel names
> from one boot to the next, and I can see this change leading to that
> scenario.

The thing is the current async behavior already does this if the driver 
is loaded before the device is added. All I am doing is making the 
behavior with the driver loaded first the standard instead of letting it 
work the other way around. This way we get consistent behavior.

> If a driver / sub-system wants more parallelism than what
> driver_allows_async_probing() provides it should do it locally, for
> example, like libata does.

So where I actually saw this was with the pmem legacy setup I had. After 
doing all the work to parallelize things in the driver it had no effect. 
That was because the nd_pmem driver wasn't loaded yet so all the 
device_add calls did is add the device but didn't attach the nd_pmem 
driver. Then when the driver loaded it serialized the probe calls 
resulting in it taking twice as long as it needed to in order to 
initialize the memory.

This seems to affect standard persistent memory as well. The only 
difference is that instead of probing the device on the first pass we 
kick it back and reprobe it in nd_pmem_probe/nd_pfn_probe in order to 
set the correct personality and that in turn allows us to asynchronously 
reschedule the work on the correct CPU and deserialize it.


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-27 15:27       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 15:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler



On 9/26/2018 5:48 PM, Dan Williams wrote:
> On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
>>
>> This change makes it so that we probe devices asynchronously instead of the
>> driver. This results in us seeing the same behavior if the device is
>> registered before the driver or after. This way we can avoid serializing
>> the initialization should the driver not be loaded until after the devices
>> have already been added.
>>
>> The motivation behind this is that if we have a set of devices that
>> take a significant amount of time to load we can greatly reduce the time to
>> load by processing them in parallel instead of one at a time. In addition,
>> each device can exist on a different node so placing a single thread on one
>> CPU to initialize all of the devices for a given driver can result in poor
>> performance on a system with multiple nodes.
>>
>> One issue I can see with this patch is that I am using the
>> dev_set/get_drvdata functions to store the driver in the device while I am
>> waiting on the asynchronous init to complete. For now I am protecting it by
>> using the lack of a dev->driver and the device lock.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> [..]
>> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
>>                  return ret;
>>          } /* ret > 0 means positive match */
>>
>> +       if (driver_allows_async_probing(drv)) {
>> +               /*
>> +                * Instead of probing the device synchronously we will
>> +                * probe it asynchronously to allow for more parallelism.
>> +                *
>> +                * We only take the device lock here in order to guarantee
>> +                * that the dev->driver and driver_data fields are protected
>> +                */
>> +               dev_dbg(dev, "scheduling asynchronous probe\n");
>> +               device_lock(dev);
>> +               if (!dev->driver) {
>> +                       get_device(dev);
>> +                       dev_set_drvdata(dev, drv);
>> +                       async_schedule(__driver_attach_async_helper, dev);
> 
> I'm not sure async drivers / sub-systems are ready for their devices
> to show up in parallel. While userspace should not be relying on
> kernel device names, people get upset when devices change kernel names
> from one boot to the next, and I can see this change leading to that
> scenario.

The thing is the current async behavior already does this if the driver 
is loaded before the device is added. All I am doing is making the 
behavior with the driver loaded first the standard instead of letting it 
work the other way around. This way we get consistent behavior.

> If a driver / sub-system wants more parallelism than what
> driver_allows_async_probing() provides it should do it locally, for
> example, like libata does.

So where I actually saw this was with the pmem legacy setup I had. After 
doing all the work to parallelize things in the driver it had no effect. 
That was because the nd_pmem driver wasn't loaded yet so all the 
device_add calls did is add the device but didn't attach the nd_pmem 
driver. Then when the driver loaded it serialized the probe calls 
resulting in it taking twice as long as it needed to in order to 
initialize the memory.

This seems to affect standard persistent memory as well. The only 
difference is that instead of probing the device on the first pass we 
kick it back and reprobe it in nd_pmem_probe/nd_pfn_probe in order to 
set the correct personality and that in turn allows us to asynchronously 
reschedule the work on the correct CPU and deserialize it.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-27 15:27       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 15:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, Linux Kernel Mailing List,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, Pavel Machek, Tejun Heo,
	Andrew Morton, Rafael J. Wysocki



On 9/26/2018 5:48 PM, Dan Williams wrote:
> On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>>
>> This change makes it so that we probe devices asynchronously instead of the
>> driver. This results in us seeing the same behavior if the device is
>> registered before the driver or after. This way we can avoid serializing
>> the initialization should the driver not be loaded until after the devices
>> have already been added.
>>
>> The motivation behind this is that if we have a set of devices that
>> take a significant amount of time to load we can greatly reduce the time to
>> load by processing them in parallel instead of one at a time. In addition,
>> each device can exist on a different node so placing a single thread on one
>> CPU to initialize all of the devices for a given driver can result in poor
>> performance on a system with multiple nodes.
>>
>> One issue I can see with this patch is that I am using the
>> dev_set/get_drvdata functions to store the driver in the device while I am
>> waiting on the asynchronous init to complete. For now I am protecting it by
>> using the lack of a dev->driver and the device lock.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> [..]
>> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
>>                  return ret;
>>          } /* ret > 0 means positive match */
>>
>> +       if (driver_allows_async_probing(drv)) {
>> +               /*
>> +                * Instead of probing the device synchronously we will
>> +                * probe it asynchronously to allow for more parallelism.
>> +                *
>> +                * We only take the device lock here in order to guarantee
>> +                * that the dev->driver and driver_data fields are protected
>> +                */
>> +               dev_dbg(dev, "scheduling asynchronous probe\n");
>> +               device_lock(dev);
>> +               if (!dev->driver) {
>> +                       get_device(dev);
>> +                       dev_set_drvdata(dev, drv);
>> +                       async_schedule(__driver_attach_async_helper, dev);
> 
> I'm not sure async drivers / sub-systems are ready for their devices
> to show up in parallel. While userspace should not be relying on
> kernel device names, people get upset when devices change kernel names
> from one boot to the next, and I can see this change leading to that
> scenario.

The thing is the current async behavior already does this if the driver 
is loaded before the device is added. All I am doing is making the 
behavior with the driver loaded first the standard instead of letting it 
work the other way around. This way we get consistent behavior.

> If a driver / sub-system wants more parallelism than what
> driver_allows_async_probing() provides it should do it locally, for
> example, like libata does.

So where I actually saw this was with the pmem legacy setup I had. After 
doing all the work to parallelize things in the driver it had no effect. 
That was because the nd_pmem driver wasn't loaded yet so all the 
device_add calls did is add the device but didn't attach the nd_pmem 
driver. Then when the driver loaded it serialized the probe calls 
resulting in it taking twice as long as it needed to in order to 
initialize the memory.

This seems to affect standard persistent memory as well. The only 
difference is that instead of probing the device on the first pass we 
kick it back and reprobe it in nd_pmem_probe/nd_pfn_probe in order to 
set the correct personality and that in turn allows us to asynchronously 
reschedule the work on the correct CPU and deserialize it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
  2018-09-27 15:16       ` Alexander Duyck
@ 2018-09-27 19:48         ` Dan Williams
  -1 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27 19:48 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On Thu, Sep 27, 2018 at 8:24 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
[..]
> >> - * Returns an async_cookie_t that may be used for checkpointing later.
> >> - * @domain may be used in the async_synchronize_*_domain() functions to
> >> - * wait within a certain synchronization domain rather than globally.  A
> >> - * synchronization domain is specified via @domain.  Note: This function
> >> - * may be called from atomic or non-atomic contexts.
> >> + * Device specific version of async_schedule_near_domain that provides some
> >> + * NUMA awareness based on the device node.
> >> + */
> >> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
> >> +                                        struct async_domain *domain)
> >> +{
> >> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
> >> +}
> >> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
> >
> > This seems unnecessary and restrictive. Callers may want to pass
> > something other than dev as the parameter to the async function, and
> > dev_to_node() is not on onerous burden to place on callers.
>
>
> That is what async_schedule_near_domain is for, they can call that. The
> "dev" versions of the calls as just supposed to be helpers since one of
> the most common parameters to the async_schedule calls is a device, so I
> thought I would just put together a function that takes care of all this
> for us so I could drop an argument and avoid having to use dev_to_node
> everywhere.

Yeah, makes sense, I guess I was reacting to the fact that this
expands the number of exports unnecessarily. The other async routines
are exported because they hide internal implementation details of the
async implementation. The async_schedule_dev* helpers can just be
static inline wrappers.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-27 19:48         ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-27 19:48 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On Thu, Sep 27, 2018 at 8:24 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
[..]
> >> - * Returns an async_cookie_t that may be used for checkpointing later.
> >> - * @domain may be used in the async_synchronize_*_domain() functions to
> >> - * wait within a certain synchronization domain rather than globally.  A
> >> - * synchronization domain is specified via @domain.  Note: This function
> >> - * may be called from atomic or non-atomic contexts.
> >> + * Device specific version of async_schedule_near_domain that provides some
> >> + * NUMA awareness based on the device node.
> >> + */
> >> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
> >> +                                        struct async_domain *domain)
> >> +{
> >> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
> >> +}
> >> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
> >
> > This seems unnecessary and restrictive. Callers may want to pass
> > something other than dev as the parameter to the async function, and
> > dev_to_node() is not on onerous burden to place on callers.
>
>
> That is what async_schedule_near_domain is for, they can call that. The
> "dev" versions of the calls as just supposed to be helpers since one of
> the most common parameters to the async_schedule calls is a device, so I
> thought I would just put together a function that takes care of all this
> for us so I could drop an argument and avoid having to use dev_to_node
> everywhere.

Yeah, makes sense, I guess I was reacting to the fact that this
expands the number of exports unnecessarily. The other async routines
are exported because they hide internal implementation details of the
async implementation. The async_schedule_dev* helpers can just be
static inline wrappers.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
  2018-09-27 19:48         ` Dan Williams
@ 2018-09-27 20:03           ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 20:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On 9/27/2018 12:48 PM, Dan Williams wrote:
> On Thu, Sep 27, 2018 at 8:24 AM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> [..]
>>>> - * Returns an async_cookie_t that may be used for checkpointing later.
>>>> - * @domain may be used in the async_synchronize_*_domain() functions to
>>>> - * wait within a certain synchronization domain rather than globally.  A
>>>> - * synchronization domain is specified via @domain.  Note: This function
>>>> - * may be called from atomic or non-atomic contexts.
>>>> + * Device specific version of async_schedule_near_domain that provides some
>>>> + * NUMA awareness based on the device node.
>>>> + */
>>>> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
>>>> +                                        struct async_domain *domain)
>>>> +{
>>>> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
>>>
>>> This seems unnecessary and restrictive. Callers may want to pass
>>> something other than dev as the parameter to the async function, and
>>> dev_to_node() is not on onerous burden to place on callers.
>>
>>
>> That is what async_schedule_near_domain is for, they can call that. The
>> "dev" versions of the calls as just supposed to be helpers since one of
>> the most common parameters to the async_schedule calls is a device, so I
>> thought I would just put together a function that takes care of all this
>> for us so I could drop an argument and avoid having to use dev_to_node
>> everywhere.
> 
> Yeah, makes sense, I guess I was reacting to the fact that this
> expands the number of exports unnecessarily. The other async routines
> are exported because they hide internal implementation details of the
> async implementation. The async_schedule_dev* helpers can just be
> static inline wrappers.

I can do them as inline wrappers for the next patch set. Shouldn't be 
too much of an issue.

Thanks.

- Alex

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific NUMA node
@ 2018-09-27 20:03           ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-09-27 20:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On 9/27/2018 12:48 PM, Dan Williams wrote:
> On Thu, Sep 27, 2018 at 8:24 AM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> [..]
>>>> - * Returns an async_cookie_t that may be used for checkpointing later.
>>>> - * @domain may be used in the async_synchronize_*_domain() functions to
>>>> - * wait within a certain synchronization domain rather than globally.  A
>>>> - * synchronization domain is specified via @domain.  Note: This function
>>>> - * may be called from atomic or non-atomic contexts.
>>>> + * Device specific version of async_schedule_near_domain that provides some
>>>> + * NUMA awareness based on the device node.
>>>> + */
>>>> +async_cookie_t async_schedule_dev_domain(async_func_t func, struct device *dev,
>>>> +                                        struct async_domain *domain)
>>>> +{
>>>> +       return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(async_schedule_dev_domain);
>>>
>>> This seems unnecessary and restrictive. Callers may want to pass
>>> something other than dev as the parameter to the async function, and
>>> dev_to_node() is not on onerous burden to place on callers.
>>
>>
>> That is what async_schedule_near_domain is for, they can call that. The
>> "dev" versions of the calls as just supposed to be helpers since one of
>> the most common parameters to the async_schedule calls is a device, so I
>> thought I would just put together a function that takes care of all this
>> for us so I could drop an argument and avoid having to use dev_to_node
>> everywhere.
> 
> Yeah, makes sense, I guess I was reacting to the fact that this
> expands the number of exports unnecessarily. The other async routines
> are exported because they hide internal implementation details of the
> async implementation. The async_schedule_dev* helpers can just be
> static inline wrappers.

I can do them as inline wrappers for the next patch set. Shouldn't be 
too much of an issue.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
  2018-09-27 15:27       ` Alexander Duyck
  (?)
@ 2018-09-28  2:48         ` Dan Williams
  -1 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28  2:48 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On Thu, Sep 27, 2018 at 8:31 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
>
>
> On 9/26/2018 5:48 PM, Dan Williams wrote:
> > On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> >>
> >> This change makes it so that we probe devices asynchronously instead of the
> >> driver. This results in us seeing the same behavior if the device is
> >> registered before the driver or after. This way we can avoid serializing
> >> the initialization should the driver not be loaded until after the devices
> >> have already been added.
> >>
> >> The motivation behind this is that if we have a set of devices that
> >> take a significant amount of time to load we can greatly reduce the time to
> >> load by processing them in parallel instead of one at a time. In addition,
> >> each device can exist on a different node so placing a single thread on one
> >> CPU to initialize all of the devices for a given driver can result in poor
> >> performance on a system with multiple nodes.
> >>
> >> One issue I can see with this patch is that I am using the
> >> dev_set/get_drvdata functions to store the driver in the device while I am
> >> waiting on the asynchronous init to complete. For now I am protecting it by
> >> using the lack of a dev->driver and the device lock.
> >>
> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > [..]
> >> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
> >>                  return ret;
> >>          } /* ret > 0 means positive match */
> >>
> >> +       if (driver_allows_async_probing(drv)) {
> >> +               /*
> >> +                * Instead of probing the device synchronously we will
> >> +                * probe it asynchronously to allow for more parallelism.
> >> +                *
> >> +                * We only take the device lock here in order to guarantee
> >> +                * that the dev->driver and driver_data fields are protected
> >> +                */
> >> +               dev_dbg(dev, "scheduling asynchronous probe\n");
> >> +               device_lock(dev);
> >> +               if (!dev->driver) {
> >> +                       get_device(dev);
> >> +                       dev_set_drvdata(dev, drv);
> >> +                       async_schedule(__driver_attach_async_helper, dev);
> >
> > I'm not sure async drivers / sub-systems are ready for their devices
> > to show up in parallel. While userspace should not be relying on
> > kernel device names, people get upset when devices change kernel names
> > from one boot to the next, and I can see this change leading to that
> > scenario.
>
> The thing is the current async behavior already does this if the driver
> is loaded before the device is added. All I am doing is making the
> behavior with the driver loaded first the standard instead of letting it
> work the other way around. This way we get consistent behavior.

Ok, I can see the consistency argument. It's still a behavior change
that needs testing. Configurations that have been living with the
default of synchronous probing of the devices on the bus for a later
arriving driver might be surprised.

That said, I was confusing async probing with device registration in
my thinking, so some of the discovery order / naming concerns may not
be as bad as I was imagining. Sub-systems that would be broken by this
behavior change would already be broken if a driver is built-in vs
module.

So, consider this, an Acked-by.

> > If a driver / sub-system wants more parallelism than what
> > driver_allows_async_probing() provides it should do it locally, for
> > example, like libata does.
>
> So where I actually saw this was with the pmem legacy setup I had. After
> doing all the work to parallelize things in the driver it had no effect.
> That was because the nd_pmem driver wasn't loaded yet so all the
> device_add calls did is add the device but didn't attach the nd_pmem
> driver. Then when the driver loaded it serialized the probe calls
> resulting in it taking twice as long as it needed to in order to
> initialize the memory.
>
> This seems to affect standard persistent memory as well. The only
> difference is that instead of probing the device on the first pass we
> kick it back and reprobe it in nd_pmem_probe/nd_pfn_probe in order to
> set the correct personality and that in turn allows us to asynchronously
> reschedule the work on the correct CPU and deserialize it.

I don't see any problems with this series with the nvdimm unit tests,
but note those tests do not check for discovery order / naming
discrepancies.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-28  2:48         ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28  2:48 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On Thu, Sep 27, 2018 at 8:31 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
>
>
> On 9/26/2018 5:48 PM, Dan Williams wrote:
> > On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> >>
> >> This change makes it so that we probe devices asynchronously instead of the
> >> driver. This results in us seeing the same behavior if the device is
> >> registered before the driver or after. This way we can avoid serializing
> >> the initialization should the driver not be loaded until after the devices
> >> have already been added.
> >>
> >> The motivation behind this is that if we have a set of devices that
> >> take a significant amount of time to load we can greatly reduce the time to
> >> load by processing them in parallel instead of one at a time. In addition,
> >> each device can exist on a different node so placing a single thread on one
> >> CPU to initialize all of the devices for a given driver can result in poor
> >> performance on a system with multiple nodes.
> >>
> >> One issue I can see with this patch is that I am using the
> >> dev_set/get_drvdata functions to store the driver in the device while I am
> >> waiting on the asynchronous init to complete. For now I am protecting it by
> >> using the lack of a dev->driver and the device lock.
> >>
> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > [..]
> >> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
> >>                  return ret;
> >>          } /* ret > 0 means positive match */
> >>
> >> +       if (driver_allows_async_probing(drv)) {
> >> +               /*
> >> +                * Instead of probing the device synchronously we will
> >> +                * probe it asynchronously to allow for more parallelism.
> >> +                *
> >> +                * We only take the device lock here in order to guarantee
> >> +                * that the dev->driver and driver_data fields are protected
> >> +                */
> >> +               dev_dbg(dev, "scheduling asynchronous probe\n");
> >> +               device_lock(dev);
> >> +               if (!dev->driver) {
> >> +                       get_device(dev);
> >> +                       dev_set_drvdata(dev, drv);
> >> +                       async_schedule(__driver_attach_async_helper, dev);
> >
> > I'm not sure async drivers / sub-systems are ready for their devices
> > to show up in parallel. While userspace should not be relying on
> > kernel device names, people get upset when devices change kernel names
> > from one boot to the next, and I can see this change leading to that
> > scenario.
>
> The thing is the current async behavior already does this if the driver
> is loaded before the device is added. All I am doing is making the
> behavior with the driver loaded first the standard instead of letting it
> work the other way around. This way we get consistent behavior.

Ok, I can see the consistency argument. It's still a behavior change
that needs testing. Configurations that have been living with the
default of synchronous probing of the devices on the bus for a later
arriving driver might be surprised.

That said, I was confusing async probing with device registration in
my thinking, so some of the discovery order / naming concerns may not
be as bad as I was imagining. Sub-systems that would be broken by this
behavior change would already be broken if a driver is built-in vs
module.

So, consider this, an Acked-by.

> > If a driver / sub-system wants more parallelism than what
> > driver_allows_async_probing() provides it should do it locally, for
> > example, like libata does.
>
> So where I actually saw this was with the pmem legacy setup I had. After
> doing all the work to parallelize things in the driver it had no effect.
> That was because the nd_pmem driver wasn't loaded yet so all the
> device_add calls did is add the device but didn't attach the nd_pmem
> driver. Then when the driver loaded it serialized the probe calls
> resulting in it taking twice as long as it needed to in order to
> initialize the memory.
>
> This seems to affect standard persistent memory as well. The only
> difference is that instead of probing the device on the first pass we
> kick it back and reprobe it in nd_pmem_probe/nd_pfn_probe in order to
> set the correct personality and that in turn allows us to asynchronously
> reschedule the work on the correct CPU and deserialize it.

I don't see any problems with this series with the nvdimm unit tests,
but note those tests do not check for discovery order / naming
discrepancies.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver
@ 2018-09-28  2:48         ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28  2:48 UTC (permalink / raw)
  To: alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, Linux Kernel Mailing List,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, Pavel Machek, Tejun Heo,
	Andrew Morton, Rafael J. Wysocki

On Thu, Sep 27, 2018 at 8:31 AM Alexander Duyck
<alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>
>
>
> On 9/26/2018 5:48 PM, Dan Williams wrote:
> > On Wed, Sep 26, 2018 at 2:51 PM Alexander Duyck
> > <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> >>
> >> This change makes it so that we probe devices asynchronously instead of the
> >> driver. This results in us seeing the same behavior if the device is
> >> registered before the driver or after. This way we can avoid serializing
> >> the initialization should the driver not be loaded until after the devices
> >> have already been added.
> >>
> >> The motivation behind this is that if we have a set of devices that
> >> take a significant amount of time to load we can greatly reduce the time to
> >> load by processing them in parallel instead of one at a time. In addition,
> >> each device can exist on a different node so placing a single thread on one
> >> CPU to initialize all of the devices for a given driver can result in poor
> >> performance on a system with multiple nodes.
> >>
> >> One issue I can see with this patch is that I am using the
> >> dev_set/get_drvdata functions to store the driver in the device while I am
> >> waiting on the asynchronous init to complete. For now I am protecting it by
> >> using the lack of a dev->driver and the device lock.
> >>
> >> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > [..]
> >> @@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
> >>                  return ret;
> >>          } /* ret > 0 means positive match */
> >>
> >> +       if (driver_allows_async_probing(drv)) {
> >> +               /*
> >> +                * Instead of probing the device synchronously we will
> >> +                * probe it asynchronously to allow for more parallelism.
> >> +                *
> >> +                * We only take the device lock here in order to guarantee
> >> +                * that the dev->driver and driver_data fields are protected
> >> +                */
> >> +               dev_dbg(dev, "scheduling asynchronous probe\n");
> >> +               device_lock(dev);
> >> +               if (!dev->driver) {
> >> +                       get_device(dev);
> >> +                       dev_set_drvdata(dev, drv);
> >> +                       async_schedule(__driver_attach_async_helper, dev);
> >
> > I'm not sure async drivers / sub-systems are ready for their devices
> > to show up in parallel. While userspace should not be relying on
> > kernel device names, people get upset when devices change kernel names
> > from one boot to the next, and I can see this change leading to that
> > scenario.
>
> The thing is the current async behavior already does this if the driver
> is loaded before the device is added. All I am doing is making the
> behavior with the driver loaded first the standard instead of letting it
> work the other way around. This way we get consistent behavior.

Ok, I can see the consistency argument. It's still a behavior change
that needs testing. Configurations that have been living with the
default of synchronous probing of the devices on the bus for a later
arriving driver might be surprised.

That said, I was confusing async probing with device registration in
my thinking, so some of the discovery order / naming concerns may not
be as bad as I was imagining. Sub-systems that would be broken by this
behavior change would already be broken if a driver is built-in vs
module.

So, consider this, an Acked-by.

> > If a driver / sub-system wants more parallelism than what
> > driver_allows_async_probing() provides it should do it locally, for
> > example, like libata does.
>
> So where I actually saw this was with the pmem legacy setup I had. After
> doing all the work to parallelize things in the driver it had no effect.
> That was because the nd_pmem driver wasn't loaded yet so all the
> device_add calls did is add the device but didn't attach the nd_pmem
> driver. Then when the driver loaded it serialized the probe calls
> resulting in it taking twice as long as it needed to in order to
> initialize the memory.
>
> This seems to affect standard persistent memory as well. The only
> difference is that instead of probing the device on the first pass we
> kick it back and reprobe it in nd_pmem_probe/nd_pfn_probe in order to
> set the correct personality and that in turn allows us to asynchronously
> reschedule the work on the correct CPU and deserialize it.

I don't see any problems with this series with the nvdimm unit tests,
but note those tests do not check for discovery order / naming
discrepancies.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command
  2018-09-26 21:51   ` Alexander Duyck
  (?)
@ 2018-09-28 17:42     ` Dan Williams
  -1 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28 17:42 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:52 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This change makes it so that we use the device specific version of the
> async_schedule commands to defer various tasks related to devices. By doing
> this we should see a slight improvement in performance as any device that
> is sensitive to latency/locality in the setup will now be initializing on
> the node closest to the device.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  drivers/base/dd.c         |    4 ++--
>  drivers/base/power/main.c |   12 ++++++------
>  2 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 5ba366c1cb83..81472dc44a70 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
>                          */
>                         dev_dbg(dev, "scheduling asynchronous probe\n");
>                         get_device(dev);
> -                       async_schedule(__device_attach_async_helper, dev);
> +                       async_schedule_dev(__device_attach_async_helper, dev);
>                 } else {
>                         pm_request_idle(dev);
>                 }
> @@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
>                 if (!dev->driver) {
>                         get_device(dev);
>                         dev_set_drvdata(dev, drv);
> -                       async_schedule(__driver_attach_async_helper, dev);
> +                       async_schedule_dev(__driver_attach_async_helper, dev);
>                 }

Above looks good to me...

>                 device_unlock(dev);
>                 return 0;
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 3f68e2919dc5..8495d9b1e9d0 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c

Let's split these changes into their own patch, just in case it causes
a suspend regression we can revert it independently of the async
probing change. It might also be worthwhile to have a positive
indication that this improves the latency of power transitions.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command
@ 2018-09-28 17:42     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28 17:42 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On Wed, Sep 26, 2018 at 2:52 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This change makes it so that we use the device specific version of the
> async_schedule commands to defer various tasks related to devices. By doing
> this we should see a slight improvement in performance as any device that
> is sensitive to latency/locality in the setup will now be initializing on
> the node closest to the device.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  drivers/base/dd.c         |    4 ++--
>  drivers/base/power/main.c |   12 ++++++------
>  2 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 5ba366c1cb83..81472dc44a70 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
>                          */
>                         dev_dbg(dev, "scheduling asynchronous probe\n");
>                         get_device(dev);
> -                       async_schedule(__device_attach_async_helper, dev);
> +                       async_schedule_dev(__device_attach_async_helper, dev);
>                 } else {
>                         pm_request_idle(dev);
>                 }
> @@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
>                 if (!dev->driver) {
>                         get_device(dev);
>                         dev_set_drvdata(dev, drv);
> -                       async_schedule(__driver_attach_async_helper, dev);
> +                       async_schedule_dev(__driver_attach_async_helper, dev);
>                 }

Above looks good to me...

>                 device_unlock(dev);
>                 return 0;
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 3f68e2919dc5..8495d9b1e9d0 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c

Let's split these changes into their own patch, just in case it causes
a suspend regression we can revert it independently of the async
probing change. It might also be worthwhile to have a positive
indication that this improves the latency of power transitions.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command
@ 2018-09-28 17:42     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28 17:42 UTC (permalink / raw)
  To: alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w, Linux Kernel Mailing List,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, Pavel Machek, Tejun Heo,
	Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:52 PM Alexander Duyck
<alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
>
> This change makes it so that we use the device specific version of the
> async_schedule commands to defer various tasks related to devices. By doing
> this we should see a slight improvement in performance as any device that
> is sensitive to latency/locality in the setup will now be initializing on
> the node closest to the device.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
>  drivers/base/dd.c         |    4 ++--
>  drivers/base/power/main.c |   12 ++++++------
>  2 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 5ba366c1cb83..81472dc44a70 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
>                          */
>                         dev_dbg(dev, "scheduling asynchronous probe\n");
>                         get_device(dev);
> -                       async_schedule(__device_attach_async_helper, dev);
> +                       async_schedule_dev(__device_attach_async_helper, dev);
>                 } else {
>                         pm_request_idle(dev);
>                 }
> @@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
>                 if (!dev->driver) {
>                         get_device(dev);
>                         dev_set_drvdata(dev, drv);
> -                       async_schedule(__driver_attach_async_helper, dev);
> +                       async_schedule_dev(__driver_attach_async_helper, dev);
>                 }

Above looks good to me...

>                 device_unlock(dev);
>                 return 0;
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 3f68e2919dc5..8495d9b1e9d0 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c

Let's split these changes into their own patch, just in case it causes
a suspend regression we can revert it independently of the async
probing change. It might also be worthwhile to have a positive
indication that this improves the latency of power transitions.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 5/5] nvdimm: Schedule device registration on node local to the device
  2018-09-26 21:52   ` Alexander Duyck
@ 2018-09-28 17:46     ` Dan Williams
  -1 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28 17:46 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Brown, Len, Linux-pm mailing list, Greg KH, linux-nvdimm,
	jiangshanlai, Linux Kernel Mailing List, zwisler, Pavel Machek,
	Tejun Heo, Andrew Morton, Rafael J. Wysocki

On Wed, Sep 26, 2018 at 2:52 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This patch is meant to force the device registration for nvdimm devices to
> be closer to the actual device. This is achieved by using either the NUMA
> node ID of the region, or of the parent. By doing this we can have
> everything above the region based on the region, and everything below the
> region based on the nvdimm bus.
>
> By guaranteeing NUMA locality I see an improvement of as high as 25% for
> per-node init of a system with 12TB of persistent memory.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Looks good to me. This does have a dependency on the current contents
of libnvdimm development tree.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 5/5] nvdimm: Schedule device registration on node local to the device
@ 2018-09-28 17:46     ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2018-09-28 17:46 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: linux-nvdimm, Greg KH, Linux-pm mailing list,
	Linux Kernel Mailing List, Tejun Heo, Andrew Morton, Brown, Len,
	Dave Jiang, Rafael J. Wysocki, Vishal L Verma, jiangshanlai,
	Pavel Machek, zwisler

On Wed, Sep 26, 2018 at 2:52 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> This patch is meant to force the device registration for nvdimm devices to
> be closer to the actual device. This is achieved by using either the NUMA
> node ID of the region, or of the parent. By doing this we can have
> everything above the region based on the region, and everything below the
> region based on the nvdimm bus.
>
> By guaranteeing NUMA locality I see an improvement of as high as 25% for
> per-node init of a system with 12TB of persistent memory.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Looks good to me. This does have a dependency on the current contents
of libnvdimm development tree.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-09-26 22:19           ` Alexander Duyck
  (?)
@ 2018-10-01 16:01             ` Tejun Heo
  -1 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-01 16:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

Hello,

On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
> On 9/26/2018 3:09 PM, Tejun Heo wrote:
> I could just use queue_work_on probably, but is there any issue if I
> am passing CPU values that are not in the wq_unbound_cpumask? That

That should be fine.  If it can't find any available cpu, it'll fall
back to round-robin.  We probably can improve it so that it can
consider the numa distance when falling back.

> was mostly my concern. Also for an unbound queue do I need to worry
> about the hotplug lock? I wasn't sure if that was the case or not as

Issuers don't need to worry about them.

> I know it is called out as something to be concerned with using
> queue_work_on, but in __queue_work the value is just used to
> determine which node to grab a work queue from.

It might be better to leave queue_work_on() to be used for per-cpu
workqueues and introduce queue_work_near() as you suggseted.  I just
don't want it to duplicate the node selection code in it.  Would that
work?

> I forgot to address your question about the advantages. They are
> pretty significant. The test system I was working with was
> initializing 3TB of nvdimm memory per node. If the node is aligned
> it takes something like 24 seconds, whereas an unaligned core can
> take 36 seconds or more.

Oh yeah, sure, numa affinity matters quite a bit on memory heavy
workloads.  I was mistaken that you were adding adding numa affinity
to per-cpu workqueues.

Thanks.

-- 
tejun
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-01 16:01             ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-01 16:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

Hello,

On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
> On 9/26/2018 3:09 PM, Tejun Heo wrote:
> I could just use queue_work_on probably, but is there any issue if I
> am passing CPU values that are not in the wq_unbound_cpumask? That

That should be fine.  If it can't find any available cpu, it'll fall
back to round-robin.  We probably can improve it so that it can
consider the numa distance when falling back.

> was mostly my concern. Also for an unbound queue do I need to worry
> about the hotplug lock? I wasn't sure if that was the case or not as

Issuers don't need to worry about them.

> I know it is called out as something to be concerned with using
> queue_work_on, but in __queue_work the value is just used to
> determine which node to grab a work queue from.

It might be better to leave queue_work_on() to be used for per-cpu
workqueues and introduce queue_work_near() as you suggseted.  I just
don't want it to duplicate the node selection code in it.  Would that
work?

> I forgot to address your question about the advantages. They are
> pretty significant. The test system I was working with was
> initializing 3TB of nvdimm memory per node. If the node is aligned
> it takes something like 24 seconds, whereas an unaligned core can
> take 36 seconds or more.

Oh yeah, sure, numa affinity matters quite a bit on memory heavy
workloads.  I was mistaken that you were adding adding numa affinity
to per-cpu workqueues.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-01 16:01             ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-01 16:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
> On 9/26/2018 3:09 PM, Tejun Heo wrote:
> I could just use queue_work_on probably, but is there any issue if I
> am passing CPU values that are not in the wq_unbound_cpumask? That

That should be fine.  If it can't find any available cpu, it'll fall
back to round-robin.  We probably can improve it so that it can
consider the numa distance when falling back.

> was mostly my concern. Also for an unbound queue do I need to worry
> about the hotplug lock? I wasn't sure if that was the case or not as

Issuers don't need to worry about them.

> I know it is called out as something to be concerned with using
> queue_work_on, but in __queue_work the value is just used to
> determine which node to grab a work queue from.

It might be better to leave queue_work_on() to be used for per-cpu
workqueues and introduce queue_work_near() as you suggseted.  I just
don't want it to duplicate the node selection code in it.  Would that
work?

> I forgot to address your question about the advantages. They are
> pretty significant. The test system I was working with was
> initializing 3TB of nvdimm memory per node. If the node is aligned
> it takes something like 24 seconds, whereas an unaligned core can
> take 36 seconds or more.

Oh yeah, sure, numa affinity matters quite a bit on memory heavy
workloads.  I was mistaken that you were adding adding numa affinity
to per-cpu workqueues.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-10-01 16:01             ` Tejun Heo
  (?)
@ 2018-10-01 21:54               ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-01 21:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

On 10/1/2018 9:01 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
>> On 9/26/2018 3:09 PM, Tejun Heo wrote:
>> I could just use queue_work_on probably, but is there any issue if I
>> am passing CPU values that are not in the wq_unbound_cpumask? That
> 
> That should be fine.  If it can't find any available cpu, it'll fall
> back to round-robin.  We probably can improve it so that it can
> consider the numa distance when falling back.
> 
>> was mostly my concern. Also for an unbound queue do I need to worry
>> about the hotplug lock? I wasn't sure if that was the case or not as
> 
> Issuers don't need to worry about them.
> 
>> I know it is called out as something to be concerned with using
>> queue_work_on, but in __queue_work the value is just used to
>> determine which node to grab a work queue from.
> 
> It might be better to leave queue_work_on() to be used for per-cpu
> workqueues and introduce queue_work_near() as you suggseted.  I just
> don't want it to duplicate the node selection code in it.  Would that
> work?

So if I understand what you are saying correctly we default to 
round-robin on a given node has no CPUs attached to it. I could probably 
work with that if that is the default behavior instead of adding much of 
the complexity I already have.

The question I have then is what should I do about workqueues that 
aren't WQ_UNBOUND if they attempt to use queue_work_near? In that case I 
should be looking for some way to go from a node to a CPU shouldn't I? 
If so should I look at doing something like wq_select_unbound_cpu that 
uses the node cpumask instead of the wq_unbound_cpumask?

- Alex


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-01 21:54               ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-01 21:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

On 10/1/2018 9:01 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
>> On 9/26/2018 3:09 PM, Tejun Heo wrote:
>> I could just use queue_work_on probably, but is there any issue if I
>> am passing CPU values that are not in the wq_unbound_cpumask? That
> 
> That should be fine.  If it can't find any available cpu, it'll fall
> back to round-robin.  We probably can improve it so that it can
> consider the numa distance when falling back.
> 
>> was mostly my concern. Also for an unbound queue do I need to worry
>> about the hotplug lock? I wasn't sure if that was the case or not as
> 
> Issuers don't need to worry about them.
> 
>> I know it is called out as something to be concerned with using
>> queue_work_on, but in __queue_work the value is just used to
>> determine which node to grab a work queue from.
> 
> It might be better to leave queue_work_on() to be used for per-cpu
> workqueues and introduce queue_work_near() as you suggseted.  I just
> don't want it to duplicate the node selection code in it.  Would that
> work?

So if I understand what you are saying correctly we default to 
round-robin on a given node has no CPUs attached to it. I could probably 
work with that if that is the default behavior instead of adding much of 
the complexity I already have.

The question I have then is what should I do about workqueues that 
aren't WQ_UNBOUND if they attempt to use queue_work_near? In that case I 
should be looking for some way to go from a node to a CPU shouldn't I? 
If so should I look at doing something like wq_select_unbound_cpu that 
uses the node cpumask instead of the wq_unbound_cpumask?

- Alex



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-01 21:54               ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-01 21:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 10/1/2018 9:01 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
>> On 9/26/2018 3:09 PM, Tejun Heo wrote:
>> I could just use queue_work_on probably, but is there any issue if I
>> am passing CPU values that are not in the wq_unbound_cpumask? That
> 
> That should be fine.  If it can't find any available cpu, it'll fall
> back to round-robin.  We probably can improve it so that it can
> consider the numa distance when falling back.
> 
>> was mostly my concern. Also for an unbound queue do I need to worry
>> about the hotplug lock? I wasn't sure if that was the case or not as
> 
> Issuers don't need to worry about them.
> 
>> I know it is called out as something to be concerned with using
>> queue_work_on, but in __queue_work the value is just used to
>> determine which node to grab a work queue from.
> 
> It might be better to leave queue_work_on() to be used for per-cpu
> workqueues and introduce queue_work_near() as you suggseted.  I just
> don't want it to duplicate the node selection code in it.  Would that
> work?

So if I understand what you are saying correctly we default to 
round-robin on a given node has no CPUs attached to it. I could probably 
work with that if that is the default behavior instead of adding much of 
the complexity I already have.

The question I have then is what should I do about workqueues that 
aren't WQ_UNBOUND if they attempt to use queue_work_near? In that case I 
should be looking for some way to go from a node to a CPU shouldn't I? 
If so should I look at doing something like wq_select_unbound_cpu that 
uses the node cpumask instead of the wq_unbound_cpumask?

- Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-10-01 21:54               ` Alexander Duyck
  (?)
@ 2018-10-02 17:41                 ` Tejun Heo
  -1 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-02 17:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

Hello,

On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
> >It might be better to leave queue_work_on() to be used for per-cpu
> >workqueues and introduce queue_work_near() as you suggseted.  I just
> >don't want it to duplicate the node selection code in it.  Would that
> >work?
> 
> So if I understand what you are saying correctly we default to
> round-robin on a given node has no CPUs attached to it. I could
> probably work with that if that is the default behavior instead of
> adding much of the complexity I already have.

Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
round-robin.  We can probably do better there and find the nearest
node considering topology.

> The question I have then is what should I do about workqueues that
> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that

Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
out later and users could already do that anyway.

Thanks.

-- 
tejun
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 17:41                 ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-02 17:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

Hello,

On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
> >It might be better to leave queue_work_on() to be used for per-cpu
> >workqueues and introduce queue_work_near() as you suggseted.  I just
> >don't want it to duplicate the node selection code in it.  Would that
> >work?
> 
> So if I understand what you are saying correctly we default to
> round-robin on a given node has no CPUs attached to it. I could
> probably work with that if that is the default behavior instead of
> adding much of the complexity I already have.

Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
round-robin.  We can probably do better there and find the nearest
node considering topology.

> The question I have then is what should I do about workqueues that
> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that

Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
out later and users could already do that anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 17:41                 ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-02 17:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
> >It might be better to leave queue_work_on() to be used for per-cpu
> >workqueues and introduce queue_work_near() as you suggseted.  I just
> >don't want it to duplicate the node selection code in it.  Would that
> >work?
> 
> So if I understand what you are saying correctly we default to
> round-robin on a given node has no CPUs attached to it. I could
> probably work with that if that is the default behavior instead of
> adding much of the complexity I already have.

Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
round-robin.  We can probably do better there and find the nearest
node considering topology.

> The question I have then is what should I do about workqueues that
> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that

Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
out later and users could already do that anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-10-02 17:41                 ` Tejun Heo
  (?)
@ 2018-10-02 18:23                   ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-02 18:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

On 10/2/2018 10:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
>>> It might be better to leave queue_work_on() to be used for per-cpu
>>> workqueues and introduce queue_work_near() as you suggseted.  I just
>>> don't want it to duplicate the node selection code in it.  Would that
>>> work?
>>
>> So if I understand what you are saying correctly we default to
>> round-robin on a given node has no CPUs attached to it. I could
>> probably work with that if that is the default behavior instead of
>> adding much of the complexity I already have.
> 
> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> round-robin.  We can probably do better there and find the nearest
> node considering topology.

Well if we could get wq_select_unbound_cpu doing the right thing based 
on node topology that would be most of my work solved right there. 
Basically I could just pass WQ_CPU_UNBOUND with the correct node and it 
would take care of getting to the right CPU.

>> The question I have then is what should I do about workqueues that
>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> 
> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> out later and users could already do that anyway.
> 
> Thanks.

So are you saying I should just return an error for now if somebody 
tries to use something other than an unbound workqueue with 
queue_work_near, and expect everyone else to just use queue_work_on for 
the other workqueue types?

Thanks.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 18:23                   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-02 18:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

On 10/2/2018 10:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
>>> It might be better to leave queue_work_on() to be used for per-cpu
>>> workqueues and introduce queue_work_near() as you suggseted.  I just
>>> don't want it to duplicate the node selection code in it.  Would that
>>> work?
>>
>> So if I understand what you are saying correctly we default to
>> round-robin on a given node has no CPUs attached to it. I could
>> probably work with that if that is the default behavior instead of
>> adding much of the complexity I already have.
> 
> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> round-robin.  We can probably do better there and find the nearest
> node considering topology.

Well if we could get wq_select_unbound_cpu doing the right thing based 
on node topology that would be most of my work solved right there. 
Basically I could just pass WQ_CPU_UNBOUND with the correct node and it 
would take care of getting to the right CPU.

>> The question I have then is what should I do about workqueues that
>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> 
> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> out later and users could already do that anyway.
> 
> Thanks.

So are you saying I should just return an error for now if somebody 
tries to use something other than an unbound workqueue with 
queue_work_near, and expect everyone else to just use queue_work_on for 
the other workqueue types?

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 18:23                   ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-02 18:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 10/2/2018 10:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
>>> It might be better to leave queue_work_on() to be used for per-cpu
>>> workqueues and introduce queue_work_near() as you suggseted.  I just
>>> don't want it to duplicate the node selection code in it.  Would that
>>> work?
>>
>> So if I understand what you are saying correctly we default to
>> round-robin on a given node has no CPUs attached to it. I could
>> probably work with that if that is the default behavior instead of
>> adding much of the complexity I already have.
> 
> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> round-robin.  We can probably do better there and find the nearest
> node considering topology.

Well if we could get wq_select_unbound_cpu doing the right thing based 
on node topology that would be most of my work solved right there. 
Basically I could just pass WQ_CPU_UNBOUND with the correct node and it 
would take care of getting to the right CPU.

>> The question I have then is what should I do about workqueues that
>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> 
> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> out later and users could already do that anyway.
> 
> Thanks.

So are you saying I should just return an error for now if somebody 
tries to use something other than an unbound workqueue with 
queue_work_near, and expect everyone else to just use queue_work_on for 
the other workqueue types?

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-10-02 18:23                   ` Alexander Duyck
  (?)
@ 2018-10-02 18:41                     ` Tejun Heo
  -1 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-02 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

Hello,

On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
> >Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> >requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> >round-robin.  We can probably do better there and find the nearest
> >node considering topology.
> 
> Well if we could get wq_select_unbound_cpu doing the right thing
> based on node topology that would be most of my work solved right
> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
> node and it would take care of getting to the right CPU.

Yeah, sth like that.  It might be better to keep the function to take
cpu for consistency as everything else passes around cpu.

> >>The question I have then is what should I do about workqueues that
> >>aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> >
> >Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> >out later and users could already do that anyway.
> 
> So are you saying I should just return an error for now if somebody
> tries to use something other than an unbound workqueue with
> queue_work_near, and expect everyone else to just use queue_work_on
> for the other workqueue types?

Oh, I meant that let's not add a new interface for now and just use
queue_work_on() for your use case too.

Thanks.

-- 
tejun
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 18:41                     ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-02 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

Hello,

On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
> >Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> >requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> >round-robin.  We can probably do better there and find the nearest
> >node considering topology.
> 
> Well if we could get wq_select_unbound_cpu doing the right thing
> based on node topology that would be most of my work solved right
> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
> node and it would take care of getting to the right CPU.

Yeah, sth like that.  It might be better to keep the function to take
cpu for consistency as everything else passes around cpu.

> >>The question I have then is what should I do about workqueues that
> >>aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> >
> >Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> >out later and users could already do that anyway.
> 
> So are you saying I should just return an error for now if somebody
> tries to use something other than an unbound workqueue with
> queue_work_near, and expect everyone else to just use queue_work_on
> for the other workqueue types?

Oh, I meant that let's not add a new interface for now and just use
queue_work_on() for your use case too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 18:41                     ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2018-10-02 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
> >Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> >requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> >round-robin.  We can probably do better there and find the nearest
> >node considering topology.
> 
> Well if we could get wq_select_unbound_cpu doing the right thing
> based on node topology that would be most of my work solved right
> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
> node and it would take care of getting to the right CPU.

Yeah, sth like that.  It might be better to keep the function to take
cpu for consistency as everything else passes around cpu.

> >>The question I have then is what should I do about workqueues that
> >>aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> >
> >Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> >out later and users could already do that anyway.
> 
> So are you saying I should just return an error for now if somebody
> tries to use something other than an unbound workqueue with
> queue_work_near, and expect everyone else to just use queue_work_on
> for the other workqueue types?

Oh, I meant that let's not add a new interface for now and just use
queue_work_on() for your use case too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
  2018-10-02 18:41                     ` Tejun Heo
  (?)
@ 2018-10-02 20:49                       ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-02 20:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown, linux-pm, gregkh, linux-nvdimm, jiangshanlai,
	linux-kernel, zwisler, pavel, rafael, akpm

On 10/2/2018 11:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
>>> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
>>> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
>>> round-robin.  We can probably do better there and find the nearest
>>> node considering topology.
>>
>> Well if we could get wq_select_unbound_cpu doing the right thing
>> based on node topology that would be most of my work solved right
>> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
>> node and it would take care of getting to the right CPU.
> 
> Yeah, sth like that.  It might be better to keep the function to take
> cpu for consistency as everything else passes around cpu.
> 
>>>> The question I have then is what should I do about workqueues that
>>>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
>>>
>>> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
>>> out later and users could already do that anyway.
>>
>> So are you saying I should just return an error for now if somebody
>> tries to use something other than an unbound workqueue with
>> queue_work_near, and expect everyone else to just use queue_work_on
>> for the other workqueue types?
> 
> Oh, I meant that let's not add a new interface for now and just use
> queue_work_on() for your use case too.
> 
> Thanks.

So the only issue is that I was hoping to get away with not having to 
add additional preemption. That was the motivation behind doing 
queue_work_near as I could just wrap it all in the same local_irq_save 
that way I don't have to worry about the CPU I am on changing.

What I may look at doing is just greatly reducing the 
workqueue_select_unbound_cpu_near function to essentially just perform a 
few tests and then will just use the results from a cpumask_any_and of 
the cpumask_of_node and the cpu_online_mask. I'll probably rename it 
while I am at it since I am going to probably be getting away from the 
"unbound" checks in the logic.

- Alex

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 20:49                       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-02 20:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-nvdimm, gregkh, linux-pm, linux-kernel, akpm, len.brown,
	dave.jiang, rafael, vishal.l.verma, jiangshanlai, pavel, zwisler,
	dan.j.williams

On 10/2/2018 11:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
>>> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
>>> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
>>> round-robin.  We can probably do better there and find the nearest
>>> node considering topology.
>>
>> Well if we could get wq_select_unbound_cpu doing the right thing
>> based on node topology that would be most of my work solved right
>> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
>> node and it would take care of getting to the right CPU.
> 
> Yeah, sth like that.  It might be better to keep the function to take
> cpu for consistency as everything else passes around cpu.
> 
>>>> The question I have then is what should I do about workqueues that
>>>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
>>>
>>> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
>>> out later and users could already do that anyway.
>>
>> So are you saying I should just return an error for now if somebody
>> tries to use something other than an unbound workqueue with
>> queue_work_near, and expect everyone else to just use queue_work_on
>> for the other workqueue types?
> 
> Oh, I meant that let's not add a new interface for now and just use
> queue_work_on() for your use case too.
> 
> Thanks.

So the only issue is that I was hoping to get away with not having to 
add additional preemption. That was the motivation behind doing 
queue_work_near as I could just wrap it all in the same local_irq_save 
that way I don't have to worry about the CPU I am on changing.

What I may look at doing is just greatly reducing the 
workqueue_select_unbound_cpu_near function to essentially just perform a 
few tests and then will just use the results from a cpumask_any_and of 
the cpumask_of_node and the cpu_online_mask. I'll probably rename it 
while I am at it since I am going to probably be getting away from the 
"unbound" checks in the logic.

- Alex


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node
@ 2018-10-02 20:49                       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2018-10-02 20:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: len.brown-ral2JQCrhuEAvxtiuMwx3w,
	linux-pm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	jiangshanlai-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, pavel-+ZI9xUNit7I,
	rafael-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On 10/2/2018 11:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
>>> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
>>> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
>>> round-robin.  We can probably do better there and find the nearest
>>> node considering topology.
>>
>> Well if we could get wq_select_unbound_cpu doing the right thing
>> based on node topology that would be most of my work solved right
>> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
>> node and it would take care of getting to the right CPU.
> 
> Yeah, sth like that.  It might be better to keep the function to take
> cpu for consistency as everything else passes around cpu.
> 
>>>> The question I have then is what should I do about workqueues that
>>>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
>>>
>>> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
>>> out later and users could already do that anyway.
>>
>> So are you saying I should just return an error for now if somebody
>> tries to use something other than an unbound workqueue with
>> queue_work_near, and expect everyone else to just use queue_work_on
>> for the other workqueue types?
> 
> Oh, I meant that let's not add a new interface for now and just use
> queue_work_on() for your use case too.
> 
> Thanks.

So the only issue is that I was hoping to get away with not having to 
add additional preemption. That was the motivation behind doing 
queue_work_near as I could just wrap it all in the same local_irq_save 
that way I don't have to worry about the CPU I am on changing.

What I may look at doing is just greatly reducing the 
workqueue_select_unbound_cpu_near function to essentially just perform a 
few tests and then will just use the results from a cpumask_any_and of 
the cpumask_of_node and the cpu_online_mask. I'll probably rename it 
while I am at it since I am going to probably be getting away from the 
"unbound" checks in the logic.

- Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2018-10-02 20:49 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-26 21:51 [RFC workqueue/driver-core PATCH 0/5] Add NUMA aware async_schedule calls Alexander Duyck
2018-09-26 21:51 ` Alexander Duyck
2018-09-26 21:51 ` Alexander Duyck
2018-09-26 21:51 ` [RFC workqueue/driver-core PATCH 1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-26 21:53   ` Tejun Heo
2018-09-26 21:53     ` Tejun Heo
2018-09-26 21:53     ` Tejun Heo
2018-09-26 22:05     ` Alexander Duyck
2018-09-26 22:05       ` Alexander Duyck
2018-09-26 22:09       ` Tejun Heo
2018-09-26 22:09         ` Tejun Heo
2018-09-26 22:09         ` Tejun Heo
2018-09-26 22:19         ` Alexander Duyck
2018-09-26 22:19           ` Alexander Duyck
2018-10-01 16:01           ` Tejun Heo
2018-10-01 16:01             ` Tejun Heo
2018-10-01 16:01             ` Tejun Heo
2018-10-01 21:54             ` Alexander Duyck
2018-10-01 21:54               ` Alexander Duyck
2018-10-01 21:54               ` Alexander Duyck
2018-10-02 17:41               ` Tejun Heo
2018-10-02 17:41                 ` Tejun Heo
2018-10-02 17:41                 ` Tejun Heo
2018-10-02 18:23                 ` Alexander Duyck
2018-10-02 18:23                   ` Alexander Duyck
2018-10-02 18:23                   ` Alexander Duyck
2018-10-02 18:41                   ` Tejun Heo
2018-10-02 18:41                     ` Tejun Heo
2018-10-02 18:41                     ` Tejun Heo
2018-10-02 20:49                     ` Alexander Duyck
2018-10-02 20:49                       ` Alexander Duyck
2018-10-02 20:49                       ` Alexander Duyck
2018-09-26 21:51 ` [RFC workqueue/driver-core PATCH 2/5] async: Add support for queueing on specific " Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-27  0:31   ` Dan Williams
2018-09-27  0:31     ` Dan Williams
2018-09-27  0:31     ` Dan Williams
2018-09-27 15:16     ` Alexander Duyck
2018-09-27 15:16       ` Alexander Duyck
2018-09-27 15:16       ` Alexander Duyck
2018-09-27 19:48       ` Dan Williams
2018-09-27 19:48         ` Dan Williams
2018-09-27 20:03         ` Alexander Duyck
2018-09-27 20:03           ` Alexander Duyck
2018-09-26 21:51 ` [RFC workqueue/driver-core PATCH 3/5] driver core: Probe devices asynchronously instead of the driver Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-27  0:48   ` Dan Williams
2018-09-27  0:48     ` Dan Williams
2018-09-27  0:48     ` Dan Williams
2018-09-27 15:27     ` Alexander Duyck
2018-09-27 15:27       ` Alexander Duyck
2018-09-27 15:27       ` Alexander Duyck
2018-09-28  2:48       ` Dan Williams
2018-09-28  2:48         ` Dan Williams
2018-09-28  2:48         ` Dan Williams
2018-09-26 21:51 ` [RFC workqueue/driver-core PATCH 4/5] driver core: Use new async_schedule_dev command Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-26 21:51   ` Alexander Duyck
2018-09-28 17:42   ` Dan Williams
2018-09-28 17:42     ` Dan Williams
2018-09-28 17:42     ` Dan Williams
2018-09-26 21:52 ` [RFC workqueue/driver-core PATCH 5/5] nvdimm: Schedule device registration on node local to the device Alexander Duyck
2018-09-26 21:52   ` Alexander Duyck
2018-09-26 21:52   ` Alexander Duyck
2018-09-28 17:46   ` Dan Williams
2018-09-28 17:46     ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.