linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls
@ 2018-10-10 23:07 Alexander Duyck
  2018-10-10 23:07 ` [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node Alexander Duyck
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Alexander Duyck @ 2018-10-10 23:07 UTC (permalink / raw)
  To: tj, gregkh, akpm, linux-kernel
  Cc: len.brown, rafael, linux-pm, jiangshanlai, pavel, zwisler,
	alexander.h.duyck

This patch set provides functionality that will help to improve the
locality of the async_schedule calls used to provide deferred
initialization.

This patch set originally started out with me focused on just the one call
to async_schedule_domain in the nvdimm tree that was being used to
defer the device_add call however after doing some digging I realized the
scope of this was much broader than I had originally planned. As such I
went through and reworked the underlying infrastructure down to replacing
the queue_work call itself with a function of my own and opted to try and
provide a NUMA aware solution that would work for a broader audience.

RFC->v1:
    Dropped nvdimm patch to submit later.
		It relies on code in libnvdimm development tree.
    Simplified queue_work_near to just convert node into a CPU.
	Split up drivers core and PM core patches.
v1->v2:
    Renamed queue_work_near to queue_work_node
    Added WARN_ON_ONCE if we use queue_work_node with per-cpu workqueue

---

Alexander Duyck (5):
      workqueue: Provide queue_work_node to queue work near a given NUMA node
      async: Add support for queueing on specific NUMA node
      driver core: Probe devices asynchronously instead of the driver
      driver core: Attach devices on CPU local to device node
      PM core: Use new async_schedule_dev command


 drivers/base/bus.c        |   23 ++----------
 drivers/base/dd.c         |   44 +++++++++++++++++++++++-
 drivers/base/power/main.c |   12 +++---
 include/linux/async.h     |   36 ++++++++++++++++++-
 include/linux/workqueue.h |    2 +
 kernel/async.c            |   47 ++++++++++++-------------
 kernel/workqueue.c        |   84 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 193 insertions(+), 55 deletions(-)

--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node
  2018-10-10 23:07 [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls Alexander Duyck
@ 2018-10-10 23:07 ` Alexander Duyck
  2018-10-11 15:04   ` Tejun Heo
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific " Alexander Duyck
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2018-10-10 23:07 UTC (permalink / raw)
  To: tj, gregkh, akpm, linux-kernel
  Cc: len.brown, rafael, linux-pm, jiangshanlai, pavel, zwisler,
	alexander.h.duyck

This patch provides a new function queue_work_node which is meant to
schedule work on a "random" CPU of the requested NUMA node. The main
motivation for this is to help assist asynchronous init to better improve
boot times for devices that are local to a specific node.

For now we just default to the first CPU that is in the intersection of the
cpumask of the node and the online cpumask. The only exception is if the
CPU is local to the node we will just use the current CPU. This should work
for our purposes as we are currently only using this for unbound work so
the CPU will be translated to a node anyway instead of being directly used.

As we are only using the first CPU to represent the NUMA node for now I am
limiting the scope of the function so that it can only be used with unbound
workqueues.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---

RFC->v1:
    Simplified queue_work_near to just convert node into a CPU.
v1->v2:
    Renamed queue_work_near to queue_work_node
    Added WARN_ON_ONCE if we use queue_work_node with per-cpu workqueue

 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |   84 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..1f50c1e586e7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -463,6 +463,8 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
+extern bool queue_work_node(int node, struct workqueue_struct *wq,
+			    struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..6ed7c2eb84b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1492,6 +1492,90 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
+/**
+ * workqueue_select_cpu_near - Select a CPU based on NUMA node
+ * @node: NUMA node ID that we want to bind a CPU from
+ *
+ * This function will attempt to find a "random" cpu available on a given
+ * node. If there are no CPUs available on the given node it will return
+ * WORK_CPU_UNBOUND indicating that we should just schedule to any
+ * available CPU if we need to schedule this work.
+ */
+static int workqueue_select_cpu_near(int node)
+{
+	int cpu;
+
+	/* No point in doing this if NUMA isn't enabled for workqueues */
+	if (!wq_numa_enabled)
+		return WORK_CPU_UNBOUND;
+
+	/* Delay binding to CPU if node is not valid or online */
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		return WORK_CPU_UNBOUND;
+
+	/* Use local node/cpu if we are already there */
+	cpu = raw_smp_processor_id();
+	if (node == cpu_to_node(cpu))
+		return cpu;
+
+	/* Use "random" otherwise know as "first" online CPU of node */
+	cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+
+	/* If CPU is valid return that, otherwise just defer */
+	return (cpu < nr_cpu_ids) ? cpu : WORK_CPU_UNBOUND;
+}
+
+/**
+ * queue_work_node - queue work on a "random" cpu for a given NUMA node
+ * @node: NUMA node that we are targeting the work for
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * We queue the work to a "random" CPU within a given NUMA node. The basic
+ * idea here is to provide a way to somehow associate work with a given
+ * NUMA node.
+ *
+ * This function will only make a best effort attempt at getting this onto
+ * the right NUMA node. If no node is requested or the requested node is
+ * offline then we just fall back to standard queue_work behavior.
+ *
+ * Currently the "random" CPU ends up being the first available CPU in the
+ * intersection of cpu_online_mask and the cpumask of the node, unless we
+ * are running on the node. In that case we just use the current CPU.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_work_node(int node, struct workqueue_struct *wq,
+		     struct work_struct *work)
+{
+	unsigned long flags;
+	bool ret = false;
+
+	/*
+	 * This current implementation is specific to unbound workqueues.
+	 * Specifically we only return the first available CPU for a given
+	 * node instead of cycling through individual CPUs within the node.
+	 *
+	 * If this is used with a per-cpu workqueue then the logic in
+	 * workqueue_select_cpu_near would need to be updated to allow for
+	 * some round robin type logic.
+	 */
+	WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND));
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		int cpu = workqueue_select_cpu_near(node);
+
+		__queue_work(cpu, wq, work);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_node);
+
 void delayed_work_timer_fn(struct timer_list *t)
 {
 	struct delayed_work *dwork = from_timer(dwork, t, timer);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific NUMA node
  2018-10-10 23:07 [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls Alexander Duyck
  2018-10-10 23:07 ` [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node Alexander Duyck
@ 2018-10-10 23:08 ` Alexander Duyck
  2018-10-11 10:47   ` Greg KH
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 3/5] driver core: Probe devices asynchronously instead of the driver Alexander Duyck
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2018-10-10 23:08 UTC (permalink / raw)
  To: tj, gregkh, akpm, linux-kernel
  Cc: len.brown, rafael, linux-pm, jiangshanlai, pavel, zwisler,
	alexander.h.duyck

This patch introduces four new variants of the async_schedule_ functions
that allow scheduling on a specific NUMA node.

The first two functions are async_schedule_near and
async_schedule_near_domain which end up mapping to async_schedule and
async_schedule_domain but provide NUMA node specific functionality. They
replace the original functions which were moved to inline function
definitions that call the new functions while passing NUMA_NO_NODE.

The second two functions are async_schedule_dev and
async_schedule_dev_domain which provide NUMA specific functionality when
passing a device as the data member and that device has a NUMA node other
than NUMA_NO_NODE.

The main motivation behind this is to address the need to be able to
schedule device specific init work on specific NUMA nodes in order to
improve performance of memory initialization.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---

v1->v2:
    Replaced call to queue_work_near with call to queue_work_node

 include/linux/async.h |   36 +++++++++++++++++++++++++++++++++---
 kernel/async.c        |   47 ++++++++++++++++++++++-------------------------
 2 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/include/linux/async.h b/include/linux/async.h
index 6b0226bdaadc..1f6cb0d50263 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -14,6 +14,8 @@
 
 #include <linux/types.h>
 #include <linux/list.h>
+#include <linux/numa.h>
+#include <linux/device.h>
 
 typedef u64 async_cookie_t;
 typedef void (*async_func_t) (void *data, async_cookie_t cookie);
@@ -37,9 +39,37 @@ struct async_domain {
 	struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \
 				      .registered = 0 }
 
-extern async_cookie_t async_schedule(async_func_t func, void *data);
-extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
-					    struct async_domain *domain);
+async_cookie_t async_schedule_near(async_func_t func, void *data,
+				   int node);
+async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
+					  int node,
+					  struct async_domain *domain);
+
+static inline async_cookie_t async_schedule(async_func_t func, void *data)
+{
+	return async_schedule_near(func, data, NUMA_NO_NODE);
+}
+
+static inline async_cookie_t
+async_schedule_domain(async_func_t func, void *data,
+		      struct async_domain *domain)
+{
+	return async_schedule_near_domain(func, data, NUMA_NO_NODE, domain);
+}
+
+static inline async_cookie_t
+async_schedule_dev(async_func_t func, struct device *dev)
+{
+	return async_schedule_near(func, dev, dev_to_node(dev));
+}
+
+static inline async_cookie_t
+async_schedule_dev_domain(async_func_t func, struct device *dev,
+			  struct async_domain *domain)
+{
+	return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
+}
+
 void async_unregister_domain(struct async_domain *domain);
 extern void async_synchronize_full(void);
 extern void async_synchronize_full_domain(struct async_domain *domain);
diff --git a/kernel/async.c b/kernel/async.c
index a893d6170944..c393c169dbcb 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -149,7 +149,21 @@ static void async_run_entry_fn(struct work_struct *work)
 	wake_up(&async_done);
 }
 
-static async_cookie_t __async_schedule(async_func_t func, void *data, struct async_domain *domain)
+/**
+ * async_schedule_near_domain - schedule a function for asynchronous execution within a certain domain
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.  A
+ * synchronization domain is specified via @domain.  Note: This function
+ * may be called from atomic or non-atomic contexts.
+ */
+async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
+					  int node, struct async_domain *domain)
 {
 	struct async_entry *entry;
 	unsigned long flags;
@@ -195,43 +209,26 @@ static async_cookie_t __async_schedule(async_func_t func, void *data, struct asy
 	current->flags |= PF_USED_ASYNC;
 
 	/* schedule for execution */
-	queue_work(system_unbound_wq, &entry->work);
+	queue_work_node(node, system_unbound_wq, &entry->work);
 
 	return newcookie;
 }
+EXPORT_SYMBOL_GPL(async_schedule_near_domain);
 
 /**
- * async_schedule - schedule a function for asynchronous execution
+ * async_schedule_near - schedule a function for asynchronous execution
  * @func: function to execute asynchronously
  * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
  *
  * Returns an async_cookie_t that may be used for checkpointing later.
  * Note: This function may be called from atomic or non-atomic contexts.
  */
-async_cookie_t async_schedule(async_func_t func, void *data)
-{
-	return __async_schedule(func, data, &async_dfl_domain);
-}
-EXPORT_SYMBOL_GPL(async_schedule);
-
-/**
- * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
- * @func: function to execute asynchronously
- * @data: data pointer to pass to the function
- * @domain: the domain
- *
- * Returns an async_cookie_t that may be used for checkpointing later.
- * @domain may be used in the async_synchronize_*_domain() functions to
- * wait within a certain synchronization domain rather than globally.  A
- * synchronization domain is specified via @domain.  Note: This function
- * may be called from atomic or non-atomic contexts.
- */
-async_cookie_t async_schedule_domain(async_func_t func, void *data,
-				     struct async_domain *domain)
+async_cookie_t async_schedule_near(async_func_t func, void *data, int node)
 {
-	return __async_schedule(func, data, domain);
+	return async_schedule_near_domain(func, data, node, &async_dfl_domain);
 }
-EXPORT_SYMBOL_GPL(async_schedule_domain);
+EXPORT_SYMBOL_GPL(async_schedule_near);
 
 /**
  * async_synchronize_full - synchronize all asynchronous function calls


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [workqueue/driver-core PATCH v2 3/5] driver core: Probe devices asynchronously instead of the driver
  2018-10-10 23:07 [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls Alexander Duyck
  2018-10-10 23:07 ` [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node Alexander Duyck
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific " Alexander Duyck
@ 2018-10-10 23:08 ` Alexander Duyck
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node Alexander Duyck
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 5/5] PM core: Use new async_schedule_dev command Alexander Duyck
  4 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2018-10-10 23:08 UTC (permalink / raw)
  To: tj, gregkh, akpm, linux-kernel
  Cc: len.brown, rafael, linux-pm, jiangshanlai, pavel, zwisler,
	alexander.h.duyck

This change makes it so that we probe devices asynchronously instead of the
driver. This results in us seeing the same behavior if the device is
registered before the driver or after. This way we can avoid serializing
the initialization should the driver not be loaded until after the devices
have already been added.

The motivation behind this is that if we have a set of devices that
take a significant amount of time to load we can greatly reduce the time to
load by processing them in parallel instead of one at a time. In addition,
each device can exist on a different node so placing a single thread on one
CPU to initialize all of the devices for a given driver can result in poor
performance on a system with multiple nodes.

One issue I can see with this patch is that I am using the
dev_set/get_drvdata functions to store the driver in the device while I am
waiting on the asynchronous init to complete. For now I am protecting it by
using the lack of a dev->driver and the device lock.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/bus.c |   23 +++--------------------
 drivers/base/dd.c  |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8bfd27ec73d6..2a17bed657ec 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -616,17 +616,6 @@ static ssize_t uevent_store(struct device_driver *drv, const char *buf,
 }
 static DRIVER_ATTR_WO(uevent);
 
-static void driver_attach_async(void *_drv, async_cookie_t cookie)
-{
-	struct device_driver *drv = _drv;
-	int ret;
-
-	ret = driver_attach(drv);
-
-	pr_debug("bus: '%s': driver %s async attach completed: %d\n",
-		 drv->bus->name, drv->name, ret);
-}
-
 /**
  * bus_add_driver - Add a driver to the bus.
  * @drv: driver.
@@ -659,15 +648,9 @@ int bus_add_driver(struct device_driver *drv)
 
 	klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
 	if (drv->bus->p->drivers_autoprobe) {
-		if (driver_allows_async_probing(drv)) {
-			pr_debug("bus: '%s': probing driver %s asynchronously\n",
-				drv->bus->name, drv->name);
-			async_schedule(driver_attach_async, drv);
-		} else {
-			error = driver_attach(drv);
-			if (error)
-				goto out_unregister;
-		}
+		error = driver_attach(drv);
+		if (error)
+			goto out_unregister;
 	}
 	module_add_driver(drv->owner, drv);
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 169412ee4ae8..5ba366c1cb83 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -864,6 +864,29 @@ void device_initial_probe(struct device *dev)
 	__device_attach(dev, true);
 }
 
+static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
+{
+	struct device *dev = _dev;
+
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_lock(dev->parent);
+	device_lock(dev);
+
+	if (!dev->driver) {
+		struct device_driver *drv = dev_get_drvdata(dev);
+
+		driver_probe_device(drv, dev);
+	}
+
+	dev_dbg(dev, "async probe completed\n");
+
+	device_unlock(dev);
+	if (dev->parent && dev->bus->need_parent_lock)
+		device_unlock(dev->parent);
+
+	put_device(dev);
+}
+
 static int __driver_attach(struct device *dev, void *data)
 {
 	struct device_driver *drv = data;
@@ -891,6 +914,25 @@ static int __driver_attach(struct device *dev, void *data)
 		return ret;
 	} /* ret > 0 means positive match */
 
+	if (driver_allows_async_probing(drv)) {
+		/*
+		 * Instead of probing the device synchronously we will
+		 * probe it asynchronously to allow for more parallelism.
+		 *
+		 * We only take the device lock here in order to guarantee
+		 * that the dev->driver and driver_data fields are protected
+		 */
+		dev_dbg(dev, "scheduling asynchronous probe\n");
+		device_lock(dev);
+		if (!dev->driver) {
+			get_device(dev);
+			dev_set_drvdata(dev, drv);
+			async_schedule(__driver_attach_async_helper, dev);
+		}
+		device_unlock(dev);
+		return 0;
+	}
+
 	if (dev->parent && dev->bus->need_parent_lock)
 		device_lock(dev->parent);
 	device_lock(dev);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node
  2018-10-10 23:07 [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls Alexander Duyck
                   ` (2 preceding siblings ...)
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 3/5] driver core: Probe devices asynchronously instead of the driver Alexander Duyck
@ 2018-10-10 23:08 ` Alexander Duyck
  2018-10-11 10:45   ` Greg KH
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 5/5] PM core: Use new async_schedule_dev command Alexander Duyck
  4 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2018-10-10 23:08 UTC (permalink / raw)
  To: tj, gregkh, akpm, linux-kernel
  Cc: len.brown, rafael, linux-pm, jiangshanlai, pavel, zwisler,
	alexander.h.duyck

This change makes it so that we call the asynchronous probe routines on a
CPU local to the device node. By doing this we should be able to improve
our initialization time significantly as we can avoid having to access the
device from a remote node which may introduce higher latency.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/dd.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 5ba366c1cb83..81472dc44a70 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -826,7 +826,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 			 */
 			dev_dbg(dev, "scheduling asynchronous probe\n");
 			get_device(dev);
-			async_schedule(__device_attach_async_helper, dev);
+			async_schedule_dev(__device_attach_async_helper, dev);
 		} else {
 			pm_request_idle(dev);
 		}
@@ -927,7 +927,7 @@ static int __driver_attach(struct device *dev, void *data)
 		if (!dev->driver) {
 			get_device(dev);
 			dev_set_drvdata(dev, drv);
-			async_schedule(__driver_attach_async_helper, dev);
+			async_schedule_dev(__driver_attach_async_helper, dev);
 		}
 		device_unlock(dev);
 		return 0;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [workqueue/driver-core PATCH v2 5/5] PM core: Use new async_schedule_dev command
  2018-10-10 23:07 [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls Alexander Duyck
                   ` (3 preceding siblings ...)
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node Alexander Duyck
@ 2018-10-10 23:08 ` Alexander Duyck
  4 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2018-10-10 23:08 UTC (permalink / raw)
  To: tj, gregkh, akpm, linux-kernel
  Cc: len.brown, rafael, linux-pm, jiangshanlai, pavel, zwisler,
	alexander.h.duyck

This change makes it so that we use the device specific version of the
async_schedule commands to defer various tasks related to power management.
By doing this we should see a slight improvement in performance as any
device that is sensitive to latency/locality in the setup will now be
initializing on the node closest to the device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/base/power/main.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index a690fd400260..ebb8b61b52e9 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -726,7 +726,7 @@ void dpm_noirq_resume_devices(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_noirq, dev);
+			async_schedule_dev(async_resume_noirq, dev);
 		}
 	}
 
@@ -883,7 +883,7 @@ void dpm_resume_early(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume_early, dev);
+			async_schedule_dev(async_resume_early, dev);
 		}
 	}
 
@@ -1047,7 +1047,7 @@ void dpm_resume(pm_message_t state)
 		reinit_completion(&dev->power.completion);
 		if (is_async(dev)) {
 			get_device(dev);
-			async_schedule(async_resume, dev);
+			async_schedule_dev(async_resume, dev);
 		}
 	}
 
@@ -1366,7 +1366,7 @@ static int device_suspend_noirq(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_noirq, dev);
+		async_schedule_dev(async_suspend_noirq, dev);
 		return 0;
 	}
 	return __device_suspend_noirq(dev, pm_transition, false);
@@ -1569,7 +1569,7 @@ static int device_suspend_late(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend_late, dev);
+		async_schedule_dev(async_suspend_late, dev);
 		return 0;
 	}
 
@@ -1833,7 +1833,7 @@ static int device_suspend(struct device *dev)
 
 	if (is_async(dev)) {
 		get_device(dev);
-		async_schedule(async_suspend, dev);
+		async_schedule_dev(async_suspend, dev);
 		return 0;
 	}
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node Alexander Duyck
@ 2018-10-11 10:45   ` Greg KH
  2018-10-11 15:50     ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Greg KH @ 2018-10-11 10:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: tj, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler

On Wed, Oct 10, 2018 at 04:08:40PM -0700, Alexander Duyck wrote:
> This change makes it so that we call the asynchronous probe routines on a
> CPU local to the device node. By doing this we should be able to improve
> our initialization time significantly as we can avoid having to access the
> device from a remote node which may introduce higher latency.

This is nice in theory, but what kind of real numbers does this show?
There's a lot of added complexity here, and what is the benifit?

Benchmarks or bootcharts that we can see would be great to have, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific NUMA node
  2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific " Alexander Duyck
@ 2018-10-11 10:47   ` Greg KH
  2018-10-11 15:51     ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Greg KH @ 2018-10-11 10:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: tj, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler

On Wed, Oct 10, 2018 at 04:08:21PM -0700, Alexander Duyck wrote:
> This patch introduces four new variants of the async_schedule_ functions
> that allow scheduling on a specific NUMA node.
> 
> The first two functions are async_schedule_near and
> async_schedule_near_domain which end up mapping to async_schedule and
> async_schedule_domain but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
> 
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
> 
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
> 
> v1->v2:
>     Replaced call to queue_work_near with call to queue_work_node
> 
>  include/linux/async.h |   36 +++++++++++++++++++++++++++++++++---
>  kernel/async.c        |   47 ++++++++++++++++++++++-------------------------
>  2 files changed, 55 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/async.h b/include/linux/async.h
> index 6b0226bdaadc..1f6cb0d50263 100644
> --- a/include/linux/async.h
> +++ b/include/linux/async.h
> @@ -14,6 +14,8 @@
>  
>  #include <linux/types.h>
>  #include <linux/list.h>
> +#include <linux/numa.h>
> +#include <linux/device.h>
>  
>  typedef u64 async_cookie_t;
>  typedef void (*async_func_t) (void *data, async_cookie_t cookie);
> @@ -37,9 +39,37 @@ struct async_domain {
>  	struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \
>  				      .registered = 0 }
>  
> -extern async_cookie_t async_schedule(async_func_t func, void *data);
> -extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
> -					    struct async_domain *domain);
> +async_cookie_t async_schedule_near(async_func_t func, void *data,
> +				   int node);
> +async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
> +					  int node,
> +					  struct async_domain *domain);
> +
> +static inline async_cookie_t async_schedule(async_func_t func, void *data)
> +{
> +	return async_schedule_near(func, data, NUMA_NO_NODE);
> +}
> +
> +static inline async_cookie_t
> +async_schedule_domain(async_func_t func, void *data,
> +		      struct async_domain *domain)
> +{
> +	return async_schedule_near_domain(func, data, NUMA_NO_NODE, domain);
> +}
> +
> +static inline async_cookie_t
> +async_schedule_dev(async_func_t func, struct device *dev)
> +{
> +	return async_schedule_near(func, dev, dev_to_node(dev));
> +}
> +
> +static inline async_cookie_t
> +async_schedule_dev_domain(async_func_t func, struct device *dev,
> +			  struct async_domain *domain)
> +{
> +	return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
> +}

No kerneldoc for these functions?  We lost it for
async_schedule_domain() which used to be documented before this patch :(

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node
  2018-10-10 23:07 ` [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node Alexander Duyck
@ 2018-10-11 15:04   ` Tejun Heo
  2018-10-11 16:49     ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2018-10-11 15:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: gregkh, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler

On Wed, Oct 10, 2018 at 04:07:42PM -0700, Alexander Duyck wrote:
> This patch provides a new function queue_work_node which is meant to
> schedule work on a "random" CPU of the requested NUMA node. The main
> motivation for this is to help assist asynchronous init to better improve
> boot times for devices that are local to a specific node.
> 
> For now we just default to the first CPU that is in the intersection of the
> cpumask of the node and the online cpumask. The only exception is if the
> CPU is local to the node we will just use the current CPU. This should work
> for our purposes as we are currently only using this for unbound work so
> the CPU will be translated to a node anyway instead of being directly used.
> 
> As we are only using the first CPU to represent the NUMA node for now I am
> limiting the scope of the function so that it can only be used with unbound
> workqueues.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Acked-by: Tejun Heo <tj@kernel.org>

Please let me know how you wanna route the patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node
  2018-10-11 10:45   ` Greg KH
@ 2018-10-11 15:50     ` Alexander Duyck
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2018-10-11 15:50 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler



On 10/11/2018 3:45 AM, Greg KH wrote:
> On Wed, Oct 10, 2018 at 04:08:40PM -0700, Alexander Duyck wrote:
>> This change makes it so that we call the asynchronous probe routines on a
>> CPU local to the device node. By doing this we should be able to improve
>> our initialization time significantly as we can avoid having to access the
>> device from a remote node which may introduce higher latency.
> 
> This is nice in theory, but what kind of real numbers does this show?
> There's a lot of added complexity here, and what is the benifit?
> 
> Benchmarks or bootcharts that we can see would be great to have, thanks.
> 
> greg k-h
> 

In the case of persistent memory init the cost for getting the wrong 
node is pretty significant. On my test system with 3TB per node just 
getting the initialization node matched up to the memory node dropped 
initialization time per node from 39 seconds down to about 26 seconds 
per node.

We are already starting to see code like this pop up in subsystems 
anyway. For example the PCI code already has logic similar to what I am 
adding here floating around in it[1]. I'm hoping that by placing this 
change in the core device code we could start consolidating it so we 
don't have all the individual drivers or subsystems implementing their 
own NUMA specific init logic.

This is likely going to become more of an issue in the future as we now 
have CPUs like the AMD Ryzen Threadripper out there that have people 
starting to discuss NUMA in the consumer space.

- Alex

[1] 
https://elixir.bootlin.com/linux/v4.19-rc7/source/drivers/pci/pci-driver.c#L331

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific NUMA node
  2018-10-11 10:47   ` Greg KH
@ 2018-10-11 15:51     ` Alexander Duyck
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2018-10-11 15:51 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler



On 10/11/2018 3:47 AM, Greg KH wrote:
> On Wed, Oct 10, 2018 at 04:08:21PM -0700, Alexander Duyck wrote:
>> This patch introduces four new variants of the async_schedule_ functions
>> that allow scheduling on a specific NUMA node.
>>
>> The first two functions are async_schedule_near and
>> async_schedule_near_domain which end up mapping to async_schedule and
>> async_schedule_domain but provide NUMA node specific functionality. They
>> replace the original functions which were moved to inline function
>> definitions that call the new functions while passing NUMA_NO_NODE.
>>
>> The second two functions are async_schedule_dev and
>> async_schedule_dev_domain which provide NUMA specific functionality when
>> passing a device as the data member and that device has a NUMA node other
>> than NUMA_NO_NODE.
>>
>> The main motivation behind this is to address the need to be able to
>> schedule device specific init work on specific NUMA nodes in order to
>> improve performance of memory initialization.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>> ---
>>
>> v1->v2:
>>      Replaced call to queue_work_near with call to queue_work_node
>>
>>   include/linux/async.h |   36 +++++++++++++++++++++++++++++++++---
>>   kernel/async.c        |   47 ++++++++++++++++++++++-------------------------
>>   2 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/async.h b/include/linux/async.h
>> index 6b0226bdaadc..1f6cb0d50263 100644
>> --- a/include/linux/async.h
>> +++ b/include/linux/async.h
>> @@ -14,6 +14,8 @@
>>   
>>   #include <linux/types.h>
>>   #include <linux/list.h>
>> +#include <linux/numa.h>
>> +#include <linux/device.h>
>>   
>>   typedef u64 async_cookie_t;
>>   typedef void (*async_func_t) (void *data, async_cookie_t cookie);
>> @@ -37,9 +39,37 @@ struct async_domain {
>>   	struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \
>>   				      .registered = 0 }
>>   
>> -extern async_cookie_t async_schedule(async_func_t func, void *data);
>> -extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
>> -					    struct async_domain *domain);
>> +async_cookie_t async_schedule_near(async_func_t func, void *data,
>> +				   int node);
>> +async_cookie_t async_schedule_near_domain(async_func_t func, void *data,
>> +					  int node,
>> +					  struct async_domain *domain);
>> +
>> +static inline async_cookie_t async_schedule(async_func_t func, void *data)
>> +{
>> +	return async_schedule_near(func, data, NUMA_NO_NODE);
>> +}
>> +
>> +static inline async_cookie_t
>> +async_schedule_domain(async_func_t func, void *data,
>> +		      struct async_domain *domain)
>> +{
>> +	return async_schedule_near_domain(func, data, NUMA_NO_NODE, domain);
>> +}
>> +
>> +static inline async_cookie_t
>> +async_schedule_dev(async_func_t func, struct device *dev)
>> +{
>> +	return async_schedule_near(func, dev, dev_to_node(dev));
>> +}
>> +
>> +static inline async_cookie_t
>> +async_schedule_dev_domain(async_func_t func, struct device *dev,
>> +			  struct async_domain *domain)
>> +{
>> +	return async_schedule_near_domain(func, dev, dev_to_node(dev), domain);
>> +}
> 
> No kerneldoc for these functions?  We lost it for
> async_schedule_domain() which used to be documented before this patch :(
> 
> thanks,
> 
> greg k-h
> 

I will add it for v3.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node
  2018-10-11 15:04   ` Tejun Heo
@ 2018-10-11 16:49     ` Alexander Duyck
  2018-10-11 17:02       ` Greg KH
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2018-10-11 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler



On 10/11/2018 8:04 AM, Tejun Heo wrote:
> On Wed, Oct 10, 2018 at 04:07:42PM -0700, Alexander Duyck wrote:
>> This patch provides a new function queue_work_node which is meant to
>> schedule work on a "random" CPU of the requested NUMA node. The main
>> motivation for this is to help assist asynchronous init to better improve
>> boot times for devices that are local to a specific node.
>>
>> For now we just default to the first CPU that is in the intersection of the
>> cpumask of the node and the online cpumask. The only exception is if the
>> CPU is local to the node we will just use the current CPU. This should work
>> for our purposes as we are currently only using this for unbound work so
>> the CPU will be translated to a node anyway instead of being directly used.
>>
>> As we are only using the first CPU to represent the NUMA node for now I am
>> limiting the scope of the function so that it can only be used with unbound
>> workqueues.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> 
> Please let me know how you wanna route the patch.
> 
> Thanks.

I would be good with routing the patches through you if that works. I 
had included you, Greg, and Andrew as I wasn't sure how you guys had 
wanted this routed since this affected both the workqueue and device trees.

I'll update the patches to resolve the lack of kerneldoc for the new 
"async_" functions and add some comments to the patch descriptions on 
the gains seen related to some of the specific patches for v3.

Thanks.

- Alex







^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node
  2018-10-11 16:49     ` Alexander Duyck
@ 2018-10-11 17:02       ` Greg KH
  2018-10-11 17:13         ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Greg KH @ 2018-10-11 17:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tejun Heo, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler

On Thu, Oct 11, 2018 at 09:49:59AM -0700, Alexander Duyck wrote:
> 
> 
> On 10/11/2018 8:04 AM, Tejun Heo wrote:
> > On Wed, Oct 10, 2018 at 04:07:42PM -0700, Alexander Duyck wrote:
> > > This patch provides a new function queue_work_node which is meant to
> > > schedule work on a "random" CPU of the requested NUMA node. The main
> > > motivation for this is to help assist asynchronous init to better improve
> > > boot times for devices that are local to a specific node.
> > > 
> > > For now we just default to the first CPU that is in the intersection of the
> > > cpumask of the node and the online cpumask. The only exception is if the
> > > CPU is local to the node we will just use the current CPU. This should work
> > > for our purposes as we are currently only using this for unbound work so
> > > the CPU will be translated to a node anyway instead of being directly used.
> > > 
> > > As we are only using the first CPU to represent the NUMA node for now I am
> > > limiting the scope of the function so that it can only be used with unbound
> > > workqueues.
> > > 
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > Acked-by: Tejun Heo <tj@kernel.org>
> > 
> > Please let me know how you wanna route the patch.
> > 
> > Thanks.
> 
> I would be good with routing the patches through you if that works. I had
> included you, Greg, and Andrew as I wasn't sure how you guys had wanted this
> routed since this affected both the workqueue and device trees.
> 
> I'll update the patches to resolve the lack of kerneldoc for the new
> "async_" functions and add some comments to the patch descriptions on the
> gains seen related to some of the specific patches for v3.

As Tejun has acked this, and it affects the driver core, I'll be glad to
take it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node
  2018-10-11 17:02       ` Greg KH
@ 2018-10-11 17:13         ` Alexander Duyck
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2018-10-11 17:13 UTC (permalink / raw)
  To: Greg KH
  Cc: Tejun Heo, akpm, linux-kernel, len.brown, rafael, linux-pm,
	jiangshanlai, pavel, zwisler

On 10/11/2018 10:02 AM, Greg KH wrote:
> On Thu, Oct 11, 2018 at 09:49:59AM -0700, Alexander Duyck wrote:
>>
>>
>> On 10/11/2018 8:04 AM, Tejun Heo wrote:
>>> On Wed, Oct 10, 2018 at 04:07:42PM -0700, Alexander Duyck wrote:
>>>> This patch provides a new function queue_work_node which is meant to
>>>> schedule work on a "random" CPU of the requested NUMA node. The main
>>>> motivation for this is to help assist asynchronous init to better improve
>>>> boot times for devices that are local to a specific node.
>>>>
>>>> For now we just default to the first CPU that is in the intersection of the
>>>> cpumask of the node and the online cpumask. The only exception is if the
>>>> CPU is local to the node we will just use the current CPU. This should work
>>>> for our purposes as we are currently only using this for unbound work so
>>>> the CPU will be translated to a node anyway instead of being directly used.
>>>>
>>>> As we are only using the first CPU to represent the NUMA node for now I am
>>>> limiting the scope of the function so that it can only be used with unbound
>>>> workqueues.
>>>>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> Acked-by: Tejun Heo <tj@kernel.org>
>>>
>>> Please let me know how you wanna route the patch.
>>>
>>> Thanks.
>>
>> I would be good with routing the patches through you if that works. I had
>> included you, Greg, and Andrew as I wasn't sure how you guys had wanted this
>> routed since this affected both the workqueue and device trees.
>>
>> I'll update the patches to resolve the lack of kerneldoc for the new
>> "async_" functions and add some comments to the patch descriptions on the
>> gains seen related to some of the specific patches for v3.
> 
> As Tejun has acked this, and it affects the driver core, I'll be glad to
> take it.
> 
> thanks,
> 
> greg k-h

Okay. That works. I will drop Tejun and Andrew to the Cc, and include 
you on the To line for the next submission so it is a bit more clear on 
who should be applying this.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-10-11 17:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10 23:07 [workqueue/driver-core PATCH v2 0/5] Add NUMA aware async_schedule calls Alexander Duyck
2018-10-10 23:07 ` [workqueue/driver-core PATCH v2 1/5] workqueue: Provide queue_work_node to queue work near a given NUMA node Alexander Duyck
2018-10-11 15:04   ` Tejun Heo
2018-10-11 16:49     ` Alexander Duyck
2018-10-11 17:02       ` Greg KH
2018-10-11 17:13         ` Alexander Duyck
2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 2/5] async: Add support for queueing on specific " Alexander Duyck
2018-10-11 10:47   ` Greg KH
2018-10-11 15:51     ` Alexander Duyck
2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 3/5] driver core: Probe devices asynchronously instead of the driver Alexander Duyck
2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 4/5] driver core: Attach devices on CPU local to device node Alexander Duyck
2018-10-11 10:45   ` Greg KH
2018-10-11 15:50     ` Alexander Duyck
2018-10-10 23:08 ` [workqueue/driver-core PATCH v2 5/5] PM core: Use new async_schedule_dev command Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).