linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs
@ 2019-09-06  1:40 Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 1/9] padata: allocate workqueue internally Daniel Jordan
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

v2 -> v3:
 - Rebase onto cryptodev.

v1 -> v2:
 - Updated patch 8 to avoid queueing the reorder work if the next object
   by sequence number isn't ready yet (Herbert)
 - Added Steffen's ack to all but patch 8 since that one changed.

RFC -> v1:
 - Included Tejun's acks.
 - Added testing section to cover letter.

Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job.  Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.
  
This series makes parallel padata jobs run on unbound workqueues.

Patch    Description
-----    -----------

    1    Make a padata instance allocate its workqueue internally.

    2    Unconfine some recently-confined workqueue interfaces.

  3-6    Address recursive CPU hotplug locking issue.

         padata_alloc* requires its callers to hold this lock, but allocating
         an unbound workqueue and calling apply_workqueue_attrs also take it.
         Fix by removing the requirement for callers of padata_alloc*.

  7-8    Add a second workqueue for each padata instance that's dedicated to
         parallel jobs.

    9    Small cleanup.

Performance
-----------

Measurements are from a 2-socket, 20-core, 40-CPU Xeon server.

For repeatability, modprobe was bound to a CPU and the serial cpumasks
for both pencrypt and pdecrypt were also restricted to a CPU different
from modprobe's.

  # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
  # modprobe tcrypt mode=211 sec=1
  # modprobe tcrypt mode=215 sec=1

Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU)

                             base                test
                             ----------------    ---------------
speedup    key_sz  blk_sz    ops/sec    stdev    ops/sec   stdev

(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)

 117.2x       160      16        960       30     112555   24775
 135.1x       160      64        845      246     114145   25124
 113.2x       160     256        993       17     112395   24714
 111.3x       160     512       1000        0     111252   23755
 110.0x       160    1024        983       16     108153   22374
 104.2x       160    2048        985       22     102563   20530
  98.5x       160    4096        998        3      98346   18777
  86.2x       160    8192       1000        0      86173   14480

(pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)

 127.2x       160      16        997        5     126834   24244
 128.4x       160      64       1000        0     128438   23261
 127.6x       160     256        992        7     126627   23493
 124.0x       160     512       1000        0     123958   22746
 122.8x       160    1024        989       20     121372   22632
 112.8x       160    2048        998        3     112602   18287
 106.9x       160    4096        994       10     106255   16111
  91.7x       160    8192       1000        0      91742   11670

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

 242.2x       160      16       2363      141     572189   16846
 242.1x       160      64       2397      151     580424   11923
 231.1x       160     256       2472       21     571387   16364
 237.6x       160     512       2429       24     577264    8692
 238.3x       160    1024       2384       97     568155    6621
 216.3x       160    2048       2453       74     530627    3480
 209.2x       160    4096       2381      206     498192   19177
 176.5x       160    8192       2323      157     410013    9903

multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)

 220.3x       160      16       2341      228     515733   91317
 216.6x       160      64       2467       33     534381  101262
 217.7x       160     256       2451       45     533443   85418
 213.8x       160     512       2485       26     531293   83767
 211.0x       160    1024       2472       28     521677   80339
 200.8x       160    2048       2459       67     493808   63587
 188.8x       160    4096       2491        9     470325   58055
 159.9x       160    8192       2459       51     393147   25756

Idle system (tcrypt run by itself)

                             base                test
                             ----------------    ---------------
speedup    key_sz  blk_sz    ops/sec    stdev    ops/sec   stdev

(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)

   2.5x       160      16      63412    43075     161615    1034
   4.1x       160      64      39554    24006     161653     981
   6.0x       160     256      26504     1436     160110    1158
   6.2x       160     512      25500       40     157018     951
   5.9x       160    1024      25777     1094     151852     915
   5.8x       160    2048      24653      218     143756     508
   5.6x       160    4096      24333       20     136752     548
   5.0x       160    8192      23310       15     117660     481

(pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)

   2.4x       160      16      53471    48279     128047   31328
   3.4x       160      64      37712    20855     128187   31074
   4.5x       160     256      27911     4378     126430   31084
   4.9x       160     512      25346      175     123870   29099
   3.1x       160    1024      38452    23118     120817   26846
   4.7x       160    2048      24612      187     115036   23942
   4.5x       160    4096      24217      114     109583   21559
   4.2x       160    8192      23144      108      96850   16686

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

   1.0x       160      16     412157     3855     426973    1591
   1.0x       160      64     412600     4410     431920    4224
   1.1x       160     256     410352     3254     453691   17831
   1.2x       160     512     406293     4948     473491   39818
   1.2x       160    1024     395123     7804     478539   27660
   1.2x       160    2048     385144     7601     453720   17579
   1.2x       160    4096     371989     3631     449923   15331
   1.2x       160    8192     346723     1617     399824   18559

multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)

   1.1x       160      16     407317     1487     452619   14404
   1.1x       160      64     411821     4261     464059   23541
   1.2x       160     256     408941     4945     477483   36576
   1.2x       160     512     406451      611     472661   11038
   1.2x       160    1024     394813     2667     456357   11452
   1.2x       160    2048     390291     4175     448928    8957
   1.2x       160    4096     371904     1068     449344   14225
   1.2x       160    8192     344227     1973     404397   19540

Testing
-------

In addition to the bare metal performance runs above, this series was
tested in a kvm guest with the tcrypt module (mode=215).  All
combinations of CPUs among parallel_cpumask, serial_cpumask, and CPU
hotplug online/offline were run with 3 possible CPUs, and over 2000
random combinations of these were run with 8 possible CPUs.  Workqueue
events were used throughout to verify that all parallel and serial
workers executed on only the CPUs allowed by the cpumask sysfs files.

Finally, tcrypt mode=215 was run at each patch in the series when built
with and without CONFIG_PADATA/CONFIG_CRYPTO_PCRYPT.

v2:  https://lore.kernel.org/linux-crypto/20190829173038.21040-1-daniel.m.jordan@oracle.com/
v1:  https://lore.kernel.org/linux-crypto/20190813005224.30779-1-daniel.m.jordan@oracle.com/
RFC: https://lore.kernel.org/lkml/20190725212505.15055-1-daniel.m.jordan@oracle.com/

Daniel Jordan (9):
  padata: allocate workqueue internally
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  workqueue: require CPU hotplug read exclusion for
    apply_workqueue_attrs
  padata: make padata_do_parallel find alternate callback CPU
  pcrypt: remove padata cpumask notifier
  padata, pcrypt: take CPU hotplug lock internally in
    padata_alloc_possible
  padata: use separate workqueues for parallel and serial work
  padata: unbind parallel jobs from specific CPUs
  padata: remove cpu_index from the parallel_queue

 Documentation/padata.txt  |  12 +--
 crypto/pcrypt.c           | 167 ++++---------------------------
 include/linux/padata.h    |  16 +--
 include/linux/workqueue.h |   4 +
 kernel/padata.c           | 201 ++++++++++++++++++++++----------------
 kernel/workqueue.c        |  25 +++--
 6 files changed, 170 insertions(+), 255 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/9] padata: allocate workqueue internally
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 2/9] workqueue: unconfine alloc/apply/free_workqueue_attrs() Daniel Jordan
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan, Jonathan Corbet, linux-doc

Move workqueue allocation inside of padata to prepare for further
changes to how padata uses workqueues.

Guarantees the workqueue is created with max_active=1, which padata
relies on to work correctly.  No functional change.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 Documentation/padata.txt | 12 ++++++------
 crypto/pcrypt.c          | 13 ++-----------
 include/linux/padata.h   |  3 +--
 kernel/padata.c          | 24 +++++++++++++++---------
 4 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/Documentation/padata.txt b/Documentation/padata.txt
index b103d0c82000..b37ba1eaace3 100644
--- a/Documentation/padata.txt
+++ b/Documentation/padata.txt
@@ -16,10 +16,12 @@ overall control of how tasks are to be run::
 
     #include <linux/padata.h>
 
-    struct padata_instance *padata_alloc(struct workqueue_struct *wq,
+    struct padata_instance *padata_alloc(const char *name,
 					 const struct cpumask *pcpumask,
 					 const struct cpumask *cbcpumask);
 
+'name' simply identifies the instance.
+
 The pcpumask describes which processors will be used to execute work
 submitted to this instance in parallel. The cbcpumask defines which
 processors are allowed to be used as the serialization callback processor.
@@ -128,8 +130,7 @@ in that CPU mask or about a not running instance.
 
 Each task submitted to padata_do_parallel() will, in turn, be passed to
 exactly one call to the above-mentioned parallel() function, on one CPU, so
-true parallelism is achieved by submitting multiple tasks.  Despite the
-fact that the workqueue is used to make these calls, parallel() is run with
+true parallelism is achieved by submitting multiple tasks.  parallel() runs with
 software interrupts disabled and thus cannot sleep.  The parallel()
 function gets the padata_priv structure pointer as its lone parameter;
 information about the actual work to be done is probably obtained by using
@@ -148,7 +149,7 @@ fact with a call to::
 At some point in the future, padata_do_serial() will trigger a call to the
 serial() function in the padata_priv structure.  That call will happen on
 the CPU requested in the initial call to padata_do_parallel(); it, too, is
-done through the workqueue, but with local software interrupts disabled.
+run with local software interrupts disabled.
 Note that this call may be deferred for a while since the padata code takes
 pains to ensure that tasks are completed in the order in which they were
 submitted.
@@ -159,5 +160,4 @@ when a padata instance is no longer needed::
     void padata_free(struct padata_instance *pinst);
 
 This function will busy-wait while any remaining tasks are completed, so it
-might be best not to call it while there is work outstanding.  Shutting
-down the workqueue, if necessary, should be done separately.
+might be best not to call it while there is work outstanding.
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 0edf5b54fc77..d67293063c7f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -20,7 +20,6 @@
 
 struct padata_pcrypt {
 	struct padata_instance *pinst;
-	struct workqueue_struct *wq;
 
 	/*
 	 * Cpumask for callback CPUs. It should be
@@ -397,14 +396,9 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
 
 	get_online_cpus();
 
-	pcrypt->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
-				     1, name);
-	if (!pcrypt->wq)
-		goto err;
-
-	pcrypt->pinst = padata_alloc_possible(pcrypt->wq);
+	pcrypt->pinst = padata_alloc_possible(name);
 	if (!pcrypt->pinst)
-		goto err_destroy_workqueue;
+		goto err;
 
 	mask = kmalloc(sizeof(*mask), GFP_KERNEL);
 	if (!mask)
@@ -437,8 +431,6 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
 	kfree(mask);
 err_free_padata:
 	padata_free(pcrypt->pinst);
-err_destroy_workqueue:
-	destroy_workqueue(pcrypt->wq);
 err:
 	put_online_cpus();
 
@@ -452,7 +444,6 @@ static void pcrypt_fini_padata(struct padata_pcrypt *pcrypt)
 
 	padata_stop(pcrypt->pinst);
 	padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
-	destroy_workqueue(pcrypt->wq);
 	padata_free(pcrypt->pinst);
 }
 
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 8da673861d99..839d9319920a 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -151,8 +151,7 @@ struct padata_instance {
 #define	PADATA_INVALID	4
 };
 
-extern struct padata_instance *padata_alloc_possible(
-					struct workqueue_struct *wq);
+extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern int padata_do_parallel(struct padata_instance *pinst,
 			      struct padata_priv *padata, int cb_cpu);
diff --git a/kernel/padata.c b/kernel/padata.c
index b60cc3dcee58..58728cd7f40c 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -805,6 +805,7 @@ static void __padata_free(struct padata_instance *pinst)
 	padata_free_pd(pinst->pd);
 	free_cpumask_var(pinst->cpumask.pcpu);
 	free_cpumask_var(pinst->cpumask.cbcpu);
+	destroy_workqueue(pinst->wq);
 	kfree(pinst);
 }
 
@@ -938,13 +939,13 @@ static struct kobj_type padata_attr_type = {
  * padata_alloc - allocate and initialize a padata instance and specify
  *                cpumasks for serial and parallel workers.
  *
- * @wq: workqueue to use for the allocated padata instance
+ * @name: used to identify the instance
  * @pcpumask: cpumask that will be used for padata parallelization
  * @cbcpumask: cpumask that will be used for padata serialization
  *
  * Must be called from a cpus_read_lock() protected region
  */
-static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
+static struct padata_instance *padata_alloc(const char *name,
 					    const struct cpumask *pcpumask,
 					    const struct cpumask *cbcpumask)
 {
@@ -955,11 +956,16 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
 	if (!pinst)
 		goto err;
 
-	if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+	pinst->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
+				    1, name);
+	if (!pinst->wq)
 		goto err_free_inst;
+
+	if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+		goto err_free_wq;
 	if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
 		free_cpumask_var(pinst->cpumask.pcpu);
-		goto err_free_inst;
+		goto err_free_wq;
 	}
 	if (!padata_validate_cpumask(pinst, pcpumask) ||
 	    !padata_validate_cpumask(pinst, cbcpumask))
@@ -971,8 +977,6 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
 
 	rcu_assign_pointer(pinst->pd, pd);
 
-	pinst->wq = wq;
-
 	cpumask_copy(pinst->cpumask.pcpu, pcpumask);
 	cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);
 
@@ -990,6 +994,8 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
 err_free_masks:
 	free_cpumask_var(pinst->cpumask.pcpu);
 	free_cpumask_var(pinst->cpumask.cbcpu);
+err_free_wq:
+	destroy_workqueue(pinst->wq);
 err_free_inst:
 	kfree(pinst);
 err:
@@ -1001,14 +1007,14 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
  *                         Use the cpu_possible_mask for serial and
  *                         parallel workers.
  *
- * @wq: workqueue to use for the allocated padata instance
+ * @name: used to identify the instance
  *
  * Must be called from a cpus_read_lock() protected region
  */
-struct padata_instance *padata_alloc_possible(struct workqueue_struct *wq)
+struct padata_instance *padata_alloc_possible(const char *name)
 {
 	lockdep_assert_cpus_held();
-	return padata_alloc(wq, cpu_possible_mask, cpu_possible_mask);
+	return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
 }
 EXPORT_SYMBOL(padata_alloc_possible);
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/9] workqueue: unconfine alloc/apply/free_workqueue_attrs()
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 1/9] padata: allocate workqueue internally Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 3/9] workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs Daniel Jordan
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

padata will use these these interfaces in a later patch, so unconfine them.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h | 4 ++++
 kernel/workqueue.c        | 6 +++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b7c585b5ec1c..4261d1c6e87b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -435,6 +435,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
+struct workqueue_attrs *alloc_workqueue_attrs(void);
+void free_workqueue_attrs(struct workqueue_attrs *attrs);
+int apply_workqueue_attrs(struct workqueue_struct *wq,
+			  const struct workqueue_attrs *attrs);
 int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 601d61150b65..f53705ff3ff1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3329,7 +3329,7 @@ EXPORT_SYMBOL_GPL(execute_in_process_context);
  *
  * Undo alloc_workqueue_attrs().
  */
-static void free_workqueue_attrs(struct workqueue_attrs *attrs)
+void free_workqueue_attrs(struct workqueue_attrs *attrs)
 {
 	if (attrs) {
 		free_cpumask_var(attrs->cpumask);
@@ -3345,7 +3345,7 @@ static void free_workqueue_attrs(struct workqueue_attrs *attrs)
  *
  * Return: The allocated new workqueue_attr on success. %NULL on failure.
  */
-static struct workqueue_attrs *alloc_workqueue_attrs(void)
+struct workqueue_attrs *alloc_workqueue_attrs(void)
 {
 	struct workqueue_attrs *attrs;
 
@@ -4032,7 +4032,7 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
  *
  * Return: 0 on success and -errno on failure.
  */
-static int apply_workqueue_attrs(struct workqueue_struct *wq,
+int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
 	int ret;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/9] workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 1/9] padata: allocate workqueue internally Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 2/9] workqueue: unconfine alloc/apply/free_workqueue_attrs() Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 4/9] padata: make padata_do_parallel find alternate callback CPU Daniel Jordan
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

Change the calling convention for apply_workqueue_attrs to require CPU
hotplug read exclusion.

Avoids lockdep complaints about nested calls to get_online_cpus in a
future patch where padata calls apply_workqueue_attrs when changing
other CPU-hotplug-sensitive data structures with the CPU read lock
already held.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f53705ff3ff1..bc2e09a8ea61 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4030,6 +4030,8 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
  *
  * Performs GFP_KERNEL allocations.
  *
+ * Assumes caller has CPU hotplug read exclusion, i.e. get_online_cpus().
+ *
  * Return: 0 on success and -errno on failure.
  */
 int apply_workqueue_attrs(struct workqueue_struct *wq,
@@ -4037,9 +4039,11 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 {
 	int ret;
 
-	apply_wqattrs_lock();
+	lockdep_assert_cpus_held();
+
+	mutex_lock(&wq_pool_mutex);
 	ret = apply_workqueue_attrs_locked(wq, attrs);
-	apply_wqattrs_unlock();
+	mutex_unlock(&wq_pool_mutex);
 
 	return ret;
 }
@@ -4152,16 +4156,21 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			mutex_unlock(&wq->mutex);
 		}
 		return 0;
-	} else if (wq->flags & __WQ_ORDERED) {
+	}
+
+	get_online_cpus();
+	if (wq->flags & __WQ_ORDERED) {
 		ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
 		/* there should only be single pwq for ordering guarantee */
 		WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
 			      wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
 		     "ordering guarantee broken for workqueue %s\n", wq->name);
-		return ret;
 	} else {
-		return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
+		ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
 	}
+	put_online_cpus();
+
+	return ret;
 }
 
 static int wq_clamp_max_active(int max_active, unsigned int flags,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/9] padata: make padata_do_parallel find alternate callback CPU
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (2 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 3/9] workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 5/9] pcrypt: remove padata cpumask notifier Daniel Jordan
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

padata_do_parallel currently returns -EINVAL if the callback CPU isn't
in the callback cpumask.

pcrypt tries to prevent this situation by keeping its own callback
cpumask in sync with padata's and checks that the callback CPU it passes
to padata is valid.  Make padata handle this instead.

padata_do_parallel now takes a pointer to the callback CPU and updates
it for the caller if an alternate CPU is used.  Overall behavior in
terms of which callback CPUs are chosen stays the same.

Prepares for removal of the padata cpumask notifier in pcrypt, which
will fix a lockdep complaint about nested acquisition of the CPU hotplug
lock later in the series.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 crypto/pcrypt.c        | 33 ++-------------------------------
 include/linux/padata.h |  2 +-
 kernel/padata.c        | 27 ++++++++++++++++++++-------
 3 files changed, 23 insertions(+), 39 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index d67293063c7f..efca962ab12a 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -57,35 +57,6 @@ struct pcrypt_aead_ctx {
 	unsigned int cb_cpu;
 };
 
-static int pcrypt_do_parallel(struct padata_priv *padata, unsigned int *cb_cpu,
-			      struct padata_pcrypt *pcrypt)
-{
-	unsigned int cpu_index, cpu, i;
-	struct pcrypt_cpumask *cpumask;
-
-	cpu = *cb_cpu;
-
-	rcu_read_lock_bh();
-	cpumask = rcu_dereference_bh(pcrypt->cb_cpumask);
-	if (cpumask_test_cpu(cpu, cpumask->mask))
-			goto out;
-
-	if (!cpumask_weight(cpumask->mask))
-			goto out;
-
-	cpu_index = cpu % cpumask_weight(cpumask->mask);
-
-	cpu = cpumask_first(cpumask->mask);
-	for (i = 0; i < cpu_index; i++)
-		cpu = cpumask_next(cpu, cpumask->mask);
-
-	*cb_cpu = cpu;
-
-out:
-	rcu_read_unlock_bh();
-	return padata_do_parallel(pcrypt->pinst, padata, cpu);
-}
-
 static int pcrypt_aead_setkey(struct crypto_aead *parent,
 			      const u8 *key, unsigned int keylen)
 {
@@ -157,7 +128,7 @@ static int pcrypt_aead_encrypt(struct aead_request *req)
 			       req->cryptlen, req->iv);
 	aead_request_set_ad(creq, req->assoclen);
 
-	err = pcrypt_do_parallel(padata, &ctx->cb_cpu, &pencrypt);
+	err = padata_do_parallel(pencrypt.pinst, padata, &ctx->cb_cpu);
 	if (!err)
 		return -EINPROGRESS;
 
@@ -199,7 +170,7 @@ static int pcrypt_aead_decrypt(struct aead_request *req)
 			       req->cryptlen, req->iv);
 	aead_request_set_ad(creq, req->assoclen);
 
-	err = pcrypt_do_parallel(padata, &ctx->cb_cpu, &pdecrypt);
+	err = padata_do_parallel(pdecrypt.pinst, padata, &ctx->cb_cpu);
 	if (!err)
 		return -EINPROGRESS;
 
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 839d9319920a..f7851f8e2190 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -154,7 +154,7 @@ struct padata_instance {
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern int padata_do_parallel(struct padata_instance *pinst,
-			      struct padata_priv *padata, int cb_cpu);
+			      struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
 			      cpumask_var_t cpumask);
diff --git a/kernel/padata.c b/kernel/padata.c
index 58728cd7f40c..9a17922ec436 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -94,17 +94,19 @@ static void padata_parallel_worker(struct work_struct *parallel_work)
  *
  * @pinst: padata instance
  * @padata: object to be parallelized
- * @cb_cpu: cpu the serialization callback function will run on,
- *          must be in the serial cpumask of padata(i.e. cpumask.cbcpu).
+ * @cb_cpu: pointer to the CPU that the serialization callback function should
+ *          run on.  If it's not in the serial cpumask of @pinst
+ *          (i.e. cpumask.cbcpu), this function selects a fallback CPU and if
+ *          none found, returns -EINVAL.
  *
  * The parallelization callback function will run with BHs off.
  * Note: Every object which is parallelized by padata_do_parallel
  * must be seen by padata_do_serial.
  */
 int padata_do_parallel(struct padata_instance *pinst,
-		       struct padata_priv *padata, int cb_cpu)
+		       struct padata_priv *padata, int *cb_cpu)
 {
-	int target_cpu, err;
+	int i, cpu, cpu_index, target_cpu, err;
 	struct padata_parallel_queue *queue;
 	struct parallel_data *pd;
 
@@ -116,8 +118,19 @@ int padata_do_parallel(struct padata_instance *pinst,
 	if (!(pinst->flags & PADATA_INIT) || pinst->flags & PADATA_INVALID)
 		goto out;
 
-	if (!cpumask_test_cpu(cb_cpu, pd->cpumask.cbcpu))
-		goto out;
+	if (!cpumask_test_cpu(*cb_cpu, pd->cpumask.cbcpu)) {
+		if (!cpumask_weight(pd->cpumask.cbcpu))
+			goto out;
+
+		/* Select an alternate fallback CPU and notify the caller. */
+		cpu_index = *cb_cpu % cpumask_weight(pd->cpumask.cbcpu);
+
+		cpu = cpumask_first(pd->cpumask.cbcpu);
+		for (i = 0; i < cpu_index; i++)
+			cpu = cpumask_next(cpu, pd->cpumask.cbcpu);
+
+		*cb_cpu = cpu;
+	}
 
 	err =  -EBUSY;
 	if ((pinst->flags & PADATA_RESET))
@@ -129,7 +142,7 @@ int padata_do_parallel(struct padata_instance *pinst,
 	err = 0;
 	atomic_inc(&pd->refcnt);
 	padata->pd = pd;
-	padata->cb_cpu = cb_cpu;
+	padata->cb_cpu = *cb_cpu;
 
 	target_cpu = padata_cpu_hash(pd);
 	padata->cpu = target_cpu;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 5/9] pcrypt: remove padata cpumask notifier
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (3 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 4/9] padata: make padata_do_parallel find alternate callback CPU Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 6/9] padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible Daniel Jordan
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

Now that padata_do_parallel takes care of finding an alternate callback
CPU, there's no need for pcrypt's callback cpumask, so remove it and the
notifier callback that keeps it in sync.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 crypto/pcrypt.c | 125 +++++++-----------------------------------------
 1 file changed, 18 insertions(+), 107 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index efca962ab12a..2ec36e6a132f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -18,33 +18,8 @@
 #include <linux/cpu.h>
 #include <crypto/pcrypt.h>
 
-struct padata_pcrypt {
-	struct padata_instance *pinst;
-
-	/*
-	 * Cpumask for callback CPUs. It should be
-	 * equal to serial cpumask of corresponding padata instance,
-	 * so it is updated when padata notifies us about serial
-	 * cpumask change.
-	 *
-	 * cb_cpumask is protected by RCU. This fact prevents us from
-	 * using cpumask_var_t directly because the actual type of
-	 * cpumsak_var_t depends on kernel configuration(particularly on
-	 * CONFIG_CPUMASK_OFFSTACK macro). Depending on the configuration
-	 * cpumask_var_t may be either a pointer to the struct cpumask
-	 * or a variable allocated on the stack. Thus we can not safely use
-	 * cpumask_var_t with RCU operations such as rcu_assign_pointer or
-	 * rcu_dereference. So cpumask_var_t is wrapped with struct
-	 * pcrypt_cpumask which makes possible to use it with RCU.
-	 */
-	struct pcrypt_cpumask {
-		cpumask_var_t mask;
-	} *cb_cpumask;
-	struct notifier_block nblock;
-};
-
-static struct padata_pcrypt pencrypt;
-static struct padata_pcrypt pdecrypt;
+static struct padata_instance *pencrypt;
+static struct padata_instance *pdecrypt;
 static struct kset           *pcrypt_kset;
 
 struct pcrypt_instance_ctx {
@@ -128,7 +103,7 @@ static int pcrypt_aead_encrypt(struct aead_request *req)
 			       req->cryptlen, req->iv);
 	aead_request_set_ad(creq, req->assoclen);
 
-	err = padata_do_parallel(pencrypt.pinst, padata, &ctx->cb_cpu);
+	err = padata_do_parallel(pencrypt, padata, &ctx->cb_cpu);
 	if (!err)
 		return -EINPROGRESS;
 
@@ -170,7 +145,7 @@ static int pcrypt_aead_decrypt(struct aead_request *req)
 			       req->cryptlen, req->iv);
 	aead_request_set_ad(creq, req->assoclen);
 
-	err = padata_do_parallel(pdecrypt.pinst, padata, &ctx->cb_cpu);
+	err = padata_do_parallel(pdecrypt, padata, &ctx->cb_cpu);
 	if (!err)
 		return -EINPROGRESS;
 
@@ -317,36 +292,6 @@ static int pcrypt_create(struct crypto_template *tmpl, struct rtattr **tb)
 	return -EINVAL;
 }
 
-static int pcrypt_cpumask_change_notify(struct notifier_block *self,
-					unsigned long val, void *data)
-{
-	struct padata_pcrypt *pcrypt;
-	struct pcrypt_cpumask *new_mask, *old_mask;
-	struct padata_cpumask *cpumask = (struct padata_cpumask *)data;
-
-	if (!(val & PADATA_CPU_SERIAL))
-		return 0;
-
-	pcrypt = container_of(self, struct padata_pcrypt, nblock);
-	new_mask = kmalloc(sizeof(*new_mask), GFP_KERNEL);
-	if (!new_mask)
-		return -ENOMEM;
-	if (!alloc_cpumask_var(&new_mask->mask, GFP_KERNEL)) {
-		kfree(new_mask);
-		return -ENOMEM;
-	}
-
-	old_mask = pcrypt->cb_cpumask;
-
-	cpumask_copy(new_mask->mask, cpumask->cbcpu);
-	rcu_assign_pointer(pcrypt->cb_cpumask, new_mask);
-	synchronize_rcu();
-
-	free_cpumask_var(old_mask->mask);
-	kfree(old_mask);
-	return 0;
-}
-
 static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
 {
 	int ret;
@@ -359,63 +304,29 @@ static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
 	return ret;
 }
 
-static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
-			      const char *name)
+static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
 {
 	int ret = -ENOMEM;
-	struct pcrypt_cpumask *mask;
 
 	get_online_cpus();
 
-	pcrypt->pinst = padata_alloc_possible(name);
-	if (!pcrypt->pinst)
-		goto err;
-
-	mask = kmalloc(sizeof(*mask), GFP_KERNEL);
-	if (!mask)
-		goto err_free_padata;
-	if (!alloc_cpumask_var(&mask->mask, GFP_KERNEL)) {
-		kfree(mask);
-		goto err_free_padata;
-	}
-
-	cpumask_and(mask->mask, cpu_possible_mask, cpu_online_mask);
-	rcu_assign_pointer(pcrypt->cb_cpumask, mask);
-
-	pcrypt->nblock.notifier_call = pcrypt_cpumask_change_notify;
-	ret = padata_register_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
-	if (ret)
-		goto err_free_cpumask;
+	*pinst = padata_alloc_possible(name);
+	if (!*pinst)
+		return ret;
 
-	ret = pcrypt_sysfs_add(pcrypt->pinst, name);
+	ret = pcrypt_sysfs_add(*pinst, name);
 	if (ret)
-		goto err_unregister_notifier;
+		padata_free(*pinst);
 
 	put_online_cpus();
 
-	return ret;
-
-err_unregister_notifier:
-	padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
-err_free_cpumask:
-	free_cpumask_var(mask->mask);
-	kfree(mask);
-err_free_padata:
-	padata_free(pcrypt->pinst);
-err:
-	put_online_cpus();
-
 	return ret;
 }
 
-static void pcrypt_fini_padata(struct padata_pcrypt *pcrypt)
+static void pcrypt_fini_padata(struct padata_instance *pinst)
 {
-	free_cpumask_var(pcrypt->cb_cpumask->mask);
-	kfree(pcrypt->cb_cpumask);
-
-	padata_stop(pcrypt->pinst);
-	padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
-	padata_free(pcrypt->pinst);
+	padata_stop(pinst);
+	padata_free(pinst);
 }
 
 static struct crypto_template pcrypt_tmpl = {
@@ -440,13 +351,13 @@ static int __init pcrypt_init(void)
 	if (err)
 		goto err_deinit_pencrypt;
 
-	padata_start(pencrypt.pinst);
-	padata_start(pdecrypt.pinst);
+	padata_start(pencrypt);
+	padata_start(pdecrypt);
 
 	return crypto_register_template(&pcrypt_tmpl);
 
 err_deinit_pencrypt:
-	pcrypt_fini_padata(&pencrypt);
+	pcrypt_fini_padata(pencrypt);
 err_unreg_kset:
 	kset_unregister(pcrypt_kset);
 err:
@@ -455,8 +366,8 @@ static int __init pcrypt_init(void)
 
 static void __exit pcrypt_exit(void)
 {
-	pcrypt_fini_padata(&pencrypt);
-	pcrypt_fini_padata(&pdecrypt);
+	pcrypt_fini_padata(pencrypt);
+	pcrypt_fini_padata(pdecrypt);
 
 	kset_unregister(pcrypt_kset);
 	crypto_unregister_template(&pcrypt_tmpl);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 6/9] padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (4 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 5/9] pcrypt: remove padata cpumask notifier Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 7/9] padata: use separate workqueues for parallel and serial work Daniel Jordan
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

With pcrypt's cpumask no longer used, take the CPU hotplug lock inside
padata_alloc_possible.

Useful later in the series for avoiding nested acquisition of the CPU
hotplug lock in padata when padata_alloc_possible is allocating an
unbound workqueue.

Without this patch, this nested acquisition would happen later in the
series:

      pcrypt_init_padata
        get_online_cpus
        alloc_padata_possible
          alloc_padata
            alloc_workqueue(WQ_UNBOUND)   // later in the series
              alloc_and_link_pwqs
                apply_wqattrs_lock
                  get_online_cpus         // recursive rwsem acquisition

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 crypto/pcrypt.c |  4 ----
 kernel/padata.c | 17 +++++++++--------
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 2ec36e6a132f..543792e0ebf0 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -308,8 +308,6 @@ static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
 {
 	int ret = -ENOMEM;
 
-	get_online_cpus();
-
 	*pinst = padata_alloc_possible(name);
 	if (!*pinst)
 		return ret;
@@ -318,8 +316,6 @@ static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
 	if (ret)
 		padata_free(*pinst);
 
-	put_online_cpus();
-
 	return ret;
 }
 
diff --git a/kernel/padata.c b/kernel/padata.c
index 9a17922ec436..8a362923c488 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -955,8 +955,6 @@ static struct kobj_type padata_attr_type = {
  * @name: used to identify the instance
  * @pcpumask: cpumask that will be used for padata parallelization
  * @cbcpumask: cpumask that will be used for padata serialization
- *
- * Must be called from a cpus_read_lock() protected region
  */
 static struct padata_instance *padata_alloc(const char *name,
 					    const struct cpumask *pcpumask,
@@ -974,11 +972,13 @@ static struct padata_instance *padata_alloc(const char *name,
 	if (!pinst->wq)
 		goto err_free_inst;
 
+	get_online_cpus();
+
 	if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
-		goto err_free_wq;
+		goto err_put_cpus;
 	if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
 		free_cpumask_var(pinst->cpumask.pcpu);
-		goto err_free_wq;
+		goto err_put_cpus;
 	}
 	if (!padata_validate_cpumask(pinst, pcpumask) ||
 	    !padata_validate_cpumask(pinst, cbcpumask))
@@ -1002,12 +1002,16 @@ static struct padata_instance *padata_alloc(const char *name,
 #ifdef CONFIG_HOTPLUG_CPU
 	cpuhp_state_add_instance_nocalls_cpuslocked(hp_online, &pinst->node);
 #endif
+
+	put_online_cpus();
+
 	return pinst;
 
 err_free_masks:
 	free_cpumask_var(pinst->cpumask.pcpu);
 	free_cpumask_var(pinst->cpumask.cbcpu);
-err_free_wq:
+err_put_cpus:
+	put_online_cpus();
 	destroy_workqueue(pinst->wq);
 err_free_inst:
 	kfree(pinst);
@@ -1021,12 +1025,9 @@ static struct padata_instance *padata_alloc(const char *name,
  *                         parallel workers.
  *
  * @name: used to identify the instance
- *
- * Must be called from a cpus_read_lock() protected region
  */
 struct padata_instance *padata_alloc_possible(const char *name)
 {
-	lockdep_assert_cpus_held();
 	return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
 }
 EXPORT_SYMBOL(padata_alloc_possible);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 7/9] padata: use separate workqueues for parallel and serial work
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (5 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 6/9] padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 8/9] padata: unbind parallel jobs from specific CPUs Daniel Jordan
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

padata currently uses one per-CPU workqueue per instance for all work.

Prepare for running parallel jobs on an unbound workqueue by introducing
dedicated workqueues for parallel and serial work.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/padata.h |  6 ++++--
 kernel/padata.c        | 28 ++++++++++++++++++----------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index f7851f8e2190..e7978f8942ca 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -127,7 +127,8 @@ struct parallel_data {
  * struct padata_instance - The overall control structure.
  *
  * @cpu_notifier: cpu hotplug notifier.
- * @wq: The workqueue in use.
+ * @parallel_wq: The workqueue used for parallel work.
+ * @serial_wq: The workqueue used for serial work.
  * @pd: The internal control structure.
  * @cpumask: User supplied cpumasks for parallel and serial works.
  * @cpumask_change_notifier: Notifiers chain for user-defined notify
@@ -139,7 +140,8 @@ struct parallel_data {
  */
 struct padata_instance {
 	struct hlist_node		 node;
-	struct workqueue_struct		*wq;
+	struct workqueue_struct		*parallel_wq;
+	struct workqueue_struct		*serial_wq;
 	struct parallel_data		*pd;
 	struct padata_cpumask		cpumask;
 	struct blocking_notifier_head	 cpumask_change_notifier;
diff --git a/kernel/padata.c b/kernel/padata.c
index 8a362923c488..669f5d53d357 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -152,7 +152,7 @@ int padata_do_parallel(struct padata_instance *pinst,
 	list_add_tail(&padata->list, &queue->parallel.list);
 	spin_unlock(&queue->parallel.lock);
 
-	queue_work_on(target_cpu, pinst->wq, &queue->work);
+	queue_work_on(target_cpu, pinst->parallel_wq, &queue->work);
 
 out:
 	rcu_read_unlock_bh();
@@ -261,7 +261,7 @@ static void padata_reorder(struct parallel_data *pd)
 		list_add_tail(&padata->list, &squeue->serial.list);
 		spin_unlock(&squeue->serial.lock);
 
-		queue_work_on(cb_cpu, pinst->wq, &squeue->work);
+		queue_work_on(cb_cpu, pinst->serial_wq, &squeue->work);
 	}
 
 	spin_unlock_bh(&pd->lock);
@@ -278,7 +278,7 @@ static void padata_reorder(struct parallel_data *pd)
 
 	next_queue = per_cpu_ptr(pd->pqueue, pd->cpu);
 	if (!list_empty(&next_queue->reorder.list))
-		queue_work(pinst->wq, &pd->reorder_work);
+		queue_work(pinst->serial_wq, &pd->reorder_work);
 }
 
 static void invoke_padata_reorder(struct work_struct *work)
@@ -818,7 +818,8 @@ static void __padata_free(struct padata_instance *pinst)
 	padata_free_pd(pinst->pd);
 	free_cpumask_var(pinst->cpumask.pcpu);
 	free_cpumask_var(pinst->cpumask.cbcpu);
-	destroy_workqueue(pinst->wq);
+	destroy_workqueue(pinst->serial_wq);
+	destroy_workqueue(pinst->parallel_wq);
 	kfree(pinst);
 }
 
@@ -967,18 +968,23 @@ static struct padata_instance *padata_alloc(const char *name,
 	if (!pinst)
 		goto err;
 
-	pinst->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
-				    1, name);
-	if (!pinst->wq)
+	pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_MEM_RECLAIM |
+					     WQ_CPU_INTENSIVE, 1, name);
+	if (!pinst->parallel_wq)
 		goto err_free_inst;
 
 	get_online_cpus();
 
-	if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+	pinst->serial_wq = alloc_workqueue("%s_serial", WQ_MEM_RECLAIM |
+					   WQ_CPU_INTENSIVE, 1, name);
+	if (!pinst->serial_wq)
 		goto err_put_cpus;
+
+	if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+		goto err_free_serial_wq;
 	if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
 		free_cpumask_var(pinst->cpumask.pcpu);
-		goto err_put_cpus;
+		goto err_free_serial_wq;
 	}
 	if (!padata_validate_cpumask(pinst, pcpumask) ||
 	    !padata_validate_cpumask(pinst, cbcpumask))
@@ -1010,9 +1016,11 @@ static struct padata_instance *padata_alloc(const char *name,
 err_free_masks:
 	free_cpumask_var(pinst->cpumask.pcpu);
 	free_cpumask_var(pinst->cpumask.cbcpu);
+err_free_serial_wq:
+	destroy_workqueue(pinst->serial_wq);
 err_put_cpus:
 	put_online_cpus();
-	destroy_workqueue(pinst->wq);
+	destroy_workqueue(pinst->parallel_wq);
 err_free_inst:
 	kfree(pinst);
 err:
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 8/9] padata: unbind parallel jobs from specific CPUs
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (6 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 7/9] padata: use separate workqueues for parallel and serial work Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-06  1:40 ` [PATCH v3 9/9] padata: remove cpu_index from the parallel_queue Daniel Jordan
  2019-09-13 11:29 ` [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Herbert Xu
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job.  Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.

So, make the parallel workqueue unbound.

Update the parallel workqueue's cpumask when the instance's parallel
cpumask changes.

Now that parallel jobs no longer run on max_active=1 workqueues, two or
more parallel works that hash to the same CPU may run simultaneously,
finish out of order, and so be serialized out of order.  Prevent this by
keeping the works sorted on the reorder list by sequence number and
checking that in the reordering logic.

padata_get_next becomes padata_find_next so it can be reused for the end
of padata_reorder, where it's used to avoid uselessly queueing work when
the next job by sequence number isn't finished yet but a later job that
hashed to the same CPU has.

The ENODATA case in padata_find_next no longer makes sense because
parallel jobs aren't bound to specific CPUs.  The EINPROGRESS case takes
care of the scenario where a parallel job is potentially running on the
same CPU as padata_find_next, and with only one error code left, just
use NULL instead.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/padata.h |   3 ++
 kernel/padata.c        | 118 +++++++++++++++++++++++------------------
 2 files changed, 68 insertions(+), 53 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index e7978f8942ca..43d3fd9d17fc 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -35,6 +35,7 @@ struct padata_priv {
 	struct parallel_data	*pd;
 	int			cb_cpu;
 	int			cpu;
+	unsigned int		seq_nr;
 	int			info;
 	void                    (*parallel)(struct padata_priv *padata);
 	void                    (*serial)(struct padata_priv *padata);
@@ -105,6 +106,7 @@ struct padata_cpumask {
  * @reorder_objects: Number of objects waiting in the reorder queues.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @max_seq_nr:  Maximal used sequence number.
+ * @processed: Number of already processed objects.
  * @cpu: Next CPU to be processed.
  * @cpumask: The cpumasks in use for parallel and serial workers.
  * @reorder_work: work struct for reordering.
@@ -117,6 +119,7 @@ struct parallel_data {
 	atomic_t			reorder_objects;
 	atomic_t			refcnt;
 	atomic_t			seq_nr;
+	unsigned int			processed;
 	int				cpu;
 	struct padata_cpumask		cpumask;
 	struct work_struct		reorder_work;
diff --git a/kernel/padata.c b/kernel/padata.c
index 669f5d53d357..832224dcf2e1 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -46,18 +46,13 @@ static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 	return target_cpu;
 }
 
-static int padata_cpu_hash(struct parallel_data *pd)
+static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr)
 {
-	unsigned int seq_nr;
-	int cpu_index;
-
 	/*
 	 * Hash the sequence numbers to the cpus by taking
 	 * seq_nr mod. number of cpus in use.
 	 */
-
-	seq_nr = atomic_inc_return(&pd->seq_nr);
-	cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);
+	int cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);
 
 	return padata_index_to_cpu(pd, cpu_index);
 }
@@ -144,7 +139,8 @@ int padata_do_parallel(struct padata_instance *pinst,
 	padata->pd = pd;
 	padata->cb_cpu = *cb_cpu;
 
-	target_cpu = padata_cpu_hash(pd);
+	padata->seq_nr = atomic_inc_return(&pd->seq_nr);
+	target_cpu = padata_cpu_hash(pd, padata->seq_nr);
 	padata->cpu = target_cpu;
 	queue = per_cpu_ptr(pd->pqueue, target_cpu);
 
@@ -152,7 +148,7 @@ int padata_do_parallel(struct padata_instance *pinst,
 	list_add_tail(&padata->list, &queue->parallel.list);
 	spin_unlock(&queue->parallel.lock);
 
-	queue_work_on(target_cpu, pinst->parallel_wq, &queue->work);
+	queue_work(pinst->parallel_wq, &queue->work);
 
 out:
 	rcu_read_unlock_bh();
@@ -162,21 +158,19 @@ int padata_do_parallel(struct padata_instance *pinst,
 EXPORT_SYMBOL(padata_do_parallel);
 
 /*
- * padata_get_next - Get the next object that needs serialization.
+ * padata_find_next - Find the next object that needs serialization.
  *
  * Return values are:
  *
  * A pointer to the control struct of the next object that needs
  * serialization, if present in one of the percpu reorder queues.
  *
- * -EINPROGRESS, if the next object that needs serialization will
+ * NULL, if the next object that needs serialization will
  *  be parallel processed by another cpu and is not yet present in
  *  the cpu's reorder queue.
- *
- * -ENODATA, if this cpu has to do the parallel processing for
- *  the next object.
  */
-static struct padata_priv *padata_get_next(struct parallel_data *pd)
+static struct padata_priv *padata_find_next(struct parallel_data *pd,
+					    bool remove_object)
 {
 	struct padata_parallel_queue *next_queue;
 	struct padata_priv *padata;
@@ -187,28 +181,30 @@ static struct padata_priv *padata_get_next(struct parallel_data *pd)
 	reorder = &next_queue->reorder;
 
 	spin_lock(&reorder->lock);
-	if (!list_empty(&reorder->list)) {
-		padata = list_entry(reorder->list.next,
-				    struct padata_priv, list);
-
-		list_del_init(&padata->list);
-		atomic_dec(&pd->reorder_objects);
+	if (list_empty(&reorder->list)) {
+		spin_unlock(&reorder->lock);
+		return NULL;
+	}
 
-		pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1,
-					    false);
+	padata = list_entry(reorder->list.next, struct padata_priv, list);
 
+	/*
+	 * Checks the rare case where two or more parallel jobs have hashed to
+	 * the same CPU and one of the later ones finishes first.
+	 */
+	if (padata->seq_nr != pd->processed) {
 		spin_unlock(&reorder->lock);
-		goto out;
+		return NULL;
 	}
-	spin_unlock(&reorder->lock);
 
-	if (__this_cpu_read(pd->pqueue->cpu_index) == next_queue->cpu_index) {
-		padata = ERR_PTR(-ENODATA);
-		goto out;
+	if (remove_object) {
+		list_del_init(&padata->list);
+		atomic_dec(&pd->reorder_objects);
+		++pd->processed;
+		pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false);
 	}
 
-	padata = ERR_PTR(-EINPROGRESS);
-out:
+	spin_unlock(&reorder->lock);
 	return padata;
 }
 
@@ -234,26 +230,16 @@ static void padata_reorder(struct parallel_data *pd)
 		return;
 
 	while (1) {
-		padata = padata_get_next(pd);
+		padata = padata_find_next(pd, true);
 
 		/*
 		 * If the next object that needs serialization is parallel
 		 * processed by another cpu and is still on it's way to the
 		 * cpu's reorder queue, nothing to do for now.
 		 */
-		if (PTR_ERR(padata) == -EINPROGRESS)
+		if (!padata)
 			break;
 
-		/*
-		 * This cpu has to do the parallel processing of the next
-		 * object. It's waiting in the cpu's parallelization queue,
-		 * so exit immediately.
-		 */
-		if (PTR_ERR(padata) == -ENODATA) {
-			spin_unlock_bh(&pd->lock);
-			return;
-		}
-
 		cb_cpu = padata->cb_cpu;
 		squeue = per_cpu_ptr(pd->squeue, cb_cpu);
 
@@ -277,7 +263,8 @@ static void padata_reorder(struct parallel_data *pd)
 	smp_mb();
 
 	next_queue = per_cpu_ptr(pd->pqueue, pd->cpu);
-	if (!list_empty(&next_queue->reorder.list))
+	if (!list_empty(&next_queue->reorder.list) &&
+	    padata_find_next(pd, false))
 		queue_work(pinst->serial_wq, &pd->reorder_work);
 }
 
@@ -332,9 +319,14 @@ void padata_do_serial(struct padata_priv *padata)
 	struct parallel_data *pd = padata->pd;
 	struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
 							   padata->cpu);
+	struct padata_priv *cur;
 
 	spin_lock(&pqueue->reorder.lock);
-	list_add_tail(&padata->list, &pqueue->reorder.list);
+	/* Sort in ascending order of sequence number. */
+	list_for_each_entry_reverse(cur, &pqueue->reorder.list, list)
+		if (cur->seq_nr < padata->seq_nr)
+			break;
+	list_add(&padata->list, &cur->list);
 	atomic_inc(&pd->reorder_objects);
 	spin_unlock(&pqueue->reorder.lock);
 
@@ -353,17 +345,36 @@ static int padata_setup_cpumasks(struct parallel_data *pd,
 				 const struct cpumask *pcpumask,
 				 const struct cpumask *cbcpumask)
 {
-	if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
-		return -ENOMEM;
+	struct workqueue_attrs *attrs;
+	int err = -ENOMEM;
 
+	if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
+		goto out;
 	cpumask_and(pd->cpumask.pcpu, pcpumask, cpu_online_mask);
-	if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL)) {
-		free_cpumask_var(pd->cpumask.pcpu);
-		return -ENOMEM;
-	}
 
+	if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL))
+		goto free_pcpu_mask;
 	cpumask_and(pd->cpumask.cbcpu, cbcpumask, cpu_online_mask);
+
+	attrs = alloc_workqueue_attrs();
+	if (!attrs)
+		goto free_cbcpu_mask;
+
+	/* Restrict parallel_wq workers to pd->cpumask.pcpu. */
+	cpumask_copy(attrs->cpumask, pd->cpumask.pcpu);
+	err = apply_workqueue_attrs(pd->pinst->parallel_wq, attrs);
+	free_workqueue_attrs(attrs);
+	if (err < 0)
+		goto free_cbcpu_mask;
+
 	return 0;
+
+free_cbcpu_mask:
+	free_cpumask_var(pd->cpumask.cbcpu);
+free_pcpu_mask:
+	free_cpumask_var(pd->cpumask.pcpu);
+out:
+	return err;
 }
 
 static void __padata_list_init(struct padata_list *pd_list)
@@ -429,6 +440,8 @@ static struct parallel_data *padata_alloc_pd(struct padata_instance *pinst,
 	pd->squeue = alloc_percpu(struct padata_serial_queue);
 	if (!pd->squeue)
 		goto err_free_pqueue;
+
+	pd->pinst = pinst;
 	if (padata_setup_cpumasks(pd, pcpumask, cbcpumask) < 0)
 		goto err_free_squeue;
 
@@ -437,7 +450,6 @@ static struct parallel_data *padata_alloc_pd(struct padata_instance *pinst,
 	atomic_set(&pd->seq_nr, -1);
 	atomic_set(&pd->reorder_objects, 0);
 	atomic_set(&pd->refcnt, 0);
-	pd->pinst = pinst;
 	spin_lock_init(&pd->lock);
 	pd->cpu = cpumask_first(pd->cpumask.pcpu);
 	INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
@@ -968,8 +980,8 @@ static struct padata_instance *padata_alloc(const char *name,
 	if (!pinst)
 		goto err;
 
-	pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_MEM_RECLAIM |
-					     WQ_CPU_INTENSIVE, 1, name);
+	pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_UNBOUND, 0,
+					     name);
 	if (!pinst->parallel_wq)
 		goto err_free_inst;
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 9/9] padata: remove cpu_index from the parallel_queue
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (7 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 8/9] padata: unbind parallel jobs from specific CPUs Daniel Jordan
@ 2019-09-06  1:40 ` Daniel Jordan
  2019-09-13 11:29 ` [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Herbert Xu
  9 siblings, 0 replies; 11+ messages in thread
From: Daniel Jordan @ 2019-09-06  1:40 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert
  Cc: Lai Jiangshan, Peter Zijlstra, Tejun Heo, linux-crypto,
	linux-kernel, Daniel Jordan

With the removal of the ENODATA case from padata_get_next, the cpu_index
field is no longer useful, so it can go away.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/padata.h |  2 --
 kernel/padata.c        | 13 ++-----------
 2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 43d3fd9d17fc..23717eeaad23 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -75,14 +75,12 @@ struct padata_serial_queue {
  * @swork: work struct for serialization.
  * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
- * @cpu_index: Index of the cpu.
  */
 struct padata_parallel_queue {
        struct padata_list    parallel;
        struct padata_list    reorder;
        struct work_struct    work;
        atomic_t              num_obj;
-       int                   cpu_index;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 832224dcf2e1..c3fec1413295 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -400,21 +400,12 @@ static void padata_init_squeues(struct parallel_data *pd)
 /* Initialize all percpu queues used by parallel workers */
 static void padata_init_pqueues(struct parallel_data *pd)
 {
-	int cpu_index, cpu;
+	int cpu;
 	struct padata_parallel_queue *pqueue;
 
-	cpu_index = 0;
-	for_each_possible_cpu(cpu) {
+	for_each_cpu(cpu, pd->cpumask.pcpu) {
 		pqueue = per_cpu_ptr(pd->pqueue, cpu);
 
-		if (!cpumask_test_cpu(cpu, pd->cpumask.pcpu)) {
-			pqueue->cpu_index = -1;
-			continue;
-		}
-
-		pqueue->cpu_index = cpu_index;
-		cpu_index++;
-
 		__padata_list_init(&pqueue->reorder);
 		__padata_list_init(&pqueue->parallel);
 		INIT_WORK(&pqueue->work, padata_parallel_worker);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs
  2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
                   ` (8 preceding siblings ...)
  2019-09-06  1:40 ` [PATCH v3 9/9] padata: remove cpu_index from the parallel_queue Daniel Jordan
@ 2019-09-13 11:29 ` Herbert Xu
  9 siblings, 0 replies; 11+ messages in thread
From: Herbert Xu @ 2019-09-13 11:29 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Steffen Klassert, Lai Jiangshan, Peter Zijlstra, Tejun Heo,
	linux-crypto, linux-kernel

On Thu, Sep 05, 2019 at 09:40:20PM -0400, Daniel Jordan wrote:
> v2 -> v3:
>  - Rebase onto cryptodev.
> 
> v1 -> v2:
>  - Updated patch 8 to avoid queueing the reorder work if the next object
>    by sequence number isn't ready yet (Herbert)
>  - Added Steffen's ack to all but patch 8 since that one changed.
> 
> RFC -> v1:
>  - Included Tejun's acks.
>  - Added testing section to cover letter.
> 
> Padata binds the parallel part of a job to a single CPU and round-robins
> over all CPUs in the system for each successive job.  Though the serial
> parts rely on per-CPU queues for correct ordering, they're not necessary
> for parallel work, and it improves performance to run the job locally on
> NUMA machines and let the scheduler pick the CPU within a node on a busy
> system.
>   
> This series makes parallel padata jobs run on unbound workqueues.
> 
> Patch    Description
> -----    -----------
> 
>     1    Make a padata instance allocate its workqueue internally.
> 
>     2    Unconfine some recently-confined workqueue interfaces.
> 
>   3-6    Address recursive CPU hotplug locking issue.
> 
>          padata_alloc* requires its callers to hold this lock, but allocating
>          an unbound workqueue and calling apply_workqueue_attrs also take it.
>          Fix by removing the requirement for callers of padata_alloc*.
> 
>   7-8    Add a second workqueue for each padata instance that's dedicated to
>          parallel jobs.
> 
>     9    Small cleanup.
> 
> Performance
> -----------
> 
> Measurements are from a 2-socket, 20-core, 40-CPU Xeon server.
> 
> For repeatability, modprobe was bound to a CPU and the serial cpumasks
> for both pencrypt and pdecrypt were also restricted to a CPU different
> from modprobe's.
> 
>   # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
>   # modprobe tcrypt mode=211 sec=1
>   # modprobe tcrypt mode=215 sec=1
> 
> Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU)
> 
>                              base                test
>                              ----------------    ---------------
> speedup    key_sz  blk_sz    ops/sec    stdev    ops/sec   stdev
> 
> (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)
> 
>  117.2x       160      16        960       30     112555   24775
>  135.1x       160      64        845      246     114145   25124
>  113.2x       160     256        993       17     112395   24714
>  111.3x       160     512       1000        0     111252   23755
>  110.0x       160    1024        983       16     108153   22374
>  104.2x       160    2048        985       22     102563   20530
>   98.5x       160    4096        998        3      98346   18777
>   86.2x       160    8192       1000        0      86173   14480
> 
> (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)
> 
>  127.2x       160      16        997        5     126834   24244
>  128.4x       160      64       1000        0     128438   23261
>  127.6x       160     256        992        7     126627   23493
>  124.0x       160     512       1000        0     123958   22746
>  122.8x       160    1024        989       20     121372   22632
>  112.8x       160    2048        998        3     112602   18287
>  106.9x       160    4096        994       10     106255   16111
>   91.7x       160    8192       1000        0      91742   11670
> 
> multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)
> 
>  242.2x       160      16       2363      141     572189   16846
>  242.1x       160      64       2397      151     580424   11923
>  231.1x       160     256       2472       21     571387   16364
>  237.6x       160     512       2429       24     577264    8692
>  238.3x       160    1024       2384       97     568155    6621
>  216.3x       160    2048       2453       74     530627    3480
>  209.2x       160    4096       2381      206     498192   19177
>  176.5x       160    8192       2323      157     410013    9903
> 
> multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)
> 
>  220.3x       160      16       2341      228     515733   91317
>  216.6x       160      64       2467       33     534381  101262
>  217.7x       160     256       2451       45     533443   85418
>  213.8x       160     512       2485       26     531293   83767
>  211.0x       160    1024       2472       28     521677   80339
>  200.8x       160    2048       2459       67     493808   63587
>  188.8x       160    4096       2491        9     470325   58055
>  159.9x       160    8192       2459       51     393147   25756
> 
> Idle system (tcrypt run by itself)
> 
>                              base                test
>                              ----------------    ---------------
> speedup    key_sz  blk_sz    ops/sec    stdev    ops/sec   stdev
> 
> (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)
> 
>    2.5x       160      16      63412    43075     161615    1034
>    4.1x       160      64      39554    24006     161653     981
>    6.0x       160     256      26504     1436     160110    1158
>    6.2x       160     512      25500       40     157018     951
>    5.9x       160    1024      25777     1094     151852     915
>    5.8x       160    2048      24653      218     143756     508
>    5.6x       160    4096      24333       20     136752     548
>    5.0x       160    8192      23310       15     117660     481
> 
> (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)
> 
>    2.4x       160      16      53471    48279     128047   31328
>    3.4x       160      64      37712    20855     128187   31074
>    4.5x       160     256      27911     4378     126430   31084
>    4.9x       160     512      25346      175     123870   29099
>    3.1x       160    1024      38452    23118     120817   26846
>    4.7x       160    2048      24612      187     115036   23942
>    4.5x       160    4096      24217      114     109583   21559
>    4.2x       160    8192      23144      108      96850   16686
> 
> multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)
> 
>    1.0x       160      16     412157     3855     426973    1591
>    1.0x       160      64     412600     4410     431920    4224
>    1.1x       160     256     410352     3254     453691   17831
>    1.2x       160     512     406293     4948     473491   39818
>    1.2x       160    1024     395123     7804     478539   27660
>    1.2x       160    2048     385144     7601     453720   17579
>    1.2x       160    4096     371989     3631     449923   15331
>    1.2x       160    8192     346723     1617     399824   18559
> 
> multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)
> 
>    1.1x       160      16     407317     1487     452619   14404
>    1.1x       160      64     411821     4261     464059   23541
>    1.2x       160     256     408941     4945     477483   36576
>    1.2x       160     512     406451      611     472661   11038
>    1.2x       160    1024     394813     2667     456357   11452
>    1.2x       160    2048     390291     4175     448928    8957
>    1.2x       160    4096     371904     1068     449344   14225
>    1.2x       160    8192     344227     1973     404397   19540
> 
> Testing
> -------
> 
> In addition to the bare metal performance runs above, this series was
> tested in a kvm guest with the tcrypt module (mode=215).  All
> combinations of CPUs among parallel_cpumask, serial_cpumask, and CPU
> hotplug online/offline were run with 3 possible CPUs, and over 2000
> random combinations of these were run with 8 possible CPUs.  Workqueue
> events were used throughout to verify that all parallel and serial
> workers executed on only the CPUs allowed by the cpumask sysfs files.
> 
> Finally, tcrypt mode=215 was run at each patch in the series when built
> with and without CONFIG_PADATA/CONFIG_CRYPTO_PCRYPT.
> 
> v2:  https://lore.kernel.org/linux-crypto/20190829173038.21040-1-daniel.m.jordan@oracle.com/
> v1:  https://lore.kernel.org/linux-crypto/20190813005224.30779-1-daniel.m.jordan@oracle.com/
> RFC: https://lore.kernel.org/lkml/20190725212505.15055-1-daniel.m.jordan@oracle.com/
> 
> Daniel Jordan (9):
>   padata: allocate workqueue internally
>   workqueue: unconfine alloc/apply/free_workqueue_attrs()
>   workqueue: require CPU hotplug read exclusion for
>     apply_workqueue_attrs
>   padata: make padata_do_parallel find alternate callback CPU
>   pcrypt: remove padata cpumask notifier
>   padata, pcrypt: take CPU hotplug lock internally in
>     padata_alloc_possible
>   padata: use separate workqueues for parallel and serial work
>   padata: unbind parallel jobs from specific CPUs
>   padata: remove cpu_index from the parallel_queue
> 
>  Documentation/padata.txt  |  12 +--
>  crypto/pcrypt.c           | 167 ++++---------------------------
>  include/linux/padata.h    |  16 +--
>  include/linux/workqueue.h |   4 +
>  kernel/padata.c           | 201 ++++++++++++++++++++++----------------
>  kernel/workqueue.c        |  25 +++--
>  6 files changed, 170 insertions(+), 255 deletions(-)

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-09-13 11:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-06  1:40 [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 1/9] padata: allocate workqueue internally Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 2/9] workqueue: unconfine alloc/apply/free_workqueue_attrs() Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 3/9] workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 4/9] padata: make padata_do_parallel find alternate callback CPU Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 5/9] pcrypt: remove padata cpumask notifier Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 6/9] padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 7/9] padata: use separate workqueues for parallel and serial work Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 8/9] padata: unbind parallel jobs from specific CPUs Daniel Jordan
2019-09-06  1:40 ` [PATCH v3 9/9] padata: remove cpu_index from the parallel_queue Daniel Jordan
2019-09-13 11:29 ` [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).