linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
@ 2023-05-19  0:16 Tejun Heo
  2023-05-19  0:16 ` [PATCH 01/24] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
                   ` (27 more replies)
  0 siblings, 28 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

Hello,

tl;dr
=====

Unbound workqueues used to spray work items inside each NUMA node, which
isn't great on CPUs w/ multiple L3 caches. This patchset implements
mechanisms to improve and configure execution locality.

Long patchset, so a bit of traffic control:

PeterZ
------

It uses p->wake_cpu to best-effort migrate workers on wake-ups. Migrating on
wake-ups fits the bill and it's cheaper and fits the semantics in that
workqueue is hinting the scheduler which CPU has the most continuity rather
than forcing the specific migration. However, p->wake_cpu is only used in
scheduler proper, so this would be introducing a new way of using the field.

Please see 0021 and 0024.

Linus, PeterZ
-------------

Most of the patchset are workqueue internal plumbing and probably aren't
terribly interesting. Howver, the performance picture turned out less
straight-forward than I had hoped, mostly likely due to loss of
work-conservation from scheduler in high fan-out scenarios. I'll describe it
in this cover letter. Please read on.

In terms of patches, 0021-0024 are probably the interesting ones.

Brian Norris, Nathan Huckleberry and others experiencing wq perf problems
-------------------------------------------------------------------------

Can you please test this patchset and see whether the performance problems
are resolved? After the patchset, unbound workqueues default to
soft-affining on cache boundaries, which should hopefully resolve the issues
that you guys have been seeing on recent kernels on heterogeneous CPUs.

If you want to try different settings, please read:

 https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v1&id=e8f3505e69a526cc5fe40a4da5d443b7f9231016#n350

Alasdair Kergon, Mike Snitzer, DM folks
---------------------------------------

I ran fio on top of dm-crypt to compare performance of different
configurations. It mostly behaved as expected but please let me know if
anything doens't look right. Also, DM_CRYPT_SAME_CPU can now be implemented
by applying strict "cpu" scope on the unbound workqueue and it would make
sense to add WQ_SYSFS to the kcryptd workqueue so that users can tune the
settings on the fly.

Lai
---

I'd really appreciate if you could go over the series. 0003, 0007, 0009,
0018-0021, 0023-0024 are the interesting ones. 0003, 0019, 0020 and 0024 are
the potentially risky ones. I've been testing them pretty extensively for
over a week but the series can really use your eyes.


Introduction
============

Recently, IIRC, with chromeos moving to a newer kernel, there have been a
flurry of reports on unbound workqueues showing high execution latency and
low bandwidth. AFAICT, all were on heterogeneous CPUs with multiple L3
caches. Please see the discussions in the following threads.

 https://lkml.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
 https://lkml.kernel.org/r/ZFvpJb9Dh0FCkLQA@google.com

While details from the second thread indicate that the problem is unlikely
to originate solely from workqueue, workqueue does have an underlying
problem that can be amplified through behavior changes in other subsystems.

Being an async execution mechanism, workqueue inherently disconnects issuer
and executor. Worker pools are shared across workqueues whenever possible
further severing the connection. While this disconnection is true for all
workqueues, for per-cpu workqueues, this doesn't harm locality because they
still all end up on the same CPU. For unbound workqueues tho, we end up
spraying work items across CPUs which blows away all locality.

This was already a noticeable issue with NUMA machines and 4c16bd327c74
("workqueue: implement NUMA affinity for unbound workqueues") implemented
NUMA support which segments unbound workqueues on NUMA node boundaries so
that each subset that's contained within a NUMA node is served by a separate
worker pool. Work items still get sprayed but they get sprayed only inside
its NUMA node.

This has been mostly fine but CPUs became a lot more complex with many more
cores and multiple L3 caches inside a single node with differeing distances
across them, and it looks like it's high time to improve workqueue's
locality awareness.


Design
======

There are two basic ideas.

1. Expand the segmenting so that unbound workqueues can be segmented on
   different boundaries including L3 caches.

2. The knoweldge about CPU topology is already in the scheduler. Instead of
   trying to steer things on its own, workqueue can inform the scheduler of
   the to-be-executed work item's locality and get out of its way. (Linus's
   suggestion)

#2 is nice in that it takes out workqueue out of the businness that it's not
good at. Even with #2, #1 is likely still useful as persistent locality of
workers affect cache hit ratio and memory distance on stack and shared data
structures.

Given the above, it could be that workqueue can do something like the
following and keep everyone more or less happy:

* Segment unbound workqueues according to L3 cache boundaries
  (cpus_share_cache()).

* Don't hard-affine workers but make a best-effort attempt to locate them on
  the same CPUs as the work items they are about to execute.

This way, the hot workers would stay mostly affine to the L3 cache domain
they're attached to and the scheduler would have the full knowledge of the
preferred locality when starting execution of a work item. As the workers
aren't strictly confined, the scheduler is free to move them outside that L3
cache domain or even across NUMA nodes according to the hardware
configuration and load situation. In short, we get both locality and
work-conservation.

Unfortunately, this didn't work out due to the pronounced inverse
correlation between locality and work-conservation, which will be
illustrated in the following "Results" section. As this meant that there is
not going to be one good enough behavior for everyone, I ended up
implementing a number of different behaviors that can be selected
dynamically per-workqueue.

There are three parameters:

* Affinity scope: Determines how unbound workqueues are segmented. Five
  possible values.

  CPU: Every CPU in its own scope. The unbound workqueue behaves like a
       per-cpu one.

  SMT: SMT siblings in the same scope.

  CACHE: cpus_share_cache() defines the scopes, usually on L3 boundaries.

  NUMA: NUMA nodes define the scopes. This is the same behavior as before
        this patchset.

  SYSTEM: A single scope across the whole system.

* Affinty strictness: If a scope is strict, every worker is confined inside
  its scope and the scheduler can't move it outside. If non-strict, a worker
  is brought back into its scope (called repatriation) when it starts to
  execute a work item but then can be moved outside the scope if the
  scheduler deems that the better course of action.

* Localization: If a workqueue enables localization, a worker is moved to
  the CPU that issued the work item at the start of execution. Afterwards,
  the scheduler can migrate the task according to the above affinity scope
  and strictness settings. Interestingly, this turned out to be not very
  useful in practice. It's implemented in the last two patches but they
  aren't for upstream inclusion, at least for now. See the "Results"
  section.

Combined, they allow a wide variety of configurations. It's easy to emulate
WQ_CPU_INTENSIVE per-cpu workqueues or the behavior before this patchset.
The settings can be changed anytime through both apply_workqueue_attrs() and
the sysfs interface allowing easy and flexible tuning.

Let's see how different configurations perform.


Experiment Setup
================

Here's the relevant part of the experiment setup.

* Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
  Core-to-core latencies across L3 caches are ~2.6x worse than within each
  L3 cache. ie. it's worse but not hugely so. This means that the impact of
  L3 cache locality is noticeable in these experiments but may be subdued
  compared to other setups.

* The CPU's clock boost is disabled, so that all the clocks are running very
  close to the stock 3.8Ghz for test consistency.

* dm-crypt setup with default parameters on Samsung 980 PRO SSD.

* Each combination is run five times.

Three fio scenarios are run directly against the dm-crypt device.

  HIGH: Enough issuers and work spread across the machine

     fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
     --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=24 --time_based \
     --group_reporting --name=iops-test-job --verify=sha512

  MED: Fewer issuers, enough work for saturation

     fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
     --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 --time_based \
     --group_reporting --name=iops-test-job --verify=sha512

  LOW: Even fewer issuers, not enough work to saturate

     fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
     --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 --time_based \
     --group_reporting --name=iops-test-job --verify=sha512

  They are the same except for the number of issuers. The issuers generate
  data every time, write, read back and then verifiy the content. Each
  issuer can have 64 IOs in flight.

Five workqueue configurations are used:

  SYSTEM        A single scope across the system. All workers belong to the
                same pool and work items are sprayed all over. On this CPU,
		this is the same behavior as before this patchset.

  CACHE         Four scopes, each serving a L3 cache domain. The scopes are
                non-strict, so workers starts inside their scopes but the
		scheduler can move them outside as needed.

  CACHE+STRICT  The same as CACHE but the scopes are strict. Workers are
                confined within their scopes.

  SYSTEM+LOCAL  A single scope across the system but with localization
                turned on. A worker is moved to the issuing CPU of the work
                item it's about to execute.

  CACHE+LOCAL   Four scopes each serving a L3 cache domain with localization
                turned on. A worker is moved to the issuing CPU of the work
		item it's about to execute and is a lot more likely to
		be already inside the matching L3 cache domain.


Results
=======

1. HIGH: Enough issuers and work spread across the machine

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM              1159.40 ±1.34     99.31 ±0.02                 0.00
  CACHE               1166.40 ±0.89     99.34 ±0.01                +0.60
  CACHE+STRICT        1166.00 ±0.71     99.35 ±0.01                +0.57
  SYSTEM+LOCAL        1157.40 ±1.67     99.50 ±0.01                -0.17
  CACHE+LOCAL         1160.40 ±0.55     99.50 ±0.02                +0.09

The differences are small. Here are some observations.

* CACHE and CACHE+STRICT are clearly performing better. The difference is
  not big but consistent, which isn't too surprising given that these caches
  are pretty close. The CACHE ones are doing more work for the same number
  of CPU cycles compared to SYSTEM.

  This scenario is the best case for CACHE+STRICT, so it's encouraging that
  the non-strict variant can match.

* SYSTEM/CACHE+LOCAL aren't doing horribly but not better. They burned
  slightly more CPUs, possibly from the scheduler needing to move away the
  workers more frequently.


2. MED: Fewer issuers, enough work for saturation

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM             1155.40  ±0.89     97.41 ±0.05                 0.00
  CACHE              1154.40  ±1.14     96.15 ±0.09                -0.09
  CACHE+STRICT       1112.00  ±4.64     93.26 ±0.35                -3.76
  SYSTEM+LOCAL       1066.80  ±2.17     86.70 ±0.10                -7.67
  CACHE+LOCAL        1034.60  ±1.52     83.00 ±0.07               -10.46

There are still eight issuers and plenty of work to go around. However, now,
depending on the configuration, we're starting to see a significant loss of
work-conservation where CPUs sit idle while there's work to do.

* CACHE is doing okay. It's just a bit slower. Further testing may be needed
  to definitively confirm the bandwidth gap but the CPU util difference
  seems real, so while minute, it did lose a bit of work-conservation.
  Comparing it to CACHE+STRICT, it's starting to show the benefits of
  non-strict scopes.

* CACHE+STRICT's lower bw and utilization make sense. It'd be easy to
  mis-locate issuers so that some cache domains are temporarily underloaded.
  With scheduler not being allowed to move tasks around, there will be some
  utilization bubbles.

* SYSTEM/CACHE+LOCAL are not doing great with significantly lower
  utilizations. There are more than enough work to do in the system
  (changing --numjobs to 6, for example, doesn't change the picture much for
  SYSTEM) but the kernel is seemingly unable to dispatch workers to idle
  CPUs fast enough.


3. LOW: Even fewer issuers, not enough work to saturate

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM              993.60  ±1.82     75.49 ±0.06                 0.00
  CACHE               973.40  ±1.52     74.90 ±0.07                -2.03
  CACHE+STRICT        828.20  ±4.49     66.84 ±0.29               -16.65
  SYSTEM+LOCAL        884.20 ±18.39     64.14 ±1.12               -11.01
  CACHE+LOCAL         860.60  ±4.72     61.46 ±0.24               -13.39

Now, there isn't enough work to be done to saturate the whole system. This
amplifies the pattern which started to show up in the MED scenario.

* CACHE is now noticeably slower but not crazily so.

* CACHE+STRICT shows the severe work-conservation limitation of strict
  scopes.

* SYSTEM/CACHE+LOCAL show the same pattern as in the MED scenario but with
  larger work-conservation deficits.


Conclusions
===========

The tradeoff between efficiency gains from improved locality and bandwidth
loss from work-conservation deficit poses a conundrum. The scenarios used
for testing, while limited, are plausiable representations of what heavy
workqueue usage might look like in practice, and they clearly show that
there is no one straight-forward option which can cover all cases.

* The efficiency advantage of the CACHE over SYSTEM is, while consistent and
  noticeable, small. However, the impact is dependent on the distances
  between the scopes and may be more pronounced in processors with more
  complex topologies.

  While the loss of work-conservation in certain scenarios hurts, it is a
  lot better than CACHE+STRICT and maximizing workqueue utilization is
  unlikely to be the common case anyway. As such, CACHE is the default
  affinity scope for unbound pools.

  There's some risk to this decision and we might have to reassess depending
  on how this works out in the wild.

* In all three scenarios, +LOCAL didn't show any advantage. There may be
  scenarios where the boost from L1/2 cache locality is visible but the
  severe work-conservation deficits really hurt.

  I haven't looked into it too deeply here but the behavior where CPUs
  sitting idle while there's work to do is in line with what we (Meta) have
  been seeing with CFS across multiple workloads and I believe Google has
  similar experiences. Tuning load balancing and migration costs through
  debugfs helps but not fully. Deeper discussion on this topic is a bit out
  of scope here and there are on-going efforts on this front, so we can
  revisit later.

  So, at least in the current state, +LOCAL doesn't make sense. Let's table
  it for now. I put the two patches to implement it at the end of the series
  w/o SOB.


PATCHES
=======

This series contains the following patches.

 0001-workqueue-Drop-the-special-locking-rule-for-worker-f.patch
 0002-workqueue-Cleanups-around-process_scheduled_works.patch
 0003-workqueue-Not-all-work-insertion-needs-to-wake-up-a-.patch
 0004-workqueue-Rename-wq-cpu_pwqs-to-wq-cpu_pwq.patch
 0005-workqueue-Relocate-worker-and-work-management-functi.patch
 0006-workqueue-Remove-module-param-disable_numa-and-sysfs.patch
 0007-workqueue-Use-a-kthread_worker-to-release-pool_workq.patch
 0008-workqueue-Make-per-cpu-pool_workqueues-allocated-and.patch
 0009-workqueue-Make-unbound-workqueues-to-use-per-cpu-poo.patch
 0010-workqueue-Rename-workqueue_attrs-no_numa-to-ordered.patch
 0011-workqueue-Rename-NUMA-related-names-to-use-pod-inste.patch
 0012-workqueue-Move-wq_pod_init-below-workqueue_init.patch
 0013-workqueue-Initialize-unbound-CPU-pods-later-in-the-b.patch
 0014-workqueue-Generalize-unbound-CPU-pods.patch
 0015-workqueue-Add-tools-workqueue-wq_dump.py-which-print.patch
 0016-workqueue-Modularize-wq_pod_type-initialization.patch
 0017-workqueue-Add-multiple-affinity-scopes-and-interface.patch
 0018-workqueue-Factor-out-work-to-worker-assignment-and-c.patch
 0019-workqueue-Factor-out-need_more_worker-check-and-work.patch
 0020-workqueue-Add-workqueue_attrs-__pod_cpumask.patch
 0021-workqueue-Implement-non-strict-affinity-scope-for-un.patch
 0022-workqueue-Add-Affinity-Scopes-and-Performance-sectio.patch
 0023-workqueue-Add-pool_workqueue-cpu.patch
 0024-workqueue-Implement-localize-to-issuing-CPU-for-unbo.patch

This series is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v1

diffstat follows. Thanks.

 Documentation/admin-guide/kernel-parameters.txt |   21
 Documentation/core-api/workqueue.rst            |  379 ++++++++++-
 include/linux/workqueue.h                       |   82 ++
 init/main.c                                     |    1
 kernel/workqueue.c                              | 1613 +++++++++++++++++++++++++++-----------------------
 kernel/workqueue_internal.h                     |    2
 tools/workqueue/wq_dump.py                      |  180 +++++
 tools/workqueue/wq_monitor.py                   |   29
 8 files changed, 1525 insertions(+), 782 deletions(-)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 01/24] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 02/24] workqueue: Cleanups around process_scheduled_works() Tejun Heo
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

worker->flags used to be accessed from scheduler hooks without grabbing
pool->lock for concurrency management. This is no longer true since
6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq
lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
All relevant users are accessing it under the pool lock.

Let's drop the special "X" rule and use the "L" rule for these flag fields
instead. While at it, replace the CONTEXT comment with
lockdep_assert_held().

This allows worker_set/clr_flags() to be used from context which isn't the
worker itself. This will be used later to implement assinging work items to
workers before waking them up so that workqueue can have better control over
which worker executes which work item on which CPU.

The only actual changes are sanity checks. There shouldn't be any visible
behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c          | 17 +++--------------
 kernel/workqueue_internal.h |  2 +-
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ee16ddb0647c..9a97db94e1dc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -121,11 +121,6 @@ enum {
  *
  * L: pool->lock protected.  Access with pool->lock held.
  *
- * X: During normal operation, modification requires pool->lock and should
- *    be done only from local cpu.  Either disabling preemption on local
- *    cpu or grabbing pool->lock is enough for read access.  If
- *    POOL_DISASSOCIATED is set, it's identical to L.
- *
  * K: Only modified by worker while holding pool->lock. Can be safely read by
  *    self, while holding pool->lock or from IRQ context if %current is the
  *    kworker.
@@ -159,7 +154,7 @@ struct worker_pool {
 	int			cpu;		/* I: the associated cpu */
 	int			node;		/* I: the associated node ID */
 	int			id;		/* I: pool ID */
-	unsigned int		flags;		/* X: flags */
+	unsigned int		flags;		/* L: flags */
 
 	unsigned long		watchdog_ts;	/* L: watchdog timestamp */
 	bool			cpu_stall;	/* WD: stalled cpu bound pool */
@@ -901,15 +896,12 @@ static void wake_up_worker(struct worker_pool *pool)
  * @flags: flags to set
  *
  * Set @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
  */
 static inline void worker_set_flags(struct worker *worker, unsigned int flags)
 {
 	struct worker_pool *pool = worker->pool;
 
-	WARN_ON_ONCE(worker->task != current);
+	lockdep_assert_held(&pool->lock);
 
 	/* If transitioning into NOT_RUNNING, adjust nr_running. */
 	if ((flags & WORKER_NOT_RUNNING) &&
@@ -926,16 +918,13 @@ static inline void worker_set_flags(struct worker *worker, unsigned int flags)
  * @flags: flags to clear
  *
  * Clear @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
  */
 static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
 {
 	struct worker_pool *pool = worker->pool;
 	unsigned int oflags = worker->flags;
 
-	WARN_ON_ONCE(worker->task != current);
+	lockdep_assert_held(&pool->lock);
 
 	worker->flags &= ~flags;
 
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 6b1d66e28269..f6275944ada7 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -48,7 +48,7 @@ struct worker {
 						/* A: runs through worker->node */
 
 	unsigned long		last_active;	/* K: last active timestamp */
-	unsigned int		flags;		/* X: flags */
+	unsigned int		flags;		/* L: flags */
 	int			id;		/* I: worker id */
 
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 02/24] workqueue: Cleanups around process_scheduled_works()
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
  2023-05-19  0:16 ` [PATCH 01/24] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker Tejun Heo
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

* Drop the trivial optimization in worker_thread() where it bypasses calling
  process_scheduled_works() if the first work item isn't linked. This is a
  mostly pointless micro optimization and gets in the way of improving the
  work processing path.

* Consolidate pool->watchdog_ts updates in the two callers into
  process_scheduled_works().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9a97db94e1dc..c1e56ba4a038 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2634,9 +2634,15 @@ __acquires(&pool->lock)
  */
 static void process_scheduled_works(struct worker *worker)
 {
-	while (!list_empty(&worker->scheduled)) {
-		struct work_struct *work = list_first_entry(&worker->scheduled,
-						struct work_struct, entry);
+	struct work_struct *work;
+	bool first = true;
+
+	while ((work = list_first_entry_or_null(&worker->scheduled,
+						struct work_struct, entry))) {
+		if (first) {
+			worker->pool->watchdog_ts = jiffies;
+			first = false;
+		}
 		process_one_work(worker, work);
 	}
 }
@@ -2717,17 +2723,8 @@ static int worker_thread(void *__worker)
 			list_first_entry(&pool->worklist,
 					 struct work_struct, entry);
 
-		pool->watchdog_ts = jiffies;
-
-		if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
-			/* optimization path, not strictly necessary */
-			process_one_work(worker, work);
-			if (unlikely(!list_empty(&worker->scheduled)))
-				process_scheduled_works(worker);
-		} else {
-			move_linked_works(work, &worker->scheduled, NULL);
-			process_scheduled_works(worker);
-		}
+		move_linked_works(work, &worker->scheduled, NULL);
+		process_scheduled_works(worker);
 	} while (keep_working(pool));
 
 	worker_set_flags(worker, WORKER_PREP);
@@ -2802,7 +2799,6 @@ static int rescuer_thread(void *__rescuer)
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
 		struct work_struct *work, *n;
-		bool first = true;
 
 		__set_current_state(TASK_RUNNING);
 		list_del_init(&pwq->mayday_node);
@@ -2820,12 +2816,9 @@ static int rescuer_thread(void *__rescuer)
 		WARN_ON_ONCE(!list_empty(scheduled));
 		list_for_each_entry_safe(work, n, &pool->worklist, entry) {
 			if (get_work_pwq(work) == pwq) {
-				if (first)
-					pool->watchdog_ts = jiffies;
 				move_linked_works(work, scheduled, &n);
 				pwq->stats[PWQ_STAT_RESCUED]++;
 			}
-			first = false;
 		}
 
 		if (!list_empty(scheduled)) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
  2023-05-19  0:16 ` [PATCH 01/24] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
  2023-05-19  0:16 ` [PATCH 02/24] workqueue: Cleanups around process_scheduled_works() Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-23  9:54   ` Lai Jiangshan
  2023-08-08  1:15   ` [PATCH v2 " Tejun Heo
  2023-05-19  0:16 ` [PATCH 04/24] workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq Tejun Heo
                   ` (24 subsequent siblings)
  27 siblings, 2 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

insert_work() always tried to wake up a worker; however, the only time it
needs to try to wake up a worker is when a new active work item is queued.
When a work item goes on the inactive list or queueing a flush work item,
there's no reason to try to wake up a worker.

This patch moves the worker wakeup logic out of insert_work() and places it
in the active new work item queueing path in __queue_work().

While at it:

* __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
  pool.

* Every caller of insert_work() calls debug_work_activate(). Consolidate the
  invocations into insert_work().

* In __queue_work() pool->watchdog_ts update is relocated slightly. This is
  to better accommodate future changes.

This makes wakeups more precise and will help the planned change to assign
work items to workers before waking them up. No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 37 ++++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c1e56ba4a038..0d5eb436d31a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1523,7 +1523,7 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
 static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 			struct list_head *head, unsigned int extra_flags)
 {
-	struct worker_pool *pool = pwq->pool;
+	debug_work_activate(work);
 
 	/* record the work call stack in order to print it in KASAN reports */
 	kasan_record_aux_stack_noalloc(work);
@@ -1532,9 +1532,6 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 	set_work_pwq(work, pwq, extra_flags);
 	list_add_tail(&work->entry, head);
 	get_pwq(pwq);
-
-	if (__need_more_worker(pool))
-		wake_up_worker(pool);
 }
 
 /*
@@ -1588,8 +1585,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct pool_workqueue *pwq;
-	struct worker_pool *last_pool;
-	struct list_head *worklist;
+	struct worker_pool *last_pool, *pool;
 	unsigned int work_flags;
 	unsigned int req_cpu = cpu;
 
@@ -1623,13 +1619,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
 	}
 
+	pool = pwq->pool;
+
 	/*
 	 * If @work was previously on a different pool, it might still be
 	 * running there, in which case the work needs to be queued on that
 	 * pool to guarantee non-reentrancy.
 	 */
 	last_pool = get_work_pool(work);
-	if (last_pool && last_pool != pwq->pool) {
+	if (last_pool && last_pool != pool) {
 		struct worker *worker;
 
 		raw_spin_lock(&last_pool->lock);
@@ -1638,13 +1636,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 
 		if (worker && worker->current_pwq->wq == wq) {
 			pwq = worker->current_pwq;
+			pool = pwq->pool;
 		} else {
 			/* meh... not running there, queue here */
 			raw_spin_unlock(&last_pool->lock);
-			raw_spin_lock(&pwq->pool->lock);
+			raw_spin_lock(&pool->lock);
 		}
 	} else {
-		raw_spin_lock(&pwq->pool->lock);
+		raw_spin_lock(&pool->lock);
 	}
 
 	/*
@@ -1657,7 +1656,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	 */
 	if (unlikely(!pwq->refcnt)) {
 		if (wq->flags & WQ_UNBOUND) {
-			raw_spin_unlock(&pwq->pool->lock);
+			raw_spin_unlock(&pool->lock);
 			cpu_relax();
 			goto retry;
 		}
@@ -1676,21 +1675,22 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	work_flags = work_color_to_flags(pwq->work_color);
 
 	if (likely(pwq->nr_active < pwq->max_active)) {
+		if (list_empty(&pool->worklist))
+			pool->watchdog_ts = jiffies;
+
 		trace_workqueue_activate_work(work);
 		pwq->nr_active++;
-		worklist = &pwq->pool->worklist;
-		if (list_empty(worklist))
-			pwq->pool->watchdog_ts = jiffies;
+		insert_work(pwq, work, &pool->worklist, work_flags);
+
+		if (__need_more_worker(pool))
+			wake_up_worker(pool);
 	} else {
 		work_flags |= WORK_STRUCT_INACTIVE;
-		worklist = &pwq->inactive_works;
+		insert_work(pwq, work, &pwq->inactive_works, work_flags);
 	}
 
-	debug_work_activate(work);
-	insert_work(pwq, work, worklist, work_flags);
-
 out:
-	raw_spin_unlock(&pwq->pool->lock);
+	raw_spin_unlock(&pool->lock);
 	rcu_read_unlock();
 }
 
@@ -2994,7 +2994,6 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
 	pwq->nr_in_flight[work_color]++;
 	work_flags |= work_color_to_flags(work_color);
 
-	debug_work_activate(&barr->work);
 	insert_work(pwq, &barr->work, head, work_flags);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 04/24] workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (2 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 05/24] workqueue: Relocate worker and work management functions Tejun Heo
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
The field name being plural is unusual and confusing. Rename it to singular.

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0d5eb436d31a..80b2bd01c718 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -320,7 +320,7 @@ struct workqueue_struct {
 
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
-	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+	struct pool_workqueue __percpu *cpu_pwq; /* I: per-cpu pwqs */
 	struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */
 };
 
@@ -1616,7 +1616,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	} else {
 		if (req_cpu == WORK_CPU_UNBOUND)
 			cpu = raw_smp_processor_id();
-		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
+		pwq = per_cpu_ptr(wq->cpu_pwq, cpu);
 	}
 
 	pool = pwq->pool;
@@ -3807,7 +3807,7 @@ static void rcu_free_wq(struct rcu_head *rcu)
 	wq_free_lockdep(wq);
 
 	if (!(wq->flags & WQ_UNBOUND))
-		free_percpu(wq->cpu_pwqs);
+		free_percpu(wq->cpu_pwq);
 	else
 		free_workqueue_attrs(wq->unbound_attrs);
 
@@ -4501,13 +4501,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 	int cpu, ret;
 
 	if (!(wq->flags & WQ_UNBOUND)) {
-		wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
-		if (!wq->cpu_pwqs)
+		wq->cpu_pwq = alloc_percpu(struct pool_workqueue);
+		if (!wq->cpu_pwq)
 			return -ENOMEM;
 
 		for_each_possible_cpu(cpu) {
 			struct pool_workqueue *pwq =
-				per_cpu_ptr(wq->cpu_pwqs, cpu);
+				per_cpu_ptr(wq->cpu_pwq, cpu);
 			struct worker_pool *cpu_pools =
 				per_cpu(cpu_worker_pools, cpu);
 
@@ -4888,7 +4888,7 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
 		cpu = smp_processor_id();
 
 	if (!(wq->flags & WQ_UNBOUND))
-		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
+		pwq = per_cpu_ptr(wq->cpu_pwq, cpu);
 	else
 		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 05/24] workqueue: Relocate worker and work management functions
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (3 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 04/24] workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 06/24] workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa Tejun Heo
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Collect first_idle_worker(), worker_enter/leave_idle(),
find_worker_executing_work(), move_linked_works() and wake_up_worker() into
one place. These functions will later be used to implement higher level
worker management logic.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 340 ++++++++++++++++++++++-----------------------
 1 file changed, 168 insertions(+), 172 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 80b2bd01c718..6ec22eb87283 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -860,36 +860,6 @@ static bool too_many_workers(struct worker_pool *pool)
 	return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
 
-/*
- * Wake up functions.
- */
-
-/* Return the first idle worker.  Called with pool->lock held. */
-static struct worker *first_idle_worker(struct worker_pool *pool)
-{
-	if (unlikely(list_empty(&pool->idle_list)))
-		return NULL;
-
-	return list_first_entry(&pool->idle_list, struct worker, entry);
-}
-
-/**
- * wake_up_worker - wake up an idle worker
- * @pool: worker pool to wake worker from
- *
- * Wake up the first idle worker of @pool.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock).
- */
-static void wake_up_worker(struct worker_pool *pool)
-{
-	struct worker *worker = first_idle_worker(pool);
-
-	if (likely(worker))
-		wake_up_process(worker->task);
-}
-
 /**
  * worker_set_flags - set worker flags and adjust nr_running accordingly
  * @worker: self
@@ -938,6 +908,174 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
 			pool->nr_running++;
 }
 
+/* Return the first idle worker.  Called with pool->lock held. */
+static struct worker *first_idle_worker(struct worker_pool *pool)
+{
+	if (unlikely(list_empty(&pool->idle_list)))
+		return NULL;
+
+	return list_first_entry(&pool->idle_list, struct worker, entry);
+}
+
+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state.  Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * raw_spin_lock_irq(pool->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+	struct worker_pool *pool = worker->pool;
+
+	if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
+	    WARN_ON_ONCE(!list_empty(&worker->entry) &&
+			 (worker->hentry.next || worker->hentry.pprev)))
+		return;
+
+	/* can't use worker_set_flags(), also called from create_worker() */
+	worker->flags |= WORKER_IDLE;
+	pool->nr_idle++;
+	worker->last_active = jiffies;
+
+	/* idle_list is LIFO */
+	list_add(&worker->entry, &pool->idle_list);
+
+	if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
+		mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
+
+	/* Sanity check nr_running. */
+	WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state.  Update stats.
+ *
+ * LOCKING:
+ * raw_spin_lock_irq(pool->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+	struct worker_pool *pool = worker->pool;
+
+	if (WARN_ON_ONCE(!(worker->flags & WORKER_IDLE)))
+		return;
+	worker_clr_flags(worker, WORKER_IDLE);
+	pool->nr_idle--;
+	list_del_init(&worker->entry);
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @pool: pool of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @pool by searching
+ * @pool->busy_hash which is keyed by the address of @work.  For a worker
+ * to match, its current execution should match the address of @work and
+ * its work function.  This is to avoid unwanted dependency between
+ * unrelated work executions through a work item being recycled while still
+ * being executed.
+ *
+ * This is a bit tricky.  A work item may be freed once its execution
+ * starts and nothing prevents the freed area from being recycled for
+ * another work item.  If the same work item address ends up being reused
+ * before the original execution finishes, workqueue will identify the
+ * recycled work item as currently executing and make it wait until the
+ * current execution finishes, introducing an unwanted dependency.
+ *
+ * This function checks the work item address and work function to avoid
+ * false positives.  Note that this isn't complete as one may construct a
+ * work function which can introduce dependency onto itself through a
+ * recycled work item.  Well, if somebody wants to shoot oneself in the
+ * foot that badly, there's only so much we can do, and if such deadlock
+ * actually occurs, it should be easy to locate the culprit work function.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock).
+ *
+ * Return:
+ * Pointer to worker which is executing @work if found, %NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct worker_pool *pool,
+						 struct work_struct *work)
+{
+	struct worker *worker;
+
+	hash_for_each_possible(pool->busy_hash, worker, hentry,
+			       (unsigned long)work)
+		if (worker->current_work == work &&
+		    worker->current_func == work->func)
+			return worker;
+
+	return NULL;
+}
+
+/**
+ * move_linked_works - move linked works to a list
+ * @work: start of series of works to be scheduled
+ * @head: target list to append @work to
+ * @nextp: out parameter for nested worklist walking
+ *
+ * Schedule linked works starting from @work to @head.  Work series to
+ * be scheduled starts at @work and includes any consecutive work with
+ * WORK_STRUCT_LINKED set in its predecessor.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of
+ * the last scheduled work.  This allows move_linked_works() to be
+ * nested inside outer list_for_each_entry_safe().
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock).
+ */
+static void move_linked_works(struct work_struct *work, struct list_head *head,
+			      struct work_struct **nextp)
+{
+	struct work_struct *n;
+
+	/*
+	 * Linked worklist will always end before the end of the list,
+	 * use NULL for list head.
+	 */
+	list_for_each_entry_safe_from(work, n, NULL, entry) {
+		list_move_tail(&work->entry, head);
+		if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
+			break;
+	}
+
+	/*
+	 * If we're already inside safe list traversal and have moved
+	 * multiple works to the scheduled queue, the next position
+	 * needs to be updated.
+	 */
+	if (nextp)
+		*nextp = n;
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @pool: worker pool to wake worker from
+ *
+ * Wake up the first idle worker of @pool.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock).
+ */
+static void wake_up_worker(struct worker_pool *pool)
+{
+	struct worker *worker = first_idle_worker(pool);
+
+	if (likely(worker))
+		wake_up_process(worker->task);
+}
+
 #ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
 
 /*
@@ -1183,94 +1321,6 @@ work_func_t wq_worker_last_func(struct task_struct *task)
 	return worker->last_func;
 }
 
-/**
- * find_worker_executing_work - find worker which is executing a work
- * @pool: pool of interest
- * @work: work to find worker for
- *
- * Find a worker which is executing @work on @pool by searching
- * @pool->busy_hash which is keyed by the address of @work.  For a worker
- * to match, its current execution should match the address of @work and
- * its work function.  This is to avoid unwanted dependency between
- * unrelated work executions through a work item being recycled while still
- * being executed.
- *
- * This is a bit tricky.  A work item may be freed once its execution
- * starts and nothing prevents the freed area from being recycled for
- * another work item.  If the same work item address ends up being reused
- * before the original execution finishes, workqueue will identify the
- * recycled work item as currently executing and make it wait until the
- * current execution finishes, introducing an unwanted dependency.
- *
- * This function checks the work item address and work function to avoid
- * false positives.  Note that this isn't complete as one may construct a
- * work function which can introduce dependency onto itself through a
- * recycled work item.  Well, if somebody wants to shoot oneself in the
- * foot that badly, there's only so much we can do, and if such deadlock
- * actually occurs, it should be easy to locate the culprit work function.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock).
- *
- * Return:
- * Pointer to worker which is executing @work if found, %NULL
- * otherwise.
- */
-static struct worker *find_worker_executing_work(struct worker_pool *pool,
-						 struct work_struct *work)
-{
-	struct worker *worker;
-
-	hash_for_each_possible(pool->busy_hash, worker, hentry,
-			       (unsigned long)work)
-		if (worker->current_work == work &&
-		    worker->current_func == work->func)
-			return worker;
-
-	return NULL;
-}
-
-/**
- * move_linked_works - move linked works to a list
- * @work: start of series of works to be scheduled
- * @head: target list to append @work to
- * @nextp: out parameter for nested worklist walking
- *
- * Schedule linked works starting from @work to @head.  Work series to
- * be scheduled starts at @work and includes any consecutive work with
- * WORK_STRUCT_LINKED set in its predecessor.
- *
- * If @nextp is not NULL, it's updated to point to the next work of
- * the last scheduled work.  This allows move_linked_works() to be
- * nested inside outer list_for_each_entry_safe().
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock).
- */
-static void move_linked_works(struct work_struct *work, struct list_head *head,
-			      struct work_struct **nextp)
-{
-	struct work_struct *n;
-
-	/*
-	 * Linked worklist will always end before the end of the list,
-	 * use NULL for list head.
-	 */
-	list_for_each_entry_safe_from(work, n, NULL, entry) {
-		list_move_tail(&work->entry, head);
-		if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
-			break;
-	}
-
-	/*
-	 * If we're already inside safe list traversal and have moved
-	 * multiple works to the scheduled queue, the next position
-	 * needs to be updated.
-	 */
-	if (nextp)
-		*nextp = n;
-}
-
 /**
  * get_pwq - get an extra reference on the specified pool_workqueue
  * @pwq: pool_workqueue to get
@@ -1954,60 +2004,6 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 }
 EXPORT_SYMBOL(queue_rcu_work);
 
-/**
- * worker_enter_idle - enter idle state
- * @worker: worker which is entering idle state
- *
- * @worker is entering idle state.  Update stats and idle timer if
- * necessary.
- *
- * LOCKING:
- * raw_spin_lock_irq(pool->lock).
- */
-static void worker_enter_idle(struct worker *worker)
-{
-	struct worker_pool *pool = worker->pool;
-
-	if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
-	    WARN_ON_ONCE(!list_empty(&worker->entry) &&
-			 (worker->hentry.next || worker->hentry.pprev)))
-		return;
-
-	/* can't use worker_set_flags(), also called from create_worker() */
-	worker->flags |= WORKER_IDLE;
-	pool->nr_idle++;
-	worker->last_active = jiffies;
-
-	/* idle_list is LIFO */
-	list_add(&worker->entry, &pool->idle_list);
-
-	if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
-		mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
-
-	/* Sanity check nr_running. */
-	WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
-}
-
-/**
- * worker_leave_idle - leave idle state
- * @worker: worker which is leaving idle state
- *
- * @worker is leaving idle state.  Update stats.
- *
- * LOCKING:
- * raw_spin_lock_irq(pool->lock).
- */
-static void worker_leave_idle(struct worker *worker)
-{
-	struct worker_pool *pool = worker->pool;
-
-	if (WARN_ON_ONCE(!(worker->flags & WORKER_IDLE)))
-		return;
-	worker_clr_flags(worker, WORKER_IDLE);
-	pool->nr_idle--;
-	list_del_init(&worker->entry);
-}
-
 static struct worker *alloc_worker(int node)
 {
 	struct worker *worker;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 06/24] workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (4 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 05/24] workqueue: Relocate worker and work management functions Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 07/24] workqueue: Use a kthread_worker to release pool_workqueues Tejun Heo
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
specific knobs won't make sense anymore. Remove them. Also, the pool_ids
knob was used for debugging and not really meaningful given that there is no
visibility into the pools associated with those IDs. Remove it too. A future
patch will improve overall visibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 .../admin-guide/kernel-parameters.txt         |  9 ---
 kernel/workqueue.c                            | 73 -------------------
 2 files changed, 82 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3ed7dda4c994..042275425c32 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6943,15 +6943,6 @@
 			threshold repeatedly. They are likely good
 			candidates for using WQ_UNBOUND workqueues instead.
 
-	workqueue.disable_numa
-			By default, all work items queued to unbound
-			workqueues are affine to the NUMA nodes they're
-			issued on, which results in better behavior in
-			general.  If NUMA affinity needs to be disabled for
-			whatever reason, this option can be used.  Note
-			that this also can be controlled per-workqueue for
-			workqueues visible under /sys/bus/workqueue/.
-
 	workqueue.power_efficient
 			Per-cpu workqueues are generally preferred because
 			they show better performance thanks to cache
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6ec22eb87283..f39d04e7e5f9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -337,9 +337,6 @@ static cpumask_var_t *wq_numa_possible_cpumask;
 static unsigned long wq_cpu_intensive_thresh_us = 10000;
 module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
 
-static bool wq_disable_numa;
-module_param_named(disable_numa, wq_disable_numa, bool, 0444);
-
 /* see the comment above the definition of WQ_POWER_EFFICIENT */
 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT);
 module_param_named(power_efficient, wq_power_efficient, bool, 0444);
@@ -5777,10 +5774,8 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
  *
  * Unbound workqueues have the following extra attributes.
  *
- *  pool_ids	RO int	: the associated pool IDs for each node
  *  nice	RW int	: nice value of the workers
  *  cpumask	RW mask	: bitmask of allowed CPUs for the workers
- *  numa	RW bool	: whether enable NUMA affinity
  */
 struct wq_device {
 	struct workqueue_struct		*wq;
@@ -5833,28 +5828,6 @@ static struct attribute *wq_sysfs_attrs[] = {
 };
 ATTRIBUTE_GROUPS(wq_sysfs);
 
-static ssize_t wq_pool_ids_show(struct device *dev,
-				struct device_attribute *attr, char *buf)
-{
-	struct workqueue_struct *wq = dev_to_wq(dev);
-	const char *delim = "";
-	int node, written = 0;
-
-	cpus_read_lock();
-	rcu_read_lock();
-	for_each_node(node) {
-		written += scnprintf(buf + written, PAGE_SIZE - written,
-				     "%s%d:%d", delim, node,
-				     unbound_pwq_by_node(wq, node)->pool->id);
-		delim = " ";
-	}
-	written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
-	rcu_read_unlock();
-	cpus_read_unlock();
-
-	return written;
-}
-
 static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
 			    char *buf)
 {
@@ -5945,50 +5918,9 @@ static ssize_t wq_cpumask_store(struct device *dev,
 	return ret ?: count;
 }
 
-static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
-			    char *buf)
-{
-	struct workqueue_struct *wq = dev_to_wq(dev);
-	int written;
-
-	mutex_lock(&wq->mutex);
-	written = scnprintf(buf, PAGE_SIZE, "%d\n",
-			    !wq->unbound_attrs->no_numa);
-	mutex_unlock(&wq->mutex);
-
-	return written;
-}
-
-static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
-			     const char *buf, size_t count)
-{
-	struct workqueue_struct *wq = dev_to_wq(dev);
-	struct workqueue_attrs *attrs;
-	int v, ret = -ENOMEM;
-
-	apply_wqattrs_lock();
-
-	attrs = wq_sysfs_prep_attrs(wq);
-	if (!attrs)
-		goto out_unlock;
-
-	ret = -EINVAL;
-	if (sscanf(buf, "%d", &v) == 1) {
-		attrs->no_numa = !v;
-		ret = apply_workqueue_attrs_locked(wq, attrs);
-	}
-
-out_unlock:
-	apply_wqattrs_unlock();
-	free_workqueue_attrs(attrs);
-	return ret ?: count;
-}
-
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
-	__ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
-	__ATTR(numa, 0644, wq_numa_show, wq_numa_store),
 	__ATTR_NULL,
 };
 
@@ -6362,11 +6294,6 @@ static void __init wq_numa_init(void)
 	if (num_possible_nodes() <= 1)
 		return;
 
-	if (wq_disable_numa) {
-		pr_info("workqueue: NUMA affinity support disabled\n");
-		return;
-	}
-
 	for_each_possible_cpu(cpu) {
 		if (WARN_ON(cpu_to_node(cpu) == NUMA_NO_NODE)) {
 			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 07/24] workqueue: Use a kthread_worker to release pool_workqueues
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (5 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 06/24] workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 08/24] workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones Tejun Heo
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

pool_workqueue release path is currently bounced to system_wq; however, this
is a bit tricky because this bouncing occurs while holding a pool lock and
thus has risk of causing a A-A deadlock. This is currently addressed by the
fact that only unbound workqueues use this bouncing path and system_wq is a
per-cpu workqueue.

While this works, it's brittle and requires a work-around like setting the
lockdep subclass for the lock of unbound pools. Besides, future changes will
use the bouncing path for per-cpu workqueues too making the current approach
unusable.

Let's just use a dedicated kthread_worker to untangle the dependency. This
is just one more kthread for all workqueues and makes the pwq release logic
simpler and more robust.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f39d04e7e5f9..7addda9b37b9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -256,12 +256,12 @@ struct pool_workqueue {
 	u64			stats[PWQ_NR_STATS];
 
 	/*
-	 * Release of unbound pwq is punted to system_wq.  See put_pwq()
-	 * and pwq_unbound_release_workfn() for details.  pool_workqueue
-	 * itself is also RCU protected so that the first pwq can be
-	 * determined without grabbing wq->mutex.
+	 * Release of unbound pwq is punted to a kthread_worker. See put_pwq()
+	 * and pwq_unbound_release_workfn() for details. pool_workqueue itself
+	 * is also RCU protected so that the first pwq can be determined without
+	 * grabbing wq->mutex.
 	 */
-	struct work_struct	unbound_release_work;
+	struct kthread_work	unbound_release_work;
 	struct rcu_head		rcu;
 } __aligned(1 << WORK_STRUCT_FLAG_BITS);
 
@@ -389,6 +389,13 @@ static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
 /* I: attributes used when instantiating ordered pools on demand */
 static struct workqueue_attrs *ordered_wq_attrs[NR_STD_WORKER_POOLS];
 
+/*
+ * I: kthread_worker to release pwq's. pwq release needs to be bounced to a
+ * process context while holding a pool lock. Bounce to a dedicated kthread
+ * worker to avoid A-A deadlocks.
+ */
+static struct kthread_worker *pwq_release_worker;
+
 struct workqueue_struct *system_wq __read_mostly;
 EXPORT_SYMBOL(system_wq);
 struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -1347,14 +1354,10 @@ static void put_pwq(struct pool_workqueue *pwq)
 	if (WARN_ON_ONCE(!(pwq->wq->flags & WQ_UNBOUND)))
 		return;
 	/*
-	 * @pwq can't be released under pool->lock, bounce to
-	 * pwq_unbound_release_workfn().  This never recurses on the same
-	 * pool->lock as this path is taken only for unbound workqueues and
-	 * the release work item is scheduled on a per-cpu workqueue.  To
-	 * avoid lockdep warning, unbound pool->locks are given lockdep
-	 * subclass of 1 in get_unbound_pool().
+	 * @pwq can't be released under pool->lock, bounce to a dedicated
+	 * kthread_worker to avoid A-A deadlocks.
 	 */
-	schedule_work(&pwq->unbound_release_work);
+	kthread_queue_work(pwq_release_worker, &pwq->unbound_release_work);
 }
 
 /**
@@ -3948,7 +3951,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	if (!pool || init_worker_pool(pool) < 0)
 		goto fail;
 
-	lockdep_set_subclass(&pool->lock, 1);	/* see put_pwq() */
 	copy_workqueue_attrs(pool->attrs, attrs);
 	pool->node = target_node;
 
@@ -3982,10 +3984,10 @@ static void rcu_free_pwq(struct rcu_head *rcu)
 }
 
 /*
- * Scheduled on system_wq by put_pwq() when an unbound pwq hits zero refcnt
- * and needs to be destroyed.
+ * Scheduled on pwq_release_worker by put_pwq() when an unbound pwq hits zero
+ * refcnt and needs to be destroyed.
  */
-static void pwq_unbound_release_workfn(struct work_struct *work)
+static void pwq_unbound_release_workfn(struct kthread_work *work)
 {
 	struct pool_workqueue *pwq = container_of(work, struct pool_workqueue,
 						  unbound_release_work);
@@ -4093,7 +4095,8 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	INIT_LIST_HEAD(&pwq->inactive_works);
 	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
-	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+	kthread_init_work(&pwq->unbound_release_work,
+			  pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
@@ -6419,6 +6422,9 @@ void __init workqueue_init(void)
 	struct worker_pool *pool;
 	int cpu, bkt;
 
+	pwq_release_worker = kthread_create_worker(0, "pool_workqueue_release");
+	BUG_ON(IS_ERR(pwq_release_worker));
+
 	/*
 	 * It'd be simpler to initialize NUMA in workqueue_init_early() but
 	 * CPU to node mapping may not be available that early on some
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 08/24] workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (6 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 07/24] workqueue: Use a kthread_worker to release pool_workqueues Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues Tejun Heo
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
through a per-cpu allocation and thus, unlike unbound workqueues, not
reference counted. This difference in lifetime management between the two
types is a bit confusing.

Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
isn't suitiable for the planned CPU locality related improvements. The plan
is to unify pwq handling across per-cpu and unbound workqueues so that
they're always accessed through wq->cpu_pwq.

In preparation, this patch makes per-cpu pwq's to be allocated, reference
counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
pointers to pwq's instead of containing them directly.

pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
also used for per-cpu work items.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 74 +++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 34 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7addda9b37b9..574d2e12417d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -257,11 +257,11 @@ struct pool_workqueue {
 
 	/*
 	 * Release of unbound pwq is punted to a kthread_worker. See put_pwq()
-	 * and pwq_unbound_release_workfn() for details. pool_workqueue itself
-	 * is also RCU protected so that the first pwq can be determined without
+	 * and pwq_release_workfn() for details. pool_workqueue itself is also
+	 * RCU protected so that the first pwq can be determined without
 	 * grabbing wq->mutex.
 	 */
-	struct kthread_work	unbound_release_work;
+	struct kthread_work	release_work;
 	struct rcu_head		rcu;
 } __aligned(1 << WORK_STRUCT_FLAG_BITS);
 
@@ -320,7 +320,7 @@ struct workqueue_struct {
 
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
-	struct pool_workqueue __percpu *cpu_pwq; /* I: per-cpu pwqs */
+	struct pool_workqueue __percpu **cpu_pwq; /* I: per-cpu pwqs */
 	struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */
 };
 
@@ -1351,13 +1351,11 @@ static void put_pwq(struct pool_workqueue *pwq)
 	lockdep_assert_held(&pwq->pool->lock);
 	if (likely(--pwq->refcnt))
 		return;
-	if (WARN_ON_ONCE(!(pwq->wq->flags & WQ_UNBOUND)))
-		return;
 	/*
 	 * @pwq can't be released under pool->lock, bounce to a dedicated
 	 * kthread_worker to avoid A-A deadlocks.
 	 */
-	kthread_queue_work(pwq_release_worker, &pwq->unbound_release_work);
+	kthread_queue_work(pwq_release_worker, &pwq->release_work);
 }
 
 /**
@@ -1666,7 +1664,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	} else {
 		if (req_cpu == WORK_CPU_UNBOUND)
 			cpu = raw_smp_processor_id();
-		pwq = per_cpu_ptr(wq->cpu_pwq, cpu);
+		pwq = *per_cpu_ptr(wq->cpu_pwq, cpu);
 	}
 
 	pool = pwq->pool;
@@ -3987,31 +3985,30 @@ static void rcu_free_pwq(struct rcu_head *rcu)
  * Scheduled on pwq_release_worker by put_pwq() when an unbound pwq hits zero
  * refcnt and needs to be destroyed.
  */
-static void pwq_unbound_release_workfn(struct kthread_work *work)
+static void pwq_release_workfn(struct kthread_work *work)
 {
 	struct pool_workqueue *pwq = container_of(work, struct pool_workqueue,
-						  unbound_release_work);
+						  release_work);
 	struct workqueue_struct *wq = pwq->wq;
 	struct worker_pool *pool = pwq->pool;
 	bool is_last = false;
 
 	/*
-	 * when @pwq is not linked, it doesn't hold any reference to the
+	 * When @pwq is not linked, it doesn't hold any reference to the
 	 * @wq, and @wq is invalid to access.
 	 */
 	if (!list_empty(&pwq->pwqs_node)) {
-		if (WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND)))
-			return;
-
 		mutex_lock(&wq->mutex);
 		list_del_rcu(&pwq->pwqs_node);
 		is_last = list_empty(&wq->pwqs);
 		mutex_unlock(&wq->mutex);
 	}
 
-	mutex_lock(&wq_pool_mutex);
-	put_unbound_pool(pool);
-	mutex_unlock(&wq_pool_mutex);
+	if (wq->flags & WQ_UNBOUND) {
+		mutex_lock(&wq_pool_mutex);
+		put_unbound_pool(pool);
+		mutex_unlock(&wq_pool_mutex);
+	}
 
 	call_rcu(&pwq->rcu, rcu_free_pwq);
 
@@ -4095,8 +4092,7 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	INIT_LIST_HEAD(&pwq->inactive_works);
 	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
-	kthread_init_work(&pwq->unbound_release_work,
-			  pwq_unbound_release_workfn);
+	kthread_init_work(&pwq->release_work, pwq_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
@@ -4497,20 +4493,25 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 	int cpu, ret;
 
 	if (!(wq->flags & WQ_UNBOUND)) {
-		wq->cpu_pwq = alloc_percpu(struct pool_workqueue);
+		wq->cpu_pwq = alloc_percpu(struct pool_workqueue *);
 		if (!wq->cpu_pwq)
-			return -ENOMEM;
+			goto enomem;
 
 		for_each_possible_cpu(cpu) {
-			struct pool_workqueue *pwq =
+			struct pool_workqueue **pwq_p =
 				per_cpu_ptr(wq->cpu_pwq, cpu);
-			struct worker_pool *cpu_pools =
-				per_cpu(cpu_worker_pools, cpu);
+			struct worker_pool *pool =
+				&(per_cpu_ptr(cpu_worker_pools, cpu)[highpri]);
 
-			init_pwq(pwq, wq, &cpu_pools[highpri]);
+			*pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL,
+						       pool->node);
+			if (!*pwq_p)
+				goto enomem;
+
+			init_pwq(*pwq_p, wq, pool);
 
 			mutex_lock(&wq->mutex);
-			link_pwq(pwq);
+			link_pwq(*pwq_p);
 			mutex_unlock(&wq->mutex);
 		}
 		return 0;
@@ -4529,6 +4530,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 	cpus_read_unlock();
 
 	return ret;
+
+enomem:
+	if (wq->cpu_pwq) {
+		for_each_possible_cpu(cpu)
+			kfree(*per_cpu_ptr(wq->cpu_pwq, cpu));
+		free_percpu(wq->cpu_pwq);
+		wq->cpu_pwq = NULL;
+	}
+	return -ENOMEM;
 }
 
 static int wq_clamp_max_active(int max_active, unsigned int flags,
@@ -4702,7 +4712,7 @@ static bool pwq_busy(struct pool_workqueue *pwq)
 void destroy_workqueue(struct workqueue_struct *wq)
 {
 	struct pool_workqueue *pwq;
-	int node;
+	int cpu, node;
 
 	/*
 	 * Remove it from sysfs first so that sanity check failure doesn't
@@ -4762,12 +4772,8 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	mutex_unlock(&wq_pool_mutex);
 
 	if (!(wq->flags & WQ_UNBOUND)) {
-		wq_unregister_lockdep(wq);
-		/*
-		 * The base ref is never dropped on per-cpu pwqs.  Directly
-		 * schedule RCU free.
-		 */
-		call_rcu(&wq->rcu, rcu_free_wq);
+		for_each_possible_cpu(cpu)
+			put_pwq_unlocked(*per_cpu_ptr(wq->cpu_pwq, cpu));
 	} else {
 		/*
 		 * We're the sole accessor of @wq at this point.  Directly
@@ -4884,7 +4890,7 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
 		cpu = smp_processor_id();
 
 	if (!(wq->flags & WQ_UNBOUND))
-		pwq = per_cpu_ptr(wq->cpu_pwq, cpu);
+		pwq = *per_cpu_ptr(wq->cpu_pwq, cpu);
 	else
 		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (7 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 08/24] workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-22  6:41   ` Leon Romanovsky
  2023-05-19  0:16 ` [PATCH 10/24] workqueue: Rename workqueue_attrs->no_numa to ->ordered Tejun Heo
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo,
	Dennis Dalessandro, Jason Gunthorpe, Leon Romanovsky,
	Karsten Graul, Wenjia Zhang, Jan Karcher

A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.

As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:

* Because @max_active is per-pwq, the meaning of @max_active changes
  depending on the machine configuration and whether workqueue NUMA locality
  support is enabled.

* Makes per-cpu and unbound code deviate.

* Gets in the way of making workqueue CPU locality awareness more flexible.

This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:

* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
  just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
  workqueues.

* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
  the specified pwq to the target CPU's wq->cpu_pwq.

* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
  unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
  This makes the return value of wq_calc_node_cpumask() unnecessary. It now
  returns void.

* @max_active now means the same thing for both per-cpu and unbound
  workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
  documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
  used in workqueue implementation and will be removed later.

* All unbound pwq operations which used to be per-numa-node are now per-cpu.

For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.

One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
---
 Documentation/core-api/workqueue.rst |  21 ++-
 include/linux/workqueue.h            |   8 +-
 kernel/workqueue.c                   | 218 ++++++++++-----------------
 3 files changed, 89 insertions(+), 158 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index a4c9b9d1905f..8e541c5d8fa9 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -220,17 +220,16 @@ resources, scheduled and executed.
 ``max_active``
 --------------
 
-``@max_active`` determines the maximum number of execution contexts
-per CPU which can be assigned to the work items of a wq.  For example,
-with ``@max_active`` of 16, at most 16 work items of the wq can be
-executing at the same time per CPU.
-
-Currently, for a bound wq, the maximum limit for ``@max_active`` is
-512 and the default value used when 0 is specified is 256.  For an
-unbound wq, the limit is higher of 512 and 4 *
-``num_possible_cpus()``.  These values are chosen sufficiently high
-such that they are not the limiting factor while providing protection
-in runaway cases.
+``@max_active`` determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq. For example, with
+``@max_active`` of 16, at most 16 work items of the wq can be executing
+at the same time per CPU. This is always a per-CPU attribute, even for
+unbound workqueues.
+
+The maximum limit for ``@max_active`` is 512 and the default value used
+when 0 is specified is 256. These values are chosen sufficiently high
+such that they are not the limiting factor while providing protection in
+runaway cases.
 
 The number of active work items of a wq is usually regulated by the
 users of the wq, more specifically, by how many work items the users
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 3992c994787f..d1b681f67985 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -342,14 +342,10 @@ enum {
 	__WQ_ORDERED_EXPLICIT	= 1 << 19, /* internal: alloc_ordered_workqueue() */
 
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
-	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
+	WQ_UNBOUND_MAX_ACTIVE	= WQ_MAX_ACTIVE,
 	WQ_DFL_ACTIVE		= WQ_MAX_ACTIVE / 2,
 };
 
-/* unbound wq's aren't per-cpu, scale max_active according to #cpus */
-#define WQ_UNBOUND_MAX_ACTIVE	\
-	max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU)
-
 /*
  * System-wide workqueues which are always present.
  *
@@ -390,7 +386,7 @@ extern struct workqueue_struct *system_freezable_power_efficient_wq;
  * alloc_workqueue - allocate a workqueue
  * @fmt: printf format for the name of the workqueue
  * @flags: WQ_* flags
- * @max_active: max in-flight work items, 0 for default
+ * @max_active: max in-flight work items per CPU, 0 for default
  * remaining args: args for @fmt
  *
  * Allocate a workqueue with the specified parameters.  For detailed
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 574d2e12417d..43f3bb801bd9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -320,8 +320,7 @@ struct workqueue_struct {
 
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
-	struct pool_workqueue __percpu **cpu_pwq; /* I: per-cpu pwqs */
-	struct pool_workqueue __rcu *numa_pwq_tbl[]; /* PWR: unbound pwqs indexed by node */
+	struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -602,35 +601,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
 	return ret;
 }
 
-/**
- * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
- * @wq: the target workqueue
- * @node: the node ID
- *
- * This must be called with any of wq_pool_mutex, wq->mutex or RCU
- * read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- *
- * Return: The unbound pool_workqueue for @node.
- */
-static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
-						  int node)
-{
-	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
-
-	/*
-	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
-	 * delayed item is pending.  The plan is to keep CPU -> NODE
-	 * mapping valid and stable across CPU on/offlines.  Once that
-	 * happens, this workaround can be removed.
-	 */
-	if (unlikely(node == NUMA_NO_NODE))
-		return wq->dfl_pwq;
-
-	return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
-}
-
 static unsigned int work_color_to_flags(int color)
 {
 	return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1657,16 +1627,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	rcu_read_lock();
 retry:
 	/* pwq which will be used unless @work is executing elsewhere */
-	if (wq->flags & WQ_UNBOUND) {
-		if (req_cpu == WORK_CPU_UNBOUND)
+	if (req_cpu == WORK_CPU_UNBOUND) {
+		if (wq->flags & WQ_UNBOUND)
 			cpu = wq_select_unbound_cpu(raw_smp_processor_id());
-		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
-	} else {
-		if (req_cpu == WORK_CPU_UNBOUND)
+		else
 			cpu = raw_smp_processor_id();
-		pwq = *per_cpu_ptr(wq->cpu_pwq, cpu);
 	}
 
+	pwq = rcu_dereference(*per_cpu_ptr(wq->cpu_pwq, cpu));
 	pool = pwq->pool;
 
 	/*
@@ -1695,12 +1663,11 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	}
 
 	/*
-	 * pwq is determined and locked.  For unbound pools, we could have
-	 * raced with pwq release and it could already be dead.  If its
-	 * refcnt is zero, repeat pwq selection.  Note that pwqs never die
-	 * without another pwq replacing it in the numa_pwq_tbl or while
-	 * work items are executing on it, so the retrying is guaranteed to
-	 * make forward-progress.
+	 * pwq is determined and locked. For unbound pools, we could have raced
+	 * with pwq release and it could already be dead. If its refcnt is zero,
+	 * repeat pwq selection. Note that unbound pwqs never die without
+	 * another pwq replacing it in cpu_pwq or while work items are executing
+	 * on it, so the retrying is guaranteed to make forward-progress.
 	 */
 	if (unlikely(!pwq->refcnt)) {
 		if (wq->flags & WQ_UNBOUND) {
@@ -3799,12 +3766,8 @@ static void rcu_free_wq(struct rcu_head *rcu)
 		container_of(rcu, struct workqueue_struct, rcu);
 
 	wq_free_lockdep(wq);
-
-	if (!(wq->flags & WQ_UNBOUND))
-		free_percpu(wq->cpu_pwq);
-	else
-		free_workqueue_attrs(wq->unbound_attrs);
-
+	free_percpu(wq->cpu_pwq);
+	free_workqueue_attrs(wq->unbound_attrs);
 	kfree(wq);
 }
 
@@ -4157,11 +4120,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
  *
  * The caller is responsible for ensuring that the cpumask of @node stays
  * stable.
- *
- * Return: %true if the resulting @cpumask is different from @attrs->cpumask,
- * %false if equal.
  */
-static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
+static void wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
 				 int cpu_going_down, cpumask_t *cpumask)
 {
 	if (!wq_numa_enabled || attrs->no_numa)
@@ -4178,23 +4138,18 @@ static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
 	/* yeap, return possible CPUs in @node that @attrs wants */
 	cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
 
-	if (cpumask_empty(cpumask)) {
+	if (cpumask_empty(cpumask))
 		pr_warn_once("WARNING: workqueue cpumask: online intersect > "
 				"possible intersect\n");
-		return false;
-	}
-
-	return !cpumask_equal(cpumask, attrs->cpumask);
+	return;
 
 use_dfl:
 	cpumask_copy(cpumask, attrs->cpumask);
-	return false;
 }
 
-/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
-static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
-						   int node,
-						   struct pool_workqueue *pwq)
+/* install @pwq into @wq's cpu_pwq and return the old pwq */
+static struct pool_workqueue *install_unbound_pwq(struct workqueue_struct *wq,
+					int cpu, struct pool_workqueue *pwq)
 {
 	struct pool_workqueue *old_pwq;
 
@@ -4204,8 +4159,8 @@ static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
 	/* link_pwq() can handle duplicate calls */
 	link_pwq(pwq);
 
-	old_pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
-	rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	old_pwq = rcu_access_pointer(*per_cpu_ptr(wq->cpu_pwq, cpu));
+	rcu_assign_pointer(*per_cpu_ptr(wq->cpu_pwq, cpu), pwq);
 	return old_pwq;
 }
 
@@ -4222,10 +4177,10 @@ struct apply_wqattrs_ctx {
 static void apply_wqattrs_cleanup(struct apply_wqattrs_ctx *ctx)
 {
 	if (ctx) {
-		int node;
+		int cpu;
 
-		for_each_node(node)
-			put_pwq_unlocked(ctx->pwq_tbl[node]);
+		for_each_possible_cpu(cpu)
+			put_pwq_unlocked(ctx->pwq_tbl[cpu]);
 		put_pwq_unlocked(ctx->dfl_pwq);
 
 		free_workqueue_attrs(ctx->attrs);
@@ -4242,11 +4197,11 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 {
 	struct apply_wqattrs_ctx *ctx;
 	struct workqueue_attrs *new_attrs, *tmp_attrs;
-	int node;
+	int cpu;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
-	ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_node_ids), GFP_KERNEL);
+	ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_cpu_ids), GFP_KERNEL);
 
 	new_attrs = alloc_workqueue_attrs();
 	tmp_attrs = alloc_workqueue_attrs();
@@ -4280,14 +4235,16 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	if (!ctx->dfl_pwq)
 		goto out_free;
 
-	for_each_node(node) {
-		if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) {
-			ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
-			if (!ctx->pwq_tbl[node])
-				goto out_free;
-		} else {
+	for_each_possible_cpu(cpu) {
+		if (new_attrs->no_numa) {
 			ctx->dfl_pwq->refcnt++;
-			ctx->pwq_tbl[node] = ctx->dfl_pwq;
+			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
+		} else {
+			wq_calc_node_cpumask(new_attrs, cpu_to_node(cpu), -1,
+					     tmp_attrs->cpumask);
+			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!ctx->pwq_tbl[cpu])
+				goto out_free;
 		}
 	}
 
@@ -4310,7 +4267,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 /* set attrs and install prepared pwqs, @ctx points to old pwqs on return */
 static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
 {
-	int node;
+	int cpu;
 
 	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&ctx->wq->mutex);
@@ -4318,9 +4275,9 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
 	copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);
 
 	/* save the previous pwq and install the new one */
-	for_each_node(node)
-		ctx->pwq_tbl[node] = numa_pwq_tbl_install(ctx->wq, node,
-							  ctx->pwq_tbl[node]);
+	for_each_possible_cpu(cpu)
+		ctx->pwq_tbl[cpu] = install_unbound_pwq(ctx->wq, cpu,
+							ctx->pwq_tbl[cpu]);
 
 	/* @dfl_pwq might not have been used, ensure it's linked */
 	link_pwq(ctx->dfl_pwq);
@@ -4448,20 +4405,13 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	cpumask = target_attrs->cpumask;
 
 	copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
-	pwq = unbound_pwq_by_node(wq, node);
 
-	/*
-	 * Let's determine what needs to be done.  If the target cpumask is
-	 * different from the default pwq's, we need to compare it to @pwq's
-	 * and create a new one if they don't match.  If the target cpumask
-	 * equals the default pwq's, the default pwq should be used.
-	 */
-	if (wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask)) {
-		if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
-			return;
-	} else {
-		goto use_dfl_pwq;
-	}
+	/* nothing to do if the target cpumask matches the current pwq */
+	wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask);
+	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
+					lockdep_is_held(&wq_pool_mutex));
+	if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
+		return;
 
 	/* create a new pwq */
 	pwq = alloc_unbound_pwq(wq, target_attrs);
@@ -4473,7 +4423,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 
 	/* Install the new pwq. */
 	mutex_lock(&wq->mutex);
-	old_pwq = numa_pwq_tbl_install(wq, node, pwq);
+	old_pwq = install_unbound_pwq(wq, cpu, pwq);
 	goto out_unlock;
 
 use_dfl_pwq:
@@ -4481,7 +4431,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	raw_spin_lock_irq(&wq->dfl_pwq->pool->lock);
 	get_pwq(wq->dfl_pwq);
 	raw_spin_unlock_irq(&wq->dfl_pwq->pool->lock);
-	old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq);
+	old_pwq = install_unbound_pwq(wq, cpu, wq->dfl_pwq);
 out_unlock:
 	mutex_unlock(&wq->mutex);
 	put_pwq_unlocked(old_pwq);
@@ -4492,11 +4442,11 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 	bool highpri = wq->flags & WQ_HIGHPRI;
 	int cpu, ret;
 
-	if (!(wq->flags & WQ_UNBOUND)) {
-		wq->cpu_pwq = alloc_percpu(struct pool_workqueue *);
-		if (!wq->cpu_pwq)
-			goto enomem;
+	wq->cpu_pwq = alloc_percpu(struct pool_workqueue *);
+	if (!wq->cpu_pwq)
+		goto enomem;
 
+	if (!(wq->flags & WQ_UNBOUND)) {
 		for_each_possible_cpu(cpu) {
 			struct pool_workqueue **pwq_p =
 				per_cpu_ptr(wq->cpu_pwq, cpu);
@@ -4544,13 +4494,11 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 static int wq_clamp_max_active(int max_active, unsigned int flags,
 			       const char *name)
 {
-	int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;
-
-	if (max_active < 1 || max_active > lim)
+	if (max_active < 1 || max_active > WQ_MAX_ACTIVE)
 		pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
-			max_active, name, 1, lim);
+			max_active, name, 1, WQ_MAX_ACTIVE);
 
-	return clamp_val(max_active, 1, lim);
+	return clamp_val(max_active, 1, WQ_MAX_ACTIVE);
 }
 
 /*
@@ -4594,7 +4542,6 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 					 unsigned int flags,
 					 int max_active, ...)
 {
-	size_t tbl_size = 0;
 	va_list args;
 	struct workqueue_struct *wq;
 	struct pool_workqueue *pwq;
@@ -4614,10 +4561,7 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 		flags |= WQ_UNBOUND;
 
 	/* allocate wq and format name */
-	if (flags & WQ_UNBOUND)
-		tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]);
-
-	wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
+	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
 		return NULL;
 
@@ -4712,7 +4656,7 @@ static bool pwq_busy(struct pool_workqueue *pwq)
 void destroy_workqueue(struct workqueue_struct *wq)
 {
 	struct pool_workqueue *pwq;
-	int cpu, node;
+	int cpu;
 
 	/*
 	 * Remove it from sysfs first so that sanity check failure doesn't
@@ -4771,29 +4715,23 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del_rcu(&wq->list);
 	mutex_unlock(&wq_pool_mutex);
 
-	if (!(wq->flags & WQ_UNBOUND)) {
-		for_each_possible_cpu(cpu)
-			put_pwq_unlocked(*per_cpu_ptr(wq->cpu_pwq, cpu));
-	} else {
-		/*
-		 * We're the sole accessor of @wq at this point.  Directly
-		 * access numa_pwq_tbl[] and dfl_pwq to put the base refs.
-		 * @wq will be freed when the last pwq is released.
-		 */
-		for_each_node(node) {
-			pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
-			RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
-			put_pwq_unlocked(pwq);
-		}
+	/*
+	 * We're the sole accessor of @wq. Directly access cpu_pwq and dfl_pwq
+	 * to put the base refs. @wq will be auto-destroyed from the last
+	 * pwq_put. RCU read lock prevents @wq from going away from under us.
+	 */
+	rcu_read_lock();
 
-		/*
-		 * Put dfl_pwq.  @wq may be freed any time after dfl_pwq is
-		 * put.  Don't access it afterwards.
-		 */
-		pwq = wq->dfl_pwq;
-		wq->dfl_pwq = NULL;
+	for_each_possible_cpu(cpu) {
+		pwq = rcu_access_pointer(*per_cpu_ptr(wq->cpu_pwq, cpu));
+		RCU_INIT_POINTER(*per_cpu_ptr(wq->cpu_pwq, cpu), NULL);
 		put_pwq_unlocked(pwq);
 	}
+
+	put_pwq_unlocked(wq->dfl_pwq);
+	wq->dfl_pwq = NULL;
+
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
 
@@ -4870,10 +4808,11 @@ bool current_is_workqueue_rescuer(void)
  * unreliable and only useful as advisory hints or for debugging.
  *
  * If @cpu is WORK_CPU_UNBOUND, the test is performed on the local CPU.
- * Note that both per-cpu and unbound workqueues may be associated with
- * multiple pool_workqueues which have separate congested states.  A
- * workqueue being congested on one CPU doesn't mean the workqueue is also
- * contested on other CPUs / NUMA nodes.
+ *
+ * With the exception of ordered workqueues, all workqueues have per-cpu
+ * pool_workqueues, each with its own congested state. A workqueue being
+ * congested on one CPU doesn't mean that the workqueue is contested on any
+ * other CPUs.
  *
  * Return:
  * %true if congested, %false otherwise.
@@ -4889,12 +4828,9 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
 	if (cpu == WORK_CPU_UNBOUND)
 		cpu = smp_processor_id();
 
-	if (!(wq->flags & WQ_UNBOUND))
-		pwq = *per_cpu_ptr(wq->cpu_pwq, cpu);
-	else
-		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
-
+	pwq = *per_cpu_ptr(wq->cpu_pwq, cpu);
 	ret = !list_empty(&pwq->inactive_works);
+
 	preempt_enable();
 	rcu_read_unlock();
 
@@ -6399,7 +6335,7 @@ void __init workqueue_init_early(void)
 	system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
 	system_long_wq = alloc_workqueue("events_long", 0, 0);
 	system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
-					    WQ_UNBOUND_MAX_ACTIVE);
+					    WQ_MAX_ACTIVE);
 	system_freezable_wq = alloc_workqueue("events_freezable",
 					      WQ_FREEZABLE, 0);
 	system_power_efficient_wq = alloc_workqueue("events_power_efficient",
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 10/24] workqueue: Rename workqueue_attrs->no_numa to ->ordered
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (8 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 11/24] workqueue: Rename NUMA related names to use pod instead Tejun Heo
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

With the recent removal of NUMA related module param and sysfs knob,
workqueue_attrs->no_numa is now only used to implement ordered workqueues.
Let's rename the field so that it's less confusing especially with the
planned CPU affinity awareness improvements.

Just a rename. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |  6 +++---
 kernel/workqueue.c        | 19 +++++++++----------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d1b681f67985..8cc9b86d3256 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -141,13 +141,13 @@ struct workqueue_attrs {
 	cpumask_var_t cpumask;
 
 	/**
-	 * @no_numa: disable NUMA affinity
+	 * @ordered: work items must be executed one by one in queueing order
 	 *
-	 * Unlike other fields, ``no_numa`` isn't a property of a worker_pool. It
+	 * Unlike other fields, ``ordered`` isn't a property of a worker_pool. It
 	 * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus
 	 * doesn't participate in pool hash calculations or equality comparisons.
 	 */
-	bool no_numa;
+	bool ordered;
 };
 
 static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 43f3bb801bd9..6a5d227949d9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3653,10 +3653,10 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 	cpumask_copy(to->cpumask, from->cpumask);
 	/*
 	 * Unlike hash and equality test, this function doesn't ignore
-	 * ->no_numa as it is used for both pool and wq attrs.  Instead,
-	 * get_unbound_pool() explicitly clears ->no_numa after copying.
+	 * ->ordered as it is used for both pool and wq attrs.  Instead,
+	 * get_unbound_pool() explicitly clears ->ordered after copying.
 	 */
-	to->no_numa = from->no_numa;
+	to->ordered = from->ordered;
 }
 
 /* hash value of the content of @attr */
@@ -3916,10 +3916,10 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	pool->node = target_node;
 
 	/*
-	 * no_numa isn't a worker_pool attribute, always clear it.  See
+	 * ordered isn't a worker_pool attribute, always clear it.  See
 	 * 'struct workqueue_attrs' comments for detail.
 	 */
-	pool->attrs->no_numa = false;
+	pool->attrs->ordered = false;
 
 	if (worker_pool_assign_id(pool) < 0)
 		goto fail;
@@ -4124,7 +4124,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 static void wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
 				 int cpu_going_down, cpumask_t *cpumask)
 {
-	if (!wq_numa_enabled || attrs->no_numa)
+	if (!wq_numa_enabled || attrs->ordered)
 		goto use_dfl;
 
 	/* does @node have any online CPUs @attrs wants? */
@@ -4236,7 +4236,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 		goto out_free;
 
 	for_each_possible_cpu(cpu) {
-		if (new_attrs->no_numa) {
+		if (new_attrs->ordered) {
 			ctx->dfl_pwq->refcnt++;
 			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
 		} else {
@@ -4393,7 +4393,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	lockdep_assert_held(&wq_pool_mutex);
 
 	if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
-	    wq->unbound_attrs->no_numa)
+	    wq->unbound_attrs->ordered)
 		return;
 
 	/*
@@ -6323,11 +6323,10 @@ void __init workqueue_init_early(void)
 		/*
 		 * An ordered wq should have only one pwq as ordering is
 		 * guaranteed by max_active which is enforced by pwqs.
-		 * Turn off NUMA so that dfl_pwq is used for all nodes.
 		 */
 		BUG_ON(!(attrs = alloc_workqueue_attrs()));
 		attrs->nice = std_nice[i];
-		attrs->no_numa = true;
+		attrs->ordered = true;
 		ordered_wq_attrs[i] = attrs;
 	}
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 11/24] workqueue: Rename NUMA related names to use pod instead
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (9 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 10/24] workqueue: Rename workqueue_attrs->no_numa to ->ordered Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 12/24] workqueue: Move wq_pod_init() below workqueue_init() Tejun Heo
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Workqueue is in the process of improving CPU affinity awareness. It will
become more flexible and won't be tied to NUMA node boundaries. This patch
renames all NUMA related names in workqueue.c to use "pod" instead.

While "pod" isn't a very common term, it short and captures the grouping of
CPUs well enough. These names are only going to be used within workqueue
implementation proper, so the specific naming doesn't matter that much.

* wq_numa_possible_cpumask -> wq_pod_cpus

* wq_numa_enabled -> wq_pod_enabled

* wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf

* workqueue_select_cpu_near -> select_numa_node_cpu

  This rename is different from others. The function is only used by
  queue_work_node() and specifically tries to find a CPU in the specified
  NUMA node. As workqueue affinity will become more flexible and untied from
  NUMA, this function's name should specifically describe that it's for
  NUMA.

* wq_calc_node_cpumask -> wq_calc_pod_cpumask

* wq_update_unbound_numa -> wq_update_pod

* wq_numa_init -> wq_pod_init

* node -> pod in local variables

Only renames. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 162 +++++++++++++++++++++------------------------
 1 file changed, 76 insertions(+), 86 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6a5d227949d9..08ab40371697 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -325,8 +325,7 @@ struct workqueue_struct {
 
 static struct kmem_cache *pwq_cache;
 
-static cpumask_var_t *wq_numa_possible_cpumask;
-					/* possible CPUs of each node */
+static cpumask_var_t *wq_pod_cpus;	/* possible CPUs of each node */
 
 /*
  * Per-cpu work items which run for longer than the following threshold are
@@ -342,10 +341,10 @@ module_param_named(power_efficient, wq_power_efficient, bool, 0444);
 
 static bool wq_online;			/* can kworkers be created yet? */
 
-static bool wq_numa_enabled;		/* unbound NUMA affinity enabled */
+static bool wq_pod_enabled;		/* unbound CPU pod affinity enabled */
 
-/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
-static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+/* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
+static struct workqueue_attrs *wq_update_pod_attrs_buf;
 
 static DEFINE_MUTEX(wq_pool_mutex);	/* protects pools and workqueues list */
 static DEFINE_MUTEX(wq_pool_attach_mutex); /* protects worker attach/detach */
@@ -1742,7 +1741,7 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 EXPORT_SYMBOL(queue_work_on);
 
 /**
- * workqueue_select_cpu_near - Select a CPU based on NUMA node
+ * select_numa_node_cpu - Select a CPU based on NUMA node
  * @node: NUMA node ID that we want to select a CPU from
  *
  * This function will attempt to find a "random" cpu available on a given
@@ -1750,12 +1749,12 @@ EXPORT_SYMBOL(queue_work_on);
  * WORK_CPU_UNBOUND indicating that we should just schedule to any
  * available CPU if we need to schedule this work.
  */
-static int workqueue_select_cpu_near(int node)
+static int select_numa_node_cpu(int node)
 {
 	int cpu;
 
 	/* No point in doing this if NUMA isn't enabled for workqueues */
-	if (!wq_numa_enabled)
+	if (!wq_pod_enabled)
 		return WORK_CPU_UNBOUND;
 
 	/* Delay binding to CPU if node is not valid or online */
@@ -1814,7 +1813,7 @@ bool queue_work_node(int node, struct workqueue_struct *wq,
 	local_irq_save(flags);
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
-		int cpu = workqueue_select_cpu_near(node);
+		int cpu = select_numa_node_cpu(node);
 
 		__queue_work(cpu, wq, work);
 		ret = true;
@@ -3883,8 +3882,8 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
-	int node;
-	int target_node = NUMA_NO_NODE;
+	int pod;
+	int target_pod = NUMA_NO_NODE;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3896,24 +3895,23 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 		}
 	}
 
-	/* if cpumask is contained inside a NUMA node, we belong to that node */
-	if (wq_numa_enabled) {
-		for_each_node(node) {
-			if (cpumask_subset(attrs->cpumask,
-					   wq_numa_possible_cpumask[node])) {
-				target_node = node;
+	/* if cpumask is contained inside a pod, we belong to that pod */
+	if (wq_pod_enabled) {
+		for_each_node(pod) {
+			if (cpumask_subset(attrs->cpumask, wq_pod_cpus[pod])) {
+				target_pod = pod;
 				break;
 			}
 		}
 	}
 
 	/* nope, create a new one */
-	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_node);
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_pod);
 	if (!pool || init_worker_pool(pool) < 0)
 		goto fail;
 
 	copy_workqueue_attrs(pool->attrs, attrs);
-	pool->node = target_node;
+	pool->node = target_pod;
 
 	/*
 	 * ordered isn't a worker_pool attribute, always clear it.  See
@@ -4103,40 +4101,38 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 }
 
 /**
- * wq_calc_node_cpumask - calculate a wq_attrs' cpumask for the specified node
+ * wq_calc_pod_cpumask - calculate a wq_attrs' cpumask for a pod
  * @attrs: the wq_attrs of the default pwq of the target workqueue
- * @node: the target NUMA node
+ * @pod: the target CPU pod
  * @cpu_going_down: if >= 0, the CPU to consider as offline
  * @cpumask: outarg, the resulting cpumask
  *
- * Calculate the cpumask a workqueue with @attrs should use on @node.  If
- * @cpu_going_down is >= 0, that cpu is considered offline during
- * calculation.  The result is stored in @cpumask.
+ * Calculate the cpumask a workqueue with @attrs should use on @pod. If
+ * @cpu_going_down is >= 0, that cpu is considered offline during calculation.
+ * The result is stored in @cpumask.
  *
- * If NUMA affinity is not enabled, @attrs->cpumask is always used.  If
- * enabled and @node has online CPUs requested by @attrs, the returned
- * cpumask is the intersection of the possible CPUs of @node and
- * @attrs->cpumask.
+ * If pod affinity is not enabled, @attrs->cpumask is always used. If enabled
+ * and @pod has online CPUs requested by @attrs, the returned cpumask is the
+ * intersection of the possible CPUs of @pod and @attrs->cpumask.
  *
- * The caller is responsible for ensuring that the cpumask of @node stays
- * stable.
+ * The caller is responsible for ensuring that the cpumask of @pod stays stable.
  */
-static void wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
+static void wq_calc_pod_cpumask(const struct workqueue_attrs *attrs, int pod,
 				 int cpu_going_down, cpumask_t *cpumask)
 {
-	if (!wq_numa_enabled || attrs->ordered)
+	if (!wq_pod_enabled || attrs->ordered)
 		goto use_dfl;
 
-	/* does @node have any online CPUs @attrs wants? */
-	cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
+	/* does @pod have any online CPUs @attrs wants? */
+	cpumask_and(cpumask, cpumask_of_node(pod), attrs->cpumask);
 	if (cpu_going_down >= 0)
 		cpumask_clear_cpu(cpu_going_down, cpumask);
 
 	if (cpumask_empty(cpumask))
 		goto use_dfl;
 
-	/* yeap, return possible CPUs in @node that @attrs wants */
-	cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
+	/* yeap, return possible CPUs in @pod that @attrs wants */
+	cpumask_and(cpumask, attrs->cpumask, wq_pod_cpus[pod]);
 
 	if (cpumask_empty(cpumask))
 		pr_warn_once("WARNING: workqueue cpumask: online intersect > "
@@ -4240,8 +4236,8 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 			ctx->dfl_pwq->refcnt++;
 			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
 		} else {
-			wq_calc_node_cpumask(new_attrs, cpu_to_node(cpu), -1,
-					     tmp_attrs->cpumask);
+			wq_calc_pod_cpumask(new_attrs, cpu_to_node(cpu), -1,
+					    tmp_attrs->cpumask);
 			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, tmp_attrs);
 			if (!ctx->pwq_tbl[cpu])
 				goto out_free;
@@ -4332,12 +4328,11 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
- * machines, this function maps a separate pwq to each NUMA node with
- * possibles CPUs in @attrs->cpumask so that work items are affine to the
- * NUMA node it was issued on.  Older pwqs are released as in-flight work
- * items finish.  Note that a work item which repeatedly requeues itself
- * back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq. Unless disabled, this function maps
+ * a separate pwq to each CPU pod with possibles CPUs in @attrs->cpumask so that
+ * work items are affine to the pod it was issued on. Older pwqs are released as
+ * in-flight work items finish. Note that a work item which repeatedly requeues
+ * itself back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.
  *
@@ -4360,31 +4355,28 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 }
 
 /**
- * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
+ * wq_update_pod - update pod affinity of a wq for CPU hot[un]plug
  * @wq: the target workqueue
  * @cpu: the CPU coming up or going down
  * @online: whether @cpu is coming up or going down
  *
  * This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and
- * %CPU_DOWN_FAILED.  @cpu is being hot[un]plugged, update NUMA affinity of
- * @wq accordingly.
- *
- * If NUMA affinity can't be adjusted due to memory allocation failure, it
- * falls back to @wq->dfl_pwq which may not be optimal but is always
- * correct.
- *
- * Note that when the last allowed CPU of a NUMA node goes offline for a
- * workqueue with a cpumask spanning multiple nodes, the workers which were
- * already executing the work items for the workqueue will lose their CPU
- * affinity and may execute on any CPU.  This is similar to how per-cpu
- * workqueues behave on CPU_DOWN.  If a workqueue user wants strict
- * affinity, it's the user's responsibility to flush the work item from
- * CPU_DOWN_PREPARE.
+ * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update pod affinity of @wq
+ * accordingly.
+ *
+ * If pod affinity can't be adjusted due to memory allocation failure, it falls
+ * back to @wq->dfl_pwq which may not be optimal but is always correct.
+ *
+ * Note that when the last allowed CPU of a pod goes offline for a workqueue
+ * with a cpumask spanning multiple poders, the workers which were already
+ * executing the work items for the workqueue will lose their CPU affinity and
+ * may execute on any CPU. This is similar to how per-cpu workqueues behave on
+ * CPU_DOWN. If a workqueue user wants strict affinity, it's the user's
+ * responsibility to flush the work item from CPU_DOWN_PREPARE.
  */
-static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
-				   bool online)
+static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 {
-	int node = cpu_to_node(cpu);
+	int pod = cpu_to_node(cpu);
 	int cpu_off = online ? -1 : cpu;
 	struct pool_workqueue *old_pwq = NULL, *pwq;
 	struct workqueue_attrs *target_attrs;
@@ -4392,7 +4384,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 
 	lockdep_assert_held(&wq_pool_mutex);
 
-	if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
+	if (!wq_pod_enabled || !(wq->flags & WQ_UNBOUND) ||
 	    wq->unbound_attrs->ordered)
 		return;
 
@@ -4401,13 +4393,13 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	 * Let's use a preallocated one.  The following buf is protected by
 	 * CPU hotplug exclusion.
 	 */
-	target_attrs = wq_update_unbound_numa_attrs_buf;
+	target_attrs = wq_update_pod_attrs_buf;
 	cpumask = target_attrs->cpumask;
 
 	copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
 
 	/* nothing to do if the target cpumask matches the current pwq */
-	wq_calc_node_cpumask(wq->dfl_pwq->pool->attrs, node, cpu_off, cpumask);
+	wq_calc_pod_cpumask(wq->dfl_pwq->pool->attrs, pod, cpu_off, cpumask);
 	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
 					lockdep_is_held(&wq_pool_mutex));
 	if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
@@ -4416,7 +4408,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
 	/* create a new pwq */
 	pwq = alloc_unbound_pwq(wq, target_attrs);
 	if (!pwq) {
-		pr_warn("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
+		pr_warn("workqueue: allocation failed while updating CPU pod affinity of \"%s\"\n",
 			wq->name);
 		goto use_dfl_pwq;
 	}
@@ -4547,11 +4539,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 	struct pool_workqueue *pwq;
 
 	/*
-	 * Unbound && max_active == 1 used to imply ordered, which is no
-	 * longer the case on NUMA machines due to per-node pools.  While
+	 * Unbound && max_active == 1 used to imply ordered, which is no longer
+	 * the case on many machines due to per-pod pools. While
 	 * alloc_ordered_workqueue() is the right way to create an ordered
-	 * workqueue, keep the previous behavior to avoid subtle breakages
-	 * on NUMA.
+	 * workqueue, keep the previous behavior to avoid subtle breakages.
 	 */
 	if ((flags & WQ_UNBOUND) && max_active == 1)
 		flags |= __WQ_ORDERED;
@@ -5432,9 +5423,9 @@ int workqueue_online_cpu(unsigned int cpu)
 		mutex_unlock(&wq_pool_attach_mutex);
 	}
 
-	/* update NUMA affinity of unbound workqueues */
+	/* update pod affinity of unbound workqueues */
 	list_for_each_entry(wq, &workqueues, list)
-		wq_update_unbound_numa(wq, cpu, true);
+		wq_update_pod(wq, cpu, true);
 
 	mutex_unlock(&wq_pool_mutex);
 	return 0;
@@ -5450,10 +5441,10 @@ int workqueue_offline_cpu(unsigned int cpu)
 
 	unbind_workers(cpu);
 
-	/* update NUMA affinity of unbound workqueues */
+	/* update pod affinity of unbound workqueues */
 	mutex_lock(&wq_pool_mutex);
 	list_for_each_entry(wq, &workqueues, list)
-		wq_update_unbound_numa(wq, cpu, false);
+		wq_update_pod(wq, cpu, false);
 	mutex_unlock(&wq_pool_mutex);
 
 	return 0;
@@ -6231,7 +6222,7 @@ static inline void wq_watchdog_init(void) { }
 
 #endif	/* CONFIG_WQ_WATCHDOG */
 
-static void __init wq_numa_init(void)
+static void __init wq_pod_init(void)
 {
 	cpumask_var_t *tbl;
 	int node, cpu;
@@ -6246,8 +6237,8 @@ static void __init wq_numa_init(void)
 		}
 	}
 
-	wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs();
-	BUG_ON(!wq_update_unbound_numa_attrs_buf);
+	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
+	BUG_ON(!wq_update_pod_attrs_buf);
 
 	/*
 	 * We want masks of possible CPUs of each node which isn't readily
@@ -6266,8 +6257,8 @@ static void __init wq_numa_init(void)
 		cpumask_set_cpu(cpu, tbl[node]);
 	}
 
-	wq_numa_possible_cpumask = tbl;
-	wq_numa_enabled = true;
+	wq_pod_cpus = tbl;
+	wq_pod_enabled = true;
 }
 
 /**
@@ -6367,15 +6358,14 @@ void __init workqueue_init(void)
 	BUG_ON(IS_ERR(pwq_release_worker));
 
 	/*
-	 * It'd be simpler to initialize NUMA in workqueue_init_early() but
-	 * CPU to node mapping may not be available that early on some
-	 * archs such as power and arm64.  As per-cpu pools created
-	 * previously could be missing node hint and unbound pools NUMA
-	 * affinity, fix them up.
+	 * It'd be simpler to initialize pods in workqueue_init_early() but CPU
+	 * to node mapping may not be available that early on some archs such as
+	 * power and arm64. As per-cpu pools created previously could be missing
+	 * node hint and unbound pool pod affinity, fix them up.
 	 *
 	 * Also, while iterating workqueues, create rescuers if requested.
 	 */
-	wq_numa_init();
+	wq_pod_init();
 
 	mutex_lock(&wq_pool_mutex);
 
@@ -6386,7 +6376,7 @@ void __init workqueue_init(void)
 	}
 
 	list_for_each_entry(wq, &workqueues, list) {
-		wq_update_unbound_numa(wq, smp_processor_id(), true);
+		wq_update_pod(wq, smp_processor_id(), true);
 		WARN(init_rescuer(wq),
 		     "workqueue: failed to create early rescuer for %s",
 		     wq->name);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 12/24] workqueue: Move wq_pod_init() below workqueue_init()
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (10 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 11/24] workqueue: Rename NUMA related names to use pod instead Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 13/24] workqueue: Initialize unbound CPU pods later in the boot Tejun Heo
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

wq_pod_init() is called from workqueue_init() and responsible for
initializing unbound CPU pods according to NUMA node. Workqueue is in the
process of improving affinity awareness and wants to use other topology
information to initialize unbound CPU pods; however, unlike NUMA nodes,
other topology information isn't yet available in workqueue_init().

The next patch will introduce a later stage init function for workqueue
which will be responsible for initializing unbound CPU pods. Relocate
wq_pod_init() below workqueue_init() where the new init function is going to
be located so that the diff can show the content differences.

Just a relocation. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 78 ++++++++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 38 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 08ab40371697..914a69f83d59 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6222,44 +6222,7 @@ static inline void wq_watchdog_init(void) { }
 
 #endif	/* CONFIG_WQ_WATCHDOG */
 
-static void __init wq_pod_init(void)
-{
-	cpumask_var_t *tbl;
-	int node, cpu;
-
-	if (num_possible_nodes() <= 1)
-		return;
-
-	for_each_possible_cpu(cpu) {
-		if (WARN_ON(cpu_to_node(cpu) == NUMA_NO_NODE)) {
-			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
-			return;
-		}
-	}
-
-	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
-	BUG_ON(!wq_update_pod_attrs_buf);
-
-	/*
-	 * We want masks of possible CPUs of each node which isn't readily
-	 * available.  Build one from cpu_to_node() which should have been
-	 * fully initialized by now.
-	 */
-	tbl = kcalloc(nr_node_ids, sizeof(tbl[0]), GFP_KERNEL);
-	BUG_ON(!tbl);
-
-	for_each_node(node)
-		BUG_ON(!zalloc_cpumask_var_node(&tbl[node], GFP_KERNEL,
-				node_online(node) ? node : NUMA_NO_NODE));
-
-	for_each_possible_cpu(cpu) {
-		node = cpu_to_node(cpu);
-		cpumask_set_cpu(cpu, tbl[node]);
-	}
-
-	wq_pod_cpus = tbl;
-	wq_pod_enabled = true;
-}
+static void wq_pod_init(void);
 
 /**
  * workqueue_init_early - early init for workqueue subsystem
@@ -6399,6 +6362,45 @@ void __init workqueue_init(void)
 	wq_watchdog_init();
 }
 
+static void __init wq_pod_init(void)
+{
+	cpumask_var_t *tbl;
+	int node, cpu;
+
+	if (num_possible_nodes() <= 1)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		if (WARN_ON(cpu_to_node(cpu) == NUMA_NO_NODE)) {
+			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+			return;
+		}
+	}
+
+	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
+	BUG_ON(!wq_update_pod_attrs_buf);
+
+	/*
+	 * We want masks of possible CPUs of each node which isn't readily
+	 * available.  Build one from cpu_to_node() which should have been
+	 * fully initialized by now.
+	 */
+	tbl = kcalloc(nr_node_ids, sizeof(tbl[0]), GFP_KERNEL);
+	BUG_ON(!tbl);
+
+	for_each_node(node)
+		BUG_ON(!zalloc_cpumask_var_node(&tbl[node], GFP_KERNEL,
+				node_online(node) ? node : NUMA_NO_NODE));
+
+	for_each_possible_cpu(cpu) {
+		node = cpu_to_node(cpu);
+		cpumask_set_cpu(cpu, tbl[node]);
+	}
+
+	wq_pod_cpus = tbl;
+	wq_pod_enabled = true;
+}
+
 /*
  * Despite the naming, this is a no-op function which is here only for avoiding
  * link error. Since compile-time warning may fail to catch, we will need to
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 13/24] workqueue: Initialize unbound CPU pods later in the boot
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (11 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 12/24] workqueue: Move wq_pod_init() below workqueue_init() Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-19  0:16 ` [PATCH 14/24] workqueue: Generalize unbound CPU pods Tejun Heo
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

During boot, to initialize unbound CPU pods, wq_pod_init() was called from
workqueue_init(). This is early enough for NUMA nodes to be set up but
before SMP is brought up and CPU topology information is populated.

Workqueue is in the process of improving CPU locality for unbound workqueues
and will need access to topology information during pod init. This adds a
new init function workqueue_init_topology() which is called after CPU
topology information is available and replaces wq_pod_init().

As unbound CPU pods are now initialized after workqueues are activated, we
need to revisit the workqueues to apply the pod configuration. Workqueues
which are created before workqueue_init_topology() are set up so that they
always use the default worker pool. After pods are set up in
workqueue_init_topology(), wq_update_pod() is called on all existing
workqueues to update the pool associations accordingly.

Note that wq_update_pod_attrs_buf allocation is moved to
workqueue_init_early(). This isn't necessary right now but enables further
generalization of pod handling in the future.

This patch changes the initialization sequence but the end result should be
the same.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |  1 +
 init/main.c               |  1 +
 kernel/workqueue.c        | 68 +++++++++++++++++++++++----------------
 3 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 8cc9b86d3256..b8961c8ea5b3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -709,5 +709,6 @@ int workqueue_offline_cpu(unsigned int cpu);
 
 void __init workqueue_init_early(void);
 void __init workqueue_init(void);
+void __init workqueue_init_topology(void);
 
 #endif
diff --git a/init/main.c b/init/main.c
index af50044deed5..6bd5fffce2e6 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1565,6 +1565,7 @@ static noinline void __init kernel_init_freeable(void)
 	smp_init();
 	sched_init_smp();
 
+	workqueue_init_topology();
 	padata_init();
 	page_alloc_init_late();
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 914a69f83d59..add6f5fc799b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6222,17 +6222,15 @@ static inline void wq_watchdog_init(void) { }
 
 #endif	/* CONFIG_WQ_WATCHDOG */
 
-static void wq_pod_init(void);
-
 /**
  * workqueue_init_early - early init for workqueue subsystem
  *
- * This is the first half of two-staged workqueue subsystem initialization
- * and invoked as soon as the bare basics - memory allocation, cpumasks and
- * idr are up.  It sets up all the data structures and system workqueues
- * and allows early boot code to create workqueues and queue/cancel work
- * items.  Actual work item execution starts only after kthreads can be
- * created and scheduled right before early initcalls.
+ * This is the first step of three-staged workqueue subsystem initialization and
+ * invoked as soon as the bare basics - memory allocation, cpumasks and idr are
+ * up. It sets up all the data structures and system workqueues and allows early
+ * boot code to create workqueues and queue/cancel work items. Actual work item
+ * execution starts only after kthreads can be created and scheduled right
+ * before early initcalls.
  */
 void __init workqueue_init_early(void)
 {
@@ -6247,6 +6245,9 @@ void __init workqueue_init_early(void)
 
 	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
+	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
+	BUG_ON(!wq_update_pod_attrs_buf);
+
 	/* initialize CPU pools */
 	for_each_possible_cpu(cpu) {
 		struct worker_pool *pool;
@@ -6305,11 +6306,11 @@ void __init workqueue_init_early(void)
 /**
  * workqueue_init - bring workqueue subsystem fully online
  *
- * This is the latter half of two-staged workqueue subsystem initialization
- * and invoked as soon as kthreads can be created and scheduled.
- * Workqueues have been created and work items queued on them, but there
- * are no kworkers executing the work items yet.  Populate the worker pools
- * with the initial workers and enable future kworker creations.
+ * This is the second step of three-staged workqueue subsystem initialization
+ * and invoked as soon as kthreads can be created and scheduled. Workqueues have
+ * been created and work items queued on them, but there are no kworkers
+ * executing the work items yet. Populate the worker pools with the initial
+ * workers and enable future kworker creations.
  */
 void __init workqueue_init(void)
 {
@@ -6320,18 +6321,12 @@ void __init workqueue_init(void)
 	pwq_release_worker = kthread_create_worker(0, "pool_workqueue_release");
 	BUG_ON(IS_ERR(pwq_release_worker));
 
-	/*
-	 * It'd be simpler to initialize pods in workqueue_init_early() but CPU
-	 * to node mapping may not be available that early on some archs such as
-	 * power and arm64. As per-cpu pools created previously could be missing
-	 * node hint and unbound pool pod affinity, fix them up.
-	 *
-	 * Also, while iterating workqueues, create rescuers if requested.
-	 */
-	wq_pod_init();
-
 	mutex_lock(&wq_pool_mutex);
 
+	/*
+	 * Per-cpu pools created earlier could be missing node hint. Fix them
+	 * up. Also, create a rescuer for workqueues that requested it.
+	 */
 	for_each_possible_cpu(cpu) {
 		for_each_cpu_worker_pool(pool, cpu) {
 			pool->node = cpu_to_node(cpu);
@@ -6339,7 +6334,6 @@ void __init workqueue_init(void)
 	}
 
 	list_for_each_entry(wq, &workqueues, list) {
-		wq_update_pod(wq, smp_processor_id(), true);
 		WARN(init_rescuer(wq),
 		     "workqueue: failed to create early rescuer for %s",
 		     wq->name);
@@ -6362,8 +6356,16 @@ void __init workqueue_init(void)
 	wq_watchdog_init();
 }
 
-static void __init wq_pod_init(void)
+/**
+ * workqueue_init_topology - initialize CPU pods for unbound workqueues
+ *
+ * This is the third step of there-staged workqueue subsystem initialization and
+ * invoked after SMP and topology information are fully initialized. It
+ * initializes the unbound CPU pods accordingly.
+ */
+void __init workqueue_init_topology(void)
 {
+	struct workqueue_struct *wq;
 	cpumask_var_t *tbl;
 	int node, cpu;
 
@@ -6377,8 +6379,7 @@ static void __init wq_pod_init(void)
 		}
 	}
 
-	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
-	BUG_ON(!wq_update_pod_attrs_buf);
+	mutex_lock(&wq_pool_mutex);
 
 	/*
 	 * We want masks of possible CPUs of each node which isn't readily
@@ -6399,6 +6400,19 @@ static void __init wq_pod_init(void)
 
 	wq_pod_cpus = tbl;
 	wq_pod_enabled = true;
+
+	/*
+	 * Workqueues allocated earlier would have all CPUs sharing the default
+	 * worker pool. Explicitly call wq_update_pod() on all workqueue and CPU
+	 * combinations to apply per-pod sharing.
+	 */
+	list_for_each_entry(wq, &workqueues, list) {
+		for_each_online_cpu(cpu) {
+			wq_update_pod(wq, cpu, true);
+		}
+	}
+
+	mutex_unlock(&wq_pool_mutex);
 }
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (12 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 13/24] workqueue: Initialize unbound CPU pods later in the boot Tejun Heo
@ 2023-05-19  0:16 ` Tejun Heo
  2023-05-30  8:06   ` K Prateek Nayak
  2023-05-30 21:18   ` Sandeep Dhavale
  2023-05-19  0:17 ` [PATCH 15/24] workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration Tejun Heo
                   ` (13 subsequent siblings)
  27 siblings, 2 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:16 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

While renamed to pod, the code still assumes that the pods are defined by
NUMA boundaries. Let's generalize it:

* workqueue_attrs->affn_scope is added. Each enum represents the type of
  boundaries that define the pods. There are currently two scopes -
  WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
  - one pod per NUMA node. The latter defines one global pod across the
  whole system.

* struct wq_pod_type is added which describes how pods are configured for
  each affnity scope. For each pod, it lists the member CPUs and the
  preferred NUMA node for memory allocations. The reverse mapping from CPU
  to pod is also available.

* wq_pod_enabled is dropped. Pod is now always enabled. The previously
  disabled behavior is now implemented through WQ_AFFN_SYSTEM.

* get_unbound_pool() wants to determine the NUMA node to allocate memory
  from for the new pool. The variables are renamed from node to pod but the
  logic still assumes they're one and the same. Clearly distinguish them -
  walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
  NUMA node.

* wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
  node. Take @cpu instead and determine the cpumask to use from the pod_type
  matching @attrs.

* apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
  NULL so that it can indicate -EINVAL on invalid affinity scopes.

This patch allows CPUs to be grouped into pods however desired per type.
While this patch causes some internal behavior changes, nothing material
should change for workqueue users.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |  31 +++++++-
 kernel/workqueue.c        | 154 ++++++++++++++++++++++++--------------
 2 files changed, 125 insertions(+), 60 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b8961c8ea5b3..a2f826b6ec9a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -124,6 +124,15 @@ struct rcu_work {
 	struct workqueue_struct *wq;
 };
 
+enum wq_affn_scope {
+	WQ_AFFN_NUMA,			/* one pod per NUMA node */
+	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
+
+	WQ_AFFN_NR_TYPES,
+
+	WQ_AFFN_DFL = WQ_AFFN_NUMA,
+};
+
 /**
  * struct workqueue_attrs - A struct for workqueue attributes.
  *
@@ -140,12 +149,26 @@ struct workqueue_attrs {
 	 */
 	cpumask_var_t cpumask;
 
+	/*
+	 * Below fields aren't properties of a worker_pool. They only modify how
+	 * :c:func:`apply_workqueue_attrs` select pools and thus don't
+	 * participate in pool hash calculations or equality comparisons.
+	 */
+
 	/**
-	 * @ordered: work items must be executed one by one in queueing order
+	 * @affn_scope: unbound CPU affinity scope
 	 *
-	 * Unlike other fields, ``ordered`` isn't a property of a worker_pool. It
-	 * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus
-	 * doesn't participate in pool hash calculations or equality comparisons.
+	 * CPU pods are used to improve execution locality of unbound work
+	 * items. There are multiple pod types, one for each wq_affn_scope, and
+	 * every CPU in the system belongs to one pod in every pod type. CPUs
+	 * that belong to the same pod share the worker pool. For example,
+	 * selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker
+	 * pool for each NUMA node.
+	 */
+	enum wq_affn_scope affn_scope;
+
+	/**
+	 * @ordered: work items must be executed one by one in queueing order
 	 */
 	bool ordered;
 };
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index add6f5fc799b..dae1787833cb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -325,7 +325,18 @@ struct workqueue_struct {
 
 static struct kmem_cache *pwq_cache;
 
-static cpumask_var_t *wq_pod_cpus;	/* possible CPUs of each node */
+/*
+ * Each pod type describes how CPUs should be grouped for unbound workqueues.
+ * See the comment above workqueue_attrs->affn_scope.
+ */
+struct wq_pod_type {
+	int			nr_pods;	/* number of pods */
+	cpumask_var_t		*pod_cpus;	/* pod -> cpus */
+	int			*pod_node;	/* pod -> node */
+	int			*cpu_pod;	/* cpu -> pod */
+};
+
+static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
 
 /*
  * Per-cpu work items which run for longer than the following threshold are
@@ -341,8 +352,6 @@ module_param_named(power_efficient, wq_power_efficient, bool, 0444);
 
 static bool wq_online;			/* can kworkers be created yet? */
 
-static bool wq_pod_enabled;		/* unbound CPU pod affinity enabled */
-
 /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
 static struct workqueue_attrs *wq_update_pod_attrs_buf;
 
@@ -1753,10 +1762,6 @@ static int select_numa_node_cpu(int node)
 {
 	int cpu;
 
-	/* No point in doing this if NUMA isn't enabled for workqueues */
-	if (!wq_pod_enabled)
-		return WORK_CPU_UNBOUND;
-
 	/* Delay binding to CPU if node is not valid or online */
 	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
 		return WORK_CPU_UNBOUND;
@@ -3639,6 +3644,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void)
 		goto fail;
 
 	cpumask_copy(attrs->cpumask, cpu_possible_mask);
+	attrs->affn_scope = WQ_AFFN_DFL;
 	return attrs;
 fail:
 	free_workqueue_attrs(attrs);
@@ -3650,11 +3656,13 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 {
 	to->nice = from->nice;
 	cpumask_copy(to->cpumask, from->cpumask);
+
 	/*
-	 * Unlike hash and equality test, this function doesn't ignore
-	 * ->ordered as it is used for both pool and wq attrs.  Instead,
-	 * get_unbound_pool() explicitly clears ->ordered after copying.
+	 * Unlike hash and equality test, copying shouldn't ignore wq-only
+	 * fields as copying is used for both pool and wq attrs. Instead,
+	 * get_unbound_pool() explicitly clears the fields.
 	 */
+	to->affn_scope = from->affn_scope;
 	to->ordered = from->ordered;
 }
 
@@ -3680,6 +3688,24 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
 	return true;
 }
 
+/* find wq_pod_type to use for @attrs */
+static const struct wq_pod_type *
+wqattrs_pod_type(const struct workqueue_attrs *attrs)
+{
+	struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];
+
+	if (likely(pt->nr_pods))
+		return pt;
+
+	/*
+	 * Before workqueue_init_topology(), only SYSTEM is available which is
+	 * initialized in workqueue_init_early().
+	 */
+	pt = &wq_pod_types[WQ_AFFN_SYSTEM];
+	BUG_ON(!pt->nr_pods);
+	return pt;
+}
+
 /**
  * init_worker_pool - initialize a newly zalloc'd worker_pool
  * @pool: worker_pool to initialize
@@ -3880,10 +3906,10 @@ static void put_unbound_pool(struct worker_pool *pool)
  */
 static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
+	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
-	int pod;
-	int target_pod = NUMA_NO_NODE;
+	int pod, node = NUMA_NO_NODE;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3895,28 +3921,24 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 		}
 	}
 
-	/* if cpumask is contained inside a pod, we belong to that pod */
-	if (wq_pod_enabled) {
-		for_each_node(pod) {
-			if (cpumask_subset(attrs->cpumask, wq_pod_cpus[pod])) {
-				target_pod = pod;
-				break;
-			}
+	/* If cpumask is contained inside a NUMA pod, that's our NUMA node */
+	for (pod = 0; pod < pt->nr_pods; pod++) {
+		if (cpumask_subset(attrs->cpumask, pt->pod_cpus[pod])) {
+			node = pt->pod_node[pod];
+			break;
 		}
 	}
 
 	/* nope, create a new one */
-	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_pod);
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, node);
 	if (!pool || init_worker_pool(pool) < 0)
 		goto fail;
 
 	copy_workqueue_attrs(pool->attrs, attrs);
-	pool->node = target_pod;
+	pool->node = node;
 
-	/*
-	 * ordered isn't a worker_pool attribute, always clear it.  See
-	 * 'struct workqueue_attrs' comments for detail.
-	 */
+	/* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
+	pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;
 	pool->attrs->ordered = false;
 
 	if (worker_pool_assign_id(pool) < 0)
@@ -4103,7 +4125,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 /**
  * wq_calc_pod_cpumask - calculate a wq_attrs' cpumask for a pod
  * @attrs: the wq_attrs of the default pwq of the target workqueue
- * @pod: the target CPU pod
+ * @cpu: the target CPU
  * @cpu_going_down: if >= 0, the CPU to consider as offline
  * @cpumask: outarg, the resulting cpumask
  *
@@ -4117,30 +4139,29 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
  *
  * The caller is responsible for ensuring that the cpumask of @pod stays stable.
  */
-static void wq_calc_pod_cpumask(const struct workqueue_attrs *attrs, int pod,
-				 int cpu_going_down, cpumask_t *cpumask)
+static void wq_calc_pod_cpumask(const struct workqueue_attrs *attrs, int cpu,
+				int cpu_going_down, cpumask_t *cpumask)
 {
-	if (!wq_pod_enabled || attrs->ordered)
-		goto use_dfl;
+	const struct wq_pod_type *pt = wqattrs_pod_type(attrs);
+	int pod = pt->cpu_pod[cpu];
 
 	/* does @pod have any online CPUs @attrs wants? */
-	cpumask_and(cpumask, cpumask_of_node(pod), attrs->cpumask);
+	cpumask_and(cpumask, pt->pod_cpus[pod], attrs->cpumask);
+	cpumask_and(cpumask, cpumask, cpu_online_mask);
 	if (cpu_going_down >= 0)
 		cpumask_clear_cpu(cpu_going_down, cpumask);
 
-	if (cpumask_empty(cpumask))
-		goto use_dfl;
+	if (cpumask_empty(cpumask)) {
+		cpumask_copy(cpumask, attrs->cpumask);
+		return;
+	}
 
 	/* yeap, return possible CPUs in @pod that @attrs wants */
-	cpumask_and(cpumask, attrs->cpumask, wq_pod_cpus[pod]);
+	cpumask_and(cpumask, attrs->cpumask, pt->pod_cpus[pod]);
 
 	if (cpumask_empty(cpumask))
 		pr_warn_once("WARNING: workqueue cpumask: online intersect > "
 				"possible intersect\n");
-	return;
-
-use_dfl:
-	cpumask_copy(cpumask, attrs->cpumask);
 }
 
 /* install @pwq into @wq's cpu_pwq and return the old pwq */
@@ -4197,6 +4218,10 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 
 	lockdep_assert_held(&wq_pool_mutex);
 
+	if (WARN_ON(attrs->affn_scope < 0 ||
+		    attrs->affn_scope >= WQ_AFFN_NR_TYPES))
+		return ERR_PTR(-EINVAL);
+
 	ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_cpu_ids), GFP_KERNEL);
 
 	new_attrs = alloc_workqueue_attrs();
@@ -4236,8 +4261,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 			ctx->dfl_pwq->refcnt++;
 			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
 		} else {
-			wq_calc_pod_cpumask(new_attrs, cpu_to_node(cpu), -1,
-					    tmp_attrs->cpumask);
+			wq_calc_pod_cpumask(new_attrs, cpu, -1, tmp_attrs->cpumask);
 			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, tmp_attrs);
 			if (!ctx->pwq_tbl[cpu])
 				goto out_free;
@@ -4257,7 +4281,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	free_workqueue_attrs(tmp_attrs);
 	free_workqueue_attrs(new_attrs);
 	apply_wqattrs_cleanup(ctx);
-	return NULL;
+	return ERR_PTR(-ENOMEM);
 }
 
 /* set attrs and install prepared pwqs, @ctx points to old pwqs on return */
@@ -4313,8 +4337,8 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
 	}
 
 	ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask);
-	if (!ctx)
-		return -ENOMEM;
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
 
 	/* the ctx has been prepared successfully, let's commit it */
 	apply_wqattrs_commit(ctx);
@@ -4376,7 +4400,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
  */
 static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 {
-	int pod = cpu_to_node(cpu);
 	int cpu_off = online ? -1 : cpu;
 	struct pool_workqueue *old_pwq = NULL, *pwq;
 	struct workqueue_attrs *target_attrs;
@@ -4384,8 +4407,7 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 
 	lockdep_assert_held(&wq_pool_mutex);
 
-	if (!wq_pod_enabled || !(wq->flags & WQ_UNBOUND) ||
-	    wq->unbound_attrs->ordered)
+	if (!(wq->flags & WQ_UNBOUND) || wq->unbound_attrs->ordered)
 		return;
 
 	/*
@@ -4399,7 +4421,7 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 	copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
 
 	/* nothing to do if the target cpumask matches the current pwq */
-	wq_calc_pod_cpumask(wq->dfl_pwq->pool->attrs, pod, cpu_off, cpumask);
+	wq_calc_pod_cpumask(wq->dfl_pwq->pool->attrs, cpu, cpu_off, cpumask);
 	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
 					lockdep_is_held(&wq_pool_mutex));
 	if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
@@ -5640,8 +5662,8 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
 			continue;
 
 		ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask);
-		if (!ctx) {
-			ret = -ENOMEM;
+		if (IS_ERR(ctx)) {
+			ret = PTR_ERR(ctx);
 			break;
 		}
 
@@ -6234,6 +6256,7 @@ static inline void wq_watchdog_init(void) { }
  */
 void __init workqueue_init_early(void)
 {
+	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_SYSTEM];
 	int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
 	int i, cpu;
 
@@ -6248,6 +6271,22 @@ void __init workqueue_init_early(void)
 	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
 	BUG_ON(!wq_update_pod_attrs_buf);
 
+	/* initialize WQ_AFFN_SYSTEM pods */
+	pt->pod_cpus = kcalloc(1, sizeof(pt->pod_cpus[0]), GFP_KERNEL);
+	pt->pod_node = kcalloc(1, sizeof(pt->pod_node[0]), GFP_KERNEL);
+	pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL);
+	BUG_ON(!pt->pod_cpus || !pt->pod_node || !pt->cpu_pod);
+
+	BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[0], GFP_KERNEL, NUMA_NO_NODE));
+
+	wq_update_pod_attrs_buf = alloc_workqueue_attrs();
+	BUG_ON(!wq_update_pod_attrs_buf);
+
+	pt->nr_pods = 1;
+	cpumask_copy(pt->pod_cpus[0], cpu_possible_mask);
+	pt->pod_node[0] = NUMA_NO_NODE;
+	pt->cpu_pod[0] = 0;
+
 	/* initialize CPU pools */
 	for_each_possible_cpu(cpu) {
 		struct worker_pool *pool;
@@ -6365,8 +6404,8 @@ void __init workqueue_init(void)
  */
 void __init workqueue_init_topology(void)
 {
+	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
 	struct workqueue_struct *wq;
-	cpumask_var_t *tbl;
 	int node, cpu;
 
 	if (num_possible_nodes() <= 1)
@@ -6386,20 +6425,23 @@ void __init workqueue_init_topology(void)
 	 * available.  Build one from cpu_to_node() which should have been
 	 * fully initialized by now.
 	 */
-	tbl = kcalloc(nr_node_ids, sizeof(tbl[0]), GFP_KERNEL);
-	BUG_ON(!tbl);
+	pt->pod_cpus = kcalloc(nr_node_ids, sizeof(pt->pod_cpus[0]), GFP_KERNEL);
+	pt->pod_node = kcalloc(nr_node_ids, sizeof(pt->pod_node[0]), GFP_KERNEL);
+	pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL);
+	BUG_ON(!pt->pod_cpus || !pt->pod_node || !pt->cpu_pod);
 
 	for_each_node(node)
-		BUG_ON(!zalloc_cpumask_var_node(&tbl[node], GFP_KERNEL,
+		BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[node], GFP_KERNEL,
 				node_online(node) ? node : NUMA_NO_NODE));
 
 	for_each_possible_cpu(cpu) {
 		node = cpu_to_node(cpu);
-		cpumask_set_cpu(cpu, tbl[node]);
+		cpumask_set_cpu(cpu, pt->pod_cpus[node]);
+		pt->pod_node[node] = node;
+		pt->cpu_pod[cpu] = node;
 	}
 
-	wq_pod_cpus = tbl;
-	wq_pod_enabled = true;
+	pt->nr_pods = nr_node_ids;
 
 	/*
 	 * Workqueues allocated earlier would have all CPUs sharing the default
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 15/24] workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (13 preceding siblings ...)
  2023-05-19  0:16 ` [PATCH 14/24] workqueue: Generalize unbound CPU pods Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 16/24] workqueue: Modularize wq_pod_type initialization Tejun Heo
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Lack of visibility has always been a pain point for workqueues. While the
recently added wq_monitor.py improved the situation, it's still difficult to
understand what worker pools are active in the system, how workqueues map to
them and why. The lack of visibility into how workqueues are configured is
going to become more noticeable as workqueue improves locality awareness and
provides more mechanisms to customize locality related behaviors.

Now that the basic framework for more flexible locality support is in place,
this is a good time to improve the situation. This patch adds
tools/workqueues/wq_dump.py which prints out the topology configuration,
worker pools and how workqueues are mapped to pools. Read the command's help
message for more details.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/core-api/workqueue.rst |  59 ++++++++++
 tools/workqueue/wq_dump.py           | 166 +++++++++++++++++++++++++++
 2 files changed, 225 insertions(+)
 create mode 100644 tools/workqueue/wq_dump.py

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 8e541c5d8fa9..c9e46acd339b 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -347,6 +347,65 @@ Guidelines
   level of locality in wq operations and work item execution.
 
 
+Examining Configuration
+=======================
+
+Use tools/workqueue/wq_dump.py to examine unbound CPU affinity
+configuration, worker pools and how workqueues map to the pools: ::
+
+  $ tools/workqueue/wq_dump.py
+  Affinity Scopes
+  ===============
+  wq_unbound_cpumask=0000000f
+
+  NUMA
+    nr_pods  2
+    pod_cpus [0]=00000003 [1]=0000000c
+    pod_node [0]=0 [1]=1
+    cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1
+
+  SYSTEM
+    nr_pods  1
+    pod_cpus [0]=0000000f
+    pod_node [0]=-1
+    cpu_pod  [0]=0 [1]=0 [2]=0 [3]=0
+
+  Worker Pools
+  ============
+  pool[00] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  0
+  pool[01] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  0
+  pool[02] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  1
+  pool[03] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  1
+  pool[04] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  2
+  pool[05] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  2
+  pool[06] ref= 1 nice=  0 idle/workers=  3/  3 cpu=  3
+  pool[07] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  3
+  pool[08] ref=42 nice=  0 idle/workers=  6/  6 cpus=0000000f
+  pool[09] ref=28 nice=  0 idle/workers=  3/  3 cpus=00000003
+  pool[10] ref=28 nice=  0 idle/workers= 17/ 17 cpus=0000000c
+  pool[11] ref= 1 nice=-20 idle/workers=  1/  1 cpus=0000000f
+  pool[12] ref= 2 nice=-20 idle/workers=  1/  1 cpus=00000003
+  pool[13] ref= 2 nice=-20 idle/workers=  1/  1 cpus=0000000c
+
+  Workqueue CPU -> pool
+  =====================
+  [    workqueue \ CPU              0  1  2  3 dfl]
+  events                   percpu   0  2  4  6
+  events_highpri           percpu   1  3  5  7
+  events_long              percpu   0  2  4  6
+  events_unbound           unbound  9  9 10 10  8
+  events_freezable         percpu   0  2  4  6
+  events_power_efficient   percpu   0  2  4  6
+  events_freezable_power_  percpu   0  2  4  6
+  rcu_gp                   percpu   0  2  4  6
+  rcu_par_gp               percpu   0  2  4  6
+  slub_flushwq             percpu   0  2  4  6
+  netns                    ordered  8  8  8  8  8
+  ...
+
+See the command's help message for more info.
+
+
 Monitoring
 ==========
 
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
new file mode 100644
index 000000000000..ddd0bb4395ea
--- /dev/null
+++ b/tools/workqueue/wq_dump.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2023 Tejun Heo <tj@kernel.org>
+# Copyright (C) 2023 Meta Platforms, Inc. and affiliates.
+
+desc = """
+This is a drgn script to show the current workqueue configuration. For more
+info on drgn, visit https://github.com/osandov/drgn.
+
+Affinity Scopes
+===============
+
+Shows the CPUs that can be used for unbound workqueues and how they will be
+grouped by each available affinity type. For each type:
+
+  nr_pods   number of CPU pods in the affinity type
+  pod_cpus  CPUs in each pod
+  pod_node  NUMA node for memory allocation for each pod
+  cpu_pod   pod that each CPU is associated to
+
+Worker Pools
+============
+
+Lists all worker pools indexed by their ID. For each pool:
+
+  ref       number of pool_workqueue's associated with this pool
+  nice      nice value of the worker threads in the pool
+  idle      number of idle workers
+  workers   number of all workers
+  cpu       CPU the pool is associated with (per-cpu pool)
+  cpus      CPUs the workers in the pool can run on (unbound pool)
+
+Workqueue CPU -> pool
+=====================
+
+Lists all workqueues along with their type and worker pool association. For
+each workqueue:
+
+  NAME TYPE POOL_ID...
+
+  NAME      name of the workqueue
+  TYPE      percpu, unbound or ordered
+  POOL_ID   worker pool ID associated with each possible CPU
+"""
+
+import sys
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry,list_empty
+from drgn.helpers.linux.percpu import per_cpu_ptr
+from drgn.helpers.linux.cpumask import for_each_cpu,for_each_possible_cpu
+from drgn.helpers.linux.idr import idr_for_each
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+                                 formatter_class=argparse.RawTextHelpFormatter)
+args = parser.parse_args()
+
+def err(s):
+    print(s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+def cpumask_str(cpumask):
+    output = ""
+    base = 0
+    v = 0
+    for cpu in for_each_cpu(cpumask[0]):
+        while cpu - base >= 32:
+            output += f'{hex(v)} '
+            base += 32
+            v = 0
+        v |= 1 << (cpu - base)
+    if v > 0:
+        output += f'{v:08x}'
+    return output.strip()
+
+worker_pool_idr         = prog['worker_pool_idr']
+workqueues              = prog['workqueues']
+wq_unbound_cpumask      = prog['wq_unbound_cpumask']
+wq_pod_types            = prog['wq_pod_types']
+
+WQ_UNBOUND              = prog['WQ_UNBOUND']
+WQ_ORDERED              = prog['__WQ_ORDERED']
+WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
+
+WQ_AFFN_NUMA            = prog['WQ_AFFN_NUMA']
+WQ_AFFN_SYSTEM          = prog['WQ_AFFN_SYSTEM']
+
+print('Affinity Scopes')
+print('===============')
+
+print(f'wq_unbound_cpumask={cpumask_str(wq_unbound_cpumask)}')
+
+def print_pod_type(pt):
+    print(f'  nr_pods  {pt.nr_pods.value_()}')
+
+    print('  pod_cpus', end='')
+    for pod in range(pt.nr_pods):
+        print(f' [{pod}]={cpumask_str(pt.pod_cpus[pod])}', end='')
+    print('')
+
+    print('  pod_node', end='')
+    for pod in range(pt.nr_pods):
+        print(f' [{pod}]={pt.pod_node[pod].value_()}', end='')
+    print('')
+
+    print(f'  cpu_pod ', end='')
+    for cpu in for_each_possible_cpu(prog):
+        print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='')
+    print('')
+
+print('')
+print('NUMA')
+print_pod_type(wq_pod_types[WQ_AFFN_NUMA])
+print('')
+print('SYSTEM')
+print_pod_type(wq_pod_types[WQ_AFFN_SYSTEM])
+
+print('')
+print('Worker Pools')
+print('============')
+
+max_pool_id_len = 0
+max_ref_len = 0
+for pi, pool in idr_for_each(worker_pool_idr):
+    pool = drgn.Object(prog, 'struct worker_pool', address=pool)
+    max_pool_id_len = max(max_pool_id_len, len(f'{pi}'))
+    max_ref_len = max(max_ref_len, len(f'{pool.refcnt.value_()}'))
+
+for pi, pool in idr_for_each(worker_pool_idr):
+    pool = drgn.Object(prog, 'struct worker_pool', address=pool)
+    print(f'pool[{pi:0{max_pool_id_len}}] ref={pool.refcnt.value_():{max_ref_len}} nice={pool.attrs.nice.value_():3} ', end='')
+    print(f'idle/workers={pool.nr_idle.value_():3}/{pool.nr_workers.value_():3} ', end='')
+    if pool.cpu >= 0:
+        print(f'cpu={pool.cpu.value_():3}', end='')
+    else:
+        print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='')
+    print('')
+
+print('')
+print('Workqueue CPU -> pool')
+print('=====================')
+
+print('[    workqueue \ CPU            ', end='')
+for cpu in for_each_possible_cpu(prog):
+    print(f' {cpu:{max_pool_id_len}}', end='')
+print(' dfl]')
+
+for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+    print(f'{wq.name.string_().decode()[-24:]:24}', end='')
+    if wq.flags & WQ_UNBOUND:
+        if wq.flags & WQ_ORDERED:
+            print(' ordered', end='')
+        else:
+            print(' unbound', end='')
+    else:
+        print(' percpu ', end='')
+
+    for cpu in for_each_possible_cpu(prog):
+        pool_id = per_cpu_ptr(wq.cpu_pwq, cpu)[0].pool.id.value_()
+        field_len = max(len(str(cpu)), max_pool_id_len)
+        print(f' {pool_id:{field_len}}', end='')
+
+    if wq.flags & WQ_UNBOUND:
+        print(f' {wq.dfl_pwq.pool.id.value_():{max_pool_id_len}}', end='')
+    print('')
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 16/24] workqueue: Modularize wq_pod_type initialization
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (14 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 15/24] workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 17/24] workqueue: Add multiple affinity scopes and interface to select them Tejun Heo
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
init is hard coded into workqueue_init_topology(). This patch modularizes
the init path by introducing init_pod_type() which takes a callback to
determine whether two CPUs should share a pod as an argument.

init_pod_type() first scans the CPU combinations testing for sharing to
assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.

This patch may change the pod ID assigned to each NUMA node but that
shouldn't cause any behavior changes as the NUMA node to use for allocations
are tracked separately in pod_type->pod_node[]. This makes adding new
affinty types pretty easy.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 84 +++++++++++++++++++++++++++-------------------
 1 file changed, 50 insertions(+), 34 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dae1787833cb..1734b8a11a4c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6395,6 +6395,54 @@ void __init workqueue_init(void)
 	wq_watchdog_init();
 }
 
+/*
+ * Initialize @pt by first initializing @pt->cpu_pod[] with pod IDs according to
+ * @cpu_shares_pod(). Each subset of CPUs that share a pod is assigned a unique
+ * and consecutive pod ID. The rest of @pt is initialized accordingly.
+ */
+static void __init init_pod_type(struct wq_pod_type *pt,
+				 bool (*cpus_share_pod)(int, int))
+{
+	int cur, pre, cpu, pod;
+
+	pt->nr_pods = 0;
+
+	/* init @pt->cpu_pod[] according to @cpus_share_pod() */
+	pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL);
+	BUG_ON(!pt->cpu_pod);
+
+	for_each_possible_cpu(cur) {
+		for_each_possible_cpu(pre) {
+			if (pre >= cur) {
+				pt->cpu_pod[cur] = pt->nr_pods++;
+				break;
+			}
+			if (cpus_share_pod(cur, pre)) {
+				pt->cpu_pod[cur] = pt->cpu_pod[pre];
+				break;
+			}
+		}
+	}
+
+	/* init the rest to match @pt->cpu_pod[] */
+	pt->pod_cpus = kcalloc(pt->nr_pods, sizeof(pt->pod_cpus[0]), GFP_KERNEL);
+	pt->pod_node = kcalloc(pt->nr_pods, sizeof(pt->pod_node[0]), GFP_KERNEL);
+	BUG_ON(!pt->pod_cpus || !pt->pod_node);
+
+	for (pod = 0; pod < pt->nr_pods; pod++)
+		BUG_ON(!zalloc_cpumask_var(&pt->pod_cpus[pod], GFP_KERNEL));
+
+	for_each_possible_cpu(cpu) {
+		cpumask_set_cpu(cpu, pt->pod_cpus[pt->cpu_pod[cpu]]);
+		pt->pod_node[pt->cpu_pod[cpu]] = cpu_to_node(cpu);
+	}
+}
+
+static bool __init cpus_share_numa(int cpu0, int cpu1)
+{
+	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
+}
+
 /**
  * workqueue_init_topology - initialize CPU pods for unbound workqueues
  *
@@ -6404,45 +6452,13 @@ void __init workqueue_init(void)
  */
 void __init workqueue_init_topology(void)
 {
-	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
 	struct workqueue_struct *wq;
-	int node, cpu;
-
-	if (num_possible_nodes() <= 1)
-		return;
+	int cpu;
 
-	for_each_possible_cpu(cpu) {
-		if (WARN_ON(cpu_to_node(cpu) == NUMA_NO_NODE)) {
-			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
-			return;
-		}
-	}
+	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
 
 	mutex_lock(&wq_pool_mutex);
 
-	/*
-	 * We want masks of possible CPUs of each node which isn't readily
-	 * available.  Build one from cpu_to_node() which should have been
-	 * fully initialized by now.
-	 */
-	pt->pod_cpus = kcalloc(nr_node_ids, sizeof(pt->pod_cpus[0]), GFP_KERNEL);
-	pt->pod_node = kcalloc(nr_node_ids, sizeof(pt->pod_node[0]), GFP_KERNEL);
-	pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL);
-	BUG_ON(!pt->pod_cpus || !pt->pod_node || !pt->cpu_pod);
-
-	for_each_node(node)
-		BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[node], GFP_KERNEL,
-				node_online(node) ? node : NUMA_NO_NODE));
-
-	for_each_possible_cpu(cpu) {
-		node = cpu_to_node(cpu);
-		cpumask_set_cpu(cpu, pt->pod_cpus[node]);
-		pt->pod_node[node] = node;
-		pt->cpu_pod[cpu] = node;
-	}
-
-	pt->nr_pods = nr_node_ids;
-
 	/*
 	 * Workqueues allocated earlier would have all CPUs sharing the default
 	 * worker pool. Explicitly call wq_update_pod() on all workqueue and CPU
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 17/24] workqueue: Add multiple affinity scopes and interface to select them
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (15 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 16/24] workqueue: Modularize wq_pod_type initialization Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 18/24] workqueue: Factor out work to worker assignment and collision handling Tejun Heo
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
the default. The code changes to actually add the additional scopes are
trivial.

Also add module parameter "workqueue.default_affinity_scope" to override the
default scope and "affinity_scope" sysfs file to configure it per workqueue.
wq_dump.py and documentations are updated accordingly.

This enables significant flexibility in configuring how unbound workqueues
behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
workqueue. On the other hand, "system" removes all locality boundaries.

Many modern machines have multiple L3 caches often while being mostly
uniform in terms of memory access. Thus, workqueue's previous behavior of
spreading work items in each NUMA node had negative performance implications
from unncessarily crossing L3 boundaries between issue and execution.
However, picking a finer grained affinity scope also has a downside in that
an issuer in one group can't utilize CPUs in other groups.

While dependent on the specifics of workload, there's usually a noticeable
penalty in crossing L3 boundaries, so let's default to CACHE. This issue
will be further addressed and documented with examples in future patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 .../admin-guide/kernel-parameters.txt         |  12 ++
 Documentation/core-api/workqueue.rst          |  63 ++++++++++
 include/linux/workqueue.h                     |   5 +-
 kernel/workqueue.c                            | 110 +++++++++++++++++-
 tools/workqueue/wq_dump.py                    |  15 ++-
 5 files changed, 193 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 042275425c32..0aa7fd68a024 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6958,6 +6958,18 @@
 			The default value of this parameter is determined by
 			the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
 
+        workqueue.default_affinity_scope=
+			Select the default affinity scope to use for unbound
+			workqueues. Can be one of "cpu", "smt", "cache",
+			"numa" and "system". Default is "cache". For more
+			information, see the Affinity Scopes section in
+			Documentation/core-api/workqueue.rst.
+
+			This can be updated after boot through the matching
+			file under /sys/module/workqueue/parameters.
+			However, the changed default will only apply to
+			unbound workqueues created afterwards.
+
 	workqueue.debug_force_rr_cpu
 			Workqueue used to implicitly guarantee that work
 			items queued without explicit CPU specified are put
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index c9e46acd339b..56af317508c9 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -347,6 +347,51 @@ Guidelines
   level of locality in wq operations and work item execution.
 
 
+Affinity Scopes
+===============
+
+An unbound workqueue groups CPUs according to its affinity scope to improve
+cache locality. For example, if a workqueue is using the default affinity
+scope of "cache", it will group CPUs according to last level cache
+boundaries. A work item queued on the workqueue will be processed by a
+worker running on one of the CPUs which share the last level cache with the
+issuing CPU.
+
+Workqueue currently supports the following five affinity scopes.
+
+``cpu``
+  CPUs are not grouped. A work item issued on one CPU is processed by a
+  worker on the same CPU. This makes unbound workqueues behave as per-cpu
+  workqueues without concurrency management.
+
+``smt``
+  CPUs are grouped according to SMT boundaries. This usually means that the
+  logical threads of each physical CPU core are grouped together.
+
+``cache``
+  CPUs are grouped according to cache boundaries. Which specific cache
+  boundary is used is determined by the arch code. L3 is used in a lot of
+  cases. This is the default affinity scope.
+
+``numa``
+  CPUs are grouped according to NUMA bounaries.
+
+``system``
+  All CPUs are put in the same group. Workqueue makes no effort to process a
+  work item on a CPU close to the issuing CPU.
+
+The default affinity scope can be changed with the module parameter
+``workqueue.default_affinity_scope`` and a specific workqueue's affinity
+scope can be changed using ``apply_workqueue_attrs()``.
+
+If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope
+related interface files under its ``/sys/devices/virtual/WQ_NAME/``
+directory.
+
+``affinity_scope``
+  Read to see the current affinity scope. Write to change.
+
+
 Examining Configuration
 =======================
 
@@ -358,6 +403,24 @@ Use tools/workqueue/wq_dump.py to examine unbound CPU affinity
   ===============
   wq_unbound_cpumask=0000000f
 
+  CPU
+    nr_pods  4
+    pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
+    pod_node [0]=0 [1]=0 [2]=1 [3]=1
+    cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3
+
+  SMT
+    nr_pods  4
+    pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
+    pod_node [0]=0 [1]=0 [2]=1 [3]=1
+    cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3
+
+  CACHE (default)
+    nr_pods  2
+    pod_cpus [0]=00000003 [1]=0000000c
+    pod_node [0]=0 [1]=1
+    cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1
+
   NUMA
     nr_pods  2
     pod_cpus [0]=00000003 [1]=0000000c
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a2f826b6ec9a..a01b5dcbbeb9 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -125,12 +125,15 @@ struct rcu_work {
 };
 
 enum wq_affn_scope {
+	WQ_AFFN_CPU,			/* one pod per CPU */
+	WQ_AFFN_SMT,			/* one pod poer SMT */
+	WQ_AFFN_CACHE,			/* one pod per LLC */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
 
 	WQ_AFFN_NR_TYPES,
 
-	WQ_AFFN_DFL = WQ_AFFN_NUMA,
+	WQ_AFFN_DFL = WQ_AFFN_CACHE,
 };
 
 /**
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1734b8a11a4c..bb0900602408 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -337,6 +337,15 @@ struct wq_pod_type {
 };
 
 static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
+static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_DFL;
+
+static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
+	[WQ_AFFN_CPU]			= "cpu",
+	[WQ_AFFN_SMT]			= "smt",
+	[WQ_AFFN_CACHE]			= "cache",
+	[WQ_AFFN_NUMA]			= "numa",
+	[WQ_AFFN_SYSTEM]		= "system",
+};
 
 /*
  * Per-cpu work items which run for longer than the following threshold are
@@ -3644,7 +3653,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void)
 		goto fail;
 
 	cpumask_copy(attrs->cpumask, cpu_possible_mask);
-	attrs->affn_scope = WQ_AFFN_DFL;
+	attrs->affn_scope = wq_affn_dfl;
 	return attrs;
 fail:
 	free_workqueue_attrs(attrs);
@@ -5721,19 +5730,55 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	return ret;
 }
 
+static int parse_affn_scope(const char *val)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(wq_affn_names); i++) {
+		if (!strncasecmp(val, wq_affn_names[i], strlen(wq_affn_names[i])))
+			return i;
+	}
+	return -EINVAL;
+}
+
+static int wq_affn_dfl_set(const char *val, const struct kernel_param *kp)
+{
+	int affn;
+
+	affn = parse_affn_scope(val);
+	if (affn < 0)
+		return affn;
+
+	wq_affn_dfl = affn;
+	return 0;
+}
+
+static int wq_affn_dfl_get(char *buffer, const struct kernel_param *kp)
+{
+	return scnprintf(buffer, PAGE_SIZE, "%s\n", wq_affn_names[wq_affn_dfl]);
+}
+
+static const struct kernel_param_ops wq_affn_dfl_ops = {
+	.set	= wq_affn_dfl_set,
+	.get	= wq_affn_dfl_get,
+};
+
+module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
+
 #ifdef CONFIG_SYSFS
 /*
  * Workqueues with WQ_SYSFS flag set is visible to userland via
  * /sys/bus/workqueue/devices/WQ_NAME.  All visible workqueues have the
  * following attributes.
  *
- *  per_cpu	RO bool	: whether the workqueue is per-cpu or unbound
- *  max_active	RW int	: maximum number of in-flight work items
+ *  per_cpu		RO bool	: whether the workqueue is per-cpu or unbound
+ *  max_active		RW int	: maximum number of in-flight work items
  *
  * Unbound workqueues have the following extra attributes.
  *
- *  nice	RW int	: nice value of the workers
- *  cpumask	RW mask	: bitmask of allowed CPUs for the workers
+ *  nice		RW int	: nice value of the workers
+ *  cpumask		RW mask	: bitmask of allowed CPUs for the workers
+ *  affinity_scope	RW str  : worker CPU affinity scope (cache, numa, none)
  */
 struct wq_device {
 	struct workqueue_struct		*wq;
@@ -5876,9 +5921,47 @@ static ssize_t wq_cpumask_store(struct device *dev,
 	return ret ?: count;
 }
 
+static ssize_t wq_affn_scope_show(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int written;
+
+	mutex_lock(&wq->mutex);
+	written = scnprintf(buf, PAGE_SIZE, "%s\n",
+			    wq_affn_names[wq->unbound_attrs->affn_scope]);
+	mutex_unlock(&wq->mutex);
+
+	return written;
+}
+
+static ssize_t wq_affn_scope_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int affn, ret = -ENOMEM;
+
+	affn = parse_affn_scope(buf);
+	if (affn < 0)
+		return affn;
+
+	apply_wqattrs_lock();
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (attrs) {
+		attrs->affn_scope = affn;
+		ret = apply_workqueue_attrs_locked(wq, attrs);
+	}
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+	__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
 	__ATTR_NULL,
 };
 
@@ -6438,6 +6521,20 @@ static void __init init_pod_type(struct wq_pod_type *pt,
 	}
 }
 
+static bool __init cpus_dont_share(int cpu0, int cpu1)
+{
+	return false;
+}
+
+static bool __init cpus_share_smt(int cpu0, int cpu1)
+{
+#ifdef CONFIG_SCHED_SMT
+	return cpumask_test_cpu(cpu0, cpu_smt_mask(cpu1));
+#else
+	return false;
+#endif
+}
+
 static bool __init cpus_share_numa(int cpu0, int cpu1)
 {
 	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
@@ -6455,6 +6552,9 @@ void __init workqueue_init_topology(void)
 	struct workqueue_struct *wq;
 	int cpu;
 
+	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
+	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
+	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
 
 	mutex_lock(&wq_pool_mutex);
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index ddd0bb4395ea..43ab71a193b8 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -78,11 +78,16 @@ worker_pool_idr         = prog['worker_pool_idr']
 workqueues              = prog['workqueues']
 wq_unbound_cpumask      = prog['wq_unbound_cpumask']
 wq_pod_types            = prog['wq_pod_types']
+wq_affn_dfl             = prog['wq_affn_dfl']
+wq_affn_names           = prog['wq_affn_names']
 
 WQ_UNBOUND              = prog['WQ_UNBOUND']
 WQ_ORDERED              = prog['__WQ_ORDERED']
 WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 
+WQ_AFFN_CPU             = prog['WQ_AFFN_CPU']
+WQ_AFFN_SMT             = prog['WQ_AFFN_SMT']
+WQ_AFFN_CACHE           = prog['WQ_AFFN_CACHE']
 WQ_AFFN_NUMA            = prog['WQ_AFFN_NUMA']
 WQ_AFFN_SYSTEM          = prog['WQ_AFFN_SYSTEM']
 
@@ -109,12 +114,10 @@ print(f'wq_unbound_cpumask={cpumask_str(wq_unbound_cpumask)}')
         print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='')
     print('')
 
-print('')
-print('NUMA')
-print_pod_type(wq_pod_types[WQ_AFFN_NUMA])
-print('')
-print('SYSTEM')
-print_pod_type(wq_pod_types[WQ_AFFN_SYSTEM])
+for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
+    print('')
+    print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}')
+    print_pod_type(wq_pod_types[affn])
 
 print('')
 print('Worker Pools')
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 18/24] workqueue: Factor out work to worker assignment and collision handling
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (16 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 17/24] workqueue: Add multiple affinity scopes and interface to select them Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 19/24] workqueue: Factor out need_more_worker() check and worker wake-up Tejun Heo
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.

This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.

This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.

After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.

This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 80 ++++++++++++++++++++++++++++++----------------
 1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb0900602408..a2e6c2be3a06 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1015,13 +1015,10 @@ static struct worker *find_worker_executing_work(struct worker_pool *pool,
  * @head: target list to append @work to
  * @nextp: out parameter for nested worklist walking
  *
- * Schedule linked works starting from @work to @head.  Work series to
- * be scheduled starts at @work and includes any consecutive work with
- * WORK_STRUCT_LINKED set in its predecessor.
- *
- * If @nextp is not NULL, it's updated to point to the next work of
- * the last scheduled work.  This allows move_linked_works() to be
- * nested inside outer list_for_each_entry_safe().
+ * Schedule linked works starting from @work to @head. Work series to be
+ * scheduled starts at @work and includes any consecutive work with
+ * WORK_STRUCT_LINKED set in its predecessor. See assign_work() for details on
+ * @nextp.
  *
  * CONTEXT:
  * raw_spin_lock_irq(pool->lock).
@@ -1050,6 +1047,48 @@ static void move_linked_works(struct work_struct *work, struct list_head *head,
 		*nextp = n;
 }
 
+/**
+ * assign_work - assign a work item and its linked work items to a worker
+ * @work: work to assign
+ * @worker: worker to assign to
+ * @nextp: out parameter for nested worklist walking
+ *
+ * Assign @work and its linked work items to @worker. If @work is already being
+ * executed by another worker in the same pool, it'll be punted there.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of the last
+ * scheduled work. This allows assign_work() to be nested inside
+ * list_for_each_entry_safe().
+ *
+ * Returns %true if @work was successfully assigned to @worker. %false if @work
+ * was punted to another worker already executing it.
+ */
+static bool assign_work(struct work_struct *work, struct worker *worker,
+			struct work_struct **nextp)
+{
+	struct worker_pool *pool = worker->pool;
+	struct worker *collision;
+
+	lockdep_assert_held(&pool->lock);
+
+	/*
+	 * A single work shouldn't be executed concurrently by multiple workers.
+	 * __queue_work() ensures that @work doesn't jump to a different pool
+	 * while still running in the previous pool. Here, we should ensure that
+	 * @work is not executed concurrently by multiple workers from the same
+	 * pool. Check whether anyone is already processing the work. If so,
+	 * defer the work to the currently executing one.
+	 */
+	collision = find_worker_executing_work(pool, work);
+	if (unlikely(collision)) {
+		move_linked_works(work, &collision->scheduled, nextp);
+		return false;
+	}
+
+	move_linked_works(work, &worker->scheduled, nextp);
+	return true;
+}
+
 /**
  * wake_up_worker - wake up an idle worker
  * @pool: worker pool to wake worker from
@@ -2442,7 +2481,6 @@ __acquires(&pool->lock)
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
 	unsigned long work_data;
-	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -2459,18 +2497,6 @@ __acquires(&pool->lock)
 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 		     raw_smp_processor_id() != pool->cpu);
 
-	/*
-	 * A single work shouldn't be executed concurrently by
-	 * multiple workers on a single cpu.  Check whether anyone is
-	 * already processing the work.  If so, defer the work to the
-	 * currently executing one.
-	 */
-	collision = find_worker_executing_work(pool, work);
-	if (unlikely(collision)) {
-		move_linked_works(work, &collision->scheduled, NULL);
-		return;
-	}
-
 	/* claim and dequeue */
 	debug_work_deactivate(work);
 	hash_add(pool->busy_hash, &worker->hentry, (unsigned long)work);
@@ -2697,8 +2723,8 @@ static int worker_thread(void *__worker)
 			list_first_entry(&pool->worklist,
 					 struct work_struct, entry);
 
-		move_linked_works(work, &worker->scheduled, NULL);
-		process_scheduled_works(worker);
+		if (assign_work(work, worker, NULL))
+			process_scheduled_works(worker);
 	} while (keep_working(pool));
 
 	worker_set_flags(worker, WORKER_PREP);
@@ -2742,7 +2768,6 @@ static int rescuer_thread(void *__rescuer)
 {
 	struct worker *rescuer = __rescuer;
 	struct workqueue_struct *wq = rescuer->rescue_wq;
-	struct list_head *scheduled = &rescuer->scheduled;
 	bool should_stop;
 
 	set_user_nice(current, RESCUER_NICE_LEVEL);
@@ -2787,15 +2812,14 @@ static int rescuer_thread(void *__rescuer)
 		 * Slurp in all works issued via this workqueue and
 		 * process'em.
 		 */
-		WARN_ON_ONCE(!list_empty(scheduled));
+		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
 		list_for_each_entry_safe(work, n, &pool->worklist, entry) {
-			if (get_work_pwq(work) == pwq) {
-				move_linked_works(work, scheduled, &n);
+			if (get_work_pwq(work) == pwq &&
+			    assign_work(work, rescuer, &n))
 				pwq->stats[PWQ_STAT_RESCUED]++;
-			}
 		}
 
-		if (!list_empty(scheduled)) {
+		if (!list_empty(&rescuer->scheduled)) {
 			process_scheduled_works(rescuer);
 
 			/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 19/24] workqueue: Factor out need_more_worker() check and worker wake-up
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (17 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 18/24] workqueue: Factor out work to worker assignment and collision handling Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 20/24] workqueue: Add workqueue_attrs->__pod_cpumask Tejun Heo
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

Checking need_more_worker() and calling wake_up_worker() is a repeated
pattern. Let's add kick_pool(), which checks need_more_worker() and
open-code wake_up_worker(), and replace wake_up_worker() uses. The following
conversions aren't one-to-one:

* __queue_work() was using __need_more_work() because it knows that
  pool->worklist isn't empty. Switching to kick_pool() adds an extra
  list_empty() test.

* create_worker() always needs to wake up the newly minted worker whether
  there's more work to do or not to avoid triggering hung task check on the
  new task. Keep the current wake_up_process() and still add kick_pool().
  This may lead to an extra wakeup which isn't harmful.

* pwq_adjust_max_active() was explicitly checking whether it needs to wake
  up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up
  a worker when necessary, this explicit check is no longer necessary and
  dropped.

* unbind_workers() now calls kick_pool() instead of wake_up_worker() adding
  a need_more_worker() test. This avoids spurious wakeups and shouldn't
  break anything.

wake_up_worker() is dropped as kick_pool() replaces all its users. After
this patch, all paths that wakes up a non-rescuer worker to initiate work
item execution use kick_pool(). This will enable future changes to improve
locality.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 87 ++++++++++++++++++++--------------------------
 1 file changed, 37 insertions(+), 50 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a2e6c2be3a06..58aec5cc5722 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -805,11 +805,6 @@ static bool work_is_canceling(struct work_struct *work)
  * they're being called with pool->lock held.
  */
 
-static bool __need_more_worker(struct worker_pool *pool)
-{
-	return !pool->nr_running;
-}
-
 /*
  * Need to wake up a worker?  Called from anything but currently
  * running workers.
@@ -820,7 +815,7 @@ static bool __need_more_worker(struct worker_pool *pool)
  */
 static bool need_more_worker(struct worker_pool *pool)
 {
-	return !list_empty(&pool->worklist) && __need_more_worker(pool);
+	return !list_empty(&pool->worklist) && !pool->nr_running;
 }
 
 /* Can I start working?  Called from busy but !running workers. */
@@ -1090,20 +1085,23 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 }
 
 /**
- * wake_up_worker - wake up an idle worker
- * @pool: worker pool to wake worker from
- *
- * Wake up the first idle worker of @pool.
+ * kick_pool - wake up an idle worker if necessary
+ * @pool: pool to kick
  *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock).
+ * @pool may have pending work items. Wake up worker if necessary. Returns
+ * whether a worker was woken up.
  */
-static void wake_up_worker(struct worker_pool *pool)
+static bool kick_pool(struct worker_pool *pool)
 {
 	struct worker *worker = first_idle_worker(pool);
 
-	if (likely(worker))
-		wake_up_process(worker->task);
+	lockdep_assert_held(&pool->lock);
+
+	if (!need_more_worker(pool) || !worker)
+		return false;
+
+	wake_up_process(worker->task);
+	return true;
 }
 
 #ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
@@ -1271,10 +1269,9 @@ void wq_worker_sleeping(struct task_struct *task)
 	}
 
 	pool->nr_running--;
-	if (need_more_worker(pool)) {
+	if (kick_pool(pool))
 		worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
-		wake_up_worker(pool);
-	}
+
 	raw_spin_unlock_irq(&pool->lock);
 }
 
@@ -1312,10 +1309,8 @@ void wq_worker_tick(struct task_struct *task)
 	wq_cpu_intensive_report(worker->current_func);
 	pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
 
-	if (need_more_worker(pool)) {
+	if (kick_pool(pool))
 		pwq->stats[PWQ_STAT_CM_WAKEUP]++;
-		wake_up_worker(pool);
-	}
 
 	raw_spin_unlock(&pool->lock);
 }
@@ -1752,9 +1747,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 		trace_workqueue_activate_work(work);
 		pwq->nr_active++;
 		insert_work(pwq, work, &pool->worklist, work_flags);
-
-		if (__need_more_worker(pool))
-			wake_up_worker(pool);
+		kick_pool(pool);
 	} else {
 		work_flags |= WORK_STRUCT_INACTIVE;
 		insert_work(pwq, work, &pwq->inactive_works, work_flags);
@@ -2160,9 +2153,18 @@ static struct worker *create_worker(struct worker_pool *pool)
 
 	/* start the newly created worker */
 	raw_spin_lock_irq(&pool->lock);
+
 	worker->pool->nr_workers++;
 	worker_enter_idle(worker);
+	kick_pool(pool);
+
+	/*
+	 * @worker is waiting on a completion in kthread() and will trigger hung
+	 * check if not woken up soon. As kick_pool() might not have waken it
+	 * up, wake it up explicitly once more.
+	 */
 	wake_up_process(worker->task);
+
 	raw_spin_unlock_irq(&pool->lock);
 
 	return worker;
@@ -2525,14 +2527,12 @@ __acquires(&pool->lock)
 		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
 
 	/*
-	 * Wake up another worker if necessary.  The condition is always
-	 * false for normal per-cpu workers since nr_running would always
-	 * be >= 1 at this point.  This is used to chain execution of the
-	 * pending work items for WORKER_NOT_RUNNING workers such as the
-	 * UNBOUND and CPU_INTENSIVE ones.
+	 * Kick @pool if necessary. It's always noop for per-cpu worker pools
+	 * since nr_running would always be >= 1 at this point. This is used to
+	 * chain execution of the pending work items for WORKER_NOT_RUNNING
+	 * workers such as the UNBOUND and CPU_INTENSIVE ones.
 	 */
-	if (need_more_worker(pool))
-		wake_up_worker(pool);
+	kick_pool(pool);
 
 	/*
 	 * Record the last pool and clear PENDING which should be the last
@@ -2852,12 +2852,10 @@ static int rescuer_thread(void *__rescuer)
 		put_pwq(pwq);
 
 		/*
-		 * Leave this pool.  If need_more_worker() is %true, notify a
-		 * regular worker; otherwise, we end up with 0 concurrency
-		 * and stalling the execution.
+		 * Leave this pool. Notify regular workers; otherwise, we end up
+		 * with 0 concurrency and stalling the execution.
 		 */
-		if (need_more_worker(pool))
-			wake_up_worker(pool);
+		kick_pool(pool);
 
 		raw_spin_unlock_irq(&pool->lock);
 
@@ -4068,24 +4066,13 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
 	 * is updated and visible.
 	 */
 	if (!freezable || !workqueue_freezing) {
-		bool kick = false;
-
 		pwq->max_active = wq->saved_max_active;
 
 		while (!list_empty(&pwq->inactive_works) &&
-		       pwq->nr_active < pwq->max_active) {
+		       pwq->nr_active < pwq->max_active)
 			pwq_activate_first_inactive(pwq);
-			kick = true;
-		}
 
-		/*
-		 * Need to kick a worker after thawed or an unbound wq's
-		 * max_active is bumped. In realtime scenarios, always kicking a
-		 * worker will cause interference on the isolated cpu cores, so
-		 * let's kick iff work items were activated.
-		 */
-		if (kick)
-			wake_up_worker(pwq->pool);
+		kick_pool(pwq->pool);
 	} else {
 		pwq->max_active = 0;
 	}
@@ -5351,7 +5338,7 @@ static void unbind_workers(int cpu)
 		 * worker blocking could lead to lengthy stalls.  Kick off
 		 * unbound chain execution of currently pending work items.
 		 */
-		wake_up_worker(pool);
+		kick_pool(pool);
 
 		raw_spin_unlock_irq(&pool->lock);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 20/24] workqueue: Add workqueue_attrs->__pod_cpumask
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (18 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 19/24] workqueue: Factor out need_more_worker() check and worker wake-up Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 21/24] workqueue: Implement non-strict affinity scope for unbound workqueues Tejun Heo
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

workqueue_attrs has two uses:

* to specify the required unouned workqueue properties by users

* to match worker_pool's properties to workqueues by core code

For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.

Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.

To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.

This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.

* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
  that the pool's workers must stay within. This is currently always
  ->__pod_cpumask as all boundaries are still strict.

* As a workqueue_attrs can now track both the associated workqueues' cpumask
  and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
  out-argument. Drop @cpumask and instead store the result in
  ->__pod_cpumask.

* The above also simplifies apply_wqattrs_prepare() as the same
  workqueue_attrs can be used to create all pods associated with a
  workqueue. tmp_attrs is dropped.

The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.

While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h | 16 +++++++++
 kernel/workqueue.c        | 74 ++++++++++++++++++++-------------------
 2 files changed, 54 insertions(+), 36 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a01b5dcbbeb9..7a0fc0919e0a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -149,9 +149,25 @@ struct workqueue_attrs {
 
 	/**
 	 * @cpumask: allowed CPUs
+	 *
+	 * Work items in this workqueue are affine to these CPUs and not allowed
+	 * to execute on other CPUs. A pool serving a workqueue must have the
+	 * same @cpumask.
 	 */
 	cpumask_var_t cpumask;
 
+	/**
+	 * @__pod_cpumask: internal attribute used to create per-pod pools
+	 *
+	 * Internal use only.
+	 *
+	 * Per-pod unbound worker pools are used to improve locality. Always a
+	 * subset of ->cpumask. A workqueue can be associated with multiple
+	 * worker pools with disjoint @__pod_cpumask's. Whether the enforcement
+	 * of a pool's @__pod_cpumask is strict depends on @affn_strict.
+	 */
+	cpumask_var_t __pod_cpumask;
+
 	/*
 	 * Below fields aren't properties of a worker_pool. They only modify how
 	 * :c:func:`apply_workqueue_attrs` select pools and thus don't
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 58aec5cc5722..daebc28d09ab 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2029,6 +2029,11 @@ static struct worker *alloc_worker(int node)
 	return worker;
 }
 
+static cpumask_t *pool_allowed_cpus(struct worker_pool *pool)
+{
+	return pool->attrs->__pod_cpumask;
+}
+
 /**
  * worker_attach_to_pool() - attach a worker to a pool
  * @worker: worker to be attached
@@ -2054,7 +2059,7 @@ static void worker_attach_to_pool(struct worker *worker,
 		kthread_set_per_cpu(worker->task, pool->cpu);
 
 	if (worker->rescue_wq)
-		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+		set_cpus_allowed_ptr(worker->task, pool_allowed_cpus(pool));
 
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
@@ -2146,7 +2151,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	}
 
 	set_user_nice(worker->task, pool->attrs->nice);
-	kthread_bind_mask(worker->task, pool->attrs->cpumask);
+	kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
 
 	/* successful, attach the worker to the pool */
 	worker_attach_to_pool(worker, pool);
@@ -3652,6 +3657,7 @@ void free_workqueue_attrs(struct workqueue_attrs *attrs)
 {
 	if (attrs) {
 		free_cpumask_var(attrs->cpumask);
+		free_cpumask_var(attrs->__pod_cpumask);
 		kfree(attrs);
 	}
 }
@@ -3673,6 +3679,8 @@ struct workqueue_attrs *alloc_workqueue_attrs(void)
 		goto fail;
 	if (!alloc_cpumask_var(&attrs->cpumask, GFP_KERNEL))
 		goto fail;
+	if (!alloc_cpumask_var(&attrs->__pod_cpumask, GFP_KERNEL))
+		goto fail;
 
 	cpumask_copy(attrs->cpumask, cpu_possible_mask);
 	attrs->affn_scope = wq_affn_dfl;
@@ -3687,6 +3695,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 {
 	to->nice = from->nice;
 	cpumask_copy(to->cpumask, from->cpumask);
+	cpumask_copy(to->__pod_cpumask, from->__pod_cpumask);
 
 	/*
 	 * Unlike hash and equality test, copying shouldn't ignore wq-only
@@ -3705,6 +3714,8 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 	hash = jhash_1word(attrs->nice, hash);
 	hash = jhash(cpumask_bits(attrs->cpumask),
 		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
+	hash = jhash(cpumask_bits(attrs->__pod_cpumask),
+		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
 	return hash;
 }
 
@@ -3716,6 +3727,8 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
 		return false;
 	if (!cpumask_equal(a->cpumask, b->cpumask))
 		return false;
+	if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask))
+		return false;
 	return true;
 }
 
@@ -3952,9 +3965,9 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 		}
 	}
 
-	/* If cpumask is contained inside a NUMA pod, that's our NUMA node */
+	/* If __pod_cpumask is contained inside a NUMA pod, that's our node */
 	for (pod = 0; pod < pt->nr_pods; pod++) {
-		if (cpumask_subset(attrs->cpumask, pt->pod_cpus[pod])) {
+		if (cpumask_subset(attrs->__pod_cpumask, pt->pod_cpus[pod])) {
 			node = pt->pod_node[pod];
 			break;
 		}
@@ -4147,11 +4160,10 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
  * @attrs: the wq_attrs of the default pwq of the target workqueue
  * @cpu: the target CPU
  * @cpu_going_down: if >= 0, the CPU to consider as offline
- * @cpumask: outarg, the resulting cpumask
  *
  * Calculate the cpumask a workqueue with @attrs should use on @pod. If
  * @cpu_going_down is >= 0, that cpu is considered offline during calculation.
- * The result is stored in @cpumask.
+ * The result is stored in @attrs->__pod_cpumask.
  *
  * If pod affinity is not enabled, @attrs->cpumask is always used. If enabled
  * and @pod has online CPUs requested by @attrs, the returned cpumask is the
@@ -4159,27 +4171,27 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
  *
  * The caller is responsible for ensuring that the cpumask of @pod stays stable.
  */
-static void wq_calc_pod_cpumask(const struct workqueue_attrs *attrs, int cpu,
-				int cpu_going_down, cpumask_t *cpumask)
+static void wq_calc_pod_cpumask(struct workqueue_attrs *attrs, int cpu,
+				int cpu_going_down)
 {
 	const struct wq_pod_type *pt = wqattrs_pod_type(attrs);
 	int pod = pt->cpu_pod[cpu];
 
 	/* does @pod have any online CPUs @attrs wants? */
-	cpumask_and(cpumask, pt->pod_cpus[pod], attrs->cpumask);
-	cpumask_and(cpumask, cpumask, cpu_online_mask);
+	cpumask_and(attrs->__pod_cpumask, pt->pod_cpus[pod], attrs->cpumask);
+	cpumask_and(attrs->__pod_cpumask, attrs->__pod_cpumask, cpu_online_mask);
 	if (cpu_going_down >= 0)
-		cpumask_clear_cpu(cpu_going_down, cpumask);
+		cpumask_clear_cpu(cpu_going_down, attrs->__pod_cpumask);
 
-	if (cpumask_empty(cpumask)) {
-		cpumask_copy(cpumask, attrs->cpumask);
+	if (cpumask_empty(attrs->__pod_cpumask)) {
+		cpumask_copy(attrs->__pod_cpumask, attrs->cpumask);
 		return;
 	}
 
 	/* yeap, return possible CPUs in @pod that @attrs wants */
-	cpumask_and(cpumask, attrs->cpumask, pt->pod_cpus[pod]);
+	cpumask_and(attrs->__pod_cpumask, attrs->cpumask, pt->pod_cpus[pod]);
 
-	if (cpumask_empty(cpumask))
+	if (cpumask_empty(attrs->__pod_cpumask))
 		pr_warn_once("WARNING: workqueue cpumask: online intersect > "
 				"possible intersect\n");
 }
@@ -4233,7 +4245,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 		      const cpumask_var_t unbound_cpumask)
 {
 	struct apply_wqattrs_ctx *ctx;
-	struct workqueue_attrs *new_attrs, *tmp_attrs;
+	struct workqueue_attrs *new_attrs;
 	int cpu;
 
 	lockdep_assert_held(&wq_pool_mutex);
@@ -4245,8 +4257,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_cpu_ids), GFP_KERNEL);
 
 	new_attrs = alloc_workqueue_attrs();
-	tmp_attrs = alloc_workqueue_attrs();
-	if (!ctx || !new_attrs || !tmp_attrs)
+	if (!ctx || !new_attrs)
 		goto out_free;
 
 	/*
@@ -4259,13 +4270,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	cpumask_and(new_attrs->cpumask, new_attrs->cpumask, unbound_cpumask);
 	if (unlikely(cpumask_empty(new_attrs->cpumask)))
 		cpumask_copy(new_attrs->cpumask, unbound_cpumask);
-
-	/*
-	 * We may create multiple pwqs with differing cpumasks.  Make a
-	 * copy of @new_attrs which will be modified and used to obtain
-	 * pools.
-	 */
-	copy_workqueue_attrs(tmp_attrs, new_attrs);
+	cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask);
 
 	/*
 	 * If something goes wrong during CPU up/down, we'll fall back to
@@ -4281,8 +4286,8 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 			ctx->dfl_pwq->refcnt++;
 			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
 		} else {
-			wq_calc_pod_cpumask(new_attrs, cpu, -1, tmp_attrs->cpumask);
-			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, tmp_attrs);
+			wq_calc_pod_cpumask(new_attrs, cpu, -1);
+			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, new_attrs);
 			if (!ctx->pwq_tbl[cpu])
 				goto out_free;
 		}
@@ -4291,14 +4296,13 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	/* save the user configured attrs and sanitize it. */
 	copy_workqueue_attrs(new_attrs, attrs);
 	cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
+	cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask);
 	ctx->attrs = new_attrs;
 
 	ctx->wq = wq;
-	free_workqueue_attrs(tmp_attrs);
 	return ctx;
 
 out_free:
-	free_workqueue_attrs(tmp_attrs);
 	free_workqueue_attrs(new_attrs);
 	apply_wqattrs_cleanup(ctx);
 	return ERR_PTR(-ENOMEM);
@@ -4423,7 +4427,6 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 	int cpu_off = online ? -1 : cpu;
 	struct pool_workqueue *old_pwq = NULL, *pwq;
 	struct workqueue_attrs *target_attrs;
-	cpumask_t *cpumask;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -4436,15 +4439,14 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 	 * CPU hotplug exclusion.
 	 */
 	target_attrs = wq_update_pod_attrs_buf;
-	cpumask = target_attrs->cpumask;
-
-	copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+	copy_workqueue_attrs(target_attrs, wq->dfl_pwq->pool->attrs);
 
 	/* nothing to do if the target cpumask matches the current pwq */
-	wq_calc_pod_cpumask(wq->dfl_pwq->pool->attrs, cpu, cpu_off, cpumask);
+	wq_calc_pod_cpumask(target_attrs, cpu, cpu_off);
 	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
 					lockdep_is_held(&wq_pool_mutex));
-	if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
+	if (cpumask_equal(target_attrs->__pod_cpumask,
+			  pwq->pool->attrs->cpumask))
 		return;
 
 	/* create a new pwq */
@@ -5371,7 +5373,7 @@ static void rebind_workers(struct worker_pool *pool)
 	for_each_pool_worker(worker, pool) {
 		kthread_set_per_cpu(worker->task, pool->cpu);
 		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
-						  pool->attrs->cpumask) < 0);
+						  pool_allowed_cpus(pool)) < 0);
 	}
 
 	raw_spin_lock_irq(&pool->lock);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 21/24] workqueue: Implement non-strict affinity scope for unbound workqueues
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (19 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 20/24] workqueue: Add workqueue_attrs->__pod_cpumask Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 22/24] workqueue: Add "Affinity Scopes and Performance" section to documentation Tejun Heo
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.

While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.

While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.

This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.

After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:

* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
  ->__pod_cpumask so that the workers are allowed to run on any CPU that
  the associated workqueues allow.

* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
  the field to a CPU within the pod.

This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.

There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.

While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 Documentation/core-api/workqueue.rst | 30 +++++++++---
 include/linux/workqueue.h            | 11 +++++
 kernel/workqueue.c                   | 73 +++++++++++++++++++++++++++-
 tools/workqueue/wq_dump.py           | 16 ++++--
 tools/workqueue/wq_monitor.py        | 21 +++++---
 5 files changed, 131 insertions(+), 20 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 56af317508c9..c73a6df6a118 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -353,9 +353,10 @@ Affinity Scopes
 An unbound workqueue groups CPUs according to its affinity scope to improve
 cache locality. For example, if a workqueue is using the default affinity
 scope of "cache", it will group CPUs according to last level cache
-boundaries. A work item queued on the workqueue will be processed by a
-worker running on one of the CPUs which share the last level cache with the
-issuing CPU.
+boundaries. A work item queued on the workqueue will be assigned to a worker
+on one of the CPUs which share the last level cache with the issuing CPU.
+Once started, the worker may or may not be allowed to move outside the scope
+depending on the ``affinity_strict`` setting of the scope.
 
 Workqueue currently supports the following five affinity scopes.
 
@@ -391,6 +392,21 @@ directory.
 ``affinity_scope``
   Read to see the current affinity scope. Write to change.
 
+``affinity_strict``
+  0 by default indicating that affinity scopes are not strict. When a work
+  item starts execution, workqueue makes a best-effort attempt to ensure
+  that the worker is inside its affinity scope, which is called
+  repatriation. Once started, the scheduler is free to move the worker
+  anywhere in the system as it sees fit. This enables benefiting from scope
+  locality while still being able to utilize other CPUs if necessary and
+  available.
+
+  If set to 1, all workers of the scope are guaranteed always to be in the
+  scope. This may be useful when crossing affinity scopes has other
+  implications, for example, in terms of power consumption or workload
+  isolation. Strict NUMA scope can also be used to match the workqueue
+  behavior of older kernels.
+
 
 Examining Configuration
 =======================
@@ -475,21 +491,21 @@ Monitoring
 Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
 
   $ tools/workqueue/wq_monitor.py events
-                              total  infl  CPUtime  CPUhog  CMwake  mayday rescued
+                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
   events                      18545     0      6.1       0       5       -       -
   events_highpri                  8     0      0.0       0       0       -       -
   events_long                     3     0      0.0       0       0       -       -
-  events_unbound              38306     0      0.1       -       -       -       -
+  events_unbound              38306     0      0.1       -       7       -       -
   events_freezable                0     0      0.0       0       0       -       -
   events_power_efficient      29598     0      0.2       0       0       -       -
   events_freezable_power_        10     0      0.0       0       0       -       -
   sock_diag_events                0     0      0.0       0       0       -       -
 
-                              total  infl  CPUtime  CPUhog  CMwake  mayday rescued
+                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
   events                      18548     0      6.1       0       5       -       -
   events_highpri                  8     0      0.0       0       0       -       -
   events_long                     3     0      0.0       0       0       -       -
-  events_unbound              38322     0      0.1       -       -       -       -
+  events_unbound              38322     0      0.1       -       7       -       -
   events_freezable                0     0      0.0       0       0       -       -
   events_power_efficient      29603     0      0.2       0       0       -       -
   events_freezable_power_        10     0      0.0       0       0       -       -
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7a0fc0919e0a..751eb915e3f0 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -168,6 +168,17 @@ struct workqueue_attrs {
 	 */
 	cpumask_var_t __pod_cpumask;
 
+	/**
+	 * @affn_strict: affinity scope is strict
+	 *
+	 * If clear, workqueue will make a best-effort attempt at starting the
+	 * worker inside @__pod_cpumask but the scheduler is free to migrate it
+	 * outside.
+	 *
+	 * If set, workers are only allowed to run inside @__pod_cpumask.
+	 */
+	bool affn_strict;
+
 	/*
 	 * Below fields aren't properties of a worker_pool. They only modify how
 	 * :c:func:`apply_workqueue_attrs` select pools and thus don't
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index daebc28d09ab..3ce4c18e139c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -210,6 +210,7 @@ enum pool_workqueue_stats {
 	PWQ_STAT_CPU_TIME,	/* total CPU time consumed */
 	PWQ_STAT_CPU_INTENSIVE,	/* wq_cpu_intensive_thresh_us violations */
 	PWQ_STAT_CM_WAKEUP,	/* concurrency-management worker wakeups */
+	PWQ_STAT_REPATRIATED,	/* unbound workers brought back into scope */
 	PWQ_STAT_MAYDAY,	/* maydays to rescuer */
 	PWQ_STAT_RESCUED,	/* linked work items executed by rescuer */
 
@@ -1094,13 +1095,41 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 static bool kick_pool(struct worker_pool *pool)
 {
 	struct worker *worker = first_idle_worker(pool);
+	struct task_struct *p;
 
 	lockdep_assert_held(&pool->lock);
 
 	if (!need_more_worker(pool) || !worker)
 		return false;
 
-	wake_up_process(worker->task);
+	p = worker->task;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Idle @worker is about to execute @work and waking up provides an
+	 * opportunity to migrate @worker at a lower cost by setting the task's
+	 * wake_cpu field. Let's see if we want to move @worker to improve
+	 * execution locality.
+	 *
+	 * We're waking the worker that went idle the latest and there's some
+	 * chance that @worker is marked idle but hasn't gone off CPU yet. If
+	 * so, setting the wake_cpu won't do anything. As this is a best-effort
+	 * optimization and the race window is narrow, let's leave as-is for
+	 * now. If this becomes pronounced, we can skip over workers which are
+	 * still on cpu when picking an idle worker.
+	 *
+	 * If @pool has non-strict affinity, @worker might have ended up outside
+	 * its affinity scope. Repatriate.
+	 */
+	if (!pool->attrs->affn_strict &&
+	    !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
+		struct work_struct *work = list_first_entry(&pool->worklist,
+						struct work_struct, entry);
+		p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
+		get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
+	}
+#endif
+	wake_up_process(p);
 	return true;
 }
 
@@ -2031,7 +2060,10 @@ static struct worker *alloc_worker(int node)
 
 static cpumask_t *pool_allowed_cpus(struct worker_pool *pool)
 {
-	return pool->attrs->__pod_cpumask;
+	if (pool->cpu < 0 && pool->attrs->affn_strict)
+		return pool->attrs->__pod_cpumask;
+	else
+		return pool->attrs->cpumask;
 }
 
 /**
@@ -3696,6 +3728,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 	to->nice = from->nice;
 	cpumask_copy(to->cpumask, from->cpumask);
 	cpumask_copy(to->__pod_cpumask, from->__pod_cpumask);
+	to->affn_strict = from->affn_strict;
 
 	/*
 	 * Unlike hash and equality test, copying shouldn't ignore wq-only
@@ -3716,6 +3749,7 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
 	hash = jhash(cpumask_bits(attrs->__pod_cpumask),
 		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
+	hash = jhash_1word(attrs->affn_strict, hash);
 	return hash;
 }
 
@@ -3729,6 +3763,8 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
 		return false;
 	if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask))
 		return false;
+	if (a->affn_strict != b->affn_strict)
+		return false;
 	return true;
 }
 
@@ -5792,6 +5828,7 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
  *  nice		RW int	: nice value of the workers
  *  cpumask		RW mask	: bitmask of allowed CPUs for the workers
  *  affinity_scope	RW str  : worker CPU affinity scope (cache, numa, none)
+ *  affinity_strict	RW bool : worker CPU affinity is strict
  */
 struct wq_device {
 	struct workqueue_struct		*wq;
@@ -5971,10 +6008,42 @@ static ssize_t wq_affn_scope_store(struct device *dev,
 	return ret ?: count;
 }
 
+static ssize_t wq_affinity_strict_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n",
+			 wq->unbound_attrs->affn_strict);
+}
+
+static ssize_t wq_affinity_strict_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int v, ret = -ENOMEM;
+
+	if (sscanf(buf, "%d", &v) != 1)
+		return -EINVAL;
+
+	apply_wqattrs_lock();
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (attrs) {
+		attrs->affn_strict = (bool)v;
+		ret = apply_workqueue_attrs_locked(wq, attrs);
+	}
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
 	__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
+	__ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store),
 	__ATTR_NULL,
 };
 
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index 43ab71a193b8..d0df5833f2c1 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -36,10 +36,11 @@ Workqueue CPU -> pool
 Lists all workqueues along with their type and worker pool association. For
 each workqueue:
 
-  NAME TYPE POOL_ID...
+  NAME TYPE[,FLAGS] POOL_ID...
 
   NAME      name of the workqueue
   TYPE      percpu, unbound or ordered
+  FLAGS     S: strict affinity scope
   POOL_ID   worker pool ID associated with each possible CPU
 """
 
@@ -138,13 +139,16 @@ max_ref_len = 0
         print(f'cpu={pool.cpu.value_():3}', end='')
     else:
         print(f'cpus={cpumask_str(pool.attrs.cpumask)}', end='')
+        print(f' pod_cpus={cpumask_str(pool.attrs.__pod_cpumask)}', end='')
+        if pool.attrs.affn_strict:
+            print(' strict', end='')
     print('')
 
 print('')
 print('Workqueue CPU -> pool')
 print('=====================')
 
-print('[    workqueue \ CPU            ', end='')
+print('[    workqueue     \     type   CPU', end='')
 for cpu in for_each_possible_cpu(prog):
     print(f' {cpu:{max_pool_id_len}}', end='')
 print(' dfl]')
@@ -153,11 +157,15 @@ print(' dfl]')
     print(f'{wq.name.string_().decode()[-24:]:24}', end='')
     if wq.flags & WQ_UNBOUND:
         if wq.flags & WQ_ORDERED:
-            print(' ordered', end='')
+            print(' ordered   ', end='')
         else:
             print(' unbound', end='')
+            if wq.unbound_attrs.affn_strict:
+                print(',S ', end='')
+            else:
+                print('   ', end='')
     else:
-        print(' percpu ', end='')
+        print(' percpu    ', end='')
 
     for cpu in for_each_possible_cpu(prog):
         pool_id = per_cpu_ptr(wq.cpu_pwq, cpu)[0].pool.id.value_()
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
index 6e258d123e8c..a8856a9c45dc 100644
--- a/tools/workqueue/wq_monitor.py
+++ b/tools/workqueue/wq_monitor.py
@@ -20,8 +20,11 @@ https://github.com/osandov/drgn.
            and got excluded from concurrency management to avoid stalling
            other work items.
 
-  CMwake   The number of concurrency-management wake-ups while executing a
-           work item of the workqueue.
+  CMW/RPR  For per-cpu workqueues, the number of concurrency-management
+           wake-ups while executing a work item of the workqueue. For
+           unbound workqueues, the number of times a worker was repatriated
+           to its affinity scope after being migrated to an off-scope CPU by
+           the scheduler.
 
   mayday   The number of times the rescuer was requested while waiting for
            new worker creation.
@@ -65,6 +68,7 @@ PWQ_STAT_COMPLETED      = prog['PWQ_STAT_COMPLETED']	# work items completed exec
 PWQ_STAT_CPU_TIME       = prog['PWQ_STAT_CPU_TIME']     # total CPU time consumed
 PWQ_STAT_CPU_INTENSIVE  = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations
 PWQ_STAT_CM_WAKEUP      = prog['PWQ_STAT_CM_WAKEUP']    # concurrency-management worker wakeups
+PWQ_STAT_REPATRIATED    = prog['PWQ_STAT_REPATRIATED']  # unbound workers brought back into scope
 PWQ_STAT_MAYDAY         = prog['PWQ_STAT_MAYDAY']	# maydays to rescuer
 PWQ_STAT_RESCUED        = prog['PWQ_STAT_RESCUED']	# linked work items executed by rescuer
 PWQ_NR_STATS            = prog['PWQ_NR_STATS']
@@ -89,22 +93,25 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
                  'cpu_time'             : self.stats[PWQ_STAT_CPU_TIME],
                  'cpu_intensive'        : self.stats[PWQ_STAT_CPU_INTENSIVE],
                  'cm_wakeup'            : self.stats[PWQ_STAT_CM_WAKEUP],
+                 'repatriated'          : self.stats[PWQ_STAT_REPATRIATED],
                  'mayday'               : self.stats[PWQ_STAT_MAYDAY],
                  'rescued'              : self.stats[PWQ_STAT_RESCUED], }
 
     def table_header_str():
         return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} '\
-            f'{"CPUitsv":>7} {"CMwake":>7} {"mayday":>7} {"rescued":>7}'
+            f'{"CPUitsv":>7} {"CMW/RPR":>7} {"mayday":>7} {"rescued":>7}'
 
     def table_row_str(self):
         cpu_intensive = '-'
-        cm_wakeup = '-'
+        cmw_rpr = '-'
         mayday = '-'
         rescued = '-'
 
-        if not self.unbound:
+        if self.unbound:
+            cmw_rpr = str(self.stats[PWQ_STAT_REPATRIATED]);
+        else:
             cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE])
-            cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
+            cmw_rpr = str(self.stats[PWQ_STAT_CM_WAKEUP])
 
         if self.mem_reclaim:
             mayday = str(self.stats[PWQ_STAT_MAYDAY])
@@ -115,7 +122,7 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
               f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
               f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \
               f'{cpu_intensive:>7} ' \
-              f'{cm_wakeup:>7} ' \
+              f'{cmw_rpr:>7} ' \
               f'{mayday:>7} ' \
               f'{rescued:>7} '
         return out.rstrip(':')
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 22/24] workqueue: Add "Affinity Scopes and Performance" section to documentation
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (20 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 21/24] workqueue: Implement non-strict affinity scope for unbound workqueues Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 23/24] workqueue: Add pool_workqueue->cpu Tejun Heo
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

With affinity scopes and their strictness setting added, unbound workqueues
should now be able to cover wide variety of configurations and use cases.
Unfortunately, the performance picture is not entirely straight-forward due
to a trade-off between efficiency and work-conservation in some situations
necessitating manual configuration.

This patch adds "Affinity Scopes and Performance" section to
Documentation/core-api/workqueue.rst which illustrates the trade-off with a
set of experiments and provides some guidelines.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/core-api/workqueue.rst | 184 ++++++++++++++++++++++++++-
 1 file changed, 179 insertions(+), 5 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index c73a6df6a118..4a8e764f41ae 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -1,6 +1,6 @@
-====================================
-Concurrency Managed Workqueue (cmwq)
-====================================
+=========
+Workqueue
+=========
 
 :Date: September, 2010
 :Author: Tejun Heo <tj@kernel.org>
@@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle.
 When a new work item gets queued, the worker begins executing again.
 
 
-Why cmwq?
-=========
+Why Concurrency Managed Workqueue?
+==================================
 
 In the original wq implementation, a multi threaded (MT) wq had one
 worker thread per CPU and a single threaded (ST) wq had one worker
@@ -408,6 +408,180 @@ directory.
   behavior of older kernels.
 
 
+Affinity Scopes and Performance
+===============================
+
+It'd be ideal if an unbound workqueue's behavior is optimal for vast
+majority of use cases without further tuning. Unfortunately, in the current
+kernel, there exists a pronounced trade-off between locality and utilization
+necessitating explicit configurations when workqueues are heavily used.
+
+Higher locality leads to higher efficiency where more work is performed for
+the same number of consumed CPU cycles. However, higher locality may also
+cause lower overall system utilization if the work items are not spread
+enough across the affinity scopes by the issuers. The following performance
+testing with dm-crypt clearly illustrates this trade-off.
+
+The tests are run on a CPU with 12-cores/24-threads split across four L3
+caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
+``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
+opened with ``cryptsetup`` with default settings.
+
+
+Scenario 1: Enough issuers and work spread across the machine
+-------------------------------------------------------------
+
+The command used: ::
+
+  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
+    --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
+    --name=iops-test-job --verify=sha512
+
+There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512``
+makes ``fio`` generate and read back the content each time which makes
+execution locality matter between the issuer and ``kcryptd``. The followings
+are the read bandwidths and CPU utilizations depending on different affinity
+scope settings on ``kcryptd`` measured over five runs. Bandwidths are in
+MiBps, and CPU util in percents.
+
+.. list-table::
+   :widths: 16 20 20
+   :header-rows: 1
+
+   * - Affinity
+     - Bandwidth (MiBps)
+     - CPU util (%)
+
+   * - system
+     - 1159.40 ±1.34
+     - 99.31 ±0.02
+
+   * - cache
+     - 1166.40 ±0.89
+     - 99.34 ±0.01
+
+   * - cache (strict)
+     - 1166.00 ±0.71
+     - 99.35 ±0.01
+
+With enough issuers spread across the system, there is no downside to
+"cache", strict or otherwise. All three configurations saturate the whole
+machine but the cache-affine ones outperform by 0.6% thanks to improved
+locality.
+
+
+Scenario 2: Fewer issuers, enough work for saturation
+-----------------------------------------------------
+
+The command used: ::
+
+  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
+    --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
+    --time_based --group_reporting --name=iops-test-job --verify=sha512
+
+The only difference from the previous scenario is ``--numjobs=8``. There are
+a third of the issuers but is still enough total work to saturate the
+system.
+
+.. list-table::
+   :widths: 16 20 20
+   :header-rows: 1
+
+   * - Affinity
+     - Bandwidth (MiBps)
+     - CPU util (%)
+
+   * - system
+     - 1155.40 ±0.89
+     - 97.41 ±0.05
+
+   * - cache
+     - 1154.40 ±1.14
+     - 96.15 ±0.09
+
+   * - cache (strict)
+     - 1112.00 ±4.64
+     - 93.26 ±0.35
+
+This is more than enough work to saturate the system. Both "system" and
+"cache" are nearly saturating the machine but not fully. "cache" is using
+less CPU but the better efficiency puts it at the same bandwidth as
+"system".
+
+Eight issuers moving around over four L3 cache scope still allow "cache
+(strict)" to mostly saturate the machine but the loss of work conservation
+is now starting to hurt with 3.7% bandwidth loss.
+
+
+Scenario 3: Even fewer issuers, not enough work to saturate
+-----------------------------------------------------------
+
+The command used: ::
+
+  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
+    --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
+    --time_based --group_reporting --name=iops-test-job --verify=sha512
+
+Again, the only difference is ``--numjobs=4``. With the number of issuers
+reduced to four, there now isn't enough work to saturate the whole system
+and the bandwidth becomes dependent on completion latencies.
+
+.. list-table::
+   :widths: 16 20 20
+   :header-rows: 1
+
+   * - Affinity
+     - Bandwidth (MiBps)
+     - CPU util (%)
+
+   * - system
+     - 993.60 ±1.82
+     - 75.49 ±0.06
+
+   * - cache
+     - 973.40 ±1.52
+     - 74.90 ±0.07
+
+   * - cache (strict)
+     - 828.20 ±4.49
+     - 66.84 ±0.29
+
+Now, the tradeoff between locality and utilization is clearer. "cache" shows
+2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
+
+
+Conclusion and Recommendations
+------------------------------
+
+In the above experiments, the efficiency advantage of the "cache" affinity
+scope over "system" is, while consistent and noticeable, small. However, the
+impact is dependent on the distances between the scopes and may be more
+pronounced in processors with more complex topologies.
+
+While the loss of work-conservation in certain scenarios hurts, it is a lot
+better than "cache (strict)" and maximizing workqueue utilization is
+unlikely to be the common case anyway. As such, "cache" is the default
+affinity scope for unbound pools.
+
+* As there is no one option which is great for most cases, workqueue usages
+  that may consume a significant amount of CPU are recommended to configure
+  the workqueues using ``apply_workqueue_attrs()`` and/or enable
+  ``WQ_SYSFS``.
+
+* An unbound workqueue with strict "cpu" affinity scope behaves the same as
+  ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the
+  latter and an unbound workqueue provides a lot more flexibility.
+
+* Affinity scopes are introduced in Linux v6.5. To emulate the previous
+  behavior, use strict "numa" affinity scope.
+
+* The loss of work-conservation in non-strict affinity scopes is likely
+  originating from the scheduler. There is no theoretical reason why the
+  kernel wouldn't be able to do the right thing and maintain
+  work-conservation in most cases. As such, it is possible that future
+  scheduler improvements may make most of these tunables unnecessary.
+
+
 Examining Configuration
 =======================
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 23/24] workqueue: Add pool_workqueue->cpu
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (21 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 22/24] workqueue: Add "Affinity Scopes and Performance" section to documentation Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:17 ` [PATCH 24/24] workqueue: Implement localize-to-issuing-CPU for unbound workqueues Tejun Heo
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

For both per-cpu and unbound workqueues, pwq's (pool_workqueue's) are
per-cpu. For per-cpu workqueues, we can find out the associated CPU from
pwq->pool->cpu but unbound pools don't have specific CPUs associated. Let's
add pwq->cpu so that given an unbound work item, we can determine which CPU
it was queued on through get_work_pwq(work)->cpu.

This will be used to improve execution locality on unbound workqueues.

NOT_FOR_UPSTREAM
---
 kernel/workqueue.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3ce4c18e139c..4efb0bd6f2e0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -226,6 +226,7 @@ enum pool_workqueue_stats {
 struct pool_workqueue {
 	struct worker_pool	*pool;		/* I: the associated pool */
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	int			cpu;		/* I: the associated CPU */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
 	int			refcnt;		/* L: reference count */
@@ -4131,7 +4132,7 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
 
 /* initialize newly allocated @pwq which is associated with @wq and @pool */
 static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
-		     struct worker_pool *pool)
+		     struct worker_pool *pool, int cpu)
 {
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
@@ -4139,6 +4140,7 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 
 	pwq->pool = pool;
 	pwq->wq = wq;
+	pwq->cpu = cpu;
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->inactive_works);
@@ -4169,8 +4171,9 @@ static void link_pwq(struct pool_workqueue *pwq)
 }
 
 /* obtain a pool matching @attr and create a pwq associating the pool and @wq */
-static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
-					const struct workqueue_attrs *attrs)
+static struct pool_workqueue *
+alloc_unbound_pwq(struct workqueue_struct *wq,
+		  const struct workqueue_attrs *attrs, int cpu)
 {
 	struct worker_pool *pool;
 	struct pool_workqueue *pwq;
@@ -4187,7 +4190,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 		return NULL;
 	}
 
-	init_pwq(pwq, wq, pool);
+	init_pwq(pwq, wq, pool, cpu);
 	return pwq;
 }
 
@@ -4313,7 +4316,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	 * the default pwq covering whole @attrs->cpumask.  Always create
 	 * it even if we don't use it immediately.
 	 */
-	ctx->dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
+	ctx->dfl_pwq = alloc_unbound_pwq(wq, new_attrs, -1);
 	if (!ctx->dfl_pwq)
 		goto out_free;
 
@@ -4323,7 +4326,7 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
 		} else {
 			wq_calc_pod_cpumask(new_attrs, cpu, -1);
-			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, new_attrs);
+			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, new_attrs, cpu);
 			if (!ctx->pwq_tbl[cpu])
 				goto out_free;
 		}
@@ -4486,7 +4489,7 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu, bool online)
 		return;
 
 	/* create a new pwq */
-	pwq = alloc_unbound_pwq(wq, target_attrs);
+	pwq = alloc_unbound_pwq(wq, target_attrs, cpu);
 	if (!pwq) {
 		pr_warn("workqueue: allocation failed while updating CPU pod affinity of \"%s\"\n",
 			wq->name);
@@ -4530,7 +4533,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			if (!*pwq_p)
 				goto enomem;
 
-			init_pwq(*pwq_p, wq, pool);
+			init_pwq(*pwq_p, wq, pool, cpu);
 
 			mutex_lock(&wq->mutex);
 			link_pwq(*pwq_p);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 24/24] workqueue: Implement localize-to-issuing-CPU for unbound workqueues
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (22 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 23/24] workqueue: Add pool_workqueue->cpu Tejun Heo
@ 2023-05-19  0:17 ` Tejun Heo
  2023-05-19  0:41 ` [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Linus Torvalds
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-19  0:17 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, Tejun Heo

The non-strict cache affinity scope provides a reasonable default behavior
for improving execution locality while avoiding strict utilization limits
and the overhead of too-fine-grained scopes. However, it ignores L1/2
locality which may benefit some workloads.

This patch implements workqueue_attrs->localize which, when turned on, tries
to put the worker on the work item's issuing CPU when starting execution in
the same way non-strict cache affinity is implemented. As it uses the same
task_struct->wake_cpu, the same caveats apply. It isn't clear whether this
is an acceptable use of the scheduler property and there is a small race
window where the setting from position_worker() may be ignored.

To locate a worker on the work item's issuing CPU, we need to pre-assign the
work item to the worker before waking it up; otherwise, we can't know which
exact worker the work item is going to be assigned to. For work items that
request localization, this patch updates kick_pool() to pre-assign each work
item to an idle worker, exit the worker from the idle state before waking it
up. In turn, worker_thread() directly proceeds to work item execution if
IDLE was already clear when it woke up.

Theoretically, localizing to the issuing CPU without any hard restrictions
should be the best option as it tells the scheduler the best CPU to use for
locality without any restrictions on future scheduler decisions. However, in
practice, this doesn't work out that way due to loss of work conservation.
As such, this patch isn't for upstream yet. See the cover letter for further
discussion.

NOT_FOR_UPSTREAM
---
 Documentation/core-api/workqueue.rst |  38 +++---
 include/linux/workqueue.h            |  10 ++
 kernel/workqueue.c                   | 183 +++++++++++++++++++--------
 tools/workqueue/wq_dump.py           |   7 +-
 tools/workqueue/wq_monitor.py        |   8 +-
 5 files changed, 170 insertions(+), 76 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 4a8e764f41ae..3a7b3b0e7196 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -665,25 +665,25 @@ Monitoring
 Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
 
   $ tools/workqueue/wq_monitor.py events
-                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
-  events                      18545     0      6.1       0       5       -       -
-  events_highpri                  8     0      0.0       0       0       -       -
-  events_long                     3     0      0.0       0       0       -       -
-  events_unbound              38306     0      0.1       -       7       -       -
-  events_freezable                0     0      0.0       0       0       -       -
-  events_power_efficient      29598     0      0.2       0       0       -       -
-  events_freezable_power_        10     0      0.0       0       0       -       -
-  sock_diag_events                0     0      0.0       0       0       -       -
-
-                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
-  events                      18548     0      6.1       0       5       -       -
-  events_highpri                  8     0      0.0       0       0       -       -
-  events_long                     3     0      0.0       0       0       -       -
-  events_unbound              38322     0      0.1       -       7       -       -
-  events_freezable                0     0      0.0       0       0       -       -
-  events_power_efficient      29603     0      0.2       0       0       -       -
-  events_freezable_power_        10     0      0.0       0       0       -       -
-  sock_diag_events                0     0      0.0       0       0       -       -
+                              total  infl  CPUtime  CPUlocal CPUhog CMW/RPR  mayday rescued
+  events                      18545     0      6.1     18545      0       5       -       -
+  events_highpri                  8     0      0.0         8      0       0       -       -
+  events_long                     3     0      0.0         3      0       0       -       -
+  events_unbound              38306     0      0.1      9432      -       7       -       -
+  events_freezable                0     0      0.0         0      0       0       -       -
+  events_power_efficient      29598     0      0.2     29598      0       0       -       -
+  events_freezable_power_        10     0      0.0        10      0       0       -       -
+  sock_diag_events                0     0      0.0         0      0       0       -       -
+
+                              total  infl  CPUtime  CPUlocal CPUhog CMW/RPR  mayday rescued
+  events                      18548     0      6.1     18548      0       5       -       -
+  events_highpri                  8     0      0.0         8      0       0       -       -
+  events_long                     3     0      0.0         3      0       0       -       -
+  events_unbound              38322     0      0.1      9440      -       7       -       -
+  events_freezable                0     0      0.0         0      0       0       -       -
+  events_power_efficient      29603     0      0.2     29063      0       0       -       -
+  events_freezable_power_        10     0      0.0        10      0       0       -       -
+  sock_diag_events                0     0      0.0         0      0       0       -       -
 
   ...
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 751eb915e3f0..d989f95f6646 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -197,6 +197,16 @@ struct workqueue_attrs {
 	 */
 	enum wq_affn_scope affn_scope;
 
+	/**
+	 * @localize: always put worker on work item's issuing CPU
+	 *
+	 * When starting execution of a work item, always move the assigned
+	 * worker to the CPU the work item was issued on. The scheduler is free
+	 * to move the worker around afterwards as allowed by the affinity
+	 * scope.
+	 */
+	bool localize;
+
 	/**
 	 * @ordered: work items must be executed one by one in queueing order
 	 */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4efb0bd6f2e0..b2e914655f05 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -208,6 +208,7 @@ enum pool_workqueue_stats {
 	PWQ_STAT_STARTED,	/* work items started execution */
 	PWQ_STAT_COMPLETED,	/* work items completed execution */
 	PWQ_STAT_CPU_TIME,	/* total CPU time consumed */
+	PWQ_STAT_CPU_LOCAL,	/* work items started on the issuing CPU */
 	PWQ_STAT_CPU_INTENSIVE,	/* wq_cpu_intensive_thresh_us violations */
 	PWQ_STAT_CM_WAKEUP,	/* concurrency-management worker wakeups */
 	PWQ_STAT_REPATRIATED,	/* unbound workers brought back into scope */
@@ -1087,51 +1088,76 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 }
 
 /**
- * kick_pool - wake up an idle worker if necessary
+ * kick_pool - wake up workers and optionally assign work items to them
  * @pool: pool to kick
  *
- * @pool may have pending work items. Wake up worker if necessary. Returns
- * whether a worker was woken up.
+ * @pool may have pending work items. Either wake up one idle worker or multiple
+ * with work items pre-assigned. See the in-line comments.
  */
 static bool kick_pool(struct worker_pool *pool)
 {
-	struct worker *worker = first_idle_worker(pool);
-	struct task_struct *p;
+	bool woken_up = false;
+	struct worker *worker;
 
 	lockdep_assert_held(&pool->lock);
 
-	if (!need_more_worker(pool) || !worker)
-		return false;
-
-	p = worker->task;
-
+	while (need_more_worker(pool) && (worker = first_idle_worker(pool))) {
+		struct task_struct *p = worker->task;
 #ifdef CONFIG_SMP
-	/*
-	 * Idle @worker is about to execute @work and waking up provides an
-	 * opportunity to migrate @worker at a lower cost by setting the task's
-	 * wake_cpu field. Let's see if we want to move @worker to improve
-	 * execution locality.
-	 *
-	 * We're waking the worker that went idle the latest and there's some
-	 * chance that @worker is marked idle but hasn't gone off CPU yet. If
-	 * so, setting the wake_cpu won't do anything. As this is a best-effort
-	 * optimization and the race window is narrow, let's leave as-is for
-	 * now. If this becomes pronounced, we can skip over workers which are
-	 * still on cpu when picking an idle worker.
-	 *
-	 * If @pool has non-strict affinity, @worker might have ended up outside
-	 * its affinity scope. Repatriate.
-	 */
-	if (!pool->attrs->affn_strict &&
-	    !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
 		struct work_struct *work = list_first_entry(&pool->worklist,
 						struct work_struct, entry);
-		p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
-		get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
-	}
+		struct pool_workqueue *pwq = get_work_pwq(work);
+		struct workqueue_struct *wq = pwq->wq;
+
+		/*
+		 * Idle @worker is about to execute @work and waking up provides
+		 * an opportunity to migrate @worker at a lower cost by setting
+		 * the task's wake_cpu field. Let's see if we want to move
+		 * @worker to improve execution locality.
+		 *
+		 * We're waking the worker that went idle the latest and there's
+		 * some chance that @worker is marked idle but hasn't gone off
+		 * CPU yet. If so, setting the wake_cpu won't do anything. As
+		 * this is a best-effort optimization and the race window is
+		 * narrow, let's leave as-is for now. If this becomes
+		 * pronounced, we can skip over workers which are still on cpu
+		 * when picking an idle worker.
+		 */
+
+		/*
+		 * If @work's workqueue requests localization, @work has CPU
+		 * assigned and there are enough idle workers, pre-assign @work
+		 * to @worker and tell the scheduler to try to wake up @worker
+		 * on @work's issuing CPU. Be careful that ->localize is a
+		 * workqueue attribute, not a pool one.
+		 */
+		if (wq->unbound_attrs && wq->unbound_attrs->localize &&
+		    pwq->cpu >= 0 && pool->nr_idle > 1) {
+			if (assign_work(work, worker, NULL)) {
+				worker_leave_idle(worker);
+				p->wake_cpu = pwq->cpu;
+				wake_up_process(worker->task);
+				woken_up = true;
+				continue;
+			}
+		}
+
+		/*
+		 * If @pool has non-strict affinity, @worker might have ended up
+		 * outside its affinity scope. Repatriate.
+		 */
+		if (!pool->attrs->affn_strict &&
+		    !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
+			p->wake_cpu = cpumask_any_distribute(
+						pool->attrs->__pod_cpumask);
+			pwq->stats[PWQ_STAT_REPATRIATED]++;
+		}
 #endif
-	wake_up_process(p);
-	return true;
+		wake_up_process(p);
+		return true;
+	}
+
+	return woken_up;
 }
 
 #ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
@@ -2607,6 +2633,8 @@ __acquires(&pool->lock)
 	 */
 	lockdep_invariant_state(true);
 	pwq->stats[PWQ_STAT_STARTED]++;
+	if (pwq->cpu == smp_processor_id())
+		pwq->stats[PWQ_STAT_CPU_LOCAL]++;
 	trace_workqueue_execute_start(work);
 	worker->current_func(work);
 	/*
@@ -2730,22 +2758,26 @@ static int worker_thread(void *__worker)
 		return 0;
 	}
 
-	worker_leave_idle(worker);
-recheck:
-	/* no more worker necessary? */
-	if (!need_more_worker(pool))
-		goto sleep;
-
-	/* do we need to manage? */
-	if (unlikely(!may_start_working(pool)) && manage_workers(worker))
-		goto recheck;
-
 	/*
-	 * ->scheduled list can only be filled while a worker is
-	 * preparing to process a work or actually processing it.
-	 * Make sure nobody diddled with it while I was sleeping.
+	 * If kick_pool() assigned a work item to us, it made sure that there
+	 * are other idle workers to serve the manager role and moved us out of
+	 * the idle state already. If IDLE is clear, skip manager check and
+	 * start executing the work items on @worker->scheduled right away.
 	 */
-	WARN_ON_ONCE(!list_empty(&worker->scheduled));
+	if (worker->flags & WORKER_IDLE) {
+		WARN_ON_ONCE(!list_empty(&worker->scheduled));
+		worker_leave_idle(worker);
+
+		while (true) {
+			/* no more worker necessary? */
+			if (!need_more_worker(pool))
+				goto sleep;
+			/* do we need to manage? */
+			if (likely(may_start_working(pool)) ||
+			    !manage_workers(worker))
+				break;
+		}
+	}
 
 	/*
 	 * Finish PREP stage.  We're guaranteed to have at least one idle
@@ -2756,14 +2788,31 @@ static int worker_thread(void *__worker)
 	 */
 	worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
 
-	do {
-		struct work_struct *work =
-			list_first_entry(&pool->worklist,
-					 struct work_struct, entry);
+	/*
+	 * If we woke up with IDLE cleared, there may already be work items on
+	 * ->scheduled. Always run process_scheduled_works() at least once. Note
+	 * that ->scheduled can be empty even after !IDLE wake-up as the
+	 * scheduled work item could have been canceled in-between.
+	 */
+	process_scheduled_works(worker);
 
-		if (assign_work(work, worker, NULL))
-			process_scheduled_works(worker);
-	} while (keep_working(pool));
+	/*
+	 * For unbound workqueues, the following keep_working() would be true
+	 * only when there are worker shortages. Otherwise, work items would
+	 * have been assigned to workers on queueing.
+	 */
+	while (keep_working(pool)) {
+		struct work_struct *work = list_first_entry(&pool->worklist,
+						struct work_struct, entry);
+		/*
+		 * An unbound @worker here might not be on the same CPU as @work
+		 * which is unfortunate if the workqueue has localization turned
+		 * on. However, it shouldn't be a problem in practice as this
+		 * path isn't taken often for unbound workqueues.
+		 */
+		assign_work(work, worker, NULL);
+		process_scheduled_works(worker);
+	}
 
 	worker_set_flags(worker, WORKER_PREP);
 sleep:
@@ -3737,6 +3786,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 	 * get_unbound_pool() explicitly clears the fields.
 	 */
 	to->affn_scope = from->affn_scope;
+	to->localize = from->localize;
 	to->ordered = from->ordered;
 }
 
@@ -4020,6 +4070,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 
 	/* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
 	pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;
+	pool->attrs->localize = false;
 	pool->attrs->ordered = false;
 
 	if (worker_pool_assign_id(pool) < 0)
@@ -5832,6 +5883,7 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
  *  cpumask		RW mask	: bitmask of allowed CPUs for the workers
  *  affinity_scope	RW str  : worker CPU affinity scope (cache, numa, none)
  *  affinity_strict	RW bool : worker CPU affinity is strict
+ *  localize		RW bool : localize worker to work's origin CPU
  */
 struct wq_device {
 	struct workqueue_struct		*wq;
@@ -6042,11 +6094,34 @@ static ssize_t wq_affinity_strict_store(struct device *dev,
 	return ret ?: count;
 }
 
+static ssize_t wq_localize_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->localize);
+}
+
+static ssize_t wq_localize_store(struct device *dev,
+				 struct device_attribute *attr, const char *buf,
+				 size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int v;
+
+	if (sscanf(buf, "%d", &v) != 1)
+		return -EINVAL;
+
+	wq->unbound_attrs->localize = v;
+	return count;
+}
+
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
 	__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
 	__ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store),
+	__ATTR(localize, 0644, wq_localize_show, wq_localize_store),
 	__ATTR_NULL,
 };
 
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d0df5833f2c1..036fb89260a3 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -41,6 +41,7 @@ Lists all workqueues along with their type and worker pool association. For
   NAME      name of the workqueue
   TYPE      percpu, unbound or ordered
   FLAGS     S: strict affinity scope
+            L: localize worker to work item's issuing CPU
   POOL_ID   worker pool ID associated with each possible CPU
 """
 
@@ -160,8 +161,10 @@ print(' dfl]')
             print(' ordered   ', end='')
         else:
             print(' unbound', end='')
-            if wq.unbound_attrs.affn_strict:
-                print(',S ', end='')
+            strict = wq.unbound_attrs.affn_strict
+            local = wq.unbound_attrs.localize
+            if strict or local:
+                print(f',{"S" if strict else "_"}{"L" if local else "_"}', end='')
             else:
                 print('   ', end='')
     else:
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
index a8856a9c45dc..a0b0cd50b629 100644
--- a/tools/workqueue/wq_monitor.py
+++ b/tools/workqueue/wq_monitor.py
@@ -15,6 +15,9 @@ https://github.com/osandov/drgn.
            sampled from scheduler ticks and only provides ballpark
            measurement. "nohz_full=" CPUs are excluded from measurement.
 
+  CPUlocl  The number of times a work item starts executing on the same CPU
+           that the work item was issued on.
+
   CPUitsv  The number of times a concurrency-managed work item hogged CPU
            longer than the threshold (workqueue.cpu_intensive_thresh_us)
            and got excluded from concurrency management to avoid stalling
@@ -66,6 +69,7 @@ WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 PWQ_STAT_STARTED        = prog['PWQ_STAT_STARTED']      # work items started execution
 PWQ_STAT_COMPLETED      = prog['PWQ_STAT_COMPLETED']	# work items completed execution
 PWQ_STAT_CPU_TIME       = prog['PWQ_STAT_CPU_TIME']     # total CPU time consumed
+PWQ_STAT_CPU_LOCAL      = prog['PWQ_STAT_CPU_LOCAL']    # work items started on the issuing CPU
 PWQ_STAT_CPU_INTENSIVE  = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations
 PWQ_STAT_CM_WAKEUP      = prog['PWQ_STAT_CM_WAKEUP']    # concurrency-management worker wakeups
 PWQ_STAT_REPATRIATED    = prog['PWQ_STAT_REPATRIATED']  # unbound workers brought back into scope
@@ -91,6 +95,7 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
                  'started'              : self.stats[PWQ_STAT_STARTED],
                  'completed'            : self.stats[PWQ_STAT_COMPLETED],
                  'cpu_time'             : self.stats[PWQ_STAT_CPU_TIME],
+                 'cpu_local'            : self.stats[PWQ_STAT_CPU_LOCAL],
                  'cpu_intensive'        : self.stats[PWQ_STAT_CPU_INTENSIVE],
                  'cm_wakeup'            : self.stats[PWQ_STAT_CM_WAKEUP],
                  'repatriated'          : self.stats[PWQ_STAT_REPATRIATED],
@@ -98,7 +103,7 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
                  'rescued'              : self.stats[PWQ_STAT_RESCUED], }
 
     def table_header_str():
-        return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} '\
+        return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} {"CPUlocal":>8} '\
             f'{"CPUitsv":>7} {"CMW/RPR":>7} {"mayday":>7} {"rescued":>7}'
 
     def table_row_str(self):
@@ -121,6 +126,7 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
               f'{self.stats[PWQ_STAT_STARTED]:8} ' \
               f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
               f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \
+              f'{self.stats[PWQ_STAT_CPU_LOCAL]:8} ' \
               f'{cpu_intensive:>7} ' \
               f'{cmw_rpr:>7} ' \
               f'{mayday:>7} ' \
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (23 preceding siblings ...)
  2023-05-19  0:17 ` [PATCH 24/24] workqueue: Implement localize-to-issuing-CPU for unbound workqueues Tejun Heo
@ 2023-05-19  0:41 ` Linus Torvalds
  2023-05-19 22:35   ` Tejun Heo
  2023-05-23 11:18 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Linus Torvalds @ 2023-05-19  0:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

On Thu, May 18, 2023 at 5:17 PM Tejun Heo <tj@kernel.org> wrote:
>
> Most of the patchset are workqueue internal plumbing and probably aren't
> terribly interesting. Howver, the performance picture turned out less
> straight-forward than I had hoped, mostly likely due to loss of
> work-conservation from scheduler in high fan-out scenarios. I'll describe it
> in this cover letter. Please read on.

So my reaction here is that I think your benchmarking was about
throughput, but the recent changes that triggered this discussion were
about latency for random small stuff.

Maybe your "LOW" tests might eb close to that, but looking at that fio
benchmark line you quoted, I don't think so.

IOW, I think that what the fsverity code ended up seeing was literally
*serial* IO that was fast enough that it was better done on the local
CPU immediately, and that that was the reason for why it wanted to
remove WQ_UNBOUND.

IOW, I think you should go even lower than your "LOW", and test
basically "--iodepth=1" to a ramdisk. A load where schedulign to any
other CPU is literally *always* a mistake, because the IO is basically
entirely synchronous, and it's better to just do the work on the same
CPU and be done with it.

That may sound like an outlier thing, but I don't think it's
necessarily even all that odd. I think that "depth=1" is likely the
common case for many real loads.

That commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity
read workqueue") really talks about startup costs. They are about
things like "page in the executable", which is all almost 100%
serialized with no parallelism at all. Even read-ahead ends up being
serial, in that it's likely one single contiguous IO.

Yes, latency tends to be harder to benchmark than throughput, but I
really think latency trumps throughput 95% of the time. And all your
benchmark loads looked like throughput loads to me: they just weren't
using *all* the CPU capacity you had.

Yes, writeback can have lovely throughput behavior and saturate the IO
because you have lots of parallelism. But reads are often 100% serial
for one thread, and often you don't *have* more than one thread.

So I think your "not enough work to saturate" is still ludicrously
over-the-top. You should not aim for "not enough work to saturate 24
threads". You should aim for "basically completely single-threaded".
Judging by your "CPU utilization of 60-70%", I think your "LOW" is off
by at least an order of magnitude.

                  Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19  0:41 ` [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Linus Torvalds
@ 2023-05-19 22:35   ` Tejun Heo
  2023-05-19 23:03     ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-05-19 22:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jiangshanlai, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

Hello, Linus.

On Thu, May 18, 2023 at 05:41:29PM -0700, Linus Torvalds wrote:
...
> That commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity
> read workqueue") really talks about startup costs. They are about
> things like "page in the executable", which is all almost 100%
> serialized with no parallelism at all. Even read-ahead ends up being
> serial, in that it's likely one single contiguous IO.

I should have explained my thought process better here. I don't think the
fsverity and other similar recent reports on heterogeneous ARM CPUs are
caused directly by workqueue. Please take a look at the following message
from Brian Norris.

 https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/T/#u

While it's difficult to tell anything definitive from the message, the
reported performance ordering is

 4.19 > pinning worker to a CPU >> fixating CPU freq > kthread_worker

where the difference between 4.19 and pinning to one CPU is pretty small.
So, it does line up with other reports in that the source of higher
latencies and lower performance are from work items getting sprayed across
CPUs.

However, the two kernel versions tested are 4.19 and 5.15. I audited the
commits in-between and didn't spot anything which would materially change
unbound workers' affinities or how unbound work items would be assigned to
them.

Also, f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read
workqueue") is reporting 30 fold increase in scheduler latency, which I take
to be the time from work item being queued to start of exectuion. That's
unlikely to be just from crossing a cache boundary. There must be other
interactions (e.g. with some powersaving state transitions).

That said, workqueue, by spraying work items across cache boundaries, does
provide a condition in which this sort of problems can be significantly
amplified. workqueue isn't and can't fix the root cause of these problems;
however,

* The current workqueue behavior is silly on machines with multiple L3
  caches such as recent AMD CPUs w/ chiplets and heterogeneous ARM ones.
  Improving workqueue locality is likely to lessen the severity of the
  recently reported latency problems, possibly to the extent that it won't
  matter anymore.

* The fact that the remedy people are going for is switching to percpu
  workqueue is bad. That is going to hurt other use cases, often severely,
  and their only solution would be reverting back to unbound workqueues.

So, my expectation with the posted patchset is that most of the severe
chrome problems will go away, hopefully. Even in the case that that's not
sufficient, unbound workqueues now provide enough flexibility to work around
these problems without resorting to switching to per-cpu workqueues thus
avoiding affecting other use cases negatively.

The performance benchmarks are not directed towards the reported latency
problems. The reported magnitude seems very hardware specific to me and I
have no way of reproducing. The benchmarks are more to justify switching the
default boundaries from NUMA to cache.

> Yes, latency tends to be harder to benchmark than throughput, but I
> really think latency trumps throughput 95% of the time. And all your
> benchmark loads looked like throughput loads to me: they just weren't
> using *all* the CPU capacity you had.
> 
> Yes, writeback can have lovely throughput behavior and saturate the IO
> because you have lots of parallelism. But reads are often 100% serial
> for one thread, and often you don't *have* more than one thread.
> 
> So I think your "not enough work to saturate" is still ludicrously
> over-the-top. You should not aim for "not enough work to saturate 24
> threads". You should aim for "basically completely single-threaded".
> Judging by your "CPU utilization of 60-70%", I think your "LOW" is off
> by at least an order of magnitude.


More Experiments
================

I ran several more test sets. An extra workqueue configuration CPU+STRICT is
added which is very similar to using CPU_INTENSIVE per-cpu workqueues. Also,
CPU util is now per-single-CPU, ie. 100% indicates using one full CPU rather
than all CPUs because otherwise the numbers were too small. The tests are
run against dm-crypt on a tmpfs backed loop device.


4. LLOW

Similar benchmark as before but --bs is now 512 bytes and --iodepth and
--numjobs are dialed down to 4 and 1, respectively. Compared to LOW, both IO
size and concurrency are 64 times lower.

  taskset 0x8 fio --filename=/dev/dm-1 --direct=1 --rw=randrw --bs=512 \
	--ioengine=libaio --iodepth=4 --runtime=60 --numjobs=1 \
	--time_based --group_reporting --name=iops-test-job --verify=sha512

Note that fio is pinned to one CPU because otherwise I was getting random
bi-modal behaviors depending on what the scheduler was doing making
comparisons difficult.

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM               55.08  ±0.29    290.80 ±0.64                 0.00
  CACHE                64.42  ±0.47    291.51 ±0.30               +16.96
  CACHE+STRICT         65.18  ±1.14    303.74 ±1.79               +18.34
  CPU+STRICT           32.86  ±0.34    159.05 ±0.37               -48.99
  SYSTEM+LOCAL         56.76  ±0.30    286.78 ±0.50                +3.05
  CACHE+LOCAL          57.74  ±0.11    291.65 ±0.80                +4.83 

The polarities of +LOCAL's relative BWs flipped showing gains over SYSTEM.
However, the fact that they're significantly worse than CACHE didn't change.

This is 4 in-flight 512 byte IOs, entirely plausible in any size systems.
Even at this low level of concurrency, the downside of using per-cpu
workqueue (CPU+STRICT) is clear.


5. SYNC

Let's push it further. It's now single threaded synchronous read/write(2)'s
w/ 512 byte blocks. If we're going to be able to extract meaningful gains
from localizing to the issuing CPU, this should show it.

  taskset 0x8 fio --filename=/dev/dm-1 --direct=1 --rw=randrw --bs=512 \
	--ioengine=io_uring --iodepth=1 --runtime=60 --numjobs=1 \
	--time_based --group_reporting --name=iops-test-job --verify=sha512

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM               18.64  ±0.19     88.41 ±0.47                 0.00
  CACHE                21.46  ±0.11     91.39 ±0.26               +15.13
  CACHE+STRICT         20.80  ±0.23     90.68 ±0.15               +11.59
  CPU+STRICT           21.12  ±0.61     95.29 ±0.51                -1.58
  SYSTEM+LOCAL         20.80  ±0.20     90.12 ±0.34               +11.59
  CACHE+LOCAL          21.46  ±0.09     93.18 ±0.04               +15.13

Now +LOCAL's are doing quite a bit better than SYSTEM but still can't beat
CACHE. Interestingly, CPU+STRICT performs noticeably worse than others while
occupying CPU for longer. I haven't thought too much why this would be,
maybe because the benefits of using resources on other CPUs beat the
overhead of doing so?


Summary
=======

The overall conclusion doesn't change much. +LOCAL's fare better with lower
concurrency but still can't beat CACHE. I tried other combinations and
against SSD too. Nothing significantly deviates from the overall pattern.

I'm sure we can concoct a workload which uses significantly less than one
full CPU (so that utilizing other CPU's resources don't bring enough gain)
and can stay within L1/2 to maximize the benefit of not having to go through
L3, but it looks like that's going to take some stretching. At least on the
CPU I'm testing, it looks like letting loose in each L3 domain is the right
thing to do.

And, ARM folks, AFAICS, workqueue is unlikely to be the root cause of the
significant regressions observed after switching to recent kernel. However,
this patchset should hopefully alleviate the symptoms significantly, or,
failing that, at least make working around a lot easier. So, please test and
please don't switch CPU-heavy unbound workqueues to per-cpu. That's
guaranteed to hurt other heavier-weight usages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19 22:35   ` Tejun Heo
@ 2023-05-19 23:03     ` Tejun Heo
  2023-05-23  1:51       ` Linus Torvalds
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-05-19 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jiangshanlai, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

Oh, a bit of addition.

Once below saturation, latency and bw are mostly the two sides of the same
coin but just to be sure, here are latency results. The single-threaded sync
IO is run with 1ms interval between IOs.

  taskset 0x8 fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=512 \
	--ioengine=sync --iodepth=1 --runtime=60 --numjobs=1 --time_based \
	--group_reporting --name=iops-test-job --verify=sha512 --thinktime=1ms

SYSTEM

  read: IOPS=480, BW=240KiB/s (246kB/s)(14.1MiB/60001msec)
    clat (usec): min=8, max=401, avg=30.96, stdev= 9.60
     lat (usec): min=8, max=401, avg=31.01, stdev= 9.60
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   13], 10.00th=[   25], 20.00th=[   27],
     | 30.00th=[   28], 40.00th=[   29], 50.00th=[   29], 60.00th=[   30],
     | 70.00th=[   32], 80.00th=[   42], 90.00th=[   44], 95.00th=[   44],
     | 99.00th=[   46], 99.50th=[   46], 99.90th=[   56], 99.95th=[   71],
     | 99.99th=[  253]
   bw (  KiB/s): min=  214, max=  265, per=99.85%, avg=240.29, stdev=11.35, samples=119
   iops        : min=  428, max=  530, avg=480.59, stdev=22.70, samples=119

CPU_STRICT

  read: IOPS=474, BW=237KiB/s (243kB/s)(385KiB/1624msec)
    clat (usec): min=9, max=240, avg=28.00, stdev=11.20
     lat (usec): min=9, max=240, avg=28.05, stdev=11.20
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   26], 10.00th=[   26], 20.00th=[   26],
     | 30.00th=[   27], 40.00th=[   28], 50.00th=[   28], 60.00th=[   28],
     | 70.00th=[   29], 80.00th=[   30], 90.00th=[   31], 95.00th=[   31],
     | 99.00th=[   32], 99.50th=[   50], 99.90th=[  241], 99.95th=[  241],
     | 99.99th=[  241]

CACHE

  read: IOPS=479, BW=240KiB/s (245kB/s)(14.0MiB/60002msec)
    clat (nsec): min=7874, max=75922, avg=13342.34, stdev=6906.53
     lat (nsec): min=7904, max=75952, avg=13386.08, stdev=6906.60
    clat percentiles (nsec):
     |  1.00th=[ 8384],  5.00th=[ 8896], 10.00th=[ 9152], 20.00th=[ 9408],
     | 30.00th=[ 9536], 40.00th=[ 9920], 50.00th=[10432], 60.00th=[10688],
     | 70.00th=[11072], 80.00th=[13632], 90.00th=[27264], 95.00th=[28288],
     | 99.00th=[30592], 99.50th=[30848], 99.90th=[41216], 99.95th=[56064],
     | 99.99th=[74240]
   bw (  KiB/s): min=  204, max=  269, per=99.69%, avg=239.67, stdev=11.02, samples=119
   iops        : min=  408, max=  538, avg=479.34, stdev=22.04, samples=119


It's a bit confusing because fio switched to printing nsecs for CACHE but
CPU_STRICT (per-cpu)'s average completion latency is, expectedly, better
than SYSTEM - 28ms vs. 31ms, but CACHE's is way better at 13.3ms.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues
  2023-05-19  0:16 ` [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues Tejun Heo
@ 2023-05-22  6:41   ` Leon Romanovsky
  2023-05-22 12:27     ` Dennis Dalessandro
  0 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2023-05-22  6:41 UTC (permalink / raw)
  To: Tejun Heo, Dennis Dalessandro
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	Dennis Dalessandro, Jason Gunthorpe, Karsten Graul, Wenjia Zhang,
	Jan Karcher

On Thu, May 18, 2023 at 02:16:54PM -1000, Tejun Heo wrote:
> A pwq (pool_workqueue) represents an association between a workqueue and a
> worker_pool. When a work item is queued, the workqueue selects the pwq to
> use, which in turn determines the pool, and queues the work item to the pool
> through the pwq. pwq is also what implements the maximum concurrency limit -
> @max_active.
> 
> As a per-cpu workqueue should be assocaited with a different worker_pool on
> each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
> However, unbound workqueues were sharing a pwq within each NUMA node by
> default. The sharing has several downsides:
> 
> * Because @max_active is per-pwq, the meaning of @max_active changes
>   depending on the machine configuration and whether workqueue NUMA locality
>   support is enabled.
> 
> * Makes per-cpu and unbound code deviate.
> 
> * Gets in the way of making workqueue CPU locality awareness more flexible.
> 
> This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
> workqueues do by making the following changes:
> 
> * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
>   just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
>   workqueues.
> 
> * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
>   the specified pwq to the target CPU's wq->cpu_pwq.
> 
> * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
>   unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
>   This makes the return value of wq_calc_node_cpumask() unnecessary. It now
>   returns void.
> 
> * @max_active now means the same thing for both per-cpu and unbound
>   workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
>   documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
>   used in workqueue implementation and will be removed later.
> 
> * All unbound pwq operations which used to be per-numa-node are now per-cpu.
> 
> For most unbound workqueue users, this shouldn't cause noticeable changes.
> Work item issue and completion will be a small bit faster, flush_workqueue()
> would become a bit more expensive, and the total concurrency limit would
> likely become higher. All @max_active==1 use cases are currently being
> audited for conversion into alloc_ordered_workqueue() and they shouldn't be
> affected once the audit and conversion is complete.
> 
> One area where the behavior change may be more noticeable is
> workqueue_congested() as the reported congestion state is now per CPU
> instead of NUMA node. There are only two users of this interface -
> drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
> cc'd. Inputs on the behavior change would be very much appreciated.

At least for hfi1, it seems like your changes won't cause to any
differences as NUMA node is expected to be connected to closest CPU
anyway in setups relevant to hfi1.

Dennis, am I right?

Thanks

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues
  2023-05-22  6:41   ` Leon Romanovsky
@ 2023-05-22 12:27     ` Dennis Dalessandro
  0 siblings, 0 replies; 73+ messages in thread
From: Dennis Dalessandro @ 2023-05-22 12:27 UTC (permalink / raw)
  To: Leon Romanovsky, Tejun Heo
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	Jason Gunthorpe, Karsten Graul, Wenjia Zhang, Jan Karcher

On 5/22/23 2:41 AM, Leon Romanovsky wrote:
> On Thu, May 18, 2023 at 02:16:54PM -1000, Tejun Heo wrote:
>> A pwq (pool_workqueue) represents an association between a workqueue and a
>> worker_pool. When a work item is queued, the workqueue selects the pwq to
>> use, which in turn determines the pool, and queues the work item to the pool
>> through the pwq. pwq is also what implements the maximum concurrency limit -
>> @max_active.
>>
>> As a per-cpu workqueue should be assocaited with a different worker_pool on
>> each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
>> However, unbound workqueues were sharing a pwq within each NUMA node by
>> default. The sharing has several downsides:
>>
>> * Because @max_active is per-pwq, the meaning of @max_active changes
>>   depending on the machine configuration and whether workqueue NUMA locality
>>   support is enabled.
>>
>> * Makes per-cpu and unbound code deviate.
>>
>> * Gets in the way of making workqueue CPU locality awareness more flexible.
>>
>> This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
>> workqueues do by making the following changes:
>>
>> * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
>>   just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
>>   workqueues.
>>
>> * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
>>   the specified pwq to the target CPU's wq->cpu_pwq.
>>
>> * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
>>   unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
>>   This makes the return value of wq_calc_node_cpumask() unnecessary. It now
>>   returns void.
>>
>> * @max_active now means the same thing for both per-cpu and unbound
>>   workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
>>   documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
>>   used in workqueue implementation and will be removed later.
>>
>> * All unbound pwq operations which used to be per-numa-node are now per-cpu.
>>
>> For most unbound workqueue users, this shouldn't cause noticeable changes.
>> Work item issue and completion will be a small bit faster, flush_workqueue()
>> would become a bit more expensive, and the total concurrency limit would
>> likely become higher. All @max_active==1 use cases are currently being
>> audited for conversion into alloc_ordered_workqueue() and they shouldn't be
>> affected once the audit and conversion is complete.
>>
>> One area where the behavior change may be more noticeable is
>> workqueue_congested() as the reported congestion state is now per CPU
>> instead of NUMA node. There are only two users of this interface -
>> drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
>> cc'd. Inputs on the behavior change would be very much appreciated.
> 
> At least for hfi1, it seems like your changes won't cause to any
> differences as NUMA node is expected to be connected to closest CPU
> anyway in setups relevant to hfi1.
> 
> Dennis, am I right?
> 
> Thanks

I can see there being an impact as to when things are considered congested since
it's now CPU based vs NUMA. However, this seems like it's a good thing for hfi1.
The purpose of the code in hfi1 is to decide if QP processing should yield the
CPU and allow other QPs to make progress.

Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19 23:03     ` Tejun Heo
@ 2023-05-23  1:51       ` Linus Torvalds
  2023-05-23 17:59         ` Linus Torvalds
  0 siblings, 1 reply; 73+ messages in thread
From: Linus Torvalds @ 2023-05-23  1:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

On Fri, May 19, 2023 at 4:03 PM Tejun Heo <tj@kernel.org> wrote:
>
> Oh, a bit of addition.
>
> Once below saturation, latency and bw are mostly the two sides of the same
> coin but just to be sure, here are latency results.

Ok, looks good, consider me convinced.

I still would like to hear from the actual arm crowd that caused this
all and have those odd BIG.little cases.

            Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker
  2023-05-19  0:16 ` [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker Tejun Heo
@ 2023-05-23  9:54   ` Lai Jiangshan
  2023-05-23 21:37     ` Tejun Heo
  2023-08-08  1:15   ` [PATCH v2 " Tejun Heo
  1 sibling, 1 reply; 73+ messages in thread
From: Lai Jiangshan @ 2023-05-23  9:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

On Fri, May 19, 2023 at 8:17 AM Tejun Heo <tj@kernel.org> wrote:

> +       pool = pwq->pool;
> +
>         /*
>          * If @work was previously on a different pool, it might still be
>          * running there, in which case the work needs to be queued on that
>          * pool to guarantee non-reentrancy.
>          */
>         last_pool = get_work_pool(work);
> -       if (last_pool && last_pool != pwq->pool) {
> +       if (last_pool && last_pool != pool) {
>                 struct worker *worker;
>
>                 raw_spin_lock(&last_pool->lock);
> @@ -1638,13 +1636,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
>
>                 if (worker && worker->current_pwq->wq == wq) {
>                         pwq = worker->current_pwq;
> +                       pool = pwq->pool;

The code above does a "raw_spin_lock(&last_pool->lock);", and
the code next does a "raw_spin_unlock(&pool->lock);".

                            WARN_ON_ONCE(pool != last_pool);

can be added here and served as a comment.

>                 } else {
>                         /* meh... not running there, queue here */
>                         raw_spin_unlock(&last_pool->lock);
> -                       raw_spin_lock(&pwq->pool->lock);
> +                       raw_spin_lock(&pool->lock);
>                 }
>         } else {
> -               raw_spin_lock(&pwq->pool->lock);
> +               raw_spin_lock(&pool->lock);
>         }
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (24 preceding siblings ...)
  2023-05-19  0:41 ` [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Linus Torvalds
@ 2023-05-23 11:18 ` Peter Zijlstra
  2023-05-23 16:12   ` Vincent Guittot
  2023-05-26  1:12   ` Tejun Heo
  2023-06-12 23:56 ` Brian Norris
  2023-08-08  1:22 ` Tejun Heo
  27 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-23 11:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, torvalds, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, gautham.shenoy,
	Vincent Guittot

On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:

> Most of the patchset are workqueue internal plumbing and probably aren't
> terribly interesting. Howver, the performance picture turned out less
> straight-forward than I had hoped, mostly likely due to loss of
> work-conservation from scheduler in high fan-out scenarios. I'll describe it
> in this cover letter. Please read on.

> Here's the relevant part of the experiment setup.
> 
> * Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
>   Core-to-core latencies across L3 caches are ~2.6x worse than within each
>   L3 cache. ie. it's worse but not hugely so. This means that the impact of
>   L3 cache locality is noticeable in these experiments but may be subdued
>   compared to other setups.

*blink*, 12 cores with 4 LLCs ? that's a grand total of 3 cores / 6
threads per LLC. That's puny.

/me goes google a bit.. So these are Zen2 things which nominally have 4
cores per CCX which has 16M of L3, but these must binned parts with only
3 functional cores per CCX.

Zen3 then went to 8 cores per CCX with double the L3.

> 2. MED: Fewer issuers, enough work for saturation
> 
>                   Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
>   ----------------------------------------------------------------------
>   SYSTEM             1155.40  ±0.89     97.41 ±0.05                 0.00
>   CACHE              1154.40  ±1.14     96.15 ±0.09                -0.09
>   CACHE+STRICT       1112.00  ±4.64     93.26 ±0.35                -3.76
>   SYSTEM+LOCAL       1066.80  ±2.17     86.70 ±0.10                -7.67
>   CACHE+LOCAL        1034.60  ±1.52     83.00 ±0.07               -10.46
> 
> There are still eight issuers and plenty of work to go around. However, now,
> depending on the configuration, we're starting to see a significant loss of
> work-conservation where CPUs sit idle while there's work to do.
> 
> * CACHE is doing okay. It's just a bit slower. Further testing may be needed
>   to definitively confirm the bandwidth gap but the CPU util difference
>   seems real, so while minute, it did lose a bit of work-conservation.
>   Comparing it to CACHE+STRICT, it's starting to show the benefits of
>   non-strict scopes.

So wakeup based placement is mostly all about LLC, and given this thing
has dinky small LLCs it will pile up on the one LLC you target and leave
the others idle until the regular idle balancer decides to make an
appearance and move some around.

But if these are fairly short running tasks, I can well see that not
going to help much.


Much of this was tuned back when there was 1 L3 per Node; something
which is still more or less true for Intel but clearly not for these
things.


The below is a bit crude and completely untested, but it might help. The
flip side of that coin is of course that people are going to complain
about how selecting a CPU is more expensive now and how this hurts their
performance :/

Basically it will try and iterate all L3s in a node; wakeup will still
refuse to cross node boundaries.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..ddb7f16a07a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	return idle_cpu;
 }
 
+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
+	struct sched_group *sg;
+
+	if (!sd_node || sd_node == sd)
+		return -1;
+
+	sg = sd_node->groups;
+	do {
+		int cpu = cpumask_first(sched_group_span(sg));
+		struct sched_domain *sd_child;
+
+		sd_child = per_cpu(sd_llc, cpu);
+		if (sd_child != sd) {
+			int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+			if ((unsigned int)i < nr_cpumask_bits)
+				return i;
+		}
+
+		sg = sg->next;
+	} while (sg != sd_node->groups);
+
+	return -1;
+}
+
 /*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	if (sched_feat(SIS_NODE)) {
+		i = select_idle_node(p, sd, target);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
+
 	return target;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..f965cd4a981e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 678446251c35..d2e0e2e496a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca4472281c28..d94cbc2164ca 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	while (sd && sd->parent) {
+		/*
+		 * SD_NUMA is the first domain spanning nodes, therefore @sd
+		 * must be the domain that spans a single node.
+		 */
+		if (sd->parent->flags & SD_NUMA)
+			break;
+
+		sd = sd->parent;
+	}
+	rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
+
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23 11:18 ` Peter Zijlstra
@ 2023-05-23 16:12   ` Vincent Guittot
  2023-05-24  7:34     ` Peter Zijlstra
  2023-06-05  4:46     ` Gautham R. Shenoy
  2023-05-26  1:12   ` Tejun Heo
  1 sibling, 2 replies; 73+ messages in thread
From: Vincent Guittot @ 2023-05-23 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, jiangshanlai, torvalds, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	gautham.shenoy

On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>
> > Most of the patchset are workqueue internal plumbing and probably aren't
> > terribly interesting. Howver, the performance picture turned out less
> > straight-forward than I had hoped, mostly likely due to loss of
> > work-conservation from scheduler in high fan-out scenarios. I'll describe it
> > in this cover letter. Please read on.
>
> > Here's the relevant part of the experiment setup.
> >
> > * Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
> >   Core-to-core latencies across L3 caches are ~2.6x worse than within each
> >   L3 cache. ie. it's worse but not hugely so. This means that the impact of
> >   L3 cache locality is noticeable in these experiments but may be subdued
> >   compared to other setups.
>
> *blink*, 12 cores with 4 LLCs ? that's a grand total of 3 cores / 6
> threads per LLC. That's puny.
>
> /me goes google a bit.. So these are Zen2 things which nominally have 4
> cores per CCX which has 16M of L3, but these must binned parts with only
> 3 functional cores per CCX.
>
> Zen3 then went to 8 cores per CCX with double the L3.
>
> > 2. MED: Fewer issuers, enough work for saturation
> >
> >                   Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
> >   ----------------------------------------------------------------------
> >   SYSTEM             1155.40  ą0.89     97.41 ą0.05                 0.00
> >   CACHE              1154.40  ą1.14     96.15 ą0.09                -0.09
> >   CACHE+STRICT       1112.00  ą4.64     93.26 ą0.35                -3.76
> >   SYSTEM+LOCAL       1066.80  ą2.17     86.70 ą0.10                -7.67
> >   CACHE+LOCAL        1034.60  ą1.52     83.00 ą0.07               -10.46
> >
> > There are still eight issuers and plenty of work to go around. However, now,
> > depending on the configuration, we're starting to see a significant loss of
> > work-conservation where CPUs sit idle while there's work to do.
> >
> > * CACHE is doing okay. It's just a bit slower. Further testing may be needed
> >   to definitively confirm the bandwidth gap but the CPU util difference
> >   seems real, so while minute, it did lose a bit of work-conservation.
> >   Comparing it to CACHE+STRICT, it's starting to show the benefits of
> >   non-strict scopes.
>
> So wakeup based placement is mostly all about LLC, and given this thing
> has dinky small LLCs it will pile up on the one LLC you target and leave
> the others idle until the regular idle balancer decides to make an
> appearance and move some around.
>
> But if these are fairly short running tasks, I can well see that not
> going to help much.
>
>
> Much of this was tuned back when there was 1 L3 per Node; something
> which is still more or less true for Intel but clearly not for these
> things.
>
>
> The below is a bit crude and completely untested, but it might help. The
> flip side of that coin is of course that people are going to complain
> about how selecting a CPU is more expensive now and how this hurts their
> performance :/
>
> Basically it will try and iterate all L3s in a node; wakeup will still
> refuse to cross node boundaries.

That remember me some discussion about system with fast on die
interconnect where we would like to run wider than llc at wakeup (i.e.
DIE level) something like the CLUSTER level but on the other side of
MC

Another possibility to investigate would be that each wakeup of a
worker is mostly unrelated to the previous one and it cares only
waker. so we should use -1 for the prev_cpu

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..ddb7f16a07a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         return idle_cpu;
>  }
>
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> +       struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
> +       struct sched_group *sg;
> +
> +       if (!sd_node || sd_node == sd)
> +               return -1;
> +
> +       sg = sd_node->groups;
> +       do {
> +               int cpu = cpumask_first(sched_group_span(sg));
> +               struct sched_domain *sd_child;
> +
> +               sd_child = per_cpu(sd_llc, cpu);
> +               if (sd_child != sd) {
> +                       int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> +                       if ((unsigned int)i < nr_cpumask_bits)
> +                               return i;
> +               }
> +
> +               sg = sg->next;
> +       } while (sg != sd_node->groups);
> +
> +       return -1;
> +}
> +
>  /*
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>         if ((unsigned)i < nr_cpumask_bits)
>                 return i;
>
> +       if (sched_feat(SIS_NODE)) {
> +               i = select_idle_node(p, sd, target);
> +               if ((unsigned)i < nr_cpumask_bits)
> +                       return i;
> +       }
> +
>         return target;
>  }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..f965cd4a981e 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_PROP, false)
>  SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 678446251c35..d2e0e2e496a6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ca4472281c28..d94cbc2164ca 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
>         per_cpu(sd_llc_id, cpu) = id;
>         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> +       while (sd && sd->parent) {
> +               /*
> +                * SD_NUMA is the first domain spanning nodes, therefore @sd
> +                * must be the domain that spans a single node.
> +                */
> +               if (sd->parent->flags & SD_NUMA)
> +                       break;
> +
> +               sd = sd->parent;
> +       }
> +       rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
> +
>         sd = lowest_flag_domain(cpu, SD_NUMA);
>         rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23  1:51       ` Linus Torvalds
@ 2023-05-23 17:59         ` Linus Torvalds
  2023-05-23 20:08           ` Rik van Riel
  2023-05-23 21:36           ` Sandeep Dhavale
  0 siblings, 2 replies; 73+ messages in thread
From: Linus Torvalds @ 2023-05-23 17:59 UTC (permalink / raw)
  To: Tejun Heo, Sandeep Dhavale
  Cc: jiangshanlai, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

On Mon, May 22, 2023 at 6:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, looks good, consider me convinced.

Ugh, and I just got the erofs pull, which made me aware of how there's
*another* hack for "worker threads don't work well on Android", which
now doubled down on setting special scheduler flags for them too.

See commit cf7f2732b4b8 ("erofs: use HIPRI by default if per-cpu
kthreads are enabled"), but the whole deeper problem goes back much
further (commit 3fffb589b9a6: "erofs: add per-cpu threads for
decompression as an option").

See also

    https://lore.kernel.org/lkml/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com/

I really hate how we have random drivers and filesystems doing random
workarounds for "kthread workers don't work well enough, so add random
tweaks".

> I still would like to hear from the actual arm crowd that caused this
> all and have those odd BIG.little cases.

Sandeep, any chance that you could try out Tejun's series with plain
workers, and compare it to the percpu threads?

See

     https://lore.kernel.org/lkml/20230519001709.2563-1-tj@kernel.org/

for the beginning of this thread. The aim really is to try to figure
out - and hopefully fix - the fact that Android loads seem to trigger
all these random hacks that don't make any sense on a high level, and
seem to be workarounds rather than real fixes. Scheduling percpu
workers is a horrible horrible fix and not at all good in general, and
it would be much better if we could just improve workqueue behavior in
the odd Android situation.

              Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23 17:59         ` Linus Torvalds
@ 2023-05-23 20:08           ` Rik van Riel
  2023-05-23 21:36           ` Sandeep Dhavale
  1 sibling, 0 replies; 73+ messages in thread
From: Rik van Riel @ 2023-05-23 20:08 UTC (permalink / raw)
  To: torvalds, tj, dhavale
  Cc: nhuck, brho, Kernel Team, agk, peterz, linux-kernel, joshdon,
	briannorris, snitzer, jiangshanlai, void

On Tue, 2023-05-23 at 10:59 -0700, Linus Torvalds wrote:
> 
> I really hate how we have random drivers and filesystems doing random
> workarounds for "kthread workers don't work well enough, so add
> random
> tweaks".

Part of this seems to be due to the way CFS works.

CFS policy seems to make sense for a lot of workloads, but
there are some cases with kworkers where the CFS policies
just don't work quite right. Unfortunately the scheduler
problem space is not all that well explored, and it isn't
clear what the desired behavior of a scheduler should be
in every case.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23 17:59         ` Linus Torvalds
  2023-05-23 20:08           ` Rik van Riel
@ 2023-05-23 21:36           ` Sandeep Dhavale
  1 sibling, 0 replies; 73+ messages in thread
From: Sandeep Dhavale @ 2023-05-23 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, jiangshanlai, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void

>
> Sandeep, any chance that you could try out Tejun's series with plain
> workers, and compare it to the percpu threads?
>
> See
>
>      https://lore.kernel.org/lkml/20230519001709.2563-1-tj@kernel.org/
>
Hi Linus,
I will try out the series and report back with the results.

Thanks,
Sandeep.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker
  2023-05-23  9:54   ` Lai Jiangshan
@ 2023-05-23 21:37     ` Tejun Heo
  0 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-05-23 21:37 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

On Tue, May 23, 2023 at 05:54:10PM +0800, Lai Jiangshan wrote:
> On Fri, May 19, 2023 at 8:17 AM Tejun Heo <tj@kernel.org> wrote:
> 
> > +       pool = pwq->pool;
> > +
> >         /*
> >          * If @work was previously on a different pool, it might still be
> >          * running there, in which case the work needs to be queued on that
> >          * pool to guarantee non-reentrancy.
> >          */
> >         last_pool = get_work_pool(work);
> > -       if (last_pool && last_pool != pwq->pool) {
> > +       if (last_pool && last_pool != pool) {
> >                 struct worker *worker;
> >
> >                 raw_spin_lock(&last_pool->lock);
> > @@ -1638,13 +1636,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
> >
> >                 if (worker && worker->current_pwq->wq == wq) {
> >                         pwq = worker->current_pwq;
> > +                       pool = pwq->pool;
> 
> The code above does a "raw_spin_lock(&last_pool->lock);", and
> the code next does a "raw_spin_unlock(&pool->lock);".
> 
>                             WARN_ON_ONCE(pool != last_pool);
> 
> can be added here and served as a comment.

Yeah, this is a bit confusing. Added WARN_ON_ONCE().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23 16:12   ` Vincent Guittot
@ 2023-05-24  7:34     ` Peter Zijlstra
  2023-05-24 13:15       ` Vincent Guittot
  2023-06-05  4:46     ` Gautham R. Shenoy
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-24  7:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, jiangshanlai, torvalds, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	gautham.shenoy

On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:

> Another possibility to investigate would be that each wakeup of a
> worker is mostly unrelated to the previous one and it cares only
> waker. so we should use -1 for the prev_cpu

Tejun is actually overriding p->wake_cpu in this series to target a
specific LLC -- with the explicit purpose to keep the workers near
enough.

But the problem is that with lots of short tasks we then overload the
LLC and are not running long enough for the idle load-balancer to spread
things, leading to idle time.

And that is specific to this lots of little LLC topologies.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-24  7:34     ` Peter Zijlstra
@ 2023-05-24 13:15       ` Vincent Guittot
  0 siblings, 0 replies; 73+ messages in thread
From: Vincent Guittot @ 2023-05-24 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, jiangshanlai, torvalds, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	gautham.shenoy

On Wed, 24 May 2023 at 09:35, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
>
> > Another possibility to investigate would be that each wakeup of a
> > worker is mostly unrelated to the previous one and it cares only
> > waker. so we should use -1 for the prev_cpu
>
> Tejun is actually overriding p->wake_cpu in this series to target a
> specific LLC -- with the explicit purpose to keep the workers near
> enough.

yes, so -1 for prev_cpu was a good way to forgot the irrelevant prev
cpu  without trying to abuse in order to wake it up close to the waker

>
> But the problem is that with lots of short tasks we then overload the
> LLC and are not running long enough for the idle load-balancer to spread
> things, leading to idle time.

I expect to not pile up workers in the same LLC if we keep the
workqueue cpu affinity to the die. The worker will wake up in the LLC
of the callers and callers should be spread on the die

>
> And that is specific to this lots of little LLC topologies.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23 11:18 ` Peter Zijlstra
  2023-05-23 16:12   ` Vincent Guittot
@ 2023-05-26  1:12   ` Tejun Heo
  2023-05-30 11:32     ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-05-26  1:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jiangshanlai, torvalds, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, gautham.shenoy,
	Vincent Guittot

Hello,

On Tue, May 23, 2023 at 01:18:18PM +0200, Peter Zijlstra wrote:
> > * Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
> >   Core-to-core latencies across L3 caches are ~2.6x worse than within each
> >   L3 cache. ie. it's worse but not hugely so. This means that the impact of
> >   L3 cache locality is noticeable in these experiments but may be subdued
> >   compared to other setups.
> 
> *blink*, 12 cores with 4 LLCs ? that's a grand total of 3 cores / 6
> threads per LLC. That's puny.
> 
> /me goes google a bit.. So these are Zen2 things which nominally have 4
> cores per CCX which has 16M of L3, but these must binned parts with only
> 3 functional cores per CCX.

Yeah.

> Zen3 then went to 8 cores per CCX with double the L3.

Yeah, they basically doubled each L3 domain.

> > 2. MED: Fewer issuers, enough work for saturation
> > 
> >                   Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
> >   ----------------------------------------------------------------------
> >   SYSTEM             1155.40  ±0.89     97.41 ±0.05                 0.00
> >   CACHE              1154.40  ±1.14     96.15 ±0.09                -0.09
> >   CACHE+STRICT       1112.00  ±4.64     93.26 ±0.35                -3.76
> >   SYSTEM+LOCAL       1066.80  ±2.17     86.70 ±0.10                -7.67
> >   CACHE+LOCAL        1034.60  ±1.52     83.00 ±0.07               -10.46
> > 
> > There are still eight issuers and plenty of work to go around. However, now,
> > depending on the configuration, we're starting to see a significant loss of
> > work-conservation where CPUs sit idle while there's work to do.
> > 
> > * CACHE is doing okay. It's just a bit slower. Further testing may be needed
> >   to definitively confirm the bandwidth gap but the CPU util difference
> >   seems real, so while minute, it did lose a bit of work-conservation.
> >   Comparing it to CACHE+STRICT, it's starting to show the benefits of
> >   non-strict scopes.
> 
> So wakeup based placement is mostly all about LLC, and given this thing
> has dinky small LLCs it will pile up on the one LLC you target and leave
> the others idle until the regular idle balancer decides to make an
> appearance and move some around.

While this processor configuration isn't the most typical, I have a
difficult time imgaining newer chiplet designs behaving much differently.
The core problem is that while there is benefit to improving locality within
each L3 domain, the distance across L3 domains isn't that far either, so
unless loads get balanced aggressively across L3 domains while staying
within L3 when immediately possible, you lose capacity.

> But if these are fairly short running tasks, I can well see that not
> going to help much.

Right, the tasks themselves aren't necessarily short-running but they do
behave that way due to discontinuation and repatriation across work item
boundaries.

> Much of this was tuned back when there was 1 L3 per Node; something
> which is still more or less true for Intel but clearly not for these
> things.

Yeah, the same for workqueue. It assumed that the distance within a NUMA
node is indistinguishiable, which no longer holds.

> The below is a bit crude and completely untested, but it might help. The
> flip side of that coin is of course that people are going to complain
> about how selecting a CPU is more expensive now and how this hurts their
> performance :/
> 
> Basically it will try and iterate all L3s in a node; wakeup will still
> refuse to cross node boundaries.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..ddb7f16a07a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	return idle_cpu;
>  }
>  
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> +	struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));

I had to rename the local variable because it was making the global percpu
one during initialization but after that the result looks really good. I did
the same HIGH, MED, LOW scenarios. +SN indicates that SIS_NODE is turned on.
The machine set-up was slightly different so the baseline numbers aren't
directly comparable to the original results; however, the relative bw %
comparisons should hold.

RESULTS
=======

1. HIGH: Enough issuers and work spread across the machine

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM              1162.80 ±0.45     99.29 ±0.03                 0.00
  CACHE               1169.60 ±0.55     99.35 ±0.02                +0.58
  CACHE+SN            1169.00 ±0.00     99.37 ±0.01                +0.53
  CACHE+SN+LOCAL      1165.00 ±0.71     99.48 ±0.03                +0.19

This is an over-saturation scenario and the CPUs aren't having any trouble
finding work to do no matter what. The slight gain is mostly likely from
improved L3 locality and +SN doesn't behave noticeably differently from
CACHE.


2. MED: Fewer issuers, enough work for saturation

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM              1157.20 ±0.84     97.34 ±0.07                 0.00
  CACHE               1155.60 ±1.34     96.09 ±0.10                -0.14
  CACHE+SN            1163.80 ±0.45     96.90 ±0.07                +0.57
  CACHE+SN+LOCAL      1052.00 ±1.00     85.84 ±0.11                -0.09

+LOCAL is still sad but CACHE+SN is now maintaining similar gain over SYSTEM
similar to the HIGH scenario. Compared to CACHE, CACHE_SN shows slight but
discernable increases both in bandwidth and CPU util, which is great.


3. LOW: Even fewer issuers, not enough work to saturate

                  Bandwidth (MiBps)    CPU util (%)    vs. SYSTEM BW (%)
  ----------------------------------------------------------------------
  SYSTEM               995.00 ±4.42     75.60 ±0.27                 0.00
  CACHE                971.00 ±3.39     74.86 ±0.18                -2.41
  CACHE+SN             998.40 ±4.04     74.91 ±0.27                +0.34
  CACHE+SN+LOCAL       957.60 ±2.51     69.80 ±0.17                -3.76

+LOCAL still sucks but CACHE+SN wins over SYSTEM albeit within a single
sigma. It's clear that CACHE+SN outperforms CACHE by a significant margin
and makes up the loss compared to SYSTEM.


CONCLUSION
==========

With the SIS_NODE enabled, there's no downside to CACHE. It always
outperforms or matches SYSTEM. It's possible that the overhead of searching
further for an idle CPU is more pronounced on bigger machines but most
likely so will be the gains. This looks like a no brainer improvement to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-05-19  0:16 ` [PATCH 14/24] workqueue: Generalize unbound CPU pods Tejun Heo
@ 2023-05-30  8:06   ` K Prateek Nayak
  2023-06-07  1:50     ` Tejun Heo
  2023-05-30 21:18   ` Sandeep Dhavale
  1 sibling, 1 reply; 73+ messages in thread
From: K Prateek Nayak @ 2023-05-30  8:06 UTC (permalink / raw)
  To: Tejun Heo, jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

Hello Tejun,

I ran into a NULL pointer dereferencing issue when trying to test a build
of the "affinity-scopes-v1" branch from your workqueue tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/?h=affinity-scopes-v1)
Inlining the splat, some debug details, and workaround I used below.

On 5/19/2023 5:46 AM, Tejun Heo wrote:
> While renamed to pod, the code still assumes that the pods are defined by
> NUMA boundaries. Let's generalize it:
> 
> * workqueue_attrs->affn_scope is added. Each enum represents the type of
>   boundaries that define the pods. There are currently two scopes -
>   WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
>   - one pod per NUMA node. The latter defines one global pod across the
>   whole system.
> 
> * struct wq_pod_type is added which describes how pods are configured for
>   each affnity scope. For each pod, it lists the member CPUs and the
>   preferred NUMA node for memory allocations. The reverse mapping from CPU
>   to pod is also available.
> 
> * wq_pod_enabled is dropped. Pod is now always enabled. The previously
>   disabled behavior is now implemented through WQ_AFFN_SYSTEM.
> 
> * get_unbound_pool() wants to determine the NUMA node to allocate memory
>   from for the new pool. The variables are renamed from node to pod but the
>   logic still assumes they're one and the same. Clearly distinguish them -
>   walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
>   NUMA node.
> 
> * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
>   node. Take @cpu instead and determine the cpumask to use from the pod_type
>   matching @attrs.
> 
> * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
>   NULL so that it can indicate -EINVAL on invalid affinity scopes.
> 
> This patch allows CPUs to be grouped into pods however desired per type.
> While this patch causes some internal behavior changes, nothing material
> should change for workqueue users.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/workqueue.h |  31 +++++++-
>  kernel/workqueue.c        | 154 ++++++++++++++++++++++++--------------
>  2 files changed, 125 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index b8961c8ea5b3..a2f826b6ec9a 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -124,6 +124,15 @@ struct rcu_work {
>  	struct workqueue_struct *wq;
>  };
>  
> +enum wq_affn_scope {
> +	WQ_AFFN_NUMA,			/* one pod per NUMA node */
> +	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
> +
> +	WQ_AFFN_NR_TYPES,
> +
> +	WQ_AFFN_DFL = WQ_AFFN_NUMA,
> +};
> +
>  /**
>   * struct workqueue_attrs - A struct for workqueue attributes.
>   *
> @@ -140,12 +149,26 @@ struct workqueue_attrs {
>  	 */
>  	cpumask_var_t cpumask;
>  
> +	/*
> +	 * Below fields aren't properties of a worker_pool. They only modify how
> +	 * :c:func:`apply_workqueue_attrs` select pools and thus don't
> +	 * participate in pool hash calculations or equality comparisons.
> +	 */
> +
>  	/**
> -	 * @ordered: work items must be executed one by one in queueing order
> +	 * @affn_scope: unbound CPU affinity scope
>  	 *
> -	 * Unlike other fields, ``ordered`` isn't a property of a worker_pool. It
> -	 * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus
> -	 * doesn't participate in pool hash calculations or equality comparisons.
> +	 * CPU pods are used to improve execution locality of unbound work
> +	 * items. There are multiple pod types, one for each wq_affn_scope, and
> +	 * every CPU in the system belongs to one pod in every pod type. CPUs
> +	 * that belong to the same pod share the worker pool. For example,
> +	 * selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker
> +	 * pool for each NUMA node.
> +	 */
> +	enum wq_affn_scope affn_scope;
> +
> +	/**
> +	 * @ordered: work items must be executed one by one in queueing order
>  	 */
>  	bool ordered;
>  };
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index add6f5fc799b..dae1787833cb 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -325,7 +325,18 @@ struct workqueue_struct {
>  
>  static struct kmem_cache *pwq_cache;
>  
> -static cpumask_var_t *wq_pod_cpus;	/* possible CPUs of each node */
> +/*
> + * Each pod type describes how CPUs should be grouped for unbound workqueues.
> + * See the comment above workqueue_attrs->affn_scope.
> + */
> +struct wq_pod_type {
> +	int			nr_pods;	/* number of pods */
> +	cpumask_var_t		*pod_cpus;	/* pod -> cpus */
> +	int			*pod_node;	/* pod -> node */
> +	int			*cpu_pod;	/* cpu -> pod */
> +};
> +
> +static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
>  
>  /*
>   * Per-cpu work items which run for longer than the following threshold are
> @@ -341,8 +352,6 @@ module_param_named(power_efficient, wq_power_efficient, bool, 0444);
>  
>  static bool wq_online;			/* can kworkers be created yet? */
>  
> -static bool wq_pod_enabled;		/* unbound CPU pod affinity enabled */
> -
>  /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
>  static struct workqueue_attrs *wq_update_pod_attrs_buf;
>  
> @@ -1753,10 +1762,6 @@ static int select_numa_node_cpu(int node)
>  {
>  	int cpu;
>  
> -	/* No point in doing this if NUMA isn't enabled for workqueues */
> -	if (!wq_pod_enabled)
> -		return WORK_CPU_UNBOUND;
> -
>  	/* Delay binding to CPU if node is not valid or online */
>  	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
>  		return WORK_CPU_UNBOUND;
> @@ -3639,6 +3644,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void)
>  		goto fail;
>  
>  	cpumask_copy(attrs->cpumask, cpu_possible_mask);
> +	attrs->affn_scope = WQ_AFFN_DFL;
>  	return attrs;
>  fail:
>  	free_workqueue_attrs(attrs);
> @@ -3650,11 +3656,13 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
>  {
>  	to->nice = from->nice;
>  	cpumask_copy(to->cpumask, from->cpumask);
> +
>  	/*
> -	 * Unlike hash and equality test, this function doesn't ignore
> -	 * ->ordered as it is used for both pool and wq attrs.  Instead,
> -	 * get_unbound_pool() explicitly clears ->ordered after copying.
> +	 * Unlike hash and equality test, copying shouldn't ignore wq-only
> +	 * fields as copying is used for both pool and wq attrs. Instead,
> +	 * get_unbound_pool() explicitly clears the fields.
>  	 */
> +	to->affn_scope = from->affn_scope;
>  	to->ordered = from->ordered;
>  }
>  
> @@ -3680,6 +3688,24 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
>  	return true;
>  }
>  
> +/* find wq_pod_type to use for @attrs */
> +static const struct wq_pod_type *
> +wqattrs_pod_type(const struct workqueue_attrs *attrs)
> +{
> +	struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];
> +
> +	if (likely(pt->nr_pods))
> +		return pt;
> +
> +	/*
> +	 * Before workqueue_init_topology(), only SYSTEM is available which is
> +	 * initialized in workqueue_init_early().
> +	 */
> +	pt = &wq_pod_types[WQ_AFFN_SYSTEM];
> +	BUG_ON(!pt->nr_pods);
> +	return pt;
> +}
> +
>  /**
>   * init_worker_pool - initialize a newly zalloc'd worker_pool
>   * @pool: worker_pool to initialize
> @@ -3880,10 +3906,10 @@ static void put_unbound_pool(struct worker_pool *pool)
>   */
>  static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  {
> +	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
>  	u32 hash = wqattrs_hash(attrs);
>  	struct worker_pool *pool;
> -	int pod;
> -	int target_pod = NUMA_NO_NODE;
> +	int pod, node = NUMA_NO_NODE;
>  
>  	lockdep_assert_held(&wq_pool_mutex);
>  
> @@ -3895,28 +3921,24 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  		}
>  	}
>  
> -	/* if cpumask is contained inside a pod, we belong to that pod */
> -	if (wq_pod_enabled) {
> -		for_each_node(pod) {
> -			if (cpumask_subset(attrs->cpumask, wq_pod_cpus[pod])) {
> -				target_pod = pod;
> -				break;
> -			}
> +	/* If cpumask is contained inside a NUMA pod, that's our NUMA node */
> +	for (pod = 0; pod < pt->nr_pods; pod++) {
> +		if (cpumask_subset(attrs->cpumask, pt->pod_cpus[pod])) {
> +			node = pt->pod_node[pod];
> +			break;
>  		}
>  	}
>  
>  	/* nope, create a new one */
> -	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_pod);
> +	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, node);
>  	if (!pool || init_worker_pool(pool) < 0)
>  		goto fail;
>  
>  	copy_workqueue_attrs(pool->attrs, attrs);
> -	pool->node = target_pod;
> +	pool->node = node;
>  
> -	/*
> -	 * ordered isn't a worker_pool attribute, always clear it.  See
> -	 * 'struct workqueue_attrs' comments for detail.
> -	 */
> +	/* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
> +	pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;

When booting into the kernel, I got the following NULL pointer
dereferencing error:

    [    4.280321] BUG: kernel NULL pointer dereference, address: 0000000000000004
    [    4.284172] #PF: supervisor read access in kernel mode
    [    4.284172] #PF: error_code(0x0000) - not-present page
    [    4.284172] PGD 0 P4D 0
    [    4.284172] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [    4.284172] CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.4.0-rc1-tj-wq+ #448
    [    4.284172] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [    4.284172] RIP: 0010:wq_calc_pod_cpumask+0x5a/0x180
    [    4.284172] Code: 24 e0 08 94 96 4d 8d bc 24 e0 08 94 96 85 d2 0f 84 ec 00 00 00 49 8b 47 18 49 63 f5 48 8b 53 08 48 8b 7b 10 8b 0d 56 4b f0 01 <4c> 63 2c b0 49 8b 47 08 4a 8b 34 e8 e8 25 4f 63 00 48 8b 7b 10 8b
    [    4.284172] RSP: 0018:ffff9b548d20fd88 EFLAGS: 00010286
    [    4.284172] RAX: 0000000000000000 RBX: ffff8baec0048380 RCX: 0000000000000100
    [    4.284172] RDX: ffff8baec0159440 RSI: 0000000000000001 RDI: ffff8baec0159dc0
    [    4.284172] RBP: ffff9b548d20fdb0 R08: ffffffffffffffff R09: ffffffffffffffff
    [    4.284172] R10: ffffffffffffffff R11: ffffffffffffffff R12: 00000000000000a0
    [    4.284172] R13: 0000000000000001 R14: ffffffffffffffff R15: ffffffff96940980
    [    4.284172] FS:  0000000000000000(0000) GS:ffff8bed3d240000(0000) knlGS:0000000000000000
    [    4.284172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [    4.284172] CR2: 0000000000000004 CR3: 000000530ae3c001 CR4: 0000000000770ee0
    [    4.284172] PKRU: 55555554
    [    4.284172] Call Trace:
    [    4.284172]  <TASK>
    [    4.284172]  wq_update_pod+0x89/0x1e0
    [    4.284172]  workqueue_online_cpu+0x1fc/0x250
    [    4.284172]  ? watchdog_nmi_enable+0x12/0x20
    [    4.284172]  ? __pfx_workqueue_online_cpu+0x10/0x10
    [    4.284172]  cpuhp_invoke_callback+0x165/0x4b0
    [    4.284172]  ? try_to_wake_up+0x279/0x690
    [    4.284172]  cpuhp_thread_fun+0xc4/0x1b0
    [    4.284172]  ? __pfx_smpboot_thread_fn+0x10/0x10
    [    4.284172]  smpboot_thread_fn+0xe7/0x1e0
    [    4.284172]  kthread+0xfb/0x130
    [    4.284172]  ? __pfx_kthread+0x10/0x10
    [    4.284172]  ret_from_fork+0x2c/0x50
    [    4.284172]  </TASK>
    [    4.284172] Modules linked in:
    [    4.284172] CR2: 0000000000000004
    [    4.284172] ---[ end trace 0000000000000000 ]---

I was basically hitting the following, seemingly impossible scenario in
wqattrs_pod_type():

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3825,8 +3825,10 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs)
 {
        struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];

-       if (likely(pt->nr_pods))
+       if (likely(pt->nr_pods)) {
+               BUG_ON(!pt->cpu_pod); /* No pods despite thinking we have? */
                return pt;
+       }

        /*
         * Before workqueue_init_topology(), only SYSTEM is available which is
--

Logging the value of "attrs->affn_scope" when hitting the scenario gave
me "5" which corresponds to "WQ_AFFN_NR_TYPES". The kernel was reading a
value beyond the wq_pod_types[] bounds.

This value for "affn_scope" is only set in the above hunk and I got the
kernel to boot by making the following change:

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4069,7 +4071,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
        pool->node = node;

        /* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
-       pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;
+       pool->attrs->affn_scope = wq_affn_dfl;
        pool->attrs->localize = false;
        pool->attrs->ordered = false;
--

Let me know if the above approach is correct since I'm not sure whether
the case for "affn_scope" being set to "WQ_AFFN_NR_TYPES" has a special
significance that is being handled elsewhere. Thank you :)

>  	pool->attrs->ordered = false;
>  
>  	if (worker_pool_assign_id(pool) < 0)
> [..snip..]
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-26  1:12   ` Tejun Heo
@ 2023-05-30 11:32     ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-30 11:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, torvalds, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, gautham.shenoy,
	Vincent Guittot

On Thu, May 25, 2023 at 03:12:45PM -1000, Tejun Heo wrote:

> CONCLUSION
> ==========
> 
> With the SIS_NODE enabled, there's no downside to CACHE. It always
> outperforms or matches SYSTEM. It's possible that the overhead of searching
> further for an idle CPU is more pronounced on bigger machines but most
> likely so will be the gains. This looks like a no brainer improvement to me.

OK, looking at it again, I think it can be done a little simpler, but it
should be mostly the same.

I'll go queue the below in sched/core, we'll see if anything comes up
negative.

---
Subject: sched/fair: Multi-LLC select_idle_sibling()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue May 30 13:20:46 CEST 2023

Tejun reported that when he targets workqueues towards a specific LLC
on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
significant idle time.

This is, of course, because of how select_idle_sibling() will not
consider anything outside of the local LLC, and since all these tasks
are short running the periodic idle load balancer is ineffective.

And while it is good to keep work cache local, it is better to not
have significant idle time. Therefore, have select_idle_sibling() try
other LLCs inside the same node when the local one comes up empty.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   38 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |    1 +
 2 files changed, 39 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_s
 }
 
 /*
+ * For the multiple-LLC per node case, make sure to try the other LLC's if the
+ * local LLC comes up empty.
+ */
+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	struct sched_domain *parent = sd->parent;
+	struct sched_group *sg;
+
+	/* Make sure to not cross nodes. */
+	if (!parent || parent->flags & SD_NUMA)
+		return -1;
+
+	sg = parent->groups;
+	do {
+		int cpu = cpumask_first(sched_group_span(sg));
+		struct sched_domain *sd_child;
+
+		sd_child = per_cpu(sd_llc, cpu);
+		if (sd_child != sd) {
+			int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+			if ((unsigned)i < nr_cpumask_bits)
+				return i;
+		}
+
+		sg = sg->next;
+	} while (sg != parent->groups);
+
+	return -1;
+}
+
+/*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
  * maximize capacity.
@@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct ta
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	if (sched_feat(SIS_NODE)) {
+		i = select_idle_node(p, sd, target);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
+
 	return target;
 }
 
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-05-19  0:16 ` [PATCH 14/24] workqueue: Generalize unbound CPU pods Tejun Heo
  2023-05-30  8:06   ` K Prateek Nayak
@ 2023-05-30 21:18   ` Sandeep Dhavale
  2023-05-31 12:14     ` K Prateek Nayak
  1 sibling, 1 reply; 73+ messages in thread
From: Sandeep Dhavale @ 2023-05-30 21:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	kernel-team

Hi Tejun,

> @@ -6234,6 +6256,7 @@ static inline void wq_watchdog_init(void) { }
>   */
>  void __init workqueue_init_early(void)
>  {
> +       struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_SYSTEM];
>         int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
>         int i, cpu;
>
> @@ -6248,6 +6271,22 @@ void __init workqueue_init_early(void)
>         wq_update_pod_attrs_buf = alloc_workqueue_attrs();
>         BUG_ON(!wq_update_pod_attrs_buf);
>
> +       /* initialize WQ_AFFN_SYSTEM pods */
> +       pt->pod_cpus = kcalloc(1, sizeof(pt->pod_cpus[0]), GFP_KERNEL);
> +       pt->pod_node = kcalloc(1, sizeof(pt->pod_node[0]), GFP_KERNEL);
> +       pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL);
> +       BUG_ON(!pt->pod_cpus || !pt->pod_node || !pt->cpu_pod);
> +
> +       BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[0], GFP_KERNEL, NUMA_NO_NODE));
> +
> +       wq_update_pod_attrs_buf = alloc_workqueue_attrs();
> +       BUG_ON(!wq_update_pod_attrs_buf);
> +

Looks like allocation for wq_update_pod_attrs_buf is already being
done in the preceding code block.

I am trying to evaluate this series to see if it helps with the
scheduling delays we have seen in EROFS.
In addition to the panic and fix reported by Prateek [0], I am having
stability issues only with the series applied.
I am testing with Pixel 6 and android-mainline kernel [1]

The panic seems to be in the context of kworker for events_unbound wq.
The only significant change directly to events_unbound wq was in patch [2]

@@ -6399,7 +6335,7 @@ void __init workqueue_init_early(void)
  system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
  system_long_wq = alloc_workqueue("events_long", 0, 0);
  system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
-    WQ_UNBOUND_MAX_ACTIVE);
+    WQ_MAX_ACTIVE);
  system_freezable_wq = alloc_workqueue("events_freezable",
       WQ_FREEZABLE, 0);
  system_power_efficient_wq = alloc_workqueue("events_power_efficient",

Panic log:
[  316.386684][  T115] Unable to handle kernel paging request at
virtual address ffffffd2745a0160
[  316.386936][  T115] Mem abort info:
[  316.387027][  T115]   ESR = 0x0000000096000007
[  316.387137][  T115]   EC = 0x25: DABT (current EL), IL = 32 bits
[  316.387284][  T115]   SET = 0, FnV = 0
[  316.387378][  T115]   EA = 0, S1PTW = 0
[  316.387475][  T115]   FSC = 0x07: level 3 translation fault
[  316.387606][  T115] Data abort info:
[  316.387694][  T115]   ISV = 0, ISS = 0x00000007
[  316.387804][  T115]   CM = 0, WnR = 0
[  316.387897][  T115] swapper pgtable: 4k pages, 39-bit VAs,
pgdp=0000000081dec000
[  316.388071][  T115] [ffffffd2745a0160] pgd=10000009d83ff003,
p4d=10000009d83ff003, pud=10000009d83ff003, pmd=10000009d83fb003,
pte=0000000000000000
[  316.388491][  T115] Internal error: Oops: 0000000096000007 [#1] PREEMPT SMP
[  316.388765][  T115] debug-snapshot dss: core register saved(CPU:2)
[  316.388993][  T115] debug-snapshot dss: ECC error check erridr_el1.num = 0x2
[  316.389260][  T115] debug-snapshot dss: ERRSELR_EL1.SEL = 0, NOT
Error, ERXSTATUS_EL1 = 0x0
[  316.389578][  T115] debug-snapshot dss: ERRSELR_EL1.SEL = 1, NOT
Error, ERXSTATUS_EL1 = 0x0
[  316.389898][  T115] debug-snapshot dss: context saved(CPU:2)
[  316.390112][  T115] item - log_kevents is disabled
[  316.390300][  T115] Modules linked in: sec_touch(OE) ftm5(OE)
bcmdhd4389(OE) goog_touch_interface(OE) snd_soc_cs40l2x(OE)
haptics_cs40l2x(OE) google_dock(OE) lwis(OE) panel_boe_nt37290(OE)
panel_samsung_s6e3hc4(OE) panel_samsung_s6e3hc3_c10(OE)
panel_samsung_s6e3fc3_p10(OE) stmvl53l1(OE) slg51000_core(OE)
slg51000_regulator(OE) pinctrl_slg51000(OE) nfc mac802154
ieee802154_socket ieee802154_6lowpan ieee802154 nhc_udp nhc_routing
nhc_mobility nhc_ipv6 nhc_hop nhc_fragment nhc_dest 6lowpan diag tipc
mac80211 l2tp_ppp l2tp_core hidp rfcomm can_gw can_bcm can_raw can
cfg80211 8021q btsdio hci_uart btqca btbcm bluetooth ftdi_sio
usbserial cdc_acm r8153_ecm aqc111 cdc_ncm cdc_eem cdc_ether
ax88179_178a asix usbnet r8152 rtl8150 pptp pppox ppp_mppe ppp_deflate
bsd_comp ppp_generic slhc slcan vcan can_dev mii libarc4 bigocean(OE)
st33spi(OE) st54spi(OE) st21nfc(OE) nitrous(OE) rfkill
exynos_reboot(OE) heatmap(OE) touch_bus_negotiator(OE)
touch_offload(OE) aoc_alsa_dev(OE) aoc_alsa_dev_util(OE)
aoc_uwb_platform_drv(OE)
[  316.390708][  T115]  aoc_uwb_service_dev(OE) aoc_channel_dev(OE)
aoc_control_dev(OE) aoc_char_dev(OE) aoc_core(OE) mailbox_wc(OE)
audiometrics(OE) snd_soc_cs35l41_i2c(OE) snd_soc_cs35l41_spi(OE)
snd_soc_cs35l41(OE) snd_soc_wm_adsp(OE) max20339(OE) pca9468(OE)
p9221(OE) max77759_charger(OE) max77729_charger(OE) max77729_uic(OE)
max77729_pmic(OE) max1720x_battery(OE) overheat_mitigation(OE)
google_cpm(OE) google_dual_batt_gauge(OE) google_charger(OE)
google_battery(OE) google_bms(OE) abrolhos(OE) mali_kbase(OE)
mali_pixel(OE) panel_samsung_s6e3hc3(OE) panel_samsung_sofef01(OE)
panel_samsung_s6e3fc3(OE) panel_samsung_s6e3hc2(OE)
panel_samsung_emul(OE) panel_samsung_drv(OE) exynos_drm(OE)
arm_memlat_mon(OE) governor_memlat(OE) memlat_devfreq(OE)
exynos_acme(OE) s3c2410_wdt(OE) trusty_virtio(OE) trusty_test(OE)
trusty_log(OE) trusty_irq(OE) gs101_spmic_thermal(OE) gpu_cooling(OE)
debug_reboot(OE) smfc(OE) exynos_mfc(OE) i2c_exynos5(OE)
rtc_s2mpg10(OE) keycombo(OE) goodixfp(OE) usbc_cooling_dev(OE)
tcpci_max77759(OE)
[  316.393987][  T115]  max77759_contaminant(OE) bc_max77759(OE)
max77759_helper(OE) tcpci_fusb307(OE) slg46826(OE) usb_psy(OE)
usb_f_dm1(OE) usb_f_dm(OE) xhci_exynos(OE) ufs_exynos_gs(OE)
s2mpg1x_gpio(OE) bcm47765(OE) sscoredump(OE) sbb_mux(OE) gsc_spi(OE)
g2d(OE) samsung_iommu(OE) samsung_iommu_group(OE) exyswd_rng(OE)
exynos_tty(OE) max77826_gs_regulator(OE) boot_control_sysfs(OE)
exynos_seclog(OE) dbgcore_dump(OE) pixel_stat_mm(OE)
pixel_stat_sysfs(OE) sysrq_hook(OE) hardlockup_debug(OE) eh(OE)
cp_thermal_zone(OE) cpif(OE) bts(OE) exynos_dit(OE) cpif_page(OE)
boot_device_spi(OE) bcm_dbg(OE) exynos_bcm_dbg_dump(OE) gsa_gsc(OE)
slc_acpm(OE) slc_pmon(OE) slc_dummy(OE) acpm_mbox_test(OE)
exynos_devfreq(OE) exynos_dm(OE) slc_pt(OE) power_stats(OE)
exynos_pd_dbg(OE) pixel_em(OE) gs_thermal(OE) google_bcl(OE)
i2c_acpm(OE) s2mpg11_regulator(OE) s2mpg10_regulator(OE) odpm(OE)
s2mpg10_powermeter(OE) s2mpg10_mfd(OE) s2mpg11_powermeter(OE)
pmic_class(OE) s2mpg11_mfd(OE) exynos_cpuhp(OE) pixel_boot_metrics(OE)
exynos_adv_tracer_s2d(OE)
[  316.397483][  T115]  keydebug(OE) exynos_coresight_etm(OE)
exynos_ecc_handler(OE) exynos_coresight(OE) exynos_debug_test(OE)
pixel_debug_test(OE) ehld(OE) sjtag_driver(OE) exynos_adv_tracer(OE)
gsa(OE) trusty_ipc(OE) samsung_dma_heap(OE) trusty_core(OE)
samsung_secure_iova(OE) deferred_free_helper(OE) page_pool(OE)
hardlockup_watchdog(OE) debug_snapshot_debug_kinfo(OE)
debug_snapshot_qd(OE) debug_snapshot_sfrdump(OE) exynos_pd(OE)
dwc3_exynos_usb(OE) gvotable(OE) clk_exynos_gs(OE) pcie_exynos_gs(OE)
exynos_pm(OE) acpm_flexpmu_dbg(OE) pcie_exynos_gs101_rc_cal(OE)
shm_ipc(OE) spi_s3c64xx(OE) samsung_dma(OE) pl330(OE) s2mpu(OE)
logbuffer(OE) itmon(OE) exynos_cpupm(OE) exynos_mct(OE) cmupmucal(OE)
exynos_pm_qos(OE) gs_acpm(OE) kernel_top(OE) dss(OE)
pixel_suspend_diag(OE) systrace(OE) ect_parser(OE) gs_chipid(OE)
pinctrl_exynos_gs(OE) phy_exynos_mipi(OE) phy_exynos_mipi_dsim(OE)
exynos_pmu_if(OE) phy_exynos_usbdrd_super(OE) exynos_pd_el3(OE)
arm_dsu_pmu(E) softdog(E) pps_gpio(E) i2c_dev(E) spidev(E) sg(E)
at24(E) zram zsmalloc
[  316.404101][  T115] CPU: 2 PID: 115 Comm: kworker/u24:2 Tainted: G
      W  OE      6.3.0-mainline-maybe-dirty #1
[  316.404491][  T115] Hardware name: Oriole DVT (DT)
[  316.404678][  T115] Workqueue: events_unbound idle_cull_fn
[  316.404882][  T115] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT
-SSBS BTYPE=--)
[  316.405176][  T115] pc : available_idle_cpu+0x20/0x60
[  316.405368][  T115] lr : select_task_rq_fair+0x1d0/0x17d8
[  316.405574][  T115] sp : ffffffc008dfbb40
[  316.405728][  T115] x29: ffffffc008dfbc10 x28: 0000000000000000
x27: 0000000000000008
[  316.406028][  T115] x26: 0000000000000000 x25: 0000000000000001
x24: 0000000000000008
[  316.406323][  T115] x23: 0000000000000000 x22: 0000000000000400
x21: 0000000000000000
[  316.406623][  T115] x20: 0000000000000008 x19: ffffff8800812380
x18: ffffffc008cdf040
[  316.406925][  T115] x17: 00000000aa3494c0 x16: 00000000aa3494c0
x15: 0000000000019ed5
[  316.407221][  T115] x14: 0000000000000001 x13: 000000000001a2d5
x12: 0000000000000010
[  316.407521][  T115] x11: 0000000000000400 x10: de8448a6b7c5d500 x9
: ffffffd27459f6c0
[  316.407822][  T115] x8 : ffffffd27459f6c0 x7 : 0000000000008080 x6
: 0000000000000000
[  316.408118][  T115] x5 : ffffff894f35c590 x4 : 0000646e756f626e x3
: 0000000000000008
[  316.408418][  T115] x2 : 0000000000000001 x1 : ffffff8800812380 x0
: 0000000000000008
[  316.408724][  T115] Call trace:
[  316.408842][  T115]  available_idle_cpu+0x20/0x60
[  316.409020][  T115]  try_to_wake_up+0x4ec/0x85c
[  316.409190][  T115]  wake_up_process+0x18/0x28
[  316.409359][  T115]  wake_dying_workers+0x5c/0xe8
[  316.409539][  T115]  idle_cull_fn+0xdc/0x11c
[  316.409705][  T115]  process_scheduled_works+0x208/0x45c
[  316.409905][  T115]  worker_thread+0x22c/0x31c
[  316.410074][  T115]  kthread+0x114/0x1c0
[  316.410229][  T115]  ret_from_fork+0x10/0x20
[  316.410399][  T115] Code: b00105c9 911b0129 f8605908 8b090108 (f9455109)
[  316.410651][  T115] ---[ end trace 0000000000000000 ]---
[  316.410853][  T115] Kernel panic - not syncing: Oops: Fatal exception
[  316.411097][  T115] SMP: stopping secondary CPUs

Do you think the change in patch [2] could be related?

Thanks,
Sandeep.

[0] https://lore.kernel.org/all/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com/
[1] https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline
[2] https://lore.kernel.org/all/20230519001709.2563-10-tj@kernel.org/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-05-30 21:18   ` Sandeep Dhavale
@ 2023-05-31 12:14     ` K Prateek Nayak
  2023-06-07 22:13       ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: K Prateek Nayak @ 2023-05-31 12:14 UTC (permalink / raw)
  To: Sandeep Dhavale, Tejun Heo
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	kernel-team

Hello Sandeep,

I too am seeing similar crash with the same call stack, albeit a
different error, a little while after the kernel boots. I'll inline
the details below.

On 5/31/2023 2:48 AM, Sandeep Dhavale wrote:
> Hi Tejun,
> 
>> @@ -6234,6 +6256,7 @@ static inline void wq_watchdog_init(void) { }
>>   */
>>  void __init workqueue_init_early(void)
>>  {
>> +       struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_SYSTEM];
>>         int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
>>         int i, cpu;
>>
>> @@ -6248,6 +6271,22 @@ void __init workqueue_init_early(void)
>>         wq_update_pod_attrs_buf = alloc_workqueue_attrs();
>>         BUG_ON(!wq_update_pod_attrs_buf);
>>
>> +       /* initialize WQ_AFFN_SYSTEM pods */
>> +       pt->pod_cpus = kcalloc(1, sizeof(pt->pod_cpus[0]), GFP_KERNEL);
>> +       pt->pod_node = kcalloc(1, sizeof(pt->pod_node[0]), GFP_KERNEL);
>> +       pt->cpu_pod = kcalloc(nr_cpu_ids, sizeof(pt->cpu_pod[0]), GFP_KERNEL);
>> +       BUG_ON(!pt->pod_cpus || !pt->pod_node || !pt->cpu_pod);
>> +
>> +       BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[0], GFP_KERNEL, NUMA_NO_NODE));
>> +
>> +       wq_update_pod_attrs_buf = alloc_workqueue_attrs();
>> +       BUG_ON(!wq_update_pod_attrs_buf);
>> +
> 
> Looks like allocation for wq_update_pod_attrs_buf is already being
> done in the preceding code block.
> 
> I am trying to evaluate this series to see if it helps with the
> scheduling delays we have seen in EROFS.
> In addition to the panic and fix reported by Prateek [0], I am having
> stability issues only with the series applied.
> I am testing with Pixel 6 and android-mainline kernel [1]
> 
> The panic seems to be in the context of kworker for events_unbound wq.
> The only significant change directly to events_unbound wq was in patch [2]
> 
> @@ -6399,7 +6335,7 @@ void __init workqueue_init_early(void)
>   system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
>   system_long_wq = alloc_workqueue("events_long", 0, 0);
>   system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
> -    WQ_UNBOUND_MAX_ACTIVE);
> +    WQ_MAX_ACTIVE);
>   system_freezable_wq = alloc_workqueue("events_freezable",
>        WQ_FREEZABLE, 0);
>   system_power_efficient_wq = alloc_workqueue("events_power_efficient",
> 
> Panic log:
> [  316.386684][  T115] Unable to handle kernel paging request at
> virtual address ffffffd2745a0160
> [  316.386936][  T115] Mem abort info:
> [  316.387027][  T115]   ESR = 0x0000000096000007
> [  316.387137][  T115]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  316.387284][  T115]   SET = 0, FnV = 0
> [  316.387378][  T115]   EA = 0, S1PTW = 0
> [  316.387475][  T115]   FSC = 0x07: level 3 translation fault
> [  316.387606][  T115] Data abort info:
> [  316.387694][  T115]   ISV = 0, ISS = 0x00000007
> [  316.387804][  T115]   CM = 0, WnR = 0
> [  316.387897][  T115] swapper pgtable: 4k pages, 39-bit VAs,
> pgdp=0000000081dec000
> [  316.388071][  T115] [ffffffd2745a0160] pgd=10000009d83ff003,
> p4d=10000009d83ff003, pud=10000009d83ff003, pmd=10000009d83fb003,
> pte=0000000000000000
> [  316.388491][  T115] Internal error: Oops: 0000000096000007 [#1] PREEMPT SMP
> [  316.388765][  T115] debug-snapshot dss: core register saved(CPU:2)
> [  316.388993][  T115] debug-snapshot dss: ECC error check erridr_el1.num = 0x2
> [  316.389260][  T115] debug-snapshot dss: ERRSELR_EL1.SEL = 0, NOT
> Error, ERXSTATUS_EL1 = 0x0
> [  316.389578][  T115] debug-snapshot dss: ERRSELR_EL1.SEL = 1, NOT
> Error, ERXSTATUS_EL1 = 0x0
> [  316.389898][  T115] debug-snapshot dss: context saved(CPU:2)
> [  316.390112][  T115] item - log_kevents is disabled
> [  316.390300][  T115] Modules linked in: sec_touch(OE) ftm5(OE)
> bcmdhd4389(OE) goog_touch_interface(OE) snd_soc_cs40l2x(OE)
> haptics_cs40l2x(OE) google_dock(OE) lwis(OE) panel_boe_nt37290(OE)
> panel_samsung_s6e3hc4(OE) panel_samsung_s6e3hc3_c10(OE)
> panel_samsung_s6e3fc3_p10(OE) stmvl53l1(OE) slg51000_core(OE)
> slg51000_regulator(OE) pinctrl_slg51000(OE) nfc mac802154
> ieee802154_socket ieee802154_6lowpan ieee802154 nhc_udp nhc_routing
> nhc_mobility nhc_ipv6 nhc_hop nhc_fragment nhc_dest 6lowpan diag tipc
> mac80211 l2tp_ppp l2tp_core hidp rfcomm can_gw can_bcm can_raw can
> cfg80211 8021q btsdio hci_uart btqca btbcm bluetooth ftdi_sio
> usbserial cdc_acm r8153_ecm aqc111 cdc_ncm cdc_eem cdc_ether
> ax88179_178a asix usbnet r8152 rtl8150 pptp pppox ppp_mppe ppp_deflate
> bsd_comp ppp_generic slhc slcan vcan can_dev mii libarc4 bigocean(OE)
> st33spi(OE) st54spi(OE) st21nfc(OE) nitrous(OE) rfkill
> exynos_reboot(OE) heatmap(OE) touch_bus_negotiator(OE)
> touch_offload(OE) aoc_alsa_dev(OE) aoc_alsa_dev_util(OE)
> aoc_uwb_platform_drv(OE)
> [  316.390708][  T115]  aoc_uwb_service_dev(OE) aoc_channel_dev(OE)
> aoc_control_dev(OE) aoc_char_dev(OE) aoc_core(OE) mailbox_wc(OE)
> audiometrics(OE) snd_soc_cs35l41_i2c(OE) snd_soc_cs35l41_spi(OE)
> snd_soc_cs35l41(OE) snd_soc_wm_adsp(OE) max20339(OE) pca9468(OE)
> p9221(OE) max77759_charger(OE) max77729_charger(OE) max77729_uic(OE)
> max77729_pmic(OE) max1720x_battery(OE) overheat_mitigation(OE)
> google_cpm(OE) google_dual_batt_gauge(OE) google_charger(OE)
> google_battery(OE) google_bms(OE) abrolhos(OE) mali_kbase(OE)
> mali_pixel(OE) panel_samsung_s6e3hc3(OE) panel_samsung_sofef01(OE)
> panel_samsung_s6e3fc3(OE) panel_samsung_s6e3hc2(OE)
> panel_samsung_emul(OE) panel_samsung_drv(OE) exynos_drm(OE)
> arm_memlat_mon(OE) governor_memlat(OE) memlat_devfreq(OE)
> exynos_acme(OE) s3c2410_wdt(OE) trusty_virtio(OE) trusty_test(OE)
> trusty_log(OE) trusty_irq(OE) gs101_spmic_thermal(OE) gpu_cooling(OE)
> debug_reboot(OE) smfc(OE) exynos_mfc(OE) i2c_exynos5(OE)
> rtc_s2mpg10(OE) keycombo(OE) goodixfp(OE) usbc_cooling_dev(OE)
> tcpci_max77759(OE)
> [  316.393987][  T115]  max77759_contaminant(OE) bc_max77759(OE)
> max77759_helper(OE) tcpci_fusb307(OE) slg46826(OE) usb_psy(OE)
> usb_f_dm1(OE) usb_f_dm(OE) xhci_exynos(OE) ufs_exynos_gs(OE)
> s2mpg1x_gpio(OE) bcm47765(OE) sscoredump(OE) sbb_mux(OE) gsc_spi(OE)
> g2d(OE) samsung_iommu(OE) samsung_iommu_group(OE) exyswd_rng(OE)
> exynos_tty(OE) max77826_gs_regulator(OE) boot_control_sysfs(OE)
> exynos_seclog(OE) dbgcore_dump(OE) pixel_stat_mm(OE)
> pixel_stat_sysfs(OE) sysrq_hook(OE) hardlockup_debug(OE) eh(OE)
> cp_thermal_zone(OE) cpif(OE) bts(OE) exynos_dit(OE) cpif_page(OE)
> boot_device_spi(OE) bcm_dbg(OE) exynos_bcm_dbg_dump(OE) gsa_gsc(OE)
> slc_acpm(OE) slc_pmon(OE) slc_dummy(OE) acpm_mbox_test(OE)
> exynos_devfreq(OE) exynos_dm(OE) slc_pt(OE) power_stats(OE)
> exynos_pd_dbg(OE) pixel_em(OE) gs_thermal(OE) google_bcl(OE)
> i2c_acpm(OE) s2mpg11_regulator(OE) s2mpg10_regulator(OE) odpm(OE)
> s2mpg10_powermeter(OE) s2mpg10_mfd(OE) s2mpg11_powermeter(OE)
> pmic_class(OE) s2mpg11_mfd(OE) exynos_cpuhp(OE) pixel_boot_metrics(OE)
> exynos_adv_tracer_s2d(OE)
> [  316.397483][  T115]  keydebug(OE) exynos_coresight_etm(OE)
> exynos_ecc_handler(OE) exynos_coresight(OE) exynos_debug_test(OE)
> pixel_debug_test(OE) ehld(OE) sjtag_driver(OE) exynos_adv_tracer(OE)
> gsa(OE) trusty_ipc(OE) samsung_dma_heap(OE) trusty_core(OE)
> samsung_secure_iova(OE) deferred_free_helper(OE) page_pool(OE)
> hardlockup_watchdog(OE) debug_snapshot_debug_kinfo(OE)
> debug_snapshot_qd(OE) debug_snapshot_sfrdump(OE) exynos_pd(OE)
> dwc3_exynos_usb(OE) gvotable(OE) clk_exynos_gs(OE) pcie_exynos_gs(OE)
> exynos_pm(OE) acpm_flexpmu_dbg(OE) pcie_exynos_gs101_rc_cal(OE)
> shm_ipc(OE) spi_s3c64xx(OE) samsung_dma(OE) pl330(OE) s2mpu(OE)
> logbuffer(OE) itmon(OE) exynos_cpupm(OE) exynos_mct(OE) cmupmucal(OE)
> exynos_pm_qos(OE) gs_acpm(OE) kernel_top(OE) dss(OE)
> pixel_suspend_diag(OE) systrace(OE) ect_parser(OE) gs_chipid(OE)
> pinctrl_exynos_gs(OE) phy_exynos_mipi(OE) phy_exynos_mipi_dsim(OE)
> exynos_pmu_if(OE) phy_exynos_usbdrd_super(OE) exynos_pd_el3(OE)
> arm_dsu_pmu(E) softdog(E) pps_gpio(E) i2c_dev(E) spidev(E) sg(E)
> at24(E) zram zsmalloc
> [  316.404101][  T115] CPU: 2 PID: 115 Comm: kworker/u24:2 Tainted: G
>       W  OE      6.3.0-mainline-maybe-dirty #1
> [  316.404491][  T115] Hardware name: Oriole DVT (DT)
> [  316.404678][  T115] Workqueue: events_unbound idle_cull_fn
> [  316.404882][  T115] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT
> -SSBS BTYPE=--)
> [  316.405176][  T115] pc : available_idle_cpu+0x20/0x60
> [  316.405368][  T115] lr : select_task_rq_fair+0x1d0/0x17d8
> [  316.405574][  T115] sp : ffffffc008dfbb40
> [  316.405728][  T115] x29: ffffffc008dfbc10 x28: 0000000000000000
> x27: 0000000000000008
> [  316.406028][  T115] x26: 0000000000000000 x25: 0000000000000001
> x24: 0000000000000008
> [  316.406323][  T115] x23: 0000000000000000 x22: 0000000000000400
> x21: 0000000000000000
> [  316.406623][  T115] x20: 0000000000000008 x19: ffffff8800812380
> x18: ffffffc008cdf040
> [  316.406925][  T115] x17: 00000000aa3494c0 x16: 00000000aa3494c0
> x15: 0000000000019ed5
> [  316.407221][  T115] x14: 0000000000000001 x13: 000000000001a2d5
> x12: 0000000000000010
> [  316.407521][  T115] x11: 0000000000000400 x10: de8448a6b7c5d500 x9
> : ffffffd27459f6c0
> [  316.407822][  T115] x8 : ffffffd27459f6c0 x7 : 0000000000008080 x6
> : 0000000000000000
> [  316.408118][  T115] x5 : ffffff894f35c590 x4 : 0000646e756f626e x3
> : 0000000000000008
> [  316.408418][  T115] x2 : 0000000000000001 x1 : ffffff8800812380 x0
> : 0000000000000008
> [  316.408724][  T115] Call trace:
> [  316.408842][  T115]  available_idle_cpu+0x20/0x60
> [  316.409020][  T115]  try_to_wake_up+0x4ec/0x85c
> [  316.409190][  T115]  wake_up_process+0x18/0x28
> [  316.409359][  T115]  wake_dying_workers+0x5c/0xe8
> [  316.409539][  T115]  idle_cull_fn+0xdc/0x11c
> [  316.409705][  T115]  process_scheduled_works+0x208/0x45c
> [  316.409905][  T115]  worker_thread+0x22c/0x31c
> [  316.410074][  T115]  kthread+0x114/0x1c0
> [  316.410229][  T115]  ret_from_fork+0x10/0x20
> [  316.410399][  T115] Code: b00105c9 911b0129 f8605908 8b090108 (f9455109)
> [  316.410651][  T115] ---[ end trace 0000000000000000 ]---
> [  316.410853][  T115] Kernel panic - not syncing: Oops: Fatal exception
> [  316.411097][  T115] SMP: stopping secondary CPUs
> 
> Do you think the change in patch [2] could be related?

I have hit the following error but at the exact same RIP

1) General Protection Fault

    [  320.476222] general protection fault, probably for non-canonical address 0xfbcb2fe8ef894d01: 0000 [#1] PREEMPT SMP NOPTI
    [  320.487110] CPU: 16 PID: 1553 Comm: kworker/u512:1 Not tainted 6.4.0-rc1-tj-wq-please-boot+ #457
    [  320.495289] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [  320.502855] Workqueue: events_unbound idle_cull_fn
    [  320.507663] RIP: 0010:select_task_rq_fair+0x9bd/0x2570
    [  320.512812] Code: ff 0f 1f 44 00 00 49 c7 c6 28 15 02 00 48 81 bd 60 ff ff ff ff 1f 00 00 0f 87 dc 17 00 00 4d 01 f5 49 8b 45 00 48 85 c0 74 0b <8b> 40 08 85 c0 0f 85 36 11 00 00 8b 75 98 8b 7d a8 e8 7d 01 ff ff
    [  320.531559] RSP: 0018:ffffb7ba505c3c58 EFLAGS: 00010086
    [  320.536784] RAX: fbcb2fe8ef894cf9 RBX: ffffffffa5454538 RCX: 0000000000000010
    [  320.543916] RDX: 542058454d4f4400 RSI: 0000000000000100 RDI: 0000000000000080
    [  320.551050] RBP: ffffb7ba505c3db8 R08: 0000000000000000 R09: 0000000000000012
    [  320.558182] R10: ffff9db1c0159620 R11: ffffffffffffffff R12: ffff9df03d633840
    [  320.565315] R13: ffffffffa5454528 R14: 0000000000021528 R15: ffff9db1cb1b8000
    [  320.572447] FS:  0000000000000000(0000) GS:ffff9df03d600000(0000) knlGS:0000000000000000
    [  320.580535] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  320.586280] CR2: 000055c6dc75d008 CR3: 000000807d43c004 CR4: 0000000000770ee0
    [  320.593414] PKRU: 55555554
    [  320.596126] Call Trace:
    [  320.598581]  <TASK>
    [  320.600687]  ? raw_spin_rq_unlock+0x14/0x40
    [  320.604877]  ? affine_move_task+0x29c/0x580
    [  320.609065]  ? update_load_avg+0x82/0x790
    [  320.613079]  ? __set_cpus_allowed_ptr_locked+0x146/0x1c0
    [  320.618390]  try_to_wake_up+0x121/0x690
    [  320.622230]  wake_up_process+0x19/0x20
    [  320.625983]  idle_cull_fn+0x9d/0x130
    [  320.629560]  process_one_work+0x190/0x360
    [  320.633576]  worker_thread+0x2c7/0x440
    [  320.637326]  ? __pfx_worker_thread+0x10/0x10
    [  320.641600]  kthread+0xfb/0x130
    [  320.644755]  ? __pfx_kthread+0x10/0x10
    [  320.648507]  ret_from_fork+0x2c/0x50
    [  320.652097]  </TASK>
    [  320.654288] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge
    stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio overlay binfmt_misc ipmi_ssif nls_iso8859_1 intel_rapl_msr intel_rapl_common amd64_edac kvm_amd kvm rapl dell_smbios dcdbas dell_wmi_descriptor wmi_bmof ccp ptdma
    k10temp acpi_ipmi ipmi_si acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4
    btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 crct10dif_pclmul crc32_pclmul i2c_algo_bit ghash_clmulni_intel
    drm_shmem_helper sha512_ssse3 drm_kms_helper syscopyarea sysfillrect aesni_intel sysimgblt crypto_simd cryptd tg3 xhci_pci drm
    [  320.654405]  xhci_pci_renesas megaraid_sas wmi
    [  320.748401] ---[ end trace 0000000000000000 ]---

2) NULL Pointer Dereferencing

    [  320.700972] BUG: kernel NULL pointer dereference, address: 0000000000000007
    [  320.707942] #PF: supervisor read access in kernel mode
    [  320.713079] #PF: error_code(0x0000) - not-present page
    [  320.718220] PGD 0 P4D 0
    [  320.720758] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [  320.725118] CPU: 200 PID: 3718 Comm: kworker/u522:2 Not tainted 6.4.0-rc1-tj-wq-test+ #470
    [  320.733376] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [  320.740942] Workqueue: events_unbound idle_cull_fn
    [  320.745744] RIP: 0010:select_task_rq_fair+0x9bd/0x2570
    [  320.750883] Code: ff 0f 1f 44 00 00 49 c7 c6 28 15 02 00 48 81 bd 60 ff ff ff ff 1f 00 00 0f 87 dc 17 00 00 4d 01 f5 49 8b 45 00 48 85 c0 74 0b <8b> 40 08 85 c0 0f 85 36 11 00 00 8b 75 98 8b 7d a8 e8 7d 01 ff ff
    [  320.769628] RSP: 0018:ffff9d9bd663fc58 EFLAGS: 00010086
    [  320.774856] RAX: ffffffffffffffff RBX: ffffffffafc54538 RCX: 00000000000000c8
    [  320.781989] RDX: cccccccccccccccc RSI: 0000000000000100 RDI: 0000000000000000
    [  320.789122] RBP: ffff9d9bd663fdb8 R08: 0000000000000000 R09: 0000000000000001
    [  320.796254] R10: ffff8f73801599c0 R11: ffffffffffffffff R12: ffff8ff1f3e33840
    [  320.803388] R13: ffffffffafc54528 R14: 0000000000021528 R15: ffff8fb306fe4d40
    [  320.810519] FS:  0000000000000000(0000) GS:ffff8ff1f3e00000(0000) knlGS:0000000000000000
    [  320.818606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  320.824353] CR2: 0000000000000007 CR3: 000000807d43c003 CR4: 0000000000770ee0
    [  320.831484] PKRU: 55555554
    [  320.834197] Call Trace:
    [  320.836651]  <TASK>
    [  320.838760]  ? raw_spin_rq_unlock+0x14/0x40
    [  320.842944]  ? affine_move_task+0x29c/0x580
    [  320.847129]  ? update_load_avg+0x82/0x790
    [  320.851144]  ? __set_cpus_allowed_ptr_locked+0x146/0x1c0
    [  320.856453]  try_to_wake_up+0x121/0x690
    [  320.860295]  wake_up_process+0x19/0x20
    [  320.864046]  idle_cull_fn+0x9d/0x130
    [  320.867625]  process_one_work+0x190/0x360
    [  320.871638]  ? __pfx_worker_thread+0x10/0x10
    [  320.875912]  worker_thread+0x2c7/0x440
    [  320.879665]  ? __pfx_worker_thread+0x10/0x10
    [  320.883935]  kthread+0xfb/0x130
    [  320.887083]  ? __pfx_kthread+0x10/0x10
    [  320.890837]  ret_from_fork+0x2c/0x50
    [  320.894414]  </TASK>
    [  320.896608] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge
    stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio overlay binfmt_misc ipmi_ssif nls_iso8859_1 intel_rapl_msr intel_rapl_common amd64_edac kvm_amd kvm rapl dell_smbios dcdbas dell_wmi_descriptor wmi_bmof ccp ptdma
    k10temp acpi_ipmi ipmi_si acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables
    autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea
    crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel sha512_ssse3 sysimgblt aesni_intel crypto_simd cryptd tg3 drm xhci_pci
    [  320.896686]  xhci_pci_renesas megaraid_sas wmi
    [  320.990684] CR2: 0000000000000007
    [  320.994006] ---[ end trace 0000000000000000 ]---

The RIP points to dereferencing sd_llc_shared->has_idle_cores

    $ scripts/faddr2line vmlinux select_task_rq_fair+0x9bd
    select_task_rq_fair+0x9bd/0x2570:
    test_idle_cores at kernel/sched/fair.c:6830
    (inlined by) select_idle_sibling at kernel/sched/fair.c:7189
    (inlined by) select_task_rq_fair at kernel/sched/fair.c:7710

My kernel is somewhat stable (I have not seen a panic for ~45min but I
was not stress testing the system either during that time) with the
following changes:

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b2e914655f05..a279cc9c2248 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2247,7 +2247,7 @@ static void unbind_worker(struct worker *worker)
        if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
                WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
        else
-               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_active_mask) < 0);
 }

 static void wake_dying_workers(struct list_head *cull_list)
--

However, the bits above were not directly changed by this patch and have
been in workqueue.c since commit 46a4d679ef88 ("workqueue: Avoid a false
warning in unbind_workers()"). I can only suspect something else changed
that has uncovered another issue in my case. You can give it a try and
see if it helps your case too.

I'll wait for Tejun's response however, since I have no explanation as to
why the above workaround improves the system stability in my case :)

> 
> Thanks,
> Sandeep.
> 
> [0] https://lore.kernel.org/all/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com/
> [1] https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline
> [2] https://lore.kernel.org/all/20230519001709.2563-10-tj@kernel.org/

--
Thanks and Regards,
Prateek

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-23 16:12   ` Vincent Guittot
  2023-05-24  7:34     ` Peter Zijlstra
@ 2023-06-05  4:46     ` Gautham R. Shenoy
  2023-06-07 14:42       ` Libo Chen
  1 sibling, 1 reply; 73+ messages in thread
From: Gautham R. Shenoy @ 2023-06-05  4:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Tejun Heo, jiangshanlai, torvalds, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, libo.chen, srikar

Hello Vincent,

On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
> On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So wakeup based placement is mostly all about LLC, and given this thing
> > has dinky small LLCs it will pile up on the one LLC you target and leave
> > the others idle until the regular idle balancer decides to make an
> > appearance and move some around.
> >
> > But if these are fairly short running tasks, I can well see that not
> > going to help much.
> >
> >
> > Much of this was tuned back when there was 1 L3 per Node; something
> > which is still more or less true for Intel but clearly not for these
> > things.
> >
> >
> > The below is a bit crude and completely untested, but it might help. The
> > flip side of that coin is of course that people are going to complain
> > about how selecting a CPU is more expensive now and how this hurts their
> > performance :/
> >
> > Basically it will try and iterate all L3s in a node; wakeup will still
> > refuse to cross node boundaries.
> 
> That remember me some discussion about system with fast on die
> interconnect where we would like to run wider than llc at wakeup (i.e.
> DIE level) something like the CLUSTER level but on the other side of
> MC
>

Adding Libo Chen who was a part of this discussion. IIRC, the problem was
that there was no MC domain on that system, which would have made the
SMT domain to be the sd_llc. But since the core is single threaded,
the SMT domain would be degnerated thus leaving no domain which has
the SD_SHARE_PKG_RESOURCES flag.

If I understand correctly, Peter's patch won't help in such a
situation.

However, it should help POWER10 which has the SMT domain as the LLC
and previously it was observed that moving the wakeup search to the
parent domain was helpful (Ref:
https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/)


--
Thanks and Regards
gautham.


> Another possibility to investigate would be that each wakeup of a
> worker is mostly unrelated to the previous one and it cares only
> waker. so we should use -1 for the prev_cpu
> 
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..ddb7f16a07a9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >         return idle_cpu;
> >  }
> >
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > +       struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
> > +       struct sched_group *sg;
> > +
> > +       if (!sd_node || sd_node == sd)
> > +               return -1;
> > +
> > +       sg = sd_node->groups;
> > +       do {
> > +               int cpu = cpumask_first(sched_group_span(sg));
> > +               struct sched_domain *sd_child;
> > +
> > +               sd_child = per_cpu(sd_llc, cpu);
> > +               if (sd_child != sd) {
> > +                       int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> > +                       if ((unsigned int)i < nr_cpumask_bits)
> > +                               return i;
> > +               }
> > +
> > +               sg = sg->next;
> > +       } while (sg != sd_node->groups);
> > +
> > +       return -1;
> > +}
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >         if ((unsigned)i < nr_cpumask_bits)
> >                 return i;
> >
> > +       if (sched_feat(SIS_NODE)) {
> > +               i = select_idle_node(p, sd, target);
> > +               if ((unsigned)i < nr_cpumask_bits)
> > +                       return i;
> > +       }
> > +
> >         return target;
> >  }
> >
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..f965cd4a981e 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   */
> >  SCHED_FEAT(SIS_PROP, false)
> >  SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_NODE, true)
> >
> >  /*
> >   * Issue a WARN when we do multiple update_rq_clock() calls
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 678446251c35..d2e0e2e496a6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DECLARE_PER_CPU(int, sd_llc_size);
> >  DECLARE_PER_CPU(int, sd_llc_id);
> >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index ca4472281c28..d94cbc2164ca 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DEFINE_PER_CPU(int, sd_llc_size);
> >  DEFINE_PER_CPU(int, sd_llc_id);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
> >         per_cpu(sd_llc_id, cpu) = id;
> >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > +       while (sd && sd->parent) {
> > +               /*
> > +                * SD_NUMA is the first domain spanning nodes, therefore @sd
> > +                * must be the domain that spans a single node.
> > +                */
> > +               if (sd->parent->flags & SD_NUMA)
> > +                       break;
> > +
> > +               sd = sd->parent;
> > +       }
> > +       rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
> > +
> >         sd = lowest_flag_domain(cpu, SD_NUMA);
> >         rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> >

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-05-30  8:06   ` K Prateek Nayak
@ 2023-06-07  1:50     ` Tejun Heo
  0 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-06-07  1:50 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void

Sorry about the delay. We moved last week and that ended up being a lot more
disruptive than I expected.

On Tue, May 30, 2023 at 01:36:13PM +0530, K Prateek Nayak wrote:
> I ran into a NULL pointer dereferencing issue when trying to test a build
> of the "affinity-scopes-v1" branch from your workqueue tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/?h=affinity-scopes-v1)
> Inlining the splat, some debug details, and workaround I used below.
...
>     [    4.280321] BUG: kernel NULL pointer dereference, address: 0000000000000004
...
>     [    4.284172]  wq_update_pod+0x89/0x1e0
>     [    4.284172]  workqueue_online_cpu+0x1fc/0x250
>     [    4.284172]  cpuhp_invoke_callback+0x165/0x4b0
>     [    4.284172]  cpuhp_thread_fun+0xc4/0x1b0
>     [    4.284172]  smpboot_thread_fn+0xe7/0x1e0
>     [    4.284172]  kthread+0xfb/0x130
>     [    4.284172]  ret_from_fork+0x2c/0x50
>     [    4.284172]  </TASK>
>     [    4.284172] Modules linked in:
>     [    4.284172] CR2: 0000000000000004
>     [    4.284172] ---[ end trace 0000000000000000 ]---
> 
> I was basically hitting the following, seemingly impossible scenario in
> wqattrs_pod_type():
> 
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3825,8 +3825,10 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs)
>  {
>         struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];
> 
> -       if (likely(pt->nr_pods))
> +       if (likely(pt->nr_pods)) {
> +               BUG_ON(!pt->cpu_pod); /* No pods despite thinking we have? */
>                 return pt;
> +       }
> 
>         /*
>          * Before workqueue_init_topology(), only SYSTEM is available which is
> --
> 
> Logging the value of "attrs->affn_scope" when hitting the scenario gave
> me "5" which corresponds to "WQ_AFFN_NR_TYPES". The kernel was reading a
> value beyond the wq_pod_types[] bounds.
> 
> This value for "affn_scope" is only set in the above hunk and I got the
> kernel to boot by making the following change:
> 
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4069,7 +4071,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>         pool->node = node;
> 
>         /* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
> -       pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;
> +       pool->attrs->affn_scope = wq_affn_dfl;
>         pool->attrs->localize = false;
>         pool->attrs->ordered = false;

I see. The code is a bit too subtle. wq_update_pod() is abusing dfl_pwq's
attrs to access its wq_unbound_cpumask filtered cpumask but
wqattrs_pod_type() now expects to (rightfully) get the attrs of the
workqueue in question, not its dfl_pwq's.

A proper fix follows. This goes right before
0014-workqueue-Generalize-unbound-CPU-pods.patch. Some of the subsequent
patches need to be updated. I'll post an updated patch series later.

Thanks.

From: Tejun Heo <tj@kernel.org>
Subject: workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()

For an unbound pool, multiple cpumasks are involved.

U: The user-specified cpumask (may be filtered with cpu_possible_mask).

A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
   leaves no CPU, wq_unbound_cpumask is used.

P: Per-pod subsets of #A.

wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.

wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
calculate the new #P for each workqueue, it needs to call
wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
wq->dfl_pwq->pool->attrs.

This is rather fragile because we're calling wq_calc_pod_cpumask() with
@attrs of a worker_pool rather than the workqueue's actual attrs when what
we want to calculate is the workqueue's cpumask on the pod. While this works
fine currently, future changes will add fields which are used differently
between workqueues and worker_pools and this subtlety will bite us.

This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
wq->unbound_attrs and use the new helper to obtain #A freshly instead of
abusing wq->dfl_pwq->pool_attrs.

This shouldn't cause any behavior changes in the current code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
---
 kernel/workqueue.c |   43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3681,6 +3681,20 @@ static bool wqattrs_equal(const struct w
 	return true;
 }
 
+/* Update @attrs with actually available CPUs */
+static void wqattrs_actualize_cpumask(struct workqueue_attrs *attrs,
+				      const cpumask_t *unbound_cpumask)
+{
+	/*
+	 * Calculate the effective CPU mask of @attrs given @unbound_cpumask. If
+	 * @attrs->cpumask doesn't overlap with @unbound_cpumask, we fallback to
+	 * @unbound_cpumask.
+	 */
+	cpumask_and(attrs->cpumask, attrs->cpumask, unbound_cpumask);
+	if (unlikely(cpumask_empty(attrs->cpumask)))
+		cpumask_copy(attrs->cpumask, unbound_cpumask);
+}
+
 /**
  * init_worker_pool - initialize a newly zalloc'd worker_pool
  * @pool: worker_pool to initialize
@@ -4206,32 +4220,22 @@ apply_wqattrs_prepare(struct workqueue_s
 		goto out_free;
 
 	/*
-	 * Calculate the attrs of the default pwq with unbound_cpumask
-	 * which is wq_unbound_cpumask or to set to wq_unbound_cpumask.
-	 * If the user configured cpumask doesn't overlap with the
-	 * wq_unbound_cpumask, we fallback to the wq_unbound_cpumask.
-	 */
-	copy_workqueue_attrs(new_attrs, attrs);
-	cpumask_and(new_attrs->cpumask, new_attrs->cpumask, unbound_cpumask);
-	if (unlikely(cpumask_empty(new_attrs->cpumask)))
-		cpumask_copy(new_attrs->cpumask, unbound_cpumask);
-
-	/*
-	 * We may create multiple pwqs with differing cpumasks.  Make a
-	 * copy of @new_attrs which will be modified and used to obtain
-	 * pools.
-	 */
-	copy_workqueue_attrs(tmp_attrs, new_attrs);
-
-	/*
 	 * If something goes wrong during CPU up/down, we'll fall back to
 	 * the default pwq covering whole @attrs->cpumask.  Always create
 	 * it even if we don't use it immediately.
 	 */
+	copy_workqueue_attrs(new_attrs, attrs);
+	wqattrs_actualize_cpumask(new_attrs, unbound_cpumask);
 	ctx->dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
 	if (!ctx->dfl_pwq)
 		goto out_free;
 
+	/*
+	 * We may create multiple pwqs with differing cpumasks. Make a copy of
+	 * @new_attrs which will be modified and used to obtain pools.
+	 */
+	copy_workqueue_attrs(tmp_attrs, new_attrs);
+
 	for_each_possible_cpu(cpu) {
 		if (new_attrs->ordered) {
 			ctx->dfl_pwq->refcnt++;
@@ -4398,9 +4402,10 @@ static void wq_update_pod(struct workque
 	cpumask = target_attrs->cpumask;
 
 	copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+	wqattrs_actualize_cpumask(target_attrs, wq_unbound_cpumask);
 
 	/* nothing to do if the target cpumask matches the current pwq */
-	wq_calc_pod_cpumask(wq->dfl_pwq->pool->attrs, pod, cpu_off, cpumask);
+	wq_calc_pod_cpumask(target_attrs, pod, cpu_off, cpumask);
 	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
 					lockdep_is_held(&wq_pool_mutex));
 	if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-05  4:46     ` Gautham R. Shenoy
@ 2023-06-07 14:42       ` Libo Chen
  0 siblings, 0 replies; 73+ messages in thread
From: Libo Chen @ 2023-06-07 14:42 UTC (permalink / raw)
  To: Gautham R. Shenoy, Vincent Guittot, Peter Zijlstra
  Cc: Tejun Heo, jiangshanlai, torvalds, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void, srikar



On 6/4/23 9:46 PM, Gautham R. Shenoy wrote:
> Hello Vincent,
>
> On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
>> On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
>>> So wakeup based placement is mostly all about LLC, and given this thing
>>> has dinky small LLCs it will pile up on the one LLC you target and leave
>>> the others idle until the regular idle balancer decides to make an
>>> appearance and move some around.
>>>
>>> But if these are fairly short running tasks, I can well see that not
>>> going to help much.
>>>
>>>
>>> Much of this was tuned back when there was 1 L3 per Node; something
>>> which is still more or less true for Intel but clearly not for these
>>> things.
>>>
>>>
>>> The below is a bit crude and completely untested, but it might help. The
>>> flip side of that coin is of course that people are going to complain
>>> about how selecting a CPU is more expensive now and how this hurts their
>>> performance :/
>>>
>>> Basically it will try and iterate all L3s in a node; wakeup will still
>>> refuse to cross node boundaries.
>> That remember me some discussion about system with fast on die
>> interconnect where we would like to run wider than llc at wakeup (i.e.
>> DIE level) something like the CLUSTER level but on the other side of
>> MC
>>
> Adding Libo Chen who was a part of this discussion. IIRC, the problem was
> that there was no MC domain on that system, which would have made the
> SMT domain to be the sd_llc. But since the core is single threaded,
> the SMT domain would be degnerated thus leaving no domain which has
> the SD_SHARE_PKG_RESOURCES flag.
>
> If I understand correctly, Peter's patch won't help in such a
> situation.

Yes, you have some ARM platforms that have no L3 cache (they have SLCs though which are memory-side caches)
and no hyperthreading, so the lowest domain level is DIE and every single wakee task goes back to previous
CPU no matter what because LLC doesn't exist. For such platforms, we would want to expand the search range
similar to AMD to take advantage of idle cores on the same DIE.

I think we need a new abstraction here to accommodate all these different cache topologies,
replace SD_LLC with, for example, SD_WAKEUP and allow wider search range independent of LLC.


Libo


> However, it should help POWER10 which has the SMT domain as the LLC
> and previously it was observed that moving the wakeup search to the
> parent domain was helpful (Ref:
> https://urldefense.com/v3/__https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/__;!!ACWV5N9M2RV99hQ!MWwLTGkdtGeyndhqBm2g_RRyVZQP9lTDPPEawQldKgmaE0QL20ET4F_2mpJcd_ghyOAGyvrk3Blc5FWXCvKbnFk$ )
>
>
> --
> Thanks and Regards
> gautham.
>
>
>> Another possibility to investigate would be that each wakeup of a
>> worker is mostly unrelated to the previous one and it cares only
>> waker. so we should use -1 for the prev_cpu
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 48b6f0ca13ac..ddb7f16a07a9 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>          return idle_cpu;
>>>   }
>>>
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> +       struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
>>> +       struct sched_group *sg;
>>> +
>>> +       if (!sd_node || sd_node == sd)
>>> +               return -1;
>>> +
>>> +       sg = sd_node->groups;
>>> +       do {
>>> +               int cpu = cpumask_first(sched_group_span(sg));
>>> +               struct sched_domain *sd_child;
>>> +
>>> +               sd_child = per_cpu(sd_llc, cpu);
>>> +               if (sd_child != sd) {
>>> +                       int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>>> +                       if ((unsigned int)i < nr_cpumask_bits)
>>> +                               return i;
>>> +               }
>>> +
>>> +               sg = sg->next;
>>> +       } while (sg != sd_node->groups);
>>> +
>>> +       return -1;
>>> +}
>>> +
>>>   /*
>>>    * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>>>    * the task fits. If no CPU is big enough, but there are idle ones, try to
>>> @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>          if ((unsigned)i < nr_cpumask_bits)
>>>                  return i;
>>>
>>> +       if (sched_feat(SIS_NODE)) {
>>> +               i = select_idle_node(p, sd, target);
>>> +               if ((unsigned)i < nr_cpumask_bits)
>>> +                       return i;
>>> +       }
>>> +
>>>          return target;
>>>   }
>>>
>>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>>> index ee7f23c76bd3..f965cd4a981e 100644
>>> --- a/kernel/sched/features.h
>>> +++ b/kernel/sched/features.h
>>> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>>>    */
>>>   SCHED_FEAT(SIS_PROP, false)
>>>   SCHED_FEAT(SIS_UTIL, true)
>>> +SCHED_FEAT(SIS_NODE, true)
>>>
>>>   /*
>>>    * Issue a WARN when we do multiple update_rq_clock() calls
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 678446251c35..d2e0e2e496a6 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>   DECLARE_PER_CPU(int, sd_llc_size);
>>>   DECLARE_PER_CPU(int, sd_llc_id);
>>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index ca4472281c28..d94cbc2164ca 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>   DEFINE_PER_CPU(int, sd_llc_size);
>>>   DEFINE_PER_CPU(int, sd_llc_id);
>>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
>>>          per_cpu(sd_llc_id, cpu) = id;
>>>          rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>>
>>> +       while (sd && sd->parent) {
>>> +               /*
>>> +                * SD_NUMA is the first domain spanning nodes, therefore @sd
>>> +                * must be the domain that spans a single node.
>>> +                */
>>> +               if (sd->parent->flags & SD_NUMA)
>>> +                       break;
>>> +
>>> +               sd = sd->parent;
>>> +       }
>>> +       rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
>>> +
>>>          sd = lowest_flag_domain(cpu, SD_NUMA);
>>>          rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
>>>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-05-31 12:14     ` K Prateek Nayak
@ 2023-06-07 22:13       ` Tejun Heo
  2023-06-08  3:01         ` K Prateek Nayak
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-06-07 22:13 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello,

On Wed, May 31, 2023 at 05:44:57PM +0530, K Prateek Nayak wrote:
...
> The RIP points to dereferencing sd_llc_shared->has_idle_cores
> 
>     $ scripts/faddr2line vmlinux select_task_rq_fair+0x9bd
>     select_task_rq_fair+0x9bd/0x2570:
>     test_idle_cores at kernel/sched/fair.c:6830
>     (inlined by) select_idle_sibling at kernel/sched/fair.c:7189
>     (inlined by) select_task_rq_fair at kernel/sched/fair.c:7710

Hmm... the only thing I can think of is workqueue setting ->wake_cpu to
something invalid.

> My kernel is somewhat stable (I have not seen a panic for ~45min but I
> was not stress testing the system either during that time) with the
> following changes:
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index b2e914655f05..a279cc9c2248 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2247,7 +2247,7 @@ static void unbind_worker(struct worker *worker)
>         if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
>                 WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
>         else
> -               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
> +               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_active_mask) < 0);
>  }

I'm not sure why changing the cpus_allowed_ptr would make a difference here.
Maybe the chain of events involves CPUs going offline and the above migrate
the tasks resetting their ->wake_cpu.

Can you please try the following branch and see if any of the warnings
triggers?

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-dbg-invalid-cpu

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-07 22:13       ` Tejun Heo
@ 2023-06-08  3:01         ` K Prateek Nayak
  2023-06-08 22:50           ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: K Prateek Nayak @ 2023-06-08  3:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello Tejun,

On 6/8/2023 3:43 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, May 31, 2023 at 05:44:57PM +0530, K Prateek Nayak wrote:
> ...
>> The RIP points to dereferencing sd_llc_shared->has_idle_cores
>>
>>     $ scripts/faddr2line vmlinux select_task_rq_fair+0x9bd
>>     select_task_rq_fair+0x9bd/0x2570:
>>     test_idle_cores at kernel/sched/fair.c:6830
>>     (inlined by) select_idle_sibling at kernel/sched/fair.c:7189
>>     (inlined by) select_task_rq_fair at kernel/sched/fair.c:7710
> 
> Hmm... the only thing I can think of is workqueue setting ->wake_cpu to
> something invalid.
> 
>> My kernel is somewhat stable (I have not seen a panic for ~45min but I
>> was not stress testing the system either during that time) with the
>> following changes:
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index b2e914655f05..a279cc9c2248 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -2247,7 +2247,7 @@ static void unbind_worker(struct worker *worker)
>>         if (cpumask_intersects(wq_unbound_cpumask, cpu_active_mask))
>>                 WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
>>         else
>> -               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
>> +               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_active_mask) < 0);
>>  }
> 
> I'm not sure why changing the cpus_allowed_ptr would make a difference here.
> Maybe the chain of events involves CPUs going offline and the above migrate
> the tasks resetting their ->wake_cpu.
> 
> Can you please try the following branch and see if any of the warnings
> triggers?
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-dbg-invalid-cpu

Thank you for sharing the debug branch. I've managed to hit some one of
the WARN_ON_ONCE() consistently but I still haven't seen a kernel panic
yet. Sharing the traces below:

o Early Boot

    [    4.182411] ------------[ cut here ]------------
    [    4.186313] WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
    [    4.186313] Modules linked in:
    [    4.186313] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-rc1-tj-wq-valid-cpu+ #481
    [    4.186313] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [    4.186313] RIP: 0010:kick_pool+0xdb/0xe0
    [    4.186313] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
    [    4.186313] RSP: 0018:ffffbc1b800e7dd8 EFLAGS: 00010046
    [    4.186313] RAX: 0000000000000100 RBX: ffff97c73d2321c0 RCX: 0000000000000000
    [    4.186313] RDX: 0000000000000040 RSI: 0000000000000001 RDI: ffff9788c0159728
    [    4.186313] RBP: ffffbc1b800e7df0 R08: 0000000000000100 R09: ffff9788c01593e0
    [    4.186313] R10: ffff9788c01593c0 R11: 0000000000000001 R12: ffffffff8c582430
    [    4.186313] R13: ffff9788c03fcd40 R14: 0000000000000000 R15: ffff97c73d2324b0
    [    4.186313] FS:  0000000000000000(0000) GS:ffff97c73d200000(0000) knlGS:0000000000000000
    [    4.186313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [    4.186313] CR2: ffff97cecee01000 CR3: 000000470d43a001 CR4: 0000000000770ef0
    [    4.186313] PKRU: 55555554
    [    4.186313] Call Trace:
    [    4.186313]  <TASK>
    [    4.186313]  create_worker+0x14e/0x280
    [    4.186313]  ? wake_up_process+0x15/0x20
    [    4.186313]  workqueue_init+0x22a/0x3d0
    [    4.186313]  kernel_init_freeable+0x1fe/0x4f0
    [    4.186313]  ? __pfx_kernel_init+0x10/0x10
    [    4.186313]  kernel_init+0x1b/0x1f0
    [    4.186313]  ? __pfx_kernel_init+0x10/0x10
    [    4.186313]  ret_from_fork+0x2c/0x50
    [    4.186313]  </TASK>
    [    4.186313] ---[ end trace 0000000000000000 ]---

o I consistently see a WARN_ON_ONCE() in kick_pool() being hit when I
  run "sudo ./stress-ng --iomix 96 --timeout 1m". I've seen few
  different stack traces so far. Including all below just in case:

  o First

    [  780.818319] ------------[ cut here ]------------
    [  780.822952] WARNING: CPU: 190 PID: 10639 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
    [  780.830959] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
    nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio overlay binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common
    amd64_edac kvm_amd dell_smbios dcdbas kvm acpi_ipmi rapl dell_wmi_descriptor wmi_bmof ccp ptdma k10temp ipmi_si acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc 
    scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
    async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt 
    crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm tg3 cryptd xhci_pci
    [  780.831013]  xhci_pci_renesas megaraid_sas wmi
    [  780.925011] CPU: 190 PID: 10639 Comm: stress-ng-iomix Tainted: G        W          6.4.0-rc1-tj-wq-valid-cpu+ #481
    [  780.935351] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [  780.942917] RIP: 0010:kick_pool+0xdb/0xe0
    [  780.946938] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
    [  780.965684] RSP: 0018:ffffbc1b943f7920 EFLAGS: 00010046
    [  780.970910] RAX: 0000000000000100 RBX: ffff97c73f1b24f0 RCX: 0000000000000000
    [  780.978040] RDX: 0000000000000040 RSI: 0000000000000001 RDI: ffff9788c0165648
    [  780.985167] RBP: ffffbc1b943f7938 R08: 0000000000000100 R09: ffffffff8c489768
    [  780.992298] R10: c8b223d88897ffff R11: 0000000000000000 R12: ffff9788c6fc3c48
    [  780.999422] R13: ffff978915854d40 R14: ffff9788cb82a000 R15: ffff9788c6fc3c40
    [  781.006547] FS:  00007fb7f5f39f00(0000) GS:ffff97c73f180000(0000) knlGS:0000000000000000
    [  781.014631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  781.020370] CR2: 00007fb7f5da0890 CR3: 0000000114810003 CR4: 0000000000770ee0
    [  781.027504] PKRU: 55555554
    [  781.030215] Call Trace:
    [  781.032663]  <TASK>
    [  781.034771]  __queue_work.part.0+0x1f2/0x4c0
    [  781.039040]  __queue_work+0x37/0x90
    [  781.042531]  __queue_delayed_work+0x67/0xa0
    [  781.046717]  mod_delayed_work_on+0x5e/0xa0
    [  781.050817]  kblockd_mod_delayed_work_on+0x1b/0x30
    [  781.055610]  blk_mq_delay_run_hw_queue+0xe1/0x150
    [  781.060316]  blk_mq_run_hw_queue+0x132/0x1a0
    [  781.064590]  blk_mq_submit_bio+0x3dd/0x600
    [  781.068689]  __submit_bio+0xa6/0x1a0
    [  781.072267]  submit_bio_noacct_nocheck+0x2c1/0x380
    [  781.077061]  ? page_mapping+0x18/0x50
    [  781.080726]  submit_bio_noacct+0x1b7/0x570
    [  781.084825]  submit_bio+0x47/0x70
    [  781.088144]  submit_bh_wbc+0x133/0x150
    [  781.091897]  submit_bh+0x10/0x20
    [  781.095129]  ext4_read_bh+0x51/0xb0
    [  781.098622]  ext4_read_inode_bitmap+0x41e/0x5d0
    [  781.103156]  __ext4_new_inode+0x370/0x1740
    [  781.107256]  ? ext4_fname_prepare_lookup+0x8f/0xd0
    [  781.112047]  ? from_kgid+0x12/0x20
    [  781.115454]  ext4_mkdir+0x157/0x330
    [  781.118947]  vfs_mkdir+0x195/0x250
    [  781.122353]  do_mkdirat+0x128/0x160
    [  781.125845]  __x64_sys_mkdir+0x4c/0x70
    [  781.129599]  do_syscall_64+0x5c/0x90
    [  781.133177]  ? exit_to_user_mode_prepare+0x35/0x170
    [  781.138058]  ? irqentry_exit_to_user_mode+0x9/0x20
    [  781.142850]  ? irqentry_exit+0x3b/0x50
    [  781.146602]  ? exc_page_fault+0x8a/0x180
    [  781.150528]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
    [  781.155579] RIP: 0033:0x7fb7f5d1460b
    [  781.159160] Code: 8b 05 29 48 10 00 41 bc ff ff ff ff 64 c7 00 16 00 00 00 e9 4f ff ff ff e8 22 21 02 00 66 90 f3 0f 1e fa b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 47 10 00 f7 d8 64 89 01 48
    [  781.177907] RSP: 002b:00007ffc6ad02d08 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
    [  781.185472] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb7f5d1460b
    [  781.192596] RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00007ffc6ad02d10
    [  781.199728] RBP: 000000000000298f R08: 00007fb7f5dd1460 R09: 00007ffc6ad02a60
    [  781.206853] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc6ad04e90
    [  781.213985] R13: 00007ffc6ad04fe0 R14: 00007ffc6ad02d10 R15: 00007fb7f5bbddc0
    [  781.221111]  </TASK>
    [  781.223302] ---[ end trace 0000000000000000 ]---

    o Second

    [  807.362430] ------------[ cut here ]------------
    [  807.367062] WARNING: CPU: 164 PID: 3274 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
    [  807.374981] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 
    nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio overlay binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common 
    amd64_edac kvm_amd dell_smbios dcdbas kvm acpi_ipmi rapl dell_wmi_descriptor wmi_bmof ccp ptdma k10temp ipmi_si acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc 
    scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
    async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt 
    crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm tg3 cryptd xhci_pci
    [  807.375031]  xhci_pci_renesas megaraid_sas wmi
    [  807.469026] CPU: 164 PID: 3274 Comm: jbd2/sda4-8 Tainted: G        W          6.4.0-rc1-tj-wq-valid-cpu+ #481
    [  807.478931] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [  807.486497] RIP: 0010:kick_pool+0xdb/0xe0
    [  807.490510] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
    [  807.509257] RSP: 0018:ffffbc1b9422f9a8 EFLAGS: 00010046
    [  807.514482] RAX: 0000000000000100 RBX: ffff97c73eb324f0 RCX: 0000000000000000
    [  807.521605] RDX: 0000000000000080 RSI: 0000000000000041 RDI: ffff9788c0165410
    [  807.528731] RBP: ffffbc1b9422f9c0 R08: 0000000000000100 R09: ffffffff8c489768
    [  807.535862] R10: 0000000000000000 R11: ffff97890272fff0 R12: ffff9788fd6f2248
    [  807.542995] R13: ffff9789198f4d40 R14: ffff9788cb05d000 R15: ffff9788fd6f2240
    [  807.550120] FS:  0000000000000000(0000) GS:ffff97c73eb00000(0000) knlGS:0000000000000000
    [  807.558205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  807.563943] CR2: 00007fcda13a0890 CR3: 00000001172c6005 CR4: 0000000000770ee0
    [  807.571067] PKRU: 55555554
    [  807.573771] Call Trace:
    [  807.576220]  <TASK>
    [  807.578325]  __queue_work.part.0+0x1f2/0x4c0
    [  807.582604]  __queue_work+0x37/0x90
    [  807.586095]  __queue_delayed_work+0x67/0xa0
    [  807.590273]  mod_delayed_work_on+0x5e/0xa0
    [  807.594371]  kblockd_mod_delayed_work_on+0x1b/0x30
    [  807.599164]  blk_mq_kick_requeue_list+0x1c/0x30
    [  807.603698]  blk_flush_complete_seq+0x1c5/0x2c0
    [  807.608231]  ? __blk_mq_alloc_requests+0x2e8/0x330
    [  807.613024]  blk_insert_flush+0xfc/0x180
    [  807.616950]  blk_mq_submit_bio+0x539/0x600
    [  807.621048]  __submit_bio+0xa6/0x1a0
    [  807.624628]  submit_bio_noacct_nocheck+0x2c1/0x380
    [  807.629422]  ? page_mapping+0x18/0x50
    [  807.633095]  submit_bio_noacct+0x1b7/0x570
    [  807.637186]  submit_bio+0x60/0x70
    [  807.640498]  submit_bh_wbc+0x133/0x150
    [  807.644250]  submit_bh+0x10/0x20
    [  807.647484]  journal_submit_commit_record.part.0.constprop.0+0x11d/0x1d0
    [  807.654181]  jbd2_journal_commit_transaction+0x1448/0x1980
    [  807.659659]  ? lock_timer_base+0x3b/0xd0
    [  807.663586]  kjournald2+0xab/0x270
    [  807.666992]  ? __pfx_autoremove_wake_function+0x10/0x10
    [  807.672219]  ? __pfx_kjournald2+0x10/0x10
    [  807.676229]  kthread+0xf7/0x130
    [  807.679376]  ? __pfx_kthread+0x10/0x10
    [  807.683128]  ret_from_fork+0x2c/0x50
    [  807.686707]  </TASK>
    [  807.688891] ---[ end trace 0000000000000000 ]---

    o Third
    
    [ 1244.765696] ------------[ cut here ]------------
    [ 1244.770323] WARNING: CPU: 60 PID: 19932 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
    [ 1244.778251] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 
    nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio overlay binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common 
    amd64_edac kvm_amd dell_smbios dcdbas kvm acpi_ipmi rapl dell_wmi_descriptor wmi_bmof ccp ptdma k10temp ipmi_si acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc 
    scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
    async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt 
    crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm tg3 cryptd xhci_pci
    [ 1244.778299]  xhci_pci_renesas megaraid_sas wmi
    [ 1244.872295] CPU: 60 PID: 19932 Comm: stress-ng Tainted: G        W          6.4.0-rc1-tj-wq-valid-cpu+ #481
    [ 1244.882061] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [ 1244.889628] RIP: 0010:kick_pool+0xdb/0xe0
    [ 1244.893640] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
    [ 1244.912385] RSP: 0018:ffffbc1b8de10e08 EFLAGS: 00010046
    [ 1244.917604] RAX: 0000000000000100 RBX: ffff97c73e1321c0 RCX: 0000000000000000
    [ 1244.924737] RDX: 0000000000000040 RSI: 0000000000000039 RDI: ffff9788c015de68
    [ 1244.931869] RBP: ffffbc1b8de10e20 R08: 0000000000000100 R09: ffffffff8c489768
    [ 1244.939000] R10: 0000000000000002 R11: 000000000000013a R12: ffff97c73e120d20
    [ 1244.946125] R13: ffff9788fae73380 R14: ffff9788c0171400 R15: ffff97c73e120d18
    [ 1244.953252] FS:  00007f8747b89f00(0000) GS:ffff97c73e100000(0000) knlGS:0000000000000000
    [ 1244.961335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1244.967074] CR2: 00007fff1207f170 CR3: 00000001182d8003 CR4: 0000000000770ee0
    [ 1244.974197] PKRU: 55555554
    [ 1244.976902] Call Trace:
    [ 1244.979347]  <IRQ>
    [ 1244.981369]  __queue_work.part.0+0x1f2/0x4c0
    [ 1244.985637]  delayed_work_timer_fn+0x3c/0x90
    [ 1244.989903]  ? __pfx_delayed_work_timer_fn+0x10/0x10
    [ 1244.994867]  call_timer_fn+0x2c/0x150
    [ 1244.998534]  ? __pfx_delayed_work_timer_fn+0x10/0x10
    [ 1245.003491]  __run_timers.part.0+0x16f/0x2a0
    [ 1245.007763]  ? ktime_get+0x46/0xc0
    [ 1245.011162]  ? native_apic_msr_write+0x30/0x40
    [ 1245.015606]  ? lapic_next_event+0x20/0x30
    [ 1245.019619]  ? clockevents_program_event+0xad/0x130
    [ 1245.024499]  run_timer_softirq+0x2a/0x60
    [ 1245.028418]  __do_softirq+0xdd/0x31a
    [ 1245.031996]  ? hrtimer_interrupt+0x12b/0x240
    [ 1245.036269]  __irq_exit_rcu+0x83/0xb0
    [ 1245.039934]  irq_exit_rcu+0xe/0x20
    [ 1245.043333]  sysvec_apic_timer_interrupt+0x80/0x90
    [ 1245.048126]  </IRQ>
    [ 1245.050231]  <TASK>
    [ 1245.052327]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
    [ 1245.057468] RIP: 0010:pcpu_alloc+0x3fe/0x830
    [ 1245.061741] Code: 98 32 4f 01 89 da 31 f6 83 c3 01 4c 01 e7 48 03 3c d0 4c 89 ea e8 f2 d3 c7 00 8b 35 6c 75 aa 01 48 63 d3 48 c7 c7 e0 5e 80 8c <e8> 6d da 3c 00 39 05 57 75 aa 01 48 89 c3 77 bf 48 8b 45 88 49 03
    [ 1245.080485] RSP: 0018:ffffbc1b9e607c00 EFLAGS: 00000246
    [ 1245.085705] RAX: ffffdc1b7548010c RBX: 00000000000000db RCX: 0000000000000000
    [ 1245.092838] RDX: 00000000000000db RSI: 0000000000000100 RDI: ffffffff8c805ee0
    [ 1245.099970] RBP: ffffbc1b9e607c88 R08: 00000000000000da R09: 0000000000000004
    [ 1245.107103] R10: ffffdc1b7548010c R11: 0000000000000044 R12: 000000000000010c
    [ 1245.114227] R13: 0000000000000004 R14: ffff97890e267800 R15: 0000000000000000
    [ 1245.121362]  __alloc_percpu_gfp+0x12/0x20
    [ 1245.125371]  __percpu_counter_init+0x23/0x90
    [ 1245.129643]  mm_init+0x2cb/0x430
    [ 1245.132878]  copy_process+0x10d6/0x1d70
    [ 1245.136716]  kernel_clone+0x9d/0x3c0
    [ 1245.140288]  ? __handle_mm_fault+0x8f2/0xd40
    [ 1245.144560]  __do_sys_clone+0x66/0x90
    [ 1245.148226]  __x64_sys_clone+0x25/0x30
    [ 1245.151971]  do_syscall_64+0x5c/0x90
    [ 1245.155548]  ? exit_to_user_mode_prepare+0x35/0x170
    [ 1245.160430]  ? irqentry_exit_to_user_mode+0x9/0x20
    [ 1245.165221]  ? irqentry_exit+0x3b/0x50
    [ 1245.168966]  ? exc_page_fault+0x8a/0x180
    [ 1245.172882]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
    [ 1245.177927] RIP: 0033:0x7f87478eabc7
    [ 1245.181506] Code: bb 04 00 f3 0f 1e fa 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 41 89 c0 85 c0 75 2c 64 48 8b 04 25 10 00
    [ 1245.200243] RSP: 002b:00007fff1207ec08 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
    [ 1245.207801] RAX: ffffffffffffffda RBX: 00007f8747d0f040 RCX: 00007f87478eabc7
    [ 1245.214924] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
    [ 1245.222050] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fff1207f170
    [ 1245.229184] R10: 00007f8747b8a1d0 R11: 0000000000000246 R12: 0000000000000001
    [ 1245.236315] R13: 000000000000001e R14: 00007fff1207f170 R15: 00007f87477d2310
    [ 1245.243440]  </TASK>
    [ 1245.245624] ---[ end trace 0000000000000000 ]---

    o Fourth

    [ 1471.729126] ------------[ cut here ]------------
    [ 1471.733841] WARNING: CPU: 9 PID: 21785 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
    [ 1471.741683] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 
    nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio overlay binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common 
    amd64_edac kvm_amd dell_smbios dcdbas kvm acpi_ipmi rapl dell_wmi_descriptor wmi_bmof ccp ptdma k10temp ipmi_si acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc 
    scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
    async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt 
    crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm tg3 cryptd xhci_pci
    [ 1471.741770]  xhci_pci_renesas
    [ 1471.763157] stress-ng-iomix (21877): drop_caches: 1
    [ 1471.771749] stress-ng-iomix (21863): drop_caches: 1
    [ 1471.771999] stress-ng-iomix (21870): drop_caches: 1
    [ 1471.831323]  megaraid_sas wmi
    [ 1471.831329] CPU: 9 PID: 21785 Comm: stress-ng-iomix Tainted: G        W          6.4.0-rc1-tj-wq-valid-cpu+ #481
    [ 1471.831336] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [ 1471.831338] RIP: 0010:kick_pool+0xdb/0xe0
    [ 1471.873732] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
    [ 1471.892478] RSP: 0000:ffffbc1b809fce08 EFLAGS: 00010046
    [ 1471.897701] RAX: 0000000000000100 RBX: ffff97c73d4721c0 RCX: 0000000000000000
    [ 1471.904836] RDX: 0000000000000040 RSI: 000000000000000b RDI: ffff9788c015b2e8
    [ 1471.911967] RBP: ffffbc1b809fce20 R08: 0000000000000100 R09: ffffffff8c489768
    [ 1471.919101] R10: 0000000000000001 R11: 0000000000000201 R12: ffff97c73d46e8c8
    [ 1471.926233] R13: ffff9788e3b799c0 R14: ffff9788c02f7400 R15: ffff97c73d46e8c0
    [ 1471.933365] FS:  00007f7bcbc2bf00(0000) GS:ffff97c73d440000(0000) knlGS:0000000000000000
    [ 1471.941453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1471.947196] CR2: 00007f7bcbd74048 CR3: 00000001240a6006 CR4: 0000000000770ee0
    [ 1471.954329] PKRU: 55555554
    [ 1471.957042] Call Trace:
    [ 1471.959490]  <IRQ>
    [ 1471.961511]  __queue_work.part.0+0x1f2/0x4c0
    [ 1471.965790]  delayed_work_timer_fn+0x3c/0x90
    [ 1471.970060]  ? __pfx_delayed_work_timer_fn+0x10/0x10
    [ 1471.975027]  call_timer_fn+0x2c/0x150
    [ 1471.978692]  ? __pfx_delayed_work_timer_fn+0x10/0x10
    [ 1471.983657]  __run_timers.part.0+0x16f/0x2a0
    [ 1471.987932]  ? ktime_get+0x46/0xc0
    [ 1471.991338]  ? native_apic_msr_write+0x30/0x40
    [ 1471.995785]  ? lapic_next_event+0x20/0x30
    [ 1471.999797]  ? clockevents_program_event+0xad/0x130
    [ 1472.004677]  run_timer_softirq+0x4b/0x60
    [ 1472.008603]  __do_softirq+0xdd/0x31a
    [ 1472.012183]  ? hrtimer_interrupt+0x12b/0x240
    [ 1472.016456]  __irq_exit_rcu+0x83/0xb0
    [ 1472.020121]  irq_exit_rcu+0xe/0x20
    [ 1472.023528]  sysvec_apic_timer_interrupt+0x80/0x90
    [ 1472.028319]  </IRQ>
    [ 1472.030427]  <TASK>
    [ 1472.032532]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
    [ 1472.037669] RIP: 0010:shmem_get_folio_gfp+0x109/0x750
    [ 1472.042722] Code: 49 0f ba 2f 00 0f 82 e2 03 00 00 48 8b 7d c8 48 8b 45 b8 48 39 47 18 0f 85 ea 03 00 00 83 7d c4 03 0f 84 ba 04 00 00 48 8b 07 <a8> 04 0f 84 5b 02 00 00 48 8b 7d c8 48 8b 45 a0 45 31 ff 48 89 38
    [ 1472.061470] RSP: 0000:ffffbc1bb58a7c30 EFLAGS: 00000297
    [ 1472.066697] RAX: 0017ffffc008001d RBX: ffff978928060fb0 RCX: 0000000000000002
    [ 1472.073826] RDX: 0000000080000000 RSI: 0000000000000000 RDI: fffff60845288200
    [ 1472.080951] RBP: ffffbc1bb58a7cb8 R08: 0000000000100cca R09: ffff9788e8d485c0
    [ 1472.088085] R10: 0000000000000000 R11: ffff9788e8d485c0 R12: ffff9788e8d485c0
    [ 1472.095219] R13: ffff9788e9d02a00 R14: 0000000000000000 R15: fffff60845288200
    [ 1472.102356]  shmem_fault+0x78/0x290
    [ 1472.105845]  __do_fault+0x39/0x140
    [ 1472.109250]  do_fault+0x2cf/0x430
    [ 1472.112569]  __handle_mm_fault+0x73b/0xd40
    [ 1472.116670]  handle_mm_fault+0x9e/0x2e0
    [ 1472.120506]  do_user_addr_fault+0x243/0x770
    [ 1472.124695]  exc_page_fault+0x79/0x180
    [ 1472.128448]  asm_exc_page_fault+0x27/0x30
    [ 1472.132458] RIP: 0033:0x5647c727779d
    [ 1472.136039] Code: ac 24 a8 00 00 00 48 8d 51 60 48 8b 4c 24 10 48 89 84 24 c8 00 00 00 48 8b 03 48 89 8c 24 c0 00 00 00 48 89 94 24 b8 00 00 00 <0f> 11 00 48 c7 40 10 00 00 00 00 48 8b 05 91 18 2e 00 48 8b 40 10
    [ 1472.154783] RSP: 002b:00007ffe655a2160 EFLAGS: 00010202
    [ 1472.160009] RAX: 00007f7bcbd74048 RBX: 00007ffe655a25a0 RCX: 00007f7bcb975cb8
    [ 1472.167142] RDX: 00007f7bcb759060 RSI: 0000000000000000 RDI: 00007ffe655a21e0
    [ 1472.174266] RBP: 00007ffe655a2330 R08: 00007f7bcbbd1460 R09: 00007ffe655a1f80
    [ 1472.181390] R10: 00007ffe655a2120 R11: 0000000000000246 R12: 00007f7bcb975390
    [ 1472.188523] R13: 0000000000005519 R14: 00007ffe655a25a0 R15: 00007f7bcb975e48
    [ 1472.195650]  </TASK>
    [ 1472.197842] ---[ end trace 0000000000000000 ]---


This is the same WARN_ON_ONCE() you had added in the HEAD commit:

    $ scripts/faddr2line vmlinux kick_pool+0xdb
    kick_pool+0xdb/0xe0:
    kick_pool at kernel/workqueue.c:1130 (discriminator 1)

    $ sed -n 1130,1132p kernel/workqueue.c
    if (!WARN_ON_ONCE(wake_cpu >= nr_cpu_ids))
        p->wake_cpu = wake_cpu;
    get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;

Let me know if you need any more data from my test setup.
P.S. The kernel is still up and running (~30min) despite hitting this
WARN_ON_ONCE() in my case :)

> 
> Thanks.
> 
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-08  3:01         ` K Prateek Nayak
@ 2023-06-08 22:50           ` Tejun Heo
  2023-06-09  3:43             ` K Prateek Nayak
                               ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Tejun Heo @ 2023-06-08 22:50 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello,

On Thu, Jun 08, 2023 at 08:31:34AM +0530, K Prateek Nayak wrote:
...
> Thank you for sharing the debug branch. I've managed to hit some one of
> the WARN_ON_ONCE() consistently but I still haven't seen a kernel panic
> yet. Sharing the traces below:

Yeah, that's good. It does a dirty fix-up. Shouldn't crash.

> o Early Boot
> 
>     [    4.182411] ------------[ cut here ]------------
>     [    4.186313] WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
>     [    4.186313] Modules linked in:
>     [    4.186313] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-rc1-tj-wq-valid-cpu+ #481
>     [    4.186313] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
>     [    4.186313] RIP: 0010:kick_pool+0xdb/0xe0
>     [    4.186313] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
>     [    4.186313] RSP: 0018:ffffbc1b800e7dd8 EFLAGS: 00010046
>     [    4.186313] RAX: 0000000000000100 RBX: ffff97c73d2321c0 RCX: 0000000000000000
>     [    4.186313] RDX: 0000000000000040 RSI: 0000000000000001 RDI: ffff9788c0159728
>     [    4.186313] RBP: ffffbc1b800e7df0 R08: 0000000000000100 R09: ffff9788c01593e0
>     [    4.186313] R10: ffff9788c01593c0 R11: 0000000000000001 R12: ffffffff8c582430
>     [    4.186313] R13: ffff9788c03fcd40 R14: 0000000000000000 R15: ffff97c73d2324b0
>     [    4.186313] FS:  0000000000000000(0000) GS:ffff97c73d200000(0000) knlGS:0000000000000000
>     [    4.186313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [    4.186313] CR2: ffff97cecee01000 CR3: 000000470d43a001 CR4: 0000000000770ef0
>     [    4.186313] PKRU: 55555554
>     [    4.186313] Call Trace:
>     [    4.186313]  <TASK>
>     [    4.186313]  create_worker+0x14e/0x280
>     [    4.186313]  ? wake_up_process+0x15/0x20
>     [    4.186313]  workqueue_init+0x22a/0x3d0
>     [    4.186313]  kernel_init_freeable+0x1fe/0x4f0
>     [    4.186313]  ? __pfx_kernel_init+0x10/0x10
>     [    4.186313]  kernel_init+0x1b/0x1f0
>     [    4.186313]  ? __pfx_kernel_init+0x10/0x10
>     [    4.186313]  ret_from_fork+0x2c/0x50
>     [    4.186313]  </TASK>
>     [    4.186313] ---[ end trace 0000000000000000 ]---
> 
> o I consistently see a WARN_ON_ONCE() in kick_pool() being hit when I
>   run "sudo ./stress-ng --iomix 96 --timeout 1m". I've seen few
>   different stack traces so far. Including all below just in case:
...
> This is the same WARN_ON_ONCE() you had added in the HEAD commit:
> 
>     $ scripts/faddr2line vmlinux kick_pool+0xdb
>     kick_pool+0xdb/0xe0:
>     kick_pool at kernel/workqueue.c:1130 (discriminator 1)
> 
>     $ sed -n 1130,1132p kernel/workqueue.c
>     if (!WARN_ON_ONCE(wake_cpu >= nr_cpu_ids))
>         p->wake_cpu = wake_cpu;
>     get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
> 
> Let me know if you need any more data from my test setup.
> P.S. The kernel is still up and running (~30min) despite hitting this
> WARN_ON_ONCE() in my case :)

Okay, that was me being stupid and not initializing the new fields for
per-cpu workqueues. Can you please test the following branch? It should have
both bugs fixed properly.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2

If that doesn't crash, I'd love to hear how it affects the perf regressions
reported over that past few months.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-08 22:50           ` Tejun Heo
@ 2023-06-09  3:43             ` K Prateek Nayak
  2023-06-14 18:49               ` Sandeep Dhavale
  2023-06-19  4:30             ` Swapnil Sapkal
  2023-07-05  7:04             ` K Prateek Nayak
  2 siblings, 1 reply; 73+ messages in thread
From: K Prateek Nayak @ 2023-06-09  3:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team, Swapnil Sapkal

Hello Tejun,

On 6/9/2023 4:20 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 08, 2023 at 08:31:34AM +0530, K Prateek Nayak wrote:
>> [..snip..]
>> o I consistently see a WARN_ON_ONCE() in kick_pool() being hit when I
>>   run "sudo ./stress-ng --iomix 96 --timeout 1m". I've seen few
>>   different stack traces so far. Including all below just in case:
> ...
>> This is the same WARN_ON_ONCE() you had added in the HEAD commit:
>>
>>     $ scripts/faddr2line vmlinux kick_pool+0xdb
>>     kick_pool+0xdb/0xe0:
>>     kick_pool at kernel/workqueue.c:1130 (discriminator 1)
>>
>>     $ sed -n 1130,1132p kernel/workqueue.c
>>     if (!WARN_ON_ONCE(wake_cpu >= nr_cpu_ids))
>>         p->wake_cpu = wake_cpu;
>>     get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
>>
>> Let me know if you need any more data from my test setup.
>> P.S. The kernel is still up and running (~30min) despite hitting this
>> WARN_ON_ONCE() in my case :)
> 
> Okay, that was me being stupid and not initializing the new fields for
> per-cpu workqueues. Can you please test the following branch? It should have
> both bugs fixed properly.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2

I've not run into any panics or warnings with this one. Kernel has been
stable for ~30min while running stress-ng iomix. We'll resume the testing
with v2 :)

> 
> If that doesn't crash, I'd love to hear how it affects the perf regressions
> reported over that past few months.> 
> Thanks.
> 

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (25 preceding siblings ...)
  2023-05-23 11:18 ` Peter Zijlstra
@ 2023-06-12 23:56 ` Brian Norris
  2023-06-13  2:48   ` Tejun Heo
  2023-08-08  1:22 ` Tejun Heo
  27 siblings, 1 reply; 73+ messages in thread
From: Brian Norris @ 2023-06-12 23:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, nhuck, agk, snitzer, void, treapking

Hi,

On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
> In terms of patches, 0021-0024 are probably the interesting ones.
> 
> Brian Norris, Nathan Huckleberry and others experiencing wq perf problems
> -------------------------------------------------------------------------
> 
> Can you please test this patchset and see whether the performance problems
> are resolved? After the patchset, unbound workqueues default to
> soft-affining on cache boundaries, which should hopefully resolve the issues
> that you guys have been seeing on recent kernels on heterogeneous CPUs.
> 
> If you want to try different settings, please read:
> 
>  https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v1&id=e8f3505e69a526cc5fe40a4da5d443b7f9231016#n350

Thanks for the CC; my colleague tried out your patches (ported to 5.15
with some minor difficulty), and aside from some crashes (already noted
by others, although we didn't pull the proposed v2 fixes), he didn't
notice a significant change in performance on our particular test system
and WiFi-throughput workload. I don't think we expected a lot though,
per the discussion at:

https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

Brian

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-12 23:56 ` Brian Norris
@ 2023-06-13  2:48   ` Tejun Heo
  2023-06-13  9:26     ` Pin-yen Lin
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-06-13  2:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, nhuck, agk, snitzer, void, treapking

Hello,

On Mon, Jun 12, 2023 at 04:56:06PM -0700, Brian Norris wrote:
> Thanks for the CC; my colleague tried out your patches (ported to 5.15
> with some minor difficulty), and aside from some crashes (already noted
> by others, although we didn't pull the proposed v2 fixes), he didn't

Yeah, there were a few subtle bugs that v2 fixes.

> notice a significant change in performance on our particular test system
> and WiFi-throughput workload. I don't think we expected a lot though,
> per the discussion at:
> 
> https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

That's disappointing. I was actually expecting that the default behavior
would restrain migrations across L3 boundaries strong enough to make a
meaningful difference. Can you enable WQ_SYSFS and test the following
configs?

 1. affinity_scope = cache, affinity_strict = 1

 2. affinity_scope = cpu, affinity_strict = 0

 3. affinity_scope = cpu, affinity_strict = 1

#3 basically turns it into a percpu workqueue, so it should perform more or
less the same as a percpu workqueue without affecting everyone else.

Any chance you can post the toplogy details on the affected setup? How are
the caches and cores laid out?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-13  2:48   ` Tejun Heo
@ 2023-06-13  9:26     ` Pin-yen Lin
  2023-06-21 19:16       ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: Pin-yen Lin @ 2023-06-13  9:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Brian Norris, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, nhuck, agk, snitzer, void

Hi Tejun,

On Tue, Jun 13, 2023 at 10:48 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Jun 12, 2023 at 04:56:06PM -0700, Brian Norris wrote:
> > Thanks for the CC; my colleague tried out your patches (ported to 5.15
> > with some minor difficulty), and aside from some crashes (already noted
> > by others, although we didn't pull the proposed v2 fixes), he didn't
>
> Yeah, there were a few subtle bugs that v2 fixes.
>
> > notice a significant change in performance on our particular test system
> > and WiFi-throughput workload. I don't think we expected a lot though,
> > per the discussion at:
> >
> > https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/
>
> That's disappointing. I was actually expecting that the default behavior
> would restrain migrations across L3 boundaries strong enough to make a
> meaningful difference. Can you enable WQ_SYSFS and test the following
> configs?
>
>  1. affinity_scope = cache, affinity_strict = 1
>
>  2. affinity_scope = cpu, affinity_strict = 0
>
>  3. affinity_scope = cpu, affinity_strict = 1

I pulled down v2 series and tried these settings on our 5.15 kernel.
Unfortunately none of them showed significant improvement on the
throughput. It's hard to tell which one is the best because of the
noise, but the throughput is still all far from our 4.19 kernel or
simply pinning everything to a single core.

All the 4 settings (3 settings listed above plus the default) yields
results between 90 to 120 Mbps, while pinning tasks to a single core
consistently reaches >250 Mbps.
>
> #3 basically turns it into a percpu workqueue, so it should perform more or
> less the same as a percpu workqueue without affecting everyone else.
>
> Any chance you can post the toplogy details on the affected setup? How are
> the caches and cores laid out?

The core layout is listed at [1], and I'm not familiar with its cache
configuration either.

[1]: https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

Best regards,
Pin-yen
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-09  3:43             ` K Prateek Nayak
@ 2023-06-14 18:49               ` Sandeep Dhavale
  2023-06-21 20:14                 ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: Sandeep Dhavale @ 2023-06-14 18:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	kernel-team, Swapnil Sapkal, kprateek.nayak

Hi Tejun,
Thank you for your patches! I tested the affinity-scopes-v2 with app launch
benchmarks. The numbers below are total scheduling latency for erofs kworkers
and last column is with percpu highpri kthreads that is
CONFIG_EROFS_FS_PCPU_KTHREAD=y
CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI=y

Scheduling latency is the latency between when the task became eligible to run
to when it actually started running. The test does 50 cold app launches for each
and aggregates the numbers.

| Table        | Upstream | Cache nostrict | CPU nostrict | PCPU hpri |
|--------------+----------+----------------+--------------+-----------|
| Average (us) | 12286    | 7440           | 4435         | 2717      |
| Median (us)  | 12528    | 3901           | 3258         | 2476      |
| Minimum (us) | 287      | 555            | 638          | 357       |
| Maximum (us) | 35600    | 35911          | 13364        | 6874      |
| Stdev (us)   | 7918     | 7503           | 3323         | 1918      |
|--------------+----------+----------------+--------------+-----------|

We see here that with affinity-scopes-v2 (which defaults to cache nostrict),
there is a good improvement when compared to the current codebase.
Affinity scope "CPU nostrict" for erofs workqueue has even better numbers
for my test launches and it resembles logically to percpu highpri kthreads
approach. Percpu highpri kthreads has the lowest latency and variation,
probably down to running at higher priority as those threads are set to
sched_set_fifo_low().

At high level, the app launch numbers itself improved with your series as
entire workqueue subsystem improved across the board.

Thanks,
Sandeep.

On Thu, Jun 8, 2023 at 8:43 PM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
>
> Hello Tejun,
>
> On 6/9/2023 4:20 AM, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Jun 08, 2023 at 08:31:34AM +0530, K Prateek Nayak wrote:
> >> [..snip..]
> >> o I consistently see a WARN_ON_ONCE() in kick_pool() being hit when I
> >>   run "sudo ./stress-ng --iomix 96 --timeout 1m". I've seen few
> >>   different stack traces so far. Including all below just in case:
> > ...
> >> This is the same WARN_ON_ONCE() you had added in the HEAD commit:
> >>
> >>     $ scripts/faddr2line vmlinux kick_pool+0xdb
> >>     kick_pool+0xdb/0xe0:
> >>     kick_pool at kernel/workqueue.c:1130 (discriminator 1)
> >>
> >>     $ sed -n 1130,1132p kernel/workqueue.c
> >>     if (!WARN_ON_ONCE(wake_cpu >= nr_cpu_ids))
> >>         p->wake_cpu = wake_cpu;
> >>     get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
> >>
> >> Let me know if you need any more data from my test setup.
> >> P.S. The kernel is still up and running (~30min) despite hitting this
> >> WARN_ON_ONCE() in my case :)
> >
> > Okay, that was me being stupid and not initializing the new fields for
> > per-cpu workqueues. Can you please test the following branch? It should have
> > both bugs fixed properly.
> >
> >  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
>
> I've not run into any panics or warnings with this one. Kernel has been
> stable for ~30min while running stress-ng iomix. We'll resume the testing
> with v2 :)
>
> >
> > If that doesn't crash, I'd love to hear how it affects the perf regressions
> > reported over that past few months.>
> > Thanks.
> >
>
> --
> Thanks and Regards,
> Prateek
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-08 22:50           ` Tejun Heo
  2023-06-09  3:43             ` K Prateek Nayak
@ 2023-06-19  4:30             ` Swapnil Sapkal
  2023-06-21 20:38               ` Tejun Heo
  2023-07-05  7:04             ` K Prateek Nayak
  2 siblings, 1 reply; 73+ messages in thread
From: Swapnil Sapkal @ 2023-06-19  4:30 UTC (permalink / raw)
  To: Tejun Heo, K Prateek Nayak
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello Tejun,

On 6/9/2023 4:20 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 08, 2023 at 08:31:34AM +0530, K Prateek Nayak wrote:
> ...
>> Thank you for sharing the debug branch. I've managed to hit some one of
>> the WARN_ON_ONCE() consistently but I still haven't seen a kernel panic
>> yet. Sharing the traces below:
> 
> Yeah, that's good. It does a dirty fix-up. Shouldn't crash.
> 
>> o Early Boot
>>
>>      [    4.182411] ------------[ cut here ]------------
>>      [    4.186313] WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
>>      [    4.186313] Modules linked in:
>>      [    4.186313] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-rc1-tj-wq-valid-cpu+ #481
>>      [    4.186313] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
>>      [    4.186313] RIP: 0010:kick_pool+0xdb/0xe0
>>      [    4.186313] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
>>      [    4.186313] RSP: 0018:ffffbc1b800e7dd8 EFLAGS: 00010046
>>      [    4.186313] RAX: 0000000000000100 RBX: ffff97c73d2321c0 RCX: 0000000000000000
>>      [    4.186313] RDX: 0000000000000040 RSI: 0000000000000001 RDI: ffff9788c0159728
>>      [    4.186313] RBP: ffffbc1b800e7df0 R08: 0000000000000100 R09: ffff9788c01593e0
>>      [    4.186313] R10: ffff9788c01593c0 R11: 0000000000000001 R12: ffffffff8c582430
>>      [    4.186313] R13: ffff9788c03fcd40 R14: 0000000000000000 R15: ffff97c73d2324b0
>>      [    4.186313] FS:  0000000000000000(0000) GS:ffff97c73d200000(0000) knlGS:0000000000000000
>>      [    4.186313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>      [    4.186313] CR2: ffff97cecee01000 CR3: 000000470d43a001 CR4: 0000000000770ef0
>>      [    4.186313] PKRU: 55555554
>>      [    4.186313] Call Trace:
>>      [    4.186313]  <TASK>
>>      [    4.186313]  create_worker+0x14e/0x280
>>      [    4.186313]  ? wake_up_process+0x15/0x20
>>      [    4.186313]  workqueue_init+0x22a/0x3d0
>>      [    4.186313]  kernel_init_freeable+0x1fe/0x4f0
>>      [    4.186313]  ? __pfx_kernel_init+0x10/0x10
>>      [    4.186313]  kernel_init+0x1b/0x1f0
>>      [    4.186313]  ? __pfx_kernel_init+0x10/0x10
>>      [    4.186313]  ret_from_fork+0x2c/0x50
>>      [    4.186313]  </TASK>
>>      [    4.186313] ---[ end trace 0000000000000000 ]---
>>
>> o I consistently see a WARN_ON_ONCE() in kick_pool() being hit when I
>>    run "sudo ./stress-ng --iomix 96 --timeout 1m". I've seen few
>>    different stack traces so far. Including all below just in case:
> ...
>> This is the same WARN_ON_ONCE() you had added in the HEAD commit:
>>
>>      $ scripts/faddr2line vmlinux kick_pool+0xdb
>>      kick_pool+0xdb/0xe0:
>>      kick_pool at kernel/workqueue.c:1130 (discriminator 1)
>>
>>      $ sed -n 1130,1132p kernel/workqueue.c
>>      if (!WARN_ON_ONCE(wake_cpu >= nr_cpu_ids))
>>          p->wake_cpu = wake_cpu;
>>      get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
>>
>> Let me know if you need any more data from my test setup.
>> P.S. The kernel is still up and running (~30min) despite hitting this
>> WARN_ON_ONCE() in my case :)
> 
> Okay, that was me being stupid and not initializing the new fields for
> per-cpu workqueues. Can you please test the following branch? It should have
> both bugs fixed properly.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
> 
Thanks for the patchset. I tested the patchset with fiotests.
Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
with NPS1, NPS2 and NPS4 modes.

With affinity-scopes-v2, below are the observations:
BW, LAT AVG and CLAT AVG shows improvement with some combinations
of the params in NPS1 and NPS2 while all other combinations of params
show no loss or gain in the performance. Those combinations showing
improvement are marked with ### and those showing drop in performance
are marked with ***. CLAT 99 shows mixed results in all the NPS modes.
SLAT 99 is suffering tremendously in all NPS mode.

Params:
TEST:
     * W  - Sequential Write
     * R  - Sequential Read
     * RW - Random Write
     * RR - Random Write

Threads: numjobs

BS: Block size

==================================================================================================================================
                                                       ~~~~~~~~~~~~~~~~NPS1~~~~~~~~~~~~~~~~
==================================================================================================================================

TEST Threads BS  IODEPTH   Metric   NPS1-6.4.0-rc1           CPU                 SMT             CACHE               NUMA                SYSTEM
W       1   1M     8         BW       444 MB/s       441 MB/s(-0.67)     451 MB/s( 1.57)     448 MB/s( 0.90)    449 MB/s( 1.12)     447 MB/s( 0.67)
R       1   1M     8         BW       668 MB/s       666 MB/s(-0.29)     770 MB/s(15.26)     654 MB/s(-2.09)    660 MB/s(-1.19)     775 MB/s(16.01)     ###
RW      1  64K    16         BW       438 MB/s       437 MB/s(-0.22)     437 MB/s(-0.22)     433 MB/s(-1.14)    436 MB/s(-0.45)     436 MB/s(-0.45)
RR      1  64K    16         BW       465 MB/s       461 MB/s(-0.86)     467 MB/s( 0.43)     465 MB/s( 0.00)    463 MB/s(-0.43)     463 MB/s(-0.43)
RW      1   8K    32         BW       257 MB/s       262 MB/s( 1.94)     278 MB/s( 8.17)     293 MB/s(14.00)    258 MB/s( 0.38)     282 MB/s( 9.72)     ###
RW      1   8K    32         BW       537 MB/s       537 MB/s( 0.00)     532 MB/s(-0.93)     531 MB/s(-1.11)    535 MB/s(-0.37)     533 MB/s(-0.74)
W       2   1M     8         BW       452 MB/s       452 MB/s( 0.00)     450 MB/s(-0.44)     448 MB/s(-0.88)    449 MB/s(-0.66)     451 MB/s(-0.22)
R       2   1M     8         BW       686 MB/s       685 MB/s(-0.14)     813 MB/s(18.51)     818 MB/s(19.24)    817 MB/s(19.09)     817 MB/s(19.09)     ###
RW      2  64K    16         BW       435 MB/s       434 MB/s(-0.22)     438 MB/s( 0.68)     439 MB/s( 0.91)    434 MB/s(-0.22)     437 MB/s( 0.45)
RR      2  64K    16         BW       565 MB/s       563 MB/s(-0.35)     569 MB/s( 0.70)     560 MB/s(-0.88)    566 MB/s( 0.17)     569 MB/s( 0.70)
RW      2   8K    32         BW       346 MB/s       349 MB/s( 0.86)     344 MB/s(-0.57)     348 MB/s( 0.57)    347 MB/s( 0.28)     338 MB/s(-2.31)
RW      2   8K    32         BW       549 MB/s       546 MB/s(-0.54)     545 MB/s(-0.72)     542 MB/s(-1.27)    540 MB/s(-1.63)     546 MB/s(-0.54)
W       4   1M     8         BW       451 MB/s       451 MB/s( 0.00)     449 MB/s(-0.44)     429 MB/s(-4.87)    451 MB/s( 0.00)     452 MB/s( 0.22)
R       4   1M     8         BW       832 MB/s       829 MB/s(-0.36)     830 MB/s(-0.24)     834 MB/s( 0.24)    825 MB/s(-0.84)     814 MB/s(-2.16)
RW      4  64K    16         BW       438 MB/s       434 MB/s(-0.91)     431 MB/s(-1.59)     433 MB/s(-1.14)    435 MB/s(-0.68)     436 MB/s(-0.45)
RR      4  64K    16         BW       558 MB/s       565 MB/s( 1.25)     566 MB/s( 1.43)     563 MB/s( 0.89)    567 MB/s( 1.61)     568 MB/s( 1.79)
RW      4   8K    32         BW       349 MB/s       343 MB/s(-1.71)     346 MB/s(-0.85)     349 MB/s( 0.00)    338 MB/s(-3.15)     345 MB/s(-1.14)
RW      4   8K    32         BW       548 MB/s       547 MB/s(-0.18)     546 MB/s(-0.36)     540 MB/s(-1.45)    548 MB/s( 0.00)     547 MB/s(-0.18)
W       8   1M     8         BW       452 MB/s       447 MB/s(-1.10)     450 MB/s(-0.44)     436 MB/s(-3.53)    450 MB/s(-0.44)     449 MB/s(-0.66)
R       8   1M     8         BW       831 MB/s       831 MB/s( 0.00)     830 MB/s(-0.12)     826 MB/s(-0.60)    834 MB/s( 0.36)     830 MB/s(-0.12)
RW      8  64K    16         BW       430 MB/s       436 MB/s( 1.39)     431 MB/s( 0.23)     434 MB/s( 0.93)    432 MB/s( 0.46)     436 MB/s( 1.39)
RR      8  64K    16         BW       559 MB/s       568 MB/s( 1.61)     560 MB/s( 0.17)     568 MB/s( 1.61)    562 MB/s( 0.53)     564 MB/s( 0.89)
RW      8   8K    32         BW       344 MB/s       349 MB/s( 1.45)     349 MB/s( 1.45)     345 MB/s( 0.29)    347 MB/s( 0.87)     345 MB/s( 0.29)
RW      8   8K    32         BW       545 MB/s       546 MB/s( 0.18)     547 MB/s( 0.36)     546 MB/s( 0.18)    534 MB/s(-2.01)     547 MB/s( 0.36)
W      16   1M     8         BW       450 MB/s       448 MB/s(-0.44)     449 MB/s(-0.22)     451 MB/s( 0.22)    450 MB/s( 0.00)     450 MB/s( 0.00)
R      16   1M     8         BW       831 MB/s       830 MB/s(-0.12)     833 MB/s( 0.24)     833 MB/s( 0.24)    835 MB/s( 0.48)     833 MB/s( 0.24)
RW     16  64K    16         BW       436 MB/s       435 MB/s(-0.22)     435 MB/s(-0.22)     432 MB/s(-0.91)    434 MB/s(-0.45)     428 MB/s(-1.83)
RR     16  64K    16         BW       568 MB/s       563 MB/s(-0.88)     568 MB/s( 0.00)     565 MB/s(-0.52)    563 MB/s(-0.88)     562 MB/s(-1.05)
RW     16   8K    32         BW       349 MB/s       349 MB/s( 0.00)     346 MB/s(-0.85)     346 MB/s(-0.85)    342 MB/s(-2.00)     345 MB/s(-1.14)
RW     16   8K    32         BW       543 MB/s       545 MB/s( 0.36)     542 MB/s(-0.18)     534 MB/s(-1.65)    535 MB/s(-1.47)     543 MB/s( 0.00)
W      32   1M     8         BW       450 MB/s       450 MB/s( 0.00)     452 MB/s( 0.44)     451 MB/s( 0.22)    449 MB/s(-0.22)     451 MB/s( 0.22)
R      32   1M     8         BW       832 MB/s       832 MB/s( 0.00)     834 MB/s( 0.24)     837 MB/s( 0.60)    834 MB/s( 0.24)     837 MB/s( 0.60)
RW     32  64K    16         BW       436 MB/s       434 MB/s(-0.45)     431 MB/s(-1.14)     435 MB/s(-0.22)    433 MB/s(-0.68)     434 MB/s(-0.45)
RR     32  64K    16         BW       559 MB/s       565 MB/s( 1.07)     566 MB/s( 1.25)     566 MB/s( 1.25)    559 MB/s( 0.00)     567 MB/s( 1.43)
RW     32   8K    32         BW       349 MB/s       349 MB/s( 0.00)     349 MB/s( 0.00)     342 MB/s(-2.00)    348 MB/s(-0.28)     343 MB/s(-1.71)
RW     32   8K    32         BW       535 MB/s       535 MB/s( 0.00)     524 MB/s(-2.05)     531 MB/s(-0.74)    534 MB/s(-0.18)     523 MB/s(-2.24)
W      64   1M     8         BW       451 MB/s       451 MB/s( 0.00)     450 MB/s(-0.22)     450 MB/s(-0.22)    451 MB/s( 0.00)     451 MB/s( 0.00)
R      64   1M     8         BW       831 MB/s       834 MB/s( 0.36)     837 MB/s( 0.72)     835 MB/s( 0.48)    835 MB/s( 0.48)     834 MB/s( 0.36)
RW     64  64K    16         BW       437 MB/s       435 MB/s(-0.45)     436 MB/s(-0.22)     437 MB/s( 0.00)    436 MB/s(-0.22)     437 MB/s( 0.00)
RR     64  64K    16         BW       568 MB/s       568 MB/s( 0.00)     563 MB/s(-0.88)     564 MB/s(-0.70)    564 MB/s(-0.70)     559 MB/s(-1.58)
RW     64   8K    32         BW       336 MB/s       347 MB/s( 3.27)     347 MB/s( 3.27)     346 MB/s( 2.97)    346 MB/s( 2.97)     347 MB/s( 3.27)
RW     64   8K    32         BW       523 MB/s       534 MB/s( 2.10)     531 MB/s( 1.52)     521 MB/s(-0.38)    528 MB/s( 0.95)     519 MB/s(-0.76)
W     128   1M     8         BW       452 MB/s       450 MB/s(-0.44)     449 MB/s(-0.66)     449 MB/s(-0.66)    451 MB/s(-0.22)     449 MB/s(-0.66)
R     128   1M     8         BW       832 MB/s       831 MB/s(-0.12)     834 MB/s( 0.24)     836 MB/s( 0.48)    834 MB/s( 0.24)     834 MB/s( 0.24)
RW    128  64K    16         BW       435 MB/s       435 MB/s( 0.00)     431 MB/s(-0.91)     433 MB/s(-0.45)    434 MB/s(-0.22)     431 MB/s(-0.91)
RR    128  64K    16         BW       561 MB/s       564 MB/s( 0.53)     555 MB/s(-1.06)     562 MB/s( 0.17)    567 MB/s( 1.06)     565 MB/s( 0.71)
RW    128   8K    32         BW       348 MB/s       335 MB/s(-3.73)     345 MB/s(-0.86)     347 MB/s(-0.28)    348 MB/s( 0.00)     345 MB/s(-0.86)
RW    128   8K    32         BW       518 MB/s       530 MB/s( 2.31)     530 MB/s( 2.31)     517 MB/s(-0.19)    531 MB/s( 2.50)     519 MB/s( 0.19)
W     256   1M     8         BW       448 MB/s       448 MB/s( 0.00)     450 MB/s( 0.44)     449 MB/s( 0.22)    451 MB/s( 0.66)     449 MB/s( 0.22)
R     256   1M     8         BW       833 MB/s       830 MB/s(-0.36)     833 MB/s( 0.00)     834 MB/s( 0.12)    836 MB/s( 0.36)     836 MB/s( 0.36)
RW    256  64K    16         BW       436 MB/s       433 MB/s(-0.68)     432 MB/s(-0.91)     434 MB/s(-0.45)    433 MB/s(-0.68)     433 MB/s(-0.68)
RR    256  64K    16         BW       569 MB/s       565 MB/s(-0.70)     566 MB/s(-0.52)     563 MB/s(-1.05)    565 MB/s(-0.70)     566 MB/s(-0.52)
RW    256   8K    32         BW       340 MB/s       337 MB/s(-0.88)     336 MB/s(-1.17)     332 MB/s(-2.35)    336 MB/s(-1.17)     332 MB/s(-2.35)
RW    256   8K    32         BW       530 MB/s       528 MB/s(-0.37)     528 MB/s(-0.37)     528 MB/s(-0.37)    528 MB/s(-0.37)     528 MB/s(-0.37)
W     512   1M     8         BW       449 MB/s       450 MB/s( 0.22)     450 MB/s( 0.22)     447 MB/s(-0.44)    451 MB/s( 0.44)     450 MB/s( 0.22)
R     512   1M     8         BW       830 MB/s       830 MB/s( 0.00)     833 MB/s( 0.36)     832 MB/s( 0.24)    833 MB/s( 0.36)     833 MB/s( 0.36)
RW    512  64K    16         BW       436 MB/s       431 MB/s(-1.14)     433 MB/s(-0.68)     434 MB/s(-0.45)    433 MB/s(-0.68)     430 MB/s(-1.37)
RR    512  64K    16         BW       567 MB/s       564 MB/s(-0.52)     567 MB/s( 0.00)     556 MB/s(-1.94)    565 MB/s(-0.35)     565 MB/s(-0.35)
RW    512   8K    32         BW       332 MB/s       333 MB/s( 0.30)     330 MB/s(-0.60)     334 MB/s( 0.60)    333 MB/s( 0.30)     326 MB/s(-1.80)
RW    512   8K    32         BW       524 MB/s       531 MB/s( 1.33)     520 MB/s(-0.76)     529 MB/s( 0.95)    531 MB/s( 1.33)     522 MB/s(-0.38)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS1-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    SLAT AVG           66.1 us           68.9 us (-4.23)         69.4 us (-4.91)         68.9 us (-4.23)      301.7 us (-356.15)     458.5 us (-593.22)
R       1   1M     8    SLAT AVG           57.2 us           50.8 us (11.22)         43.4 us (24.09)         50.8 us (11.22)      251.5 us (-339.65)        38.5 us (32.70)
RW      1  64K    16    SLAT AVG           16.5 us          20.4 us (-23.03)         18.0 us (-9.06)        20.4 us (-23.03)        20.5 us (-23.76)       20.2 us (-21.88)
RR      1  64K    16    SLAT AVG           10.6 us           10.9 us (-1.97)        11.8 us (-10.32)         10.9 us (-1.97)        16.0 us (-50.04)       12.5 us (-17.55)
RW      1   8K    32    SLAT AVG           28.1 us           22.7 us (19.15)         25.1 us (10.78)         22.7 us (19.15)         28.2 us (-0.39)        24.3 us (13.49)
RW      1   8K    32    SLAT AVG           11.0 us           11.1 us (-1.36)         11.1 us (-1.00)         11.1 us (-1.36)         11.0 us ( 0.00)        11.2 us (-1.82)
W       2   1M     8    SLAT AVG           72.6 us           79.5 us (-9.49)       100.8 us (-38.86)         79.5 us (-9.49)      346.4 us (-377.23)        59.9 us (17.44)
R       2   1M     8    SLAT AVG           57.3 us           44.3 us (22.70)        64.6 us (-12.65)         44.3 us (22.70)      255.9 us (-346.59)        43.4 us (24.22)	
RW      2  64K    16    SLAT AVG           17.3 us           18.2 us (-4.96)         17.6 us (-1.44)         18.2 us (-4.96)        21.5 us (-24.09)        16.8 us ( 2.88)
RR      2  64K    16    SLAT AVG           12.0 us           11.7 us ( 2.58)        13.3 us (-10.76)         11.7 us ( 2.58)        19.9 us (-66.02)       16.7 us (-39.31)
RW      2   8K    32    SLAT AVG           16.5 us           16.2 us ( 1.51)         15.6 us ( 5.10)         16.2 us ( 1.51)         15.8 us ( 4.19)        16.3 us ( 0.97)
RW      2   8K    32    SLAT AVG            7.8 us          10.3 us (-31.92)          8.4 us (-7.82)        10.3 us (-31.92)        12.9 us (-65.64)         8.2 us (-5.38)
W       4   1M     8    SLAT AVG           73.3 us       912.7 us (-1144.29)         78.5 us (-6.99)     912.7 us (-1144.29)        82.7 us (-12.70)       92.3 us (-25.82)
R       4   1M     8    SLAT AVG           25.5 us          44.5 us (-74.77)        41.0 us (-60.80)        44.5 us (-74.77)        40.8 us (-59.94)      56.3 us (-120.71)
RW      4  64K    16    SLAT AVG           18.9 us           20.7 us (-9.58)         19.0 us (-0.84)         20.7 us (-9.58)         17.8 us ( 5.87)        17.9 us ( 5.39)
RR      4  64K    16    SLAT AVG            9.3 us            9.0 us ( 2.69)          7.2 us (22.41)          9.0 us ( 2.69)          7.7 us (16.70)         8.6 us ( 7.43)
RW      4   8K    32    SLAT AVG           89.4 us           90.9 us (-1.61)         90.5 us (-1.18)         90.9 us (-1.61)         93.6 us (-4.68)        91.4 us (-2.25)
RW      4   8K    32    SLAT AVG           57.1 us           58.0 us (-1.61)         57.2 us (-0.31)         58.0 us (-1.61)         57.0 us ( 0.03)        57.2 us (-0.19)
W       8   1M     8    SLAT AVG         3944.1 us      12078.7 us (-206.24)        408.8 us (89.63)    12078.7 us (-206.24)        658.8 us (83.29)       684.3 us (82.65)
R       8   1M     8    SLAT AVG         1289.7 us       6871.1 us (-432.75)         31.7 us (97.54)     6871.1 us (-432.75)         32.3 us (97.49)    5000.1 us (-287.69)
RW      8  64K    16    SLAT AVG         1039.6 us         1035.4 us ( 0.40)       1051.6 us (-1.14)       1035.4 us ( 0.40)       1047.2 us (-0.73)      1035.9 us ( 0.35)
RR      8  64K    16    SLAT AVG          797.7 us          762.9 us ( 4.35)        767.2 us ( 3.82)        762.9 us ( 4.35)        762.9 us ( 4.35)       754.0 us ( 5.47)
RW      8   8K    32    SLAT AVG          188.9 us          187.9 us ( 0.51)        186.6 us ( 1.24)        187.9 us ( 0.51)        187.1 us ( 0.93)       188.4 us ( 0.25)
RW      8   8K    32    SLAT AVG          118.9 us          118.8 us ( 0.05)        118.4 us ( 0.40)        118.8 us ( 0.05)        121.1 us (-1.88)       118.6 us ( 0.26)
W      16   1M     8    SLAT AVG        34424.9 us        30639.2 us (10.99)      32747.9 us ( 4.87)      30639.2 us (10.99)      32097.8 us ( 6.75)     33219.7 us ( 3.50)
R      16   1M     8    SLAT AVG        16336.7 us       18195.1 us (-11.37)      16116.6 us ( 1.34)     18195.1 us (-11.37)      17533.2 us (-7.32)     17440.2 us (-6.75)
RW     16  64K    16    SLAT AVG         2394.8 us         2419.3 us (-1.02)       2403.7 us (-0.37)       2419.3 us (-1.02)       2410.5 us (-0.65)      2438.8 us (-1.83)
RR     16  64K    16    SLAT AVG         1839.5 us         1849.9 us (-0.56)       1835.8 us ( 0.20)       1849.9 us (-0.56)       1855.0 us (-0.84)      1856.0 us (-0.89)
RW     16   8K    32    SLAT AVG          373.5 us          376.4 us (-0.78)        377.4 us (-1.04)        376.4 us (-0.78)        382.0 us (-2.27)       378.4 us (-1.30)
RW     16   8K    32    SLAT AVG          239.9 us          243.8 us (-1.63)        240.4 us (-0.20)        243.8 us (-1.63)        243.3 us (-1.43)       239.8 us ( 0.02)
W      32   1M     8    SLAT AVG        73985.5 us        74152.6 us (-0.22)      73794.5 us ( 0.25)      74152.6 us (-0.22)      74439.5 us (-0.61)     74070.6 us (-0.11)
R      32   1M     8    SLAT AVG        40168.0 us        39981.1 us ( 0.46)      40034.8 us ( 0.33)      39981.1 us ( 0.46)      40110.7 us ( 0.14)     39998.8 us ( 0.42)
RW     32  64K    16    SLAT AVG         4803.5 us         4812.7 us (-0.18)       4858.6 us (-1.14)       4812.7 us (-0.18)       4842.3 us (-0.80)      4833.7 us (-0.62)
RR     32  64K    16    SLAT AVG         3751.9 us         3703.0 us ( 1.30)       3705.6 us ( 1.23)       3703.0 us ( 1.30)       3747.9 us ( 0.10)      3695.5 us ( 1.50)
RW     32   8K    32    SLAT AVG          748.2 us          765.0 us (-2.23)        749.8 us (-0.21)        765.0 us (-2.23)        750.0 us (-0.23)       762.4 us (-1.89)
RW     32   8K    32    SLAT AVG          488.1 us          492.2 us (-0.83)        498.1 us (-2.05)        492.2 us (-0.83)        489.2 us (-0.22)       498.9 us (-2.21)
W      64   1M     8    SLAT AVG       148152.2 us       148734.5 us (-0.39)     148558.1 us (-0.27)     148734.5 us (-0.39)     148389.2 us (-0.15)    148354.1 us (-0.13)
R      64   1M     8    SLAT AVG        80564.5 us        80255.7 us ( 0.38)      80000.7 us ( 0.69)      80255.7 us ( 0.38)      80228.2 us ( 0.41)     80298.6 us ( 0.32)
RW     64  64K    16    SLAT AVG         9587.7 us         9594.3 us (-0.06)       9615.0 us (-0.28)       9594.3 us (-0.06)       9625.2 us (-0.39)      9582.7 us ( 0.05)
RR     64  64K    16    SLAT AVG         7382.9 us         7433.5 us (-0.68)       7452.1 us (-0.93)       7433.5 us (-0.68)       7435.5 us (-0.71)      7493.8 us (-1.50)
RW     64   8K    32    SLAT AVG         1557.9 us         1511.4 us ( 2.98)       1506.5 us ( 3.29)       1511.4 us ( 2.98)       1512.7 us ( 2.90)      1510.4 us ( 3.04)
RW     64   8K    32    SLAT AVG         1000.1 us         1003.3 us (-0.31)        985.7 us ( 1.44)       1003.3 us (-0.31)        990.2 us ( 0.99)      1007.7 us (-0.75)
W     128   1M     8    SLAT AVG       295432.7 us       297571.5 us (-0.72)     297037.9 us (-0.54)     297571.5 us (-0.72)     295636.5 us (-0.06)    297350.6 us (-0.64)
R     128   1M     8    SLAT AVG       160875.1 us       160175.8 us ( 0.43)     160566.7 us ( 0.19)     160175.8 us ( 0.43)     160508.1 us ( 0.22)    160453.9 us ( 0.26)
RW    128  64K    16    SLAT AVG        19287.8 us        19382.9 us (-0.49)      19468.8 us (-0.93)      19382.9 us (-0.49)      19331.2 us (-0.22)     19439.9 us (-0.78)
RR    128  64K    16    SLAT AVG        14959.2 us        14924.1 us ( 0.23)      15106.7 us (-0.98)      14924.1 us ( 0.23)      14793.5 us ( 1.10)     14851.7 us ( 0.71)
RW    128   8K    32    SLAT AVG         3010.8 us         3018.8 us (-0.26)       3031.2 us (-0.67)       3018.8 us (-0.26)       3014.0 us (-0.10)      3039.5 us (-0.95)
RW    128   8K    32    SLAT AVG         2021.4 us         2027.9 us (-0.31)       1977.2 us ( 2.18)       2027.9 us (-0.31)       1974.2 us ( 2.33)      2020.2 us ( 0.06)
W     256   1M     8    SLAT AVG       594516.2 us       593344.6 us ( 0.19)     591076.4 us ( 0.57)     593344.6 us ( 0.19)     591613.2 us ( 0.48)    593075.6 us ( 0.24)
R     256   1M     8    SLAT AVG       320966.6 us       320537.3 us ( 0.13)     320940.2 us ( 0.00)     320537.3 us ( 0.13)     319791.9 us ( 0.36)    319869.3 us ( 0.34)
RW    256  64K    16    SLAT AVG        38470.8 us        38651.1 us (-0.46)      38834.8 us (-0.94)      38651.1 us (-0.46)      38680.5 us (-0.54)     38710.4 us (-0.62)
RR    256  64K    16    SLAT AVG        29492.5 us        29761.0 us (-0.91)      29622.6 us (-0.44)      29761.0 us (-0.91)      29695.4 us (-0.68)     29613.8 us (-0.41)
RW    256   8K    32    SLAT AVG         6158.0 us         6298.9 us (-2.28)       6242.2 us (-1.36)       6298.9 us (-2.28)       6227.6 us (-1.13)      6314.0 us (-2.53)
RW    256   8K    32    SLAT AVG         3953.4 us         3966.4 us (-0.32)       3966.6 us (-0.33)       3966.4 us (-0.32)       3966.1 us (-0.32)      3967.7 us (-0.36)
W     512   1M     8    SLAT AVG      1179263.7 us      1183382.0 us (-0.34)    1177489.5 us ( 0.15)    1183382.0 us (-0.34)    1176261.1 us ( 0.25)   1179015.2 us ( 0.02)
R     512   1M     8    SLAT AVG       642650.9 us       640613.8 us ( 0.31)     640251.1 us ( 0.37)     640613.8 us ( 0.31)     640184.1 us ( 0.38)    640297.9 us ( 0.36)
RW    512  64K    16    SLAT AVG        76900.6 us        77168.2 us (-0.34)      77472.5 us (-0.74)      77168.2 us (-0.34)      77447.1 us (-0.71)     77943.0 us (-1.35)
RR    512  64K    16    SLAT AVG        59081.4 us        60308.1 us (-2.07)      59080.9 us ( 0.00)      60308.1 us (-2.07)      59382.6 us (-0.50)     59304.8 us (-0.37)
RW    512   8K    32    SLAT AVG        12631.9 us        12528.2 us ( 0.82)      12693.9 us (-0.49)      12528.2 us ( 0.82)      12573.6 us ( 0.46)     12862.3 us (-1.82)
RW    512   8K    32    SLAT AVG         8005.1 us         7926.6 us ( 0.97)       8056.9 us (-0.64)       7926.6 us ( 0.97)       7898.8 us ( 1.32)      8030.2 us (-0.31)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS1-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    LAT AVG         18873.5 us        19012.7 us (-0.73)      18599.8 us ( 1.45)      18705.3 us ( 0.89)      18670.2 us ( 1.07)      18779.1 us ( 0.49)
R       1   1M     8    LAT AVG         12560.7 us        12584.7 us (-0.19)      10898.3 us (13.23)      12832.0 us (-2.15)      12698.5 us (-1.09)      10822.3 us (13.83)	###
RW      1  64K    16    LAT AVG          2393.5 us         2398.3 us (-0.19)       2400.1 us (-0.27)       2420.1 us (-1.11)       2404.4 us (-0.45)       2404.3 us (-0.44)
RR      1  64K    16    LAT AVG          2255.0 us         2272.7 us (-0.78)       2247.0 us ( 0.35)       2255.7 us (-0.03)       2264.7 us (-0.43)       2266.0 us (-0.48)
RW      1   8K    32    LAT AVG          1019.4 us          999.6 us ( 1.94)        941.9 us ( 7.60)        894.6 us (12.24)       1015.6 us ( 0.37)        927.5 us ( 9.01)	###
RW      1   8K    32    LAT AVG           488.1 us          488.1 us ( 0.00)        492.0 us (-0.80)        493.2 us (-1.05)        490.0 us (-0.39)        491.6 us (-0.72)
W       2   1M     8    LAT AVG         37154.1 us        37130.3 us ( 0.06)      37249.5 us (-0.25)      37464.1 us (-0.83)      37341.6 us (-0.50)      37166.3 us (-0.03)
R       2   1M     8    LAT AVG         24440.8 us        24500.7 us (-0.24)      20643.3 us (15.53)      20503.9 us (16.10)      20516.2 us (16.05)      20501.1 us (16.11)	###
RW      2  64K    16    LAT AVG          4818.2 us         4829.2 us (-0.22)       4789.4 us ( 0.59)       4778.7 us ( 0.81)       4832.8 us (-0.30)       4799.2 us ( 0.39)
RR      2  64K    16    LAT AVG          3709.9 us         3721.5 us (-0.31)       3682.9 us ( 0.72)       3744.2 us (-0.92)       3706.0 us ( 0.10)       3685.1 us ( 0.67)
RW      2   8K    32    LAT AVG          1515.3 us         1501.8 us ( 0.89)       1521.8 us (-0.42)       1505.7 us ( 0.62)       1509.5 us ( 0.37)       1549.5 us (-2.26)
RW      2   8K    32    LAT AVG           954.7 us          960.1 us (-0.57)        961.3 us (-0.70)        966.4 us (-1.22)        969.9 us (-1.59)        959.6 us (-0.51)
W       4   1M     8    LAT AVG         74360.0 us        74440.0 us (-0.10)      74720.0 us (-0.48)      78230.0 us (-5.20)      74380.0 us (-0.02)      74230.0 us ( 0.17)
R       4   1M     8    LAT AVG         40320.9 us        40481.1 us (-0.39)      40400.5 us (-0.19)      40217.7 us ( 0.25)      40650.0 us (-0.81)      41220.0 us (-2.22)
RW      4  64K    16    LAT AVG          9585.6 us         9672.2 us (-0.90)       9728.6 us (-1.49)       9691.8 us (-1.10)       9647.6 us (-0.64)       9611.9 us (-0.27)
RR      4  64K    16    LAT AVG          7519.7 us         7418.3 us ( 1.34)       7408.0 us ( 1.48)       7449.3 us ( 0.93)       7398.8 us ( 1.60)       7382.4 us ( 1.82)
RW      4   8K    32    LAT AVG          3000.6 us         3055.7 us (-1.83)       3027.2 us (-0.88)       3003.4 us (-0.09)       3099.8 us (-3.30)       3041.1 us (-1.34)
RW      4   8K    32    LAT AVG          1914.2 us         1915.4 us (-0.06)       1921.2 us (-0.36)       1939.9 us (-1.34)       1913.9 us ( 0.01)       1917.3 us (-0.16)
W       8   1M     8    LAT AVG        148540.0 us       150070.0 us (-1.03)     149050.0 us (-0.34)     153720.0 us (-3.48)     148920.0 us (-0.25)     149250.0 us (-0.47)
R       8   1M     8    LAT AVG         80670.0 us        80730.0 us (-0.07)      80830.0 us (-0.19)      81210.0 us (-0.66)      80410.0 us ( 0.32)      80860.0 us (-0.23)
RW      8  64K    16    LAT AVG         19487.3 us        19224.5 us ( 1.34)      19465.0 us ( 0.11)      19325.6 us ( 0.82)      19399.4 us ( 0.45)      19246.7 us ( 1.23)
RR      8  64K    16    LAT AVG         14994.2 us        14766.2 us ( 1.52)      14982.7 us ( 0.07)      14775.1 us ( 1.46)      14931.8 us ( 0.41)      14883.3 us ( 0.73)
RW      8   8K    32    LAT AVG          6098.8 us         6000.5 us ( 1.61)       6016.8 us ( 1.34)       6081.0 us ( 0.29)       6038.8 us ( 0.98)       6077.1 us ( 0.35)
RW      8   8K    32    LAT AVG          3846.1 us         3838.3 us ( 0.20)       3829.9 us ( 0.41)       3842.2 us ( 0.09)       3929.8 us (-2.17)       3834.1 us ( 0.31)
W      16   1M     8    LAT AVG        298020.0 us       299060.0 us (-0.34)     298760.0 us (-0.24)     297220.0 us ( 0.26)     298150.0 us (-0.04)     297920.0 us ( 0.03)
R      16   1M     8    LAT AVG        161430.0 us       161560.0 us (-0.08)     160960.0 us ( 0.29)     161000.0 us ( 0.26)     160620.0 us ( 0.50)     160910.0 us ( 0.32)
RW     16  64K    16    LAT AVG         38437.0 us        38550.0 us (-0.29)      38560.0 us (-0.31)      38828.1 us (-1.01)      38668.9 us (-0.60)      39151.1 us (-1.85)
RR     16  64K    16    LAT AVG         29535.4 us        29771.5 us (-0.79)      29506.2 us ( 0.09)      29689.1 us (-0.52)      29780.0 us (-0.82)      29833.0 us (-1.00)
RW     16   8K    32    LAT AVG         12007.2 us        12010.5 us (-0.02)      12129.0 us (-1.01)      12104.1 us (-0.80)      12279.2 us (-2.26)      12159.6 us (-1.26)
RW     16   8K    32    LAT AVG          7723.1 us         7693.3 us ( 0.38)       7739.2 us (-0.20)       7852.6 us (-1.67)       7835.0 us (-1.44)       7719.8 us ( 0.04)
W      32   1M     8    LAT AVG        594030.0 us       594460.0 us (-0.07)     592020.0 us ( 0.33)     593830.0 us ( 0.03)     596150.0 us (-0.35)     593740.0 us ( 0.04)
R      32   1M     8    LAT AVG        322030.0 us       321990.0 us ( 0.01)     321110.0 us ( 0.28)     320030.0 us ( 0.62)     321330.0 us ( 0.21)     320100.0 us ( 0.59)
RW     32  64K    16    LAT AVG         76880.0 us        77240.0 us (-0.46)      77760.0 us (-1.14)      77030.0 us (-0.19)      77500.0 us (-0.80)      77350.0 us (-0.61)
RR     32  64K    16    LAT AVG         60050.0 us        59360.0 us ( 1.14)      59300.0 us ( 1.24)      59270.0 us ( 1.29)      59990.0 us ( 0.09)      59140.0 us ( 1.51)
RW     32   8K    32    LAT AVG         24015.3 us        24011.3 us ( 0.01)      24060.5 us (-0.18)      24548.8 us (-2.22)      24066.4 us (-0.21)      24464.5 us (-1.87)
RW     32   8K    32    LAT AVG         15672.8 us        15679.1 us (-0.04)      15997.6 us (-2.07)      15803.4 us (-0.83)      15708.2 us (-0.22)      16024.3 us (-2.24)
W      64   1M     8    LAT AVG       1180380.0 us      1182930.0 us (-0.21)    1183380.0 us (-0.25)    1184430.0 us (-0.34)    1182230.0 us (-0.15)    1182160.0 us (-0.15)
R      64   1M     8    LAT AVG        643000.0 us       640920.0 us ( 0.32)     638430.0 us ( 0.71)     640520.0 us ( 0.38)     640210.0 us ( 0.43)     640790.0 us ( 0.34)
RW     64  64K    16    LAT AVG        153320.0 us       153940.0 us (-0.40)     153760.0 us (-0.28)     153420.0 us (-0.06)     153930.0 us (-0.39)     153230.0 us ( 0.05)
RR     64  64K    16    LAT AVG        118080.0 us       117950.0 us ( 0.11)     119190.0 us (-0.94)     118880.0 us (-0.67)     118920.0 us (-0.71)     119850.0 us (-1.49)
RW     64   8K    32    LAT AVG         49917.6 us        48266.9 us ( 3.30)      48276.1 us ( 3.28)      48412.1 us ( 3.01)      48469.3 us ( 2.90)      48382.5 us ( 3.07)
RW     64   8K    32    LAT AVG         32057.3 us        31420.8 us ( 1.98)      31590.3 us ( 1.45)      32162.1 us (-0.32)      31738.9 us ( 0.99)      32299.8 us (-0.75)
W     128   1M     8    LAT AVG       2333670.0 us      2346190.0 us (-0.53)    2346610.0 us (-0.55)    2350920.0 us (-0.73)    2335550.0 us (-0.08)    2348390.0 us (-0.63)
R     128   1M     8    LAT AVG       1278850.0 us      1280040.0 us (-0.09)    1276030.0 us ( 0.22)    1272630.0 us ( 0.48)    1275300.0 us ( 0.27)    1274870.0 us ( 0.31)
RW    128  64K    16    LAT AVG        308035.9 us       307683.7 us ( 0.11)     310932.1 us (-0.94)     309541.4 us (-0.48)     308725.3 us (-0.22)     310436.3 us (-0.77)
RR    128  64K    16    LAT AVG        239010.0 us       237570.0 us ( 0.60)     241340.0 us (-0.97)     238440.0 us ( 0.23)     236350.0 us ( 1.11)     237280.0 us ( 0.72)
RW    128   8K    32    LAT AVG         96345.8 us        99931.1 us (-3.72)      97013.5 us (-0.69)      96631.2 us (-0.29)      96438.0 us (-0.09)      97250.1 us (-0.93)
RW    128   8K    32    LAT AVG         64715.5 us        63316.5 us ( 2.16)      63299.7 us ( 2.18)      64922.5 us (-0.31)      63201.7 us ( 2.33)      64675.1 us ( 0.06)
W     256   1M     8    LAT AVG       4626300.0 us      4625540.0 us ( 0.01)    4602410.0 us ( 0.51)    4619430.0 us ( 0.14)    4595730.0 us ( 0.66)    4618030.0 us ( 0.17)
R     256   1M     8    LAT AVG       2529380.0 us      2537400.0 us (-0.31)    2528460.0 us ( 0.03)    2525920.0 us ( 0.13)    2520700.0 us ( 0.34)    2519440.0 us ( 0.39)
RW    256  64K    16    LAT AVG        612973.5 us       616360.7 us (-0.55)     618768.4 us (-0.94)     615890.4 us (-0.47)     616345.8 us (-0.55)     616782.3 us (-0.62)
RR    256  64K    16    LAT AVG        470350.0 us       473020.0 us (-0.56)     472380.0 us (-0.43)     474630.0 us (-0.90)     473600.0 us (-0.69)     472230.0 us (-0.39)
RW    256   8K    32    LAT AVG        196851.8 us       198838.9 us (-1.00)     199539.2 us (-1.36)     201364.1 us (-2.29)     199054.7 us (-1.11)     201788.0 us (-2.50)
RW    256   8K    32    LAT AVG        126451.5 us       126998.5 us (-0.43)     126875.7 us (-0.33)     126866.1 us (-0.32)     126859.2 us (-0.32)     126906.1 us (-0.35)
W     512   1M     8    LAT AVG       8904540.0 us      8898560.0 us ( 0.06)    8886410.0 us ( 0.20)    8926630.0 us (-0.24)    8885030.0 us ( 0.21)    8899820.0 us ( 0.05)
R     512   1M     8    LAT AVG       4978800.0 us      4977430.0 us ( 0.02)    4959540.0 us ( 0.38)    4960300.0 us ( 0.37)    4959810.0 us ( 0.38)    4958410.0 us ( 0.40)
RW    512  64K    16    LAT AVG       1220062.3 us      1234352.5 us (-1.17)    1228689.0 us (-0.70)    1223663.9 us (-0.29)    1227745.8 us (-0.62)    1236206.0 us (-1.32)
RR    512  64K    16    LAT AVG        938860.0 us       944890.1 us (-0.64)     938720.0 us ( 0.01)     958220.0 us (-2.06)     943460.0 us (-0.48)     942294.2 us (-0.36)
RW    512   8K    32    LAT AVG        403088.8 us       401876.5 us ( 0.30)     404934.0 us (-0.45)     399775.0 us ( 0.82)     401125.4 us ( 0.48)     410164.4 us (-1.75)
RW    512   8K    32    LAT AVG        255725.7 us       252158.8 us ( 1.39)     257406.7 us (-0.65)     253236.3 us ( 0.97)     252352.8 us ( 1.31)     256544.0 us (-0.31)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS1-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    CLAT AVG        18807.1 us        18955.8 us (-0.79)      18530.2 us ( 1.47)      18636.2 us ( 0.90)      18368.3 us ( 2.33)      18320.4 us ( 2.58)
R       1   1M     8    CLAT AVG        12503.4 us        12532.8 us (-0.23)      10854.8 us (13.18)      12781.1 us (-2.22)      12446.9 us ( 0.45)      10783.7 us (13.75)	###
RW      1  64K    16    CLAT AVG         2376.9 us         2382.9 us (-0.25)       2381.9 us (-0.21)       2399.7 us (-0.95)       2383.9 us (-0.29)       2384.0 us (-0.29)
RR      1  64K    16    CLAT AVG         2244.2 us         2262.0 us (-0.79)       2235.2 us ( 0.40)       2244.7 us (-0.02)       2248.6 us (-0.19)       2253.4 us (-0.40)
RW      1   8K    32    CLAT AVG          991.2 us          971.8 us ( 1.96)        916.7 us ( 7.51)        871.8 us (12.05)        987.3 us ( 0.39)        903.1 us ( 8.88)	###
RW      1   8K    32    CLAT AVG          477.0 us          477.2 us (-0.03)        480.8 us (-0.79)        482.0 us (-1.04)        479.0 us (-0.40)        480.4 us (-0.70)
W       2   1M     8    CLAT AVG        37081.3 us        37066.1 us ( 0.04)      37148.4 us (-0.18)      37384.4 us (-0.81)      36995.0 us ( 0.23)      37106.2 us (-0.06)
R       2   1M     8    CLAT AVG        24383.4 us        24442.0 us (-0.24)      20578.6 us (15.60)      20459.5 us (16.09)      20260.2 us (16.90)      20457.6 us (16.10)	###
RW      2  64K    16    CLAT AVG         4800.8 us         4806.5 us (-0.12)       4771.7 us ( 0.60)       4760.4 us ( 0.84)       4811.2 us (-0.21)       4782.3 us ( 0.38)
RR      2  64K    16    CLAT AVG         3697.9 us         3708.1 us (-0.27)       3669.6 us ( 0.76)       3732.5 us (-0.93)       3686.1 us ( 0.31)       3668.3 us ( 0.80)
RW      2   8K    32    CLAT AVG         1498.7 us         1486.6 us ( 0.81)       1506.1 us (-0.48)       1489.5 us ( 0.61)       1493.7 us ( 0.33)       1533.2 us (-2.29)
RW      2   8K    32    CLAT AVG          946.8 us          952.2 us (-0.57)        952.9 us (-0.64)        956.0 us (-0.97)        956.9 us (-1.06)        951.3 us (-0.47)
W       4   1M     8    CLAT AVG        74280.0 us        74350.0 us (-0.09)      74640.0 us (-0.48)      77320.0 us (-4.09)      74300.0 us (-0.02)      74140.0 us ( 0.18)
R       4   1M     8    CLAT AVG        40295.3 us        40456.7 us (-0.40)      40359.3 us (-0.15)      40173.0 us ( 0.30)      40610.0 us (-0.78)      41170.0 us (-2.17)
RW      4  64K    16    CLAT AVG         9566.6 us         9652.0 us (-0.89)       9709.4 us (-1.49)       9671.0 us (-1.09)       9629.8 us (-0.66)       9593.9 us (-0.28)
RR      4  64K    16    CLAT AVG         7510.3 us         7410.9 us ( 1.32)       7400.8 us ( 1.45)       7440.1 us ( 0.93)       7391.0 us ( 1.58)       7373.7 us ( 1.81)
RW      4   8K    32    CLAT AVG         2911.1 us         2963.5 us (-1.80)       2936.6 us (-0.87)       2912.4 us (-0.04)       3006.1 us (-3.26)       2949.6 us (-1.32)
RW      4   8K    32    CLAT AVG         1857.1 us         1858.2 us (-0.06)       1863.8 us (-0.36)       1881.8 us (-1.33)       1856.8 us ( 0.01)       1860.1 us (-0.15)
W       8   1M     8    CLAT AVG       144600.0 us       139300.0 us ( 3.66)     148640.0 us (-2.79)     141640.0 us ( 2.04)     148260.0 us (-2.53)     148570.0 us (-2.74)
R       8   1M     8    CLAT AVG        79380.0 us        79410.0 us (-0.03)      80800.0 us (-1.78)      74340.0 us ( 6.34)      80380.0 us (-1.25)      75860.0 us ( 4.43)
RW      8  64K    16    CLAT AVG        18447.5 us        18194.6 us ( 1.37)      18413.4 us ( 0.18)      18290.1 us ( 0.85)      18352.1 us ( 0.51)      18210.7 us ( 1.28)
RR      8  64K    16    CLAT AVG        14196.4 us        14012.8 us ( 1.29)      14215.4 us (-0.13)      14012.1 us ( 1.29)      14168.8 us ( 0.19)      14129.1 us ( 0.47)
RW      8   8K    32    CLAT AVG         5909.7 us         5814.4 us ( 1.61)       5830.2 us ( 1.34)       5893.0 us ( 0.28)       5851.5 us ( 0.98)       5888.5 us ( 0.35)
RW      8   8K    32    CLAT AVG         3727.1 us         3719.5 us ( 0.20)       3711.4 us ( 0.42)       3723.4 us ( 0.10)       3808.6 us (-2.18)       3715.5 us ( 0.31)
W      16   1M     8    CLAT AVG       263600.0 us       268210.0 us (-1.74)     266010.0 us (-0.91)     266580.0 us (-1.13)     266050.0 us (-0.92)     264700.0 us (-0.41)
R      16   1M     8    CLAT AVG       145090.0 us       145380.0 us (-0.19)     144850.0 us ( 0.16)     142800.0 us ( 1.57)     143080.0 us ( 1.38)     143470.0 us ( 1.11)
RW     16  64K    16    CLAT AVG        36042.1 us        36140.0 us (-0.27)      36154.9 us (-0.31)      36408.6 us (-1.01)      36258.2 us (-0.59)      36712.1 us (-1.85)
RR     16  64K    16    CLAT AVG        27695.8 us        27918.9 us (-0.80)      27670.3 us ( 0.09)      27839.1 us (-0.51)      27924.9 us (-0.82)      27976.8 us (-1.01)
RW     16   8K    32    CLAT AVG        11633.6 us        11636.8 us (-0.02)      11751.5 us (-1.01)      11727.5 us (-0.80)      11897.1 us (-2.26)      11781.2 us (-1.26)
RW     16   8K    32    CLAT AVG         7483.1 us         7454.2 us ( 0.38)       7498.7 us (-0.20)       7608.7 us (-1.67)       7591.6 us (-1.44)       7479.9 us ( 0.04)
W      32   1M     8    CLAT AVG       520040.0 us       520480.0 us (-0.08)     518230.0 us ( 0.34)     519670.0 us ( 0.07)     521710.0 us (-0.32)     519670.0 us ( 0.07)
R      32   1M     8    CLAT AVG       281860.0 us       281780.0 us ( 0.02)     281080.0 us ( 0.27)     280040.0 us ( 0.64)     281220.0 us ( 0.22)     280100.0 us ( 0.62)
RW     32  64K    16    CLAT AVG        72080.0 us        72420.0 us (-0.47)      72900.0 us (-1.13)      72210.0 us (-0.18)      72650.0 us (-0.79)      72520.0 us (-0.61)
RR     32  64K    16    CLAT AVG        56290.0 us        55650.0 us ( 1.13)      55599.2 us ( 1.22)      55560.0 us ( 1.29)      56240.0 us ( 0.08)      55450.0 us ( 1.49)
RW     32   8K    32    CLAT AVG        23266.9 us        23262.7 us ( 0.01)      23310.6 us (-0.18)      23783.6 us (-2.22)      23316.3 us (-0.21)      23702.0 us (-1.87)
RW     32   8K    32    CLAT AVG        15184.6 us        15190.7 us (-0.04)      15499.3 us (-2.07)      15311.1 us (-0.83)      15218.8 us (-0.22)      15525.3 us (-2.24)
W      64   1M     8    CLAT AVG      1032230.0 us      1034480.0 us (-0.21)    1034820.0 us (-0.25)    1035700.0 us (-0.33)    1033840.0 us (-0.15)    1033810.0 us (-0.15)
R      64   1M     8    CLAT AVG       562430.0 us       560610.0 us ( 0.32)     558430.0 us ( 0.71)     560270.0 us ( 0.38)     559980.0 us ( 0.43)     560490.0 us ( 0.34)
RW     64  64K    16    CLAT AVG       143730.0 us       144310.0 us (-0.40)     144140.0 us (-0.28)     143830.0 us (-0.06)     144300.0 us (-0.39)     143650.0 us ( 0.05)
RR     64  64K    16    CLAT AVG       110696.6 us       110580.0 us ( 0.10)     111730.0 us (-0.93)     111450.0 us (-0.68)     111480.0 us (-0.70)     112360.0 us (-1.50)
RW     64   8K    32    CLAT AVG        48359.6 us        46760.4 us ( 3.30)      46769.4 us ( 3.28)      46900.6 us ( 3.01)      46956.4 us ( 2.90)      46872.0 us ( 3.07)
RW     64   8K    32    CLAT AVG        31057.0 us        30440.2 us ( 1.98)      30604.5 us ( 1.45)      31158.7 us (-0.32)      30748.6 us ( 0.99)      31291.9 us (-0.75)
W     128   1M     8    CLAT AVG      2038230.0 us      2049330.0 us (-0.54)    2049570.0 us (-0.55)    2053340.0 us (-0.74)    2039920.0 us (-0.08)    2051040.0 us (-0.62)
R     128   1M     8    CLAT AVG      1117970.0 us      1118990.0 us (-0.09)    1115460.0 us ( 0.22)    1112450.0 us ( 0.49)    1114790.0 us ( 0.28)    1114420.0 us ( 0.31)
RW    128  64K    16    CLAT AVG       288747.9 us       288417.1 us ( 0.11)     291463.0 us (-0.94)     290158.2 us (-0.48)     289393.9 us (-0.22)     290996.2 us (-0.77)
RR    128  64K    16    CLAT AVG       224050.0 us       222700.0 us ( 0.60)     226233.8 us (-0.97)     223514.1 us ( 0.23)     221560.0 us ( 1.11)     222430.0 us ( 0.72)
RW    128   8K    32    CLAT AVG        93334.8 us        96808.4 us (-3.72)      93982.1 us (-0.69)      93612.2 us (-0.29)      93423.9 us (-0.09)      94210.4 us (-0.93)
RW    128   8K    32    CLAT AVG        62693.9 us        61338.6 us ( 2.16)      61322.3 us ( 2.18)      62894.5 us (-0.31)      61227.4 us ( 2.33)      62654.8 us ( 0.06)
W     256   1M     8    CLAT AVG      4031780.0 us      4031190.0 us ( 0.01)    4011330.0 us ( 0.50)    4026090.0 us ( 0.14)    4004120.0 us ( 0.68)    4024950.0 us ( 0.16)
R     256   1M     8    CLAT AVG      2208410.0 us      2215390.0 us (-0.31)    2207520.0 us ( 0.04)    2205380.0 us ( 0.13)    2200910.0 us ( 0.33)    2199570.0 us ( 0.40)
RW    256  64K    16    CLAT AVG       574502.5 us       577676.6 us (-0.55)     579933.4 us (-0.94)     577239.0 us (-0.47)     577665.1 us (-0.55)     578071.7 us (-0.62)
RR    256  64K    16    CLAT AVG       440852.9 us       443354.7 us (-0.56)     442760.0 us (-0.43)     444870.0 us (-0.91)     443903.7 us (-0.69)     442620.0 us (-0.40)
RW    256   8K    32    CLAT AVG       190693.7 us       192617.6 us (-1.00)     193296.9 us (-1.36)     195065.0 us (-2.29)     192827.0 us (-1.11)     195473.8 us (-2.50)
RW    256   8K    32    CLAT AVG       122497.9 us       123027.8 us (-0.43)     122909.0 us (-0.33)     122899.6 us (-0.32)     122892.9 us (-0.32)     122938.3 us (-0.35)
W     512   1M     8    CLAT AVG      7725280.0 us      7720740.0 us ( 0.05)    7708920.0 us ( 0.21)    7743240.0 us (-0.23)    7708770.0 us ( 0.21)    7720800.0 us ( 0.05)
R     512   1M     8    CLAT AVG      4336150.0 us      4334880.0 us ( 0.02)    4319290.0 us ( 0.38)    4319680.0 us ( 0.37)    4319620.0 us ( 0.38)    4318120.0 us ( 0.41)
RW    512  64K    16    CLAT AVG      1143161.4 us      1156521.8 us (-1.16)    1151216.1 us (-0.70)    1146495.4 us (-0.29)    1150298.3 us (-0.62)    1158262.8 us (-1.32)
RR    512  64K    16    CLAT AVG       879775.4 us       885430.6 us (-0.64)     879635.5 us ( 0.01)     897911.9 us (-2.06)     884080.0 us (-0.48)     882989.2 us (-0.36)
RW    512   8K    32    CLAT AVG       390456.8 us       389277.5 us ( 0.30)     392239.9 us (-0.45)     387246.6 us ( 0.82)     388551.7 us ( 0.48)     397301.9 us (-1.75)
RW    512   8K    32    CLAT AVG       247720.5 us       244265.8 us ( 1.39)     249349.6 us (-0.65)     245309.5 us ( 0.97)     244453.8 us ( 1.31)     248513.6 us (-0.32)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS1-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    CLAT  99        27132.0 us        27395.0 us (-0.96)       21890.0 us (19.32)      22414.0 us (17.38)      22152.0 us (18.35)      22414.0 us (17.38)
R       1   1M     8    CLAT  99        13829.0 us        13829.0 us ( 0.00)      16188.0 us (-17.05)      14222.0 us (-2.84)      14484.0 us (-4.73)      14091.0 us (-1.89)
RW      1  64K    16    CLAT  99         2802.0 us         2868.0 us (-2.35)        2868.0 us (-2.35)       2999.0 us (-7.03)       2933.0 us (-4.67)       2868.0 us (-2.35)
RR      1  64K    16    CLAT  99         4178.0 us         4228.0 us (-1.19)        4228.0 us (-1.19)       4228.0 us (-1.19)       4293.0 us (-2.75)       4293.0 us (-2.75)
RW      1   8K    32    CLAT  99         1418.0 us         1303.0 us ( 8.11)        1221.0 us (13.89)       1221.0 us (13.89)       1270.0 us (10.43)       1237.0 us (12.76)
RW      1   8K    32    CLAT  99          766.0 us          775.0 us (-1.17)         766.0 us ( 0.00)        775.0 us (-1.17)        766.0 us ( 0.00)        766.0 us ( 0.00)
W       2   1M     8    CLAT  99        42730.0 us        42206.0 us ( 1.22)       43254.0 us (-1.22)      44827.0 us (-4.90)      44827.0 us (-4.90)      43779.0 us (-2.45)
R       2   1M     8    CLAT  99        27132.0 us        27132.0 us ( 0.00)       25035.0 us ( 7.72)      22938.0 us (15.45)      28181.0 us (-3.86)      22938.0 us (15.45)
RW      2  64K    16    CLAT  99         5604.0 us         5735.0 us (-2.33)        5538.0 us ( 1.17)       5473.0 us ( 2.33)       5866.0 us (-4.67)       5604.0 us ( 0.00)
RR      2  64K    16    CLAT  99         6587.0 us         6587.0 us ( 0.00)        6521.0 us ( 1.00)       6652.0 us (-0.98)       6587.0 us ( 0.00)       6456.0 us ( 1.98)
RW      2   8K    32    CLAT  99         3523.0 us         3490.0 us ( 0.93)        3556.0 us (-0.93)       3458.0 us ( 1.84)       3523.0 us ( 0.00)       3556.0 us (-0.93)
RW      2   8K    32    CLAT  99         2311.0 us         2343.0 us (-1.38)        2311.0 us ( 0.00)       2278.0 us ( 1.42)       2180.0 us ( 5.66)       2311.0 us ( 0.00)
W       4   1M     8    CLAT  99        91000.0 us        93000.0 us (-2.19)       92000.0 us (-1.09)      93000.0 us (-2.19)      89000.0 us ( 2.19)      89000.0 us ( 2.19)
R       4   1M     8    CLAT  99        49546.0 us        49546.0 us ( 0.00)       42730.0 us (13.75)      42206.0 us (14.81)     58000.0 us (-17.06)     66000.0 us (-33.20)
RW      4  64K    16    CLAT  99        22676.0 us        22938.0 us (-1.15)       23462.0 us (-3.46)      23200.0 us (-2.31)      23200.0 us (-2.31)      23462.0 us (-3.46)
RR      4  64K    16    CLAT  99        17695.0 us        17695.0 us ( 0.00)       17695.0 us ( 0.00)      17695.0 us ( 0.00)      17433.0 us ( 1.48)      17433.0 us ( 1.48)
RW      4   8K    32    CLAT  99         6259.0 us         6390.0 us (-2.09)        6259.0 us ( 0.00)       6259.0 us ( 0.00)       6456.0 us (-3.14)       6390.0 us (-2.09)
RW      4   8K    32    CLAT  99         3949.0 us         3949.0 us ( 0.00)        3982.0 us (-0.83)       3949.0 us ( 0.00)       3949.0 us ( 0.00)       3982.0 us (-0.83)
W       8   1M     8    CLAT  99       262000.0 us      334000.0 us (-27.48)      236000.0 us ( 9.92)    338000.0 us (-29.00)     226000.0 us (13.74)     239000.0 us ( 8.77)
R       8   1M     8    CLAT  99        92000.0 us        89000.0 us ( 3.26)       85000.0 us ( 7.60)    171000.0 us (-85.86)      84000.0 us ( 8.69)    171000.0 us (-85.86)
RW      8  64K    16    CLAT  99        43779.0 us        42730.0 us ( 2.39)       42730.0 us ( 2.39)      42730.0 us ( 2.39)      42730.0 us ( 2.39)      42730.0 us ( 2.39)
RR      8  64K    16    CLAT  99        33817.0 us        34866.0 us (-3.10)       35390.0 us (-4.65)      34341.0 us (-1.54)      35390.0 us (-4.65)      35914.0 us (-6.20)
RW      8   8K    32    CLAT  99        11600.0 us        11469.0 us ( 1.12)       11469.0 us ( 1.12)      11731.0 us (-1.12)      11600.0 us ( 0.00)      11469.0 us ( 1.12)
RW      8   8K    32    CLAT  99         7177.0 us         7177.0 us ( 0.00)        7177.0 us ( 0.00)       7177.0 us ( 0.00)       7242.0 us (-0.90)       7177.0 us ( 0.00)
W      16   1M     8    CLAT  99       600000.0 us      684000.0 us (-14.00)      659000.0 us (-9.83)     659000.0 us (-9.83)    676000.0 us (-12.66)     651000.0 us (-8.50)
R      16   1M     8    CLAT  99       388000.0 us       393000.0 us (-1.28)      393000.0 us (-1.28)     334000.0 us (13.91)     363000.0 us ( 6.44)     376000.0 us ( 3.09)
RW     16  64K    16    CLAT  99        63177.0 us        64226.0 us (-1.66)       64226.0 us (-1.66)      64750.0 us (-2.48)      64750.0 us (-2.48)      66847.0 us (-5.80)
RR     16  64K    16    CLAT  99        49021.0 us        48497.0 us ( 1.06)       48497.0 us ( 1.06)      48497.0 us ( 1.06)      49021.0 us ( 0.00)      50070.0 us (-2.13)
RW     16   8K    32    CLAT  99        19530.0 us        19530.0 us ( 0.00)       20000.0 us (-2.40)      20055.0 us (-2.68)      21000.0 us (-7.52)      20000.0 us (-2.40)
RW     16   8K    32    CLAT  99        12518.0 us        12256.0 us ( 2.09)       12387.0 us ( 1.04)      12780.0 us (-2.09)      12780.0 us (-2.09)      12387.0 us ( 1.04)
W      32   1M     8    CLAT  99       894000.0 us       919000.0 us (-2.79)      919000.0 us (-2.79)     911000.0 us (-1.90)     902000.0 us (-0.89)     852000.0 us ( 4.69)
R      32   1M     8    CLAT  99       460000.0 us       451000.0 us ( 1.95)      464000.0 us (-0.86)     472000.0 us (-2.60)     456000.0 us ( 0.86)     485000.0 us (-5.43)
RW     32  64K    16    CLAT  99       102000.0 us       103000.0 us (-0.98)      104000.0 us (-1.96)     102000.0 us ( 0.00)     104000.0 us (-1.96)     105000.0 us (-2.94)
RR     32  64K    16    CLAT  99        78119.0 us        77071.0 us ( 1.34)       77071.0 us ( 1.34)      79168.0 us (-1.34)      79168.0 us (-1.34)      79168.0 us (-1.34)
RW     32   8K    32    CLAT  99        35390.0 us        38536.0 us (-8.88)       35390.0 us ( 0.00)     42000.0 us (-18.67)      35390.0 us ( 0.00)     39000.0 us (-10.20)
RW     32   8K    32    CLAT  99        22938.0 us        22938.0 us ( 0.00)       23725.0 us (-3.43)      23462.0 us (-2.28)      23200.0 us (-1.14)      23725.0 us (-3.43)
W      64   1M     8    CLAT  99      1418000.0 us      1401000.0 us ( 1.19)     1536000.0 us (-8.32)   1620000.0 us (-14.24)    1469000.0 us (-3.59)   1586000.0 us (-11.84)
R      64   1M     8    CLAT  99       835000.0 us       902000.0 us (-8.02)      911000.0 us (-9.10)     844000.0 us (-1.07)     911000.0 us (-9.10)     852000.0 us (-2.03)
RW     64  64K    16    CLAT  99       184000.0 us       188000.0 us (-2.17)      186000.0 us (-1.08)     197000.0 us (-7.06)     188000.0 us (-2.17)     194000.0 us (-5.43)
RR     64  64K    16    CLAT  99       153000.0 us       153000.0 us ( 0.00)      159000.0 us (-3.92)     157000.0 us (-2.61)     157000.0 us (-2.61)     159000.0 us (-3.92)
RW     64   8K    32    CLAT  99        99000.0 us        85000.0 us (14.14)       86000.0 us (13.13)      93000.0 us ( 6.06)      83000.0 us (16.16)      85000.0 us (14.14)
RW     64   8K    32    CLAT  99        48497.0 us        45876.0 us ( 5.40)       46924.0 us ( 3.24)      48497.0 us ( 0.00)      47973.0 us ( 1.08)      49021.0 us (-1.08)
W     128   1M     8    CLAT  99      3004000.0 us      3104000.0 us (-3.32)     3138000.0 us (-4.46)    3004000.0 us ( 0.00)    2903000.0 us ( 3.36)    2836000.0 us ( 5.59)
R     128   1M     8    CLAT  99      1720000.0 us      1552000.0 us ( 9.76)     1754000.0 us (-1.97)    1838000.0 us (-6.86)    1653000.0 us ( 3.89)    1804000.0 us (-4.88)
RW    128  64K    16    CLAT  99       405000.0 us       393000.0 us ( 2.96)      397000.0 us ( 1.97)     426000.0 us (-5.18)     397000.0 us ( 1.97)     418000.0 us (-3.20)
RR    128  64K    16    CLAT  99       342000.0 us       326000.0 us ( 4.67)      351000.0 us (-2.63)     338000.0 us ( 1.16)     317000.0 us ( 7.30)     342000.0 us ( 0.00)
RW    128   8K    32    CLAT  99       180000.0 us      207000.0 us (-15.00)      182000.0 us (-1.11)     184000.0 us (-2.22)     176000.0 us ( 2.22)     194000.0 us (-7.77)
RW    128   8K    32    CLAT  99       103000.0 us       100000.0 us ( 2.91)       99000.0 us ( 3.88)     103000.0 us ( 0.00)      99000.0 us ( 3.88)     103000.0 us ( 0.00)
W     256   1M     8    CLAT  99      5403000.0 us      5403000.0 us ( 0.00)     5403000.0 us ( 0.00)    5537000.0 us (-2.48)    5403000.0 us ( 0.00)    5537000.0 us (-2.48)
R     256   1M     8    CLAT  99      3138000.0 us      3138000.0 us ( 0.00)     3239000.0 us (-3.21)    3037000.0 us ( 3.21)    2937000.0 us ( 6.40)   3473000.0 us (-10.67)
RW    256  64K    16    CLAT  99       802000.0 us       802000.0 us ( 0.00)      785000.0 us ( 2.11)     818000.0 us (-1.99)     844000.0 us (-5.23)     810000.0 us (-0.99)
RR    256  64K    16    CLAT  99       659000.0 us      760000.0 us (-15.32)      651000.0 us ( 1.21)     684000.0 us (-3.79)     634000.0 us ( 3.79)     651000.0 us ( 1.21)
RW    256   8K    32    CLAT  99       393000.0 us       405000.0 us (-3.05)      409000.0 us (-4.07)     401000.0 us (-2.03)     397000.0 us (-1.01)    435000.0 us (-10.68)
RW    256   8K    32    CLAT  99       201000.0 us       207000.0 us (-2.98)      203000.0 us (-0.99)     203000.0 us (-0.99)     203000.0 us (-0.99)     205000.0 us (-1.99)
W     512   1M     8    CLAT  99     10402000.0 us     10537000.0 us (-1.29)    10805000.0 us (-3.87)   10537000.0 us (-1.29)   10402000.0 us ( 0.00)   10939000.0 us (-5.16)
R     512   1M     8    CLAT  99      5738000.0 us      5738000.0 us ( 0.00)     5873000.0 us (-2.35)    6275000.0 us (-9.35)    5671000.0 us ( 1.16)    6074000.0 us (-5.85)
RW    512  64K    16    CLAT  99      1502000.0 us      1502000.0 us ( 0.00)     1519000.0 us (-1.13)    1536000.0 us (-2.26)    1485000.0 us ( 1.13)    1536000.0 us (-2.26)
RR    512  64K    16    CLAT  99      1301000.0 us      1351000.0 us (-3.84)     1217000.0 us ( 6.45)    1318000.0 us (-1.30)    1267000.0 us ( 2.61)    1250000.0 us ( 3.92)
RW    512   8K    32    CLAT  99       802000.0 us       802000.0 us ( 0.00)      793000.0 us ( 1.12)     793000.0 us ( 1.12)     844000.0 us (-5.23)    894000.0 us (-11.47)
RW    512   8K    32    CLAT  99       414000.0 us       405000.0 us ( 2.17)      414000.0 us ( 0.00)     405000.0 us ( 2.17)     405000.0 us ( 2.17)     422000.0 us (-1.93)

==================================================================================================================================
                                                     ~~~~~~~~~~~~~~~~NPS2~~~~~~~~~~~~~~~~
==================================================================================================================================

TEST Threads BS  IODEPTH Metric   NPS2-6.4.0-rc1       CPU                 SMT                 CACHE               NUMA              SYSTEM
W       1   1M     8        BW       435 MB/s      446 MB/s( 2.52)     447 MB/s( 2.75)     449 MB/s( 3.21)     445 MB/s( 2.29)     451 MB/s( 3.67)
R       1   1M     8        BW       678 MB/s      766 MB/s(12.97)     763 MB/s(12.53)     770 MB/s(13.56)     763 MB/s(12.53)     668 MB/s(-1.47)      ###
RW      1  64K    16        BW       436 MB/s      434 MB/s(-0.45)     437 MB/s( 0.22)     434 MB/s(-0.45)     435 MB/s(-0.22)     434 MB/s(-0.45)
RR      1  64K    16        BW       467 MB/s      465 MB/s(-0.42)     462 MB/s(-1.07)     462 MB/s(-1.07)     461 MB/s(-1.28)     467 MB/s( 0.00)
RW      1   8K    32        BW       250 MB/s      253 MB/s( 1.20)     231 MB/s(-7.60)     211 MB/s(-15.60)    229 MB/s(-8.40)     213 MB/s(-14.80)     ***
RW      1   8K    32        BW       537 MB/s      535 MB/s(-0.37)     536 MB/s(-0.18)     544 MB/s( 1.30)     535 MB/s(-0.37)     539 MB/s( 0.37)
W       2   1M     8        BW       451 MB/s      449 MB/s(-0.44)     450 MB/s(-0.22)     452 MB/s( 0.22)     450 MB/s(-0.22)     449 MB/s(-0.44)
R       2   1M     8        BW       678 MB/s      814 MB/s(20.05)     782 MB/s(15.33)     805 MB/s(18.73)     819 MB/s(20.79)     807 MB/s(19.02)      ###
RW      2  64K    16        BW       438 MB/s      438 MB/s( 0.00)     434 MB/s(-0.91)     436 MB/s(-0.45)     434 MB/s(-0.91)     433 MB/s(-1.14)
RR      2  64K    16        BW       565 MB/s      566 MB/s( 0.17)     561 MB/s(-0.70)     566 MB/s( 0.17)     564 MB/s(-0.17)     565 MB/s( 0.00)
RW      2   8K    32        BW       350 MB/s      336 MB/s(-4.00)     349 MB/s(-0.28)     348 MB/s(-0.57)     336 MB/s(-4.00)     348 MB/s(-0.57)
RW      2   8K    32        BW       547 MB/s      545 MB/s(-0.36)     544 MB/s(-0.54)     546 MB/s(-0.18)     544 MB/s(-0.54)     545 MB/s(-0.36)
W       4   1M     8        BW       429 MB/s      449 MB/s( 4.66)     450 MB/s( 4.89)     443 MB/s( 3.26)     449 MB/s( 4.66)     445 MB/s( 3.72)
R       4   1M     8        BW       823 MB/s      823 MB/s( 0.00)     828 MB/s( 0.60)     828 MB/s( 0.60)     833 MB/s( 1.21)     833 MB/s( 1.21)
RW      4  64K    16        BW       437 MB/s      433 MB/s(-0.91)     436 MB/s(-0.22)     434 MB/s(-0.68)     433 MB/s(-0.91)     433 MB/s(-0.91)
RR      4  64K    16        BW       565 MB/s      560 MB/s(-0.88)     566 MB/s( 0.17)     567 MB/s( 0.35)     560 MB/s(-0.88)     565 MB/s( 0.00)
RW      4   8K    32        BW       347 MB/s      347 MB/s( 0.00)     343 MB/s(-1.15)     349 MB/s( 0.57)     348 MB/s( 0.28)     337 MB/s(-2.88)
RW      4   8K    32        BW       546 MB/s      545 MB/s(-0.18)     545 MB/s(-0.18)     543 MB/s(-0.54)     546 MB/s( 0.00)     541 MB/s(-0.91)
W       8   1M     8        BW       447 MB/s      449 MB/s( 0.44)     450 MB/s( 0.67)     448 MB/s( 0.22)     449 MB/s( 0.44)     449 MB/s( 0.44)
R       8   1M     8        BW       822 MB/s      825 MB/s( 0.36)     827 MB/s( 0.60)     826 MB/s( 0.48)     834 MB/s( 1.45)     834 MB/s( 1.45)
RW      8  64K    16        BW       433 MB/s      430 MB/s(-0.69)     433 MB/s( 0.00)     437 MB/s( 0.92)     432 MB/s(-0.23)     436 MB/s( 0.69)
RR      8  64K    16        BW       569 MB/s      566 MB/s(-0.52)     564 MB/s(-0.87)     567 MB/s(-0.35)     564 MB/s(-0.87)     566 MB/s(-0.52)
RW      8   8K    32        BW       348 MB/s      344 MB/s(-1.14)     340 MB/s(-2.29)     347 MB/s(-0.28)     347 MB/s(-0.28)     342 MB/s(-1.72)
RW      8   8K    32        BW       544 MB/s      534 MB/s(-1.83)     536 MB/s(-1.47)     536 MB/s(-1.47)     530 MB/s(-2.57)     542 MB/s(-0.36)
W      16   1M     8        BW       451 MB/s      446 MB/s(-1.10)     451 MB/s( 0.00)     449 MB/s(-0.44)     451 MB/s( 0.00)     449 MB/s(-0.44)
R      16   1M     8        BW       833 MB/s      831 MB/s(-0.24)     833 MB/s( 0.00)     833 MB/s( 0.00)     833 MB/s( 0.00)     834 MB/s( 0.12)
RW     16  64K    16        BW       433 MB/s      436 MB/s( 0.69)     430 MB/s(-0.69)     436 MB/s( 0.69)     435 MB/s( 0.46)     434 MB/s( 0.23)
RR     16  64K    16        BW       566 MB/s      566 MB/s( 0.00)     565 MB/s(-0.17)     561 MB/s(-0.88)     562 MB/s(-0.70)     560 MB/s(-1.06)
RW     16   8K    32        BW       347 MB/s      339 MB/s(-2.30)     345 MB/s(-0.57)     331 MB/s(-4.61)     342 MB/s(-1.44)     346 MB/s(-0.28)
RW     16   8K    32        BW       538 MB/s      538 MB/s( 0.00)     532 MB/s(-1.11)     536 MB/s(-0.37)     537 MB/s(-0.18)     539 MB/s( 0.18)
W      32   1M     8        BW       451 MB/s      446 MB/s(-1.10)     451 MB/s( 0.00)     450 MB/s(-0.22)     450 MB/s(-0.22)     449 MB/s(-0.44)
R      32   1M     8        BW       830 MB/s      831 MB/s( 0.12)     835 MB/s( 0.60)     834 MB/s( 0.48)     834 MB/s( 0.48)     837 MB/s( 0.84)
RW     32  64K    16        BW       435 MB/s      432 MB/s(-0.68)     433 MB/s(-0.45)     433 MB/s(-0.45)     436 MB/s( 0.22)     431 MB/s(-0.91)
RR     32  64K    16        BW       571 MB/s      567 MB/s(-0.70)     563 MB/s(-1.40)     568 MB/s(-0.52)     567 MB/s(-0.70)     560 MB/s(-1.92)
RW     32   8K    32        BW       349 MB/s      341 MB/s(-2.29)     345 MB/s(-1.14)     347 MB/s(-0.57)     345 MB/s(-1.14)     347 MB/s(-0.57)
RW     32   8K    32        BW       534 MB/s      521 MB/s(-2.43)     533 MB/s(-0.18)     530 MB/s(-0.74)     522 MB/s(-2.24)     522 MB/s(-2.24)
W      64   1M     8        BW       450 MB/s      441 MB/s(-2.00)     447 MB/s(-0.66)     450 MB/s( 0.00)     448 MB/s(-0.44)     449 MB/s(-0.22)
R      64   1M     8        BW       832 MB/s      832 MB/s( 0.00)     835 MB/s( 0.36)     835 MB/s( 0.36)     835 MB/s( 0.36)     835 MB/s( 0.36)
RW     64  64K    16        BW       432 MB/s      435 MB/s( 0.69)     434 MB/s( 0.46)     435 MB/s( 0.69)     432 MB/s( 0.00)     434 MB/s( 0.46)
RR     64  64K    16        BW       569 MB/s      565 MB/s(-0.70)     558 MB/s(-1.93)     567 MB/s(-0.35)     565 MB/s(-0.70)     562 MB/s(-1.23)
RW     64   8K    32        BW       346 MB/s      343 MB/s(-0.86)     346 MB/s( 0.00)     347 MB/s( 0.28)     347 MB/s( 0.28)     345 MB/s(-0.28)
RW     64   8K    32        BW       527 MB/s      526 MB/s(-0.18)     521 MB/s(-1.13)     526 MB/s(-0.18)     528 MB/s( 0.18)     526 MB/s(-0.18)
W     128   1M     8        BW       449 MB/s      449 MB/s( 0.00)     450 MB/s( 0.22)     448 MB/s(-0.22)     446 MB/s(-0.66)     451 MB/s( 0.44)
R     128   1M     8        BW       832 MB/s      831 MB/s(-0.12)     835 MB/s( 0.36)     833 MB/s( 0.12)     834 MB/s( 0.24)     835 MB/s( 0.36)
RW    128  64K    16        BW       436 MB/s      432 MB/s(-0.91)     434 MB/s(-0.45)     432 MB/s(-0.91)     435 MB/s(-0.22)     432 MB/s(-0.91)
RR    128  64K    16        BW       564 MB/s      565 MB/s( 0.17)     564 MB/s( 0.00)     559 MB/s(-0.88)     562 MB/s(-0.35)     561 MB/s(-0.53)
RW    128   8K    32        BW       346 MB/s      345 MB/s(-0.28)     328 MB/s(-5.20)     342 MB/s(-1.15)     346 MB/s( 0.00)     347 MB/s( 0.28)
RW    128   8K    32        BW       519 MB/s      527 MB/s( 1.54)     525 MB/s( 1.15)     530 MB/s( 2.11)     518 MB/s(-0.19)     526 MB/s( 1.34)
W     256   1M     8        BW       450 MB/s      449 MB/s(-0.22)     451 MB/s( 0.22)     450 MB/s( 0.00)     447 MB/s(-0.66)     448 MB/s(-0.44)
R     256   1M     8        BW       832 MB/s      830 MB/s(-0.24)     834 MB/s( 0.24)     834 MB/s( 0.24)     834 MB/s( 0.24)     833 MB/s( 0.12)
RW    256  64K    16        BW       436 MB/s      431 MB/s(-1.14)     434 MB/s(-0.45)     432 MB/s(-0.91)     431 MB/s(-1.14)     433 MB/s(-0.68)
RR    256  64K    16        BW       565 MB/s      573 MB/s( 1.41)     569 MB/s( 0.70)     559 MB/s(-1.06)     560 MB/s(-0.88)     558 MB/s(-1.23)
RW    256   8K    32        BW       332 MB/s      339 MB/s( 2.10)     337 MB/s( 1.50)     334 MB/s( 0.60)     328 MB/s(-1.20)     340 MB/s( 2.40)
RW    256   8K    32        BW       521 MB/s      524 MB/s( 0.57)     524 MB/s( 0.57)     528 MB/s( 1.34)     528 MB/s( 1.34)     529 MB/s( 1.53)
W     512   1M     8        BW       450 MB/s      448 MB/s(-0.44)     446 MB/s(-0.88)     449 MB/s(-0.22)     448 MB/s(-0.44)     450 MB/s( 0.00)
R     512   1M     8        BW       832 MB/s      830 MB/s(-0.24)     834 MB/s( 0.24)     832 MB/s( 0.00)     833 MB/s( 0.12)     834 MB/s( 0.24)
RW    512  64K    16        BW       430 MB/s      433 MB/s( 0.69)     435 MB/s( 1.16)     431 MB/s( 0.23)     430 MB/s( 0.00)     432 MB/s( 0.46)
RR    512  64K    16        BW       564 MB/s      559 MB/s(-0.88)     563 MB/s(-0.17)     568 MB/s( 0.70)     564 MB/s( 0.00)     563 MB/s(-0.17)
RW    512   8K    32        BW       338 MB/s      332 MB/s(-1.77)     320 MB/s(-5.32)     329 MB/s(-2.66)     328 MB/s(-2.95)     332 MB/s(-1.77)
RW    512   8K    32        BW       530 MB/s      531 MB/s( 0.18)     529 MB/s(-0.18)     531 MB/s( 0.18)     526 MB/s(-0.75)     528 MB/s(-0.37)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS2-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8     SLAT AVG           69.1 us          55.6 us (19.51)        78.8 us (-14.05)        78.2 us (-13.27)      424.9 us (-515.12)      366.1 us (-430.01)
R       1   1M     8     SLAT AVG           50.8 us          49.8 us ( 1.96)         54.6 us (-7.46)         38.8 us (23.64)      300.3 us (-491.10)      264.5 us (-420.72)
RW      1  64K    16     SLAT AVG           15.5 us          15.7 us (-0.90)        18.1 us (-16.62)        18.5 us (-19.32)        20.8 us (-34.21)        23.6 us (-51.99)
RR      1  64K    16     SLAT AVG           11.0 us          10.8 us ( 1.36)        12.8 us (-17.04)         11.0 us ( 0.00)        12.4 us (-13.30)        14.7 us (-34.18)
RW      1   8K    32     SLAT AVG           30.0 us          28.5 us ( 4.96)         32.7 us (-8.88)        36.4 us (-21.23)        33.1 us (-10.31)        36.1 us (-20.23)
RW      1   8K    32     SLAT AVG           10.8 us          10.9 us (-0.83)         10.9 us (-0.27)          9.0 us (17.00)         10.9 us (-1.20)          9.8 us ( 9.33)
W       2   1M     8     SLAT AVG           76.9 us         96.2 us (-25.12)      176.0 us (-128.84)         73.3 us ( 4.70)      606.3 us (-688.56)      334.7 us (-335.33)
R       2   1M     8     SLAT AVG           54.8 us          44.9 us (18.04)        90.1 us (-64.31)         57.5 us (-4.97)      337.8 us (-516.10)         59.5 us (-8.62)
RW      2  64K    16     SLAT AVG           17.9 us          16.5 us ( 7.94)         16.2 us ( 9.56)         16.0 us (10.51)        29.0 us (-62.24)        21.5 us (-20.19)
RR      2  64K    16     SLAT AVG           11.1 us          11.6 us (-4.32)         11.4 us (-2.25)         11.6 us (-4.50)        20.4 us (-83.16)        17.0 us (-52.83)
RW      2   8K    32     SLAT AVG           15.1 us         20.1 us (-33.26)         14.6 us ( 2.91)         16.1 us (-6.89)        20.9 us (-38.23)        19.3 us (-27.96)
RW      2   8K    32     SLAT AVG            7.7 us           7.6 us ( 1.42)          8.2 us (-6.20)          8.0 us (-3.61)          7.6 us ( 1.80)          7.7 us ( 0.64)
W       4   1M     8     SLAT AVG          648.3 us          96.6 us (85.10)         91.6 us (85.86)         77.3 us (88.06)         94.6 us (85.41)         77.2 us (88.08)
R       4   1M     8     SLAT AVG           29.0 us         34.5 us (-18.99)        38.9 us (-34.39)        33.8 us (-16.71)        40.7 us (-40.46)        42.7 us (-47.34)
RW      4  64K    16     SLAT AVG           18.1 us          18.5 us (-1.93)         19.0 us (-4.96)        20.4 us (-12.30)        21.7 us (-19.53)         17.4 us ( 4.24)
RR      4  64K    16     SLAT AVG            8.1 us           8.6 us (-7.45)        10.4 us (-29.81)          7.4 us ( 7.70)        10.6 us (-31.05)          7.7 us ( 3.97)
RW      4   8K    32     SLAT AVG           91.4 us          91.0 us ( 0.35)         91.9 us (-0.62)         90.8 us ( 0.55)         91.1 us ( 0.33)         94.1 us (-2.98)
RW      4   8K    32     SLAT AVG           57.2 us          57.5 us (-0.41)         57.5 us (-0.40)         57.6 us (-0.62)         57.3 us (-0.12)         58.0 us (-1.25)
W       8   1M     8     SLAT AVG        12322.7 us       12821.8 us (-4.05)        398.1 us (96.76)       9293.3 us (24.58)      11211.7 us ( 9.01)        821.0 us (93.33)
R       8   1M     8     SLAT AVG         6641.7 us        5776.2 us (13.03)       5578.0 us (16.01)       5779.9 us (12.97)         39.9 us (99.39)         31.8 us (99.52)
RW      8  64K    16     SLAT AVG         1032.7 us        1045.4 us (-1.23)       1036.4 us (-0.35)       1027.5 us ( 0.50)       1037.6 us (-0.47)       1031.1 us ( 0.15)
RR      8  64K    16     SLAT AVG          782.9 us         786.7 us (-0.47)        784.8 us (-0.23)        780.7 us ( 0.29)        755.5 us ( 3.50)        786.5 us (-0.45)
RW      8   8K    32     SLAT AVG          186.6 us         188.7 us (-1.11)        190.6 us (-2.16)        187.3 us (-0.38)        186.9 us (-0.15)        189.9 us (-1.77)
RW      8   8K    32     SLAT AVG          119.1 us         121.0 us (-1.61)        120.7 us (-1.36)        120.7 us (-1.32)        122.0 us (-2.45)        119.6 us (-0.38)
W      16   1M     8     SLAT AVG        32861.6 us       32716.3 us ( 0.44)      31509.8 us ( 4.11)      30533.0 us ( 7.08)      32283.6 us ( 1.75)      32754.9 us ( 0.32)
R      16   1M     8     SLAT AVG        16347.8 us       17288.0 us (-5.75)      15926.0 us ( 2.58)      16020.2 us ( 2.00)     18334.2 us (-12.15)      15997.7 us ( 2.14)
RW     16  64K    16     SLAT AVG         2417.2 us        2400.9 us ( 0.67)       2433.3 us (-0.66)       2399.3 us ( 0.74)       2405.2 us ( 0.49)       2409.3 us ( 0.32)
RR     16  64K    16     SLAT AVG         1845.4 us        1847.9 us (-0.13)       1848.6 us (-0.17)       1862.7 us (-0.93)       1857.8 us (-0.67)       1866.0 us (-1.11)
RW     16   8K    32     SLAT AVG          375.6 us         385.2 us (-2.55)        377.8 us (-0.55)        393.8 us (-4.82)        381.1 us (-1.45)        377.5 us (-0.48)
RW     16   8K    32     SLAT AVG          241.8 us         242.2 us (-0.15)        244.7 us (-1.18)        242.8 us (-0.41)        242.6 us (-0.33)        241.9 us (-0.02)
W      32   1M     8     SLAT AVG        73911.1 us       74878.8 us (-1.30)      74155.4 us (-0.33)      74296.7 us (-0.52)      74348.2 us (-0.59)      74324.0 us (-0.55)
R      32   1M     8     SLAT AVG        40310.4 us       40317.9 us (-0.01)      40044.0 us ( 0.66)      40068.1 us ( 0.60)      40102.7 us ( 0.51)      40012.5 us ( 0.73)
RW     32  64K    16     SLAT AVG         4817.5 us        4845.3 us (-0.57)       4837.6 us (-0.41)       4844.4 us (-0.55)       4801.6 us ( 0.33)       4866.2 us (-1.01)
RR     32  64K    16     SLAT AVG         3669.8 us        3692.9 us (-0.62)       3722.4 us (-1.43)       3688.1 us (-0.49)       3692.9 us (-0.63)       3739.9 us (-1.91)
RW     32   8K    32     SLAT AVG          749.1 us         766.7 us (-2.34)        756.2 us (-0.95)        754.2 us (-0.67)        757.9 us (-1.17)        752.8 us (-0.49)
RW     32   8K    32     SLAT AVG          489.0 us         501.6 us (-2.56)        490.2 us (-0.22)        492.7 us (-0.74)        500.0 us (-2.24)        500.3 us (-2.30)
W      64   1M     8     SLAT AVG       148503.8 us      151397.1 us (-1.94)     149573.1 us (-0.72)     148545.5 us (-0.02)     149269.4 us (-0.51)     148909.0 us (-0.27)
R      64   1M     8     SLAT AVG        80548.3 us       80499.8 us ( 0.06)      80232.7 us ( 0.39)      80205.2 us ( 0.42)      80258.1 us ( 0.36)      80230.6 us ( 0.39)
RW     64  64K    16     SLAT AVG         9695.2 us        9633.5 us ( 0.63)       9655.9 us ( 0.40)       9633.2 us ( 0.63)       9706.1 us (-0.11)       9655.1 us ( 0.41)
RR     64  64K    16     SLAT AVG         7369.6 us        7416.4 us (-0.63)       7515.1 us (-1.97)       7390.9 us (-0.28)       7425.1 us (-0.75)       7459.2 us (-1.21)
RW     64   8K    32     SLAT AVG         1513.6 us        1527.1 us (-0.89)       1513.4 us ( 0.00)       1508.3 us ( 0.35)       1509.7 us ( 0.25)       1518.2 us (-0.30)
RW     64   8K    32     SLAT AVG          993.1 us         994.2 us (-0.11)       1004.7 us (-1.16)        994.7 us (-0.15)        990.9 us ( 0.22)        995.1 us (-0.19)
W     128   1M     8     SLAT AVG       297703.8 us      297483.3 us ( 0.07)     296868.0 us ( 0.28)     298124.7 us (-0.14)     299233.1 us (-0.51)     296398.2 us ( 0.43)
R     128   1M     8     SLAT AVG       161014.5 us      161170.9 us (-0.09)     160367.7 us ( 0.40)     160632.0 us ( 0.23)     160586.0 us ( 0.26)     160336.6 us ( 0.42)
RW    128  64K    16     SLAT AVG        19211.9 us       19422.0 us (-1.09)      19306.8 us (-0.49)      19395.6 us (-0.95)      19289.8 us (-0.40)      19392.3 us (-0.93)
RR    128  64K    16     SLAT AVG        14862.2 us       14833.2 us ( 0.19)      14860.5 us ( 0.01)      15005.8 us (-0.96)      14911.1 us (-0.32)      14940.7 us (-0.52)
RW    128   8K    32     SLAT AVG         3030.6 us        3030.9 us (-0.01)       3192.1 us (-5.33)       3058.8 us (-0.93)       3029.4 us ( 0.04)       3022.2 us ( 0.27)
RW    128   8K    32     SLAT AVG         2018.3 us        1987.8 us ( 1.51)       1996.1 us ( 1.10)       1976.4 us ( 2.07)       2020.3 us (-0.09)       1992.0 us ( 1.30)
W     256   1M     8     SLAT AVG       592084.3 us      594206.8 us (-0.35)     590998.4 us ( 0.18)     592508.6 us (-0.07)     595737.1 us (-0.61)     594953.0 us (-0.48)
R     256   1M     8     SLAT AVG       321411.8 us      322217.7 us (-0.25)     320759.6 us ( 0.20)     320717.4 us ( 0.21)     320562.7 us ( 0.26)     320770.3 us ( 0.19)
RW    256  64K    16     SLAT AVG        38441.5 us       38932.8 us (-1.27)      38603.2 us (-0.42)      38838.2 us (-1.03)      38852.9 us (-1.07)      38748.7 us (-0.79)
RR    256  64K    16     SLAT AVG        29700.6 us       29264.1 us ( 1.46)      29467.9 us ( 0.78)      30018.6 us (-1.07)      29943.0 us (-0.81)      30025.3 us (-1.09)
RW    256   8K    32     SLAT AVG         6306.1 us        6183.0 us ( 1.95)       6215.5 us ( 1.43)       6270.0 us ( 0.57)       6378.8 us (-1.15)       6153.7 us ( 2.41)
RW    256   8K    32     SLAT AVG         4021.9 us        3997.3 us ( 0.61)       4000.6 us ( 0.53)       3971.6 us ( 1.25)       3968.0 us ( 1.34)       3964.8 us ( 1.41)
W     512   1M     8     SLAT AVG      1178070.9 us     1184901.3 us (-0.57)    1186833.2 us (-0.74)    1180252.0 us (-0.18)    1182115.7 us (-0.34)    1178710.0 us (-0.05)
R     512   1M     8     SLAT AVG       641182.0 us      642169.6 us (-0.15)     639244.4 us ( 0.30)     640601.4 us ( 0.09)     640300.0 us ( 0.13)     639403.3 us ( 0.27)
RW    512  64K    16     SLAT AVG        77976.6 us       77342.1 us ( 0.81)      77068.0 us ( 1.16)      77776.2 us ( 0.25)      77920.1 us ( 0.07)      77561.6 us ( 0.53)
RR    512  64K    16     SLAT AVG        59461.4 us       60000.9 us (-0.90)      59580.6 us (-0.20)      59058.1 us ( 0.67)      59463.6 us ( 0.00)      59570.8 us (-0.18)
RW    512   8K    32     SLAT AVG        12415.5 us       12622.5 us (-1.66)      13094.8 us (-5.47)      12724.1 us (-2.48)      12763.2 us (-2.80)      12630.6 us (-1.73)
RW    512   8K    32     SLAT AVG         7909.1 us        7887.7 us ( 0.27)       7918.4 us (-0.11)       7895.4 us ( 0.17)       7964.5 us (-0.70)       7934.6 us (-0.32)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS2-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    LAT AVG          19297.0 us       18791.7 us ( 2.61)      18753.9 us ( 2.81)      18699.8 us ( 3.09)      18845.7 us ( 2.33)      18609.5 us ( 3.56)
R       1   1M     8    LAT AVG          12365.8 us       10945.3 us (11.48)      10996.6 us (11.07)      10897.7 us (11.87)      10983.9 us (11.17)      12542.8 us (-1.43)	###
RW      1  64K    16    LAT AVG           2403.7 us        2417.9 us (-0.59)       2397.6 us ( 0.25)       2412.8 us (-0.37)       2407.2 us (-0.14)       2416.9 us (-0.55)
RR      1  64K    16    LAT AVG           2243.4 us        2252.7 us (-0.41)       2269.8 us (-1.17)       2269.5 us (-1.16)       2274.8 us (-1.40)       2246.7 us (-0.14)
RW      1   8K    32    LAT AVG           1048.6 us        1036.3 us ( 1.17)       1132.8 us (-8.03)      1240.9 us (-18.34)       1144.5 us (-9.14)      1230.1 us (-17.30)	***
RW      1   8K    32    LAT AVG            487.5 us         489.4 us (-0.38)        489.0 us (-0.30)        481.4 us ( 1.24)        489.7 us (-0.46)        485.9 us ( 0.31)
W       2   1M     8    LAT AVG          37195.7 us       37376.8 us (-0.48)      37257.8 us (-0.16)      37149.2 us ( 0.12)      37268.1 us (-0.19)      37394.1 us (-0.53)
R       2   1M     8    LAT AVG          24724.5 us       20605.9 us (16.65)      20925.6 us (15.36)      20822.3 us (15.78)      20452.6 us (17.27)      20788.8 us (15.91)	###
RW      2  64K    16    LAT AVG           4785.8 us        4787.2 us (-0.02)       4831.0 us (-0.94)       4809.7 us (-0.49)       4835.9 us (-1.04)       4842.3 us (-1.17)
RR      2  64K    16    LAT AVG           3709.1 us        3705.3 us ( 0.10)       3734.4 us (-0.68)       3705.5 us ( 0.09)       3714.5 us (-0.14)       3712.1 us (-0.08)
RW      2   8K    32    LAT AVG           1498.2 us        1558.8 us (-4.04)       1503.5 us (-0.35)       1505.8 us (-0.50)       1559.9 us (-4.12)       1507.3 us (-0.61)
RW      2   8K    32    LAT AVG            958.9 us         961.4 us (-0.26)        964.0 us (-0.52)        959.3 us (-0.03)        963.1 us (-0.43)        962.4 us (-0.36)
W       4   1M     8    LAT AVG          78130.0 us       74680.0 us ( 4.41)      74480.0 us ( 4.67)      75710.0 us ( 3.09)      74690.0 us ( 4.40)      75400.0 us ( 3.49)
R       4   1M     8    LAT AVG          40768.6 us       40740.5 us ( 0.06)      40510.0 us ( 0.63)      40505.2 us ( 0.64)      40259.2 us ( 1.24)      40254.5 us ( 1.26)
RW      4  64K    16    LAT AVG           9605.4 us        9691.5 us (-0.89)       9620.0 us (-0.15)       9656.0 us (-0.52)       9678.1 us (-0.75)       9682.3 us (-0.80)
RR      4  64K    16    LAT AVG           7420.9 us        7487.2 us (-0.89)       7414.9 us ( 0.08)       7398.6 us ( 0.30)       7483.0 us (-0.83)       7424.1 us (-0.04)
RW      4   8K    32    LAT AVG           3023.4 us        3020.3 us ( 0.10)       3056.4 us (-1.09)       3002.8 us ( 0.68)       3012.6 us ( 0.36)       3112.2 us (-2.93)
RW      4   8K    32    LAT AVG           1919.6 us        1924.9 us (-0.28)       1925.0 us (-0.28)       1929.6 us (-0.52)       1921.2 us (-0.08)       1938.1 us (-0.96)
W       8   1M     8    LAT AVG         149910.0 us      149250.0 us ( 0.44)     148870.0 us ( 0.69)     149530.0 us ( 0.25)     149350.0 us ( 0.37)     149450.0 us ( 0.30)
R       8   1M     8    LAT AVG          81590.0 us       81300.0 us ( 0.35)      81080.0 us ( 0.62)      81180.0 us ( 0.50)      80480.0 us ( 1.36)      80460.0 us ( 1.38)
RW      8  64K    16    LAT AVG          19367.0 us       19502.8 us (-0.70)      19363.3 us ( 0.01)      19211.2 us ( 0.80)      19430.5 us (-0.32)      19240.7 us ( 0.65)
RR      8  64K    16    LAT AVG          14734.9 us       14824.1 us (-0.60)      14859.8 us (-0.84)      14788.1 us (-0.36)      14861.6 us (-0.85)      14823.3 us (-0.59)
RW      8   8K    32    LAT AVG           6021.7 us        6086.8 us (-1.08)       6158.6 us (-2.27)       6049.4 us (-0.45)       6043.2 us (-0.35)       6138.1 us (-1.93)
RW      8   8K    32    LAT AVG           3854.7 us        3927.1 us (-1.87)       3910.6 us (-1.45)       3909.6 us (-1.42)       3953.4 us (-2.55)       3869.2 us (-0.37)
W      16   1M     8    LAT AVG         297360.0 us      300350.0 us (-1.00)     297290.0 us ( 0.02)     298660.0 us (-0.43)     297220.0 us ( 0.04)     298610.0 us (-0.42)
R      16   1M     8    LAT AVG         160930.0 us      161420.0 us (-0.30)     161070.0 us (-0.08)     161030.0 us (-0.06)     161050.0 us (-0.07)     160890.0 us ( 0.02)
RW     16  64K    16    LAT AVG          38766.1 us       38505.2 us ( 0.67)      39020.0 us (-0.65)      38490.0 us ( 0.71)      38600.0 us ( 0.42)      38630.0 us ( 0.35)
RR     16  64K    16    LAT AVG          29626.2 us       29661.3 us (-0.11)      29674.6 us (-0.16)      29892.6 us (-0.89)      29823.5 us (-0.66)      29943.9 us (-1.07)
RW     16   8K    32    LAT AVG          12076.9 us       12381.3 us (-2.52)      12146.7 us (-0.57)      12670.9 us (-4.91)      12262.3 us (-1.53)      12135.0 us (-0.48)
RW     16   8K    32    LAT AVG           7789.0 us        7799.5 us (-0.13)       7884.9 us (-1.23)       7818.6 us (-0.37)       7814.3 us (-0.32)       7787.8 us ( 0.01)
W      32   1M     8    LAT AVG         592290.0 us      599280.0 us (-1.18)     593750.0 us (-0.24)     594670.0 us (-0.40)     594960.0 us (-0.45)     595560.0 us (-0.55)
R      32   1M     8    LAT AVG         322550.0 us      322490.0 us ( 0.01)     321040.0 us ( 0.46)     321260.0 us ( 0.39)     321170.0 us ( 0.42)     320170.0 us ( 0.73)
RW     32  64K    16    LAT AVG          77090.0 us       77550.0 us (-0.59)      77420.0 us (-0.42)      77540.0 us (-0.58)      76840.0 us ( 0.32)      77880.0 us (-1.02)
RR     32  64K    16    LAT AVG          58730.0 us       59110.0 us (-0.64)      59570.0 us (-1.43)      59030.0 us (-0.51)      59100.0 us (-0.63)      59860.0 us (-1.92)
RW     32   8K    32    LAT AVG          24033.6 us       24611.1 us (-2.40)      24278.6 us (-1.01)      24196.4 us (-0.67)      24323.3 us (-1.20)      24150.8 us (-0.48)
RW     32   8K    32    LAT AVG          15705.9 us       16108.6 us (-2.56)      15740.4 us (-0.21)      15823.3 us (-0.74)      16059.2 us (-2.24)      16067.6 us (-2.30)
W      64   1M     8    LAT AVG        1181790.0 us     1205320.0 us (-1.99)    1191750.0 us (-0.84)    1182410.0 us (-0.05)    1187650.0 us (-0.49)    1185000.0 us (-0.27)
R      64   1M     8    LAT AVG         642870.0 us      642440.0 us ( 0.06)     640340.0 us ( 0.39)     640070.0 us ( 0.43)     640470.0 us ( 0.37)     640230.0 us ( 0.41)
RW     64  64K    16    LAT AVG         155034.9 us      154060.0 us ( 0.62)     154420.0 us ( 0.39)     154036.2 us ( 0.64)     155193.0 us (-0.10)     154390.0 us ( 0.41)
RR     64  64K    16    LAT AVG         117870.0 us      118620.0 us (-0.63)     120190.0 us (-1.96)     118210.0 us (-0.28)     118760.0 us (-0.75)     119300.0 us (-1.21)
RW     64   8K    32    LAT AVG          48484.2 us       48924.7 us (-0.90)      48494.8 us (-0.02)      48313.4 us ( 0.35)      48365.4 us ( 0.24)      48633.1 us (-0.30)
RW     64   8K    32    LAT AVG          31832.3 us       31867.7 us (-0.11)      32206.7 us (-1.17)      31882.3 us (-0.15)      31764.6 us ( 0.21)      31893.0 us (-0.19)
W     128   1M     8    LAT AVG        2352040.0 us     2348600.0 us ( 0.14)    2345670.0 us ( 0.27)    2356160.0 us (-0.17)    2364560.0 us (-0.53)    2339640.0 us ( 0.52)
R     128   1M     8    LAT AVG        1279470.0 us     1280950.0 us (-0.11)    1274610.0 us ( 0.37)    1276170.0 us ( 0.25)    1276130.0 us ( 0.26)    1274270.0 us ( 0.40)
RW    128  64K    16    LAT AVG         306840.0 us      310198.0 us (-1.09)     308347.0 us (-0.49)     309761.9 us (-0.95)     308085.9 us (-0.40)     309698.6 us (-0.93)
RR    128  64K    16    LAT AVG         237460.0 us      237000.0 us ( 0.19)     237440.0 us ( 0.00)     239750.0 us (-0.96)     238230.0 us (-0.32)     238720.0 us (-0.53)
RW    128   8K    32    LAT AVG          96980.1 us       97003.4 us (-0.02)     102153.3 us (-5.33)      97877.5 us (-0.92)      96946.7 us ( 0.03)      96710.6 us ( 0.27)
RW    128   8K    32    LAT AVG          64616.8 us       63642.1 us ( 1.50)      63909.6 us ( 1.09)      63277.5 us ( 2.07)      64682.4 us (-0.10)      63776.5 us ( 1.30)
W     256   1M     8    LAT AVG        4606440.0 us     4622600.0 us (-0.35)    4600690.0 us ( 0.12)    4611210.0 us (-0.10)    4634460.0 us (-0.60)    4622390.0 us (-0.34)
R     256   1M     8    LAT AVG        2532970.0 us     2538450.0 us (-0.21)    2528080.0 us ( 0.19)    2527250.0 us ( 0.22)    2525630.0 us ( 0.28)    2528030.0 us ( 0.19)
RW    256  64K    16    LAT AVG         612510.0 us      620369.9 us (-1.28)     615088.4 us (-0.42)     618839.9 us (-1.03)     619143.8 us (-1.08)     617441.3 us (-0.80)
RR    256  64K    16    LAT AVG         473690.0 us      466760.0 us ( 1.46)     469950.0 us ( 0.78)     478720.0 us (-1.06)     477500.0 us (-0.80)     478822.6 us (-1.08)
RW    256   8K    32    LAT AVG         201568.8 us      197627.5 us ( 1.95)     198687.3 us ( 1.42)     200441.8 us ( 0.55)     203869.3 us (-1.14)     196735.5 us ( 2.39)
RW    256   8K    32    LAT AVG         128646.6 us      127860.1 us ( 0.61)     127963.8 us ( 0.53)     127036.7 us ( 1.25)     126916.1 us ( 1.34)     126818.1 us ( 1.42)
W     512   1M     8    LAT AVG        8881860.0 us     8927670.0 us (-0.51)    8946760.0 us (-0.73)    8904220.0 us (-0.25)    8919030.0 us (-0.41)    8891800.0 us (-0.11)
R     512   1M     8    LAT AVG        4966570.0 us     4973960.0 us (-0.14)    4952160.0 us ( 0.29)    4961550.0 us ( 0.10)    4961600.0 us ( 0.10)    4952570.0 us ( 0.28)
RW    512  64K    16    LAT AVG        1236690.4 us     1226780.1 us ( 0.80)    1222633.1 us ( 1.13)    1233756.8 us ( 0.23)    1235800.0 us ( 0.07)    1230154.7 us ( 0.52)
RR    512  64K    16    LAT AVG         944850.0 us      953374.4 us (-0.90)     946790.0 us (-0.20)     938402.2 us ( 0.68)     944849.6 us ( 0.00)     946559.5 us (-0.18)
RW    512   8K    32    LAT AVG         396049.9 us      402844.4 us (-1.71)     417229.0 us (-5.34)     405910.2 us (-2.48)     407128.3 us (-2.79)     403012.6 us (-1.75)
RW    512   8K    32    LAT AVG         252679.3 us      251994.9 us ( 0.27)     252958.8 us (-0.11)     252244.2 us ( 0.17)     254444.5 us (-0.69)     253498.1 us (-0.32)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS2-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    CLAT AVG         19227.7 us       18736.0 us ( 2.55)      18674.9 us ( 2.87)      18621.4 us ( 3.15)      18420.7 us ( 4.19)      18243.2 us ( 5.12)
R       1   1M     8    CLAT AVG         12315.0 us       10895.4 us (11.52)      10941.9 us (11.14)      10858.8 us (11.82)      10683.5 us (13.24)      12278.2 us ( 0.29)	###
RW      1  64K    16    CLAT AVG          2388.1 us        2402.2 us (-0.59)       2379.4 us ( 0.36)       2394.2 us (-0.25)       2386.2 us ( 0.07)       2393.2 us (-0.21)
RR      1  64K    16    CLAT AVG          2232.4 us        2241.8 us (-0.42)       2256.9 us (-1.09)       2258.5 us (-1.17)       2262.3 us (-1.34)       2231.9 us ( 0.02)
RW      1   8K    32    CLAT AVG          1018.5 us        1007.6 us ( 1.06)       1099.9 us (-8.00)      1204.4 us (-18.25)       1111.2 us (-9.10)      1193.8 us (-17.21)	***
RW      1   8K    32    CLAT AVG           476.6 us         478.4 us (-0.37)        478.1 us (-0.31)        472.3 us ( 0.89)        478.7 us (-0.44)        476.0 us ( 0.11)
W       2   1M     8    CLAT AVG         37118.6 us       37280.3 us (-0.43)      37081.5 us ( 0.09)      37075.6 us ( 0.11)      36661.6 us ( 1.23)      37059.1 us ( 0.16)
R       2   1M     8    CLAT AVG         24669.6 us       20560.8 us (16.65)      20835.4 us (15.54)      20764.6 us (15.82)      20114.7 us (18.46)      20729.2 us (15.97)	###
RW      2  64K    16    CLAT AVG          4767.9 us        4770.6 us (-0.05)       4814.7 us (-0.98)       4793.6 us (-0.54)       4806.8 us (-0.81)       4820.7 us (-1.10)
RR      2  64K    16    CLAT AVG          3697.9 us        3693.6 us ( 0.11)       3723.0 us (-0.67)       3693.8 us ( 0.11)       3694.1 us ( 0.10)       3695.1 us ( 0.07)
RW      2   8K    32    CLAT AVG          1483.0 us        1538.6 us (-3.74)       1488.8 us (-0.38)       1489.6 us (-0.43)       1539.0 us (-3.77)       1487.9 us (-0.32)
RW      2   8K    32    CLAT AVG           951.1 us         953.7 us (-0.27)        955.7 us (-0.48)        951.2 us ( 0.00)        955.4 us (-0.45)        954.7 us (-0.37)
W       4   1M     8    CLAT AVG         77480.0 us       74580.0 us ( 3.74)      74380.0 us ( 4.00)      75630.0 us ( 2.38)      74600.0 us ( 3.71)      75330.0 us ( 2.77)
R       4   1M     8    CLAT AVG         40739.5 us       40705.9 us ( 0.08)      40470.0 us ( 0.66)      40471.2 us ( 0.65)      40218.4 us ( 1.27)      40211.7 us ( 1.29)
RW      4  64K    16    CLAT AVG          9587.1 us        9673.0 us (-0.89)       9600.8 us (-0.14)       9635.6 us (-0.50)       9656.3 us (-0.72)       9664.9 us (-0.81)
RR      4  64K    16    CLAT AVG          7412.8 us        7478.5 us (-0.88)       7404.4 us ( 0.11)       7391.1 us ( 0.29)       7472.3 us (-0.80)       7416.3 us (-0.04)
RW      4   8K    32    CLAT AVG          2932.0 us        2929.1 us ( 0.09)       2964.4 us (-1.10)       2911.8 us ( 0.68)       2921.4 us ( 0.36)       3018.1 us (-2.93)
RW      4   8K    32    CLAT AVG          1862.3 us        1867.4 us (-0.27)       1867.5 us (-0.27)       1871.9 us (-0.51)       1863.8 us (-0.08)       1880.0 us (-0.95)
W       8   1M     8    CLAT AVG        137590.0 us      136420.0 us ( 0.85)     148470.0 us (-7.90)     140240.0 us (-1.92)     138140.0 us (-0.39)     148630.0 us (-8.02)
R       8   1M     8    CLAT AVG         74950.0 us       75530.0 us (-0.77)      75500.0 us (-0.73)      75400.0 us (-0.60)      80440.0 us (-7.32)      80430.0 us (-7.31)
RW      8  64K    16    CLAT AVG         18334.2 us       18457.3 us (-0.67)      18326.8 us ( 0.04)      18183.5 us ( 0.82)      18392.8 us (-0.31)      18209.5 us ( 0.68)
RR      8  64K    16    CLAT AVG         13951.9 us       14037.3 us (-0.61)      14074.9 us (-0.88)      14007.3 us (-0.39)      14106.0 us (-1.10)      14036.7 us (-0.60)
RW      8   8K    32    CLAT AVG          5835.0 us        5898.0 us (-1.07)       5967.8 us (-2.27)       5861.9 us (-0.46)       5856.2 us (-0.36)       5948.0 us (-1.93)
RW      8   8K    32    CLAT AVG          3735.5 us        3806.0 us (-1.88)       3789.8 us (-1.45)       3788.9 us (-1.42)       3831.2 us (-2.56)       3749.6 us (-0.37)
W      16   1M     8    CLAT AVG        264490.0 us      267630.0 us (-1.18)     265780.0 us (-0.48)     268120.0 us (-1.37)     264940.0 us (-0.17)     265860.0 us (-0.51)
R      16   1M     8    CLAT AVG        144590.0 us      144140.0 us ( 0.31)     145140.0 us (-0.38)     145010.0 us (-0.29)     142710.0 us ( 1.30)     144890.0 us (-0.20)
RW     16  64K    16    CLAT AVG         36348.8 us       36104.2 us ( 0.67)      36591.5 us (-0.66)      36091.1 us ( 0.70)      36190.0 us ( 0.43)      36225.1 us ( 0.34)
RR     16  64K    16    CLAT AVG         27780.7 us       27813.2 us (-0.11)      27825.8 us (-0.16)      28029.8 us (-0.89)      27965.6 us (-0.66)      28077.8 us (-1.06)
RW     16   8K    32    CLAT AVG         11701.2 us       11996.0 us (-2.51)      11768.8 us (-0.57)      12277.0 us (-4.92)      11881.0 us (-1.53)      11757.5 us (-0.48)
RW     16   8K    32    CLAT AVG          7547.1 us        7557.2 us (-0.13)       7640.1 us (-1.23)       7575.6 us (-0.37)       7571.5 us (-0.32)       7545.8 us ( 0.01)
W      32   1M     8    CLAT AVG        518380.0 us      524400.0 us (-1.16)     519600.0 us (-0.23)     520380.0 us (-0.38)     520620.0 us (-0.43)     521230.0 us (-0.54)
R      32   1M     8    CLAT AVG        282240.0 us      282180.0 us ( 0.02)     280990.0 us ( 0.44)     281190.0 us ( 0.37)     281070.0 us ( 0.41)     280160.0 us ( 0.73)
RW     32  64K    16    CLAT AVG         72280.0 us       72710.0 us (-0.59)      72580.0 us (-0.41)      72690.0 us (-0.56)      72040.0 us ( 0.33)      73020.0 us (-1.02)
RR     32  64K    16    CLAT AVG         55060.0 us       55410.0 us (-0.63)      55850.0 us (-1.43)      55340.0 us (-0.50)      55410.0 us (-0.63)      56120.0 us (-1.92)
RW     32   8K    32    CLAT AVG         23284.4 us       23844.2 us (-2.40)      23522.2 us (-1.02)      23442.1 us (-0.67)      23565.3 us (-1.20)      23397.8 us (-0.48)
RW     32   8K    32    CLAT AVG         15216.7 us       15606.9 us (-2.56)      15250.1 us (-0.21)      15330.5 us (-0.74)      15559.1 us (-2.24)      15567.2 us (-2.30)
W      64   1M     8    CLAT AVG       1033280.0 us     1053920.0 us (-1.99)    1042180.0 us (-0.86)    1033870.0 us (-0.05)    1038380.0 us (-0.49)    1036090.0 us (-0.27)
R      64   1M     8    CLAT AVG        562320.0 us      561940.0 us ( 0.06)     560110.0 us ( 0.39)     559860.0 us ( 0.43)     560210.0 us ( 0.37)     560000.0 us ( 0.41)
RW     64  64K    16    CLAT AVG        145339.5 us      144420.0 us ( 0.63)     144760.0 us ( 0.39)     144402.8 us ( 0.64)     145486.7 us (-0.10)     144730.0 us ( 0.41)
RR     64  64K    16    CLAT AVG        110500.0 us      111200.0 us (-0.63)     112670.0 us (-1.96)     110814.7 us (-0.28)     111330.0 us (-0.75)     111840.0 us (-1.21)
RW     64   8K    32    CLAT AVG         46970.5 us       47397.5 us (-0.90)      46981.2 us (-0.02)      46805.0 us ( 0.35)      46855.6 us ( 0.24)      47114.7 us (-0.30)
RW     64   8K    32    CLAT AVG         30839.1 us       30873.3 us (-0.11)      31201.9 us (-1.17)      30887.5 us (-0.15)      30773.6 us ( 0.21)      30897.8 us (-0.19)
W     128   1M     8    CLAT AVG       2054340.0 us     2051110.0 us ( 0.15)    2048800.0 us ( 0.26)    2058030.0 us (-0.17)    2065320.0 us (-0.53)    2043240.0 us ( 0.54)
R     128   1M     8    CLAT AVG       1118460.0 us     1119780.0 us (-0.11)    1114250.0 us ( 0.37)    1115540.0 us ( 0.26)    1115540.0 us ( 0.26)    1113930.0 us ( 0.40)
RW    128  64K    16    CLAT AVG        287625.9 us      290775.8 us (-1.09)     289040.0 us (-0.49)     290366.1 us (-0.95)     288795.9 us (-0.40)     290306.1 us (-0.93)
RR    128  64K    16    CLAT AVG        222601.0 us      222170.0 us ( 0.19)     222580.0 us ( 0.00)     224743.7 us (-0.96)     223320.0 us (-0.32)     223775.6 us (-0.52)
RW    128   8K    32    CLAT AVG         93949.4 us       93972.4 us (-0.02)      98961.0 us (-5.33)      94818.5 us (-0.92)      93917.1 us ( 0.03)      93688.3 us ( 0.27)
RW    128   8K    32    CLAT AVG         62598.3 us       61654.1 us ( 1.50)      61913.4 us ( 1.09)      61301.0 us ( 2.07)      62662.0 us (-0.10)      61784.4 us ( 1.30)
W     256   1M     8    CLAT AVG       4014360.0 us     4028390.0 us (-0.34)    4009690.0 us ( 0.11)    4018700.0 us (-0.10)    4038720.0 us (-0.60)    4027440.0 us (-0.32)
R     256   1M     8    CLAT AVG       2211560.0 us     2216230.0 us (-0.21)    2207320.0 us ( 0.19)    2206530.0 us ( 0.22)    2205070.0 us ( 0.29)    2207260.0 us ( 0.19)
RW    256  64K    16    CLAT AVG        574070.0 us      581436.9 us (-1.28)     576484.9 us (-0.42)     580001.4 us (-1.03)     580290.6 us (-1.08)     578692.3 us (-0.80)
RR    256  64K    16    CLAT AVG        443990.0 us      437491.6 us ( 1.46)     440490.0 us ( 0.78)     448700.0 us (-1.06)     447560.0 us (-0.80)     448797.1 us (-1.08)
RW    256   8K    32    CLAT AVG        195262.5 us      191444.3 us ( 1.95)     192471.6 us ( 1.42)     194171.7 us ( 0.55)     197490.4 us (-1.14)     190581.6 us ( 2.39)
RW    256   8K    32    CLAT AVG        124624.6 us      123862.7 us ( 0.61)     123963.1 us ( 0.53)     123065.0 us ( 1.25)     122948.0 us ( 1.34)     122853.2 us ( 1.42)
W     512   1M     8    CLAT AVG       7703790.0 us     7742770.0 us (-0.50)    7759930.0 us (-0.72)    7723970.0 us (-0.26)    7736920.0 us (-0.43)    7713090.0 us (-0.12)
R     512   1M     8    CLAT AVG       4325390.0 us     4331790.0 us (-0.14)    4312920.0 us ( 0.28)    4320950.0 us ( 0.10)    4321300.0 us ( 0.09)    4313160.0 us ( 0.28)
RW    512  64K    16    CLAT AVG       1158713.5 us     1149437.7 us ( 0.80)    1145564.9 us ( 1.13)    1155980.4 us ( 0.23)    1157880.0 us ( 0.07)    1152592.9 us ( 0.52)
RR    512  64K    16    CLAT AVG        885387.7 us      893373.2 us (-0.90)     887200.0 us (-0.20)     879343.9 us ( 0.68)     885385.7 us ( 0.00)     886988.4 us (-0.18)
RW    512   8K    32    CLAT AVG        383634.2 us      390221.6 us (-1.71)     404134.0 us (-5.34)     393185.9 us (-2.48)     394364.9 us (-2.79)     390381.9 us (-1.75)
RW    512   8K    32    CLAT AVG        244770.0 us      244107.0 us ( 0.27)     245040.3 us (-0.11)     244348.7 us ( 0.17)     246479.8 us (-0.69)     245563.4 us (-0.32)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS2-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    CLAT  99        25560.0 us         22676.0 us (11.28)      22414.0 us (12.30)      22414.0 us (12.30)      22414.0 us (12.30)      21103.0 us (17.43)
R       1   1M     8    CLAT  99        13829.0 us         13829.0 us ( 0.00)      13829.0 us ( 0.00)     15664.0 us (-13.26)      14615.0 us (-5.68)      14091.0 us (-1.89)
RW      1  64K    16    CLAT  99         2900.0 us          3064.0 us (-5.65)       2835.0 us ( 2.24)       2999.0 us (-3.41)       2868.0 us ( 1.10)       2868.0 us ( 1.10)
RR      1  64K    16    CLAT  99         4113.0 us          4228.0 us (-2.79)       4359.0 us (-5.98)       4293.0 us (-4.37)       4293.0 us (-4.37)       4080.0 us ( 0.80)
RW      1   8K    32    CLAT  99         1237.0 us         1418.0 us (-14.63)       1303.0 us (-5.33)      1418.0 us (-14.63)      1467.0 us (-18.59)      1418.0 us (-14.63)
RW      1   8K    32    CLAT  99          766.0 us           766.0 us ( 0.00)        766.0 us ( 0.00)        758.0 us ( 1.04)        766.0 us ( 0.00)        766.0 us ( 0.00)
W       2   1M     8    CLAT  99        43254.0 us         44303.0 us (-2.42)      43779.0 us (-1.21)      43779.0 us (-1.21)      43254.0 us ( 0.00)      46400.0 us (-7.27)
R       2   1M     8    CLAT  99        27395.0 us         27395.0 us ( 0.00)      25822.0 us ( 5.74)      26084.0 us ( 4.78)      25560.0 us ( 6.69)      25297.0 us ( 7.65)
RW      2  64K    16    CLAT  99         5538.0 us          5604.0 us (-1.19)       5932.0 us (-7.11)       5800.0 us (-4.73)       5735.0 us (-3.55)       5997.0 us (-8.28)
RR      2  64K    16    CLAT  99         6521.0 us          6521.0 us ( 0.00)       6587.0 us (-1.01)       6521.0 us ( 0.00)       6587.0 us (-1.01)       6587.0 us (-1.01)
RW      2   8K    32    CLAT  99         3490.0 us          3621.0 us (-3.75)       3490.0 us ( 0.00)       3490.0 us ( 0.00)       3589.0 us (-2.83)       3490.0 us ( 0.00)
RW      2   8K    32    CLAT  99         2311.0 us          2343.0 us (-1.38)       2311.0 us ( 0.00)       2311.0 us ( 0.00)       2343.0 us (-1.38)       2343.0 us (-1.38)
W       4   1M     8    CLAT  99        90000.0 us         92000.0 us (-2.22)      88000.0 us ( 2.22)    130000.0 us (-44.44)      92000.0 us (-2.22)      95000.0 us (-5.55)
R       4   1M     8    CLAT  99        45351.0 us         47449.0 us (-4.62)      49021.0 us (-8.09)     58459.0 us (-28.90)      42730.0 us ( 5.77)      42730.0 us ( 5.77)
RW      4  64K    16    CLAT  99        22938.0 us         23200.0 us (-1.14)      22676.0 us ( 1.14)      23200.0 us (-1.14)      23200.0 us (-1.14)      23462.0 us (-2.28)
RR      4  64K    16    CLAT  99        17695.0 us         17695.0 us ( 0.00)      17695.0 us ( 0.00)      17433.0 us ( 1.48)      17695.0 us ( 0.00)      17695.0 us ( 0.00)
RW      4   8K    32    CLAT  99         6259.0 us          6325.0 us (-1.05)       6325.0 us (-1.05)       6259.0 us ( 0.00)       6259.0 us ( 0.00)       6325.0 us (-1.05)
RW      4   8K    32    CLAT  99         3982.0 us          3949.0 us ( 0.82)       3949.0 us ( 0.82)       3949.0 us ( 0.82)       3982.0 us ( 0.00)       3949.0 us ( 0.82)
W       8   1M     8    CLAT  99       321000.0 us        326000.0 us (-1.55)     241000.0 us (24.92)     317000.0 us ( 1.24)     321000.0 us ( 0.00)     234000.0 us (27.10)
R       8   1M     8    CLAT  99       171000.0 us        167000.0 us ( 2.33)     169000.0 us ( 1.16)     169000.0 us ( 1.16)      85000.0 us (50.29)      84000.0 us (50.87)
RW      8  64K    16    CLAT  99        43254.0 us         43254.0 us ( 0.00)      42730.0 us ( 1.21)      42730.0 us ( 1.21)      43254.0 us ( 0.00)      42730.0 us ( 1.21)
RR      8  64K    16    CLAT  99        32900.0 us         33162.0 us (-0.79)      33424.0 us (-1.59)      33162.0 us (-0.79)      35390.0 us (-7.56)      33162.0 us (-0.79)
RW      8   8K    32    CLAT  99        11469.0 us         11469.0 us ( 0.00)      12125.0 us (-5.71)      11731.0 us (-2.28)      11994.0 us (-4.57)      11994.0 us (-4.57)
RW      8   8K    32    CLAT  99         7242.0 us          7242.0 us ( 0.00)       7242.0 us ( 0.00)       7242.0 us ( 0.00)       7308.0 us (-0.91)       7242.0 us ( 0.00)
W      16   1M     8    CLAT  99       634000.0 us        667000.0 us (-5.20)     684000.0 us (-7.88)     693000.0 us (-9.30)     659000.0 us (-3.94)     651000.0 us (-2.68)
R      16   1M     8    CLAT  99       376000.0 us        355000.0 us ( 5.58)     401000.0 us (-6.64)     393000.0 us (-4.52)     326000.0 us (13.29)     393000.0 us (-4.52)
RW     16  64K    16    CLAT  99        63701.0 us         63701.0 us ( 0.00)      64226.0 us (-0.82)      62653.0 us ( 1.64)      63177.0 us ( 0.82)      62653.0 us ( 1.64)
RR     16  64K    16    CLAT  99        48497.0 us         48497.0 us ( 0.00)      48497.0 us ( 0.00)      49021.0 us (-1.08)      49021.0 us (-1.08)      49021.0 us (-1.08)
RW     16   8K    32    CLAT  99        21000.0 us         20000.0 us ( 4.76)      21000.0 us ( 0.00)      23000.0 us (-9.52)      21365.0 us (-1.73)      21000.0 us ( 0.00)
RW     16   8K    32    CLAT  99        12518.0 us         12518.0 us ( 0.00)      12649.0 us (-1.04)      12780.0 us (-2.09)      12649.0 us (-1.04)      12518.0 us ( 0.00)
W      32   1M     8    CLAT  99       902000.0 us        844000.0 us ( 6.43)     844000.0 us ( 6.43)     894000.0 us ( 0.88)     894000.0 us ( 0.88)     936000.0 us (-3.76)
R      32   1M     8    CLAT  99       464000.0 us        477000.0 us (-2.80)     468000.0 us (-0.86)     451000.0 us ( 2.80)     456000.0 us ( 1.72)     481000.0 us (-3.66)
RW     32  64K    16    CLAT  99       103000.0 us        103000.0 us ( 0.00)     103000.0 us ( 0.00)     103000.0 us ( 0.00)     102000.0 us ( 0.97)     104000.0 us (-0.97)
RR     32  64K    16    CLAT  99        77071.0 us         78119.0 us (-1.35)      78119.0 us (-1.35)      77071.0 us ( 0.00)      77071.0 us ( 0.00)      78119.0 us (-1.35)
RW     32   8K    32    CLAT  99        38011.0 us        45000.0 us (-18.38)      40109.0 us (-5.51)     43000.0 us (-13.12)      40000.0 us (-5.23)      37000.0 us ( 2.65)
RW     32   8K    32    CLAT  99        22938.0 us         23987.0 us (-4.57)      22938.0 us ( 0.00)      23200.0 us (-1.14)      23725.0 us (-3.43)      23725.0 us (-3.43)
W      64   1M     8    CLAT  99      1687000.0 us       1687000.0 us ( 0.00)    1552000.0 us ( 8.00)    1620000.0 us ( 3.97)    1687000.0 us ( 0.00)    1687000.0 us ( 0.00)
R      64   1M     8    CLAT  99       885000.0 us        885000.0 us ( 0.00)     877000.0 us ( 0.90)     919000.0 us (-3.84)     919000.0 us (-3.84)     902000.0 us (-1.92)
RW     64  64K    16    CLAT  99       199000.0 us        190000.0 us ( 4.52)     197000.0 us ( 1.00)     199000.0 us ( 0.00)     201000.0 us (-1.00)     190000.0 us ( 4.52)
RR     64  64K    16    CLAT  99       155000.0 us        157000.0 us (-1.29)     153000.0 us ( 1.29)     159000.0 us (-2.58)     157000.0 us (-1.29)     157000.0 us (-1.29)
RW     64   8K    32    CLAT  99        93000.0 us         89000.0 us ( 4.30)      94000.0 us (-1.07)      87000.0 us ( 6.45)      85000.0 us ( 8.60)      91000.0 us ( 2.15)
RW     64   8K    32    CLAT  99        48497.0 us         47973.0 us ( 1.08)      48497.0 us ( 0.00)      48497.0 us ( 0.00)      47973.0 us ( 1.08)      47973.0 us ( 1.08)
W     128   1M     8    CLAT  99      3205000.0 us       3104000.0 us ( 3.15)    3239000.0 us (-1.06)    3239000.0 us (-1.06)    3339000.0 us (-4.18)    3104000.0 us ( 3.15)
R     128   1M     8    CLAT  99      1770000.0 us       1703000.0 us ( 3.78)    1888000.0 us (-6.66)    1687000.0 us ( 4.68)    1821000.0 us (-2.88)    1720000.0 us ( 2.82)
RW    128  64K    16    CLAT  99       355000.0 us       430000.0 us (-21.12)    393000.0 us (-10.70)    430000.0 us (-21.12)    393000.0 us (-10.70)     363000.0 us (-2.25)
RR    128  64K    16    CLAT  99       326000.0 us        342000.0 us (-4.90)     338000.0 us (-3.68)     338000.0 us (-3.68)     338000.0 us (-3.68)     326000.0 us ( 0.00)
RW    128   8K    32    CLAT  99       190000.0 us        184000.0 us ( 3.15)    224000.0 us (-17.89)     192000.0 us (-1.05)     188000.0 us ( 1.05)     182000.0 us ( 4.21)
RW    128   8K    32    CLAT  99       103000.0 us        100000.0 us ( 2.91)     101000.0 us ( 1.94)      99000.0 us ( 3.88)     103000.0 us ( 0.00)     101000.0 us ( 1.94)
W     256   1M     8    CLAT  99      5336000.0 us       5671000.0 us (-6.27)    5537000.0 us (-3.76)    5671000.0 us (-6.27)    5470000.0 us (-2.51)    5470000.0 us (-2.51)
R     256   1M     8    CLAT  99      3004000.0 us      3675000.0 us (-22.33)    3138000.0 us (-4.46)    3205000.0 us (-6.69)    3239000.0 us (-7.82)    3037000.0 us (-1.09)
RW    256  64K    16    CLAT  99       709000.0 us       793000.0 us (-11.84)     726000.0 us (-2.39)    869000.0 us (-22.56)     776000.0 us (-9.44)    818000.0 us (-15.37)
RR    256  64K    16    CLAT  99       651000.0 us        600000.0 us ( 7.83)     642000.0 us ( 1.38)     642000.0 us ( 1.38)     709000.0 us (-8.90)     709000.0 us (-8.90)
RW    256   8K    32    CLAT  99       443000.0 us        393000.0 us (11.28)     418000.0 us ( 5.64)     418000.0 us ( 5.64)     422000.0 us ( 4.74)     397000.0 us (10.38)
RW    256   8K    32    CLAT  99       207000.0 us        207000.0 us ( 0.00)     205000.0 us ( 0.96)     201000.0 us ( 2.89)     205000.0 us ( 0.96)     201000.0 us ( 2.89)
W     512   1M     8    CLAT  99     10537000.0 us      10402000.0 us ( 1.28)   10537000.0 us ( 0.00)   10671000.0 us (-1.27)   10402000.0 us ( 1.28)   10805000.0 us (-2.54)
R     512   1M     8    CLAT  99      5604000.0 us       5805000.0 us (-3.58)    5805000.0 us (-3.58)    6007000.0 us (-7.19)    5805000.0 us (-3.58)    5873000.0 us (-4.80)
RW    512  64K    16    CLAT  99      1552000.0 us       1485000.0 us ( 4.31)    1452000.0 us ( 6.44)    1620000.0 us (-4.38)    1502000.0 us ( 3.22)    1519000.0 us ( 2.12)
RR    512  64K    16    CLAT  99      1167000.0 us      1301000.0 us (-11.48)    1217000.0 us (-4.28)   1334000.0 us (-14.31)   1301000.0 us (-11.48)   1301000.0 us (-11.48)
RW    512   8K    32    CLAT  99       785000.0 us        810000.0 us (-3.18)     827000.0 us (-5.35)     793000.0 us (-1.01)    869000.0 us (-10.70)     802000.0 us (-2.16)
RW    512   8K    32    CLAT  99       409000.0 us        422000.0 us (-3.17)     405000.0 us ( 0.97)     426000.0 us (-4.15)     409000.0 us ( 0.00)     409000.0 us ( 0.00)

==================================================================================================================================
                                                 ~~~~~~~~~~~~~~~~NPS4~~~~~~~~~~~~~~~~
==================================================================================================================================

TEST Threads BS  IODEPTH Metric   NPS4-6.4.0-rc1       CPU                 SMT                 CACHE               NUMA                SYSTEM
W       1   1M     8        BW       448 MB/s      448 MB/s( 0.00)     450 MB/s( 0.44)     447 MB/s(-0.22)     447 MB/s(-0.22)     450 MB/s( 0.44)
R       1   1M     8        BW       670 MB/s      660 MB/s(-1.49)     664 MB/s(-0.89)     671 MB/s( 0.14)     660 MB/s(-1.49)     762 MB/s(13.73)      ###
RW      1  64K    16        BW       434 MB/s      435 MB/s( 0.23)     438 MB/s( 0.92)     437 MB/s( 0.69)     436 MB/s( 0.46)     434 MB/s( 0.00)
RR      1  64K    16        BW       467 MB/s      462 MB/s(-1.07)     467 MB/s( 0.00)     466 MB/s(-0.21)     466 MB/s(-0.21)     465 MB/s(-0.42)
RW      1   8K    32        BW       251 MB/s      230 MB/s(-8.36)     243 MB/s(-3.18)     213 MB/s(-15.13)    238 MB/s(-5.17)     224 MB/s(-10.75)     ***
RW      1   8K    32        BW       536 MB/s      514 MB/s(-4.10)     536 MB/s( 0.00)     536 MB/s( 0.00)     536 MB/s( 0.00)     537 MB/s( 0.18)
W       2   1M     8        BW       451 MB/s      451 MB/s( 0.00)     451 MB/s( 0.00)     450 MB/s(-0.22)     450 MB/s(-0.22)     450 MB/s(-0.22)
R       2   1M     8        BW       819 MB/s      689 MB/s(-15.87)    673 MB/s(-17.82)    789 MB/s(-3.66)     792 MB/s(-3.29)     784 MB/s(-4.27)      ***
RW      2  64K    16        BW       434 MB/s      437 MB/s( 0.69)     439 MB/s( 1.15)     438 MB/s( 0.92)     437 MB/s( 0.69)     436 MB/s( 0.46)
RR      2  64K    16        BW       570 MB/s      566 MB/s(-0.70)     564 MB/s(-1.05)     565 MB/s(-0.87)     567 MB/s(-0.52)     568 MB/s(-0.35)
RW      2   8K    32        BW       349 MB/s      348 MB/s(-0.28)     339 MB/s(-2.86)     345 MB/s(-1.14)     348 MB/s(-0.28)     346 MB/s(-0.85)
RW      2   8K    32        BW       535 MB/s      544 MB/s( 1.68)     545 MB/s( 1.86)     546 MB/s( 2.05)     545 MB/s( 1.86)     545 MB/s( 1.86)
W       4   1M     8        BW       441 MB/s      434 MB/s(-1.58)     448 MB/s( 1.58)     448 MB/s( 1.58)     452 MB/s( 2.49)     448 MB/s( 1.58)
R       4   1M     8        BW       824 MB/s      834 MB/s( 1.21)     821 MB/s(-0.36)     831 MB/s( 0.84)     825 MB/s( 0.12)     828 MB/s( 0.48)
RW      4  64K    16        BW       434 MB/s      438 MB/s( 0.92)     431 MB/s(-0.69)     435 MB/s( 0.23)     434 MB/s( 0.00)     434 MB/s( 0.00)
RR      4  64K    16        BW       565 MB/s      562 MB/s(-0.53)     566 MB/s( 0.17)     566 MB/s( 0.17)     565 MB/s( 0.00)     564 MB/s(-0.17)
RW      4   8K    32        BW       349 MB/s      346 MB/s(-0.85)     347 MB/s(-0.57)     346 MB/s(-0.85)     346 MB/s(-0.85)     337 MB/s(-3.43)
RW      4   8K    32        BW       544 MB/s      542 MB/s(-0.36)     544 MB/s( 0.00)     543 MB/s(-0.18)     539 MB/s(-0.91)     543 MB/s(-0.18)
W       8   1M     8        BW       449 MB/s      447 MB/s(-0.44)     452 MB/s( 0.66)     450 MB/s( 0.22)     449 MB/s( 0.00)     448 MB/s(-0.22)
R       8   1M     8        BW       829 MB/s      825 MB/s(-0.48)     827 MB/s(-0.24)     830 MB/s( 0.12)     834 MB/s( 0.60)     831 MB/s( 0.24)
RW      8  64K    16        BW       431 MB/s      436 MB/s( 1.16)     435 MB/s( 0.92)     432 MB/s( 0.23)     436 MB/s( 1.16)     436 MB/s( 1.16)
RR      8  64K    16        BW       567 MB/s      565 MB/s(-0.35)     567 MB/s( 0.00)     555 MB/s(-2.11)     562 MB/s(-0.88)     567 MB/s( 0.00)
RW      8   8K    32        BW       347 MB/s      346 MB/s(-0.28)     344 MB/s(-0.86)     348 MB/s( 0.28)     348 MB/s( 0.28)     348 MB/s( 0.28)
RW      8   8K    32        BW       540 MB/s      541 MB/s( 0.18)     542 MB/s( 0.37)     539 MB/s(-0.18)     540 MB/s( 0.00)     541 MB/s( 0.18)
W      16   1M     8        BW       452 MB/s      450 MB/s(-0.44)     450 MB/s(-0.44)     450 MB/s(-0.44)     450 MB/s(-0.44)     452 MB/s( 0.00)
R      16   1M     8        BW       831 MB/s      831 MB/s( 0.00)     833 MB/s( 0.24)     834 MB/s( 0.36)     834 MB/s( 0.36)     834 MB/s( 0.36)
RW     16  64K    16        BW       436 MB/s      437 MB/s( 0.22)     434 MB/s(-0.45)     434 MB/s(-0.45)     434 MB/s(-0.45)     438 MB/s( 0.45)
RR     16  64K    16        BW       566 MB/s      556 MB/s(-1.76)     560 MB/s(-1.06)     565 MB/s(-0.17)     560 MB/s(-1.06)     570 MB/s( 0.70)
RW     16   8K    32        BW       348 MB/s      348 MB/s( 0.00)     348 MB/s( 0.00)     347 MB/s(-0.28)     346 MB/s(-0.57)     348 MB/s( 0.00)
RW     16   8K    32        BW       539 MB/s      540 MB/s( 0.18)     538 MB/s(-0.18)     538 MB/s(-0.18)     534 MB/s(-0.92)     536 MB/s(-0.55)
W      32   1M     8        BW       452 MB/s      452 MB/s( 0.00)     450 MB/s(-0.44)     451 MB/s(-0.22)     448 MB/s(-0.88)     451 MB/s(-0.22)
R      32   1M     8        BW       831 MB/s      832 MB/s( 0.12)     833 MB/s( 0.24)     837 MB/s( 0.72)     835 MB/s( 0.48)     835 MB/s( 0.48)
RW     32  64K    16        BW       435 MB/s      431 MB/s(-0.91)     436 MB/s( 0.22)     436 MB/s( 0.22)     434 MB/s(-0.22)     438 MB/s( 0.68)
RR     32  64K    16        BW       568 MB/s      560 MB/s(-1.40)     560 MB/s(-1.40)     559 MB/s(-1.58)     562 MB/s(-1.05)     565 MB/s(-0.52)
RW     32   8K    32        BW       347 MB/s      345 MB/s(-0.57)     345 MB/s(-0.57)     346 MB/s(-0.28)     341 MB/s(-1.72)     346 MB/s(-0.28)
RW     32   8K    32        BW       524 MB/s      536 MB/s( 2.29)     536 MB/s( 2.29)     536 MB/s( 2.29)     532 MB/s( 1.52)     531 MB/s( 1.33)
W      64   1M     8        BW       450 MB/s      452 MB/s( 0.44)     450 MB/s( 0.00)     450 MB/s( 0.00)     447 MB/s(-0.66)     451 MB/s( 0.22)
R      64   1M     8        BW       831 MB/s      832 MB/s( 0.12)     835 MB/s( 0.48)     835 MB/s( 0.48)     833 MB/s( 0.24)     835 MB/s( 0.48)
RW     64  64K    16        BW       429 MB/s      434 MB/s( 1.16)     434 MB/s( 1.16)     431 MB/s( 0.46)     430 MB/s( 0.23)     438 MB/s( 2.09)
RR     64  64K    16        BW       568 MB/s      564 MB/s(-0.70)     558 MB/s(-1.76)     567 MB/s(-0.17)     562 MB/s(-1.05)     566 MB/s(-0.35)
RW     64   8K    32        BW       342 MB/s      342 MB/s( 0.00)     345 MB/s( 0.87)     339 MB/s(-0.87)     344 MB/s( 0.58)     346 MB/s( 1.16)
RW     64   8K    32        BW       532 MB/s      529 MB/s(-0.56)     530 MB/s(-0.37)     529 MB/s(-0.56)     533 MB/s( 0.18)     530 MB/s(-0.37)
W     128   1M     8        BW       451 MB/s      451 MB/s( 0.00)     449 MB/s(-0.44)     450 MB/s(-0.22)     447 MB/s(-0.88)     451 MB/s( 0.00)
R     128   1M     8        BW       831 MB/s      832 MB/s( 0.12)     834 MB/s( 0.36)     833 MB/s( 0.24)     834 MB/s( 0.36)     835 MB/s( 0.48)
RW    128  64K    16        BW       435 MB/s      435 MB/s( 0.00)     434 MB/s(-0.22)     430 MB/s(-1.14)     430 MB/s(-1.14)     433 MB/s(-0.45)
RR    128  64K    16        BW       565 MB/s      568 MB/s( 0.53)     561 MB/s(-0.70)     562 MB/s(-0.53)     567 MB/s( 0.35)     566 MB/s( 0.17)
RW    128   8K    32        BW       339 MB/s      338 MB/s(-0.29)     340 MB/s( 0.29)     340 MB/s( 0.29)     343 MB/s( 1.17)     339 MB/s( 0.00)
RW    128   8K    32        BW       528 MB/s      533 MB/s( 0.94)     524 MB/s(-0.75)     527 MB/s(-0.18)     528 MB/s( 0.00)     526 MB/s(-0.37)
W     256   1M     8        BW       449 MB/s      451 MB/s( 0.44)     450 MB/s( 0.22)     450 MB/s( 0.22)     450 MB/s( 0.22)     447 MB/s(-0.44)
R     256   1M     8        BW       831 MB/s      831 MB/s( 0.00)     834 MB/s( 0.36)     834 MB/s( 0.36)     837 MB/s( 0.72)     835 MB/s( 0.48)
RW    256  64K    16        BW       434 MB/s      432 MB/s(-0.46)     434 MB/s( 0.00)     430 MB/s(-0.92)     430 MB/s(-0.92)     436 MB/s( 0.46)
RR    256  64K    16        BW       567 MB/s      567 MB/s( 0.00)     559 MB/s(-1.41)     566 MB/s(-0.17)     569 MB/s( 0.35)     558 MB/s(-1.58)
RW    256   8K    32        BW       331 MB/s      336 MB/s( 1.51)     332 MB/s( 0.30)     333 MB/s( 0.60)     331 MB/s( 0.00)     332 MB/s( 0.30)
RW    256   8K    32        BW       520 MB/s      528 MB/s( 1.53)     533 MB/s( 2.50)     529 MB/s( 1.73)     531 MB/s( 2.11)     530 MB/s( 1.92)
W     512   1M     8        BW       449 MB/s      451 MB/s( 0.44)     451 MB/s( 0.44)     450 MB/s( 0.22)     446 MB/s(-0.66)     451 MB/s( 0.44)
R     512   1M     8        BW       830 MB/s      830 MB/s( 0.00)     834 MB/s( 0.48)     834 MB/s( 0.48)     835 MB/s( 0.60)     835 MB/s( 0.60)
RW    512  64K    16        BW       430 MB/s      432 MB/s( 0.46)     429 MB/s(-0.23)     435 MB/s( 1.16)     434 MB/s( 0.93)     434 MB/s( 0.93)
RR    512  64K    16        BW       564 MB/s      557 MB/s(-1.24)     567 MB/s( 0.53)     568 MB/s( 0.70)     567 MB/s( 0.53)     565 MB/s( 0.17)
RW    512   8K    32        BW       335 MB/s      334 MB/s(-0.29)     325 MB/s(-2.98)     333 MB/s(-0.59)     329 MB/s(-1.79)     317 MB/s(-5.37)
RW    512   8K    32        BW       529 MB/s      534 MB/s( 0.94)     529 MB/s( 0.00)     532 MB/s( 0.56)     526 MB/s(-0.56)     525 MB/s(-0.75)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS4-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    SLAT AVG           62.3 us            58.1 us ( 6.69)        71.8 us (-15.40)        79.8 us (-28.17)     434.2 us (-597.41)      434.3 us (-597.51)
R       1   1M     8    SLAT AVG           50.0 us            51.3 us (-2.68)         51.5 us (-3.12)         53.3 us (-6.70)     253.7 us (-407.90)      188.8 us (-278.05)
RW      1  64K    16    SLAT AVG           19.3 us            18.9 us ( 2.48)         19.0 us ( 1.49)         19.4 us (-0.20)       21.6 us (-11.63)        21.6 us (-11.94)
RR      1  64K    16    SLAT AVG           10.9 us            10.9 us (-0.18)         10.9 us ( 0.09)         11.3 us (-3.66)       15.7 us (-43.68)        15.0 us (-36.99)
RW      1   8K    32    SLAT AVG           29.2 us           33.1 us (-13.31)         30.7 us (-4.99)        35.6 us (-21.93)        31.3 us (-7.22)        33.9 us (-16.05)
RW      1   8K    32    SLAT AVG           11.0 us           13.5 us (-22.79)         10.9 us ( 0.90)         11.0 us ( 0.00)        11.0 us ( 0.45)         11.0 us ( 0.27)
W       2   1M     8    SLAT AVG           60.2 us           73.9 us (-22.74)        75.5 us (-25.36)        79.8 us (-32.61)     353.4 us (-487.09)      402.5 us (-568.68)
R       2   1M     8    SLAT AVG           38.1 us           53.8 us (-41.15)        54.2 us (-42.28)        56.6 us (-48.58)     244.2 us (-541.02)      230.8 us (-505.72)
RW      2  64K    16    SLAT AVG           17.2 us            18.6 us (-8.09)         18.2 us (-6.11)         15.6 us ( 9.31)        18.8 us (-9.26)         18.8 us (-9.26)
RR      2  64K    16    SLAT AVG           11.8 us            11.3 us ( 3.90)         12.1 us (-2.37)         11.6 us ( 2.03)       17.0 us (-44.35)        15.6 us (-32.06)
RW      2   8K    32    SLAT AVG           19.2 us            19.2 us ( 0.00)        22.1 us (-15.21)         18.8 us ( 1.92)        17.3 us ( 9.79)         19.1 us ( 0.26)
RW      2   8K    32    SLAT AVG           15.9 us             9.8 us (38.49)          9.8 us (38.68)          8.1 us (49.46)         8.0 us (50.03)          7.8 us (51.15)
W       4   1M     8    SLAT AVG           50.5 us         233.5 us (-362.32)        77.3 us (-52.95)        56.7 us (-12.27)       66.9 us (-32.48)        99.0 us (-95.96)
R       4   1M     8    SLAT AVG           26.8 us            26.5 us ( 1.15)        31.8 us (-18.45)         28.1 us (-4.88)       38.0 us (-41.53)        31.1 us (-16.03)
RW      4  64K    16    SLAT AVG           18.7 us            16.8 us (10.33)         16.4 us (12.04)         17.9 us ( 4.06)        20.2 us (-8.40)         20.4 us (-9.10)
RR      4  64K    16    SLAT AVG            8.7 us             8.3 us ( 4.83)        10.4 us (-19.44)          7.6 us (12.31)       11.9 us (-37.16)          8.1 us ( 7.24)
RW      4   8K    32    SLAT AVG           90.6 us            91.1 us (-0.64)         91.0 us (-0.49)         91.6 us (-1.11)        91.4 us (-0.97)         94.2 us (-4.04)
RW      4   8K    32    SLAT AVG           57.6 us            57.9 us (-0.52)         57.5 us ( 0.20)         57.7 us (-0.17)        58.2 us (-1.09)         57.7 us (-0.19)
W       8   1M     8    SLAT AVG          853.9 us        9189.1 us (-976.13)     8983.1 us (-952.01)     7175.0 us (-740.25)    8332.7 us (-875.83)     5838.1 us (-583.70)
R       8   1M     8    SLAT AVG         3294.6 us         5723.9 us (-73.73)      4980.4 us (-51.16)      4666.1 us (-41.62)        36.7 us (98.88)      3956.3 us (-20.08)
RW      8  64K    16    SLAT AVG         1041.9 us          1024.4 us ( 1.68)       1029.9 us ( 1.15)       1043.6 us (-0.15)      1030.3 us ( 1.11)       1025.5 us ( 1.57)
RR      8  64K    16    SLAT AVG          787.3 us           787.9 us (-0.08)        786.5 us ( 0.10)        800.7 us (-1.70)       794.5 us (-0.92)        783.9 us ( 0.42)
RW      8   8K    32    SLAT AVG          187.2 us           187.1 us ( 0.06)        188.1 us (-0.48)        186.6 us ( 0.33)       186.7 us ( 0.28)        186.3 us ( 0.47)
RW      8   8K    32    SLAT AVG          119.8 us           119.7 us ( 0.12)        119.5 us ( 0.27)        120.0 us (-0.10)       119.8 us ( 0.02)        119.6 us ( 0.17)
W      16   1M     8    SLAT AVG        30315.0 us         30625.4 us (-1.02)      32137.1 us (-6.01)      30116.3 us ( 0.65)     31053.4 us (-2.43)      31008.5 us (-2.28)
R      16   1M     8    SLAT AVG        16275.2 us         16990.0 us (-4.39)      17302.0 us (-6.30)      17030.9 us (-4.64)     16874.7 us (-3.68)      17401.9 us (-6.92)
RW     16  64K    16    SLAT AVG         2400.1 us          2392.8 us ( 0.30)       2407.4 us (-0.30)       2410.8 us (-0.44)      2408.7 us (-0.35)       2386.5 us ( 0.56)
RR     16  64K    16    SLAT AVG         1846.4 us          1877.7 us (-1.69)       1864.6 us (-0.98)       1849.9 us (-0.18)      1865.6 us (-1.03)       1833.6 us ( 0.69)
RW     16   8K    32    SLAT AVG          374.5 us           375.3 us (-0.20)        374.1 us ( 0.09)        376.0 us (-0.39)       376.9 us (-0.64)        375.1 us (-0.17)
RW     16   8K    32    SLAT AVG          241.4 us           241.2 us ( 0.09)        242.0 us (-0.26)        242.0 us (-0.26)       243.5 us (-0.86)        242.7 us (-0.54)
W      32   1M     8    SLAT AVG        73872.1 us         73953.3 us (-0.10)      74297.0 us (-0.57)      74054.8 us (-0.24)     74434.9 us (-0.76)      74080.7 us (-0.28)
R      32   1M     8    SLAT AVG        40261.1 us         40265.5 us (-0.01)      40204.9 us ( 0.13)      40013.5 us ( 0.61)     40080.2 us ( 0.44)      40067.1 us ( 0.48)
RW     32  64K    16    SLAT AVG         4816.5 us          4863.3 us (-0.97)       4811.5 us ( 0.10)       4800.6 us ( 0.33)      4823.0 us (-0.13)       4788.6 us ( 0.57)
RR     32  64K    16    SLAT AVG         3688.6 us          3739.6 us (-1.38)       3743.9 us (-1.49)       3746.3 us (-1.56)      3726.2 us (-1.01)       3711.5 us (-0.62)
RW     32   8K    32    SLAT AVG          754.1 us           755.9 us (-0.24)        758.1 us (-0.53)        755.5 us (-0.18)       766.4 us (-1.63)        754.2 us (-0.01)
RW     32   8K    32    SLAT AVG          498.7 us           487.0 us ( 2.35)        487.0 us ( 2.35)        487.3 us ( 2.28)       491.3 us ( 1.48)        492.1 us ( 1.33)
W      64   1M     8    SLAT AVG       148752.0 us        147926.2 us ( 0.55)     148587.9 us ( 0.11)     148633.2 us ( 0.07)    149428.0 us (-0.45)     148405.9 us ( 0.23)
R      64   1M     8    SLAT AVG        80606.5 us         80529.7 us ( 0.09)      80204.7 us ( 0.49)      80194.3 us ( 0.51)     80396.8 us ( 0.26)      80189.6 us ( 0.51)
RW     64  64K    16    SLAT AVG         9772.9 us          9666.2 us ( 1.09)       9655.6 us ( 1.20)       9723.9 us ( 0.50)      9749.5 us ( 0.23)       9577.8 us ( 1.99)
RR     64  64K    16    SLAT AVG         7383.6 us          7438.6 us (-0.74)       7506.0 us (-1.65)       7396.8 us (-0.17)      7459.8 us (-1.03)       7412.2 us (-0.38)
RW     64   8K    32    SLAT AVG         1529.0 us          1530.1 us (-0.07)       1516.3 us ( 0.82)       1540.0 us (-0.72)      1520.8 us ( 0.53)       1509.7 us ( 1.25)
RW     64   8K    32    SLAT AVG          983.7 us           989.8 us (-0.62)        986.5 us (-0.27)        989.2 us (-0.56)       981.8 us ( 0.18)        986.5 us (-0.27)
W     128   1M     8    SLAT AVG       296380.6 us        296001.7 us ( 0.12)     297235.8 us (-0.28)     296878.9 us (-0.16)    298735.0 us (-0.79)     296306.0 us ( 0.02)
R     128   1M     8    SLAT AVG       161104.5 us        160888.0 us ( 0.13)     160462.5 us ( 0.39)     160671.3 us ( 0.26)    160605.6 us ( 0.30)     160380.7 us ( 0.44)
RW    128  64K    16    SLAT AVG        19256.4 us         19254.8 us ( 0.00)      19327.9 us (-0.37)      19506.9 us (-1.30)     19488.9 us (-1.20)      19378.4 us (-0.63)
RR    128  64K    16    SLAT AVG        14851.0 us         14771.8 us ( 0.53)      14936.6 us (-0.57)      14914.6 us (-0.42)     14793.7 us ( 0.38)      14815.2 us ( 0.24)
RW    128   8K    32    SLAT AVG         3091.8 us          3101.3 us (-0.30)       3080.1 us ( 0.38)       3078.0 us ( 0.44)      3051.1 us ( 1.31)       3089.4 us ( 0.07)
RW    128   8K    32    SLAT AVG         1984.3 us          1964.1 us ( 1.01)       1998.0 us (-0.68)       1987.0 us (-0.13)      1982.6 us ( 0.08)       1989.6 us (-0.26)
W     256   1M     8    SLAT AVG       593779.9 us        590960.2 us ( 0.47)     592330.2 us ( 0.24)     591166.7 us ( 0.44)    591994.6 us ( 0.30)     595833.3 us (-0.34)
R     256   1M     8    SLAT AVG       321605.7 us        321601.3 us ( 0.00)     320435.1 us ( 0.36)     320533.0 us ( 0.33)    319474.0 us ( 0.66)     320265.3 us ( 0.41)
RW    256  64K    16    SLAT AVG        38612.8 us         38783.6 us (-0.44)      38644.0 us (-0.08)      39005.2 us (-1.01)     39029.8 us (-1.07)      38475.6 us ( 0.35)
RR    256  64K    16    SLAT AVG        29577.9 us         29557.0 us ( 0.07)      29991.3 us (-1.39)      29626.1 us (-0.16)     29478.7 us ( 0.33)      30031.8 us (-1.53)
RW    256   8K    32    SLAT AVG         6322.1 us          6239.8 us ( 1.30)       6322.0 us ( 0.00)       6289.9 us ( 0.50)      6322.6 us ( 0.00)       6315.4 us ( 0.10)
RW    256   8K    32    SLAT AVG         4031.9 us          3971.9 us ( 1.48)       3931.0 us ( 2.50)       3958.5 us ( 1.82)      3946.0 us ( 2.13)       3952.8 us ( 1.96)
W     512   1M     8    SLAT AVG      1181861.0 us       1175091.3 us ( 0.57)    1176164.6 us ( 0.48)    1178815.3 us ( 0.25)   1188861.1 us (-0.59)    1175876.0 us ( 0.50)
R     512   1M     8    SLAT AVG       642286.7 us        642470.8 us (-0.02)     639078.2 us ( 0.49)     639487.9 us ( 0.43)    638784.9 us ( 0.54)     638974.1 us ( 0.51)
RW    512  64K    16    SLAT AVG        77948.1 us         77613.7 us ( 0.42)      78153.8 us (-0.26)      76970.0 us ( 1.25)     77217.4 us ( 0.93)      77266.0 us ( 0.87)
RR    512  64K    16    SLAT AVG        59445.2 us         60176.2 us (-1.22)      59122.7 us ( 0.54)      59027.2 us ( 0.70)     59118.5 us ( 0.54)      59320.5 us ( 0.20)
RW    512   8K    32    SLAT AVG        12500.4 us         12531.4 us (-0.24)      12888.5 us (-3.10)      12581.7 us (-0.64)     12723.7 us (-1.78)      13221.9 us (-5.77)
RW    512   8K    32    SLAT AVG         7916.5 us          7849.6 us ( 0.84)       7923.1 us (-0.08)       7879.0 us ( 0.47)      7975.6 us (-0.74)       7984.2 us (-0.85)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS4-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    LAT AVG         18732.9 us         18737.9 us (-0.02)      18658.4 us ( 0.39)      18753.9 us (-0.11)      18773.9 us (-0.21)      18628.3 us ( 0.55)
R       1   1M     8    LAT AVG         12524.3 us         12699.7 us (-1.40)      12631.6 us (-0.85)      12502.5 us ( 0.17)      12701.1 us (-1.41)      10995.7 us (12.20)
RW      1  64K    16    LAT AVG          2417.7 us          2407.6 us ( 0.42)       2392.5 us ( 1.04)       2396.7 us ( 0.86)       2406.7 us ( 0.45)       2417.2 us ( 0.02)
RR      1  64K    16    LAT AVG          2243.9 us          2269.1 us (-1.12)       2245.3 us (-0.06)       2251.8 us (-0.35)       2251.4 us (-0.33)       2256.0 us (-0.54)
RW      1   8K    32    LAT AVG          1044.2 us          1138.0 us (-8.98)       1076.8 us (-3.12)      1228.3 us (-17.63)       1098.8 us (-5.23)      1167.5 us (-11.80)	***
RW      1   8K    32    LAT AVG           488.3 us           509.5 us (-4.32)        488.4 us (-0.01)        488.4 us (-0.01)        488.9 us (-0.10)        488.2 us ( 0.03)
W       2   1M     8    LAT AVG         37167.4 us         37207.4 us (-0.10)      37168.2 us ( 0.00)      37254.2 us (-0.23)      37268.1 us (-0.27)      37257.1 us (-0.24)
R       2   1M     8    LAT AVG         20469.5 us        24349.5 us (-18.95)     24938.8 us (-21.83)      21267.6 us (-3.89)      21178.3 us (-3.46)      21380.9 us (-4.45)	***
RW      2  64K    16    LAT AVG          4829.4 us          4793.6 us ( 0.74)       4777.0 us ( 1.08)       4791.1 us ( 0.79)       4799.4 us ( 0.62)       4807.6 us ( 0.45)
RR      2  64K    16    LAT AVG          3681.5 us          3704.5 us (-0.62)       3718.6 us (-1.00)       3711.2 us (-0.80)       3700.3 us (-0.51)       3690.4 us (-0.24)
RW      2   8K    32    LAT AVG          1502.4 us          1505.2 us (-0.18)       1544.4 us (-2.79)       1517.3 us (-0.98)       1505.7 us (-0.21)       1512.7 us (-0.68)
RW      2   8K    32    LAT AVG           979.6 us           962.6 us ( 1.73)        961.7 us ( 1.82)        960.2 us ( 1.97)        960.9 us ( 1.90)        961.7 us ( 1.82)
W       4   1M     8    LAT AVG         76080.0 us         77240.0 us (-1.52)      74850.0 us ( 1.61)      74810.0 us ( 1.66)      74300.0 us ( 2.33)      74850.0 us ( 1.61)
R       4   1M     8    LAT AVG         40700.0 us         40213.8 us ( 1.19)      40840.0 us (-0.34)      40380.0 us ( 0.78)      40649.6 us ( 0.12)      40535.9 us ( 0.40)
RW      4  64K    16    LAT AVG          9672.1 us          9584.1 us ( 0.91)       9734.8 us (-0.64)       9651.5 us ( 0.21)       9668.8 us ( 0.03)       9665.8 us ( 0.06)
RR      4  64K    16    LAT AVG          7425.0 us          7464.0 us (-0.52)       7403.0 us ( 0.29)       7410.8 us ( 0.19)       7416.3 us ( 0.11)       7430.8 us (-0.07)
RW      4   8K    32    LAT AVG          3001.2 us          3026.7 us (-0.84)       3023.0 us (-0.72)       3030.3 us (-0.96)       3026.6 us (-0.84)       3113.3 us (-3.73)
RW      4   8K    32    LAT AVG          1928.3 us          1935.5 us (-0.37)       1925.9 us ( 0.12)       1930.1 us (-0.09)       1944.0 us (-0.81)       1932.3 us (-0.20)
W       8   1M     8    LAT AVG        149300.0 us        149980.0 us (-0.45)     148300.0 us ( 0.66)     149120.0 us ( 0.12)     149390.0 us (-0.06)     149770.0 us (-0.31)
R       8   1M     8    LAT AVG         80870.0 us         81320.0 us (-0.55)      81070.0 us (-0.24)      80850.0 us ( 0.02)      80390.0 us ( 0.59)      80720.0 us ( 0.18)
RW      8  64K    16    LAT AVG         19448.9 us         19242.6 us ( 1.06)      19262.2 us ( 0.95)      19398.2 us ( 0.26)      19232.1 us ( 1.11)      19218.6 us ( 1.18)
RR      8  64K    16    LAT AVG         14796.3 us         14847.3 us (-0.34)      14797.2 us ( 0.00)      15119.5 us (-2.18)      14916.2 us (-0.81)      14781.4 us ( 0.10)
RW      8   8K    32    LAT AVG          6047.7 us          6050.1 us (-0.04)       6101.5 us (-0.89)       6022.9 us ( 0.40)       6025.6 us ( 0.36)       6026.6 us ( 0.34)
RW      8   8K    32    LAT AVG          3882.6 us          3876.6 us ( 0.15)       3870.5 us ( 0.31)       3887.3 us (-0.12)       3879.9 us ( 0.06)       3873.8 us ( 0.22)
W      16   1M     8    LAT AVG        296370.0 us        298100.0 us (-0.58)     297800.0 us (-0.48)     298200.0 us (-0.61)     297680.0 us (-0.44)     296310.0 us ( 0.02)
R      16   1M     8    LAT AVG        161300.0 us        161390.0 us (-0.05)     160980.0 us ( 0.19)     160780.0 us ( 0.32)     160810.0 us ( 0.30)     160880.0 us ( 0.26)
RW     16  64K    16    LAT AVG         38487.7 us         38374.7 us ( 0.29)      38610.0 us (-0.31)      38658.2 us (-0.44)      38630.0 us (-0.36)      38275.5 us ( 0.55)
RR     16  64K    16    LAT AVG         29638.3 us         30145.1 us (-1.70)      29932.4 us (-0.99)      29696.0 us (-0.19)      29948.7 us (-1.04)      29431.4 us ( 0.69)
RW     16   8K    32    LAT AVG         12044.0 us         12065.8 us (-0.18)      12037.5 us ( 0.05)      12092.1 us (-0.39)      12115.0 us (-0.58)      12066.7 us (-0.18)
RW     16   8K    32    LAT AVG          7775.6 us          7766.6 us ( 0.11)       7799.4 us (-0.30)       7799.3 us (-0.30)       7848.9 us (-0.94)       7818.1 us (-0.54)
W      32   1M     8    LAT AVG        591510.0 us        592030.0 us (-0.08)     593750.0 us (-0.37)     593550.0 us (-0.34)     596040.0 us (-0.76)     592030.0 us (-0.08)
R      32   1M     8    LAT AVG        322210.0 us        322200.0 us ( 0.00)     321640.0 us ( 0.17)     320260.0 us ( 0.60)     320870.0 us ( 0.41)     320820.0 us ( 0.43)
RW     32  64K    16    LAT AVG         77081.9 us         77833.3 us (-0.97)      76997.1 us ( 0.10)      76827.8 us ( 0.32)      77182.7 us (-0.13)      76637.3 us ( 0.57)
RR     32  64K    16    LAT AVG         59040.0 us         59850.0 us (-1.37)      59920.0 us (-1.49)      59960.0 us (-1.55)      59640.0 us (-1.01)      59410.0 us (-0.62)
RW     32   8K    32    LAT AVG         24199.4 us         24268.3 us (-0.28)      24336.6 us (-0.56)      24245.6 us (-0.19)      24589.6 us (-1.61)      24201.8 us ( 0.00)
RW     32   8K    32    LAT AVG         16018.1 us         15645.5 us ( 2.32)      15643.5 us ( 2.33)      15655.9 us ( 2.26)      15780.0 us ( 1.48)      15803.1 us ( 1.34)
W      64   1M     8    LAT AVG       1183590.0 us       1178070.0 us ( 0.46)    1182430.0 us ( 0.09)    1182590.0 us ( 0.08)    1188620.0 us (-0.42)    1181150.0 us ( 0.20)
R      64   1M     8    LAT AVG        643320.0 us        642690.0 us ( 0.09)     640020.0 us ( 0.51)     640020.0 us ( 0.51)     641200.0 us ( 0.32)     639880.0 us ( 0.53)
RW     64  64K    16    LAT AVG        156286.8 us        154573.7 us ( 1.09)     154403.1 us ( 1.20)     155494.2 us ( 0.50)     155897.6 us ( 0.24)     153158.4 us ( 2.00)
RR     64  64K    16    LAT AVG        118090.0 us        118970.0 us (-0.74)     120053.2 us (-1.66)     118300.0 us (-0.17)     119310.0 us (-1.03)     118560.0 us (-0.39)
RW     64   8K    32    LAT AVG         49002.3 us         49039.6 us (-0.07)      48583.4 us ( 0.85)      49376.2 us (-0.76)      48719.0 us ( 0.57)      48381.2 us ( 1.26)
RW     64   8K    32    LAT AVG         31536.2 us         31728.7 us (-0.61)      31620.2 us (-0.26)      31709.0 us (-0.54)      31472.4 us ( 0.20)      31624.2 us (-0.27)
W     128   1M     8    LAT AVG       2340540.0 us       2337820.0 us ( 0.11)    2348970.0 us (-0.36)    2346440.0 us (-0.25)    2360080.0 us (-0.83)    2340320.0 us ( 0.00)
R     128   1M     8    LAT AVG       1280260.0 us       1278600.0 us ( 0.12)    1275210.0 us ( 0.39)    1276960.0 us ( 0.25)    1276230.0 us ( 0.31)    1274360.0 us ( 0.46)
RW    128  64K    16    LAT AVG        307561.4 us        307534.7 us ( 0.00)     308677.7 us (-0.36)     311555.9 us (-1.29)     311239.9 us (-1.19)     309502.9 us (-0.63)
RR    128  64K    16    LAT AVG        237280.0 us        236010.0 us ( 0.53)     238640.0 us (-0.57)     238290.0 us (-0.42)     236336.6 us ( 0.39)     236710.0 us ( 0.24)
RW    128   8K    32    LAT AVG         98969.2 us         99287.5 us (-0.32)      98565.9 us ( 0.40)      98519.5 us ( 0.45)      97638.0 us ( 1.34)      98877.1 us ( 0.09)
RW    128   8K    32    LAT AVG         63532.9 us         62884.8 us ( 1.02)      63967.4 us (-0.68)      63616.2 us (-0.13)      63476.9 us ( 0.08)      63700.4 us (-0.26)
W     256   1M     8    LAT AVG       4619020.0 us       4598290.0 us ( 0.44)    4609920.0 us ( 0.19)    4602230.0 us ( 0.36)    4607110.0 us ( 0.25)    4635140.0 us (-0.34)
R     256   1M     8    LAT AVG       2534380.0 us       2534220.0 us ( 0.00)    2525090.0 us ( 0.36)    2525490.0 us ( 0.35)    2517740.0 us ( 0.65)    2523890.0 us ( 0.41)
RW    256  64K    16    LAT AVG        615320.0 us        618037.1 us (-0.44)     615780.0 us (-0.07)     621510.0 us (-1.00)     621920.9 us (-1.07)     613060.0 us ( 0.36)
RR    256  64K    16    LAT AVG        471690.4 us        471371.8 us ( 0.06)     478251.4 us (-1.39)     472438.7 us (-0.15)     470120.0 us ( 0.33)     478930.0 us (-1.53)
RW    256   8K    32    LAT AVG        202092.8 us        199491.8 us ( 1.28)     202068.0 us ( 0.01)     201072.0 us ( 0.50)     202110.8 us ( 0.00)     201913.0 us ( 0.08)
RW    256   8K    32    LAT AVG        128962.1 us        127047.8 us ( 1.48)     125736.9 us ( 2.50)     126621.1 us ( 1.81)     126219.0 us ( 2.12)     126436.9 us ( 1.95)
W     512   1M     8    LAT AVG       8903560.0 us       8865780.0 us ( 0.42)    8874540.0 us ( 0.32)    8885480.0 us ( 0.20)    8958760.0 us (-0.61)    8868010.0 us ( 0.39)
R     512   1M     8    LAT AVG       4974260.0 us       4977280.0 us (-0.06)    4950860.0 us ( 0.47)    4953810.0 us ( 0.41)    4948690.0 us ( 0.51)    4949640.0 us ( 0.49)
RW    512  64K    16    LAT AVG       1236179.1 us       1231250.0 us ( 0.39)    1239790.0 us (-0.29)    1221070.0 us ( 1.22)    1225070.0 us ( 0.89)    1225888.4 us ( 0.83)
RR    512  64K    16    LAT AVG        944620.0 us        956230.0 us (-1.22)     939480.0 us ( 0.54)     938029.6 us ( 0.69)     939450.0 us ( 0.54)     942670.0 us ( 0.20)
RW    512   8K    32    LAT AVG        399053.1 us        399963.3 us (-0.22)     411143.4 us (-3.02)     401355.4 us (-0.57)     406139.1 us (-1.77)     421692.5 us (-5.67)
RW    512   8K    32    LAT AVG        252931.0 us        250780.0 us ( 0.85)     253137.2 us (-0.08)     251712.5 us ( 0.48)     254804.8 us (-0.74)     255068.4 us (-0.84)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS4-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    CLAT AVG        18670.5 us         18679.7 us (-0.04)      18586.3 us ( 0.45)       18673.9 us (-0.01)     18339.6 us ( 1.77)      18193.9 us ( 2.55)
R       1   1M     8    CLAT AVG        12474.2 us         12648.3 us (-1.39)      12579.9 us (-0.84)       12449.1 us ( 0.20)     12447.3 us ( 0.21)      10806.7 us (13.36)
RW      1  64K    16    CLAT AVG         2398.3 us          2388.6 us ( 0.40)       2373.3 us ( 1.04)        2377.2 us ( 0.87)      2385.0 us ( 0.55)       2395.4 us ( 0.11)
RR      1  64K    16    CLAT AVG         2232.9 us          2258.1 us (-1.13)       2234.3 us (-0.06)        2240.4 us (-0.33)      2235.6 us (-0.12)       2241.0 us (-0.36)
RW      1   8K    32    CLAT AVG         1014.8 us          1104.8 us (-8.87)       1045.9 us (-3.07)       1192.6 us (-17.52)      1067.4 us (-5.18)      1133.4 us (-11.69)	***
RW      1   8K    32    CLAT AVG          477.2 us           495.9 us (-3.90)        477.4 us (-0.04)         477.3 us (-0.02)       477.8 us (-0.11)        477.1 us ( 0.02)
W       2   1M     8    CLAT AVG        37106.9 us         37133.3 us (-0.07)      37092.5 us ( 0.03)       37174.2 us (-0.18)     36914.5 us ( 0.51)      36854.4 us ( 0.68)
R       2   1M     8    CLAT AVG        20431.3 us        24295.6 us (-18.91)     24884.5 us (-21.79)       21210.8 us (-3.81)     20933.9 us (-2.46)      21149.9 us (-3.51)	***
RW      2  64K    16    CLAT AVG         4812.1 us          4774.9 us ( 0.77)       4758.7 us ( 1.11)        4775.5 us ( 0.76)      4780.5 us ( 0.65)       4788.8 us ( 0.48)
RR      2  64K    16    CLAT AVG         3669.6 us          3693.1 us (-0.63)       3706.4 us (-1.00)        3699.6 us (-0.81)      3683.2 us (-0.36)       3674.8 us (-0.14)
RW      2   8K    32    CLAT AVG         1483.1 us          1485.9 us (-0.18)       1522.2 us (-2.63)        1498.4 us (-1.02)      1488.3 us (-0.34)       1493.4 us (-0.69)
RW      2   8K    32    CLAT AVG          963.5 us           952.7 us ( 1.11)        951.8 us ( 1.21)         952.0 us ( 1.18)       952.8 us ( 1.10)        953.8 us ( 1.00)
W       4   1M     8    CLAT AVG        76030.0 us         77010.0 us (-1.28)      74770.0 us ( 1.65)       74750.0 us ( 1.68)     74230.0 us ( 2.36)      74750.0 us ( 1.68)
R       4   1M     8    CLAT AVG        40670.0 us         40187.1 us ( 1.18)      40810.0 us (-0.34)       40350.0 us ( 0.78)     40611.6 us ( 0.14)      40504.6 us ( 0.40)
RW      4  64K    16    CLAT AVG         9653.3 us          9567.3 us ( 0.89)       9718.3 us (-0.67)        9633.5 us ( 0.20)      9648.4 us ( 0.05)       9645.3 us ( 0.08)
RR      4  64K    16    CLAT AVG         7416.2 us          7455.6 us (-0.53)       7392.5 us ( 0.31)        7403.1 us ( 0.17)      7404.3 us ( 0.16)       7422.6 us (-0.08)
RW      4   8K    32    CLAT AVG         2910.6 us          2935.5 us (-0.85)       2931.9 us (-0.73)        2938.7 us (-0.96)      2935.1 us (-0.84)       3018.9 us (-3.72)
RW      4   8K    32    CLAT AVG         1870.6 us          1877.5 us (-0.37)       1868.3 us ( 0.12)        1872.3 us (-0.09)      1885.6 us (-0.80)       1874.5 us (-0.21)
W       8   1M     8    CLAT AVG       148450.0 us        140790.0 us ( 5.15)     139320.0 us ( 6.15)      141940.0 us ( 4.38)    141060.0 us ( 4.97)     143930.0 us ( 3.04)
R       8   1M     8    CLAT AVG        77580.0 us         75600.0 us ( 2.55)      76090.0 us ( 1.92)       76180.0 us ( 1.80)     80360.0 us (-3.58)      76770.0 us ( 1.04)
RW      8  64K    16    CLAT AVG        18406.9 us         18218.1 us ( 1.02)      18232.2 us ( 0.94)       18354.5 us ( 0.28)     18201.7 us ( 1.11)      18192.9 us ( 1.16)
RR      8  64K    16    CLAT AVG        14008.9 us         14059.2 us (-0.35)      14010.7 us (-0.01)       14318.7 us (-2.21)     14121.6 us (-0.80)      13997.4 us ( 0.08)
RW      8   8K    32    CLAT AVG         5860.4 us          5863.0 us (-0.04)       5913.3 us (-0.90)        5836.2 us ( 0.41)      5838.8 us ( 0.36)       5840.2 us ( 0.34)
RW      8   8K    32    CLAT AVG         3762.6 us          3756.9 us ( 0.15)       3750.9 us ( 0.31)        3767.2 us (-0.12)      3760.0 us ( 0.06)       3754.1 us ( 0.22)
W      16   1M     8    CLAT AVG       266060.0 us        267470.0 us (-0.52)     265670.0 us ( 0.14)      268080.0 us (-0.75)    266620.0 us (-0.21)     265300.0 us ( 0.28)
R      16   1M     8    CLAT AVG       145030.0 us        144400.0 us ( 0.43)     143680.0 us ( 0.93)      143750.0 us ( 0.88)    143940.0 us ( 0.75)     143480.0 us ( 1.06)
RW     16  64K    16    CLAT AVG        36087.5 us         35981.8 us ( 0.29)      36200.0 us (-0.31)       36247.3 us (-0.44)     36220.0 us (-0.36)      35888.9 us ( 0.55)
RR     16  64K    16    CLAT AVG        27791.8 us         28267.3 us (-1.71)      28067.7 us (-0.99)       27846.0 us (-0.19)     28083.0 us (-1.04)      27597.7 us ( 0.69)
RW     16   8K    32    CLAT AVG        11669.4 us         11690.4 us (-0.17)      11663.3 us ( 0.05)       11716.0 us (-0.39)     11738.0 us (-0.58)      11691.4 us (-0.18)
RW     16   8K    32    CLAT AVG         7534.0 us          7525.3 us ( 0.11)       7557.2 us (-0.30)        7557.1 us (-0.30)      7605.3 us (-0.94)       7575.3 us (-0.54)
W      32   1M     8    CLAT AVG       517640.0 us        518070.0 us (-0.08)     519450.0 us (-0.34)      519490.0 us (-0.35)    521610.0 us (-0.76)     517950.0 us (-0.05)
R      32   1M     8    CLAT AVG       281950.0 us        281940.0 us ( 0.00)     281430.0 us ( 0.18)      280250.0 us ( 0.60)    280790.0 us ( 0.41)     280750.0 us ( 0.42)
RW     32  64K    16    CLAT AVG        72265.2 us         72969.8 us (-0.97)      72185.4 us ( 0.11)       72027.0 us ( 0.32)     72359.4 us (-0.13)      71848.5 us ( 0.57)
RR     32  64K    16    CLAT AVG        55350.0 us         56110.0 us (-1.37)      56180.0 us (-1.49)       56214.0 us (-1.56)     55910.0 us (-1.01)      55690.0 us (-0.61)
RW     32   8K    32    CLAT AVG        23445.2 us         23512.2 us (-0.28)      23578.3 us (-0.56)       23490.0 us (-0.19)     23823.0 us (-1.61)      23447.4 us ( 0.00)
RW     32   8K    32    CLAT AVG        15519.2 us         15158.3 us ( 2.32)      15156.3 us ( 2.33)       15168.5 us ( 2.26)     15288.6 us ( 1.48)      15310.9 us ( 1.34)
W      64   1M     8    CLAT AVG      1034840.0 us       1030150.0 us ( 0.45)    1033850.0 us ( 0.09)     1033950.0 us ( 0.08)   1039190.0 us (-0.42)    1032740.0 us ( 0.20)
R      64   1M     8    CLAT AVG       562710.0 us        562160.0 us ( 0.09)     559810.0 us ( 0.51)      559830.0 us ( 0.51)    560810.0 us ( 0.33)     559690.0 us ( 0.53)
RW     64  64K    16    CLAT AVG       146513.6 us        144907.3 us ( 1.09)     144747.3 us ( 1.20)      145770.1 us ( 0.50)    146147.9 us ( 0.24)     143580.4 us ( 2.00)
RR     64  64K    16    CLAT AVG       110708.2 us        111530.0 us (-0.74)     112546.9 us (-1.66)      110910.0 us (-0.18)    111850.0 us (-1.03)     111140.0 us (-0.39)
RW     64   8K    32    CLAT AVG        47473.1 us         47509.3 us (-0.07)      47066.9 us ( 0.85)       47836.0 us (-0.76)     47198.1 us ( 0.57)      46871.4 us ( 1.26)
RW     64   8K    32    CLAT AVG        30552.4 us         30738.8 us (-0.61)      30633.6 us (-0.26)       30719.6 us (-0.54)     30490.4 us ( 0.20)      30637.7 us (-0.27)
W     128   1M     8    CLAT AVG      2044160.0 us       2041810.0 us ( 0.11)    2051730.0 us (-0.37)     2049560.0 us (-0.26)   2061340.0 us (-0.84)    2044010.0 us ( 0.00)
R     128   1M     8    CLAT AVG      1119150.0 us       1117720.0 us ( 0.12)    1114750.0 us ( 0.39)     1116290.0 us ( 0.25)   1115630.0 us ( 0.31)    1113980.0 us ( 0.46)
RW    128  64K    16    CLAT AVG       288304.8 us        288279.7 us ( 0.00)     289349.5 us (-0.36)      292048.8 us (-1.29)    291750.7 us (-1.19)     290124.2 us (-0.63)
RR    128  64K    16    CLAT AVG       222423.9 us        221240.0 us ( 0.53)     223699.9 us (-0.57)      223371.5 us (-0.42)    221542.7 us ( 0.39)     221900.0 us ( 0.23)
RW    128   8K    32    CLAT AVG        95877.2 us         96186.1 us (-0.32)      95485.6 us ( 0.40)       95441.3 us ( 0.45)     94586.7 us ( 1.34)      95787.5 us ( 0.09)
RW    128   8K    32    CLAT AVG        61548.4 us         60920.6 us ( 1.02)      61969.3 us (-0.68)       61629.1 us (-0.13)     61494.1 us ( 0.08)      61710.7 us (-0.26)
W     256   1M     8    CLAT AVG      4025240.0 us       4007330.0 us ( 0.44)    4017590.0 us ( 0.19)     4011060.0 us ( 0.35)   4015110.0 us ( 0.25)    4039310.0 us (-0.34)
R     256   1M     8    CLAT AVG      2212780.0 us       2212620.0 us ( 0.00)    2204660.0 us ( 0.36)     2204950.0 us ( 0.35)   2198270.0 us ( 0.65)    2203630.0 us ( 0.41)
RW    256  64K    16    CLAT AVG       576700.0 us        579253.3 us (-0.44)     577130.0 us (-0.07)      582507.9 us (-1.00)    582890.8 us (-1.07)     574580.0 us ( 0.36)
RR    256  64K    16    CLAT AVG       442112.2 us        441814.6 us ( 0.06)     448260.0 us (-1.39)      442812.3 us (-0.15)    440640.0 us ( 0.33)     448895.9 us (-1.53)
RW    256   8K    32    CLAT AVG       195770.5 us        193251.8 us ( 1.28)     195745.9 us ( 0.01)      194781.9 us ( 0.50)    195788.0 us ( 0.00)     195597.5 us ( 0.08)
RW    256   8K    32    CLAT AVG       124930.0 us        123075.7 us ( 1.48)     121805.7 us ( 2.50)      122662.4 us ( 1.81)    122272.8 us ( 2.12)     122484.0 us ( 1.95)
W     512   1M     8    CLAT AVG      7721700.0 us       7690690.0 us ( 0.40)    7698370.0 us ( 0.30)     7706670.0 us ( 0.19)   7769900.0 us (-0.62)    7692130.0 us ( 0.38)
R     512   1M     8    CLAT AVG      4331980.0 us       4334810.0 us (-0.06)    4311790.0 us ( 0.46)     4314320.0 us ( 0.40)   4309900.0 us ( 0.50)    4310660.0 us ( 0.49)
RW    512  64K    16    CLAT AVG      1158230.7 us       1153641.0 us ( 0.39)    1161630.0 us (-0.29)     1144103.0 us ( 1.21)   1147850.0 us ( 0.89)    1148622.1 us ( 0.82)
RR    512  64K    16    CLAT AVG       885170.0 us        896058.4 us (-1.23)     880356.7 us ( 0.54)      879002.1 us ( 0.69)    880330.0 us ( 0.54)     883352.1 us ( 0.20)
RW    512   8K    32    CLAT AVG       386552.5 us        387431.7 us (-0.22)     398254.8 us (-3.02)      388773.6 us (-0.57)    393415.3 us (-1.77)     408470.4 us (-5.67)
RW    512   8K    32    CLAT AVG       245014.4 us        242930.3 us ( 0.85)     245214.0 us (-0.08)      243833.3 us ( 0.48)    246829.0 us (-0.74)     247084.1 us (-0.84)

==================================================================================================================================

TEST Threads BS  IODEPTH Metric       NPS4-6.4.0-rc1              CPU                     SMT                    CACHE                   NUMA                    SYSTEM
W       1   1M     8    CLAT  99        22414.0 us         23725.0 us (-5.84)      22152.0 us ( 1.16)      22414.0 us ( 0.00)      22414.0 us ( 0.00)      21103.0 us ( 5.84)
R       1   1M     8    CLAT  99        13960.0 us         14353.0 us (-2.81)      13960.0 us ( 0.00)      13698.0 us ( 1.87)      14353.0 us (-2.81)      13829.0 us ( 0.93)
RW      1  64K    16    CLAT  99         3064.0 us          2966.0 us ( 3.19)       2802.0 us ( 8.55)       2769.0 us ( 9.62)       2966.0 us ( 3.19)       2999.0 us ( 2.12)
RR      1  64K    16    CLAT  99         4146.0 us          4293.0 us (-3.54)       4178.0 us (-0.77)       4178.0 us (-0.77)       4228.0 us (-1.97)       4293.0 us (-3.54)
RW      1   8K    32    CLAT  99         1287.0 us          1352.0 us (-5.05)       1237.0 us ( 3.88)      1500.0 us (-16.55)       1369.0 us (-6.37)      1467.0 us (-13.98)
RW      1   8K    32    CLAT  99          766.0 us           783.0 us (-2.21)        766.0 us ( 0.00)        766.0 us ( 0.00)        766.0 us ( 0.00)        766.0 us ( 0.00)
W       2   1M     8    CLAT  99        43254.0 us         43254.0 us ( 0.00)      42730.0 us ( 1.21)      43779.0 us (-1.21)      43779.0 us (-1.21)      45876.0 us (-6.06)
R       2   1M     8    CLAT  99        26608.0 us         26608.0 us ( 0.00)      27919.0 us (-4.92)      25822.0 us ( 2.95)      26346.0 us ( 0.98)      26346.0 us ( 0.98)
RW      2  64K    16    CLAT  99         5997.0 us          5604.0 us ( 6.55)       5473.0 us ( 8.73)       5538.0 us ( 7.65)       5735.0 us ( 4.36)       5604.0 us ( 6.55)
RR      2  64K    16    CLAT  99         6390.0 us          6521.0 us (-2.05)       6587.0 us (-3.08)       6521.0 us (-2.05)       6456.0 us (-1.03)       6456.0 us (-1.03)
RW      2   8K    32    CLAT  99         3490.0 us          3523.0 us (-0.94)       3556.0 us (-1.89)       3523.0 us (-0.94)       3490.0 us ( 0.00)       3523.0 us (-0.94)
RW      2   8K    32    CLAT  99         2147.0 us          2278.0 us (-6.10)       2278.0 us (-6.10)       2278.0 us (-6.10)       2311.0 us (-7.63)       2343.0 us (-9.12)
W       4   1M     8    CLAT  99        89000.0 us         90000.0 us (-1.12)    100000.0 us (-12.35)      91000.0 us (-2.24)      92000.0 us (-3.37)    109000.0 us (-22.47)
R       4   1M     8    CLAT  99        57934.0 us         45351.0 us (21.71)      54789.0 us ( 5.42)      52167.0 us ( 9.95)      57410.0 us ( 0.90)      54264.0 us ( 6.33)
RW      4  64K    16    CLAT  99        23462.0 us         22938.0 us ( 2.23)      23200.0 us ( 1.11)      23200.0 us ( 1.11)      23200.0 us ( 1.11)      23462.0 us ( 0.00)
RR      4  64K    16    CLAT  99        17695.0 us         17957.0 us (-1.48)      17695.0 us ( 0.00)      17695.0 us ( 0.00)      17695.0 us ( 0.00)      17695.0 us ( 0.00)
RW      4   8K    32    CLAT  99         6194.0 us          6325.0 us (-2.11)       6325.0 us (-2.11)       6325.0 us (-2.11)       6194.0 us ( 0.00)       6259.0 us (-1.04)
RW      4   8K    32    CLAT  99         3916.0 us          3884.0 us ( 0.81)       3916.0 us ( 0.00)       3884.0 us ( 0.81)       3851.0 us ( 1.65)       3949.0 us (-0.84)
W       8   1M     8    CLAT  99       213000.0 us       309000.0 us (-45.07)    309000.0 us (-45.07)    292000.0 us (-37.08)    305000.0 us (-43.19)    288000.0 us (-35.21)
R       8   1M     8    CLAT  99       148000.0 us       169000.0 us (-14.18)    163000.0 us (-10.13)     161000.0 us (-8.78)      84000.0 us (43.24)     155000.0 us (-4.72)
RW      8  64K    16    CLAT  99        42730.0 us         42730.0 us ( 0.00)      42730.0 us ( 0.00)      42730.0 us ( 0.00)      42730.0 us ( 0.00)      42730.0 us ( 0.00)
RR      8  64K    16    CLAT  99        32900.0 us         33162.0 us (-0.79)      32900.0 us ( 0.00)      33817.0 us (-2.78)      33817.0 us (-2.78)      33162.0 us (-0.79)
RW      8   8K    32    CLAT  99        11731.0 us         11731.0 us ( 0.00)      11863.0 us (-1.12)      11600.0 us ( 1.11)      11731.0 us ( 0.00)      11731.0 us ( 0.00)
RW      8   8K    32    CLAT  99         7177.0 us          7242.0 us (-0.90)       7242.0 us (-0.90)       7242.0 us (-0.90)       7242.0 us (-0.90)       7242.0 us (-0.90)
W      16   1M     8    CLAT  99       667000.0 us        676000.0 us (-1.34)     667000.0 us ( 0.00)     718000.0 us (-7.64)     676000.0 us (-1.34)     684000.0 us (-2.54)
R      16   1M     8    CLAT  99       384000.0 us        363000.0 us ( 5.46)     359000.0 us ( 6.51)     363000.0 us ( 5.46)     372000.0 us ( 3.12)     355000.0 us ( 7.55)
RW     16  64K    16    CLAT  99        62653.0 us         61604.0 us ( 1.67)      63177.0 us (-0.83)      63177.0 us (-0.83)      64226.0 us (-2.51)      62653.0 us ( 0.00)
RR     16  64K    16    CLAT  99        48497.0 us         49546.0 us (-2.16)      49021.0 us (-1.08)      48497.0 us ( 0.00)      50594.0 us (-4.32)      47973.0 us ( 1.08)
RW     16   8K    32    CLAT  99        21000.0 us         21000.0 us ( 0.00)      20579.0 us ( 2.00)      21000.0 us ( 0.00)      21000.0 us ( 0.00)      20579.0 us ( 2.00)
RW     16   8K    32    CLAT  99        12518.0 us         12518.0 us ( 0.00)      12387.0 us ( 1.04)      12387.0 us ( 1.04)      12387.0 us ( 1.04)      12649.0 us (-1.04)
W      32   1M     8    CLAT  99       877000.0 us        927000.0 us (-5.70)     894000.0 us (-1.93)     902000.0 us (-2.85)    969000.0 us (-10.49)     877000.0 us ( 0.00)
R      32   1M     8    CLAT  99       477000.0 us        477000.0 us ( 0.00)     468000.0 us ( 1.88)     464000.0 us ( 2.72)     472000.0 us ( 1.04)     485000.0 us (-1.67)
RW     32  64K    16    CLAT  99       102000.0 us        104000.0 us (-1.96)     102000.0 us ( 0.00)     102000.0 us ( 0.00)     105000.0 us (-2.94)     101000.0 us ( 0.98)
RR     32  64K    16    CLAT  99        78119.0 us         79168.0 us (-1.34)      79168.0 us (-1.34)      79168.0 us (-1.34)      82000.0 us (-4.96)      78119.0 us ( 0.00)
RW     32   8K    32    CLAT  99        43000.0 us         44000.0 us (-2.32)      41000.0 us ( 4.65)      38000.0 us (11.62)      40000.0 us ( 6.97)      37000.0 us (13.95)
RW     32   8K    32    CLAT  99        23987.0 us         22414.0 us ( 6.55)      21890.0 us ( 8.74)      22152.0 us ( 7.64)      22938.0 us ( 4.37)      23200.0 us ( 3.28)
W      64   1M     8    CLAT  99      1636000.0 us       1620000.0 us ( 0.97)    1703000.0 us (-4.09)    1620000.0 us ( 0.97)    1737000.0 us (-6.17)    1687000.0 us (-3.11)
R      64   1M     8    CLAT  99       927000.0 us        885000.0 us ( 4.53)     944000.0 us (-1.83)     894000.0 us ( 3.55)     927000.0 us ( 0.00)     936000.0 us (-0.97)
RW     64  64K    16    CLAT  99       199000.0 us        197000.0 us ( 1.00)     197000.0 us ( 1.00)     205000.0 us (-3.01)     207000.0 us (-4.02)     184000.0 us ( 7.53)
RR     64  64K    16    CLAT  99       157000.0 us        159000.0 us (-1.27)     161000.0 us (-2.54)     159000.0 us (-1.27)     163000.0 us (-3.82)     155000.0 us ( 1.27)
RW     64   8K    32    CLAT  99       100000.0 us        101000.0 us (-1.00)      94000.0 us ( 6.00)      94000.0 us ( 6.00)      88000.0 us (12.00)      87000.0 us (13.00)
RW     64   8K    32    CLAT  99        47449.0 us         47973.0 us (-1.10)      46924.0 us ( 1.10)      47973.0 us (-1.10)      45876.0 us ( 3.31)      47449.0 us ( 0.00)
W     128   1M     8    CLAT  99      2903000.0 us       3004000.0 us (-3.47)    3171000.0 us (-9.23)    3071000.0 us (-5.78)    3138000.0 us (-8.09)    2970000.0 us (-2.30)
R     128   1M     8    CLAT  99      1854000.0 us       1770000.0 us ( 4.53)    1720000.0 us ( 7.22)    1737000.0 us ( 6.31)    1687000.0 us ( 9.00)    1905000.0 us (-2.75)
RW    128  64K    16    CLAT  99       422000.0 us        363000.0 us (13.98)     376000.0 us (10.90)     418000.0 us ( 0.94)     430000.0 us (-1.89)     405000.0 us ( 4.02)
RR    128  64K    16    CLAT  99       347000.0 us        342000.0 us ( 1.44)     363000.0 us (-4.61)     355000.0 us (-2.30)     342000.0 us ( 1.44)     355000.0 us (-2.30)
RW    128   8K    32    CLAT  99       201000.0 us        203000.0 us (-0.99)     194000.0 us ( 3.48)     190000.0 us ( 5.47)     192000.0 us ( 4.47)     199000.0 us ( 0.99)
RW    128   8K    32    CLAT  99       101000.0 us         95000.0 us ( 5.94)     101000.0 us ( 0.00)     101000.0 us ( 0.00)     101000.0 us ( 0.00)     101000.0 us ( 0.00)
W     256   1M     8    CLAT  99      5671000.0 us       5738000.0 us (-1.18)    5336000.0 us ( 5.90)    5537000.0 us ( 2.36)    5805000.0 us (-2.36)    5805000.0 us (-2.36)
R     256   1M     8    CLAT  99      2970000.0 us       3037000.0 us (-2.25)    2903000.0 us ( 2.25)    3171000.0 us (-6.76)   3440000.0 us (-15.82)    3004000.0 us (-1.14)
RW    256  64K    16    CLAT  99       793000.0 us        760000.0 us ( 4.16)     676000.0 us (14.75)     835000.0 us (-5.29)     869000.0 us (-9.58)     735000.0 us ( 7.31)
RR    256  64K    16    CLAT  99       760000.0 us        651000.0 us (14.34)     667000.0 us (12.23)     827000.0 us (-8.81)     760000.0 us ( 0.00)     684000.0 us (10.00)
RW    256   8K    32    CLAT  99       435000.0 us        418000.0 us ( 3.90)     439000.0 us (-0.91)     439000.0 us (-0.91)     418000.0 us ( 3.90)     422000.0 us ( 2.98)
RW    256   8K    32    CLAT  99       211000.0 us        203000.0 us ( 3.79)     197000.0 us ( 6.63)     201000.0 us ( 4.73)     201000.0 us ( 4.73)     199000.0 us ( 5.68)
W     512   1M     8    CLAT  99     11476000.0 us      10671000.0 us ( 7.01)   10000000.0 us (12.86)   10939000.0 us ( 4.67)   10939000.0 us ( 4.67)   10939000.0 us ( 4.67)
R     512   1M     8    CLAT  99      6007000.0 us       5873000.0 us ( 2.23)    5738000.0 us ( 4.47)    5738000.0 us ( 4.47)    5873000.0 us ( 2.23)    5604000.0 us ( 6.70)
RW    512  64K    16    CLAT  99      1485000.0 us       1469000.0 us ( 1.07)    1536000.0 us (-3.43)    1435000.0 us ( 3.36)    1552000.0 us (-4.51)    1485000.0 us ( 0.00)
RR    512  64K    16    CLAT  99      1351000.0 us       1234000.0 us ( 8.66)    1301000.0 us ( 3.70)    1200000.0 us (11.17)    1267000.0 us ( 6.21)    1234000.0 us ( 8.66)
RW    512   8K    32    CLAT  99       793000.0 us        810000.0 us (-2.14)    885000.0 us (-11.60)     793000.0 us ( 0.00)     818000.0 us (-3.15)     827000.0 us (-4.28)
RW    512   8K    32    CLAT  99       414000.0 us        393000.0 us ( 5.07)     426000.0 us (-2.89)     397000.0 us ( 4.10)     414000.0 us ( 0.00)     418000.0 us (-0.96)

==================================================================================================================================
--
Thanks and Regards,
Swapnil

> If that doesn't crash, I'd love to hear how it affects the perf regressions
> reported over that past few months.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-13  9:26     ` Pin-yen Lin
@ 2023-06-21 19:16       ` Tejun Heo
  2023-06-21 19:31         ` Linus Torvalds
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-06-21 19:16 UTC (permalink / raw)
  To: Pin-yen Lin
  Cc: Brian Norris, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, nhuck, agk, snitzer, void

Hello, Pin-yen.

On Tue, Jun 13, 2023 at 05:26:48PM +0800, Pin-yen Lin wrote:
...
> >  1. affinity_scope = cache, affinity_strict = 1
> >
> >  2. affinity_scope = cpu, affinity_strict = 0
> >
> >  3. affinity_scope = cpu, affinity_strict = 1
> 
> I pulled down v2 series and tried these settings on our 5.15 kernel.
> Unfortunately none of them showed significant improvement on the
> throughput. It's hard to tell which one is the best because of the
> noise, but the throughput is still all far from our 4.19 kernel or
> simply pinning everything to a single core.
> 
> All the 4 settings (3 settings listed above plus the default) yields
> results between 90 to 120 Mbps, while pinning tasks to a single core
> consistently reaches >250 Mbps.

I find that perplexing given that switching to a per-cpu workqueue remedies
the situation quite a bit, which is how this patchset came to be. #3 is the
same as per-cpu workqueue, so if you're seeing noticeably different
performance numbers between #3 and per-cpu workqueue, there's something
wrong with either the code or test setup.

Also, if you have to ping to a single or some subset of CPUs, you can just
set WQ_SYSFS for the workqueue and set affinities in its sysfs interface
instead of hard-coding the workaround for a specific hardware.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-21 19:16       ` Tejun Heo
@ 2023-06-21 19:31         ` Linus Torvalds
  2023-06-29  9:49           ` Pin-yen Lin
  0 siblings, 1 reply; 73+ messages in thread
From: Linus Torvalds @ 2023-06-21 19:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pin-yen Lin, Brian Norris, jiangshanlai, peterz, linux-kernel,
	kernel-team, joshdon, brho, nhuck, agk, snitzer, void

On Wed, 21 Jun 2023 at 12:16, Tejun Heo <tj@kernel.org> wrote:
>
> I find that perplexing given that switching to a per-cpu workqueue remedies
> the situation quite a bit, which is how this patchset came to be. #3 is the
> same as per-cpu workqueue, so if you're seeing noticeably different
> performance numbers between #3 and per-cpu workqueue, there's something
> wrong with either the code or test setup.

Or maybe there's some silly thinko in the wq code that is hidden by
the percpu code.

For example, WQ_UNBOUND triggers a lot of other overhead at least on
wq allocation and free. Maybe some of that stuff then indirectly
affects workqueue execution even when strict cpu affinity is set.

Pin-Yen Li - can you do a system-wide profile of the two cases (the
percpu case vs the "strict cpu affinity" one), to see if something
stands out?

             Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-14 18:49               ` Sandeep Dhavale
@ 2023-06-21 20:14                 ` Tejun Heo
  0 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-06-21 20:14 UTC (permalink / raw)
  To: Sandeep Dhavale
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void,
	kernel-team, Swapnil Sapkal, kprateek.nayak

Hello,

On Wed, Jun 14, 2023 at 11:49:53AM -0700, Sandeep Dhavale wrote:
> Thank you for your patches! I tested the affinity-scopes-v2 with app launch
> benchmarks. The numbers below are total scheduling latency for erofs kworkers
> and last column is with percpu highpri kthreads that is
> CONFIG_EROFS_FS_PCPU_KTHREAD=y
> CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI=y
> 
> Scheduling latency is the latency between when the task became eligible to run
> to when it actually started running. The test does 50 cold app launches for each
> and aggregates the numbers.
> 
> | Table        | Upstream | Cache nostrict | CPU nostrict | PCPU hpri |
> |--------------+----------+----------------+--------------+-----------|
> | Average (us) | 12286    | 7440           | 4435         | 2717      |
> | Median (us)  | 12528    | 3901           | 3258         | 2476      |
> | Minimum (us) | 287      | 555            | 638          | 357       |
> | Maximum (us) | 35600    | 35911          | 13364        | 6874      |
> | Stdev (us)   | 7918     | 7503           | 3323         | 1918      |
> |--------------+----------+----------------+--------------+-----------|
> 
> We see here that with affinity-scopes-v2 (which defaults to cache nostrict),
> there is a good improvement when compared to the current codebase.
> Affinity scope "CPU nostrict" for erofs workqueue has even better numbers
> for my test launches and it resembles logically to percpu highpri kthreads
> approach. Percpu highpri kthreads has the lowest latency and variation,
> probably down to running at higher priority as those threads are set to
> sched_set_fifo_low().

If you set workqueue to CPU strict and set its nice value to -19 in the
sysfs interface, it should behave simliar to the hardcoded PCPU hpri. I'd
also love to see the comparison between strict and nostrict too if possible.

> At high level, the app launch numbers itself improved with your series as
> entire workqueue subsystem improved across the board.

Glad to hear.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-19  4:30             ` Swapnil Sapkal
@ 2023-06-21 20:38               ` Tejun Heo
  0 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-06-21 20:38 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: K Prateek Nayak, Sandeep Dhavale, jiangshanlai, torvalds, peterz,
	linux-kernel, kernel-team, joshdon, brho, briannorris, nhuck,
	agk, snitzer, void, kernel-team

Hello, Swapnil.

On Mon, Jun 19, 2023 at 10:00:33AM +0530, Swapnil Sapkal wrote:
...
> Thanks for the patchset. I tested the patchset with fiotests.
> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> with NPS1, NPS2 and NPS4 modes.

Can you elaborate or point me to a doc explaining the differences between
NPS1, 2 and 4? My feeble attempt at googling didn't lead to anything useful.
What's the test doing and how long are they running?

> With affinity-scopes-v2, below are the observations:
> BW, LAT AVG and CLAT AVG shows improvement with some combinations
> of the params in NPS1 and NPS2 while all other combinations of params
> show no loss or gain in the performance. Those combinations showing
> improvement are marked with ### and those showing drop in performance
> are marked with ***. CLAT 99 shows mixed results in all the NPS modes.
> SLAT 99 is suffering tremendously in all NPS mode.

Lower thread count tests showing larger variance is consistent with my
experience. Sometimes the scheduling and its interaction with workload seems
to exhibit bi(or higher degree)-modal behaviors and the swings get a lot
more severe when clock boosting is involved.

Outside of that tho, I'm having a difficult time interpreting the results.
It's definitely possible that I made some mistakes but in theory NUMA should
behave about the same as before the patchset, which seem sto hold for most
of the results but there are some striking outliers.

So, here's a suggestion. How about we pick two scenarios, one where CACHE is
doing better and one worse, and then run those two specific scenarios
multiple times and see how consistent the results are?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-21 19:31         ` Linus Torvalds
@ 2023-06-29  9:49           ` Pin-yen Lin
  2023-07-03 21:47             ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: Pin-yen Lin @ 2023-06-29  9:49 UTC (permalink / raw)
  To: Linus Torvalds, Tejun Heo
  Cc: Brian Norris, jiangshanlai, peterz, linux-kernel, kernel-team,
	joshdon, brho, nhuck, agk, snitzer, void

Hi Linus and Tejun,

On Thu, Jun 22, 2023 at 3:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 21 Jun 2023 at 12:16, Tejun Heo <tj@kernel.org> wrote:
> >
> > I find that perplexing given that switching to a per-cpu workqueue remedies
> > the situation quite a bit, which is how this patchset came to be. #3 is the
> > same as per-cpu workqueue, so if you're seeing noticeably different
> > performance numbers between #3 and per-cpu workqueue, there's something
> > wrong with either the code or test setup.
>
In our case, per-cpu workqueue (removing WQ_UNBOUND) doesn't bring us
better results. But given that pinning tasks to a single CPU core
helps, we thought that the regression is related to the behavior of
WQ_UNBOUND. Our findings are listed in [1].

We already use WQ_SYSFS and the sysfs interface to pin the tasks, but
thanks for the suggestion.

[1]: https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

> Or maybe there's some silly thinko in the wq code that is hidden by
> the percpu code.
>
> For example, WQ_UNBOUND triggers a lot of other overhead at least on
> wq allocation and free. Maybe some of that stuff then indirectly
> affects workqueue execution even when strict cpu affinity is set.
>
> Pin-Yen Li - can you do a system-wide profile of the two cases (the
> percpu case vs the "strict cpu affinity" one), to see if something
> stands out?

The two actually have similar performances, so I guess the profiling
is not interesting for you. Please let me know if you want to see any
data and I'll be happy to collect them and update here.

Best regards,
Pin-yen
>
>              Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-06-29  9:49           ` Pin-yen Lin
@ 2023-07-03 21:47             ` Tejun Heo
  0 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-07-03 21:47 UTC (permalink / raw)
  To: Pin-yen Lin
  Cc: Linus Torvalds, Brian Norris, jiangshanlai, peterz, linux-kernel,
	kernel-team, joshdon, brho, nhuck, agk, snitzer, void

Hello, Pin-yen.

On Thu, Jun 29, 2023 at 05:49:27PM +0800, Pin-yen Lin wrote:
> > > I find that perplexing given that switching to a per-cpu workqueue remedies
> > > the situation quite a bit, which is how this patchset came to be. #3 is the
> > > same as per-cpu workqueue, so if you're seeing noticeably different
> > > performance numbers between #3 and per-cpu workqueue, there's something
> > > wrong with either the code or test setup.
> >
> In our case, per-cpu workqueue (removing WQ_UNBOUND) doesn't bring us
> better results. But given that pinning tasks to a single CPU core
> helps, we thought that the regression is related to the behavior of
> WQ_UNBOUND. Our findings are listed in [1].

I see.

> We already use WQ_SYSFS and the sysfs interface to pin the tasks, but
> thanks for the suggestion.

Yeah, I have no idea why there's such a drastic regression and the only way
to alleviate that is pinning the execution to a single CPU, which is also
different from other reports. It seems plausible that there are some
scheduling behavior changes and that's interacting negatively with power
saving but I'm sure that's the first thing you guys looked into.

From workqueue's POV, I'm afraid using WQ_SYSFS and pinning it as needed
seems like about what can be done for your specific case.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-06-08 22:50           ` Tejun Heo
  2023-06-09  3:43             ` K Prateek Nayak
  2023-06-19  4:30             ` Swapnil Sapkal
@ 2023-07-05  7:04             ` K Prateek Nayak
  2023-07-05 18:39               ` Tejun Heo
  2 siblings, 1 reply; 73+ messages in thread
From: K Prateek Nayak @ 2023-07-05  7:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello Tejun,

On 6/9/2023 4:20 AM, Tejun Heo wrote:
> [..snip..]
> 
> Can you please test the following branch? It should have
> both bugs fixed properly.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
> 
> If that doesn't crash, I'd love to hear how it affects the perf regressions
> reported over that past few months.

Sorry about the delay. I'll leave the detailed results of the testing below,
results are from a dual socket 3rd Generation EPYC system (2 x 64C/128T)

tl;dr

- Apart from tbench and netperf, the rest of the benchmarks show no
  difference out of the box.

- SPECjbb2015 Multi-jVM sees small uplift to max-jOPS with certain
  affinity scopes.

- tbench and netperf seem to be unhappy throughout. None of the affinity
  scopes seem to bring back the performance. I'll dig more into this.

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- base:             affinity-scopes-v2 branch at
                    commit 18c8ae813156 ("workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0")

- affinity_scopes:  affinity-scopes-v2 branch at
                    commit a4da9f618d3e ("workqueue: Add "Affinity Scopes and Performance" section to documentation")
                    running with the default affinity scope.

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			base		   affinity_scopes
 1-groups:	   0.00 (0.00 pct)	   3.68 (0.00 pct)
 2-groups:	   4.41 (0.00 pct)	   4.40 (0.22 pct)
 4-groups:	   4.91 (0.00 pct)	   4.87 (0.81 pct)
 8-groups:	   5.64 (0.00 pct)	   5.74 (-1.77 pct)
16-groups:	   7.72 (0.00 pct)	   7.54 (2.33 pct)

o NPS2

Test:			base		   affinity_scopes
 1-groups:	   3.74 (0.00 pct)	   3.85 (-2.94 pct)
 2-groups:	   4.38 (0.00 pct)	   4.34 (0.91 pct)
 4-groups:	   4.87 (0.00 pct)	   4.80 (1.43 pct)
 8-groups:	   5.42 (0.00 pct)	   5.40 (0.36 pct)
16-groups:	   6.75 (0.00 pct)	   7.02 (-4.00 pct)

o NPS4

Test:			base		   affinity_scopes
 1-groups:	   3.90 (0.00 pct)	   3.84 (1.53 pct)
 2-groups:	   4.40 (0.00 pct)	   4.39 (0.22 pct)
 4-groups:	   4.86 (0.00 pct)	   4.85 (0.20 pct)
 8-groups:	   5.44 (0.00 pct)	   5.44 (0.00 pct)
16-groups:	   7.20 (0.00 pct)	   7.08 (1.66 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	base		  affinity_scopes
  1:	  26.00 (0.00 pct)	  26.00 (0.00 pct)
  2:	  26.00 (0.00 pct)	  28.00 (-7.69 pct)
  4:	  31.00 (0.00 pct)	  28.00 (9.67 pct)
  8:	  37.00 (0.00 pct)	  37.00 (0.00 pct)
 16:	  49.00 (0.00 pct)	  47.00 (4.08 pct)
 32:	  78.00 (0.00 pct)	  81.00 (-3.84 pct)
 64:	 170.00 (0.00 pct)	 173.00 (-1.76 pct)
128:	 369.00 (0.00 pct)	 344.00 (6.77 pct)
256:	 49600.00 (0.00 pct)	 48704.00 (1.80 pct)
512:	 93568.00 (0.00 pct)	 93824.00 (-0.27 pct)

o NPS2

#workers:	base		  affinity_scopes
  1:	  24.00 (0.00 pct)	  23.00 (4.16 pct)
  2:	  29.00 (0.00 pct)	  25.00 (13.79 pct)
  4:	  31.00 (0.00 pct)	  32.00 (-3.22 pct)
  8:	  43.00 (0.00 pct)	  39.00 (9.30 pct)
 16:	  52.00 (0.00 pct)	  52.00 (0.00 pct)
 32:	  82.00 (0.00 pct)	  89.00 (-8.53 pct)
 64:	 179.00 (0.00 pct)	 154.00 (13.96 pct)
128:	 400.00 (0.00 pct)	 360.00 (10.00 pct)
256:	 49856.00 (0.00 pct)	 48576.00 (2.56 pct)
512:	 93056.00 (0.00 pct)	 91520.00 (1.65 pct)

o NPS4

#workers:	base		  affinity_scopes
  1:	  25.00 (0.00 pct)	  22.00 (12.00 pct)
  2:	  26.00 (0.00 pct)	  27.00 (-3.84 pct)
  4:	  29.00 (0.00 pct)	  28.00 (3.44 pct)
  8:	  48.00 (0.00 pct)	  44.00 (8.33 pct)
 16:	  55.00 (0.00 pct)	  59.00 (-7.27 pct)
 32:	  88.00 (0.00 pct)	  84.00 (4.54 pct)
 64:	 166.00 (0.00 pct)	 173.00 (-4.21 pct)
128:	 374.00 (0.00 pct)	 368.00 (1.60 pct)
256:	 49600.00 (0.00 pct)	 49856.00 (-0.51 pct)
512:	 93824.00 (0.00 pct)	 93568.00 (0.27 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	base		  affinity_scopes
    1	 450.40 (0.00 pct)	 456.71 (1.40 pct)
    2	 872.50 (0.00 pct)	 882.38 (1.13 pct)
    4	 1630.13 (0.00 pct)	 1605.48 (-1.51 pct)
    8	 3139.90 (0.00 pct)	 3041.39 (-3.13 pct)
   16	 6113.51 (0.00 pct)	 5449.58 (-10.86 pct)
   32	 11024.64 (0.00 pct)	 9147.71 (-17.02 pct)
   64	 19081.96 (0.00 pct)	 14843.46 (-22.21 pct)
  128	 30956.07 (0.00 pct)	 27493.35 (-11.18 pct)
  256	 42829.46 (0.00 pct)	 36913.54 (-13.81 pct)
  512	 42395.69 (0.00 pct)	 36165.41 (-14.69 pct)
 1024	 41973.51 (0.00 pct)	 38530.57 (-8.20 pct)

o NPS2

Clients:	base		  affinity_scopes
    1	 451.37 (0.00 pct)	 450.97 (-0.08 pct)
    2	 875.07 (0.00 pct)	 874.08 (-0.11 pct)
    4	 1636.31 (0.00 pct)	 1639.60 (0.20 pct)
    8	 3162.48 (0.00 pct)	 3074.73 (-2.77 pct)
   16	 5794.93 (0.00 pct)	 5502.22 (-5.05 pct)
   32	 11205.26 (0.00 pct)	 8979.27 (-19.86 pct)
   64	 20770.79 (0.00 pct)	 17151.10 (-17.42 pct)
  128	 30485.82 (0.00 pct)	 26953.16 (-11.58 pct)
  256	 40161.93 (0.00 pct)	 35892.11 (-10.63 pct)
  512	 44513.43 (0.00 pct)	 38876.31 (-12.66 pct)
 1024	 42781.13 (0.00 pct)	 38313.23 (-10.44 pct)

o NPS4

Clients:	base		  affinity_scopes
    1	 451.25 (0.00 pct)	 447.95 (-0.73 pct)
    2	 877.94 (0.00 pct)	 877.93 (0.00 pct)
    4	 1641.74 (0.00 pct)	 1653.17 (0.69 pct)
    8	 3140.87 (0.00 pct)	 3050.94 (-2.86 pct)
   16	 5878.87 (0.00 pct)	 5291.66 (-9.98 pct)
   32	 10910.11 (0.00 pct)	 9745.45 (-10.67 pct)
   64	 18814.62 (0.00 pct)	 16708.46 (-11.19 pct)
  128	 29238.49 (0.00 pct)	 27598.00 (-5.61 pct)
  256	 42119.54 (0.00 pct)	 38464.91 (-8.67 pct)
  512	 41645.81 (0.00 pct)	 40330.03 (-3.15 pct)
 1024	 41977.06 (0.00 pct)	 39540.55 (-5.80 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test:		base		   affinity_scopes
 Copy:	 245676.59 (0.00 pct)	 333646.71 (35.80 pct)
Scale:	 206545.41 (0.00 pct)	 205706.04 (-0.40 pct)
  Add:	 213506.82 (0.00 pct)	 236739.07 (10.88 pct)
Triad:	 217679.43 (0.00 pct)	 249263.46 (14.50 pct)

- 100 Runs:

Test:		base		   affinity_scopes
 Copy:	 318060.91 (0.00 pct)	 326025.89 (2.50 pct)
Scale:	 213943.40 (0.00 pct)	 207647.37 (-2.94 pct)
  Add:	 237892.53 (0.00 pct)	 232164.59 (-2.40 pct)
Triad:	 245672.84 (0.00 pct)	 246333.21 (0.26 pct)

o NPS2

- 10 Runs:

Test:		base		   affinity_scopes
 Copy:	 296632.20 (0.00 pct)	 291153.63 (-1.84 pct)
Scale:	 206193.90 (0.00 pct)	 216368.42 (4.93 pct)
  Add:	 240363.50 (0.00 pct)	 245954.23 (2.32 pct)
Triad:	 242748.60 (0.00 pct)	 238606.20 (-1.70 pct)

- 100 Runs:

Test:		base		   affinity_scopes
 Copy:	 322535.79 (0.00 pct)	 315020.03 (-2.33 pct)
Scale:	 217723.56 (0.00 pct)	 220172.32 (1.12 pct)
  Add:	 248117.72 (0.00 pct)	 250557.17 (0.98 pct)
Triad:	 257768.66 (0.00 pct)	 248264.00 (-3.68 pct)

o NPS4

- 10 Runs:

Test:		base		   affinity_scopes
 Copy:	 274067.54 (0.00 pct)	 302804.77 (10.48 pct)
Scale:	 224944.53 (0.00 pct)	 230112.39 (2.29 pct)
  Add:	 229318.09 (0.00 pct)	 241939.54 (5.50 pct)
Triad:	 230175.89 (0.00 pct)	 253613.85 (10.18 pct)

- 100 Runs:

Test:		base		   affinity_scopes
 Copy:	 338922.96 (0.00 pct)	 348183.65 (2.73 pct)
Scale:	 240262.45 (0.00 pct)	 245939.67 (2.36 pct)
  Add:	 256968.24 (0.00 pct)	 260657.01 (1.43 pct)
Triad:	 262644.16 (0.00 pct)	 262286.46 (-0.13 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

Test:			base		   affinity_scopes
 1-clients:	 100910.82 (0.00 pct)	 102553.83 (1.62 pct)
 2-clients:	 99777.76 (0.00 pct)	 99390.14 (-0.38 pct)
 4-clients:	 97676.17 (0.00 pct)	 95856.63 (-1.86 pct)
 8-clients:	 95413.11 (0.00 pct)	 88801.05 (-6.92 pct)
16-clients:	 88961.66 (0.00 pct)	 78807.71 (-11.41 pct)
32-clients:	 82199.83 (0.00 pct)	 73372.46 (-10.73 pct)
64-clients:	 66094.87 (0.00 pct)	 58487.61 (-11.50 pct)
128-clients:	 43833.63 (0.00 pct)	 42005.47 (-4.17 pct)
256-clients:	 38917.58 (0.00 pct)	 22653.73 (-41.79 pct)

o NPS2

Test:			base		   affinity_scopes
 1-clients:	 101745.99 (0.00 pct)	 102703.66 (0.94 pct)
 2-clients:	 100013.62 (0.00 pct)	 99536.20 (-0.47 pct)
 4-clients:	 97124.42 (0.00 pct)	 95261.28 (-1.91 pct)
 8-clients:	 92110.60 (0.00 pct)	 87714.72 (-4.77 pct)
16-clients:	 84578.86 (0.00 pct)	 77329.65 (-8.57 pct)
32-clients:	 78272.91 (0.00 pct)	 72114.77 (-7.86 pct)
64-clients:	 61595.20 (0.00 pct)	 58001.87 (-5.83 pct)
128-clients:	 44119.18 (0.00 pct)	 40057.91 (-9.20 pct)
256-clients:	 36221.03 (0.00 pct)	 21468.40 (-40.72 pct)

o NPS4

Test:			base		   affinity_scopes
 1-clients:	 102711.93 (0.00 pct)	 103244.49 (0.51 pct)
 2-clients:	 101655.11 (0.00 pct)	 98764.88 (-2.84 pct)
 4-clients:	 98519.58 (0.00 pct)	 94439.88 (-4.14 pct)
 8-clients:	 94247.56 (0.00 pct)	 88618.17 (-5.97 pct)
16-clients:	 87515.03 (0.00 pct)	 82392.50 (-5.85 pct)
32-clients:	 81486.07 (0.00 pct)	 74022.13 (-9.15 pct)
64-clients:	 68436.64 (0.00 pct)	 60303.48 (-11.88 pct)
128-clients:	 49393.57 (0.00 pct)	 43924.74 (-11.07 pct)
256-clients:	 41111.27 (0.00 pct)	 27694.64 (-32.63 pct)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

					base		     affinity_scopes
Hmean     unixbench-dhry2reg-1      41194259.44 (   0.00%)    41044431.89 (  -0.36%)
Hmean     unixbench-dhry2reg-512  6252840065.42 (   0.00%)  6244309194.01 (  -0.14%)
Amean     unixbench-syscall-1        2534936.20 (   0.00%)     2517701.13 *   0.68%*
Amean     unixbench-syscall-512      8037812.87 (   0.00%)     7379945.23 *   8.18%*
Hmean     unixbench-pipe-1           2391449.08 (   0.00%)     2392275.16 (   0.03%)
Hmean     unixbench-pipe-512       340010431.31 (   0.00%)   339389300.96 (  -0.18%)
Hmean     unixbench-spawn-1             4471.68 (   0.00%)        4568.80 (   2.17%)
Hmean     unixbench-spawn-512          66246.39 (   0.00%)       62380.27 *  -5.84%*
Hmean     unixbench-execl-1             3695.11 (   0.00%)        3663.75 *  -0.85%*
Hmean     unixbench-execl-512          12526.29 (   0.00%)       11833.41 (  -5.53%)

o NPS2

					base		     affinity_scopes
Hmean     unixbench-dhry2reg-1      40812348.19 (   0.00%)    41044955.13 (   0.57%)
Hmean     unixbench-dhry2reg-512  6248963826.97 (   0.00%)  6244114150.91 (  -0.08%)
Amean     unixbench-syscall-1        2479433.67 (   0.00%)     2498544.70 (  -0.77%)
Amean     unixbench-syscall-512      8064530.47 (   0.00%)     8064139.93 (   0.00%)
Hmean     unixbench-pipe-1           2393194.62 (   0.00%)     2365328.39 (  -1.16%)
Hmean     unixbench-pipe-512       339553534.72 (   0.00%)   340930432.76 (   0.41%)
Hmean     unixbench-spawn-1             4777.52 (   0.00%)        4975.71 (   4.15%)
Hmean     unixbench-spawn-512          67467.26 (   0.00%)       63427.50 *  -5.99%*
Hmean     unixbench-execl-1             3640.89 (   0.00%)        3636.52 (  -0.12%)
Hmean     unixbench-execl-512          14182.44 (   0.00%)       13584.16 (  -4.22%)

o NPS4

					base		     affinity_scopes
Hmean     unixbench-dhry2reg-1      41075499.61 (   0.00%)    41222189.50 (   0.36%)
Hmean     unixbench-dhry2reg-512  6250307266.90 (   0.00%)  6251044709.08 (   0.01%)
Amean     unixbench-syscall-1        2538714.30 (   0.00%)     2521520.87 *   0.68%*
Amean     unixbench-syscall-512      7514126.30 (   0.00%)     7534175.47 (  -0.27%)
Hmean     unixbench-pipe-1           2393641.60 (   0.00%)     2379400.79 (  -0.59%)
Hmean     unixbench-pipe-512       339424173.78 (   0.00%)   341229694.29 *   0.53%*
Hmean     unixbench-spawn-1             5421.34 (   0.00%)        5556.23 (   2.49%)
Hmean     unixbench-spawn-512          64071.52 (   0.00%)       65783.47 *   2.67%*
Hmean     unixbench-execl-1             3629.56 (   0.00%)        3670.13 *   1.12%*
Hmean     unixbench-execl-512          13641.24 (   0.00%)       13848.81 (   1.52%)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

base:			298681.00 (var: 2.31%)
affinity_scopes		295106.33 (var: 2.22%) (-1.19%)

o NPS2:

base:			296570.00 (var: 1.01%)
affinity_scopes		298637.67 (var: 1.50%) (0.70%)

o NPS4:

base			297181.67 (var: 0.46%)
affinity_scopes		294253.33 (var: 0.80%) (-0.99%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1:

- 1 CCD

base:			1.00 (var: 0.14%)
affinity_scopes:	1.01 (var: 0.51%) (+1.19%)

- 2 CCD

base:			1.00 (var: 0.74%)
affinity_scopes:	0.99 (var: 0.47%) (-1.02%)

- 4 CCD

base:			1.00 (var: 0.33%)
affinity_scopes:	0.99 (var: 0.47%) (-0.95%)

- 8 CCD

base:			1.00 (var: 0.62%)
affinity_scopes:	0.99 (var: 2.30%) (-1.42%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Benchmarks run with multiple affinity scope ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

o NPS1

- tbench

Clients:     base                     cpu                    cache                   numa                    system
    1    450.40 (0.00 pct)       459.44 (2.00 pct)       457.12 (1.49 pct)       456.36 (1.32 pct)       456.75 (1.40 pct)
    2    872.50 (0.00 pct)       869.68 (-0.32 pct)      890.59 (2.07 pct)       878.87 (0.73 pct)       890.14 (2.02 pct)
    4    1630.13 (0.00 pct)      1621.24 (-0.54 pct)     1634.74 (0.28 pct)      1628.62 (-0.09 pct)     1646.57 (1.00 pct)
    8    3139.90 (0.00 pct)      3044.58 (-3.03 pct)     3099.49 (-1.28 pct)     3081.43 (-1.86 pct)     3151.16 (0.35 pct)
   16    6113.51 (0.00 pct)      5555.17 (-9.13 pct)     5465.09 (-10.60 pct)    5661.31 (-7.39 pct)     5742.58 (-6.06 pct)
   32    11024.64 (0.00 pct)     9574.62 (-13.15 pct)    9282.62 (-15.80 pct)    9542.00 (-13.44 pct)    9916.66 (-10.05 pct)
   64    19081.96 (0.00 pct)     15656.53 (-17.95 pct)   15176.12 (-20.46 pct)   16527.77 (-13.38 pct)   15097.97 (-20.87 pct)
  128    30956.07 (0.00 pct)     28277.80 (-8.65 pct)    27662.76 (-10.63 pct)   27817.94 (-10.13 pct)   28925.78 (-6.55 pct)
  256    42829.46 (0.00 pct)     38646.48 (-9.76 pct)    38355.27 (-10.44 pct)   37073.24 (-13.43 pct)   34391.01 (-19.70 pct)
  512    42395.69 (0.00 pct)     36931.34 (-12.88 pct)   39259.49 (-7.39 pct)    36571.62 (-13.73 pct)   36245.55 (-14.50 pct)
 1024    41973.51 (0.00 pct)     38817.07 (-7.52 pct)    38733.15 (-7.72 pct)    38864.45 (-7.40 pct)    35728.70 (-14.87 pct)

- netperf

                        base                    cpu                     cache                   numa                    system
 1-clients:      100910.82 (0.00 pct)    103440.72 (2.50 pct)    102592.36 (1.66 pct)    103199.49 (2.26 pct)    103561.90 (2.62 pct)
 2-clients:      99777.76 (0.00 pct)     100414.00 (0.63 pct)    100305.89 (0.52 pct)    99890.90 (0.11 pct)     101512.46 (1.73 pct)
 4-clients:      97676.17 (0.00 pct)     96624.28 (-1.07 pct)    95966.77 (-1.75 pct)    97105.22 (-0.58 pct)    97972.11 (0.30 pct)
 8-clients:      95413.11 (0.00 pct)     89926.72 (-5.75 pct)    89977.14 (-5.69 pct)    91020.10 (-4.60 pct)    92458.94 (-3.09 pct)
16-clients:      88961.66 (0.00 pct)     81295.02 (-8.61 pct)    79144.83 (-11.03 pct)   80216.42 (-9.83 pct)    85439.68 (-3.95 pct)
32-clients:      82199.83 (0.00 pct)     77914.00 (-5.21 pct)    75055.66 (-8.69 pct)    76813.94 (-6.55 pct)    80768.87 (-1.74 pct)
64-clients:      66094.87 (0.00 pct)     64419.91 (-2.53 pct)    63718.37 (-3.59 pct)    60370.40 (-8.66 pct)    66179.58 (0.12 pct)
128-clients:     43833.63 (0.00 pct)     42936.08 (-2.04 pct)    44554.69 (1.64 pct)     42666.82 (-2.66 pct)    45543.69 (3.90 pct)
256-clients:     38917.58 (0.00 pct)     24807.28 (-36.25 pct)   20517.01 (-47.28 pct)   21651.40 (-44.36 pct)   23778.87 (-38.89 pct)

- SPECjbb2015 Mutli-JVM

	       max-jOPS	     critical-jOPS
base:		 0.00%		 0.00%
smt:            -1.11%		-1.84%
cpu:             2.86%		-1.35%
cache:           2.86%		-1.66%
numa:            1.43%		-1.49%
system:          0.08%		-0.41%


I'll go dig deeper into the tbench and netperf regressions. I'm not sure
why the regression is observed for all the affinity scopes. I'll look
into IBS profile and see if something obvious pops up. Meanwhile if there
is any specific data you would like me to collect or benchmark you would
like me to test, let me know.

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-07-05  7:04             ` K Prateek Nayak
@ 2023-07-05 18:39               ` Tejun Heo
  2023-07-11  3:02                 ` K Prateek Nayak
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-07-05 18:39 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello,

On Wed, Jul 05, 2023 at 12:34:48PM +0530, K Prateek Nayak wrote:
> - Apart from tbench and netperf, the rest of the benchmarks show no
>   difference out of the box.

Just looking at the data, it's a bit difficult for me to judge. I suppose
most of differences are due to run-to-run variances? It'd be really useful
if the data contains standard deviation (whether historical or directly from
multiple runs).

> - SPECjbb2015 Multi-jVM sees small uplift to max-jOPS with certain
>   affinity scopes.
> 
> - tbench and netperf seem to be unhappy throughout. None of the affinity
>   scopes seem to bring back the performance. I'll dig more into this.

Yeah, that seems pretty consistent.

> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> - 10 Runs:
> 
> Test:		base		   affinity_scopes
>  Copy:	 245676.59 (0.00 pct)	 333646.71 (35.80 pct)
> Scale:	 206545.41 (0.00 pct)	 205706.04 (-0.40 pct)
>   Add:	 213506.82 (0.00 pct)	 236739.07 (10.88 pct)
> Triad:	 217679.43 (0.00 pct)	 249263.46 (14.50 pct)
> 
> - 100 Runs:
> 
> Test:		base		   affinity_scopes
>  Copy:	 318060.91 (0.00 pct)	 326025.89 (2.50 pct)
> Scale:	 213943.40 (0.00 pct)	 207647.37 (-2.94 pct)
>   Add:	 237892.53 (0.00 pct)	 232164.59 (-2.40 pct)
> Triad:	 245672.84 (0.00 pct)	 246333.21 (0.26 pct)
> 
> o NPS2
> 
> - 10 Runs:
> 
> Test:		base		   affinity_scopes
>  Copy:	 296632.20 (0.00 pct)	 291153.63 (-1.84 pct)
> Scale:	 206193.90 (0.00 pct)	 216368.42 (4.93 pct)
>   Add:	 240363.50 (0.00 pct)	 245954.23 (2.32 pct)
> Triad:	 242748.60 (0.00 pct)	 238606.20 (-1.70 pct)
> 
> - 100 Runs:
> 
> Test:		base		   affinity_scopes
>  Copy:	 322535.79 (0.00 pct)	 315020.03 (-2.33 pct)
> Scale:	 217723.56 (0.00 pct)	 220172.32 (1.12 pct)
>   Add:	 248117.72 (0.00 pct)	 250557.17 (0.98 pct)
> Triad:	 257768.66 (0.00 pct)	 248264.00 (-3.68 pct)
> 
> o NPS4
> 
> - 10 Runs:
> 
> Test:		base		   affinity_scopes
>  Copy:	 274067.54 (0.00 pct)	 302804.77 (10.48 pct)
> Scale:	 224944.53 (0.00 pct)	 230112.39 (2.29 pct)
>   Add:	 229318.09 (0.00 pct)	 241939.54 (5.50 pct)
> Triad:	 230175.89 (0.00 pct)	 253613.85 (10.18 pct)
> 
> - 100 Runs:
> 
> Test:		base		   affinity_scopes
>  Copy:	 338922.96 (0.00 pct)	 348183.65 (2.73 pct)
> Scale:	 240262.45 (0.00 pct)	 245939.67 (2.36 pct)
>   Add:	 256968.24 (0.00 pct)	 260657.01 (1.43 pct)
> Triad:	 262644.16 (0.00 pct)	 262286.46 (-0.13 pct)

The differences seem more consistent and pronounced for this benchmark too.
Is this just expected variance for this benchmark?

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Benchmarks run with multiple affinity scope ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> o NPS1
> 
> - tbench
> 
> Clients:     base                     cpu                    cache                   numa                    system
>     1    450.40 (0.00 pct)       459.44 (2.00 pct)       457.12 (1.49 pct)       456.36 (1.32 pct)       456.75 (1.40 pct)
>     2    872.50 (0.00 pct)       869.68 (-0.32 pct)      890.59 (2.07 pct)       878.87 (0.73 pct)       890.14 (2.02 pct)
>     4    1630.13 (0.00 pct)      1621.24 (-0.54 pct)     1634.74 (0.28 pct)      1628.62 (-0.09 pct)     1646.57 (1.00 pct)
>     8    3139.90 (0.00 pct)      3044.58 (-3.03 pct)     3099.49 (-1.28 pct)     3081.43 (-1.86 pct)     3151.16 (0.35 pct)
>    16    6113.51 (0.00 pct)      5555.17 (-9.13 pct)     5465.09 (-10.60 pct)    5661.31 (-7.39 pct)     5742.58 (-6.06 pct)
>    32    11024.64 (0.00 pct)     9574.62 (-13.15 pct)    9282.62 (-15.80 pct)    9542.00 (-13.44 pct)    9916.66 (-10.05 pct)
>    64    19081.96 (0.00 pct)     15656.53 (-17.95 pct)   15176.12 (-20.46 pct)   16527.77 (-13.38 pct)   15097.97 (-20.87 pct)
>   128    30956.07 (0.00 pct)     28277.80 (-8.65 pct)    27662.76 (-10.63 pct)   27817.94 (-10.13 pct)   28925.78 (-6.55 pct)
>   256    42829.46 (0.00 pct)     38646.48 (-9.76 pct)    38355.27 (-10.44 pct)   37073.24 (-13.43 pct)   34391.01 (-19.70 pct)
>   512    42395.69 (0.00 pct)     36931.34 (-12.88 pct)   39259.49 (-7.39 pct)    36571.62 (-13.73 pct)   36245.55 (-14.50 pct)
>  1024    41973.51 (0.00 pct)     38817.07 (-7.52 pct)    38733.15 (-7.72 pct)    38864.45 (-7.40 pct)    35728.70 (-14.87 pct)
> 
> - netperf
> 
>                         base                    cpu                     cache                   numa                    system
>  1-clients:      100910.82 (0.00 pct)    103440.72 (2.50 pct)    102592.36 (1.66 pct)    103199.49 (2.26 pct)    103561.90 (2.62 pct)
>  2-clients:      99777.76 (0.00 pct)     100414.00 (0.63 pct)    100305.89 (0.52 pct)    99890.90 (0.11 pct)     101512.46 (1.73 pct)
>  4-clients:      97676.17 (0.00 pct)     96624.28 (-1.07 pct)    95966.77 (-1.75 pct)    97105.22 (-0.58 pct)    97972.11 (0.30 pct)
>  8-clients:      95413.11 (0.00 pct)     89926.72 (-5.75 pct)    89977.14 (-5.69 pct)    91020.10 (-4.60 pct)    92458.94 (-3.09 pct)
> 16-clients:      88961.66 (0.00 pct)     81295.02 (-8.61 pct)    79144.83 (-11.03 pct)   80216.42 (-9.83 pct)    85439.68 (-3.95 pct)
> 32-clients:      82199.83 (0.00 pct)     77914.00 (-5.21 pct)    75055.66 (-8.69 pct)    76813.94 (-6.55 pct)    80768.87 (-1.74 pct)
> 64-clients:      66094.87 (0.00 pct)     64419.91 (-2.53 pct)    63718.37 (-3.59 pct)    60370.40 (-8.66 pct)    66179.58 (0.12 pct)
> 128-clients:     43833.63 (0.00 pct)     42936.08 (-2.04 pct)    44554.69 (1.64 pct)     42666.82 (-2.66 pct)    45543.69 (3.90 pct)
> 256-clients:     38917.58 (0.00 pct)     24807.28 (-36.25 pct)   20517.01 (-47.28 pct)   21651.40 (-44.36 pct)   23778.87 (-38.89 pct)
> 
> - SPECjbb2015 Mutli-JVM
> 
> 	       max-jOPS	     critical-jOPS
> base:		 0.00%		 0.00%
> smt:            -1.11%		-1.84%
> cpu:             2.86%		-1.35%
> cache:           2.86%		-1.66%
> numa:            1.43%		-1.49%
> system:          0.08%		-0.41%
> 
> 
> I'll go dig deeper into the tbench and netperf regressions. I'm not sure
> why the regression is observed for all the affinity scopes. I'll look
> into IBS profile and see if something obvious pops up. Meanwhile if there
> is any specific data you would like me to collect or benchmark you would
> like me to test, let me know.

Yeah, that's a bit surprising given that in terms of affinity behavior
"numa" should be identical to base. The only meaningful differences that I
can think of is when the work item is assigned to its worker and maybe how
pwq max_active limit is applied. Hmm... can you monitor the number of
kworker kthreads while running the benchmark? No need to do the whole
matrix, just comparing base against numa should be enough.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-07-05 18:39               ` Tejun Heo
@ 2023-07-11  3:02                 ` K Prateek Nayak
  2023-07-31 23:52                   ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: K Prateek Nayak @ 2023-07-11  3:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello Tejun,

On 7/6/2023 12:09 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 05, 2023 at 12:34:48PM +0530, K Prateek Nayak wrote:
>> - Apart from tbench and netperf, the rest of the benchmarks show no
>>   difference out of the box.
> 
> Just looking at the data, it's a bit difficult for me to judge. I suppose
> most of differences are due to run-to-run variances? It'd be really useful
> if the data contains standard deviation (whether historical or directly from
> multiple runs).

I'll make sure to include this from now on.

> 
>> - SPECjbb2015 Multi-jVM sees small uplift to max-jOPS with certain
>>   affinity scopes.
>>
>> - tbench and netperf seem to be unhappy throughout. None of the affinity
>>   scopes seem to bring back the performance. I'll dig more into this.
> 
> Yeah, that seems pretty consistent.
> 
>> ~~~~~~~~~~
>> ~ stream ~
>> ~~~~~~~~~~
>>
>> o NPS1
>>
>> - 10 Runs:
>>
>> Test:		base		   affinity_scopes
>>  Copy:	 245676.59 (0.00 pct)	 333646.71 (35.80 pct)
>> Scale:	 206545.41 (0.00 pct)	 205706.04 (-0.40 pct)
>>   Add:	 213506.82 (0.00 pct)	 236739.07 (10.88 pct)
>> Triad:	 217679.43 (0.00 pct)	 249263.46 (14.50 pct)
>>
>> - 100 Runs:
>>
>> Test:		base		   affinity_scopes
>>  Copy:	 318060.91 (0.00 pct)	 326025.89 (2.50 pct)
>> Scale:	 213943.40 (0.00 pct)	 207647.37 (-2.94 pct)
>>   Add:	 237892.53 (0.00 pct)	 232164.59 (-2.40 pct)
>> Triad:	 245672.84 (0.00 pct)	 246333.21 (0.26 pct)
>>
>> o NPS2
>>
>> - 10 Runs:
>>
>> Test:		base		   affinity_scopes
>>  Copy:	 296632.20 (0.00 pct)	 291153.63 (-1.84 pct)
>> Scale:	 206193.90 (0.00 pct)	 216368.42 (4.93 pct)
>>   Add:	 240363.50 (0.00 pct)	 245954.23 (2.32 pct)
>> Triad:	 242748.60 (0.00 pct)	 238606.20 (-1.70 pct)
>>
>> - 100 Runs:
>>
>> Test:		base		   affinity_scopes
>>  Copy:	 322535.79 (0.00 pct)	 315020.03 (-2.33 pct)
>> Scale:	 217723.56 (0.00 pct)	 220172.32 (1.12 pct)
>>   Add:	 248117.72 (0.00 pct)	 250557.17 (0.98 pct)
>> Triad:	 257768.66 (0.00 pct)	 248264.00 (-3.68 pct)
>>
>> o NPS4
>>
>> - 10 Runs:
>>
>> Test:		base		   affinity_scopes
>>  Copy:	 274067.54 (0.00 pct)	 302804.77 (10.48 pct)
>> Scale:	 224944.53 (0.00 pct)	 230112.39 (2.29 pct)
>>   Add:	 229318.09 (0.00 pct)	 241939.54 (5.50 pct)
>> Triad:	 230175.89 (0.00 pct)	 253613.85 (10.18 pct)
>>
>> - 100 Runs:
>>
>> Test:		base		   affinity_scopes
>>  Copy:	 338922.96 (0.00 pct)	 348183.65 (2.73 pct)
>> Scale:	 240262.45 (0.00 pct)	 245939.67 (2.36 pct)
>>   Add:	 256968.24 (0.00 pct)	 260657.01 (1.43 pct)
>> Triad:	 262644.16 (0.00 pct)	 262286.46 (-0.13 pct)
> 
> The differences seem more consistent and pronounced for this benchmark too.
> Is this just expected variance for this benchmark?

Stream's changes are mostly due to run-to-run variance.

> 
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Benchmarks run with multiple affinity scope ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> - tbench
>>
>> Clients:     base                     cpu                    cache                   numa                    system
>>     1    450.40 (0.00 pct)       459.44 (2.00 pct)       457.12 (1.49 pct)       456.36 (1.32 pct)       456.75 (1.40 pct)
>>     2    872.50 (0.00 pct)       869.68 (-0.32 pct)      890.59 (2.07 pct)       878.87 (0.73 pct)       890.14 (2.02 pct)
>>     4    1630.13 (0.00 pct)      1621.24 (-0.54 pct)     1634.74 (0.28 pct)      1628.62 (-0.09 pct)     1646.57 (1.00 pct)
>>     8    3139.90 (0.00 pct)      3044.58 (-3.03 pct)     3099.49 (-1.28 pct)     3081.43 (-1.86 pct)     3151.16 (0.35 pct)
>>    16    6113.51 (0.00 pct)      5555.17 (-9.13 pct)     5465.09 (-10.60 pct)    5661.31 (-7.39 pct)     5742.58 (-6.06 pct)
>>    32    11024.64 (0.00 pct)     9574.62 (-13.15 pct)    9282.62 (-15.80 pct)    9542.00 (-13.44 pct)    9916.66 (-10.05 pct)
>>    64    19081.96 (0.00 pct)     15656.53 (-17.95 pct)   15176.12 (-20.46 pct)   16527.77 (-13.38 pct)   15097.97 (-20.87 pct)
>>   128    30956.07 (0.00 pct)     28277.80 (-8.65 pct)    27662.76 (-10.63 pct)   27817.94 (-10.13 pct)   28925.78 (-6.55 pct)
>>   256    42829.46 (0.00 pct)     38646.48 (-9.76 pct)    38355.27 (-10.44 pct)   37073.24 (-13.43 pct)   34391.01 (-19.70 pct)
>>   512    42395.69 (0.00 pct)     36931.34 (-12.88 pct)   39259.49 (-7.39 pct)    36571.62 (-13.73 pct)   36245.55 (-14.50 pct)
>>  1024    41973.51 (0.00 pct)     38817.07 (-7.52 pct)    38733.15 (-7.72 pct)    38864.45 (-7.40 pct)    35728.70 (-14.87 pct)
>>
>> - netperf
>>
>>                         base                    cpu                     cache                   numa                    system
>>  1-clients:      100910.82 (0.00 pct)    103440.72 (2.50 pct)    102592.36 (1.66 pct)    103199.49 (2.26 pct)    103561.90 (2.62 pct)
>>  2-clients:      99777.76 (0.00 pct)     100414.00 (0.63 pct)    100305.89 (0.52 pct)    99890.90 (0.11 pct)     101512.46 (1.73 pct)
>>  4-clients:      97676.17 (0.00 pct)     96624.28 (-1.07 pct)    95966.77 (-1.75 pct)    97105.22 (-0.58 pct)    97972.11 (0.30 pct)
>>  8-clients:      95413.11 (0.00 pct)     89926.72 (-5.75 pct)    89977.14 (-5.69 pct)    91020.10 (-4.60 pct)    92458.94 (-3.09 pct)
>> 16-clients:      88961.66 (0.00 pct)     81295.02 (-8.61 pct)    79144.83 (-11.03 pct)   80216.42 (-9.83 pct)    85439.68 (-3.95 pct)
>> 32-clients:      82199.83 (0.00 pct)     77914.00 (-5.21 pct)    75055.66 (-8.69 pct)    76813.94 (-6.55 pct)    80768.87 (-1.74 pct)
>> 64-clients:      66094.87 (0.00 pct)     64419.91 (-2.53 pct)    63718.37 (-3.59 pct)    60370.40 (-8.66 pct)    66179.58 (0.12 pct)
>> 128-clients:     43833.63 (0.00 pct)     42936.08 (-2.04 pct)    44554.69 (1.64 pct)     42666.82 (-2.66 pct)    45543.69 (3.90 pct)
>> 256-clients:     38917.58 (0.00 pct)     24807.28 (-36.25 pct)   20517.01 (-47.28 pct)   21651.40 (-44.36 pct)   23778.87 (-38.89 pct)
>>
>> - SPECjbb2015 Mutli-JVM
>>
>> 	       max-jOPS	     critical-jOPS
>> base:		 0.00%		 0.00%
>> smt:            -1.11%		-1.84%
>> cpu:             2.86%		-1.35%
>> cache:           2.86%		-1.66%
>> numa:            1.43%		-1.49%
>> system:          0.08%		-0.41%
>>
>>
>> I'll go dig deeper into the tbench and netperf regressions. I'm not sure
>> why the regression is observed for all the affinity scopes. I'll look
>> into IBS profile and see if something obvious pops up. Meanwhile if there
>> is any specific data you would like me to collect or benchmark you would
>> like me to test, let me know.
> 
> Yeah, that's a bit surprising given that in terms of affinity behavior
> "numa" should be identical to base. The only meaningful differences that I
> can think of is when the work item is assigned to its worker and maybe how
> pwq max_active limit is applied. Hmm... can you monitor the number of
> kworker kthreads while running the benchmark? No need to do the whole
> matrix, just comparing base against numa should be enough.

Sure. I'll get back to you with the data soon.

> 
> Thanks.
> 

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-07-11  3:02                 ` K Prateek Nayak
@ 2023-07-31 23:52                   ` Tejun Heo
  2023-08-08  1:08                     ` Tejun Heo
  0 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-07-31 23:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello,

On Tue, Jul 11, 2023 at 08:32:27AM +0530, K Prateek Nayak wrote:
> > Yeah, that's a bit surprising given that in terms of affinity behavior
> > "numa" should be identical to base. The only meaningful differences that I
> > can think of is when the work item is assigned to its worker and maybe how
> > pwq max_active limit is applied. Hmm... can you monitor the number of
> > kworker kthreads while running the benchmark? No need to do the whole
> > matrix, just comparing base against numa should be enough.
> 
> Sure. I'll get back to you with the data soon.

Any updates? I'd like to proceed with the patchset as it helps resolving
problems others are reporting. I can try to reproduce the results too if you
can share more details on how they're run.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
  2023-07-31 23:52                   ` Tejun Heo
@ 2023-08-08  1:08                     ` Tejun Heo
  0 siblings, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-08-08  1:08 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sandeep Dhavale, jiangshanlai, torvalds, peterz, linux-kernel,
	kernel-team, joshdon, brho, briannorris, nhuck, agk, snitzer,
	void, kernel-team

Hello,

On Mon, Jul 31, 2023 at 01:52:21PM -1000, Tejun Heo wrote:
> On Tue, Jul 11, 2023 at 08:32:27AM +0530, K Prateek Nayak wrote:
> > > Yeah, that's a bit surprising given that in terms of affinity behavior
> > > "numa" should be identical to base. The only meaningful differences that I
> > > can think of is when the work item is assigned to its worker and maybe how
> > > pwq max_active limit is applied. Hmm... can you monitor the number of
> > > kworker kthreads while running the benchmark? No need to do the whole
> > > matrix, just comparing base against numa should be enough.
> > 
> > Sure. I'll get back to you with the data soon.
> 
> Any updates? I'd like to proceed with the patchset as it helps resolving
> problems others are reporting. I can try to reproduce the results too if you
> can share more details on how they're run.

Prateek sent me how he tested along with workqueue traces. I tried to
reproduce on an AMD zen2 machine and here are the findings:

* The test has a high run-to-run variance. Even with cpufreq boost turned
  off, the numbers reported every second within each run is relatively
  stable but adjacent runs can report signficantly variable numbers. Maybe
  initial thread placement has lingering effects?

  On ryzen 3900x, 15 runs of `./tbench -c ./client.txt -t 60 32 127.0.0.1`:

    Before: AVG=9066.43 STDEV=42.65
    After : AVG=9076.11 STDEV=60.50

  Given the stdev, I don't think this is indicating any meaningful
  difference.

* I looked at what were consuming CPUs during the benchmark runs and also
  Prateek's workqueue traces. None of the operations that tbench is doing
  directly involves workqueue. I couldn't find a mechanism how workqueue
  differences would cause any meaningful performance differences.

At least for tbench results, I couldn't find any signal.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 03/24] workqueue: Not all work insertion needs to wake up a worker
  2023-05-19  0:16 ` [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker Tejun Heo
  2023-05-23  9:54   ` Lai Jiangshan
@ 2023-08-08  1:15   ` Tejun Heo
  1 sibling, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-08-08  1:15 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

insert_work() always tried to wake up a worker; however, the only time it
needs to try to wake up a worker is when a new active work item is queued.
When a work item goes on the inactive list or queueing a flush work item,
there's no reason to try to wake up a worker.

This patch moves the worker wakeup logic out of insert_work() and places it
in the active new work item queueing path in __queue_work().

While at it:

* __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
  pool.

* Every caller of insert_work() calls debug_work_activate(). Consolidate the
  invocations into insert_work().

* In __queue_work() pool->watchdog_ts update is relocated slightly. This is
  to better accommodate future changes.

This makes wakeups more precise and will help the planned change to assign
work items to workers before waking them up. No behavior changes intended.

v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
    suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
---
 kernel/workqueue.c |   38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1539,7 +1539,7 @@ fail:
 static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 			struct list_head *head, unsigned int extra_flags)
 {
-	struct worker_pool *pool = pwq->pool;
+	debug_work_activate(work);
 
 	/* record the work call stack in order to print it in KASAN reports */
 	kasan_record_aux_stack_noalloc(work);
@@ -1548,9 +1548,6 @@ static void insert_work(struct pool_work
 	set_work_pwq(work, pwq, extra_flags);
 	list_add_tail(&work->entry, head);
 	get_pwq(pwq);
-
-	if (__need_more_worker(pool))
-		wake_up_worker(pool);
 }
 
 /*
@@ -1604,8 +1601,7 @@ static void __queue_work(int cpu, struct
 			 struct work_struct *work)
 {
 	struct pool_workqueue *pwq;
-	struct worker_pool *last_pool;
-	struct list_head *worklist;
+	struct worker_pool *last_pool, *pool;
 	unsigned int work_flags;
 	unsigned int req_cpu = cpu;
 
@@ -1639,13 +1635,15 @@ retry:
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
 	}
 
+	pool = pwq->pool;
+
 	/*
 	 * If @work was previously on a different pool, it might still be
 	 * running there, in which case the work needs to be queued on that
 	 * pool to guarantee non-reentrancy.
 	 */
 	last_pool = get_work_pool(work);
-	if (last_pool && last_pool != pwq->pool) {
+	if (last_pool && last_pool != pool) {
 		struct worker *worker;
 
 		raw_spin_lock(&last_pool->lock);
@@ -1654,13 +1652,15 @@ retry:
 
 		if (worker && worker->current_pwq->wq == wq) {
 			pwq = worker->current_pwq;
+			pool = pwq->pool;
+			WARN_ON_ONCE(pool != last_pool);
 		} else {
 			/* meh... not running there, queue here */
 			raw_spin_unlock(&last_pool->lock);
-			raw_spin_lock(&pwq->pool->lock);
+			raw_spin_lock(&pool->lock);
 		}
 	} else {
-		raw_spin_lock(&pwq->pool->lock);
+		raw_spin_lock(&pool->lock);
 	}
 
 	/*
@@ -1673,7 +1673,7 @@ retry:
 	 */
 	if (unlikely(!pwq->refcnt)) {
 		if (wq->flags & WQ_UNBOUND) {
-			raw_spin_unlock(&pwq->pool->lock);
+			raw_spin_unlock(&pool->lock);
 			cpu_relax();
 			goto retry;
 		}
@@ -1692,21 +1692,22 @@ retry:
 	work_flags = work_color_to_flags(pwq->work_color);
 
 	if (likely(pwq->nr_active < pwq->max_active)) {
+		if (list_empty(&pool->worklist))
+			pool->watchdog_ts = jiffies;
+
 		trace_workqueue_activate_work(work);
 		pwq->nr_active++;
-		worklist = &pwq->pool->worklist;
-		if (list_empty(worklist))
-			pwq->pool->watchdog_ts = jiffies;
+		insert_work(pwq, work, &pool->worklist, work_flags);
+
+		if (__need_more_worker(pool))
+			wake_up_worker(pool);
 	} else {
 		work_flags |= WORK_STRUCT_INACTIVE;
-		worklist = &pwq->inactive_works;
+		insert_work(pwq, work, &pwq->inactive_works, work_flags);
 	}
 
-	debug_work_activate(work);
-	insert_work(pwq, work, worklist, work_flags);
-
 out:
-	raw_spin_unlock(&pwq->pool->lock);
+	raw_spin_unlock(&pool->lock);
 	rcu_read_unlock();
 }
 
@@ -3010,7 +3011,6 @@ static void insert_wq_barrier(struct poo
 	pwq->nr_in_flight[work_color]++;
 	work_flags |= work_color_to_flags(work_color);
 
-	debug_work_activate(&barr->work);
 	insert_work(pwq, &barr->work, head, work_flags);
 }
 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
                   ` (26 preceding siblings ...)
  2023-06-12 23:56 ` Brian Norris
@ 2023-08-08  1:22 ` Tejun Heo
  2023-08-08  2:58   ` K Prateek Nayak
  27 siblings, 1 reply; 73+ messages in thread
From: Tejun Heo @ 2023-08-08  1:22 UTC (permalink / raw)
  To: jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void, kprateek.nayak

Hello,

On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
> Unbound workqueues used to spray work items inside each NUMA node, which
> isn't great on CPUs w/ multiple L3 caches. This patchset implements
> mechanisms to improve and configure execution locality.

The patchset shows minor perf improvements for some but more importantly
gives users more control over worker placement which helps working around
some of the recently reported performance regressions. Prateek reported
concerning regressions with tbench but I couldn't reproduce it and can't see
how tbench would be affected at all given the benchmark doesn't involve
workqueue operations in any noticeable way.

Assuming that the tbench difference was a testing artifact, I'm applying the
patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
really appreciate if you could repeat the test and see whether the
difference persists.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-08-08  1:22 ` Tejun Heo
@ 2023-08-08  2:58   ` K Prateek Nayak
  2023-08-08  7:59     ` Tejun Heo
  2023-08-18  4:05     ` K Prateek Nayak
  0 siblings, 2 replies; 73+ messages in thread
From: K Prateek Nayak @ 2023-08-08  2:58 UTC (permalink / raw)
  To: Tejun Heo, jiangshanlai
  Cc: torvalds, peterz, linux-kernel, kernel-team, joshdon, brho,
	briannorris, nhuck, agk, snitzer, void

Hello Tejun,

On 8/8/2023 6:52 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>> Unbound workqueues used to spray work items inside each NUMA node, which
>> isn't great on CPUs w/ multiple L3 caches. This patchset implements
>> mechanisms to improve and configure execution locality.
> 
> The patchset shows minor perf improvements for some but more importantly
> gives users more control over worker placement which helps working around
> some of the recently reported performance regressions. Prateek reported
> concerning regressions with tbench but I couldn't reproduce it and can't see
> how tbench would be affected at all given the benchmark doesn't involve
> workqueue operations in any noticeable way.
> 
> Assuming that the tbench difference was a testing artifact, I'm applying the
> patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
> really appreciate if you could repeat the test and see whether the
> difference persists.

Sure. I'll retest with for-6.6 branch. Will post the results here once the
tests are done. I'll repeat the same - test with the defaults and the ones
that show any difference in results, I'll rerun them with various affinity
scopes.

Let me know if you have any suggestions.

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-08-08  2:58   ` K Prateek Nayak
@ 2023-08-08  7:59     ` Tejun Heo
  2023-08-18  4:05     ` K Prateek Nayak
  1 sibling, 0 replies; 73+ messages in thread
From: Tejun Heo @ 2023-08-08  7:59 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: jiangshanlai, torvalds, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void

Hello,

On Tue, Aug 08, 2023 at 08:28:53AM +0530, K Prateek Nayak wrote:
> > Assuming that the tbench difference was a testing artifact, I'm applying the
> > patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
> > really appreciate if you could repeat the test and see whether the
> > difference persists.
> 
> Sure. I'll retest with for-6.6 branch. Will post the results here once the
> tests are done. I'll repeat the same - test with the defaults and the ones
> that show any difference in results, I'll rerun them with various affinity
> scopes.

I think it'd be helpful to pick one benchmark setup which shows clear
difference and repeat it multiple times while taking measures to avoid
systemic biases (e.g. instead of running all of control followed by all of
test, separate them into several segments and interleave them).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
  2023-08-08  2:58   ` K Prateek Nayak
  2023-08-08  7:59     ` Tejun Heo
@ 2023-08-18  4:05     ` K Prateek Nayak
  1 sibling, 0 replies; 73+ messages in thread
From: K Prateek Nayak @ 2023-08-18  4:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, jiangshanlai, peterz, linux-kernel, kernel-team,
	joshdon, brho, briannorris, nhuck, agk, snitzer, void

Hello Tejun,

On 8/8/2023 8:28 AM, K Prateek Nayak wrote:
> Hello Tejun,
> 
> On 8/8/2023 6:52 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>>> Unbound workqueues used to spray work items inside each NUMA node, which
>>> isn't great on CPUs w/ multiple L3 caches. This patchset implements
>>> mechanisms to improve and configure execution locality.
>>
>> The patchset shows minor perf improvements for some but more importantly
>> gives users more control over worker placement which helps working around
>> some of the recently reported performance regressions. Prateek reported
>> concerning regressions with tbench but I couldn't reproduce it and can't see
>> how tbench would be affected at all given the benchmark doesn't involve
>> workqueue operations in any noticeable way.
>>
>> Assuming that the tbench difference was a testing artifact, I'm applying the
>> patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
>> really appreciate if you could repeat the test and see whether the
>> difference persists.
> 
> Sure. I'll retest with for-6.6 branch. Will post the results here once the
> tests are done. I'll repeat the same - test with the defaults and the ones
> that show any difference in results, I'll rerun them with various affinity
> scopes.

Sorry I'm lagging on the test queue but following are the results of the
standard benchmarks running on a dual socket 3rd Generation EPYC system
(2 x 64C/128T)

tl;dr

- No noticeable difference in performance.
- The netperf and tbench regression are gone now and the base numbers too
  are much higher than before (sorry for the false alarm!)

Following are the results:

base:	affinity-scopes-v2 branch at commit 18c8ae813156 ("workqueue:
	Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us
	is 0")

affinity-scope:	affinity-scopes-v2 branch at commit a4da9f618d3e
	("workqueue: Add "Affinity Scopes and Performance" section to]
	documentation")

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          base[pct imp](CV)    affinity-scope[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.76)     0.99 [  0.56]( 3.02)
 2-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.94]( 2.36)
 4-groups     1.00 [ -0.00]( 1.49)     1.02 [ -2.20]( 1.91)
 8-groups     1.00 [ -0.00]( 1.12)     1.00 [ -0.00]( 0.93)
16-groups     1.00 [ -0.00]( 3.64)     1.01 [ -0.87]( 2.66)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:  base[pct imp](CV)    affinity-scope[pct imp](CV)
    1     1.00 [  0.00]( 0.47)     1.00 [ -0.21]( 1.03)
    2     1.00 [  0.00]( 0.10)     1.00 [  0.00]( 0.45)
    4     1.00 [  0.00]( 1.60)     1.00 [ -0.18]( 0.83)
    8     1.00 [  0.00]( 0.13)     1.00 [ -0.26]( 0.59)
   16     1.00 [  0.00]( 1.69)     1.02 [  2.05]( 1.08)
   32     1.00 [  0.00]( 0.35)     1.00 [ -0.36]( 2.47)
   64     1.00 [  0.00]( 0.43)     1.00 [  0.45]( 2.54)
  128     1.00 [  0.00]( 0.31)     0.99 [ -0.82]( 0.58)
  256     1.00 [  0.00]( 1.81)     0.98 [ -1.84]( 1.80)
  512     1.00 [  0.00]( 0.54)     1.00 [  0.04]( 0.06)
 1024     1.00 [  0.00]( 0.13)     1.01 [  1.01]( 0.42)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     base[pct imp](CV)    affinity-scope[pct imp](CV)
 Copy     1.00 [  0.00]( 6.45)     1.03 [  2.50]( 5.75)
Scale     1.00 [  0.00]( 6.21)     1.03 [  3.36]( 0.75)
  Add     1.00 [  0.00]( 6.10)     1.04 [  4.23]( 1.81)
Triad     1.00 [  0.00]( 7.24)     1.03 [  3.49]( 3.41)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     base[pct imp](CV)    affinity-scope[pct imp](CV)
 Copy     1.00 [  0.00]( 1.98)     1.00 [  0.40]( 2.57)
Scale     1.00 [  0.00]( 4.88)     1.00 [ -0.07]( 5.11)
  Add     1.00 [  0.00]( 4.60)     1.00 [  0.23]( 5.21)
Triad     1.00 [  0.00]( 6.21)     1.03 [  2.85]( 2.55)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:       base[pct imp](CV)    affinity-scope[pct imp](CV)
 1-clients     1.00 [  0.00]( 1.84)     1.01 [  0.99]( 0.72)
 2-clients     1.00 [  0.00]( 0.64)     1.01 [  0.53]( 0.77)
 4-clients     1.00 [  0.00]( 0.75)     1.01 [  0.54]( 0.96)
 8-clients     1.00 [  0.00]( 0.83)     1.00 [ -0.21]( 1.03)
16-clients     1.00 [  0.00]( 0.75)     1.00 [  0.31]( 0.81)
32-clients     1.00 [  0.00]( 0.82)     1.00 [  0.12]( 1.57)
64-clients     1.00 [  0.00]( 2.30)     1.00 [ -0.28]( 2.39)
128-clients     1.00 [  0.00]( 2.54)     0.99 [ -1.01]( 2.61)
256-clients     1.00 [  0.00]( 4.37)     1.01 [  1.23]( 2.69)
512-clients     1.00 [  0.00](48.73)     1.01 [  0.99](46.07)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: base[pct imp](CV)    affinity-scope[pct imp](CV)
  1     1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 2.28)
  2     1.00 [ -0.00]( 8.55)     0.96 [  4.00]( 4.17)
  4     1.00 [ -0.00]( 3.81)     0.94 [  6.45]( 8.78)
  8     1.00 [ -0.00]( 2.78)     0.97 [  2.78]( 4.81)
 16     1.00 [ -0.00]( 1.22)     0.96 [  4.26]( 1.27)
 32     1.00 [ -0.00]( 2.02)     0.97 [  2.63]( 3.99)
 64     1.00 [ -0.00]( 5.65)     0.99 [  0.62]( 1.65)
128     1.00 [ -0.00]( 5.17)     0.98 [  1.91]( 8.12)
256     1.00 [ -0.00](10.79)     1.07 [ -6.82]( 7.18)
512     1.00 [ -0.00]( 1.24)     0.99 [  0.54]( 1.37)



==================================================================
Test          : Unixbench
Units         : Various, Througput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
               	 			base                affinity-scope
Hmean     unixbench-dhry2reg-1      40947261.77 (   0.00%)    41078213.81 (   0.32%)
Hmean     unixbench-dhry2reg-512  6243140251.68 (   0.00%)  6240938691.75 (  -0.04%)
Amean     unixbench-syscall-1        2932806.37 (   0.00%)     2871035.50 *   2.11%*
Amean     unixbench-syscall-512      7689448.00 (   0.00%)     8406697.27 *   9.33%*
Hmean     unixbench-pipe-1    	     2577667.42 (   0.00%)     2497979.59 *  -3.09%*
Hmean     unixbench-pipe-512	   363366036.45 (   0.00%)   356991588.20 *  -1.75%*
Hmean     unixbench-spawn-1             4446.97 (   0.00%)        4760.91 *   7.06%*
Hmean     unixbench-spawn-512          68983.49 (   0.00%)       68464.78 *  -0.75%*
Hmean     unixbench-execl-1             3894.20 (   0.00%)        3857.78 (  -0.94%)
Hmean     unixbench-execl-512          12716.76 (   0.00%)       13067.63 (   2.76%)


==================================================================
Test          : tbench (Various Affinity Scopes)
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:   base[pct imp](CV)         cpu[pct imp](CV)         smt[pct imp](CV)       cache[pct imp](CV)        numa[pct imp](CV)      system[pct imp](CV)
    1     1.00 [  0.00]( 0.47)     1.00 [  0.11]( 0.95)     1.00 [  0.23]( 1.97)     1.01 [  1.01]( 0.29)     1.00 [  0.07]( 0.57)     1.01 [  1.36]( 0.36)
    2     1.00 [  0.00]( 0.10)     1.01 [  1.14]( 0.27)     0.99 [ -0.84]( 0.51)     1.01 [  1.05]( 0.50)     1.00 [  0.24]( 0.75)     1.00 [ -0.29]( 1.22)
    4     1.00 [  0.00]( 1.60)     1.02 [  2.07]( 1.42)     1.02 [  1.65]( 0.46)     1.02 [  2.45]( 0.83)     1.00 [  0.36]( 1.33)     1.02 [  2.37]( 0.57)
    8     1.00 [  0.00]( 0.13)     1.00 [ -0.02]( 0.61)     1.00 [  0.14]( 0.57)     1.01 [  0.88]( 0.33)     1.00 [ -0.26]( 0.30)     1.01 [  0.90]( 1.48)
   16     1.00 [  0.00]( 1.69)     1.03 [  3.10]( 0.69)     1.04 [  3.66]( 1.36)     1.02 [  2.36]( 0.62)     1.02 [  1.61]( 1.63)     1.04 [  3.77]( 1.00)
   32     1.00 [  0.00]( 0.35)     0.97 [ -3.49]( 0.62)     0.97 [ -3.21]( 0.77)     1.00 [ -0.24]( 3.77)     0.96 [ -4.08]( 4.43)     0.97 [ -2.81]( 3.50)
   64     1.00 [  0.00]( 0.43)     1.00 [  0.20]( 1.66)     0.99 [ -0.61]( 0.81)     1.03 [  2.87]( 0.55)     1.02 [  2.16]( 2.31)     0.98 [ -2.32]( 3.63)
  128     1.00 [  0.00]( 0.31)     1.01 [  1.44]( 1.33)     1.01 [  0.72]( 0.46)     1.01 [  1.33]( 0.67)     1.00 [  0.38]( 0.58)     1.01 [  1.44]( 1.35)
  256     1.00 [  0.00]( 1.81)     0.98 [ -2.10]( 1.05)     0.97 [ -2.50]( 0.42)     0.97 [ -3.46]( 0.91)     0.99 [ -0.79]( 0.85)     0.96 [ -3.83]( 0.29)
  512     1.00 [  0.00]( 0.54)     1.00 [  0.37]( 1.12)     0.99 [ -1.33]( 0.44)     1.00 [ -0.19]( 0.94)     1.01 [  0.87]( 1.05)     0.99 [ -1.08]( 0.12)
 1024     1.00 [  0.00]( 0.13)     1.01 [  1.10]( 0.49)     1.00 [  0.47]( 0.28)     1.00 [  0.33]( 0.73)     1.00 [  0.48]( 0.69)     1.00 [  0.01]( 0.47)

==================================================================

ycsb-mongodb and DeathStarBench do not see any difference in
performance. I'll go and test more NPS modes / more machines.
Meanwhile, please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2023-08-18  4:06 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-19  0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
2023-05-19  0:16 ` [PATCH 01/24] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
2023-05-19  0:16 ` [PATCH 02/24] workqueue: Cleanups around process_scheduled_works() Tejun Heo
2023-05-19  0:16 ` [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker Tejun Heo
2023-05-23  9:54   ` Lai Jiangshan
2023-05-23 21:37     ` Tejun Heo
2023-08-08  1:15   ` [PATCH v2 " Tejun Heo
2023-05-19  0:16 ` [PATCH 04/24] workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq Tejun Heo
2023-05-19  0:16 ` [PATCH 05/24] workqueue: Relocate worker and work management functions Tejun Heo
2023-05-19  0:16 ` [PATCH 06/24] workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa Tejun Heo
2023-05-19  0:16 ` [PATCH 07/24] workqueue: Use a kthread_worker to release pool_workqueues Tejun Heo
2023-05-19  0:16 ` [PATCH 08/24] workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones Tejun Heo
2023-05-19  0:16 ` [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues Tejun Heo
2023-05-22  6:41   ` Leon Romanovsky
2023-05-22 12:27     ` Dennis Dalessandro
2023-05-19  0:16 ` [PATCH 10/24] workqueue: Rename workqueue_attrs->no_numa to ->ordered Tejun Heo
2023-05-19  0:16 ` [PATCH 11/24] workqueue: Rename NUMA related names to use pod instead Tejun Heo
2023-05-19  0:16 ` [PATCH 12/24] workqueue: Move wq_pod_init() below workqueue_init() Tejun Heo
2023-05-19  0:16 ` [PATCH 13/24] workqueue: Initialize unbound CPU pods later in the boot Tejun Heo
2023-05-19  0:16 ` [PATCH 14/24] workqueue: Generalize unbound CPU pods Tejun Heo
2023-05-30  8:06   ` K Prateek Nayak
2023-06-07  1:50     ` Tejun Heo
2023-05-30 21:18   ` Sandeep Dhavale
2023-05-31 12:14     ` K Prateek Nayak
2023-06-07 22:13       ` Tejun Heo
2023-06-08  3:01         ` K Prateek Nayak
2023-06-08 22:50           ` Tejun Heo
2023-06-09  3:43             ` K Prateek Nayak
2023-06-14 18:49               ` Sandeep Dhavale
2023-06-21 20:14                 ` Tejun Heo
2023-06-19  4:30             ` Swapnil Sapkal
2023-06-21 20:38               ` Tejun Heo
2023-07-05  7:04             ` K Prateek Nayak
2023-07-05 18:39               ` Tejun Heo
2023-07-11  3:02                 ` K Prateek Nayak
2023-07-31 23:52                   ` Tejun Heo
2023-08-08  1:08                     ` Tejun Heo
2023-05-19  0:17 ` [PATCH 15/24] workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration Tejun Heo
2023-05-19  0:17 ` [PATCH 16/24] workqueue: Modularize wq_pod_type initialization Tejun Heo
2023-05-19  0:17 ` [PATCH 17/24] workqueue: Add multiple affinity scopes and interface to select them Tejun Heo
2023-05-19  0:17 ` [PATCH 18/24] workqueue: Factor out work to worker assignment and collision handling Tejun Heo
2023-05-19  0:17 ` [PATCH 19/24] workqueue: Factor out need_more_worker() check and worker wake-up Tejun Heo
2023-05-19  0:17 ` [PATCH 20/24] workqueue: Add workqueue_attrs->__pod_cpumask Tejun Heo
2023-05-19  0:17 ` [PATCH 21/24] workqueue: Implement non-strict affinity scope for unbound workqueues Tejun Heo
2023-05-19  0:17 ` [PATCH 22/24] workqueue: Add "Affinity Scopes and Performance" section to documentation Tejun Heo
2023-05-19  0:17 ` [PATCH 23/24] workqueue: Add pool_workqueue->cpu Tejun Heo
2023-05-19  0:17 ` [PATCH 24/24] workqueue: Implement localize-to-issuing-CPU for unbound workqueues Tejun Heo
2023-05-19  0:41 ` [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Linus Torvalds
2023-05-19 22:35   ` Tejun Heo
2023-05-19 23:03     ` Tejun Heo
2023-05-23  1:51       ` Linus Torvalds
2023-05-23 17:59         ` Linus Torvalds
2023-05-23 20:08           ` Rik van Riel
2023-05-23 21:36           ` Sandeep Dhavale
2023-05-23 11:18 ` Peter Zijlstra
2023-05-23 16:12   ` Vincent Guittot
2023-05-24  7:34     ` Peter Zijlstra
2023-05-24 13:15       ` Vincent Guittot
2023-06-05  4:46     ` Gautham R. Shenoy
2023-06-07 14:42       ` Libo Chen
2023-05-26  1:12   ` Tejun Heo
2023-05-30 11:32     ` Peter Zijlstra
2023-06-12 23:56 ` Brian Norris
2023-06-13  2:48   ` Tejun Heo
2023-06-13  9:26     ` Pin-yen Lin
2023-06-21 19:16       ` Tejun Heo
2023-06-21 19:31         ` Linus Torvalds
2023-06-29  9:49           ` Pin-yen Lin
2023-07-03 21:47             ` Tejun Heo
2023-08-08  1:22 ` Tejun Heo
2023-08-08  2:58   ` K Prateek Nayak
2023-08-08  7:59     ` Tejun Heo
2023-08-18  4:05     ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).