linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -V3 0/9] mm: PCP high auto-tuning
@ 2023-10-16  5:29 Huang Ying
  2023-10-16  5:29 ` [PATCH -V3 1/9] mm, pcp: avoid to drain PCP when process exit Huang Ying
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

The page allocation performance requirements of different workloads
are often different.  So, we need to tune the PCP (Per-CPU Pageset)
high on each CPU automatically to optimize the page allocation
performance.

The list of patches in series is as follows,

[1/9] mm, pcp: avoid to drain PCP when process exit
[2/9] cacheinfo: calculate per-CPU data cache size
[3/9] mm, pcp: reduce lock contention for draining high-order pages
[4/9] mm: restrict the pcp batch scale factor to avoid too long latency
[5/9] mm, page_alloc: scale the number of pages that are batch allocated
[6/9] mm: add framework for PCP high auto-tuning
[7/9] mm: tune PCP high automatically
[8/9] mm, pcp: decrease PCP high if free pages < high watermark
[9/9] mm, pcp: reduce detecting time of consecutive high order page freeing

Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
high-order pages freeing.

Patch [4/9], [5/9] optimize batch freeing and allocating.

Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
auto-tuning method.

Patch [9/9] optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.

The test results for patches with performance impact are as follows,

kbuild
======

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.

	build time   lock contend%	free_high	alloc_zone
	----------	----------	---------	----------
base	     100.0	      14.0          100.0            100.0
patch1	      99.5	      12.8	     19.5	      95.6
patch3	      99.4	      12.6	      7.1	      95.6
patch5	      98.6	      11.0	      8.1	      97.1
patch7	      95.1	       0.5	      2.8	      15.6
patch9	      95.0	       1.0	      8.8	      20.0

The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
allocation optimization (patch [5/9]) reduces zone lock contention a
little.  The PCP high auto-tuning (patch [7/9], [9/9]) reduces build
time visibly.  Where the tuning target: the number of pages allocated
from zone reduces greatly.  So, the zone contention cycles% reduces
greatly.

With PCP tuning patches (patch [7/9], [9/9]), the average used memory
during test increases up to 18.4% because more pages are cached in
PCP.  But at the end of the test, the number of the used memory
decreases to the same level as that of the base patch.  That is, the
pages cached in PCP will be released to zone after not being used
actively.

netperf SCTP_STREAM_MANY
========================

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.

	     score   lock contend%	free_high	alloc_zone  cache miss rate%
	     -----	----------	---------	----------  ----------------
base	     100.0	       2.1          100.0            100.0	         1.3
patch1	      99.4	       2.1	     99.4	      99.4		 1.3
patch3	     106.4	       1.3	     13.3	     106.3		 1.3
patch5	     106.0	       1.2	     13.2	     105.9		 1.3
patch7	     103.4	       1.9	      6.7	      90.3		 7.6
patch9	     108.6	       1.3	     13.7	     108.6		 1.3

The PCP draining optimization (patch [1/9]+[3/9]) improves
performance.  The PCP high auto-tuning (patch [7/9]) reduces
performance a little because PCP draining cannot be triggered in time
sometimes.  So, the cache miss rate% increases.  The further PCP
draining optimization (patch [9/9]) based on PCP tuning restore the
performance.

lmbench3 UNIX (AF_UNIX)
=======================

On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.

	     score   lock contend%	free_high	alloc_zone  cache miss rate%
	     -----	----------	---------	----------  ----------------
base	     100.0	      51.4          100.0            100.0	         0.2
patch1	     116.8	      46.1           69.5	     104.3	         0.2
patch3	     199.1	      21.3            7.0	     104.9	         0.2
patch5	     200.0	      20.8            7.1	     106.9	         0.3
patch7	     191.6	      19.9            6.8	     103.8	         2.8
patch9	     193.4	      21.7            7.0	     104.7	         2.1

The PCP draining optimization (patch [1/9], [3/9]) improves
performance much.  The PCP tuning (patch [7/9]) reduces performance a
little because PCP draining cannot be triggered in time sometimes.
The further PCP draining optimization (patch [9/9]) based on PCP
tuning restores the performance partly.

The patchset adds several fields in struct per_cpu_pages.  The struct
layout before/after the patchset is as follows,

base
====

struct per_cpu_pages {
	spinlock_t                 lock;                 /*     0     4 */
	int                        count;                /*     4     4 */
	int                        high;                 /*     8     4 */
	int                        batch;                /*    12     4 */
	short int                  free_factor;          /*    16     2 */
	short int                  expire;               /*    18     2 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           lists[13];            /*    24   208 */

	/* size: 256, cachelines: 4, members: 7 */
	/* sum members: 228, holes: 1, sum holes: 4 */
	/* padding: 24 */
} __attribute__((__aligned__(64)));

patched
=======

struct per_cpu_pages {
	spinlock_t                 lock;                 /*     0     4 */
	int                        count;                /*     4     4 */
	int                        high;                 /*     8     4 */
	int                        high_min;             /*    12     4 */
	int                        high_max;             /*    16     4 */
	int                        batch;                /*    20     4 */
	u8                         flags;                /*    24     1 */
	u8                         alloc_factor;         /*    25     1 */
	u8                         expire;               /*    26     1 */

	/* XXX 1 byte hole, try to pack */

	short int                  free_count;           /*    28     2 */

	/* XXX 2 bytes hole, try to pack */

	struct list_head           lists[13];            /*    32   208 */

	/* size: 256, cachelines: 4, members: 11 */
	/* sum members: 237, holes: 2, sum holes: 3 */
	/* padding: 16 */
} __attribute__((__aligned__(64)));

The size of the struct doesn't changed with the patchset.

Changelog:

v3:

- Fix the per-CPU data cache slice size calculation in [3/9].  Thanks
  Mel for pointing this out!

- Fix a PCP->high decrement amount issue when free pages become low.

- Dropped original "[9/10] mm, pcp: avoid to reduce PCP high
  unnecessarily", because of no enough performance data to support it.

- Add some comments per Mel's comments.  Thanks!

- Add a Kconfig option for max batch scale factor in [4/9] per Mel's
  comments.  Thanks!

- Collected Acked-by.  Thanks Mel!

v2:

- Fix the kbuild test configuration and results.  Thanks Andrew for
  reminding on test results!

- Add document for sysctl behavior extension in [06/10] per Andrew's comments.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH -V3 1/9] mm, pcp: avoid to drain PCP when process exit
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
@ 2023-10-16  5:29 ` Huang Ying
  2023-10-16  5:29 ` [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice Huang Ying
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocation and
freeing CPUs.

But, the PCP draining mechanism may be triggered unexpectedly when
process exits.  With some customized trace point, it was found that
PCP draining (free_high == true) was triggered with the order-1 page
freeing with the following call stack,

 => free_unref_page_commit
 => free_unref_page
 => __mmdrop
 => exit_mm
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64

Checking the source code, this is the page table PGD
freeing (mm_free_pgd()).  It's a order-1 page freeing if
CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
security.

Just before that, page freeing with the following call stack was
found,

 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => tlb_batch_pages_flush
 => tlb_finish_mmu
 => exit_mmap
 => __mmput
 => exit_mm
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64

So, when a process exits,

- a large number of user pages of the process will be freed without
  page allocation, it's highly possible that pcp->free_factor becomes >
  0.  In fact, this is expected behavior to improve process exit
  performance.

- after freeing all user pages, the PGD will be freed, which is a
  order-1 page freeing, PCP will be drained.

All in all, when a process exits, it's high possible that the PCP will
be drained.  This is an unexpected behavior.

To avoid this, in the patch, the PCP draining will only be triggered
for 2 consecutive high-order page freeing.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the cycles% of the spinlock contention (mostly for
zone lock) decreases from 14.0% to 12.8% (with PCP size == 367).  The
number of PCP draining for high order pages freeing (free_high)
decreases 80.5%.

This helps network workload too for reduced zone lock contention.  On
a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 16.8%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 51.4% to
46.1%.  The number of PCP draining for high order pages
freeing (free_high) decreases 30.5%.  The cache miss rate keeps 0.2%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h | 12 +++++++++++-
 mm/page_alloc.c        | 11 ++++++++---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..19c40a6f7e45 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -676,12 +676,22 @@ enum zone_watermarks {
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
+/*
+ * Flags used in pcp->flags field.
+ *
+ * PCPF_PREV_FREE_HIGH_ORDER: a high-order page is freed in the
+ * previous page freeing.  To avoid to drain PCP for an accident
+ * high-order page freeing.
+ */
+#define	PCPF_PREV_FREE_HIGH_ORDER	BIT(0)
+
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	short free_factor;	/* batch scaling factor during free */
+	u8 flags;		/* protected by pcp->lock */
+	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
 	short expire;		/* When 0, remote pagesets are drained */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95546f376302..295e61f0c49d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 {
 	int high;
 	int pindex;
-	bool free_high;
+	bool free_high = false;
 
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
@@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * freeing without allocation. The remainder after bulk freeing
 	 * stops will be drained from vmstat refresh context.
 	 */
-	free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER);
-
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		free_high = (pcp->free_factor &&
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
+	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
+		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
+	}
 	high = nr_pcp_high(pcp, zone, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
  2023-10-16  5:29 ` [PATCH -V3 1/9] mm, pcp: avoid to drain PCP when process exit Huang Ying
@ 2023-10-16  5:29 ` Huang Ying
  2023-10-19 12:11   ` Mel Gorman
  2023-10-16  5:29 ` [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages Huang Ying
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying,
	Sudeep Holla, Mel Gorman, Vlastimil Babka, David Hildenbrand,
	Johannes Weiner, Dave Hansen, Michal Hocko, Pavel Tatashin,
	Matthew Wilcox, Christoph Lameter

This can be used to estimate the size of the data cache slice that can
be used by one CPU under ideal circumstances.  Both DATA caches and
UNIFIED caches are used in calculation.  So, the users need to consider
the impact of the code cache usage.

Because the cache inclusive/non-inclusive information isn't available
now, we just use the size of the per-CPU slice of LLC to make the
result more predictable across architectures.  This may be improved
when more cache information is available in the future.

A brute-force algorithm to iterate all online CPUs is used to avoid
to allocate an extra cpumask, especially in offline callback.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c  | 49 ++++++++++++++++++++++++++++++++++++++-
 include/linux/cacheinfo.h |  1 +
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index cbae8be1fe52..585c66fce9d9 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -898,6 +898,48 @@ static int cache_add_dev(unsigned int cpu)
 	return rc;
 }
 
+/*
+ * Calculate the size of the per-CPU data cache slice.  This can be
+ * used to estimate the size of the data cache slice that can be used
+ * by one CPU under ideal circumstances.  UNIFIED caches are counted
+ * in addition to DATA caches.  So, please consider code cache usage
+ * when use the result.
+ *
+ * Because the cache inclusive/non-inclusive information isn't
+ * available, we just use the size of the per-CPU slice of LLC to make
+ * the result more predictable across architectures.
+ */
+static void update_per_cpu_data_slice_size_cpu(unsigned int cpu)
+{
+	struct cpu_cacheinfo *ci;
+	struct cacheinfo *llc;
+	unsigned int nr_shared;
+
+	if (!last_level_cache_is_valid(cpu))
+		return;
+
+	ci = ci_cacheinfo(cpu);
+	llc = per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1);
+
+	if (llc->type != CACHE_TYPE_DATA && llc->type != CACHE_TYPE_UNIFIED)
+		return;
+
+	nr_shared = cpumask_weight(&llc->shared_cpu_map);
+	if (nr_shared)
+		ci->per_cpu_data_slice_size = llc->size / nr_shared;
+}
+
+static void update_per_cpu_data_slice_size(bool cpu_online, unsigned int cpu)
+{
+	unsigned int icpu;
+
+	for_each_online_cpu(icpu) {
+		if (!cpu_online && icpu == cpu)
+			continue;
+		update_per_cpu_data_slice_size_cpu(icpu);
+	}
+}
+
 static int cacheinfo_cpu_online(unsigned int cpu)
 {
 	int rc = detect_cache_attributes(cpu);
@@ -906,7 +948,11 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 		return rc;
 	rc = cache_add_dev(cpu);
 	if (rc)
-		free_cache_attributes(cpu);
+		goto err;
+	update_per_cpu_data_slice_size(true, cpu);
+	return 0;
+err:
+	free_cache_attributes(cpu);
 	return rc;
 }
 
@@ -916,6 +962,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 		cpu_cache_sysfs_exit(cpu);
 
 	free_cache_attributes(cpu);
+	update_per_cpu_data_slice_size(false, cpu);
 	return 0;
 }
 
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index a5cfd44fab45..d504eb4b49ab 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -73,6 +73,7 @@ struct cacheinfo {
 
 struct cpu_cacheinfo {
 	struct cacheinfo *info_list;
+	unsigned int per_cpu_data_slice_size;
 	unsigned int num_levels;
 	unsigned int num_leaves;
 	bool cpu_map_populated;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
  2023-10-16  5:29 ` [PATCH -V3 1/9] mm, pcp: avoid to drain PCP when process exit Huang Ying
  2023-10-16  5:29 ` [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice Huang Ying
@ 2023-10-16  5:29 ` Huang Ying
  2023-10-27  6:23   ` kernel test robot
  2023-11-06  6:22   ` kernel test robot
  2023-10-16  5:29 ` [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency Huang Ying
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Sudeep Holla, Vlastimil Babka, David Hildenbrand,
	Johannes Weiner, Dave Hansen, Michal Hocko, Pavel Tatashin,
	Matthew Wilcox, Christoph Lameter

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocating and
freeing CPUs.

On system with small per-CPU data cache slice, pages shouldn't be
cached before draining to guarantee cache-hot.  But on a system with
large per-CPU data cache slice, some pages can be cached before
draining to reduce zone lock contention.

So, in this patch, instead of draining without any caching,
"pcp->batch" pages will be cached in PCP before draining if the
size of the per-CPU data cache slice is more than "3 * batch".

In theory, if the size of per-CPU data cache slice is more than "2 *
batch", we can reuse cache-hot pages between CPUs.  But considering
the other usage of cache (code, other data accessing, etc.), "3 *
batch" is used.

Note: "3 * batch" is chosen to make sure the optimization works on
recent x86_64 server CPUs.  If you want to increase it, please check
whether it breaks the optimization.

On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 70.5%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 46.1% to
21.3%.  The number of PCP draining for high order pages
freeing (free_high) decreases 89.9%.  The cache miss rate keeps 0.2%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c |  2 ++
 include/linux/gfp.h      |  1 +
 include/linux/mmzone.h   |  6 ++++++
 mm/page_alloc.c          | 38 +++++++++++++++++++++++++++++++++++++-
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 585c66fce9d9..f1e79263fe61 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -950,6 +950,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 	if (rc)
 		goto err;
 	update_per_cpu_data_slice_size(true, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 err:
 	free_cache_attributes(cpu);
@@ -963,6 +964,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 
 	free_cache_attributes(cpu);
 	update_per_cpu_data_slice_size(false, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..665edc11fb9f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+void setup_pcp_cacheinfo(void);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 19c40a6f7e45..cdff247e8c6f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -682,8 +682,14 @@ enum zone_watermarks {
  * PCPF_PREV_FREE_HIGH_ORDER: a high-order page is freed in the
  * previous page freeing.  To avoid to drain PCP for an accident
  * high-order page freeing.
+ *
+ * PCPF_FREE_HIGH_BATCH: preserve "pcp->batch" pages in PCP before
+ * draining PCP for consecutive high-order pages freeing without
+ * allocation if data cache slice of CPU is large enough.  To reduce
+ * zone lock contention and keep cache-hot pages reusing.
  */
 #define	PCPF_PREV_FREE_HIGH_ORDER	BIT(0)
+#define	PCPF_FREE_HIGH_BATCH		BIT(1)
 
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 295e61f0c49d..ba2d8f06523e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -52,6 +52,7 @@
 #include <linux/psi.h>
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
+#include <linux/cacheinfo.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
 		free_high = (pcp->free_factor &&
-			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
+			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
+			      pcp->count >= READ_ONCE(pcp->batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
@@ -5418,6 +5421,39 @@ static void zone_pcp_update(struct zone *zone, int cpu_online)
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
+static void zone_pcp_update_cacheinfo(struct zone *zone)
+{
+	int cpu;
+	struct per_cpu_pages *pcp;
+	struct cpu_cacheinfo *cci;
+
+	for_each_online_cpu(cpu) {
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		cci = get_cpu_cacheinfo(cpu);
+		/*
+		 * If data cache slice of CPU is large enough, "pcp->batch"
+		 * pages can be preserved in PCP before draining PCP for
+		 * consecutive high-order pages freeing without allocation.
+		 * This can reduce zone lock contention without hurting
+		 * cache-hot pages sharing.
+		 */
+		spin_lock(&pcp->lock);
+		if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch)
+			pcp->flags |= PCPF_FREE_HIGH_BATCH;
+		else
+			pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
+		spin_unlock(&pcp->lock);
+	}
+}
+
+void setup_pcp_cacheinfo(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zone_pcp_update_cacheinfo(zone);
+}
+
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
                   ` (2 preceding siblings ...)
  2023-10-16  5:29 ` [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages Huang Ying
@ 2023-10-16  5:29 ` Huang Ying
  2023-10-19 12:12   ` Mel Gorman
  2023-10-16  5:29 ` [PATCH -V3 5/9] mm, page_alloc: scale the number of pages that are batch allocated Huang Ying
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention.
But too large batch size will cause too long maximal
allocation/freeing latency, which may punish arbitrary users.  So the
default batch size is chosen carefully (in zone_batchsize(), the value
is 63 for zone > 1GB) to avoid that.

In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
are batch freed"), the batch size will be scaled for large number of
page freeing to improve page freeing performance and reduce zone lock
contention.  Similar optimization can be used for large number of
pages allocation too.

To find out a suitable max batch scale factor (that is, max effective
batch size), some tests and measurement on some machines were done as
follows.

A set of debug patches are implemented as follows,

- Set PCP high to be 2 * batch to reduce the effect of PCP high

- Disable free batch size scaling to get the raw performance.

- The code with zone lock held is extracted from rmqueue_bulk() and
  free_pcppages_bulk() to 2 separate functions to make it easy to
  measure the function run time with ftrace function_graph tracer.

- The batch size is hard coded to be 63 (default), 127, 255, 511,
  1023, 2047, 4095.

Then will-it-scale/page_fault1 is used to generate the page
allocation/freeing workload.  The page allocation/freeing throughput
(page/s) is measured via will-it-scale.  The page allocation/freeing
average latency (alloc/free latency avg, in us) and allocation/freeing
latency at 99 percentile (alloc/free latency 99%, in us) are measured
with ftrace function_graph tracer.

The test results are as follows,

Sapphire Rapids Server
======================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	513633.4	 2.33		 3.57		 2.67		  6.83
 127	517616.7	 4.35		 6.65		 4.22		 13.03
 255	520822.8	 8.29		13.32		 7.52		 25.24
 511	524122.0	15.79		23.42		14.02		 49.35
1023	525980.5	30.25		44.19		25.36		 94.88
2047	526793.6	59.39		84.50		45.22		140.81

Ice Lake Server
===============
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	620210.3	 2.21		 3.68		 2.02		 4.35
 127	627003.0	 4.09		 6.86		 3.51		 8.28
 255	630777.5	 7.70		13.50		 6.17		15.97
 511	633651.5	14.85		22.62		11.66		31.08
1023	637071.1	28.55		42.02		20.81		54.36
2047	638089.7	56.54		84.06		39.28		91.68

Cascade Lake Server
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	404706.7	 3.29		  5.03		 3.53		  4.75
 127	422475.2	 6.12		  9.09		 6.36		  8.76
 255	411522.2	11.68		 16.97		10.90		 16.39
 511	428124.1	22.54		 31.28		19.86		 32.25
1023	414718.4	43.39		 62.52		40.00		 66.33
2047	429848.7	86.64		120.34		71.14		106.08

Commet Lake Desktop
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------

  63	795183.13	 2.18		 3.55		 2.03		 3.05
 127	803067.85	 3.91		 6.56		 3.85		 5.52
 255	812771.10	 7.35		10.80		 7.14		10.20
 511	817723.48	14.17		27.54		13.43		30.31
1023	818870.19	27.72		40.10		27.89		46.28

Coffee Lake Desktop
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	510542.8	 3.13		  4.40		 2.48		 3.43
 127	514288.6	 5.97		  7.89		 4.65		 6.04
 255	516889.7	11.86		 15.58		 8.96		12.55
 511	519802.4	23.10		 28.81		16.95		26.19
1023	520802.7	45.30		 52.51		33.19		45.95
2047	519997.1	90.63		104.00		65.26		81.74

From the above data, to restrict the allocation/freeing latency to be
less than 100 us in most times, the max batch scale factor needs to be
less than or equal to 5.

Although it is reasonable to use 5 as max batch scale factor for the
systems tested, there are also slower systems.  Where smaller value
should be used to constrain the page allocation/freeing latency.

So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added
to set the max batch scale factor.  Whose default value is 5, and
users can reduce it when necessary.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/Kconfig      | 11 +++++++++++
 mm/page_alloc.c |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..ece4f2847e2b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -704,6 +704,17 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
+config PCP_BATCH_SCALE_MAX
+	int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
+	default 5
+	range 0 6
+	help
+	  In page allocator, PCP (Per-CPU pageset) is refilled and drained in
+	  batches.  The batch number is scaled automatically to improve page
+	  allocation/free throughput.  But too large scale factor may hurt
+	  latency.  This option sets the upper limit of scale factor to limit
+	  the maximum latency.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba2d8f06523e..a5a5a4c3cd2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2340,7 +2340,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free)
+	if (batch < max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 5/9] mm, page_alloc: scale the number of pages that are batch allocated
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
                   ` (3 preceding siblings ...)
  2023-10-16  5:29 ` [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency Huang Ying
@ 2023-10-16  5:29 ` Huang Ying
  2023-10-16  5:29 ` [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning Huang Ying
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

When a task is allocating a large number of order-0 pages, it may
acquire the zone->lock multiple times allocating pages in batches.
This may unnecessarily contend on the zone lock when allocating very
large number of pages.  This patch adapts the size of the batch based
on the recent pattern to scale the batch size for subsequent
allocations.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the cycles% of the spinlock contention (mostly for
zone lock) decreases from 12.6% to 11.0% (with PCP size == 367).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  3 ++-
 mm/page_alloc.c        | 53 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cdff247e8c6f..ba548ae20686 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -697,9 +697,10 @@ struct per_cpu_pages {
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
+	u8 alloc_factor;	/* batch scaling factor during allocate */
 	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
-	short expire;		/* When 0, remote pagesets are drained */
+	u8 expire;		/* When 0, remote pagesets are drained */
 #endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a5a5a4c3cd2b..eeef0ead1c2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2373,6 +2373,12 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	int pindex;
 	bool free_high = false;
 
+	/*
+	 * On freeing, reduce the number of pages that are batch allocated.
+	 * See nr_pcp_alloc() where alloc_factor is increased for subsequent
+	 * allocations.
+	 */
+	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
@@ -2679,6 +2685,42 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+{
+	int high, batch, max_nr_alloc;
+
+	high = READ_ONCE(pcp->high);
+	batch = READ_ONCE(pcp->batch);
+
+	/* Check for PCP disabled or boot pageset */
+	if (unlikely(high < batch))
+		return 1;
+
+	/*
+	 * Double the number of pages allocated each time there is subsequent
+	 * allocation of order-0 pages without any freeing.
+	 */
+	if (!order) {
+		max_nr_alloc = max(high - pcp->count - batch, batch);
+		batch <<= pcp->alloc_factor;
+		if (batch <= max_nr_alloc &&
+		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+			pcp->alloc_factor++;
+		batch = min(batch, max_nr_alloc);
+	}
+
+	/*
+	 * Scale batch relative to order if batch implies free pages
+	 * can be stored on the PCP. Batch can be 1 for small zones or
+	 * for boot pagesets which should never store free pages as
+	 * the pages may belong to arbitrary zones.
+	 */
+	if (batch > 1)
+		batch = max(batch >> order, 2);
+
+	return batch;
+}
+
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -2691,18 +2733,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = READ_ONCE(pcp->batch);
+			int batch = nr_pcp_alloc(pcp, order);
 			int alloced;
 
-			/*
-			 * Scale batch relative to order if batch implies
-			 * free pages can be stored on the PCP. Batch can
-			 * be 1 for small zones or for boot pagesets which
-			 * should never store free pages as the pages may
-			 * belong to arbitrary zones.
-			 */
-			if (batch > 1)
-				batch = max(batch >> order, 2);
 			alloced = rmqueue_bulk(zone, order,
 					batch, list,
 					migratetype, alloc_flags);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
                   ` (4 preceding siblings ...)
  2023-10-16  5:29 ` [PATCH -V3 5/9] mm, page_alloc: scale the number of pages that are batch allocated Huang Ying
@ 2023-10-16  5:29 ` Huang Ying
  2023-10-19 12:16   ` Mel Gorman
  2023-10-16  5:30 ` [PATCH -V3 7/9] mm: tune PCP high automatically Huang Ying
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

The page allocation performance requirements of different workloads
are usually different.  So, we need to tune PCP (per-CPU pageset) high
to optimize the workload page allocation performance.  Now, we have a
system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP
high by hand.  But, it's hard to find out the best value by hand.  And
one global configuration may not work best for the different workloads
that run on the same system.  One solution to these issues is to tune
PCP high of each CPU automatically.

This patch adds the framework for PCP high auto-tuning.  With it,
pcp->high of each CPU will be changed automatically by tuning
algorithm at runtime.  The minimal high (pcp->high_min) is the
original PCP high value calculated based on the low watermark pages.
While the maximal high (pcp->high_max) is the PCP high value when
percpu_pagelist_high_fraction sysctl knob is set to
MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the maximal pcp->high
that can be set via sysctl knob by hand.

It's possible that PCP high auto-tuning doesn't work well for some
workloads.  So, when PCP high is tuned by hand via the sysctl knob,
the auto-tuning will be disabled.  The PCP high set by hand will be
used instead.

This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always.  We will add actual
auto-tuning algorithm in the following patches in the series.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  5 ++-
 mm/page_alloc.c        | 71 +++++++++++++++++++++++++++---------------
 2 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ba548ae20686..ec3f7daedcc7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -695,6 +695,8 @@ struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
+	int high_min;		/* min high watermark */
+	int high_max;		/* max high watermark */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
@@ -854,7 +856,8 @@ struct zone {
 	 * the high and batch values are copied to individual pagesets for
 	 * faster access
 	 */
-	int pageset_high;
+	int pageset_high_min;
+	int pageset_high_max;
 	int pageset_batch;
 
 #ifndef CONFIG_SPARSEMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeef0ead1c2a..1fb2c6ebde9c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2350,7 +2350,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		       bool free_high)
 {
-	int high = READ_ONCE(pcp->high);
+	int high = READ_ONCE(pcp->high_min);
 
 	if (unlikely(!high || free_high))
 		return 0;
@@ -2689,7 +2689,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
 {
 	int high, batch, max_nr_alloc;
 
-	high = READ_ONCE(pcp->high);
+	high = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
 
 	/* Check for PCP disabled or boot pageset */
@@ -5296,14 +5296,15 @@ static int zone_batchsize(struct zone *zone)
 }
 
 static int percpu_pagelist_high_fraction;
-static int zone_highsize(struct zone *zone, int batch, int cpu_online)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online,
+			 int high_fraction)
 {
 #ifdef CONFIG_MMU
 	int high;
 	int nr_split_cpus;
 	unsigned long total_pages;
 
-	if (!percpu_pagelist_high_fraction) {
+	if (!high_fraction) {
 		/*
 		 * By default, the high value of the pcp is based on the zone
 		 * low watermark so that if they are full then background
@@ -5316,15 +5317,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 		 * value is based on a fraction of the managed pages in the
 		 * zone.
 		 */
-		total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+		total_pages = zone_managed_pages(zone) / high_fraction;
 	}
 
 	/*
 	 * Split the high value across all online CPUs local to the zone. Note
 	 * that early in boot that CPUs may not be online yet and that during
 	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
-	 * onlined. For memory nodes that have no CPUs, split pcp->high across
-	 * all online CPUs to mitigate the risk that reclaim is triggered
+	 * onlined. For memory nodes that have no CPUs, split the high value
+	 * across all online CPUs to mitigate the risk that reclaim is triggered
 	 * prematurely due to pages stored on pcp lists.
 	 */
 	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
@@ -5352,19 +5353,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
  * However, guaranteeing these relations at all times would require e.g. write
  * barriers here but also careful usage of read barriers at the read side, and
  * thus be prone to error and bad for performance. Thus the update only prevents
- * store tearing. Any new users of pcp->batch and pcp->high should ensure they
- * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max
+ * should ensure they can cope with those fields changing asynchronously, and
+ * fully trust only the pcp->count field on the local CPU with interrupts
+ * disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
  * exist).
  */
-static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
-		unsigned long batch)
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min,
+			   unsigned long high_max, unsigned long batch)
 {
 	WRITE_ONCE(pcp->batch, batch);
-	WRITE_ONCE(pcp->high, high);
+	WRITE_ONCE(pcp->high_min, high_min);
+	WRITE_ONCE(pcp->high_max, high_max);
 }
 
 static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
@@ -5384,20 +5387,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	 * need to be as careful as pageset_update() as nobody can access the
 	 * pageset yet.
 	 */
-	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->high_min = BOOT_PAGESET_HIGH;
+	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
 	pcp->free_factor = 0;
 }
 
-static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
-		unsigned long batch)
+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
+					      unsigned long high_max, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
 		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-		pageset_update(pcp, high, batch);
+		pageset_update(pcp, high_min, high_max, batch);
 	}
 }
 
@@ -5407,19 +5411,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
 {
-	int new_high, new_batch;
+	int new_high_min, new_high_max, new_batch;
 
 	new_batch = max(1, zone_batchsize(zone));
-	new_high = zone_highsize(zone, new_batch, cpu_online);
+	if (percpu_pagelist_high_fraction) {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online,
+					     percpu_pagelist_high_fraction);
+		/*
+		 * PCP high is tuned manually, disable auto-tuning via
+		 * setting high_min and high_max to the manual value.
+		 */
+		new_high_max = new_high_min;
+	} else {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online, 0);
+		new_high_max = zone_highsize(zone, new_batch, cpu_online,
+					     MIN_PERCPU_PAGELIST_HIGH_FRACTION);
+	}
 
-	if (zone->pageset_high == new_high &&
+	if (zone->pageset_high_min == new_high_min &&
+	    zone->pageset_high_max == new_high_max &&
 	    zone->pageset_batch == new_batch)
 		return;
 
-	zone->pageset_high = new_high;
+	zone->pageset_high_min = new_high_min;
+	zone->pageset_high_max = new_high_max;
 	zone->pageset_batch = new_batch;
 
-	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
+	__zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max,
+					  new_batch);
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -5528,7 +5547,8 @@ __meminit void zone_pcp_init(struct zone *zone)
 	 */
 	zone->per_cpu_pageset = &boot_pageset;
 	zone->per_cpu_zonestats = &boot_zonestats;
-	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_high_min = BOOT_PAGESET_HIGH;
+	zone->pageset_high_max = BOOT_PAGESET_HIGH;
 	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
 	if (populated_zone(zone))
@@ -6430,13 +6450,14 @@ EXPORT_SYMBOL(free_contig_range);
 void zone_pcp_disable(struct zone *zone)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	__zone_set_pageset_high_and_batch(zone, 0, 1);
+	__zone_set_pageset_high_and_batch(zone, 0, 0, 1);
 	__drain_all_pages(zone, true);
 }
 
 void zone_pcp_enable(struct zone *zone)
 {
-	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	__zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+		zone->pageset_high_max, zone->pageset_batch);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 7/9] mm: tune PCP high automatically
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
                   ` (5 preceding siblings ...)
  2023-10-16  5:29 ` [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning Huang Ying
@ 2023-10-16  5:30 ` Huang Ying
  2023-10-31  2:50   ` kernel test robot
  2023-10-16  5:30 ` [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark Huang Ying
  2023-10-16  5:30 ` [PATCH -V3 9/9] mm, pcp: reduce detecting time of consecutive high order page freeing Huang Ying
  8 siblings, 1 reply; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Michal Hocko, Vlastimil Babka, David Hildenbrand,
	Johannes Weiner, Dave Hansen, Pavel Tatashin, Matthew Wilcox,
	Christoph Lameter

The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
  Because if we had larger PCP, we could avoid to allocate from the
  zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
  decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
  high in allocating path, decrease PCP high and free some pages in
  freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small.  But
this isn't a severe issue, because there are no idle pages in this
case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during
PCP refilling.  This can avoid the issue above.  But if the number of
pages allocated is much less than that of pages freed on a CPU, there
will be many idle pages in PCP and it is hard to free these idle
pages.

1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8
is kind of arbitrary.  Just to make sure that the idle PCP pages will
be freed eventually.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the build time decreases 3.5%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 11.0% to
0.5%.  The number of PCP draining for high order pages
freeing (free_high) decreases 65.6%.  The number of pages allocated
from zone (instead of from PCP) decreases 83.9%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/gfp.h |   1 +
 mm/page_alloc.c     | 119 ++++++++++++++++++++++++++++++++++----------
 mm/vmstat.c         |   8 +--
 3 files changed, 99 insertions(+), 29 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665edc11fb9f..5b917e5b9350 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -320,6 +320,7 @@ extern void page_frag_free(void *addr);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init_cpuhp(void);
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fb2c6ebde9c..8382ad2cdfd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2157,6 +2157,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	return i;
 }
 
+/*
+ * Called from the vmstat counter updater to decay the PCP high.
+ * Return whether there are addition works to do.
+ */
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+{
+	int high_min, to_drain, batch;
+	int todo = 0;
+
+	high_min = READ_ONCE(pcp->high_min);
+	batch = READ_ONCE(pcp->batch);
+	/*
+	 * Decrease pcp->high periodically to try to free possible
+	 * idle PCP pages.  And, avoid to free too many pages to
+	 * control latency.  This caps pcp->high decrement too.
+	 */
+	if (pcp->high > high_min) {
+		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+				 pcp->high - (pcp->high >> 3), high_min);
+		if (pcp->high > high_min)
+			todo++;
+	}
+
+	to_drain = pcp->count - pcp->high;
+	if (to_drain > 0) {
+		spin_lock(&pcp->lock);
+		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		spin_unlock(&pcp->lock);
+		todo++;
+	}
+
+	return todo;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Called from the vmstat counter updater to drain pagesets of this
@@ -2318,14 +2352,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
 	return true;
 }
 
-static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
+static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high)
 {
 	int min_nr_free, max_nr_free;
-	int batch = READ_ONCE(pcp->batch);
 
-	/* Free everything if batch freeing high-order pages. */
+	/* Free as much as possible if batch freeing high-order pages. */
 	if (unlikely(free_high))
-		return pcp->count;
+		return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
 
 	/* Check for PCP disabled or boot pageset */
 	if (unlikely(high < batch))
@@ -2340,7 +2373,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+	if (batch <= max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 
@@ -2348,28 +2381,48 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 }
 
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
-		       bool free_high)
+		       int batch, bool free_high)
 {
-	int high = READ_ONCE(pcp->high_min);
+	int high, high_min, high_max;
 
-	if (unlikely(!high || free_high))
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
+
+	if (unlikely(!high))
 		return 0;
 
-	if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
-		return high;
+	if (unlikely(free_high)) {
+		pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+				high_min);
+		return 0;
+	}
 
 	/*
 	 * If reclaim is active, limit the number of pages that can be
 	 * stored on pcp lists
 	 */
-	return min(READ_ONCE(pcp->batch) << 2, high);
+	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		return min(batch << 2, pcp->high);
+	}
+
+	if (pcp->count >= high && high_min != high_max) {
+		int need_high = (batch << pcp->free_factor) + batch;
+
+		/* pcp->high should be large enough to hold batch freed pages */
+		if (pcp->high < need_high)
+			pcp->high = clamp(need_high, high_min, high_max);
+	}
+
+	return high;
 }
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
 				   unsigned int order)
 {
-	int high;
+	int high, batch;
 	int pindex;
 	bool free_high = false;
 
@@ -2384,6 +2437,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
+	batch = READ_ONCE(pcp->batch);
 	/*
 	 * As high-order pages other than THP's stored on PCP can contribute
 	 * to fragmentation, limit the number stored when PCP is heavily
@@ -2394,14 +2448,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 		free_high = (pcp->free_factor &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
-			      pcp->count >= READ_ONCE(pcp->batch)));
+			      pcp->count >= READ_ONCE(batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
-	high = nr_pcp_high(pcp, zone, free_high);
+	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
-		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
+		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
+				   pcp, pindex);
 	}
 }
 
@@ -2685,24 +2740,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
-static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 {
-	int high, batch, max_nr_alloc;
+	int high, base_batch, batch, max_nr_alloc;
+	int high_max, high_min;
 
-	high = READ_ONCE(pcp->high_min);
-	batch = READ_ONCE(pcp->batch);
+	base_batch = READ_ONCE(pcp->batch);
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
 
 	/* Check for PCP disabled or boot pageset */
-	if (unlikely(high < batch))
+	if (unlikely(high < base_batch))
 		return 1;
 
+	if (order)
+		batch = base_batch;
+	else
+		batch = (base_batch << pcp->alloc_factor);
+
 	/*
-	 * Double the number of pages allocated each time there is subsequent
-	 * allocation of order-0 pages without any freeing.
+	 * If we had larger pcp->high, we could avoid to allocate from
+	 * zone.
 	 */
+	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+		high = pcp->high = min(high + batch, high_max);
+
 	if (!order) {
-		max_nr_alloc = max(high - pcp->count - batch, batch);
-		batch <<= pcp->alloc_factor;
+		max_nr_alloc = max(high - pcp->count - base_batch, base_batch);
+		/*
+		 * Double the number of pages allocated each time there is
+		 * subsequent allocation of order-0 pages without any freeing.
+		 */
 		if (batch <= max_nr_alloc &&
 		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
 			pcp->alloc_factor++;
@@ -2733,7 +2802,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = nr_pcp_alloc(pcp, order);
+			int batch = nr_pcp_alloc(pcp, zone, order);
 			int alloced;
 
 			alloced = rmqueue_bulk(zone, order,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..2f716ad14168 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
 		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 #endif
 			}
 		}
-#ifdef CONFIG_NUMA
 
 		if (do_pagesets) {
 			cond_resched();
+
+			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+#ifdef CONFIG_NUMA
 			/*
 			 * Deal with draining the remote pageset of this
 			 * processor
@@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
-		}
 #endif
+		}
 	}
 
 	for_each_online_pgdat(pgdat) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
                   ` (6 preceding siblings ...)
  2023-10-16  5:30 ` [PATCH -V3 7/9] mm: tune PCP high automatically Huang Ying
@ 2023-10-16  5:30 ` Huang Ying
  2023-10-19 12:33   ` Mel Gorman
  2023-10-16  5:30 ` [PATCH -V3 9/9] mm, pcp: reduce detecting time of consecutive high order page freeing Huang Ying
  8 siblings, 1 reply; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

One target of PCP is to minimize pages in PCP if the system free pages
is too few.  To reach that target, when page reclaiming is active for
the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
allocating path, decrease PCP high and free some pages in freeing
path.  But this may be too late because the background page reclaiming
may introduce latency for some workloads.  So, in this patch, during
page allocation we will detect whether the number of free pages of the
zone is below high watermark.  If so, we will stop increasing PCP high
in allocating path, decrease PCP high and free some pages in freeing
path.  With this, we can reduce the possibility of the premature
background page reclaiming caused by too large PCP.

The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 33 +++++++++++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ec3f7daedcc7..c88770381aaf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1018,6 +1018,7 @@ enum zone_flags {
 					 * Cleared when kswapd is woken.
 					 */
 	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
+	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
 };
 
 static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8382ad2cdfd4..253fc7d0498e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2407,7 +2407,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return min(batch << 2, pcp->high);
 	}
 
-	if (pcp->count >= high && high_min != high_max) {
+	if (high_min == high_max)
+		return high;
+
+	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		high = max(pcp->count, high_min);
+	} else if (pcp->count >= high) {
 		int need_high = (batch << pcp->free_factor) + batch;
 
 		/* pcp->high should be large enough to hold batch freed pages */
@@ -2457,6 +2463,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
 				   pcp, pindex);
+		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
+		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
+				      ZONE_MOVABLE, 0))
+			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
 	}
 }
 
@@ -2763,7 +2773,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 	 * If we had larger pcp->high, we could avoid to allocate from
 	 * zone.
 	 */
-	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+	if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags))
 		high = pcp->high = min(high + batch, high_max);
 
 	if (!order) {
@@ -3225,6 +3235,25 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
+		/*
+		 * Detect whether the number of free pages is below high
+		 * watermark.  If so, we will decrease pcp->high and free
+		 * PCP pages in free path to reduce the possibility of
+		 * premature page reclaiming.  Detection is done here to
+		 * avoid to do that in hotter free path.
+		 */
+		if (test_bit(ZONE_BELOW_HIGH, &zone->flags))
+			goto check_alloc_wmark;
+
+		mark = high_wmark_pages(zone);
+		if (zone_watermark_fast(zone, order, mark,
+					ac->highest_zoneidx, alloc_flags,
+					gfp_mask))
+			goto try_this_zone;
+		else
+			set_bit(ZONE_BELOW_HIGH, &zone->flags);
+
+check_alloc_wmark:
 		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac->highest_zoneidx, alloc_flags,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -V3 9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
  2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
                   ` (7 preceding siblings ...)
  2023-10-16  5:30 ` [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark Huang Ying
@ 2023-10-16  5:30 ` Huang Ying
  8 siblings, 0 replies; 20+ messages in thread
From: Huang Ying @ 2023-10-16  5:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Arjan Van De Ven, Huang Ying, Mel Gorman,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small, for
example, in the sender of network workloads.  If a CPU was used as
sender originally, then it is used as receiver after context
switching, we need to fill the whole PCP with maximal high before
triggering PCP draining for consecutive high order freeing.  This will
hurt the performance of some network workloads.

To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining.  So, we
can detect consecutive page freeing much earlier.

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.  With the patch, the network bandwidth improves 5.0%.  This
restores the performance drop caused by PCP auto-tuning.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        | 27 +++++++++++++++------------
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c88770381aaf..57086c57b8e4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -700,10 +700,10 @@ struct per_cpu_pages {
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
-	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
 	u8 expire;		/* When 0, remote pagesets are drained */
 #endif
+	short free_count;	/* consecutive free count */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 253fc7d0498e..28088dd7a968 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2369,13 +2369,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
 	max_nr_free = high - batch;
 
 	/*
-	 * Double the number of pages freed each time there is subsequent
-	 * freeing of pages without any allocation.
+	 * Increase the batch number to the number of the consecutive
+	 * freed pages to reduce zone lock contention.
 	 */
-	batch <<= pcp->free_factor;
-	if (batch <= max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX)
-		pcp->free_factor++;
-	batch = clamp(batch, min_nr_free, max_nr_free);
+	batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
 
 	return batch;
 }
@@ -2403,7 +2400,9 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	 * stored on pcp lists
 	 */
 	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
-		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		int free_count = max_t(int, pcp->free_count, batch);
+
+		pcp->high = max(high - free_count, high_min);
 		return min(batch << 2, pcp->high);
 	}
 
@@ -2411,10 +2410,12 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return high;
 
 	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
-		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		int free_count = max_t(int, pcp->free_count, batch);
+
+		pcp->high = max(high - free_count, high_min);
 		high = max(pcp->count, high_min);
 	} else if (pcp->count >= high) {
-		int need_high = (batch << pcp->free_factor) + batch;
+		int need_high = pcp->free_count + batch;
 
 		/* pcp->high should be large enough to hold batch freed pages */
 		if (pcp->high < need_high)
@@ -2451,7 +2452,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * stops will be drained from vmstat refresh context.
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		free_high = (pcp->free_factor &&
+		free_high = (pcp->free_count >= batch &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
 			      pcp->count >= READ_ONCE(batch)));
@@ -2459,6 +2460,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
+	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+		pcp->free_count += (1 << order);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2855,7 +2858,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 * See nr_pcp_free() where free_factor is increased for subsequent
 	 * frees.
 	 */
-	pcp->free_factor >>= 1;
+	pcp->free_count >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
@@ -5488,7 +5491,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	pcp->high_min = BOOT_PAGESET_HIGH;
 	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
-	pcp->free_factor = 0;
+	pcp->free_count = 0;
 }
 
 static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice
  2023-10-16  5:29 ` [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice Huang Ying
@ 2023-10-19 12:11   ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2023-10-19 12:11 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Arjan Van De Ven,
	Sudeep Holla, Vlastimil Babka, David Hildenbrand,
	Johannes Weiner, Dave Hansen, Michal Hocko, Pavel Tatashin,
	Matthew Wilcox, Christoph Lameter

On Mon, Oct 16, 2023 at 01:29:55PM +0800, Huang Ying wrote:
> This can be used to estimate the size of the data cache slice that can
> be used by one CPU under ideal circumstances.  Both DATA caches and
> UNIFIED caches are used in calculation.  So, the users need to consider
> the impact of the code cache usage.
> 
> Because the cache inclusive/non-inclusive information isn't available
> now, we just use the size of the per-CPU slice of LLC to make the
> result more predictable across architectures.  This may be improved
> when more cache information is available in the future.
> 
> A brute-force algorithm to iterate all online CPUs is used to avoid
> to allocate an extra cpumask, especially in offline callback.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency
  2023-10-16  5:29 ` [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency Huang Ying
@ 2023-10-19 12:12   ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2023-10-19 12:12 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Arjan Van De Ven,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

On Mon, Oct 16, 2023 at 01:29:57PM +0800, Huang Ying wrote:
> In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
> batches to increase page allocation throughput, reduce page
> allocation/freeing latency per page, and reduce zone lock contention.
> But too large batch size will cause too long maximal
> allocation/freeing latency, which may punish arbitrary users.  So the
> default batch size is chosen carefully (in zone_batchsize(), the value
> is 63 for zone > 1GB) to avoid that.
> 
> In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
> are batch freed"), the batch size will be scaled for large number of
> page freeing to improve page freeing performance and reduce zone lock
> contention.  Similar optimization can be used for large number of
> pages allocation too.
> 
> To find out a suitable max batch scale factor (that is, max effective
> batch size), some tests and measurement on some machines were done as
> follows.
> 
> A set of debug patches are implemented as follows,
> 
> - Set PCP high to be 2 * batch to reduce the effect of PCP high
> 
> - Disable free batch size scaling to get the raw performance.
> 
> - The code with zone lock held is extracted from rmqueue_bulk() and
>   free_pcppages_bulk() to 2 separate functions to make it easy to
>   measure the function run time with ftrace function_graph tracer.
> 
> - The batch size is hard coded to be 63 (default), 127, 255, 511,
>   1023, 2047, 4095.
> 
> Then will-it-scale/page_fault1 is used to generate the page
> allocation/freeing workload.  The page allocation/freeing throughput
> (page/s) is measured via will-it-scale.  The page allocation/freeing
> average latency (alloc/free latency avg, in us) and allocation/freeing
> latency at 99 percentile (alloc/free latency 99%, in us) are measured
> with ftrace function_graph tracer.
> 
> The test results are as follows,
> 
> Sapphire Rapids Server
> ======================
> Batch	throughput	free latency	free latency	alloc latency	alloc latency
> 	page/s		avg / us	99% / us	avg / us	99% / us
> -----	----------	------------	------------	-------------	-------------
>   63	513633.4	 2.33		 3.57		 2.67		  6.83
>  127	517616.7	 4.35		 6.65		 4.22		 13.03
>  255	520822.8	 8.29		13.32		 7.52		 25.24
>  511	524122.0	15.79		23.42		14.02		 49.35
> 1023	525980.5	30.25		44.19		25.36		 94.88
> 2047	526793.6	59.39		84.50		45.22		140.81
> 
> Ice Lake Server
> ===============
> Batch	throughput	free latency	free latency	alloc latency	alloc latency
> 	page/s		avg / us	99% / us	avg / us	99% / us
> -----	----------	------------	------------	-------------	-------------
>   63	620210.3	 2.21		 3.68		 2.02		 4.35
>  127	627003.0	 4.09		 6.86		 3.51		 8.28
>  255	630777.5	 7.70		13.50		 6.17		15.97
>  511	633651.5	14.85		22.62		11.66		31.08
> 1023	637071.1	28.55		42.02		20.81		54.36
> 2047	638089.7	56.54		84.06		39.28		91.68
> 
> Cascade Lake Server
> ===================
> Batch	throughput	free latency	free latency	alloc latency	alloc latency
> 	page/s		avg / us	99% / us	avg / us	99% / us
> -----	----------	------------	------------	-------------	-------------
>   63	404706.7	 3.29		  5.03		 3.53		  4.75
>  127	422475.2	 6.12		  9.09		 6.36		  8.76
>  255	411522.2	11.68		 16.97		10.90		 16.39
>  511	428124.1	22.54		 31.28		19.86		 32.25
> 1023	414718.4	43.39		 62.52		40.00		 66.33
> 2047	429848.7	86.64		120.34		71.14		106.08
> 
> Commet Lake Desktop
> ===================
> Batch	throughput	free latency	free latency	alloc latency	alloc latency
> 	page/s		avg / us	99% / us	avg / us	99% / us
> -----	----------	------------	------------	-------------	-------------
> 
>   63	795183.13	 2.18		 3.55		 2.03		 3.05
>  127	803067.85	 3.91		 6.56		 3.85		 5.52
>  255	812771.10	 7.35		10.80		 7.14		10.20
>  511	817723.48	14.17		27.54		13.43		30.31
> 1023	818870.19	27.72		40.10		27.89		46.28
> 
> Coffee Lake Desktop
> ===================
> Batch	throughput	free latency	free latency	alloc latency	alloc latency
> 	page/s		avg / us	99% / us	avg / us	99% / us
> -----	----------	------------	------------	-------------	-------------
>   63	510542.8	 3.13		  4.40		 2.48		 3.43
>  127	514288.6	 5.97		  7.89		 4.65		 6.04
>  255	516889.7	11.86		 15.58		 8.96		12.55
>  511	519802.4	23.10		 28.81		16.95		26.19
> 1023	520802.7	45.30		 52.51		33.19		45.95
> 2047	519997.1	90.63		104.00		65.26		81.74
> 
> From the above data, to restrict the allocation/freeing latency to be
> less than 100 us in most times, the max batch scale factor needs to be
> less than or equal to 5.
> 
> Although it is reasonable to use 5 as max batch scale factor for the
> systems tested, there are also slower systems.  Where smaller value
> should be used to constrain the page allocation/freeing latency.
> 
> So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added
> to set the max batch scale factor.  Whose default value is 5, and
> users can reduce it when necessary.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning
  2023-10-16  5:29 ` [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning Huang Ying
@ 2023-10-19 12:16   ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2023-10-19 12:16 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Arjan Van De Ven,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

On Mon, Oct 16, 2023 at 01:29:59PM +0800, Huang Ying wrote:
> The page allocation performance requirements of different workloads
> are usually different.  So, we need to tune PCP (per-CPU pageset) high
> to optimize the workload page allocation performance.  Now, we have a
> system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP
> high by hand.  But, it's hard to find out the best value by hand.  And
> one global configuration may not work best for the different workloads
> that run on the same system.  One solution to these issues is to tune
> PCP high of each CPU automatically.
> 
> This patch adds the framework for PCP high auto-tuning.  With it,
> pcp->high of each CPU will be changed automatically by tuning
> algorithm at runtime.  The minimal high (pcp->high_min) is the
> original PCP high value calculated based on the low watermark pages.
> While the maximal high (pcp->high_max) is the PCP high value when
> percpu_pagelist_high_fraction sysctl knob is set to
> MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the maximal pcp->high
> that can be set via sysctl knob by hand.
> 
> It's possible that PCP high auto-tuning doesn't work well for some
> workloads.  So, when PCP high is tuned by hand via the sysctl knob,
> the auto-tuning will be disabled.  The PCP high set by hand will be
> used instead.
> 
> This patch only adds the framework, so pcp->high will be set to
> pcp->high_min (original default) always.  We will add actual
> auto-tuning algorithm in the following patches in the series.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark
  2023-10-16  5:30 ` [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark Huang Ying
@ 2023-10-19 12:33   ` Mel Gorman
  2023-10-20  3:30     ` Huang, Ying
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2023-10-19 12:33 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Arjan Van De Ven,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

On Mon, Oct 16, 2023 at 01:30:01PM +0800, Huang Ying wrote:
> One target of PCP is to minimize pages in PCP if the system free pages
> is too few.  To reach that target, when page reclaiming is active for
> the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
> allocating path, decrease PCP high and free some pages in freeing
> path.  But this may be too late because the background page reclaiming
> may introduce latency for some workloads.  So, in this patch, during
> page allocation we will detect whether the number of free pages of the
> zone is below high watermark.  If so, we will stop increasing PCP high
> in allocating path, decrease PCP high and free some pages in freeing
> path.  With this, we can reduce the possibility of the premature
> background page reclaiming caused by too large PCP.
> 
> The high watermark checking is done in allocating path to reduce the
> overhead in hotter freeing path.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christoph Lameter <cl@linux.com>
> ---
>  include/linux/mmzone.h |  1 +
>  mm/page_alloc.c        | 33 +++++++++++++++++++++++++++++++--
>  2 files changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ec3f7daedcc7..c88770381aaf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1018,6 +1018,7 @@ enum zone_flags {
>  					 * Cleared when kswapd is woken.
>  					 */
>  	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
> +	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
>  };
>  
>  static inline unsigned long zone_managed_pages(struct zone *zone)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8382ad2cdfd4..253fc7d0498e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2407,7 +2407,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>  		return min(batch << 2, pcp->high);
>  	}
>  
> -	if (pcp->count >= high && high_min != high_max) {
> +	if (high_min == high_max)
> +		return high;
> +
> +	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
> +		pcp->high = max(high - (batch << pcp->free_factor), high_min);
> +		high = max(pcp->count, high_min);
> +	} else if (pcp->count >= high) {
>  		int need_high = (batch << pcp->free_factor) + batch;
>  
>  		/* pcp->high should be large enough to hold batch freed pages */
> @@ -2457,6 +2463,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>  	if (pcp->count >= high) {
>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
>  				   pcp, pindex);
> +		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
> +		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
> +				      ZONE_MOVABLE, 0))
> +			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
>  	}
>  }
>  

This is a relatively fast path and freeing pages should not need to check
watermarks. While the overhead is mitigated because it applies only when
the watermark is below high, that is also potentially an unbounded condition
if a workload is sized precisely enough. Why not clear this bit when kswapd
is going to sleep after reclaiming enough pages in a zone?

If you agree then a follow-up patch classed as a micro-optimisation is
sufficient to avoid redoing all the results again. For most of your
tests, it should be performance-neutral or borderline noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark
  2023-10-19 12:33   ` Mel Gorman
@ 2023-10-20  3:30     ` Huang, Ying
  2023-10-23  9:26       ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Huang, Ying @ 2023-10-20  3:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, Arjan Van De Ven,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

Mel Gorman <mgorman@techsingularity.net> writes:

> On Mon, Oct 16, 2023 at 01:30:01PM +0800, Huang Ying wrote:
>> One target of PCP is to minimize pages in PCP if the system free pages
>> is too few.  To reach that target, when page reclaiming is active for
>> the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
>> allocating path, decrease PCP high and free some pages in freeing
>> path.  But this may be too late because the background page reclaiming
>> may introduce latency for some workloads.  So, in this patch, during
>> page allocation we will detect whether the number of free pages of the
>> zone is below high watermark.  If so, we will stop increasing PCP high
>> in allocating path, decrease PCP high and free some pages in freeing
>> path.  With this, we can reduce the possibility of the premature
>> background page reclaiming caused by too large PCP.
>> 
>> The high watermark checking is done in allocating path to reduce the
>> overhead in hotter freeing path.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Johannes Weiner <jweiner@redhat.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Christoph Lameter <cl@linux.com>
>> ---
>>  include/linux/mmzone.h |  1 +
>>  mm/page_alloc.c        | 33 +++++++++++++++++++++++++++++++--
>>  2 files changed, 32 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index ec3f7daedcc7..c88770381aaf 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1018,6 +1018,7 @@ enum zone_flags {
>>  					 * Cleared when kswapd is woken.
>>  					 */
>>  	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
>> +	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
>>  };
>>  
>>  static inline unsigned long zone_managed_pages(struct zone *zone)
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 8382ad2cdfd4..253fc7d0498e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2407,7 +2407,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>>  		return min(batch << 2, pcp->high);
>>  	}
>>  
>> -	if (pcp->count >= high && high_min != high_max) {
>> +	if (high_min == high_max)
>> +		return high;
>> +
>> +	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
>> +		pcp->high = max(high - (batch << pcp->free_factor), high_min);
>> +		high = max(pcp->count, high_min);
>> +	} else if (pcp->count >= high) {
>>  		int need_high = (batch << pcp->free_factor) + batch;
>>  
>>  		/* pcp->high should be large enough to hold batch freed pages */
>> @@ -2457,6 +2463,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>>  	if (pcp->count >= high) {
>>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
>>  				   pcp, pindex);
>> +		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
>> +		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
>> +				      ZONE_MOVABLE, 0))
>> +			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
>>  	}
>>  }
>>  
>
> This is a relatively fast path and freeing pages should not need to check
> watermarks.

Another stuff that mitigate the overhead is that the watermarks checking
only occurs when we free pages from PCP to buddy.  That is, in most
cases, every 63 page freeing.

> While the overhead is mitigated because it applies only when
> the watermark is below high, that is also potentially an unbounded condition
> if a workload is sized precisely enough. Why not clear this bit when kswapd
> is going to sleep after reclaiming enough pages in a zone?

IIUC, if the number of free pages is kept larger than the low watermark,
then kswapd will have no opportunity to be waken up even if the number
of free pages was ever smaller than the high watermark.

> If you agree then a follow-up patch classed as a micro-optimisation is
> sufficient to avoid redoing all the results again. For most of your
> tests, it should be performance-neutral or borderline noise.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark
  2023-10-20  3:30     ` Huang, Ying
@ 2023-10-23  9:26       ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2023-10-23  9:26 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Arjan Van De Ven,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter

On Fri, Oct 20, 2023 at 11:30:53AM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Mon, Oct 16, 2023 at 01:30:01PM +0800, Huang Ying wrote:
> >> One target of PCP is to minimize pages in PCP if the system free pages
> >> is too few.  To reach that target, when page reclaiming is active for
> >> the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
> >> allocating path, decrease PCP high and free some pages in freeing
> >> path.  But this may be too late because the background page reclaiming
> >> may introduce latency for some workloads.  So, in this patch, during
> >> page allocation we will detect whether the number of free pages of the
> >> zone is below high watermark.  If so, we will stop increasing PCP high
> >> in allocating path, decrease PCP high and free some pages in freeing
> >> path.  With this, we can reduce the possibility of the premature
> >> background page reclaiming caused by too large PCP.
> >> 
> >> The high watermark checking is done in allocating path to reduce the
> >> overhead in hotter freeing path.
> >> 
> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Johannes Weiner <jweiner@redhat.com>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Michal Hocko <mhocko@suse.com>
> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> Cc: Christoph Lameter <cl@linux.com>
> >> ---
> >>  include/linux/mmzone.h |  1 +
> >>  mm/page_alloc.c        | 33 +++++++++++++++++++++++++++++++--
> >>  2 files changed, 32 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index ec3f7daedcc7..c88770381aaf 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -1018,6 +1018,7 @@ enum zone_flags {
> >>  					 * Cleared when kswapd is woken.
> >>  					 */
> >>  	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
> >> +	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
> >>  };
> >>  
> >>  static inline unsigned long zone_managed_pages(struct zone *zone)
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 8382ad2cdfd4..253fc7d0498e 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2407,7 +2407,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> >>  		return min(batch << 2, pcp->high);
> >>  	}
> >>  
> >> -	if (pcp->count >= high && high_min != high_max) {
> >> +	if (high_min == high_max)
> >> +		return high;
> >> +
> >> +	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
> >> +		pcp->high = max(high - (batch << pcp->free_factor), high_min);
> >> +		high = max(pcp->count, high_min);
> >> +	} else if (pcp->count >= high) {
> >>  		int need_high = (batch << pcp->free_factor) + batch;
> >>  
> >>  		/* pcp->high should be large enough to hold batch freed pages */
> >> @@ -2457,6 +2463,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
> >>  	if (pcp->count >= high) {
> >>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> >>  				   pcp, pindex);
> >> +		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
> >> +		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
> >> +				      ZONE_MOVABLE, 0))
> >> +			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
> >>  	}
> >>  }
> >>  
> >
> > This is a relatively fast path and freeing pages should not need to check
> > watermarks.
> 
> Another stuff that mitigate the overhead is that the watermarks checking
> only occurs when we free pages from PCP to buddy.  That is, in most
> cases, every 63 page freeing.
> 

True

> > While the overhead is mitigated because it applies only when
> > the watermark is below high, that is also potentially an unbounded condition
> > if a workload is sized precisely enough. Why not clear this bit when kswapd
> > is going to sleep after reclaiming enough pages in a zone?
> 
> IIUC, if the number of free pages is kept larger than the low watermark,
> then kswapd will have no opportunity to be waken up even if the number
> of free pages was ever smaller than the high watermark.
> 

Also true and I did think of that later. I guess it's ok, the chances
are that the series overall offsets any micro-costs like this so I'm
happy. If, for some reason, this overhead is noticable (doubtful), then
it can be revisted.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages
  2023-10-16  5:29 ` [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages Huang Ying
@ 2023-10-27  6:23   ` kernel test robot
  2023-11-06  6:22   ` kernel test robot
  1 sibling, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-10-27  6:23 UTC (permalink / raw)
  To: Huang Ying
  Cc: oe-lkp, lkp, Mel Gorman, Andrew Morton, Sudeep Holla,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter,
	linux-kernel, linux-mm, ying.huang, feng.tang, fengwei.yin,
	Arjan Van De Ven, oliver.sang



Hello,

kernel test robot noticed a 14.6% improvement of netperf.Throughput_Mbps on:


commit: f5ddc662f07d7d99e9cfc5e07778e26c7394caf8 ("[PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages")
url: https://github.com/intel-lab-lkp/linux/commits/Huang-Ying/mm-pcp-avoid-to-drain-PCP-when-process-exit/20231017-143633
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git 36b2d7dd5a8ac95c8c1e69bdc93c4a6e2dc28a23
patch link: https://lore.kernel.org/all/20231016053002.756205-4-ying.huang@intel.com/
patch subject: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages

testcase: netperf
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
parameters:

	ip: ipv4
	runtime: 300s
	nr_threads: 200%
	cluster: cs-localhost
	send_size: 10K
	test: SCTP_STREAM_MANY
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231027/202310271441.71ce0a9-oliver.sang@intel.com

=========================================================================================
cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/send_size/tbox_group/test/testcase:
  cs-localhost/gcc-12/performance/ipv4/x86_64-rhel-8.3/200%/debian-11.1-x86_64-20220510.cgz/300s/10K/lkp-icl-2sp2/SCTP_STREAM_MANY/netperf

commit: 
  c828e65251 ("cacheinfo: calculate size of per-CPU data cache slice")
  f5ddc662f0 ("mm, pcp: reduce lock contention for draining high-order pages")

c828e65251502516 f5ddc662f07d7d99e9cfc5e0777 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     26471           -11.1%      23520        uptime.idle
 2.098e+10           -14.1%  1.802e+10        cpuidle..time
 5.798e+08           +14.3%  6.628e+08        cpuidle..usage
 1.329e+09           +14.7%  1.525e+09        numa-numastat.node0.local_node
 1.329e+09           +14.7%  1.525e+09        numa-numastat.node0.numa_hit
 1.336e+09           +14.6%  1.531e+09        numa-numastat.node1.local_node
 1.336e+09           +14.6%  1.531e+09        numa-numastat.node1.numa_hit
 1.329e+09           +14.7%  1.525e+09        numa-vmstat.node0.numa_hit
 1.329e+09           +14.7%  1.525e+09        numa-vmstat.node0.numa_local
 1.336e+09           +14.6%  1.531e+09        numa-vmstat.node1.numa_hit
 1.336e+09           +14.6%  1.531e+09        numa-vmstat.node1.numa_local
     26.31 ± 12%     +33.0%      35.00 ± 10%  perf-sched.wait_and_delay.avg.ms.__cond_resched.__kmem_cache_alloc_node.kmalloc_trace.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
    229.00 ± 13%     -24.7%     172.33 ±  5%  perf-sched.wait_and_delay.count.__cond_resched.__kmem_cache_alloc_node.kmalloc_trace.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
    929.50 ±  2%      +8.2%       1005 ±  4%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
     26.30 ± 12%     +33.0%      35.00 ± 10%  perf-sched.wait_time.avg.ms.__cond_resched.__kmem_cache_alloc_node.kmalloc_trace.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
     53.98           -14.1%      46.36        vmstat.cpu.id
     58.15           +17.6%      68.37        vmstat.procs.r
   3720385           +15.6%    4301904        vmstat.system.cs
   1991764           +14.5%    2281507        vmstat.system.in
     53.69            -7.7       46.03        mpstat.cpu.all.idle%
      2.10            +0.3        2.44        mpstat.cpu.all.irq%
      7.25            +1.3        8.58        mpstat.cpu.all.soft%
     35.74            +5.7       41.46        mpstat.cpu.all.sys%
      1.23            +0.3        1.49        mpstat.cpu.all.usr%
   2047040            +2.9%    2105598        proc-vmstat.nr_file_pages
   1377160            +4.2%    1435588        proc-vmstat.nr_shmem
 2.665e+09           +14.7%  3.056e+09        proc-vmstat.numa_hit
 2.665e+09           +14.7%  3.056e+09        proc-vmstat.numa_local
 1.534e+10           +14.6%  1.758e+10        proc-vmstat.pgalloc_normal
 1.534e+10           +14.6%  1.758e+10        proc-vmstat.pgfree
      1296           +16.3%       1507        turbostat.Avg_MHz
     49.98            +8.1       58.12        turbostat.Busy%
 5.797e+08           +14.3%  6.628e+08        turbostat.C1
     53.88            -7.6       46.34        turbostat.C1%
     50.02           -16.3%      41.88        turbostat.CPU%c1
 6.081e+08           +14.5%  6.961e+08        turbostat.IRQ
    391.82            +3.5%     405.41        turbostat.PkgWatt
      2204           +14.6%       2527        netperf.ThroughputBoth_Mbps
    564378           +14.6%     647027        netperf.ThroughputBoth_total_Mbps
      2204           +14.6%       2527        netperf.Throughput_Mbps
    564378           +14.6%     647027        netperf.Throughput_total_Mbps
    146051            +5.9%     154705        netperf.time.involuntary_context_switches
      3011           +16.8%       3516        netperf.time.percent_of_cpu_this_job_got
      8875           +16.6%      10351        netperf.time.system_time
    221.39           +18.0%     261.14        netperf.time.user_time
   2759631            +8.0%    2981144        netperf.time.voluntary_context_switches
 2.067e+09           +14.6%  2.369e+09        netperf.workload
   2920531           +34.4%    3925407        sched_debug.cfs_rq:/.avg_vruntime.avg
   3172407 ±  2%     +36.5%    4331807 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.max
   2801767           +35.2%    3787891 ±  2%  sched_debug.cfs_rq:/.avg_vruntime.min
     45404 ±  5%     +33.3%      60516 ± 11%  sched_debug.cfs_rq:/.avg_vruntime.stddev
   2817265 ± 10%     +40.6%    3961862        sched_debug.cfs_rq:/.left_vruntime.max
    376003 ± 18%     +51.2%     568331 ± 13%  sched_debug.cfs_rq:/.left_vruntime.stddev
   2920531           +34.4%    3925407        sched_debug.cfs_rq:/.min_vruntime.avg
   3172407 ±  2%     +36.5%    4331807 ±  3%  sched_debug.cfs_rq:/.min_vruntime.max
   2801767           +35.2%    3787891 ±  2%  sched_debug.cfs_rq:/.min_vruntime.min
     45404 ±  5%     +33.3%      60516 ± 11%  sched_debug.cfs_rq:/.min_vruntime.stddev
   2817265 ± 10%     +40.6%    3961862        sched_debug.cfs_rq:/.right_vruntime.max
    376003 ± 18%     +51.2%     568331 ± 13%  sched_debug.cfs_rq:/.right_vruntime.stddev
    157.25 ±  6%     +13.3%     178.14 ±  4%  sched_debug.cfs_rq:/.util_est_enqueued.avg
   4361500           +15.5%    5035528        sched_debug.cpu.nr_switches.avg
   4674667           +14.7%    5363125        sched_debug.cpu.nr_switches.max
   3947619           +14.1%    4504637 ±  2%  sched_debug.cpu.nr_switches.min
      0.56            -3.7%       0.54        perf-stat.i.MPKI
 2.293e+10           +14.3%  2.622e+10        perf-stat.i.branch-instructions
 1.449e+08           +15.6%  1.675e+08        perf-stat.i.branch-misses
      2.15            -0.1        2.05        perf-stat.i.cache-miss-rate%
  67409238           +10.2%   74274510        perf-stat.i.cache-misses
 3.199e+09           +15.7%  3.702e+09        perf-stat.i.cache-references
   3765045           +15.6%    4353228        perf-stat.i.context-switches
      1.42            +1.7%       1.45        perf-stat.i.cpi
 1.717e+11           +16.5%      2e+11        perf-stat.i.cpu-cycles
      5094           +51.1%       7695 ±  3%  perf-stat.i.cpu-migrations
      2554            +5.7%       2699        perf-stat.i.cycles-between-cache-misses
  3.28e+10           +14.5%  3.756e+10        perf-stat.i.dTLB-loads
    329792 ± 11%     +37.3%     452936 ± 15%  perf-stat.i.dTLB-store-misses
  2.04e+10           +14.7%  2.339e+10        perf-stat.i.dTLB-stores
 1.205e+11           +14.4%  1.379e+11        perf-stat.i.instructions
      0.71            -1.7%       0.69        perf-stat.i.ipc
      1.34           +16.5%       1.56        perf-stat.i.metric.GHz
    221.29            +7.4%     237.74        perf-stat.i.metric.K/sec
    619.67           +14.5%     709.77        perf-stat.i.metric.M/sec
   7031738           +14.3%    8034255        perf-stat.i.node-load-misses
     79.94            -1.3       78.62        perf-stat.i.node-store-miss-rate%
   3349862 ±  2%      +9.2%    3656880        perf-stat.i.node-stores
      0.56            -3.7%       0.54        perf-stat.overall.MPKI
      2.11            -0.1        2.01        perf-stat.overall.cache-miss-rate%
      1.42            +1.8%       1.45        perf-stat.overall.cpi
      2546            +5.7%       2692        perf-stat.overall.cycles-between-cache-misses
      0.70            -1.8%       0.69        perf-stat.overall.ipc
     79.91            -1.4       78.54        perf-stat.overall.node-store-miss-rate%
 2.286e+10           +14.3%  2.614e+10        perf-stat.ps.branch-instructions
 1.444e+08           +15.6%  1.669e+08        perf-stat.ps.branch-misses
  67192773           +10.2%   74037940        perf-stat.ps.cache-misses
 3.189e+09           +15.7%   3.69e+09        perf-stat.ps.cache-references
   3753095           +15.6%    4339552        perf-stat.ps.context-switches
 1.711e+11           +16.5%  1.994e+11        perf-stat.ps.cpu-cycles
      5078           +51.1%       7674 ±  3%  perf-stat.ps.cpu-migrations
 3.269e+10           +14.5%  3.743e+10        perf-stat.ps.dTLB-loads
    328489 ± 11%     +37.3%     451131 ± 15%  perf-stat.ps.dTLB-store-misses
 2.033e+10           +14.7%  2.331e+10        perf-stat.ps.dTLB-stores
 1.201e+11           +14.4%  1.374e+11        perf-stat.ps.instructions
   7009249           +14.3%    8009170        perf-stat.ps.node-load-misses
   3339511 ±  2%      +9.2%    3645997        perf-stat.ps.node-stores
 3.635e+13           +14.3%  4.155e+13        perf-stat.total.instructions
      4.40 ±  2%      -1.5        2.87        perf-profile.calltrace.cycles-pp.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.sock_recvmsg
      5.83            -1.4        4.41        perf-profile.calltrace.cycles-pp.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg
      1.92 ±  3%      -1.4        0.55        perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
     22.33            -1.3       21.03        perf-profile.calltrace.cycles-pp.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg.___sys_recvmsg
     22.42            -1.3       21.12        perf-profile.calltrace.cycles-pp.inet_recvmsg.sock_recvmsg.____sys_recvmsg.___sys_recvmsg.__sys_recvmsg
     22.75            -1.3       21.48        perf-profile.calltrace.cycles-pp.sock_recvmsg.____sys_recvmsg.___sys_recvmsg.__sys_recvmsg.do_syscall_64
     23.44            -1.2       22.20        perf-profile.calltrace.cycles-pp.____sys_recvmsg.___sys_recvmsg.__sys_recvmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
     24.65            -1.2       23.47        perf-profile.calltrace.cycles-pp.___sys_recvmsg.__sys_recvmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvmsg
     25.14            -1.2       23.98        perf-profile.calltrace.cycles-pp.__sys_recvmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvmsg
     25.46            -1.1       24.31        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvmsg
     25.59            -1.1       24.46        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.recvmsg
     26.47            -1.1       25.36        perf-profile.calltrace.cycles-pp.recvmsg
      3.57 ±  6%      -0.6        2.93 ±  9%  perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.__kmalloc_large_node.__kmalloc_node_track_caller
      5.22 ±  2%      -0.4        4.79        perf-profile.calltrace.cycles-pp.__alloc_pages.__kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb
      4.76 ±  2%      -0.4        4.33        perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.__kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve
      0.96            -0.4        0.59 ±  2%  perf-profile.calltrace.cycles-pp.release_sock.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg
      3.16 ±  2%      -0.3        2.84        perf-profile.calltrace.cycles-pp.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb.sctp_packet_transmit.sctp_outq_flush
      3.14 ±  2%      -0.3        2.82        perf-profile.calltrace.cycles-pp.__kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb.sctp_packet_transmit
      3.18 ±  2%      -0.3        2.86        perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter
      3.44 ±  2%      -0.3        3.13        perf-profile.calltrace.cycles-pp.__alloc_skb.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm
      1.62 ±  3%      -0.3        1.34 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.rmqueue.get_page_from_freelist.__alloc_pages.__kmalloc_large_node
      1.49 ±  3%      -0.3        1.22 ±  3%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.rmqueue.get_page_from_freelist.__alloc_pages
      1.46 ±  2%      -0.2        1.25 ±  2%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__free_pages_ok.skb_release_data.kfree_skb_reason
      1.62 ±  2%      -0.2        1.43 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__free_pages_ok.skb_release_data.kfree_skb_reason.sctp_recvmsg
      1.99 ±  2%      -0.2        1.80        perf-profile.calltrace.cycles-pp.__free_pages_ok.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
      0.76            -0.2        0.58        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
      0.85            -0.1        0.74        perf-profile.calltrace.cycles-pp.__slab_free.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg
      0.84            -0.1        0.73        perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.skb_release_data.consume_skb.sctp_chunk_put
      1.37            -0.1        1.28        perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.consume_skb.sctp_chunk_put.sctp_outq_sack
      2.65            -0.1        2.57        perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user
      2.56            -0.1        2.48        perf-profile.calltrace.cycles-pp.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty
      2.49 ±  2%      -0.1        2.42        perf-profile.calltrace.cycles-pp.__kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb._sctp_make_chunk
      1.92            -0.1        1.85        perf-profile.calltrace.cycles-pp.skb_release_data.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter
      0.62            +0.0        0.64        perf-profile.calltrace.cycles-pp.simple_copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.sctp_recvmsg.inet_recvmsg
      0.65            +0.0        0.68        perf-profile.calltrace.cycles-pp.sctp_chunk_put.sctp_ulpevent_free.sctp_recvmsg.inet_recvmsg.sock_recvmsg
      0.89            +0.0        0.93        perf-profile.calltrace.cycles-pp.copy_msghdr_from_user.___sys_recvmsg.__sys_recvmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.24            +0.0        1.28        perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.sctp_data_ready
      0.56 ±  2%      +0.0        0.60        perf-profile.calltrace.cycles-pp.sctp_packet_config.sctp_outq_select_transport.sctp_outq_flush_data.sctp_outq_flush.sctp_cmd_interpreter
      1.32            +0.0        1.36        perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.sctp_data_ready.sctp_ulpq_tail_event
      1.29            +0.0        1.33        perf-profile.calltrace.cycles-pp.sctp_ulpevent_free.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg
      0.71 ±  2%      +0.0        0.75        perf-profile.calltrace.cycles-pp.sctp_outq_select_transport.sctp_outq_flush_data.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm
      0.61            +0.0        0.66        perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.schedule_timeout
      0.62            +0.1        0.67        perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending
      1.50            +0.1        1.56        perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.sctp_data_ready.sctp_ulpq_tail_event.sctp_ulpq_tail_data
      1.58            +0.1        1.64        perf-profile.calltrace.cycles-pp.__wake_up_common_lock.sctp_data_ready.sctp_ulpq_tail_event.sctp_ulpq_tail_data.sctp_cmd_interpreter
      0.70            +0.1        0.76        perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.schedule_timeout.sctp_skb_recv_datagram
      1.02            +0.1        1.08        perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single
      2.02            +0.1        2.09        perf-profile.calltrace.cycles-pp.sctp_outq_flush_data.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND
      1.86            +0.1        1.93        perf-profile.calltrace.cycles-pp.sctp_data_ready.sctp_ulpq_tail_event.sctp_ulpq_tail_data.sctp_cmd_interpreter.sctp_do_sm
      0.76            +0.1        0.83        perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single
      0.73            +0.1        0.80        perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
      0.89            +0.1        0.96        perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary
      0.82            +0.1        0.89        perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put
      0.95            +0.1        1.03        perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      2.06            +0.1        2.14        perf-profile.calltrace.cycles-pp.sctp_ulpq_tail_event.sctp_ulpq_tail_data.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv
      3.68            +0.1        3.76        perf-profile.calltrace.cycles-pp.copyin._copy_from_iter.sctp_user_addto_chunk.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
      0.98            +0.1        1.06        perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
      1.34            +0.1        1.43        perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single
      1.38            +0.1        1.47        perf-profile.calltrace.cycles-pp.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack
      1.54            +0.1        1.63        perf-profile.calltrace.cycles-pp.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter
      1.25 ±  2%      +0.1        1.35        perf-profile.calltrace.cycles-pp.__sk_mem_raise_allocated.__sk_mem_schedule.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg
      1.28 ±  2%      +0.1        1.38        perf-profile.calltrace.cycles-pp.__sk_mem_schedule.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg.____sys_sendmsg
      1.82            +0.1        1.93        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.acpi_safe_halt
      2.00            +0.1        2.11        perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter
      1.39            +0.1        1.50        perf-profile.calltrace.cycles-pp.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.sock_recvmsg
      4.39            +0.1        4.51        perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      2.68            +0.2        2.84        perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state
      2.98            +0.2        3.14        perf-profile.calltrace.cycles-pp.sctp_ulpevent_make_rcvmsg.sctp_ulpq_tail_data.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv
      1.88            +0.2        2.06        perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_timeout.sctp_skb_recv_datagram.sctp_recvmsg
      0.34 ± 70%      +0.2        0.54        perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.schedule_timeout.sctp_skb_recv_datagram
     10.32            +0.2       10.53        perf-profile.calltrace.cycles-pp.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg
      3.60            +0.2        3.81        perf-profile.calltrace.cycles-pp.sctp_skb_recv_datagram.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg
      1.94            +0.2        2.14        perf-profile.calltrace.cycles-pp.schedule.schedule_timeout.sctp_skb_recv_datagram.sctp_recvmsg.inet_recvmsg
      2.20            +0.2        2.41        perf-profile.calltrace.cycles-pp.schedule_timeout.sctp_skb_recv_datagram.sctp_recvmsg.inet_recvmsg.sock_recvmsg
     10.93            +0.2       11.16        perf-profile.calltrace.cycles-pp.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
     10.51            +0.2       10.74        perf-profile.calltrace.cycles-pp.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg
      7.26            +0.2        7.50        perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.sctp_recvmsg
     11.17            +0.2       11.42        perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      5.40            +0.2        5.64        perf-profile.calltrace.cycles-pp.sctp_ulpq_tail_data.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv.sctp_rcv
     11.25            +0.2       11.50        perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      7.38            +0.3        7.64        perf-profile.calltrace.cycles-pp._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.sctp_recvmsg.inet_recvmsg
     20.03            +0.3       20.29        perf-profile.calltrace.cycles-pp.sctp_backlog_rcv.__release_sock.release_sock.sctp_sendmsg.sock_sendmsg
     20.09            +0.3       20.36        perf-profile.calltrace.cycles-pp.__release_sock.release_sock.sctp_sendmsg.sock_sendmsg.____sys_sendmsg
     20.30            +0.3       20.57        perf-profile.calltrace.cycles-pp.release_sock.sctp_sendmsg.sock_sendmsg.____sys_sendmsg.___sys_sendmsg
      8.40            +0.3        8.68        perf-profile.calltrace.cycles-pp.__skb_datagram_iter.skb_copy_datagram_iter.sctp_recvmsg.inet_recvmsg.sock_recvmsg
      8.44            +0.3        8.72        perf-profile.calltrace.cycles-pp.skb_copy_datagram_iter.sctp_recvmsg.inet_recvmsg.sock_recvmsg.____sys_recvmsg
     11.85            +0.3       12.14        perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     11.22            +0.3       11.52        perf-profile.calltrace.cycles-pp.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND
     13.26            +0.3       13.61        perf-profile.calltrace.cycles-pp.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc
     13.21            +0.4       13.59        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     13.25            +0.4       13.64        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
     13.24            +0.4       13.62        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     13.34            +0.4       13.74        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
     15.70            +0.4       16.12        perf-profile.calltrace.cycles-pp.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg.____sys_sendmsg
      0.55            +0.5        1.02 ± 19%  perf-profile.calltrace.cycles-pp.__sk_mem_raise_allocated.__sk_mem_schedule.sctp_ulpevent_make_rcvmsg.sctp_ulpq_tail_data.sctp_cmd_interpreter
      0.66 ± 28%      +0.5        1.14        perf-profile.calltrace.cycles-pp.__sk_mem_schedule.sctp_ulpevent_make_rcvmsg.sctp_ulpq_tail_data.sctp_cmd_interpreter.sctp_do_sm
      0.00            +0.5        0.54        perf-profile.calltrace.cycles-pp.sctp_sf_eat_data_6_2.sctp_do_sm.sctp_assoc_bh_rcv.sctp_rcv.ip_protocol_deliver_rcu
     51.26            +0.5       51.80        perf-profile.calltrace.cycles-pp.sctp_sendmsg.sock_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg
     15.28            +0.5       15.82        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
     51.76            +0.6       52.32        perf-profile.calltrace.cycles-pp.sock_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64
     53.77            +0.6       54.34        perf-profile.calltrace.cycles-pp.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
      6.06 ±  2%      -2.4        3.68        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      5.94 ±  2%      -2.2        3.75        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      6.84            -1.9        4.97        perf-profile.children.cycles-pp.skb_release_data
      3.64            -1.7        1.92        perf-profile.children.cycles-pp.free_unref_page
      2.04 ±  2%      -1.7        0.34 ±  2%  perf-profile.children.cycles-pp.free_pcppages_bulk
      5.84            -1.4        4.42        perf-profile.children.cycles-pp.kfree_skb_reason
     22.43            -1.3       21.14        perf-profile.children.cycles-pp.inet_recvmsg
     22.67            -1.3       21.39        perf-profile.children.cycles-pp.sctp_recvmsg
     22.76            -1.3       21.50        perf-profile.children.cycles-pp.sock_recvmsg
     23.46            -1.2       22.22        perf-profile.children.cycles-pp.____sys_recvmsg
     24.68            -1.2       23.50        perf-profile.children.cycles-pp.___sys_recvmsg
     25.16            -1.2       24.00        perf-profile.children.cycles-pp.__sys_recvmsg
     26.69            -1.1       25.59        perf-profile.children.cycles-pp.recvmsg
     82.77            -0.5       82.24        perf-profile.children.cycles-pp.do_syscall_64
     83.14            -0.5       82.63        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      5.02            -0.5        4.53        perf-profile.children.cycles-pp.get_page_from_freelist
      5.46            -0.5        4.98        perf-profile.children.cycles-pp.__alloc_pages
      5.96            -0.5        5.50        perf-profile.children.cycles-pp.__kmalloc_node_track_caller
      6.21            -0.5        5.76        perf-profile.children.cycles-pp.kmalloc_reserve
      3.86            -0.5        3.41        perf-profile.children.cycles-pp.rmqueue
      5.88            -0.5        5.44        perf-profile.children.cycles-pp.__kmalloc_large_node
      7.47            -0.4        7.07        perf-profile.children.cycles-pp.__alloc_skb
      0.65 ±  3%      -0.3        0.30 ±  5%  perf-profile.children.cycles-pp.sctp_wait_for_sndbuf
      1.91            -0.3        1.58        perf-profile.children.cycles-pp._raw_spin_lock_bh
      1.78            -0.3        1.46        perf-profile.children.cycles-pp.lock_sock_nested
      4.43            -0.2        4.22        perf-profile.children.cycles-pp.consume_skb
      6.00            -0.2        5.80        perf-profile.children.cycles-pp.sctp_outq_sack
      5.82            -0.2        5.62        perf-profile.children.cycles-pp.sctp_chunk_put
      2.00 ±  2%      -0.2        1.82 ±  2%  perf-profile.children.cycles-pp.__free_pages_ok
      1.20            -0.2        1.04        perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      1.27            -0.1        1.16        perf-profile.children.cycles-pp.__slab_free
      0.39            -0.1        0.32 ±  2%  perf-profile.children.cycles-pp.__free_one_page
      0.86 ±  2%      -0.1        0.79        perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.42            -0.1        0.36 ±  2%  perf-profile.children.cycles-pp.__zone_watermark_ok
      0.45 ±  2%      -0.1        0.40 ±  2%  perf-profile.children.cycles-pp.rmqueue_bulk
      0.54            -0.0        0.51        perf-profile.children.cycles-pp.__list_add_valid_or_report
      0.65 ±  2%      -0.0        0.62        perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.47 ±  2%      -0.0        0.44 ±  2%  perf-profile.children.cycles-pp.__kmalloc
      0.25 ±  3%      -0.0        0.22 ±  2%  perf-profile.children.cycles-pp.__irq_exit_rcu
      0.24 ±  4%      -0.0        0.22 ±  3%  perf-profile.children.cycles-pp.perf_event_task_tick
      0.24 ±  3%      -0.0        0.22 ±  3%  perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context
      0.15 ±  5%      -0.0        0.13 ±  4%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
      0.11 ±  4%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.sctp_assoc_rwnd_increase
      0.06            +0.0        0.07        perf-profile.children.cycles-pp.ct_idle_exit
      0.12 ±  3%      +0.0        0.13 ±  2%  perf-profile.children.cycles-pp.get_pfnblock_flags_mask
      0.42            +0.0        0.44        perf-profile.children.cycles-pp.free_unref_page_prepare
      0.14 ±  2%      +0.0        0.16 ±  3%  perf-profile.children.cycles-pp.check_stack_object
      0.13 ±  2%      +0.0        0.15 ±  3%  perf-profile.children.cycles-pp.__mod_lruvec_page_state
      0.27            +0.0        0.28        perf-profile.children.cycles-pp.update_curr
      0.22            +0.0        0.24 ±  2%  perf-profile.children.cycles-pp.__switch_to_asm
      0.16 ±  2%      +0.0        0.18 ±  3%  perf-profile.children.cycles-pp.__update_load_avg_se
      0.29 ±  2%      +0.0        0.30        perf-profile.children.cycles-pp.sctp_outq_flush_ctrl
      0.42            +0.0        0.44        perf-profile.children.cycles-pp.free_large_kmalloc
      0.13 ±  2%      +0.0        0.15 ±  7%  perf-profile.children.cycles-pp.update_cfs_group
      0.40            +0.0        0.42 ±  2%  perf-profile.children.cycles-pp.loopback_xmit
      0.24 ±  3%      +0.0        0.26 ±  2%  perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
      0.45            +0.0        0.47        perf-profile.children.cycles-pp.dev_hard_start_xmit
      0.20            +0.0        0.23 ±  3%  perf-profile.children.cycles-pp.set_next_entity
      0.63            +0.0        0.65        perf-profile.children.cycles-pp.simple_copy_to_iter
      0.13 ±  3%      +0.0        0.15 ±  3%  perf-profile.children.cycles-pp.sk_leave_memory_pressure
      0.30            +0.0        0.32 ±  2%  perf-profile.children.cycles-pp.sctp_inet_skb_msgname
      0.54 ±  2%      +0.0        0.57 ±  2%  perf-profile.children.cycles-pp.__copy_skb_header
      0.31            +0.0        0.34        perf-profile.children.cycles-pp.___perf_sw_event
      0.27 ±  3%      +0.0        0.30 ±  2%  perf-profile.children.cycles-pp.security_socket_recvmsg
      0.24 ±  3%      +0.0        0.26        perf-profile.children.cycles-pp.ipv4_dst_check
      0.42 ±  2%      +0.0        0.44 ±  3%  perf-profile.children.cycles-pp.page_counter_try_charge
      1.30            +0.0        1.33        perf-profile.children.cycles-pp.try_to_wake_up
      0.42            +0.0        0.45        perf-profile.children.cycles-pp.__mod_node_page_state
      0.79            +0.0        0.82        perf-profile.children.cycles-pp.__skb_clone
      0.44            +0.0        0.48        perf-profile.children.cycles-pp.aa_sk_perm
      0.30            +0.0        0.33 ±  4%  perf-profile.children.cycles-pp.accept_connection
      0.30            +0.0        0.33 ±  4%  perf-profile.children.cycles-pp.spawn_child
      0.30            +0.0        0.33 ±  4%  perf-profile.children.cycles-pp.process_requests
      0.36            +0.0        0.40        perf-profile.children.cycles-pp.prepare_task_switch
      0.28 ±  2%      +0.0        0.31 ±  5%  perf-profile.children.cycles-pp.recv_sctp_stream_1toMany
      0.66            +0.0        0.70        perf-profile.children.cycles-pp.sctp_addrs_lookup_transport
      0.69            +0.0        0.72        perf-profile.children.cycles-pp.__sctp_rcv_lookup
      0.39 ±  3%      +0.0        0.43        perf-profile.children.cycles-pp.dst_release
      1.36            +0.0        1.40        perf-profile.children.cycles-pp.autoremove_wake_function
      0.77            +0.0        0.81        perf-profile.children.cycles-pp.kmem_cache_alloc_node
      1.31            +0.0        1.35        perf-profile.children.cycles-pp.sctp_ulpevent_free
      0.92            +0.0        0.96        perf-profile.children.cycles-pp.try_charge_memcg
      0.64            +0.0        0.69        perf-profile.children.cycles-pp.dequeue_entity
      0.83            +0.0        0.88        perf-profile.children.cycles-pp.sctp_packet_config
      2.48            +0.0        2.53        perf-profile.children.cycles-pp.copy_msghdr_from_user
      0.61 ±  3%      +0.1        0.66 ±  2%  perf-profile.children.cycles-pp.mem_cgroup_uncharge_skmem
      0.66            +0.1        0.71        perf-profile.children.cycles-pp.enqueue_entity
      1.56            +0.1        1.61        perf-profile.children.cycles-pp.__wake_up_common
      1.39            +0.1        1.45        perf-profile.children.cycles-pp.kmem_cache_free
      1.02 ±  2%      +0.1        1.08        perf-profile.children.cycles-pp.sctp_outq_select_transport
      0.00            +0.1        0.06 ±  9%  perf-profile.children.cycles-pp.pick_next_task_idle
      1.64            +0.1        1.70        perf-profile.children.cycles-pp.__wake_up_common_lock
      0.86 ±  3%      +0.1        0.92        perf-profile.children.cycles-pp.pick_next_task_fair
      0.58            +0.1        0.64        perf-profile.children.cycles-pp.update_load_avg
      1.56            +0.1        1.62        perf-profile.children.cycles-pp.__check_object_size
      0.71            +0.1        0.77        perf-profile.children.cycles-pp.dequeue_task_fair
      0.86            +0.1        0.93        perf-profile.children.cycles-pp.sctp_eat_data
      1.92            +0.1        1.99        perf-profile.children.cycles-pp.sctp_data_ready
      1.05            +0.1        1.12        perf-profile.children.cycles-pp.ttwu_do_activate
      0.26 ± 32%      +0.1        0.33 ±  4%  perf-profile.children.cycles-pp.accept_connections
      2.16            +0.1        2.22        perf-profile.children.cycles-pp.sctp_ulpq_tail_event
      0.76            +0.1        0.83        perf-profile.children.cycles-pp.enqueue_task_fair
      0.78            +0.1        0.86        perf-profile.children.cycles-pp.activate_task
      0.98            +0.1        1.05        perf-profile.children.cycles-pp.sctp_sf_eat_data_6_2
      0.97            +0.1        1.04        perf-profile.children.cycles-pp.schedule_idle
      3.22            +0.1        3.30        perf-profile.children.cycles-pp.sctp_outq_flush_data
      1.78            +0.1        1.85        perf-profile.children.cycles-pp.mem_cgroup_charge_skmem
      1.48            +0.1        1.56        perf-profile.children.cycles-pp.sctp_wfree
      1.38            +0.1        1.46        perf-profile.children.cycles-pp.sched_ttwu_pending
      3.80            +0.1        3.89        perf-profile.children.cycles-pp.copyin
      3.92            +0.1        4.00        perf-profile.children.cycles-pp._copy_from_iter
     10.14            +0.1       10.24        perf-profile.children.cycles-pp.sctp_datamsg_from_user
      1.87            +0.1        1.97        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      4.48            +0.1        4.59        perf-profile.children.cycles-pp.sctp_user_addto_chunk
      2.04            +0.1        2.15        perf-profile.children.cycles-pp.__sysvec_call_function_single
      6.96            +0.1        7.09        perf-profile.children.cycles-pp.__memcpy
      7.57            +0.1        7.71        perf-profile.children.cycles-pp.sctp_packet_pack
      3.20            +0.1        3.34        perf-profile.children.cycles-pp.sctp_ulpevent_make_rcvmsg
      1.85            +0.2        2.00        perf-profile.children.cycles-pp.__sk_mem_reduce_allocated
     12.41            +0.2       12.56        perf-profile.children.cycles-pp.sctp_rcv
      2.74            +0.2        2.90        perf-profile.children.cycles-pp.sysvec_call_function_single
      2.41            +0.2        2.57        perf-profile.children.cycles-pp.__sk_mem_raise_allocated
      2.48            +0.2        2.65        perf-profile.children.cycles-pp.__sk_mem_schedule
     13.86            +0.2       14.04        perf-profile.children.cycles-pp.__do_softirq
     13.28            +0.2       13.45        perf-profile.children.cycles-pp.process_backlog
     13.31            +0.2       13.49        perf-profile.children.cycles-pp.__napi_poll
     13.45            +0.2       13.63        perf-profile.children.cycles-pp.net_rx_action
      2.04            +0.2        2.21        perf-profile.children.cycles-pp.schedule
      2.28            +0.2        2.46        perf-profile.children.cycles-pp.schedule_timeout
     12.53            +0.2       12.71        perf-profile.children.cycles-pp.ip_local_deliver_finish
     13.05            +0.2       13.23        perf-profile.children.cycles-pp.__netif_receive_skb_one_core
     12.51            +0.2       12.69        perf-profile.children.cycles-pp.ip_protocol_deliver_rcu
     29.73            +0.2       29.92        perf-profile.children.cycles-pp.sctp_outq_flush
      3.63            +0.2        3.84        perf-profile.children.cycles-pp.sctp_skb_recv_datagram
     13.78            +0.2       13.98        perf-profile.children.cycles-pp.do_softirq
      5.68            +0.2        5.89        perf-profile.children.cycles-pp.sctp_ulpq_tail_data
     13.98            +0.2       14.20        perf-profile.children.cycles-pp.__local_bh_enable_ip
      3.22            +0.2        3.44        perf-profile.children.cycles-pp.skb_release_head_state
      2.90            +0.2        3.13        perf-profile.children.cycles-pp.__schedule
     36.67            +0.2       36.90        perf-profile.children.cycles-pp.sctp_do_sm
     36.13            +0.2       36.36        perf-profile.children.cycles-pp.sctp_cmd_interpreter
     10.99            +0.2       11.22        perf-profile.children.cycles-pp.acpi_safe_halt
      7.30            +0.2        7.54        perf-profile.children.cycles-pp.copyout
     14.37            +0.2       14.61        perf-profile.children.cycles-pp.__dev_queue_xmit
     11.01            +0.2       11.26        perf-profile.children.cycles-pp.acpi_idle_enter
     14.53            +0.2       14.78        perf-profile.children.cycles-pp.ip_finish_output2
      7.40            +0.3        7.65        perf-profile.children.cycles-pp._copy_to_iter
     15.04            +0.3       15.29        perf-profile.children.cycles-pp.__ip_queue_xmit
     11.26            +0.3       11.52        perf-profile.children.cycles-pp.cpuidle_enter_state
     11.33            +0.3       11.59        perf-profile.children.cycles-pp.cpuidle_enter
     29.10            +0.3       29.37        perf-profile.children.cycles-pp.sctp_sendmsg_to_asoc
      8.41            +0.3        8.69        perf-profile.children.cycles-pp.__skb_datagram_iter
      8.45            +0.3        8.73        perf-profile.children.cycles-pp.skb_copy_datagram_iter
     11.94            +0.3       12.25        perf-profile.children.cycles-pp.cpuidle_idle_call
      9.15            +0.4        9.52        perf-profile.children.cycles-pp.asm_sysvec_call_function_single
     13.25            +0.4       13.64        perf-profile.children.cycles-pp.start_secondary
     13.32            +0.4       13.71        perf-profile.children.cycles-pp.do_idle
     13.34            +0.4       13.74        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
     13.34            +0.4       13.74        perf-profile.children.cycles-pp.cpu_startup_entry
     16.00            +0.4       16.41        perf-profile.children.cycles-pp.sctp_primitive_SEND
     52.23            +0.6       52.80        perf-profile.children.cycles-pp.sock_sendmsg
     52.14            +0.6       52.72        perf-profile.children.cycles-pp.sctp_sendmsg
     54.28            +0.6       54.87        perf-profile.children.cycles-pp.____sys_sendmsg
     56.24            +0.6       56.85        perf-profile.children.cycles-pp.___sys_sendmsg
     56.83            +0.6       57.45        perf-profile.children.cycles-pp.__sys_sendmsg
      6.05 ±  2%      -2.4        3.68        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.97            -0.2        0.81        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      1.26            -0.1        1.14        perf-profile.self.cycles-pp.__slab_free
      1.22            -0.1        1.14        perf-profile.self.cycles-pp.rmqueue
      0.40            -0.1        0.35 ±  2%  perf-profile.self.cycles-pp.__zone_watermark_ok
      0.46            -0.0        0.42        perf-profile.self.cycles-pp.__list_add_valid_or_report
      0.18 ±  4%      -0.0        0.16 ±  4%  perf-profile.self.cycles-pp.__free_one_page
      0.15 ±  5%      -0.0        0.13 ±  4%  perf-profile.self.cycles-pp.__intel_pmu_enable_all
      0.10 ±  3%      +0.0        0.11        perf-profile.self.cycles-pp._copy_to_iter
      0.31            +0.0        0.32        perf-profile.self.cycles-pp.sctp_v4_xmit
      0.24            +0.0        0.26 ±  2%  perf-profile.self.cycles-pp.__sys_sendmsg
      0.06 ±  7%      +0.0        0.08        perf-profile.self.cycles-pp.dequeue_task_fair
      0.07 ±  5%      +0.0        0.09 ±  5%  perf-profile.self.cycles-pp.newidle_balance
      0.40            +0.0        0.42        perf-profile.self.cycles-pp.sctp_skb_recv_datagram
      0.19 ±  3%      +0.0        0.20 ±  2%  perf-profile.self.cycles-pp.menu_select
      0.11 ±  4%      +0.0        0.13 ±  4%  perf-profile.self.cycles-pp.enqueue_task_fair
      0.37            +0.0        0.39        perf-profile.self.cycles-pp.__check_object_size
      0.78            +0.0        0.80        perf-profile.self.cycles-pp._raw_spin_lock_bh
      0.22 ±  2%      +0.0        0.24 ±  2%  perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      0.21            +0.0        0.23        perf-profile.self.cycles-pp.__switch_to_asm
      0.27            +0.0        0.30        perf-profile.self.cycles-pp.___perf_sw_event
      0.38            +0.0        0.40 ±  2%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.12 ±  3%      +0.0        0.15 ±  6%  perf-profile.self.cycles-pp.update_cfs_group
      0.35            +0.0        0.38        perf-profile.self.cycles-pp.____sys_recvmsg
      0.05            +0.0        0.08 ±  6%  perf-profile.self.cycles-pp.schedule
      0.20            +0.0        0.22 ±  2%  perf-profile.self.cycles-pp.update_load_avg
      0.28 ±  2%      +0.0        0.31        perf-profile.self.cycles-pp.sctp_inet_skb_msgname
      0.23 ±  3%      +0.0        0.25        perf-profile.self.cycles-pp.ipv4_dst_check
      0.12 ±  4%      +0.0        0.14 ±  3%  perf-profile.self.cycles-pp.sk_leave_memory_pressure
      0.58            +0.0        0.61 ±  2%  perf-profile.self.cycles-pp.kmem_cache_alloc_node
      0.41            +0.0        0.44 ±  2%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.36            +0.0        0.39 ±  2%  perf-profile.self.cycles-pp.aa_sk_perm
      0.27 ±  3%      +0.0        0.30 ±  5%  perf-profile.self.cycles-pp.recv_sctp_stream_1toMany
      0.78            +0.0        0.82        perf-profile.self.cycles-pp.sctp_recvmsg
      0.71            +0.0        0.74        perf-profile.self.cycles-pp.sctp_sendmsg
      0.38 ±  3%      +0.0        0.42 ±  2%  perf-profile.self.cycles-pp.dst_release
      1.36            +0.1        1.42        perf-profile.self.cycles-pp.kmem_cache_free
      0.51 ±  2%      +0.1        0.58 ±  2%  perf-profile.self.cycles-pp.__sk_mem_raise_allocated
      0.63            +0.1        0.70 ±  2%  perf-profile.self.cycles-pp.sctp_eat_data
      0.47 ±  4%      +0.1        0.55 ±  3%  perf-profile.self.cycles-pp.__sk_mem_reduce_allocated
      3.77            +0.1        3.86        perf-profile.self.cycles-pp.copyin
      6.90            +0.1        7.03        perf-profile.self.cycles-pp.__memcpy
      7.30            +0.2        7.45        perf-profile.self.cycles-pp.acpi_safe_halt
      7.26            +0.2        7.49        perf-profile.self.cycles-pp.copyout




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 7/9] mm: tune PCP high automatically
  2023-10-16  5:30 ` [PATCH -V3 7/9] mm: tune PCP high automatically Huang Ying
@ 2023-10-31  2:50   ` kernel test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-10-31  2:50 UTC (permalink / raw)
  To: Huang Ying
  Cc: oe-lkp, lkp, Mel Gorman, Michal Hocko, Andrew Morton,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Pavel Tatashin, Matthew Wilcox, Christoph Lameter, linux-mm,
	ying.huang, feng.tang, fengwei.yin, linux-kernel,
	Arjan Van De Ven, oliver.sang



Hello,

kernel test robot noticed a 8.4% improvement of will-it-scale.per_process_ops on:


commit: ba6149e96007edcdb01284c1531ebd49b4720f72 ("[PATCH -V3 7/9] mm: tune PCP high automatically")
url: https://github.com/intel-lab-lkp/linux/commits/Huang-Ying/mm-pcp-avoid-to-drain-PCP-when-process-exit/20231017-143633
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git 36b2d7dd5a8ac95c8c1e69bdc93c4a6e2dc28a23
patch link: https://lore.kernel.org/all/20231016053002.756205-8-ying.huang@intel.com/
patch subject: [PATCH -V3 7/9] mm: tune PCP high automatically

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

	nr_task: 16
	mode: process
	test: page_fault2
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231031/202310311001.edbc5817-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/process/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/page_fault2/will-it-scale

commit: 
  9f9d0b0869 ("mm: add framework for PCP high auto-tuning")
  ba6149e960 ("mm: tune PCP high automatically")

9f9d0b08696fb316 ba6149e96007edcdb01284c1531 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.29            +0.0        0.32        mpstat.cpu.all.usr%
   1434135 ±  2%     +15.8%    1660688 ±  4%  numa-meminfo.node0.AnonPages.max
     22.97            +2.0%      23.43        turbostat.RAMWatt
    213121 ±  5%     -19.5%     171478 ±  7%  meminfo.DirectMap4k
   8031428           +12.0%    8998346        meminfo.Memused
   9777522           +14.3%   11178004        meminfo.max_used_kB
   4913700            +8.4%    5326025        will-it-scale.16.processes
    307105            +8.4%     332876        will-it-scale.per_process_ops
   4913700            +8.4%    5326025        will-it-scale.workload
 1.488e+09            +8.5%  1.614e+09        proc-vmstat.numa_hit
 1.487e+09            +8.4%  1.612e+09        proc-vmstat.numa_local
 1.486e+09            +8.3%  1.609e+09        proc-vmstat.pgalloc_normal
 1.482e+09            +8.3%  1.604e+09        proc-vmstat.pgfault
 1.486e+09            +8.3%  1.609e+09        proc-vmstat.pgfree
   2535424 ±  2%      +6.2%    2693888 ±  2%  proc-vmstat.unevictable_pgs_scanned
      0.04 ±  9%     +62.2%       0.06 ± 20%  perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
     85.33 ±  7%     +36.1%     116.17 ±  8%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
    475.33 ±  3%     +24.8%     593.33 ±  4%  perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.16 ± 17%    +449.1%       0.87 ± 39%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.03 ± 10%     +94.1%       0.07 ± 26%  perf-sched.wait_time.avg.ms.__cond_resched.__alloc_pages.__folio_alloc.vma_alloc_folio.do_cow_fault
      0.04 ±  9%     +62.2%       0.06 ± 20%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.16 ± 17%    +449.1%       0.87 ± 39%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
     14.01            +6.0%      14.85        perf-stat.i.MPKI
  5.79e+09            +3.6%  6.001e+09        perf-stat.i.branch-instructions
      0.20 ±  2%      +0.0        0.21 ±  2%  perf-stat.i.branch-miss-rate%
  12098037 ±  2%      +8.5%   13122446 ±  2%  perf-stat.i.branch-misses
     82.90            +2.1       85.03        perf-stat.i.cache-miss-rate%
 4.005e+08            +9.8%  4.399e+08        perf-stat.i.cache-misses
  4.83e+08            +7.1%  5.174e+08        perf-stat.i.cache-references
      2.29            -3.2%       2.22        perf-stat.i.cpi
    164.08            -9.0%     149.33        perf-stat.i.cycles-between-cache-misses
 7.091e+09            +4.2%  7.392e+09        perf-stat.i.dTLB-loads
      0.97            +0.0        1.01        perf-stat.i.dTLB-store-miss-rate%
  40301594            +8.8%   43829422        perf-stat.i.dTLB-store-misses
 4.121e+09            +4.4%  4.302e+09        perf-stat.i.dTLB-stores
     83.96            +2.6       86.59        perf-stat.i.iTLB-load-miss-rate%
  10268085 ±  3%     +23.0%   12628681 ±  3%  perf-stat.i.iTLB-load-misses
 2.861e+10            +3.7%  2.966e+10        perf-stat.i.instructions
      2796 ±  3%     -15.7%       2356 ±  3%  perf-stat.i.instructions-per-iTLB-miss
      0.44            +3.3%       0.45        perf-stat.i.ipc
    984.67            +9.6%       1078        perf-stat.i.metric.K/sec
     78.05            +4.2%      81.29        perf-stat.i.metric.M/sec
   4913856            +8.4%    5329060        perf-stat.i.minor-faults
 1.356e+08           +10.6%  1.499e+08        perf-stat.i.node-loads
  32443508            +7.6%   34908277        perf-stat.i.node-stores
   4913858            +8.4%    5329062        perf-stat.i.page-faults
     14.00            +6.0%      14.83        perf-stat.overall.MPKI
      0.21 ±  2%      +0.0        0.22 ±  2%  perf-stat.overall.branch-miss-rate%
     82.92            +2.1       85.02        perf-stat.overall.cache-miss-rate%
      2.29            -3.1%       2.21        perf-stat.overall.cpi
    163.33            -8.6%     149.29        perf-stat.overall.cycles-between-cache-misses
      0.97            +0.0        1.01        perf-stat.overall.dTLB-store-miss-rate%
     84.00            +2.6       86.61        perf-stat.overall.iTLB-load-miss-rate%
      2789 ±  3%     -15.7%       2350 ±  3%  perf-stat.overall.instructions-per-iTLB-miss
      0.44            +3.2%       0.45        perf-stat.overall.ipc
   1754985            -4.7%    1673375        perf-stat.overall.path-length
 5.771e+09            +3.6%  5.981e+09        perf-stat.ps.branch-instructions
  12074113 ±  2%      +8.4%   13094204 ±  2%  perf-stat.ps.branch-misses
 3.992e+08            +9.8%  4.384e+08        perf-stat.ps.cache-misses
 4.814e+08            +7.1%  5.157e+08        perf-stat.ps.cache-references
 7.068e+09            +4.2%  7.367e+09        perf-stat.ps.dTLB-loads
  40167519            +8.7%   43680173        perf-stat.ps.dTLB-store-misses
 4.107e+09            +4.4%  4.288e+09        perf-stat.ps.dTLB-stores
  10234325 ±  3%     +23.0%   12587000 ±  3%  perf-stat.ps.iTLB-load-misses
 2.852e+10            +3.6%  2.956e+10        perf-stat.ps.instructions
   4897507            +8.4%    5310921        perf-stat.ps.minor-faults
 1.351e+08           +10.5%  1.494e+08        perf-stat.ps.node-loads
  32335421            +7.6%   34789913        perf-stat.ps.node-stores
   4897509            +8.4%    5310923        perf-stat.ps.page-faults
 8.623e+12            +3.4%  8.912e+12        perf-stat.total.instructions
      9.86 ±  3%      -8.4        1.49 ±  5%  perf-profile.calltrace.cycles-pp.rmqueue_bulk.__rmqueue_pcplist.rmqueue.get_page_from_freelist.__alloc_pages
      8.11 ±  3%      -7.5        0.58 ±  8%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.rmqueue_bulk.__rmqueue_pcplist.rmqueue.get_page_from_freelist
      8.10 ±  3%      -7.5        0.58 ±  8%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.rmqueue_bulk.__rmqueue_pcplist.rmqueue
      7.52 ±  3%      -6.4        1.15 ±  5%  perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush.zap_pte_range
      7.90 ±  4%      -6.4        1.55 ±  4%  perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range
      5.78 ±  4%      -5.8        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush
      5.78 ±  4%      -5.8        0.00        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page_list.release_pages
     10.90 ±  3%      -5.3        5.59 ±  2%  perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.__folio_alloc.vma_alloc_folio.do_cow_fault
     10.57 ±  3%      -5.3        5.26 ±  3%  perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.__folio_alloc.vma_alloc_folio
     10.21 ±  3%      -5.3        4.94 ±  3%  perf-profile.calltrace.cycles-pp.__rmqueue_pcplist.rmqueue.get_page_from_freelist.__alloc_pages.__folio_alloc
     11.18 ±  3%      -5.3        5.91 ±  2%  perf-profile.calltrace.cycles-pp.__folio_alloc.vma_alloc_folio.do_cow_fault.do_fault.__handle_mm_fault
     11.15 ±  3%      -5.3        5.88 ±  2%  perf-profile.calltrace.cycles-pp.__alloc_pages.__folio_alloc.vma_alloc_folio.do_cow_fault.do_fault
     11.56 ±  3%      -5.2        6.37 ±  2%  perf-profile.calltrace.cycles-pp.vma_alloc_folio.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
      9.76 ±  3%      -4.3        5.50 ±  6%  perf-profile.calltrace.cycles-pp.release_pages.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range.unmap_page_range
     10.18 ±  3%      -4.2        5.95 ±  5%  perf-profile.calltrace.cycles-pp.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.do_vmi_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     15.39 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.unmap_region.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap.__x64_sys_munmap
     14.08 ±  3%      -3.6       10.49        perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region
     14.10 ±  3%      -3.6       10.52        perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap
     14.10 ±  3%      -3.6       10.52        perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.do_vmi_align_munmap.do_vmi_munmap
     14.10 ±  3%      -3.6       10.52        perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region.do_vmi_align_munmap
      1.60 ±  2%      -0.7        0.86 ±  6%  perf-profile.calltrace.cycles-pp.__list_del_entry_valid_or_report.rmqueue_bulk.__rmqueue_pcplist.rmqueue.get_page_from_freelist
      0.96 ±  3%      -0.4        0.56 ±  3%  perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush.tlb_finish_mmu
      1.00 ±  4%      -0.4        0.62 ±  4%  perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region
      1.26 ±  4%      -0.1        1.11 ±  2%  perf-profile.calltrace.cycles-pp.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region.do_vmi_align_munmap
      1.28 ±  3%      -0.1        1.16 ±  3%  perf-profile.calltrace.cycles-pp.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region.do_vmi_align_munmap.do_vmi_munmap
      1.28 ±  4%      -0.1        1.17 ±  2%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap
      0.60 ±  3%      -0.0        0.57        perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
      0.55 ±  3%      +0.0        0.60        perf-profile.calltrace.cycles-pp.__perf_sw_event.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      0.73 ±  3%      +0.1        0.79 ±  2%  perf-profile.calltrace.cycles-pp.lock_vma_under_rcu.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      0.68 ±  3%      +0.1        0.78 ±  3%  perf-profile.calltrace.cycles-pp.page_remove_rmap.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
      0.57 ±  7%      +0.1        0.71 ±  8%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.folio_add_new_anon_rmap.set_pte_range.finish_fault.do_cow_fault
      1.41 ±  3%      +0.1        1.55        perf-profile.calltrace.cycles-pp.sync_regs.asm_exc_page_fault.testcase
      0.77 ±  4%      +0.2        0.93 ±  5%  perf-profile.calltrace.cycles-pp.folio_add_new_anon_rmap.set_pte_range.finish_fault.do_cow_fault.do_fault
      0.94 ±  3%      +0.2        1.12 ±  3%  perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru_vma.set_pte_range.finish_fault
      0.36 ± 70%      +0.2        0.57        perf-profile.calltrace.cycles-pp.__perf_sw_event.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      1.26 ±  5%      +0.2        1.47 ±  3%  perf-profile.calltrace.cycles-pp.filemap_get_entry.shmem_get_folio_gfp.shmem_fault.__do_fault.do_cow_fault
      1.61 ±  5%      +0.3        1.87 ±  3%  perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fault.__do_fault.do_cow_fault.do_fault
      1.75 ±  5%      +0.3        2.05 ±  3%  perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.do_cow_fault.do_fault.__handle_mm_fault
      1.86 ±  4%      +0.3        2.17 ±  2%  perf-profile.calltrace.cycles-pp.__do_fault.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
      0.17 ±141%      +0.4        0.58 ±  3%  perf-profile.calltrace.cycles-pp.xas_load.filemap_get_entry.shmem_get_folio_gfp.shmem_fault.__do_fault
      2.60 ±  3%      +0.5        3.14 ±  5%  perf-profile.calltrace.cycles-pp._compound_head.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
      4.51 ±  3%      +0.7        5.16        perf-profile.calltrace.cycles-pp._raw_spin_lock.__pte_offset_map_lock.finish_fault.do_cow_fault.do_fault
      4.65 ±  3%      +0.7        5.32        perf-profile.calltrace.cycles-pp.__pte_offset_map_lock.finish_fault.do_cow_fault.do_fault.__handle_mm_fault
      1.61 ±  3%      +1.9        3.52 ±  6%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru_vma
      0.85 ±  2%      +1.9        2.77 ± 13%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.zap_pte_range
      0.84 ±  2%      +1.9        2.76 ± 13%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush
      0.85 ±  2%      +1.9        2.78 ± 12%  perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range
      1.71 ±  3%      +1.9        3.64 ±  6%  perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru_vma.set_pte_range.finish_fault
      1.70 ±  2%      +1.9        3.63 ±  6%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru_vma.set_pte_range
      3.31 ±  2%      +2.2        5.52 ±  5%  perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru_vma.set_pte_range.finish_fault.do_cow_fault
      3.46 ±  2%      +2.2        5.71 ±  5%  perf-profile.calltrace.cycles-pp.folio_add_lru_vma.set_pte_range.finish_fault.do_cow_fault.do_fault
      4.47 ±  2%      +2.4        6.90 ±  4%  perf-profile.calltrace.cycles-pp.set_pte_range.finish_fault.do_cow_fault.do_fault.__handle_mm_fault
      9.22 ±  2%      +3.1       12.33 ±  2%  perf-profile.calltrace.cycles-pp.finish_fault.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
     44.13 ±  3%      +3.2       47.34        perf-profile.calltrace.cycles-pp.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
     44.27 ±  3%      +3.2       47.49        perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
     45.63 ±  2%      +3.3       48.95        perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      0.00            +3.4        3.37 ±  2%  perf-profile.calltrace.cycles-pp.__list_del_entry_valid_or_report.__rmqueue_pcplist.rmqueue.get_page_from_freelist.__alloc_pages
     46.88 ±  3%      +3.4       50.29        perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
     49.40 ±  2%      +3.6       53.03        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
     49.59 ±  2%      +3.7       53.24        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
     59.06 ±  2%      +4.5       63.60        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
     56.32 ±  3%      +4.6       60.89        perf-profile.calltrace.cycles-pp.testcase
     20.16 ±  3%      +4.9       25.10        perf-profile.calltrace.cycles-pp.copy_page.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
     16.66 ±  3%      -8.8        7.83 ±  8%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     16.48 ±  3%      -8.8        7.66 ±  8%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      9.90 ±  3%      -8.4        1.50 ±  5%  perf-profile.children.cycles-pp.rmqueue_bulk
      8.92 ±  3%      -6.7        2.18 ±  2%  perf-profile.children.cycles-pp.free_unref_page_list
      8.47 ±  3%      -6.7        1.74 ±  4%  perf-profile.children.cycles-pp.free_pcppages_bulk
     10.96 ±  3%      -5.3        5.64 ±  2%  perf-profile.children.cycles-pp.get_page_from_freelist
     10.62 ±  3%      -5.3        5.30 ±  2%  perf-profile.children.cycles-pp.rmqueue
     10.26 ±  3%      -5.3        4.97 ±  3%  perf-profile.children.cycles-pp.__rmqueue_pcplist
     11.24 ±  3%      -5.3        5.96 ±  2%  perf-profile.children.cycles-pp.__alloc_pages
     11.18 ±  3%      -5.3        5.92 ±  2%  perf-profile.children.cycles-pp.__folio_alloc
     11.57 ±  3%      -5.2        6.37 ±  2%  perf-profile.children.cycles-pp.vma_alloc_folio
     11.19 ±  3%      -4.4        6.82 ±  5%  perf-profile.children.cycles-pp.release_pages
     11.46 ±  3%      -4.3        7.12 ±  5%  perf-profile.children.cycles-pp.tlb_batch_pages_flush
     15.52 ±  3%      -3.7       11.81        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     15.52 ±  3%      -3.7       11.81        perf-profile.children.cycles-pp.do_syscall_64
     15.41 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.do_vmi_munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.do_vmi_align_munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.__x64_sys_munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.__vm_munmap
     15.39 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.unmap_region
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.unmap_vmas
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.unmap_page_range
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.zap_pmd_range
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.zap_pte_range
      2.60 ±  3%      -2.0        0.56 ±  4%  perf-profile.children.cycles-pp.__free_one_page
      1.28 ±  3%      -0.1        1.17 ±  2%  perf-profile.children.cycles-pp.tlb_finish_mmu
      0.15 ± 19%      -0.1        0.08 ± 14%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      0.61 ±  3%      -0.0        0.58 ±  2%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      0.11 ±  6%      -0.0        0.08 ±  7%  perf-profile.children.cycles-pp.__mod_zone_page_state
      0.25 ±  4%      +0.0        0.26        perf-profile.children.cycles-pp.error_entry
      0.15 ±  3%      +0.0        0.17 ±  4%  perf-profile.children.cycles-pp.free_unref_page_commit
      0.12 ±  8%      +0.0        0.14 ±  3%  perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
      0.18 ±  3%      +0.0        0.20 ±  4%  perf-profile.children.cycles-pp.access_error
      0.07 ±  5%      +0.0        0.09 ±  7%  perf-profile.children.cycles-pp.task_tick_fair
      0.04 ± 45%      +0.0        0.06 ±  7%  perf-profile.children.cycles-pp.page_counter_try_charge
      0.30 ±  4%      +0.0        0.32        perf-profile.children.cycles-pp.down_read_trylock
      0.27 ±  3%      +0.0        0.30 ±  2%  perf-profile.children.cycles-pp.up_read
      0.15 ±  8%      +0.0        0.18 ±  3%  perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
      0.02 ±142%      +0.1        0.07 ± 29%  perf-profile.children.cycles-pp.ret_from_fork_asm
      0.44 ±  2%      +0.1        0.49 ±  3%  perf-profile.children.cycles-pp.mas_walk
      0.46 ±  4%      +0.1        0.52 ±  2%  perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      0.67 ±  3%      +0.1        0.73 ±  4%  perf-profile.children.cycles-pp.lock_mm_and_find_vma
      0.42 ±  3%      +0.1        0.48 ±  2%  perf-profile.children.cycles-pp.free_swap_cache
      0.43 ±  4%      +0.1        0.49 ±  2%  perf-profile.children.cycles-pp.free_pages_and_swap_cache
      0.30 ±  5%      +0.1        0.37 ±  3%  perf-profile.children.cycles-pp.xas_descend
      0.86 ±  3%      +0.1        0.92        perf-profile.children.cycles-pp.___perf_sw_event
      0.73 ±  3%      +0.1        0.80        perf-profile.children.cycles-pp.lock_vma_under_rcu
      0.40 ±  2%      +0.1        0.47        perf-profile.children.cycles-pp.__mod_node_page_state
      0.01 ±223%      +0.1        0.09 ± 12%  perf-profile.children.cycles-pp.shmem_get_policy
      0.53 ±  2%      +0.1        0.62 ±  2%  perf-profile.children.cycles-pp.__mod_lruvec_state
      1.09 ±  3%      +0.1        1.18        perf-profile.children.cycles-pp.__perf_sw_event
      0.50 ±  5%      +0.1        0.60 ±  3%  perf-profile.children.cycles-pp.xas_load
      0.68 ±  3%      +0.1        0.78 ±  3%  perf-profile.children.cycles-pp.page_remove_rmap
      1.45 ±  3%      +0.1        1.60        perf-profile.children.cycles-pp.sync_regs
      0.77 ±  4%      +0.2        0.93 ±  5%  perf-profile.children.cycles-pp.folio_add_new_anon_rmap
      0.84 ±  5%      +0.2        1.02 ±  7%  perf-profile.children.cycles-pp.__mod_lruvec_page_state
      0.96 ±  4%      +0.2        1.15 ±  3%  perf-profile.children.cycles-pp.lru_add_fn
      1.27 ±  5%      +0.2        1.48 ±  3%  perf-profile.children.cycles-pp.filemap_get_entry
      1.62 ±  4%      +0.3        1.88 ±  3%  perf-profile.children.cycles-pp.shmem_get_folio_gfp
      1.75 ±  5%      +0.3        2.06 ±  3%  perf-profile.children.cycles-pp.shmem_fault
      1.87 ±  4%      +0.3        2.18 ±  2%  perf-profile.children.cycles-pp.__do_fault
      2.19 ±  2%      +0.3        2.51        perf-profile.children.cycles-pp.native_irq_return_iret
      2.64 ±  4%      +0.5        3.18 ±  6%  perf-profile.children.cycles-pp._compound_head
      4.62 ±  3%      +0.6        5.26        perf-profile.children.cycles-pp._raw_spin_lock
      4.67 ±  3%      +0.7        5.34        perf-profile.children.cycles-pp.__pte_offset_map_lock
      3.32 ±  2%      +2.2        5.54 ±  5%  perf-profile.children.cycles-pp.folio_batch_move_lru
      3.47 ±  2%      +2.2        5.72 ±  5%  perf-profile.children.cycles-pp.folio_add_lru_vma
      4.49 ±  2%      +2.4        6.92 ±  4%  perf-profile.children.cycles-pp.set_pte_range
      9.25 ±  2%      +3.1       12.36 ±  2%  perf-profile.children.cycles-pp.finish_fault
      2.25 ±  2%      +3.1        5.36 ±  2%  perf-profile.children.cycles-pp.__list_del_entry_valid_or_report
     44.16 ±  3%      +3.2       47.37        perf-profile.children.cycles-pp.do_cow_fault
     44.28 ±  3%      +3.2       47.50        perf-profile.children.cycles-pp.do_fault
     45.66 ±  2%      +3.3       48.98        perf-profile.children.cycles-pp.__handle_mm_fault
     46.91 ±  2%      +3.4       50.33        perf-profile.children.cycles-pp.handle_mm_fault
     49.44 ±  2%      +3.6       53.08        perf-profile.children.cycles-pp.do_user_addr_fault
     49.62 ±  2%      +3.6       53.27        perf-profile.children.cycles-pp.exc_page_fault
      2.70 ±  3%      +4.1        6.75 ±  8%  perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
     55.26 ±  2%      +4.2       59.44        perf-profile.children.cycles-pp.asm_exc_page_fault
     58.13 ±  3%      +4.6       62.72        perf-profile.children.cycles-pp.testcase
     20.19 ±  3%      +4.9       25.14        perf-profile.children.cycles-pp.copy_page
     16.48 ±  3%      -8.8        7.66 ±  8%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      2.53 ±  3%      -2.0        0.54 ±  3%  perf-profile.self.cycles-pp.__free_one_page
      0.12 ±  4%      -0.1        0.05 ± 46%  perf-profile.self.cycles-pp.rmqueue_bulk
      0.14 ± 19%      -0.1        0.08 ± 14%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.10 ±  3%      -0.0        0.08 ± 10%  perf-profile.self.cycles-pp.__mod_zone_page_state
      0.13 ±  5%      +0.0        0.14 ±  2%  perf-profile.self.cycles-pp.free_unref_page_commit
      0.13 ±  3%      +0.0        0.14 ±  3%  perf-profile.self.cycles-pp.exc_page_fault
      0.15 ±  5%      +0.0        0.17 ±  4%  perf-profile.self.cycles-pp.__pte_offset_map
      0.04 ± 44%      +0.0        0.06 ±  6%  perf-profile.self.cycles-pp.page_counter_try_charge
      0.18 ±  3%      +0.0        0.20 ±  4%  perf-profile.self.cycles-pp.access_error
      0.30 ±  3%      +0.0        0.32 ±  2%  perf-profile.self.cycles-pp.down_read_trylock
      0.16 ±  6%      +0.0        0.18        perf-profile.self.cycles-pp.set_pte_range
      0.26 ±  2%      +0.0        0.29 ±  3%  perf-profile.self.cycles-pp.up_read
      0.15 ±  8%      +0.0        0.18 ±  4%  perf-profile.self.cycles-pp.folio_add_lru_vma
      0.15 ±  8%      +0.0        0.18 ±  3%  perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
      0.22 ±  6%      +0.0        0.26 ±  5%  perf-profile.self.cycles-pp.__alloc_pages
      0.32 ±  6%      +0.0        0.36 ±  3%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.28 ±  5%      +0.0        0.32 ±  4%  perf-profile.self.cycles-pp.do_cow_fault
      0.14 ±  7%      +0.0        0.18 ±  6%  perf-profile.self.cycles-pp.shmem_fault
      0.34 ±  5%      +0.0        0.38 ±  4%  perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__cond_resched
      0.44 ±  3%      +0.1        0.49 ±  4%  perf-profile.self.cycles-pp.page_remove_rmap
      0.41 ±  3%      +0.1        0.47 ±  3%  perf-profile.self.cycles-pp.free_swap_cache
      0.75 ±  3%      +0.1        0.81 ±  2%  perf-profile.self.cycles-pp.___perf_sw_event
      0.91 ±  2%      +0.1        0.98 ±  2%  perf-profile.self.cycles-pp.__handle_mm_fault
      0.29 ±  6%      +0.1        0.36 ±  3%  perf-profile.self.cycles-pp.xas_descend
      0.38 ±  2%      +0.1        0.45 ±  2%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.01 ±223%      +0.1        0.09 ±  8%  perf-profile.self.cycles-pp.shmem_get_policy
      0.58 ±  3%      +0.1        0.66 ±  2%  perf-profile.self.cycles-pp.release_pages
      0.44 ±  4%      +0.1        0.54 ±  3%  perf-profile.self.cycles-pp.lru_add_fn
      1.44 ±  3%      +0.1        1.59        perf-profile.self.cycles-pp.sync_regs
      2.18 ±  2%      +0.3        2.50        perf-profile.self.cycles-pp.native_irq_return_iret
      4.36 ±  3%      +0.4        4.76        perf-profile.self.cycles-pp.testcase
      2.61 ±  4%      +0.5        3.14 ±  5%  perf-profile.self.cycles-pp._compound_head
      4.60 ±  3%      +0.6        5.23        perf-profile.self.cycles-pp._raw_spin_lock
      2.23 ±  2%      +3.1        5.34 ±  2%  perf-profile.self.cycles-pp.__list_del_entry_valid_or_report
     20.10 ±  3%      +4.9       25.02        perf-profile.self.cycles-pp.copy_page




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages
  2023-10-16  5:29 ` [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages Huang Ying
  2023-10-27  6:23   ` kernel test robot
@ 2023-11-06  6:22   ` kernel test robot
  2023-11-06  6:38     ` Huang, Ying
  1 sibling, 1 reply; 20+ messages in thread
From: kernel test robot @ 2023-11-06  6:22 UTC (permalink / raw)
  To: Huang Ying
  Cc: oe-lkp, lkp, Mel Gorman, Andrew Morton, Sudeep Holla,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter,
	linux-kernel, linux-mm, ying.huang, feng.tang, fengwei.yin,
	Arjan Van De Ven, oliver.sang


hi, Huang Ying,

sorry for late of this report.
we reported
"a 14.6% improvement of netperf.Throughput_Mbps"
in
https://lore.kernel.org/all/202310271441.71ce0a9-oliver.sang@intel.com/

later, our auto-bisect tool captured a regression on a netperf test with
different configurations, however, unfortunately, regarded it as 'reported'
so we missed this report at the first time.

now send again FYI.


Hello,

kernel test robot noticed a -60.4% regression of netperf.Throughput_Mbps on:


commit: f5ddc662f07d7d99e9cfc5e07778e26c7394caf8 ("[PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages")
url: https://github.com/intel-lab-lkp/linux/commits/Huang-Ying/mm-pcp-avoid-to-drain-PCP-when-process-exit/20231017-143633
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git 36b2d7dd5a8ac95c8c1e69bdc93c4a6e2dc28a23
patch link: https://lore.kernel.org/all/20231016053002.756205-4-ying.huang@intel.com/
patch subject: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages

testcase: netperf
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
parameters:

	ip: ipv4
	runtime: 300s
	nr_threads: 50%
	cluster: cs-localhost
	test: UDP_STREAM
	cpufreq_governor: performance



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311061311.8d63998-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231106/202311061311.8d63998-oliver.sang@intel.com

=========================================================================================
cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/tbox_group/test/testcase:
  cs-localhost/gcc-12/performance/ipv4/x86_64-rhel-8.3/50%/debian-11.1-x86_64-20220510.cgz/300s/lkp-icl-2sp2/UDP_STREAM/netperf

commit: 
  c828e65251 ("cacheinfo: calculate size of per-CPU data cache slice")
  f5ddc662f0 ("mm, pcp: reduce lock contention for draining high-order pages")

c828e65251502516 f5ddc662f07d7d99e9cfc5e0777 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      7321 ±  4%     +28.2%       9382        uptime.idle
     50.65 ±  4%      -4.0%      48.64        boot-time.boot
      6042 ±  4%      -4.2%       5785        boot-time.idle
 1.089e+09 ±  2%    +232.1%  3.618e+09        cpuidle..time
   1087075 ±  2%  +24095.8%   2.63e+08        cpuidle..usage
   3357014           +99.9%    6710312        vmstat.memory.cache
     48731 ± 19%   +4666.5%    2322787        vmstat.system.cs
    144637          +711.2%    1173334        vmstat.system.in
      2.59 ±  2%      +6.2        8.79        mpstat.cpu.all.idle%
      1.01            +0.7        1.66        mpstat.cpu.all.irq%
      6.00            -3.2        2.79        mpstat.cpu.all.soft%
      1.13 ±  2%      -0.1        1.02        mpstat.cpu.all.usr%
 1.407e+09 ±  3%     -28.2%  1.011e+09        numa-numastat.node0.local_node
 1.407e+09 ±  3%     -28.2%   1.01e+09        numa-numastat.node0.numa_hit
 1.469e+09 ±  8%     -32.0%  9.979e+08        numa-numastat.node1.local_node
 1.469e+09 ±  8%     -32.1%  9.974e+08        numa-numastat.node1.numa_hit
    103.00 ± 19%     -44.0%      57.67 ± 20%  perf-c2c.DRAM.local
      8970 ± 12%     -89.4%     951.00 ±  4%  perf-c2c.DRAM.remote
      8192 ±  5%     +68.5%      13807        perf-c2c.HITM.local
      6675 ± 11%     -92.6%     491.00 ±  2%  perf-c2c.HITM.remote
   1051014 ±  2%  +24922.0%   2.63e+08        turbostat.C1
      2.75 ±  2%      +6.5        9.29        turbostat.C1%
      2.72 ±  2%    +178.3%       7.57        turbostat.CPU%c1
      0.09           -22.2%       0.07        turbostat.IPC
  44589125          +701.5%  3.574e+08        turbostat.IRQ
    313.00 ± 57%   +1967.0%       6469 ±  8%  turbostat.POLL
     70.33            +3.3%      72.67        turbostat.PkgTmp
     44.23 ±  4%     -31.8%      30.15 ±  2%  turbostat.RAMWatt
    536096          +583.7%    3665194        meminfo.Active
    535414          +584.4%    3664543        meminfo.Active(anon)
   3238301          +103.2%    6579677        meminfo.Cached
   1204424          +278.9%    4563575        meminfo.Committed_AS
    469093           +47.9%     693889 ±  3%  meminfo.Inactive
    467250           +48.4%     693496 ±  3%  meminfo.Inactive(anon)
     53615          +562.5%     355225 ±  4%  meminfo.Mapped
   5223078           +64.1%    8571212        meminfo.Memused
    557305          +599.6%    3899111        meminfo.Shmem
   5660207           +58.9%    8993642        meminfo.max_used_kB
     78504 ±  3%     -30.1%      54869        netperf.ThroughputBoth_Mbps
   5024292 ±  3%     -30.1%    3511666        netperf.ThroughputBoth_total_Mbps
      7673 ±  5%    +249.7%      26832        netperf.ThroughputRecv_Mbps
    491074 ±  5%    +249.7%    1717287        netperf.ThroughputRecv_total_Mbps
     70831 ±  2%     -60.4%      28037        netperf.Throughput_Mbps
   4533217 ±  2%     -60.4%    1794379        netperf.Throughput_total_Mbps
      5439            +9.4%       5949        netperf.time.percent_of_cpu_this_job_got
     16206            +9.4%      17728        netperf.time.system_time
    388.14           -51.9%     186.53        netperf.time.user_time
 2.876e+09 ±  3%     -30.1%   2.01e+09        netperf.workload
    177360 ± 30%     -36.0%     113450 ± 20%  numa-meminfo.node0.AnonPages
    255926 ± 12%     -40.6%     152052 ± 12%  numa-meminfo.node0.AnonPages.max
     22582 ± 61%    +484.2%     131916 ± 90%  numa-meminfo.node0.Mapped
    138287 ± 17%     +22.6%     169534 ± 12%  numa-meminfo.node1.AnonHugePages
    267468 ± 20%     +29.1%     345385 ±  6%  numa-meminfo.node1.AnonPages
    346204 ± 18%     +34.5%     465696 ±  2%  numa-meminfo.node1.AnonPages.max
    279416 ± 19%     +77.0%     494652 ± 18%  numa-meminfo.node1.Inactive
    278445 ± 19%     +77.6%     494393 ± 18%  numa-meminfo.node1.Inactive(anon)
     31726 ± 45%    +607.7%     224533 ± 45%  numa-meminfo.node1.Mapped
      4802 ±  6%     +19.4%       5733 ±  3%  numa-meminfo.node1.PageTables
    297323 ± 12%    +792.6%    2653850 ± 63%  numa-meminfo.node1.Shmem
     44325 ± 30%     -36.0%      28379 ± 20%  numa-vmstat.node0.nr_anon_pages
      5590 ± 61%    +491.0%      33042 ± 90%  numa-vmstat.node0.nr_mapped
 1.407e+09 ±  3%     -28.2%   1.01e+09        numa-vmstat.node0.numa_hit
 1.407e+09 ±  3%     -28.2%  1.011e+09        numa-vmstat.node0.numa_local
     66858 ± 20%     +29.2%      86385 ±  6%  numa-vmstat.node1.nr_anon_pages
     69601 ± 20%     +77.8%     123729 ± 18%  numa-vmstat.node1.nr_inactive_anon
      7953 ± 45%    +608.3%      56335 ± 45%  numa-vmstat.node1.nr_mapped
      1201 ±  6%     +19.4%       1434 ±  3%  numa-vmstat.node1.nr_page_table_pages
     74288 ± 11%    +792.6%     663111 ± 63%  numa-vmstat.node1.nr_shmem
     69601 ± 20%     +77.8%     123728 ± 18%  numa-vmstat.node1.nr_zone_inactive_anon
 1.469e+09 ±  8%     -32.1%  9.974e+08        numa-vmstat.node1.numa_hit
 1.469e+09 ±  8%     -32.0%  9.979e+08        numa-vmstat.node1.numa_local
    133919          +584.2%     916254        proc-vmstat.nr_active_anon
    111196            +3.3%     114828        proc-vmstat.nr_anon_pages
   5602484            -1.5%    5518799        proc-vmstat.nr_dirty_background_threshold
  11218668            -1.5%   11051092        proc-vmstat.nr_dirty_threshold
    809646          +103.2%    1645012        proc-vmstat.nr_file_pages
  56374629            -1.5%   55536913        proc-vmstat.nr_free_pages
    116775           +48.4%     173349 ±  3%  proc-vmstat.nr_inactive_anon
     13386 ±  2%    +563.3%      88793 ±  4%  proc-vmstat.nr_mapped
      2286            +6.5%       2434        proc-vmstat.nr_page_table_pages
    139393          +599.4%     974869        proc-vmstat.nr_shmem
     29092            +6.6%      31019        proc-vmstat.nr_slab_reclaimable
    133919          +584.2%     916254        proc-vmstat.nr_zone_active_anon
    116775           +48.4%     173349 ±  3%  proc-vmstat.nr_zone_inactive_anon
     32135 ± 11%    +257.2%     114797 ± 21%  proc-vmstat.numa_hint_faults
     20858 ± 16%    +318.3%      87244 ±  6%  proc-vmstat.numa_hint_faults_local
 2.876e+09 ±  3%     -30.2%  2.008e+09        proc-vmstat.numa_hit
 2.876e+09 ±  3%     -30.2%  2.008e+09        proc-vmstat.numa_local
     25453 ±  7%     -75.2%       6324 ± 30%  proc-vmstat.numa_pages_migrated
    178224 ±  2%     +76.6%     314680 ±  7%  proc-vmstat.numa_pte_updates
    160889 ±  3%    +267.6%     591393 ±  6%  proc-vmstat.pgactivate
 2.295e+10 ±  3%     -30.2%  1.601e+10        proc-vmstat.pgalloc_normal
   1026605           +21.9%    1251671        proc-vmstat.pgfault
 2.295e+10 ±  3%     -30.2%  1.601e+10        proc-vmstat.pgfree
     25453 ±  7%     -75.2%       6324 ± 30%  proc-vmstat.pgmigrate_success
     39208 ±  2%      -6.1%      36815        proc-vmstat.pgreuse
   3164416           -20.3%    2521344 ±  2%  proc-vmstat.unevictable_pgs_scanned
  19248627           -22.1%   14989905        sched_debug.cfs_rq:/.avg_vruntime.avg
  20722680           -24.9%   15569530        sched_debug.cfs_rq:/.avg_vruntime.max
  17634233           -22.5%   13663168        sched_debug.cfs_rq:/.avg_vruntime.min
    949063 ±  2%     -70.5%     280388        sched_debug.cfs_rq:/.avg_vruntime.stddev
      0.78 ± 10%    -100.0%       0.00        sched_debug.cfs_rq:/.h_nr_running.min
      0.16 ±  8%    +113.3%       0.33 ±  2%  sched_debug.cfs_rq:/.h_nr_running.stddev
      0.56 ±141%  +2.2e+07%     122016 ± 52%  sched_debug.cfs_rq:/.left_vruntime.avg
     45.01 ±141%  +2.2e+07%   10035976 ± 28%  sched_debug.cfs_rq:/.left_vruntime.max
      4.58 ±141%  +2.3e+07%    1072762 ± 36%  sched_debug.cfs_rq:/.left_vruntime.stddev
      5814 ± 10%    -100.0%       0.00        sched_debug.cfs_rq:/.load.min
      5.39 ±  9%     -73.2%       1.44 ± 10%  sched_debug.cfs_rq:/.load_avg.min
  19248627           -22.1%   14989905        sched_debug.cfs_rq:/.min_vruntime.avg
  20722680           -24.9%   15569530        sched_debug.cfs_rq:/.min_vruntime.max
  17634233           -22.5%   13663168        sched_debug.cfs_rq:/.min_vruntime.min
    949063 ±  2%     -70.5%     280388        sched_debug.cfs_rq:/.min_vruntime.stddev
      0.78 ± 10%    -100.0%       0.00        sched_debug.cfs_rq:/.nr_running.min
      0.06 ±  8%    +369.2%       0.30 ±  3%  sched_debug.cfs_rq:/.nr_running.stddev
      4.84 ± 26%   +1611.3%      82.79 ± 67%  sched_debug.cfs_rq:/.removed.load_avg.avg
     27.92 ± 12%   +3040.3%     876.79 ± 68%  sched_debug.cfs_rq:/.removed.load_avg.stddev
      0.56 ±141%  +2.2e+07%     122016 ± 52%  sched_debug.cfs_rq:/.right_vruntime.avg
     45.06 ±141%  +2.2e+07%   10035976 ± 28%  sched_debug.cfs_rq:/.right_vruntime.max
      4.59 ±141%  +2.3e+07%    1072762 ± 36%  sched_debug.cfs_rq:/.right_vruntime.stddev
    900.25           -10.4%     806.45        sched_debug.cfs_rq:/.runnable_avg.avg
    533.28 ±  4%     -87.0%      69.56 ± 39%  sched_debug.cfs_rq:/.runnable_avg.min
    122.77 ±  2%     +92.9%     236.86        sched_debug.cfs_rq:/.runnable_avg.stddev
    896.13           -10.8%     799.44        sched_debug.cfs_rq:/.util_avg.avg
    379.06 ±  4%     -83.4%      62.94 ± 37%  sched_debug.cfs_rq:/.util_avg.min
    116.35 ±  8%     +99.4%     232.04        sched_debug.cfs_rq:/.util_avg.stddev
    550.87           -14.2%     472.66 ±  2%  sched_debug.cfs_rq:/.util_est_enqueued.avg
      1124 ±  8%     +18.2%       1329 ±  3%  sched_debug.cfs_rq:/.util_est_enqueued.max
    134.17 ± 30%    -100.0%       0.00        sched_debug.cfs_rq:/.util_est_enqueued.min
    558243 ±  6%     -66.9%     184666        sched_debug.cpu.avg_idle.avg
     12860 ± 11%     -56.1%       5644        sched_debug.cpu.avg_idle.min
    365635           -53.5%     169863 ±  5%  sched_debug.cpu.avg_idle.stddev
      9.56 ±  3%     -28.4%       6.84 ±  8%  sched_debug.cpu.clock.stddev
      6999 ±  2%     -85.6%       1007 ±  3%  sched_debug.cpu.clock_task.stddev
      3985 ± 10%    -100.0%       0.00        sched_debug.cpu.curr->pid.min
    491.71 ± 10%    +209.3%       1520 ±  4%  sched_debug.cpu.curr->pid.stddev
    270.19 ±141%   +1096.6%       3233 ± 51%  sched_debug.cpu.max_idle_balance_cost.stddev
      0.78 ± 10%    -100.0%       0.00        sched_debug.cpu.nr_running.min
      0.15 ±  6%    +121.7%       0.34 ±  2%  sched_debug.cpu.nr_running.stddev
     62041 ± 15%   +4280.9%    2717948        sched_debug.cpu.nr_switches.avg
   1074922 ± 14%    +292.6%    4220307 ±  2%  sched_debug.cpu.nr_switches.max
      1186 ±  2%  +1.2e+05%    1379073 ±  4%  sched_debug.cpu.nr_switches.min
    132392 ± 21%    +294.6%     522476 ±  5%  sched_debug.cpu.nr_switches.stddev
      6.44 ±  4%     +21.4%       7.82 ± 12%  sched_debug.cpu.nr_uninterruptible.stddev
      6.73 ± 13%     -84.8%       1.02 ±  5%  perf-stat.i.MPKI
 1.652e+10 ±  2%     -22.2%  1.285e+10        perf-stat.i.branch-instructions
      0.72            +0.0        0.75        perf-stat.i.branch-miss-rate%
  1.19e+08 ±  3%     -19.8%   95493630        perf-stat.i.branch-misses
     27.46 ± 12%     -26.2        1.30 ±  4%  perf-stat.i.cache-miss-rate%
 5.943e+08 ± 10%     -88.6%   67756219 ±  5%  perf-stat.i.cache-misses
 2.201e+09          +143.7%  5.364e+09        perf-stat.i.cache-references
     48911 ± 19%   +4695.4%    2345525        perf-stat.i.context-switches
      3.66 ±  2%     +28.5%       4.71        perf-stat.i.cpi
 3.228e+11            -4.1%  3.097e+11        perf-stat.i.cpu-cycles
    190.51         +1363.7%       2788 ± 10%  perf-stat.i.cpu-migrations
    803.99 ±  6%    +510.2%       4905 ±  5%  perf-stat.i.cycles-between-cache-misses
      0.00 ± 16%      +0.0        0.01 ± 14%  perf-stat.i.dTLB-load-miss-rate%
    755654 ± 18%    +232.4%    2512024 ± 14%  perf-stat.i.dTLB-load-misses
 2.385e+10 ±  2%     -26.9%  1.742e+10        perf-stat.i.dTLB-loads
      0.00 ± 31%      +0.0        0.01 ± 35%  perf-stat.i.dTLB-store-miss-rate%
    305657 ± 36%    +200.0%     916822 ± 35%  perf-stat.i.dTLB-store-misses
 1.288e+10 ±  2%     -28.8%  9.179e+09        perf-stat.i.dTLB-stores
 8.789e+10 ±  2%     -25.2%  6.578e+10        perf-stat.i.instructions
      0.28 ±  2%     -21.6%       0.22        perf-stat.i.ipc
      2.52            -4.1%       2.42        perf-stat.i.metric.GHz
    873.89 ± 12%     -67.0%     288.04 ±  8%  perf-stat.i.metric.K/sec
    435.61 ±  2%     -19.6%     350.06        perf-stat.i.metric.M/sec
      2799           +29.9%       3637 ±  2%  perf-stat.i.minor-faults
     99.74            -2.6       97.11        perf-stat.i.node-load-miss-rate%
 1.294e+08 ± 12%     -92.4%    9879207 ±  7%  perf-stat.i.node-load-misses
     76.55           +16.4       92.92        perf-stat.i.node-store-miss-rate%
 2.257e+08 ± 10%     -90.4%   21721672 ±  8%  perf-stat.i.node-store-misses
  69217511 ± 13%     -97.7%    1625810 ±  7%  perf-stat.i.node-stores
      2799           +29.9%       3637 ±  2%  perf-stat.i.page-faults
      6.79 ± 13%     -84.9%       1.03 ±  5%  perf-stat.overall.MPKI
      0.72            +0.0        0.74        perf-stat.overall.branch-miss-rate%
     27.06 ± 12%     -25.8        1.26 ±  4%  perf-stat.overall.cache-miss-rate%
      3.68 ±  2%     +28.1%       4.71        perf-stat.overall.cpi
    549.38 ± 10%    +736.0%       4592 ±  5%  perf-stat.overall.cycles-between-cache-misses
      0.00 ± 18%      +0.0        0.01 ± 14%  perf-stat.overall.dTLB-load-miss-rate%
      0.00 ± 36%      +0.0        0.01 ± 35%  perf-stat.overall.dTLB-store-miss-rate%
      0.27 ±  2%     -22.0%       0.21        perf-stat.overall.ipc
     99.80            -2.4       97.37        perf-stat.overall.node-load-miss-rate%
     76.60           +16.4       93.03        perf-stat.overall.node-store-miss-rate%
      9319            +5.8%       9855        perf-stat.overall.path-length
 1.646e+10 ±  2%     -22.2%  1.281e+10        perf-stat.ps.branch-instructions
 1.186e+08 ±  3%     -19.8%   95167897        perf-stat.ps.branch-misses
 5.924e+08 ± 10%     -88.6%   67384354 ±  5%  perf-stat.ps.cache-misses
 2.193e+09          +143.4%  5.339e+09        perf-stat.ps.cache-references
     49100 ± 19%   +4668.0%    2341074        perf-stat.ps.context-switches
 3.218e+11            -4.1%  3.087e+11        perf-stat.ps.cpu-cycles
    189.73         +1368.4%       2786 ± 10%  perf-stat.ps.cpu-migrations
    753056 ± 18%    +229.9%    2484575 ± 14%  perf-stat.ps.dTLB-load-misses
 2.377e+10 ±  2%     -26.9%  1.737e+10        perf-stat.ps.dTLB-loads
    304509 ± 36%    +199.1%     910856 ± 35%  perf-stat.ps.dTLB-store-misses
 1.284e+10 ±  2%     -28.7%  9.152e+09        perf-stat.ps.dTLB-stores
  8.76e+10 ±  2%     -25.2%  6.557e+10        perf-stat.ps.instructions
      2791           +28.2%       3580 ±  2%  perf-stat.ps.minor-faults
  1.29e+08 ± 12%     -92.4%    9815672 ±  7%  perf-stat.ps.node-load-misses
  2.25e+08 ± 10%     -90.4%   21575943 ±  8%  perf-stat.ps.node-store-misses
  69002373 ± 13%     -97.7%    1615410 ±  7%  perf-stat.ps.node-stores
      2791           +28.2%       3580 ±  2%  perf-stat.ps.page-faults
  2.68e+13 ±  2%     -26.1%  1.981e+13        perf-stat.total.instructions
      0.00 ± 35%   +2600.0%       0.04 ± 23%  perf-sched.sch_delay.avg.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
      1.18 ±  9%     -98.1%       0.02 ± 32%  perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
      0.58 ±  3%     -62.1%       0.22 ± 97%  perf-sched.sch_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      0.51 ± 22%     -82.7%       0.09 ± 11%  perf-sched.sch_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
      0.25 ± 23%     -59.6%       0.10 ± 10%  perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.03 ± 42%     -64.0%       0.01 ± 15%  perf-sched.sch_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      0.04 ±  7%    +434.6%       0.23 ± 36%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      1.00 ± 20%     -84.1%       0.16 ± 78%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
      0.01 ±  7%     -70.0%       0.00        perf-sched.sch_delay.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
      0.02 ±  2%    +533.9%       0.12 ± 43%  perf-sched.sch_delay.avg.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
      0.03 ±  7%    +105.9%       0.06 ± 33%  perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      0.01 ± 15%     +67.5%       0.02 ±  8%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.09 ± 50%     -85.7%       0.01 ± 33%  perf-sched.sch_delay.avg.ms.wait_for_partner.fifo_open.do_dentry_open.do_open
      0.04 ±  7%    +343.4%       0.16 ±  6%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.06 ± 41%   +3260.7%       1.88 ± 30%  perf-sched.sch_delay.max.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
      3.78           -96.2%       0.14 ±  3%  perf-sched.sch_delay.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
      2.86 ±  4%     -72.6%       0.78 ±113%  perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      4.09 ±  7%     -34.1%       2.69 ±  7%  perf-sched.sch_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
      3.09 ± 37%     -64.1%       1.11 ±  5%  perf-sched.sch_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
      0.00 ±141%   +6200.0%       0.13 ± 82%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      3.94           -40.5%       2.35 ± 48%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      1.63 ± 21%     -77.0%       0.38 ± 90%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
      7.29 ± 39%    +417.5%      37.72 ± 16%  perf-sched.sch_delay.max.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
      3.35 ± 14%     -51.7%       1.62 ±  3%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      0.05 ± 13%   +2245.1%       1.13 ± 40%  perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
      3.01 ± 26%    +729.6%      25.01 ± 91%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      1.93 ± 59%     -85.5%       0.28 ± 62%  perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.do_open
      0.01           -50.0%       0.00        perf-sched.total_sch_delay.average.ms
      7.29 ± 39%    +468.8%      41.46 ± 26%  perf-sched.total_sch_delay.max.ms
      6.04 ±  4%     -94.1%       0.35        perf-sched.total_wait_and_delay.average.ms
    205790 ±  3%   +1811.0%    3932742        perf-sched.total_wait_and_delay.count.ms
      6.03 ±  4%     -94.2%       0.35        perf-sched.total_wait_time.average.ms
     75.51 ± 41%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
     23.01 ± 17%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
     23.82 ±  7%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
     95.27 ± 41%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
     55.86 ±141%   +1014.6%     622.64 ±  5%  perf-sched.wait_and_delay.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      0.07 ± 23%     -82.5%       0.01        perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
    137.41 ±  3%    +345.1%     611.63 ±  2%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      0.04 ±  5%     -49.6%       0.02        perf-sched.wait_and_delay.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
    536.33 ±  5%     -46.5%     287.00        perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     21.67 ± 32%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
      5.67 ±  8%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
      1.67 ± 56%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
      5.67 ± 29%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
      5.33 ± 23%     +93.8%      10.33 ± 25%  perf-sched.wait_and_delay.count.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    101725 ±  3%     +15.3%     117243 ± 10%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
    100.00 ±  7%     -80.3%      19.67 ±  2%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
     97762 ±  4%   +3794.8%    3807606        perf-sched.wait_and_delay.count.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
      1091 ±  9%    +111.9%       2311 ±  3%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    604.50 ± 43%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
     37.41 ±  9%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
     27.08 ± 13%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
    275.41 ± 32%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
      1313 ± 69%    +112.1%       2786 ± 15%  perf-sched.wait_and_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    333.38 ±141%    +200.4%       1001        perf-sched.wait_and_delay.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      1000           -96.8%      31.85 ± 48%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
     17.99 ± 33%    +387.5%      87.71 ±  8%  perf-sched.wait_and_delay.max.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
      0.33 ± 19%     -74.1%       0.09 ± 10%  perf-sched.wait_time.avg.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
      0.02 ± 53%    +331.4%       0.10 ± 50%  perf-sched.wait_time.avg.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
      0.09 ± 65%     -75.9%       0.02 ±  9%  perf-sched.wait_time.avg.ms.__cond_resched.aa_sk_perm.security_socket_sendmsg.sock_sendmsg.__sys_sendto
      0.02 ± 22%     -70.2%       0.01 ±141%  perf-sched.wait_time.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
     75.51 ± 41%    -100.0%       0.04 ± 42%  perf-sched.wait_time.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
      0.10 ± 36%     -80.3%       0.02 ±  9%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_node.__alloc_skb.alloc_skb_with_frags.sock_alloc_send_pskb
      0.55 ± 61%     -94.9%       0.03 ± 45%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_node.kmalloc_reserve.__alloc_skb.alloc_skb_with_frags
     23.01 ± 17%    -100.0%       0.00 ±141%  perf-sched.wait_time.avg.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
     23.82 ±  7%     -99.7%       0.07 ± 57%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
     95.27 ± 41%    -100.0%       0.03 ± 89%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
     56.30 ±139%   +1005.5%     622.44 ±  5%  perf-sched.wait_time.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      2.78 ± 66%     -98.2%       0.05 ± 52%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      0.07 ± 23%     -82.5%       0.01        perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
    137.37 ±  3%    +345.1%     611.40 ±  2%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      0.02 ±  5%     -41.9%       0.01 ±  3%  perf-sched.wait_time.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
    536.32 ±  5%     -46.5%     286.98        perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      4.66 ± 20%     -56.7%       2.02 ± 26%  perf-sched.wait_time.max.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
      0.03 ± 63%    +995.0%       0.37 ± 26%  perf-sched.wait_time.max.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
      1.67 ± 87%     -92.6%       0.12 ± 57%  perf-sched.wait_time.max.ms.__cond_resched.aa_sk_perm.security_socket_sendmsg.sock_sendmsg.__sys_sendto
      0.54 ±117%     -95.1%       0.03 ±105%  perf-sched.wait_time.max.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
      0.06 ± 49%     -89.1%       0.01 ±141%  perf-sched.wait_time.max.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
    604.50 ± 43%    -100.0%       0.16 ± 83%  perf-sched.wait_time.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
      2.77 ± 45%     -95.4%       0.13 ± 64%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_node.__alloc_skb.alloc_skb_with_frags.sock_alloc_send_pskb
      2.86 ± 45%     -94.3%       0.16 ± 91%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_node.kmalloc_reserve.__alloc_skb.alloc_skb_with_frags
     37.41 ±  9%    -100.0%       0.01 ±141%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
     27.08 ± 13%     -99.7%       0.08 ± 61%  perf-sched.wait_time.max.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
    275.41 ± 32%    -100.0%       0.03 ± 89%  perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
      1313 ± 69%    +112.1%       2786 ± 15%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    334.74 ±140%    +198.9%       1000        perf-sched.wait_time.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
     21.74 ± 58%     -95.4%       1.00 ±103%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      1000           -97.6%      24.49 ± 50%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
     10.90 ± 27%    +682.9%      85.36 ±  6%  perf-sched.wait_time.max.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
     32.91 ± 58%     -63.5%      12.01 ±115%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
    169.97 ±  7%     -49.2%      86.29 ± 15%  perf-sched.wait_time.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     44.08           -19.8       24.25        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page.skb_release_data.__consume_stateless_skb
     44.47           -19.6       24.87        perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg
     43.63           -19.5       24.15        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page.skb_release_data
     45.62           -19.2       26.39        perf-profile.calltrace.cycles-pp.skb_release_data.__consume_stateless_skb.udp_recvmsg.inet_recvmsg.sock_recvmsg
     45.62           -19.2       26.40        perf-profile.calltrace.cycles-pp.__consume_stateless_skb.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
     45.00           -19.1       25.94        perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg.inet_recvmsg
     50.41           -16.8       33.64 ± 39%  perf-profile.calltrace.cycles-pp.accept_connections.main.__libc_start_main
     50.41           -16.8       33.64 ± 39%  perf-profile.calltrace.cycles-pp.accept_connection.accept_connections.main.__libc_start_main
     50.41           -16.8       33.64 ± 39%  perf-profile.calltrace.cycles-pp.spawn_child.accept_connection.accept_connections.main.__libc_start_main
     50.41           -16.8       33.64 ± 39%  perf-profile.calltrace.cycles-pp.process_requests.spawn_child.accept_connection.accept_connections.main
     99.92           -14.2       85.72 ± 15%  perf-profile.calltrace.cycles-pp.main.__libc_start_main
     99.96           -14.2       85.77 ± 15%  perf-profile.calltrace.cycles-pp.__libc_start_main
     50.10            -8.6       41.52        perf-profile.calltrace.cycles-pp.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
     50.11            -8.6       41.55        perf-profile.calltrace.cycles-pp.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
     50.13            -8.5       41.64        perf-profile.calltrace.cycles-pp.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
     50.28            -8.0       42.27        perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom
     50.29            -8.0       42.29        perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni
     50.31            -7.9       42.42        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni.process_requests
     50.32            -7.8       42.47        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni.process_requests.spawn_child
     50.36            -7.6       42.78        perf-profile.calltrace.cycles-pp.recvfrom.recv_omni.process_requests.spawn_child.accept_connection
     50.41            -7.3       43.07        perf-profile.calltrace.cycles-pp.recv_omni.process_requests.spawn_child.accept_connection.accept_connections
     19.93 ±  2%      -6.6       13.36        perf-profile.calltrace.cycles-pp.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg
     19.44 ±  2%      -6.3       13.16        perf-profile.calltrace.cycles-pp._copy_from_iter.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg
     18.99 ±  2%      -6.1       12.90        perf-profile.calltrace.cycles-pp.copyin._copy_from_iter.ip_generic_getfrag.__ip_append_data.ip_make_skb
      8.95            -5.1        3.82        perf-profile.calltrace.cycles-pp.udp_send_skb.udp_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto
      8.70            -5.0        3.71        perf-profile.calltrace.cycles-pp.ip_send_skb.udp_send_skb.udp_sendmsg.sock_sendmsg.__sys_sendto
      8.10            -4.6        3.45        perf-profile.calltrace.cycles-pp.ip_finish_output2.ip_send_skb.udp_send_skb.udp_sendmsg.sock_sendmsg
      7.69            -4.4        3.27        perf-profile.calltrace.cycles-pp.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb.udp_sendmsg
      6.51            -3.7        2.78        perf-profile.calltrace.cycles-pp.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb
      6.47            -3.7        2.75        perf-profile.calltrace.cycles-pp.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_send_skb
      6.41            -3.7        2.71        perf-profile.calltrace.cycles-pp.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2
      5.88            -3.5        2.43        perf-profile.calltrace.cycles-pp.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit
      5.73            -3.4        2.35        perf-profile.calltrace.cycles-pp.__napi_poll.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip
      5.69            -3.4        2.33        perf-profile.calltrace.cycles-pp.process_backlog.__napi_poll.net_rx_action.__do_softirq.do_softirq
      5.36            -3.2        2.19        perf-profile.calltrace.cycles-pp.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action.__do_softirq
      4.59            -2.7        1.89        perf-profile.calltrace.cycles-pp.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action
      4.55 ±  2%      -2.7        1.88        perf-profile.calltrace.cycles-pp.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll
      4.40 ±  2%      -2.6        1.81        perf-profile.calltrace.cycles-pp.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog
      3.81 ±  2%      -2.2        1.57        perf-profile.calltrace.cycles-pp.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
      3.75 ±  2%      -2.2        1.55        perf-profile.calltrace.cycles-pp.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
      2.21 ±  2%      -1.6        0.63        perf-profile.calltrace.cycles-pp.__ip_make_skb.ip_make_skb.udp_sendmsg.sock_sendmsg.__sys_sendto
      1.94 ±  2%      -1.4        0.51 ±  2%  perf-profile.calltrace.cycles-pp.__ip_select_ident.__ip_make_skb.ip_make_skb.udp_sendmsg.sock_sendmsg
      1.14            -0.6        0.51        perf-profile.calltrace.cycles-pp.sock_alloc_send_pskb.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg
      0.00            +0.5        0.53 ±  2%  perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.00            +0.7        0.69        perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state
      0.00            +0.7        0.71        perf-profile.calltrace.cycles-pp.__sk_mem_raise_allocated.__sk_mem_schedule.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb
      0.00            +0.7        0.72        perf-profile.calltrace.cycles-pp.__sk_mem_schedule.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv
      0.00            +1.0        0.99 ± 20%  perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp
      0.00            +1.0        1.01 ± 20%  perf-profile.calltrace.cycles-pp.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
      0.00            +1.1        1.05 ± 20%  perf-profile.calltrace.cycles-pp.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg
      0.00            +1.1        1.12        perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      0.00            +1.2        1.18 ± 20%  perf-profile.calltrace.cycles-pp.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg.sock_recvmsg
      0.00            +1.3        1.32        perf-profile.calltrace.cycles-pp.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu
      0.00            +2.2        2.23        perf-profile.calltrace.cycles-pp.__skb_recv_udp.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
     49.51            +2.6       52.08        perf-profile.calltrace.cycles-pp.send_udp_stream.main.__libc_start_main
     49.49            +2.6       52.07        perf-profile.calltrace.cycles-pp.send_omni_inner.send_udp_stream.main.__libc_start_main
      0.00            +3.0        2.96 ±  2%  perf-profile.calltrace.cycles-pp.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
     48.71            +3.0       51.73        perf-profile.calltrace.cycles-pp.sendto.send_omni_inner.send_udp_stream.main.__libc_start_main
      0.00            +3.1        3.06 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      0.00            +3.1        3.09        perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
     48.34            +3.2       51.56        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner.send_udp_stream.main
      0.00            +3.3        3.33 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     48.13            +3.8       51.96        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner.send_udp_stream
     47.82            +4.0       51.82        perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner
     47.70            +4.1       51.76        perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
      0.00            +4.1        4.08        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.00            +4.1        4.10        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.00            +4.1        4.10        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
      0.00            +4.1        4.14        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
      0.00            +4.3        4.35 ±  2%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
     46.52            +4.8       51.27        perf-profile.calltrace.cycles-pp.sock_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
     46.04            +5.0       51.08        perf-profile.calltrace.cycles-pp.udp_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
      3.67            +8.0       11.63        perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg
      3.71            +8.1       11.80        perf-profile.calltrace.cycles-pp._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
      3.96            +8.5       12.42        perf-profile.calltrace.cycles-pp.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg.sock_recvmsg
      3.96            +8.5       12.44        perf-profile.calltrace.cycles-pp.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
     35.13           +11.3       46.39        perf-profile.calltrace.cycles-pp.ip_make_skb.udp_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto
     32.68 ±  2%     +13.0       45.65        perf-profile.calltrace.cycles-pp.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg.__sys_sendto
     10.27           +20.3       30.59        perf-profile.calltrace.cycles-pp.sk_page_frag_refill.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg
     10.24           +20.3       30.58        perf-profile.calltrace.cycles-pp.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data.ip_make_skb.udp_sendmsg
      9.84           +20.5       30.32        perf-profile.calltrace.cycles-pp.__alloc_pages.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data.ip_make_skb
      9.59           +20.5       30.11        perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data
      8.40           +21.0       29.42        perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.skb_page_frag_refill.sk_page_frag_refill
      6.13           +21.9       28.05        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.rmqueue_bulk.rmqueue.get_page_from_freelist
      6.20           +22.0       28.15        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
      6.46           +22.5       28.91        perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.skb_page_frag_refill
     48.24           -21.8       26.43        perf-profile.children.cycles-pp.skb_release_data
     47.19           -21.2       25.98        perf-profile.children.cycles-pp.free_unref_page
     44.48           -19.6       24.88        perf-profile.children.cycles-pp.free_pcppages_bulk
     45.62           -19.2       26.40        perf-profile.children.cycles-pp.__consume_stateless_skb
     99.95           -14.2       85.76 ± 15%  perf-profile.children.cycles-pp.main
     99.96           -14.2       85.77 ± 15%  perf-profile.children.cycles-pp.__libc_start_main
     50.10            -8.6       41.53        perf-profile.children.cycles-pp.udp_recvmsg
     50.11            -8.6       41.56        perf-profile.children.cycles-pp.inet_recvmsg
     50.13            -8.5       41.65        perf-profile.children.cycles-pp.sock_recvmsg
     50.29            -8.0       42.28        perf-profile.children.cycles-pp.__sys_recvfrom
     50.29            -8.0       42.30        perf-profile.children.cycles-pp.__x64_sys_recvfrom
     50.38            -7.5       42.86        perf-profile.children.cycles-pp.recvfrom
     50.41            -7.3       43.07        perf-profile.children.cycles-pp.accept_connections
     50.41            -7.3       43.07        perf-profile.children.cycles-pp.accept_connection
     50.41            -7.3       43.07        perf-profile.children.cycles-pp.spawn_child
     50.41            -7.3       43.07        perf-profile.children.cycles-pp.process_requests
     50.41            -7.3       43.07        perf-profile.children.cycles-pp.recv_omni
     19.96 ±  2%      -6.5       13.50        perf-profile.children.cycles-pp.ip_generic_getfrag
     19.46 ±  2%      -6.2       13.28        perf-profile.children.cycles-pp._copy_from_iter
     19.21 ±  2%      -6.1       13.14        perf-profile.children.cycles-pp.copyin
      8.96            -5.1        3.86        perf-profile.children.cycles-pp.udp_send_skb
      8.72            -5.0        3.75        perf-profile.children.cycles-pp.ip_send_skb
      8.11            -4.6        3.49        perf-profile.children.cycles-pp.ip_finish_output2
      7.72            -4.4        3.32        perf-profile.children.cycles-pp.__dev_queue_xmit
     98.71            -4.1       94.59        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     98.51            -4.0       94.46        perf-profile.children.cycles-pp.do_syscall_64
      6.49            -3.7        2.78        perf-profile.children.cycles-pp.do_softirq
      6.51            -3.7        2.82        perf-profile.children.cycles-pp.__local_bh_enable_ip
      6.43            -3.7        2.78        perf-profile.children.cycles-pp.__do_softirq
      5.90            -3.4        2.46        perf-profile.children.cycles-pp.net_rx_action
      5.74            -3.4        2.38        perf-profile.children.cycles-pp.__napi_poll
      5.71            -3.4        2.36        perf-profile.children.cycles-pp.process_backlog
      5.37            -3.2        2.21        perf-profile.children.cycles-pp.__netif_receive_skb_one_core
      4.60            -2.7        1.91        perf-profile.children.cycles-pp.ip_local_deliver_finish
      4.57 ±  2%      -2.7        1.90        perf-profile.children.cycles-pp.ip_protocol_deliver_rcu
      4.42 ±  2%      -2.6        1.83        perf-profile.children.cycles-pp.__udp4_lib_rcv
      3.82 ±  2%      -2.2        1.58 ±  2%  perf-profile.children.cycles-pp.udp_unicast_rcv_skb
      3.78 ±  2%      -2.2        1.57 ±  2%  perf-profile.children.cycles-pp.udp_queue_rcv_one_skb
      2.23 ±  2%      -1.6        0.65 ±  2%  perf-profile.children.cycles-pp.__ip_make_skb
      1.95 ±  2%      -1.4        0.52 ±  3%  perf-profile.children.cycles-pp.__ip_select_ident
      1.51 ±  4%      -1.2        0.34        perf-profile.children.cycles-pp.free_unref_page_commit
      1.17            -0.7        0.51 ±  2%  perf-profile.children.cycles-pp.ip_route_output_flow
      1.15            -0.6        0.52        perf-profile.children.cycles-pp.sock_alloc_send_pskb
      0.91            -0.5        0.39        perf-profile.children.cycles-pp.alloc_skb_with_frags
      0.86            -0.5        0.37        perf-profile.children.cycles-pp.__alloc_skb
      0.83            -0.5        0.36 ±  2%  perf-profile.children.cycles-pp.ip_route_output_key_hash_rcu
      0.75            -0.4        0.32        perf-profile.children.cycles-pp.dev_hard_start_xmit
      0.72            -0.4        0.31 ±  3%  perf-profile.children.cycles-pp.fib_table_lookup
      0.67            -0.4        0.28        perf-profile.children.cycles-pp.loopback_xmit
      0.70 ±  2%      -0.4        0.33        perf-profile.children.cycles-pp.__zone_watermark_ok
      0.47 ±  4%      -0.3        0.15        perf-profile.children.cycles-pp.kmem_cache_free
      0.57            -0.3        0.26        perf-profile.children.cycles-pp.kmem_cache_alloc_node
      0.46            -0.3        0.18 ±  2%  perf-profile.children.cycles-pp.ip_rcv
      0.42            -0.3        0.17        perf-profile.children.cycles-pp.move_addr_to_kernel
      0.41            -0.2        0.16 ±  2%  perf-profile.children.cycles-pp.__udp4_lib_lookup
      0.32            -0.2        0.13        perf-profile.children.cycles-pp.__netif_rx
      0.30            -0.2        0.12        perf-profile.children.cycles-pp.netif_rx_internal
      0.30            -0.2        0.12        perf-profile.children.cycles-pp._copy_from_user
      0.31            -0.2        0.13        perf-profile.children.cycles-pp.kmalloc_reserve
      0.63            -0.2        0.46 ±  2%  perf-profile.children.cycles-pp.free_unref_page_prepare
      0.28            -0.2        0.11        perf-profile.children.cycles-pp.enqueue_to_backlog
      0.27            -0.2        0.11        perf-profile.children.cycles-pp.udp4_lib_lookup2
      0.29            -0.2        0.13 ±  6%  perf-profile.children.cycles-pp.send_data
      0.25            -0.2        0.10        perf-profile.children.cycles-pp.__netif_receive_skb_core
      0.23 ±  2%      -0.1        0.10 ±  4%  perf-profile.children.cycles-pp.security_socket_sendmsg
      0.19 ±  2%      -0.1        0.06        perf-profile.children.cycles-pp.ip_rcv_core
      0.37            -0.1        0.24        perf-profile.children.cycles-pp.irqtime_account_irq
      0.21            -0.1        0.08        perf-profile.children.cycles-pp.sock_wfree
      0.21 ±  3%      -0.1        0.08        perf-profile.children.cycles-pp.validate_xmit_skb
      0.20 ±  2%      -0.1        0.08        perf-profile.children.cycles-pp.ip_output
      0.22 ±  2%      -0.1        0.10 ±  4%  perf-profile.children.cycles-pp.ip_rcv_finish_core
      0.20 ±  6%      -0.1        0.09 ±  5%  perf-profile.children.cycles-pp.__mkroute_output
      0.21 ±  2%      -0.1        0.09 ±  5%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.28            -0.1        0.18        perf-profile.children.cycles-pp._raw_spin_trylock
      0.34 ±  3%      -0.1        0.25        perf-profile.children.cycles-pp.__slab_free
      0.13 ±  3%      -0.1        0.05        perf-profile.children.cycles-pp.siphash_3u32
      0.12 ±  4%      -0.1        0.03 ± 70%  perf-profile.children.cycles-pp.ipv4_pktinfo_prepare
      0.14 ±  3%      -0.1        0.06 ±  7%  perf-profile.children.cycles-pp.__ip_local_out
      0.20 ±  2%      -0.1        0.12        perf-profile.children.cycles-pp.aa_sk_perm
      0.18 ±  2%      -0.1        0.10        perf-profile.children.cycles-pp.get_pfnblock_flags_mask
      0.12 ±  3%      -0.1        0.05        perf-profile.children.cycles-pp.sk_filter_trim_cap
      0.13            -0.1        0.06        perf-profile.children.cycles-pp.ip_setup_cork
      0.13 ±  7%      -0.1        0.06 ±  8%  perf-profile.children.cycles-pp.fib_lookup_good_nhc
      0.15 ±  3%      -0.1        0.08 ±  5%  perf-profile.children.cycles-pp.skb_set_owner_w
      0.11 ±  4%      -0.1        0.05        perf-profile.children.cycles-pp.dst_release
      0.23 ±  2%      -0.1        0.17 ±  2%  perf-profile.children.cycles-pp.__entry_text_start
      0.11            -0.1        0.05        perf-profile.children.cycles-pp.ipv4_mtu
      0.20 ±  2%      -0.1        0.15 ±  3%  perf-profile.children.cycles-pp.__list_add_valid_or_report
      0.10            -0.1        0.05        perf-profile.children.cycles-pp.ip_send_check
      0.31 ±  2%      -0.0        0.26 ±  3%  perf-profile.children.cycles-pp.sockfd_lookup_light
      0.27            -0.0        0.22 ±  2%  perf-profile.children.cycles-pp.__fget_light
      0.63            -0.0        0.58        perf-profile.children.cycles-pp.__check_object_size
      0.15 ±  3%      -0.0        0.11        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.13            -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.alloc_pages
      0.27            -0.0        0.24        perf-profile.children.cycles-pp.sched_clock_cpu
      0.11 ±  4%      -0.0        0.08 ±  6%  perf-profile.children.cycles-pp.__cond_resched
      0.14 ±  3%      -0.0        0.11        perf-profile.children.cycles-pp.free_tail_page_prepare
      0.11            -0.0        0.08 ±  5%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.09 ±  9%      -0.0        0.06 ±  7%  perf-profile.children.cycles-pp.__xfrm_policy_check2
      0.23 ±  2%      -0.0        0.21 ±  2%  perf-profile.children.cycles-pp.sched_clock
      0.14 ±  3%      -0.0        0.11 ±  4%  perf-profile.children.cycles-pp.prep_compound_page
      0.21 ±  2%      -0.0        0.20 ±  2%  perf-profile.children.cycles-pp.native_sched_clock
      0.06            -0.0        0.05        perf-profile.children.cycles-pp.task_tick_fair
      0.06            -0.0        0.05        perf-profile.children.cycles-pp.check_stack_object
      0.18 ±  2%      +0.0        0.20 ±  2%  perf-profile.children.cycles-pp.perf_event_task_tick
      0.18 ±  2%      +0.0        0.19 ±  2%  perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context
      0.31 ±  3%      +0.0        0.33        perf-profile.children.cycles-pp.tick_sched_handle
      0.31 ±  3%      +0.0        0.33        perf-profile.children.cycles-pp.update_process_times
      0.41 ±  2%      +0.0        0.43        perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.40 ±  2%      +0.0        0.42        perf-profile.children.cycles-pp.hrtimer_interrupt
      0.32 ±  2%      +0.0        0.34        perf-profile.children.cycles-pp.tick_sched_timer
      0.36 ±  2%      +0.0        0.39        perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.06 ±  7%      +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.05 ±  8%      +0.0        0.10        perf-profile.children.cycles-pp._raw_spin_lock_bh
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.update_cfs_group
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.cpuidle_governor_latency_req
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.flush_smp_call_function_queue
      0.00            +0.1        0.05 ±  8%  perf-profile.children.cycles-pp.prepare_to_wait_exclusive
      0.07            +0.1        0.13 ±  3%  perf-profile.children.cycles-pp.__mod_zone_page_state
      0.00            +0.1        0.06 ± 13%  perf-profile.children.cycles-pp.cgroup_rstat_updated
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.__x2apic_send_IPI_dest
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.security_socket_recvmsg
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.select_task_rq_fair
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.tick_irq_enter
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.tick_nohz_idle_enter
      0.42 ±  2%      +0.1        0.49 ±  2%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.ktime_get
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.__get_user_4
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.update_rq_clock
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.select_task_rq
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.native_apic_msr_eoi
      0.49            +0.1        0.57 ±  2%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.11 ± 11%      +0.1        0.19 ±  2%  perf-profile.children.cycles-pp._raw_spin_lock
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.update_rq_clock_task
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.__update_load_avg_se
      0.00            +0.1        0.09 ±  5%  perf-profile.children.cycles-pp.irq_enter_rcu
      0.00            +0.1        0.09 ±  5%  perf-profile.children.cycles-pp.__irq_exit_rcu
      0.00            +0.1        0.09        perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
      0.00            +0.1        0.09        perf-profile.children.cycles-pp.update_blocked_averages
      0.00            +0.1        0.09        perf-profile.children.cycles-pp.update_sg_lb_stats
      0.00            +0.1        0.09 ±  5%  perf-profile.children.cycles-pp.set_next_entity
      0.00            +0.1        0.10        perf-profile.children.cycles-pp.__switch_to_asm
      0.00            +0.1        0.11 ± 12%  perf-profile.children.cycles-pp._copy_to_user
      0.00            +0.1        0.12 ±  3%  perf-profile.children.cycles-pp.menu_select
      0.00            +0.1        0.12 ±  3%  perf-profile.children.cycles-pp.recv_data
      0.00            +0.1        0.12 ±  3%  perf-profile.children.cycles-pp.update_sd_lb_stats
      0.00            +0.1        0.13 ±  3%  perf-profile.children.cycles-pp.native_irq_return_iret
      0.00            +0.1        0.13 ±  3%  perf-profile.children.cycles-pp.__switch_to
      0.00            +0.1        0.13 ±  3%  perf-profile.children.cycles-pp.find_busiest_group
      0.00            +0.1        0.14        perf-profile.children.cycles-pp.finish_task_switch
      0.00            +0.1        0.15 ±  3%  perf-profile.children.cycles-pp.update_curr
      0.00            +0.2        0.15 ±  3%  perf-profile.children.cycles-pp.mem_cgroup_uncharge_skmem
      0.00            +0.2        0.16        perf-profile.children.cycles-pp.ttwu_queue_wakelist
      0.05            +0.2        0.22 ±  2%  perf-profile.children.cycles-pp.page_counter_try_charge
      0.00            +0.2        0.17 ±  2%  perf-profile.children.cycles-pp.load_balance
      0.00            +0.2        0.17 ±  2%  perf-profile.children.cycles-pp.___perf_sw_event
      0.02 ±141%      +0.2        0.19 ±  2%  perf-profile.children.cycles-pp.page_counter_uncharge
      0.33            +0.2        0.52        perf-profile.children.cycles-pp.__free_one_page
      0.02 ±141%      +0.2        0.21 ±  2%  perf-profile.children.cycles-pp.drain_stock
      0.00            +0.2        0.20 ±  2%  perf-profile.children.cycles-pp.prepare_task_switch
      0.16 ±  3%      +0.2        0.38 ±  2%  perf-profile.children.cycles-pp.simple_copy_to_iter
      0.07 ± 11%      +0.2        0.31        perf-profile.children.cycles-pp.refill_stock
      0.07 ±  6%      +0.2        0.31 ±  4%  perf-profile.children.cycles-pp.move_addr_to_user
      0.00            +0.2        0.24        perf-profile.children.cycles-pp.enqueue_entity
      0.00            +0.2        0.25        perf-profile.children.cycles-pp.update_load_avg
      0.21 ±  2%      +0.3        0.48        perf-profile.children.cycles-pp.__list_del_entry_valid_or_report
      0.00            +0.3        0.31 ±  4%  perf-profile.children.cycles-pp.dequeue_entity
      0.08 ±  5%      +0.3        0.40 ±  3%  perf-profile.children.cycles-pp.try_charge_memcg
      0.00            +0.3        0.33        perf-profile.children.cycles-pp.enqueue_task_fair
      0.00            +0.4        0.35 ±  2%  perf-profile.children.cycles-pp.dequeue_task_fair
      0.00            +0.4        0.35 ±  2%  perf-profile.children.cycles-pp.activate_task
      0.00            +0.4        0.36 ±  2%  perf-profile.children.cycles-pp.try_to_wake_up
      0.00            +0.4        0.37 ±  2%  perf-profile.children.cycles-pp.autoremove_wake_function
      0.00            +0.4        0.39 ±  3%  perf-profile.children.cycles-pp.newidle_balance
      0.12 ±  8%      +0.4        0.51 ±  2%  perf-profile.children.cycles-pp.mem_cgroup_charge_skmem
      0.00            +0.4        0.39        perf-profile.children.cycles-pp.ttwu_do_activate
      0.00            +0.4        0.40 ±  2%  perf-profile.children.cycles-pp.__wake_up_common
      0.18 ±  4%      +0.4        0.59        perf-profile.children.cycles-pp.udp_rmem_release
      0.11 ±  7%      +0.4        0.52        perf-profile.children.cycles-pp.__sk_mem_reduce_allocated
      0.00            +0.4        0.43        perf-profile.children.cycles-pp.__wake_up_common_lock
      0.00            +0.5        0.46        perf-profile.children.cycles-pp.sched_ttwu_pending
      0.00            +0.5        0.49        perf-profile.children.cycles-pp.sock_def_readable
      0.00            +0.5        0.53 ±  2%  perf-profile.children.cycles-pp.pick_next_task_fair
      0.00            +0.5        0.54 ±  2%  perf-profile.children.cycles-pp.schedule_idle
      0.00            +0.6        0.55        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      0.15 ±  3%      +0.6        0.73 ±  2%  perf-profile.children.cycles-pp.__sk_mem_raise_allocated
      0.00            +0.6        0.57        perf-profile.children.cycles-pp.__sysvec_call_function_single
      0.16 ±  5%      +0.6        0.74 ±  2%  perf-profile.children.cycles-pp.__sk_mem_schedule
      0.00            +0.8        0.78        perf-profile.children.cycles-pp.sysvec_call_function_single
      0.41 ±  3%      +0.9        1.33 ±  2%  perf-profile.children.cycles-pp.__udp_enqueue_schedule_skb
      0.00            +1.2        1.16 ±  2%  perf-profile.children.cycles-pp.schedule
      0.00            +1.2        1.21 ±  2%  perf-profile.children.cycles-pp.schedule_timeout
      0.00            +1.3        1.33 ±  2%  perf-profile.children.cycles-pp.__skb_wait_for_more_packets
      0.00            +1.7        1.66 ±  2%  perf-profile.children.cycles-pp.__schedule
      0.27 ±  3%      +2.0        2.25        perf-profile.children.cycles-pp.__skb_recv_udp
     50.41            +2.4       52.81        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      0.00            +2.7        2.68        perf-profile.children.cycles-pp.asm_sysvec_call_function_single
     49.78            +2.7       52.49        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      0.00            +3.0        2.98        perf-profile.children.cycles-pp.acpi_safe_halt
      0.00            +3.0        3.00        perf-profile.children.cycles-pp.acpi_idle_enter
     49.51            +3.1       52.57        perf-profile.children.cycles-pp.send_udp_stream
     49.50            +3.1       52.56        perf-profile.children.cycles-pp.send_omni_inner
      0.00            +3.1        3.10        perf-profile.children.cycles-pp.cpuidle_enter_state
      0.00            +3.1        3.12        perf-profile.children.cycles-pp.cpuidle_enter
      0.00            +3.4        3.37        perf-profile.children.cycles-pp.cpuidle_idle_call
     48.90            +3.4       52.30        perf-profile.children.cycles-pp.sendto
     47.85            +4.0       51.83        perf-profile.children.cycles-pp.__x64_sys_sendto
     47.73            +4.0       51.77        perf-profile.children.cycles-pp.__sys_sendto
      0.00            +4.1        4.10        perf-profile.children.cycles-pp.start_secondary
      0.00            +4.1        4.13        perf-profile.children.cycles-pp.do_idle
      0.00            +4.1        4.14        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
      0.00            +4.1        4.14        perf-profile.children.cycles-pp.cpu_startup_entry
     46.54            +4.7       51.28        perf-profile.children.cycles-pp.sock_sendmsg
     46.10            +5.0       51.11        perf-profile.children.cycles-pp.udp_sendmsg
      3.70            +8.0       11.71        perf-profile.children.cycles-pp.copyout
      3.71            +8.1       11.80        perf-profile.children.cycles-pp._copy_to_iter
      3.96            +8.5       12.43        perf-profile.children.cycles-pp.__skb_datagram_iter
      3.96            +8.5       12.44        perf-profile.children.cycles-pp.skb_copy_datagram_iter
     35.14           +11.3       46.40        perf-profile.children.cycles-pp.ip_make_skb
     32.71 ±  2%     +13.0       45.66        perf-profile.children.cycles-pp.__ip_append_data
     10.28           +20.6       30.89        perf-profile.children.cycles-pp.sk_page_frag_refill
     10.25           +20.6       30.88        perf-profile.children.cycles-pp.skb_page_frag_refill
      9.86           +20.8       30.63        perf-profile.children.cycles-pp.__alloc_pages
      9.62           +20.8       30.42        perf-profile.children.cycles-pp.get_page_from_freelist
      8.42           +21.3       29.72        perf-profile.children.cycles-pp.rmqueue
      6.47           +22.8       29.22        perf-profile.children.cycles-pp.rmqueue_bulk
     19.11 ±  2%      -6.0       13.08        perf-profile.self.cycles-pp.copyin
      1.81 ±  2%      -1.4        0.39        perf-profile.self.cycles-pp.rmqueue
      1.81 ±  2%      -1.3        0.46 ±  2%  perf-profile.self.cycles-pp.__ip_select_ident
      1.47 ±  4%      -1.2        0.31        perf-profile.self.cycles-pp.free_unref_page_commit
      1.29 ±  2%      -0.5        0.75        perf-profile.self.cycles-pp.__ip_append_data
      0.71            -0.4        0.29        perf-profile.self.cycles-pp.udp_sendmsg
      0.68 ±  2%      -0.4        0.32        perf-profile.self.cycles-pp.__zone_watermark_ok
      0.50            -0.3        0.16        perf-profile.self.cycles-pp.skb_release_data
      0.59 ±  3%      -0.3        0.26 ±  3%  perf-profile.self.cycles-pp.fib_table_lookup
      0.46 ±  4%      -0.3        0.15 ±  3%  perf-profile.self.cycles-pp.kmem_cache_free
      0.63            -0.3        0.33 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.47            -0.3        0.19        perf-profile.self.cycles-pp.__sys_sendto
      0.44            -0.2        0.21 ±  2%  perf-profile.self.cycles-pp.kmem_cache_alloc_node
      0.36            -0.2        0.16 ±  3%  perf-profile.self.cycles-pp.send_omni_inner
      0.35 ±  2%      -0.2        0.15 ±  3%  perf-profile.self.cycles-pp.ip_finish_output2
      0.29            -0.2        0.12        perf-profile.self.cycles-pp._copy_from_user
      0.24            -0.1        0.10 ±  4%  perf-profile.self.cycles-pp.__netif_receive_skb_core
      0.22 ±  2%      -0.1        0.08 ±  5%  perf-profile.self.cycles-pp.free_unref_page
      0.19 ±  2%      -0.1        0.06        perf-profile.self.cycles-pp.ip_rcv_core
      0.21 ±  2%      -0.1        0.08        perf-profile.self.cycles-pp.__alloc_skb
      0.20 ±  2%      -0.1        0.08        perf-profile.self.cycles-pp.sock_wfree
      0.22 ±  2%      -0.1        0.10 ±  4%  perf-profile.self.cycles-pp.send_data
      0.21            -0.1        0.09        perf-profile.self.cycles-pp.sendto
      0.21 ±  2%      -0.1        0.10 ±  4%  perf-profile.self.cycles-pp.ip_rcv_finish_core
      0.21 ±  2%      -0.1        0.09 ±  5%  perf-profile.self.cycles-pp.__ip_make_skb
      0.20 ±  4%      -0.1        0.09 ±  5%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.21 ±  2%      -0.1        0.10 ±  4%  perf-profile.self.cycles-pp.__dev_queue_xmit
      0.38 ±  3%      -0.1        0.27        perf-profile.self.cycles-pp.get_page_from_freelist
      0.20 ±  2%      -0.1        0.09        perf-profile.self.cycles-pp.udp_send_skb
      0.18 ±  2%      -0.1        0.07        perf-profile.self.cycles-pp.__udp_enqueue_schedule_skb
      0.18 ±  4%      -0.1        0.08 ±  6%  perf-profile.self.cycles-pp.__mkroute_output
      0.25            -0.1        0.15 ±  3%  perf-profile.self.cycles-pp._copy_from_iter
      0.27 ±  4%      -0.1        0.17 ±  2%  perf-profile.self.cycles-pp.skb_page_frag_refill
      0.16            -0.1        0.06 ±  7%  perf-profile.self.cycles-pp.sock_sendmsg
      0.33 ±  2%      -0.1        0.24        perf-profile.self.cycles-pp.__slab_free
      0.15 ±  3%      -0.1        0.06        perf-profile.self.cycles-pp.udp4_lib_lookup2
      0.38 ±  2%      -0.1        0.29 ±  2%  perf-profile.self.cycles-pp.free_unref_page_prepare
      0.26            -0.1        0.17        perf-profile.self.cycles-pp._raw_spin_trylock
      0.15            -0.1        0.06        perf-profile.self.cycles-pp.ip_output
      0.14            -0.1        0.05 ±  8%  perf-profile.self.cycles-pp.process_backlog
      0.14            -0.1        0.06        perf-profile.self.cycles-pp.ip_route_output_flow
      0.14            -0.1        0.06        perf-profile.self.cycles-pp.__udp4_lib_lookup
      0.21 ±  2%      -0.1        0.13 ±  3%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.12 ±  3%      -0.1        0.05        perf-profile.self.cycles-pp.siphash_3u32
      0.13 ±  3%      -0.1        0.06 ±  8%  perf-profile.self.cycles-pp.ip_send_skb
      0.17            -0.1        0.10        perf-profile.self.cycles-pp.__do_softirq
      0.15 ±  3%      -0.1        0.08 ±  5%  perf-profile.self.cycles-pp.skb_set_owner_w
      0.17 ±  2%      -0.1        0.10 ±  4%  perf-profile.self.cycles-pp.aa_sk_perm
      0.12            -0.1        0.05        perf-profile.self.cycles-pp.__x64_sys_sendto
      0.12 ±  6%      -0.1        0.05        perf-profile.self.cycles-pp.fib_lookup_good_nhc
      0.19 ±  2%      -0.1        0.13        perf-profile.self.cycles-pp.__list_add_valid_or_report
      0.14 ±  3%      -0.1        0.07 ±  6%  perf-profile.self.cycles-pp.net_rx_action
      0.16 ±  2%      -0.1        0.10        perf-profile.self.cycles-pp.do_syscall_64
      0.11            -0.1        0.05        perf-profile.self.cycles-pp.__udp4_lib_rcv
      0.16 ±  3%      -0.1        0.10 ±  4%  perf-profile.self.cycles-pp.get_pfnblock_flags_mask
      0.11 ±  4%      -0.1        0.05        perf-profile.self.cycles-pp.ip_route_output_key_hash_rcu
      0.10 ±  4%      -0.1        0.05        perf-profile.self.cycles-pp.ip_generic_getfrag
      0.10            -0.1        0.05        perf-profile.self.cycles-pp.ipv4_mtu
      0.26            -0.0        0.21 ±  2%  perf-profile.self.cycles-pp.__fget_light
      0.15 ±  3%      -0.0        0.11 ±  4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.24            -0.0        0.20 ±  2%  perf-profile.self.cycles-pp.__alloc_pages
      0.15 ±  3%      -0.0        0.12        perf-profile.self.cycles-pp.__check_object_size
      0.11            -0.0        0.08 ±  6%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.08 ±  5%      -0.0        0.05        perf-profile.self.cycles-pp.loopback_xmit
      0.13            -0.0        0.11 ±  4%  perf-profile.self.cycles-pp.prep_compound_page
      0.11            -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.irqtime_account_irq
      0.09 ± 10%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.__xfrm_policy_check2
      0.07            -0.0        0.05        perf-profile.self.cycles-pp.alloc_pages
      0.08            -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.__entry_text_start
      0.09 ±  5%      -0.0        0.07        perf-profile.self.cycles-pp.free_tail_page_prepare
      0.10            +0.0        0.11        perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context
      0.06            +0.0        0.08 ±  6%  perf-profile.self.cycles-pp.free_pcppages_bulk
      0.05 ±  8%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp._raw_spin_lock_bh
      0.07            +0.0        0.12        perf-profile.self.cycles-pp.__mod_zone_page_state
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.cpuidle_idle_call
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.udp_rmem_release
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__flush_smp_call_function_queue
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.sock_def_readable
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.update_cfs_group
      0.11 ± 11%      +0.1        0.17 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.00            +0.1        0.05 ±  8%  perf-profile.self.cycles-pp.finish_task_switch
      0.00            +0.1        0.05 ±  8%  perf-profile.self.cycles-pp.cgroup_rstat_updated
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.do_idle
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.__skb_wait_for_more_packets
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.__x2apic_send_IPI_dest
      0.00            +0.1        0.06 ±  7%  perf-profile.self.cycles-pp.enqueue_entity
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.schedule_timeout
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.move_addr_to_user
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.menu_select
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.native_apic_msr_eoi
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.update_sg_lb_stats
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.__update_load_avg_se
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.__get_user_4
      0.00            +0.1        0.08 ±  6%  perf-profile.self.cycles-pp.__sk_mem_reduce_allocated
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.update_curr
      0.00            +0.1        0.08 ±  5%  perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      0.00            +0.1        0.09 ±  5%  perf-profile.self.cycles-pp.try_to_wake_up
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.recvfrom
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.mem_cgroup_charge_skmem
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.update_load_avg
      0.00            +0.1        0.09 ±  5%  perf-profile.self.cycles-pp.enqueue_task_fair
      0.00            +0.1        0.10 ±  4%  perf-profile.self.cycles-pp._copy_to_iter
      0.00            +0.1        0.10 ±  4%  perf-profile.self.cycles-pp.newidle_balance
      0.00            +0.1        0.10 ±  4%  perf-profile.self.cycles-pp.recv_data
      0.00            +0.1        0.10        perf-profile.self.cycles-pp.refill_stock
      0.00            +0.1        0.10        perf-profile.self.cycles-pp.__switch_to_asm
      0.00            +0.1        0.11 ± 15%  perf-profile.self.cycles-pp._copy_to_user
      0.00            +0.1        0.12        perf-profile.self.cycles-pp.recv_omni
      0.00            +0.1        0.12        perf-profile.self.cycles-pp.mem_cgroup_uncharge_skmem
      0.00            +0.1        0.13 ±  3%  perf-profile.self.cycles-pp.native_irq_return_iret
      0.00            +0.1        0.13        perf-profile.self.cycles-pp.__switch_to
      0.06            +0.1        0.20 ±  2%  perf-profile.self.cycles-pp.rmqueue_bulk
      0.09 ±  5%      +0.1        0.23 ±  4%  perf-profile.self.cycles-pp.udp_recvmsg
      0.00            +0.1        0.14 ±  3%  perf-profile.self.cycles-pp.__skb_recv_udp
      0.00            +0.1        0.14 ±  3%  perf-profile.self.cycles-pp.___perf_sw_event
      0.08            +0.1        0.22 ±  2%  perf-profile.self.cycles-pp.__skb_datagram_iter
      0.03 ± 70%      +0.2        0.20 ±  4%  perf-profile.self.cycles-pp.page_counter_try_charge
      0.02 ±141%      +0.2        0.18 ±  4%  perf-profile.self.cycles-pp.__sys_recvfrom
      0.00            +0.2        0.17 ±  2%  perf-profile.self.cycles-pp.__schedule
      0.00            +0.2        0.17 ±  2%  perf-profile.self.cycles-pp.try_charge_memcg
      0.00            +0.2        0.17 ±  2%  perf-profile.self.cycles-pp.page_counter_uncharge
      0.00            +0.2        0.21 ±  2%  perf-profile.self.cycles-pp.__sk_mem_raise_allocated
      0.14 ±  3%      +0.2        0.36        perf-profile.self.cycles-pp.__free_one_page
      0.20 ±  2%      +0.3        0.47        perf-profile.self.cycles-pp.__list_del_entry_valid_or_report
      0.00            +2.1        2.07 ±  2%  perf-profile.self.cycles-pp.acpi_safe_halt
     49.78            +2.7       52.49        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      3.68            +8.0       11.64        perf-profile.self.cycles-pp.copyout



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages
  2023-11-06  6:22   ` kernel test robot
@ 2023-11-06  6:38     ` Huang, Ying
  0 siblings, 0 replies; 20+ messages in thread
From: Huang, Ying @ 2023-11-06  6:38 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, Mel Gorman, Andrew Morton, Sudeep Holla,
	Vlastimil Babka, David Hildenbrand, Johannes Weiner, Dave Hansen,
	Michal Hocko, Pavel Tatashin, Matthew Wilcox, Christoph Lameter,
	linux-kernel, linux-mm, feng.tang, fengwei.yin, Arjan Van De Ven

Hi,

kernel test robot <oliver.sang@intel.com> writes:

> hi, Huang Ying,
>
> sorry for late of this report.
> we reported
> "a 14.6% improvement of netperf.Throughput_Mbps"
> in
> https://lore.kernel.org/all/202310271441.71ce0a9-oliver.sang@intel.com/
>
> later, our auto-bisect tool captured a regression on a netperf test with
> different configurations, however, unfortunately, regarded it as 'reported'
> so we missed this report at the first time.
>
> now send again FYI.
>
>
> Hello,
>
> kernel test robot noticed a -60.4% regression of netperf.Throughput_Mbps on:
>
>
> commit: f5ddc662f07d7d99e9cfc5e07778e26c7394caf8 ("[PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages")
> url: https://github.com/intel-lab-lkp/linux/commits/Huang-Ying/mm-pcp-avoid-to-drain-PCP-when-process-exit/20231017-143633
> base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git 36b2d7dd5a8ac95c8c1e69bdc93c4a6e2dc28a23
> patch link: https://lore.kernel.org/all/20231016053002.756205-4-ying.huang@intel.com/
> patch subject: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages
>
> testcase: netperf
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> parameters:
>
> 	ip: ipv4
> 	runtime: 300s
> 	nr_threads: 50%
> 	cluster: cs-localhost
> 	test: UDP_STREAM
> 	cpufreq_governor: performance
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202311061311.8d63998-oliver.sang@intel.com
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20231106/202311061311.8d63998-oliver.sang@intel.com
>
> =========================================================================================
> cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/tbox_group/test/testcase:
>   cs-localhost/gcc-12/performance/ipv4/x86_64-rhel-8.3/50%/debian-11.1-x86_64-20220510.cgz/300s/lkp-icl-2sp2/UDP_STREAM/netperf
>
> commit: 
>   c828e65251 ("cacheinfo: calculate size of per-CPU data cache slice")
>   f5ddc662f0 ("mm, pcp: reduce lock contention for draining high-order pages")
>
> c828e65251502516 f5ddc662f07d7d99e9cfc5e0777 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>       7321   4%     +28.2%       9382        uptime.idle
>      50.65   4%      -4.0%      48.64        boot-time.boot
>       6042   4%      -4.2%       5785        boot-time.idle
>  1.089e+09   2%    +232.1%  3.618e+09        cpuidle..time
>    1087075   2%  +24095.8%   2.63e+08        cpuidle..usage
>    3357014           +99.9%    6710312        vmstat.memory.cache
>      48731  19%   +4666.5%    2322787        vmstat.system.cs
>     144637          +711.2%    1173334        vmstat.system.in
>       2.59   2%      +6.2        8.79        mpstat.cpu.all.idle%
>       1.01            +0.7        1.66        mpstat.cpu.all.irq%
>       6.00            -3.2        2.79        mpstat.cpu.all.soft%
>       1.13   2%      -0.1        1.02        mpstat.cpu.all.usr%
>  1.407e+09   3%     -28.2%  1.011e+09        numa-numastat.node0.local_node
>  1.407e+09   3%     -28.2%   1.01e+09        numa-numastat.node0.numa_hit
>  1.469e+09   8%     -32.0%  9.979e+08        numa-numastat.node1.local_node
>  1.469e+09   8%     -32.1%  9.974e+08        numa-numastat.node1.numa_hit
>     103.00  19%     -44.0%      57.67  20%  perf-c2c.DRAM.local
>       8970  12%     -89.4%     951.00   4%  perf-c2c.DRAM.remote
>       8192   5%     +68.5%      13807        perf-c2c.HITM.local
>       6675  11%     -92.6%     491.00   2%  perf-c2c.HITM.remote
>    1051014   2%  +24922.0%   2.63e+08        turbostat.C1
>       2.75   2%      +6.5        9.29        turbostat.C1%
>       2.72   2%    +178.3%       7.57        turbostat.CPU%c1
>       0.09           -22.2%       0.07        turbostat.IPC
>   44589125          +701.5%  3.574e+08        turbostat.IRQ
>     313.00  57%   +1967.0%       6469   8%  turbostat.POLL
>      70.33            +3.3%      72.67        turbostat.PkgTmp
>      44.23   4%     -31.8%      30.15   2%  turbostat.RAMWatt
>     536096          +583.7%    3665194        meminfo.Active
>     535414          +584.4%    3664543        meminfo.Active(anon)
>    3238301          +103.2%    6579677        meminfo.Cached
>    1204424          +278.9%    4563575        meminfo.Committed_AS
>     469093           +47.9%     693889   3%  meminfo.Inactive
>     467250           +48.4%     693496   3%  meminfo.Inactive(anon)
>      53615          +562.5%     355225   4%  meminfo.Mapped
>    5223078           +64.1%    8571212        meminfo.Memused
>     557305          +599.6%    3899111        meminfo.Shmem
>    5660207           +58.9%    8993642        meminfo.max_used_kB
>      78504   3%     -30.1%      54869        netperf.ThroughputBoth_Mbps
>    5024292   3%     -30.1%    3511666        netperf.ThroughputBoth_total_Mbps
>       7673   5%    +249.7%      26832        netperf.ThroughputRecv_Mbps
>     491074   5%    +249.7%    1717287        netperf.ThroughputRecv_total_Mbps
>      70831   2%     -60.4%      28037        netperf.Throughput_Mbps
>    4533217   2%     -60.4%    1794379        netperf.Throughput_total_Mbps

This is a UDP test.  So the sender will not wait for receiver.  In the
result, you can find that the sender throughput reduces 60.4%, while the
receiver throughput increases 249.7%.  And, much less packets are
dropped during the test, and this is good too.

All in all, considering the performance of both the sender and the
receiver, I think the patch helps the performance.

--
Best Regards,
Huang, Ying

>       5439            +9.4%       5949        netperf.time.percent_of_cpu_this_job_got
>      16206            +9.4%      17728        netperf.time.system_time
>     388.14           -51.9%     186.53        netperf.time.user_time
>  2.876e+09   3%     -30.1%   2.01e+09        netperf.workload
>     177360  30%     -36.0%     113450  20%  numa-meminfo.node0.AnonPages
>     255926  12%     -40.6%     152052  12%  numa-meminfo.node0.AnonPages.max
>      22582  61%    +484.2%     131916  90%  numa-meminfo.node0.Mapped
>     138287  17%     +22.6%     169534  12%  numa-meminfo.node1.AnonHugePages
>     267468  20%     +29.1%     345385   6%  numa-meminfo.node1.AnonPages
>     346204  18%     +34.5%     465696   2%  numa-meminfo.node1.AnonPages.max
>     279416  19%     +77.0%     494652  18%  numa-meminfo.node1.Inactive
>     278445  19%     +77.6%     494393  18%  numa-meminfo.node1.Inactive(anon)
>      31726  45%    +607.7%     224533  45%  numa-meminfo.node1.Mapped
>       4802   6%     +19.4%       5733   3%  numa-meminfo.node1.PageTables
>     297323  12%    +792.6%    2653850  63%  numa-meminfo.node1.Shmem
>      44325  30%     -36.0%      28379  20%  numa-vmstat.node0.nr_anon_pages
>       5590  61%    +491.0%      33042  90%  numa-vmstat.node0.nr_mapped
>  1.407e+09   3%     -28.2%   1.01e+09        numa-vmstat.node0.numa_hit
>  1.407e+09   3%     -28.2%  1.011e+09        numa-vmstat.node0.numa_local
>      66858  20%     +29.2%      86385   6%  numa-vmstat.node1.nr_anon_pages
>      69601  20%     +77.8%     123729  18%  numa-vmstat.node1.nr_inactive_anon
>       7953  45%    +608.3%      56335  45%  numa-vmstat.node1.nr_mapped
>       1201   6%     +19.4%       1434   3%  numa-vmstat.node1.nr_page_table_pages
>      74288  11%    +792.6%     663111  63%  numa-vmstat.node1.nr_shmem
>      69601  20%     +77.8%     123728  18%  numa-vmstat.node1.nr_zone_inactive_anon
>  1.469e+09   8%     -32.1%  9.974e+08        numa-vmstat.node1.numa_hit
>  1.469e+09   8%     -32.0%  9.979e+08        numa-vmstat.node1.numa_local
>     133919          +584.2%     916254        proc-vmstat.nr_active_anon
>     111196            +3.3%     114828        proc-vmstat.nr_anon_pages
>    5602484            -1.5%    5518799        proc-vmstat.nr_dirty_background_threshold
>   11218668            -1.5%   11051092        proc-vmstat.nr_dirty_threshold
>     809646          +103.2%    1645012        proc-vmstat.nr_file_pages
>   56374629            -1.5%   55536913        proc-vmstat.nr_free_pages
>     116775           +48.4%     173349   3%  proc-vmstat.nr_inactive_anon
>      13386   2%    +563.3%      88793   4%  proc-vmstat.nr_mapped
>       2286            +6.5%       2434        proc-vmstat.nr_page_table_pages
>     139393          +599.4%     974869        proc-vmstat.nr_shmem
>      29092            +6.6%      31019        proc-vmstat.nr_slab_reclaimable
>     133919          +584.2%     916254        proc-vmstat.nr_zone_active_anon
>     116775           +48.4%     173349   3%  proc-vmstat.nr_zone_inactive_anon
>      32135  11%    +257.2%     114797  21%  proc-vmstat.numa_hint_faults
>      20858  16%    +318.3%      87244   6%  proc-vmstat.numa_hint_faults_local
>  2.876e+09   3%     -30.2%  2.008e+09        proc-vmstat.numa_hit
>  2.876e+09   3%     -30.2%  2.008e+09        proc-vmstat.numa_local
>      25453   7%     -75.2%       6324  30%  proc-vmstat.numa_pages_migrated
>     178224   2%     +76.6%     314680   7%  proc-vmstat.numa_pte_updates
>     160889   3%    +267.6%     591393   6%  proc-vmstat.pgactivate
>  2.295e+10   3%     -30.2%  1.601e+10        proc-vmstat.pgalloc_normal
>    1026605           +21.9%    1251671        proc-vmstat.pgfault
>  2.295e+10   3%     -30.2%  1.601e+10        proc-vmstat.pgfree
>      25453   7%     -75.2%       6324  30%  proc-vmstat.pgmigrate_success
>      39208   2%      -6.1%      36815        proc-vmstat.pgreuse
>    3164416           -20.3%    2521344   2%  proc-vmstat.unevictable_pgs_scanned
>   19248627           -22.1%   14989905        sched_debug.cfs_rq:/.avg_vruntime.avg
>   20722680           -24.9%   15569530        sched_debug.cfs_rq:/.avg_vruntime.max
>   17634233           -22.5%   13663168        sched_debug.cfs_rq:/.avg_vruntime.min
>     949063   2%     -70.5%     280388        sched_debug.cfs_rq:/.avg_vruntime.stddev
>       0.78  10%    -100.0%       0.00        sched_debug.cfs_rq:/.h_nr_running.min
>       0.16   8%    +113.3%       0.33   2%  sched_debug.cfs_rq:/.h_nr_running.stddev
>       0.56 141%  +2.2e+07%     122016  52%  sched_debug.cfs_rq:/.left_vruntime.avg
>      45.01 141%  +2.2e+07%   10035976  28%  sched_debug.cfs_rq:/.left_vruntime.max
>       4.58 141%  +2.3e+07%    1072762  36%  sched_debug.cfs_rq:/.left_vruntime.stddev
>       5814  10%    -100.0%       0.00        sched_debug.cfs_rq:/.load.min
>       5.39   9%     -73.2%       1.44  10%  sched_debug.cfs_rq:/.load_avg.min
>   19248627           -22.1%   14989905        sched_debug.cfs_rq:/.min_vruntime.avg
>   20722680           -24.9%   15569530        sched_debug.cfs_rq:/.min_vruntime.max
>   17634233           -22.5%   13663168        sched_debug.cfs_rq:/.min_vruntime.min
>     949063   2%     -70.5%     280388        sched_debug.cfs_rq:/.min_vruntime.stddev
>       0.78  10%    -100.0%       0.00        sched_debug.cfs_rq:/.nr_running.min
>       0.06   8%    +369.2%       0.30   3%  sched_debug.cfs_rq:/.nr_running.stddev
>       4.84  26%   +1611.3%      82.79  67%  sched_debug.cfs_rq:/.removed.load_avg.avg
>      27.92  12%   +3040.3%     876.79  68%  sched_debug.cfs_rq:/.removed.load_avg.stddev
>       0.56 141%  +2.2e+07%     122016  52%  sched_debug.cfs_rq:/.right_vruntime.avg
>      45.06 141%  +2.2e+07%   10035976  28%  sched_debug.cfs_rq:/.right_vruntime.max
>       4.59 141%  +2.3e+07%    1072762  36%  sched_debug.cfs_rq:/.right_vruntime.stddev
>     900.25           -10.4%     806.45        sched_debug.cfs_rq:/.runnable_avg.avg
>     533.28   4%     -87.0%      69.56  39%  sched_debug.cfs_rq:/.runnable_avg.min
>     122.77   2%     +92.9%     236.86        sched_debug.cfs_rq:/.runnable_avg.stddev
>     896.13           -10.8%     799.44        sched_debug.cfs_rq:/.util_avg.avg
>     379.06   4%     -83.4%      62.94  37%  sched_debug.cfs_rq:/.util_avg.min
>     116.35   8%     +99.4%     232.04        sched_debug.cfs_rq:/.util_avg.stddev
>     550.87           -14.2%     472.66   2%  sched_debug.cfs_rq:/.util_est_enqueued.avg
>       1124   8%     +18.2%       1329   3%  sched_debug.cfs_rq:/.util_est_enqueued.max
>     134.17  30%    -100.0%       0.00        sched_debug.cfs_rq:/.util_est_enqueued.min
>     558243   6%     -66.9%     184666        sched_debug.cpu.avg_idle.avg
>      12860  11%     -56.1%       5644        sched_debug.cpu.avg_idle.min
>     365635           -53.5%     169863   5%  sched_debug.cpu.avg_idle.stddev
>       9.56   3%     -28.4%       6.84   8%  sched_debug.cpu.clock.stddev
>       6999   2%     -85.6%       1007   3%  sched_debug.cpu.clock_task.stddev
>       3985  10%    -100.0%       0.00        sched_debug.cpu.curr->pid.min
>     491.71  10%    +209.3%       1520   4%  sched_debug.cpu.curr->pid.stddev
>     270.19 141%   +1096.6%       3233  51%  sched_debug.cpu.max_idle_balance_cost.stddev
>       0.78  10%    -100.0%       0.00        sched_debug.cpu.nr_running.min
>       0.15   6%    +121.7%       0.34   2%  sched_debug.cpu.nr_running.stddev
>      62041  15%   +4280.9%    2717948        sched_debug.cpu.nr_switches.avg
>    1074922  14%    +292.6%    4220307   2%  sched_debug.cpu.nr_switches.max
>       1186   2%  +1.2e+05%    1379073   4%  sched_debug.cpu.nr_switches.min
>     132392  21%    +294.6%     522476   5%  sched_debug.cpu.nr_switches.stddev
>       6.44   4%     +21.4%       7.82  12%  sched_debug.cpu.nr_uninterruptible.stddev
>       6.73  13%     -84.8%       1.02   5%  perf-stat.i.MPKI
>  1.652e+10   2%     -22.2%  1.285e+10        perf-stat.i.branch-instructions
>       0.72            +0.0        0.75        perf-stat.i.branch-miss-rate%
>   1.19e+08   3%     -19.8%   95493630        perf-stat.i.branch-misses
>      27.46  12%     -26.2        1.30   4%  perf-stat.i.cache-miss-rate%
>  5.943e+08  10%     -88.6%   67756219   5%  perf-stat.i.cache-misses
>  2.201e+09          +143.7%  5.364e+09        perf-stat.i.cache-references
>      48911  19%   +4695.4%    2345525        perf-stat.i.context-switches
>       3.66   2%     +28.5%       4.71        perf-stat.i.cpi
>  3.228e+11            -4.1%  3.097e+11        perf-stat.i.cpu-cycles
>     190.51         +1363.7%       2788  10%  perf-stat.i.cpu-migrations
>     803.99   6%    +510.2%       4905   5%  perf-stat.i.cycles-between-cache-misses
>       0.00  16%      +0.0        0.01  14%  perf-stat.i.dTLB-load-miss-rate%
>     755654  18%    +232.4%    2512024  14%  perf-stat.i.dTLB-load-misses
>  2.385e+10   2%     -26.9%  1.742e+10        perf-stat.i.dTLB-loads
>       0.00  31%      +0.0        0.01  35%  perf-stat.i.dTLB-store-miss-rate%
>     305657  36%    +200.0%     916822  35%  perf-stat.i.dTLB-store-misses
>  1.288e+10   2%     -28.8%  9.179e+09        perf-stat.i.dTLB-stores
>  8.789e+10   2%     -25.2%  6.578e+10        perf-stat.i.instructions
>       0.28   2%     -21.6%       0.22        perf-stat.i.ipc
>       2.52            -4.1%       2.42        perf-stat.i.metric.GHz
>     873.89  12%     -67.0%     288.04   8%  perf-stat.i.metric.K/sec
>     435.61   2%     -19.6%     350.06        perf-stat.i.metric.M/sec
>       2799           +29.9%       3637   2%  perf-stat.i.minor-faults
>      99.74            -2.6       97.11        perf-stat.i.node-load-miss-rate%
>  1.294e+08  12%     -92.4%    9879207   7%  perf-stat.i.node-load-misses
>      76.55           +16.4       92.92        perf-stat.i.node-store-miss-rate%
>  2.257e+08  10%     -90.4%   21721672   8%  perf-stat.i.node-store-misses
>   69217511  13%     -97.7%    1625810   7%  perf-stat.i.node-stores
>       2799           +29.9%       3637   2%  perf-stat.i.page-faults
>       6.79  13%     -84.9%       1.03   5%  perf-stat.overall.MPKI
>       0.72            +0.0        0.74        perf-stat.overall.branch-miss-rate%
>      27.06  12%     -25.8        1.26   4%  perf-stat.overall.cache-miss-rate%
>       3.68   2%     +28.1%       4.71        perf-stat.overall.cpi
>     549.38  10%    +736.0%       4592   5%  perf-stat.overall.cycles-between-cache-misses
>       0.00  18%      +0.0        0.01  14%  perf-stat.overall.dTLB-load-miss-rate%
>       0.00  36%      +0.0        0.01  35%  perf-stat.overall.dTLB-store-miss-rate%
>       0.27   2%     -22.0%       0.21        perf-stat.overall.ipc
>      99.80            -2.4       97.37        perf-stat.overall.node-load-miss-rate%
>      76.60           +16.4       93.03        perf-stat.overall.node-store-miss-rate%
>       9319            +5.8%       9855        perf-stat.overall.path-length
>  1.646e+10   2%     -22.2%  1.281e+10        perf-stat.ps.branch-instructions
>  1.186e+08   3%     -19.8%   95167897        perf-stat.ps.branch-misses
>  5.924e+08  10%     -88.6%   67384354   5%  perf-stat.ps.cache-misses
>  2.193e+09          +143.4%  5.339e+09        perf-stat.ps.cache-references
>      49100  19%   +4668.0%    2341074        perf-stat.ps.context-switches
>  3.218e+11            -4.1%  3.087e+11        perf-stat.ps.cpu-cycles
>     189.73         +1368.4%       2786  10%  perf-stat.ps.cpu-migrations
>     753056  18%    +229.9%    2484575  14%  perf-stat.ps.dTLB-load-misses
>  2.377e+10   2%     -26.9%  1.737e+10        perf-stat.ps.dTLB-loads
>     304509  36%    +199.1%     910856  35%  perf-stat.ps.dTLB-store-misses
>  1.284e+10   2%     -28.7%  9.152e+09        perf-stat.ps.dTLB-stores
>   8.76e+10   2%     -25.2%  6.557e+10        perf-stat.ps.instructions
>       2791           +28.2%       3580   2%  perf-stat.ps.minor-faults
>   1.29e+08  12%     -92.4%    9815672   7%  perf-stat.ps.node-load-misses
>   2.25e+08  10%     -90.4%   21575943   8%  perf-stat.ps.node-store-misses
>   69002373  13%     -97.7%    1615410   7%  perf-stat.ps.node-stores
>       2791           +28.2%       3580   2%  perf-stat.ps.page-faults
>   2.68e+13   2%     -26.1%  1.981e+13        perf-stat.total.instructions
>       0.00  35%   +2600.0%       0.04  23%  perf-sched.sch_delay.avg.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
>       1.18   9%     -98.1%       0.02  32%  perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
>       0.58   3%     -62.1%       0.22  97%  perf-sched.sch_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
>       0.51  22%     -82.7%       0.09  11%  perf-sched.sch_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
>       0.25  23%     -59.6%       0.10  10%  perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
>       0.03  42%     -64.0%       0.01  15%  perf-sched.sch_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
>       0.04   7%    +434.6%       0.23  36%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
>       1.00  20%     -84.1%       0.16  78%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
>       0.01   7%     -70.0%       0.00        perf-sched.sch_delay.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>       0.02   2%    +533.9%       0.12  43%  perf-sched.sch_delay.avg.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
>       0.03   7%    +105.9%       0.06  33%  perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
>       0.01  15%     +67.5%       0.02   8%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>       0.09  50%     -85.7%       0.01  33%  perf-sched.sch_delay.avg.ms.wait_for_partner.fifo_open.do_dentry_open.do_open
>       0.04   7%    +343.4%       0.16   6%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
>       0.06  41%   +3260.7%       1.88  30%  perf-sched.sch_delay.max.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
>       3.78           -96.2%       0.14   3%  perf-sched.sch_delay.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
>       2.86   4%     -72.6%       0.78 113%  perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
>       4.09   7%     -34.1%       2.69   7%  perf-sched.sch_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
>       3.09  37%     -64.1%       1.11   5%  perf-sched.sch_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
>       0.00 141%   +6200.0%       0.13  82%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
>       3.94           -40.5%       2.35  48%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
>       1.63  21%     -77.0%       0.38  90%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
>       7.29  39%    +417.5%      37.72  16%  perf-sched.sch_delay.max.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>       3.35  14%     -51.7%       1.62   3%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
>       0.05  13%   +2245.1%       1.13  40%  perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
>       3.01  26%    +729.6%      25.01  91%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
>       1.93  59%     -85.5%       0.28  62%  perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.do_open
>       0.01           -50.0%       0.00        perf-sched.total_sch_delay.average.ms
>       7.29  39%    +468.8%      41.46  26%  perf-sched.total_sch_delay.max.ms
>       6.04   4%     -94.1%       0.35        perf-sched.total_wait_and_delay.average.ms
>     205790   3%   +1811.0%    3932742        perf-sched.total_wait_and_delay.count.ms
>       6.03   4%     -94.2%       0.35        perf-sched.total_wait_time.average.ms
>      75.51  41%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
>      23.01  17%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
>      23.82   7%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
>      95.27  41%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
>      55.86 141%   +1014.6%     622.64   5%  perf-sched.wait_and_delay.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
>       0.07  23%     -82.5%       0.01        perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
>     137.41   3%    +345.1%     611.63   2%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
>       0.04   5%     -49.6%       0.02        perf-sched.wait_and_delay.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>     536.33   5%     -46.5%     287.00        perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>      21.67  32%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
>       5.67   8%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
>       1.67  56%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
>       5.67  29%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
>       5.33  23%     +93.8%      10.33  25%  perf-sched.wait_and_delay.count.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>     101725   3%     +15.3%     117243  10%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
>     100.00   7%     -80.3%      19.67   2%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
>      97762   4%   +3794.8%    3807606        perf-sched.wait_and_delay.count.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>       1091   9%    +111.9%       2311   3%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>     604.50  43%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
>      37.41   9%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
>      27.08  13%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
>     275.41  32%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
>       1313  69%    +112.1%       2786  15%  perf-sched.wait_and_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>     333.38 141%    +200.4%       1001        perf-sched.wait_and_delay.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
>       1000           -96.8%      31.85  48%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
>      17.99  33%    +387.5%      87.71   8%  perf-sched.wait_and_delay.max.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>       0.33  19%     -74.1%       0.09  10%  perf-sched.wait_time.avg.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
>       0.02  53%    +331.4%       0.10  50%  perf-sched.wait_time.avg.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
>       0.09  65%     -75.9%       0.02   9%  perf-sched.wait_time.avg.ms.__cond_resched.aa_sk_perm.security_socket_sendmsg.sock_sendmsg.__sys_sendto
>       0.02  22%     -70.2%       0.01 141%  perf-sched.wait_time.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
>      75.51  41%    -100.0%       0.04  42%  perf-sched.wait_time.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
>       0.10  36%     -80.3%       0.02   9%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_node.__alloc_skb.alloc_skb_with_frags.sock_alloc_send_pskb
>       0.55  61%     -94.9%       0.03  45%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_node.kmalloc_reserve.__alloc_skb.alloc_skb_with_frags
>      23.01  17%    -100.0%       0.00 141%  perf-sched.wait_time.avg.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
>      23.82   7%     -99.7%       0.07  57%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
>      95.27  41%    -100.0%       0.03  89%  perf-sched.wait_time.avg.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
>      56.30 139%   +1005.5%     622.44   5%  perf-sched.wait_time.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
>       2.78  66%     -98.2%       0.05  52%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
>       0.07  23%     -82.5%       0.01        perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
>     137.37   3%    +345.1%     611.40   2%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
>       0.02   5%     -41.9%       0.01   3%  perf-sched.wait_time.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>     536.32   5%     -46.5%     286.98        perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>       4.66  20%     -56.7%       2.02  26%  perf-sched.wait_time.max.ms.__cond_resched.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
>       0.03  63%    +995.0%       0.37  26%  perf-sched.wait_time.max.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
>       1.67  87%     -92.6%       0.12  57%  perf-sched.wait_time.max.ms.__cond_resched.aa_sk_perm.security_socket_sendmsg.sock_sendmsg.__sys_sendto
>       0.54 117%     -95.1%       0.03 105%  perf-sched.wait_time.max.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
>       0.06  49%     -89.1%       0.01 141%  perf-sched.wait_time.max.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
>     604.50  43%    -100.0%       0.16  83%  perf-sched.wait_time.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
>       2.77  45%     -95.4%       0.13  64%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_node.__alloc_skb.alloc_skb_with_frags.sock_alloc_send_pskb
>       2.86  45%     -94.3%       0.16  91%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_node.kmalloc_reserve.__alloc_skb.alloc_skb_with_frags
>      37.41   9%    -100.0%       0.01 141%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
>      27.08  13%     -99.7%       0.08  61%  perf-sched.wait_time.max.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
>     275.41  32%    -100.0%       0.03  89%  perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_block.shmem_alloc_and_acct_folio.shmem_get_folio_gfp.shmem_write_begin
>       1313  69%    +112.1%       2786  15%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>     334.74 140%    +198.9%       1000        perf-sched.wait_time.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
>      21.74  58%     -95.4%       1.00 103%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
>       1000           -97.6%      24.49  50%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
>      10.90  27%    +682.9%      85.36   6%  perf-sched.wait_time.max.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>      32.91  58%     -63.5%      12.01 115%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
>     169.97   7%     -49.2%      86.29  15%  perf-sched.wait_time.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
>      44.08           -19.8       24.25        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page.skb_release_data.__consume_stateless_skb
>      44.47           -19.6       24.87        perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg
>      43.63           -19.5       24.15        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page.skb_release_data
>      45.62           -19.2       26.39        perf-profile.calltrace.cycles-pp.skb_release_data.__consume_stateless_skb.udp_recvmsg.inet_recvmsg.sock_recvmsg
>      45.62           -19.2       26.40        perf-profile.calltrace.cycles-pp.__consume_stateless_skb.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
>      45.00           -19.1       25.94        perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg.inet_recvmsg
>      50.41           -16.8       33.64  39%  perf-profile.calltrace.cycles-pp.accept_connections.main.__libc_start_main
>      50.41           -16.8       33.64  39%  perf-profile.calltrace.cycles-pp.accept_connection.accept_connections.main.__libc_start_main
>      50.41           -16.8       33.64  39%  perf-profile.calltrace.cycles-pp.spawn_child.accept_connection.accept_connections.main.__libc_start_main
>      50.41           -16.8       33.64  39%  perf-profile.calltrace.cycles-pp.process_requests.spawn_child.accept_connection.accept_connections.main
>      99.92           -14.2       85.72  15%  perf-profile.calltrace.cycles-pp.main.__libc_start_main
>      99.96           -14.2       85.77  15%  perf-profile.calltrace.cycles-pp.__libc_start_main
>      50.10            -8.6       41.52        perf-profile.calltrace.cycles-pp.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
>      50.11            -8.6       41.55        perf-profile.calltrace.cycles-pp.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
>      50.13            -8.5       41.64        perf-profile.calltrace.cycles-pp.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      50.28            -8.0       42.27        perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom
>      50.29            -8.0       42.29        perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni
>      50.31            -7.9       42.42        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni.process_requests
>      50.32            -7.8       42.47        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.recvfrom.recv_omni.process_requests.spawn_child
>      50.36            -7.6       42.78        perf-profile.calltrace.cycles-pp.recvfrom.recv_omni.process_requests.spawn_child.accept_connection
>      50.41            -7.3       43.07        perf-profile.calltrace.cycles-pp.recv_omni.process_requests.spawn_child.accept_connection.accept_connections
>      19.93   2%      -6.6       13.36        perf-profile.calltrace.cycles-pp.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg
>      19.44   2%      -6.3       13.16        perf-profile.calltrace.cycles-pp._copy_from_iter.ip_generic_getfrag.__ip_append_data.ip_make_skb.udp_sendmsg
>      18.99   2%      -6.1       12.90        perf-profile.calltrace.cycles-pp.copyin._copy_from_iter.ip_generic_getfrag.__ip_append_data.ip_make_skb
>       8.95            -5.1        3.82        perf-profile.calltrace.cycles-pp.udp_send_skb.udp_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto
>       8.70            -5.0        3.71        perf-profile.calltrace.cycles-pp.ip_send_skb.udp_send_skb.udp_sendmsg.sock_sendmsg.__sys_sendto
>       8.10            -4.6        3.45        perf-profile.calltrace.cycles-pp.ip_finish_output2.ip_send_skb.udp_send_skb.udp_sendmsg.sock_sendmsg
>       7.69            -4.4        3.27        perf-profile.calltrace.cycles-pp.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb.udp_sendmsg
>       6.51            -3.7        2.78        perf-profile.calltrace.cycles-pp.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_send_skb.udp_send_skb
>       6.47            -3.7        2.75        perf-profile.calltrace.cycles-pp.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_send_skb
>       6.41            -3.7        2.71        perf-profile.calltrace.cycles-pp.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2
>       5.88            -3.5        2.43        perf-profile.calltrace.cycles-pp.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit
>       5.73            -3.4        2.35        perf-profile.calltrace.cycles-pp.__napi_poll.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip
>       5.69            -3.4        2.33        perf-profile.calltrace.cycles-pp.process_backlog.__napi_poll.net_rx_action.__do_softirq.do_softirq
>       5.36            -3.2        2.19        perf-profile.calltrace.cycles-pp.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action.__do_softirq
>       4.59            -2.7        1.89        perf-profile.calltrace.cycles-pp.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action
>       4.55   2%      -2.7        1.88        perf-profile.calltrace.cycles-pp.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll
>       4.40   2%      -2.6        1.81        perf-profile.calltrace.cycles-pp.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog
>       3.81   2%      -2.2        1.57        perf-profile.calltrace.cycles-pp.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
>       3.75   2%      -2.2        1.55        perf-profile.calltrace.cycles-pp.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
>       2.21   2%      -1.6        0.63        perf-profile.calltrace.cycles-pp.__ip_make_skb.ip_make_skb.udp_sendmsg.sock_sendmsg.__sys_sendto
>       1.94   2%      -1.4        0.51   2%  perf-profile.calltrace.cycles-pp.__ip_select_ident.__ip_make_skb.ip_make_skb.udp_sendmsg.sock_sendmsg
>       1.14            -0.6        0.51        perf-profile.calltrace.cycles-pp.sock_alloc_send_pskb.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg
>       0.00            +0.5        0.53   2%  perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
>       0.00            +0.7        0.69        perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state
>       0.00            +0.7        0.71        perf-profile.calltrace.cycles-pp.__sk_mem_raise_allocated.__sk_mem_schedule.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb
>       0.00            +0.7        0.72        perf-profile.calltrace.cycles-pp.__sk_mem_schedule.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv
>       0.00            +1.0        0.99  20%  perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp
>       0.00            +1.0        1.01  20%  perf-profile.calltrace.cycles-pp.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
>       0.00            +1.1        1.05  20%  perf-profile.calltrace.cycles-pp.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg
>       0.00            +1.1        1.12        perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
>       0.00            +1.2        1.18  20%  perf-profile.calltrace.cycles-pp.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg.sock_recvmsg
>       0.00            +1.3        1.32        perf-profile.calltrace.cycles-pp.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu
>       0.00            +2.2        2.23        perf-profile.calltrace.cycles-pp.__skb_recv_udp.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
>      49.51            +2.6       52.08        perf-profile.calltrace.cycles-pp.send_udp_stream.main.__libc_start_main
>      49.49            +2.6       52.07        perf-profile.calltrace.cycles-pp.send_omni_inner.send_udp_stream.main.__libc_start_main
>       0.00            +3.0        2.96   2%  perf-profile.calltrace.cycles-pp.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
>      48.71            +3.0       51.73        perf-profile.calltrace.cycles-pp.sendto.send_omni_inner.send_udp_stream.main.__libc_start_main
>       0.00            +3.1        3.06   2%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
>       0.00            +3.1        3.09        perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
>      48.34            +3.2       51.56        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner.send_udp_stream.main
>       0.00            +3.3        3.33   2%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
>      48.13            +3.8       51.96        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner.send_udp_stream
>      47.82            +4.0       51.82        perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.send_omni_inner
>      47.70            +4.1       51.76        perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
>       0.00            +4.1        4.08        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
>       0.00            +4.1        4.10        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
>       0.00            +4.1        4.10        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
>       0.00            +4.1        4.14        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
>       0.00            +4.3        4.35   2%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
>      46.52            +4.8       51.27        perf-profile.calltrace.cycles-pp.sock_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      46.04            +5.0       51.08        perf-profile.calltrace.cycles-pp.udp_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
>       3.67            +8.0       11.63        perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg
>       3.71            +8.1       11.80        perf-profile.calltrace.cycles-pp._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg
>       3.96            +8.5       12.42        perf-profile.calltrace.cycles-pp.__skb_datagram_iter.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg.sock_recvmsg
>       3.96            +8.5       12.44        perf-profile.calltrace.cycles-pp.skb_copy_datagram_iter.udp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
>      35.13           +11.3       46.39        perf-profile.calltrace.cycles-pp.ip_make_skb.udp_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto
>      32.68   2%     +13.0       45.65        perf-profile.calltrace.cycles-pp.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg.__sys_sendto
>      10.27           +20.3       30.59        perf-profile.calltrace.cycles-pp.sk_page_frag_refill.__ip_append_data.ip_make_skb.udp_sendmsg.sock_sendmsg
>      10.24           +20.3       30.58        perf-profile.calltrace.cycles-pp.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data.ip_make_skb.udp_sendmsg
>       9.84           +20.5       30.32        perf-profile.calltrace.cycles-pp.__alloc_pages.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data.ip_make_skb
>       9.59           +20.5       30.11        perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.skb_page_frag_refill.sk_page_frag_refill.__ip_append_data
>       8.40           +21.0       29.42        perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.skb_page_frag_refill.sk_page_frag_refill
>       6.13           +21.9       28.05        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.rmqueue_bulk.rmqueue.get_page_from_freelist
>       6.20           +22.0       28.15        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
>       6.46           +22.5       28.91        perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.skb_page_frag_refill
>      48.24           -21.8       26.43        perf-profile.children.cycles-pp.skb_release_data
>      47.19           -21.2       25.98        perf-profile.children.cycles-pp.free_unref_page
>      44.48           -19.6       24.88        perf-profile.children.cycles-pp.free_pcppages_bulk
>      45.62           -19.2       26.40        perf-profile.children.cycles-pp.__consume_stateless_skb
>      99.95           -14.2       85.76  15%  perf-profile.children.cycles-pp.main
>      99.96           -14.2       85.77  15%  perf-profile.children.cycles-pp.__libc_start_main
>      50.10            -8.6       41.53        perf-profile.children.cycles-pp.udp_recvmsg
>      50.11            -8.6       41.56        perf-profile.children.cycles-pp.inet_recvmsg
>      50.13            -8.5       41.65        perf-profile.children.cycles-pp.sock_recvmsg
>      50.29            -8.0       42.28        perf-profile.children.cycles-pp.__sys_recvfrom
>      50.29            -8.0       42.30        perf-profile.children.cycles-pp.__x64_sys_recvfrom
>      50.38            -7.5       42.86        perf-profile.children.cycles-pp.recvfrom
>      50.41            -7.3       43.07        perf-profile.children.cycles-pp.accept_connections
>      50.41            -7.3       43.07        perf-profile.children.cycles-pp.accept_connection
>      50.41            -7.3       43.07        perf-profile.children.cycles-pp.spawn_child
>      50.41            -7.3       43.07        perf-profile.children.cycles-pp.process_requests
>      50.41            -7.3       43.07        perf-profile.children.cycles-pp.recv_omni
>      19.96   2%      -6.5       13.50        perf-profile.children.cycles-pp.ip_generic_getfrag
>      19.46   2%      -6.2       13.28        perf-profile.children.cycles-pp._copy_from_iter
>      19.21   2%      -6.1       13.14        perf-profile.children.cycles-pp.copyin
>       8.96            -5.1        3.86        perf-profile.children.cycles-pp.udp_send_skb
>       8.72            -5.0        3.75        perf-profile.children.cycles-pp.ip_send_skb
>       8.11            -4.6        3.49        perf-profile.children.cycles-pp.ip_finish_output2
>       7.72            -4.4        3.32        perf-profile.children.cycles-pp.__dev_queue_xmit
>      98.71            -4.1       94.59        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>      98.51            -4.0       94.46        perf-profile.children.cycles-pp.do_syscall_64
>       6.49            -3.7        2.78        perf-profile.children.cycles-pp.do_softirq
>       6.51            -3.7        2.82        perf-profile.children.cycles-pp.__local_bh_enable_ip
>       6.43            -3.7        2.78        perf-profile.children.cycles-pp.__do_softirq
>       5.90            -3.4        2.46        perf-profile.children.cycles-pp.net_rx_action
>       5.74            -3.4        2.38        perf-profile.children.cycles-pp.__napi_poll
>       5.71            -3.4        2.36        perf-profile.children.cycles-pp.process_backlog
>       5.37            -3.2        2.21        perf-profile.children.cycles-pp.__netif_receive_skb_one_core
>       4.60            -2.7        1.91        perf-profile.children.cycles-pp.ip_local_deliver_finish
>       4.57   2%      -2.7        1.90        perf-profile.children.cycles-pp.ip_protocol_deliver_rcu
>       4.42   2%      -2.6        1.83        perf-profile.children.cycles-pp.__udp4_lib_rcv
>       3.82   2%      -2.2        1.58   2%  perf-profile.children.cycles-pp.udp_unicast_rcv_skb
>       3.78   2%      -2.2        1.57   2%  perf-profile.children.cycles-pp.udp_queue_rcv_one_skb
>       2.23   2%      -1.6        0.65   2%  perf-profile.children.cycles-pp.__ip_make_skb
>       1.95   2%      -1.4        0.52   3%  perf-profile.children.cycles-pp.__ip_select_ident
>       1.51   4%      -1.2        0.34        perf-profile.children.cycles-pp.free_unref_page_commit
>       1.17            -0.7        0.51   2%  perf-profile.children.cycles-pp.ip_route_output_flow
>       1.15            -0.6        0.52        perf-profile.children.cycles-pp.sock_alloc_send_pskb
>       0.91            -0.5        0.39        perf-profile.children.cycles-pp.alloc_skb_with_frags
>       0.86            -0.5        0.37        perf-profile.children.cycles-pp.__alloc_skb
>       0.83            -0.5        0.36   2%  perf-profile.children.cycles-pp.ip_route_output_key_hash_rcu
>       0.75            -0.4        0.32        perf-profile.children.cycles-pp.dev_hard_start_xmit
>       0.72            -0.4        0.31   3%  perf-profile.children.cycles-pp.fib_table_lookup
>       0.67            -0.4        0.28        perf-profile.children.cycles-pp.loopback_xmit
>       0.70   2%      -0.4        0.33        perf-profile.children.cycles-pp.__zone_watermark_ok
>       0.47   4%      -0.3        0.15        perf-profile.children.cycles-pp.kmem_cache_free
>       0.57            -0.3        0.26        perf-profile.children.cycles-pp.kmem_cache_alloc_node
>       0.46            -0.3        0.18   2%  perf-profile.children.cycles-pp.ip_rcv
>       0.42            -0.3        0.17        perf-profile.children.cycles-pp.move_addr_to_kernel
>       0.41            -0.2        0.16   2%  perf-profile.children.cycles-pp.__udp4_lib_lookup
>       0.32            -0.2        0.13        perf-profile.children.cycles-pp.__netif_rx
>       0.30            -0.2        0.12        perf-profile.children.cycles-pp.netif_rx_internal
>       0.30            -0.2        0.12        perf-profile.children.cycles-pp._copy_from_user
>       0.31            -0.2        0.13        perf-profile.children.cycles-pp.kmalloc_reserve
>       0.63            -0.2        0.46   2%  perf-profile.children.cycles-pp.free_unref_page_prepare
>       0.28            -0.2        0.11        perf-profile.children.cycles-pp.enqueue_to_backlog
>       0.27            -0.2        0.11        perf-profile.children.cycles-pp.udp4_lib_lookup2
>       0.29            -0.2        0.13   6%  perf-profile.children.cycles-pp.send_data
>       0.25            -0.2        0.10        perf-profile.children.cycles-pp.__netif_receive_skb_core
>       0.23   2%      -0.1        0.10   4%  perf-profile.children.cycles-pp.security_socket_sendmsg
>       0.19   2%      -0.1        0.06        perf-profile.children.cycles-pp.ip_rcv_core
>       0.37            -0.1        0.24        perf-profile.children.cycles-pp.irqtime_account_irq
>       0.21            -0.1        0.08        perf-profile.children.cycles-pp.sock_wfree
>       0.21   3%      -0.1        0.08        perf-profile.children.cycles-pp.validate_xmit_skb
>       0.20   2%      -0.1        0.08        perf-profile.children.cycles-pp.ip_output
>       0.22   2%      -0.1        0.10   4%  perf-profile.children.cycles-pp.ip_rcv_finish_core
>       0.20   6%      -0.1        0.09   5%  perf-profile.children.cycles-pp.__mkroute_output
>       0.21   2%      -0.1        0.09   5%  perf-profile.children.cycles-pp._raw_spin_lock_irq
>       0.28            -0.1        0.18        perf-profile.children.cycles-pp._raw_spin_trylock
>       0.34   3%      -0.1        0.25        perf-profile.children.cycles-pp.__slab_free
>       0.13   3%      -0.1        0.05        perf-profile.children.cycles-pp.siphash_3u32
>       0.12   4%      -0.1        0.03  70%  perf-profile.children.cycles-pp.ipv4_pktinfo_prepare
>       0.14   3%      -0.1        0.06   7%  perf-profile.children.cycles-pp.__ip_local_out
>       0.20   2%      -0.1        0.12        perf-profile.children.cycles-pp.aa_sk_perm
>       0.18   2%      -0.1        0.10        perf-profile.children.cycles-pp.get_pfnblock_flags_mask
>       0.12   3%      -0.1        0.05        perf-profile.children.cycles-pp.sk_filter_trim_cap
>       0.13            -0.1        0.06        perf-profile.children.cycles-pp.ip_setup_cork
>       0.13   7%      -0.1        0.06   8%  perf-profile.children.cycles-pp.fib_lookup_good_nhc
>       0.15   3%      -0.1        0.08   5%  perf-profile.children.cycles-pp.skb_set_owner_w
>       0.11   4%      -0.1        0.05        perf-profile.children.cycles-pp.dst_release
>       0.23   2%      -0.1        0.17   2%  perf-profile.children.cycles-pp.__entry_text_start
>       0.11            -0.1        0.05        perf-profile.children.cycles-pp.ipv4_mtu
>       0.20   2%      -0.1        0.15   3%  perf-profile.children.cycles-pp.__list_add_valid_or_report
>       0.10            -0.1        0.05        perf-profile.children.cycles-pp.ip_send_check
>       0.31   2%      -0.0        0.26   3%  perf-profile.children.cycles-pp.sockfd_lookup_light
>       0.27            -0.0        0.22   2%  perf-profile.children.cycles-pp.__fget_light
>       0.63            -0.0        0.58        perf-profile.children.cycles-pp.__check_object_size
>       0.15   3%      -0.0        0.11        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
>       0.13            -0.0        0.09   5%  perf-profile.children.cycles-pp.alloc_pages
>       0.27            -0.0        0.24        perf-profile.children.cycles-pp.sched_clock_cpu
>       0.11   4%      -0.0        0.08   6%  perf-profile.children.cycles-pp.__cond_resched
>       0.14   3%      -0.0        0.11        perf-profile.children.cycles-pp.free_tail_page_prepare
>       0.11            -0.0        0.08   5%  perf-profile.children.cycles-pp.syscall_return_via_sysret
>       0.09   9%      -0.0        0.06   7%  perf-profile.children.cycles-pp.__xfrm_policy_check2
>       0.23   2%      -0.0        0.21   2%  perf-profile.children.cycles-pp.sched_clock
>       0.14   3%      -0.0        0.11   4%  perf-profile.children.cycles-pp.prep_compound_page
>       0.21   2%      -0.0        0.20   2%  perf-profile.children.cycles-pp.native_sched_clock
>       0.06            -0.0        0.05        perf-profile.children.cycles-pp.task_tick_fair
>       0.06            -0.0        0.05        perf-profile.children.cycles-pp.check_stack_object
>       0.18   2%      +0.0        0.20   2%  perf-profile.children.cycles-pp.perf_event_task_tick
>       0.18   2%      +0.0        0.19   2%  perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context
>       0.31   3%      +0.0        0.33        perf-profile.children.cycles-pp.tick_sched_handle
>       0.31   3%      +0.0        0.33        perf-profile.children.cycles-pp.update_process_times
>       0.41   2%      +0.0        0.43        perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
>       0.40   2%      +0.0        0.42        perf-profile.children.cycles-pp.hrtimer_interrupt
>       0.32   2%      +0.0        0.34        perf-profile.children.cycles-pp.tick_sched_timer
>       0.36   2%      +0.0        0.39        perf-profile.children.cycles-pp.__hrtimer_run_queues
>       0.06   7%      +0.0        0.10   4%  perf-profile.children.cycles-pp.exit_to_user_mode_prepare
>       0.05   8%      +0.0        0.10        perf-profile.children.cycles-pp._raw_spin_lock_bh
>       0.00            +0.1        0.05        perf-profile.children.cycles-pp.update_cfs_group
>       0.00            +0.1        0.05        perf-profile.children.cycles-pp.cpuidle_governor_latency_req
>       0.00            +0.1        0.05        perf-profile.children.cycles-pp.flush_smp_call_function_queue
>       0.00            +0.1        0.05   8%  perf-profile.children.cycles-pp.prepare_to_wait_exclusive
>       0.07            +0.1        0.13   3%  perf-profile.children.cycles-pp.__mod_zone_page_state
>       0.00            +0.1        0.06  13%  perf-profile.children.cycles-pp.cgroup_rstat_updated
>       0.00            +0.1        0.06        perf-profile.children.cycles-pp.__x2apic_send_IPI_dest
>       0.00            +0.1        0.06        perf-profile.children.cycles-pp.security_socket_recvmsg
>       0.00            +0.1        0.06        perf-profile.children.cycles-pp.select_task_rq_fair
>       0.00            +0.1        0.06        perf-profile.children.cycles-pp.tick_irq_enter
>       0.00            +0.1        0.06        perf-profile.children.cycles-pp.tick_nohz_idle_enter
>       0.42   2%      +0.1        0.49   2%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
>       0.00            +0.1        0.07   7%  perf-profile.children.cycles-pp.ktime_get
>       0.00            +0.1        0.07        perf-profile.children.cycles-pp.__get_user_4
>       0.00            +0.1        0.07        perf-profile.children.cycles-pp.update_rq_clock
>       0.00            +0.1        0.07        perf-profile.children.cycles-pp.select_task_rq
>       0.00            +0.1        0.07        perf-profile.children.cycles-pp.native_apic_msr_eoi
>       0.49            +0.1        0.57   2%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
>       0.11  11%      +0.1        0.19   2%  perf-profile.children.cycles-pp._raw_spin_lock
>       0.00            +0.1        0.08        perf-profile.children.cycles-pp.update_rq_clock_task
>       0.00            +0.1        0.08        perf-profile.children.cycles-pp.__update_load_avg_se
>       0.00            +0.1        0.09   5%  perf-profile.children.cycles-pp.irq_enter_rcu
>       0.00            +0.1        0.09   5%  perf-profile.children.cycles-pp.__irq_exit_rcu
>       0.00            +0.1        0.09        perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
>       0.00            +0.1        0.09        perf-profile.children.cycles-pp.update_blocked_averages
>       0.00            +0.1        0.09        perf-profile.children.cycles-pp.update_sg_lb_stats
>       0.00            +0.1        0.09   5%  perf-profile.children.cycles-pp.set_next_entity
>       0.00            +0.1        0.10        perf-profile.children.cycles-pp.__switch_to_asm
>       0.00            +0.1        0.11  12%  perf-profile.children.cycles-pp._copy_to_user
>       0.00            +0.1        0.12   3%  perf-profile.children.cycles-pp.menu_select
>       0.00            +0.1        0.12   3%  perf-profile.children.cycles-pp.recv_data
>       0.00            +0.1        0.12   3%  perf-profile.children.cycles-pp.update_sd_lb_stats
>       0.00            +0.1        0.13   3%  perf-profile.children.cycles-pp.native_irq_return_iret
>       0.00            +0.1        0.13   3%  perf-profile.children.cycles-pp.__switch_to
>       0.00            +0.1        0.13   3%  perf-profile.children.cycles-pp.find_busiest_group
>       0.00            +0.1        0.14        perf-profile.children.cycles-pp.finish_task_switch
>       0.00            +0.1        0.15   3%  perf-profile.children.cycles-pp.update_curr
>       0.00            +0.2        0.15   3%  perf-profile.children.cycles-pp.mem_cgroup_uncharge_skmem
>       0.00            +0.2        0.16        perf-profile.children.cycles-pp.ttwu_queue_wakelist
>       0.05            +0.2        0.22   2%  perf-profile.children.cycles-pp.page_counter_try_charge
>       0.00            +0.2        0.17   2%  perf-profile.children.cycles-pp.load_balance
>       0.00            +0.2        0.17   2%  perf-profile.children.cycles-pp.___perf_sw_event
>       0.02 141%      +0.2        0.19   2%  perf-profile.children.cycles-pp.page_counter_uncharge
>       0.33            +0.2        0.52        perf-profile.children.cycles-pp.__free_one_page
>       0.02 141%      +0.2        0.21   2%  perf-profile.children.cycles-pp.drain_stock
>       0.00            +0.2        0.20   2%  perf-profile.children.cycles-pp.prepare_task_switch
>       0.16   3%      +0.2        0.38   2%  perf-profile.children.cycles-pp.simple_copy_to_iter
>       0.07  11%      +0.2        0.31        perf-profile.children.cycles-pp.refill_stock
>       0.07   6%      +0.2        0.31   4%  perf-profile.children.cycles-pp.move_addr_to_user
>       0.00            +0.2        0.24        perf-profile.children.cycles-pp.enqueue_entity
>       0.00            +0.2        0.25        perf-profile.children.cycles-pp.update_load_avg
>       0.21   2%      +0.3        0.48        perf-profile.children.cycles-pp.__list_del_entry_valid_or_report
>       0.00            +0.3        0.31   4%  perf-profile.children.cycles-pp.dequeue_entity
>       0.08   5%      +0.3        0.40   3%  perf-profile.children.cycles-pp.try_charge_memcg
>       0.00            +0.3        0.33        perf-profile.children.cycles-pp.enqueue_task_fair
>       0.00            +0.4        0.35   2%  perf-profile.children.cycles-pp.dequeue_task_fair
>       0.00            +0.4        0.35   2%  perf-profile.children.cycles-pp.activate_task
>       0.00            +0.4        0.36   2%  perf-profile.children.cycles-pp.try_to_wake_up
>       0.00            +0.4        0.37   2%  perf-profile.children.cycles-pp.autoremove_wake_function
>       0.00            +0.4        0.39   3%  perf-profile.children.cycles-pp.newidle_balance
>       0.12   8%      +0.4        0.51   2%  perf-profile.children.cycles-pp.mem_cgroup_charge_skmem
>       0.00            +0.4        0.39        perf-profile.children.cycles-pp.ttwu_do_activate
>       0.00            +0.4        0.40   2%  perf-profile.children.cycles-pp.__wake_up_common
>       0.18   4%      +0.4        0.59        perf-profile.children.cycles-pp.udp_rmem_release
>       0.11   7%      +0.4        0.52        perf-profile.children.cycles-pp.__sk_mem_reduce_allocated
>       0.00            +0.4        0.43        perf-profile.children.cycles-pp.__wake_up_common_lock
>       0.00            +0.5        0.46        perf-profile.children.cycles-pp.sched_ttwu_pending
>       0.00            +0.5        0.49        perf-profile.children.cycles-pp.sock_def_readable
>       0.00            +0.5        0.53   2%  perf-profile.children.cycles-pp.pick_next_task_fair
>       0.00            +0.5        0.54   2%  perf-profile.children.cycles-pp.schedule_idle
>       0.00            +0.6        0.55        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
>       0.15   3%      +0.6        0.73   2%  perf-profile.children.cycles-pp.__sk_mem_raise_allocated
>       0.00            +0.6        0.57        perf-profile.children.cycles-pp.__sysvec_call_function_single
>       0.16   5%      +0.6        0.74   2%  perf-profile.children.cycles-pp.__sk_mem_schedule
>       0.00            +0.8        0.78        perf-profile.children.cycles-pp.sysvec_call_function_single
>       0.41   3%      +0.9        1.33   2%  perf-profile.children.cycles-pp.__udp_enqueue_schedule_skb
>       0.00            +1.2        1.16   2%  perf-profile.children.cycles-pp.schedule
>       0.00            +1.2        1.21   2%  perf-profile.children.cycles-pp.schedule_timeout
>       0.00            +1.3        1.33   2%  perf-profile.children.cycles-pp.__skb_wait_for_more_packets
>       0.00            +1.7        1.66   2%  perf-profile.children.cycles-pp.__schedule
>       0.27   3%      +2.0        2.25        perf-profile.children.cycles-pp.__skb_recv_udp
>      50.41            +2.4       52.81        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>       0.00            +2.7        2.68        perf-profile.children.cycles-pp.asm_sysvec_call_function_single
>      49.78            +2.7       52.49        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>       0.00            +3.0        2.98        perf-profile.children.cycles-pp.acpi_safe_halt
>       0.00            +3.0        3.00        perf-profile.children.cycles-pp.acpi_idle_enter
>      49.51            +3.1       52.57        perf-profile.children.cycles-pp.send_udp_stream
>      49.50            +3.1       52.56        perf-profile.children.cycles-pp.send_omni_inner
>       0.00            +3.1        3.10        perf-profile.children.cycles-pp.cpuidle_enter_state
>       0.00            +3.1        3.12        perf-profile.children.cycles-pp.cpuidle_enter
>       0.00            +3.4        3.37        perf-profile.children.cycles-pp.cpuidle_idle_call
>      48.90            +3.4       52.30        perf-profile.children.cycles-pp.sendto
>      47.85            +4.0       51.83        perf-profile.children.cycles-pp.__x64_sys_sendto
>      47.73            +4.0       51.77        perf-profile.children.cycles-pp.__sys_sendto
>       0.00            +4.1        4.10        perf-profile.children.cycles-pp.start_secondary
>       0.00            +4.1        4.13        perf-profile.children.cycles-pp.do_idle
>       0.00            +4.1        4.14        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
>       0.00            +4.1        4.14        perf-profile.children.cycles-pp.cpu_startup_entry
>      46.54            +4.7       51.28        perf-profile.children.cycles-pp.sock_sendmsg
>      46.10            +5.0       51.11        perf-profile.children.cycles-pp.udp_sendmsg
>       3.70            +8.0       11.71        perf-profile.children.cycles-pp.copyout
>       3.71            +8.1       11.80        perf-profile.children.cycles-pp._copy_to_iter
>       3.96            +8.5       12.43        perf-profile.children.cycles-pp.__skb_datagram_iter
>       3.96            +8.5       12.44        perf-profile.children.cycles-pp.skb_copy_datagram_iter
>      35.14           +11.3       46.40        perf-profile.children.cycles-pp.ip_make_skb
>      32.71   2%     +13.0       45.66        perf-profile.children.cycles-pp.__ip_append_data
>      10.28           +20.6       30.89        perf-profile.children.cycles-pp.sk_page_frag_refill
>      10.25           +20.6       30.88        perf-profile.children.cycles-pp.skb_page_frag_refill
>       9.86           +20.8       30.63        perf-profile.children.cycles-pp.__alloc_pages
>       9.62           +20.8       30.42        perf-profile.children.cycles-pp.get_page_from_freelist
>       8.42           +21.3       29.72        perf-profile.children.cycles-pp.rmqueue
>       6.47           +22.8       29.22        perf-profile.children.cycles-pp.rmqueue_bulk
>      19.11   2%      -6.0       13.08        perf-profile.self.cycles-pp.copyin
>       1.81   2%      -1.4        0.39        perf-profile.self.cycles-pp.rmqueue
>       1.81   2%      -1.3        0.46   2%  perf-profile.self.cycles-pp.__ip_select_ident
>       1.47   4%      -1.2        0.31        perf-profile.self.cycles-pp.free_unref_page_commit
>       1.29   2%      -0.5        0.75        perf-profile.self.cycles-pp.__ip_append_data
>       0.71            -0.4        0.29        perf-profile.self.cycles-pp.udp_sendmsg
>       0.68   2%      -0.4        0.32        perf-profile.self.cycles-pp.__zone_watermark_ok
>       0.50            -0.3        0.16        perf-profile.self.cycles-pp.skb_release_data
>       0.59   3%      -0.3        0.26   3%  perf-profile.self.cycles-pp.fib_table_lookup
>       0.46   4%      -0.3        0.15   3%  perf-profile.self.cycles-pp.kmem_cache_free
>       0.63            -0.3        0.33   2%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
>       0.47            -0.3        0.19        perf-profile.self.cycles-pp.__sys_sendto
>       0.44            -0.2        0.21   2%  perf-profile.self.cycles-pp.kmem_cache_alloc_node
>       0.36            -0.2        0.16   3%  perf-profile.self.cycles-pp.send_omni_inner
>       0.35   2%      -0.2        0.15   3%  perf-profile.self.cycles-pp.ip_finish_output2
>       0.29            -0.2        0.12        perf-profile.self.cycles-pp._copy_from_user
>       0.24            -0.1        0.10   4%  perf-profile.self.cycles-pp.__netif_receive_skb_core
>       0.22   2%      -0.1        0.08   5%  perf-profile.self.cycles-pp.free_unref_page
>       0.19   2%      -0.1        0.06        perf-profile.self.cycles-pp.ip_rcv_core
>       0.21   2%      -0.1        0.08        perf-profile.self.cycles-pp.__alloc_skb
>       0.20   2%      -0.1        0.08        perf-profile.self.cycles-pp.sock_wfree
>       0.22   2%      -0.1        0.10   4%  perf-profile.self.cycles-pp.send_data
>       0.21            -0.1        0.09        perf-profile.self.cycles-pp.sendto
>       0.21   2%      -0.1        0.10   4%  perf-profile.self.cycles-pp.ip_rcv_finish_core
>       0.21   2%      -0.1        0.09   5%  perf-profile.self.cycles-pp.__ip_make_skb
>       0.20   4%      -0.1        0.09   5%  perf-profile.self.cycles-pp._raw_spin_lock_irq
>       0.21   2%      -0.1        0.10   4%  perf-profile.self.cycles-pp.__dev_queue_xmit
>       0.38   3%      -0.1        0.27        perf-profile.self.cycles-pp.get_page_from_freelist
>       0.20   2%      -0.1        0.09        perf-profile.self.cycles-pp.udp_send_skb
>       0.18   2%      -0.1        0.07        perf-profile.self.cycles-pp.__udp_enqueue_schedule_skb
>       0.18   4%      -0.1        0.08   6%  perf-profile.self.cycles-pp.__mkroute_output
>       0.25            -0.1        0.15   3%  perf-profile.self.cycles-pp._copy_from_iter
>       0.27   4%      -0.1        0.17   2%  perf-profile.self.cycles-pp.skb_page_frag_refill
>       0.16            -0.1        0.06   7%  perf-profile.self.cycles-pp.sock_sendmsg
>       0.33   2%      -0.1        0.24        perf-profile.self.cycles-pp.__slab_free
>       0.15   3%      -0.1        0.06        perf-profile.self.cycles-pp.udp4_lib_lookup2
>       0.38   2%      -0.1        0.29   2%  perf-profile.self.cycles-pp.free_unref_page_prepare
>       0.26            -0.1        0.17        perf-profile.self.cycles-pp._raw_spin_trylock
>       0.15            -0.1        0.06        perf-profile.self.cycles-pp.ip_output
>       0.14            -0.1        0.05   8%  perf-profile.self.cycles-pp.process_backlog
>       0.14            -0.1        0.06        perf-profile.self.cycles-pp.ip_route_output_flow
>       0.14            -0.1        0.06        perf-profile.self.cycles-pp.__udp4_lib_lookup
>       0.21   2%      -0.1        0.13   3%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
>       0.12   3%      -0.1        0.05        perf-profile.self.cycles-pp.siphash_3u32
>       0.13   3%      -0.1        0.06   8%  perf-profile.self.cycles-pp.ip_send_skb
>       0.17            -0.1        0.10        perf-profile.self.cycles-pp.__do_softirq
>       0.15   3%      -0.1        0.08   5%  perf-profile.self.cycles-pp.skb_set_owner_w
>       0.17   2%      -0.1        0.10   4%  perf-profile.self.cycles-pp.aa_sk_perm
>       0.12            -0.1        0.05        perf-profile.self.cycles-pp.__x64_sys_sendto
>       0.12   6%      -0.1        0.05        perf-profile.self.cycles-pp.fib_lookup_good_nhc
>       0.19   2%      -0.1        0.13        perf-profile.self.cycles-pp.__list_add_valid_or_report
>       0.14   3%      -0.1        0.07   6%  perf-profile.self.cycles-pp.net_rx_action
>       0.16   2%      -0.1        0.10        perf-profile.self.cycles-pp.do_syscall_64
>       0.11            -0.1        0.05        perf-profile.self.cycles-pp.__udp4_lib_rcv
>       0.16   3%      -0.1        0.10   4%  perf-profile.self.cycles-pp.get_pfnblock_flags_mask
>       0.11   4%      -0.1        0.05        perf-profile.self.cycles-pp.ip_route_output_key_hash_rcu
>       0.10   4%      -0.1        0.05        perf-profile.self.cycles-pp.ip_generic_getfrag
>       0.10            -0.1        0.05        perf-profile.self.cycles-pp.ipv4_mtu
>       0.26            -0.0        0.21   2%  perf-profile.self.cycles-pp.__fget_light
>       0.15   3%      -0.0        0.11   4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
>       0.24            -0.0        0.20   2%  perf-profile.self.cycles-pp.__alloc_pages
>       0.15   3%      -0.0        0.12        perf-profile.self.cycles-pp.__check_object_size
>       0.11            -0.0        0.08   6%  perf-profile.self.cycles-pp.syscall_return_via_sysret
>       0.08   5%      -0.0        0.05        perf-profile.self.cycles-pp.loopback_xmit
>       0.13            -0.0        0.11   4%  perf-profile.self.cycles-pp.prep_compound_page
>       0.11            -0.0        0.09   5%  perf-profile.self.cycles-pp.irqtime_account_irq
>       0.09  10%      -0.0        0.06   7%  perf-profile.self.cycles-pp.__xfrm_policy_check2
>       0.07            -0.0        0.05        perf-profile.self.cycles-pp.alloc_pages
>       0.08            -0.0        0.06   7%  perf-profile.self.cycles-pp.__entry_text_start
>       0.09   5%      -0.0        0.07        perf-profile.self.cycles-pp.free_tail_page_prepare
>       0.10            +0.0        0.11        perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context
>       0.06            +0.0        0.08   6%  perf-profile.self.cycles-pp.free_pcppages_bulk
>       0.05   8%      +0.0        0.10   4%  perf-profile.self.cycles-pp._raw_spin_lock_bh
>       0.07            +0.0        0.12        perf-profile.self.cycles-pp.__mod_zone_page_state
>       0.00            +0.1        0.05        perf-profile.self.cycles-pp.cpuidle_idle_call
>       0.00            +0.1        0.05        perf-profile.self.cycles-pp.udp_rmem_release
>       0.00            +0.1        0.05        perf-profile.self.cycles-pp.__flush_smp_call_function_queue
>       0.00            +0.1        0.05        perf-profile.self.cycles-pp.sock_def_readable
>       0.00            +0.1        0.05        perf-profile.self.cycles-pp.update_cfs_group
>       0.11  11%      +0.1        0.17   2%  perf-profile.self.cycles-pp._raw_spin_lock
>       0.00            +0.1        0.05   8%  perf-profile.self.cycles-pp.finish_task_switch
>       0.00            +0.1        0.05   8%  perf-profile.self.cycles-pp.cgroup_rstat_updated
>       0.00            +0.1        0.06        perf-profile.self.cycles-pp.do_idle
>       0.00            +0.1        0.06        perf-profile.self.cycles-pp.__skb_wait_for_more_packets
>       0.00            +0.1        0.06        perf-profile.self.cycles-pp.__x2apic_send_IPI_dest
>       0.00            +0.1        0.06   7%  perf-profile.self.cycles-pp.enqueue_entity
>       0.00            +0.1        0.07   7%  perf-profile.self.cycles-pp.schedule_timeout
>       0.00            +0.1        0.07   7%  perf-profile.self.cycles-pp.move_addr_to_user
>       0.00            +0.1        0.07   7%  perf-profile.self.cycles-pp.menu_select
>       0.00            +0.1        0.07   7%  perf-profile.self.cycles-pp.native_apic_msr_eoi
>       0.00            +0.1        0.07   7%  perf-profile.self.cycles-pp.update_sg_lb_stats
>       0.00            +0.1        0.07        perf-profile.self.cycles-pp.__update_load_avg_se
>       0.00            +0.1        0.07        perf-profile.self.cycles-pp.__get_user_4
>       0.00            +0.1        0.08   6%  perf-profile.self.cycles-pp.__sk_mem_reduce_allocated
>       0.00            +0.1        0.08        perf-profile.self.cycles-pp.update_curr
>       0.00            +0.1        0.08   5%  perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
>       0.00            +0.1        0.09   5%  perf-profile.self.cycles-pp.try_to_wake_up
>       0.00            +0.1        0.09        perf-profile.self.cycles-pp.recvfrom
>       0.00            +0.1        0.09        perf-profile.self.cycles-pp.mem_cgroup_charge_skmem
>       0.00            +0.1        0.09        perf-profile.self.cycles-pp.update_load_avg
>       0.00            +0.1        0.09   5%  perf-profile.self.cycles-pp.enqueue_task_fair
>       0.00            +0.1        0.10   4%  perf-profile.self.cycles-pp._copy_to_iter
>       0.00            +0.1        0.10   4%  perf-profile.self.cycles-pp.newidle_balance
>       0.00            +0.1        0.10   4%  perf-profile.self.cycles-pp.recv_data
>       0.00            +0.1        0.10        perf-profile.self.cycles-pp.refill_stock
>       0.00            +0.1        0.10        perf-profile.self.cycles-pp.__switch_to_asm
>       0.00            +0.1        0.11  15%  perf-profile.self.cycles-pp._copy_to_user
>       0.00            +0.1        0.12        perf-profile.self.cycles-pp.recv_omni
>       0.00            +0.1        0.12        perf-profile.self.cycles-pp.mem_cgroup_uncharge_skmem
>       0.00            +0.1        0.13   3%  perf-profile.self.cycles-pp.native_irq_return_iret
>       0.00            +0.1        0.13        perf-profile.self.cycles-pp.__switch_to
>       0.06            +0.1        0.20   2%  perf-profile.self.cycles-pp.rmqueue_bulk
>       0.09   5%      +0.1        0.23   4%  perf-profile.self.cycles-pp.udp_recvmsg
>       0.00            +0.1        0.14   3%  perf-profile.self.cycles-pp.__skb_recv_udp
>       0.00            +0.1        0.14   3%  perf-profile.self.cycles-pp.___perf_sw_event
>       0.08            +0.1        0.22   2%  perf-profile.self.cycles-pp.__skb_datagram_iter
>       0.03  70%      +0.2        0.20   4%  perf-profile.self.cycles-pp.page_counter_try_charge
>       0.02 141%      +0.2        0.18   4%  perf-profile.self.cycles-pp.__sys_recvfrom
>       0.00            +0.2        0.17   2%  perf-profile.self.cycles-pp.__schedule
>       0.00            +0.2        0.17   2%  perf-profile.self.cycles-pp.try_charge_memcg
>       0.00            +0.2        0.17   2%  perf-profile.self.cycles-pp.page_counter_uncharge
>       0.00            +0.2        0.21   2%  perf-profile.self.cycles-pp.__sk_mem_raise_allocated
>       0.14   3%      +0.2        0.36        perf-profile.self.cycles-pp.__free_one_page
>       0.20   2%      +0.3        0.47        perf-profile.self.cycles-pp.__list_del_entry_valid_or_report
>       0.00            +2.1        2.07   2%  perf-profile.self.cycles-pp.acpi_safe_halt
>      49.78            +2.7       52.49        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>       3.68            +8.0       11.64        perf-profile.self.cycles-pp.copyout
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-11-06  6:40 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-16  5:29 [PATCH -V3 0/9] mm: PCP high auto-tuning Huang Ying
2023-10-16  5:29 ` [PATCH -V3 1/9] mm, pcp: avoid to drain PCP when process exit Huang Ying
2023-10-16  5:29 ` [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice Huang Ying
2023-10-19 12:11   ` Mel Gorman
2023-10-16  5:29 ` [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages Huang Ying
2023-10-27  6:23   ` kernel test robot
2023-11-06  6:22   ` kernel test robot
2023-11-06  6:38     ` Huang, Ying
2023-10-16  5:29 ` [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency Huang Ying
2023-10-19 12:12   ` Mel Gorman
2023-10-16  5:29 ` [PATCH -V3 5/9] mm, page_alloc: scale the number of pages that are batch allocated Huang Ying
2023-10-16  5:29 ` [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning Huang Ying
2023-10-19 12:16   ` Mel Gorman
2023-10-16  5:30 ` [PATCH -V3 7/9] mm: tune PCP high automatically Huang Ying
2023-10-31  2:50   ` kernel test robot
2023-10-16  5:30 ` [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark Huang Ying
2023-10-19 12:33   ` Mel Gorman
2023-10-20  3:30     ` Huang, Ying
2023-10-23  9:26       ` Mel Gorman
2023-10-16  5:30 ` [PATCH -V3 9/9] mm, pcp: reduce detecting time of consecutive high order page freeing Huang Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).