linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
@ 2023-02-08  7:35 Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 1/5] x86/ibs: In-kernel IBS driver for page access profiling Bharata B Rao
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-08  7:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mgorman, peterz, mingo, bp, dave.hansen, x86, akpm, luto, tglx,
	yue.li, Ravikumar.Bangoria, Bharata B Rao

Hi,

Some hardware platforms can provide information about memory accesses
that can be used to do optimal page and task placement on NUMA
systems. AMD processors have a hardware facility called Instruction-
Based Sampling (IBS) that can be used to gather specific metrics
related to instruction fetch and execution activity. This facility
can be used to perform memory access profiling based on statistical
sampling.

This RFC is a proof-of-concept implementation where the access
information obtained from the hardware is used to drive NUMA balancing.
With this it is no longer necessary to scan the address space and
introduce NUMA hint faults to build task-to-page association. Hence
the approach taken here is to replace the address space scanning plus
hint faults with the access information provided by the hardware.
The access samples obtained from hardware are fed to NUMA balancing
as fault-equivalents. The rest of the NUMA balancing logic that
collects/aggregates the shared/private/local/remote faults and does
pages/task migrations based on the faults is retained except that
accesses replace faults.

This early implementation is an attempt to get a working solution
only and as such a lot of TODOs exist:

- Perf uses IBS and we are using the same IBS for access profiling here.
  There needs to be a proper way to make the use mutually exclusive.
- Is tying this up with NUMA balancing a reasonable approach or
  should we look at a completely new approach?
- When accesses replace faults in NUMA balancing, a few things have
  to be tuned differently. All such decision points need to be
  identified and appropriate tuning needs to be done.
- Hardware provided access information could be very useful for driving
  hot page promotion in tiered memory systems. Need to check if this
  requires different tuning/heuristics apart from what NUMA balancing
  already does.
- Some of the values used to program the IBS counters like the sampling
  period etc may not be the optimal or ideal values. The sample period
  adjustment follows the same logic as scan period modification which
  may not be ideal. More experimentation is required to fine-tune all
  these aspects.
- Currently I am acting (i,e., attempt to migrate a page) on each sampled
  access. Need to check if it makes sense to delay it and do batched page
  migration.

This RFC is mainly about showing how hardware provided access
information could be used for NUMA balancing but I have run a
few basic benchmarks from mmtests to check if this is any severe
regression/overhead to any of those. Some benchmarks show some
improvement, some show no significant change and a few regress.
I am hopeful that with more appropriate tuning there is scope for
futher improvement here especially for workloads for which NUMA
matters.

FWIW, here are the numbers in brief:
(1st column is default kernel, 2nd column is with this patchset)

kernbench
=========
                                6.2.0-rc5              6.2.0-rc5-ibs
Amean     user-512    19385.27 (   0.00%)    18140.69 *   6.42%*
Amean     syst-512    21620.40 (   0.00%)    19984.87 *   7.56%*
Amean     elsp-512       95.91 (   0.00%)       88.60 *   7.62%*

Duration User       19385.45    18140.89
Duration System     21620.90    19985.37
Duration Elapsed       96.52       89.20

Ops NUMA alloc hit                 552153976.00   499596610.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local               552152782.00   499595620.00
Ops NUMA base-page range updates      758004.00           0.00
Ops NUMA PTE updates                  758004.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                  215654.00     1797848.00
Ops NUMA hint local faults %            2054.00     1775103.00
Ops NUMA hint local percent                0.95          98.73
Ops NUMA pages migrated               213600.00       22745.00
Ops AutoNUMA cost                       1087.63        8989.67

autonumabench
=============
Amean     syst-NUMA01                90516.91 (   0.00%)    65272.04 *  27.89%*
Amean     syst-NUMA01_THREADLOCAL        0.26 (   0.00%)        0.27 *  -3.80%*
Amean     syst-NUMA02                    1.10 (   0.00%)        1.02 *   7.24%*
Amean     syst-NUMA02_SMT                0.74 (   0.00%)        0.90 * -21.77%*
Amean     elsp-NUMA01                  747.73 (   0.00%)      625.29 *  16.37%*
Amean     elsp-NUMA01_THREADLOCAL        1.07 (   0.00%)        1.07 *  -0.13%*
Amean     elsp-NUMA02                    1.75 (   0.00%)        1.72 *   1.96%*
Amean     elsp-NUMA02_SMT                3.03 (   0.00%)        3.04 *  -0.47%*

Duration User     1312937.34  1148196.94
Duration System    633634.59   456921.29
Duration Elapsed     5289.47     4427.82

Ops NUMA alloc hit                1115625106.00   704004226.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local               599879745.00   459968338.00
Ops NUMA base-page range updates    74310268.00           0.00
Ops NUMA PTE updates                74310268.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults               110504178.00    27624054.00
Ops NUMA hint local faults %        54257985.00    17310888.00
Ops NUMA hint local percent               49.10          62.67
Ops NUMA pages migrated             11399016.00     7983717.00
Ops AutoNUMA cost                     553257.64      138271.96

tbench4 Latency
===============
Amean     latency-1           0.08 (   0.00%)        0.08 *   1.43%*
Amean     latency-2           0.10 (   0.00%)        0.11 *  -2.75%*
Amean     latency-4           0.14 (   0.00%)        0.13 *   4.31%*
Amean     latency-8           0.14 (   0.00%)        0.14 *  -0.94%*
Amean     latency-16          0.20 (   0.00%)        0.19 *   8.01%*
Amean     latency-32          0.24 (   0.00%)        0.20 *  12.92%*
Amean     latency-64          0.34 (   0.00%)        0.28 *  18.30%*
Amean     latency-128         1.71 (   0.00%)        1.44 *  16.04%*
Amean     latency-256         0.52 (   0.00%)        0.69 * -32.26%*
Amean     latency-512         3.27 (   0.00%)        5.32 * -62.62%*
Amean     latency-1024        0.00 (   0.00%)        0.00 *   0.00%*
Amean     latency-2048        0.00 (   0.00%)        0.00 *   0.00%*

tbench4 Throughput
==================
Hmean     1         504.57 (   0.00%)      496.80 *  -1.54%*
Hmean     2        1006.46 (   0.00%)      990.04 *  -1.63%*
Hmean     4        1855.11 (   0.00%)     1933.76 *   4.24%*
Hmean     8        3711.49 (   0.00%)     3582.32 *  -3.48%*
Hmean     16       6707.58 (   0.00%)     6674.46 *  -0.49%*
Hmean     32      13146.81 (   0.00%)    12649.49 *  -3.78%*
Hmean     64      20922.72 (   0.00%)    22605.55 *   8.04%*
Hmean     128     33637.07 (   0.00%)    37870.35 *  12.59%*
Hmean     256     54083.12 (   0.00%)    50257.25 *  -7.07%*
Hmean     512     72455.66 (   0.00%)    53141.88 * -26.66%*
Hmean     1024   124413.95 (   0.00%)   117398.40 *  -5.64%*
Hmean     2048   124481.61 (   0.00%)   124892.12 *   0.33%*

Ops NUMA alloc hit                2092196681.00  2007852353.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local              2092193601.00  2007849231.00
Ops NUMA base-page range updates      298999.00           0.00
Ops NUMA PTE updates                  298999.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                  287539.00     4499166.00
Ops NUMA hint local faults %           98931.00     4349685.00
Ops NUMA hint local percent               34.41          96.68
Ops NUMA pages migrated               169086.00      149476.00
Ops AutoNUMA cost                       1443.00       22498.67

Duration User       23999.54    24476.30
Duration System    160480.07   164366.91
Duration Elapsed     2685.19     2685.69

netperf-udp
===========
Hmean     send-64         226.57 (   0.00%)      225.41 *  -0.51%*
Hmean     send-128        450.89 (   0.00%)      448.90 *  -0.44%*
Hmean     send-256        899.63 (   0.00%)      898.02 *  -0.18%*
Hmean     send-1024      3510.63 (   0.00%)     3526.24 *   0.44%*
Hmean     send-2048      6493.15 (   0.00%)     6493.27 *   0.00%*
Hmean     send-3312      9778.22 (   0.00%)     9801.03 *   0.23%*
Hmean     send-4096     11523.43 (   0.00%)    11490.57 *  -0.29%*
Hmean     send-8192     18666.11 (   0.00%)    18686.99 *   0.11%*
Hmean     send-16384    28112.56 (   0.00%)    28223.81 *   0.40%*
Hmean     recv-64         226.57 (   0.00%)      225.41 *  -0.51%*
Hmean     recv-128        450.88 (   0.00%)      448.90 *  -0.44%*
Hmean     recv-256        899.63 (   0.00%)      898.01 *  -0.18%*
Hmean     recv-1024      3510.61 (   0.00%)     3526.21 *   0.44%*
Hmean     recv-2048      6493.07 (   0.00%)     6493.15 *   0.00%*
Hmean     recv-3312      9777.95 (   0.00%)     9800.85 *   0.23%*
Hmean     recv-4096     11522.87 (   0.00%)    11490.47 *  -0.28%*
Hmean     recv-8192     18665.83 (   0.00%)    18686.56 *   0.11%*
Hmean     recv-16384    28112.13 (   0.00%)    28223.73 *   0.40%*

Duration User          48.52       48.74
Duration System       931.24      925.83
Duration Elapsed     1934.05     1934.79

Ops NUMA alloc hit                  60042365.00    60144256.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                60042305.00    60144228.00
Ops NUMA base-page range updates        6630.00           0.00
Ops NUMA PTE updates                    6630.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                    5709.00       26249.00
Ops NUMA hint local faults %            3030.00       25130.00
Ops NUMA hint local percent               53.07          95.74
Ops NUMA pages migrated                 2500.00        1119.00
Ops AutoNUMA cost                         28.64         131.27

netperf-udp-rr
==============
Hmean     1   132319.16 (   0.00%)   130621.99 *  -1.28%*

Duration User           9.92        9.97
Duration System       118.02      119.26
Duration Elapsed      432.12      432.91

Ops NUMA alloc hit                    289650.00      289222.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                  289642.00      289222.00
Ops NUMA base-page range updates           1.00           0.00
Ops NUMA PTE updates                       1.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                       1.00          51.00
Ops NUMA hint local faults %               0.00          45.00
Ops NUMA hint local percent                0.00          88.24
Ops NUMA pages migrated                    1.00           6.00
Ops AutoNUMA cost                          0.01           0.26

netperf-tcp-rr
==============
Hmean     1   118141.46 (   0.00%)   115515.41 *  -2.22%*

Duration User           9.59        9.52
Duration System       120.32      121.66
Duration Elapsed      432.20      432.49

Ops NUMA alloc hit                    291257.00      290927.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                  291233.00      290923.00
Ops NUMA base-page range updates           2.00           0.00
Ops NUMA PTE updates                       2.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                       2.00          46.00
Ops NUMA hint local faults %               0.00          42.00
Ops NUMA hint local percent                0.00          91.30
Ops NUMA pages migrated                    2.00           4.00
Ops AutoNUMA cost                          0.01           0.23

dbench
======
dbench4 Latency

Amean     latency-1          2.13 (   0.00%)       10.92 *-411.44%*
Amean     latency-2         12.03 (   0.00%)        8.17 *  32.07%*
Amean     latency-4         21.12 (   0.00%)        9.60 *  54.55%*
Amean     latency-8         41.20 (   0.00%)       33.59 *  18.45%*
Amean     latency-16        76.85 (   0.00%)       75.84 *   1.31%*
Amean     latency-32        91.68 (   0.00%)       90.26 *   1.55%*
Amean     latency-64       124.61 (   0.00%)      113.31 *   9.07%*
Amean     latency-128      140.14 (   0.00%)      126.29 *   9.89%*
Amean     latency-256      155.63 (   0.00%)      142.11 *   8.69%*
Amean     latency-512      258.60 (   0.00%)      243.13 *   5.98%*

dbench4 Throughput (misleading but traditional)

Hmean     1        987.47 (   0.00%)      938.07 *  -5.00%*
Hmean     2       1750.10 (   0.00%)     1697.08 *  -3.03%*
Hmean     4       2990.33 (   0.00%)     3023.23 *   1.10%*
Hmean     8       3557.38 (   0.00%)     3863.32 *   8.60%*
Hmean     16      2705.90 (   0.00%)     2660.48 *  -1.68%*
Hmean     32      2954.08 (   0.00%)     3101.59 *   4.99%*
Hmean     64      3061.68 (   0.00%)     3206.15 *   4.72%*
Hmean     128     2867.74 (   0.00%)     3080.21 *   7.41%*
Hmean     256     2585.58 (   0.00%)     2875.44 *  11.21%*
Hmean     512     1777.80 (   0.00%)     1777.79 *  -0.00%*

Duration User        2359.02     2246.44
Duration System     18927.83    16856.91
Duration Elapsed     1901.54     1901.44

Ops NUMA alloc hit                 240556255.00   255283721.00
Ops NUMA alloc miss                   408851.00       62903.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local               240547816.00   255264974.00
Ops NUMA base-page range updates      204316.00           0.00
Ops NUMA PTE updates                  204316.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                  201101.00      287642.00
Ops NUMA hint local faults %          104199.00      153547.00
Ops NUMA hint local percent               51.81          53.38
Ops NUMA pages migrated                96158.00      134083.00
Ops AutoNUMA cost                       1008.76        1440.76

Bharata B Rao (5):
  x86/ibs: In-kernel IBS driver for page access profiling
  x86/ibs: Drive NUMA balancing via IBS access data
  x86/ibs: Enable per-process IBS from sched switch path
  x86/ibs: Adjust access faults sampling period
  x86/ibs: Delay the collection of HW-provided access info

 arch/x86/events/amd/ibs.c        |   6 +
 arch/x86/include/asm/msr-index.h |  12 ++
 arch/x86/mm/Makefile             |   1 +
 arch/x86/mm/ibs.c                | 250 +++++++++++++++++++++++++++++++
 include/linux/migrate.h          |   1 +
 include/linux/mm.h               |   2 +
 include/linux/mm_types.h         |   3 +
 include/linux/sched.h            |   4 +
 include/linux/vm_event_item.h    |  12 ++
 kernel/sched/core.c              |   1 +
 kernel/sched/debug.c             |  10 ++
 kernel/sched/fair.c              | 142 ++++++++++++++++--
 kernel/sched/sched.h             |   9 ++
 mm/memory.c                      |  92 ++++++++++++
 mm/vmstat.c                      |  12 ++
 15 files changed, 544 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/mm/ibs.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 1/5] x86/ibs: In-kernel IBS driver for page access profiling
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
@ 2023-02-08  7:35 ` Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 2/5] x86/ibs: Drive NUMA balancing via IBS access data Bharata B Rao
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-08  7:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mgorman, peterz, mingo, bp, dave.hansen, x86, akpm, luto, tglx,
	yue.li, Ravikumar.Bangoria, Bharata B Rao

Use IBS (Instruction Based Sampling) feature present
in AMD processors for memory access tracking. The access
information obtained from IBS will be used in subsequent
patches to drive NUMA balancing.

An NMI handler is registered to obtain the IBS data. The
handler does nothing much yet. It just filters out the
non-useful samples and collects some stats. This patch
just builds the framework and IBS execution sampling is
enabled only in a subsequent patch.

TODOs
-----
1. Perf also uses IBS. For the purpose of this prototype
   just disable the use of IBS in perf. This needs to be
   done cleanly.
2. Only the required MSR bits are defined here.

About IBS
---------
IBS can be programmed to provide data about instruction
execution periodically. This is done by programming a desired
sample count (number of ops) in a control register. When the
programmed number of ops are dispatched, a micro-op gets tagged,
various information about the tagged micro-op's execution is
populated in IBS execution MSRs and an interrupt is raised.
While IBS provides a lot of data for each sample, for the
purpose of  memory access profiling, we are interested in
linear and physical address of the memory access that reached
DRAM. Recent AMD processors provide further filtering where
it is possible to limit the sampling to those ops that had
an L3 miss which greately reduces the non-useful samples.

While IBS provides capability to sample instruction fetch
and execution, only IBS execution sampling is used here
to collect data about memory accesses that occur during
the instruction execution.

More information about IBS is available in Sec 13.3 of
AMD64 Architecture Programmer's Manual, Volume 2:System
Programming which is present at:
https://bugzilla.kernel.org/attachment.cgi?id=288923

Information about MSRs used for programming IBS can be
found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h
Model 11h B1 which is currently present at:
https://www.amd.com/system/files/TechDocs/55901_0.25.zip

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/events/amd/ibs.c        |   6 ++
 arch/x86/include/asm/msr-index.h |  12 +++
 arch/x86/mm/Makefile             |   1 +
 arch/x86/mm/ibs.c                | 169 +++++++++++++++++++++++++++++++
 include/linux/vm_event_item.h    |  11 ++
 mm/vmstat.c                      |  11 ++
 6 files changed, 210 insertions(+)
 create mode 100644 arch/x86/mm/ibs.c

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index da3f5ebac4e1..290e6d221844 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -1512,6 +1512,12 @@ static __init int amd_ibs_init(void)
 {
 	u32 caps;
 
+	/*
+	 * TODO: Find a clean way to disable perf IBS so that IBS
+	 * can be used for NUMA balancing.
+	 */
+	return 0;
+
 	caps = __get_ibs_caps();
 	if (!caps)
 		return -ENODEV;	/* ibs not supported by the cpu */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 37ff47552bcb..443d4cf73366 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -593,6 +593,18 @@
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
 
+/* AMD IBS MSR bits */
+#define MSR_AMD64_IBSOPDATA2_DATASRC			0x7
+#define MSR_AMD64_IBSOPDATA2_DATASRC_DRAM		0x3
+#define MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE	0x5
+
+#define MSR_AMD64_IBSOPDATA3_LDOP		BIT_ULL(0)
+#define MSR_AMD64_IBSOPDATA3_STOP		BIT_ULL(1)
+#define MSR_AMD64_IBSOPDATA3_DCMISS		BIT_ULL(7)
+#define MSR_AMD64_IBSOPDATA3_LADDR_VALID	BIT_ULL(17)
+#define MSR_AMD64_IBSOPDATA3_PADDR_VALID	BIT_ULL(18)
+#define MSR_AMD64_IBSOPDATA3_L2MISS		BIT_ULL(20)
+
 /* Fam 17h MSRs */
 #define MSR_F17H_IRPERF			0xc00000e9
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index c80febc44cd2..e74b95a57d86 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -27,6 +27,7 @@ endif
 obj-y				:=  init.o init_$(BITS).o fault.o ioremap.o extable.o mmap.o \
 				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o
 
+obj-$(CONFIG_NUMA_BALANCING)	+= ibs.o
 obj-y				+= pat/
 
 # Make sure __phys_addr has no stackprotector
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
new file mode 100644
index 000000000000..411dba2a88d1
--- /dev/null
+++ b/arch/x86/mm/ibs.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/init.h>
+
+#include <asm/nmi.h>
+#include <asm/perf_event.h> /* TODO: Move defns like IBS_OP_ENABLE into non-perf header */
+#include <asm/apic.h>
+
+static u64 ibs_config __read_mostly;
+
+static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	u64 ops_ctl, ops_data3, ops_data2;
+	u64 remote_access;
+	u64 laddr = -1, paddr = -1;
+	struct mm_struct *mm = current->mm;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+
+	/*
+	 * When IBS sampling period is reprogrammed via read-modify-update
+	 * of MSR_AMD64_IBSOPCTL, overflow NMIs could be generated with
+	 * IBS_OP_ENABLE not set. For such cases, return as HANDLED.
+	 *
+	 * With this, the handler will say "handled" for all NMIs that
+	 * aren't related to this NMI.  This stems from the limitation of
+	 * having both status and control bits in one MSR.
+	 */
+	if (!(ops_ctl & IBS_OP_VAL))
+		goto handled;
+
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_VAL);
+
+	count_vm_event(IBS_NR_EVENTS);
+
+	if (!mm) {
+		count_vm_event(IBS_KTHREAD);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA3, ops_data3);
+
+	/* Load/Store ops only */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_LDOP |
+			   MSR_AMD64_IBSOPDATA3_STOP))) {
+		count_vm_event(IBS_NON_LOAD_STORES);
+		goto handled;
+	}
+
+	/* Discard the sample if it was L1 or L2 hit */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_DCMISS |
+			   MSR_AMD64_IBSOPDATA3_L2MISS))) {
+		count_vm_event(IBS_DC_L2_HITS);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA2, ops_data2);
+	remote_access = ops_data2 & MSR_AMD64_IBSOPDATA2_DATASRC;
+
+	/* Consider only DRAM accesses, exclude cache accesses from near ccx */
+	if (remote_access < MSR_AMD64_IBSOPDATA2_DATASRC_DRAM) {
+		count_vm_event(IBS_NEAR_CACHE_HITS);
+		goto handled;
+	}
+
+	/* Exclude hits from peer cache in far ccx */
+	if (remote_access == MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE) {
+		count_vm_event(IBS_FAR_CACHE_HITS);
+		goto handled;
+	}
+
+	/* Is linear addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_LADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCLINAD, laddr);
+	else {
+		count_vm_event(IBS_LADDR_INVALID);
+		goto handled;
+	}
+
+	/* Discard kernel address accesses */
+	if (laddr & (1UL << 63)) {
+		count_vm_event(IBS_KERNEL_ADDR);
+		goto handled;
+	}
+
+	/* Is phys addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr);
+	else
+		count_vm_event(IBS_PADDR_INVALID);
+
+handled:
+	return NMI_HANDLED;
+}
+
+static inline int get_ibs_lvt_offset(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_AMD64_IBSCTL, val);
+	if (!(val & IBSCTL_LVT_OFFSET_VALID))
+		return -EINVAL;
+
+	return val & IBSCTL_LVT_OFFSET_MASK;
+}
+
+static void setup_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset < 0)
+		goto failed;
+
+	if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0))
+		return;
+failed:
+	pr_warn("IBS APIC setup failed on cpu #%d\n",
+		smp_processor_id());
+}
+
+static void clear_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset >= 0)
+		setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1);
+}
+
+static int x86_amd_ibs_access_profile_startup(unsigned int cpu)
+{
+	setup_APIC_ibs();
+	return 0;
+}
+
+static int x86_amd_ibs_access_profile_teardown(unsigned int cpu)
+{
+	clear_APIC_ibs();
+	return 0;
+}
+
+int __init ibs_access_profiling_init(void)
+{
+	u32 caps;
+
+	ibs_config = IBS_OP_CNT_CTL | IBS_OP_ENABLE;
+
+	if (!boot_cpu_has(X86_FEATURE_IBS)) {
+		pr_info("IBS capability is unavailable for access profiling\n");
+		return 0;
+	}
+
+	caps = cpuid_eax(IBS_CPUID_FEATURES);
+	if (caps & IBS_CAPS_ZEN4)
+		ibs_config |= IBS_OP_L3MISSONLY;
+
+	register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs");
+
+	cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
+			  "x86/amd/ibs_access_profile:starting",
+			  x86_amd_ibs_access_profile_startup,
+			  x86_amd_ibs_access_profile_teardown);
+
+	pr_info("IBS access profiling setup for NUMA Balancing\n");
+	return 0;
+}
+
+arch_initcall(ibs_access_profiling_init);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..1d55e347d16c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,17 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
+#ifdef CONFIG_NUMA_BALANCING
+		IBS_NR_EVENTS,
+		IBS_KTHREAD,
+		IBS_NON_LOAD_STORES,
+		IBS_DC_L2_HITS,
+		IBS_NEAR_CACHE_HITS,
+		IBS_FAR_CACHE_HITS,
+		IBS_LADDR_INVALID,
+		IBS_KERNEL_ADDR,
+		IBS_PADDR_INVALID,
+#endif
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..c7a9d0d9ade8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1398,6 +1398,17 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_X86
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
+#ifdef CONFIG_NUMA_BALANCING
+	"ibs_nr_events",
+	"ibs_kthread",
+	"ibs_non_load_stores",
+	"ibs_dc_l2_hits",
+	"ibs_near_cache_hits",
+	"ibs_far_cache_hits",
+	"ibs_invalid_laddr",
+	"ibs_kernel_addr",
+	"ibs_invalid_paddr",
+#endif
 #endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 2/5] x86/ibs: Drive NUMA balancing via IBS access data
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 1/5] x86/ibs: In-kernel IBS driver for page access profiling Bharata B Rao
@ 2023-02-08  7:35 ` Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 3/5] x86/ibs: Enable per-process IBS from sched switch path Bharata B Rao
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-08  7:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mgorman, peterz, mingo, bp, dave.hansen, x86, akpm, luto, tglx,
	yue.li, Ravikumar.Bangoria, Bharata B Rao

Feed the page access data obtained from IBS to NUMA balancing
as hint fault equivalents. The existing per-task and per-group
fault stats are now built from IBS-provided page access information.
With this it will not be necessary to scan the address space to
introduce NUMA hinting faults.

Use task_work framework to process the IBS sampled data. Actual
programming of IBS to generate page access information isn't
done yet.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/mm/ibs.c             | 38 ++++++++++++++-
 include/linux/migrate.h       |  1 +
 include/linux/sched.h         |  1 +
 include/linux/vm_event_item.h |  1 +
 kernel/sched/fair.c           | 10 ++++
 mm/memory.c                   | 92 +++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |  1 +
 7 files changed, 143 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
index 411dba2a88d1..adbc587b1767 100644
--- a/arch/x86/mm/ibs.c
+++ b/arch/x86/mm/ibs.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include <linux/init.h>
+#include <linux/migrate.h>
+#include <linux/task_work.h>
 
 #include <asm/nmi.h>
 #include <asm/perf_event.h> /* TODO: Move defns like IBS_OP_ENABLE into non-perf header */
@@ -8,12 +10,30 @@
 
 static u64 ibs_config __read_mostly;
 
+struct ibs_access_work {
+	struct callback_head work;
+	u64 laddr, paddr;
+};
+
+void task_ibs_access_work(struct callback_head *work)
+{
+	struct ibs_access_work *iwork = container_of(work, struct ibs_access_work, work);
+	struct task_struct *p = current;
+
+	u64 laddr = iwork->laddr;
+	u64 paddr = iwork->paddr;
+
+	kfree(iwork);
+	do_numa_access(p, laddr, paddr);
+}
+
 static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs)
 {
 	u64 ops_ctl, ops_data3, ops_data2;
 	u64 remote_access;
 	u64 laddr = -1, paddr = -1;
 	struct mm_struct *mm = current->mm;
+	struct ibs_access_work *iwork;
 
 	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
 
@@ -86,8 +106,24 @@ static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs)
 	/* Is phys addr valid? */
 	if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID)
 		rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr);
-	else
+	else {
 		count_vm_event(IBS_PADDR_INVALID);
+		goto handled;
+	}
+
+	/*
+	 * TODO: GFP_ATOMIC!
+	 */
+	iwork = kzalloc(sizeof(*iwork), GFP_ATOMIC);
+	if (!iwork)
+		goto handled;
+
+	count_vm_event(IBS_USEFUL_SAMPLES);
+
+	iwork->laddr = laddr;
+	iwork->paddr = paddr;
+	init_task_work(&iwork->work, task_ibs_access_work);
+	task_work_add(current, &iwork->work, TWA_RESUME);
 
 handled:
 	return NMI_HANDLED;
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3ef77f52a4f0..4dcce7885b0c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -216,6 +216,7 @@ void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
 			unsigned long npages);
 void migrate_device_finalize(unsigned long *src_pfns,
 			unsigned long *dst_pfns, unsigned long npages);
+void do_numa_access(struct task_struct *p, u64 laddr, u64 paddr);
 
 #endif /* CONFIG_MIGRATION */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..19dd4ee07436 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2420,4 +2420,5 @@ static inline void sched_core_fork(struct task_struct *p) { }
 
 extern void sched_set_stop_task(int cpu, struct task_struct *stop);
 
+DECLARE_STATIC_KEY_FALSE(hw_access_hints);
 #endif
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1d55e347d16c..2ccc7dee3c13 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -159,6 +159,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		IBS_LADDR_INVALID,
 		IBS_KERNEL_ADDR,
 		IBS_PADDR_INVALID,
+		IBS_USEFUL_SAMPLES,
 #endif
 #endif
 		NR_VM_EVENT_ITEMS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f8736991427..c9b9e62da779 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/migrate.h>
 
 #include <asm/switch_to.h>
 
@@ -3125,6 +3126,8 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	}
 }
 
+DEFINE_STATIC_KEY_FALSE(hw_access_hints);
+
 /*
  * Drive the periodic memory faults..
  */
@@ -3133,6 +3136,13 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	struct callback_head *work = &curr->numa_work;
 	u64 period, now;
 
+	/*
+	 * If we are using access hints from hardware (like using
+	 * IBS), don't scan the address space.
+	 */
+	if (static_branch_unlikely(&hw_access_hints))
+		return;
+
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..79096aba197c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4668,6 +4668,98 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 	return mpol_misplaced(page, vma, addr);
 }
 
+/*
+ * Called from task_work context to act upon the page access.
+ *
+ * Physical address (provided by IBS) is used directly instead
+ * of walking the page tables to get to the PTE/page. Hence we
+ * don't check if PTE is writable for the TNF_NO_GROUP
+ * optimization, which means RO pages are considered for grouping.
+ */
+void do_numa_access(struct task_struct *p, u64 laddr, u64 paddr)
+{
+	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	struct page *page = NULL;
+	int page_nid = NUMA_NO_NODE;
+	int last_cpupid;
+	int target_nid;
+	int flags = 0;
+
+	if (!mm)
+		return;
+
+	if (!mmap_read_trylock(mm))
+		return;
+
+	vma = find_vma(mm, laddr);
+	if (!vma)
+		goto out_unlock;
+
+	if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+		is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP))
+		goto out_unlock;
+
+	if (!vma->vm_mm ||
+	    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+		goto out_unlock;
+
+	if (!vma_is_accessible(vma))
+		goto out_unlock;
+
+	page = pfn_to_online_page(PHYS_PFN(paddr));
+	if (!page || is_zone_device_page(page))
+		goto out_unlock;
+
+	if (unlikely(!PageLRU(page)))
+		goto out_unlock;
+
+	/* TODO: handle PTE-mapped THP */
+	if (PageCompound(page))
+		goto out_unlock;
+
+	/*
+	 * Flag if the page is shared between multiple address spaces. This
+	 * is later used when determining whether to group tasks together
+	 */
+	if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+		flags |= TNF_SHARED;
+
+	last_cpupid = page_cpupid_last(page);
+	page_nid = page_to_nid(page);
+
+	/*
+	 * For memory tiering mode, cpupid of slow memory page is used
+	 * to record page access time.  So use default value.
+	 */
+	if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+	    !node_is_toptier(page_nid))
+		last_cpupid = (-1 & LAST_CPUPID_MASK);
+	else
+		last_cpupid = page_cpupid_last(page);
+
+	target_nid = numa_migrate_prep(page, vma, laddr, page_nid, &flags);
+	if (target_nid == NUMA_NO_NODE) {
+		put_page(page);
+		goto out;
+	}
+
+	/* Migrate to the requested node */
+	if (migrate_misplaced_page(page, vma, target_nid)) {
+		page_nid = target_nid;
+		flags |= TNF_MIGRATED;
+	} else {
+		flags |= TNF_MIGRATE_FAIL;
+	}
+
+out:
+	if (page_nid != NUMA_NO_NODE)
+		task_numa_fault(last_cpupid, page_nid, 1, flags);
+
+out_unlock:
+	mmap_read_unlock(mm);
+}
+
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c7a9d0d9ade8..33738426ae48 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = {
 	"ibs_invalid_laddr",
 	"ibs_kernel_addr",
 	"ibs_invalid_paddr",
+	"ibs_useful_samples",
 #endif
 #endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 3/5] x86/ibs: Enable per-process IBS from sched switch path
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 1/5] x86/ibs: In-kernel IBS driver for page access profiling Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 2/5] x86/ibs: Drive NUMA balancing via IBS access data Bharata B Rao
@ 2023-02-08  7:35 ` Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 4/5] x86/ibs: Adjust access faults sampling period Bharata B Rao
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-08  7:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mgorman, peterz, mingo, bp, dave.hansen, x86, akpm, luto, tglx,
	yue.li, Ravikumar.Bangoria, Bharata B Rao

Program IBS for access profiling for threads from the
task sched switch path. IBS is programmed with a period
that corresponds to the incoming thread. Kernel threads are
excluded from this.

The sample period is currently kept at a fixed value of 10000.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/mm/ibs.c     | 27 +++++++++++++++++++++++++++
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   |  1 +
 kernel/sched/sched.h  |  5 +++++
 5 files changed, 35 insertions(+)

diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
index adbc587b1767..a479029e9262 100644
--- a/arch/x86/mm/ibs.c
+++ b/arch/x86/mm/ibs.c
@@ -8,6 +8,7 @@
 #include <asm/perf_event.h> /* TODO: Move defns like IBS_OP_ENABLE into non-perf header */
 #include <asm/apic.h>
 
+#define IBS_SAMPLE_PERIOD      10000
 static u64 ibs_config __read_mostly;
 
 struct ibs_access_work {
@@ -15,6 +16,31 @@ struct ibs_access_work {
 	u64 laddr, paddr;
 };
 
+void hw_access_sched_in(struct task_struct *prev, struct task_struct *curr)
+{
+	u64 config = 0;
+	unsigned int period;
+
+	if (!static_branch_unlikely(&hw_access_hints))
+		return;
+
+	/* Disable IBS for kernel thread */
+	if (!curr->mm)
+		goto out;
+
+	if (curr->numa_sample_period)
+		period = curr->numa_sample_period;
+	else
+		period = IBS_SAMPLE_PERIOD;
+
+
+	config = (period >> 4)  & IBS_OP_MAX_CNT;
+	config |= (period & IBS_OP_MAX_CNT_EXT_MASK);
+	config |= ibs_config;
+out:
+	wrmsrl(MSR_AMD64_IBSOPCTL, config);
+}
+
 void task_ibs_access_work(struct callback_head *work)
 {
 	struct ibs_access_work *iwork = container_of(work, struct ibs_access_work, work);
@@ -198,6 +224,7 @@ int __init ibs_access_profiling_init(void)
 			  x86_amd_ibs_access_profile_startup,
 			  x86_amd_ibs_access_profile_teardown);
 
+	static_branch_enable(&hw_access_hints);
 	pr_info("IBS access profiling setup for NUMA Balancing\n");
 	return 0;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 19dd4ee07436..66c532418d38 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1254,6 +1254,7 @@ struct task_struct {
 	int				numa_scan_seq;
 	unsigned int			numa_scan_period;
 	unsigned int			numa_scan_period_max;
+	unsigned int			numa_sample_period;
 	int				numa_preferred_nid;
 	unsigned long			numa_migrate_retry;
 	/* Migration stamp: */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e838feb6adc5..1c13fed8bebc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5165,6 +5165,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = READ_ONCE(prev->__state);
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	hw_access_sched_in(prev, current);
 	finish_task(prev);
 	tick_nohz_task_switch();
 	finish_lock_switch(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9b9e62da779..3f617c799821 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3094,6 +3094,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->node_stamp			= 0;
 	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
+	p->numa_sample_period		= 0;
 	p->numa_migrate_retry		= 0;
 	/* Protect against double add, see task_tick_numa and task_numa_work */
 	p->numa_work.next		= &p->numa_work;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 771f8ddb7053..953d16c802d6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1723,11 +1723,16 @@ extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *p, struct task_struct *t,
 			int cpu, int scpu);
 extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
+void hw_access_sched_in(struct task_struct *prev, struct task_struct *curr);
 #else
 static inline void
 init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 }
+static inline void hw_access_sched_in(struct task_struct *prev,
+				      struct task_struct *curr)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 4/5] x86/ibs: Adjust access faults sampling period
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
                   ` (2 preceding siblings ...)
  2023-02-08  7:35 ` [RFC PATCH 3/5] x86/ibs: Enable per-process IBS from sched switch path Bharata B Rao
@ 2023-02-08  7:35 ` Bharata B Rao
  2023-02-08  7:35 ` [RFC PATCH 5/5] x86/ibs: Delay the collection of HW-provided access info Bharata B Rao
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-08  7:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mgorman, peterz, mingo, bp, dave.hansen, x86, akpm, luto, tglx,
	yue.li, Ravikumar.Bangoria, Bharata B Rao

Adjust the access faults sampling period of a thread to be within
the fixed mininum and maximum value. The adjustment logic uses the
private/shared and local/remote access faults stats. The algorithm
is same as the logic followed to adjust the scan period.

Unlike hinting faults, the min and max sampling period aren't
adjusted (yet) for access based sampling.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/sched.h |   2 +
 kernel/sched/debug.c  |   8 +++
 kernel/sched/fair.c   | 130 +++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |   4 ++
 4 files changed, 130 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66c532418d38..101c6377abbc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1257,6 +1257,8 @@ struct task_struct {
 	unsigned int			numa_sample_period;
 	int				numa_preferred_nid;
 	unsigned long			numa_migrate_retry;
+	unsigned int			numa_access_faults;
+	unsigned int			numa_access_faults_window;
 	/* Migration stamp: */
 	u64				node_stamp;
 	u64				last_task_numa_placement;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..1cf19778a232 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -334,6 +334,14 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
 	debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
+	debugfs_create_u32("sample_period_def", 0644, numa,
+			   &sysctl_numa_balancing_sample_period_def);
+	debugfs_create_u32("sample_period_min", 0644, numa,
+			   &sysctl_numa_balancing_sample_period_min);
+	debugfs_create_u32("sample_period_max", 0644, numa,
+			   &sysctl_numa_balancing_sample_period_max);
+	debugfs_create_u32("access_faults_threshold", 0644, numa,
+			   &sysctl_numa_balancing_access_faults_threshold);
 #endif
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f617c799821..1b0665b034d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1093,6 +1093,11 @@ adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_NUMA_BALANCING
+unsigned int sysctl_numa_balancing_sample_period_def = 10000;
+unsigned int sysctl_numa_balancing_sample_period_min = 5000;
+unsigned int sysctl_numa_balancing_sample_period_max = 20000;
+unsigned int sysctl_numa_balancing_access_faults_threshold = 250;
+
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
  * calculated based on the tasks virtual memory size and
@@ -1572,6 +1577,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	struct numa_group *ng = deref_curr_numa_group(p);
 	int dst_nid = cpu_to_node(dst_cpu);
 	int last_cpupid, this_cpupid;
+	bool early = false;
 
 	/*
 	 * The pages in slow memory node should be migrated according
@@ -1611,13 +1617,21 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	    !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
 		return false;
 
+	if (static_branch_unlikely(&hw_access_hints)) {
+		if (p->numa_access_faults < sysctl_numa_balancing_access_faults_threshold * 4)
+			early = true;
+	} else {
+		if (p->numa_scan_seq <= 4)
+			early = true;
+	}
+
 	/*
 	 * Allow first faults or private faults to migrate immediately early in
 	 * the lifetime of a task. The magic number 4 is based on waiting for
 	 * two full passes of the "multi-stage node selection" test that is
 	 * executed below.
 	 */
-	if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
+	if ((p->numa_preferred_nid == NUMA_NO_NODE || early) &&
 	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
 		return true;
 
@@ -2305,7 +2319,11 @@ static void numa_migrate_preferred(struct task_struct *p)
 		return;
 
 	/* Periodically retry migrating the task to the preferred node */
-	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
+	if (static_branch_unlikely(&hw_access_hints))
+		interval = min(interval, msecs_to_jiffies(p->numa_sample_period) / 16);
+	else
+		interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
+
 	p->numa_migrate_retry = jiffies + interval;
 
 	/* Success if task is already running on preferred CPU */
@@ -2430,6 +2448,77 @@ static void update_task_scan_period(struct task_struct *p,
 	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 }
 
+static void update_task_sample_period(struct task_struct *p,
+			unsigned long shared, unsigned long private)
+{
+	unsigned int period_slot;
+	int lr_ratio, ps_ratio;
+	int diff;
+
+	unsigned long remote = p->numa_faults_locality[0];
+	unsigned long local = p->numa_faults_locality[1];
+
+	/*
+	 * If there were no access faults then either the task is
+	 * completely idle or all activity is in areas that are not of interest
+	 * to automatic numa balancing. Related to that, if there were failed
+	 * migration then it implies we are migrating too quickly or the local
+	 * node is overloaded. In either case, increase the sampling rate.
+	 */
+	if (local + shared == 0 || p->numa_faults_locality[2]) {
+		p->numa_sample_period = min(sysctl_numa_balancing_sample_period_max,
+		p->numa_sample_period << 1);
+		return;
+	}
+
+	/*
+	 * Prepare to scale scan period relative to the current period.
+	 *	 == NUMA_PERIOD_THRESHOLD sample period stays the same
+	 *       <  NUMA_PERIOD_THRESHOLD sample period decreases
+	 *	 >= NUMA_PERIOD_THRESHOLD sample period increases
+	 */
+	period_slot = DIV_ROUND_UP(p->numa_sample_period, NUMA_PERIOD_SLOTS);
+	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
+	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+
+	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+		/*
+		 * Most memory accesses are local. There is no need to
+		 * do fast access sampling, since memory is already local.
+		 */
+		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+
+		if (!slot)
+			slot = 1;
+		diff = slot * period_slot;
+	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
+		/*
+		 * Most memory accesses are shared with other tasks.
+		 * There is no point in continuing fast access sampling,
+		 * since other tasks may just move the memory elsewhere.
+		 */
+		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
+
+		if (!slot)
+			slot = 1;
+		diff = slot * period_slot;
+	} else {
+		/*
+		 * Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS,
+		 * yet they are not on the local NUMA node. Speed up
+		 * access sampling to get the memory moved over.
+		 */
+		int ratio = max(lr_ratio, ps_ratio);
+
+		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+	}
+
+	p->numa_sample_period = clamp(p->numa_sample_period + diff,
+				      sysctl_numa_balancing_sample_period_min,
+				      sysctl_numa_balancing_sample_period_max);
+	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
+}
+
 /*
  * Get the fraction of time the task has been running since the last
  * NUMA placement cycle. The scheduler keeps similar statistics, but
@@ -2560,16 +2649,24 @@ static void task_numa_placement(struct task_struct *p)
 	spinlock_t *group_lock = NULL;
 	struct numa_group *ng;
 
-	/*
-	 * The p->mm->numa_scan_seq field gets updated without
-	 * exclusive access. Use READ_ONCE() here to ensure
-	 * that the field is read in a single access:
-	 */
-	seq = READ_ONCE(p->mm->numa_scan_seq);
-	if (p->numa_scan_seq == seq)
-		return;
-	p->numa_scan_seq = seq;
-	p->numa_scan_period_max = task_scan_max(p);
+	if (static_branch_unlikely(&hw_access_hints)) {
+		p->numa_access_faults_window++;
+		p->numa_access_faults++;
+		if (p->numa_access_faults_window < sysctl_numa_balancing_access_faults_threshold)
+			return;
+		p->numa_access_faults_window = 0;
+	} else {
+		/*
+		 * The p->mm->numa_scan_seq field gets updated without
+		 * exclusive access. Use READ_ONCE() here to ensure
+		 * that the field is read in a single access:
+		 */
+		seq = READ_ONCE(p->mm->numa_scan_seq);
+		if (p->numa_scan_seq == seq)
+			return;
+		p->numa_scan_seq = seq;
+		p->numa_scan_period_max = task_scan_max(p);
+	}
 
 	total_faults = p->numa_faults_locality[0] +
 		       p->numa_faults_locality[1];
@@ -2672,7 +2769,10 @@ static void task_numa_placement(struct task_struct *p)
 			sched_setnuma(p, max_nid);
 	}
 
-	update_task_scan_period(p, fault_types[0], fault_types[1]);
+	if (static_branch_unlikely(&hw_access_hints))
+		update_task_sample_period(p, fault_types[0], fault_types[1]);
+	else
+		update_task_scan_period(p, fault_types[0], fault_types[1]);
 }
 
 static inline int get_numa_group(struct numa_group *grp)
@@ -3094,7 +3194,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->node_stamp			= 0;
 	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
 	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
-	p->numa_sample_period		= 0;
+	p->numa_sample_period		= sysctl_numa_balancing_sample_period_def;
+	p->numa_access_faults           = 0;
+	p->numa_access_faults_window    = 0;
 	p->numa_migrate_retry		= 0;
 	/* Protect against double add, see task_tick_numa and task_numa_work */
 	p->numa_work.next		= &p->numa_work;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 953d16c802d6..0367dc727cc4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2473,6 +2473,10 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
+extern unsigned int sysctl_numa_balancing_sample_period_def;
+extern unsigned int sysctl_numa_balancing_sample_period_min;
+extern unsigned int sysctl_numa_balancing_sample_period_max;
+extern unsigned int sysctl_numa_balancing_access_faults_threshold;
 #endif
 
 #ifdef CONFIG_SCHED_HRTICK
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 5/5] x86/ibs: Delay the collection of HW-provided access info
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
                   ` (3 preceding siblings ...)
  2023-02-08  7:35 ` [RFC PATCH 4/5] x86/ibs: Adjust access faults sampling period Bharata B Rao
@ 2023-02-08  7:35 ` Bharata B Rao
  2023-02-08 18:03 ` [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Peter Zijlstra
  2023-02-13  3:26 ` Huang, Ying
  6 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-08  7:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mgorman, peterz, mingo, bp, dave.hansen, x86, akpm, luto, tglx,
	yue.li, Ravikumar.Bangoria, Bharata B Rao

Allow an initial delay before enabling the collection
of IBS provided access info.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/mm/ibs.c        | 18 ++++++++++++++++++
 include/linux/mm.h       |  2 ++
 include/linux/mm_types.h |  3 +++
 kernel/sched/debug.c     |  2 ++
 kernel/sched/fair.c      |  3 +++
 5 files changed, 28 insertions(+)

diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
index a479029e9262..dfe5246954c0 100644
--- a/arch/x86/mm/ibs.c
+++ b/arch/x86/mm/ibs.c
@@ -16,6 +16,21 @@ struct ibs_access_work {
 	u64 laddr, paddr;
 };
 
+static bool delay_hw_access_profiling(struct mm_struct *mm)
+{
+	unsigned long delay, now = jiffies;
+
+	if (!mm->numa_hw_access_delay)
+		mm->numa_hw_access_delay = now +
+			msecs_to_jiffies(sysctl_numa_balancing_access_faults_delay);
+
+	delay = mm->numa_hw_access_delay;
+	if (time_before(now, delay))
+		return true;
+
+	return false;
+}
+
 void hw_access_sched_in(struct task_struct *prev, struct task_struct *curr)
 {
 	u64 config = 0;
@@ -28,6 +43,9 @@ void hw_access_sched_in(struct task_struct *prev, struct task_struct *curr)
 	if (!curr->mm)
 		goto out;
 
+	if (delay_hw_access_profiling(curr->mm))
+		goto out;
+
 	if (curr->numa_sample_period)
 		period = curr->numa_sample_period;
 	else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f857163ac89..118705a296ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1397,6 +1397,8 @@ static inline int folio_nid(const struct folio *folio)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern unsigned int sysctl_numa_balancing_access_faults_delay;
+
 /* page access time bits needs to hold at least 4 seconds */
 #define PAGE_ACCESS_TIME_MIN_BITS	12
 #if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9757067c3053..8a2fb8bf2d62 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -750,6 +750,9 @@ struct mm_struct {
 
 		/* numa_scan_seq prevents two threads remapping PTEs. */
 		int numa_scan_seq;
+
+		/* HW-provided access info is collected after this initial delay */
+		unsigned long numa_hw_access_delay;
 #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1cf19778a232..5c76a7594358 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -342,6 +342,8 @@ static __init int sched_init_debug(void)
 			   &sysctl_numa_balancing_sample_period_max);
 	debugfs_create_u32("access_faults_threshold", 0644, numa,
 			   &sysctl_numa_balancing_access_faults_threshold);
+	debugfs_create_u32("access_faults_delay", 0644, numa,
+			   &sysctl_numa_balancing_access_faults_delay);
 #endif
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b0665b034d0..2e2b1e706a24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1097,6 +1097,7 @@ unsigned int sysctl_numa_balancing_sample_period_def = 10000;
 unsigned int sysctl_numa_balancing_sample_period_min = 5000;
 unsigned int sysctl_numa_balancing_sample_period_max = 20000;
 unsigned int sysctl_numa_balancing_access_faults_threshold = 250;
+unsigned int sysctl_numa_balancing_access_faults_delay = 1000;
 
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -3189,6 +3190,8 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 		if (mm_users == 1) {
 			mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
 			mm->numa_scan_seq = 0;
+			mm->numa_hw_access_delay = jiffies +
+				msecs_to_jiffies(sysctl_numa_balancing_access_faults_delay);
 		}
 	}
 	p->node_stamp			= 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
                   ` (4 preceding siblings ...)
  2023-02-08  7:35 ` [RFC PATCH 5/5] x86/ibs: Delay the collection of HW-provided access info Bharata B Rao
@ 2023-02-08 18:03 ` Peter Zijlstra
  2023-02-08 18:12   ` Dave Hansen
  2023-02-09  5:57   ` Bharata B Rao
  2023-02-13  3:26 ` Huang, Ying
  6 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2023-02-08 18:03 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote:

> - Perf uses IBS and we are using the same IBS for access profiling here.
>   There needs to be a proper way to make the use mutually exclusive.

No, IFF this lives it needs to use in-kernel perf.

> - Is tying this up with NUMA balancing a reasonable approach or
>   should we look at a completely new approach?

Is it giving sufficient win to be worth it, afaict it doesn't come even
close to justifying it.

> - Hardware provided access information could be very useful for driving
>   hot page promotion in tiered memory systems. Need to check if this
>   requires different tuning/heuristics apart from what NUMA balancing
>   already does.

I think Huang Ying looked at that from the Intel POV and I think the
conclusion was that it doesn't really work out. What you need is
frequency information, but the PMU doesn't really give you that. You
need to process a *ton* of PMU data in-kernel.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-08 18:03 ` [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Peter Zijlstra
@ 2023-02-08 18:12   ` Dave Hansen
  2023-02-09  6:04     ` Bharata B Rao
  2023-02-09  5:57   ` Bharata B Rao
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2023-02-08 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On 2/8/23 10:03, Peter Zijlstra wrote:
>> - Hardware provided access information could be very useful for driving
>>   hot page promotion in tiered memory systems. Need to check if this
>>   requires different tuning/heuristics apart from what NUMA balancing
>>   already does.
> I think Huang Ying looked at that from the Intel POV and I think the
> conclusion was that it doesn't really work out. What you need is
> frequency information, but the PMU doesn't really give you that. You
> need to process a *ton* of PMU data in-kernel.

Yeah, there were two big problems.

First, IIRC, Intel PEBS at the time only gave guest virtual addresses in
the PEBS records.  They had to be translated back to host addresses to
be usable.  That was extra expensive.

Second, it *did* take a lot of processing to turn raw memory accesses
into actionable frequency data.  That meant that we started in a hole
performance-wise and had to make *REALLY* good decisions about page
migration to make up for it.

The performance data here don't look awful, but they don't seem to add
up to a clear win either.  I'm having a hard time imagining who would
turn this on and how widely it would get used in practice.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-08 18:03 ` [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Peter Zijlstra
  2023-02-08 18:12   ` Dave Hansen
@ 2023-02-09  5:57   ` Bharata B Rao
  2023-02-13  2:56     ` Huang, Ying
  1 sibling, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-09  5:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On 2/8/2023 11:33 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote:
> 
>> - Perf uses IBS and we are using the same IBS for access profiling here.
>>   There needs to be a proper way to make the use mutually exclusive.
> 
> No, IFF this lives it needs to use in-kernel perf.

In fact I started out with in-kernel perf by using the
perf_event_create_kernel_counter() API. However there are issues
with using in-kernel perf:

- We want to reprogram the counter potentially during every
context switch. The IBS hardware sample counter needs to be
reprogrammed based on the incoming thread's view of sample period.
Additionally sampling needs to be disabled for kernel threads.
So I wanted to use perf_event_enable/disable() and
perf_event_period(). However they take mutexes and hence it is
not possible to use them from the sched switch atomic context.

- In-kernel perf gives a per-cpu counter, but we want it to count
based on the task that is currently running. I,e., the period should
be modified on per-task basis. I don't see how an in-kernel perf
event counter can be associated with per-task like this.
	
Hence I didn't see an easy option other than making the use
of IBS in perf and NUMA balancing mutually exclusive.

> 
>> - Is tying this up with NUMA balancing a reasonable approach or
>>   should we look at a completely new approach?
> 
> Is it giving sufficient win to be worth it, afaict it doesn't come even
> close to justifying it.
> 
>> - Hardware provided access information could be very useful for driving
>>   hot page promotion in tiered memory systems. Need to check if this
>>   requires different tuning/heuristics apart from what NUMA balancing
>>   already does.
> 
> I think Huang Ying looked at that from the Intel POV and I think the
> conclusion was that it doesn't really work out. What you need is
> frequency information, but the PMU doesn't really give you that. You
> need to process a *ton* of PMU data in-kernel.

What I am doing here is to feed the access data into NUMA balancing which
already has the logic to aggregate that at task and numa group level and
decide if that access is actionable in terms of migrating the page. In this
context, I am not sure about the frequency information that you and Dave
are mentioning. AFAIU, existing NUMA balancing takes care of taking
action, IBS becomes an alternative source of access information to NUMA
hint faults.

Thanks for your inputs.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-08 18:12   ` Dave Hansen
@ 2023-02-09  6:04     ` Bharata B Rao
  2023-02-09 14:28       ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-09  6:04 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On 2/8/2023 11:42 PM, Dave Hansen wrote:
> On 2/8/23 10:03, Peter Zijlstra wrote:
>>> - Hardware provided access information could be very useful for driving
>>>   hot page promotion in tiered memory systems. Need to check if this
>>>   requires different tuning/heuristics apart from what NUMA balancing
>>>   already does.
>> I think Huang Ying looked at that from the Intel POV and I think the
>> conclusion was that it doesn't really work out. What you need is
>> frequency information, but the PMU doesn't really give you that. You
>> need to process a *ton* of PMU data in-kernel.
> 
> Yeah, there were two big problems.
> 
> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in
> the PEBS records.  They had to be translated back to host addresses to
> be usable.  That was extra expensive.

Just to be clear, I am using IBS in host only and it can give both virtual
and physical address.

> 
> Second, it *did* take a lot of processing to turn raw memory accesses
> into actionable frequency data.  That meant that we started in a hole
> performance-wise and had to make *REALLY* good decisions about page
> migration to make up for it.

I touched upon the frequency aspect in reply to Peter, but please let
me know if I am missing something.

> 
> The performance data here don't look awful, but they don't seem to add
> up to a clear win either.  I'm having a hard time imagining who would
> turn this on and how widely it would get used in practice.

I am hopeful with more appropriate tuning of NUMA balancing logic to work
with hardware-provided access info (as against scan based NUMA hint faults),
we should be able to see a clear win. At least theoretically we wouldn't
have the overheads of address space scanning and hint faults handling.

Thanks for your inputs.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-09  6:04     ` Bharata B Rao
@ 2023-02-09 14:28       ` Dave Hansen
  2023-02-10  4:28         ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2023-02-09 14:28 UTC (permalink / raw)
  To: Bharata B Rao, Peter Zijlstra
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On 2/8/23 22:04, Bharata B Rao wrote:
>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in
>> the PEBS records.  They had to be translated back to host addresses to
>> be usable.  That was extra expensive.
> Just to be clear, I am using IBS in host only and it can give both virtual
> and physical address.

Could you talk a little bit about how IBS might get used for NUMA
balancing guest memory?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-09 14:28       ` Dave Hansen
@ 2023-02-10  4:28         ` Bharata B Rao
  2023-02-10  4:40           ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-10  4:28 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On 2/9/2023 7:58 PM, Dave Hansen wrote:
> On 2/8/23 22:04, Bharata B Rao wrote:
>>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in
>>> the PEBS records.  They had to be translated back to host addresses to
>>> be usable.  That was extra expensive.
>> Just to be clear, I am using IBS in host only and it can give both virtual
>> and physical address.
> 
> Could you talk a little bit about how IBS might get used for NUMA
> balancing guest memory?

IBS can work for guest, but will not provide physical address. Also
the support for virtualized IBS isn't upstream yet.

However I am not sure how effective or useful NUMA balancing within a guest
is, as the actual physical addresses are transparent to the guest.

Additionally when using IBS in host, it is possible to prevent collection
of samples from secure guests by using the PreventHostIBS feature.
(https://lore.kernel.org/linux-perf-users/20230206060545.628502-1-manali.shukla@amd.com/T/#)

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-10  4:28         ` Bharata B Rao
@ 2023-02-10  4:40           ` Dave Hansen
  2023-02-10 15:10             ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2023-02-10  4:40 UTC (permalink / raw)
  To: Bharata B Rao, Peter Zijlstra
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang

On 2/9/23 20:28, Bharata B Rao wrote:
> On 2/9/2023 7:58 PM, Dave Hansen wrote:
>> On 2/8/23 22:04, Bharata B Rao wrote:
>>>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in
>>>> the PEBS records.  They had to be translated back to host addresses to
>>>> be usable.  That was extra expensive.
>>> Just to be clear, I am using IBS in host only and it can give both virtual
>>> and physical address.
>> Could you talk a little bit about how IBS might get used for NUMA
>> balancing guest memory?
> IBS can work for guest, but will not provide physical address. Also
> the support for virtualized IBS isn't upstream yet.
> 
> However I am not sure how effective or useful NUMA balancing within a guest
> is, as the actual physical addresses are transparent to the guest.
> 
> Additionally when using IBS in host, it is possible to prevent collection
> of samples from secure guests by using the PreventHostIBS feature.
> (https://lore.kernel.org/linux-perf-users/20230206060545.628502-1-manali.shukla@amd.com/T/#)
I was wondering specifically about how a host might use IBS to balance
guest memory transparently to the guest.  Now how a guest might use IBS
to balance its own memory.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-10  4:40           ` Dave Hansen
@ 2023-02-10 15:10             ` Bharata B Rao
  0 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-10 15:10 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra
  Cc: linux-kernel, linux-mm, mgorman, mingo, bp, dave.hansen, x86,
	akpm, luto, tglx, yue.li, Ravikumar.Bangoria, ying.huang



On 2/10/2023 10:10 AM, Dave Hansen wrote:
> On 2/9/23 20:28, Bharata B Rao wrote:
>> On 2/9/2023 7:58 PM, Dave Hansen wrote:
>>> On 2/8/23 22:04, Bharata B Rao wrote:
>>>>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in
>>>>> the PEBS records.  They had to be translated back to host addresses to
>>>>> be usable.  That was extra expensive.
>>>> Just to be clear, I am using IBS in host only and it can give both virtual
>>>> and physical address.
>>> Could you talk a little bit about how IBS might get used for NUMA
>>> balancing guest memory?
>> IBS can work for guest, but will not provide physical address. Also
>> the support for virtualized IBS isn't upstream yet.
>>
>> However I am not sure how effective or useful NUMA balancing within a guest
>> is, as the actual physical addresses are transparent to the guest.
>>
>> Additionally when using IBS in host, it is possible to prevent collection
>> of samples from secure guests by using the PreventHostIBS feature.
>> (https://lore.kernel.org/linux-perf-users/20230206060545.628502-1-manali.shukla@amd.com/T/#)
> I was wondering specifically about how a host might use IBS to balance
> guest memory transparently to the guest.  Now how a guest might use IBS
> to balance its own memory.

When guest memory accesses are captured by IBS in the host, IBS provides
the host physical address. Hence IBS based NUMA balancing in the host
should be able to balance guest memory transparently to the guest.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-09  5:57   ` Bharata B Rao
@ 2023-02-13  2:56     ` Huang, Ying
  2023-02-13  3:23       ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-02-13  2:56 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Peter Zijlstra, linux-kernel, linux-mm, mgorman, mingo, bp,
	dave.hansen, x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 2/8/2023 11:33 PM, Peter Zijlstra wrote:
>> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote:
>> 
>> 
>>> - Hardware provided access information could be very useful for driving
>>>   hot page promotion in tiered memory systems. Need to check if this
>>>   requires different tuning/heuristics apart from what NUMA balancing
>>>   already does.
>> 
>> I think Huang Ying looked at that from the Intel POV and I think the
>> conclusion was that it doesn't really work out. What you need is
>> frequency information, but the PMU doesn't really give you that. You
>> need to process a *ton* of PMU data in-kernel.
>
> What I am doing here is to feed the access data into NUMA balancing which
> already has the logic to aggregate that at task and numa group level and
> decide if that access is actionable in terms of migrating the page. In this
> context, I am not sure about the frequency information that you and Dave
> are mentioning. AFAIU, existing NUMA balancing takes care of taking
> action, IBS becomes an alternative source of access information to NUMA
> hint faults.

We do need frequency information to determine whether a page is hot
enough to be migrated to the fast memory (promotion).  What PMU provided
is just "recently" accessed pages, not "frequently" accessed pages.  For
current NUMA balancing implementation, please check
NUMA_BALANCING_MEMORY_TIERING in should_numa_migrate_memory().  In
general, it estimates the page access frequency via measuring the
latency between page table scanning and page fault, the shorter the
latency, the higher the frequency.  This isn't perfect, but provides a
starting point.  You need to consider how to get frequency information
via PMU.  For example, you may count access number for each page, aging
them periodically, and get hot threshold via some statistics.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-13  2:56     ` Huang, Ying
@ 2023-02-13  3:23       ` Bharata B Rao
  2023-02-13  3:34         ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-13  3:23 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, linux-kernel, linux-mm, mgorman, mingo, bp,
	dave.hansen, x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 2/13/2023 8:26 AM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
> 
>> On 2/8/2023 11:33 PM, Peter Zijlstra wrote:
>>> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote:
>>>
>>>
>>>> - Hardware provided access information could be very useful for driving
>>>>   hot page promotion in tiered memory systems. Need to check if this
>>>>   requires different tuning/heuristics apart from what NUMA balancing
>>>>   already does.
>>>
>>> I think Huang Ying looked at that from the Intel POV and I think the
>>> conclusion was that it doesn't really work out. What you need is
>>> frequency information, but the PMU doesn't really give you that. You
>>> need to process a *ton* of PMU data in-kernel.
>>
>> What I am doing here is to feed the access data into NUMA balancing which
>> already has the logic to aggregate that at task and numa group level and
>> decide if that access is actionable in terms of migrating the page. In this
>> context, I am not sure about the frequency information that you and Dave
>> are mentioning. AFAIU, existing NUMA balancing takes care of taking
>> action, IBS becomes an alternative source of access information to NUMA
>> hint faults.
> 
> We do need frequency information to determine whether a page is hot
> enough to be migrated to the fast memory (promotion).  What PMU provided
> is just "recently" accessed pages, not "frequently" accessed pages.  For
> current NUMA balancing implementation, please check
> NUMA_BALANCING_MEMORY_TIERING in should_numa_migrate_memory().  In
> general, it estimates the page access frequency via measuring the
> latency between page table scanning and page fault, the shorter the
> latency, the higher the frequency.  This isn't perfect, but provides a
> starting point.  You need to consider how to get frequency information
> via PMU.  For example, you may count access number for each page, aging
> them periodically, and get hot threshold via some statistics.

For the tiered memory hot page promotion case of NUMA balancing, we will
have to maintain frequency information in software when such information
isn't available from the hardware.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
                   ` (5 preceding siblings ...)
  2023-02-08 18:03 ` [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Peter Zijlstra
@ 2023-02-13  3:26 ` Huang, Ying
  2023-02-13  5:52   ` Bharata B Rao
  6 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-02-13  3:26 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> Hi,
>
> Some hardware platforms can provide information about memory accesses
> that can be used to do optimal page and task placement on NUMA
> systems. AMD processors have a hardware facility called Instruction-
> Based Sampling (IBS) that can be used to gather specific metrics
> related to instruction fetch and execution activity. This facility
> can be used to perform memory access profiling based on statistical
> sampling.
>
> This RFC is a proof-of-concept implementation where the access
> information obtained from the hardware is used to drive NUMA balancing.
> With this it is no longer necessary to scan the address space and
> introduce NUMA hint faults to build task-to-page association. Hence
> the approach taken here is to replace the address space scanning plus
> hint faults with the access information provided by the hardware.

You method can avoid the address space scanning, but cannot avoid memory
access fault in fact.  PMU will raise NMI and then task_work to process
the sampled memory accesses.  The overhead depends on the frequency of
the memory access sampling.  Please measure the overhead of your method
in details.

> The access samples obtained from hardware are fed to NUMA balancing
> as fault-equivalents. The rest of the NUMA balancing logic that
> collects/aggregates the shared/private/local/remote faults and does
> pages/task migrations based on the faults is retained except that
> accesses replace faults.
>
> This early implementation is an attempt to get a working solution
> only and as such a lot of TODOs exist:
>
> - Perf uses IBS and we are using the same IBS for access profiling here.
>   There needs to be a proper way to make the use mutually exclusive.
> - Is tying this up with NUMA balancing a reasonable approach or
>   should we look at a completely new approach?
> - When accesses replace faults in NUMA balancing, a few things have
>   to be tuned differently. All such decision points need to be
>   identified and appropriate tuning needs to be done.
> - Hardware provided access information could be very useful for driving
>   hot page promotion in tiered memory systems. Need to check if this
>   requires different tuning/heuristics apart from what NUMA balancing
>   already does.
> - Some of the values used to program the IBS counters like the sampling
>   period etc may not be the optimal or ideal values. The sample period
>   adjustment follows the same logic as scan period modification which
>   may not be ideal. More experimentation is required to fine-tune all
>   these aspects.
> - Currently I am acting (i,e., attempt to migrate a page) on each sampled
>   access. Need to check if it makes sense to delay it and do batched page
>   migration.

You current implementation is tied with AMD IBS.  You will need a
architecture/vendor independent framework for upstreaming.

BTW: can IBS sampling memory writing too?  Or just memory reading?

> This RFC is mainly about showing how hardware provided access
> information could be used for NUMA balancing but I have run a
> few basic benchmarks from mmtests to check if this is any severe
> regression/overhead to any of those. Some benchmarks show some
> improvement, some show no significant change and a few regress.
> I am hopeful that with more appropriate tuning there is scope for
> futher improvement here especially for workloads for which NUMA
> matters.

What's your expected improvement of the PMU based NUMA balancing?  It
should come from reduced overhead?  higher accuracy?  Quicker response?
I think that it may be better to prove that with appropriate statistics
for at least one workload.

> FWIW, here are the numbers in brief:
> (1st column is default kernel, 2nd column is with this patchset)
>
> kernbench
> =========
>                                 6.2.0-rc5              6.2.0-rc5-ibs
> Amean     user-512    19385.27 (   0.00%)    18140.69 *   6.42%*
> Amean     syst-512    21620.40 (   0.00%)    19984.87 *   7.56%*
> Amean     elsp-512       95.91 (   0.00%)       88.60 *   7.62%*
>
> Duration User       19385.45    18140.89
> Duration System     21620.90    19985.37
> Duration Elapsed       96.52       89.20
>
> Ops NUMA alloc hit                 552153976.00   499596610.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local               552152782.00   499595620.00
> Ops NUMA base-page range updates      758004.00           0.00
> Ops NUMA PTE updates                  758004.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                  215654.00     1797848.00
> Ops NUMA hint local faults %            2054.00     1775103.00
> Ops NUMA hint local percent                0.95          98.73
> Ops NUMA pages migrated               213600.00       22745.00
> Ops AutoNUMA cost                       1087.63        8989.67
>
> autonumabench
> =============
> Amean     syst-NUMA01                90516.91 (   0.00%)    65272.04 *  27.89%*
> Amean     syst-NUMA01_THREADLOCAL        0.26 (   0.00%)        0.27 *  -3.80%*
> Amean     syst-NUMA02                    1.10 (   0.00%)        1.02 *   7.24%*
> Amean     syst-NUMA02_SMT                0.74 (   0.00%)        0.90 * -21.77%*
> Amean     elsp-NUMA01                  747.73 (   0.00%)      625.29 *  16.37%*
> Amean     elsp-NUMA01_THREADLOCAL        1.07 (   0.00%)        1.07 *  -0.13%*
> Amean     elsp-NUMA02                    1.75 (   0.00%)        1.72 *   1.96%*
> Amean     elsp-NUMA02_SMT                3.03 (   0.00%)        3.04 *  -0.47%*
>
> Duration User     1312937.34  1148196.94
> Duration System    633634.59   456921.29
> Duration Elapsed     5289.47     4427.82
>
> Ops NUMA alloc hit                1115625106.00   704004226.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local               599879745.00   459968338.00
> Ops NUMA base-page range updates    74310268.00           0.00
> Ops NUMA PTE updates                74310268.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults               110504178.00    27624054.00
> Ops NUMA hint local faults %        54257985.00    17310888.00
> Ops NUMA hint local percent               49.10          62.67
> Ops NUMA pages migrated             11399016.00     7983717.00
> Ops AutoNUMA cost                     553257.64      138271.96
>
> tbench4 Latency
> ===============
> Amean     latency-1           0.08 (   0.00%)        0.08 *   1.43%*
> Amean     latency-2           0.10 (   0.00%)        0.11 *  -2.75%*
> Amean     latency-4           0.14 (   0.00%)        0.13 *   4.31%*
> Amean     latency-8           0.14 (   0.00%)        0.14 *  -0.94%*
> Amean     latency-16          0.20 (   0.00%)        0.19 *   8.01%*
> Amean     latency-32          0.24 (   0.00%)        0.20 *  12.92%*
> Amean     latency-64          0.34 (   0.00%)        0.28 *  18.30%*
> Amean     latency-128         1.71 (   0.00%)        1.44 *  16.04%*
> Amean     latency-256         0.52 (   0.00%)        0.69 * -32.26%*
> Amean     latency-512         3.27 (   0.00%)        5.32 * -62.62%*
> Amean     latency-1024        0.00 (   0.00%)        0.00 *   0.00%*
> Amean     latency-2048        0.00 (   0.00%)        0.00 *   0.00%*
>
> tbench4 Throughput
> ==================
> Hmean     1         504.57 (   0.00%)      496.80 *  -1.54%*
> Hmean     2        1006.46 (   0.00%)      990.04 *  -1.63%*
> Hmean     4        1855.11 (   0.00%)     1933.76 *   4.24%*
> Hmean     8        3711.49 (   0.00%)     3582.32 *  -3.48%*
> Hmean     16       6707.58 (   0.00%)     6674.46 *  -0.49%*
> Hmean     32      13146.81 (   0.00%)    12649.49 *  -3.78%*
> Hmean     64      20922.72 (   0.00%)    22605.55 *   8.04%*
> Hmean     128     33637.07 (   0.00%)    37870.35 *  12.59%*
> Hmean     256     54083.12 (   0.00%)    50257.25 *  -7.07%*
> Hmean     512     72455.66 (   0.00%)    53141.88 * -26.66%*
> Hmean     1024   124413.95 (   0.00%)   117398.40 *  -5.64%*
> Hmean     2048   124481.61 (   0.00%)   124892.12 *   0.33%*
>
> Ops NUMA alloc hit                2092196681.00  2007852353.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local              2092193601.00  2007849231.00
> Ops NUMA base-page range updates      298999.00           0.00
> Ops NUMA PTE updates                  298999.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                  287539.00     4499166.00
> Ops NUMA hint local faults %           98931.00     4349685.00
> Ops NUMA hint local percent               34.41          96.68
> Ops NUMA pages migrated               169086.00      149476.00
> Ops AutoNUMA cost                       1443.00       22498.67
>
> Duration User       23999.54    24476.30
> Duration System    160480.07   164366.91
> Duration Elapsed     2685.19     2685.69
>
> netperf-udp
> ===========
> Hmean     send-64         226.57 (   0.00%)      225.41 *  -0.51%*
> Hmean     send-128        450.89 (   0.00%)      448.90 *  -0.44%*
> Hmean     send-256        899.63 (   0.00%)      898.02 *  -0.18%*
> Hmean     send-1024      3510.63 (   0.00%)     3526.24 *   0.44%*
> Hmean     send-2048      6493.15 (   0.00%)     6493.27 *   0.00%*
> Hmean     send-3312      9778.22 (   0.00%)     9801.03 *   0.23%*
> Hmean     send-4096     11523.43 (   0.00%)    11490.57 *  -0.29%*
> Hmean     send-8192     18666.11 (   0.00%)    18686.99 *   0.11%*
> Hmean     send-16384    28112.56 (   0.00%)    28223.81 *   0.40%*
> Hmean     recv-64         226.57 (   0.00%)      225.41 *  -0.51%*
> Hmean     recv-128        450.88 (   0.00%)      448.90 *  -0.44%*
> Hmean     recv-256        899.63 (   0.00%)      898.01 *  -0.18%*
> Hmean     recv-1024      3510.61 (   0.00%)     3526.21 *   0.44%*
> Hmean     recv-2048      6493.07 (   0.00%)     6493.15 *   0.00%*
> Hmean     recv-3312      9777.95 (   0.00%)     9800.85 *   0.23%*
> Hmean     recv-4096     11522.87 (   0.00%)    11490.47 *  -0.28%*
> Hmean     recv-8192     18665.83 (   0.00%)    18686.56 *   0.11%*
> Hmean     recv-16384    28112.13 (   0.00%)    28223.73 *   0.40%*
>
> Duration User          48.52       48.74
> Duration System       931.24      925.83
> Duration Elapsed     1934.05     1934.79
>
> Ops NUMA alloc hit                  60042365.00    60144256.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                60042305.00    60144228.00
> Ops NUMA base-page range updates        6630.00           0.00
> Ops NUMA PTE updates                    6630.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                    5709.00       26249.00
> Ops NUMA hint local faults %            3030.00       25130.00
> Ops NUMA hint local percent               53.07          95.74
> Ops NUMA pages migrated                 2500.00        1119.00
> Ops AutoNUMA cost                         28.64         131.27
>
> netperf-udp-rr
> ==============
> Hmean     1   132319.16 (   0.00%)   130621.99 *  -1.28%*
>
> Duration User           9.92        9.97
> Duration System       118.02      119.26
> Duration Elapsed      432.12      432.91
>
> Ops NUMA alloc hit                    289650.00      289222.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                  289642.00      289222.00
> Ops NUMA base-page range updates           1.00           0.00
> Ops NUMA PTE updates                       1.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                       1.00          51.00
> Ops NUMA hint local faults %               0.00          45.00
> Ops NUMA hint local percent                0.00          88.24
> Ops NUMA pages migrated                    1.00           6.00
> Ops AutoNUMA cost                          0.01           0.26
>
> netperf-tcp-rr
> ==============
> Hmean     1   118141.46 (   0.00%)   115515.41 *  -2.22%*
>
> Duration User           9.59        9.52
> Duration System       120.32      121.66
> Duration Elapsed      432.20      432.49
>
> Ops NUMA alloc hit                    291257.00      290927.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                  291233.00      290923.00
> Ops NUMA base-page range updates           2.00           0.00
> Ops NUMA PTE updates                       2.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                       2.00          46.00
> Ops NUMA hint local faults %               0.00          42.00
> Ops NUMA hint local percent                0.00          91.30
> Ops NUMA pages migrated                    2.00           4.00
> Ops AutoNUMA cost                          0.01           0.23
>
> dbench
> ======
> dbench4 Latency
>
> Amean     latency-1          2.13 (   0.00%)       10.92 *-411.44%*
> Amean     latency-2         12.03 (   0.00%)        8.17 *  32.07%*
> Amean     latency-4         21.12 (   0.00%)        9.60 *  54.55%*
> Amean     latency-8         41.20 (   0.00%)       33.59 *  18.45%*
> Amean     latency-16        76.85 (   0.00%)       75.84 *   1.31%*
> Amean     latency-32        91.68 (   0.00%)       90.26 *   1.55%*
> Amean     latency-64       124.61 (   0.00%)      113.31 *   9.07%*
> Amean     latency-128      140.14 (   0.00%)      126.29 *   9.89%*
> Amean     latency-256      155.63 (   0.00%)      142.11 *   8.69%*
> Amean     latency-512      258.60 (   0.00%)      243.13 *   5.98%*
>
> dbench4 Throughput (misleading but traditional)
>
> Hmean     1        987.47 (   0.00%)      938.07 *  -5.00%*
> Hmean     2       1750.10 (   0.00%)     1697.08 *  -3.03%*
> Hmean     4       2990.33 (   0.00%)     3023.23 *   1.10%*
> Hmean     8       3557.38 (   0.00%)     3863.32 *   8.60%*
> Hmean     16      2705.90 (   0.00%)     2660.48 *  -1.68%*
> Hmean     32      2954.08 (   0.00%)     3101.59 *   4.99%*
> Hmean     64      3061.68 (   0.00%)     3206.15 *   4.72%*
> Hmean     128     2867.74 (   0.00%)     3080.21 *   7.41%*
> Hmean     256     2585.58 (   0.00%)     2875.44 *  11.21%*
> Hmean     512     1777.80 (   0.00%)     1777.79 *  -0.00%*
>
> Duration User        2359.02     2246.44
> Duration System     18927.83    16856.91
> Duration Elapsed     1901.54     1901.44
>
> Ops NUMA alloc hit                 240556255.00   255283721.00
> Ops NUMA alloc miss                   408851.00       62903.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local               240547816.00   255264974.00
> Ops NUMA base-page range updates      204316.00           0.00
> Ops NUMA PTE updates                  204316.00           0.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                  201101.00      287642.00
> Ops NUMA hint local faults %          104199.00      153547.00
> Ops NUMA hint local percent               51.81          53.38
> Ops NUMA pages migrated                96158.00      134083.00
> Ops AutoNUMA cost                       1008.76        1440.76
>
> Bharata B Rao (5):
>   x86/ibs: In-kernel IBS driver for page access profiling
>   x86/ibs: Drive NUMA balancing via IBS access data
>   x86/ibs: Enable per-process IBS from sched switch path
>   x86/ibs: Adjust access faults sampling period
>   x86/ibs: Delay the collection of HW-provided access info
>
>  arch/x86/events/amd/ibs.c        |   6 +
>  arch/x86/include/asm/msr-index.h |  12 ++
>  arch/x86/mm/Makefile             |   1 +
>  arch/x86/mm/ibs.c                | 250 +++++++++++++++++++++++++++++++
>  include/linux/migrate.h          |   1 +
>  include/linux/mm.h               |   2 +
>  include/linux/mm_types.h         |   3 +
>  include/linux/sched.h            |   4 +
>  include/linux/vm_event_item.h    |  12 ++
>  kernel/sched/core.c              |   1 +
>  kernel/sched/debug.c             |  10 ++
>  kernel/sched/fair.c              | 142 ++++++++++++++++--
>  kernel/sched/sched.h             |   9 ++
>  mm/memory.c                      |  92 ++++++++++++
>  mm/vmstat.c                      |  12 ++
>  15 files changed, 544 insertions(+), 13 deletions(-)
>  create mode 100644 arch/x86/mm/ibs.c

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-13  3:23       ` Bharata B Rao
@ 2023-02-13  3:34         ` Huang, Ying
  0 siblings, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2023-02-13  3:34 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Peter Zijlstra, linux-kernel, linux-mm, mgorman, mingo, bp,
	dave.hansen, x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 2/13/2023 8:26 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> On 2/8/2023 11:33 PM, Peter Zijlstra wrote:
>>>> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote:
>>>>
>>>>
>>>>> - Hardware provided access information could be very useful for driving
>>>>>   hot page promotion in tiered memory systems. Need to check if this
>>>>>   requires different tuning/heuristics apart from what NUMA balancing
>>>>>   already does.
>>>>
>>>> I think Huang Ying looked at that from the Intel POV and I think the
>>>> conclusion was that it doesn't really work out. What you need is
>>>> frequency information, but the PMU doesn't really give you that. You
>>>> need to process a *ton* of PMU data in-kernel.
>>>
>>> What I am doing here is to feed the access data into NUMA balancing which
>>> already has the logic to aggregate that at task and numa group level and
>>> decide if that access is actionable in terms of migrating the page. In this
>>> context, I am not sure about the frequency information that you and Dave
>>> are mentioning. AFAIU, existing NUMA balancing takes care of taking
>>> action, IBS becomes an alternative source of access information to NUMA
>>> hint faults.
>> 
>> We do need frequency information to determine whether a page is hot
>> enough to be migrated to the fast memory (promotion).  What PMU provided
>> is just "recently" accessed pages, not "frequently" accessed pages.  For
>> current NUMA balancing implementation, please check
>> NUMA_BALANCING_MEMORY_TIERING in should_numa_migrate_memory().  In
>> general, it estimates the page access frequency via measuring the
>> latency between page table scanning and page fault, the shorter the
>> latency, the higher the frequency.  This isn't perfect, but provides a
>> starting point.  You need to consider how to get frequency information
>> via PMU.  For example, you may count access number for each page, aging
>> them periodically, and get hot threshold via some statistics.
>
> For the tiered memory hot page promotion case of NUMA balancing, we will
> have to maintain frequency information in software when such information
> isn't available from the hardware.

Yes.  It's challenging to calculate frequency information.  Please
consider how to do that.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-13  3:26 ` Huang, Ying
@ 2023-02-13  5:52   ` Bharata B Rao
  2023-02-13  6:30     ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-13  5:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 2/13/2023 8:56 AM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
> 
>> Hi,
>>
>> Some hardware platforms can provide information about memory accesses
>> that can be used to do optimal page and task placement on NUMA
>> systems. AMD processors have a hardware facility called Instruction-
>> Based Sampling (IBS) that can be used to gather specific metrics
>> related to instruction fetch and execution activity. This facility
>> can be used to perform memory access profiling based on statistical
>> sampling.
>>
>> This RFC is a proof-of-concept implementation where the access
>> information obtained from the hardware is used to drive NUMA balancing.
>> With this it is no longer necessary to scan the address space and
>> introduce NUMA hint faults to build task-to-page association. Hence
>> the approach taken here is to replace the address space scanning plus
>> hint faults with the access information provided by the hardware.
> 
> You method can avoid the address space scanning, but cannot avoid memory
> access fault in fact.  PMU will raise NMI and then task_work to process
> the sampled memory accesses.  The overhead depends on the frequency of
> the memory access sampling.  Please measure the overhead of your method
> in details.

Yes, the address space scanning is avoided. I will measure the overhead
of hint fault vs NMI handling path. The actual processing of the access
from task_work context is pretty much similar to the stats processing
from hint faults. As you note the overhead depends on the frequency of
sampling. In this current approach, the sampling period is per-task
and it varies based on the same logic that NUMA balancing uses to
vary the scan period.

> 
>> The access samples obtained from hardware are fed to NUMA balancing
>> as fault-equivalents. The rest of the NUMA balancing logic that
>> collects/aggregates the shared/private/local/remote faults and does
>> pages/task migrations based on the faults is retained except that
>> accesses replace faults.
>>
>> This early implementation is an attempt to get a working solution
>> only and as such a lot of TODOs exist:
>>
>> - Perf uses IBS and we are using the same IBS for access profiling here.
>>   There needs to be a proper way to make the use mutually exclusive.
>> - Is tying this up with NUMA balancing a reasonable approach or
>>   should we look at a completely new approach?
>> - When accesses replace faults in NUMA balancing, a few things have
>>   to be tuned differently. All such decision points need to be
>>   identified and appropriate tuning needs to be done.
>> - Hardware provided access information could be very useful for driving
>>   hot page promotion in tiered memory systems. Need to check if this
>>   requires different tuning/heuristics apart from what NUMA balancing
>>   already does.
>> - Some of the values used to program the IBS counters like the sampling
>>   period etc may not be the optimal or ideal values. The sample period
>>   adjustment follows the same logic as scan period modification which
>>   may not be ideal. More experimentation is required to fine-tune all
>>   these aspects.
>> - Currently I am acting (i,e., attempt to migrate a page) on each sampled
>>   access. Need to check if it makes sense to delay it and do batched page
>>   migration.
> 
> You current implementation is tied with AMD IBS.  You will need a
> architecture/vendor independent framework for upstreaming.

I have tried to keep it vendor and arch neutral as far
as possible, will re-look into this of course to make the
interfaces more robust and useful.

I have defined a static key (hw_access_hints=false) which will be
set only by the platform driver when it detects the hardware
capability to provide memory access information. NUMA balancing
code skips the address space scanning when it sees this capability.
The platform driver (access fault handler) will call into the NUMA
balancing API with linear and physical address information of the
accessed sample. Hence any equivalent hardware functionality could
plug into this scheme in its current form. There are checks for this
static key in the NUMA balancing logic at a few points to decide if
it should work based on access faults or hint faults.

> 
> BTW: can IBS sampling memory writing too?  Or just memory reading?

IBS can tag both store and load operations.

> 
>> This RFC is mainly about showing how hardware provided access
>> information could be used for NUMA balancing but I have run a
>> few basic benchmarks from mmtests to check if this is any severe
>> regression/overhead to any of those. Some benchmarks show some
>> improvement, some show no significant change and a few regress.
>> I am hopeful that with more appropriate tuning there is scope for
>> futher improvement here especially for workloads for which NUMA
>> matters.
> 
> What's your expected improvement of the PMU based NUMA balancing?  It
> should come from reduced overhead?  higher accuracy?  Quicker response?
> I think that it may be better to prove that with appropriate statistics
> for at least one workload.

Just to clarify, unlike PEBS, IBS works independently of PMU.

I believe the improvement will come from reduced overhead due to
sampling of relevant accesses only.

I have a microbenchmark where two sets of threads bound to two 
NUMA nodes access the two different halves of memory which is
initially allocated on the 1st node.

On a two node Zen4 system, with 64 threads in each set accessing
8G of memory each from the initial allocation of 16G, I see that
IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
to complete a fixed number of memory accesses. This could well
be the best case and real workloads/benchmarks may not get this much
uplift, but it does show the potential gain to be had.

Thanks for your inputs.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-13  5:52   ` Bharata B Rao
@ 2023-02-13  6:30     ` Huang, Ying
  2023-02-14  4:55       ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-02-13  6:30 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 2/13/2023 8:56 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> Hi,
>>>
>>> Some hardware platforms can provide information about memory accesses
>>> that can be used to do optimal page and task placement on NUMA
>>> systems. AMD processors have a hardware facility called Instruction-
>>> Based Sampling (IBS) that can be used to gather specific metrics
>>> related to instruction fetch and execution activity. This facility
>>> can be used to perform memory access profiling based on statistical
>>> sampling.
>>>
>>> This RFC is a proof-of-concept implementation where the access
>>> information obtained from the hardware is used to drive NUMA balancing.
>>> With this it is no longer necessary to scan the address space and
>>> introduce NUMA hint faults to build task-to-page association. Hence
>>> the approach taken here is to replace the address space scanning plus
>>> hint faults with the access information provided by the hardware.
>> 
>> You method can avoid the address space scanning, but cannot avoid memory
>> access fault in fact.  PMU will raise NMI and then task_work to process
>> the sampled memory accesses.  The overhead depends on the frequency of
>> the memory access sampling.  Please measure the overhead of your method
>> in details.
>
> Yes, the address space scanning is avoided. I will measure the overhead
> of hint fault vs NMI handling path. The actual processing of the access
> from task_work context is pretty much similar to the stats processing
> from hint faults. As you note the overhead depends on the frequency of
> sampling. In this current approach, the sampling period is per-task
> and it varies based on the same logic that NUMA balancing uses to
> vary the scan period.
>
>> 
>>> The access samples obtained from hardware are fed to NUMA balancing
>>> as fault-equivalents. The rest of the NUMA balancing logic that
>>> collects/aggregates the shared/private/local/remote faults and does
>>> pages/task migrations based on the faults is retained except that
>>> accesses replace faults.
>>>
>>> This early implementation is an attempt to get a working solution
>>> only and as such a lot of TODOs exist:
>>>
>>> - Perf uses IBS and we are using the same IBS for access profiling here.
>>>   There needs to be a proper way to make the use mutually exclusive.
>>> - Is tying this up with NUMA balancing a reasonable approach or
>>>   should we look at a completely new approach?
>>> - When accesses replace faults in NUMA balancing, a few things have
>>>   to be tuned differently. All such decision points need to be
>>>   identified and appropriate tuning needs to be done.
>>> - Hardware provided access information could be very useful for driving
>>>   hot page promotion in tiered memory systems. Need to check if this
>>>   requires different tuning/heuristics apart from what NUMA balancing
>>>   already does.
>>> - Some of the values used to program the IBS counters like the sampling
>>>   period etc may not be the optimal or ideal values. The sample period
>>>   adjustment follows the same logic as scan period modification which
>>>   may not be ideal. More experimentation is required to fine-tune all
>>>   these aspects.
>>> - Currently I am acting (i,e., attempt to migrate a page) on each sampled
>>>   access. Need to check if it makes sense to delay it and do batched page
>>>   migration.
>> 
>> You current implementation is tied with AMD IBS.  You will need a
>> architecture/vendor independent framework for upstreaming.
>
> I have tried to keep it vendor and arch neutral as far
> as possible, will re-look into this of course to make the
> interfaces more robust and useful.
>
> I have defined a static key (hw_access_hints=false) which will be
> set only by the platform driver when it detects the hardware
> capability to provide memory access information. NUMA balancing
> code skips the address space scanning when it sees this capability.
> The platform driver (access fault handler) will call into the NUMA
> balancing API with linear and physical address information of the
> accessed sample. Hence any equivalent hardware functionality could
> plug into this scheme in its current form. There are checks for this
> static key in the NUMA balancing logic at a few points to decide if
> it should work based on access faults or hint faults.
>
>> 
>> BTW: can IBS sampling memory writing too?  Or just memory reading?
>
> IBS can tag both store and load operations.

Thanks for your information!

>> 
>>> This RFC is mainly about showing how hardware provided access
>>> information could be used for NUMA balancing but I have run a
>>> few basic benchmarks from mmtests to check if this is any severe
>>> regression/overhead to any of those. Some benchmarks show some
>>> improvement, some show no significant change and a few regress.
>>> I am hopeful that with more appropriate tuning there is scope for
>>> futher improvement here especially for workloads for which NUMA
>>> matters.
>> 
>> What's your expected improvement of the PMU based NUMA balancing?  It
>> should come from reduced overhead?  higher accuracy?  Quicker response?
>> I think that it may be better to prove that with appropriate statistics
>> for at least one workload.
>
> Just to clarify, unlike PEBS, IBS works independently of PMU.

Good to known this, Thanks!

> I believe the improvement will come from reduced overhead due to
> sampling of relevant accesses only.
>
> I have a microbenchmark where two sets of threads bound to two 
> NUMA nodes access the two different halves of memory which is
> initially allocated on the 1st node.
>
> On a two node Zen4 system, with 64 threads in each set accessing
> 8G of memory each from the initial allocation of 16G, I see that
> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
> to complete a fixed number of memory accesses. This could well
> be the best case and real workloads/benchmarks may not get this much
> uplift, but it does show the potential gain to be had.

Can you find a way to show the overhead of the original implementation
and your method?  Then we can compare between them?  Because you think
the improvement comes from the reduced overhead.

I also have interest in the pages migration throughput per second during
the test, because I suspect your method can migrate pages faster.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-13  6:30     ` Huang, Ying
@ 2023-02-14  4:55       ` Bharata B Rao
  2023-02-15  6:07         ` Huang, Ying
  2023-02-16  8:41         ` Bharata B Rao
  0 siblings, 2 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-14  4:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>> I have a microbenchmark where two sets of threads bound to two 
>> NUMA nodes access the two different halves of memory which is
>> initially allocated on the 1st node.
>>
>> On a two node Zen4 system, with 64 threads in each set accessing
>> 8G of memory each from the initial allocation of 16G, I see that
>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>> to complete a fixed number of memory accesses. This could well
>> be the best case and real workloads/benchmarks may not get this much
>> uplift, but it does show the potential gain to be had.
> 
> Can you find a way to show the overhead of the original implementation
> and your method?  Then we can compare between them?  Because you think
> the improvement comes from the reduced overhead.

Sure, will measure the overhead.

> 
> I also have interest in the pages migration throughput per second during
> the test, because I suspect your method can migrate pages faster.

I have some data on pages migrated over time for the benchmark I mentioned
above.

                                                                                
                                Pages migrated vs Time(s)                       
    2500000 +---------------------------------------------------------------+   
            |       +       +       +       +       +       +       +       |   
            |                                               Default ******* |   
            |                                                   IBS ####### |   
            |                                                               |   
            |                                   ****************************|   
            |                                  *                            |   
    2000000 |-+                               *                           +-|   
            |                                *                              |   
            |                              **                               |   
 P          |                             *  ##                             |   
 a          |                            *###                               |   
 g          |                          **#                                  |   
 e  1500000 |-+                       *##                                 +-|   
 s          |                        ##                                     |   
            |                       #                                       |   
 m          |                      #                                        |   
 i          |                    *#                                         |   
 g          |                   *#                                          |   
 r          |                  ##                                           |   
 a  1000000 |-+               #                                           +-|   
 t          |                #                                              |   
 e          |               #*                                              |   
 d          |              #*                                               |   
            |             # *                                               |   
            |            # *                                                |   
     500000 |-+         #  *                                              +-|   
            |          #  *                                                 |   
            |         #   *                                                 |   
            |        #   *                                                  |   
            |      ##    *                                                  |   
            |     #     *                                                   |   
            |    #  +  *    +       +       +       +       +       +       |   
          0 +---------------------------------------------------------------+   
            0       20      40      60      80     100     120     140     160  
                                        Time (s)                                

So acting upon the relevant accesses early enough seem to result in
pages migrating faster in the beginning.

Here is the actual data in case the above ascii graph gets jumbled up:

numa_pages_migrated vs time in seconds
======================================

Time	Default		IBS
---------------------------
5	2639		511
10	2639		17724
15	2699		134632
20	2699		253485
25	2699		386296
30	159805		524651
35	450678		667622
40	741762		811603
45	971848		950691
50	1108475		1084537
55	1246229		1215265
60	1385920		1336521
65	1508354		1446950
70	1624068		1544890
75	1739311		1629162
80	1854639		1700068
85	1979906		1759025
90	2099857		<end>
95	2099857
100	2099857
105	2099859
110	2099859
115	2099859
120	2099859
125	2099859
130	2099859
135	2099859
140	2099859
145	2099859
150	2099859
155	2099859
160	2099859

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-14  4:55       ` Bharata B Rao
@ 2023-02-15  6:07         ` Huang, Ying
  2023-02-24  3:28           ` Bharata B Rao
  2023-02-16  8:41         ` Bharata B Rao
  1 sibling, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-02-15  6:07 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>> I have a microbenchmark where two sets of threads bound to two 
>>> NUMA nodes access the two different halves of memory which is
>>> initially allocated on the 1st node.
>>>
>>> On a two node Zen4 system, with 64 threads in each set accessing
>>> 8G of memory each from the initial allocation of 16G, I see that
>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>> to complete a fixed number of memory accesses. This could well
>>> be the best case and real workloads/benchmarks may not get this much
>>> uplift, but it does show the potential gain to be had.
>> 
>> Can you find a way to show the overhead of the original implementation
>> and your method?  Then we can compare between them?  Because you think
>> the improvement comes from the reduced overhead.
>
> Sure, will measure the overhead.
>
>> 
>> I also have interest in the pages migration throughput per second during
>> the test, because I suspect your method can migrate pages faster.
>
> I have some data on pages migrated over time for the benchmark I mentioned
> above.
>
>                                                                                 
>                                 Pages migrated vs Time(s)                       
>     2500000 +---------------------------------------------------------------+   
>             |       +       +       +       +       +       +       +       |   
>             |                                               Default ******* |   
>             |                                                   IBS ####### |   
>             |                                                               |   
>             |                                   ****************************|   
>             |                                  *                            |   
>     2000000 |-+                               *                           +-|   
>             |                                *                              |   
>             |                              **                               |   
>  P          |                             *  ##                             |   
>  a          |                            *###                               |   
>  g          |                          **#                                  |   
>  e  1500000 |-+                       *##                                 +-|   
>  s          |                        ##                                     |   
>             |                       #                                       |   
>  m          |                      #                                        |   
>  i          |                    *#                                         |   
>  g          |                   *#                                          |   
>  r          |                  ##                                           |   
>  a  1000000 |-+               #                                           +-|   
>  t          |                #                                              |   
>  e          |               #*                                              |   
>  d          |              #*                                               |   
>             |             # *                                               |   
>             |            # *                                                |   
>      500000 |-+         #  *                                              +-|   
>             |          #  *                                                 |   
>             |         #   *                                                 |   
>             |        #   *                                                  |   
>             |      ##    *                                                  |   
>             |     #     *                                                   |   
>             |    #  +  *    +       +       +       +       +       +       |   
>           0 +---------------------------------------------------------------+   
>             0       20      40      60      80     100     120     140     160  
>                                         Time (s)                                
>
> So acting upon the relevant accesses early enough seem to result in
> pages migrating faster in the beginning.

One way to prove this is to output the benchmark performance
periodically.  So we can find how the benchmark score change over time.

Best Regards,
Huang, Ying

> Here is the actual data in case the above ascii graph gets jumbled up:
>
> numa_pages_migrated vs time in seconds
> ======================================
>
> Time	Default		IBS
> ---------------------------
> 5	2639		511
> 10	2639		17724
> 15	2699		134632
> 20	2699		253485
> 25	2699		386296
> 30	159805		524651
> 35	450678		667622
> 40	741762		811603
> 45	971848		950691
> 50	1108475		1084537
> 55	1246229		1215265
> 60	1385920		1336521
> 65	1508354		1446950
> 70	1624068		1544890
> 75	1739311		1629162
> 80	1854639		1700068
> 85	1979906		1759025
> 90	2099857		<end>
> 95	2099857
> 100	2099857
> 105	2099859
> 110	2099859
> 115	2099859
> 120	2099859
> 125	2099859
> 130	2099859
> 135	2099859
> 140	2099859
> 145	2099859
> 150	2099859
> 155	2099859
> 160	2099859
>
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-14  4:55       ` Bharata B Rao
  2023-02-15  6:07         ` Huang, Ying
@ 2023-02-16  8:41         ` Bharata B Rao
  2023-02-17  6:03           ` Huang, Ying
  1 sibling, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-16  8:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 14-Feb-23 10:25 AM, Bharata B Rao wrote:
> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>> I have a microbenchmark where two sets of threads bound to two 
>>> NUMA nodes access the two different halves of memory which is
>>> initially allocated on the 1st node.
>>>
>>> On a two node Zen4 system, with 64 threads in each set accessing
>>> 8G of memory each from the initial allocation of 16G, I see that
>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>> to complete a fixed number of memory accesses. This could well
>>> be the best case and real workloads/benchmarks may not get this much
>>> uplift, but it does show the potential gain to be had.
>>
>> Can you find a way to show the overhead of the original implementation
>> and your method?  Then we can compare between them?  Because you think
>> the improvement comes from the reduced overhead.
> 
> Sure, will measure the overhead.

I used ftrace function_graph tracer to measure the amount of time (in us)
spent in fault handling and task_work handling in both the methods when
the above mentioned benchmark was running.

			Default		IBS
Fault handling		29879668.71	1226770.84
Task work handling	24878.894	10635593.82
Sched switch handling			78159.846

Total			29904547.6	11940524.51

In the default case, the fault handling duration is measured
by tracing do_numa_page() and the task_work duration is tracked
by task_numa_work().

In the IBS case, the fault handling is tracked by the NMI handler
ibs_overflow_handler(), the task_work is tracked by task_ibs_access_work()
and sched switch time overhead is tracked by hw_access_sched_in(). Note
that in IBS case, not much is done in NMI handler but bulk of the work
(page migration etc) happens in task_work context unlike the default case.

The breakup in numbers is given below:

Default
=======
			Duration	Min	Max		Avg
do_numa_page		29879668.71	0.08	317.166		17.16
task_numa_work		24878.894	0.2	3424.19		388.73
Total			29904547.6

IBS
===
			Duration	Min	Max		Avg
ibs_overflow_handler	1226770.84	0.15	104.918		1.26
task_ibs_access_work	10635593.82	0.21	398.428		29.81
hw_access_sched_in	78159.846	0.15	247.922		1.29
Total			11940524.51

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-16  8:41         ` Bharata B Rao
@ 2023-02-17  6:03           ` Huang, Ying
  2023-02-24  3:36             ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-02-17  6:03 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 14-Feb-23 10:25 AM, Bharata B Rao wrote:
>> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>>> I have a microbenchmark where two sets of threads bound to two 
>>>> NUMA nodes access the two different halves of memory which is
>>>> initially allocated on the 1st node.
>>>>
>>>> On a two node Zen4 system, with 64 threads in each set accessing
>>>> 8G of memory each from the initial allocation of 16G, I see that
>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>>> to complete a fixed number of memory accesses. This could well
>>>> be the best case and real workloads/benchmarks may not get this much
>>>> uplift, but it does show the potential gain to be had.
>>>
>>> Can you find a way to show the overhead of the original implementation
>>> and your method?  Then we can compare between them?  Because you think
>>> the improvement comes from the reduced overhead.
>> 
>> Sure, will measure the overhead.
>
> I used ftrace function_graph tracer to measure the amount of time (in us)
> spent in fault handling and task_work handling in both the methods when
> the above mentioned benchmark was running.
>
> 			Default		IBS
> Fault handling		29879668.71	1226770.84
> Task work handling	24878.894	10635593.82
> Sched switch handling			78159.846
>
> Total			29904547.6	11940524.51

Thanks!  You have shown the large overhead difference between the
original method and your method.  Can you show the number of the pages
migrated too?  I think the overhead / page can be a good overhead
indicator too.

Can it be translated to the performance improvement?  Per my
understanding, the total overhead is small compared with total run time.

Best Regards,
Huang, Ying

> In the default case, the fault handling duration is measured
> by tracing do_numa_page() and the task_work duration is tracked
> by task_numa_work().
>
> In the IBS case, the fault handling is tracked by the NMI handler
> ibs_overflow_handler(), the task_work is tracked by task_ibs_access_work()
> and sched switch time overhead is tracked by hw_access_sched_in(). Note
> that in IBS case, not much is done in NMI handler but bulk of the work
> (page migration etc) happens in task_work context unlike the default case.
>
> The breakup in numbers is given below:
>
> Default
> =======
> 			Duration	Min	Max		Avg
> do_numa_page		29879668.71	0.08	317.166		17.16
> task_numa_work		24878.894	0.2	3424.19		388.73
> Total			29904547.6
>
> IBS
> ===
> 			Duration	Min	Max		Avg
> ibs_overflow_handler	1226770.84	0.15	104.918		1.26
> task_ibs_access_work	10635593.82	0.21	398.428		29.81
> hw_access_sched_in	78159.846	0.15	247.922		1.29
> Total			11940524.51
>
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-15  6:07         ` Huang, Ying
@ 2023-02-24  3:28           ` Bharata B Rao
  0 siblings, 0 replies; 33+ messages in thread
From: Bharata B Rao @ 2023-02-24  3:28 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 15-Feb-23 11:37 AM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
> 
>> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>>> I have a microbenchmark where two sets of threads bound to two 
>>>> NUMA nodes access the two different halves of memory which is
>>>> initially allocated on the 1st node.
>>>>
>>>> On a two node Zen4 system, with 64 threads in each set accessing
>>>> 8G of memory each from the initial allocation of 16G, I see that
>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>>> to complete a fixed number of memory accesses. This could well
>>>> be the best case and real workloads/benchmarks may not get this much
>>>> uplift, but it does show the potential gain to be had.
>>>
>>> Can you find a way to show the overhead of the original implementation
>>> and your method?  Then we can compare between them?  Because you think
>>> the improvement comes from the reduced overhead.
>>
>> Sure, will measure the overhead.
>>
>>>
>>> I also have interest in the pages migration throughput per second during
>>> the test, because I suspect your method can migrate pages faster.
>>
>> I have some data on pages migrated over time for the benchmark I mentioned
>> above.
>>
>>                                                                                 
>>                                 Pages migrated vs Time(s)                       
>>     2500000 +---------------------------------------------------------------+   
>>             |       +       +       +       +       +       +       +       |   
>>             |                                               Default ******* |   
>>             |                                                   IBS ####### |   
>>             |                                                               |   
>>             |                                   ****************************|   
>>             |                                  *                            |   
>>     2000000 |-+                               *                           +-|   
>>             |                                *                              |   
>>             |                              **                               |   
>>  P          |                             *  ##                             |   
>>  a          |                            *###                               |   
>>  g          |                          **#                                  |   
>>  e  1500000 |-+                       *##                                 +-|   
>>  s          |                        ##                                     |   
>>             |                       #                                       |   
>>  m          |                      #                                        |   
>>  i          |                    *#                                         |   
>>  g          |                   *#                                          |   
>>  r          |                  ##                                           |   
>>  a  1000000 |-+               #                                           +-|   
>>  t          |                #                                              |   
>>  e          |               #*                                              |   
>>  d          |              #*                                               |   
>>             |             # *                                               |   
>>             |            # *                                                |   
>>      500000 |-+         #  *                                              +-|   
>>             |          #  *                                                 |   
>>             |         #   *                                                 |   
>>             |        #   *                                                  |   
>>             |      ##    *                                                  |   
>>             |     #     *                                                   |   
>>             |    #  +  *    +       +       +       +       +       +       |   
>>           0 +---------------------------------------------------------------+   
>>             0       20      40      60      80     100     120     140     160  
>>                                         Time (s)                                
>>
>> So acting upon the relevant accesses early enough seem to result in
>> pages migrating faster in the beginning.
> 
> One way to prove this is to output the benchmark performance
> periodically.  So we can find how the benchmark score change over time.

Here is the data from a different run that captures the benchmark scores
periodically. The benchmark touches a fixed amount of memory a fixed number
of times iteratively. I am capturing the iteration number for one of the
threads that starts touching memory which is completely remote at the
beginning. The higher iteration number suggests that the thread is making
progress quickly which eventually reflects as the benchmark score.

\f                                                                                
                                                                                
                              Access iterations vs Time                         
    500 +-------------------------------------------------------------------+   
        |       +      +       +      +       +      +       +      +     * |   
        |                                                   Default ******* |   
    450 |-+                #                                    IBS #######-|   
        |                  #                                             *  |   
        |                 #                                             *   |   
        |                 #                                             *   |   
    400 |-+              #                                             *  +-|   
        |                #                                             *    |   
 A      |            ****#*********************************************     |   
 c  350 |-+          *  #                                                 +-|   
 c      |           *   #                                                   |   
 e      |           *  #                                                    |   
 s  300 |-+        *  #                                                   +-|   
 s      |          *  #                                                     |   
        |         *  #                                                      |   
 i  250 |-+       * #                                                     +-|   
 t      |        *  #                                                       |   
 e      |        * #                                                        |   
 r      |       * #                                                         |   
 a  200 |-+     * #                                                       +-|   
 t      |       *#                                                          |   
 i      |      * #                                                          |   
 o  150 |-+    *#                                                         +-|   
 n      |     *#                                                            |   
 s      |     *#                                                            |   
    100 |-+  *#                                                           +-|   
        |    #                                                              |   
        |   #                                                               |   
        |  #                                                                |   
     50 |-#                                                               +-|   
        |#                                                                  |   
        |#      +      +       +      +       +      +       +      +       |   
      0 +-------------------------------------------------------------------+   
        0       20     40      60     80     100    120     140    160     180  
                                      Time (s)                                  
                                                                                
The way the number of migrated pages varies for the above runs is shown in
the below graph:

\f                                                                                
                                                                                
                                 Pages migrated vs Time                         
    2500000 +---------------------------------------------------------------+   
            |     +      +     +      +     +     +      +     +      +     |   
            |                                               Default ******* |   
            |                                                   IBS ####### |   
            |                                                               |   
            |                                                   ********    |   
            |                                                  *            |   
    2000000 |-+                                              **           +-|   
            |                                             ***               |   
            |                                           **                  |   
 p          |                                          *                    |   
 a          |                                        **                     |   
 g          |                                      **                       |   
 e  1500000 |-+                                   *                       +-|   
 s          |                                  ***                          |   
            |                                **                             |   
 m          |                              **                               |   
 i          |                             *                                 |   
 g          |                           **                                  |   
 r          |                          *                                    |   
 a  1000000 |-+                        *                                  +-|   
 t          |                         *                                     |   
 e          |                        *                                      |   
 d          |                       *                                       |   
            |                      *                                        |   
            |                   ##*                                         |   
     500000 |-+                #  *                                       +-|   
            |                ##  *                                          |   
            |              ##   *                                           |   
            |           ###    *                                            |   
            |          #       *                                            |   
            |      ####       *                                             |   
            |     #      +   * +      +     +     +      +     +      +     |   
          0 +---------------------------------------------------------------+   
            0     20     40    60     80   100   120    140   160    180   200  
                                        Time (s)                                
                                                                                
The final benchmark scores for the above runs compare like this:

		Default		IBS
Time (us)	174459192.0	54710778.0

Regards,
Bharata. 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-17  6:03           ` Huang, Ying
@ 2023-02-24  3:36             ` Bharata B Rao
  2023-02-27  7:54               ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-02-24  3:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria


On 17-Feb-23 11:33 AM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
> 
>> On 14-Feb-23 10:25 AM, Bharata B Rao wrote:
>>> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>>>> I have a microbenchmark where two sets of threads bound to two 
>>>>> NUMA nodes access the two different halves of memory which is
>>>>> initially allocated on the 1st node.
>>>>>
>>>>> On a two node Zen4 system, with 64 threads in each set accessing
>>>>> 8G of memory each from the initial allocation of 16G, I see that
>>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>>>> to complete a fixed number of memory accesses. This could well
>>>>> be the best case and real workloads/benchmarks may not get this much
>>>>> uplift, but it does show the potential gain to be had.
>>>>
>>>> Can you find a way to show the overhead of the original implementation
>>>> and your method?  Then we can compare between them?  Because you think
>>>> the improvement comes from the reduced overhead.
>>>
>>> Sure, will measure the overhead.
>>
>> I used ftrace function_graph tracer to measure the amount of time (in us)
>> spent in fault handling and task_work handling in both the methods when
>> the above mentioned benchmark was running.
>>
>> 			Default		IBS
>> Fault handling		29879668.71	1226770.84
>> Task work handling	24878.894	10635593.82
>> Sched switch handling			78159.846
>>
>> Total			29904547.6	11940524.51
> 
> Thanks!  You have shown the large overhead difference between the
> original method and your method.  Can you show the number of the pages
> migrated too?  I think the overhead / page can be a good overhead
> indicator too.
> 
> Can it be translated to the performance improvement?  Per my
> understanding, the total overhead is small compared with total run time.

I captured some of the numbers that you wanted for two different runs.
The first case shows the data for a short run (less number of memory access
iterations) and the second one is for a long run (more number of iterations)

Short-run
=========
Time taken or overhead (us) for fault, task_work and sched_switch
handling

			Default		IBS
Fault handling		29017953.99	1196828.67
Task work handling	10354.40	10356778.53
Sched switch handling			56572.21
Total overhead		29028308.39	11610179.41

Benchmark score(us)	194050290	53963650
numa_pages_migrated	2097256		662755
Overhead / page		13.84		17.51
Pages migrated per sec	72248.64	57083.95

Default
-------
			Total		Min	Max		Avg
do_numa_page		29017953.99	0.1	307.63		15.97
task_numa_work		10354.40	2.86	4573.60		175.50
Total			29028308.39

IBS
---
			Total		Min	Max		Avg
ibs_overflow_handler	1196828.67	0.15	100.28		1.26
task_ibs_access_work	10356778.53	0.21	10504.14	28.42
hw_access_sched_in	56572.21	0.15	16.94		1.45
Total			11610179.41


Long-run
========
Time taken or overhead (us) for fault, task_work and sched_switch
handling
			Default		IBS
Fault handling		27437756.73	901406.37
Task work handling	1741.66		4902935.32
Sched switch handling			100590.33
Total overhead		27439498.38	5904932.02

Benchmark score(us)	306786210.0	153422489.0
numa_pages_migrated	2097218		1746099
Overhead / page		13.08		3.38
Pages migrated per sec	6836.08		11380.98

Default
-------
			Total		Min	Max		Avg
do_numa_page		27437756.73	0.08	363.475		15.03
task_numa_work		1741.66		3.294	1200.71		42.48
Total			27439498.38

IBS
---
			Total		Min	Max		Avg
ibs_overflow_handler	901406.37	0.15	95.51		1.06
task_ibs_access_work	4902935.32	0.22	11013.68	9.64
hw_access_sched_in	100590.33	0.14	91.97		1.52
Total			5904932.02

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-24  3:36             ` Bharata B Rao
@ 2023-02-27  7:54               ` Huang, Ying
  2023-03-01 11:21                 ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-02-27  7:54 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 17-Feb-23 11:33 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> On 14-Feb-23 10:25 AM, Bharata B Rao wrote:
>>>> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>>>>> I have a microbenchmark where two sets of threads bound to two 
>>>>>> NUMA nodes access the two different halves of memory which is
>>>>>> initially allocated on the 1st node.
>>>>>>
>>>>>> On a two node Zen4 system, with 64 threads in each set accessing
>>>>>> 8G of memory each from the initial allocation of 16G, I see that
>>>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>>>>> to complete a fixed number of memory accesses. This could well
>>>>>> be the best case and real workloads/benchmarks may not get this much
>>>>>> uplift, but it does show the potential gain to be had.
>>>>>
>>>>> Can you find a way to show the overhead of the original implementation
>>>>> and your method?  Then we can compare between them?  Because you think
>>>>> the improvement comes from the reduced overhead.
>>>>
>>>> Sure, will measure the overhead.
>>>
>>> I used ftrace function_graph tracer to measure the amount of time (in us)
>>> spent in fault handling and task_work handling in both the methods when
>>> the above mentioned benchmark was running.
>>>
>>> 			Default		IBS
>>> Fault handling		29879668.71	1226770.84
>>> Task work handling	24878.894	10635593.82
>>> Sched switch handling			78159.846
>>>
>>> Total			29904547.6	11940524.51
>> 
>> Thanks!  You have shown the large overhead difference between the
>> original method and your method.  Can you show the number of the pages
>> migrated too?  I think the overhead / page can be a good overhead
>> indicator too.
>> 
>> Can it be translated to the performance improvement?  Per my
>> understanding, the total overhead is small compared with total run time.
>
> I captured some of the numbers that you wanted for two different runs.
> The first case shows the data for a short run (less number of memory access
> iterations) and the second one is for a long run (more number of iterations)
>
> Short-run
> =========
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
>
> 			Default		IBS
> Fault handling		29017953.99	1196828.67
> Task work handling	10354.40	10356778.53
> Sched switch handling			56572.21
> Total overhead		29028308.39	11610179.41
>
> Benchmark score(us)	194050290	53963650
> numa_pages_migrated	2097256		662755
> Overhead / page		13.84		17.51

From above, the overhead/page is similar.

> Pages migrated per sec	72248.64	57083.95
>
> Default
> -------
> 			Total		Min	Max		Avg
> do_numa_page		29017953.99	0.1	307.63		15.97
> task_numa_work		10354.40	2.86	4573.60		175.50
> Total			29028308.39
>
> IBS
> ---
> 			Total		Min	Max		Avg
> ibs_overflow_handler	1196828.67	0.15	100.28		1.26
> task_ibs_access_work	10356778.53	0.21	10504.14	28.42
> hw_access_sched_in	56572.21	0.15	16.94		1.45
> Total			11610179.41
>
>
> Long-run
> ========
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
> 			Default		IBS
> Fault handling		27437756.73	901406.37
> Task work handling	1741.66		4902935.32
> Sched switch handling			100590.33
> Total overhead		27439498.38	5904932.02
>
> Benchmark score(us)	306786210.0	153422489.0
> numa_pages_migrated	2097218		1746099
> Overhead / page		13.08		3.38

But from this, the overhead/page is quite different.

One possibility is that there's more "local" hint page faults in the
original implementation, we can check "numa_hint_faults" and
"numa_hint_faults_local" in /proc/vmstat for that.

If

  numa_hint_faults_local / numa_hint_faults

is similar.  For each page migrated, the number of hint page fault is
similar, and the run time for each hint page fault handler is similar
too.  Or I made some mistake in analysis?

> Pages migrated per sec	6836.08		11380.98
>
> Default
> -------
> 			Total		Min	Max		Avg
> do_numa_page		27437756.73	0.08	363.475		15.03
> task_numa_work		1741.66		3.294	1200.71		42.48
> Total			27439498.38
>
> IBS
> ---
> 			Total		Min	Max		Avg
> ibs_overflow_handler	901406.37	0.15	95.51		1.06
> task_ibs_access_work	4902935.32	0.22	11013.68	9.64
> hw_access_sched_in	100590.33	0.14	91.97		1.52
> Total			5904932.02

Thank you very much for detailed data.  Can you provide some analysis
for your data?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-02-27  7:54               ` Huang, Ying
@ 2023-03-01 11:21                 ` Bharata B Rao
  2023-03-02  8:10                   ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-03-01 11:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 27-Feb-23 1:24 PM, Huang, Ying wrote:
> Thank you very much for detailed data.  Can you provide some analysis
> for your data?

The overhead numbers I shared earlier weren't correct as I
realized that while obtaining those numbers from function_graph
tracing, the trace buffer was silently getting overrun. I had to
reduce the number of memory access iterations to ensure that I get
the full trace buffer. I will be summarizing the findings
based on this new numbers below.

Just to recap - The microbenchmark is run on an AMD Genoa
two node system. The benchmark has two set of threads,
(one affined to each node) accessing two different chunks
of memory (chunk size 8G) which are initially allocated
on first node. The benchmark touches each page in the
chunk iteratively for a fixed number of iterations (384
in this case given below). The benchmark score is the
amount of time it takes to complete the specified number
of accesses.

Here is the data for the benchmark run:

Time taken or overhead (us) for fault, task_work and sched_switch
handling

				Default		IBS
Fault handling			2875354862	2602455		
Task work handling		139023		24008121
Sched switch handling				37712
Total overhead			2875493885	26648288	

Default
-------
			Total		Min	Max		Avg
do_numa_page		2875354862	0.08	392.13		22.11
task_numa_work		139023		0.14	5365.77		532.66
Total			2875493885

IBS
---
			Total		Min	Max		Avg
ibs_overflow_handler	2602455		0.14	103.91		1.29
task_ibs_access_work	24008121	0.17	485.09		37.65
hw_access_sched_in	37712		0.15	287.55		1.35
Total			26648288


				Default		IBS
Benchmark score(us)		160171762.0	40323293.0
numa_pages_migrated		2097220		511791
Overhead per page		1371		52
Pages migrated per sec		13094		12692
numa_hint_faults_local		2820311		140856
numa_hint_faults		38589520	652647
hint_faults_local/hint_faults	7%		22%

Here is the summary:

- In case of IBS, the benchmark completes 75% faster compared to
  the default case. The gain varies based on how many iterations of
  memory accesses we run as part of the benchmark. For 2048 iterations
  of accesses, I have seen a gain of around 50%.
- The overhead of NUMA balancing (as measured by the time taken in
  the fault handling, task_work time handling and sched_switch time
  handling) in the default case is seen to be pretty high compared to
  the IBS case.
- The number of hint-faults in the default case is significantly
  higher than the IBS case.
- The local hint-faults percentage is much better in the IBS
  case compared to the default case.
- As shown in the graphs (in other threads of this mail thread), in
  the default case, the page migrations start a bit slowly while IBS
  case shows steady migrations right from the start.
- I have also shown (via graphs in other threads of this mail thread)
  that in IBS case the benchmark is able to steadily increase
  the access iterations over time, while in the default case, the
  benchmark doesn't do forward progress for a long time after
  an initial increase.
- Early migrations due to relevant access sampling from IBS,
  is most probably the significant reason for the uplift that IBS
  case gets.
- It is consistently seen that the benchmark in the IBS case manages
  to complete the specified number of accesses even before the entire
  chunk of memory gets migrated. The early migrations are offsetting
  the cost of remote accesses too.
- In the IBS case, we re-program the IBS counters for the incoming
  task in the sched_switch path. It is seen that this overhead isn't
  that significant to slow down the benchmark.
- One of the differences between the default case and the IBS case
  is about when the faults-since-last-scan is updated/folded into the
  historical faults stats and subsequent scan period update. Since we
  don't have the notion of scanning in IBS, I have a threshold (number
  of access faults) to determine when to update the historical faults
  and the IBS sample period. I need to check if quicker migrations
  could result from this change.
- Finally, all this is for the above mentioned microbenchmark. The
  gains on other benchmarks is yet to be evaluated.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-03-01 11:21                 ` Bharata B Rao
@ 2023-03-02  8:10                   ` Huang, Ying
  2023-03-03  5:25                     ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-03-02  8:10 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 27-Feb-23 1:24 PM, Huang, Ying wrote:
>> Thank you very much for detailed data.  Can you provide some analysis
>> for your data?
>
> The overhead numbers I shared earlier weren't correct as I
> realized that while obtaining those numbers from function_graph
> tracing, the trace buffer was silently getting overrun. I had to
> reduce the number of memory access iterations to ensure that I get
> the full trace buffer. I will be summarizing the findings
> based on this new numbers below.
>
> Just to recap - The microbenchmark is run on an AMD Genoa
> two node system. The benchmark has two set of threads,
> (one affined to each node) accessing two different chunks
> of memory (chunk size 8G) which are initially allocated
> on first node. The benchmark touches each page in the
> chunk iteratively for a fixed number of iterations (384
> in this case given below). The benchmark score is the
> amount of time it takes to complete the specified number
> of accesses.
>
> Here is the data for the benchmark run:
>
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
>
> 				Default		IBS
> Fault handling			2875354862	2602455		
> Task work handling		139023		24008121
> Sched switch handling				37712
> Total overhead			2875493885	26648288	
>
> Default
> -------
> 			Total		Min	Max		Avg
> do_numa_page		2875354862	0.08	392.13		22.11
> task_numa_work		139023		0.14	5365.77		532.66
> Total			2875493885
>
> IBS
> ---
> 			Total		Min	Max		Avg
> ibs_overflow_handler	2602455		0.14	103.91		1.29
> task_ibs_access_work	24008121	0.17	485.09		37.65
> hw_access_sched_in	37712		0.15	287.55		1.35
> Total			26648288
>
>
> 				Default		IBS
> Benchmark score(us)		160171762.0	40323293.0
> numa_pages_migrated		2097220		511791
> Overhead per page		1371		52
> Pages migrated per sec		13094		12692
> numa_hint_faults_local		2820311		140856
> numa_hint_faults		38589520	652647

For default, numa_hint_faults >> numa_pages_migrated.  It's hard to be
understood.  I guess that there aren't many shared pages in the
benchmark?  And I guess that the free pages in the target node is enough
too?

> hint_faults_local/hint_faults	7%		22%
>
> Here is the summary:
>
> - In case of IBS, the benchmark completes 75% faster compared to
>   the default case. The gain varies based on how many iterations of
>   memory accesses we run as part of the benchmark. For 2048 iterations
>   of accesses, I have seen a gain of around 50%.
> - The overhead of NUMA balancing (as measured by the time taken in
>   the fault handling, task_work time handling and sched_switch time
>   handling) in the default case is seen to be pretty high compared to
>   the IBS case.
> - The number of hint-faults in the default case is significantly
>   higher than the IBS case.
> - The local hint-faults percentage is much better in the IBS
>   case compared to the default case.
> - As shown in the graphs (in other threads of this mail thread), in
>   the default case, the page migrations start a bit slowly while IBS
>   case shows steady migrations right from the start.
> - I have also shown (via graphs in other threads of this mail thread)
>   that in IBS case the benchmark is able to steadily increase
>   the access iterations over time, while in the default case, the
>   benchmark doesn't do forward progress for a long time after
>   an initial increase.

Hard to understand this too.  Pages are migrated to local, but
performance doesn't improve.

> - Early migrations due to relevant access sampling from IBS,
>   is most probably the significant reason for the uplift that IBS
>   case gets.

In original kernel, the NUMA page table scanning will delay for a
while.  Please check the below comments in task_tick_numa().

	/*
	 * Using runtime rather than walltime has the dual advantage that
	 * we (mostly) drive the selection from busy threads and that the
	 * task needs to have done some actual work before we bother with
	 * NUMA placement.
	 */

I think this is generally reasonable, while it's not best for this
micro-benchmark.

Best Regards,
Huang, Ying

> - It is consistently seen that the benchmark in the IBS case manages
>   to complete the specified number of accesses even before the entire
>   chunk of memory gets migrated. The early migrations are offsetting
>   the cost of remote accesses too.
> - In the IBS case, we re-program the IBS counters for the incoming
>   task in the sched_switch path. It is seen that this overhead isn't
>   that significant to slow down the benchmark.
> - One of the differences between the default case and the IBS case
>   is about when the faults-since-last-scan is updated/folded into the
>   historical faults stats and subsequent scan period update. Since we
>   don't have the notion of scanning in IBS, I have a threshold (number
>   of access faults) to determine when to update the historical faults
>   and the IBS sample period. I need to check if quicker migrations
>   could result from this change.
> - Finally, all this is for the above mentioned microbenchmark. The
>   gains on other benchmarks is yet to be evaluated.
>
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-03-02  8:10                   ` Huang, Ying
@ 2023-03-03  5:25                     ` Bharata B Rao
  2023-03-03  5:53                       ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-03-03  5:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 02-Mar-23 1:40 PM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
>>
>> Here is the data for the benchmark run:
>>
>> Time taken or overhead (us) for fault, task_work and sched_switch
>> handling
>>
>> 				Default		IBS
>> Fault handling			2875354862	2602455		
>> Task work handling		139023		24008121
>> Sched switch handling				37712
>> Total overhead			2875493885	26648288	
>>
>> Default
>> -------
>> 			Total		Min	Max		Avg
>> do_numa_page		2875354862	0.08	392.13		22.11
>> task_numa_work		139023		0.14	5365.77		532.66
>> Total			2875493885
>>
>> IBS
>> ---
>> 			Total		Min	Max		Avg
>> ibs_overflow_handler	2602455		0.14	103.91		1.29
>> task_ibs_access_work	24008121	0.17	485.09		37.65
>> hw_access_sched_in	37712		0.15	287.55		1.35
>> Total			26648288
>>
>>
>> 				Default		IBS
>> Benchmark score(us)		160171762.0	40323293.0
>> numa_pages_migrated		2097220		511791
>> Overhead per page		1371		52
>> Pages migrated per sec		13094		12692
>> numa_hint_faults_local		2820311		140856
>> numa_hint_faults		38589520	652647
> 
> For default, numa_hint_faults >> numa_pages_migrated.  It's hard to be
> understood.

Most of the migration requests from the numa hint page fault path
are failing due to failure to isolate the pages.

This is the check in migrate_misplaced_page() from where it returns
without even trying to do the subsequent migrate_pages() call:

        isolated = numamigrate_isolate_page(pgdat, page);
        if (!isolated)
                goto out;

I will further investigate this.

> I guess that there aren't many shared pages in the
> benchmark?

I have a version of the benchmark which has a fraction of 
shared memory between sets of thread in addition to the
per-set exclusive memory. Here too the same performance
difference is seen.

> And I guess that the free pages in the target node is enough
> too?

The benchmark is using 16G totally with 8G being accessed from
threads on either nodes. There is enough memory on the target
node to accept the incoming page migration requests.

> 
>> hint_faults_local/hint_faults	7%		22%
>>
>> Here is the summary:
>>
>> - In case of IBS, the benchmark completes 75% faster compared to
>>   the default case. The gain varies based on how many iterations of
>>   memory accesses we run as part of the benchmark. For 2048 iterations
>>   of accesses, I have seen a gain of around 50%.
>> - The overhead of NUMA balancing (as measured by the time taken in
>>   the fault handling, task_work time handling and sched_switch time
>>   handling) in the default case is seen to be pretty high compared to
>>   the IBS case.
>> - The number of hint-faults in the default case is significantly
>>   higher than the IBS case.
>> - The local hint-faults percentage is much better in the IBS
>>   case compared to the default case.
>> - As shown in the graphs (in other threads of this mail thread), in
>>   the default case, the page migrations start a bit slowly while IBS
>>   case shows steady migrations right from the start.
>> - I have also shown (via graphs in other threads of this mail thread)
>>   that in IBS case the benchmark is able to steadily increase
>>   the access iterations over time, while in the default case, the
>>   benchmark doesn't do forward progress for a long time after
>>   an initial increase.
> 
> Hard to understand this too.  Pages are migrated to local, but
> performance doesn't improve.

Migrations start a bit late and too much of time is spent later
in the run in hint faults and failed migration attempts (due to failure
to isolate the pages) is probably the reason?
> 
>> - Early migrations due to relevant access sampling from IBS,
>>   is most probably the significant reason for the uplift that IBS
>>   case gets.
> 
> In original kernel, the NUMA page table scanning will delay for a
> while.  Please check the below comments in task_tick_numa().
> 
> 	/*
> 	 * Using runtime rather than walltime has the dual advantage that
> 	 * we (mostly) drive the selection from busy threads and that the
> 	 * task needs to have done some actual work before we bother with
> 	 * NUMA placement.
> 	 */
> 
> I think this is generally reasonable, while it's not best for this
> micro-benchmark.

This is in addition to the initial scan delay that we have via
sysctl_numa_balancing_scan_delay. I have an equivalent of this
initial delay where the IBS access sampling is not started for
the task until an initial delay.

Thanks for your observations.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-03-03  5:25                     ` Bharata B Rao
@ 2023-03-03  5:53                       ` Huang, Ying
  2023-03-06 15:30                         ` Bharata B Rao
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2023-03-03  5:53 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 02-Mar-23 1:40 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>>>
>>> Here is the data for the benchmark run:
>>>
>>> Time taken or overhead (us) for fault, task_work and sched_switch
>>> handling
>>>
>>> 				Default		IBS
>>> Fault handling			2875354862	2602455		
>>> Task work handling		139023		24008121
>>> Sched switch handling				37712
>>> Total overhead			2875493885	26648288	
>>>
>>> Default
>>> -------
>>> 			Total		Min	Max		Avg
>>> do_numa_page		2875354862	0.08	392.13		22.11
>>> task_numa_work		139023		0.14	5365.77		532.66
>>> Total			2875493885
>>>
>>> IBS
>>> ---
>>> 			Total		Min	Max		Avg
>>> ibs_overflow_handler	2602455		0.14	103.91		1.29
>>> task_ibs_access_work	24008121	0.17	485.09		37.65
>>> hw_access_sched_in	37712		0.15	287.55		1.35
>>> Total			26648288
>>>
>>>
>>> 				Default		IBS
>>> Benchmark score(us)		160171762.0	40323293.0
>>> numa_pages_migrated		2097220		511791
>>> Overhead per page		1371		52
>>> Pages migrated per sec		13094		12692
>>> numa_hint_faults_local		2820311		140856
>>> numa_hint_faults		38589520	652647
>> 
>> For default, numa_hint_faults >> numa_pages_migrated.  It's hard to be
>> understood.
>
> Most of the migration requests from the numa hint page fault path
> are failing due to failure to isolate the pages.
>
> This is the check in migrate_misplaced_page() from where it returns
> without even trying to do the subsequent migrate_pages() call:
>
>         isolated = numamigrate_isolate_page(pgdat, page);
>         if (!isolated)
>                 goto out;
>
> I will further investigate this.
>
>> I guess that there aren't many shared pages in the
>> benchmark?
>
> I have a version of the benchmark which has a fraction of 
> shared memory between sets of thread in addition to the
> per-set exclusive memory. Here too the same performance
> difference is seen.
>
>> And I guess that the free pages in the target node is enough
>> too?
>
> The benchmark is using 16G totally with 8G being accessed from
> threads on either nodes. There is enough memory on the target
> node to accept the incoming page migration requests.
>
>> 
>>> hint_faults_local/hint_faults	7%		22%
>>>
>>> Here is the summary:
>>>
>>> - In case of IBS, the benchmark completes 75% faster compared to
>>>   the default case. The gain varies based on how many iterations of
>>>   memory accesses we run as part of the benchmark. For 2048 iterations
>>>   of accesses, I have seen a gain of around 50%.
>>> - The overhead of NUMA balancing (as measured by the time taken in
>>>   the fault handling, task_work time handling and sched_switch time
>>>   handling) in the default case is seen to be pretty high compared to
>>>   the IBS case.
>>> - The number of hint-faults in the default case is significantly
>>>   higher than the IBS case.
>>> - The local hint-faults percentage is much better in the IBS
>>>   case compared to the default case.
>>> - As shown in the graphs (in other threads of this mail thread), in
>>>   the default case, the page migrations start a bit slowly while IBS
>>>   case shows steady migrations right from the start.
>>> - I have also shown (via graphs in other threads of this mail thread)
>>>   that in IBS case the benchmark is able to steadily increase
>>>   the access iterations over time, while in the default case, the
>>>   benchmark doesn't do forward progress for a long time after
>>>   an initial increase.
>> 
>> Hard to understand this too.  Pages are migrated to local, but
>> performance doesn't improve.
>
> Migrations start a bit late and too much of time is spent later
> in the run in hint faults and failed migration attempts (due to failure
> to isolate the pages) is probably the reason?
>> 
>>> - Early migrations due to relevant access sampling from IBS,
>>>   is most probably the significant reason for the uplift that IBS
>>>   case gets.
>> 
>> In original kernel, the NUMA page table scanning will delay for a
>> while.  Please check the below comments in task_tick_numa().
>> 
>> 	/*
>> 	 * Using runtime rather than walltime has the dual advantage that
>> 	 * we (mostly) drive the selection from busy threads and that the
>> 	 * task needs to have done some actual work before we bother with
>> 	 * NUMA placement.
>> 	 */
>> 
>> I think this is generally reasonable, while it's not best for this
>> micro-benchmark.
>
> This is in addition to the initial scan delay that we have via
> sysctl_numa_balancing_scan_delay. I have an equivalent of this
> initial delay where the IBS access sampling is not started for
> the task until an initial delay.

What is the memory accessing pattern of the workload?  Uniform random or
something like Gauss distribution?

Anyway, it may take some time for the original method to scan enough
memory space to trigger enough hint page fault.  We can check
numa_pte_updates to check whether enough virtual space has been scanned.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-03-03  5:53                       ` Huang, Ying
@ 2023-03-06 15:30                         ` Bharata B Rao
  2023-03-07  2:33                           ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Bharata B Rao @ 2023-03-06 15:30 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

On 03-Mar-23 11:23 AM, Huang, Ying wrote:
> 
> What is the memory accessing pattern of the workload?  Uniform random or
> something like Gauss distribution?

Multiple iterations of uniform access from beginning to end of the
memory region.

> 
> Anyway, it may take some time for the original method to scan enough
> memory space to trigger enough hint page fault.  We can check
> numa_pte_updates to check whether enough virtual space has been scanned.

I see that numa_hint_faults is way higher (sometimes close to 5 times)
than numa_pte_updates. This doesn't make sense. Very rarely I do see
saner numbers and when that happens the benchmark score is also much better.

Looks like an issue with the default kernel itself. I will debug this
further and get back.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
  2023-03-06 15:30                         ` Bharata B Rao
@ 2023-03-07  2:33                           ` Huang, Ying
  0 siblings, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2023-03-07  2:33 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, mgorman, peterz, mingo, bp, dave.hansen,
	x86, akpm, luto, tglx, yue.li, Ravikumar.Bangoria

Bharata B Rao <bharata@amd.com> writes:

> On 03-Mar-23 11:23 AM, Huang, Ying wrote:
>> 
>> What is the memory accessing pattern of the workload?  Uniform random or
>> something like Gauss distribution?
>
> Multiple iterations of uniform access from beginning to end of the
> memory region.

I guess this is sequential accesses instead of random accesses with
uniform distribution.

>> 
>> Anyway, it may take some time for the original method to scan enough
>> memory space to trigger enough hint page fault.  We can check
>> numa_pte_updates to check whether enough virtual space has been scanned.
>
> I see that numa_hint_faults is way higher (sometimes close to 5 times)
> than numa_pte_updates. This doesn't make sense. Very rarely I do see
> saner numbers and when that happens the benchmark score is also much better.
>
> Looks like an issue with the default kernel itself. I will debug this
> further and get back.

Yes.  It appears that something is wrong.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-03-07  2:34 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-08  7:35 [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Bharata B Rao
2023-02-08  7:35 ` [RFC PATCH 1/5] x86/ibs: In-kernel IBS driver for page access profiling Bharata B Rao
2023-02-08  7:35 ` [RFC PATCH 2/5] x86/ibs: Drive NUMA balancing via IBS access data Bharata B Rao
2023-02-08  7:35 ` [RFC PATCH 3/5] x86/ibs: Enable per-process IBS from sched switch path Bharata B Rao
2023-02-08  7:35 ` [RFC PATCH 4/5] x86/ibs: Adjust access faults sampling period Bharata B Rao
2023-02-08  7:35 ` [RFC PATCH 5/5] x86/ibs: Delay the collection of HW-provided access info Bharata B Rao
2023-02-08 18:03 ` [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Peter Zijlstra
2023-02-08 18:12   ` Dave Hansen
2023-02-09  6:04     ` Bharata B Rao
2023-02-09 14:28       ` Dave Hansen
2023-02-10  4:28         ` Bharata B Rao
2023-02-10  4:40           ` Dave Hansen
2023-02-10 15:10             ` Bharata B Rao
2023-02-09  5:57   ` Bharata B Rao
2023-02-13  2:56     ` Huang, Ying
2023-02-13  3:23       ` Bharata B Rao
2023-02-13  3:34         ` Huang, Ying
2023-02-13  3:26 ` Huang, Ying
2023-02-13  5:52   ` Bharata B Rao
2023-02-13  6:30     ` Huang, Ying
2023-02-14  4:55       ` Bharata B Rao
2023-02-15  6:07         ` Huang, Ying
2023-02-24  3:28           ` Bharata B Rao
2023-02-16  8:41         ` Bharata B Rao
2023-02-17  6:03           ` Huang, Ying
2023-02-24  3:36             ` Bharata B Rao
2023-02-27  7:54               ` Huang, Ying
2023-03-01 11:21                 ` Bharata B Rao
2023-03-02  8:10                   ` Huang, Ying
2023-03-03  5:25                     ` Bharata B Rao
2023-03-03  5:53                       ` Huang, Ying
2023-03-06 15:30                         ` Bharata B Rao
2023-03-07  2:33                           ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).