[PATCH V2 0/3] sched/numa: Enhance vma scanning

* [PATCH V2 0/3] sched/numa: Enhance vma scanning
@ 2023-02-01  8:02 Raghavendra K T
  2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-01  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [2]. Though I have removed
RFC, I do think some parts need more feedback and refinement.

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA 
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of 
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

About Patchset:
Patch1:
1) VMA scan delay logic added (Mel) where during initial phase of VMA,
we delay the scanning by sysctl_numa_balancing_scan_delay.

2) A new status structure is added (vma_numab) so as to not grow
the vm_area_struct in !NUMA_BALANCING case.

Patch2:
3) last 6 Bits of PID is used as index to remember which PIDs accessed
VMA in fault path. This is further used to restrict scanning of VMA in
scan path.

Please note that first two times scanning is unconditionally allowed
(using numa_scan_seq). But this may need some potential change since
numa_scan_seq is per mm.

Patch3: 
4) Introduce basic patch clearing of accessed PIDs. This is as of now
done at every 4 * sysctl_numa_balancing_scan_delay interval.

This logic may need more experiment/refinement.

Things to ponder over (and Future TODO):
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself)

- Current scan period is not changed in the patchset, so we do see frequent
 tries to scan. Relaxing scan period dynamically could improve results
further.

Result Summary:
================
The result is obtained by running mmtests with below configs
config-numa-autonumabench

There is a significant reduction in AutoNuma cost from the benchmark
runs, But some of the results need improvement. I hope working on the
potential changes mentioned in patch3 and hopefuly numa_scan_period
tuning depending on current scanning efiiciency would help. will be
working on that..

SUT:
2 socket AMD Milan System
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2

256GB memory per socket amounting to 512GB in total
NPS1 NUMA configuration where each socket is a NUMA node

autonumabench
                                                   6.1.0                  6.1.0
BAmean-99 syst-NUMA01                  195.84 (   0.00%)       17.79 (  90.91%)
BAmean-99 syst-NUMA01_THREADLOCAL        0.19 (   0.00%)        0.19 (   2.56%)
BAmean-99 syst-NUMA02                    0.85 (   0.00%)        0.85 (   0.00%)
BAmean-99 syst-NUMA02_SMT                0.62 (   0.00%)        0.65 (  -4.30%)
BAmean-99 elsp-NUMA01                  254.95 (   0.00%)      322.69 ( -26.57%)
BAmean-99 elsp-NUMA01_THREADLOCAL        1.04 (   0.00%)        1.05 (  -1.29%)
BAmean-99 elsp-NUMA02                    3.08 (   0.00%)        3.29 (  -6.94%)
BAmean-99 elsp-NUMA02_SMT                3.49 (   0.00%)        3.43 (   1.91%)

                                          6.1.0          6.1.0
Ops NUMA alloc hit                  59210941.00    50772531.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                59200395.00    50771359.00
Ops NUMA base-page range updates    90670863.00       10952.00
Ops NUMA PTE updates                90670863.00       10952.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                92069634.00        9501.00
Ops NUMA hint local faults %        69966984.00        9213.00
Ops NUMA hint local percent               75.99          96.97
Ops NUMA pages migrated              8424631.00         287.00
Ops AutoNUMA cost                     461142.93          47.59

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
[2] RFC V1:
 Link: https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Mel Gorman (1):
  sched/numa: Apply the scan delay to every vma instead of tasks

Raghavendra K T (2):
  sched/numa: Enhance vma scanning logic
  sched/numa: Reset the accessing PID information periodically

 include/linux/mm.h       | 23 +++++++++++++++++++
 include/linux/mm_types.h |  9 ++++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 49 ++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |  1 +
 mm/memory.c              |  1 +
 6 files changed, 85 insertions(+)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 17+ messages in thread