[PATCH V3 0/4] sched/numa: Enhance vma scanning

* [PATCH V3 0/4] sched/numa: Enhance vma scanning
@ 2023-02-28  4:50 Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 1/4] sched/numa: Apply the scan delay to every new vma Raghavendra K T
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [3]. 

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA 
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of 
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

V3 changes are mostly based on PeterZ comments (details below in
changes)

Summary of patchset:
Current patchset implements:
1. Delay the vma scanning logic for newly created VMA's so that
additional overhead of scanning is not incurred for short lived tasks
(implementation by Mel)

2. Store the information of tasks accessing VMA in 2 windows. It is
regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
The above time is derived from experimenting (Suggested by PeterZ) to
balance between frequent clearing vs obsolete access data

3. hash_32 used to encode task index accessing VMA information

4. VMA's acess information is used to skip scanning for the tasks
which had not accessed VMA

Things to ponder over:
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself (Done in this patchset by implementing 2 window history)

- Current scan period is not changed in the patchset, so we do see frequent
 tries to scan. Relaxing scan period dynamically could improve results
further.

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

[2] RFC V1 Link: 
  https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

[3] V2 Link:
  https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/

Changes since V2:
patch1: 
 - Renaming of structure, macro to function,
 - Add explanation to heuristics
 - Adding more details from result (PeterZ)
 Patch2:
 - Usage of test and set bit (PeterZ)
 - Move storing access PID info to numa_migrate_prep()
 - Add a note on fainess among tasks allowed to scan
   (PeterZ)
 Patch3:
 - Maintain two windows of access PID information
  (PeterZ supported implementation and Gave idea to extend
   to N if needed)
 Patch4:
 - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench and
dbench improvement is around 5% and huge system time (80%+) improvement
from mmtest autonuma.

kernbench
=============
                                    6.1.0-base                 6.1.0-patched
Amean     user-256    22437.65 (   0.00%)    22622.16 *  -0.82%*
Amean     syst-256     9290.30 (   0.00%)     8763.85 *   5.67%*
Amean     elsp-256      159.36 (   0.00%)      157.44 *   1.20%*

Duration User       67322.16    67876.18
Duration System     27884.89    26306.28
Duration Elapsed      498.95      494.42

Ops NUMA alloc hit                1738904367.00  1738882062.00
Ops NUMA alloc local              1738904104.00  1738881490.00
Ops NUMA base-page range updates      440526.00      272095.00
Ops NUMA PTE updates                  440526.00      272095.00
Ops NUMA hint faults                  109109.00       55630.00
Ops NUMA hint local faults %            5474.00         196.00
Ops NUMA hint local percent                5.02           0.35
Ops NUMA pages migrated               103400.00       55434.00
Ops AutoNUMA cost                        550.59         281.11

autonumabench
===============
                                    6.1.0-base                 6.1.0-patched
Amean     syst-NUMA01                  252.55 (   0.00%)       27.71 *  89.03%*
Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.23 * -12.77%*
Amean     syst-NUMA02                    0.91 (   0.00%)        0.76 *  16.22%*
Amean     syst-NUMA02_SMT                0.67 (   0.00%)        0.67 *  -1.07%*
Amean     elsp-NUMA01                  269.93 (   0.00%)      309.44 * -14.64%*
Amean     elsp-NUMA01_THREADLOCAL        1.05 (   0.00%)        1.07 *  -1.36%*
Amean     elsp-NUMA02                    3.26 (   0.00%)        3.29 *  -0.79%*
Amean     elsp-NUMA02_SMT                3.73 (   0.00%)        3.52 *   5.64%*

Duration User      318683.69   330084.06
Duration System      1780.77      206.14
Duration Elapsed     1954.30     2233.06

Ops NUMA alloc hit                  62237331.00    49179090.00
Ops NUMA alloc local                62235222.00    49177092.00
Ops NUMA base-page range updates    85303091.00       29242.00
Ops NUMA PTE updates                85303091.00       29242.00
Ops NUMA hint faults                87457481.00        8302.00
Ops NUMA hint local faults %        66665145.00        6064.00
Ops NUMA hint local percent               76.23          73.04
Ops NUMA pages migrated              9348511.00        2232.00
Ops AutoNUMA cost                     438062.15          41.76

dbench
========
dbench -t 90 <nproc>

Throughput
#clients             base             	patched			%improvement
1		842.655 MB/sec		922.305 MB/sec		9.45
16              5062.82 MB/sec          5079.85 MB/sec          0.34
64              9408.81 MB/sec          9980.89 MB/sec          6.08
256             7076.59 MB/sec          7590.76 MB/sec          7.26

Mel Gorman (1):
  sched/numa: Apply the scan delay to every new vma

Raghavendra K T (3):
  sched/numa: Enhance vma scanning logic
  sched/numa: implement access PID reset logic
  sched/numa: Use hash_32 to mix up PIDs accessing VMA

 include/linux/mm.h       | 30 +++++++++++++++++++++
 include/linux/mm_types.h |  9 +++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c              |  3 +++
 5 files changed, 101 insertions(+)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 8+ messages in thread