All of lore.kernel.org
 help / color / mirror / Atom feed
From: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
To: Raghavendra K T <raghavendra.kt@amd.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	rppt@kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Bharata B Rao <bharata@amd.com>,
	Aithal Srikanth <sraithal@amd.com>,
	kernel test robot <oliver.sang@intel.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>
Subject: Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning
Date: Wed, 13 Sep 2023 10:58:23 +0530	[thread overview]
Message-ID: <54fa32a9-36a8-f296-9be8-d2b28770f82e@amd.com> (raw)
In-Reply-To: <cover.1693287931.git.raghavendra.kt@amd.com>

Hello Raghu,

On 8/29/2023 11:36 AM, Raghavendra K T wrote:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
> VMA scanning is allowed if:
> 1) The task had accessed the VMA.
>   Rationale: Reduce overhead for the tasks that had not
> touched VMA. Also filter out unnecessary scanning.
> 
> 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
>   Rationale: Understanding initial characteristics of VMAs and also
>   prevent VMA scanning unfairness.
> 
> While that works for most of the times to reduce scanning overhead,
>   there are some corner cases associated with it.
> 
> This was found in an internal LKP run and also reported by [2]. There was
> an attempt to fix.
> 
> Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/
> 
> This is a fully different series after Mel's feedback to address the issue
>   and also a continuation of enhancing VMA scanning for NUMA balancing.
> 
> Problem statement (Disjoint VMA set):
> ======================================
> Let's look at some of the corner cases with a below example of tasks and their
> access pattern.
> 
> Consider N tasks (threads) of a process.
> Set1 tasks accessing vma_x (group of VMAs)
> Set2 tasks accessing vma_y (group of VMAs)
> 
>               Set1                      Set2
>          -------------------         --------------------
>          | task_1..task_n/2 |       | task_n/2+1..task_n |
>          -------------------         --------------------	
>                   |                             |
>                   V                             V
>          -------------------         --------------------
>          |     vma_x       |         |     vma_y         |
>          -------------------         --------------------	
> 
> Corner cases:
> (a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
> suppose Set1 tasks gets more opportunity to scan (May be because of the
> activity pattern of tasks or other reasons in current design) in the above
> example, then vma_x gets scanned more number of times than vma_y.
> 
> some experiment is also done here which illustrates this unfairness:
> Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/
> 
> (b) Sizes of vmas can differ.
> Suppose size of vma_y is far greater than the size of vma_x, then a bigger
> portion of vma_y can potentially be left unscanned since scanning is bounded
> by scan_size of 256MB (default) for each iteration.
> 
> (c) Highly active threads trap a few VMAs frequently, and some of the VMAs not
> accessed for long time can potentially get starved of scanning indefinitely
> (Mel). There is a possibility of lack of enough hints/details about VMAs if it
> is needed later for migration.
> 
> (d) Allocation of memory in some specific manner (Mel).
> One example could be, Suppose a main thread allocates memory and it is not
> active. When other threads tries to act upon it, they may not have much
> hints about it, if the corresponding VMA was not scanned.
> 
> (e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
> will never get scanned. (Observed rarely but very much possible depending on
> workload behaviour).
> 
> Above this, a combination of some of the above (e.g., (a) and (b)) can
> potentially amplifyi/worsen the side effect.
> 
> This patchset, tries to address the above issues by enhancing unconditional
> VMA scanning logic.
> 
> High level ideas:
> =================
> Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it
> and when it hits zero do force scan (Mel).
> vma_scan_select value is again repopulated when it hits zero.
> 
> This is how VMA scanning phases looks like after implementation:
> 
> |<---p1--->|<-----p2----->|<-----p2----->|...
> 
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero.
> 
> Reinitialize vma_scan_select and repeat p2.
> 
> pros/cons:
> +  : Ratelimiting is inbuilt to the approach
> +  : vma_size is taken into account for scanning
> +/-: Scanning continues forever
> -  : Changes in vma size is taken care after force scan. i.e.,
>     vma_scan_select is repopulated only after vma_scan_select hits zero.
> 
> Idea-1 can potentially cover all the issues mentioned above.
> 
> Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata).
> If number of tasks accessing vma is >= 1, unconditionally allow scanning.
> 
> Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks
> accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.
> 
> Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.
> 
> Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
> reaches a value proportional (or equal) to vma_size/scan_size.
> This a complimentary to Idea-1.
> 
> this is how VMA scanning phases looks like after implementation:
> 
> |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
>                                                          RESET                                               RESET
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.
> 
> p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size in a rate limited manner. This is an optional phase.
> 
> p4: Allow scanning iff task has accessed VMA.
> 
> Reset after p4 (optional).
> 
> Repeat p2, p3 p4
> 
> Motivation: Allow agressive scanning in the beginning followed by a rate
> limited scanning. And then completely disallow scanning to avoid unnecessary
> scanning. Reset time could be a function of scan_delay and chosen long enough
> to aid long running task to forget history and start afresh.
> 
> +  : Ratelimiting need to be taken care separately if needed.
> +/-: Scanning continues only if RESET of vma_scan_seq is implemented.
> +  : changes in vma size is taken care in every scan.
> 
>   Current patch series implements Ideas 1, 2, 3 + extension of access PID history
> idea from PeterZ.
> 
> Results:
> ======
> Base: 6.5.0-rc6+ (4853c74bd7ab)
> SUT: Milan w/ 2 numa nodes 256 cpus
> 
> mmtest		numa01_THREAD_ALLOC manual run:
> 
> 		base		patched
> real		1m22.758s	1m9.200s
> user		249m49.540s	229m30.039s
> sys		0m25.040s	3m10.451s
> 	
> numa_pte_updates 	6985	1573363
> numa_hint_faults 	2705	1022623
> numa_hint_faults_local 	2279	389633
> numa_pages_migrated 	426	632990
> 
> kernbench
> 			base			patched
> Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
> Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
> Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*
> 
> Duration User       65973.18    65038.00
> Duration System     30538.92    32478.59
> Duration Elapsed      529.52      533.09
> 
> Ops NUMA PTE updates                  976844.00      962680.00
> Ops NUMA hint faults                  226763.00      245620.00
> Ops NUMA pages migrated               220146.00      207025.00
> Ops AutoNUMA cost                       1144.84        1238.77
> 
> Improvements in other benchmarks I have tested.
> Time based:
> Hashjoin	4.21%
> Btree	 	2.04%
> XSbench		0.36%
> 
> Throughput based:
> Graph500 	-3.62%
> Nas.bt		3.69%
> Nas.ft		21.91%
> 
> Note: VMA scanning improvements [1] has refined scanning so much that
> system overhead we re-introduce with additional scan look glaringly
> high. But If we consider the difference between before [1] and current
> series, overall scanning overhead is considerably reduced.
> 
> 1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> 2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> 
> Note: Patch description is again repeated in some patches to avoid any
> need to copy from cover letter again.
> 
> Peter Zijlstra (1):
>    sched/numa: Increase tasks' access history
> 
> Raghavendra K T (5):
>    sched/numa: Move up the access pid reset logic
>    sched/numa: Add disjoint vma unconditional scan logic
>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>    sched/numa: Allow recently accessed VMAs to be scanned
>    sched/numa: Allow scanning of shared VMAs
> 
>   include/linux/mm.h       |  12 +++--
>   include/linux/mm_types.h |   5 +-
>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>   3 files changed, 102 insertions(+), 24 deletions(-)
> 

I have tested this series on 4th generation EPYC processor. I am seeing improvement in autonuma-benchmark with the series.

o System Details

- 4th Generation EPYC System
- 2 x 128C/256T
- NPS1 mode

o Kernels

base:   4853c74bd7ab Merge tag 'parisc-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux


==================================================================
Test          : autonuma-benchmark
Units         : Time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
commit:
   base (4853c74bd7ab)
   base + this_series

   base (4853c74bd7ab)         base + this_series
---------------- ---------------------------
          %stddev     %change         %stddev
              \          |                \
     522.58           -11.2%     464.23        autonuma-benchmark.numa01.seconds
     273.93            -1.2%     270.75        autonuma-benchmark.numa01_THREAD_ALLOC.seconds
       0.60            +0.0%       0.60        autonuma-benchmark.numa02.seconds
     102.68            +3.7%     106.50        autonuma-benchmark.numa02_SMT.seconds

Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>

--
Thanks and Regards,
Swapnil

  parent reply	other threads:[~2023-09-13  5:28 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-29  6:06 [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 1/6] sched/numa: Move up the access pid reset logic Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 2/6] sched/numa: Add disjoint vma unconditional scan logic Raghavendra K T
2023-09-12  7:50   ` kernelt test robot
2023-09-13  6:21     ` Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 3/6] sched/numa: Remove unconditional scan logic using mm numa_scan_seq Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 4/6] sched/numa: Increase tasks' access history Raghavendra K T
2023-09-12 14:24   ` kernel test robot
2023-09-13  6:15     ` Raghavendra K T
2023-09-13  7:34       ` Oliver Sang
2023-08-29  6:06 ` [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned Raghavendra K T
2023-09-10 15:29   ` kernel test robot
2023-09-11 11:25     ` Raghavendra K T
2023-09-12  2:22       ` Oliver Sang
2023-09-12  6:43         ` Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 6/6] sched/numa: Allow scanning of shared VMAs Raghavendra K T
2023-09-13  5:28 ` Swapnil Sapkal [this message]
2023-09-13  6:24   ` [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning Raghavendra K T
2023-09-19  6:30 ` Raghavendra K T
2023-09-19  7:15   ` Ingo Molnar
2023-09-19  8:06     ` Raghavendra K T
2023-09-19  9:28 ` Peter Zijlstra
2023-09-19 16:22   ` Mel Gorman
2023-09-19 19:11     ` Peter Zijlstra
2023-09-20 10:42     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54fa32a9-36a8-f296-9be8-d2b28770f82e@amd.com \
    --to=swapnil.sapkal@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rppt@kernel.org \
    --cc=sraithal@amd.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.