linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Raghavendra K T <raghavendra.kt@amd.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Ingo Molnar <mingo@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	rppt@kernel.org, Bharata B Rao <bharata@amd.com>,
	Disha Talreja <dishaa.talreja@amd.com>
Subject: Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
Date: Mon, 27 Feb 2023 12:10:41 +0530	[thread overview]
Message-ID: <c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com> (raw)
In-Reply-To: <ccba1a65-fe4f-89d5-a32b-2efba30a1350@amd.com>

On 2/7/2023 12:11 PM, Raghavendra K T wrote:
> On 2/4/2023 11:44 PM, Raghavendra K T wrote:
>> On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
>>> On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
> [...]
>>
>>>> +        if (!vma_is_accessed(vma))
>>>> +            continue;
>>>> +
>>>>           do {
>>>>               start = max(start, vma->vm_start);
>>>>               end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>>>
>>>
>>> This feels wrong, specifically we track numa_scan_offset per mm, now, if
>>> we divide the threads into two dis-joint groups each only using their
>>> own set of vmas (in fact quite common for workloads with proper data
>>> partitioning) it is possible to consistently sample one set of threads
>>> and thus not scan the other set of vmas.
>>>
>>> It seems somewhat unlikely, but not impossible to create significant
>>> unfairness.
>>>
>>
>> Agree, But that is the reason why we want to allow first few
>> unconditional scans Or am I missing something?
>>
> 
> Thinking further, may be we can summarize the different aspects of 
> thread/ two disjoint set case itself into:
> 
> 1) Unfairness because of way in which threads gets opportunity
> to scan.
> 
> 2) Disjoint set of vmas in the partition set could be of different sizes
> 
> 3) Disjoint set of vmas could be associated with different number of
> threads
> 
> Each of above can potentially help or make some thread do heavy lifting
> 
> but (2), and (3). is what I think we are trying to be Okay with by
> making sure tasks mostly do not scan others' vmas
> 
> (1) could be a real issue (though I know that there are many places we
>   have corrected the issue by introducing offset in p->numa_next_scan)
> but how the distribution looks now practically, I take it as a TODO and
> post.


Hello PeterZ,
Sorry to come back little late with Data for how unfair are we when we
have two disjoint set of groups mentioned above:

Program I tested with:
1. 128/256 thread program divided into two sets each accessing
different set of memory (set0, set1) from node1 and node2 to induce
migration (8GB divided into 4GB each].
Total iteration per thread was around 2048 to make sure for
128 thread: program ran around 8 min
256 thread: program ran around 17/18 min

SUT: 128 core 256 CPU Milan with 2 nodes

Some of the observations:
1) Total number of threads which gets access to scan in the entire
program span was around 50-62%

for e.g. 128 thread case total tasks got access to scan was around 74-84
for 256 thread range was 123-146

2) There was a little bias towards set 1 always (the threads from where 
remote faults occurred)

In summary: I do see that access to VMAs from disjoint sets is not fully
  fair, But on the other hand it is not very bad too. There is definitely
some scope or possibility to explore/improve fairness in this area
further.

Posting result for one of the 128 threads: (base 6.1.0 vanilla)

column1: frequency
column2: PID

set 0 threads
       1 5906
       1 5908
       2 5912
       1 5914
       1 5916
       1 5918
       1 5926
       2 5928
       3 5932
       3 5938
       1 5940
       1 5944
       2 5948
       2 5950
       1 5956
       1 5962
       1 5974
       1 5978
       1 5992
       1 6004
       1 6006
       1 6008
       1 6012
       3 6014
       1 6016
       2 6026
       2 6028
       1 6030

set 1 threads
       3 5907
       5 5909
       2 5911
       4 5913
       7 5915
       2 5917
       3 5919
       5 5921
       3 5923
       2 5925
       4 5927
       4 5929
       3 5931
       3 5933
       2 5935
       7 5937
       4 5939
       3 5941
       4 5943
       1 5945
       2 5947
       1 5949
       1 5951
       2 5953
       1 5955
       1 5957
       1 5959
       1 5961
       1 5963
       1 5965
       1 5967
       1 5969
       1 5971
       2 5975
       2 5979
       2 5981
       2 5983
       5 5993
       5 5995
      11 5997
       1 5999
       6 6003
       4 6005
       3 6007
       1 6013
       1 6015
       3 6017
       1 6019
       1 6021
       1 6023
       6 6027
       4 6029
       4 6031
       1 6033

PS: I have also tested above applying V3 patch (which incorporates your
suggestions), have not seen much deviation in observation with patch.

Thanks
- Raghu

  reply	other threads:[~2023-02-27  6:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01  8:02 [PATCH V2 0/3] sched/numa: Enhance vma scanning Raghavendra K T
2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
2023-02-03 10:24   ` Peter Zijlstra
2023-02-04 17:19     ` Raghavendra K T
2023-02-01  8:02 ` [PATCH V2 2/3] sched/numa: Enhance vma scanning logic Raghavendra K T
2023-02-03 11:15   ` Peter Zijlstra
2023-02-03 11:27     ` Peter Zijlstra
2023-02-04 18:18       ` Raghavendra K T
2023-02-04 18:14     ` Raghavendra K T
2023-02-07  6:41       ` Raghavendra K T
2023-02-27  6:40         ` Raghavendra K T [this message]
2023-02-27 10:06           ` Peter Zijlstra
2023-02-27 10:12             ` Raghavendra K T
2023-02-28  4:59       ` Raghavendra K T
2023-02-01  8:02 ` [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically Raghavendra K T
2023-02-03 11:35   ` Peter Zijlstra
2023-02-04 18:32     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=dishaa.talreja@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).