linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 00/33] AutoNUMA27
       [not found] ` <20121004113943.be7f92a0.akpm@linux-foundation.org>
@ 2012-10-05 23:14   ` Andi Kleen
  2012-10-05 23:57     ` Tim Chen
  2012-10-08 20:34     ` Rik van Riel
  0 siblings, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2012-10-05 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu,  4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week. 

Did they pass review? I have some doubts.

The last time I looked it also broke numactl.

> Guys, what's the plan here?

Since they are both performance features their ultimate benefit
is how much faster they make things (and how seldom they make things
slower)

IMHO needs a performance shot-out. Run both on the same 10 workloads
and see who wins. Just a lot of of work. Any volunteers?

For a change like this I think less regression is actually more
important than the highest peak numbers.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-05 23:14   ` [PATCH 00/33] AutoNUMA27 Andi Kleen
@ 2012-10-05 23:57     ` Tim Chen
  2012-10-06  0:11       ` Andi Kleen
  2012-10-08 20:34     ` Rik van Riel
  1 sibling, 1 reply; 34+ messages in thread
From: Tim Chen @ 2012-10-05 23:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton,
	Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner,
	Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad

On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Thu,  4 Oct 2012 01:50:42 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> >> This is a new AutoNUMA27 release for Linux v3.6.
> >
> > Peter's numa/sched patches have been in -next for a week. 
> 
> Did they pass review? I have some doubts.
> 
> The last time I looked it also broke numactl.
> 
> > Guys, what's the plan here?
> 
> Since they are both performance features their ultimate benefit
> is how much faster they make things (and how seldom they make things
> slower)
> 
> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?
> 
> For a change like this I think less regression is actually more
> important than the highest peak numbers.
> 
> -Andi
> 

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest.  For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-05 23:57     ` Tim Chen
@ 2012-10-06  0:11       ` Andi Kleen
  2012-10-08 13:44         ` Don Morris
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2012-10-06  0:11 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton,
	Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner,
	Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex, Sh

Tim Chen <tim.c.chen@linux.intel.com> writes:
>> 
>
> I remembered that 3 months ago when Alex tested the numa/sched patches
> there were 20% regression on SpecJbb2005 due to the numa balancer.

20% on anything sounds like a show stopper to me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-06  0:11       ` Andi Kleen
@ 2012-10-08 13:44         ` Don Morris
  0 siblings, 0 replies; 34+ messages in thread
From: Don Morris @ 2012-10-08 13:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tim Chen, Andrew Morton, Andrea Arcangeli, linux-kernel,
	linux-mm, Linus Torvalds, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex, Sh

On 10/05/2012 05:11 PM, Andi Kleen wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
>>>
>>
>> I remembered that 3 months ago when Alex tested the numa/sched patches
>> there were 20% regression on SpecJbb2005 due to the numa balancer.
> 
> 20% on anything sounds like a show stopper to me.
> 
> -Andi
> 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( &(&mm->page_table_lock)->rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 500000 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly "best case" for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a "big" process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a "scan" (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-05 23:14   ` [PATCH 00/33] AutoNUMA27 Andi Kleen
  2012-10-05 23:57     ` Tim Chen
@ 2012-10-08 20:34     ` Rik van Riel
  1 sibling, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2012-10-08 20:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad, dshaks

On Fri, 05 Oct 2012 16:14:44 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?

Here are some preliminary results from simple benchmarks on a
4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system.

For the simple linpack streams benchmark, both sched/numa and
autonuma are within the margin of error compared to manual
tuning of task affinity.  This is a big win, since the current
upstream scheduler has regressions of 10-20% when the system
runs 4 through 16 streams processes.

For specjbb, the story is more complicated. After fixing the
obvious bugs in sched/numa, and getting some basic cpu-follows-memory
code (not yet in -tip AFAIK), Larry, Peter and I, averaged results
look like this:

baseline: 	246019
manual pinning: 285481 (+16%)
autonuma:	266626 (+8%)
sched/numa:	226540 (-8%)

This is with newer sched/numa code than what is in -tip right now.
Once Peter pushes the fixes by Larry and me into -tip, as well as
his cpu-follows-memory code, others should be able to run tests
like this as well.

Now for some other workloads, and tests on 8 node systems, etc...


Full results for the specjbb run below:

BASELINE - disabling auto numa (matches RHEL6 within 1%)

[root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt
spec1.txt:           throughput =     243639.70 SPECjbb2005 bops
spec2.txt:           throughput =     249186.20 SPECjbb2005 bops
spec3.txt:           throughput =     247216.72 SPECjbb2005 bops
spec4.txt:           throughput =     244035.60 SPECjbb2005 bops

Manual NUMACTL results are:

[root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt
spec1.txt:           throughput =     291430.22 SPECjbb2005 bops
spec2.txt:           throughput =     283550.85 SPECjbb2005 bops
spec3.txt:           throughput =     284028.71 SPECjbb2005 bops
spec4.txt:           throughput =     282919.37 SPECjbb2005 bops

AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64
[root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt
spec1.txt:           throughput =     261835.01 SPECjbb2005 bops
spec2.txt:           throughput =     269053.06 SPECjbb2005 bops
spec3.txt:           throughput =     261230.50 SPECjbb2005 bops
spec3.txt:           throughput =     274386.81 SPECjbb2005 bops

Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and 
Larry:

[root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt
spec1.txt:           throughput =     222349.74 SPECjbb2005 bops
spec2.txt:           throughput =     232988.59 SPECjbb2005 bops
spec3.txt:           throughput =     223386.03 SPECjbb2005 bops
spec4.txt:           throughput =     227438.11 SPECjbb2005 bops

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
       [not found] ` <20121011101930.GM3317@csn.ul.ie>
@ 2012-10-11 14:56   ` Andrea Arcangeli
  2012-10-11 15:35     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 14:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi Mel,

On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> As a basic sniff test I added a test to MMtests for the AutoNUMA
> Benchmark on a 4-node machine and the following fell out.
> 
>                                      3.6.0                 3.6.0
>                                    vanilla        autonuma-v33r6
> User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)

Thanks a lot for the help and for looking into it!

Just curious, why are you running only numa02_SMT and
numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
without _suffix)

> 
> The performance improvements are certainly there for this basic test but
> I note the System CPU usage is very high.

Yes, migrate is expensive, but after convergence has been reached the
system time should be the same as upstream.

btw, I improved things further in autonuma28 (new branch in aa.git).

> 
> The vmstats showed up this
> 
> THP fault alloc               81376       86070
> THP collapse alloc               14       40423
> THP splits                        8       41792
> 
> So we're doing a lot of splits and collapses for THP there. There is a
> possibility that khugepaged and the autonuma kernel thread are doing some
> busy work. Not a show-stopped, just interesting.
> 
> I've done no analysis at all and this was just to have something to look
> at before looking at the code closer.

Sure, the idea is to have THP native migration, then we'll do zero
collapse/splits.

> > The objective of AutoNUMA is to provide out-of-the-box performance as
> > close as possible to (and potentially faster than) manual NUMA hard
> > bindings.
> > 
> > It is not very intrusive into the kernel core and is well structured
> > into separate source modules.
> > 
> > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > NUMA placement algorithms such as numad (in userland through cpusets)
> > and schednuma (in kernel too) and was found superior in all cases.
> > 
> > Most important: not a single benchmark showed a regression yet when
> > compared to vanilla kernels. Not even on the 2 node systems where the
> > NUMA effects are less significant.
> > 
> 
> Ok, I have not run a general regression test and won't get the chance to
> soon but hopefully others will. One thing they might want to watch out
> for is System CPU time. It's possible that your AutoNUMA benchmark
> triggers a worst-case but it's worth keeping an eye on because any cost
> from that has to be offset by gains from better NUMA placements.

Good idea to monitor it indeed.

> Is STREAM really a good benchmark in this case? Unless you also ran it in
> parallel mode, it basically operations against three arrays and not really
> NUMA friendly once the total size is greater than a NUMA node. I guess
> it makes sense to run it just to see does autonuma break it :)

The way this is run is that there is 1 stream, then 4 stream, then 8
until we max out all CPUs.

I think we could run "memhog" instead of "stream" and it'd be the
same. stream probably better resembles real life computations.

The upstream scheduler lacks any notion of affinity so eventually
during the 5 min run, on process changes node, it doesn't notice its
memory was elsewhere so it stays there, and the memory can't follow
the cpu either. So then it runs much slower.

So it's the simplest test of all to get right, all it requires is some
notion of node affinity.

It's also the only workload that the home node design in schednuma in
tip.git can get right (schednuma post current tip.git introduced
cpu-follow-memory design of AutoNUMA so schednuma will have a chance
to get right more stuff than just the stream multi instance
benchmark).

So it's just for a verification than the simple stuff (single threaded
process computing) is ok and the upstream regression vs hard NUMA
bindings is fixed.

stream is also one case where we have to perform identical to the hard
NUMA bindings. No migration of CPU or memory must ever happen with
AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
find that it is already in the best place and it will leave it alone.

With the autonuma-benchmark it's impossible to reach identical
performance of the _HARD_BIND case because _HARD_BIND doesn't need to
do any memory migration (I'm 3 seconds away from hard bindings in a
198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).

> 
> > 
> > == iozone ==
> > 
> >                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> > FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> > ====--------------------------------------------------------------------------------------------------------------
> > noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> > autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> > 
> > AutoNUMA can't help much for I/O loads but you can see it seems a
> > small improvement there too. The important thing for I/O loads, is to
> > verify that there is no regression.
> > 
> 
> It probably is unreasonable to expect autonuma to handle the case where
> a file-based workload has not been tuned for NUMA. In too many cases
> it's going to be read/write based so you're not going to get the
> statistics you need.

Agreed. Some statistic may still accumulate and it's still better than
nothing but unless the workload is CPU and memory bound, we can't
expect to see any difference.

This is meant as a verification that we're not introducing regression
to I/O bound load.

Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 14:56   ` Andrea Arcangeli
@ 2012-10-11 15:35     ` Mel Gorman
  2012-10-12  0:41       ` Andrea Arcangeli
  2012-10-12 14:54       ` Mel Gorman
  0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 15:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > Benchmark on a 4-node machine and the following fell out.
> > 
> >                                      3.6.0                 3.6.0
> >                                    vanilla        autonuma-v33r6
> > User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> > User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> > System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> > System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> > Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> > Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> > CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> > CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)
> 
> Thanks a lot for the help and for looking into it!
> 
> Just curious, why are you running only numa02_SMT and
> numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> without _suffix)
> 

Bug in the testing script on my end. Each of them are run separtly and it
looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then
numa01_THREAD_ALLOC. The intention was to allow additional stats to be
gathered independently of what start_bench.sh collects. Will improve it
in the future.

> > 
> > The performance improvements are certainly there for this basic test but
> > I note the System CPU usage is very high.
> 
> Yes, migrate is expensive, but after convergence has been reached the
> system time should be the same as upstream.
> 

Ok.

> btw, I improved things further in autonuma28 (new branch in aa.git).
> 

Ok.

> > 
> > The vmstats showed up this
> > 
> > THP fault alloc               81376       86070
> > THP collapse alloc               14       40423
> > THP splits                        8       41792
> > 
> > So we're doing a lot of splits and collapses for THP there. There is a
> > possibility that khugepaged and the autonuma kernel thread are doing some
> > busy work. Not a show-stopped, just interesting.
> > 
> > I've done no analysis at all and this was just to have something to look
> > at before looking at the code closer.
> 
> Sure, the idea is to have THP native migration, then we'll do zero
> collapse/splits.
> 

Seems reasonably. It should be obvious to measure when/if that happens.

> > > The objective of AutoNUMA is to provide out-of-the-box performance as
> > > close as possible to (and potentially faster than) manual NUMA hard
> > > bindings.
> > > 
> > > It is not very intrusive into the kernel core and is well structured
> > > into separate source modules.
> > > 
> > > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > > NUMA placement algorithms such as numad (in userland through cpusets)
> > > and schednuma (in kernel too) and was found superior in all cases.
> > > 
> > > Most important: not a single benchmark showed a regression yet when
> > > compared to vanilla kernels. Not even on the 2 node systems where the
> > > NUMA effects are less significant.
> > > 
> > 
> > Ok, I have not run a general regression test and won't get the chance to
> > soon but hopefully others will. One thing they might want to watch out
> > for is System CPU time. It's possible that your AutoNUMA benchmark
> > triggers a worst-case but it's worth keeping an eye on because any cost
> > from that has to be offset by gains from better NUMA placements.
> 
> Good idea to monitor it indeed.
> 

If System CPU time really does go down as this converges then that
should be obvious from monitoring vmstat over time for a test. Early on
- high usage with that dropping as it converges. If that doesn't happen
  then the tasks are not converging, the phases change constantly or
something unexpected happened that needs to be identified.

> > Is STREAM really a good benchmark in this case? Unless you also ran it in
> > parallel mode, it basically operations against three arrays and not really
> > NUMA friendly once the total size is greater than a NUMA node. I guess
> > it makes sense to run it just to see does autonuma break it :)
> 
> The way this is run is that there is 1 stream, then 4 stream, then 8
> until we max out all CPUs.
> 

Ok. Are they separate STREAM instances or threads running on the same
arrays? 

> I think we could run "memhog" instead of "stream" and it'd be the
> same. stream probably better resembles real life computations.
> 
> The upstream scheduler lacks any notion of affinity so eventually
> during the 5 min run, on process changes node, it doesn't notice its
> memory was elsewhere so it stays there, and the memory can't follow
> the cpu either. So then it runs much slower.
> 
> So it's the simplest test of all to get right, all it requires is some
> notion of node affinity.
> 

Ok.

> It's also the only workload that the home node design in schednuma in
> tip.git can get right (schednuma post current tip.git introduced
> cpu-follow-memory design of AutoNUMA so schednuma will have a chance
> to get right more stuff than just the stream multi instance
> benchmark).
> 
> So it's just for a verification than the simple stuff (single threaded
> process computing) is ok and the upstream regression vs hard NUMA
> bindings is fixed.
> 

Verification of the simple stuff makes sense.

> stream is also one case where we have to perform identical to the hard
> NUMA bindings. No migration of CPU or memory must ever happen with
> AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
> find that it is already in the best place and it will leave it alone.
> 
> With the autonuma-benchmark it's impossible to reach identical
> performance of the _HARD_BIND case because _HARD_BIND doesn't need to
> do any memory migration (I'm 3 seconds away from hard bindings in a
> 198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).
> 
> > 
> > > 
> > > == iozone ==
> > > 
> > >                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> > > FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> > > ====--------------------------------------------------------------------------------------------------------------
> > > noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> > > autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> > > 
> > > AutoNUMA can't help much for I/O loads but you can see it seems a
> > > small improvement there too. The important thing for I/O loads, is to
> > > verify that there is no regression.
> > > 
> > 
> > It probably is unreasonable to expect autonuma to handle the case where
> > a file-based workload has not been tuned for NUMA. In too many cases
> > it's going to be read/write based so you're not going to get the
> > statistics you need.
> 
> Agreed. Some statistic may still accumulate and it's still better than
> nothing but unless the workload is CPU and memory bound, we can't
> expect to see any difference.
> 
> This is meant as a verification that we're not introducing regression
> to I/O bound load.
> 

Ok, that's more or less what I had guessed but nice to know for sure.
Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
       [not found]   ` <20121011105036.GN3317@csn.ul.ie>
@ 2012-10-11 16:07     ` Andrea Arcangeli
  2012-10-11 19:37       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi,

On Thu, Oct 11, 2012 at 11:50:36AM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> > +The AutoNUMA logic is a chain reaction resulting from the actions of
> > +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
> 
> s/knum_scand/knuma_scand/

Applied.

> > +scans the mm structures of all active processes. It gathers the
> > +AutoNUMA mm statistics for each "anon" page in the process's working
> 
> Ok, so this will not make a different to file-based workloads but as I
> mentioned in the leader this would be a difficult proposition anyway
> because if it's read/write based, you'll have no statistics.

Oops sorry for the confusion but the the doc is wrong on this one: it
actually tracks anything with a page_mapcount == 1, even if that is
pagecache or even .text as long as it's only mapped in a single
process. So if you've a threaded database doing a gigantic MAP_SHARED,
it'll track and move around the whole MAP_SHARED as well as anonymous
memory or anything that can be moved.

Changed to:

+AutoNUMA mm statistics for each not shared page in the process's

> > +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> > +present bit in each pte or pmd that was counted. This triggers NUMA
> > +hinting page faults described next.
> > +
> > +The mm statistics are expentially decayed by dividing the total memory
> > +in half and adding the new totals to the decayed values for each
> > +knuma_scand pass. This causes the mm statistics to resemble a simple
> > +forecasting model, taking into account some past working set data.
> > +
> > +=== NUMA hinting fault ===
> > +
> > +A NUMA hinting fault occurs when a task running on a CPU thread
> > +accesses a vma whose pte or pmd is not present and the NUMA bit is
> > +set. The NUMA hinting page fault handler returns the pte or pmd back
> > +to its present state and counts the fault's occurance in the
> > +task_autonuma structure.
> > +
> 
> So, minimally one source of System CPU overhead will be increased traps.

Correct.

It takes down 128M every 100msec, and then when it finished taking
down everything it sleeps 10sec, then increases the pass_counter and
restarts. It's not measurable, even if I do a kernel build with -j128
in tmpfs the performance is identical with autonuma running or not.

> I haven't seen the code yet obviously but I wonder if this gets accounted
> for as a minor fault? If it does, how can we distinguish between minor
> faults and numa hinting faults? If not, is it possible to get any idea of
> how many numa hinting faults were incurred? Mention it here.

Yes, it's surely accounted as minor fault. To monitor it normally I
use:

perf probe numa_hinting_fault
perf record -e probe:numa_hinting_fault -aR -g sleep 10
perf report -g

# Samples: 345  of event 'probe:numa_hinting_fault'
# Event count (approx.): 345
#
# Overhead  Command      Shared Object                  Symbol
# ........  .......  .................  ......................
#
    64.64%     perf  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--57.40%-- sig_handler
                  |          |          
                  |          |--62.50%-- run_builtin
                  |          |          main
                  |          |          __libc_start_main
                  |          |          
                  |           --37.50%-- 0x7f47f7c6cba0
                  |                     run_builtin
                  |                     main
                  |                     __libc_start_main
                  |          
                  |--16.59%-- __poll
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                  |--9.87%-- 0x7f47f7c6cba0
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                  |--9.42%-- save_i387_xstate
                  |          do_signal
                  |          do_notify_resume
                  |          int_signal
                  |          __poll
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                   --6.73%-- sys_poll
                             system_call_fastpath
                             __poll

    21.45%     ntpd  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--66.22%-- 0x42b910
                  |          0x0
                  |          
                  |--24.32%-- __select
                  |          0x0
                  |          
                  |--4.05%-- do_signal
                  |          do_notify_resume
                  |          int_signal
                  |          __select
                  |          0x0
                  |          
                  |--2.70%-- 0x7f88827b3ba0
                  |          0x0
                  |          
                   --2.70%-- clock_gettime
                             0x1a1eb808

     7.83%     init  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--33.33%-- __select
                  |          0x0
                  |          
                  |--29.63%-- 0x404e0c
                  |          0x0
                  |          
                  |--18.52%-- 0x405820
                  |          
                  |--11.11%-- sys_select
                  |          system_call_fastpath
                  |          __select
                  |          0x0
                  |          
                   --7.41%-- 0x402528

     6.09%    sleep  [kernel.kallsyms]  [k] numa_hinting_fault
              |
              --- numa_hinting_fault
                  handle_mm_fault
                  do_page_fault
                  page_fault
                 |          
                 |--42.86%-- 0x7f0f67847fe0
                 |          0x7fff4cd6d42b
                 |          
                 |--28.57%-- 0x404007
                 |          
                 |--19.05%-- nanosleep
                 |          
                  --9.52%-- 0x4016d0
                            0x7fff4cd6d42b


Chances are we want to add more vmstat for this event.

> > +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> > +
> > +- Increments the total number of pages faulted for this task
> > +
> > +- Increments the number of pages faulted on the current NUMA node
> > +
> 
> So, am I correct in assuming that the rate of NUMA hinting faults will be
> related to the scan rate of knuma_scand?

This is correct. They're identical.

There's a slight chance that two threads hit the fault on the same
pte/pmd_numa concurrently, but just one of the two will actually
invoke the numa_hinting_fault() function.

> > +- If the fault was for an hugepage, the number of subpages represented
> > +  by an hugepage is added to the task statistics above
> > +
> > +- Each time the NUMA hinting page fault discoveres that another
> 
> s/discoveres/discovers/

Fixed.

> 
> > +  knuma_scand pass has occurred, it divides the total number of pages
> > +  and the pages for each NUMA node in half. This causes the task
> > +  statistics to be exponentially decayed, just as the mm statistics
> > +  are. Thus, the task statistics also resemble a simple forcasting

Also noticed forecasting ;).

> > +  model, taking into account some past NUMA hinting fault data.
> > +
> > +If the page being accessed is on the current NUMA node (same as the
> > +task), the NUMA hinting fault handler only records the nid of the
> > +current NUMA node in the page_autonuma structure field last_nid and
> > +then it'd done.
> > +
> > +Othewise, it checks if the nid of the current NUMA node matches the
> > +last_nid in the page_autonuma structure. If it matches it means it's
> > +the second NUMA hinting fault for the page occurring (on a subsequent
> > +pass of the knuma_scand daemon) from the current NUMA node.
> 
> You don't spell it out, but this is effectively a migration threshold N
> where N is the number of remote NUMA hinting faults that must be
> incurred before migration happens. The default value of this threshold
> is 2.
> 
> Is that accurate? If so, why 2?

More like 1. It needs one confirmation the migrate request come from
the same node again (note: it is allowed to come from a different
threads as long as it's the same node and that is very important).

Why only 1 confirmation? It's the same as page aging. We could record
the number of pagecache lookup hits, and not just have a single bit as
reference count. But doing so, if the workload radically changes it
takes too much time to adapt to the new configuration and so I usually
don't like counting.

Plus I avoided as much as possible fixed numbers. I can explain why 0
or 1, but I can't as easily explain why 5 or 8, so if I can't explain
it, I avoid it.

> I don't have a better suggestion, it's just an obvious source of an
> adverse workload that could force a lot of migrations by faulting once
> per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.

Correct, for certain workloads like single instance specjbb that
wasn't enough, but it is fixed in autonuma28, now it's faster even on
single instance.

> I'm assuming it must be async migration then. IO in progress would be
> a bit of a surprise though! It would have to be a mapped anonymous page
> being written to swap.

It's all migrate on fault now, but I'm using all methods you implemented to
avoid compaction to block in migrate_pages.

> > +=== Task exchange ===
> > +
> > +The following defines "weight" in the AutoNUMA balance routine's
> > +algorithm.
> > +
> > +If the tasks are threads of the same process:
> > +
> > +    weight = task weight for the NUMA node (since memory weights are
> > +             the same)
> > +
> > +If the tasks are not threads of the same process:
> > +
> > +    weight = memory weight for the NUMA node (prefer to move the task
> > +             to the memory)
> > +
> > +The following algorithm determines if the current task will be
> > +exchanged with a running task on a remote NUMA node:
> > +
> > +    this_diff: Weight of the current task on the remote NUMA node
> > +               minus its weight on the current NUMA node (only used if
> > +               a positive value). How much does the current task
> > +               prefer to run on the remote NUMA node.
> > +
> > +    other_diff: Weight of the current task on the remote NUMA node
> > +                minus the weight of the other task on the same remote
> > +                NUMA node (only used if a positive value). How much
> > +                does the current task prefer to run on the remote NUMA
> > +                node compared to the other task.
> > +
> > +    total_weight_diff = this_diff + other_diff
> > +
> > +    total_weight_diff: How favorable it is to exchange the two tasks.
> > +                       The pair of tasks with the highest
> > +                       total_weight_diff (if any) are selected for
> > +                       exchange.
> > +
> > +As mentioned above, if the two tasks are threads of the same process,
> > +the AutoNUMA balance routine uses the task_autonuma statistics. By
> > +using the task_autonuma statistics, each thread follows its own memory
> > +locality and they will not necessarily converge on the same node. This
> > +is often very desirable for processes with more threads than CPUs on
> > +each NUMA node.
> > +
> 
> What about the case where two threads on different CPUs are accessing

I assume on different nodes (different cpus if in the same node, the
above won't kick in).

> separate structures that are not page-aligned (base or huge page but huge
> page would be obviously worse). Does this cause a ping-pong effect or
> otherwise mess up the statistics?

Very good point! This is exactly what I call NUMA false sharing and
it's the biggest nightmare in this whole effort.

So if there's an huge amount of this over time the statistics will be
around 50/50 (the statistics just record the working set of the
thread).

So if there's another process (note: thread not) heavily computing the
50/50 won't be used and the mm statistics will be used instead to
balance the two threads against the other process. And the two threads
will converge in the same node, and then their thread statistics will
change from 50/50 to 0/100 matching the mm statistics.

If there are just threads and they're all doing what you describe
above with all their memory, well then the problem has no solution,
and the new stuff in autonuma28 will deal with that too.

Ideally we should do MADV_INTERLEAVE, I didn't get that far yet but I
probably could now.

Even without the new stuff it wasn't too bad but there were a bit too
many spurious migrations in that load with autonuma27 and previous. It
was less spurious on bigger systems with many nodes because last_nid
is implicitly more accurate there (as last_nid will have more possible
values than 0|1). With autonuma28 even on 2 nodes it's perfectly fine.

If it's just 1 page false sharing and all the rest is thread-local,
the statistics will be 99/1 and the false sharing will be lost in the
noise.

The false sharing spillover caused by alignments is minor if the
threads are really computing on a lot of local memory so it's not a
concern and it will be optimized away by the last_nid plus the new
stuff.

> Ok, very obviously this will never be an RT feature but that is hardly
> a surprise and anyone who tries to enable this for RT needs their head
> examined. I'm not suggesting you do it but people running detailed
> performance analysis on scheduler-intensive workloads might want to keep
> an eye on their latency and jitter figures and how they are affected by
> this exchanging. Does ftrace show a noticable increase in wakeup latencies
> for example?

If you do:

echo 1 >/sys/kernel/mm/autonuma/debug

you will get 1 printk every single time sched_autonuma_balance
triggers a task exchange.

With autonuma28 I resolved a lot of the jittering and now there are
6/7 printk for the whole 198 seconds of numa01. CFS runs in autopilot
all the time.

With specjbb x2 overcommit, the active balancing events are reduced to
one every few sec (vs several per sec with autonuma27). In fact the
specjbb x2 overcommit load jumped ahead too with autonuma28.

About tracing events, the git branch already has tracing events to
monitor all page and task migrations showed in an awesome "perf script
numatop" from Andrew. Likely we need one tracing event to see the task
exchange generated specifically by the autonuma balancing event (we're
running short in event columns to show it in numatop though ;). Right
now that is only available as the printk above.

> > +=== task_autonuma - per task AutoNUMA data ===
> > +
> > +The task_autonuma structure is used to hold AutoNUMA data required for
> > +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> > +nodes.
> > +
> > +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> > +                scheduler balancing code, -1 if none (2 bytes)
> > +
> > +- Task NUMA statistics for this thread/process:
> > +
> > +    Total number of NUMA hinting page faults in this pass of
> > +    knuma_scand (8 bytes)
> > +
> > +    Per NUMA node number of NUMA hinting page faults in this pass of
> > +    knuma_scand (8 bytes * # of NUMA nodes)
> > +
> 
> It might be possible to put a coarse ping-pong detection counter in here
> as well by recording a declaying average of number of pages migrated
> over a number of knuma_scand passes instead of just the last one.  If the
> value is too high, you're ping-ponging and the process should be ignored,
> possibly forever. It's not a requirement and it would be more memory
> overhead obviously but I'm throwing it out there as a suggestion if it
> ever turns out the ping-pong problem is real.

Yes, this is a problem where we've an enormous degree in trying
things, so your suggestions are very appreciated :).

About ping ponging of CPU I never seen it yet (even if it's 550/450,
it rarely switches over from 450/550, and even it does, it doesn't
really change anything because it's a fairly rare event and one node
is not more right than the other anyway).

Thanks a lot for the help!
Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
       [not found]   ` <20121011110137.GQ3317@csn.ul.ie>
@ 2012-10-11 16:43     ` Andrea Arcangeli
  2012-10-11 19:48       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > faults to identify the per NUMA node working set of the thread at
> > runtime.
> > 
> > Arming the NUMA hinting page fault mechanism works similarly to
> > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > triggers we can identify it as a NUMA hinting page fault.
> > 
> 
> That implies that there is an atomic update requirement or at least
> an ordering requirement -- present bit must be cleared before setting
> NUMA bit. No doubt it'll be clear later in the series how this is
> accomplished. What you propose seems ok but it all depends how it's
> implemented so I'm leaving my ack off this particular patch for now.

Correct. The switch is done atomically (clear _PAGE_PRESENT at the
same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
to avoid firing an IPI for every pte/pmd_numa we establish).

It's still similar to setting a range PROT_NONE (except the way
_PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
mutually exclusive, so they can easily share the same pte/pmd
bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.

The NUMA hinting page fault also won't require any TLB flush ever.

So the whole process (establish/teardown) has an incredibly low TLB
flushing cost.

The only fixed cost is in knuma_scand and the enter/exit kernel for
every not-shared page every 10 sec (or whatever you set the duration
of a knuma_scand pass in sysfs).

Furthermore, if the pmd_scan mode is activated, I guarantee there's at
max 1 NUMA hinting page fault every 2m virtual region (even if some
accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
to disable THP (echo never >enabled) to measure the exact cost per 4k
page. It's hardly measurable here. With THP the fault is also 1 every
2m virtual region but no accuracy is lost in that case (or more
precisely, there's no way to get more accuracy than that as we deal
with a pmd).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
       [not found]   ` <20121011111545.GR3317@csn.ul.ie>
@ 2012-10-11 16:58     ` Andrea Arcangeli
  2012-10-11 19:54       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 12:15:45PM +0100, Mel Gorman wrote:
> huh?
> 
> #define _PAGE_NUMA     _PAGE_PROTNONE
> 
> so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE
> 
> I suspect you are doing this because there is no requirement for
> _PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
> describe your intent. Is that really the case or did I miss something
> stupid?

Exactly.

It reminds that we need to return true in pte_present when the NUMA
hinting page fault is on.

Hardwiring _PAGE_NUMA to _PAGE_PROTNONE conceptually is not necessary
and it's actually an artificial restrictions. Other archs without a
bitflag for _PAGE_PROTNONE, may want to use something else and they'll
have to deal with pte_present too, somehow. So this is a reminder for
them as well.

> >  static inline int pte_hidden(pte_t pte)
> > @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> >  	 * the _PAGE_PSE flag will remain set at all times while the
> >  	 * _PAGE_PRESENT bit is clear).
> >  	 */
> > -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> > +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> > +				 _PAGE_NUMA);
> > +}
> > +
> > +#ifdef CONFIG_AUTONUMA
> > +/*
> > + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> > + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> 
> same bit on x86, not necessarily anywhere else.

Yep. In fact before using _PAGE_PRESENT the two bits were different
even on x86. But I unified them. If I vary them then they will become
_PAGE_PTE_NUMA/_PAGE_PMD_NUMA and the above will fail to build without
risk of errors.

> 
> _PAGE_PRESENT?

good eye ;) corrected.

> > +/*
> > + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> > + * because they're called by the NUMA hinting minor page fault.
> 
> automatically or atomically?
> 
> I assume you meant atomically but what stops two threads faulting at the
> same time and doing to the same update? mmap_sem will be insufficient in
> that case so what is guaranteeing the atomicity. PTL?

I meant automatically. I explained myself wrong and automatically may
be the wrong word. It also is atomic of course but it wasn't about the
atomic part.

So the thing is: the numa hinting page fault hooking point is this:

	if (pte_numa(entry))
		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);

It won't get this far:

	entry = pte_mkyoung(entry);
	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {

So if I don't set _PAGE_ACCESSED in pte/pmd_mknuma, the TLB miss
handler will have to set _PAGE_ACCESSED itself with an additional
write on the pte/pmd later when userland touches the page. And that
will slow us down for no good.

Because mknuma is only called in the numa hinting page fault context,
it's optimal to set _PAGE_ACCESSED too, not only _PAGE_PRESENT (and
clearing _PAGE_NUMA of course).

The basic idea, is that the numa hinting page fault can only trigger
if userland touches the page, and after such an event, _PAGE_ACCESSED
would be set by the hardware no matter if there is a NUMA hinting page
fault or not (so we can optimize away the hardware action when the NUMA
hinting page fault triggers).

I tried to reword it:

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cf1d3f0..3dc6a9b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -449,12 +449,12 @@ static inline int pmd_numa(pmd_t pmd)
 #endif
 
 /*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag too because they're
+ * only called by the NUMA hinting minor page fault. If we wouldn't
+ * set the _PAGE_ACCESSED bitflag here, the TLB miss handler would be
+ * forced to set it later while filling the TLB after we return to
+ * userland. That would trigger a second write to memory that we
+ * optimize away by setting _PAGE_ACCESSED here.
  */
 static inline pte_t pte_mknonnuma(pte_t pte)
 {


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
       [not found]   ` <20121011122255.GS3317@csn.ul.ie>
@ 2012-10-11 17:05     ` Andrea Arcangeli
  2012-10-11 20:01       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 01:22:55PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> > In the special "pmd" mode of knuma_scand
> > (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> > type (_PAGE_PRESENT not set), however the pte might be
> > present. Therefore, gup_pmd_range() must return 0 in this case to
> > avoid losing a NUMA hinting page fault during gup_fast.
> > 
> 
> So if gup_fast fails, presumably we fall back to taking the mmap_sem and
> calling get_user_pages(). This is a heavier operation and I wonder if the
> cost is justified. i.e. Is the performance loss from using get_user_pages()
> offset by improved NUMA placement? I ask because we always incur the cost of
> taking mmap_sem but only sometimes get it back from improved NUMA placement.
> How bad would it be if gup_fast lost some of the NUMA hinting information?

Good question indeed. Now, I agree it wouldn't be bad to skip NUMA
hinting page faults in gup_fast for no-virt usage like
O_DIRECT/ptrace, but the only problem is that we'd lose AutoNUMA on
the memory touched by the KVM vcpus.

I've been also asked if the vhost-net kernel thread (KVM in kernel
virtio backend) will be controlled by autonuma in between
use_mm/unuse_mm and answer is yes, but to do that, it also needs
this. (see also the flush to task_autonuma_nid and mm/task statistics in
unuse_mm to reset it back to regular kernel thread status,
uncontrolled by autonuma)

$ git grep get_user_pages
tcm_vhost.c:            ret = get_user_pages_fast((unsigned long)ptr, 1, write, &page);
vhost.c:        r = get_user_pages_fast(log, 1, 1, &page);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
       [not found]   ` <20121011122827.GT3317@csn.ul.ie>
@ 2012-10-11 17:15     ` Andrea Arcangeli
  2012-10-11 20:06       ` Mel Gorman
       [not found]     ` <5076E4B2.2040301@redhat.com>
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 01:28:27PM +0100, Mel Gorman wrote:
> s/togehter/together/

Fixed.

> 
> > + * knumad_scan structure.
> > + */
> > +struct mm_autonuma {
> 
> Nit but this is very similar in principle to mm_slot for transparent
> huge pages. It might be worth renaming both to mm_thp_slot and
> mm_autonuma_slot to set the expectation they are very similar in nature.
> Could potentially be made generic but probably overkill.

Agreed. A plain rename to mm_autonuma_slot would have the only cons of
making some code spill over 80 col ;).

> > +	/* link for knuma_scand's list of mm structures to scan */
> > +	struct list_head mm_node;
> > +	/* Pointer to associated mm structure */
> > +	struct mm_struct *mm;
> > +
> > +	/*
> > +	 * Zeroed from here during allocation, check
> > +	 * mm_autonuma_reset() if you alter the below.
> > +	 */
> > +
> > +	/*
> > +	 * Pass counter for this mm. This exist only to be able to
> > +	 * tell when it's time to apply the exponential backoff on the
> > +	 * task_autonuma statistics.
> > +	 */
> > +	unsigned long mm_numa_fault_pass;
> > +	/* Total number of pages that will trigger NUMA faults for this mm */
> > +	unsigned long mm_numa_fault_tot;
> > +	/* Number of pages that will trigger NUMA faults for each [nid] */
> > +	unsigned long mm_numa_fault[0];
> > +	/* do not add more variables here, the above array size is dynamic */
> > +};
> 
> How cache hot is this structure? nodes are sharing counters in the same
> cache lines so if updates are frequent this will bounce like a mad yoke.
> Profiles will tell for sure but it's possible that some sort of per-cpu
> hilarity will be necessary here in the future.

On autonuma27 this is only written by knuma_scand so it won't risk to
bounce.

On autonuma28 however it's updated by the numa hinting page fault
locklessy and so your concern is very real, and the cacheline bounces
will materialize. It'll cause more interconnect traffic before the
workload converges too. I thought about that, but I wanted the
mm_autonuma updated in real time as migration happens otherwise it
converges more slowly if we have to wait until the next pass to bring
mm_autonuma statistical data in sync with the migration
activities. Converging more slowly looked worse than paying more
cacheline bounces.

It's a tradeoff. And if it's not a good one, we can go back to
autonuma27 mm_autonuma stat gathering method and converge slower but
without any cacheline bouncing in the NUMA hinting page faults. At
least it's lockless.

> > +	unsigned long task_numa_fault_pass;
> > +	/* Total number of eligible pages that triggered NUMA faults */
> > +	unsigned long task_numa_fault_tot;
> > +	/* Number of pages that triggered NUMA faults for each [nid] */
> > +	unsigned long task_numa_fault[0];
> > +	/* do not add more variables here, the above array size is dynamic */
> > +};
> > +
> 
> Same question about cache hotness.

Here it's per-thread, so there won't be risk of accesses interleaved
by different CPUs.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
       [not found]   ` <20121011134643.GU3317@csn.ul.ie>
@ 2012-10-11 17:34     ` Andrea Arcangeli
  2012-10-11 20:17       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 02:46:43PM +0100, Mel Gorman wrote:
> Should this be a SCHED_FEATURE flag?

I guess it could. It is only used by kernel/sched/numa.c which isn't
even built unless CONFIG_AUTONUMA is set. So it would require a
CONFIG_AUTONUMA in the sched feature flags unless we want to expose
no-operational bits. I'm not sure what the preferred way is.

> Have you ever identified a case where it's a good idea to set that flag?

It's currently set by default but no, I didn't do enough experiments
if it's worth copying or resetting the data.

> A child that closely shared data with its parent is not likely to also
> want to migrate to separate nodes. It just seems unnecessary to have and

Agreed, this is why the task_selected_nid is always inherited by
default (that is the CFS autopilot driver).

The question is if the full statistics also should be inherited across
fork/clone or not. I don't know the answer yet and that's why that
knob exists.

If we retain them, the autonuma_balance may decide to move the
task before a full statistics buildup executed the child.

The current way is to reset the data, and wait the data to buildup in
the child, while we keep CFS on autopilot with task_selected_nid
(which is always inherited). I thought the current one to be a good
tradeoff, but copying all data isn't an horrible idea either.

> impossible to suggest to an administrator how the flag might be used.

Agreed. this in fact is a debug flag only, it won't ever showup to the admin.

#ifdef CONFIG_DEBUG_VM
SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
SYSFS_ENTRY(migrate_allow_first_fault,
	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
#endif /* CONFIG_DEBUG_VM */

> 
> > +	/*
> > +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> > +	 * faults at the pmd level instead of the pte level. This
> > +	 * reduces the number of NUMA hinting faults potentially
> > +	 * saving CPU time. It reduces the accuracy of the
> > +	 * task_autonuma statistics (but does not change the accuracy
> > +	 * of the mm_autonuma statistics). This flag can be toggled
> > +	 * through sysfs as runtime.
> > +	 *
> > +	 * This flag does not affect AutoNUMA with transparent
> > +	 * hugepages (THP). With THP the NUMA hinting page faults
> > +	 * always happen at the pmd level, regardless of the setting
> > +	 * of this flag. Note: there is no reduction in accuracy of
> > +	 * task_autonuma statistics with THP.
> > +	 *
> > +	 * Default set.
> > +	 */
> > +	AUTONUMA_SCAN_PMD_FLAG,
> 
> This flag and the other flags make sense. Early on we just are not going
> to know what the correct choice is. My gut says that ultimately we'll

Agreed. This is why I left these knobs in, even if I've been asked to
drop them a few times (they were perceived as adding complexity). But
for things we're not sure about, these really helps to benchmark quick
one way or another.

scan_pmd is actually not under DEBUG_VM as it looked a more fundamental thing.

> default to PMD level *but* fall back to PTE level on a per-task basis if
> ping-pong migrations are detected. This will catch ping-pongs on data
> that is not PMD aligned although obviously data that is not page aligned
> will also suffer. Eventually I think this flag will go away but the
> behaviour will be;
> 
> default, AUTONUMA_SCAN_PMD
> if ping-pong, fallback to AUTONUMA_SCAN_PTE
> if ping-ping, AUTONUMA_SCAN_NONE

That would be ideal, good idea indeed.

> so there is a graceful degradation if autonuma is doing the wrong thing.

Makes perfect sense to me if we figure out how to reliably detect when
to make the switch.

thanks!
Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
  2012-10-11 16:07     ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
@ 2012-10-11 19:37       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 19:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:07:02PM +0200, Andrea Arcangeli wrote:
> Hi,
> 
> On Thu, Oct 11, 2012 at 11:50:36AM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> > > +The AutoNUMA logic is a chain reaction resulting from the actions of
> > > +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
> > 
> > s/knum_scand/knuma_scand/
> 
> Applied.
> 
> > > +scans the mm structures of all active processes. It gathers the
> > > +AutoNUMA mm statistics for each "anon" page in the process's working
> > 
> > Ok, so this will not make a different to file-based workloads but as I
> > mentioned in the leader this would be a difficult proposition anyway
> > because if it's read/write based, you'll have no statistics.
> 
> Oops sorry for the confusion but the the doc is wrong on this one: it
> actually tracks anything with a page_mapcount == 1, even if that is
> pagecache or even .text as long as it's only mapped in a single
> process. So if you've a threaded database doing a gigantic MAP_SHARED,
> it'll track and move around the whole MAP_SHARED as well as anonymous
> memory or anything that can be moved.
> 

Ok, I would have expected MAP_PRIVATE in this case but I get your point.

> Changed to:
> 
> +AutoNUMA mm statistics for each not shared page in the process's
> 

Better.

> > > +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> > > +present bit in each pte or pmd that was counted. This triggers NUMA
> > > +hinting page faults described next.
> > > +
> > > +The mm statistics are expentially decayed by dividing the total memory
> > > +in half and adding the new totals to the decayed values for each
> > > +knuma_scand pass. This causes the mm statistics to resemble a simple
> > > +forecasting model, taking into account some past working set data.
> > > +
> > > +=== NUMA hinting fault ===
> > > +
> > > +A NUMA hinting fault occurs when a task running on a CPU thread
> > > +accesses a vma whose pte or pmd is not present and the NUMA bit is
> > > +set. The NUMA hinting page fault handler returns the pte or pmd back
> > > +to its present state and counts the fault's occurance in the
> > > +task_autonuma structure.
> > > +
> > 
> > So, minimally one source of System CPU overhead will be increased traps.
> 
> Correct.
> 
> It takes down 128M every 100msec, and then when it finished taking
> down everything it sleeps 10sec, then increases the pass_counter and
> restarts. It's not measurable, even if I do a kernel build with -j128
> in tmpfs the performance is identical with autonuma running or not.
> 

Ok, I see it clearly now, particularly after reading the series. It does
mean a CPU spike every 10 seconds but it'll be detectable if it's a problem.

> > I haven't seen the code yet obviously but I wonder if this gets accounted
> > for as a minor fault? If it does, how can we distinguish between minor
> > faults and numa hinting faults? If not, is it possible to get any idea of
> > how many numa hinting faults were incurred? Mention it here.
> 
> Yes, it's surely accounted as minor fault. To monitor it normally I
> use:
> 
> perf probe numa_hinting_fault
> perf record -e probe:numa_hinting_fault -aR -g sleep 10
> perf report -g
> 

Ok, straight-forward. Also can be recorded with trace-cmd obviously once
the probe is in place.

> # Samples: 345  of event 'probe:numa_hinting_fault'
> # Event count (approx.): 345
> #
> # Overhead  Command      Shared Object                  Symbol
> # ........  .......  .................  ......................
> #
>     64.64%     perf  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--57.40%-- sig_handler
>                   |          |          
>                   |          |--62.50%-- run_builtin
>                   |          |          main
>                   |          |          __libc_start_main
>                   |          |          
>                   |           --37.50%-- 0x7f47f7c6cba0
>                   |                     run_builtin
>                   |                     main
>                   |                     __libc_start_main
>                   |          
>                   |--16.59%-- __poll
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                   |--9.87%-- 0x7f47f7c6cba0
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                   |--9.42%-- save_i387_xstate
>                   |          do_signal
>                   |          do_notify_resume
>                   |          int_signal
>                   |          __poll
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                    --6.73%-- sys_poll
>                              system_call_fastpath
>                              __poll
> 
>     21.45%     ntpd  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--66.22%-- 0x42b910
>                   |          0x0
>                   |          
>                   |--24.32%-- __select
>                   |          0x0
>                   |          
>                   |--4.05%-- do_signal
>                   |          do_notify_resume
>                   |          int_signal
>                   |          __select
>                   |          0x0
>                   |          
>                   |--2.70%-- 0x7f88827b3ba0
>                   |          0x0
>                   |          
>                    --2.70%-- clock_gettime
>                              0x1a1eb808
> 
>      7.83%     init  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--33.33%-- __select
>                   |          0x0
>                   |          
>                   |--29.63%-- 0x404e0c
>                   |          0x0
>                   |          
>                   |--18.52%-- 0x405820
>                   |          
>                   |--11.11%-- sys_select
>                   |          system_call_fastpath
>                   |          __select
>                   |          0x0
>                   |          
>                    --7.41%-- 0x402528
> 
>      6.09%    sleep  [kernel.kallsyms]  [k] numa_hinting_fault
>               |
>               --- numa_hinting_fault
>                   handle_mm_fault
>                   do_page_fault
>                   page_fault
>                  |          
>                  |--42.86%-- 0x7f0f67847fe0
>                  |          0x7fff4cd6d42b
>                  |          
>                  |--28.57%-- 0x404007
>                  |          
>                  |--19.05%-- nanosleep
>                  |          
>                   --9.52%-- 0x4016d0
>                             0x7fff4cd6d42b
> 
> 
> Chances are we want to add more vmstat for this event.
> 
> > > +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> > > +
> > > +- Increments the total number of pages faulted for this task
> > > +
> > > +- Increments the number of pages faulted on the current NUMA node
> > > +
> > 
> > So, am I correct in assuming that the rate of NUMA hinting faults will be
> > related to the scan rate of knuma_scand?
> 
> This is correct. They're identical.
> 
> There's a slight chance that two threads hit the fault on the same
> pte/pmd_numa concurrently, but just one of the two will actually
> invoke the numa_hinting_fault() function.
> 

Ok, I see they'll be serialised by the PTL anyway.

> > > +- If the fault was for an hugepage, the number of subpages represented
> > > +  by an hugepage is added to the task statistics above
> > > +
> > > +- Each time the NUMA hinting page fault discoveres that another
> > 
> > s/discoveres/discovers/
> 
> Fixed.
> 
> > 
> > > +  knuma_scand pass has occurred, it divides the total number of pages
> > > +  and the pages for each NUMA node in half. This causes the task
> > > +  statistics to be exponentially decayed, just as the mm statistics
> > > +  are. Thus, the task statistics also resemble a simple forcasting
> 
> Also noticed forecasting ;).
> 
> > > +  model, taking into account some past NUMA hinting fault data.
> > > +
> > > +If the page being accessed is on the current NUMA node (same as the
> > > +task), the NUMA hinting fault handler only records the nid of the
> > > +current NUMA node in the page_autonuma structure field last_nid and
> > > +then it'd done.
> > > +
> > > +Othewise, it checks if the nid of the current NUMA node matches the
> > > +last_nid in the page_autonuma structure. If it matches it means it's
> > > +the second NUMA hinting fault for the page occurring (on a subsequent
> > > +pass of the knuma_scand daemon) from the current NUMA node.
> > 
> > You don't spell it out, but this is effectively a migration threshold N
> > where N is the number of remote NUMA hinting faults that must be
> > incurred before migration happens. The default value of this threshold
> > is 2.
> > 
> > Is that accurate? If so, why 2?
> 
> More like 1. It needs one confirmation the migrate request come from
> the same node again (note: it is allowed to come from a different
> threads as long as it's the same node and that is very important).
> 
> Why only 1 confirmation? It's the same as page aging. We could record
> the number of pagecache lookup hits, and not just have a single bit as
> reference count. But doing so, if the workload radically changes it
> takes too much time to adapt to the new configuration and so I usually
> don't like counting.
> 
> Plus I avoided as much as possible fixed numbers. I can explain why 0
> or 1, but I can't as easily explain why 5 or 8, so if I can't explain
> it, I avoid it.
> 

That's fair enough. Expressing in terms of page aging is reasonable. One
could argue for any number but ultimately it'll be related to the
workload. It's all part of the "ping-pong" detection problem. Include
this blurb in the docs.

> > I don't have a better suggestion, it's just an obvious source of an
> > adverse workload that could force a lot of migrations by faulting once
> > per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.
> 
> Correct, for certain workloads like single instance specjbb that
> wasn't enough, but it is fixed in autonuma28, now it's faster even on
> single instance.
> 

Ok.

> > I'm assuming it must be async migration then. IO in progress would be
> > a bit of a surprise though! It would have to be a mapped anonymous page
> > being written to swap.
> 
> It's all migrate on fault now, but I'm using all methods you implemented to
> avoid compaction to block in migrate_pages.
> 

Excellent. There are other places where autonuma may need to backoff if
contention is detected but it can be incrementally addressed.

> > > +=== Task exchange ===
> > > +
> > > +The following defines "weight" in the AutoNUMA balance routine's
> > > +algorithm.
> > > +
> > > +If the tasks are threads of the same process:
> > > +
> > > +    weight = task weight for the NUMA node (since memory weights are
> > > +             the same)
> > > +
> > > +If the tasks are not threads of the same process:
> > > +
> > > +    weight = memory weight for the NUMA node (prefer to move the task
> > > +             to the memory)
> > > +
> > > +The following algorithm determines if the current task will be
> > > +exchanged with a running task on a remote NUMA node:
> > > +
> > > +    this_diff: Weight of the current task on the remote NUMA node
> > > +               minus its weight on the current NUMA node (only used if
> > > +               a positive value). How much does the current task
> > > +               prefer to run on the remote NUMA node.
> > > +
> > > +    other_diff: Weight of the current task on the remote NUMA node
> > > +                minus the weight of the other task on the same remote
> > > +                NUMA node (only used if a positive value). How much
> > > +                does the current task prefer to run on the remote NUMA
> > > +                node compared to the other task.
> > > +
> > > +    total_weight_diff = this_diff + other_diff
> > > +
> > > +    total_weight_diff: How favorable it is to exchange the two tasks.
> > > +                       The pair of tasks with the highest
> > > +                       total_weight_diff (if any) are selected for
> > > +                       exchange.
> > > +
> > > +As mentioned above, if the two tasks are threads of the same process,
> > > +the AutoNUMA balance routine uses the task_autonuma statistics. By
> > > +using the task_autonuma statistics, each thread follows its own memory
> > > +locality and they will not necessarily converge on the same node. This
> > > +is often very desirable for processes with more threads than CPUs on
> > > +each NUMA node.
> > > +
> > 
> > What about the case where two threads on different CPUs are accessing
> 
> I assume on different nodes (different cpus if in the same node, the
> above won't kick in).
> 

Yes, on different nodes is the case I care about here.

> > separate structures that are not page-aligned (base or huge page but huge
> > page would be obviously worse). Does this cause a ping-pong effect or
> > otherwise mess up the statistics?
> 
> Very good point! This is exactly what I call NUMA false sharing and
> it's the biggest nightmare in this whole effort.
> 
> So if there's an huge amount of this over time the statistics will be
> around 50/50 (the statistics just record the working set of the
> thread).
> 
> So if there's another process (note: thread not) heavily computing the
> 50/50 won't be used and the mm statistics will be used instead to
> balance the two threads against the other process. And the two threads
> will converge in the same node, and then their thread statistics will
> change from 50/50 to 0/100 matching the mm statistics.
> 
> If there are just threads and they're all doing what you describe
> above with all their memory, well then the problem has no solution,
> and the new stuff in autonuma28 will deal with that too.
> 

Ok, I'll keep an eye out for it. 

> Ideally we should do MADV_INTERLEAVE, I didn't get that far yet but I
> probably could now.
> 

I like the idea that if autonuma was able to detect that it was
ping-ponging it would set MADV_INTERLEAVE and remove the task from the
list of address spaces to scan entirely. It would require more state in
the task struct.

> Even without the new stuff it wasn't too bad but there were a bit too
> many spurious migrations in that load with autonuma27 and previous. It
> was less spurious on bigger systems with many nodes because last_nid
> is implicitly more accurate there (as last_nid will have more possible
> values than 0|1). With autonuma28 even on 2 nodes it's perfectly fine.
> 
> If it's just 1 page false sharing and all the rest is thread-local,
> the statistics will be 99/1 and the false sharing will be lost in the
> noise.
> 
> The false sharing spillover caused by alignments is minor if the
> threads are really computing on a lot of local memory so it's not a
> concern and it will be optimized away by the last_nid plus the new
> stuff.
> 

Ok, so there will be examples of when this works and counter-examples
and that is unavoidable. At least the cases where it goes wrong are
known in advance. Ideally there would be a few statistics to help track
when that is going wrong.

pages_migrate_success		migrate.c
pages_migrate_fail		migrate.c
  (tracepoint to distinguish between compaction and autonuma, delete
   the equivalent counters in compaction.c. Monitoring the sucess
   counter over time will allow an estimate of how much inter-node
   traffic is due to autonuma)

thread_migrate_numa		autonuma.c
  (rapidly increasing count implies ping-pong. A count that is static
  indicates that it has converged. Potentially could be aggegated
  for a whole address space to see if an overal application has
  converged or not.)

PMU for remote numa accesses
  (should decrease with autonuma if working perfectly. Will increase it
   when ping-pong is occurring)

perf probe do_numahint_page() / do_numahint_pmd()
  (tracks overhead incurred from the faults, also shows up in profile)

I'm not suggesting this all has to be implemented but it'd be nice to know
in advance how we'll identify both when autonuma is getting things right
and when it's getting it wrong.

> > Ok, very obviously this will never be an RT feature but that is hardly
> > a surprise and anyone who tries to enable this for RT needs their head
> > examined. I'm not suggesting you do it but people running detailed
> > performance analysis on scheduler-intensive workloads might want to keep
> > an eye on their latency and jitter figures and how they are affected by
> > this exchanging. Does ftrace show a noticable increase in wakeup latencies
> > for example?
> 
> If you do:
> 
> echo 1 >/sys/kernel/mm/autonuma/debug
> 
> you will get 1 printk every single time sched_autonuma_balance
> triggers a task exchange.
> 

Yeah, ideally we would get away from that over time though and move to
tracepoints so it can be gathered by trace-cmd or perf depending on the
exact scenario.

> With autonuma28 I resolved a lot of the jittering and now there are
> 6/7 printk for the whole 198 seconds of numa01. CFS runs in autopilot
> all the time.
> 

Good. trace-cmd with the wakeup plugin might also be able to detect
interference in scheduling latencies.

> With specjbb x2 overcommit, the active balancing events are reduced to
> one every few sec (vs several per sec with autonuma27). In fact the
> specjbb x2 overcommit load jumped ahead too with autonuma28.
> 
> About tracing events, the git branch already has tracing events to
> monitor all page and task migrations showed in an awesome "perf script
> numatop" from Andrew.

Cool!

> Likely we need one tracing event to see the task
> exchange generated specifically by the autonuma balancing event (we're
> running short in event columns to show it in numatop though ;). Right
> now that is only available as the printk above.
> 

Ok.

> > > +=== task_autonuma - per task AutoNUMA data ===
> > > +
> > > +The task_autonuma structure is used to hold AutoNUMA data required for
> > > +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> > > +nodes.
> > > +
> > > +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> > > +                scheduler balancing code, -1 if none (2 bytes)
> > > +
> > > +- Task NUMA statistics for this thread/process:
> > > +
> > > +    Total number of NUMA hinting page faults in this pass of
> > > +    knuma_scand (8 bytes)
> > > +
> > > +    Per NUMA node number of NUMA hinting page faults in this pass of
> > > +    knuma_scand (8 bytes * # of NUMA nodes)
> > > +
> > 
> > It might be possible to put a coarse ping-pong detection counter in here
> > as well by recording a declaying average of number of pages migrated
> > over a number of knuma_scand passes instead of just the last one.  If the
> > value is too high, you're ping-ponging and the process should be ignored,
> > possibly forever. It's not a requirement and it would be more memory
> > overhead obviously but I'm throwing it out there as a suggestion if it
> > ever turns out the ping-pong problem is real.
> 
> Yes, this is a problem where we've an enormous degree in trying
> things, so your suggestions are very appreciated :).
> 
> About ping ponging of CPU I never seen it yet (even if it's 550/450,
> it rarely switches over from 450/550, and even it does, it doesn't
> really change anything because it's a fairly rare event and one node
> is not more right than the other anyway).
> 

I expect in practice that it's very rare that there is a workload that
does not or cannot align its data structures. It'll happen at least once
so it'd be nice to be able to identify that quickly when it does.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
  2012-10-11 16:43     ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
@ 2012-10-11 19:48       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 19:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:43:00PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > > faults to identify the per NUMA node working set of the thread at
> > > runtime.
> > > 
> > > Arming the NUMA hinting page fault mechanism works similarly to
> > > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > > triggers we can identify it as a NUMA hinting page fault.
> > > 
> > 
> > That implies that there is an atomic update requirement or at least
> > an ordering requirement -- present bit must be cleared before setting
> > NUMA bit. No doubt it'll be clear later in the series how this is
> > accomplished. What you propose seems ok but it all depends how it's
> > implemented so I'm leaving my ack off this particular patch for now.
> 
> Correct. The switch is done atomically (clear _PAGE_PRESENT at the
> same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
> to avoid firing an IPI for every pte/pmd_numa we establish).
> 

Good. I think you might still be flushing more than you need to but
commented on the patch itself.

> It's still similar to setting a range PROT_NONE (except the way
> _PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
> mutually exclusive, so they can easily share the same pte/pmd
> bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.
> 
> The NUMA hinting page fault also won't require any TLB flush ever.
> 

It sortof can. The fault itself is still a heavy operation that can do
things like this

numa_hinting_fault
 -> numa_hinting_fault_memory_follow_cpu
    -> autonuma_migrate_page
      -> sync_isolate_migratepages
	 (lru lock for single page)
      -> migrate_pages

and buried down there where it unmaps the page and makes a migration PTE
is a TLB flush due to calling ptep_clear_flush_notify(). That's a bad case
obviously and the expectation is that as the threads converage to a node that
it's not a problem. While it's converging though it will be a heavy cost.

Tracking how often a numa_hinting_fault results in a migration should be
enough to keep an eye on it.

> So the whole process (establish/teardown) has an incredibly low TLB
> flushing cost.
> 
> The only fixed cost is in knuma_scand and the enter/exit kernel for
> every not-shared page every 10 sec (or whatever you set the duration
> of a knuma_scand pass in sysfs).
> 

10 seconds should be sufficiently low. It itself might need to adapt in
the future but at least 10 seconds now by default will not stomp too heavily.

> Furthermore, if the pmd_scan mode is activated, I guarantee there's at
> max 1 NUMA hinting page fault every 2m virtual region (even if some
> accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
> to disable THP (echo never >enabled) to measure the exact cost per 4k
> page. It's hardly measurable here. With THP the fault is also 1 every
> 2m virtual region but no accuracy is lost in that case (or more
> precisely, there's no way to get more accuracy than that as we deal
> with a pmd).
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
  2012-10-11 16:58     ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
@ 2012-10-11 19:54       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:58:47PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:15:45PM +0100, Mel Gorman wrote:
> > huh?
> > 
> > #define _PAGE_NUMA     _PAGE_PROTNONE
> > 
> > so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE
> > 
> > I suspect you are doing this because there is no requirement for
> > _PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
> > describe your intent. Is that really the case or did I miss something
> > stupid?
> 
> Exactly.
> 
> It reminds that we need to return true in pte_present when the NUMA
> hinting page fault is on.
> 
> Hardwiring _PAGE_NUMA to _PAGE_PROTNONE conceptually is not necessary
> and it's actually an artificial restrictions. Other archs without a
> bitflag for _PAGE_PROTNONE, may want to use something else and they'll
> have to deal with pte_present too, somehow. So this is a reminder for
> them as well.
> 

That's all very reasonable.

> > >  static inline int pte_hidden(pte_t pte)
> > > @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> > >  	 * the _PAGE_PSE flag will remain set at all times while the
> > >  	 * _PAGE_PRESENT bit is clear).
> > >  	 */
> > > -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> > > +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> > > +				 _PAGE_NUMA);
> > > +}
> > > +
> > > +#ifdef CONFIG_AUTONUMA
> > > +/*
> > > + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> > > + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> > 
> > same bit on x86, not necessarily anywhere else.
> 
> Yep. In fact before using _PAGE_PRESENT the two bits were different
> even on x86. But I unified them. If I vary them then they will become
> _PAGE_PTE_NUMA/_PAGE_PMD_NUMA and the above will fail to build without
> risk of errors.
> 

Ok.

> > 
> > _PAGE_PRESENT?
> 
> good eye ;) corrected.
> 
> > > +/*
> > > + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> > > + * because they're called by the NUMA hinting minor page fault.
> > 
> > automatically or atomically?
> > 
> > I assume you meant atomically but what stops two threads faulting at the
> > same time and doing to the same update? mmap_sem will be insufficient in
> > that case so what is guaranteeing the atomicity. PTL?
> 
> I meant automatically. I explained myself wrong and automatically may
> be the wrong word. It also is atomic of course but it wasn't about the
> atomic part.
> 
> So the thing is: the numa hinting page fault hooking point is this:
> 
> 	if (pte_numa(entry))
> 		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);
> 
> It won't get this far:
> 
> 	entry = pte_mkyoung(entry);
> 	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
> 
> So if I don't set _PAGE_ACCESSED in pte/pmd_mknuma, the TLB miss
> handler will have to set _PAGE_ACCESSED itself with an additional
> write on the pte/pmd later when userland touches the page. And that
> will slow us down for no good.
> 

All clear now. Letting it fall through to reach that point would be
convulated and messy. This is a better option.

> Because mknuma is only called in the numa hinting page fault context,
> it's optimal to set _PAGE_ACCESSED too, not only _PAGE_PRESENT (and
> clearing _PAGE_NUMA of course).
> 
> The basic idea, is that the numa hinting page fault can only trigger
> if userland touches the page, and after such an event, _PAGE_ACCESSED
> would be set by the hardware no matter if there is a NUMA hinting page
> fault or not (so we can optimize away the hardware action when the NUMA
> hinting page fault triggers).
> 
> I tried to reword it:
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cf1d3f0..3dc6a9b 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -449,12 +449,12 @@ static inline int pmd_numa(pmd_t pmd)
>  #endif
>  
>  /*
> - * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> - * because they're called by the NUMA hinting minor page fault. If we
> - * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
> - * would be forced to set it later while filling the TLB after we
> - * return to userland. That would trigger a second write to memory
> - * that we optimize away by setting _PAGE_ACCESSED here.
> + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag too because they're
> + * only called by the NUMA hinting minor page fault. If we wouldn't
> + * set the _PAGE_ACCESSED bitflag here, the TLB miss handler would be
> + * forced to set it later while filling the TLB after we return to
> + * userland. That would trigger a second write to memory that we
> + * optimize away by setting _PAGE_ACCESSED here.
>   */
>  static inline pte_t pte_mknonnuma(pte_t pte)
>  {
> 

Much better.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
  2012-10-11 17:05     ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
@ 2012-10-11 20:01       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 20:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:05:33PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:22:55PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> > > In the special "pmd" mode of knuma_scand
> > > (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> > > type (_PAGE_PRESENT not set), however the pte might be
> > > present. Therefore, gup_pmd_range() must return 0 in this case to
> > > avoid losing a NUMA hinting page fault during gup_fast.
> > > 
> > 
> > So if gup_fast fails, presumably we fall back to taking the mmap_sem and
> > calling get_user_pages(). This is a heavier operation and I wonder if the
> > cost is justified. i.e. Is the performance loss from using get_user_pages()
> > offset by improved NUMA placement? I ask because we always incur the cost of
> > taking mmap_sem but only sometimes get it back from improved NUMA placement.
> > How bad would it be if gup_fast lost some of the NUMA hinting information?
> 
> Good question indeed. Now, I agree it wouldn't be bad to skip NUMA
> hinting page faults in gup_fast for no-virt usage like
> O_DIRECT/ptrace, but the only problem is that we'd lose AutoNUMA on
> the memory touched by the KVM vcpus.
> 

Ok I see, that could be in the changelog because it's not immediately
obvious. At least, it's not as obvious as the potential downside (more GUP
fallbacks). In this context there is no way to guess what type of access
it is. AFAIK, there is no way from here to tell if it's KVM calling gup
or if it's due to O_DIRECT.

> I've been also asked if the vhost-net kernel thread (KVM in kernel
> virtio backend) will be controlled by autonuma in between
> use_mm/unuse_mm and answer is yes, but to do that, it also needs
> this. (see also the flush to task_autonuma_nid and mm/task statistics in
> unuse_mm to reset it back to regular kernel thread status,
> uncontrolled by autonuma)

I can understand why it needs this now. The clearing of the statistics is
still not clear to me but I asked that question in the thread that adjusts
unuse_mm already.

> 
> $ git grep get_user_pages
> tcm_vhost.c:            ret = get_user_pages_fast((unsigned long)ptr, 1, write, &page);
> vhost.c:        r = get_user_pages_fast(log, 1, 1, &page);
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-11 17:15     ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
@ 2012-10-11 20:06       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 20:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:15:20PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:28:27PM +0100, Mel Gorman wrote:
> > s/togehter/together/
> 
> Fixed.
> 
> > 
> > > + * knumad_scan structure.
> > > + */
> > > +struct mm_autonuma {
> > 
> > Nit but this is very similar in principle to mm_slot for transparent
> > huge pages. It might be worth renaming both to mm_thp_slot and
> > mm_autonuma_slot to set the expectation they are very similar in nature.
> > Could potentially be made generic but probably overkill.
> 
> Agreed. A plain rename to mm_autonuma_slot would have the only cons of
> making some code spill over 80 col ;).
> 

Fair enough :)

> > > +	/* link for knuma_scand's list of mm structures to scan */
> > > +	struct list_head mm_node;
> > > +	/* Pointer to associated mm structure */
> > > +	struct mm_struct *mm;
> > > +
> > > +	/*
> > > +	 * Zeroed from here during allocation, check
> > > +	 * mm_autonuma_reset() if you alter the below.
> > > +	 */
> > > +
> > > +	/*
> > > +	 * Pass counter for this mm. This exist only to be able to
> > > +	 * tell when it's time to apply the exponential backoff on the
> > > +	 * task_autonuma statistics.
> > > +	 */
> > > +	unsigned long mm_numa_fault_pass;
> > > +	/* Total number of pages that will trigger NUMA faults for this mm */
> > > +	unsigned long mm_numa_fault_tot;
> > > +	/* Number of pages that will trigger NUMA faults for each [nid] */
> > > +	unsigned long mm_numa_fault[0];
> > > +	/* do not add more variables here, the above array size is dynamic */
> > > +};
> > 
> > How cache hot is this structure? nodes are sharing counters in the same
> > cache lines so if updates are frequent this will bounce like a mad yoke.
> > Profiles will tell for sure but it's possible that some sort of per-cpu
> > hilarity will be necessary here in the future.
> 
> On autonuma27 this is only written by knuma_scand so it won't risk to
> bounce.
> 
> On autonuma28 however it's updated by the numa hinting page fault
> locklessy and so your concern is very real, and the cacheline bounces
> will materialize.

It will be related to the knuma_scan thing though so once every 10
seconds, we might see a sudden spike in cache conflicts. Is that
accurate? Something like perf top might detect when this happens but it
can be inferred using perf probe on the fault handler too.

> It'll cause more interconnect traffic before the
> workload converges too. I thought about that, but I wanted the
> mm_autonuma updated in real time as migration happens otherwise it
> converges more slowly if we have to wait until the next pass to bring
> mm_autonuma statistical data in sync with the migration
> activities. Converging more slowly looked worse than paying more
> cacheline bounces.
> 

You could argue that slower converging also means more cross-node
traffic so it costs either way.

> It's a tradeoff. And if it's not a good one, we can go back to
> autonuma27 mm_autonuma stat gathering method and converge slower but
> without any cacheline bouncing in the NUMA hinting page faults. At
> least it's lockless.
> 

Yep.

> > > +	unsigned long task_numa_fault_pass;
> > > +	/* Total number of eligible pages that triggered NUMA faults */
> > > +	unsigned long task_numa_fault_tot;
> > > +	/* Number of pages that triggered NUMA faults for each [nid] */
> > > +	unsigned long task_numa_fault[0];
> > > +	/* do not add more variables here, the above array size is dynamic */
> > > +};
> > > +
> > 
> > Same question about cache hotness.
> 
> Here it's per-thread, so there won't be risk of accesses interleaved
> by different CPUs.
> 

Ok thanks. With that clarification

Acked-by: Mel Gorman <mgorman@suse.de>

While I still have concerns about the cache behaviour of this the basic
intent of the structure will not change no matter how the problem is
addressed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
  2012-10-11 17:34     ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
@ 2012-10-11 20:17       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-11 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:34:42PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 02:46:43PM +0100, Mel Gorman wrote:
> > Should this be a SCHED_FEATURE flag?
> 
> I guess it could. It is only used by kernel/sched/numa.c which isn't
> even built unless CONFIG_AUTONUMA is set. So it would require a
> CONFIG_AUTONUMA in the sched feature flags unless we want to expose
> no-operational bits. I'm not sure what the preferred way is.
> 

It's fine this way for now. It just felt that it was bolted onto the
side a bit and didn't quite belong there but it could be argued either
way so just leave it alone.

> > Have you ever identified a case where it's a good idea to set that flag?
> 
> It's currently set by default but no, I didn't do enough experiments
> if it's worth copying or resetting the data.
> 

Ok, if it was something that was going to be regularly used there would
be more justification for SCHED_FEATURE.

> > A child that closely shared data with its parent is not likely to also
> > want to migrate to separate nodes. It just seems unnecessary to have and
> 
> Agreed, this is why the task_selected_nid is always inherited by
> default (that is the CFS autopilot driver).
> 
> The question is if the full statistics also should be inherited across
> fork/clone or not. I don't know the answer yet and that's why that
> knob exists.
> 

I very strongly suspect the answer is "no".

> If we retain them, the autonuma_balance may decide to move the
> task before a full statistics buildup executed the child.
> 
> The current way is to reset the data, and wait the data to buildup in
> the child, while we keep CFS on autopilot with task_selected_nid
> (which is always inherited). I thought the current one to be a good
> tradeoff, but copying all data isn't an horrible idea either.
> 
> > impossible to suggest to an administrator how the flag might be used.
> 
> Agreed. this in fact is a debug flag only, it won't ever showup to the admin.
> 
> #ifdef CONFIG_DEBUG_VM
> SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
> SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
> SYSFS_ENTRY(migrate_allow_first_fault,
> 	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
> #endif /* CONFIG_DEBUG_VM */
> 

Good. Nice to have just in case even if I think it'll never be used :)

> > 
> > > +	/*
> > > +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> > > +	 * faults at the pmd level instead of the pte level. This
> > > +	 * reduces the number of NUMA hinting faults potentially
> > > +	 * saving CPU time. It reduces the accuracy of the
> > > +	 * task_autonuma statistics (but does not change the accuracy
> > > +	 * of the mm_autonuma statistics). This flag can be toggled
> > > +	 * through sysfs as runtime.
> > > +	 *
> > > +	 * This flag does not affect AutoNUMA with transparent
> > > +	 * hugepages (THP). With THP the NUMA hinting page faults
> > > +	 * always happen at the pmd level, regardless of the setting
> > > +	 * of this flag. Note: there is no reduction in accuracy of
> > > +	 * task_autonuma statistics with THP.
> > > +	 *
> > > +	 * Default set.
> > > +	 */
> > > +	AUTONUMA_SCAN_PMD_FLAG,
> > 
> > This flag and the other flags make sense. Early on we just are not going
> > to know what the correct choice is. My gut says that ultimately we'll
> 
> Agreed. This is why I left these knobs in, even if I've been asked to
> drop them a few times (they were perceived as adding complexity). But
> for things we're not sure about, these really helps to benchmark quick
> one way or another.
> 

I don't mind them being left in for now. They at least forced me to
consider the cases where they might be required and consider if that is
realistic or not. From that perspective alone it was worth it :)

> scan_pmd is actually not under DEBUG_VM as it looked a more fundamental thing.
> 
> > default to PMD level *but* fall back to PTE level on a per-task basis if
> > ping-pong migrations are detected. This will catch ping-pongs on data
> > that is not PMD aligned although obviously data that is not page aligned
> > will also suffer. Eventually I think this flag will go away but the
> > behaviour will be;
> > 
> > default, AUTONUMA_SCAN_PMD
> > if ping-pong, fallback to AUTONUMA_SCAN_PTE
> > if ping-ping, AUTONUMA_SCAN_NONE
> 
> That would be ideal, good idea indeed.
> 
> > so there is a graceful degradation if autonuma is doing the wrong thing.
> 
> Makes perfect sense to me if we figure out how to reliably detect when
> to make the switch.
> 

The "reliable" part is the mess. I think it potentially would be possible
to detect it based on the number of times numa_hinting_fault() migrated
pages and decay that at each knuma_scan but that could take too long
to detect with the 10 second delays so there is no obvious good answer.
WIth some experience on a few different workloads, it might be a bit
more obvious. Right now what you have is good enough and we can just
keep the potential problem in mind so we'll recognise it when we see it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
       [not found]   ` <20121011145805.GW3317@csn.ul.ie>
@ 2012-10-12  0:25     ` Andrea Arcangeli
  2012-10-12  8:29       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > This algorithm takes as input the statistical information filled by the
> > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > (p->task_autonuma), evaluates it for the current scheduled task, and
> > compares it against every other running process to see if it should
> > move the current task to another NUMA node.
> > 
> 
> That sounds expensive if there are a lot of running processes in the
> system. How often does this happen? Mention it here even though I
> realised much later that it's obvious from the patch itself.

Ok I added:

==
This algorithm will run once every ~100msec, and can be easily slowed
down further. Its computational complexity is O(nr_cpus) and it's
executed by all CPUs. The number of running threads and processes is
not going to alter the cost of this algorithm, only the online number
of CPUs is. However practically this will very rarely hit on all CPUs
runqueues. Most of the time it will only compute on local data in the
task_autonuma struct (for example if convergence has been
reached). Even if no convergence has been reached yet, it'll only scan
the CPUs in the NUMA nodes where the local task_autonuma data is
showing that they are worth migrating to.
==

It's configurable through sysfs, 100mses is the default.

> > + * there is no affinity set for the task).
> > + */
> > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > +{
> 
> nit, but elsewhere you have
> 
> static inline TYPE and here you have
> static TYPE inline

Fixed.

> 
> > +	int task_selected_nid;
> > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > +
> > +	if (!task_autonuma)
> > +		return true;
> > +
> > +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > +		return true;
> > +	else
> > +		return false;
> > +}
> 
> no need for else.

Removed.

> 
> > +
> > +static inline void sched_autonuma_balance(void)
> > +{
> > +	struct task_autonuma *ta = current->task_autonuma;
> > +
> > +	if (ta && current->mm)
> > +		__sched_autonuma_balance();
> > +}
> > +
> 
> Ok, so this could do with a comment explaining where it is called from.
> It is called during idle balancing at least so potentially this is every
> scheduler tick. It'll be run from softirq context so the cost will not
> be obvious to a process but the overhead will be there. What happens if
> this takes longer than a scheduler tick to run? Is that possible?

softirqs can run for huge amount of time so it won't harm.

Nested IRQs could even run on top of the softirq, and they could take
milliseconds too if they're hyper inefficient and we must still run
perfectly rock solid (with horrible latency, but still stable).

I added:

/*
 * This is called in the context of the SCHED_SOFTIRQ from
 * run_rebalance_domains().
 */

> > +/*
> > + * This function __sched_autonuma_balance() is responsible for
> 
> This function is far too shot and could do with another few pages :P

:) I tried to split it once already but gave up in the middle.

> > + * "Full convergence" is achieved when all memory accesses by a task
> > + * are 100% local to the CPU it is running on. A task's "best node" is
> 
> I think this is the first time you defined convergence in the series.
> The explanation should be included in the documentation.

Ok. It's not too easy concept to explain with words.  Here a try:

 *
 * A workload converges when all the memory of a thread or a process
 * has been placed in the NUMA node of the CPU where the process or
 * thread is running on.
 *

> > + * other_diff: how much the current task is closer to fully converge
> > + * on the node of the other CPU than the other task that is currently
> > + * running in the other CPU.
> 
> In the changelog you talked about comparing a process with every other
> running process but here it looks like you intent to examine every
> process that is *currently running* on a remote node and compare that.
> What if the best process to swap with is not currently running? Do we
> miss it?

Correct, only currently running processes are being checked. If a task
in R state goes to sleep immediately, it's not relevant where it
runs. We focus on "long running" compute tasks, so tasks that are in R
state most frequently.

> > + * If both checks succeed it guarantees that we found a way to
> > + * multilaterally improve the system wide NUMA
> > + * convergence. Multilateral here means that the same checks will not
> > + * succeed again on those same two tasks, after the task exchange, so
> > + * there is no risk of ping-pong.
> > + *
> 
> At least not in that instance of time. A new CPU binding or change in
> behaviour (such as a computation finishing and a reduce step starting)
> might change that scoring.

Yes.

> > + * If a task exchange can happen because the two checks succeed, we
> > + * select the destination CPU that will give us the biggest increase
> > + * in system wide convergence (i.e. biggest "weight", in the above
> > + * quoted code).
> > + *
> 
> So there is a bit of luck that the best task to exchange is currently
> running. How bad is that? It depends really on the number of tasks
> running on that node and the priority. There is a chance that it doesn't
> matter as such because if all the wrong tasks are currently running then
> no exchange will take place - it was just wasted CPU. It does imply that
> AutoNUMA works best of CPUs are not over-subscribed with processes. Is
> that fair?

It seems to works fine with overcommit as well. specjbb x2 is
converging fine, as well as numa01 in parallel with numa02. It's
actually pretty cool to watch.

Try to run this:

while :; do ./nmstat -n numa; sleep 1; done

nmstat is a binary in autonuma benchmark.

Then run:

time (./numa01 & ./numa02 & wait)

The thing is, we work together with CFS, CFS in autopilot works fine,
we only need to correct the occasional error.

It works the same as the active idle balancing, that corrects the
occasional error for HT cores left idle, then CFS takes over.

> Again, I have no suggestions at all on how this might be improved and
> these comments are hand-waving towards where we *might* see problems in
> the future. If problems are actually identified in practice for
> worklaods then autonuma can be turned off until the relevant problem
> area is fixed.

Exactly, it's enough to run:

echo 1 >/sys/kernel/mm/autonuma/enabled

If you want to get rid of the 2 bytes per page too, passing
"noautonuma" at boot will do it (but then /sys/kernel/mm/autonuma
disapperers and you can't enable it anymore).

Plus if there's any issue with the cost of sched_autonuma_balance it's
more than enough to run "perf top" to find out.

> I would fully expect that there are parallel workloads that work on
> differenet portions of a large set of data and it would be perfectly
> reasonable for threads using the same address space to converge on
> different nodes.

Agreed. Even if they can't converge fully they could have stats like
70/30, 30/70, with 30 being numa-false-shared and we'll schedule them
right, so running faster than upstream. That 30% will also tend to
slowly distribute better over time.

> I would hope we manage to figure out a way to examine fewer processes,
> not more :)

8)))

> > +void __sched_autonuma_balance(void)
> > +{
> > +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> > +	int this_nid = numa_node_id();
> > +	int this_cpu = smp_processor_id();
> > +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> > +	unsigned long task_max, mm_max;
> > +	unsigned long weight_diff_max;
> > +	long uninitialized_var(s_w_nid);
> > +	long uninitialized_var(s_w_this_nid);
> > +	long uninitialized_var(s_w_other);
> > +	bool uninitialized_var(s_w_type_thread);
> > +	struct cpumask *allowed;
> > +	struct task_struct *p = current, *other_task;
> 
> So the task in question is current but this is called by the idle
> balancer. I'm missing something obvious here but it's not clear to me why
> that process is necessarily relevant. What guarantee is there that all
> tasks will eventually run this code? Maybe it doesn't matter because the
> most CPU intensive tasks are also the most likely to end up in here but
> a clarification would be nice.

Exactly. We only focus on who is significantly computing. If a task
runs for 1msec we can't possibly care where it runs and where the
memory is. If it keeps running for 1msec, over time even that task
will be migrated right.

> > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > +	struct mm_autonuma *mm_autonuma;
> > +	struct rq *rq;
> > +
> > +	/* per-cpu statically allocated in runqueues */
> > +	long *task_numa_weight;
> > +	long *mm_numa_weight;
> > +
> > +	if (!task_autonuma || !p->mm)
> > +		return;
> > +
> > +	if (!autonuma_enabled()) {
> > +		if (task_autonuma->task_selected_nid != -1)
> > +			task_autonuma->task_selected_nid = -1;
> > +		return;
> > +	}
> > +
> > +	allowed = tsk_cpus_allowed(p);
> > +	mm_autonuma = p->mm->mm_autonuma;
> > +
> > +	/*
> > +	 * If the task has no NUMA hinting page faults or if the mm
> > +	 * hasn't been fully scanned by knuma_scand yet, set task
> > +	 * selected nid to the current nid, to avoid the task bounce
> > +	 * around randomly.
> > +	 */
> > +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
> 
> Why ACCESS_ONCE?

mm variables are altered by other threads too. Only task_autonuma is
local to this task and cannot change from under us.

I did it all lockless, I don't care if we're off once in a while.

> > +	if (!mm_tot) {
> > +		if (task_autonuma->task_selected_nid != this_nid)
> > +			task_autonuma->task_selected_nid = this_nid;
> > +		return;
> > +	}
> > +	task_tot = task_autonuma->task_numa_fault_tot;
> > +	if (!task_tot) {
> > +		if (task_autonuma->task_selected_nid != this_nid)
> > +			task_autonuma->task_selected_nid = this_nid;
> > +		return;
> > +	}
> > +
> > +	rq = cpu_rq(this_cpu);
> > +
> > +	/*
> > +	 * Verify that we can migrate the current task, otherwise try
> > +	 * again later.
> > +	 */
> > +	if (ACCESS_ONCE(rq->autonuma_balance))
> > +		return;
> > +
> > +	/*
> > +	 * The following two arrays will hold the NUMA affinity weight
> > +	 * information for the current process if scheduled on the
> > +	 * given NUMA node.
> > +	 *
> > +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> > +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> > +	 */
> > +	task_numa_weight = rq->task_numa_weight;
> > +	mm_numa_weight = rq->mm_numa_weight;
> > +
> > +	/*
> > +	 * Identify the NUMA node where this thread (task_struct), and
> > +	 * the process (mm_struct) as a whole, has the largest number
> > +	 * of NUMA faults.
> > +	 */
> > +	task_max = mm_max = 0;
> > +	selected_nid = mm_selected_nid = -1;
> > +	for_each_online_node(nid) {
> > +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> > +		task_fault = task_autonuma->task_numa_fault[nid];
> > +		if (mm_fault > mm_tot)
> > +			/* could be removed with a seqlock */
> > +			mm_tot = mm_fault;
> > +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> > +		if (task_fault > task_tot) {
> > +			task_tot = task_fault;
> > +			WARN_ON(1);
> > +		}
> > +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> > +		if (mm_numa_weight[nid] > mm_max) {
> > +			mm_max = mm_numa_weight[nid];
> > +			mm_selected_nid = nid;
> > +		}
> > +		if (task_numa_weight[nid] > task_max) {
> > +			task_max = task_numa_weight[nid];
> > +			selected_nid = nid;
> > +		}
> > +	}
> 
> Ok, so this is a big walk to take every time and as this happens every
> scheduler tick, it seems unlikely that the workload would be changing
> phases that often in terms of NUMA behaviour. Would it be possible for
> this to be sampled less frequently and cache the result?

Even if there are 8 nodes, this is fairly quick and only requires 2
cachelines. At 16 nodes we're at 4 cachelines. The cacheline of
task_autonuma is fully local. The one of mm_autonuma can be shared
(modulo numa hinting page faults with atuonuma28, in autonuma27 it was
also sharable even despite numa hinting page faults).

> > +			/*
> > +			 * Grab the fault/tot of the processes running
> > +			 * in the other CPUs to compute w_other.
> > +			 */
> > +			raw_spin_lock_irq(&rq->lock);
> > +			_other_task = rq->curr;
> > +			/* recheck after implicit barrier() */
> > +			mm = _other_task->mm;
> > +			if (!mm) {
> > +				raw_spin_unlock_irq(&rq->lock);
> > +				continue;
> > +			}
> > +
> 
> Is it really critical to pin those values using the lock? That seems *really*
> heavy. If the results have to be exactly stable then is there any chance
> the values could be encoded in the high and low bits of a single unsigned
> long and read without the lock?  Updates would be more expensive but that's
> in a trap anyway. This on the other hand is a scheduler path.

The reason of the lock is to prevent rq->curr, mm etc.. to be freed
from under us.

> > +			/*
> > +			 * Check if the _other_task is allowed to be
> > +			 * migrated to this_cpu.
> > +			 */
> > +			if (!cpumask_test_cpu(this_cpu,
> > +					      tsk_cpus_allowed(_other_task))) {
> > +				raw_spin_unlock_irq(&rq->lock);
> > +				continue;
> > +			}
> > +
> 
> Would it not make sense to check this *before* we take the lock and
> grab all its counters? It probably will not make much of a difference in
> practice as I expect it's rare that the target CPU is running a task
> that can't migrate but it still feels the wrong way around.

It's a micro optimization to do it here. It's too rare that the above
fails, while !tot may be zero much more frequently (like if the task
has been just started).

> > +	if (selected_cpu != this_cpu) {
> > +		if (autonuma_debug()) {
> > +			char *w_type_str;
> > +			w_type_str = s_w_type_thread ? "thread" : "process";
> > +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> > +			       p->mm, p->pid, this_nid, selected_nid,
> > +			       this_cpu, selected_cpu,
> > +			       s_w_other, s_w_nid, s_w_this_nid,
> > +			       w_type_str);
> > +		}
> 
> Can these be made tracepoints and get rid of the autonuma_debug() check?
> I recognise there is a risk that some tool might grow to depend on
> implementation details but in this case it seems very unlikely.

The debug mode provides me also a dump of all mm done racy, I wouldn't
know how to do it with tracing.

So I wouldn't remove the printk until we can replace everything with
tracing, but I'd welcome to add a tracepoint too. There are already
other proper tracepoints driving "perf script numatop".

> Ok, so I confess I did not work out if the weights and calculations really
> make sense or not but at a glance they seem reasonable and I spotted no
> obvious flaws. The function is pretty heavy though and may be doing more
> work around locking than is really necessary. That said, there will be
> workloads where the cost is justified and offset by the performance gains
> from improved NUMA locality. I just don't expect it to be a universal win so
> we'll need to keep an eye on the system CPU usage and incrementally optimise
> where possible. I suspect there will be a time when an incremental
> optimisation just does not cut it any more but by then I would also
> expect there will be more data on how autonuma behaves in practice and a
> new algorithm might be more obvious at that point.

Agreed. Chances are I can replace all this already with RCU and a
rcu_dereference or ACCESS_ONCE to grab the rq->curr->task_autonuma and
rq->curr->mm->mm_autonuma data. I didn't try yet. The task struct
shouldn't go away from under us after rcu_read_lock, the mm may be more
tricky, I haven't checked this yet. Optimizations welcome ;)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 15:35     ` Mel Gorman
@ 2012-10-12  0:41       ` Andrea Arcangeli
  2012-10-12 14:54       ` Mel Gorman
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> If System CPU time really does go down as this converges then that
> should be obvious from monitoring vmstat over time for a test. Early on
> - high usage with that dropping as it converges. If that doesn't happen
>   then the tasks are not converging, the phases change constantly or
> something unexpected happened that needs to be identified.

Yes, all measurable kernel cost should be in the memory copies
(migration and khugepaged, the latter is going to be optimized away).

The migrations must stop after the workload converges. Either
migrations are used to reach convergence or they shouldn't happen in
the first place (not in any measurable amount).

> Ok. Are they separate STREAM instances or threads running on the same
> arrays? 

My understanding is separate instances. I think it's a single threaded
benchmark and you run many copies. It was modified to run for 5min
(otherwise upstream has not enough time to get it wrong, as result of
background scheduling jitters).

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
       [not found]       ` <0000013a525a8739-2b4049fa-1cb3-4b8f-b3a7-1fa77b181590-000000@email.amazonses.com>
@ 2012-10-12  0:52         ` Andrea Arcangeli
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney

Hi Christoph,

On Fri, Oct 12, 2012 at 12:23:17AM +0000, Christoph Lameter wrote:
> On Thu, 11 Oct 2012, Rik van Riel wrote:
> 
> > These statistics are updated at page fault time, I
> > believe while holding the page table lock.
> >
> > In other words, they are in code paths where updating
> > the stats should not cause issues.
> 
> The per cpu counters in the VM were introduced because of
> counter contention caused at page fault time. This is the same code path
> where you think that there cannot be contention.

There's no contention at all in autonuma27.

I changed it in autonuma28, to get real time updates in mm_autonuma
from migration events.

There is no lock taken though (the spinlock below is taken once every
pass, very rarely). It's a few liner change shown in detail below. The
only contention point is this:

+	ACCESS_ONCE(mm_numa_fault[access_nid]) += numpages;
+	ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot) += numpages;

autonuma28 is much more experimental than autonuma27 :)

I wouldn't focus on >1024 CPU systems for this though. The bigger the
system the more costly any automatic placement logic will become, no
matter which algorithm and which computation complexity the algorithm
has, and chances are those will use NUMA hard bindings anyway
considering how much they're expensive to setup and maintain.

The diff looks like this, I can consider undoing it. Comments
welcome. (but real time stats updates, converge faster in autonuma28)

--- a/mm/autonuma.c
+++ b/mm/autonuma.c
 
 static struct knuma_scand_data {
 	struct list_head mm_head; /* entry: mm->mm_autonuma->mm_node */
 	struct mm_struct *mm;
 	unsigned long address;
-	unsigned long *mm_numa_fault_tmp;
 } knuma_scand_data = {
 	.mm_head = LIST_HEAD_INIT(knuma_scand_data.mm_head),
 };






+	unsigned long tot;
+
+	/*
+	 * Set the task's fault_pass equal to the new
+	 * mm's fault_pass, so new_pass will be false
+	 * on the next fault by this thread in this
+	 * same pass.
+	 */
+	p->task_autonuma->task_numa_fault_pass = mm_numa_fault_pass;
+
 	/* If a new pass started, degrade the stats by a factor of 2 */
 	for_each_node(nid)
 		task_numa_fault[nid] >>= 1;
 	task_autonuma->task_numa_fault_tot >>= 1;
+
+	if (mm_numa_fault_pass ==
+	    ACCESS_ONCE(mm_autonuma->mm_numa_fault_last_pass))
+		return;
+
+	spin_lock(&mm_autonuma->mm_numa_fault_lock);
+	if (unlikely(mm_numa_fault_pass ==
+		     mm_autonuma->mm_numa_fault_last_pass)) {
+		spin_unlock(&mm_autonuma->mm_numa_fault_lock);
+		return;
+	}
+	mm_autonuma->mm_numa_fault_last_pass = mm_numa_fault_pass;
+
+	tot = 0;
+	for_each_node(nid) {
+		unsigned long fault = ACCESS_ONCE(mm_numa_fault[nid]);
+		fault >>= 1;
+		ACCESS_ONCE(mm_numa_fault[nid]) = fault;
+		tot += fault;
+	}
+	mm_autonuma->mm_numa_fault_tot = tot;
+	spin_unlock(&mm_autonuma->mm_numa_fault_lock);
 }






 	task_numa_fault[access_nid] += numpages;
 	task_autonuma->task_numa_fault_tot += numpages;
 
+	ACCESS_ONCE(mm_numa_fault[access_nid]) += numpages;
+	ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot) += numpages;
+
 	local_bh_enable();
 }
 
@@ -310,28 +355,35 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
@@ -593,35 +628,26 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		goto out;
 
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		int page_nid;
-		unsigned long *fault_tmp;
 		ret = HPAGE_PMD_NR;
 
 		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-		if (autonuma_mm_working_set() && pmd_numa(*pmd)) {
+		if (pmd_numa(*pmd)) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
 		page = pmd_page(*pmd);
-
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
-		page_nid = page_to_nid(page);
-		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-		fault_tmp[page_nid] += ret;
-
 		if (pmd_numa(*pmd)) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
 		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+
 		/* defer TLB flush to lower the overhead */
 		spin_unlock(&mm->page_table_lock);
 		goto out;
@@ -636,10 +662,9 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 	for (_address = address, _pte = pte; _address < end;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
-		unsigned long *fault_tmp;
 		if (!pte_present(pteval))
 			continue;
-		if (autonuma_mm_working_set() && pte_numa(pteval))
+		if (pte_numa(pteval))
 			continue;
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
@@ -647,13 +672,8 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1)
 			continue;
-
-		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-		fault_tmp[page_to_nid(page)]++;
-
 		if (pte_numa(pteval))
 			continue;
-
 		if (!autonuma_scan_pmd())
 			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
 
@@ -677,56 +697,6 @@ out:
 	return ret;
 }
 
-static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
-{
-	int nid;
-	struct mm_autonuma *mma = mm->mm_autonuma;
-	unsigned long tot;
-	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-
-	if (autonuma_mm_working_set()) {
-		for_each_node(nid) {
-			tot = fault_tmp[nid];
-			if (tot)
-				break;
-		}
-		if (!tot)
-			/* process was idle, keep the old data */
-			return;
-	}
-
-	/* FIXME: would be better protected with write_seqlock_bh() */
-	local_bh_disable();
-
-	tot = 0;
-	for_each_node(nid) {
-		unsigned long faults = fault_tmp[nid];
-		fault_tmp[nid] = 0;
-		mma->mm_numa_fault[nid] = faults;
-		tot += faults;
-	}
-	mma->mm_numa_fault_tot = tot;
-
-	local_bh_enable();
-}
-
-static void mm_numa_fault_tmp_reset(void)
-{
-	memset(knuma_scand_data.mm_numa_fault_tmp, 0,
-	       mm_autonuma_fault_size());
-}
-
-static inline void validate_mm_numa_fault_tmp(unsigned long address)
-{
-#ifdef CONFIG_DEBUG_VM
-	int nid;
-	if (address)
-		return;
-	for_each_node(nid)
-		BUG_ON(knuma_scand_data.mm_numa_fault_tmp[nid]);
-#endif
-}
-
 /*
  * Scan the next part of the mm. Keep track of the progress made and
  * return it.
@@ -758,8 +728,6 @@ static int knumad_do_scan(void)
 	}
 	address = knuma_scand_data.address;
 
-	validate_mm_numa_fault_tmp(address);
-
 	mutex_unlock(&knumad_mm_mutex);
 
 	down_read(&mm->mmap_sem);
@@ -855,9 +824,7 @@ static int knumad_do_scan(void)
 			/* tell autonuma_exit not to list_del */
 			VM_BUG_ON(mm->mm_autonuma->mm != mm);
 			mm->mm_autonuma->mm = NULL;
-			mm_numa_fault_tmp_reset();
-		} else
-			mm_numa_fault_tmp_flush(mm);
+		}
 
 		mmdrop(mm);
 	}
@@ -942,7 +916,6 @@ static int knuma_scand(void *none)
 
 	if (mm)
 		mmdrop(mm);
-	mm_numa_fault_tmp_reset();
 
 	return 0;
 }
@@ -987,11 +960,6 @@ static int start_knuma_scand(void)
 	int err = 0;
 	struct task_struct *knumad_thread;
 
-	knuma_scand_data.mm_numa_fault_tmp = kzalloc(mm_autonuma_fault_size(),
-						     GFP_KERNEL);
-	if (!knuma_scand_data.mm_numa_fault_tmp)
-		return -ENOMEM;
-
 	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
 	if (unlikely(IS_ERR(knumad_thread))) {
 		autonuma_printk(KERN_ERR

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
       [not found] ` <20121011213432.GQ3317@csn.ul.ie>
@ 2012-10-12  1:45   ` Andrea Arcangeli
  2012-10-12  8:46     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  1:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi Mel,

On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> So after getting through the full review of it, there wasn't anything
> I could not stand. I think it's *very* heavy on some of the paths like
> the idle balancer which I was not keen on and the fault paths are also
> quite heavy.  I think the weight on some of these paths can be reduced
> but not to 0 if the objectives to autonuma are to be met.
> 
> I'm not fully convinced that the task exchange is actually necessary or
> beneficial because it somewhat assumes that there is a symmetry between CPU
> and memory balancing that may not be true. The fact that it only considers

The problem is that without an active task exchange and no explicit
call to stop_one_cpu*, there's no way to migrate a currently running
task and clearly we need that. We can indefinitely wait hoping the
task goes to sleep and leaves the CPU idle, or that a couple of other
tasks start and trigger load balance events.

We must move tasks even if all cpus are in a steady rq->nr_running ==
1 state and there's no other scheduler balance event that could
possibly attempt to move tasks around in such a steady state.

Of course one could hack the active idle balancing so that it does the
active NUMA balancing action, but that would be a purely artificial
complication: it would add unnecessary delay and it would provide no
benefit whatsoever.

Why don't we dump the active idle balancing too, and we hack the load
balancing to do the active idle balancing as well? Of course then the
two will be more integrated. But it'll be a mess and slower and
there's a good reason why they exist as totally separated pieces of
code working in parallel.

We can integrate it more, but in my view the result would be worse and
more complicated. Last but not the least messing the idle balancing
code to do an active NUMA balancing action (somehow invoking
stop_one_cpu* in the steady state described above) would force even
cellphones and UP kernels to deal with NUMA code somehow.

> tasks that are currently running feels a bit random but examining all tasks
> that recently ran on the node would be far too expensive to there is no

So far this seems a good tradeoff. Nothing will prevent us to scan
deeper into the runqueues later if find a way to do that efficiently.

> good answer. You are caught between a rock and a hard place and either
> direction you go is wrong for different reasons. You need something more

I think you described the problem perfectly ;).

> frequent than scans (because it'll converge too slowly) but doing it from
> the balancer misses some tasks and may run too frequently and it's unclear
> how it effects the current load balancer decisions. I don't have a good
> alternative solution for this but ideally it would be better integrated with
> the existing scheduler when there is more data on what those scheduling
> decisions should be. That will only come from a wide range of testing and
> the inevitable bug reports.
> 
> That said, this is concentrating on the problems without considering the
> situations where it would work very well.  I think it'll come down to HPC
> and anything jitter-sensitive will hate this while workloads like JVM,
> virtualisation or anything that uses a lot of memory without caring about
> placement will love it. It's not perfect but it's better than incurring
> the cost of remote access unconditionally.

Full agreement.

Your detailed full review was very appreciated, thanks!

Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
  2012-10-12  0:25     ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
@ 2012-10-12  8:29       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-12  8:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Fri, Oct 12, 2012 at 02:25:13AM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > > This algorithm takes as input the statistical information filled by the
> > > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > > (p->task_autonuma), evaluates it for the current scheduled task, and
> > > compares it against every other running process to see if it should
> > > move the current task to another NUMA node.
> > > 
> > 
> > That sounds expensive if there are a lot of running processes in the
> > system. How often does this happen? Mention it here even though I
> > realised much later that it's obvious from the patch itself.
> 
> Ok I added:
> 
> ==
> This algorithm will run once every ~100msec,

~100msec (depending on the scheduler tick)

> and can be easily slowed
> down further

using the sysfs tunable ....

>. Its computational complexity is O(nr_cpus) and it's
> executed by all CPUs. The number of running threads and processes is
> not going to alter the cost of this algorithm, only the online number
> of CPUs is. However practically this will very rarely hit on all CPUs
> runqueues. Most of the time it will only compute on local data in the
> task_autonuma struct (for example if convergence has been
> reached). Even if no convergence has been reached yet, it'll only scan
> the CPUs in the NUMA nodes where the local task_autonuma data is
> showing that they are worth migrating to.

Ok, this explains how things are currently which is beter.

> ==
> 
> It's configurable through sysfs, 100mses is the default.
> 
> > > + * there is no affinity set for the task).
> > > + */
> > > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > > +{
> > 
> > nit, but elsewhere you have
> > 
> > static inline TYPE and here you have
> > static TYPE inline
> 
> Fixed.
> 
> > 
> > > +	int task_selected_nid;
> > > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +
> > > +	if (!task_autonuma)
> > > +		return true;
> > > +
> > > +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > > +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > > +		return true;
> > > +	else
> > > +		return false;
> > > +}
> > 
> > no need for else.
> 
> Removed.
> 
> > 
> > > +
> > > +static inline void sched_autonuma_balance(void)
> > > +{
> > > +	struct task_autonuma *ta = current->task_autonuma;
> > > +
> > > +	if (ta && current->mm)
> > > +		__sched_autonuma_balance();
> > > +}
> > > +
> > 
> > Ok, so this could do with a comment explaining where it is called from.
> > It is called during idle balancing at least so potentially this is every
> > scheduler tick. It'll be run from softirq context so the cost will not
> > be obvious to a process but the overhead will be there. What happens if
> > this takes longer than a scheduler tick to run? Is that possible?
> 
> softirqs can run for huge amount of time so it won't harm.
> 

They're allowed, but it's not free. Its not a stopper but eventually
we'll want to get away with it.

> Nested IRQs could even run on top of the softirq, and they could take
> milliseconds too if they're hyper inefficient and we must still run
> perfectly rock solid (with horrible latency, but still stable).
> 
> I added:
> 
> /*
>  * This is called in the context of the SCHED_SOFTIRQ from
>  * run_rebalance_domains().
>  */
> 

Ok. A vague idea occurred to me while mulling this over that would avoid the
walk. I did not flesh this out at all so there will be major inaccuracies
but hopefully you'll get the general idea.

The scheduler already caches some information about domains such as
sd_llc storing a per-cpu basis a pointer to the highest shared domain
with the same lowest level cache.

It should be possible to cache on a per-NUMA node domain basis the
highest mm_numafault and task_mmfault and the PID within that domain 
in sd_numa_mostconverged with one entry per NUMA node. At a scheduling tick, the
current task does the for_each_online_node(), calculates its values,
them to sd_numa_mostconverged and updates the cache if necessary.

With the view to integrating this with CFQ better, this update should happen
in kernel/sched/fair.c in a function called update_convergence_stats()
or possibly even integrated within one of the existing CPU walkers
like nohz_idle_balance or maybe in idle_balance itself and moved out of
kernel/sched/numa.c.  It shouldn't migrate tasks at this point and
reduce the overhead in the idle balancer.

This should integrate the whole of the following block into CFQ.

        /*
         * Identify the NUMA node where this thread (task_struct), and
         * the process (mm_struct) as a whole, has the largest number
         * of NUMA faults.
         */

It then later considers doing the task exchange but only the
sd_numa_mostconverged values for each node are considered.
This gets rid of the
for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) loop with the obvious
caveat that there is no guarantee that cached PID is eligible for exchange
but I expect that's rare (it could be mitigated by never caching pids
that are bound to a single node for example). This would make this block

        /*
         * Check the other NUMA nodes to see if there is a task we
         * should exchange places with.
         */

O(num_online_nodes()) instead of O(num_online_cpus()) and reduce
the cost of that path. It will converge slower, but only slightly slower,
as you only ever consider one task per node anyway after deciding which
one is the best.

Again, in the interest in integrating with CFQ further, this whole block
should then move to kernel/sched/fair.c , possibly within load_balance()
so they are working closer together.

That just leaves the task exchange part which can remain separate and
just called from load_balance() when autonuma is in use.

This is not a patch obviously but I think it's important to have some
sort of integrating path with CFQ in mind.

> > > +/*
> > > + * This function __sched_autonuma_balance() is responsible for
> > 
> > This function is far too shot and could do with another few pages :P
> 
> :) I tried to split it once already but gave up in the middle.
> 

FWIW, the blocks are at least clear and it was easier to follow than I
expected.

> > > + * "Full convergence" is achieved when all memory accesses by a task
> > > + * are 100% local to the CPU it is running on. A task's "best node" is
> > 
> > I think this is the first time you defined convergence in the series.
> > The explanation should be included in the documentation.
> 
> Ok. It's not too easy concept to explain with words.  Here a try:
> 
>  *
>  * A workload converges when all the memory of a thread or a process
>  * has been placed in the NUMA node of the CPU where the process or
>  * thread is running on.
>  *
> 

Sounds right to me.

> > > + * other_diff: how much the current task is closer to fully converge
> > > + * on the node of the other CPU than the other task that is currently
> > > + * running in the other CPU.
> > 
> > In the changelog you talked about comparing a process with every other
> > running process but here it looks like you intent to examine every
> > process that is *currently running* on a remote node and compare that.
> > What if the best process to swap with is not currently running? Do we
> > miss it?
> 
> Correct, only currently running processes are being checked. If a task
> in R state goes to sleep immediately, it's not relevant where it
> runs. We focus on "long running" compute tasks, so tasks that are in R
> state most frequently.
> 

Ok, so it can still miss some things but we're trying to reduce the
overhead, not increase it. If the most and worst PIDS were cached as I
described above they could be updated either on the idle balancing (and
potentially miss tasks like this does) or if high granularity was every
required it could be done on every reschedule. It's one call for a
relatively light function. I don't think it's necessary to have this
fine granularity though.

> > > + * If both checks succeed it guarantees that we found a way to
> > > + * multilaterally improve the system wide NUMA
> > > + * convergence. Multilateral here means that the same checks will not
> > > + * succeed again on those same two tasks, after the task exchange, so
> > > + * there is no risk of ping-pong.
> > > + *
> > 
> > At least not in that instance of time. A new CPU binding or change in
> > behaviour (such as a computation finishing and a reduce step starting)
> > might change that scoring.
> 
> Yes.
> 
> > > + * If a task exchange can happen because the two checks succeed, we
> > > + * select the destination CPU that will give us the biggest increase
> > > + * in system wide convergence (i.e. biggest "weight", in the above
> > > + * quoted code).
> > > + *
> > 
> > So there is a bit of luck that the best task to exchange is currently
> > running. How bad is that? It depends really on the number of tasks
> > running on that node and the priority. There is a chance that it doesn't
> > matter as such because if all the wrong tasks are currently running then
> > no exchange will take place - it was just wasted CPU. It does imply that
> > AutoNUMA works best of CPUs are not over-subscribed with processes. Is
> > that fair?
> 
> It seems to works fine with overcommit as well. specjbb x2 is
> converging fine, as well as numa01 in parallel with numa02. It's
> actually pretty cool to watch.
> 
> Try to run this:
> 
> while :; do ./nmstat -n numa; sleep 1; done
> 
> nmstat is a binary in autonuma benchmark.
> 
> Then run:
> 
> time (./numa01 & ./numa02 & wait)
> 
> The thing is, we work together with CFS, CFS in autopilot works fine,
> we only need to correct the occasional error.
> 
> It works the same as the active idle balancing, that corrects the
> occasional error for HT cores left idle, then CFS takes over.
> 

Ok.

> > Again, I have no suggestions at all on how this might be improved and
> > these comments are hand-waving towards where we *might* see problems in
> > the future. If problems are actually identified in practice for
> > worklaods then autonuma can be turned off until the relevant problem
> > area is fixed.
> 
> Exactly, it's enough to run:
> 
> echo 1 >/sys/kernel/mm/autonuma/enabled
> 
> If you want to get rid of the 2 bytes per page too, passing
> "noautonuma" at boot will do it (but then /sys/kernel/mm/autonuma
> disapperers and you can't enable it anymore).
> 
> Plus if there's any issue with the cost of sched_autonuma_balance it's
> more than enough to run "perf top" to find out.
> 

Yep. I'm just trying to anticipate what the problems might be so when/if
I see a problem profile I'll have a rough idea what it might be due to.

> > I would fully expect that there are parallel workloads that work on
> > differenet portions of a large set of data and it would be perfectly
> > reasonable for threads using the same address space to converge on
> > different nodes.
> 
> Agreed. Even if they can't converge fully they could have stats like
> 70/30, 30/70, with 30 being numa-false-shared and we'll schedule them
> right, so running faster than upstream. That 30% will also tend to
> slowly distribute better over time.
> 

Ok

> > I would hope we manage to figure out a way to examine fewer processes,
> > not more :)
> 
> 8)))
> 
> > > +void __sched_autonuma_balance(void)
> > > +{
> > > +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> > > +	int this_nid = numa_node_id();
> > > +	int this_cpu = smp_processor_id();
> > > +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> > > +	unsigned long task_max, mm_max;
> > > +	unsigned long weight_diff_max;
> > > +	long uninitialized_var(s_w_nid);
> > > +	long uninitialized_var(s_w_this_nid);
> > > +	long uninitialized_var(s_w_other);
> > > +	bool uninitialized_var(s_w_type_thread);
> > > +	struct cpumask *allowed;
> > > +	struct task_struct *p = current, *other_task;
> > 
> > So the task in question is current but this is called by the idle
> > balancer. I'm missing something obvious here but it's not clear to me why
> > that process is necessarily relevant. What guarantee is there that all
> > tasks will eventually run this code? Maybe it doesn't matter because the
> > most CPU intensive tasks are also the most likely to end up in here but
> > a clarification would be nice.
> 
> Exactly. We only focus on who is significantly computing. If a task
> runs for 1msec we can't possibly care where it runs and where the
> memory is. If it keeps running for 1msec, over time even that task
> will be migrated right.
> 

This limitation is fine, but it should be mentioned in a comment above
__sched_autonuma_balance() for the next person that reviews this in the
future.

> > > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +	struct mm_autonuma *mm_autonuma;
> > > +	struct rq *rq;
> > > +
> > > +	/* per-cpu statically allocated in runqueues */
> > > +	long *task_numa_weight;
> > > +	long *mm_numa_weight;
> > > +
> > > +	if (!task_autonuma || !p->mm)
> > > +		return;
> > > +
> > > +	if (!autonuma_enabled()) {
> > > +		if (task_autonuma->task_selected_nid != -1)
> > > +			task_autonuma->task_selected_nid = -1;
> > > +		return;
> > > +	}
> > > +
> > > +	allowed = tsk_cpus_allowed(p);
> > > +	mm_autonuma = p->mm->mm_autonuma;
> > > +
> > > +	/*
> > > +	 * If the task has no NUMA hinting page faults or if the mm
> > > +	 * hasn't been fully scanned by knuma_scand yet, set task
> > > +	 * selected nid to the current nid, to avoid the task bounce
> > > +	 * around randomly.
> > > +	 */
> > > +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
> > 
> > Why ACCESS_ONCE?
> 
> mm variables are altered by other threads too. Only task_autonuma is
> local to this task and cannot change from under us.
> 
> I did it all lockless, I don't care if we're off once in a while.
> 

Mention why ACCESS_ONCE is used in a comment the first time it appears
in kernel/sched/numa.c. It's not necessary to mention it after that.

> > > +	if (!mm_tot) {
> > > +		if (task_autonuma->task_selected_nid != this_nid)
> > > +			task_autonuma->task_selected_nid = this_nid;
> > > +		return;
> > > +	}
> > > +	task_tot = task_autonuma->task_numa_fault_tot;
> > > +	if (!task_tot) {
> > > +		if (task_autonuma->task_selected_nid != this_nid)
> > > +			task_autonuma->task_selected_nid = this_nid;
> > > +		return;
> > > +	}
> > > +
> > > +	rq = cpu_rq(this_cpu);
> > > +
> > > +	/*
> > > +	 * Verify that we can migrate the current task, otherwise try
> > > +	 * again later.
> > > +	 */
> > > +	if (ACCESS_ONCE(rq->autonuma_balance))
> > > +		return;
> > > +
> > > +	/*
> > > +	 * The following two arrays will hold the NUMA affinity weight
> > > +	 * information for the current process if scheduled on the
> > > +	 * given NUMA node.
> > > +	 *
> > > +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> > > +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> > > +	 */
> > > +	task_numa_weight = rq->task_numa_weight;
> > > +	mm_numa_weight = rq->mm_numa_weight;
> > > +
> > > +	/*
> > > +	 * Identify the NUMA node where this thread (task_struct), and
> > > +	 * the process (mm_struct) as a whole, has the largest number
> > > +	 * of NUMA faults.
> > > +	 */
> > > +	task_max = mm_max = 0;
> > > +	selected_nid = mm_selected_nid = -1;
> > > +	for_each_online_node(nid) {
> > > +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> > > +		task_fault = task_autonuma->task_numa_fault[nid];
> > > +		if (mm_fault > mm_tot)
> > > +			/* could be removed with a seqlock */
> > > +			mm_tot = mm_fault;
> > > +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> > > +		if (task_fault > task_tot) {
> > > +			task_tot = task_fault;
> > > +			WARN_ON(1);
> > > +		}
> > > +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> > > +		if (mm_numa_weight[nid] > mm_max) {
> > > +			mm_max = mm_numa_weight[nid];
> > > +			mm_selected_nid = nid;
> > > +		}
> > > +		if (task_numa_weight[nid] > task_max) {
> > > +			task_max = task_numa_weight[nid];
> > > +			selected_nid = nid;
> > > +		}
> > > +	}
> > 
> > Ok, so this is a big walk to take every time and as this happens every
> > scheduler tick, it seems unlikely that the workload would be changing
> > phases that often in terms of NUMA behaviour. Would it be possible for
> > this to be sampled less frequently and cache the result?
> 
> Even if there are 8 nodes, this is fairly quick and only requires 2
> cachelines. At 16 nodes we're at 4 cachelines. The cacheline of
> task_autonuma is fully local. The one of mm_autonuma can be shared
> (modulo numa hinting page faults with atuonuma28, in autonuma27 it was
> also sharable even despite numa hinting page faults).
> 

Two cachelines that bounce though because of writes. I still don't
really like it but it can be lived with for now I guess, it's not my call
really. However, I'd like you to consider the suggestion above on how we
might create a per-NUMA scheduling domain cache of this information that
is only updated by a task if it scores "better" or "worse" than the current
cached value.

> > > +			/*
> > > +			 * Grab the fault/tot of the processes running
> > > +			 * in the other CPUs to compute w_other.
> > > +			 */
> > > +			raw_spin_lock_irq(&rq->lock);
> > > +			_other_task = rq->curr;
> > > +			/* recheck after implicit barrier() */
> > > +			mm = _other_task->mm;
> > > +			if (!mm) {
> > > +				raw_spin_unlock_irq(&rq->lock);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Is it really critical to pin those values using the lock? That seems *really*
> > heavy. If the results have to be exactly stable then is there any chance
> > the values could be encoded in the high and low bits of a single unsigned
> > long and read without the lock?  Updates would be more expensive but that's
> > in a trap anyway. This on the other hand is a scheduler path.
> 
> The reason of the lock is to prevent rq->curr, mm etc.. to be freed
> from under us.
> 

Crap, yes.

> > > +			/*
> > > +			 * Check if the _other_task is allowed to be
> > > +			 * migrated to this_cpu.
> > > +			 */
> > > +			if (!cpumask_test_cpu(this_cpu,
> > > +					      tsk_cpus_allowed(_other_task))) {
> > > +				raw_spin_unlock_irq(&rq->lock);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Would it not make sense to check this *before* we take the lock and
> > grab all its counters? It probably will not make much of a difference in
> > practice as I expect it's rare that the target CPU is running a task
> > that can't migrate but it still feels the wrong way around.
> 
> It's a micro optimization to do it here. It's too rare that the above
> fails, while !tot may be zero much more frequently (like if the task
> has been just started).
> 

Ok.

> > > +	if (selected_cpu != this_cpu) {
> > > +		if (autonuma_debug()) {
> > > +			char *w_type_str;
> > > +			w_type_str = s_w_type_thread ? "thread" : "process";
> > > +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> > > +			       p->mm, p->pid, this_nid, selected_nid,
> > > +			       this_cpu, selected_cpu,
> > > +			       s_w_other, s_w_nid, s_w_this_nid,
> > > +			       w_type_str);
> > > +		}
> > 
> > Can these be made tracepoints and get rid of the autonuma_debug() check?
> > I recognise there is a risk that some tool might grow to depend on
> > implementation details but in this case it seems very unlikely.
> 
> The debug mode provides me also a dump of all mm done racy, I wouldn't
> know how to do it with tracing.
> 

For live reporting on a terminal;

$ trace-cmd start -e autonuma:some_event_whatever_you_called_it
$ cat /sys/kernel/debug/tracing/trace_pipe
$ trace-cmd stop -e autonuma:some_event_whatever_you_called_it

you can record the trace using trace-cmd record but I suspect in this
case you want live reporting and I think this is the best way of doing
it.

> So I wouldn't remove the printk until we can replace everything with
> tracing, but I'd welcome to add a tracepoint too. There are already
> other proper tracepoints driving "perf script numatop".
> 

Good.

> > Ok, so I confess I did not work out if the weights and calculations really
> > make sense or not but at a glance they seem reasonable and I spotted no
> > obvious flaws. The function is pretty heavy though and may be doing more
> > work around locking than is really necessary. That said, there will be
> > workloads where the cost is justified and offset by the performance gains
> > from improved NUMA locality. I just don't expect it to be a universal win so
> > we'll need to keep an eye on the system CPU usage and incrementally optimise
> > where possible. I suspect there will be a time when an incremental
> > optimisation just does not cut it any more but by then I would also
> > expect there will be more data on how autonuma behaves in practice and a
> > new algorithm might be more obvious at that point.
> 
> Agreed. Chances are I can replace all this already with RCU and a
> rcu_dereference or ACCESS_ONCE to grab the rq->curr->task_autonuma and
> rq->curr->mm->mm_autonuma data. I didn't try yet. The task struct
> shouldn't go away from under us after rcu_read_lock, the mm may be more
> tricky, I haven't checked this yet. Optimizations welcome ;)
> 

Optimizations are limited to hand waving and no patches for the moment
:)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-12  1:45   ` [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
@ 2012-10-12  8:46     ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-12  8:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> > So after getting through the full review of it, there wasn't anything
> > I could not stand. I think it's *very* heavy on some of the paths like
> > the idle balancer which I was not keen on and the fault paths are also
> > quite heavy.  I think the weight on some of these paths can be reduced
> > but not to 0 if the objectives to autonuma are to be met.
> > 
> > I'm not fully convinced that the task exchange is actually necessary or
> > beneficial because it somewhat assumes that there is a symmetry between CPU
> > and memory balancing that may not be true. The fact that it only considers
> 
> The problem is that without an active task exchange and no explicit
> call to stop_one_cpu*, there's no way to migrate a currently running
> task and clearly we need that. We can indefinitely wait hoping the
> task goes to sleep and leaves the CPU idle, or that a couple of other
> tasks start and trigger load balance events.
> 

Stick that in a comment although I still don't fully see why the actual
exchange is necessary and why you cannot just move the current task to
the remote CPUs runqueue. Maybe it's something to do with them converging
faster if you do an exchange. I'll figure it out eventually.

> We must move tasks even if all cpus are in a steady rq->nr_running ==
> 1 state and there's no other scheduler balance event that could
> possibly attempt to move tasks around in such a steady state.
> 

I see, because just because there is a 1:1 mapping between tasks and
CPUs does not mean that it has converged from a NUMA perspective. The
idle balancer could be moving to an idle CPU that is poor from a NUMA
point of view. Better integration with the load balancer and caching on
a per-NUMA basis both the best and worst converged processes might help
but I'm hand-waving.

> Of course one could hack the active idle balancing so that it does the
> active NUMA balancing action, but that would be a purely artificial
> complication: it would add unnecessary delay and it would provide no
> benefit whatsoever.
> 
> Why don't we dump the active idle balancing too, and we hack the load
> balancing to do the active idle balancing as well? Of course then the
> two will be more integrated. But it'll be a mess and slower and
> there's a good reason why they exist as totally separated pieces of
> code working in parallel.
> 

I'm not 100% convinced they have to be separate but you have thought about
this a hell of a lot more than I have and I'm a scheduling dummy.

For example, to me it seems that if the load balancer was going to move a
task to an idle CPU on a remote node, it could also check it it would be
more or less converged before moving and reject the balancing if it would
be less converged after the move. This increases the search cost in the
load balancer but not necessarily any worse than what happens currently.

> We can integrate it more, but in my view the result would be worse and
> more complicated. Last but not the least messing the idle balancing
> code to do an active NUMA balancing action (somehow invoking
> stop_one_cpu* in the steady state described above) would force even
> cellphones and UP kernels to deal with NUMA code somehow.
> 

hmm...

> > tasks that are currently running feels a bit random but examining all tasks
> > that recently ran on the node would be far too expensive to there is no
> 
> So far this seems a good tradeoff. Nothing will prevent us to scan
> deeper into the runqueues later if find a way to do that efficiently.
> 

I don't think there is an effecient way to do that but I'm hoping
caching an exchange candiate on a per-NUMA basis could reduce the cost
while still converging reasonably quickly.

> > good answer. You are caught between a rock and a hard place and either
> > direction you go is wrong for different reasons. You need something more
> 
> I think you described the problem perfectly ;).
> 
> > frequent than scans (because it'll converge too slowly) but doing it from
> > the balancer misses some tasks and may run too frequently and it's unclear
> > how it effects the current load balancer decisions. I don't have a good
> > alternative solution for this but ideally it would be better integrated with
> > the existing scheduler when there is more data on what those scheduling
> > decisions should be. That will only come from a wide range of testing and
> > the inevitable bug reports.
> > 
> > That said, this is concentrating on the problems without considering the
> > situations where it would work very well.  I think it'll come down to HPC
> > and anything jitter-sensitive will hate this while workloads like JVM,
> > virtualisation or anything that uses a lot of memory without caring about
> > placement will love it. It's not perfect but it's better than incurring
> > the cost of remote access unconditionally.
> 
> Full agreement.
> 
> Your detailed full review was very appreciated, thanks!
> 

You're welcome.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 15/33] autonuma: alloc/free/init task_autonuma
       [not found]       ` <20121011175953.GT1818@redhat.com>
@ 2012-10-12 14:03         ` Rik van Riel
  0 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2012-10-12 14:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On 10/11/2012 01:59 PM, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:34:12PM -0400, Rik van Riel wrote:

>> That is indeed a future optimization I have suggested
>> in the past. Allocation of this struct could be deferred
>> until the first time knuma_scand unmaps pages from the
>> process to generate NUMA page faults.
>
> I already tried this, and quickly noticed that for mm_autonuma we
> can't, or we wouldn't have memory to queue the "mm" into knuma_scand
> in the first place.
>
> For task_autonuma we could, but then we wouldn't be able to inherit
> the task_autonuma->task_autonuma_nid across clone/fork which kind of
> makes sense to me (and it's done by default without knob at the
> moment). It's actually more important for clone than for fork but it
> might be good for fork too if it doesn't exec immediately.
>
> Another option is to move task_autonuma_nid in the task_structure
> (it's in the stack so it won't cost RAM). Then I probably can defer
> the task_autonuma if I remove the child_inheritance knob.
>
> In knuma_scand we don't have the task pointer, so task_autonuma would
> need to be allocated in the NUMA page faults, the first time it fires.

One thing that could be done is have the (few) mm and
task specific bits directly in the mm and task structs,
and have the sized-by-number-of-nodes statistics in
a separate numa_stats struct.

At that point, the numa_stats struct could be lazily
allocated, reducing the memory allocations at fork
time by 2 (and the frees at exit time, for short lived
processes).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 15:35     ` Mel Gorman
  2012-10-12  0:41       ` Andrea Arcangeli
@ 2012-10-12 14:54       ` Mel Gorman
  1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2012-10-12 14:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> > Hi Mel,
> > 
> > On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > > Benchmark on a 4-node machine and the following fell out.
> > > 
> > >                                      3.6.0                 3.6.0
> > >                                    vanilla        autonuma-v33r6
> > > User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> > > User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> > > System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> > > System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> > > Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> > > Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> > > CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> > > CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)
> > 
> > Thanks a lot for the help and for looking into it!
> > 
> > Just curious, why are you running only numa02_SMT and
> > numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> > without _suffix)
> > 
> 
> Bug in the testing script on my end. Each of them are run separtly and it

Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so
it can run the tests individually. I know start_bench.sh can run all the
tests itself but in time I'll want mmtests to collect additional stats
that can also be applied to other benchmarks consistently. The revised
results look like this

AUTONUMA BENCH
                                          3.6.0                 3.6.0
                                        vanilla        autonuma-v33r6
User    NUMA01               66395.58 (  0.00%)    32000.83 ( 51.80%)
User    NUMA01_THEADLOCAL    55952.48 (  0.00%)    16950.48 ( 69.71%)
User    NUMA02                6988.51 (  0.00%)     2150.56 ( 69.23%)
User    NUMA02_SMT            2914.25 (  0.00%)     1013.11 ( 65.24%)
System  NUMA01                 319.12 (  0.00%)      483.60 (-51.54%)
System  NUMA01_THEADLOCAL       40.60 (  0.00%)      184.39 (-354.16%)
System  NUMA02                   1.62 (  0.00%)       23.92 (-1376.54%)
System  NUMA02_SMT               0.90 (  0.00%)       16.20 (-1700.00%)
Elapsed NUMA01                1519.53 (  0.00%)      757.40 ( 50.16%)
Elapsed NUMA01_THEADLOCAL     1269.49 (  0.00%)      398.63 ( 68.60%)
Elapsed NUMA02                 181.12 (  0.00%)       57.09 ( 68.48%)
Elapsed NUMA02_SMT             164.18 (  0.00%)       53.16 ( 67.62%)
CPU     NUMA01                4390.00 (  0.00%)     4288.00 (  2.32%)
CPU     NUMA01_THEADLOCAL     4410.00 (  0.00%)     4298.00 (  2.54%)
CPU     NUMA02                3859.00 (  0.00%)     3808.00 (  1.32%)
CPU     NUMA02_SMT            1775.00 (  0.00%)     1935.00 ( -9.01%)

MMTests Statistics: duration
               3.6.0       3.6.0
             vanilla autonuma-v33r6
User       132257.44    52121.30
System        362.79      708.62
Elapsed      3142.66     1275.72

MMTests Statistics: vmstat
                              3.6.0       3.6.0
                            vanilla autonuma-v33r6
THP fault alloc               17660       19927
THP collapse alloc               10       12399
THP splits                        4       12637

The System CPU usage is high but is compenstated for with reduced User
and Elapsed times in this particular case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
       [not found] <1349308275-2174-1-git-send-email-aarcange@redhat.com>
                   ` (10 preceding siblings ...)
       [not found] ` <1349308275-2174-16-git-send-email-aarcange@redhat.com>
@ 2012-10-13 18:40 ` Srikar Dronamraju
  2012-10-14  4:57   ` Andrea Arcangeli
       [not found] ` <1349308275-2174-20-git-send-email-aarcange@redhat.com>
  12 siblings, 1 reply; 34+ messages in thread
From: Srikar Dronamraju @ 2012-10-13 18:40 UTC (permalink / raw)
  To: aarcange
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

* Andrea Arcangeli <aarcange@redhat.com> [2012-10-04 01:50:42]:

> Hello everyone,
> 
> This is a new AutoNUMA27 release for Linux v3.6.
> 


Here results of autonumabenchmark on a 328GB 64 core with ht disabled
comparing v3.6 with autonuma27.

$ numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31689 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31917 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31928 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31926 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31913 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63952 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64230 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  20  20  20  20  20  20  20 
  1:  20  10  20  20  20  20  20  20 
  2:  20  20  10  20  20  20  20  20 
  3:  20  20  20  10  20  20  20  20 
  4:  20  20  20  20  10  20  20  20 
  5:  20  20  20  20  20  10  20  20 
  6:  20  20  20  20  20  20  10  20 
  7:  20  20  20  20  20  20  20  10 



          KernelVersion:                 3.6.0-mainline_v36
                        Testcase:     Min      Max      Avg
                          numa01: 1509.14  2098.75  1793.90
                numa01_HARD_BIND:  865.43  1826.40  1334.85
             numa01_INVERSE_BIND: 3242.76  3496.71  3345.12
             numa01_THREAD_ALLOC:  944.28  1418.78  1214.32
   numa01_THREAD_ALLOC_HARD_BIND:  696.33  1004.99   825.63
numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88  2301.27  2186.33
                          numa02:  129.87   146.10   136.88
                numa02_HARD_BIND:   25.81    26.18    25.97
             numa02_INVERSE_BIND:  341.96   354.73   345.59
                      numa02_SMT:  160.77   246.66   186.85
            numa02_SMT_HARD_BIND:   25.77    38.86    33.57
         numa02_SMT_INVERSE_BIND:  282.61   326.76   296.44

          KernelVersion:               3.6.0-autonuma27+                            
                        Testcase:     Min      Max      Avg  %Change   
                          numa01: 1805.19  1907.11  1866.39    -3.88%  
                numa01_HARD_BIND:  953.33  2050.23  1603.29   -16.74%  
             numa01_INVERSE_BIND: 3515.14  3882.10  3715.28    -9.96%  
             numa01_THREAD_ALLOC:  323.50   362.17   348.81   248.13%  
   numa01_THREAD_ALLOC_HARD_BIND:  841.08  1205.80   977.43   -15.53%  
numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35  2654.89  2439.51   -10.38%  
                          numa02:   51.64    73.35    58.88   132.47%  
                numa02_HARD_BIND:   25.23    26.31    25.93     0.15%  
             numa02_INVERSE_BIND:  338.39   355.70   344.82     0.22%  
                      numa02_SMT:   51.76    66.78    58.63   218.69%  
            numa02_SMT_HARD_BIND:   34.95    45.39    39.24   -14.45%  
         numa02_SMT_INVERSE_BIND:  287.85   300.82   295.80     0.22%  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-13 18:40 ` [PATCH 00/33] AutoNUMA27 Srikar Dronamraju
@ 2012-10-14  4:57   ` Andrea Arcangeli
  2012-10-15  8:16     ` Srikar Dronamraju
  2012-10-23 16:32     ` Srikar Dronamraju
  0 siblings, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2012-10-14  4:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

Hi Srikar,

On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote:
> * Andrea Arcangeli <aarcange@redhat.com> [2012-10-04 01:50:42]:
> 
> > Hello everyone,
> > 
> > This is a new AutoNUMA27 release for Linux v3.6.
> > 
> 
> 
> Here results of autonumabenchmark on a 328GB 64 core with ht disabled
> comparing v3.6 with autonuma27.

*snip*

>                           numa01: 1805.19  1907.11  1866.39    -3.88%  

Interesting. So numa01 should be improved in autonuma28fast. Not sure
why the hard binds show any difference, but I'm more concerned in
optimizing numa01. I get the same results from hard bindings on
upstream or autonuma, strange.

Could you repeat only numa01 with the origin/autonuma28fast branch?
Also if you could post the two pdf convergence chart generated by
numa01 on autonuma27 and autonuma28fast, I think that would be
interesting to see the full effect and why it is faster.

I only had the time for a quick push after having the idea added in
autonuma28fast (which is yet improved compared to autonuma28), but
I've been told already that it's dealing with numa01 on the 8 node
very well as expected.

numa01 in the 8 node is a workload without a perfect solution (other
than MADV_INTERLEAVE). Full convergence preventing cross-node traffic
is impossible because there are 2 processes spanning over 8 nodes and
all process memory is touched by all threads constantly. Yet
autonuma28fast should deal optimally that scenario too.

As a side note: numa01 on the 2 node instead converges fully (2
processes + 2 nodes = full convergence). numa01 on 2 nodes or >2nodes
is a very different kind of test.

I'll release an autonuma29 behaving like 28fast if there are no
surprises. The new algorithm change in 28fast will also save memory
once I rewrite it properly.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-14  4:57   ` Andrea Arcangeli
@ 2012-10-15  8:16     ` Srikar Dronamraju
  2012-10-23 16:32     ` Srikar Dronamraju
  1 sibling, 0 replies; 34+ messages in thread
From: Srikar Dronamraju @ 2012-10-15  8:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

> 
> Interesting. So numa01 should be improved in autonuma28fast. Not sure
> why the hard binds show any difference, but I'm more concerned in
> optimizing numa01. I get the same results from hard bindings on
> upstream or autonuma, strange.
> 
> Could you repeat only numa01 with the origin/autonuma28fast branch?

Okay, will try to get the numbers on autonuma28 soon.

> Also if you could post the two pdf convergence chart generated by
> numa01 on autonuma27 and autonuma28fast, I think that would be
> interesting to see the full effect and why it is faster.

Have attached the chart for autonuma27 in a private email.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
       [not found]   ` <20121013180618.GC31442@linux.vnet.ibm.com>
@ 2012-10-15  8:24     ` Srikar Dronamraju
  2012-10-15  9:20       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Srikar Dronamraju @ 2012-10-15  8:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:

> > +
> > +bool numa_hinting_fault(struct page *page, int numpages)
> > +{
> > +	bool migrated = false;
> > +
> > +	/*
> > +	 * "current->mm" could be different from the "mm" where the
> > +	 * NUMA hinting page fault happened, if get_user_pages()
> > +	 * triggered the fault on some other process "mm". That is ok,
> > +	 * all we care about is to count the "page_nid" access on the
> > +	 * current->task_autonuma, even if the page belongs to a
> > +	 * different "mm".
> > +	 */
> > +	WARN_ON_ONCE(!current->mm);
> 
> Given the above comment, Do we really need this warn_on?
> I think I have seen this warning when using autonuma.
> 

------------[ cut here ]------------
WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
Hardware name: BladeCenter HS22V -[7871AC1]-
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3
Call Trace:
 [<ffffffff8105194f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff810519aa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81153f0d>] numa_hinting_fault+0x60d/0x7c0
 [<ffffffff8104ae90>] ? flush_tlb_mm_range+0x250/0x250
 [<ffffffff8103b82e>] ? physflat_send_IPI_mask+0xe/0x10
 [<ffffffff81036db5>] ? native_send_call_func_ipi+0xa5/0xd0
 [<ffffffff81154255>] pmd_numa_fixup+0x195/0x350
 [<ffffffff81135ef4>] handle_mm_fault+0x2c4/0x3d0
 [<ffffffff8113139c>] ? follow_page+0x2fc/0x4f0
 [<ffffffff81156364>] break_ksm+0x74/0xa0
 [<ffffffff81156562>] break_cow+0xa2/0xb0
 [<ffffffff81158444>] ksm_scan_thread+0xb54/0xd50
 [<ffffffff81075cf0>] ? wake_up_bit+0x40/0x40
 [<ffffffff811578f0>] ? run_store+0x340/0x340
 [<ffffffff8107563e>] kthread+0x9e/0xb0
 [<ffffffff814e8c44>] kernel_thread_helper+0x4/0x10
 [<ffffffff810755a0>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814e8c40>] ? gs_change+0x13/0x13
---[ end trace 8f50820d1887cf93 ]-


While running specjbb on a 2 node box. Seems pretty easy to produce this.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-15  8:24     ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Srikar Dronamraju
@ 2012-10-15  9:20       ` Mel Gorman
  2012-10-15 10:00         ` Srikar Dronamraju
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2012-10-15  9:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, torvalds, akpm,
	pzijlstr, mingo, hughd, riel, hannes, dhillf, drjones, tglx, pjt,
	cl, suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

On Mon, Oct 15, 2012 at 01:54:13PM +0530, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:
> 
> > > +
> > > +bool numa_hinting_fault(struct page *page, int numpages)
> > > +{
> > > +	bool migrated = false;
> > > +
> > > +	/*
> > > +	 * "current->mm" could be different from the "mm" where the
> > > +	 * NUMA hinting page fault happened, if get_user_pages()
> > > +	 * triggered the fault on some other process "mm". That is ok,
> > > +	 * all we care about is to count the "page_nid" access on the
> > > +	 * current->task_autonuma, even if the page belongs to a
> > > +	 * different "mm".
> > > +	 */
> > > +	WARN_ON_ONCE(!current->mm);
> > 
> > Given the above comment, Do we really need this warn_on?
> > I think I have seen this warning when using autonuma.
> > 
> 
> ------------[ cut here ]------------
> WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
> Hardware name: BladeCenter HS22V -[7871AC1]-
> Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3

The kernel is tainted "D" which implies that it has already oopsed
before this warning was triggered. What was the other oops?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-15  9:20       ` Mel Gorman
@ 2012-10-15 10:00         ` Srikar Dronamraju
  0 siblings, 0 replies; 34+ messages in thread
From: Srikar Dronamraju @ 2012-10-15 10:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, torvalds, akpm,
	pzijlstr, mingo, hughd, riel, hannes, dhillf, drjones, tglx, pjt,
	cl, suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

* Mel Gorman <mel@csn.ul.ie> [2012-10-15 10:20:44]:

> On Mon, Oct 15, 2012 at 01:54:13PM +0530, Srikar Dronamraju wrote:
> > * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:
> > 
> > > > +
> > > > +bool numa_hinting_fault(struct page *page, int numpages)
> > > > +{
> > > > +	bool migrated = false;
> > > > +
> > > > +	/*
> > > > +	 * "current->mm" could be different from the "mm" where the
> > > > +	 * NUMA hinting page fault happened, if get_user_pages()
> > > > +	 * triggered the fault on some other process "mm". That is ok,
> > > > +	 * all we care about is to count the "page_nid" access on the
> > > > +	 * current->task_autonuma, even if the page belongs to a
> > > > +	 * different "mm".
> > > > +	 */
> > > > +	WARN_ON_ONCE(!current->mm);
> > > 
> > > Given the above comment, Do we really need this warn_on?
> > > I think I have seen this warning when using autonuma.
> > > 
> > 
> > ------------[ cut here ]------------
> > WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
> > Hardware name: BladeCenter HS22V -[7871AC1]-
> > Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3
> 
> The kernel is tainted "D" which implies that it has already oopsed
> before this warning was triggered. What was the other oops?
> 

Yes, But this oops shows up even with v3.6 kernel and not related to autonuma changes.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000dc
IP: [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
PGD 671ce4067 PUD 671257067 PMD 0 
Oops: 0000 [#3] SMP 
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
CPU 1 
Pid: 10833, comm: tar Tainted: G      D      3.6.0-autonuma27+ #2 IBM BladeCenter HS22V -[7871AC1]-/81Y5995     
RIP: 0010:[<ffffffffa0015543>]  [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
RSP: 0018:ffff88033a10fe68  EFLAGS: 00010286
RAX: ffff880371bd5000 RBX: ffffffffa0018880 RCX: ffffffffa0015530
RDX: 0000000000000000 RSI: ffffffffa0018880 RDI: ffff88036f0af000
RBP: ffff88033a10fe68 R08: ffff88036f0af010 R09: ffffffff8152a140
R10: 0000000000002de7 R11: 0000000000000246 R12: ffff88033a10ff48
R13: 0000000000001000 R14: 0000000000ccc600 R15: ffff88036f233e40
FS:  00007f57c07c47a0(0000) GS:ffff88037fc20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000dc CR3: 0000000671e12000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tar (pid: 10833, threadinfo ffff88033a10e000, task ffff88036e45e7f0)
Stack:
 ffff88033a10fe98 ffffffff8132b1e7 ffff88033a10fe88 ffffffff81110b5e
 ffff88033a10fe98 ffff88036f233e60 ffff88033a10fef8 ffffffff811d2d1e
 0000000000001000 ffff88036f0af010 ffffffff8152a140 ffff88036d875e48
Call Trace:
 [<ffffffff8132b1e7>] dev_attr_show+0x27/0x50
 [<ffffffff81110b5e>] ? __get_free_pages+0xe/0x50
 [<ffffffff811d2d1e>] sysfs_read_file+0xce/0x1c0
 [<ffffffff81162ed5>] vfs_read+0xc5/0x190
 [<ffffffff811630a1>] sys_read+0x51/0x90
 [<ffffffff814e29e9>] system_call_fastpath+0x16/0x1b
Code: 89 c7 48 c7 c6 64 79 01 a0 31 c0 e8 18 8d 23 e1 c9 48 98 c3 0f 1f 40 00 55 48 89 e5 66 66 66 66 90 48 89 d0 48 8b 97 c0 03 00 00 <8b> 92 dc 00 00 00 85 d2 78 1b 48 89 c7 48 c7 c6 69 79 01 a0 31 
RIP  [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
 RSP <ffff88033a10fe68>
CR2: 00000000000000dc
---[ end trace f0a3a4c8c85ff69f ]---

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-14  4:57   ` Andrea Arcangeli
  2012-10-15  8:16     ` Srikar Dronamraju
@ 2012-10-23 16:32     ` Srikar Dronamraju
  1 sibling, 0 replies; 34+ messages in thread
From: Srikar Dronamraju @ 2012-10-23 16:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

* Andrea Arcangeli <aarcange@redhat.com> [2012-10-14 06:57:16]:

> I'll release an autonuma29 behaving like 28fast if there are no
> surprises. The new algorithm change in 28fast will also save memory
> once I rewrite it properly.
> 

Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but
plan to run on a newer release soon).


---------------------------------------------------------------------------------------------------
|          kernel|      vm|                              nofit|                                fit|
-                -        -------------------------------------------------------------------------
|                |        |            noksm|              ksm|            noksm|              ksm|
-                -        -------------------------------------------------------------------------
|                |        |   nothp|     thp|   nothp|     thp|   nothp|     thp|   nothp|     thp|
---------------------------------------------------------------------------------------------------
|    mainline_v36|    vm_1|  136085|  188500|  133871|  163638|  133540|  178159|  132460|  164763|
|                |    vm_2|   61549|   80496|   61420|   74864|   63777|   80573|   60479|   73416|
|                |    vm_3|   60688|   79349|   62244|   73289|   64394|   80803|   61040|   74258|
---------------------------------------------------------------------------------------------------
|     autonuma27_|    vm_1|  143261|  186080|  127420|  178505|  141080|  201436|  143216|  183710|
|                |    vm_2|   72224|   94368|   71309|   89576|   59098|   83750|   63813|   90862|
|                |    vm_3|   61215|   94213|   71539|   89594|   76269|   99637|   72412|   91191|
---------------------------------------------------------------------------------------------------
| improvement    |    vm_1|   5.27%|  -1.28%|  -4.82%|   9.09%|   5.65%|  13.07%|   8.12%|  11.50%|
|   from         |    vm_2|  17.34%|  17.23%|  16.10%|  19.65%|  -7.34%|   3.94%|   5.51%|  23.76%|
|  mainline      |    vm_3|   0.87%|  18.73%|  14.93%|  22.25%|  18.44%|  23.31%|  18.63%|  22.80%|
---------------------------------------------------------------------------------------------------


(Results with suggested tweaks from Andrea)

echo 0 > /sys/kernel/mm/autonuma/knuma_scand/pmd

echo 15000 > /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs 

----------------------------------------------------------------------------------------------------
|          kernel|      vm|                               nofit|                                fit|
-                -        --------------------------------------------------------------------------
|                |        |             noksm|              ksm|            noksm|              ksm|
-                -        --------------------------------------------------------------------------
|                |        |    nothp|     thp|   nothp|     thp|   nothp|     thp|   nothp|     thp|
----------------------------------------------------------------------------------------------------
|    mainline_v36|    vm_1|   136142|  178362|  132493|  166169|  131774|  179340|  133058|  164637|
|                |    vm_2|    61143|   81943|   60998|   74195|   63725|   79530|   61916|   73183|
|                |    vm_3|    61599|   79058|   61448|   73248|   62563|   80815|   61381|   74669|
----------------------------------------------------------------------------------------------------
|     autonuma27_|    vm_1|   142023|      na|  142808|  177880|      na|  197244|  145165|  174175|
|                |    vm_2|    61071|      na|   61008|   91184|      na|   78893|   71675|   80471|
|                |    vm_3|    72646|      na|   72855|   92167|      na|   99080|   64758|   91831|
----------------------------------------------------------------------------------------------------
| improvement    |    vm_1|    4.32%|      na|   7.79%|   7.05%|      na|   9.98%|   9.10%|   5.79%|
|  from          |    vm_2|   -0.12%|      na|   0.02%|  22.90%|      na|  -0.80%|  15.76%|   9.96%|
|  mainline      |    vm_3|   17.93%|      na|  18.56%|  25.83%|      na|  22.60%|   5.50%|  22.98%|
----------------------------------------------------------------------------------------------------

Host:

    Enterprise Linux Distro
    2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
        (total of 24 logical CPUs and 24 GB RAM) 

VMs:

    Enterprise Linux Distro
    Distro Kernel
        Main VM (VM1) -- relevant benchmark score.
            12 vCPUs

	    Either 12 GB (for '< 1 Node' configuration, i.e fit case)
		 or 14 GB (for '> 1 Node', i.e no fit case) 
        Noise VMs (VM2 and VM3)
            each noise VM has half of the remaining resources.
            6 vCPUs

            Either 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node ')
                (to sum 20 GB w/ Main VM + 4 GB for host = total 24 GB) 

Settings:

    Swapping disabled on host and VMs.
    Memory Overcommit enabled on host and VMs.
    THP on host is a variable. THP disabled on VMs.
    KSM on host is a variable. KSM disabled on VMs. 

na: refers to I results where I wasnt able to collect the results.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-10-23 16:31 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1349308275-2174-1-git-send-email-aarcange@redhat.com>
     [not found] ` <20121004113943.be7f92a0.akpm@linux-foundation.org>
2012-10-05 23:14   ` [PATCH 00/33] AutoNUMA27 Andi Kleen
2012-10-05 23:57     ` Tim Chen
2012-10-06  0:11       ` Andi Kleen
2012-10-08 13:44         ` Don Morris
2012-10-08 20:34     ` Rik van Riel
     [not found] ` <20121011101930.GM3317@csn.ul.ie>
2012-10-11 14:56   ` Andrea Arcangeli
2012-10-11 15:35     ` Mel Gorman
2012-10-12  0:41       ` Andrea Arcangeli
2012-10-12 14:54       ` Mel Gorman
     [not found] ` <1349308275-2174-2-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011105036.GN3317@csn.ul.ie>
2012-10-11 16:07     ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
2012-10-11 19:37       ` Mel Gorman
     [not found] ` <1349308275-2174-5-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011110137.GQ3317@csn.ul.ie>
2012-10-11 16:43     ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
2012-10-11 19:48       ` Mel Gorman
     [not found] ` <1349308275-2174-6-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011111545.GR3317@csn.ul.ie>
2012-10-11 16:58     ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-10-11 19:54       ` Mel Gorman
     [not found] ` <1349308275-2174-7-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011122255.GS3317@csn.ul.ie>
2012-10-11 17:05     ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-10-11 20:01       ` Mel Gorman
     [not found] ` <1349308275-2174-8-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011122827.GT3317@csn.ul.ie>
2012-10-11 17:15     ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-10-11 20:06       ` Mel Gorman
     [not found]     ` <5076E4B2.2040301@redhat.com>
     [not found]       ` <0000013a525a8739-2b4049fa-1cb3-4b8f-b3a7-1fa77b181590-000000@email.amazonses.com>
2012-10-12  0:52         ` Andrea Arcangeli
     [not found] ` <1349308275-2174-9-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011134643.GU3317@csn.ul.ie>
2012-10-11 17:34     ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
2012-10-11 20:17       ` Mel Gorman
     [not found] ` <1349308275-2174-11-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011145805.GW3317@csn.ul.ie>
2012-10-12  0:25     ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-10-12  8:29       ` Mel Gorman
     [not found] ` <20121011213432.GQ3317@csn.ul.ie>
2012-10-12  1:45   ` [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
2012-10-12  8:46     ` Mel Gorman
     [not found] ` <1349308275-2174-16-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121011155302.GA3317@csn.ul.ie>
     [not found]     ` <50770314.7060800@redhat.com>
     [not found]       ` <20121011175953.GT1818@redhat.com>
2012-10-12 14:03         ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Rik van Riel
2012-10-13 18:40 ` [PATCH 00/33] AutoNUMA27 Srikar Dronamraju
2012-10-14  4:57   ` Andrea Arcangeli
2012-10-15  8:16     ` Srikar Dronamraju
2012-10-23 16:32     ` Srikar Dronamraju
     [not found] ` <1349308275-2174-20-git-send-email-aarcange@redhat.com>
     [not found]   ` <20121013180618.GC31442@linux.vnet.ibm.com>
2012-10-15  8:24     ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Srikar Dronamraju
2012-10-15  9:20       ` Mel Gorman
2012-10-15 10:00         ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).