All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/33] AutoNUMA27
@ 2012-10-03 23:50 Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
                   ` (37 more replies)
  0 siblings, 38 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Hello everyone,

This is a new AutoNUMA27 release for Linux v3.6.

I believe that this autonuma version answers all of the review
comments I got upstream. This patch set has undergone a huge series of
changes that includes changing the page migration implementation to
synchronous, reduction of memory overhead to minimum, internal
documentation, external documentation and benchmarking. I'm grateful
for all the reviews and contributions, that includes Rik, Karen, Avi,
Peter, Konrad, Hillf and all others, plus all runtime feedback
received (bugreports, KVM benchmarks, etc..).

The last 4 months were fully dedicated to answer the upstream review.

Linus, Andrew, please review, as the handful of performance results
show we're in excellent shape for inclusion. Further changes such as
transparent huge page native migration and more are expected but at
this point I would ask you to accept the current series and further
changes will be added in traditional gradual steps.

====

The objective of AutoNUMA is to provide out-of-the-box performance as
close as possible to (and potentially faster than) manual NUMA hard
bindings.

It is not very intrusive into the kernel core and is well structured
into separate source modules.

AutoNUMA was extensively tested against 3.x upstream kernels and other
NUMA placement algorithms such as numad (in userland through cpusets)
and schednuma (in kernel too) and was found superior in all cases.

Most important: not a single benchmark showed a regression yet when
compared to vanilla kernels. Not even on the 2 node systems where the
NUMA effects are less significant.

=== Some benchmark result ===

Key to the kernels used in the testing:

- 3.6.0         = upstream 3.6.0 kernel
- 3.6.0numactl  = 3.6.0 kernel with numactl hard NUMA bindings
- autonuma26MoF = previous autonuma version based 3.6.0-rc7 kernel

== specjbb multi instance, 4 nodes, 4 instances ==

autonuma26MoF outperform 3.6.0 by 11% while 3.6.0numactl provides an
additional 9% increase.

3.6.0numactl:
Per-node process memory usage (in MBs):
             PID             N0             N1             N2             N3
      ----------     ----------     ----------     ----------     ----------
           38901        3075.56           0.54           0.07           7.53
           38902           1.31           0.54        3065.37           7.53
           38903           1.31           0.54           0.07        3070.10
           38904           1.31        3064.56           0.07           7.53

autonuma26MoF:
Per-node process memory usage (in MBs):
             PID             N0             N1             N2             N3
      ----------     ----------     ----------     ----------     ----------
            9704          94.85        2862.37          50.86         139.35
            9705          61.51          20.05        2963.78          40.62
            9706        2941.80          11.68         104.12           7.70
            9707          35.02          10.62           9.57        3042.25

== specjbb multi instance, 4 nodes, 8 instances (x2 CPU overcommit) ==

This verifies AutoNUMA converges with x2 overcommit too.

autonuma26MoF nmstat every 10sec:
Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7410         335.48        2369.66         194.18         191.28
           7411          50.09         100.95        2935.93          56.50
           7412        2907.98          66.71          33.71          68.93
           7413          46.70          31.59          24.24        2974.60
           7426        1493.34        1156.18         221.60         217.93
           7427         398.18         176.94         269.14        2237.49
           7428        1028.12        1471.29         202.76         366.44
           7430         126.81         451.92        2270.37         242.75
Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7410           4.09        3047.02          20.87          18.79
           7411          24.11          75.70        3012.76          32.99
           7412        3061.95          28.88          13.70          36.88
           7413          12.71           7.56          14.18        3042.85
           7426        2521.48         402.80          87.61          77.32
           7427         148.09          79.34          87.43        2767.11
           7428         279.48        2598.05          71.96         119.30
           7430          25.45         109.46        2912.09          45.03
Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7410           2.09        3057.18          16.88          14.78
           7411           8.13           4.96        3111.52          21.01
           7412        3115.94           6.91           7.71          10.92
           7413          10.23           3.53           4.20        3059.49
           7426        2982.48          63.19          32.25          11.41
           7427          68.05          21.32          47.80        2944.93
           7428          65.80        2931.43          45.93          25.73
           7430          13.56          49.91        3007.72          20.99
Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7410           2.08        3128.38          15.55           9.05
           7411           6.13           0.96        3119.53          19.14
           7412        3124.12           3.03           5.56           8.92
           7413           8.27           4.91           5.61        3130.11
           7426        3035.93           7.08          17.30          29.37
           7427          24.12           6.89           7.85        3043.63
           7428          13.77        3022.68          23.95           8.94
           7430           2.25          39.51        3044.04           6.68

== specjbb, 4 nodes, 4 instances, but start instance 1 and 2 first,
wait for them to converge, then start instance 3 and 4 under numactl
over the nodes that AutoNUMA picked to converge instance 1 and 2 ==

This verifies AutoNUMA plays along nicely with NUMA hard binding
syscalls.

autonuma26MoF nmstat every 10sec:
            PID             N0             N1             N2             N3
Per-node process memory usage (in MBs):
     ----------     ----------     ----------     ----------     ----------
           7756         426.33        1171.21         470.66        1063.76
           7757        1254.48         152.09        1415.17         244.25

Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7756         342.42        1070.75         364.70        1354.14
           7757        1260.54         152.10        1411.19         242.29
           7883           4.30        2915.12           2.93           0.00
           7884           4.30           2.21        2919.59           0.02

Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7756         318.39        1036.31         348.68        1428.66
           7757        1733.25          96.77        1075.89         160.24
           7883           4.30        2975.99           2.93           0.00
           7884           4.30           2.21        2989.96           0.02

Per-node process memory usage (in MBs):
            PID             N0             N1             N2             N3
     ----------     ----------     ----------     ----------     ----------
           7756          35.22          42.48          18.96        3035.60
           7757        3027.93           6.63          25.67           6.21
           7883           4.30        3064.35           2.93           0.00
           7884           4.30           2.21        3074.38           0.02

>From the last nmstat we can't even tell which pids were run under
numactl and which not. You can only tell it by reading the first
nmstat: pid 7756 and 7757 were the two processes not run under
numactl.

pid 7756 and 7757 memory and CPUs were decided by AutoNUMA.

pid 7883 and 7884 never ran outside of node N1 and N3 respectively
because of the numactl binds.

== stream modified to run each instance for ~5min ==

Objective: compare autonuma26MoF against itself with CPU and NUMA
bindings

By running 1/4/8/16/32 tasks, we also verified that the idle balancing
is done well, maxing out all memory bandwidth.

Result is "PASS" if the performance of the kernel without bindings is
within -10% and +5% of CPU and NUMA bindings.

upstream result is FAIL (worst DIFF is -33%, best DIFF is +1%).

autonuma26MoF result is PASS (worst DIFF is -2%, best DIFF is +2%).

The autonuma26MoF raw numbers for this test are appended at the end
of this email.

== iozone ==

                     ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
====--------------------------------------------------------------------------------------------------------------
noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825

AutoNUMA can't help much for I/O loads but you can see it seems a
small improvement there too. The important thing for I/O loads, is to
verify that there is no regression.

== autonuma benchmark 2 nodes & 8 nodes ==

 http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120817.pdf

== autonuma27 ==

 git clone --reference linux -b autonuma27 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Real time updated development autonuma branch:

 git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

To update:

 git fetch && git checkout -f origin/autonuma

Andrea Arcangeli (32):
  autonuma: make set_pmd_at always available
  autonuma: export is_vma_temporary_stack() even if
    CONFIG_TRANSPARENT_HUGEPAGE=n
  autonuma: define _PAGE_NUMA
  autonuma: pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pmd_numa
  autonuma: mm_autonuma and task_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follows memory algorithm
  autonuma: add the autonuma_last_nid in the page structure
  autonuma: Migrate On Fault per NUMA node data
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init task_autonuma
  autonuma: alloc/free/init mm_autonuma
  autonuma: prevent select_task_rq_fair to return -1
  autonuma: teach CFS about autonuma affinity
  autonuma: memory follows CPU algorithm and task/mm_autonuma stats
    collection
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: split_huge_page: transfer the NUMA type from the pmd to the
    pte
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: page_autonuma
  autonuma: bugcheck page_autonuma fields on newly allocated pages
  autonuma: boost khugepaged scanning rate
  autonuma: add migrate_allow_first_fault knob in sysfs
  autonuma: add mm_autonuma working set estimation

Karen Noel (1):
  autonuma: add Documentation/vm/autonuma.txt

 Documentation/vm/autonuma.txt        |  364 +++++++++
 arch/Kconfig                         |    3 +
 arch/x86/Kconfig                     |    1 +
 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   65 ++-
 arch/x86/include/asm/pgtable_types.h |   20 +
 arch/x86/mm/gup.c                    |   13 +-
 fs/exec.c                            |    7 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   57 ++
 include/linux/autonuma_flags.h       |  159 ++++
 include/linux/autonuma_sched.h       |   59 ++
 include/linux/autonuma_types.h       |  126 +++
 include/linux/huge_mm.h              |    6 +-
 include/linux/mm_types.h             |    5 +
 include/linux/mmzone.h               |   23 +
 include/linux/page_autonuma.h        |   50 ++
 include/linux/sched.h                |    3 +
 init/main.c                          |    2 +
 kernel/fork.c                        |   18 +
 kernel/sched/Makefile                |    1 +
 kernel/sched/core.c                  |    1 +
 kernel/sched/fair.c                  |   82 ++-
 kernel/sched/numa.c                  |  638 +++++++++++++++
 kernel/sched/sched.h                 |   19 +
 mm/Kconfig                           |   17 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1414 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   96 +++-
 mm/memory.c                          |   10 +
 mm/mempolicy.c                       |   12 +-
 mm/mmu_context.c                     |    3 +
 mm/page_alloc.c                      |    7 +-
 mm/page_autonuma.c                   |  237 ++++++
 mm/sparse.c                          |  126 +++-
 35 files changed, 3631 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/vm/autonuma.txt
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c
 create mode 100644 mm/page_autonuma.c

== Changelog from AutoNUMA24 to AutoNUMA27 ==

o Migrate On Fault

   At the mm mini summit some discussion happened about the real need
   of asynchronous migration in AutoNUMA. Peter pointed out
   asynchronous migration could be removed without adverse performance
   effects and that would save lots of memory.

   So over the last few weeks asynchronous migration was removed and
   replaced with an ad-hoc Migrate On Fault implementation (one that
   doesn't require to alter the migrate.c API).

   All CPU/memory NUMA placement decisions remained identical: the
   only change is that instead of adding a page to a migration LRU
   list and returning to userland immediately, AutoNUMA is calling
   migrate_pages() before returning to userland.

   Peter was right: we found Migrate On Fault didn't degrade
   performance significantly. Migrate on Fault seems more cache
   friendly too.

   Also note: after the workload converged, all memory migration stops
   so it cannot make any difference after that.

   With Migrate On Fault, the cost per-page of AutoNUMA has been
   reduced to 2 bytes per page.

o Share the same pmd/pte bitflag (8) for both _PAGE_PROTNONE and
  _PAGE_NUMA. This means pte_numa/pmd_numa cannot be used anymore in
  code paths where mprotect(PROT_NONE) faults could trigger. Luckily
  the paths are mutually exclusive and mprotect(PROT_NONE) regions
  cannot reach handle_mm_fault() so no special checks on the
  vma->vm_page_prot are required to find out if it's a pte/pmd_numa or
  a mprotect(PROT_NONE).

  This doesn't provide any runtime benefit but it leaves _PAGE_PAT
  free for different usage in the future, so it looks cleaner.

o New overview document added in Documentation/vm/autonuma.txt

o Lockless NUMA hinting page faults.

    Migrate On Fault needs to block and schedule within the context of
    the NUMA hinting page faults. So the VM locks must be dropped
    before the NUMA hinting page fault starts.

    This is a worthwhile change for the asynchronous migration code
    too, and it's included in an unofficial "dead" autonuma26 branch
    (the last release with asynchronous migration).

o kmap bugfix for 32bit archs in __pmd_numa_fixup (nop for x86-64)

o Converted knuma_scand to use pmd_trans_huge_lock() cleaner API.

o Fixed a kernel crash on a 8 node system during a heavy infiniband
  load if knuma_scand encounters an unstable pmd (a pmd_trans_unstable
  check was needed as knuma_scand holds the mmap_sem only for
  reading). The workload must have been using madvise(MADV_DONTNEED).

o Skip PROT_NONE regions from the knuma_scand scanning. We're now
  sharing the same bitflag for mprotect(PROT_NONE) and pte/pmd_numa()
  couldn't distinguish between a pte/pmd_numa and a PROT_NONE range
  during the knuma_scand pass unless we check the vm_flags and skip
  it. It wouldn't be fatal for knuma_scand to scan a PROT_NONE range
  but it's not worth it.

o Removed the sub-directories from /sys/kernel/mm/autonuma/ (all sysfs
  files are in the same autonuma/ directory now). It looked cleaner
  this way after removing the knuma_migrated/ directory, now that the
  only kernel daemon left is knuma_scand. This shows less
  implementation details through the sysfs interface too which is a bonus.

o All "tuning" config tweaks in sysfs are visible only if
  CONFIG_DEBUG_VM=y.

o Lots of cleanups and minor optimizations (better variable names
  etc..).

o The ppc64 support is not included in this upstream submit until Ben
  is happy with it (but it's still included in the git branch).

== Changelog from AutoNUMA19 to AutoNUMA24 ==

o Improved lots of comments and header commit messages.

o Rewritten from scratch the comment at the top of kernel/sched/numa.c
  as the old comment wasn't well received in upstream reviews. Tried
  to describe the algorithm from a global view now.

o Added ppc64 support.

o Improved patch splitup.

o Lots of code cleanups and variable renames to make the code more readable.

o Try to take advantage of task_autonuma_nid before the knuma_scand is
  complete.

o Moved some performance tuning sysfs tweaks under DEBUG_VM so they
  won't be visible on production kernels.

o Enabled by default the working set mode for the mm_autonuma data
  collection.

o Halved the size of the mm_autonuma structure.

o scan_sleep_pass_millisecs now is more intuitive (you can can set it
  to 10000 to mean one pass every 10 sec, in the previous release it had
  to be set to 5000 to one pass every 10 sec).

o Removed PF_THREAD_BOUND to allow CPU isolation. Turned the VM_BUG_ON
  verifying the hard binding into a WARN_ON_ONCE so the knuma_migrated
  can be moved by root anywhere safely.

o Optimized autonuma_possible() to avoid checking num_possible_nodes()
  every time.

o Added the math on the last_nid statistical effects from sched-numa
  rewrite which also introduced the last_nid logic of AutoNUMA.

o Now handle systems with holes in the NUMA nodemask. Lots of
  num_possible_nodes() replaced with nr_node_ids (nr_node_ids not so
  nice name for such information).

o Fixed a bug affecting KSM. KSM failed to merge pages mapped with a
  pte_numa pte, now it passes LTP fine.

o More...

== Changelog from AutoNUMA-alpha14 to AutoNUMA19 ==

o sched_autonuma_balance callout location removed from schedule() now it runs
  in the softirq along with CFS load_balancing

o lots of documentation about the math in the sched_autonuma_balance algorithm

o fixed a bug in the fast path detection in sched_autonuma_balance that could
  decrease performance with many nodes

o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page

o fixed a crash in __pmd_numa_fixup

o knuma_numad won't scan VM_MIXEDMAP|PFNMAP (it never touched those ptes
  anyway)

o fixed a crash in autonuma_exit

o fixed a crash when split_huge_page returns 0 in knuma_migratedN as the page
  has been freed already

o assorted cleanups and probably more

Changelog from alpha13 to alpha14:

o page_autonuma introduction, no memory wasted if the kernel is booted
  on not-NUMA hardware. Tested with flatmem/sparsemem on x86
  autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
  "noautonuma" kernel param disables autonuma permanently also when
  booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
  page_autonuma allocations, like cgroup_disable=memory)

o autonuma_balance only runs along with run_rebalance_domains, to
  avoid altering the usual scheduler runtime. autonuma_balance gives a
  "kick" to the scheduler after a rebalance (it overrides the load
  balance activity if needed). It's not yet tested on specjbb or more
  schedule intensive benchmark, hopefully there's no NUMA
  regression. For intensive compute loads not involving a flood of
  scheduling activity this doesn't show any performance regression,
  and it avoids altering the strict schedule performance. It goes in
  the direction of being less intrusive with the stock scheduler
  runtime.

  Note: autonuma_balance still runs from normal context (not softirq
  context like run_rebalance_domains) to be able to wait on process
  migration (avoid _nowait), but most of the time it does nothing at
  all.

== Changelog from alpha11 to alpha13 ==

o autonuma_balance optimization (take the fast path when process is in
  the preferred NUMA node)

== TODO ==

o THP native migration (orthogonal and also needed for
  cpuset/migrate_pages(2)/numa/sched).

o powerpc has open issues to address. As result of this work Ben found
  more other archs (not only some powerpc variant) didn't implement
  PROT_NONE properly. Sharing the same pte/pmd bit of _PAGE_NUMA with
  _PAGE_PROTNONE is quite handy, as the code paths of the two features
  are mutually exclusive so they don't step into each other toes.


== stream benchmark: autonuma26MoF vs CPU/NUMA bindings ==

NUMA is Enabled.  # of nodes              = 4 nodes (0-3)

RESULTS: (MBs/sec) (higher is better)

               |                                   S C H E D U L I N G        M O D E                                   |                       | AFFINITY  |
               |                                                                                                        |  DEFAULT COMPARED TO  |COMPARED TO|
               |            DEFAULT                            CPU AFFINITY                      NUMA AFFINITY          | AFFINITY      NUMA    |   NUMA    |
NUMBER |       |                              AVG |                              AVG |                              AVG |                       |           |
  OF   |STREAM |                             WALL |                             WALL |                             WALL | %   TEST  | %   TEST  | %   TEST  |
STREAMS|FUNCT  |   TOTAL   AVG  STDEV  SCALE  CLK |   TOTAL   AVG  STDEV  SCALE  CLK |   TOTAL   AVG  STDEV  SCALE  CLK |DIFF STATUS|DIFF STATUS|DIFF STATUS|
-------+-------+----------------------------------+----------------------------------+----------------------------------+-----------+-----------+-----------+
    1  | Add   |    5496  5496    0.0     -  1606 |    5480  5480    0.0     -  1572 |    5477  5477    0.0     -  1571 |   0  PASS |   0  PASS |   0  PASS |
    1  | Copy  |    4411  4411    0.0     -  1606 |    4522  4522    0.0     -  1572 |    4521  4521    0.0     -  1571 |  -2  PASS |  -2  PASS |   0  PASS |
    1  | Scale |    4417  4417    0.0     -  1606 |    4510  4510    0.0     -  1572 |    4514  4514    0.0     -  1571 |  -2  PASS |  -2  PASS |   0  PASS |
    1  | Triad |    5338  5338    0.0     -  1606 |    5308  5308    0.0     -  1572 |    5306  5306    0.0     -  1571 |   1  PASS |   1  PASS |   0  PASS |
    1  |   ALL |    4950  4950    0.0     -  1606 |    4987  4987    0.0     -  1572 |    4990  4990    0.0     -  1571 |  -1  PASS |  -1  PASS |   0  PASS |
    1  | A_OLD |    4916  4916    0.0     -  1606 |    4955  4955    0.0     -  1572 |    4954  4954    0.0     -  1571 |  -1  PASS |  -1  PASS |   0  PASS |

    4  | Add   |   22432  5608   81.3    4.1 1574 |   22344  5586   35.1    4.1 1562 |   22244  5561   41.8    4.1 1552 |   0  PASS |   1  PASS |   0  PASS |
    4  | Copy  |   18280  4570   65.8    4.1 1574 |   18332  4583   50.1    4.1 1562 |   18392  4598   19.5    4.1 1552 |   0  PASS |  -1  PASS |   0  PASS |
    4  | Scale |   18300  4575   63.1    4.1 1574 |   18328  4582   45.0    4.1 1562 |   18344  4586   31.9    4.1 1552 |   0  PASS |   0  PASS |   0  PASS |
    4  | Triad |   21700  5425   66.2    4.1 1574 |   21664  5416   42.7    4.1 1562 |   21560  5390   43.2    4.1 1552 |   0  PASS |   1  PASS |   0  PASS |
    4  |   ALL |   20256  5064   71.2    4.1 1574 |   20232  5058   50.3    4.1 1562 |   20204  5051   34.3    4.0 1552 |   0  PASS |   0  PASS |   0  PASS |
    4  | A_OLD |   20176  5044  495.9    4.1 1574 |   20168  5042  479.8    4.1 1562 |   20136  5034  461.8    4.1 1552 |   0  PASS |   0  PASS |   0  PASS |

    8  | Add   |   43568  5446    9.3    7.9 1614 |   43344  5418   36.5    7.9 1594 |   43144  5393   58.9    7.9 1614 |   1  PASS |   1  PASS |   0  PASS |
    8  | Copy  |   36216  4527   64.8    8.2 1614 |   36200  4525   71.6    8.0 1594 |   35904  4488  104.9    7.9 1614 |   0  PASS |   1  PASS |   1  PASS |
    8  | Scale |   36496  4562   53.1    8.3 1614 |   36528  4566   47.0    8.1 1594 |   36272  4534   83.6    8.0 1614 |   0  PASS |   1  PASS |   1  PASS |
    8  | Triad |   42600  5325   33.9    8.0 1614 |   42496  5312   48.4    8.0 1594 |   42272  5284   73.6    8.0 1614 |   0  PASS |   1  PASS |   1  PASS |
    8  |   ALL |   39640  4955   60.3    8.0 1614 |   39680  4960   55.2    8.0 1594 |   39448  4931   77.8    7.9 1614 |   0  PASS |   0  PASS |   1  PASS |
    8  | A_OLD |   39720  4965  431.9    8.1 1614 |   39640  4955  421.2    8.0 1594 |   39400  4925  429.2    8.0 1614 |   0  PASS |   1  PASS |   1  PASS |

   16  | Add   |   69216  4326  190.2   12.6 2002 |   67600  4225   23.7   12.3 1991 |   67616  4226   16.1   12.3 1989 |   2  PASS |   2  PASS |   0  PASS |
   16  | Copy  |   58800  3675  194.1   13.3 2002 |   57408  3588   19.3   12.7 1991 |   57504  3594   17.6   12.7 1989 |   2  PASS |   2  PASS |   0  PASS |
   16  | Scale |   60048  3753  135.5   13.6 2002 |   58976  3686   23.2   13.1 1991 |   58992  3687   19.1   13.1 1989 |   2  PASS |   2  PASS |   0  PASS |
   16  | Triad |   67648  4228  157.9   12.7 2002 |   66304  4144   17.9   12.5 1991 |   66176  4136   11.1   12.5 1989 |   2  PASS |   2  PASS |   0  PASS |
   16  |   ALL |   63648  3978  141.9   12.9 2002 |   62480  3905   13.8   12.5 1991 |   62480  3905   12.1   12.5 1989 |   2  PASS |   2  PASS |   0  PASS |
   16  | A_OLD |   63936  3996  332.3   13.0 2002 |   62576  3911  280.2   12.6 1991 |   62576  3911  276.8   12.6 1989 |   2  PASS |   2  PASS |   0  PASS |

   32  | Add   |   75968  2374   13.4   13.8 3562 |   75840  2370   14.1   13.8 3562 |   75840  2370   17.3   13.8 3562 |   0  PASS |   0  PASS |   0  PASS |
   32  | Copy  |   64032  2001    8.3   14.5 3562 |   64224  2007    2.0   14.2 3562 |   64160  2005    9.8   14.2 3562 |   0  PASS |   0  PASS |   0  PASS |
   32  | Scale |   65376  2043   16.7   14.8 3562 |   65248  2039   14.4   14.5 3562 |   65440  2045   21.1   14.5 3562 |   0  PASS |   0  PASS |   0  PASS |
   32  | Triad |   74144  2317   13.5   13.9 3562 |   74048  2314    7.7   14.0 3562 |   74400  2325   28.5   14.0 3562 |   0  PASS |   0  PASS |   0  PASS |
   32  |   ALL |   69440  2170    7.6   14.0 3562 |   69248  2164    2.4   13.9 3562 |   69440  2170   13.5   13.9 3562 |   0  PASS |   0  PASS |   0  PASS |
   32  | A_OLD |   69888  2184  164.9   14.2 3562 |   69824  2182  162.2   14.1 3562 |   69952  2186  164.6   14.1 3562 |   0  PASS |   0  PASS |   0  PASS |

Test Acceptance Ranges:
    Default vs CPU Affinity/NUMA:  FAIL outside [-25, 10],  WARN outside [-10,  5],  PASS within [-10,  5]
    CPU Affinity vs NUMA:          FAIL outside [-10, 10],  WARN outside [ -5,  5],  PASS within [ -5,  5]

Results: PASS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 10:50   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
                   ` (36 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

From: Karen Noel <knoel@redhat.com>

Documentation of the AutoNUMA design.

Signed-off-by: Karen Noel <knoel@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/autonuma.txt |  364 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 364 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vm/autonuma.txt

diff --git a/Documentation/vm/autonuma.txt b/Documentation/vm/autonuma.txt
new file mode 100644
index 0000000..70d40b8
--- /dev/null
+++ b/Documentation/vm/autonuma.txt
@@ -0,0 +1,364 @@
+= AutoNUMA Documentation =
+
+Table of Contents:
+
+    I:   Intruduction to AutoNUMA
+    II:  AutoNUMA Daemons and Algorithms
+         knuma_scand - the page scanning daemon
+         NUMA hinting fault
+         Migrate-on-fault
+         sched_autonuma_balance - the AutoNUMA balance routine
+         Scheduler load balancing
+    III: AutoNUMA Data Structures
+         mm_autonuma - per process mm AutoNUMA data
+         task_autonuma - per task AutoNUMA data
+         page_autonuma - per page AutoNUMA data
+         pte and pmd - NUMA flags
+    IV:  Definition of AutoNUMA "Active"
+    V:   AutoNUMA Flags
+
+== I: Introduction to AutoNUMA ==
+
+AutoNUMA was introduced to the Linux kernel to improve the performance
+of applications running on NUMA hardware systems. The fundamental
+principle is that an application will perform best when the threads of
+its processes are accessing memory on the same NUMA node as the
+threads are scheduled.
+
+AutoNUMA moves tasks, which can be threads or processes, closer to the
+memory they are accessing. It also moves application data to memory
+closer to the tasks that reference it. This is all done automatically
+by the kernel when AutoNUMA is active. (See seciton IV for the
+definition of when AutoNUMA is active.)
+
+The following daemons are started and algorithms executed only if
+AutoNUMA is active on the system. No memory is allocated for AutoNUMA
+data structures if AutoNUMA is not active at boot time.
+
+== II: AutoNUMA Daemons and Algorithms ==
+
+The following sections describe the basic flow, or chain reaction, of
+AutoNUMA events.
+
+=== knuma_scand - the page scanning daemon ===
+
+The AutoNUMA logic is a chain reaction resulting from the actions of
+the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
+scans the mm structures of all active processes. It gathers the
+AutoNUMA mm statistics for each "anon" page in the process's working
+set. While scanning, knuma_scand also sets the NUMA bit and clears the
+present bit in each pte or pmd that was counted. This triggers NUMA
+hinting page faults described next.
+
+The mm statistics are expentially decayed by dividing the total memory
+in half and adding the new totals to the decayed values for each
+knuma_scand pass. This causes the mm statistics to resemble a simple
+forecasting model, taking into account some past working set data.
+
+=== NUMA hinting fault ===
+
+A NUMA hinting fault occurs when a task running on a CPU thread
+accesses a vma whose pte or pmd is not present and the NUMA bit is
+set. The NUMA hinting page fault handler returns the pte or pmd back
+to its present state and counts the fault's occurance in the
+task_autonuma structure.
+
+The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
+
+- Increments the total number of pages faulted for this task
+
+- Increments the number of pages faulted on the current NUMA node
+
+- If the fault was for an hugepage, the number of subpages represented
+  by an hugepage is added to the task statistics above
+
+- Each time the NUMA hinting page fault discoveres that another
+  knuma_scand pass has occurred, it divides the total number of pages
+  and the pages for each NUMA node in half. This causes the task
+  statistics to be exponentially decayed, just as the mm statistics
+  are. Thus, the task statistics also resemble a simple forcasting
+  model, taking into account some past NUMA hinting fault data.
+
+If the page being accessed is on the current NUMA node (same as the
+task), the NUMA hinting fault handler only records the nid of the
+current NUMA node in the page_autonuma structure field last_nid and
+then it'd done.
+
+Othewise, it checks if the nid of the current NUMA node matches the
+last_nid in the page_autonuma structure. If it matches it means it's
+the second NUMA hinting fault for the page occurring (on a subsequent
+pass of the knuma_scand daemon) from the current NUMA node. So if it
+matches, the NUMA hinting fault handler migrates the contents of the
+page to a new page on the current NUMA node.
+
+If the NUMA node accessing the page does not match last_nid, then
+last_nid is reset to the current NUMA node (since it is considered the
+first fault again).
+
+Note: You can clear a flag (AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT) which
+causes the page to be migrated on the second NUMA hinting fault
+instead of the very first one for a newly allocated page.
+
+=== Migrate-on-Fault (MoF) ===
+
+If the migrate-on-fault logic is active and the NUMA hinting fault
+handler determines that the page should be migrated, a new page is
+allocated on the current NUMA node and the data is copied from the
+previous page on the remote node to the new page. The associated pte
+or pmd is modified to reference the pfn of the new page, and the
+previous page is freed to the LRU of its NUMA node. See routine
+migrate_pages() in mm/migrate.c.
+
+If no page is available on the current NUMA node or I/O is in progress
+on the page, it is not migrated and the task continues to reference
+the remote page.
+
+=== sched_autonuma_balance - the AutoNUMA balance routine ===
+
+The AutoNUMA balance routine is responsible for deciding which NUMA
+node is the best for running the current task and potentially which
+task on the remote node it should be exchanged with. It uses the mm
+statistics collected by the knuma_scand daemon and the task statistics
+collected by the NUMA hinting fault to make this decision.
+
+The AutoNUMA balance routine is invoked as part of the scheduler load
+balancing code. It exchanges the task on the current CPU's run queue
+with a current task from a remote NUMA node if that exchange would
+result in the tasks running with a smaller percentage of cross-node
+memory accesses. Because the balance routine involves only running
+tasks, it is only invoked when the scheduler is not idle
+balancing. This means that the CFS scheduler is in control of
+scheduling decsions and can move tasks to idle threads on any NUMA
+node based on traditional or new policies.
+
+The following defines "memory weight" and "task weight" in the
+AutoNUMA balance routine's algorithms.
+
+- memory weight = % of total memory from the NUMA node. Uses mm
+                  statistics collected by the knuma_scand daemon.
+
+- task weight = % of total memory faulted on the NUMA node. Uses task
+                statistics collected by the NUMA hinting fault.
+
+=== task_selected_nid - The AutoNUMA preferred NUMA node ===
+
+The AutoNUMA balance routine first determines which NUMA node the
+current task has the most affinity to run on, based on the maximum
+task weight and memory weight for each NUMA node. If both max values
+are for the same NUMA node, that node's nid is stored in
+task_selected_nid.
+
+If the selected nid is the current NUMA node, the AutoNUMA balance
+routine is finished and does not proceed to compare tasks on other
+NUMA nodes.
+
+If the selected nid is not the current NUMA node, a task exchange is
+possible as described next. (Note that the task exchange algorithm
+might update task_selected_nid to a different NUMA node)
+
+=== Task exchange ===
+
+The following defines "weight" in the AutoNUMA balance routine's
+algorithm.
+
+If the tasks are threads of the same process:
+
+    weight = task weight for the NUMA node (since memory weights are
+             the same)
+
+If the tasks are not threads of the same process:
+
+    weight = memory weight for the NUMA node (prefer to move the task
+             to the memory)
+
+The following algorithm determines if the current task will be
+exchanged with a running task on a remote NUMA node:
+
+    this_diff: Weight of the current task on the remote NUMA node
+               minus its weight on the current NUMA node (only used if
+               a positive value). How much does the current task
+               prefer to run on the remote NUMA node.
+
+    other_diff: Weight of the current task on the remote NUMA node
+                minus the weight of the other task on the same remote
+                NUMA node (only used if a positive value). How much
+                does the current task prefer to run on the remote NUMA
+                node compared to the other task.
+
+    total_weight_diff = this_diff + other_diff
+
+    total_weight_diff: How favorable it is to exchange the two tasks.
+                       The pair of tasks with the highest
+                       total_weight_diff (if any) are selected for
+                       exchange.
+
+As mentioned above, if the two tasks are threads of the same process,
+the AutoNUMA balance routine uses the task_autonuma statistics. By
+using the task_autonuma statistics, each thread follows its own memory
+locality and they will not necessarily converge on the same node. This
+is often very desirable for processes with more threads than CPUs on
+each NUMA node.
+
+If the two tasks are not threads of the same process, the AutoNUMA
+balance routine uses the mm_autonuma statistics to calculate the
+memory weights. This way all threads of the same process converge to
+the same node, which is the one with the highest percentage of memory
+for the process.
+
+If task_selected_nid, determined above, is not the NUMA node the
+current task will be exchanged to, task_selected_nid for this task is
+updated. This causes the AutoNUMA balance routine to favor overall
+balance of the system over a single task's preference for a NUMA node.
+
+To exchange the two tasks, the AutoNUMA balance routine stops the CPU
+that is running the remote task and exchanges the tasks on the two run
+queues. Once each task has been moved to another node, closer to most
+of the memory it is accessing, any memory for that task not in the new
+NUMA node also moves to the NUMA node over time with the
+migrate-on-fault logic.
+
+=== Scheduler Load Balancing ===
+
+Load balancing, which affects fairness more than performance,
+schedules based on AutoNUMA recommendations (task_selected_nid) unless
+the flag AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG is set.
+
+The CFS load balancer uses the task's AutoNUMA task_selected_nid when
+deciding to move a task to a different run-queue or when waking it
+up. For example, idle balancing while looking into the run-queues of
+busy CPUs, first looks for a task with task_selected_nid set to the
+NUMA node of the idle CPU. Idle balancing falls back to scheduling
+tasks without task_selected_nid set or with a different NUMA node set
+in task_selected_nid. This allows a task to move to a different NUMA
+node and its memory will follow it to the new NUMA node over time.
+
+== III: AutoNUMA Data Structures ==
+
+The following data structures are defined for AutoNUMA. All structures
+are allocated only if AutoNUMA is active (as defined in the
+introduction).
+
+=== mm_autonuma - per process mm AutoNUMA data ===
+
+The mm_autonuma structure is used to hold AutoNUMA data required for
+each mm structure. Total size: 32 bytes + 8 * # of NUMA nodes.
+
+- Link of mm structures to be scanned by knuma_scand (8 bytes)
+
+- Pointer to associated mm structure (8 bytes)
+
+- fault_pass - pass number of knuma_scand (8 bytes)
+
+- Memory NUMA statistics for this process:
+
+    Total number of anon pages in the process working set (8 bytes)
+
+    Per NUMA node number of anon pages in the process working set (8
+    bytes * # of NUMA nodes)
+
+=== task_autonuma - per task AutoNUMA data ===
+
+The task_autonuma structure is used to hold AutoNUMA data required for
+each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
+nodes.
+
+- selected_nid: preferred NUMA node as determined by the AutoNUMA
+                scheduler balancing code, -1 if none (2 bytes)
+
+- Task NUMA statistics for this thread/process:
+
+    Total number of NUMA hinting page faults in this pass of
+    knuma_scand (8 bytes)
+
+    Per NUMA node number of NUMA hinting page faults in this pass of
+    knuma_scand (8 bytes * # of NUMA nodes)
+
+=== page_autonuma - per page AutoNUMA data ===
+
+The page_autonuma structure is used to hold AutoNUMA data required for
+each page of memory. Total size: 2 bytes
+
+    last_nid - NUMA node for last time this page incurred a NUMA
+               hinting fault, -1 if none (2 bytes)
+
+=== pte and pmd - NUMA flags ===
+
+A bit in pte and pmd structures are used to indicate to the page fault
+handler that the fault was incurred for NUMA purposes.
+
+    _PAGE_NUMA: a NUMA hinting fault at either the pte or pmd level (1
+                bit)
+
+        The same bit used for _PAGE_PROTNONE is used for
+        _PAGE_NUMA. This is okay because all uses of _PAGE_PROTNONE
+        are mutually exclusive of _PAGE_NUMA.
+
+Note: NUMA hinting fault at the pmd level is only used on
+architectures where pmd granularity is supported.
+
+== IV: AutoNUMA Active ==
+
+AutoNUMA is considered active when each of the following 4 conditions
+are met:
+
+- AutoNUMA is compiled into the kernel
+
+    CONFIG_AUTONUMA=y
+
+- The hardware has NUMA properties
+
+- AutoNUMA is enabled at boot time
+
+    "noautonuma" not passed to the kernel command line
+
+- AutoNUMA is enabled dynamically at run-time
+
+    CONFIG_AUTONUMA_DEFAULT_ENABLED=y
+
+  or
+
+    echo 1 >/sys/kernel/mm/autonuma/enabled
+
+== V: AutoNUMA Flags ==
+
+AUTONUMA_POSSIBLE_FLAG: The kernel was not passed the "noautonuma"
+                        boot parameter and is being run on NUMA
+                        hardware.
+
+AUTONUMA_ENABLED_FLAG: AutoNUMA is enabled (default set at compile
+                       time).
+
+AUTONUMA_DEBUG_FLAG (default 0): printf lots of debug info, set
+		                 through sysfs
+
+AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG (default 0): AutoNUMA will
+                                                     prioritize on
+                                                     NUMA affinity and
+                                                     will disregard
+                                                     inter-node
+                                                     fairness.
+
+AUTONUMA_CHILD_INHERITANCE_FLAG (default 1): AutoNUMA statistics are
+                                             copied to the child at
+                                             every fork/clone instead
+                                             of resetting them like it
+                                             happens unconditionally
+                                             in execve.
+
+AUTONUMA_SCAN_PMD_FLAG (default 1): trigger NUMA hinting faults for
+                                    the pmd level instead of just the
+                                    pte level (note: for THP, NUMA
+                                    hinting faults always occur at the
+                                    pmd level)
+
+AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG (default 0): page is migrated
+                                                     on first NUMA
+                                                     hinting fault
+                                                     instead of second
+
+AUTONUMA_MM_WORKING_SET_FLAG (default 1): mm_autonuma represents a
+                                          working set estimation of
+                                          the memory used by the
+                                          process
+
+Contributors: Andrea Arcangeli, Karen Noel, Rik van Riel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 02/33] autonuma: make set_pmd_at always available
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 10:54   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
                   ` (35 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..5edd174 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -528,7 +528,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -539,7 +538,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 10:54   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
                   ` (34 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.

Reported-by: Petr Holasek <pholasek@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..ad4e2e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,13 +54,13 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT HPAGE_SHIFT
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
-extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
-
 #define transparent_hugepage_enabled(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_FLAG) ||				\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 04/33] autonuma: define _PAGE_NUMA
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 11:01   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
                   ` (33 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
faults to identify the per NUMA node working set of the thread at
runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit is
cleared at the same time that _PAGE_NUMA is set, so when the fault
triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
could also use a different bitflag, it's up to the architecture to
decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA doesn't
alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
step into each other toes.

Because we want to be able to set this bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, this bitflag must never be set when the pte and
pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..bf99b6a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather autonuma statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA	_PAGE_PROTNONE
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 11:15   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
                   ` (32 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->task_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to migrate the page with Migrate On Fault if there have been enough
   NUMA hinting page faults on the page coming from remote CPUs
   (autonuma_last_nid heuristic)

NUMA hinting page faults collect information and possibly add pages to
migrate queues. They are extremely quick, and they try to be
non-blocking also when Migrate On Fault is invoked as result.

The generic implementation is used when CONFIG_AUTONUMA=n.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   65 ++++++++++++++++++++++++++++++++++++++-
 include/asm-generic/pgtable.h  |   12 +++++++
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c3520d7..6c14b40 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+				 _PAGE_NUMA);
+}
+
+#ifdef CONFIG_AUTONUMA
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -479,6 +536,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..0ff87ec 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -530,6 +530,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 12:22   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
                   ` (31 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

In the special "pmd" mode of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (_PAGE_PRESENT not set), however the pte might be
present. Therefore, gup_pmd_range() must return 0 in this case to
avoid losing a NUMA hinting page fault during gup_fast.

Note: gup_fast will skip over non present ptes (like numa types), so
no explicit check is needed for the pte_numa case. gup_fast will also
skip over THP when the trans huge pmd is non present. So, the pmd_numa
case will also be correctly skipped with no additional code changes
required.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |   13 ++++++++++++-
 1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 6dc9921..cad7d97 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -169,8 +169,19 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * can't because it has irq disabled and
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
+		 *
+		 * The pmd_numa() check is needed because the code
+		 * doesn't check the _PAGE_PRESENT bit of the pmd if
+		 * the gup_pte_range() path is taken. NOTE: not all
+		 * gup_fast users will will access the page contents
+		 * using the CPU through the NUMA memory channels like
+		 * KVM does. So we're forced to trigger NUMA hinting
+		 * page faults unconditionally for all gup_fast users
+		 * even though NUMA hinting page faults aren't useful
+		 * to I/O drivers that will access the page with DMA
+		 * and not with the CPU.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 12:28   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
                   ` (30 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |  107 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..9673ce8
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,107 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+
+/*
+ * Per-mm (per-process) structure that contains the NUMA memory
+ * placement statistics generated by the knuma scan daemon. This
+ * structure is dynamically allocated only if AutoNUMA is possible on
+ * this system. They are linked togehter in a list headed within the
+ * knumad_scan structure.
+ */
+struct mm_autonuma {
+	/* link for knuma_scand's list of mm structures to scan */
+	struct list_head mm_node;
+	/* Pointer to associated mm structure */
+	struct mm_struct *mm;
+
+	/*
+	 * Zeroed from here during allocation, check
+	 * mm_autonuma_reset() if you alter the below.
+	 */
+
+	/*
+	 * Pass counter for this mm. This exist only to be able to
+	 * tell when it's time to apply the exponential backoff on the
+	 * task_autonuma statistics.
+	 */
+	unsigned long mm_numa_fault_pass;
+	/* Total number of pages that will trigger NUMA faults for this mm */
+	unsigned long mm_numa_fault_tot;
+	/* Number of pages that will trigger NUMA faults for each [nid] */
+	unsigned long mm_numa_fault[0];
+	/* do not add more variables here, the above array size is dynamic */
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+/*
+ * Per-task (thread) structure that contains the NUMA memory placement
+ * statistics generated by the knuma scan daemon. This structure is
+ * dynamically allocated only if AutoNUMA is possible on this
+ * system. They are linked togehter in a list headed within the
+ * knumad_scan structure.
+ */
+struct task_autonuma {
+	/* node id the CPU scheduler should try to stick with (-1 if none) */
+	int task_selected_nid;
+
+	/*
+	 * Zeroed from here during allocation, check
+	 * mm_autonuma_reset() if you alter the below.
+	 */
+
+	/*
+	 * Pass counter for this task. When the pass counter is found
+	 * out of sync with the mm_numa_fault_pass we know it's time
+	 * to apply the exponential backoff on the task_autonuma
+	 * statistics, and then we synchronize it with
+	 * mm_numa_fault_pass. This pass counter is needed because in
+	 * knuma_scand we work on the mm and we've no visibility on
+	 * the task_autonuma. Furthermore it would be detrimental to
+	 * apply exponential backoff to all task_autonuma associated
+	 * to a certain mm_autonuma (potentially zeroing out the trail
+	 * of statistical data in task_autonuma) if the task is idle
+	 * for a long period of time (i.e. several knuma_scand passes).
+	 */
+	unsigned long task_numa_fault_pass;
+	/* Total number of eligible pages that triggered NUMA faults */
+	unsigned long task_numa_fault_tot;
+	/* Number of pages that triggered NUMA faults for each [nid] */
+	unsigned long task_numa_fault[0];
+	/* do not add more variables here, the above array size is dynamic */
+};
+
+extern int alloc_task_autonuma(struct task_struct *tsk,
+			       struct task_struct *orig,
+			       int node);
+extern void __init task_autonuma_init(void);
+extern void free_task_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_task_autonuma(struct task_struct *tsk,
+				      struct task_struct *orig,
+				      int node)
+{
+	return 0;
+}
+static inline void task_autonuma_init(void) {}
+static inline void free_task_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 08/33] autonuma: define the autonuma flags
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 13:46   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 09/33] autonuma: core autonuma.h header Andrea Arcangeli
                   ` (29 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |  120 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 120 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..630ecc5
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,120 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+/*
+ * If CONFIG_AUTONUMA=n only autonuma_possible() is defined (as false)
+ * to allow optimizing away at compile time blocks of common code
+ * without using #ifdefs.
+ */
+
+#ifdef CONFIG_AUTONUMA
+
+enum autonuma_flag {
+	/*
+	 * Set if the kernel wasn't passed the "noautonuma" boot
+	 * parameter and the hardware is NUMA. If AutoNUMA is not
+	 * possible the value of all other flags becomes irrelevant
+	 * (they will never be checked) and AutoNUMA can't be enabled.
+	 *
+	 * No defaults: depends on hardware discovery and "noautonuma"
+	 * early param.
+	 */
+	AUTONUMA_POSSIBLE_FLAG,
+	/*
+	 * If AutoNUMA is possible, this defines if AutoNUMA is
+	 * currently enabled or disabled. It can be toggled at runtime
+	 * through sysfs.
+	 *
+	 * The default depends on CONFIG_AUTONUMA_DEFAULT_ENABLED.
+	 */
+	AUTONUMA_ENABLED_FLAG,
+	/*
+	 * If set through sysfs this will print lots of debug info
+	 * about the AutoNUMA activities in the kernel logs.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_DEBUG_FLAG,
+	/*
+	 * This defines if CFS should prioritize between load
+	 * balancing fairness or NUMA affinity, if there are no idle
+	 * CPUs available. If this flag is set AutoNUMA will
+	 * prioritize on NUMA affinity and it will disregard
+	 * inter-node fairness.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	/*
+	 * This flag defines if the task/mm_autonuma statistics should
+	 * be inherited from the parent task/process or instead if
+	 * they should be cleared at every fork/clone. The
+	 * task/mm_autonuma statistics are always cleared across
+	 * execve and there's no way to disable that.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_CHILD_INHERITANCE_FLAG,
+	/*
+	 * If set, this tells knuma_scand to trigger NUMA hinting page
+	 * faults at the pmd level instead of the pte level. This
+	 * reduces the number of NUMA hinting faults potentially
+	 * saving CPU time. It reduces the accuracy of the
+	 * task_autonuma statistics (but does not change the accuracy
+	 * of the mm_autonuma statistics). This flag can be toggled
+	 * through sysfs as runtime.
+	 *
+	 * This flag does not affect AutoNUMA with transparent
+	 * hugepages (THP). With THP the NUMA hinting page faults
+	 * always happen at the pmd level, regardless of the setting
+	 * of this flag. Note: there is no reduction in accuracy of
+	 * task_autonuma statistics with THP.
+	 *
+	 * Default set.
+	 */
+	AUTONUMA_SCAN_PMD_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static inline bool autonuma_possible(void)
+{
+	return test_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_enabled(void)
+{
+	return test_bit(AUTONUMA_ENABLED_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_debug(void)
+{
+	return test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_sched_load_balance_strict(void)
+{
+	return test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			&autonuma_flags);
+}
+
+static inline bool autonuma_child_inheritance(void)
+{
+	return test_bit(AUTONUMA_CHILD_INHERITANCE_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_scan_pmd(void)
+{
+	return test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+#else /* CONFIG_AUTONUMA */
+
+static inline bool autonuma_possible(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 09/33] autonuma: core autonuma.h header
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Header that defines the generic AutoNUMA specific functions.

All functions are defined unconditionally, but are only linked into
the kernel if CONFIG_AUTONUMA=y. When CONFIG_AUTONUMA=n, their call
sites are optimized away at build time (or the kernel wouldn't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..02d4875
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#include <linux/autonuma_flags.h>
+
+#ifdef CONFIG_AUTONUMA
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern int pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			  unsigned long addr, pte_t pte, pte_t *ptep,
+			  pmd_t *pmd);
+extern int pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			  pmd_t *pmd);
+extern bool numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 10/33] autonuma: CPU follows memory algorithm
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 09/33] autonuma: core autonuma.h header Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 14:58   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure Andrea Arcangeli
                   ` (27 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->task_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->task_autonuma->task_selected_nid. The fair
scheduler then tries to keep the task on the task_selected_nid.

Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   59 ++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |    1 +
 kernel/sched/fair.c            |    4 +
 kernel/sched/numa.c            |  638 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   19 ++
 7 files changed, 729 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..8d786eb
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,59 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#include <linux/autonuma_flags.h>
+
+#ifdef CONFIG_AUTONUMA
+
+extern void __sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int strict_numa, int dst_cpu,
+					    enum cpu_idle_type idle);
+
+/*
+ * Return true if the specified CPU is in this task's selected_nid (or
+ * there is no affinity set for the task).
+ */
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+	int task_selected_nid;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+
+	if (!task_autonuma)
+		return true;
+
+	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
+	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+}
+
+static inline void sched_autonuma_balance(void)
+{
+	struct task_autonuma *ta = current->task_autonuma;
+
+	if (ta && current->mm)
+		__sched_autonuma_balance();
+}
+
+#else /* CONFIG_AUTONUMA */
+
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int strict_numa,
+						   int dst_cpu,
+						   enum cpu_idle_type idle)
+{
+	return true;
+}
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+	return true;
+}
+
+static inline void sched_autonuma_balance(void) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf78672..c80101c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -405,6 +406,10 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 	struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 23bddac..ca246e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1522,6 +1522,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct task_autonuma *task_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 649c9f8..5a36579 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96e2b18..877f077 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -4932,6 +4933,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
+	if (!this_rq->idle_balance)
+		sched_autonuma_balance();
+
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..d0cbfe9
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,638 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+/*
+ * Callback used by the AutoNUMA balancer to migrate a task to the
+ * selected CPU. Invoked by stop_one_cpu_nowait().
+ */
+static int autonuma_balance_cpu_stop(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
+	struct task_struct *p = src_rq->autonuma_balance_task;
+	struct rq *dst_rq = cpu_rq(dst_cpu);
+
+	raw_spin_lock_irq(&p->pi_lock);
+	raw_spin_lock(&src_rq->lock);
+
+	/* Make sure the selected cpu hasn't gone down in the meanwhile */
+	if (unlikely(src_cpu != smp_processor_id() ||
+		     !src_rq->autonuma_balance))
+		goto out_unlock;
+
+	/* Check if the affinity changed in the meanwhile */
+	if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p)))
+		goto out_unlock;
+
+	/* Is the task to migrate still there? */
+	if (task_cpu(p) != src_cpu)
+		goto out_unlock;
+
+	BUG_ON(src_rq == dst_rq);
+
+	/* Prepare to move the task from src_rq to dst_rq */
+	double_lock_balance(src_rq, dst_rq);
+
+	/*
+	 * Supposedly pi_lock should have been enough but some code
+	 * seems to call __set_task_cpu without pi_lock.
+	 */
+	if (task_cpu(p) != src_cpu)
+		goto out_double_unlock;
+
+	/*
+	 * If the task is not on a rq, the task_selected_nid will take
+	 * care of the NUMA affinity at the next wake-up.
+	 */
+	if (p->on_rq) {
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	}
+
+out_double_unlock:
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	src_rq->autonuma_balance = false;
+	raw_spin_unlock(&src_rq->lock);
+	/* spinlocks acts as barrier() so p is stored locally on the stack */
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+	return 0;
+}
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+/*
+ * This function __sched_autonuma_balance() is responsible for
+ * deciding which is the best CPU each process should be running on
+ * according to the NUMA statistics collected in mm->mm_autonuma and
+ * tsk->task_autonuma.
+ *
+ * This will not alter the active idle load balancing and most other
+ * scheduling activity, it works by exchanging running tasks across
+ * CPUs located in different NUMA nodes, when such an exchange
+ * provides a net benefit in increasing the system wide NUMA
+ * convergence.
+ *
+ * The tasks that are the closest to "fully converged" are given the
+ * maximum priority in being moved to their "best node".
+ *
+ * "Full convergence" is achieved when all memory accesses by a task
+ * are 100% local to the CPU it is running on. A task's "best node" is
+ * the NUMA node that recently had the most memory accesses from the
+ * task. The tasks that are closest to being fully converged are given
+ * maximum priority for being moved to their "best node."
+ *
+ * To find how close a task is to converging we use weights. These
+ * weights are computed using the task_autonuma and mm_autonuma
+ * statistics. These weights represent the percentage amounts of
+ * memory accesses (in AUTONUMA_BALANCE_SCALE) that each task recently
+ * had in each node. If the weight of one node is equal to
+ * AUTONUMA_BALANCE_SCALE that implies the task reached "full
+ * convergence" in that given node. To the contrary, a node with a
+ * zero weight would be the "worst node" for the task.
+ *
+ * If the weights for two tasks on CPUs in different nodes are equal
+ * no switch will happen.
+ *
+ * The core math that evaluates the current CPU against the CPUs of
+ * all other nodes is this:
+ *
+ *	if (other_diff > 0 && this_diff > 0)
+ *		weight_diff = other_diff + this_diff;
+ *
+ * other_diff: how much the current task is closer to fully converge
+ * on the node of the other CPU than the other task that is currently
+ * running in the other CPU.
+ *
+ * this_diff: how much the current task is closer to converge on the
+ * node of the other CPU than in the current node.
+ *
+ * If both checks succeed it guarantees that we found a way to
+ * multilaterally improve the system wide NUMA
+ * convergence. Multilateral here means that the same checks will not
+ * succeed again on those same two tasks, after the task exchange, so
+ * there is no risk of ping-pong.
+ *
+ * If a task exchange can happen because the two checks succeed, we
+ * select the destination CPU that will give us the biggest increase
+ * in system wide convergence (i.e. biggest "weight", in the above
+ * quoted code).
+ *
+ * CFS is NUMA aware via sched_autonuma_can migrate_task(). CFS searches
+ * CPUs in the task's task_selected_nid first during load balancing and
+ * idle balancing.
+ *
+ * The task's task_selected_nid is the node selected by
+ * __sched_autonuma_balance() when it migrates the current task to the
+ * selected cpu in the selected node during the task exchange.
+ *
+ * Once a task has been moved to another node, closer to most of the
+ * memory it has recently accessed, any memory for that task not in
+ * the new node moves slowly to the new node. This is done in the
+ * context of the NUMA hinting page fault (aka Migrate On Fault).
+ *
+ * One important thing is how we calculate the weights using
+ * task_autonuma or mm_autonuma, depending if the other CPU is running
+ * a thread of the current process, or a thread of a different
+ * process.
+ *
+ * We use the mm_autonuma statistics to calculate the NUMA weights of
+ * the two task candidates for exchange if the task in the other CPU
+ * belongs to a different process. This way all threads of the same
+ * process will converge to the same node, which is the one with the
+ * highest percentage of memory for the process.  This will happen
+ * even if the thread's "best node" is busy running threads of a
+ * different process.
+ *
+ * If the two candidate tasks for exchange are threads of the same
+ * process, we use the task_autonuma information (because the
+ * mm_autonuma information is identical). By using the task_autonuma
+ * statistics, each thread follows its own memory locality and they
+ * will not necessarily converge on the same node. This is often very
+ * desirable for processes with more theads than CPUs on each NUMA
+ * node.
+ *
+ * To avoid the risk of NUMA false sharing it's best to schedule all
+ * threads accessing the same memory in the same node (on in as fewer
+ * nodes as possible if they can't fit in a single node).
+ *
+ * False sharing in the above sentence is intended as simultaneous
+ * virtual memory accesses to the same pages of memory, by threads
+ * running in CPUs of different nodes. Sharing doesn't refer to shared
+ * memory as in tmpfs, but it refers to CLONE_VM instead.
+ *
+ * This algorithm might be expanded to take all runnable processes
+ * into account later.
+ *
+ * This algorithm is executed by every CPU in the context of the
+ * SCHED_SOFTIRQ load balancing event at regular intervals.
+ *
+ * If the task is found to have converged in the current node, we
+ * already know that the check "this_diff > 0" will not succeed, so
+ * the autonuma balancing completes without having to check any of the
+ * CPUs of the other NUMA nodes.
+ */
+void __sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
+	int this_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	unsigned long task_fault, task_tot, mm_fault, mm_tot;
+	unsigned long task_max, mm_max;
+	unsigned long weight_diff_max;
+	long uninitialized_var(s_w_nid);
+	long uninitialized_var(s_w_this_nid);
+	long uninitialized_var(s_w_other);
+	bool uninitialized_var(s_w_type_thread);
+	struct cpumask *allowed;
+	struct task_struct *p = current, *other_task;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	struct mm_autonuma *mm_autonuma;
+	struct rq *rq;
+
+	/* per-cpu statically allocated in runqueues */
+	long *task_numa_weight;
+	long *mm_numa_weight;
+
+	if (!task_autonuma || !p->mm)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (task_autonuma->task_selected_nid != -1)
+			task_autonuma->task_selected_nid = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+	mm_autonuma = p->mm->mm_autonuma;
+
+	/*
+	 * If the task has no NUMA hinting page faults or if the mm
+	 * hasn't been fully scanned by knuma_scand yet, set task
+	 * selected nid to the current nid, to avoid the task bounce
+	 * around randomly.
+	 */
+	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
+	if (!mm_tot) {
+		if (task_autonuma->task_selected_nid != this_nid)
+			task_autonuma->task_selected_nid = this_nid;
+		return;
+	}
+	task_tot = task_autonuma->task_numa_fault_tot;
+	if (!task_tot) {
+		if (task_autonuma->task_selected_nid != this_nid)
+			task_autonuma->task_selected_nid = this_nid;
+		return;
+	}
+
+	rq = cpu_rq(this_cpu);
+
+	/*
+	 * Verify that we can migrate the current task, otherwise try
+	 * again later.
+	 */
+	if (ACCESS_ONCE(rq->autonuma_balance))
+		return;
+
+	/*
+	 * The following two arrays will hold the NUMA affinity weight
+	 * information for the current process if scheduled on the
+	 * given NUMA node.
+	 *
+	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
+	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
+	 */
+	task_numa_weight = rq->task_numa_weight;
+	mm_numa_weight = rq->mm_numa_weight;
+
+	/*
+	 * Identify the NUMA node where this thread (task_struct), and
+	 * the process (mm_struct) as a whole, has the largest number
+	 * of NUMA faults.
+	 */
+	task_max = mm_max = 0;
+	selected_nid = mm_selected_nid = -1;
+	for_each_online_node(nid) {
+		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
+		task_fault = task_autonuma->task_numa_fault[nid];
+		if (mm_fault > mm_tot)
+			/* could be removed with a seqlock */
+			mm_tot = mm_fault;
+		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
+		if (task_fault > task_tot) {
+			task_tot = task_fault;
+			WARN_ON(1);
+		}
+		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
+		if (mm_numa_weight[nid] > mm_max) {
+			mm_max = mm_numa_weight[nid];
+			mm_selected_nid = nid;
+		}
+		if (task_numa_weight[nid] > task_max) {
+			task_max = task_numa_weight[nid];
+			selected_nid = nid;
+		}
+	}
+	/*
+	 * If this NUMA node is the selected one, based on process
+	 * memory and task NUMA faults, set task_selected_nid and
+	 * we're done.
+	 */
+	if (selected_nid == this_nid && mm_selected_nid == this_nid) {
+		if (task_autonuma->task_selected_nid != selected_nid)
+			task_autonuma->task_selected_nid = selected_nid;
+		return;
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = this_nid;
+	weight_diff_max = 0;
+	other_task = NULL;
+
+	/* check that the following raw_spin_lock_irq is safe */
+	BUG_ON(irqs_disabled());
+
+	/*
+	 * Check the other NUMA nodes to see if there is a task we
+	 * should exchange places with.
+	 */
+	for_each_online_node(nid) {
+		/* No need to check our current node. */
+		if (nid == this_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			struct mm_autonuma *mma = NULL /* bugcheck */;
+			struct task_autonuma *ta = NULL /* bugcheck */;
+			unsigned long fault, tot;
+			long this_diff, other_diff;
+			long w_nid, w_this_nid, w_other;
+			bool w_type_thread;
+			struct mm_struct *mm;
+			struct task_struct *_other_task;
+
+			rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+
+			/* CFS takes care of idle balancing. */
+			if (idle_cpu(cpu))
+				continue;
+
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+
+			/*
+			 * Check if the _other_task is pending for
+			 * migrate. Do it locklessly: it's an
+			 * optimistic racy check anyway.
+			 */
+			if (ACCESS_ONCE(rq->autonuma_balance))
+				continue;
+
+			/*
+			 * Grab the fault/tot of the processes running
+			 * in the other CPUs to compute w_other.
+			 */
+			raw_spin_lock_irq(&rq->lock);
+			_other_task = rq->curr;
+			/* recheck after implicit barrier() */
+			mm = _other_task->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			if (mm == p->mm) {
+				/*
+				 * This task is another thread in the
+				 * same process. Use the task statistics.
+				 */
+				w_type_thread = true;
+				ta = _other_task->task_autonuma;
+				tot = ta->task_numa_fault_tot;
+			} else {
+				/*
+				 * This task is part of another process.
+				 * Use the mm statistics.
+				 */
+				w_type_thread = false;
+				mma = mm->mm_autonuma;
+				tot = ACCESS_ONCE(mma->mm_numa_fault_tot);
+			}
+
+			if (!tot) {
+				/* Need NUMA faults to evaluate NUMA placement. */
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			/*
+			 * Check if the _other_task is allowed to be
+			 * migrated to this_cpu.
+			 */
+			if (!cpumask_test_cpu(this_cpu,
+					      tsk_cpus_allowed(_other_task))) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			if (w_type_thread)
+				fault = ta->task_numa_fault[nid];
+			else
+				fault = ACCESS_ONCE(mma->mm_numa_fault[nid]);
+
+			raw_spin_unlock_irq(&rq->lock);
+
+			if (fault > tot)
+				tot = fault;
+			w_other = fault*AUTONUMA_BALANCE_SCALE/tot;
+
+			/*
+			 * We pre-computed the weights for the current
+			 * task in the task/mm_numa_weight arrays.
+			 * Those computations were mm/task local, and
+			 * didn't require accessing other CPUs'
+			 * runqueues.
+			 */
+			if (w_type_thread) {
+				w_nid = task_numa_weight[nid];
+				w_this_nid = task_numa_weight[this_nid];
+			} else {
+				w_nid = mm_numa_weight[nid];
+				w_this_nid = mm_numa_weight[this_nid];
+			}
+
+			/*
+			 * other_diff: How much does the current task
+			 * prefer to run the remote NUMA node (nid)
+			 * compared to the other task on the remote
+			 * node (nid).
+			 */
+			other_diff = w_nid - w_other;
+
+			/*
+			 * this_diff: How much does the currrent task
+			 * prefer to run on the remote NUMA node (nid)
+			 * rather than the current NUMA node
+			 * (this_nid).
+			 */
+			this_diff = w_nid - w_this_nid;
+
+			/*
+			 * Would swapping NUMA location with this task
+			 * reduce the total number of cross-node NUMA
+			 * faults in the system?
+			 */
+			if (other_diff > 0 && this_diff > 0) {
+				unsigned long weight_diff;
+
+				weight_diff = other_diff + this_diff;
+
+				/* Remember the best candidate. */
+				if (weight_diff > weight_diff_max) {
+					weight_diff_max = weight_diff;
+					selected_cpu = cpu;
+					selected_nid = nid;
+
+					s_w_other = w_other;
+					s_w_nid = w_nid;
+					s_w_this_nid = w_this_nid;
+					s_w_type_thread = w_type_thread;
+					other_task = _other_task;
+				}
+			}
+		}
+	}
+
+	if (task_autonuma->task_selected_nid != selected_nid)
+		task_autonuma->task_selected_nid = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug()) {
+			char *w_type_str;
+			w_type_str = s_w_type_thread ? "thread" : "process";
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, this_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       s_w_other, s_w_nid, s_w_this_nid,
+			       w_type_str);
+		}
+		BUG_ON(this_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	rq = cpu_rq(this_cpu);
+
+	/*
+	 * autonuma_balance synchronizes accesses to
+	 * autonuma_balance_work. After set, it's cleared by the
+	 * callback once the migration work is finished.
+	 */
+	raw_spin_lock_irq(&rq->lock);
+	if (rq->autonuma_balance) {
+		raw_spin_unlock_irq(&rq->lock);
+		return;
+	}
+	rq->autonuma_balance = true;
+	raw_spin_unlock_irq(&rq->lock);
+
+	rq->autonuma_balance_dst_cpu = selected_cpu;
+	rq->autonuma_balance_task = p;
+	get_task_struct(p);
+
+	/* Do the actual migration. */
+	stop_one_cpu_nowait(this_cpu,
+			    autonuma_balance_cpu_stop, rq,
+			    &rq->autonuma_balance_work);
+
+	BUG_ON(!other_task);
+	rq = cpu_rq(selected_cpu);
+
+	/*
+	 * autonuma_balance synchronizes accesses to
+	 * autonuma_balance_work. After set, it's cleared by the
+	 * callback once the migration work is finished.
+	 */
+	raw_spin_lock_irq(&rq->lock);
+	/*
+	 * The chance of other_task having quit in the meanwhile
+	 * and another task having reused its previous task struct is
+	 * tiny. Even if it happens the kernel will be stable.
+	 */
+	if (rq->autonuma_balance || rq->curr != other_task) {
+		raw_spin_unlock_irq(&rq->lock);
+		return;
+	}
+	rq->autonuma_balance = true;
+	/* take the pin on the task struct before dropping the lock */
+	get_task_struct(other_task);
+	raw_spin_unlock_irq(&rq->lock);
+
+	rq->autonuma_balance_dst_cpu = this_cpu;
+	rq->autonuma_balance_task = other_task;
+
+	/* Do the actual migration. */
+	stop_one_cpu_nowait(selected_cpu,
+			    autonuma_balance_cpu_stop, rq,
+			    &rq->autonuma_balance_work);
+#ifdef __ia64__
+#error "NOTE: tlb_migrate_finish won't run here, review before deleting"
+#endif
+}
+
+/*
+ * The function sched_autonuma_can_migrate_task is called by CFS
+ * can_migrate_task() to prioritize on the task's
+ * task_selected_nid. It is called during load_balancing, idle
+ * balancing and in general before any task CPU migration event
+ * happens.
+ *
+ * The caller first scans the CFS migration candidate tasks passing a
+ * not zero numa parameter, to skip tasks without AutoNUMA affinity
+ * (according to the tasks's task_selected_nid). If no task can be
+ * migrated in the first scan, a second scan is run with a zero numa
+ * parameter.
+ *
+ * If the numa parameter is not zero, this function allows the task
+ * migration only if the dst_cpu of the migration is in the node
+ * selected by AutoNUMA or if it's an idle load balancing event.
+ *
+ * If load_balance_strict is enabled, AutoNUMA will only allow
+ * migration of tasks for idle balancing purposes (the idle balancing
+ * of CFS is never altered by AutoNUMA). In the not strict mode the
+ * load balancing is not altered and the AutoNUMA affinity is
+ * disregarded in favor of higher fairness. The load_balance_strict
+ * knob is runtime tunable in sysfs.
+ *
+ * If load_balance_strict is enabled, it tends to partition the
+ * system. In turn it may reduce the scheduler fairness across NUMA
+ * nodes, but it should deliver higher global performance.
+ */
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int strict_numa, int dst_cpu,
+				     enum cpu_idle_type idle)
+{
+	if (task_autonuma_cpu(p, dst_cpu))
+		return true;
+
+	/* NUMA affinity is set - and to a different NUMA node */
+
+	/*
+	 * If strict_numa is not zero, it means our caller is in the
+	 * first pass so be strict and only allow migration of tasks
+	 * that passed the NUMA affinity test. If our caller finds
+	 * none in the first pass, it'll normally retry a second pass
+	 * with a zero "strict_numa" parameter.
+	 */
+	if (strict_numa)
+		return false;
+
+	/*
+	 * The idle load balancing always has higher priority than the
+	 * NUMA affinity.
+	 */
+	if (idle == CPU_NEWLY_IDLE || idle == CPU_IDLE)
+		return true;
+
+	if (autonuma_sched_load_balance_strict())
+		return false;
+	else
+		return true;
+}
+
+/*
+ * sched_autonuma_dump_mm is a purely debugging function called at
+ * regular intervals when /sys/kernel/mm/autonuma/debug is
+ * enabled. This prints in the kernel logs how the threads and
+ * processes are distributed in all NUMA nodes to easily check if the
+ * threads of the same processes are converging in the same
+ * nodes. This won't take into account kernel threads and because it
+ * runs itself from a kernel thread it won't show what was running in
+ * the current CPU, but it's simple and good enough to get what we
+ * need in the debug logs. This function can be disabled or deleted
+ * later.
+ */
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	cpumask_var_t x;
+
+	if (!alloc_cpumask_var(&x, GFP_KERNEL))
+		return;
+	cpumask_setall(x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, x);
+				}
+			}
+			printk("nid %d process %p nr_threads %d\n", nid, mm, nr);
+		}
+	}
+	free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0848fa3..9ce8151 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -467,6 +467,25 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	/* stop_one_cpu_nowait() data used by autonuma_balance_cpu_stop() */
+	bool autonuma_balance;
+	int autonuma_balance_dst_cpu;
+	struct task_struct *autonuma_balance_task;
+	struct cpu_stop_work autonuma_balance_work;
+	/*
+	 * Per-cpu arrays used to compute the per-thread and
+	 * per-process NUMA affinity weights (per nid) for the current
+	 * process. Allocated statically to avoid overflowing the
+	 * stack with large MAX_NUMNODES values.
+	 *
+	 * FIXME: allocate with dynamic num_possible_nodes() array
+	 * sizes and only if autonuma is possible, to save some dozen
+	 * KB of RAM when booting on non NUMA (or small NUMA) systems.
+	 */
+	long task_numa_weight[MAX_NUMNODES];
+	long mm_numa_weight[MAX_NUMNODES];
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This is the basic implementation improved by later patches.

Later patches moves the new field to a dynamically allocated
page_autonuma taking 2 bytes per page (only allocated if booted on
NUMA hardware, unless "noautonuma" is passed as parameter to the
kernel at boot).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   11 +++++++++++
 mm/page_alloc.c          |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c80101c..9e8398a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,6 +152,17 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+#if MAX_NUMNODES > 32767
+#error "too many nodes"
+#endif
+	short autonuma_last_nid;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77845f9..a9b18bc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3793,6 +3793,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 15:43   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
                   ` (25 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This defines the per node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |   11 +++++++++++
 mm/page_alloc.c        |    6 ++++++
 2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..f793541 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -709,6 +709,17 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * Lock serializing the per destination node AutoNUMA memory
+	 * migration rate limiting data.
+	 */
+	spinlock_t autonuma_migrate_lock;
+	/* Rate limiting time interval */
+	unsigned long autonuma_migrate_last_jiffies;
+	/* Number of pages migrated during the rate limiting time interval */
+	unsigned long autonuma_migrate_nr_pages;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9b18bc..ef69743 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/prefetch.h>
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4398,6 +4399,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int ret;
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_migrate_lock);
+	pgdat->autonuma_migrate_nr_pages = 0;
+	pgdat->autonuma_migrate_last_jiffies = jiffies;
+#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 13/33] autonuma: autonuma_enter/exit
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 13:50   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
                   ` (24 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This is where we register (and unregister) an "mm" structure into
AutoNUMA for knuma_scand to scan them.

knuma_scand is the first gear in the whole AutoNUMA algorithm.
knuma_scand is the daemon that scans the "mm" structures in the list
and sets pmd_numa and pte_numa to allow the NUMA hinting page faults
to start. All other actions follow after that. If knuma_scand doesn't
run, AutoNUMA is fully bypassed. If knuma_scand is stopped, soon all
other AutoNUMA gears will settle down too.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ec0495b..14d68d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -541,6 +542,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -609,6 +611,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 14/33] autonuma: call autonuma_setup_new_exec()
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 15:47   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
                   ` (23 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detach from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    7 +++++++
 mm/mmu_context.c |    3 +++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 574cf4d..1d55077 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1172,6 +1173,12 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	/*
+	 * Reset autonuma counters, as past NUMA information
+	 * is unlikely to be relevant for the future.
+	 */
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..e6fff1c 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -52,6 +53,8 @@ void unuse_mm(struct mm_struct *mm)
 {
 	struct task_struct *tsk = current;
 
+	autonuma_setup_new_exec(tsk);
+
 	task_lock(tsk);
 	sync_mm_rss(mm);
 	tsk->mm = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 15/33] autonuma: alloc/free/init task_autonuma
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-11 15:53   ` Mel Gorman
  2012-10-03 23:50 ` [PATCH 16/33] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
                   ` (22 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This is where the dynamically allocated task_autonuma structure is
being handled.

This is the structure holding the per-thread NUMA statistics generated
by the NUMA hinting page faults. This per-thread NUMA statistical
information is needed by sched_autonuma_balance to make optimal NUMA
balancing decisions.

It also contains the task_selected_nid which hints the stock CPU
scheduler on the best NUMA node to schedule this thread on (as decided
by sched_autonuma_balance).

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the non NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 14d68d3..1d8a7e8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -209,6 +209,7 @@ void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
 	arch_release_thread_info(tsk->stack);
+	free_task_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -264,6 +265,9 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	/* prepare task_autonuma for alloc_task_autonuma/free_task_autonuma */
+	task_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -310,6 +314,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto free_ti;
 
+	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto free_ti;
+
 	tsk->stack = ti;
 
 	setup_thread_stack(tsk, orig);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 16/33] autonuma: alloc/free/init mm_autonuma
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-03 23:50 ` [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This is where the mm_autonuma structure is being handled.

mm_autonuma holds the link for knuma_scand's list of mm structures to
scan and a pointer to the associated mm structure for knuma_scand's
convenience.

It also contains the per-mm NUMA statistics collected by knuma_scand
daemon. The per-mm NUMA statistics are needed by
sched_autonuma_balance to take appropriate NUMA balancing decision
when balancing threads belonging to different processes.

Just like task_autonuma, this is only allocated at runtime if the
hardware the kernel is running on has been detected as NUMA. On not
NUMA hardware the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1d8a7e8..697dc2f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -532,6 +532,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -554,6 +556,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -603,6 +607,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -885,6 +890,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1707,6 +1713,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 16/33] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
@ 2012-10-03 23:50 ` Andrea Arcangeli
  2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

find_idlest_cpu when run up on all domain levels shouldn't normally
return -1. With the introduction of the NUMA affinity check that
should be still true most of the time, but it's not guaranteed if the
NUMA affinity of the task changes very fast. So better not to depend
on timings.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 877f077..0c6bedd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2808,6 +2808,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+#ifdef CONFIG_AUTONUMA
+	if (new_cpu < 0)
+		/*
+		 * find_idlest_cpu() may return -1 if
+		 * task_autonuma_cpu() changes all the time, it's very
+		 * unlikely, but we must handle it if it ever happens.
+		 */
+		new_cpu = prev_cpu;
+#endif
+	BUG_ON(new_cpu < 0);
+
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2012-10-03 23:50 ` [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-05  6:41   ` Mike Galbraith
  2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
                   ` (19 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

The CFS scheduler is still in charge of all scheduling decisions. At
times, however, AutoNUMA balancing will override them.

Generally, we'll just rely on the CFS scheduler to keep doing its
thing, while preferring the task's AutoNUMA affine node when deciding
to move a task to a different runqueue or when waking it up.

For example, idle balancing, while looking into the runqueues of busy
CPUs, will first look for a task that "wants" to run on the NUMA node
of this idle CPU (one where task_autonuma_cpu() returns true).

Most of this is encoded in can_migrate_task becoming AutoNUMA aware
and running two passes for each balancing pass, the first NUMA aware,
and the second one relaxed.

Idle or newidle balancing is always allowed to fall back to scheduling
non-affine AutoNUMA tasks (ones with task_selected_nid set to another
node). Load_balancing, which affects fairness more than performance,
is only able to schedule against AutoNUMA affinity if the flag
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set.

Tasks that haven't been fully profiled yet, are not affected by this
because their p->task_autonuma->task_selected_nid is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   67 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c6bedd..05c5c78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2622,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}
@@ -2640,12 +2642,14 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
-	 * If the task is going to be woken-up on this cpu and if it is
-	 * already idle, then it is the right target.
+	 * If the task is going to be woken-up on this cpu and if it
+	 * is already idle and if this cpu is in the AutoNUMA selected
+	 * NUMA node, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
@@ -2658,6 +2662,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2671,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+					 tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
@@ -2708,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3081,6 +3096,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_SOME_PINNED 0x04
+#define LBF_NUMA	0x08
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3160,7 +3176,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * 3) are cache-hot on their current CPU, or
+	 * 4) going to be migrated to a dst_cpu not in the selected NUMA node
+	 *    if LBF_NUMA is set.
 	 */
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
 		int new_dst_cpu;
@@ -3195,6 +3213,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3231,6 +3253,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= autonuma_possible() ? LBF_NUMA : 0;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3245,8 +3269,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3271,6 +3301,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	env->flags |= autonuma_possible() ? LBF_NUMA : 0;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3310,9 +3342,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3325,6 +3361,17 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->imbalance > 0) {
+			env->loop = 0;
+			env->loop_break = sched_nr_migrate_break;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-10 22:01   ` Rik van Riel
                     ` (2 more replies)
  2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
                   ` (18 subsequent siblings)
  37 siblings, 3 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This implements the following parts of autonuma:

o knuma_scand: daemon for setting pte_numa and pmd_numa while
  gathering NUMA mm stats

o NUMA hinting page fault handler: Migrate on Fault and gathers NUMA
  task stats

o Migrate On Fault: in the context of the NUMA hinting page faults we
  migrate memory from remote nodes to the local node

o The rest of autonuma core logic: false sharing detection, sysfs and
  initialization routines

The AutoNUMA algorithm when knuma_scand is not running is fully
bypassed and it will not alter the runtime of memory management or the
scheduler.

The whole AutoNUMA logic is a chain reaction as a result of the
actions of the knuma_scand. Various parts of the code can be described
like different gears (gears as in glxgears).

knuma_scand is the first gear and it collects the mm_autonuma
per-process statistics and at the same time it sets the ptes and pmds
it scans respectively as pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the task_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes using Migrate On Fault.

The third gear is Migrate On Fault. Pages pending for migration are
migrated in the context of the NUMA hinting page faults. Each
destination node has a migration rate limit configurable with sysfs.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->task_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
mm->mm_autonuma statistics. Those mm_autonuma statistics are collected
by the knuma_scand pmd/pte scans that are also responsible for setting
pte_numa/pmd_numa to activate the NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  task_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

The code includes some fixes from Hillf Danton <dhillf@gmail.com>.

Math documentation on autonuma_last_nid in the header of
last_nid_set() reworked from sched-numa code by Peter Zijlstra
<a.p.zijlstra@chello.nl>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hillf Danton <dhillf@gmail.com>
---
 mm/autonuma.c    | 1365 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c |   34 ++
 2 files changed, 1399 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..1b2530c
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1365 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on non NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_POSSIBLE_FLAG)
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	|(1<<AUTONUMA_ENABLED_FLAG)
+#endif
+	|(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 10000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* page migration rate limiting control */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knuma_scand_data {
+	struct list_head mm_head; /* entry: mm->mm_autonuma->mm_node */
+	struct mm_struct *mm;
+	unsigned long address;
+	unsigned long *mm_numa_fault_tmp;
+} knuma_scand_data = {
+	.mm_head = LIST_HEAD_INIT(knuma_scand_data.mm_head),
+};
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int last_nid;
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+static int sync_isolate_migratepages(struct list_head *migratepages,
+				     struct page *page,
+				     struct pglist_data *pgdat,
+				     bool *migrated)
+{
+	struct zone *zone;
+	struct lruvec *lruvec;
+	int nr_subpages;
+	struct page *subpage;
+	int ret = 0;
+
+	nr_subpages = 1;
+	if (PageTransHuge(page)) {
+		nr_subpages = HPAGE_PMD_NR;
+		VM_BUG_ON(!PageAnon(page));
+		/* FIXME: remove split_huge_page */
+		if (unlikely(split_huge_page(page))) {
+			autonuma_printk("autonuma migrate THP free\n");
+			goto out;
+		}
+	}
+
+	/* All THP subpages are guaranteed to be in the same zone */
+	zone = page_zone(page);
+
+	for (subpage = page; subpage < page+nr_subpages; subpage++) {
+		spin_lock_irq(&zone->lru_lock);
+
+		/* Must run under the lru_lock and before page isolation */
+		lruvec = mem_cgroup_page_lruvec(subpage, zone);
+
+		if (!__isolate_lru_page(subpage, ISOLATE_ASYNC_MIGRATE)) {
+			VM_BUG_ON(PageTransCompound(subpage));
+			del_page_from_lru_list(subpage, lruvec,
+					       page_lru(subpage));
+			inc_zone_state(zone, page_is_file_cache(subpage) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+
+			list_add(&subpage->lru, migratepages);
+			ret++;
+		} else {
+			/* losing page */
+			spin_unlock_irq(&zone->lru_lock);
+		}
+	}
+
+	/*
+	 * Pin the head subpage at least until the first
+	 * __isolate_lru_page succeeds (__isolate_lru_page pins it
+	 * again when it succeeds). If we unpin before
+	 * __isolate_lru_page successd, the page could be freed and
+	 * reallocated out from under us. Thus our previous checks on
+	 * the page, and the split_huge_page, would be worthless.
+	 *
+	 * We really only need to do this if "ret > 0" but it doesn't
+	 * hurt to do it unconditionally as nobody can reference
+	 * "page" anymore after this and so we can avoid an "if (ret >
+	 * 0)" branch here.
+	 */
+	put_page(page);
+	/*
+	 * Tell the caller we already released its pin, to avoid a
+	 * double free.
+	 */
+	*migrated = true;
+
+out:
+	return ret;
+}
+
+static bool autonuma_balance_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, not the zone.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN | __GFP_NO_KSWAPD) &
+					 ~GFP_IOFS, 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_migrate_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_migrate_lock);
+}
+
+static bool autonuma_migrate_page(struct page *page, int dst_nid,
+				  int page_nid, bool *migrated)
+{
+	int isolated = 0;
+	LIST_HEAD(migratepages);
+	struct pglist_data *pgdat = NODE_DATA(dst_nid);
+	int nr_pages = hpage_nr_pages(page);
+	unsigned long autonuma_migrate_nr_pages = 0;
+
+	autonuma_migrate_lock(dst_nid);
+	if (time_after(jiffies, pgdat->autonuma_migrate_last_jiffies +
+		       msecs_to_jiffies(migrate_sleep_millisecs))) {
+		autonuma_migrate_nr_pages = pgdat->autonuma_migrate_nr_pages;
+		pgdat->autonuma_migrate_nr_pages = 0;
+		pgdat->autonuma_migrate_last_jiffies = jiffies;
+	}
+	if (pgdat->autonuma_migrate_nr_pages >= pages_to_migrate) {
+		autonuma_migrate_unlock(dst_nid);
+		goto out;
+	}
+	pgdat->autonuma_migrate_nr_pages += nr_pages;
+	autonuma_migrate_unlock(dst_nid);
+
+	if (autonuma_migrate_nr_pages)
+		autonuma_printk("migrated %lu pages to node %d\n",
+				autonuma_migrate_nr_pages, dst_nid);
+
+	if (autonuma_balance_pgdat(pgdat, nr_pages))
+		isolated = sync_isolate_migratepages(&migratepages,
+						     page, pgdat,
+						     migrated);
+
+	if (isolated) {
+		int err;
+		pages_migrated += isolated; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, MIGRATE_ASYNC);
+		if (err)
+			putback_lru_pages(&migratepages);
+	}
+	BUG_ON(!list_empty(&migratepages));
+out:
+	return isolated;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct task_autonuma *task_autonuma,
+				   unsigned long *task_numa_fault)
+{
+	int nid;
+	/* If a new pass started, degrade the stats by a factor of 2 */
+	for_each_node(nid)
+		task_numa_fault[nid] >>= 1;
+	task_autonuma->task_numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool new_pass)
+{
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
+
+	/* prevent sched_autonuma_balance() to run on top of us */
+	local_bh_disable();
+
+	if (unlikely(new_pass))
+		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
+	task_numa_fault[access_nid] += numpages;
+	task_autonuma->task_numa_fault_tot += numpages;
+
+	local_bh_enable();
+}
+
+/*
+ * In this function we build a temporal CPU_node<->page relation by
+ * using a two-stage autonuma_last_nid filter to remove short/unlikely
+ * relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentest probability, we can
+ * equate a node's CPU usage of a particular page (n_p) per total
+ * usage of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely CPU_node<->page relation.
+ */
+static inline bool last_nid_set(struct page *page, int this_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(this_nid < 0);
+	VM_BUG_ON(this_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid != this_nid) {
+		if (autonuma_last_nid >= 0)
+			ret = false;
+		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
+	}
+	return ret;
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct page *page,
+						int this_nid, int page_nid,
+						bool new_pass,
+						bool *migrated)
+{
+	if (!last_nid_set(page, this_nid))
+		goto out;
+	if (!PageLRU(page))
+		goto out;
+	if (this_nid != page_nid) {
+		if (autonuma_migrate_page(page, this_nid, page_nid,
+					  migrated))
+			return this_nid;
+	}
+out:
+	return page_nid;
+}
+
+bool numa_hinting_fault(struct page *page, int numpages)
+{
+	bool migrated = false;
+
+	/*
+	 * "current->mm" could be different from the "mm" where the
+	 * NUMA hinting page fault happened, if get_user_pages()
+	 * triggered the fault on some other process "mm". That is ok,
+	 * all we care about is to count the "page_nid" access on the
+	 * current->task_autonuma, even if the page belongs to a
+	 * different "mm".
+	 */
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int this_nid, page_nid, access_nid;
+		bool new_pass;
+
+		/*
+		 * new_pass is only true the first time the thread
+		 * faults on this pass of knuma_scand.
+		 */
+		new_pass = p->task_autonuma->task_numa_fault_pass !=
+			p->mm->mm_autonuma->mm_numa_fault_pass;
+		page_nid = page_to_nid(page);
+		this_nid = numa_node_id();
+		VM_BUG_ON(this_nid < 0);
+		VM_BUG_ON(this_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(page,
+								  this_nid,
+								  page_nid,
+								  new_pass,
+								  &migrated);
+		/* "page" has been already freed if "migrated" is true */
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, new_pass);
+		if (unlikely(new_pass))
+			/*
+			 * Set the task's fault_pass equal to the new
+			 * mm's fault_pass, so new_pass will be false
+			 * on the next fault by this thread in this
+			 * same pass.
+			 */
+			p->task_autonuma->task_numa_fault_pass =
+				p->mm->mm_autonuma->mm_numa_fault_pass;
+	}
+
+	return migrated;
+}
+
+/* NUMA hinting page fault entry point for ptes */
+int pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	struct page *page;
+	spinlock_t *ptl;
+	bool migrated;
+
+	/*
+	 * The "pte" at this point cannot be used safely without
+	 * validation through pte_unmap_same(). It's of NUMA type but
+	 * the pfn may be screwed if the read is non atomic.
+	 */
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte)))
+		goto out_unlock;
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	if (unlikely(page_mapcount(page) != 1))
+		goto out_unlock;
+	get_page(page);
+	pte_unmap_unlock(ptep, ptl);
+
+	migrated = numa_hinting_fault(page, 1);
+	if (!migrated)
+		put_page(page);
+out:
+	return 0;
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+	goto out;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+int pmd_numa_fixup(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte, *orig_pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+	struct vm_area_struct *vma;
+	bool migrated;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return 0;
+
+	vma = find_vma(mm, _addr);
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknonnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (unlikely(page_mapcount(page) != 1))
+			continue;
+		get_page(page);
+		pte_unmap_unlock(pte, ptl);
+
+		migrated = numa_hinting_fault(page, 1);
+		if (!migrated)
+			put_page(page);
+
+		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+	return 0;
+}
+
+static inline int task_autonuma_size(void)
+{
+	return sizeof(struct task_autonuma) +
+		nr_node_ids * sizeof(unsigned long);
+}
+
+static inline int task_autonuma_reset_size(void)
+{
+	struct task_autonuma *task_autonuma = NULL;
+	return task_autonuma_size() -
+		(int)((char *)(&task_autonuma->task_numa_fault_pass) -
+		      (char *)task_autonuma);
+}
+
+static void __task_autonuma_reset(struct task_autonuma *task_autonuma)
+{
+	memset(&task_autonuma->task_numa_fault_pass, 0,
+	       task_autonuma_reset_size());
+}
+
+static void task_autonuma_reset(struct task_autonuma *task_autonuma)
+{
+	task_autonuma->task_selected_nid = -1;
+	__task_autonuma_reset(task_autonuma);
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return nr_node_ids * sizeof(unsigned long);
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size();
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->mm_numa_fault_pass) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->mm_numa_fault_pass, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->task_autonuma)
+		task_autonuma_reset(p->task_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+/*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault. We also collect the
+ * mm_autonuma statistics for this process mm at the same time.
+ */
+static int knuma_scand_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		int page_nid;
+		unsigned long *fault_tmp;
+		ret = HPAGE_PMD_NR;
+
+		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+		page = pmd_page(*pmd);
+
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page_nid = page_to_nid(page);
+		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
+		fault_tmp[page_nid] += ret;
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		/* defer TLB flush to lower the overhead */
+		spin_unlock(&mm->page_table_lock);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		goto out;
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		unsigned long *fault_tmp;
+		if (!pte_present(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
+		fault_tmp[page_to_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		/*
+		 * Mark the page table pmd as numa if "autonuma scan
+		 * pmd" mode is enabled.
+		 */
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long tot;
+	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
+
+	/* FIXME: would be better protected with write_seqlock_bh() */
+	local_bh_disable();
+
+	tot = 0;
+	for_each_node(nid) {
+		unsigned long faults = fault_tmp[nid];
+		fault_tmp[nid] = 0;
+		mma->mm_numa_fault[nid] = faults;
+		tot += faults;
+	}
+	mma->mm_numa_fault_tot = tot;
+
+	local_bh_enable();
+}
+
+static void mm_numa_fault_tmp_reset(void)
+{
+	memset(knuma_scand_data.mm_numa_fault_tmp, 0,
+	       mm_autonuma_fault_size());
+}
+
+static inline void validate_mm_numa_fault_tmp(unsigned long address)
+{
+#ifdef CONFIG_DEBUG_VM
+	int nid;
+	if (address)
+		return;
+	for_each_node(nid)
+		BUG_ON(knuma_scand_data.mm_numa_fault_tmp[nid]);
+#endif
+}
+
+/*
+ * Scan the next part of the mm. Keep track of the progress made and
+ * return it.
+ */
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knuma_scand_data.mm;
+	/*
+	 * knuma_scand_data.mm is NULL after the end of each
+	 * knuma_scand pass. So when it's NULL we've start from
+	 * scratch from the very first mm in the list.
+	 */
+	if (!mm) {
+		if (unlikely(list_empty(&knuma_scand_data.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knuma_scand_data.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knuma_scand_data.address = 0;
+		knuma_scand_data.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->mm_numa_fault_pass++;
+	}
+	address = knuma_scand_data.address;
+
+	validate_mm_numa_fault_tmp(address);
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
+			progress++;
+			continue;
+		}
+		/*
+		 * Skip regions mprotected with PROT_NONE. It would be
+		 * safe to scan them too, but it's worthless because
+		 * NUMA hinting page faults can't run on those.
+		 */
+		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knuma_scand_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the NUMA hinting
+		 * page faults after we finish scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knuma_scand_data.mm != mm);
+	knuma_scand_data.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knuma_scand_data.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knuma_scand_data.mm = mm_autonuma->mm;
+			atomic_inc(&knuma_scand_data.mm->mm_count);
+			knuma_scand_data.address = 0;
+			knuma_scand_data.mm->mm_autonuma->mm_numa_fault_pass++;
+		} else
+			knuma_scand_data.mm = NULL;
+
+		if (knumad_test_exit(mm)) {
+			list_del(&mm->mm_autonuma->mm_node);
+			/* tell autonuma_exit not to list_del */
+			VM_BUG_ON(mm->mm_autonuma->mm != mm);
+			mm->mm_autonuma->mm = NULL;
+			mm_numa_fault_tmp_reset();
+		} else
+			mm_numa_fault_tmp_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	/*
+	 * Serialize the knuma_scand_data against
+	 * autonuma_enter/exit().
+	 */
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+
+		/* Do one loop of scanning, keeping track of the progress */
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		/* Check if we completed one full scan pass */
+		if (unlikely(!knuma_scand_data.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knuma_scand_data.mm;
+	knuma_scand_data.mm = NULL;
+	if (mm && knumad_test_exit(mm)) {
+		list_del(&mm->mm_autonuma->mm_node);
+		/* tell autonuma_exit not to list_del */
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL;
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+	mm_numa_fault_tmp_reset();
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (!autonuma_possible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knuma_scand_data.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (!autonuma_possible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knuma_scand_data.mm == mm)
+		serialize = true;
+	else if (mm->mm_autonuma->mm) {
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL; /* debug */
+		list_del(&mm->mm_autonuma->mm_node);
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		/* prevent the mm to go away under knumad_do_scan main loop */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knuma_scand_data.mm_numa_fault_tmp = kzalloc(mm_autonuma_fault_size(),
+						     GFP_KERNEL);
+	if (!knuma_scand_data.mm_numa_fault_tmp)
+		return -ENOMEM;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_ENABLED_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_ENABLED_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+#ifdef CONFIG_DEBUG_VM
+SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
+#endif /* CONFIG_DEBUG_VM */
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_SCAN_SLEEP_ENTRY,
+	SYSFS_SCAN_PAGES_ENTRY,
+	SYSFS_MIGRATE_SLEEP_ENTRY,
+	SYSFS_MIGRATE_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)					\
+	static ssize_t NAME ## _show(struct kobject *kobj,		\
+				     struct kobj_attribute *attr,	\
+				     char *buf)				\
+	{								\
+		return sprintf(buf, "%u\n", NAME);			\
+	}								\
+	static ssize_t NAME ## _store(struct kobject *kobj,		\
+				      struct kobj_attribute *attr,	\
+				      const char *buf, size_t count)	\
+	{								\
+		unsigned long val;					\
+		int err;						\
+									\
+		err = strict_strtoul(buf, 10, &val);			\
+		if (err || val > UINT_MAX)				\
+			return -EINVAL;					\
+		switch (SYSFS_TYPE) {					\
+		case SYSFS_SCAN_PAGES_ENTRY:				\
+		case SYSFS_MIGRATE_PAGES_ENTRY:				\
+			if (!val)					\
+				return -EINVAL;				\
+			break;						\
+		}							\
+									\
+		NAME = val;						\
+		switch (SYSFS_TYPE) {					\
+		case SYSFS_SCAN_SLEEP_ENTRY:				\
+			wake_up_interruptible(&knuma_scand_wait);	\
+			break;						\
+		}							\
+									\
+		return count;						\
+	}								\
+	static struct kobj_attribute NAME ## _attr =			\
+		__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_SCAN_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_SCAN_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_SCAN_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_MIGRATE_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_MIGRATE_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+
+	&debug_attr.attr,
+
+	/* migrate start */
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	/* migrate end */
+
+	/* scan start */
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&scan_pmd_attr.attr,
+	/* scan end */
+
+#ifdef CONFIG_DEBUG_VM
+	&sched_load_balance_strict_attr.attr,
+	&child_inheritance_attr.attr,
+#endif
+
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	return 0;
+
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (autonuma_possible()) {
+		printk("AutoNUMA permanently disabled\n");
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		WARN_ON(autonuma_possible()); /* avoid early crash */
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static bool autonuma_init_checks_failed(void)
+{
+	/* safety checks on nr_node_ids */
+	int last_nid = find_last_bit(node_states[N_POSSIBLE].bits, MAX_NUMNODES);
+	if (last_nid + 1 != nr_node_ids) {
+		WARN_ON(1);
+		return true;
+	}
+	if (num_possible_nodes() > nr_node_ids) {
+		WARN_ON(1);
+		return true;
+	}
+	return false;
+}
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (num_possible_nodes() <= 1 || !autonuma_possible()) {
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		return -EINVAL;
+	} else if (autonuma_init_checks_failed()) {
+		printk("autonuma disengaged: init checks failed\n");
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		return -EINVAL;
+	}
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *task_autonuma_cachep;
+
+int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct task_autonuma *task_autonuma;
+
+	if (!autonuma_possible())
+		goto no_numa;
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, node);
+	if (!task_autonuma)
+		goto out;
+	if (!autonuma_child_inheritance()) {
+		/*
+		 * Only reset the task NUMA stats, Always inherit the
+		 * task_selected_nid. It's certainly better to start
+		 * the child in the same NUMA node of the parent, if
+		 * idle/load balancing permits. If they don't permit,
+		 * task_selected_nid is a transient entity and it'll
+		 * be updated accordingly.
+		 */
+		task_autonuma->task_selected_nid =
+			orig->task_autonuma->task_selected_nid;
+		__task_autonuma_reset(task_autonuma);
+	} else
+		memcpy(task_autonuma, orig->task_autonuma,
+		       task_autonuma_size());
+	VM_BUG_ON(task_autonuma->task_selected_nid < -1);
+	VM_BUG_ON(task_autonuma->task_selected_nid >= nr_node_ids);
+	tsk->task_autonuma = task_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_task_autonuma(struct task_struct *tsk)
+{
+	if (!autonuma_possible()) {
+		BUG_ON(tsk->task_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->task_autonuma);
+	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
+	tsk->task_autonuma = NULL;
+}
+
+void __init task_autonuma_init(void)
+{
+	struct task_autonuma *task_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (!autonuma_possible())
+		return;
+
+	task_autonuma_cachep =
+		kmem_cache_create("task_autonuma",
+				  task_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, numa_node_id());
+	BUG_ON(!task_autonuma);
+	task_autonuma_reset(task_autonuma);
+	BUG_ON(current->task_autonuma);
+	current->task_autonuma = task_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (!autonuma_possible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (!autonuma_child_inheritance() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+
+	/*
+	 * We're not leaking memory here, if mm->mm_autonuma is not
+	 * zero it's a not refcounted copy of the parent's
+	 * mm->mm_autonuma pointer.
+	 */
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (!autonuma_possible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (!autonuma_possible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 25e262a..edee54d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1036,6 +1036,40 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+/* NUMA hinting page fault entry point for trans huge pmds */
+int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			pmd_t pmd, pmd_t *pmdp)
+{
+	struct page *page;
+	bool migrated;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
+	page = pmd_page(pmd);
+	pmd = pmd_mknonnuma(pmd);
+	set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+	VM_BUG_ON(pmd_numa(*pmdp));
+	if (unlikely(page_mapcount(page) != 1))
+		goto out_unlock;
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	migrated = numa_hinting_fault(page, HPAGE_PMD_NR);
+	if (!migrated)
+		put_page(page);
+
+out:
+	return 0;
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	goto out;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-04 20:03   ` KOSAKI Motohiro
  2012-10-11 18:32   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
                   ` (17 subsequent siblings)
  37 siblings, 2 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

If an task_selected_nid has already been selected for the task, try to
allocate memory from it even if it's temporarily not the local
node. Chances are it's where most of its memory is already located and
where it will run in the future.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ada3be..5cffcb6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1951,10 +1951,18 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid = -1;
+#ifdef CONFIG_AUTONUMA
+		if (current->task_autonuma)
+			nid = current->task_autonuma->task_selected_nid;
+#endif
+		if (nid < 0)
+			nid = numa_node_id();
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 21/33] autonuma: call autonuma_split_huge_page()
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-11 18:33   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
                   ` (16 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This transfers the autonuma_last_nid information to all tail pages
during split_huge_page.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index edee54d..152d4dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1350,6 +1351,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 22/33] autonuma: make khugepaged pte_numa aware
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-11 18:36   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
                   ` (15 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

See the comment inline for why we require just one pte_numa pte to
make a pmd_numa pmd. If needed later we could change the number of
pte_numa ptes required to create a pmd_numa and make it tunable with
sysfs too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 152d4dd..1023e67 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1833,12 +1833,19 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+/*
+ * Do the actual data copy for mapped ptes and release the mapped
+ * pages, or alternatively zero out the transparent hugepage in the
+ * mapping holes. Transfer the page_autonuma information in the
+ * process. Return true if any of the mapped ptes was of numa type.
+ */
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1865,11 +1872,29 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			/*
+			 * Only require one pte_numa mapped by a pmd
+			 * to make it a pmd_numa, too. To avoid the
+			 * risk of losing NUMA hinting page faults, it
+			 * is better to overestimate the NUMA node
+			 * affinity with a node where we just
+			 * collapsed a hugepage, rather than
+			 * underestimate it.
+			 *
+			 * Note: if AUTONUMA_SCAN_PMD_FLAG is set, we
+			 * won't find any pte_numa ptes since we're
+			 * only setting NUMA hinting at the pmd
+			 * level.
+			 */
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1887,6 +1912,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -2005,7 +2031,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -2015,6 +2042,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 23/33] autonuma: retain page last_nid information in khugepaged
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-11 18:44   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
                   ` (14 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1023e67..78b2851 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1846,6 +1846,9 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 {
 	pte_t *_pte;
 	bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+	int autonuma_last_nid = -1;
+#endif
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1855,6 +1858,17 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the first one, better than nothing */
+			if (autonuma_last_nid < 0) {
+				autonuma_last_nid =
+					ACCESS_ONCE(src_page->
+						    autonuma_last_nid);
+				if (autonuma_last_nid >= 0)
+					ACCESS_ONCE(page->autonuma_last_nid) =
+						autonuma_last_nid;
+			}
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			release_pte_page(src_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-11 18:45   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
                   ` (13 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

When we split a transparent hugepage, transfer the NUMA type from the
pmd to the pte if needed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78b2851..757c1cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1412,6 +1412,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(*pmd))
+				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 25/33] autonuma: numa hinting page faults entry points
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-11 18:47   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 26/33] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
                   ` (12 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/memory.c             |   10 ++++++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad4e2e0..eca2c5e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			       pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 1040e87..c89b1d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3456,6 +3457,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(entry))
+		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3530,6 +3534,9 @@ retry:
 		 */
 		orig_pmd = ACCESS_ONCE(*pmd);
 		if (pmd_trans_huge(orig_pmd)) {
+			if (pmd_numa(*pmd))
+				return huge_pmd_numa_fixup(mm, address,
+							   orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd)) {
@@ -3548,6 +3555,9 @@ retry:
 		}
 	}
 
+	if (pmd_numa(*pmd))
+		return pmd_numa_fixup(mm, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 26/33] autonuma: reset autonuma page data when pages are freed
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-03 23:51 ` [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This initializes the last_nid data at page freeing time, so when pages
are allocated later we can identify the first NUMA hinting page fault
happening on them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef69743..e096742 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -619,6 +619,9 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+#ifdef CONFIG_AUTONUMA
+	page->autonuma_last_nid = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 26/33] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    1 +
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 92753e2..0fd3165 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-11 18:50   ` Mel Gorman
  2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
                   ` (9 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/Kconfig     |    3 +++
 arch/x86/Kconfig |    1 +
 mm/Kconfig       |   17 +++++++++++++++++
 3 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 72f2fa1..ee3ed89 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -281,4 +281,7 @@ config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config HAVE_ARCH_AUTONUMA
+	bool
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 50a1d1f..06575bc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -97,6 +97,7 @@ config X86
 	select KTIME_SCALAR if X86_32
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
+	select HAVE_ARCH_AUTONUMA
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS || UPROBES)
diff --git a/mm/Kconfig b/mm/Kconfig
index d5c8019..f00a0cd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -211,6 +211,23 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "AutoNUMA"
+	select MIGRATION
+	depends on NUMA && HAVE_ARCH_AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+	  Avoids the administrator to manually setup hard NUMA
+	  bindings in order to achieve optimal performance on NUMA
+	  hardware.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 29/33] autonuma: page_autonuma
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-04 14:16   ` Christoph Lameter
  2012-10-04 20:09   ` KOSAKI Motohiro
  2012-10-03 23:51 ` [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
                   ` (8 subsequent siblings)
  37 siblings, 2 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Move the autonuma_last_nid from the "struct page" to a separate
page_autonuma data structure allocated in the memsection (with
sparsemem) or in the pgdat (with flatmem).

This is done to avoid growing the size of "struct page". The
page_autonuma data is only allocated if the kernel is booted on real
NUMA hardware and noautonuma is not passed as a parameter to the
kernel.

An alternative would be to takeover 16 bits from the page->flags: but:

1) 32bit are already used (in fact 32bit archs are considering to
   adding another 32bit too to avoid losing common code features), 16
   bits would be used by the last_nid, and several bits are used by
   per-node (readonly) zone/node information, so we would be left with
   just an handful of spare PG_ bits if we stole 16 for the last_nid.

2) We cannot exclude we'll want to add more bits of information in the
   future (and more than 16 wouldn't fit on page->flags). Changing the
   format or layout of the page_autonuma structure is trivial,
   compared to altering the format of the page->flags. So
   page_autonuma is much more hackable than page->flags.

3) page->flags can be modified from under us with locked ops
   (lock_page and all page flags operations). Normally we never change
   more than 1 bit at once on it. So the way page->flags could be
   updated is through cmpxchg. That's slow and tricky code would need
   to be written for it (potentially to drop late in case of point 2
   above). Allocating those 2 bytes separately to me looks a lot
   cleaner even if it takes 0.048% of memory (but only when booting on
   NUMA hardware).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h       |    8 ++
 include/linux/autonuma_types.h |   19 +++
 include/linux/mm_types.h       |   11 --
 include/linux/mmzone.h         |   12 ++
 include/linux/page_autonuma.h  |   50 +++++++++
 init/main.c                    |    2 +
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   37 +++++--
 mm/huge_memory.c               |   13 ++-
 mm/page_alloc.c                |   14 +--
 mm/page_autonuma.c             |  237 ++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c                    |  126 ++++++++++++++++++++-
 12 files changed, 490 insertions(+), 41 deletions(-)
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 mm/page_autonuma.c

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 02d4875..274c616 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -10,6 +10,13 @@ extern void autonuma_exit(struct mm_struct *mm);
 extern void autonuma_migrate_split_huge_page(struct page *page,
 					     struct page *page_tail);
 extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
+
+static inline void autonuma_free_page(struct page *page)
+{
+	if (autonuma_possible())
+		lookup_page_autonuma(page)->autonuma_last_nid = -1;
+}
 
 #define autonuma_printk(format, args...) \
 	if (autonuma_debug()) printk(format, ##args)
@@ -21,6 +28,7 @@ static inline void autonuma_exit(struct mm_struct *mm) {}
 static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 9673ce8..d0c6403 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -78,6 +78,25 @@ struct task_autonuma {
 	/* do not add more variables here, the above array size is dynamic */
 };
 
+/*
+ * Per page (or per-pageblock) structure dynamically allocated only if
+ * autonuma is possible.
+ */
+struct page_autonuma {
+	/*
+	 * autonuma_last_nid records the NUMA node that accessed the
+	 * page during the last NUMA hinting page fault. If a
+	 * different node accesses the page next, AutoNUMA will not
+	 * migrate the page. This tries to avoid page thrashing by
+	 * requiring that a page be accessed by the same node twice in
+	 * a row before it is queued for migration.
+	 */
+#if MAX_NUMNODES > 32767
+#error "too many nodes"
+#endif
+	short autonuma_last_nid;
+};
+
 extern int alloc_task_autonuma(struct task_struct *tsk,
 			       struct task_struct *orig,
 			       int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9e8398a..c80101c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,17 +152,6 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
-#ifdef CONFIG_AUTONUMA
-	/*
-	 * FIXME: move to pgdat section along with the memcg and allocate
-	 * at runtime only in presence of a numa system.
-	 */
-#if MAX_NUMNODES > 32767
-#error "too many nodes"
-#endif
-	short autonuma_last_nid;
-#endif
-
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f793541..db68389 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -710,6 +710,9 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_AUTONUMA
+#if !defined(CONFIG_SPARSEMEM)
+	struct page_autonuma *node_page_autonuma;
+#endif
 	/*
 	 * Lock serializing the per destination node AutoNUMA memory
 	 * migration rate limiting data.
@@ -1081,6 +1084,15 @@ struct mem_section {
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
 	 */
 	struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+	 * section.
+	 */
+	struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_MEMCG) ^ defined(CONFIG_AUTONUMA)
 	unsigned long pad;
 #endif
 };
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..6da6c51
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#include <linux/autonuma_flags.h>
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+					     struct page_autonuma *page_autonuma,
+					     int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
+				    PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif /* CONFIG_SPARSEMEM */
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+									unsigned long nr_pages);
+extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+					  unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+							 unsigned long pnum_begin,
+							 unsigned long pnum_end,
+							 unsigned long map_count,
+							 int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index b286730..586764f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -69,6 +69,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/file.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -456,6 +457,7 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_cgroup_init_flatmem();
+	page_autonuma_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 0fd3165..5a4fa30 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 1b2530c..b5c5ff6 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -55,10 +55,19 @@ void autonuma_migrate_split_huge_page(struct page *page,
 				      struct page *page_tail)
 {
 	int last_nid;
+	struct page_autonuma *page_autonuma, *page_tail_autonuma;
 
-	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (!autonuma_possible())
+		return;
+
+	page_autonuma = lookup_page_autonuma(page);
+	page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+	VM_BUG_ON(page_tail_autonuma->autonuma_last_nid != -1);
+
+	last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	if (last_nid >= 0)
-		page_tail->autonuma_last_nid = last_nid;
+		page_tail_autonuma->autonuma_last_nid = last_nid;
 }
 
 static int sync_isolate_migratepages(struct list_head *migratepages,
@@ -176,13 +185,18 @@ static struct page *alloc_migrate_dst_page(struct page *page,
 {
 	int nid = (int) data;
 	struct page *newpage;
+	struct page_autonuma *page_autonuma, *newpage_autonuma;
 	newpage = alloc_pages_exact_node(nid,
 					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
 					  __GFP_NOMEMALLOC | __GFP_NORETRY |
 					  __GFP_NOWARN | __GFP_NO_KSWAPD) &
 					 ~GFP_IOFS, 0);
-	if (newpage)
-		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	if (newpage) {
+		page_autonuma = lookup_page_autonuma(page);
+		newpage_autonuma = lookup_page_autonuma(newpage);
+		newpage_autonuma->autonuma_last_nid =
+			page_autonuma->autonuma_last_nid;
+	}
 	return newpage;
 }
 
@@ -291,13 +305,14 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
 static inline bool last_nid_set(struct page *page, int this_nid)
 {
 	bool ret = true;
-	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(this_nid < 0);
 	VM_BUG_ON(this_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid != this_nid) {
 		if (autonuma_last_nid >= 0)
 			ret = false;
-		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
+		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = this_nid;
 	}
 	return ret;
 }
@@ -1185,7 +1200,8 @@ static int __init noautonuma_setup(char *str)
 	}
 	return 1;
 }
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
 
 static bool autonuma_init_checks_failed(void)
 {
@@ -1209,7 +1225,12 @@ static int __init autonuma_init(void)
 
 	VM_BUG_ON(num_possible_nodes() < 1);
 	if (num_possible_nodes() <= 1 || !autonuma_possible()) {
-		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		/* should have been already initialized by page_autonuma */
+		if (autonuma_possible()) {
+			WARN_ON(1);
+			/* try to fixup if it wasn't ok */
+			clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		}
 		return -EINVAL;
 	} else if (autonuma_init_checks_failed()) {
 		printk("autonuma disengaged: init checks failed\n");
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 757c1cc..86db742 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1850,7 +1850,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 	bool mknuma = false;
 #ifdef CONFIG_AUTONUMA
 	int autonuma_last_nid = -1;
+	struct page_autonuma *src_page_an, *page_an = NULL;
+
+	if (autonuma_possible())
+		page_an = lookup_page_autonuma(page);
 #endif
+
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1862,12 +1867,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			src_page = pte_page(pteval);
 #ifdef CONFIG_AUTONUMA
 			/* pick the first one, better than nothing */
-			if (autonuma_last_nid < 0) {
+			if (autonuma_possible() && autonuma_last_nid < 0) {
+				src_page_an = lookup_page_autonuma(src_page);
 				autonuma_last_nid =
-					ACCESS_ONCE(src_page->
-						    autonuma_last_nid);
+					ACCESS_ONCE(src_page_an->autonuma_last_nid);
 				if (autonuma_last_nid >= 0)
-					ACCESS_ONCE(page->autonuma_last_nid) =
+					ACCESS_ONCE(page_an->autonuma_last_nid) =
 						autonuma_last_nid;
 			}
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e096742..8e6493a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -619,9 +620,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-#ifdef CONFIG_AUTONUMA
-	page->autonuma_last_nid = -1;
-#endif
+	autonuma_free_page(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3797,9 +3796,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
-		page->autonuma_last_nid = -1;
-#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))
@@ -4402,14 +4398,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int ret;
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
-	spin_lock_init(&pgdat->autonuma_migrate_lock);
-	pgdat->autonuma_migrate_nr_pages = 0;
-	pgdat->autonuma_migrate_last_jiffies = jiffies;
-#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);
+	pgdat_autonuma_init(pgdat);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..d400d7f
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,237 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+#include <linux/vmalloc.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+				      struct page_autonuma *page_autonuma,
+				      int nr_pages)
+{
+	struct page *end;
+	for (end = page + nr_pages; page < end; page++, page_autonuma++)
+		page_autonuma->autonuma_last_nid = -1;
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	spin_lock_init(&pgdat->autonuma_migrate_lock);
+	pgdat->autonuma_migrate_nr_pages = 0;
+	pgdat->autonuma_migrate_last_jiffies = jiffies;
+
+	/* initialize autonuma_possible() */
+	if (num_possible_nodes() <= 1)
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+	pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_autonuma *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (unlikely(!base))
+		return NULL;
+#endif
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+	struct page_autonuma *base;
+	unsigned long table_size;
+	unsigned long nr_pages;
+
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+	if (!nr_pages)
+		return 0;
+
+	table_size = sizeof(struct page_autonuma) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	NODE_DATA(nid)->node_page_autonuma = base;
+	total_usage += table_size;
+	page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+	return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+	int nid, fail;
+
+	/* __pgdat_autonuma_init initialized autonuma_possible() */
+	if (!autonuma_possible())
+		return;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_autonuma(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+	       total_usage >> 10);
+	printk(KERN_INFO "please try the 'noautonuma' option if you"
+	" don't want to allocate page_autonuma memory\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	/* if it's not a power of two we may be wasting memory */
+	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+		     (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+	/* memsection must be a power of two */
+	BUILD_BUG_ON(sizeof(struct mem_section) &
+		     (sizeof(struct mem_section)-1));
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (!section->section_page_autonuma)
+		return NULL;
+#endif
+	return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+								 unsigned long nr_pages)
+{
+	struct page_autonuma *ret;
+	struct page *page;
+	unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+	page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+				get_order(memmap_size));
+	if (page)
+		goto got_map_page_autonuma;
+
+	ret = vmalloc(memmap_size);
+	if (ret)
+		goto out;
+
+	return NULL;
+got_map_page_autonuma:
+	ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+	return ret;
+}
+
+void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+				   unsigned long nr_pages)
+{
+	if (is_vmalloc_addr(page_autonuma))
+		vfree(page_autonuma);
+	else
+		free_pages((unsigned long)page_autonuma,
+			   get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+								      int nid)
+{
+	struct page_autonuma *map;
+	unsigned long size;
+
+	map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+	if (map)
+		return map;
+
+	size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+	map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+						  unsigned long pnum_begin,
+						  unsigned long pnum_end,
+						  unsigned long map_count,
+						  int nodeid)
+{
+	void *map;
+	unsigned long pnum;
+	unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+	map = alloc_remap(nodeid, size * map_count);
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	size = PAGE_ALIGN(size);
+	map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	/* fallback */
+	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+		struct mem_section *ms;
+
+		if (!present_section_nr(pnum))
+			continue;
+		page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+		if (page_autonuma_map[pnum])
+			continue;
+		ms = __nr_to_section(pnum);
+		printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+		       "some memory will not be available.\n", __func__);
+	}
+}
+
+#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse.c b/mm/sparse.c
index fac95f2..5b8d018 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
 #include "internal.h"
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
@@ -230,7 +231,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static int __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		unsigned long *pageblock_bitmap,
+		struct page_autonuma *page_autonuma)
 {
 	if (!present_section(ms))
 		return -EINVAL;
@@ -239,6 +241,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
  	ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+	if (page_autonuma) {
+		ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+		page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+	}
+#else
+	BUG_ON(page_autonuma);
+#endif
 
 	return 1;
 }
@@ -480,6 +490,9 @@ void __init sparse_init(void)
 	int size2;
 	struct page **map_map;
 #endif
+	struct page_autonuma **uninitialized_var(page_autonuma_map);
+	struct page_autonuma *page_autonuma;
+	int size3;
 
 	/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
 	set_pageblock_order();
@@ -577,6 +590,63 @@ void __init sparse_init(void)
 					 map_count, nodeid_begin);
 #endif
 
+	/* __pgdat_autonuma_init initialized autonuma_possible() */
+	if (autonuma_possible()) {
+		unsigned long total_page_autonuma;
+		unsigned long page_autonuma_count;
+
+		size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+		page_autonuma_map = alloc_bootmem(size3);
+		if (!page_autonuma_map)
+			panic("can not allocate page_autonuma_map\n");
+
+		for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid_begin = sparse_early_nid(ms);
+			pnum_begin = pnum;
+			break;
+		}
+		total_page_autonuma = 0;
+		page_autonuma_count = 1;
+		for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+			int nodeid;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid = sparse_early_nid(ms);
+			if (nodeid == nodeid_begin) {
+				page_autonuma_count++;
+				continue;
+			}
+			/* ok, we need to take cake of from pnum_begin to pnum - 1*/
+			sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+							      pnum_begin,
+							      NR_MEM_SECTIONS,
+							      page_autonuma_count,
+							      nodeid_begin);
+			total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+			/* new start, update count etc*/
+			nodeid_begin = nodeid;
+			pnum_begin = pnum;
+			page_autonuma_count = 1;
+		}
+		/* ok, last chunk */
+		sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+						      NR_MEM_SECTIONS,
+						      page_autonuma_count, nodeid_begin);
+		total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+		printk("allocated %lu KBytes of page_autonuma\n",
+		       total_page_autonuma >> 10);
+		printk(KERN_INFO "please try the 'noautonuma' option if you"
+		       " don't want to allocate page_autonuma memory\n");
+	}
+
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
@@ -585,6 +655,13 @@ void __init sparse_init(void)
 		if (!usemap)
 			continue;
 
+		if (autonuma_possible()) {
+			page_autonuma = page_autonuma_map[pnum];
+			if (!page_autonuma)
+				continue;
+		} else
+			page_autonuma = NULL;
+
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 		map = map_map[pnum];
 #else
@@ -594,11 +671,13 @@ void __init sparse_init(void)
 			continue;
 
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
-								usemap);
+					usemap, page_autonuma);
 	}
 
 	vmemmap_populate_print_last();
 
+	if (autonuma_possible())
+		free_bootmem(__pa(page_autonuma_map), size3);
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
@@ -685,7 +764,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+				struct page_autonuma *page_autonuma)
 {
 	struct page *usemap_page;
 	unsigned long nr_pages;
@@ -699,8 +779,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	 */
 	if (PageSlab(usemap_page)) {
 		kfree(usemap);
-		if (memmap)
+		if (memmap) {
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+			if (autonuma_possible())
+				__kfree_section_page_autonuma(page_autonuma,
+							      PAGES_PER_SECTION);
+			else
+				BUG_ON(page_autonuma);
+		}
 		return;
 	}
 
@@ -717,6 +803,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 			>> PAGE_SHIFT;
 
 		free_map_bootmem(memmap_page, nr_pages);
+
+		if (autonuma_possible()) {
+			struct page *page_autonuma_page;
+			page_autonuma_page = virt_to_page(page_autonuma);
+			free_map_bootmem(page_autonuma_page, nr_pages);
+		} else
+			BUG_ON(page_autonuma);
 	}
 }
 
@@ -732,6 +825,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct mem_section *ms;
 	struct page *memmap;
+	struct page_autonuma *page_autonuma;
 	unsigned long *usemap;
 	unsigned long flags;
 	int ret;
@@ -751,6 +845,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 		__kfree_section_memmap(memmap, nr_pages);
 		return -ENOMEM;
 	}
+	if (autonuma_possible()) {
+		page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+								nr_pages);
+		if (!page_autonuma) {
+			kfree(usemap);
+			__kfree_section_memmap(memmap, nr_pages);
+			return -ENOMEM;
+		}
+	} else
+		page_autonuma = NULL;
 
 	pgdat_resize_lock(pgdat, &flags);
 
@@ -762,11 +866,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
-	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+	ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+				      page_autonuma);
 
 out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
+		if (autonuma_possible())
+			__kfree_section_page_autonuma(page_autonuma, nr_pages);
+		else
+			BUG_ON(page_autonuma);
 		kfree(usemap);
 		__kfree_section_memmap(memmap, nr_pages);
 	}
@@ -777,6 +886,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
+	struct page_autonuma *page_autonuma = NULL;
 
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
@@ -784,8 +894,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 						__section_nr(ms));
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+		page_autonuma = ms->section_page_autonuma;
+#endif
 	}
 
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, page_autonuma);
 }
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-03 23:51 ` [PATCH 31/33] autonuma: boost khugepaged scanning rate Andrea Arcangeli
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Debug tweak.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   15 +++++++++++++++
 mm/page_alloc.c          |    3 ++-
 2 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 274c616..9cd94cc 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -18,6 +18,20 @@ static inline void autonuma_free_page(struct page *page)
 		lookup_page_autonuma(page)->autonuma_last_nid = -1;
 }
 
+static inline int autonuma_check_new_page(struct page *page)
+{
+	struct page_autonuma *page_autonuma;
+	int ret = 0;
+	if (autonuma_possible()) {
+		page_autonuma = lookup_page_autonuma(page);
+		if (unlikely(page_autonuma->autonuma_last_nid != -1)) {
+			ret = 1;
+			WARN_ON(1);
+		}
+	}
+	return ret;
+}
+
 #define autonuma_printk(format, args...) \
 	if (autonuma_debug()) printk(format, ##args)
 
@@ -29,6 +43,7 @@ static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
 static inline void autonuma_free_page(struct page *page) {}
+static inline int autonuma_check_new_page(struct page *page) { return 0; }
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e6493a..ecb2f8d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -833,7 +833,8 @@ static inline int check_new_page(struct page *page)
 		(page->mapping != NULL)  |
 		(__page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
-		(mem_cgroup_bad_page_check(page)))) {
+		(mem_cgroup_bad_page_check(page)) |
+		autonuma_check_new_page(page))) {
 		bad_page(page);
 		return 1;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 31/33] autonuma: boost khugepaged scanning rate
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (29 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-03 23:51 ` [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs Andrea Arcangeli
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86db742..6856468 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,12 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+	/* Hack, remove after THP native migration */
+	if (autonuma_possible()) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (30 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 31/33] autonuma: boost khugepaged scanning rate Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-03 23:51 ` [PATCH 33/33] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

This sysfs control, if enabled, allows memory migrations on the first
numa hinting page fault.

If disabled it forbids it and requires a confirmation through the
last_nid logic.

By default, the first fault is allowed to migrate memory. Disabling it
may increase the time it takes to converge, but it reduces some
initial thrashing in case of NUMA false sharing.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   20 ++++++++++++++++++++
 mm/autonuma.c                  |   11 +++++++++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 630ecc5..988a8b5 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -73,6 +73,20 @@ enum autonuma_flag {
 	 * Default set.
 	 */
 	AUTONUMA_SCAN_PMD_FLAG,
+	/*
+	 * If not set, a page must successfully pass a last_nid check
+	 * before it can be migrated if it's the very first NUMA
+	 * hinting page fault occurring on the page. If set, the first
+	 * NUMA hinting page fault of a newly allocated page will
+	 * always pass the last_nid check.
+	 *
+	 * If set a newly started workload can converge quicker, but
+	 * it may incur in more false positive migrations before
+	 * reaching convergence.
+	 *
+	 * Default set.
+	 */
+	AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG,
 };
 
 extern unsigned long autonuma_flags;
@@ -108,6 +122,12 @@ static inline bool autonuma_scan_pmd(void)
 	return test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
 }
 
+static inline bool autonuma_migrate_allow_first_fault(void)
+{
+	return test_bit(AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG,
+			&autonuma_flags);
+}
+
 #else /* CONFIG_AUTONUMA */
 
 static inline bool autonuma_possible(void)
diff --git a/mm/autonuma.c b/mm/autonuma.c
index b5c5ff6..ec5b1d4 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -24,7 +24,8 @@ unsigned long autonuma_flags __read_mostly =
 #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
 	|(1<<AUTONUMA_ENABLED_FLAG)
 #endif
-	|(1<<AUTONUMA_SCAN_PMD_FLAG);
+	|(1<<AUTONUMA_SCAN_PMD_FLAG)
+	|(1<<AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
 
 static DEFINE_MUTEX(knumad_mm_mutex);
 
@@ -310,7 +311,8 @@ static inline bool last_nid_set(struct page *page, int this_nid)
 	VM_BUG_ON(this_nid < 0);
 	VM_BUG_ON(this_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid != this_nid) {
-		if (autonuma_last_nid >= 0)
+		if (!autonuma_migrate_allow_first_fault() ||
+		    autonuma_last_nid >= 0)
 			ret = false;
 		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = this_nid;
 	}
@@ -1048,6 +1050,8 @@ SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
 #ifdef CONFIG_DEBUG_VM
 SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
 SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
+SYSFS_ENTRY(migrate_allow_first_fault,
+	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
 #endif /* CONFIG_DEBUG_VM */
 
 #undef SYSFS_ENTRY
@@ -1130,6 +1134,9 @@ static struct attribute *autonuma_attr[] = {
 	&migrate_sleep_millisecs_attr.attr,
 	&pages_to_migrate_attr.attr,
 	&pages_migrated_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&migrate_allow_first_fault_attr.attr,
+#endif
 	/* migrate end */
 
 	/* scan start */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 33/33] autonuma: add mm_autonuma working set estimation
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (31 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs Andrea Arcangeli
@ 2012-10-03 23:51 ` Andrea Arcangeli
  2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Working set estimation will only record memory that was recently used
and in turn will be eligible for automatic migration. It will ignore
memory that is never accessed by the process and that in turn will
never attempted to be migrated. This can increase NUMA convergence if
large areas of memory are never used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   25 ++++++++++++++++++++++---
 mm/autonuma.c                  |   21 +++++++++++++++++++++
 2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 988a8b5..1c5e625 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -60,9 +60,10 @@ enum autonuma_flag {
 	 * faults at the pmd level instead of the pte level. This
 	 * reduces the number of NUMA hinting faults potentially
 	 * saving CPU time. It reduces the accuracy of the
-	 * task_autonuma statistics (but does not change the accuracy
-	 * of the mm_autonuma statistics). This flag can be toggled
-	 * through sysfs as runtime.
+	 * task_autonuma statistics (it doesn't change the accuracy of
+	 * the mm_autonuma statistics if the mm_working_set mode is
+	 * not set). This flag can be toggled through sysfs as
+	 * runtime.
 	 *
 	 * This flag does not affect AutoNUMA with transparent
 	 * hugepages (THP). With THP the NUMA hinting page faults
@@ -87,6 +88,18 @@ enum autonuma_flag {
 	 * Default set.
 	 */
 	AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG,
+	/*
+	 * If set, mm_autonuma will represent a working set estimation
+	 * of the memory used by the process over the last knuma_scand
+	 * pass.
+	 *
+	 * If not set, mm_autonuma will represent all (not shared)
+	 * memory eligible for automatic migration mapped by the
+	 * process.
+	 *
+	 * Default set.
+	 */
+	AUTONUMA_MM_WORKING_SET_FLAG,
 };
 
 extern unsigned long autonuma_flags;
@@ -128,6 +141,12 @@ static inline bool autonuma_migrate_allow_first_fault(void)
 			&autonuma_flags);
 }
 
+static inline bool autonuma_mm_working_set(void)
+{
+	return test_bit(AUTONUMA_MM_WORKING_SET_FLAG,
+			&autonuma_flags);
+}
+
 #else /* CONFIG_AUTONUMA */
 
 static inline bool autonuma_possible(void)
diff --git a/mm/autonuma.c b/mm/autonuma.c
index ec5b1d4..f1e699f 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -25,6 +25,7 @@ unsigned long autonuma_flags __read_mostly =
 	|(1<<AUTONUMA_ENABLED_FLAG)
 #endif
 	|(1<<AUTONUMA_SCAN_PMD_FLAG)
+	|(1<<AUTONUMA_MM_WORKING_SET_FLAG)
 	|(1<<AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
 
 static DEFINE_MUTEX(knumad_mm_mutex);
@@ -592,6 +593,11 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 
 		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
+		if (autonuma_mm_working_set() && pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
 		page = pmd_page(*pmd);
 
 		/* only check non-shared pages */
@@ -627,6 +633,8 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		unsigned long *fault_tmp;
 		if (!pte_present(pteval))
 			continue;
+		if (autonuma_mm_working_set() && pte_numa(pteval))
+			continue;
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			continue;
@@ -670,6 +678,17 @@ static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
 	unsigned long tot;
 	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
 
+	if (autonuma_mm_working_set()) {
+		for_each_node(nid) {
+			tot = fault_tmp[nid];
+			if (tot)
+				break;
+		}
+		if (!tot)
+			/* process was idle, keep the old data */
+			return;
+	}
+
 	/* FIXME: would be better protected with write_seqlock_bh() */
 	local_bh_disable();
 
@@ -1052,6 +1071,7 @@ SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
 SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
 SYSFS_ENTRY(migrate_allow_first_fault,
 	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
+SYSFS_ENTRY(mm_working_set, AUTONUMA_MM_WORKING_SET_FLAG);
 #endif /* CONFIG_DEBUG_VM */
 
 #undef SYSFS_ENTRY
@@ -1151,6 +1171,7 @@ static struct attribute *autonuma_attr[] = {
 #ifdef CONFIG_DEBUG_VM
 	&sched_load_balance_strict_attr.attr,
 	&child_inheritance_attr.attr,
+	&mm_working_set_attr.attr,
 #endif
 
 	NULL,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 29/33] autonuma: page_autonuma
  2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
@ 2012-10-04 14:16   ` Christoph Lameter
  2012-10-04 20:09   ` KOSAKI Motohiro
  1 sibling, 0 replies; 148+ messages in thread
From: Christoph Lameter @ 2012-10-04 14:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 4 Oct 2012, Andrea Arcangeli wrote:

> Move the autonuma_last_nid from the "struct page" to a separate
> page_autonuma data structure allocated in the memsection (with
> sparsemem) or in the pgdat (with flatmem).

Note that there is a available word in struct page before the autonuma
patches on x86_64 with CONFIG_HAVE_ALIGNED_STRUCT_PAGE.

In fact the page_autonuma fills up the structure to nicely fit in one 64
byte cacheline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (32 preceding siblings ...)
  2012-10-03 23:51 ` [PATCH 33/33] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
@ 2012-10-04 18:39 ` Andrew Morton
  2012-10-04 20:49   ` Rik van Riel
                     ` (2 more replies)
  2012-10-11 10:19 ` Mel Gorman
                   ` (3 subsequent siblings)
  37 siblings, 3 replies; 148+ messages in thread
From: Andrew Morton @ 2012-10-04 18:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu,  4 Oct 2012 01:50:42 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> This is a new AutoNUMA27 release for Linux v3.6.

Peter's numa/sched patches have been in -next for a week.  Guys, what's the
plan here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA
  2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
@ 2012-10-04 20:03   ` KOSAKI Motohiro
  2012-10-11 18:32   ` Mel Gorman
  1 sibling, 0 replies; 148+ messages in thread
From: KOSAKI Motohiro @ 2012-10-04 20:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt, kosaki.motohiro

(10/3/12 7:51 PM), Andrea Arcangeli wrote:
> If an task_selected_nid has already been selected for the task, try to
> allocate memory from it even if it's temporarily not the local
> node. Chances are it's where most of its memory is already located and
> where it will run in the future.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

This mempolicy part looks ok to me.
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 29/33] autonuma: page_autonuma
  2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
  2012-10-04 14:16   ` Christoph Lameter
@ 2012-10-04 20:09   ` KOSAKI Motohiro
  2012-10-05 11:31     ` Andrea Arcangeli
  1 sibling, 1 reply; 148+ messages in thread
From: KOSAKI Motohiro @ 2012-10-04 20:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt, kosaki.motohiro

> +struct page_autonuma *lookup_page_autonuma(struct page *page)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	unsigned long offset;
> +	struct page_autonuma *base;
> +
> +	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
> +#ifdef CONFIG_DEBUG_VM
> +	/*
> +	 * The sanity checks the page allocator does upon freeing a
> +	 * page can reach here before the page_autonuma arrays are
> +	 * allocated when feeding a range of pages to the allocator
> +	 * for the first time during bootup or memory hotplug.
> +	 */
> +	if (unlikely(!base))
> +		return NULL;
> +#endif

When using CONFIG_DEBUG_VM, please just use BUG_ON instead of additional
sanity check. Otherwise only MM people might fault to find a real bug.


And I have additional question here. What's happen if memory hotplug occur
and several autonuma_last_nid will point to invalid node id? My quick skimming
didn't find hotplug callback code.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
@ 2012-10-04 20:49   ` Rik van Riel
  2012-10-05 23:08   ` Rik van Riel
  2012-10-05 23:14     ` Andi Kleen
  2 siblings, 0 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-04 20:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 10/04/2012 02:39 PM, Andrew Morton wrote:
> On Thu,  4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week.  Guys, what's the
> plan here?

Both AutoNUMA and sched/numa have been extremely useful
development trees, allowing us to learn a lot about what
functionality we do (and do not) require to get NUMA
placement and scheduling to work correctly.

The AutoNUMA code base seems to work right, but may be
complex for some people. It could be simplified to
people's tastes and merged.

The sched/numa code base is not quite ready, but is
rapidly getting there. The way things are going now,
I would give it another week or two?

A few inefficiencies in the sched/numa migration code
were fixed earlier today, and cpu-follows-memory code
is being added as we speak. Both code bases should be
functionally similar real soon now.

That leaves the choice up to you folks :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
@ 2012-10-05  6:41   ` Mike Galbraith
  2012-10-05 11:54     ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mike Galbraith @ 2012-10-05  6:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-10-04 at 01:51 +0200, Andrea Arcangeli wrote: 
> The CFS scheduler is still in charge of all scheduling decisions. At
> times, however, AutoNUMA balancing will override them.
> 
> Generally, we'll just rely on the CFS scheduler to keep doing its
> thing, while preferring the task's AutoNUMA affine node when deciding
> to move a task to a different runqueue or when waking it up.

Why does AutoNuma fiddle with wakeup decisions _within_ a node?

pgbench intensely disliked me recently depriving it of migration routes
in select_idle_sibling(), so AutoNuma saying NAK seems unlikely to make
it or ilk any happier.

> For example, idle balancing, while looking into the runqueues of busy
> CPUs, will first look for a task that "wants" to run on the NUMA node
> of this idle CPU (one where task_autonuma_cpu() returns true).
> 
> Most of this is encoded in can_migrate_task becoming AutoNUMA aware
> and running two passes for each balancing pass, the first NUMA aware,
> and the second one relaxed.
> 
> Idle or newidle balancing is always allowed to fall back to scheduling
> non-affine AutoNUMA tasks (ones with task_selected_nid set to another
> node). Load_balancing, which affects fairness more than performance,
> is only able to schedule against AutoNUMA affinity if the flag
> /sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set.
> 
> Tasks that haven't been fully profiled yet, are not affected by this
> because their p->task_autonuma->task_selected_nid is still set to the
> original value of -1 and task_autonuma_cpu will always return true in
> that case.

Hm.  How does this profiling work for 1:N loads?  Once you need two or
more nodes, there is no best node for the 1, so restricting it can only
do harm.  For pgbench and ilk, loads of cross node traffic should mean
the 1 is succeeding at keeping the N busy.

-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 29/33] autonuma: page_autonuma
  2012-10-04 20:09   ` KOSAKI Motohiro
@ 2012-10-05 11:31     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-05 11:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi KOSAKI,

On Thu, Oct 04, 2012 at 04:09:40PM -0400, KOSAKI Motohiro wrote:
> > +struct page_autonuma *lookup_page_autonuma(struct page *page)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	unsigned long offset;
> > +	struct page_autonuma *base;
> > +
> > +	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
> > +#ifdef CONFIG_DEBUG_VM
> > +	/*
> > +	 * The sanity checks the page allocator does upon freeing a
> > +	 * page can reach here before the page_autonuma arrays are
> > +	 * allocated when feeding a range of pages to the allocator
> > +	 * for the first time during bootup or memory hotplug.
> > +	 */
> > +	if (unlikely(!base))
> > +		return NULL;
> > +#endif
> 
> When using CONFIG_DEBUG_VM, please just use BUG_ON instead of additional
> sanity check. Otherwise only MM people might fault to find a real bug.

Agreed. But I just tried to stick to the page_cgroup.c model. I
suggest you send a patch to fix it in mm/page_cgroup.c, then I'll
synchronize mm/page_autonuma.c with whatever lands in page_cgroup.c.

The idea is that in the future it'd be nice to unify those with a
common implementation. And the closer page_cgroup.c and
page_autonuma.c are, the less work it'll be to update them to use a
common framework. And if it's never going to be worth it to unify it
(if it generates more code than it saves), well then keeping the code
as similar as possible, is still beneficial so it's easier to review both.

> And I have additional question here. What's happen if memory hotplug occur
> and several autonuma_last_nid will point to invalid node id? My quick skimming
> didn't find hotplug callback code.

last_nid is statistical info so if it's random it's ok (I didn't add
bugchecks to trap uninitialized cases to it, maybe I should?).

sparse_init_one_section also initializes it, and that's invoked by
sparse_add_one_section.

Also those fields are also initialized when the page is freed the
first time to add it to the buddy, but I didn't want to depend on
that, I thought an explicit init post-allocation would be more robust.

By reviewing it the only thing I found is that I was wasting a bit of
.text for 32bit builds (CONFIG_SPARSEMEM=n).

diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index d400d7f..303b427 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -14,7 +14,7 @@ void __meminit page_autonuma_map_init(struct page *page,
 		page_autonuma->autonuma_last_nid = -1;
 }
 
-static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+static void __paginginit __pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	spin_lock_init(&pgdat->autonuma_migrate_lock);
 	pgdat->autonuma_migrate_nr_pages = 0;
@@ -29,7 +29,7 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 
 static unsigned long total_usage;
 
-void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+void __paginginit pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	__pgdat_autonuma_init(pgdat);
 	pgdat->node_page_autonuma = NULL;
@@ -131,7 +131,7 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
 	return section->section_page_autonuma + pfn;
 }
 
-void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+void __paginginit pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	__pgdat_autonuma_init(pgdat);
 }


So those can be freed if it's a non sparsemem build. The caller has
__paginging init too so it should be ok.


The other page_autonuma.c places invoked only by sparsemem hotplug
code are using meminit so in theory it should work (I haven't tested
it yet).

Thanks for the review!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-05  6:41   ` Mike Galbraith
@ 2012-10-05 11:54     ` Andrea Arcangeli
  2012-10-06  2:39       ` Mike Galbraith
  0 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-05 11:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Oct 05, 2012 at 08:41:25AM +0200, Mike Galbraith wrote:
> On Thu, 2012-10-04 at 01:51 +0200, Andrea Arcangeli wrote: 
> > The CFS scheduler is still in charge of all scheduling decisions. At
> > times, however, AutoNUMA balancing will override them.
> > 
> > Generally, we'll just rely on the CFS scheduler to keep doing its
> > thing, while preferring the task's AutoNUMA affine node when deciding
> > to move a task to a different runqueue or when waking it up.
> 
> Why does AutoNuma fiddle with wakeup decisions _within_ a node?
> 
> pgbench intensely disliked me recently depriving it of migration routes
> in select_idle_sibling(), so AutoNuma saying NAK seems unlikely to make
> it or ilk any happier.

Preferring doesn't mean NAK. It means "search affine first" if there's
not, go the usual route like if autonuma was not there.

Here the code change to the select_idle_sibling() for reference. You
can see it still fallbacks into the first idle_target but it keeps
going and stops when the first NUMA affine idle target is found
according to task_autonuma_cpu().

Notably load and idle balancing decisions are never overridden or
NAKed: only a "preference" is added.

@@ -2658,6 +2662,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2671,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+					 tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);

In short there's no risk of regressions like it happened until 3.6-rc6
(I reverted that patch before it was reverted in 3.6-rc6).

> > For example, idle balancing, while looking into the runqueues of busy
> > CPUs, will first look for a task that "wants" to run on the NUMA node
> > of this idle CPU (one where task_autonuma_cpu() returns true).
> > 
> > Most of this is encoded in can_migrate_task becoming AutoNUMA aware
> > and running two passes for each balancing pass, the first NUMA aware,
> > and the second one relaxed.
> > 
> > Idle or newidle balancing is always allowed to fall back to scheduling
> > non-affine AutoNUMA tasks (ones with task_selected_nid set to another
> > node). Load_balancing, which affects fairness more than performance,
> > is only able to schedule against AutoNUMA affinity if the flag
> > /sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set.
> > 
> > Tasks that haven't been fully profiled yet, are not affected by this
> > because their p->task_autonuma->task_selected_nid is still set to the
> > original value of -1 and task_autonuma_cpu will always return true in
> > that case.
> 
> Hm.  How does this profiling work for 1:N loads?  Once you need two or
> more nodes, there is no best node for the 1, so restricting it can only
> do harm.  For pgbench and ilk, loads of cross node traffic should mean
> the 1 is succeeding at keeping the N busy.

That resembles numa01 on the 8 node system. There are N threads
trashing over all the memory of 4 nodes, and another N threads
trashing over the memory of another 4 nodes. It still work massively
better than no autonuma.

If there are multiple threads their affinity will vary slighly and the
task_selected_nid will distribute (and if it doesn't distribute the
idle load balancing will still work perfectly as upstream).

If there's just one thread, so really 1:N, it doesn't matter in which
CPU of the 4 nodes we put it if it's the memory split is 25/25/25/25.

In short in those 1:N scenarios, it's usually better to just stick to
the last node it run on, and it does with AutoNUMA. This is why it's
better to have 1 task_selected_nid instead of 4. There may be level 3
caches for the node too and that will preserve them too.

See the update of task_selected_nid when no task exchange is done
(even when there are no statistics available yet), and also why I only
updated the below:

-       if (target == cpu && idle_cpu(cpu))
+       if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))

and not:

	if (target == prev_cpu && idle_cpu(prev_cpu))
		return prev_cpu;

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
  2012-10-04 20:49   ` Rik van Riel
@ 2012-10-05 23:08   ` Rik van Riel
  2012-10-05 23:14     ` Andi Kleen
  2 siblings, 0 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-05 23:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 10/04/2012 02:39 PM, Andrew Morton wrote:
> On Thu,  4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week.

It may be worth pointing out that several of those patches have
quietly slipped into -next without any prior review on lkml.

That is not the way things should go, IMHO.

> Guys, what's the plan here?

My previous email outlined some of the situation and what I
have been doing, but does not actually have a plan.

Me helping improve both code bases does not seem to have
gotten either of the two closer to merging...

I guess "prod Andrew, Hugh, Mel, and others to test and review
both NUMA code bases" might be a plan?

Does anybody have any ideas?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
@ 2012-10-05 23:14     ` Andi Kleen
  2012-10-05 23:08   ` Rik van Riel
  2012-10-05 23:14     ` Andi Kleen
  2 siblings, 0 replies; 148+ messages in thread
From: Andi Kleen @ 2012-10-05 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu,  4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week. 

Did they pass review? I have some doubts.

The last time I looked it also broke numactl.

> Guys, what's the plan here?

Since they are both performance features their ultimate benefit
is how much faster they make things (and how seldom they make things
slower)

IMHO needs a performance shot-out. Run both on the same 10 workloads
and see who wins. Just a lot of of work. Any volunteers?

For a change like this I think less regression is actually more
important than the highest peak numbers.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-05 23:14     ` Andi Kleen
  0 siblings, 0 replies; 148+ messages in thread
From: Andi Kleen @ 2012-10-05 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu,  4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week. 

Did they pass review? I have some doubts.

The last time I looked it also broke numactl.

> Guys, what's the plan here?

Since they are both performance features their ultimate benefit
is how much faster they make things (and how seldom they make things
slower)

IMHO needs a performance shot-out. Run both on the same 10 workloads
and see who wins. Just a lot of of work. Any volunteers?

For a change like this I think less regression is actually more
important than the highest peak numbers.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-05 23:14     ` Andi Kleen
@ 2012-10-05 23:57       ` Tim Chen
  -1 siblings, 0 replies; 148+ messages in thread
From: Tim Chen @ 2012-10-05 23:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton,
	Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner,
	Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad

On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Thu,  4 Oct 2012 01:50:42 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> >> This is a new AutoNUMA27 release for Linux v3.6.
> >
> > Peter's numa/sched patches have been in -next for a week. 
> 
> Did they pass review? I have some doubts.
> 
> The last time I looked it also broke numactl.
> 
> > Guys, what's the plan here?
> 
> Since they are both performance features their ultimate benefit
> is how much faster they make things (and how seldom they make things
> slower)
> 
> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?
> 
> For a change like this I think less regression is actually more
> important than the highest peak numbers.
> 
> -Andi
> 

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest.  For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-05 23:57       ` Tim Chen
  0 siblings, 0 replies; 148+ messages in thread
From: Tim Chen @ 2012-10-05 23:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton,
	Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner,
	Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad

On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Thu,  4 Oct 2012 01:50:42 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> >> This is a new AutoNUMA27 release for Linux v3.6.
> >
> > Peter's numa/sched patches have been in -next for a week. 
> 
> Did they pass review? I have some doubts.
> 
> The last time I looked it also broke numactl.
> 
> > Guys, what's the plan here?
> 
> Since they are both performance features their ultimate benefit
> is how much faster they make things (and how seldom they make things
> slower)
> 
> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?
> 
> For a change like this I think less regression is actually more
> important than the highest peak numbers.
> 
> -Andi
> 

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest.  For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-05 23:57       ` Tim Chen
@ 2012-10-06  0:11         ` Andi Kleen
  -1 siblings, 0 replies; 148+ messages in thread
From: Andi Kleen @ 2012-10-06  0:11 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton,
	Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner,
	Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex, Sh

Tim Chen <tim.c.chen@linux.intel.com> writes:
>> 
>
> I remembered that 3 months ago when Alex tested the numa/sched patches
> there were 20% regression on SpecJbb2005 due to the numa balancer.

20% on anything sounds like a show stopper to me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-06  0:11         ` Andi Kleen
  0 siblings, 0 replies; 148+ messages in thread
From: Andi Kleen @ 2012-10-06  0:11 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton,
	Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner,
	Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex, Sh

Tim Chen <tim.c.chen@linux.intel.com> writes:
>> 
>
> I remembered that 3 months ago when Alex tested the numa/sched patches
> there were 20% regression on SpecJbb2005 due to the numa balancer.

20% on anything sounds like a show stopper to me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-05 11:54     ` Andrea Arcangeli
@ 2012-10-06  2:39       ` Mike Galbraith
  2012-10-06 12:34         ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mike Galbraith @ 2012-10-06  2:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-10-05 at 13:54 +0200, Andrea Arcangeli wrote: 
> On Fri, Oct 05, 2012 at 08:41:25AM +0200, Mike Galbraith wrote:
> > On Thu, 2012-10-04 at 01:51 +0200, Andrea Arcangeli wrote: 
> > > The CFS scheduler is still in charge of all scheduling decisions. At
> > > times, however, AutoNUMA balancing will override them.
> > > 
> > > Generally, we'll just rely on the CFS scheduler to keep doing its
> > > thing, while preferring the task's AutoNUMA affine node when deciding
> > > to move a task to a different runqueue or when waking it up.
> > 
> > Why does AutoNuma fiddle with wakeup decisions _within_ a node?
> > 
> > pgbench intensely disliked me recently depriving it of migration routes
> > in select_idle_sibling(), so AutoNuma saying NAK seems unlikely to make
> > it or ilk any happier.
> 
> Preferring doesn't mean NAK. It means "search affine first" if there's
> not, go the usual route like if autonuma was not there.

I'll rephrase.  We're searching a processor.  What does that have to do
with NUMA?  I saw you turning want_affine off (and wonder what that's
gonna do to fluctuating vs for more or less static loads), and get that.

> In short there's no risk of regressions like it happened until 3.6-rc6
> (I reverted that patch before it was reverted in 3.6-rc6).

(Shrug, +1000% vs -20%.  Relevant is the NUMA vs package bit, and node
stickiness vs 1:N bit)

> > Hm.  How does this profiling work for 1:N loads?  Once you need two or
> > more nodes, there is no best node for the 1, so restricting it can only
> > do harm.  For pgbench and ilk, loads of cross node traffic should mean
> > the 1 is succeeding at keeping the N busy.
> 
> That resembles numa01 on the 8 node system. There are N threads
> trashing over all the memory of 4 nodes, and another N threads
> trashing over the memory of another 4 nodes. It still work massively
> better than no autonuma.

I measured the 1 in 1:N pgbench very much preferring mobility.  The N,
dunno, but I don't imagine a large benefit for making them sticky
either.  Hohum, numbers will tell the tale.

> If there are multiple threads their affinity will vary slighly and the
> task_selected_nid will distribute (and if it doesn't distribute the
> idle load balancing will still work perfectly as upstream).
> 
> If there's just one thread, so really 1:N, it doesn't matter in which
> CPU of the 4 nodes we put it if it's the memory split is 25/25/25/25.

It should matter when load is not static.  Just as select_idle_sibling()
is not a great idea once you're ramped up, retained stickiness should
hurt dynamic responsiveness.  But never mind, that's just me pondering
the up/down sides of stickiness.

> In short in those 1:N scenarios, it's usually better to just stick to
> the last node it run on, and it does with AutoNUMA. This is why it's
> better to have 1 task_selected_nid instead of 4. There may be level 3
> caches for the node too and that will preserve them too.

My point was that there is no correct node to prefer, so wondered if
AutoNuma could possibly recognize that, and not do what can only be the
wrong thing.  It needs to only tag things it is really sure about.

-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-06  2:39       ` Mike Galbraith
@ 2012-10-06 12:34         ` Andrea Arcangeli
  2012-10-07  6:07           ` Mike Galbraith
  0 siblings, 1 reply; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-06 12:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Mike,

On Sat, Oct 06, 2012 at 04:39:54AM +0200, Mike Galbraith wrote:
> On Fri, 2012-10-05 at 13:54 +0200, Andrea Arcangeli wrote: 
> > On Fri, Oct 05, 2012 at 08:41:25AM +0200, Mike Galbraith wrote:
> > > On Thu, 2012-10-04 at 01:51 +0200, Andrea Arcangeli wrote: 
> > > > The CFS scheduler is still in charge of all scheduling decisions. At
> > > > times, however, AutoNUMA balancing will override them.
> > > > 
> > > > Generally, we'll just rely on the CFS scheduler to keep doing its
> > > > thing, while preferring the task's AutoNUMA affine node when deciding
> > > > to move a task to a different runqueue or when waking it up.
> > > 
> > > Why does AutoNuma fiddle with wakeup decisions _within_ a node?
> > > 
> > > pgbench intensely disliked me recently depriving it of migration routes
> > > in select_idle_sibling(), so AutoNuma saying NAK seems unlikely to make
> > > it or ilk any happier.
> > 
> > Preferring doesn't mean NAK. It means "search affine first" if there's
> > not, go the usual route like if autonuma was not there.
> 
> I'll rephrase.  We're searching a processor.  What does that have to do
> with NUMA?  I saw you turning want_affine off (and wonder what that's
> gonna do to fluctuating vs for more or less static loads), and get that.

I think you just found a mistake.

So disabling wake_affine if the wakeup CPU was on a remote NODE (only
in that case it was turned off), meant sd_affine couldn't be turned on
and for certain wakeups select_idle_sibling wouldn't run (rendering
pointless some of my logic in select_idle_sibling).

So I'm reversing this hunk:

@@ -2708,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
                return prev_cpu;
 
        if (sd_flag & SD_BALANCE_WAKE) {
-               if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+               if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+                   task_autonuma_cpu(p, cpu))
                        want_affine = 1;
                new_cpu = prev_cpu;
        }


Another optimization I noticed is that I should record idle_target =
true if target == cpu && idle_cpu(cpu) but task_autonuma_cpu fails, so
we'll pick the target if it was idle, and there's no CPU idle in the
affine node.

> I measured the 1 in 1:N pgbench very much preferring mobility.  The N,
> dunno, but I don't imagine a large benefit for making them sticky
> either.  Hohum, numbers will tell the tale.

Mobility on non-NUMA is an entirely different matter than mobility
across NUMA nodes. Keep in mind there are tons of CPUs intra-node too
so the mobility intra node may be enough.  But I don't know exactly
what the mobiltiy requirements of pgbench are so I can't tell for sure
and I fully agree we should collect numbers.

The availability of NUMA systems increased a lot lately so hopefully
more people will be able to test it and provide feedback.

Overall getting wrong the intra-node convergence is more concerning
than not being optimal in the 1:N load. Getting the former wrong means
we risk to delay convergence (and having to fixup later with autonuma
balancing events). The latter is just about maxim out all memory
channels and all HT idle cores, in a MADV_INTERLEAVE behavior and to
mitigage the spurious page migrates (which will still happen seldom
and we need them to keep happening slowly to avoid ending up using a
single memory channel). But the latter is a less deterministic case,
it's harder to be faster than upstream unless upstream does all
allocations in one thread first and then starts the other threads
computing on the memory later. The 1:N has no perfect solution anyway,
unless we just detect it and hammer it with MADV_INTERLEAVE. But I
tried to avoid hard classifications and radical change in behavior and
I try to do something that always works no matter the load we throw at
it. So I'm usually more concerned about optimizing for the former case
which has a perfect solution possible.

> > If there are multiple threads their affinity will vary slighly and the
> > task_selected_nid will distribute (and if it doesn't distribute the
> > idle load balancing will still work perfectly as upstream).
> > 
> > If there's just one thread, so really 1:N, it doesn't matter in which
> > CPU of the 4 nodes we put it if it's the memory split is 25/25/25/25.
> 
> It should matter when load is not static.  Just as select_idle_sibling()
> is not a great idea once you're ramped up, retained stickiness should
> hurt dynamic responsiveness.  But never mind, that's just me pondering
> the up/down sides of stickiness.

Actually I'm going to test removing the above hunk.

> > In short in those 1:N scenarios, it's usually better to just stick to
> > the last node it run on, and it does with AutoNUMA. This is why it's
> > better to have 1 task_selected_nid instead of 4. There may be level 3
> > caches for the node too and that will preserve them too.
> 
> My point was that there is no correct node to prefer, so wondered if
> AutoNuma could possibly recognize that, and not do what can only be the
> wrong thing.  It needs to only tag things it is really sure about.

You know sched/fair.c so much better than me, so you decide. AutoNUMA
is just an ideal hacking base that converges and works well, and we
can build on that. It's very easy to modify and experiment
with. All contributions are welcome ;).

I'm adding new ideas to it as I write this in some experimetnal branch
(just reached new records of convergence vs autonuma27, by accounting
in real time for the page migrations in mm_autonuma without having to
boost the numa hinting page fault rate).

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-06 12:34         ` Andrea Arcangeli
@ 2012-10-07  6:07           ` Mike Galbraith
  2012-10-08  7:03             ` Mike Galbraith
  0 siblings, 1 reply; 148+ messages in thread
From: Mike Galbraith @ 2012-10-07  6:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, 2012-10-06 at 14:34 +0200, Andrea Arcangeli wrote: 
> Hi Mike,

Greetings,

> On Sat, Oct 06, 2012 at 04:39:54AM +0200, Mike Galbraith wrote:

> I think you just found a mistake.
> 
> So disabling wake_affine if the wakeup CPU was on a remote NODE (only
> in that case it was turned off), meant sd_affine couldn't be turned on
> and for certain wakeups select_idle_sibling wouldn't run (rendering
> pointless some of my logic in select_idle_sibling).

Well, it still looks a bit bent to me no matter how I tilt my head.

                /*
                 * If both cpu and prev_cpu are part of this domain,
                 * cpu is a valid SD_WAKE_AFFINE target.
                 */
                if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
                    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
                        affine_sd = tmp;
                        want_affine = 0;
                }

Disabling when waker/wakee are cross node makes sense to me as a cycle
saver.  If you have (SMT), MC and NODE domains, waker/wakee are cross
node, spans don't intersect, affine_sd remains NULL, the whole traverse
becomes a waste of cycles.  If WAKE_BALANCE is enabled, we'll do that
instead (which pgbench and ilk should like methinks).

> > I measured the 1 in 1:N pgbench very much preferring mobility.  The N,
> > dunno, but I don't imagine a large benefit for making them sticky
> > either.  Hohum, numbers will tell the tale.
> 
> Mobility on non-NUMA is an entirely different matter than mobility
> across NUMA nodes. Keep in mind there are tons of CPUs intra-node too
> so the mobility intra node may be enough.  But I don't know exactly
> what the mobiltiy requirements of pgbench are so I can't tell for sure
> and I fully agree we should collect numbers.

Yeah, that 1:1 vs 1:N, load meets anti-load thing is kinda interesting.
Tune for one, you may well annihilate the other.  Numbers required.

I think we need to detect and react accordingly.  If that nasty little 1
bugger is doing a lot of work, it's very special, so I don't think you
making him sticky can help any more than me taking away wakeup options..
both remove latency reducing options from a latency dominated load.

But numbers talk, pondering (may = BS) walks, so I'm outta here :)

-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
  2012-10-07  6:07           ` Mike Galbraith
@ 2012-10-08  7:03             ` Mike Galbraith
  0 siblings, 0 replies; 148+ messages in thread
From: Mike Galbraith @ 2012-10-08  7:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sun, 2012-10-07 at 08:07 +0200, Mike Galbraith wrote:

> If you have (SMT), MC and NODE domains, waker/wakee are cross
> node, spans don't intersect, affine_sd remains NULL, the whole traverse
> becomes a waste of cycles.

Zzzt, horse-pookey.  NODE spans all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-06  0:11         ` Andi Kleen
@ 2012-10-08 13:44           ` Don Morris
  -1 siblings, 0 replies; 148+ messages in thread
From: Don Morris @ 2012-10-08 13:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tim Chen, Andrew Morton, Andrea Arcangeli, linux-kernel,
	linux-mm, Linus Torvalds, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex, Sh

On 10/05/2012 05:11 PM, Andi Kleen wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
>>>
>>
>> I remembered that 3 months ago when Alex tested the numa/sched patches
>> there were 20% regression on SpecJbb2005 due to the numa balancer.
> 
> 20% on anything sounds like a show stopper to me.
> 
> -Andi
> 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( &(&mm->page_table_lock)->rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 500000 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly "best case" for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a "big" process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a "scan" (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-08 13:44           ` Don Morris
  0 siblings, 0 replies; 148+ messages in thread
From: Don Morris @ 2012-10-08 13:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tim Chen, Andrew Morton, Andrea Arcangeli, linux-kernel,
	linux-mm, Linus Torvalds, Peter Zijlstra, Ingo Molnar,
	Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
	Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
	Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Alex, Sh

On 10/05/2012 05:11 PM, Andi Kleen wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
>>>
>>
>> I remembered that 3 months ago when Alex tested the numa/sched patches
>> there were 20% regression on SpecJbb2005 due to the numa balancer.
> 
> 20% on anything sounds like a show stopper to me.
> 
> -Andi
> 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( &(&mm->page_table_lock)->rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 500000 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly "best case" for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a "big" process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a "scan" (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-05 23:14     ` Andi Kleen
@ 2012-10-08 20:34       ` Rik van Riel
  -1 siblings, 0 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-08 20:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad, dshaks

On Fri, 05 Oct 2012 16:14:44 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?

Here are some preliminary results from simple benchmarks on a
4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system.

For the simple linpack streams benchmark, both sched/numa and
autonuma are within the margin of error compared to manual
tuning of task affinity.  This is a big win, since the current
upstream scheduler has regressions of 10-20% when the system
runs 4 through 16 streams processes.

For specjbb, the story is more complicated. After fixing the
obvious bugs in sched/numa, and getting some basic cpu-follows-memory
code (not yet in -tip AFAIK), Larry, Peter and I, averaged results
look like this:

baseline: 	246019
manual pinning: 285481 (+16%)
autonuma:	266626 (+8%)
sched/numa:	226540 (-8%)

This is with newer sched/numa code than what is in -tip right now.
Once Peter pushes the fixes by Larry and me into -tip, as well as
his cpu-follows-memory code, others should be able to run tests
like this as well.

Now for some other workloads, and tests on 8 node systems, etc...


Full results for the specjbb run below:

BASELINE - disabling auto numa (matches RHEL6 within 1%)

[root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt
spec1.txt:           throughput =     243639.70 SPECjbb2005 bops
spec2.txt:           throughput =     249186.20 SPECjbb2005 bops
spec3.txt:           throughput =     247216.72 SPECjbb2005 bops
spec4.txt:           throughput =     244035.60 SPECjbb2005 bops

Manual NUMACTL results are:

[root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt
spec1.txt:           throughput =     291430.22 SPECjbb2005 bops
spec2.txt:           throughput =     283550.85 SPECjbb2005 bops
spec3.txt:           throughput =     284028.71 SPECjbb2005 bops
spec4.txt:           throughput =     282919.37 SPECjbb2005 bops

AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64
[root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt
spec1.txt:           throughput =     261835.01 SPECjbb2005 bops
spec2.txt:           throughput =     269053.06 SPECjbb2005 bops
spec3.txt:           throughput =     261230.50 SPECjbb2005 bops
spec3.txt:           throughput =     274386.81 SPECjbb2005 bops

Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and 
Larry:

[root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt
spec1.txt:           throughput =     222349.74 SPECjbb2005 bops
spec2.txt:           throughput =     232988.59 SPECjbb2005 bops
spec3.txt:           throughput =     223386.03 SPECjbb2005 bops
spec4.txt:           throughput =     227438.11 SPECjbb2005 bops

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-08 20:34       ` Rik van Riel
  0 siblings, 0 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-08 20:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad, dshaks

On Fri, 05 Oct 2012 16:14:44 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?

Here are some preliminary results from simple benchmarks on a
4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system.

For the simple linpack streams benchmark, both sched/numa and
autonuma are within the margin of error compared to manual
tuning of task affinity.  This is a big win, since the current
upstream scheduler has regressions of 10-20% when the system
runs 4 through 16 streams processes.

For specjbb, the story is more complicated. After fixing the
obvious bugs in sched/numa, and getting some basic cpu-follows-memory
code (not yet in -tip AFAIK), Larry, Peter and I, averaged results
look like this:

baseline: 	246019
manual pinning: 285481 (+16%)
autonuma:	266626 (+8%)
sched/numa:	226540 (-8%)

This is with newer sched/numa code than what is in -tip right now.
Once Peter pushes the fixes by Larry and me into -tip, as well as
his cpu-follows-memory code, others should be able to run tests
like this as well.

Now for some other workloads, and tests on 8 node systems, etc...


Full results for the specjbb run below:

BASELINE - disabling auto numa (matches RHEL6 within 1%)

[root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt
spec1.txt:           throughput =     243639.70 SPECjbb2005 bops
spec2.txt:           throughput =     249186.20 SPECjbb2005 bops
spec3.txt:           throughput =     247216.72 SPECjbb2005 bops
spec4.txt:           throughput =     244035.60 SPECjbb2005 bops

Manual NUMACTL results are:

[root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt
spec1.txt:           throughput =     291430.22 SPECjbb2005 bops
spec2.txt:           throughput =     283550.85 SPECjbb2005 bops
spec3.txt:           throughput =     284028.71 SPECjbb2005 bops
spec4.txt:           throughput =     282919.37 SPECjbb2005 bops

AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64
[root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt
spec1.txt:           throughput =     261835.01 SPECjbb2005 bops
spec2.txt:           throughput =     269053.06 SPECjbb2005 bops
spec3.txt:           throughput =     261230.50 SPECjbb2005 bops
spec3.txt:           throughput =     274386.81 SPECjbb2005 bops

Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and 
Larry:

[root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt
spec1.txt:           throughput =     222349.74 SPECjbb2005 bops
spec2.txt:           throughput =     232988.59 SPECjbb2005 bops
spec3.txt:           throughput =     223386.03 SPECjbb2005 bops
spec4.txt:           throughput =     227438.11 SPECjbb2005 bops

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
@ 2012-10-10 22:01   ` Rik van Riel
  2012-10-10 22:36     ` Andrea Arcangeli
  2012-10-11 18:28   ` Mel Gorman
  2012-10-13 18:06   ` Srikar Dronamraju
  2 siblings, 1 reply; 148+ messages in thread
From: Rik van Riel @ 2012-10-10 22:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 10/03/2012 07:51 PM, Andrea Arcangeli wrote:

> +/*
> + * In this function we build a temporal CPU_node<->page relation by
> + * using a two-stage autonuma_last_nid filter to remove short/unlikely
> + * relations.
> + *
> + * Using P(p) ~ n_p / n_t as per frequentest probability, we can
> + * equate a node's CPU usage of a particular page (n_p) per total
> + * usage of this page (n_t) (in a given time-span) to a probability.
> + *
> + * Our periodic faults will then sample this probability and getting
> + * the same result twice in a row, given these samples are fully
> + * independent, is then given by P(n)^2, provided our sample period
> + * is sufficiently short compared to the usage pattern.
> + *
> + * This quadric squishes small probabilities, making it less likely
> + * we act on an unlikely CPU_node<->page relation.
> + */
> +static inline bool last_nid_set(struct page *page, int this_nid)
> +{

Could be nice to rename this function to should_migrate_page()...

> +	bool ret = true;
> +	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> +	VM_BUG_ON(this_nid < 0);
> +	VM_BUG_ON(this_nid >= MAX_NUMNODES);
> +	if (autonuma_last_nid != this_nid) {
> +		if (autonuma_last_nid >= 0)
> +			ret = false;
> +		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
> +	}
> +	return ret;
> +}

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-10 22:01   ` Rik van Riel
@ 2012-10-10 22:36     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-10 22:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Wed, Oct 10, 2012 at 06:01:38PM -0400, Rik van Riel wrote:
> > +static inline bool last_nid_set(struct page *page, int this_nid)
> > +{
> 
> Could be nice to rename this function to should_migrate_page()...

Done, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (33 preceding siblings ...)
  2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
@ 2012-10-11 10:19 ` Mel Gorman
  2012-10-11 14:56     ` Andrea Arcangeli
  2012-10-11 21:34 ` Mel Gorman
                   ` (2 subsequent siblings)
  37 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 10:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:42AM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> This is a new AutoNUMA27 release for Linux v3.6.
> 
> I believe that this autonuma version answers all of the review
> comments I got upstream. This patch set has undergone a huge series of
> changes that includes changing the page migration implementation to
> synchronous, reduction of memory overhead to minimum, internal
> documentation, external documentation and benchmarking. I'm grateful
> for all the reviews and contributions, that includes Rik, Karen, Avi,
> Peter, Konrad, Hillf and all others, plus all runtime feedback
> received (bugreports, KVM benchmarks, etc..).
> 
> The last 4 months were fully dedicated to answer the upstream review.
> 
> Linus, Andrew, please review, as the handful of performance results
> show we're in excellent shape for inclusion. Further changes such as
> transparent huge page native migration and more are expected but at
> this point I would ask you to accept the current series and further
> changes will be added in traditional gradual steps.
> 

As a basic sniff test I added a test to MMtests for the AutoNUMA
Benchmark on a 4-node machine and the following fell out.

                                     3.6.0                 3.6.0
                                   vanilla        autonuma-v33r6
User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)

The performance improvements are certainly there for this basic test but
I note the System CPU usage is very high.

The vmstats showed up this

THP fault alloc               81376       86070
THP collapse alloc               14       40423
THP splits                        8       41792

So we're doing a lot of splits and collapses for THP there. There is a
possibility that khugepaged and the autonuma kernel thread are doing some
busy work. Not a show-stopped, just interesting.

I've done no analysis at all and this was just to have something to look
at before looking at the code closer.

> The objective of AutoNUMA is to provide out-of-the-box performance as
> close as possible to (and potentially faster than) manual NUMA hard
> bindings.
> 
> It is not very intrusive into the kernel core and is well structured
> into separate source modules.
> 
> AutoNUMA was extensively tested against 3.x upstream kernels and other
> NUMA placement algorithms such as numad (in userland through cpusets)
> and schednuma (in kernel too) and was found superior in all cases.
> 
> Most important: not a single benchmark showed a regression yet when
> compared to vanilla kernels. Not even on the 2 node systems where the
> NUMA effects are less significant.
> 

Ok, I have not run a general regression test and won't get the chance to
soon but hopefully others will. One thing they might want to watch out
for is System CPU time. It's possible that your AutoNUMA benchmark
triggers a worst-case but it's worth keeping an eye on because any cost
from that has to be offset by gains from better NUMA placements.

> === Some benchmark result ===
> 
> <SNIP>

Looked good for the most part.

> == stream modified to run each instance for ~5min ==
> 

Is STREAM really a good benchmark in this case? Unless you also ran it in
parallel mode, it basically operations against three arrays and not really
NUMA friendly once the total size is greater than a NUMA node. I guess
it makes sense to run it just to see does autonuma break it :)

> 
> == iozone ==
> 
>                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> ====--------------------------------------------------------------------------------------------------------------
> noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> 
> AutoNUMA can't help much for I/O loads but you can see it seems a
> small improvement there too. The important thing for I/O loads, is to
> verify that there is no regression.
> 

It probably is unreasonable to expect autonuma to handle the case where
a file-based workload has not been tuned for NUMA. In too many cases
it's going to be read/write based so you're not going to get the
statistics you need.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
  2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
@ 2012-10-11 10:50   ` Mel Gorman
  2012-10-11 16:07       ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 10:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> +The AutoNUMA logic is a chain reaction resulting from the actions of
> +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically

s/knum_scand/knuma_scand/

> +scans the mm structures of all active processes. It gathers the
> +AutoNUMA mm statistics for each "anon" page in the process's working

Ok, so this will not make a different to file-based workloads but as I
mentioned in the leader this would be a difficult proposition anyway
because if it's read/write based, you'll have no statistics.

> +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> +present bit in each pte or pmd that was counted. This triggers NUMA
> +hinting page faults described next.
> +
> +The mm statistics are expentially decayed by dividing the total memory
> +in half and adding the new totals to the decayed values for each
> +knuma_scand pass. This causes the mm statistics to resemble a simple
> +forecasting model, taking into account some past working set data.
> +
> +=== NUMA hinting fault ===
> +
> +A NUMA hinting fault occurs when a task running on a CPU thread
> +accesses a vma whose pte or pmd is not present and the NUMA bit is
> +set. The NUMA hinting page fault handler returns the pte or pmd back
> +to its present state and counts the fault's occurance in the
> +task_autonuma structure.
> +

So, minimally one source of System CPU overhead will be increased traps.
I haven't seen the code yet obviously but I wonder if this gets accounted
for as a minor fault? If it does, how can we distinguish between minor
faults and numa hinting faults? If not, is it possible to get any idea of
how many numa hinting faults were incurred? Mention it here.

> +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> +
> +- Increments the total number of pages faulted for this task
> +
> +- Increments the number of pages faulted on the current NUMA node
> +

So, am I correct in assuming that the rate of NUMA hinting faults will be
related to the scan rate of knuma_scand?

> +- If the fault was for an hugepage, the number of subpages represented
> +  by an hugepage is added to the task statistics above
> +
> +- Each time the NUMA hinting page fault discoveres that another

s/discoveres/discovers/

> +  knuma_scand pass has occurred, it divides the total number of pages
> +  and the pages for each NUMA node in half. This causes the task
> +  statistics to be exponentially decayed, just as the mm statistics
> +  are. Thus, the task statistics also resemble a simple forcasting
> +  model, taking into account some past NUMA hinting fault data.
> +
> +If the page being accessed is on the current NUMA node (same as the
> +task), the NUMA hinting fault handler only records the nid of the
> +current NUMA node in the page_autonuma structure field last_nid and
> +then it'd done.
> +
> +Othewise, it checks if the nid of the current NUMA node matches the
> +last_nid in the page_autonuma structure. If it matches it means it's
> +the second NUMA hinting fault for the page occurring (on a subsequent
> +pass of the knuma_scand daemon) from the current NUMA node.

You don't spell it out, but this is effectively a migration threshold N
where N is the number of remote NUMA hinting faults that must be
incurred before migration happens. The default value of this threshold
is 2.

Is that accurate? If so, why 2?

I don't have a better suggestion, it's just an obvious source of an
adverse workload that could force a lot of migrations by faulting once
per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.

> So if it
> +matches, the NUMA hinting fault handler migrates the contents of the
> +page to a new page on the current NUMA node.
> +
> +If the NUMA node accessing the page does not match last_nid, then
> +last_nid is reset to the current NUMA node (since it is considered the
> +first fault again).
> +
> +Note: You can clear a flag (AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT) which
> +causes the page to be migrated on the second NUMA hinting fault
> +instead of the very first one for a newly allocated page.
> +
> +=== Migrate-on-Fault (MoF) ===
> +
> +If the migrate-on-fault logic is active and the NUMA hinting fault
> +handler determines that the page should be migrated, a new page is
> +allocated on the current NUMA node and the data is copied from the
> +previous page on the remote node to the new page. The associated pte
> +or pmd is modified to reference the pfn of the new page, and the
> +previous page is freed to the LRU of its NUMA node. See routine
> +migrate_pages() in mm/migrate.c.
> +
> +If no page is available on the current NUMA node or I/O is in progress
> +on the page, it is not migrated and the task continues to reference
> +the remote page.
> +

I'm assuming it must be async migration then. IO in progress would be
a bit of a surprise though! It would have to be a mapped anonymous page
being written to swap.

> +=== sched_autonuma_balance - the AutoNUMA balance routine ===
> +
> +The AutoNUMA balance routine is responsible for deciding which NUMA
> +node is the best for running the current task and potentially which
> +task on the remote node it should be exchanged with. It uses the mm
> +statistics collected by the knuma_scand daemon and the task statistics
> +collected by the NUMA hinting fault to make this decision.
> +
> +The AutoNUMA balance routine is invoked as part of the scheduler load
> +balancing code. It exchanges the task on the current CPU's run queue
> +with a current task from a remote NUMA node if that exchange would
> +result in the tasks running with a smaller percentage of cross-node
> +memory accesses. Because the balance routine involves only running
> +tasks, it is only invoked when the scheduler is not idle
> +balancing. This means that the CFS scheduler is in control of
> +scheduling decsions and can move tasks to idle threads on any NUMA
> +node based on traditional or new policies.
> +
> +The following defines "memory weight" and "task weight" in the
> +AutoNUMA balance routine's algorithms.
> +
> +- memory weight = % of total memory from the NUMA node. Uses mm
> +                  statistics collected by the knuma_scand daemon.
> +
> +- task weight = % of total memory faulted on the NUMA node. Uses task
> +                statistics collected by the NUMA hinting fault.
> +
> +=== task_selected_nid - The AutoNUMA preferred NUMA node ===
> +
> +The AutoNUMA balance routine first determines which NUMA node the
> +current task has the most affinity to run on, based on the maximum
> +task weight and memory weight for each NUMA node. If both max values
> +are for the same NUMA node, that node's nid is stored in
> +task_selected_nid.
> +
> +If the selected nid is the current NUMA node, the AutoNUMA balance
> +routine is finished and does not proceed to compare tasks on other
> +NUMA nodes.
> +
> +If the selected nid is not the current NUMA node, a task exchange is
> +possible as described next. (Note that the task exchange algorithm
> +might update task_selected_nid to a different NUMA node)
> +

Ok, it's hard to predict how this will behave in advance but the description
is appreciated.

> +=== Task exchange ===
> +
> +The following defines "weight" in the AutoNUMA balance routine's
> +algorithm.
> +
> +If the tasks are threads of the same process:
> +
> +    weight = task weight for the NUMA node (since memory weights are
> +             the same)
> +
> +If the tasks are not threads of the same process:
> +
> +    weight = memory weight for the NUMA node (prefer to move the task
> +             to the memory)
> +
> +The following algorithm determines if the current task will be
> +exchanged with a running task on a remote NUMA node:
> +
> +    this_diff: Weight of the current task on the remote NUMA node
> +               minus its weight on the current NUMA node (only used if
> +               a positive value). How much does the current task
> +               prefer to run on the remote NUMA node.
> +
> +    other_diff: Weight of the current task on the remote NUMA node
> +                minus the weight of the other task on the same remote
> +                NUMA node (only used if a positive value). How much
> +                does the current task prefer to run on the remote NUMA
> +                node compared to the other task.
> +
> +    total_weight_diff = this_diff + other_diff
> +
> +    total_weight_diff: How favorable it is to exchange the two tasks.
> +                       The pair of tasks with the highest
> +                       total_weight_diff (if any) are selected for
> +                       exchange.
> +
> +As mentioned above, if the two tasks are threads of the same process,
> +the AutoNUMA balance routine uses the task_autonuma statistics. By
> +using the task_autonuma statistics, each thread follows its own memory
> +locality and they will not necessarily converge on the same node. This
> +is often very desirable for processes with more threads than CPUs on
> +each NUMA node.
> +

What about the case where two threads on different CPUs are accessing
separate structures that are not page-aligned (base or huge page but huge
page would be obviously worse). Does this cause a ping-pong effect or
otherwise mess up the statistics?

> +If the two tasks are not threads of the same process, the AutoNUMA
> +balance routine uses the mm_autonuma statistics to calculate the
> +memory weights. This way all threads of the same process converge to
> +the same node, which is the one with the highest percentage of memory
> +for the process.
> +
> +If task_selected_nid, determined above, is not the NUMA node the
> +current task will be exchanged to, task_selected_nid for this task is
> +updated. This causes the AutoNUMA balance routine to favor overall
> +balance of the system over a single task's preference for a NUMA node.
> +
> +To exchange the two tasks, the AutoNUMA balance routine stops the CPU
> +that is running the remote task and exchanges the tasks on the two run
> +queues. Once each task has been moved to another node, closer to most
> +of the memory it is accessing, any memory for that task not in the new
> +NUMA node also moves to the NUMA node over time with the
> +migrate-on-fault logic.
> +

Ok, very obviously this will never be an RT feature but that is hardly
a surprise and anyone who tries to enable this for RT needs their head
examined. I'm not suggesting you do it but people running detailed
performance analysis on scheduler-intensive workloads might want to keep
an eye on their latency and jitter figures and how they are affected by
this exchanging. Does ftrace show a noticable increase in wakeup latencies
for example?

> +=== Scheduler Load Balancing ===
> +
> +Load balancing, which affects fairness more than performance,
> +schedules based on AutoNUMA recommendations (task_selected_nid) unless
> +the flag AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG is set.
> +
> +The CFS load balancer uses the task's AutoNUMA task_selected_nid when
> +deciding to move a task to a different run-queue or when waking it
> +up. For example, idle balancing while looking into the run-queues of
> +busy CPUs, first looks for a task with task_selected_nid set to the
> +NUMA node of the idle CPU. Idle balancing falls back to scheduling
> +tasks without task_selected_nid set or with a different NUMA node set
> +in task_selected_nid. This allows a task to move to a different NUMA
> +node and its memory will follow it to the new NUMA node over time.
> +
> +== III: AutoNUMA Data Structures ==
> +
> +The following data structures are defined for AutoNUMA. All structures
> +are allocated only if AutoNUMA is active (as defined in the
> +introduction).
> +
> +=== mm_autonuma - per process mm AutoNUMA data ===
> +
> +The mm_autonuma structure is used to hold AutoNUMA data required for
> +each mm structure. Total size: 32 bytes + 8 * # of NUMA nodes.
> +
> +- Link of mm structures to be scanned by knuma_scand (8 bytes)
> +
> +- Pointer to associated mm structure (8 bytes)
> +
> +- fault_pass - pass number of knuma_scand (8 bytes)
> +
> +- Memory NUMA statistics for this process:
> +
> +    Total number of anon pages in the process working set (8 bytes)
> +
> +    Per NUMA node number of anon pages in the process working set (8
> +    bytes * # of NUMA nodes)
> +
> +=== task_autonuma - per task AutoNUMA data ===
> +
> +The task_autonuma structure is used to hold AutoNUMA data required for
> +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> +nodes.
> +
> +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> +                scheduler balancing code, -1 if none (2 bytes)
> +
> +- Task NUMA statistics for this thread/process:
> +
> +    Total number of NUMA hinting page faults in this pass of
> +    knuma_scand (8 bytes)
> +
> +    Per NUMA node number of NUMA hinting page faults in this pass of
> +    knuma_scand (8 bytes * # of NUMA nodes)
> +

It might be possible to put a coarse ping-pong detection counter in here
as well by recording a declaying average of number of pages migrated
over a number of knuma_scand passes instead of just the last one.  If the
value is too high, you're ping-ponging and the process should be ignored,
possibly forever. It's not a requirement and it would be more memory
overhead obviously but I'm throwing it out there as a suggestion if it
ever turns out the ping-pong problem is real.


> +=== page_autonuma - per page AutoNUMA data ===
> +
> +The page_autonuma structure is used to hold AutoNUMA data required for
> +each page of memory. Total size: 2 bytes
> +
> +    last_nid - NUMA node for last time this page incurred a NUMA
> +               hinting fault, -1 if none (2 bytes)
> +
> +=== pte and pmd - NUMA flags ===
> +
> +A bit in pte and pmd structures are used to indicate to the page fault
> +handler that the fault was incurred for NUMA purposes.
> +
> +    _PAGE_NUMA: a NUMA hinting fault at either the pte or pmd level (1
> +                bit)
> +
> +        The same bit used for _PAGE_PROTNONE is used for
> +        _PAGE_NUMA. This is okay because all uses of _PAGE_PROTNONE
> +        are mutually exclusive of _PAGE_NUMA.
> +
> +Note: NUMA hinting fault at the pmd level is only used on
> +architectures where pmd granularity is supported.
> +
> +== IV: AutoNUMA Active ==
> +
> +AutoNUMA is considered active when each of the following 4 conditions
> +are met:
> +
> +- AutoNUMA is compiled into the kernel
> +
> +    CONFIG_AUTONUMA=y
> +
> +- The hardware has NUMA properties
> +
> +- AutoNUMA is enabled at boot time
> +
> +    "noautonuma" not passed to the kernel command line
> +
> +- AutoNUMA is enabled dynamically at run-time
> +
> +    CONFIG_AUTONUMA_DEFAULT_ENABLED=y
> +
> +  or
> +
> +    echo 1 >/sys/kernel/mm/autonuma/enabled
> +
> +== V: AutoNUMA Flags ==
> +
> +AUTONUMA_POSSIBLE_FLAG: The kernel was not passed the "noautonuma"
> +                        boot parameter and is being run on NUMA
> +                        hardware.
> +
> +AUTONUMA_ENABLED_FLAG: AutoNUMA is enabled (default set at compile
> +                       time).
> +
> +AUTONUMA_DEBUG_FLAG (default 0): printf lots of debug info, set
> +		                 through sysfs
> +
> +AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG (default 0): AutoNUMA will
> +                                                     prioritize on
> +                                                     NUMA affinity and
> +                                                     will disregard
> +                                                     inter-node
> +                                                     fairness.
> +
> +AUTONUMA_CHILD_INHERITANCE_FLAG (default 1): AutoNUMA statistics are
> +                                             copied to the child at
> +                                             every fork/clone instead
> +                                             of resetting them like it
> +                                             happens unconditionally
> +                                             in execve.
> +
> +AUTONUMA_SCAN_PMD_FLAG (default 1): trigger NUMA hinting faults for
> +                                    the pmd level instead of just the
> +                                    pte level (note: for THP, NUMA
> +                                    hinting faults always occur at the
> +                                    pmd level)
> +
> +AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG (default 0): page is migrated
> +                                                     on first NUMA
> +                                                     hinting fault
> +                                                     instead of second
> +
> +AUTONUMA_MM_WORKING_SET_FLAG (default 1): mm_autonuma represents a
> +                                          working set estimation of
> +                                          the memory used by the
> +                                          process
> +
> +Contributors: Andrea Arcangeli, Karen Noel, Rik van Riel
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 02/33] autonuma: make set_pmd_at always available
  2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
@ 2012-10-11 10:54   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 10:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:44AM +0200, Andrea Arcangeli wrote:
> set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
> mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
> won't fail.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
  2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
@ 2012-10-11 10:54   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 10:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:45AM +0200, Andrea Arcangeli wrote:
> is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
> this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.
> 
> Reported-by: Petr Holasek <pholasek@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
  2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
@ 2012-10-11 11:01   ` Mel Gorman
  2012-10-11 16:43       ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 11:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> faults to identify the per NUMA node working set of the thread at
> runtime.
> 
> Arming the NUMA hinting page fault mechanism works similarly to
> setting up a mprotect(PROT_NONE) virtual range: the present bit is
> cleared at the same time that _PAGE_NUMA is set, so when the fault
> triggers we can identify it as a NUMA hinting page fault.
> 

That implies that there is an atomic update requirement or at least
an ordering requirement -- present bit must be cleared before setting
NUMA bit. No doubt it'll be clear later in the series how this is
accomplished. What you propose seems ok but it all depends how it's
implemented so I'm leaving my ack off this particular patch for now.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
  2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
@ 2012-10-11 11:15   ` Mel Gorman
  2012-10-11 16:58       ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 11:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:47AM +0200, Andrea Arcangeli wrote:
> Implement pte_numa and pmd_numa.
> 
> We must atomically set the numa bit and clear the present bit to
> define a pte_numa or pmd_numa.
> 

Or I could just have kept reading :/

> Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
> a thread touches a virtual address in the corresponding virtual range,
> a NUMA hinting page fault will trigger. The NUMA hinting page fault
> will clear the NUMA bit and set the present bit again to resolve the
> page fault.
> 
> NUMA hinting page faults are used:
> 
> 1) to fill in the per-thread NUMA statistic stored for each thread in
>    a current->task_autonuma data structure
> 
> 2) to track the per-node last_nid information in the page structure to
>    detect false sharing
> 
> 3) to migrate the page with Migrate On Fault if there have been enough
>    NUMA hinting page faults on the page coming from remote CPUs
>    (autonuma_last_nid heuristic)
> 
> NUMA hinting page faults collect information and possibly add pages to
> migrate queues. They are extremely quick, and they try to be

They better be :D They are certainly a contributor to the high System
CPU usage I saw in the basic tests but I expect they are a relatively
small contributor with the bulk of the time actually being consumed by
the various scanners.

> non-blocking also when Migrate On Fault is invoked as result.
> 
> The generic implementation is used when CONFIG_AUTONUMA=n.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable.h |   65 ++++++++++++++++++++++++++++++++++++++-
>  include/asm-generic/pgtable.h  |   12 +++++++
>  2 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index c3520d7..6c14b40 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
>  
>  static inline int pte_present(pte_t a)
>  {
> -	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
> +	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
> +			       _PAGE_NUMA);
>  }
>  

huh?

#define _PAGE_NUMA     _PAGE_PROTNONE

so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE

I suspect you are doing this because there is no requirement for
_PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
describe your intent. Is that really the case or did I miss something
stupid?

>  static inline int pte_hidden(pte_t pte)
> @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
>  	 * the _PAGE_PSE flag will remain set at all times while the
>  	 * _PAGE_PRESENT bit is clear).
>  	 */
> -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> +				 _PAGE_NUMA);
> +}
> +
> +#ifdef CONFIG_AUTONUMA
> +/*
> + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> + * same bit too). It's set only when _PAGE_PRESET is not set and it's

same bit on x86, not necessarily anywhere else.

_PAGE_PRESENT?

> + * never set if _PAGE_PRESENT is set.
> + *
> + * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
> + * fault triggers on those regions if pte/pmd_numa returns true
> + * (because _PAGE_PRESENT is not set).
> + */
> +static inline int pte_numa(pte_t pte)
> +{
> +	return (pte_flags(pte) &
> +		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return (pmd_flags(pmd) &
> +		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +#endif
> +
> +/*
> + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> + * because they're called by the NUMA hinting minor page fault.

automatically or atomically?

I assume you meant atomically but what stops two threads faulting at the
same time and doing to the same update? mmap_sem will be insufficient in
that case so what is guaranteeing the atomicity. PTL?

> If we
> + * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
> + * would be forced to set it later while filling the TLB after we
> + * return to userland. That would trigger a second write to memory
> + * that we optimize away by setting _PAGE_ACCESSED here.
> + */
> +static inline pte_t pte_mknonnuma(pte_t pte)
> +{
> +	pte = pte_clear_flags(pte, _PAGE_NUMA);
> +	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> +{
> +	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
> +	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pte_t pte_mknuma(pte_t pte)
> +{
> +	pte = pte_set_flags(pte, _PAGE_NUMA);
> +	return pte_clear_flags(pte, _PAGE_PRESENT);
> +}
> +
> +static inline pmd_t pmd_mknuma(pmd_t pmd)
> +{
> +	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
>  }
>  
>  static inline int pmd_none(pmd_t pmd)
> @@ -479,6 +536,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
>  
>  static inline int pmd_bad(pmd_t pmd)
>  {
> +#ifdef CONFIG_AUTONUMA
> +	if (pmd_numa(pmd))
> +		return 0;
> +#endif
>  	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
>  }
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index ff4947b..0ff87ec 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -530,6 +530,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
>  #endif
>  }
>  
> +#ifndef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_AUTONUMA */
> +
>  #endif /* CONFIG_MMU */
>  
>  #endif /* !__ASSEMBLY__ */
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
  2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
@ 2012-10-11 12:22   ` Mel Gorman
  2012-10-11 17:05       ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 12:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> In the special "pmd" mode of knuma_scand
> (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> type (_PAGE_PRESENT not set), however the pte might be
> present. Therefore, gup_pmd_range() must return 0 in this case to
> avoid losing a NUMA hinting page fault during gup_fast.
> 

So if gup_fast fails, presumably we fall back to taking the mmap_sem and
calling get_user_pages(). This is a heavier operation and I wonder if the
cost is justified. i.e. Is the performance loss from using get_user_pages()
offset by improved NUMA placement? I ask because we always incur the cost of
taking mmap_sem but only sometimes get it back from improved NUMA placement.
How bad would it be if gup_fast lost some of the NUMA hinting information?

> Note: gup_fast will skip over non present ptes (like numa types), so
> no explicit check is needed for the pte_numa case. gup_fast will also
> skip over THP when the trans huge pmd is non present. So, the pmd_numa
> case will also be correctly skipped with no additional code changes
> required.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/mm/gup.c |   13 ++++++++++++-
>  1 files changed, 12 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 6dc9921..cad7d97 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -169,8 +169,19 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		 * can't because it has irq disabled and
>  		 * wait_split_huge_page() would never return as the
>  		 * tlb flush IPI wouldn't run.
> +		 *
> +		 * The pmd_numa() check is needed because the code
> +		 * doesn't check the _PAGE_PRESENT bit of the pmd if
> +		 * the gup_pte_range() path is taken. NOTE: not all
> +		 * gup_fast users will will access the page contents
> +		 * using the CPU through the NUMA memory channels like
> +		 * KVM does. So we're forced to trigger NUMA hinting
> +		 * page faults unconditionally for all gup_fast users
> +		 * even though NUMA hinting page faults aren't useful
> +		 * to I/O drivers that will access the page with DMA
> +		 * and not with the CPU.
>  		 */
> -		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
>  			return 0;
>  		if (unlikely(pmd_large(pmd))) {
>  			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
@ 2012-10-11 12:28   ` Mel Gorman
  2012-10-11 15:24     ` Rik van Riel
  2012-10-11 17:15       ` Andrea Arcangeli
  0 siblings, 2 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 12:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:49AM +0200, Andrea Arcangeli wrote:
> Define the two data structures that collect the per-process (in the
> mm) and per-thread (in the task_struct) statistical information that
> are the input of the CPU follow memory algorithms in the NUMA
> scheduler.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_types.h |  107 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_types.h
> 
> diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
> new file mode 100644
> index 0000000..9673ce8
> --- /dev/null
> +++ b/include/linux/autonuma_types.h
> @@ -0,0 +1,107 @@
> +#ifndef _LINUX_AUTONUMA_TYPES_H
> +#define _LINUX_AUTONUMA_TYPES_H
> +
> +#ifdef CONFIG_AUTONUMA
> +
> +#include <linux/numa.h>
> +
> +
> +/*
> + * Per-mm (per-process) structure that contains the NUMA memory
> + * placement statistics generated by the knuma scan daemon. This
> + * structure is dynamically allocated only if AutoNUMA is possible on
> + * this system. They are linked togehter in a list headed within the

s/togehter/together/

> + * knumad_scan structure.
> + */
> +struct mm_autonuma {

Nit but this is very similar in principle to mm_slot for transparent
huge pages. It might be worth renaming both to mm_thp_slot and
mm_autonuma_slot to set the expectation they are very similar in nature.
Could potentially be made generic but probably overkill.

> +	/* link for knuma_scand's list of mm structures to scan */
> +	struct list_head mm_node;
> +	/* Pointer to associated mm structure */
> +	struct mm_struct *mm;
> +
> +	/*
> +	 * Zeroed from here during allocation, check
> +	 * mm_autonuma_reset() if you alter the below.
> +	 */
> +
> +	/*
> +	 * Pass counter for this mm. This exist only to be able to
> +	 * tell when it's time to apply the exponential backoff on the
> +	 * task_autonuma statistics.
> +	 */
> +	unsigned long mm_numa_fault_pass;
> +	/* Total number of pages that will trigger NUMA faults for this mm */
> +	unsigned long mm_numa_fault_tot;
> +	/* Number of pages that will trigger NUMA faults for each [nid] */
> +	unsigned long mm_numa_fault[0];
> +	/* do not add more variables here, the above array size is dynamic */
> +};

How cache hot is this structure? nodes are sharing counters in the same
cache lines so if updates are frequent this will bounce like a mad yoke.
Profiles will tell for sure but it's possible that some sort of per-cpu
hilarity will be necessary here in the future.

> +
> +extern int alloc_mm_autonuma(struct mm_struct *mm);
> +extern void free_mm_autonuma(struct mm_struct *mm);
> +extern void __init mm_autonuma_init(void);
> +
> +/*
> + * Per-task (thread) structure that contains the NUMA memory placement
> + * statistics generated by the knuma scan daemon. This structure is
> + * dynamically allocated only if AutoNUMA is possible on this
> + * system. They are linked togehter in a list headed within the
> + * knumad_scan structure.
> + */
> +struct task_autonuma {
> +	/* node id the CPU scheduler should try to stick with (-1 if none) */
> +	int task_selected_nid;
> +
> +	/*
> +	 * Zeroed from here during allocation, check
> +	 * mm_autonuma_reset() if you alter the below.
> +	 */
> +
> +	/*
> +	 * Pass counter for this task. When the pass counter is found
> +	 * out of sync with the mm_numa_fault_pass we know it's time
> +	 * to apply the exponential backoff on the task_autonuma
> +	 * statistics, and then we synchronize it with
> +	 * mm_numa_fault_pass. This pass counter is needed because in
> +	 * knuma_scand we work on the mm and we've no visibility on
> +	 * the task_autonuma. Furthermore it would be detrimental to
> +	 * apply exponential backoff to all task_autonuma associated
> +	 * to a certain mm_autonuma (potentially zeroing out the trail
> +	 * of statistical data in task_autonuma) if the task is idle
> +	 * for a long period of time (i.e. several knuma_scand passes).
> +	 */
> +	unsigned long task_numa_fault_pass;
> +	/* Total number of eligible pages that triggered NUMA faults */
> +	unsigned long task_numa_fault_tot;
> +	/* Number of pages that triggered NUMA faults for each [nid] */
> +	unsigned long task_numa_fault[0];
> +	/* do not add more variables here, the above array size is dynamic */
> +};
> +

Same question about cache hotness.

> +extern int alloc_task_autonuma(struct task_struct *tsk,
> +			       struct task_struct *orig,
> +			       int node);
> +extern void __init task_autonuma_init(void);
> +extern void free_task_autonuma(struct task_struct *tsk);
> +
> +#else /* CONFIG_AUTONUMA */
> +
> +static inline int alloc_mm_autonuma(struct mm_struct *mm)
> +{
> +	return 0;
> +}
> +static inline void free_mm_autonuma(struct mm_struct *mm) {}
> +static inline void mm_autonuma_init(void) {}
> +
> +static inline int alloc_task_autonuma(struct task_struct *tsk,
> +				      struct task_struct *orig,
> +				      int node)
> +{
> +	return 0;
> +}
> +static inline void task_autonuma_init(void) {}
> +static inline void free_task_autonuma(struct task_struct *tsk) {}
> +
> +#endif /* CONFIG_AUTONUMA */
> +
> +#endif /* _LINUX_AUTONUMA_TYPES_H */
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
  2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
@ 2012-10-11 13:46   ` Mel Gorman
  2012-10-11 17:34       ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 13:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:50AM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_flags.h |  120 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 120 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_flags.h
> 
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..630ecc5
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,120 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +/*
> + * If CONFIG_AUTONUMA=n only autonuma_possible() is defined (as false)
> + * to allow optimizing away at compile time blocks of common code
> + * without using #ifdefs.
> + */
> +
> +#ifdef CONFIG_AUTONUMA
> +
> +enum autonuma_flag {
> +	/*
> +	 * Set if the kernel wasn't passed the "noautonuma" boot
> +	 * parameter and the hardware is NUMA. If AutoNUMA is not
> +	 * possible the value of all other flags becomes irrelevant
> +	 * (they will never be checked) and AutoNUMA can't be enabled.
> +	 *
> +	 * No defaults: depends on hardware discovery and "noautonuma"
> +	 * early param.
> +	 */
> +	AUTONUMA_POSSIBLE_FLAG,
> +	/*
> +	 * If AutoNUMA is possible, this defines if AutoNUMA is
> +	 * currently enabled or disabled. It can be toggled at runtime
> +	 * through sysfs.
> +	 *
> +	 * The default depends on CONFIG_AUTONUMA_DEFAULT_ENABLED.
> +	 */
> +	AUTONUMA_ENABLED_FLAG,
> +	/*
> +	 * If set through sysfs this will print lots of debug info
> +	 * about the AutoNUMA activities in the kernel logs.
> +	 *
> +	 * Default not set.
> +	 */
> +	AUTONUMA_DEBUG_FLAG,
> +	/*
> +	 * This defines if CFS should prioritize between load
> +	 * balancing fairness or NUMA affinity, if there are no idle
> +	 * CPUs available. If this flag is set AutoNUMA will
> +	 * prioritize on NUMA affinity and it will disregard
> +	 * inter-node fairness.
> +	 *
> +	 * Default not set.
> +	 */
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,

Should this be a SCHED_FEATURE flag?

> +	/*
> +	 * This flag defines if the task/mm_autonuma statistics should
> +	 * be inherited from the parent task/process or instead if
> +	 * they should be cleared at every fork/clone. The
> +	 * task/mm_autonuma statistics are always cleared across
> +	 * execve and there's no way to disable that.
> +	 *
> +	 * Default not set.
> +	 */
> +	AUTONUMA_CHILD_INHERITANCE_FLAG,

Have you ever identified a case where it's a good idea to set that flag?
A child that closely shared data with its parent is not likely to also
want to migrate to separate nodes. It just seems unnecessary to have and
impossible to suggest to an administrator how the flag might be used.

> +	/*
> +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> +	 * faults at the pmd level instead of the pte level. This
> +	 * reduces the number of NUMA hinting faults potentially
> +	 * saving CPU time. It reduces the accuracy of the
> +	 * task_autonuma statistics (but does not change the accuracy
> +	 * of the mm_autonuma statistics). This flag can be toggled
> +	 * through sysfs as runtime.
> +	 *
> +	 * This flag does not affect AutoNUMA with transparent
> +	 * hugepages (THP). With THP the NUMA hinting page faults
> +	 * always happen at the pmd level, regardless of the setting
> +	 * of this flag. Note: there is no reduction in accuracy of
> +	 * task_autonuma statistics with THP.
> +	 *
> +	 * Default set.
> +	 */
> +	AUTONUMA_SCAN_PMD_FLAG,

This flag and the other flags make sense. Early on we just are not going
to know what the correct choice is. My gut says that ultimately we'll
default to PMD level *but* fall back to PTE level on a per-task basis if
ping-pong migrations are detected. This will catch ping-pongs on data
that is not PMD aligned although obviously data that is not page aligned
will also suffer. Eventually I think this flag will go away but the
behaviour will be;

default, AUTONUMA_SCAN_PMD
if ping-pong, fallback to AUTONUMA_SCAN_PTE
if ping-ping, AUTONUMA_SCAN_NONE

so there is a graceful degradation if autonuma is doing the wrong thing.

> +};
> +
> +extern unsigned long autonuma_flags;
> +
> +static inline bool autonuma_possible(void)
> +{
> +	return test_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_enabled(void)
> +{
> +	return test_bit(AUTONUMA_ENABLED_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> +	return test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> +	return test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +			&autonuma_flags);
> +}
> +
> +static inline bool autonuma_child_inheritance(void)
> +{
> +	return test_bit(AUTONUMA_CHILD_INHERITANCE_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> +	return test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +#else /* CONFIG_AUTONUMA */
> +
> +static inline bool autonuma_possible(void)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_AUTONUMA */
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 13/33] autonuma: autonuma_enter/exit
  2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
@ 2012-10-11 13:50   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 13:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:55AM +0200, Andrea Arcangeli wrote:
> This is where we register (and unregister) an "mm" structure into
> AutoNUMA for knuma_scand to scan them.
> 
> knuma_scand is the first gear in the whole AutoNUMA algorithm.
> knuma_scand is the daemon that scans the "mm" structures in the list
> and sets pmd_numa and pte_numa to allow the NUMA hinting page faults
> to start. All other actions follow after that. If knuma_scand doesn't
> run, AutoNUMA is fully bypassed. If knuma_scand is stopped, soon all
> other AutoNUMA gears will settle down too.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I think there will be other cases in the future where autonuma_exit will
be used but not mandatory to deal with right now so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 10:19 ` Mel Gorman
@ 2012-10-11 14:56     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 14:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi Mel,

On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> As a basic sniff test I added a test to MMtests for the AutoNUMA
> Benchmark on a 4-node machine and the following fell out.
> 
>                                      3.6.0                 3.6.0
>                                    vanilla        autonuma-v33r6
> User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)

Thanks a lot for the help and for looking into it!

Just curious, why are you running only numa02_SMT and
numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
without _suffix)

> 
> The performance improvements are certainly there for this basic test but
> I note the System CPU usage is very high.

Yes, migrate is expensive, but after convergence has been reached the
system time should be the same as upstream.

btw, I improved things further in autonuma28 (new branch in aa.git).

> 
> The vmstats showed up this
> 
> THP fault alloc               81376       86070
> THP collapse alloc               14       40423
> THP splits                        8       41792
> 
> So we're doing a lot of splits and collapses for THP there. There is a
> possibility that khugepaged and the autonuma kernel thread are doing some
> busy work. Not a show-stopped, just interesting.
> 
> I've done no analysis at all and this was just to have something to look
> at before looking at the code closer.

Sure, the idea is to have THP native migration, then we'll do zero
collapse/splits.

> > The objective of AutoNUMA is to provide out-of-the-box performance as
> > close as possible to (and potentially faster than) manual NUMA hard
> > bindings.
> > 
> > It is not very intrusive into the kernel core and is well structured
> > into separate source modules.
> > 
> > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > NUMA placement algorithms such as numad (in userland through cpusets)
> > and schednuma (in kernel too) and was found superior in all cases.
> > 
> > Most important: not a single benchmark showed a regression yet when
> > compared to vanilla kernels. Not even on the 2 node systems where the
> > NUMA effects are less significant.
> > 
> 
> Ok, I have not run a general regression test and won't get the chance to
> soon but hopefully others will. One thing they might want to watch out
> for is System CPU time. It's possible that your AutoNUMA benchmark
> triggers a worst-case but it's worth keeping an eye on because any cost
> from that has to be offset by gains from better NUMA placements.

Good idea to monitor it indeed.

> Is STREAM really a good benchmark in this case? Unless you also ran it in
> parallel mode, it basically operations against three arrays and not really
> NUMA friendly once the total size is greater than a NUMA node. I guess
> it makes sense to run it just to see does autonuma break it :)

The way this is run is that there is 1 stream, then 4 stream, then 8
until we max out all CPUs.

I think we could run "memhog" instead of "stream" and it'd be the
same. stream probably better resembles real life computations.

The upstream scheduler lacks any notion of affinity so eventually
during the 5 min run, on process changes node, it doesn't notice its
memory was elsewhere so it stays there, and the memory can't follow
the cpu either. So then it runs much slower.

So it's the simplest test of all to get right, all it requires is some
notion of node affinity.

It's also the only workload that the home node design in schednuma in
tip.git can get right (schednuma post current tip.git introduced
cpu-follow-memory design of AutoNUMA so schednuma will have a chance
to get right more stuff than just the stream multi instance
benchmark).

So it's just for a verification than the simple stuff (single threaded
process computing) is ok and the upstream regression vs hard NUMA
bindings is fixed.

stream is also one case where we have to perform identical to the hard
NUMA bindings. No migration of CPU or memory must ever happen with
AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
find that it is already in the best place and it will leave it alone.

With the autonuma-benchmark it's impossible to reach identical
performance of the _HARD_BIND case because _HARD_BIND doesn't need to
do any memory migration (I'm 3 seconds away from hard bindings in a
198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).

> 
> > 
> > == iozone ==
> > 
> >                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> > FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> > ====--------------------------------------------------------------------------------------------------------------
> > noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> > autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> > 
> > AutoNUMA can't help much for I/O loads but you can see it seems a
> > small improvement there too. The important thing for I/O loads, is to
> > verify that there is no regression.
> > 
> 
> It probably is unreasonable to expect autonuma to handle the case where
> a file-based workload has not been tuned for NUMA. In too many cases
> it's going to be read/write based so you're not going to get the
> statistics you need.

Agreed. Some statistic may still accumulate and it's still better than
nothing but unless the workload is CPU and memory bound, we can't
expect to see any difference.

This is meant as a verification that we're not introducing regression
to I/O bound load.

Andrea

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-11 14:56     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 14:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi Mel,

On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> As a basic sniff test I added a test to MMtests for the AutoNUMA
> Benchmark on a 4-node machine and the following fell out.
> 
>                                      3.6.0                 3.6.0
>                                    vanilla        autonuma-v33r6
> User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)

Thanks a lot for the help and for looking into it!

Just curious, why are you running only numa02_SMT and
numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
without _suffix)

> 
> The performance improvements are certainly there for this basic test but
> I note the System CPU usage is very high.

Yes, migrate is expensive, but after convergence has been reached the
system time should be the same as upstream.

btw, I improved things further in autonuma28 (new branch in aa.git).

> 
> The vmstats showed up this
> 
> THP fault alloc               81376       86070
> THP collapse alloc               14       40423
> THP splits                        8       41792
> 
> So we're doing a lot of splits and collapses for THP there. There is a
> possibility that khugepaged and the autonuma kernel thread are doing some
> busy work. Not a show-stopped, just interesting.
> 
> I've done no analysis at all and this was just to have something to look
> at before looking at the code closer.

Sure, the idea is to have THP native migration, then we'll do zero
collapse/splits.

> > The objective of AutoNUMA is to provide out-of-the-box performance as
> > close as possible to (and potentially faster than) manual NUMA hard
> > bindings.
> > 
> > It is not very intrusive into the kernel core and is well structured
> > into separate source modules.
> > 
> > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > NUMA placement algorithms such as numad (in userland through cpusets)
> > and schednuma (in kernel too) and was found superior in all cases.
> > 
> > Most important: not a single benchmark showed a regression yet when
> > compared to vanilla kernels. Not even on the 2 node systems where the
> > NUMA effects are less significant.
> > 
> 
> Ok, I have not run a general regression test and won't get the chance to
> soon but hopefully others will. One thing they might want to watch out
> for is System CPU time. It's possible that your AutoNUMA benchmark
> triggers a worst-case but it's worth keeping an eye on because any cost
> from that has to be offset by gains from better NUMA placements.

Good idea to monitor it indeed.

> Is STREAM really a good benchmark in this case? Unless you also ran it in
> parallel mode, it basically operations against three arrays and not really
> NUMA friendly once the total size is greater than a NUMA node. I guess
> it makes sense to run it just to see does autonuma break it :)

The way this is run is that there is 1 stream, then 4 stream, then 8
until we max out all CPUs.

I think we could run "memhog" instead of "stream" and it'd be the
same. stream probably better resembles real life computations.

The upstream scheduler lacks any notion of affinity so eventually
during the 5 min run, on process changes node, it doesn't notice its
memory was elsewhere so it stays there, and the memory can't follow
the cpu either. So then it runs much slower.

So it's the simplest test of all to get right, all it requires is some
notion of node affinity.

It's also the only workload that the home node design in schednuma in
tip.git can get right (schednuma post current tip.git introduced
cpu-follow-memory design of AutoNUMA so schednuma will have a chance
to get right more stuff than just the stream multi instance
benchmark).

So it's just for a verification than the simple stuff (single threaded
process computing) is ok and the upstream regression vs hard NUMA
bindings is fixed.

stream is also one case where we have to perform identical to the hard
NUMA bindings. No migration of CPU or memory must ever happen with
AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
find that it is already in the best place and it will leave it alone.

With the autonuma-benchmark it's impossible to reach identical
performance of the _HARD_BIND case because _HARD_BIND doesn't need to
do any memory migration (I'm 3 seconds away from hard bindings in a
198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).

> 
> > 
> > == iozone ==
> > 
> >                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> > FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> > ====--------------------------------------------------------------------------------------------------------------
> > noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> > autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> > 
> > AutoNUMA can't help much for I/O loads but you can see it seems a
> > small improvement there too. The important thing for I/O loads, is to
> > verify that there is no regression.
> > 
> 
> It probably is unreasonable to expect autonuma to handle the case where
> a file-based workload has not been tuned for NUMA. In too many cases
> it's going to be read/write based so you're not going to get the
> statistics you need.

Agreed. Some statistic may still accumulate and it's still better than
nothing but unless the workload is CPU and memory bound, we can't
expect to see any difference.

This is meant as a verification that we're not introducing regression
to I/O bound load.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
  2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
@ 2012-10-11 14:58   ` Mel Gorman
  2012-10-12  0:25       ` Andrea Arcangeli
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 14:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> This algorithm takes as input the statistical information filled by the
> knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> (p->task_autonuma), evaluates it for the current scheduled task, and
> compares it against every other running process to see if it should
> move the current task to another NUMA node.
> 

That sounds expensive if there are a lot of running processes in the
system. How often does this happen? Mention it here even though I
realised much later that it's obvious from the patch itself.

> When the scheduler decides if the task should be migrated to a
> different NUMA node or to stay in the same NUMA node, the decision is
> then stored into p->task_autonuma->task_selected_nid. The fair
> scheduler then tries to keep the task on the task_selected_nid.
> 
> Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_sched.h |   59 ++++
>  include/linux/mm_types.h       |    5 +
>  include/linux/sched.h          |    3 +
>  kernel/sched/core.c            |    1 +
>  kernel/sched/fair.c            |    4 +
>  kernel/sched/numa.c            |  638 ++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h           |   19 ++
>  7 files changed, 729 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_sched.h
>  create mode 100644 kernel/sched/numa.c
> 
> diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
> new file mode 100644
> index 0000000..8d786eb
> --- /dev/null
> +++ b/include/linux/autonuma_sched.h
> @@ -0,0 +1,59 @@
> +#ifndef _LINUX_AUTONUMA_SCHED_H
> +#define _LINUX_AUTONUMA_SCHED_H
> +
> +#include <linux/autonuma_flags.h>
> +
> +#ifdef CONFIG_AUTONUMA
> +
> +extern void __sched_autonuma_balance(void);
> +extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
> +					    int strict_numa, int dst_cpu,
> +					    enum cpu_idle_type idle);
> +
> +/*
> + * Return true if the specified CPU is in this task's selected_nid (or
> + * there is no affinity set for the task).
> + */
> +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> +{

nit, but elsewhere you have

static inline TYPE and here you have
static TYPE inline

> +	int task_selected_nid;
> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +
> +	if (!task_autonuma)
> +		return true;
> +
> +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> +		return true;
> +	else
> +		return false;
> +}

no need for else.

> +
> +static inline void sched_autonuma_balance(void)
> +{
> +	struct task_autonuma *ta = current->task_autonuma;
> +
> +	if (ta && current->mm)
> +		__sched_autonuma_balance();
> +}
> +

Ok, so this could do with a comment explaining where it is called from.
It is called during idle balancing at least so potentially this is every
scheduler tick. It'll be run from softirq context so the cost will not
be obvious to a process but the overhead will be there. What happens if
this takes longer than a scheduler tick to run? Is that possible?

> +#else /* CONFIG_AUTONUMA */
> +
> +static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
> +						   int strict_numa,
> +						   int dst_cpu,
> +						   enum cpu_idle_type idle)
> +{
> +	return true;
> +}
> +
> +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> +{
> +	return true;
> +}
> +
> +static inline void sched_autonuma_balance(void) {}
> +
> +#endif /* CONFIG_AUTONUMA */
> +
> +#endif /* _LINUX_AUTONUMA_SCHED_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index bf78672..c80101c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -13,6 +13,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
>  #include <linux/uprobes.h>
> +#include <linux/autonuma_types.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>  
> @@ -405,6 +406,10 @@ struct mm_struct {
>  	struct cpumask cpumask_allocation;
>  #endif
>  	struct uprobes_state uprobes_state;
> +#ifdef CONFIG_AUTONUMA
> +	/* this is used by the scheduler and the page allocator */
> +	struct mm_autonuma *mm_autonuma;
> +#endif
>  };
>  
>  static inline void mm_init_cpumask(struct mm_struct *mm)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 23bddac..ca246e7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1522,6 +1522,9 @@ struct task_struct {
>  	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
>  	short il_next;
>  	short pref_node_fork;
> +#ifdef CONFIG_AUTONUMA
> +	struct task_autonuma *task_autonuma;
> +#endif
>  #endif
>  	struct rcu_head rcu;
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 649c9f8..5a36579 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -72,6 +72,7 @@
>  #include <linux/slab.h>
>  #include <linux/init_task.h>
>  #include <linux/binfmts.h>
> +#include <linux/autonuma_sched.h>
>  
>  #include <asm/switch_to.h>
>  #include <asm/tlb.h>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 96e2b18..877f077 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/profile.h>
>  #include <linux/interrupt.h>
> +#include <linux/autonuma_sched.h>
>  
>  #include <trace/events/sched.h>
>  
> @@ -4932,6 +4933,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>  
>  	rebalance_domains(this_cpu, idle);
>  
> +	if (!this_rq->idle_balance)
> +		sched_autonuma_balance();
> +
>  	/*
>  	 * If this cpu has a pending nohz_balance_kick, then do the
>  	 * balancing on behalf of the other idle cpus whose ticks are
> diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
> new file mode 100644
> index 0000000..d0cbfe9
> --- /dev/null
> +++ b/kernel/sched/numa.c
> @@ -0,0 +1,638 @@
> +/*
> + *  Copyright (C) 2012  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/autonuma_sched.h>
> +#include <asm/tlb.h>
> +
> +#include "sched.h"
> +
> +/*
> + * Callback used by the AutoNUMA balancer to migrate a task to the
> + * selected CPU. Invoked by stop_one_cpu_nowait().
> + */
> +static int autonuma_balance_cpu_stop(void *data)
> +{
> +	struct rq *src_rq = data;
> +	int src_cpu = cpu_of(src_rq);
> +	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
> +	struct task_struct *p = src_rq->autonuma_balance_task;
> +	struct rq *dst_rq = cpu_rq(dst_cpu);
> +
> +	raw_spin_lock_irq(&p->pi_lock);
> +	raw_spin_lock(&src_rq->lock);
> +
> +	/* Make sure the selected cpu hasn't gone down in the meanwhile */
> +	if (unlikely(src_cpu != smp_processor_id() ||
> +		     !src_rq->autonuma_balance))
> +		goto out_unlock;
> +
> +	/* Check if the affinity changed in the meanwhile */
> +	if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p)))
> +		goto out_unlock;
> +
> +	/* Is the task to migrate still there? */
> +	if (task_cpu(p) != src_cpu)
> +		goto out_unlock;
> +
> +	BUG_ON(src_rq == dst_rq);
> +
> +	/* Prepare to move the task from src_rq to dst_rq */
> +	double_lock_balance(src_rq, dst_rq);
> +
> +	/*
> +	 * Supposedly pi_lock should have been enough but some code
> +	 * seems to call __set_task_cpu without pi_lock.
> +	 */
> +	if (task_cpu(p) != src_cpu)
> +		goto out_double_unlock;
> +
> +	/*
> +	 * If the task is not on a rq, the task_selected_nid will take
> +	 * care of the NUMA affinity at the next wake-up.
> +	 */
> +	if (p->on_rq) {
> +		deactivate_task(src_rq, p, 0);
> +		set_task_cpu(p, dst_cpu);
> +		activate_task(dst_rq, p, 0);
> +		check_preempt_curr(dst_rq, p, 0);
> +	}
> +
> +out_double_unlock:
> +	double_unlock_balance(src_rq, dst_rq);
> +out_unlock:
> +	src_rq->autonuma_balance = false;
> +	raw_spin_unlock(&src_rq->lock);
> +	/* spinlocks acts as barrier() so p is stored locally on the stack */
> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +	return 0;
> +}

Glanced through it only but didn't see anything obviously messed up.

> +
> +#define AUTONUMA_BALANCE_SCALE 1000
> +

I was going to ask why 1000 but the next comment talks about it.

> +/*
> + * This function __sched_autonuma_balance() is responsible for

This function is far too shot and could do with another few pages :P

> + * deciding which is the best CPU each process should be running on
> + * according to the NUMA statistics collected in mm->mm_autonuma and
> + * tsk->task_autonuma.
> + *
> + * This will not alter the active idle load balancing and most other
> + * scheduling activity, it works by exchanging running tasks across
> + * CPUs located in different NUMA nodes, when such an exchange
> + * provides a net benefit in increasing the system wide NUMA
> + * convergence.
> + *
> + * The tasks that are the closest to "fully converged" are given the
> + * maximum priority in being moved to their "best node".
> + *
> + * "Full convergence" is achieved when all memory accesses by a task
> + * are 100% local to the CPU it is running on. A task's "best node" is

I think this is the first time you defined convergence in the series.
The explanation should be included in the documentation.

> + * the NUMA node that recently had the most memory accesses from the
> + * task. The tasks that are closest to being fully converged are given
> + * maximum priority for being moved to their "best node."
> + *
> + * To find how close a task is to converging we use weights. These
> + * weights are computed using the task_autonuma and mm_autonuma
> + * statistics. These weights represent the percentage amounts of
> + * memory accesses (in AUTONUMA_BALANCE_SCALE) that each task recently
> + * had in each node. If the weight of one node is equal to
> + * AUTONUMA_BALANCE_SCALE that implies the task reached "full
> + * convergence" in that given node. To the contrary, a node with a
> + * zero weight would be the "worst node" for the task.
> + *
> + * If the weights for two tasks on CPUs in different nodes are equal
> + * no switch will happen.
> + *
> + * The core math that evaluates the current CPU against the CPUs of
> + * all other nodes is this:
> + *
> + *	if (other_diff > 0 && this_diff > 0)
> + *		weight_diff = other_diff + this_diff;
> + *
> + * other_diff: how much the current task is closer to fully converge
> + * on the node of the other CPU than the other task that is currently
> + * running in the other CPU.

In the changelog you talked about comparing a process with every other
running process but here it looks like you intent to examine every
process that is *currently running* on a remote node and compare that.
What if the best process to swap with is not currently running? Do we
miss it?

I do not have a better suggestion on how this might be done better.

> + *
> + * this_diff: how much the current task is closer to converge on the
> + * node of the other CPU than in the current node.
> + *
> + * If both checks succeed it guarantees that we found a way to
> + * multilaterally improve the system wide NUMA
> + * convergence. Multilateral here means that the same checks will not
> + * succeed again on those same two tasks, after the task exchange, so
> + * there is no risk of ping-pong.
> + *

At least not in that instance of time. A new CPU binding or change in
behaviour (such as a computation finishing and a reduce step starting)
might change that scoring.

> + * If a task exchange can happen because the two checks succeed, we
> + * select the destination CPU that will give us the biggest increase
> + * in system wide convergence (i.e. biggest "weight", in the above
> + * quoted code).
> + *

So there is a bit of luck that the best task to exchange is currently
running. How bad is that? It depends really on the number of tasks
running on that node and the priority. There is a chance that it doesn't
matter as such because if all the wrong tasks are currently running then
no exchange will take place - it was just wasted CPU. It does imply that
AutoNUMA works best of CPUs are not over-subscribed with processes. Is
that fair?

Again, I have no suggestions at all on how this might be improved and
these comments are hand-waving towards where we *might* see problems in
the future. If problems are actually identified in practice for
worklaods then autonuma can be turned off until the relevant problem
area is fixed.

> + * CFS is NUMA aware via sched_autonuma_can migrate_task(). CFS searches
> + * CPUs in the task's task_selected_nid first during load balancing and
> + * idle balancing.
> + *
> + * The task's task_selected_nid is the node selected by
> + * __sched_autonuma_balance() when it migrates the current task to the
> + * selected cpu in the selected node during the task exchange.
> + *
> + * Once a task has been moved to another node, closer to most of the
> + * memory it has recently accessed, any memory for that task not in
> + * the new node moves slowly to the new node. This is done in the
> + * context of the NUMA hinting page fault (aka Migrate On Fault).
> + *
> + * One important thing is how we calculate the weights using
> + * task_autonuma or mm_autonuma, depending if the other CPU is running
> + * a thread of the current process, or a thread of a different
> + * process.
> + *
> + * We use the mm_autonuma statistics to calculate the NUMA weights of
> + * the two task candidates for exchange if the task in the other CPU
> + * belongs to a different process. This way all threads of the same
> + * process will converge to the same node, which is the one with the
> + * highest percentage of memory for the process.  This will happen
> + * even if the thread's "best node" is busy running threads of a
> + * different process.
> + *
> + * If the two candidate tasks for exchange are threads of the same
> + * process, we use the task_autonuma information (because the
> + * mm_autonuma information is identical). By using the task_autonuma
> + * statistics, each thread follows its own memory locality and they
> + * will not necessarily converge on the same node. This is often very
> + * desirable for processes with more theads than CPUs on each NUMA
> + * node.
> + *

I would fully expect that there are parallel workloads that work on
differenet portions of a large set of data and it would be perfectly
reasonable for threads using the same address space to converge on
different nodes.

> + * To avoid the risk of NUMA false sharing it's best to schedule all
> + * threads accessing the same memory in the same node (on in as fewer
> + * nodes as possible if they can't fit in a single node).
> + *
> + * False sharing in the above sentence is intended as simultaneous
> + * virtual memory accesses to the same pages of memory, by threads
> + * running in CPUs of different nodes. Sharing doesn't refer to shared
> + * memory as in tmpfs, but it refers to CLONE_VM instead.
> + *
> + * This algorithm might be expanded to take all runnable processes
> + * into account later.
> + *

I would hope we manage to figure out a way to examine fewer processes,
not more :)

> + * This algorithm is executed by every CPU in the context of the
> + * SCHED_SOFTIRQ load balancing event at regular intervals.
> + *
> + * If the task is found to have converged in the current node, we
> + * already know that the check "this_diff > 0" will not succeed, so
> + * the autonuma balancing completes without having to check any of the
> + * CPUs of the other NUMA nodes.
> + */
> +void __sched_autonuma_balance(void)
> +{
> +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> +	int this_nid = numa_node_id();
> +	int this_cpu = smp_processor_id();
> +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> +	unsigned long task_max, mm_max;
> +	unsigned long weight_diff_max;
> +	long uninitialized_var(s_w_nid);
> +	long uninitialized_var(s_w_this_nid);
> +	long uninitialized_var(s_w_other);
> +	bool uninitialized_var(s_w_type_thread);
> +	struct cpumask *allowed;
> +	struct task_struct *p = current, *other_task;

So the task in question is current but this is called by the idle
balancer. I'm missing something obvious here but it's not clear to me why
that process is necessarily relevant. What guarantee is there that all
tasks will eventually run this code? Maybe it doesn't matter because the
most CPU intensive tasks are also the most likely to end up in here but
a clarification would be nice.

> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +	struct mm_autonuma *mm_autonuma;
> +	struct rq *rq;
> +
> +	/* per-cpu statically allocated in runqueues */
> +	long *task_numa_weight;
> +	long *mm_numa_weight;
> +
> +	if (!task_autonuma || !p->mm)
> +		return;
> +
> +	if (!autonuma_enabled()) {
> +		if (task_autonuma->task_selected_nid != -1)
> +			task_autonuma->task_selected_nid = -1;
> +		return;
> +	}
> +
> +	allowed = tsk_cpus_allowed(p);
> +	mm_autonuma = p->mm->mm_autonuma;
> +
> +	/*
> +	 * If the task has no NUMA hinting page faults or if the mm
> +	 * hasn't been fully scanned by knuma_scand yet, set task
> +	 * selected nid to the current nid, to avoid the task bounce
> +	 * around randomly.
> +	 */
> +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);

Why ACCESS_ONCE?

> +	if (!mm_tot) {
> +		if (task_autonuma->task_selected_nid != this_nid)
> +			task_autonuma->task_selected_nid = this_nid;
> +		return;
> +	}
> +	task_tot = task_autonuma->task_numa_fault_tot;
> +	if (!task_tot) {
> +		if (task_autonuma->task_selected_nid != this_nid)
> +			task_autonuma->task_selected_nid = this_nid;
> +		return;
> +	}
> +
> +	rq = cpu_rq(this_cpu);
> +
> +	/*
> +	 * Verify that we can migrate the current task, otherwise try
> +	 * again later.
> +	 */
> +	if (ACCESS_ONCE(rq->autonuma_balance))
> +		return;
> +
> +	/*
> +	 * The following two arrays will hold the NUMA affinity weight
> +	 * information for the current process if scheduled on the
> +	 * given NUMA node.
> +	 *
> +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> +	 */
> +	task_numa_weight = rq->task_numa_weight;
> +	mm_numa_weight = rq->mm_numa_weight;
> +
> +	/*
> +	 * Identify the NUMA node where this thread (task_struct), and
> +	 * the process (mm_struct) as a whole, has the largest number
> +	 * of NUMA faults.
> +	 */
> +	task_max = mm_max = 0;
> +	selected_nid = mm_selected_nid = -1;
> +	for_each_online_node(nid) {
> +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> +		task_fault = task_autonuma->task_numa_fault[nid];
> +		if (mm_fault > mm_tot)
> +			/* could be removed with a seqlock */
> +			mm_tot = mm_fault;
> +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> +		if (task_fault > task_tot) {
> +			task_tot = task_fault;
> +			WARN_ON(1);
> +		}
> +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> +		if (mm_numa_weight[nid] > mm_max) {
> +			mm_max = mm_numa_weight[nid];
> +			mm_selected_nid = nid;
> +		}
> +		if (task_numa_weight[nid] > task_max) {
> +			task_max = task_numa_weight[nid];
> +			selected_nid = nid;
> +		}
> +	}

Ok, so this is a big walk to take every time and as this happens every
scheduler tick, it seems unlikely that the workload would be changing
phases that often in terms of NUMA behaviour. Would it be possible for
this to be sampled less frequently and cache the result?

> +	/*
> +	 * If this NUMA node is the selected one, based on process
> +	 * memory and task NUMA faults, set task_selected_nid and
> +	 * we're done.
> +	 */
> +	if (selected_nid == this_nid && mm_selected_nid == this_nid) {
> +		if (task_autonuma->task_selected_nid != selected_nid)
> +			task_autonuma->task_selected_nid = selected_nid;
> +		return;
> +	}
> +
> +	selected_cpu = this_cpu;
> +	selected_nid = this_nid;
> +	weight_diff_max = 0;
> +	other_task = NULL;
> +
> +	/* check that the following raw_spin_lock_irq is safe */
> +	BUG_ON(irqs_disabled());
> +
> +	/*
> +	 * Check the other NUMA nodes to see if there is a task we
> +	 * should exchange places with.
> +	 */
> +	for_each_online_node(nid) {
> +		/* No need to check our current node. */
> +		if (nid == this_nid)
> +			continue;
> +		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
> +			struct mm_autonuma *mma = NULL /* bugcheck */;
> +			struct task_autonuma *ta = NULL /* bugcheck */;
> +			unsigned long fault, tot;
> +			long this_diff, other_diff;
> +			long w_nid, w_this_nid, w_other;
> +			bool w_type_thread;
> +			struct mm_struct *mm;
> +			struct task_struct *_other_task;
> +
> +			rq = cpu_rq(cpu);
> +			if (!cpu_online(cpu))
> +				continue;
> +
> +			/* CFS takes care of idle balancing. */
> +			if (idle_cpu(cpu))
> +				continue;
> +
> +			mm = rq->curr->mm;
> +			if (!mm)
> +				continue;
> +
> +			/*
> +			 * Check if the _other_task is pending for
> +			 * migrate. Do it locklessly: it's an
> +			 * optimistic racy check anyway.
> +			 */
> +			if (ACCESS_ONCE(rq->autonuma_balance))
> +				continue;
> +
> +			/*
> +			 * Grab the fault/tot of the processes running
> +			 * in the other CPUs to compute w_other.
> +			 */
> +			raw_spin_lock_irq(&rq->lock);
> +			_other_task = rq->curr;
> +			/* recheck after implicit barrier() */
> +			mm = _other_task->mm;
> +			if (!mm) {
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +

Is it really critical to pin those values using the lock? That seems *really*
heavy. If the results have to be exactly stable then is there any chance
the values could be encoded in the high and low bits of a single unsigned
long and read without the lock?  Updates would be more expensive but that's
in a trap anyway. This on the other hand is a scheduler path.


> +			if (mm == p->mm) {
> +				/*
> +				 * This task is another thread in the
> +				 * same process. Use the task statistics.
> +				 */
> +				w_type_thread = true;
> +				ta = _other_task->task_autonuma;
> +				tot = ta->task_numa_fault_tot;
> +			} else {
> +				/*
> +				 * This task is part of another process.
> +				 * Use the mm statistics.
> +				 */
> +				w_type_thread = false;
> +				mma = mm->mm_autonuma;
> +				tot = ACCESS_ONCE(mma->mm_numa_fault_tot);
> +			}
> +
> +			if (!tot) {
> +				/* Need NUMA faults to evaluate NUMA placement. */
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * Check if the _other_task is allowed to be
> +			 * migrated to this_cpu.
> +			 */
> +			if (!cpumask_test_cpu(this_cpu,
> +					      tsk_cpus_allowed(_other_task))) {
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +

Would it not make sense to check this *before* we take the lock and
grab all its counters? It probably will not make much of a difference in
practice as I expect it's rare that the target CPU is running a task
that can't migrate but it still feels the wrong way around.


> +			if (w_type_thread)
> +				fault = ta->task_numa_fault[nid];
> +			else
> +				fault = ACCESS_ONCE(mma->mm_numa_fault[nid]);
> +
> +			raw_spin_unlock_irq(&rq->lock);
> +
> +			if (fault > tot)
> +				tot = fault;
> +			w_other = fault*AUTONUMA_BALANCE_SCALE/tot;
> +
> +			/*
> +			 * We pre-computed the weights for the current
> +			 * task in the task/mm_numa_weight arrays.
> +			 * Those computations were mm/task local, and
> +			 * didn't require accessing other CPUs'
> +			 * runqueues.
> +			 */
> +			if (w_type_thread) {
> +				w_nid = task_numa_weight[nid];
> +				w_this_nid = task_numa_weight[this_nid];
> +			} else {
> +				w_nid = mm_numa_weight[nid];
> +				w_this_nid = mm_numa_weight[this_nid];
> +			}
> +
> +			/*
> +			 * other_diff: How much does the current task
> +			 * prefer to run the remote NUMA node (nid)
> +			 * compared to the other task on the remote
> +			 * node (nid).
> +			 */
> +			other_diff = w_nid - w_other;
> +
> +			/*
> +			 * this_diff: How much does the currrent task
> +			 * prefer to run on the remote NUMA node (nid)
> +			 * rather than the current NUMA node
> +			 * (this_nid).
> +			 */
> +			this_diff = w_nid - w_this_nid;
> +
> +			/*
> +			 * Would swapping NUMA location with this task
> +			 * reduce the total number of cross-node NUMA
> +			 * faults in the system?
> +			 */
> +			if (other_diff > 0 && this_diff > 0) {
> +				unsigned long weight_diff;
> +
> +				weight_diff = other_diff + this_diff;
> +
> +				/* Remember the best candidate. */
> +				if (weight_diff > weight_diff_max) {
> +					weight_diff_max = weight_diff;
> +					selected_cpu = cpu;
> +					selected_nid = nid;
> +
> +					s_w_other = w_other;
> +					s_w_nid = w_nid;
> +					s_w_this_nid = w_this_nid;
> +					s_w_type_thread = w_type_thread;
> +					other_task = _other_task;
> +				}
> +			}
> +		}
> +	}
> +
> +	if (task_autonuma->task_selected_nid != selected_nid)
> +		task_autonuma->task_selected_nid = selected_nid;
> +	if (selected_cpu != this_cpu) {
> +		if (autonuma_debug()) {
> +			char *w_type_str;
> +			w_type_str = s_w_type_thread ? "thread" : "process";
> +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> +			       p->mm, p->pid, this_nid, selected_nid,
> +			       this_cpu, selected_cpu,
> +			       s_w_other, s_w_nid, s_w_this_nid,
> +			       w_type_str);
> +		}

Can these be made tracepoints and get rid of the autonuma_debug() check?
I recognise there is a risk that some tool might grow to depend on
implementation details but in this case it seems very unlikely.

> +		BUG_ON(this_nid == selected_nid);
> +		goto found;
> +	}
> +
> +	return;
> +
> +found:
> +	rq = cpu_rq(this_cpu);
> +
> +	/*
> +	 * autonuma_balance synchronizes accesses to
> +	 * autonuma_balance_work. After set, it's cleared by the
> +	 * callback once the migration work is finished.
> +	 */
> +	raw_spin_lock_irq(&rq->lock);
> +	if (rq->autonuma_balance) {
> +		raw_spin_unlock_irq(&rq->lock);
> +		return;
> +	}
> +	rq->autonuma_balance = true;
> +	raw_spin_unlock_irq(&rq->lock);
> +
> +	rq->autonuma_balance_dst_cpu = selected_cpu;
> +	rq->autonuma_balance_task = p;
> +	get_task_struct(p);
> +
> +	/* Do the actual migration. */
> +	stop_one_cpu_nowait(this_cpu,
> +			    autonuma_balance_cpu_stop, rq,
> +			    &rq->autonuma_balance_work);
> +
> +	BUG_ON(!other_task);
> +	rq = cpu_rq(selected_cpu);
> +
> +	/*
> +	 * autonuma_balance synchronizes accesses to
> +	 * autonuma_balance_work. After set, it's cleared by the
> +	 * callback once the migration work is finished.
> +	 */
> +	raw_spin_lock_irq(&rq->lock);
> +	/*
> +	 * The chance of other_task having quit in the meanwhile
> +	 * and another task having reused its previous task struct is
> +	 * tiny. Even if it happens the kernel will be stable.
> +	 */
> +	if (rq->autonuma_balance || rq->curr != other_task) {
> +		raw_spin_unlock_irq(&rq->lock);
> +		return;
> +	}
> +	rq->autonuma_balance = true;
> +	/* take the pin on the task struct before dropping the lock */
> +	get_task_struct(other_task);
> +	raw_spin_unlock_irq(&rq->lock);
> +
> +	rq->autonuma_balance_dst_cpu = this_cpu;
> +	rq->autonuma_balance_task = other_task;
> +
> +	/* Do the actual migration. */
> +	stop_one_cpu_nowait(selected_cpu,
> +			    autonuma_balance_cpu_stop, rq,
> +			    &rq->autonuma_balance_work);
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here, review before deleting"
> +#endif
> +}
> +

Ok, so I confess I did not work out if the weights and calculations really
make sense or not but at a glance they seem reasonable and I spotted no
obvious flaws. The function is pretty heavy though and may be doing more
work around locking than is really necessary. That said, there will be
workloads where the cost is justified and offset by the performance gains
from improved NUMA locality. I just don't expect it to be a universal win so
we'll need to keep an eye on the system CPU usage and incrementally optimise
where possible. I suspect there will be a time when an incremental
optimisation just does not cut it any more but by then I would also
expect there will be more data on how autonuma behaves in practice and a
new algorithm might be more obvious at that point.

> +/*
> + * The function sched_autonuma_can_migrate_task is called by CFS
> + * can_migrate_task() to prioritize on the task's
> + * task_selected_nid. It is called during load_balancing, idle
> + * balancing and in general before any task CPU migration event
> + * happens.
> + *

That's a lot of events but other than a cpu_to_node() lookup this part
does not seem too heavy.

> + * The caller first scans the CFS migration candidate tasks passing a
> + * not zero numa parameter, to skip tasks without AutoNUMA affinity
> + * (according to the tasks's task_selected_nid). If no task can be
> + * migrated in the first scan, a second scan is run with a zero numa
> + * parameter.
> + *
> + * If the numa parameter is not zero, this function allows the task
> + * migration only if the dst_cpu of the migration is in the node
> + * selected by AutoNUMA or if it's an idle load balancing event.
> + *
> + * If load_balance_strict is enabled, AutoNUMA will only allow
> + * migration of tasks for idle balancing purposes (the idle balancing
> + * of CFS is never altered by AutoNUMA). In the not strict mode the
> + * load balancing is not altered and the AutoNUMA affinity is
> + * disregarded in favor of higher fairness. The load_balance_strict
> + * knob is runtime tunable in sysfs.
> + *
> + * If load_balance_strict is enabled, it tends to partition the
> + * system. In turn it may reduce the scheduler fairness across NUMA
> + * nodes, but it should deliver higher global performance.
> + */
> +bool sched_autonuma_can_migrate_task(struct task_struct *p,
> +				     int strict_numa, int dst_cpu,
> +				     enum cpu_idle_type idle)
> +{
> +	if (task_autonuma_cpu(p, dst_cpu))
> +		return true;
> +
> +	/* NUMA affinity is set - and to a different NUMA node */
> +
> +	/*
> +	 * If strict_numa is not zero, it means our caller is in the
> +	 * first pass so be strict and only allow migration of tasks
> +	 * that passed the NUMA affinity test. If our caller finds
> +	 * none in the first pass, it'll normally retry a second pass
> +	 * with a zero "strict_numa" parameter.
> +	 */
> +	if (strict_numa)
> +		return false;
> +
> +	/*
> +	 * The idle load balancing always has higher priority than the
> +	 * NUMA affinity.
> +	 */
> +	if (idle == CPU_NEWLY_IDLE || idle == CPU_IDLE)
> +		return true;
> +
> +	if (autonuma_sched_load_balance_strict())
> +		return false;
> +	else
> +		return true;
> +}
> +
> +/*
> + * sched_autonuma_dump_mm is a purely debugging function called at
> + * regular intervals when /sys/kernel/mm/autonuma/debug is
> + * enabled. This prints in the kernel logs how the threads and
> + * processes are distributed in all NUMA nodes to easily check if the
> + * threads of the same processes are converging in the same
> + * nodes. This won't take into account kernel threads and because it
> + * runs itself from a kernel thread it won't show what was running in
> + * the current CPU, but it's simple and good enough to get what we
> + * need in the debug logs. This function can be disabled or deleted
> + * later.
> + */
> +void sched_autonuma_dump_mm(void)
> +{
> +	int nid, cpu;
> +	cpumask_var_t x;
> +
> +	if (!alloc_cpumask_var(&x, GFP_KERNEL))
> +		return;
> +	cpumask_setall(x);
> +	for_each_online_node(nid) {
> +		for_each_cpu(cpu, cpumask_of_node(nid)) {
> +			struct rq *rq = cpu_rq(cpu);
> +			struct mm_struct *mm = rq->curr->mm;
> +			int nr = 0, cpux;
> +			if (!cpumask_test_cpu(cpu, x))
> +				continue;
> +			for_each_cpu(cpux, cpumask_of_node(nid)) {
> +				struct rq *rqx = cpu_rq(cpux);
> +				if (rqx->curr->mm == mm) {
> +					nr++;
> +					cpumask_clear_cpu(cpux, x);
> +				}
> +			}
> +			printk("nid %d process %p nr_threads %d\n", nid, mm, nr);
> +		}
> +	}
> +	free_cpumask_var(x);
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0848fa3..9ce8151 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -467,6 +467,25 @@ struct rq {
>  #ifdef CONFIG_SMP
>  	struct llist_head wake_list;
>  #endif
> +#ifdef CONFIG_AUTONUMA
> +	/* stop_one_cpu_nowait() data used by autonuma_balance_cpu_stop() */
> +	bool autonuma_balance;
> +	int autonuma_balance_dst_cpu;
> +	struct task_struct *autonuma_balance_task;
> +	struct cpu_stop_work autonuma_balance_work;
> +	/*
> +	 * Per-cpu arrays used to compute the per-thread and
> +	 * per-process NUMA affinity weights (per nid) for the current
> +	 * process. Allocated statically to avoid overflowing the
> +	 * stack with large MAX_NUMNODES values.
> +	 *
> +	 * FIXME: allocate with dynamic num_possible_nodes() array
> +	 * sizes and only if autonuma is possible, to save some dozen
> +	 * KB of RAM when booting on non NUMA (or small NUMA) systems.
> +	 */
> +	long task_numa_weight[MAX_NUMNODES];
> +	long mm_numa_weight[MAX_NUMNODES];
> +#endif
>  };
>  
>  static inline int cpu_of(struct rq *rq)
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-11 12:28   ` Mel Gorman
@ 2012-10-11 15:24     ` Rik van Riel
  2012-10-11 15:57       ` Mel Gorman
  2012-10-12  0:23       ` Christoph Lameter
  2012-10-11 17:15       ` Andrea Arcangeli
  1 sibling, 2 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-11 15:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 10/11/2012 08:28 AM, Mel Gorman wrote:

>> +	/* link for knuma_scand's list of mm structures to scan */
>> +	struct list_head mm_node;
>> +	/* Pointer to associated mm structure */
>> +	struct mm_struct *mm;
>> +
>> +	/*
>> +	 * Zeroed from here during allocation, check
>> +	 * mm_autonuma_reset() if you alter the below.
>> +	 */
>> +
>> +	/*
>> +	 * Pass counter for this mm. This exist only to be able to
>> +	 * tell when it's time to apply the exponential backoff on the
>> +	 * task_autonuma statistics.
>> +	 */
>> +	unsigned long mm_numa_fault_pass;
>> +	/* Total number of pages that will trigger NUMA faults for this mm */
>> +	unsigned long mm_numa_fault_tot;
>> +	/* Number of pages that will trigger NUMA faults for each [nid] */
>> +	unsigned long mm_numa_fault[0];
>> +	/* do not add more variables here, the above array size is dynamic */
>> +};
>
> How cache hot is this structure? nodes are sharing counters in the same
> cache lines so if updates are frequent this will bounce like a mad yoke.
> Profiles will tell for sure but it's possible that some sort of per-cpu
> hilarity will be necessary here in the future.

These statistics are updated at page fault time, I
believe while holding the page table lock.

In other words, they are in code paths where updating
the stats should not cause issues.


>> +/*
>> + * Per-task (thread) structure that contains the NUMA memory placement
>> + * statistics generated by the knuma scan daemon. This structure is
>> + * dynamically allocated only if AutoNUMA is possible on this
>> + * system. They are linked togehter in a list headed within the
>> + * knumad_scan structure.
>> + */
>> +struct task_autonuma {

>> +	unsigned long task_numa_fault[0];
>> +	/* do not add more variables here, the above array size is dynamic */
>> +};
>> +
>
> Same question about cache hotness.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 14:56     ` Andrea Arcangeli
@ 2012-10-11 15:35       ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 15:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > Benchmark on a 4-node machine and the following fell out.
> > 
> >                                      3.6.0                 3.6.0
> >                                    vanilla        autonuma-v33r6
> > User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> > User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> > System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> > System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> > Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> > Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> > CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> > CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)
> 
> Thanks a lot for the help and for looking into it!
> 
> Just curious, why are you running only numa02_SMT and
> numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> without _suffix)
> 

Bug in the testing script on my end. Each of them are run separtly and it
looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then
numa01_THREAD_ALLOC. The intention was to allow additional stats to be
gathered independently of what start_bench.sh collects. Will improve it
in the future.

> > 
> > The performance improvements are certainly there for this basic test but
> > I note the System CPU usage is very high.
> 
> Yes, migrate is expensive, but after convergence has been reached the
> system time should be the same as upstream.
> 

Ok.

> btw, I improved things further in autonuma28 (new branch in aa.git).
> 

Ok.

> > 
> > The vmstats showed up this
> > 
> > THP fault alloc               81376       86070
> > THP collapse alloc               14       40423
> > THP splits                        8       41792
> > 
> > So we're doing a lot of splits and collapses for THP there. There is a
> > possibility that khugepaged and the autonuma kernel thread are doing some
> > busy work. Not a show-stopped, just interesting.
> > 
> > I've done no analysis at all and this was just to have something to look
> > at before looking at the code closer.
> 
> Sure, the idea is to have THP native migration, then we'll do zero
> collapse/splits.
> 

Seems reasonably. It should be obvious to measure when/if that happens.

> > > The objective of AutoNUMA is to provide out-of-the-box performance as
> > > close as possible to (and potentially faster than) manual NUMA hard
> > > bindings.
> > > 
> > > It is not very intrusive into the kernel core and is well structured
> > > into separate source modules.
> > > 
> > > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > > NUMA placement algorithms such as numad (in userland through cpusets)
> > > and schednuma (in kernel too) and was found superior in all cases.
> > > 
> > > Most important: not a single benchmark showed a regression yet when
> > > compared to vanilla kernels. Not even on the 2 node systems where the
> > > NUMA effects are less significant.
> > > 
> > 
> > Ok, I have not run a general regression test and won't get the chance to
> > soon but hopefully others will. One thing they might want to watch out
> > for is System CPU time. It's possible that your AutoNUMA benchmark
> > triggers a worst-case but it's worth keeping an eye on because any cost
> > from that has to be offset by gains from better NUMA placements.
> 
> Good idea to monitor it indeed.
> 

If System CPU time really does go down as this converges then that
should be obvious from monitoring vmstat over time for a test. Early on
- high usage with that dropping as it converges. If that doesn't happen
  then the tasks are not converging, the phases change constantly or
something unexpected happened that needs to be identified.

> > Is STREAM really a good benchmark in this case? Unless you also ran it in
> > parallel mode, it basically operations against three arrays and not really
> > NUMA friendly once the total size is greater than a NUMA node. I guess
> > it makes sense to run it just to see does autonuma break it :)
> 
> The way this is run is that there is 1 stream, then 4 stream, then 8
> until we max out all CPUs.
> 

Ok. Are they separate STREAM instances or threads running on the same
arrays? 

> I think we could run "memhog" instead of "stream" and it'd be the
> same. stream probably better resembles real life computations.
> 
> The upstream scheduler lacks any notion of affinity so eventually
> during the 5 min run, on process changes node, it doesn't notice its
> memory was elsewhere so it stays there, and the memory can't follow
> the cpu either. So then it runs much slower.
> 
> So it's the simplest test of all to get right, all it requires is some
> notion of node affinity.
> 

Ok.

> It's also the only workload that the home node design in schednuma in
> tip.git can get right (schednuma post current tip.git introduced
> cpu-follow-memory design of AutoNUMA so schednuma will have a chance
> to get right more stuff than just the stream multi instance
> benchmark).
> 
> So it's just for a verification than the simple stuff (single threaded
> process computing) is ok and the upstream regression vs hard NUMA
> bindings is fixed.
> 

Verification of the simple stuff makes sense.

> stream is also one case where we have to perform identical to the hard
> NUMA bindings. No migration of CPU or memory must ever happen with
> AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
> find that it is already in the best place and it will leave it alone.
> 
> With the autonuma-benchmark it's impossible to reach identical
> performance of the _HARD_BIND case because _HARD_BIND doesn't need to
> do any memory migration (I'm 3 seconds away from hard bindings in a
> 198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).
> 
> > 
> > > 
> > > == iozone ==
> > > 
> > >                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> > > FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> > > ====--------------------------------------------------------------------------------------------------------------
> > > noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> > > autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> > > 
> > > AutoNUMA can't help much for I/O loads but you can see it seems a
> > > small improvement there too. The important thing for I/O loads, is to
> > > verify that there is no regression.
> > > 
> > 
> > It probably is unreasonable to expect autonuma to handle the case where
> > a file-based workload has not been tuned for NUMA. In too many cases
> > it's going to be read/write based so you're not going to get the
> > statistics you need.
> 
> Agreed. Some statistic may still accumulate and it's still better than
> nothing but unless the workload is CPU and memory bound, we can't
> expect to see any difference.
> 
> This is meant as a verification that we're not introducing regression
> to I/O bound load.
> 

Ok, that's more or less what I had guessed but nice to know for sure.
Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-11 15:35       ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 15:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > Benchmark on a 4-node machine and the following fell out.
> > 
> >                                      3.6.0                 3.6.0
> >                                    vanilla        autonuma-v33r6
> > User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> > User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> > System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> > System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> > Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> > Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> > CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> > CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)
> 
> Thanks a lot for the help and for looking into it!
> 
> Just curious, why are you running only numa02_SMT and
> numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> without _suffix)
> 

Bug in the testing script on my end. Each of them are run separtly and it
looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then
numa01_THREAD_ALLOC. The intention was to allow additional stats to be
gathered independently of what start_bench.sh collects. Will improve it
in the future.

> > 
> > The performance improvements are certainly there for this basic test but
> > I note the System CPU usage is very high.
> 
> Yes, migrate is expensive, but after convergence has been reached the
> system time should be the same as upstream.
> 

Ok.

> btw, I improved things further in autonuma28 (new branch in aa.git).
> 

Ok.

> > 
> > The vmstats showed up this
> > 
> > THP fault alloc               81376       86070
> > THP collapse alloc               14       40423
> > THP splits                        8       41792
> > 
> > So we're doing a lot of splits and collapses for THP there. There is a
> > possibility that khugepaged and the autonuma kernel thread are doing some
> > busy work. Not a show-stopped, just interesting.
> > 
> > I've done no analysis at all and this was just to have something to look
> > at before looking at the code closer.
> 
> Sure, the idea is to have THP native migration, then we'll do zero
> collapse/splits.
> 

Seems reasonably. It should be obvious to measure when/if that happens.

> > > The objective of AutoNUMA is to provide out-of-the-box performance as
> > > close as possible to (and potentially faster than) manual NUMA hard
> > > bindings.
> > > 
> > > It is not very intrusive into the kernel core and is well structured
> > > into separate source modules.
> > > 
> > > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > > NUMA placement algorithms such as numad (in userland through cpusets)
> > > and schednuma (in kernel too) and was found superior in all cases.
> > > 
> > > Most important: not a single benchmark showed a regression yet when
> > > compared to vanilla kernels. Not even on the 2 node systems where the
> > > NUMA effects are less significant.
> > > 
> > 
> > Ok, I have not run a general regression test and won't get the chance to
> > soon but hopefully others will. One thing they might want to watch out
> > for is System CPU time. It's possible that your AutoNUMA benchmark
> > triggers a worst-case but it's worth keeping an eye on because any cost
> > from that has to be offset by gains from better NUMA placements.
> 
> Good idea to monitor it indeed.
> 

If System CPU time really does go down as this converges then that
should be obvious from monitoring vmstat over time for a test. Early on
- high usage with that dropping as it converges. If that doesn't happen
  then the tasks are not converging, the phases change constantly or
something unexpected happened that needs to be identified.

> > Is STREAM really a good benchmark in this case? Unless you also ran it in
> > parallel mode, it basically operations against three arrays and not really
> > NUMA friendly once the total size is greater than a NUMA node. I guess
> > it makes sense to run it just to see does autonuma break it :)
> 
> The way this is run is that there is 1 stream, then 4 stream, then 8
> until we max out all CPUs.
> 

Ok. Are they separate STREAM instances or threads running on the same
arrays? 

> I think we could run "memhog" instead of "stream" and it'd be the
> same. stream probably better resembles real life computations.
> 
> The upstream scheduler lacks any notion of affinity so eventually
> during the 5 min run, on process changes node, it doesn't notice its
> memory was elsewhere so it stays there, and the memory can't follow
> the cpu either. So then it runs much slower.
> 
> So it's the simplest test of all to get right, all it requires is some
> notion of node affinity.
> 

Ok.

> It's also the only workload that the home node design in schednuma in
> tip.git can get right (schednuma post current tip.git introduced
> cpu-follow-memory design of AutoNUMA so schednuma will have a chance
> to get right more stuff than just the stream multi instance
> benchmark).
> 
> So it's just for a verification than the simple stuff (single threaded
> process computing) is ok and the upstream regression vs hard NUMA
> bindings is fixed.
> 

Verification of the simple stuff makes sense.

> stream is also one case where we have to perform identical to the hard
> NUMA bindings. No migration of CPU or memory must ever happen with
> AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
> find that it is already in the best place and it will leave it alone.
> 
> With the autonuma-benchmark it's impossible to reach identical
> performance of the _HARD_BIND case because _HARD_BIND doesn't need to
> do any memory migration (I'm 3 seconds away from hard bindings in a
> 198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).
> 
> > 
> > > 
> > > == iozone ==
> > > 
> > >                      ALL  INIT   RE             RE   RANDOM RANDOM BACKWD  RECRE STRIDE  F      FRE     F      FRE
> > > FILE     TYPE (KB)  IOS  WRITE  WRITE   READ   READ   READ  WRITE   READ  WRITE   READ  WRITE  WRITE   READ   READ
> > > ====--------------------------------------------------------------------------------------------------------------
> > > noautonuma ALL      2492   1224   1874   2699   3669   3724   2327   2638   4091   3525   1142   1692   2668   3696
> > > autonuma   ALL      2531   1221   1886   2732   3757   3760   2380   2650   4192   3599   1150   1731   2712   3825
> > > 
> > > AutoNUMA can't help much for I/O loads but you can see it seems a
> > > small improvement there too. The important thing for I/O loads, is to
> > > verify that there is no regression.
> > > 
> > 
> > It probably is unreasonable to expect autonuma to handle the case where
> > a file-based workload has not been tuned for NUMA. In too many cases
> > it's going to be read/write based so you're not going to get the
> > statistics you need.
> 
> Agreed. Some statistic may still accumulate and it's still better than
> nothing but unless the workload is CPU and memory bound, we can't
> expect to see any difference.
> 
> This is meant as a verification that we're not introducing regression
> to I/O bound load.
> 

Ok, that's more or less what I had guessed but nice to know for sure.
Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data
  2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
@ 2012-10-11 15:43   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 15:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:54AM +0200, Andrea Arcangeli wrote:
> This defines the per node data used by Migrate On Fault in order to
> rate limit the migration. The rate limiting is applied independently
> to each destination node.
> 

How is it determined what the rate limit should be?

I assume it's something to do with memory bandwidth between nodes. Is
that automatically determined or something else?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 14/33] autonuma: call autonuma_setup_new_exec()
  2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
@ 2012-10-11 15:47   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 15:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:56AM +0200, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detach from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.
> 

Unlikely is an understatement.

> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/exec.c        |    7 +++++++
>  mm/mmu_context.c |    3 +++
>  2 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 574cf4d..1d55077 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/pipe_fs_i.h>
>  #include <linux/oom.h>
>  #include <linux/compat.h>
> +#include <linux/autonuma.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1172,6 +1173,12 @@ void setup_new_exec(struct linux_binprm * bprm)
>  			
>  	flush_signal_handlers(current, 0);
>  	flush_old_files(current->files);
> +
> +	/*
> +	 * Reset autonuma counters, as past NUMA information
> +	 * is unlikely to be relevant for the future.
> +	 */
> +	autonuma_setup_new_exec(current);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
>  
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 3dcfaf4..e6fff1c 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -7,6 +7,7 @@
>  #include <linux/mmu_context.h>
>  #include <linux/export.h>
>  #include <linux/sched.h>
> +#include <linux/autonuma.h>
>  
>  #include <asm/mmu_context.h>
>  
> @@ -52,6 +53,8 @@ void unuse_mm(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk = current;
>  
> +	autonuma_setup_new_exec(tsk);
> +

Why are the stats discarded in unuse_mm? That does not seem necessary at
all. Why would AIO being completed cause the stats to reset?

>  	task_lock(tsk);
>  	sync_mm_rss(mm);
>  	tsk->mm = NULL;
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 15/33] autonuma: alloc/free/init task_autonuma
  2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
@ 2012-10-11 15:53   ` Mel Gorman
  2012-10-11 17:34     ` Rik van Riel
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 15:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:57AM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated task_autonuma structure is
> being handled.
> 
> This is the structure holding the per-thread NUMA statistics generated
> by the NUMA hinting page faults. This per-thread NUMA statistical
> information is needed by sched_autonuma_balance to make optimal NUMA
> balancing decisions.
> 
> It also contains the task_selected_nid which hints the stock CPU
> scheduler on the best NUMA node to schedule this thread on (as decided
> by sched_autonuma_balance).
> 
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the non NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).
> 
> If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
> is allocated on the kernel stack of course.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

There is a possibility that someone will complain about the extra
kmalloc() during fork that is now necessary for the autonuma structure.
Microbenchmarks will howl but who cares -- autonuma only makes sense for
long-lived processes anyway. It may be necessary in the future to defer
this allocation until the process has consumed a few CPU seconds and
likely to hang around for a while. Overkill for now though so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-11 15:24     ` Rik van Riel
@ 2012-10-11 15:57       ` Mel Gorman
  2012-10-12  0:23       ` Christoph Lameter
  1 sibling, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 15:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 11, 2012 at 11:24:34AM -0400, Rik van Riel wrote:
> On 10/11/2012 08:28 AM, Mel Gorman wrote:
> 
> >>+	/* link for knuma_scand's list of mm structures to scan */
> >>+	struct list_head mm_node;
> >>+	/* Pointer to associated mm structure */
> >>+	struct mm_struct *mm;
> >>+
> >>+	/*
> >>+	 * Zeroed from here during allocation, check
> >>+	 * mm_autonuma_reset() if you alter the below.
> >>+	 */
> >>+
> >>+	/*
> >>+	 * Pass counter for this mm. This exist only to be able to
> >>+	 * tell when it's time to apply the exponential backoff on the
> >>+	 * task_autonuma statistics.
> >>+	 */
> >>+	unsigned long mm_numa_fault_pass;
> >>+	/* Total number of pages that will trigger NUMA faults for this mm */
> >>+	unsigned long mm_numa_fault_tot;
> >>+	/* Number of pages that will trigger NUMA faults for each [nid] */
> >>+	unsigned long mm_numa_fault[0];
> >>+	/* do not add more variables here, the above array size is dynamic */
> >>+};
> >
> >How cache hot is this structure? nodes are sharing counters in the same
> >cache lines so if updates are frequent this will bounce like a mad yoke.
> >Profiles will tell for sure but it's possible that some sort of per-cpu
> >hilarity will be necessary here in the future.
> 
> These statistics are updated at page fault time, I
> believe while holding the page table lock.
> 
> In other words, they are in code paths where updating
> the stats should not cause issues.
> 

Ordinarily I would agree but in this case the updates are taking place for
NUMA hinting faults so there is an new source of new page faults and the
page table lock is going to be hotter than it was in the past.  It may not
be a problem as it'll be related to how short a knuma_scan cycle is but it's
something else to keep an eye on. Profiles will tell for sure at the end of
the day and it can be incrementally improved.

> >>+/*
> >>+ * Per-task (thread) structure that contains the NUMA memory placement
> >>+ * statistics generated by the knuma scan daemon. This structure is
> >>+ * dynamically allocated only if AutoNUMA is possible on this
> >>+ * system. They are linked togehter in a list headed within the
> >>+ * knumad_scan structure.
> >>+ */
> >>+struct task_autonuma {
> 
> >>+	unsigned long task_numa_fault[0];
> >>+	/* do not add more variables here, the above array size is dynamic */
> >>+};
> >>+
> >
> >Same question about cache hotness.
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
  2012-10-11 10:50   ` Mel Gorman
@ 2012-10-11 16:07       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi,

On Thu, Oct 11, 2012 at 11:50:36AM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> > +The AutoNUMA logic is a chain reaction resulting from the actions of
> > +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
> 
> s/knum_scand/knuma_scand/

Applied.

> > +scans the mm structures of all active processes. It gathers the
> > +AutoNUMA mm statistics for each "anon" page in the process's working
> 
> Ok, so this will not make a different to file-based workloads but as I
> mentioned in the leader this would be a difficult proposition anyway
> because if it's read/write based, you'll have no statistics.

Oops sorry for the confusion but the the doc is wrong on this one: it
actually tracks anything with a page_mapcount == 1, even if that is
pagecache or even .text as long as it's only mapped in a single
process. So if you've a threaded database doing a gigantic MAP_SHARED,
it'll track and move around the whole MAP_SHARED as well as anonymous
memory or anything that can be moved.

Changed to:

+AutoNUMA mm statistics for each not shared page in the process's

> > +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> > +present bit in each pte or pmd that was counted. This triggers NUMA
> > +hinting page faults described next.
> > +
> > +The mm statistics are expentially decayed by dividing the total memory
> > +in half and adding the new totals to the decayed values for each
> > +knuma_scand pass. This causes the mm statistics to resemble a simple
> > +forecasting model, taking into account some past working set data.
> > +
> > +=== NUMA hinting fault ===
> > +
> > +A NUMA hinting fault occurs when a task running on a CPU thread
> > +accesses a vma whose pte or pmd is not present and the NUMA bit is
> > +set. The NUMA hinting page fault handler returns the pte or pmd back
> > +to its present state and counts the fault's occurance in the
> > +task_autonuma structure.
> > +
> 
> So, minimally one source of System CPU overhead will be increased traps.

Correct.

It takes down 128M every 100msec, and then when it finished taking
down everything it sleeps 10sec, then increases the pass_counter and
restarts. It's not measurable, even if I do a kernel build with -j128
in tmpfs the performance is identical with autonuma running or not.

> I haven't seen the code yet obviously but I wonder if this gets accounted
> for as a minor fault? If it does, how can we distinguish between minor
> faults and numa hinting faults? If not, is it possible to get any idea of
> how many numa hinting faults were incurred? Mention it here.

Yes, it's surely accounted as minor fault. To monitor it normally I
use:

perf probe numa_hinting_fault
perf record -e probe:numa_hinting_fault -aR -g sleep 10
perf report -g

# Samples: 345  of event 'probe:numa_hinting_fault'
# Event count (approx.): 345
#
# Overhead  Command      Shared Object                  Symbol
# ........  .......  .................  ......................
#
    64.64%     perf  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--57.40%-- sig_handler
                  |          |          
                  |          |--62.50%-- run_builtin
                  |          |          main
                  |          |          __libc_start_main
                  |          |          
                  |           --37.50%-- 0x7f47f7c6cba0
                  |                     run_builtin
                  |                     main
                  |                     __libc_start_main
                  |          
                  |--16.59%-- __poll
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                  |--9.87%-- 0x7f47f7c6cba0
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                  |--9.42%-- save_i387_xstate
                  |          do_signal
                  |          do_notify_resume
                  |          int_signal
                  |          __poll
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                   --6.73%-- sys_poll
                             system_call_fastpath
                             __poll

    21.45%     ntpd  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--66.22%-- 0x42b910
                  |          0x0
                  |          
                  |--24.32%-- __select
                  |          0x0
                  |          
                  |--4.05%-- do_signal
                  |          do_notify_resume
                  |          int_signal
                  |          __select
                  |          0x0
                  |          
                  |--2.70%-- 0x7f88827b3ba0
                  |          0x0
                  |          
                   --2.70%-- clock_gettime
                             0x1a1eb808

     7.83%     init  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--33.33%-- __select
                  |          0x0
                  |          
                  |--29.63%-- 0x404e0c
                  |          0x0
                  |          
                  |--18.52%-- 0x405820
                  |          
                  |--11.11%-- sys_select
                  |          system_call_fastpath
                  |          __select
                  |          0x0
                  |          
                   --7.41%-- 0x402528

     6.09%    sleep  [kernel.kallsyms]  [k] numa_hinting_fault
              |
              --- numa_hinting_fault
                  handle_mm_fault
                  do_page_fault
                  page_fault
                 |          
                 |--42.86%-- 0x7f0f67847fe0
                 |          0x7fff4cd6d42b
                 |          
                 |--28.57%-- 0x404007
                 |          
                 |--19.05%-- nanosleep
                 |          
                  --9.52%-- 0x4016d0
                            0x7fff4cd6d42b


Chances are we want to add more vmstat for this event.

> > +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> > +
> > +- Increments the total number of pages faulted for this task
> > +
> > +- Increments the number of pages faulted on the current NUMA node
> > +
> 
> So, am I correct in assuming that the rate of NUMA hinting faults will be
> related to the scan rate of knuma_scand?

This is correct. They're identical.

There's a slight chance that two threads hit the fault on the same
pte/pmd_numa concurrently, but just one of the two will actually
invoke the numa_hinting_fault() function.

> > +- If the fault was for an hugepage, the number of subpages represented
> > +  by an hugepage is added to the task statistics above
> > +
> > +- Each time the NUMA hinting page fault discoveres that another
> 
> s/discoveres/discovers/

Fixed.

> 
> > +  knuma_scand pass has occurred, it divides the total number of pages
> > +  and the pages for each NUMA node in half. This causes the task
> > +  statistics to be exponentially decayed, just as the mm statistics
> > +  are. Thus, the task statistics also resemble a simple forcasting

Also noticed forecasting ;).

> > +  model, taking into account some past NUMA hinting fault data.
> > +
> > +If the page being accessed is on the current NUMA node (same as the
> > +task), the NUMA hinting fault handler only records the nid of the
> > +current NUMA node in the page_autonuma structure field last_nid and
> > +then it'd done.
> > +
> > +Othewise, it checks if the nid of the current NUMA node matches the
> > +last_nid in the page_autonuma structure. If it matches it means it's
> > +the second NUMA hinting fault for the page occurring (on a subsequent
> > +pass of the knuma_scand daemon) from the current NUMA node.
> 
> You don't spell it out, but this is effectively a migration threshold N
> where N is the number of remote NUMA hinting faults that must be
> incurred before migration happens. The default value of this threshold
> is 2.
> 
> Is that accurate? If so, why 2?

More like 1. It needs one confirmation the migrate request come from
the same node again (note: it is allowed to come from a different
threads as long as it's the same node and that is very important).

Why only 1 confirmation? It's the same as page aging. We could record
the number of pagecache lookup hits, and not just have a single bit as
reference count. But doing so, if the workload radically changes it
takes too much time to adapt to the new configuration and so I usually
don't like counting.

Plus I avoided as much as possible fixed numbers. I can explain why 0
or 1, but I can't as easily explain why 5 or 8, so if I can't explain
it, I avoid it.

> I don't have a better suggestion, it's just an obvious source of an
> adverse workload that could force a lot of migrations by faulting once
> per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.

Correct, for certain workloads like single instance specjbb that
wasn't enough, but it is fixed in autonuma28, now it's faster even on
single instance.

> I'm assuming it must be async migration then. IO in progress would be
> a bit of a surprise though! It would have to be a mapped anonymous page
> being written to swap.

It's all migrate on fault now, but I'm using all methods you implemented to
avoid compaction to block in migrate_pages.

> > +=== Task exchange ===
> > +
> > +The following defines "weight" in the AutoNUMA balance routine's
> > +algorithm.
> > +
> > +If the tasks are threads of the same process:
> > +
> > +    weight = task weight for the NUMA node (since memory weights are
> > +             the same)
> > +
> > +If the tasks are not threads of the same process:
> > +
> > +    weight = memory weight for the NUMA node (prefer to move the task
> > +             to the memory)
> > +
> > +The following algorithm determines if the current task will be
> > +exchanged with a running task on a remote NUMA node:
> > +
> > +    this_diff: Weight of the current task on the remote NUMA node
> > +               minus its weight on the current NUMA node (only used if
> > +               a positive value). How much does the current task
> > +               prefer to run on the remote NUMA node.
> > +
> > +    other_diff: Weight of the current task on the remote NUMA node
> > +                minus the weight of the other task on the same remote
> > +                NUMA node (only used if a positive value). How much
> > +                does the current task prefer to run on the remote NUMA
> > +                node compared to the other task.
> > +
> > +    total_weight_diff = this_diff + other_diff
> > +
> > +    total_weight_diff: How favorable it is to exchange the two tasks.
> > +                       The pair of tasks with the highest
> > +                       total_weight_diff (if any) are selected for
> > +                       exchange.
> > +
> > +As mentioned above, if the two tasks are threads of the same process,
> > +the AutoNUMA balance routine uses the task_autonuma statistics. By
> > +using the task_autonuma statistics, each thread follows its own memory
> > +locality and they will not necessarily converge on the same node. This
> > +is often very desirable for processes with more threads than CPUs on
> > +each NUMA node.
> > +
> 
> What about the case where two threads on different CPUs are accessing

I assume on different nodes (different cpus if in the same node, the
above won't kick in).

> separate structures that are not page-aligned (base or huge page but huge
> page would be obviously worse). Does this cause a ping-pong effect or
> otherwise mess up the statistics?

Very good point! This is exactly what I call NUMA false sharing and
it's the biggest nightmare in this whole effort.

So if there's an huge amount of this over time the statistics will be
around 50/50 (the statistics just record the working set of the
thread).

So if there's another process (note: thread not) heavily computing the
50/50 won't be used and the mm statistics will be used instead to
balance the two threads against the other process. And the two threads
will converge in the same node, and then their thread statistics will
change from 50/50 to 0/100 matching the mm statistics.

If there are just threads and they're all doing what you describe
above with all their memory, well then the problem has no solution,
and the new stuff in autonuma28 will deal with that too.

Ideally we should do MADV_INTERLEAVE, I didn't get that far yet but I
probably could now.

Even without the new stuff it wasn't too bad but there were a bit too
many spurious migrations in that load with autonuma27 and previous. It
was less spurious on bigger systems with many nodes because last_nid
is implicitly more accurate there (as last_nid will have more possible
values than 0|1). With autonuma28 even on 2 nodes it's perfectly fine.

If it's just 1 page false sharing and all the rest is thread-local,
the statistics will be 99/1 and the false sharing will be lost in the
noise.

The false sharing spillover caused by alignments is minor if the
threads are really computing on a lot of local memory so it's not a
concern and it will be optimized away by the last_nid plus the new
stuff.

> Ok, very obviously this will never be an RT feature but that is hardly
> a surprise and anyone who tries to enable this for RT needs their head
> examined. I'm not suggesting you do it but people running detailed
> performance analysis on scheduler-intensive workloads might want to keep
> an eye on their latency and jitter figures and how they are affected by
> this exchanging. Does ftrace show a noticable increase in wakeup latencies
> for example?

If you do:

echo 1 >/sys/kernel/mm/autonuma/debug

you will get 1 printk every single time sched_autonuma_balance
triggers a task exchange.

With autonuma28 I resolved a lot of the jittering and now there are
6/7 printk for the whole 198 seconds of numa01. CFS runs in autopilot
all the time.

With specjbb x2 overcommit, the active balancing events are reduced to
one every few sec (vs several per sec with autonuma27). In fact the
specjbb x2 overcommit load jumped ahead too with autonuma28.

About tracing events, the git branch already has tracing events to
monitor all page and task migrations showed in an awesome "perf script
numatop" from Andrew. Likely we need one tracing event to see the task
exchange generated specifically by the autonuma balancing event (we're
running short in event columns to show it in numatop though ;). Right
now that is only available as the printk above.

> > +=== task_autonuma - per task AutoNUMA data ===
> > +
> > +The task_autonuma structure is used to hold AutoNUMA data required for
> > +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> > +nodes.
> > +
> > +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> > +                scheduler balancing code, -1 if none (2 bytes)
> > +
> > +- Task NUMA statistics for this thread/process:
> > +
> > +    Total number of NUMA hinting page faults in this pass of
> > +    knuma_scand (8 bytes)
> > +
> > +    Per NUMA node number of NUMA hinting page faults in this pass of
> > +    knuma_scand (8 bytes * # of NUMA nodes)
> > +
> 
> It might be possible to put a coarse ping-pong detection counter in here
> as well by recording a declaying average of number of pages migrated
> over a number of knuma_scand passes instead of just the last one.  If the
> value is too high, you're ping-ponging and the process should be ignored,
> possibly forever. It's not a requirement and it would be more memory
> overhead obviously but I'm throwing it out there as a suggestion if it
> ever turns out the ping-pong problem is real.

Yes, this is a problem where we've an enormous degree in trying
things, so your suggestions are very appreciated :).

About ping ponging of CPU I never seen it yet (even if it's 550/450,
it rarely switches over from 450/550, and even it does, it doesn't
really change anything because it's a fairly rare event and one node
is not more right than the other anyway).

Thanks a lot for the help!
Andrea

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
@ 2012-10-11 16:07       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi,

On Thu, Oct 11, 2012 at 11:50:36AM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> > +The AutoNUMA logic is a chain reaction resulting from the actions of
> > +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
> 
> s/knum_scand/knuma_scand/

Applied.

> > +scans the mm structures of all active processes. It gathers the
> > +AutoNUMA mm statistics for each "anon" page in the process's working
> 
> Ok, so this will not make a different to file-based workloads but as I
> mentioned in the leader this would be a difficult proposition anyway
> because if it's read/write based, you'll have no statistics.

Oops sorry for the confusion but the the doc is wrong on this one: it
actually tracks anything with a page_mapcount == 1, even if that is
pagecache or even .text as long as it's only mapped in a single
process. So if you've a threaded database doing a gigantic MAP_SHARED,
it'll track and move around the whole MAP_SHARED as well as anonymous
memory or anything that can be moved.

Changed to:

+AutoNUMA mm statistics for each not shared page in the process's

> > +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> > +present bit in each pte or pmd that was counted. This triggers NUMA
> > +hinting page faults described next.
> > +
> > +The mm statistics are expentially decayed by dividing the total memory
> > +in half and adding the new totals to the decayed values for each
> > +knuma_scand pass. This causes the mm statistics to resemble a simple
> > +forecasting model, taking into account some past working set data.
> > +
> > +=== NUMA hinting fault ===
> > +
> > +A NUMA hinting fault occurs when a task running on a CPU thread
> > +accesses a vma whose pte or pmd is not present and the NUMA bit is
> > +set. The NUMA hinting page fault handler returns the pte or pmd back
> > +to its present state and counts the fault's occurance in the
> > +task_autonuma structure.
> > +
> 
> So, minimally one source of System CPU overhead will be increased traps.

Correct.

It takes down 128M every 100msec, and then when it finished taking
down everything it sleeps 10sec, then increases the pass_counter and
restarts. It's not measurable, even if I do a kernel build with -j128
in tmpfs the performance is identical with autonuma running or not.

> I haven't seen the code yet obviously but I wonder if this gets accounted
> for as a minor fault? If it does, how can we distinguish between minor
> faults and numa hinting faults? If not, is it possible to get any idea of
> how many numa hinting faults were incurred? Mention it here.

Yes, it's surely accounted as minor fault. To monitor it normally I
use:

perf probe numa_hinting_fault
perf record -e probe:numa_hinting_fault -aR -g sleep 10
perf report -g

# Samples: 345  of event 'probe:numa_hinting_fault'
# Event count (approx.): 345
#
# Overhead  Command      Shared Object                  Symbol
# ........  .......  .................  ......................
#
    64.64%     perf  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--57.40%-- sig_handler
                  |          |          
                  |          |--62.50%-- run_builtin
                  |          |          main
                  |          |          __libc_start_main
                  |          |          
                  |           --37.50%-- 0x7f47f7c6cba0
                  |                     run_builtin
                  |                     main
                  |                     __libc_start_main
                  |          
                  |--16.59%-- __poll
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                  |--9.87%-- 0x7f47f7c6cba0
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                  |--9.42%-- save_i387_xstate
                  |          do_signal
                  |          do_notify_resume
                  |          int_signal
                  |          __poll
                  |          run_builtin
                  |          main
                  |          __libc_start_main
                  |          
                   --6.73%-- sys_poll
                             system_call_fastpath
                             __poll

    21.45%     ntpd  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--66.22%-- 0x42b910
                  |          0x0
                  |          
                  |--24.32%-- __select
                  |          0x0
                  |          
                  |--4.05%-- do_signal
                  |          do_notify_resume
                  |          int_signal
                  |          __select
                  |          0x0
                  |          
                  |--2.70%-- 0x7f88827b3ba0
                  |          0x0
                  |          
                   --2.70%-- clock_gettime
                             0x1a1eb808

     7.83%     init  [kernel.kallsyms]  [k] numa_hinting_fault
               |
               --- numa_hinting_fault
                   handle_mm_fault
                   do_page_fault
                   page_fault
                  |          
                  |--33.33%-- __select
                  |          0x0
                  |          
                  |--29.63%-- 0x404e0c
                  |          0x0
                  |          
                  |--18.52%-- 0x405820
                  |          
                  |--11.11%-- sys_select
                  |          system_call_fastpath
                  |          __select
                  |          0x0
                  |          
                   --7.41%-- 0x402528

     6.09%    sleep  [kernel.kallsyms]  [k] numa_hinting_fault
              |
              --- numa_hinting_fault
                  handle_mm_fault
                  do_page_fault
                  page_fault
                 |          
                 |--42.86%-- 0x7f0f67847fe0
                 |          0x7fff4cd6d42b
                 |          
                 |--28.57%-- 0x404007
                 |          
                 |--19.05%-- nanosleep
                 |          
                  --9.52%-- 0x4016d0
                            0x7fff4cd6d42b


Chances are we want to add more vmstat for this event.

> > +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> > +
> > +- Increments the total number of pages faulted for this task
> > +
> > +- Increments the number of pages faulted on the current NUMA node
> > +
> 
> So, am I correct in assuming that the rate of NUMA hinting faults will be
> related to the scan rate of knuma_scand?

This is correct. They're identical.

There's a slight chance that two threads hit the fault on the same
pte/pmd_numa concurrently, but just one of the two will actually
invoke the numa_hinting_fault() function.

> > +- If the fault was for an hugepage, the number of subpages represented
> > +  by an hugepage is added to the task statistics above
> > +
> > +- Each time the NUMA hinting page fault discoveres that another
> 
> s/discoveres/discovers/

Fixed.

> 
> > +  knuma_scand pass has occurred, it divides the total number of pages
> > +  and the pages for each NUMA node in half. This causes the task
> > +  statistics to be exponentially decayed, just as the mm statistics
> > +  are. Thus, the task statistics also resemble a simple forcasting

Also noticed forecasting ;).

> > +  model, taking into account some past NUMA hinting fault data.
> > +
> > +If the page being accessed is on the current NUMA node (same as the
> > +task), the NUMA hinting fault handler only records the nid of the
> > +current NUMA node in the page_autonuma structure field last_nid and
> > +then it'd done.
> > +
> > +Othewise, it checks if the nid of the current NUMA node matches the
> > +last_nid in the page_autonuma structure. If it matches it means it's
> > +the second NUMA hinting fault for the page occurring (on a subsequent
> > +pass of the knuma_scand daemon) from the current NUMA node.
> 
> You don't spell it out, but this is effectively a migration threshold N
> where N is the number of remote NUMA hinting faults that must be
> incurred before migration happens. The default value of this threshold
> is 2.
> 
> Is that accurate? If so, why 2?

More like 1. It needs one confirmation the migrate request come from
the same node again (note: it is allowed to come from a different
threads as long as it's the same node and that is very important).

Why only 1 confirmation? It's the same as page aging. We could record
the number of pagecache lookup hits, and not just have a single bit as
reference count. But doing so, if the workload radically changes it
takes too much time to adapt to the new configuration and so I usually
don't like counting.

Plus I avoided as much as possible fixed numbers. I can explain why 0
or 1, but I can't as easily explain why 5 or 8, so if I can't explain
it, I avoid it.

> I don't have a better suggestion, it's just an obvious source of an
> adverse workload that could force a lot of migrations by faulting once
> per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.

Correct, for certain workloads like single instance specjbb that
wasn't enough, but it is fixed in autonuma28, now it's faster even on
single instance.

> I'm assuming it must be async migration then. IO in progress would be
> a bit of a surprise though! It would have to be a mapped anonymous page
> being written to swap.

It's all migrate on fault now, but I'm using all methods you implemented to
avoid compaction to block in migrate_pages.

> > +=== Task exchange ===
> > +
> > +The following defines "weight" in the AutoNUMA balance routine's
> > +algorithm.
> > +
> > +If the tasks are threads of the same process:
> > +
> > +    weight = task weight for the NUMA node (since memory weights are
> > +             the same)
> > +
> > +If the tasks are not threads of the same process:
> > +
> > +    weight = memory weight for the NUMA node (prefer to move the task
> > +             to the memory)
> > +
> > +The following algorithm determines if the current task will be
> > +exchanged with a running task on a remote NUMA node:
> > +
> > +    this_diff: Weight of the current task on the remote NUMA node
> > +               minus its weight on the current NUMA node (only used if
> > +               a positive value). How much does the current task
> > +               prefer to run on the remote NUMA node.
> > +
> > +    other_diff: Weight of the current task on the remote NUMA node
> > +                minus the weight of the other task on the same remote
> > +                NUMA node (only used if a positive value). How much
> > +                does the current task prefer to run on the remote NUMA
> > +                node compared to the other task.
> > +
> > +    total_weight_diff = this_diff + other_diff
> > +
> > +    total_weight_diff: How favorable it is to exchange the two tasks.
> > +                       The pair of tasks with the highest
> > +                       total_weight_diff (if any) are selected for
> > +                       exchange.
> > +
> > +As mentioned above, if the two tasks are threads of the same process,
> > +the AutoNUMA balance routine uses the task_autonuma statistics. By
> > +using the task_autonuma statistics, each thread follows its own memory
> > +locality and they will not necessarily converge on the same node. This
> > +is often very desirable for processes with more threads than CPUs on
> > +each NUMA node.
> > +
> 
> What about the case where two threads on different CPUs are accessing

I assume on different nodes (different cpus if in the same node, the
above won't kick in).

> separate structures that are not page-aligned (base or huge page but huge
> page would be obviously worse). Does this cause a ping-pong effect or
> otherwise mess up the statistics?

Very good point! This is exactly what I call NUMA false sharing and
it's the biggest nightmare in this whole effort.

So if there's an huge amount of this over time the statistics will be
around 50/50 (the statistics just record the working set of the
thread).

So if there's another process (note: thread not) heavily computing the
50/50 won't be used and the mm statistics will be used instead to
balance the two threads against the other process. And the two threads
will converge in the same node, and then their thread statistics will
change from 50/50 to 0/100 matching the mm statistics.

If there are just threads and they're all doing what you describe
above with all their memory, well then the problem has no solution,
and the new stuff in autonuma28 will deal with that too.

Ideally we should do MADV_INTERLEAVE, I didn't get that far yet but I
probably could now.

Even without the new stuff it wasn't too bad but there were a bit too
many spurious migrations in that load with autonuma27 and previous. It
was less spurious on bigger systems with many nodes because last_nid
is implicitly more accurate there (as last_nid will have more possible
values than 0|1). With autonuma28 even on 2 nodes it's perfectly fine.

If it's just 1 page false sharing and all the rest is thread-local,
the statistics will be 99/1 and the false sharing will be lost in the
noise.

The false sharing spillover caused by alignments is minor if the
threads are really computing on a lot of local memory so it's not a
concern and it will be optimized away by the last_nid plus the new
stuff.

> Ok, very obviously this will never be an RT feature but that is hardly
> a surprise and anyone who tries to enable this for RT needs their head
> examined. I'm not suggesting you do it but people running detailed
> performance analysis on scheduler-intensive workloads might want to keep
> an eye on their latency and jitter figures and how they are affected by
> this exchanging. Does ftrace show a noticable increase in wakeup latencies
> for example?

If you do:

echo 1 >/sys/kernel/mm/autonuma/debug

you will get 1 printk every single time sched_autonuma_balance
triggers a task exchange.

With autonuma28 I resolved a lot of the jittering and now there are
6/7 printk for the whole 198 seconds of numa01. CFS runs in autopilot
all the time.

With specjbb x2 overcommit, the active balancing events are reduced to
one every few sec (vs several per sec with autonuma27). In fact the
specjbb x2 overcommit load jumped ahead too with autonuma28.

About tracing events, the git branch already has tracing events to
monitor all page and task migrations showed in an awesome "perf script
numatop" from Andrew. Likely we need one tracing event to see the task
exchange generated specifically by the autonuma balancing event (we're
running short in event columns to show it in numatop though ;). Right
now that is only available as the printk above.

> > +=== task_autonuma - per task AutoNUMA data ===
> > +
> > +The task_autonuma structure is used to hold AutoNUMA data required for
> > +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> > +nodes.
> > +
> > +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> > +                scheduler balancing code, -1 if none (2 bytes)
> > +
> > +- Task NUMA statistics for this thread/process:
> > +
> > +    Total number of NUMA hinting page faults in this pass of
> > +    knuma_scand (8 bytes)
> > +
> > +    Per NUMA node number of NUMA hinting page faults in this pass of
> > +    knuma_scand (8 bytes * # of NUMA nodes)
> > +
> 
> It might be possible to put a coarse ping-pong detection counter in here
> as well by recording a declaying average of number of pages migrated
> over a number of knuma_scand passes instead of just the last one.  If the
> value is too high, you're ping-ponging and the process should be ignored,
> possibly forever. It's not a requirement and it would be more memory
> overhead obviously but I'm throwing it out there as a suggestion if it
> ever turns out the ping-pong problem is real.

Yes, this is a problem where we've an enormous degree in trying
things, so your suggestions are very appreciated :).

About ping ponging of CPU I never seen it yet (even if it's 550/450,
it rarely switches over from 450/550, and even it does, it doesn't
really change anything because it's a fairly rare event and one node
is not more right than the other anyway).

Thanks a lot for the help!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
  2012-10-11 11:01   ` Mel Gorman
@ 2012-10-11 16:43       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > faults to identify the per NUMA node working set of the thread at
> > runtime.
> > 
> > Arming the NUMA hinting page fault mechanism works similarly to
> > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > triggers we can identify it as a NUMA hinting page fault.
> > 
> 
> That implies that there is an atomic update requirement or at least
> an ordering requirement -- present bit must be cleared before setting
> NUMA bit. No doubt it'll be clear later in the series how this is
> accomplished. What you propose seems ok but it all depends how it's
> implemented so I'm leaving my ack off this particular patch for now.

Correct. The switch is done atomically (clear _PAGE_PRESENT at the
same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
to avoid firing an IPI for every pte/pmd_numa we establish).

It's still similar to setting a range PROT_NONE (except the way
_PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
mutually exclusive, so they can easily share the same pte/pmd
bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.

The NUMA hinting page fault also won't require any TLB flush ever.

So the whole process (establish/teardown) has an incredibly low TLB
flushing cost.

The only fixed cost is in knuma_scand and the enter/exit kernel for
every not-shared page every 10 sec (or whatever you set the duration
of a knuma_scand pass in sysfs).

Furthermore, if the pmd_scan mode is activated, I guarantee there's at
max 1 NUMA hinting page fault every 2m virtual region (even if some
accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
to disable THP (echo never >enabled) to measure the exact cost per 4k
page. It's hardly measurable here. With THP the fault is also 1 every
2m virtual region but no accuracy is lost in that case (or more
precisely, there's no way to get more accuracy than that as we deal
with a pmd).

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
@ 2012-10-11 16:43       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > faults to identify the per NUMA node working set of the thread at
> > runtime.
> > 
> > Arming the NUMA hinting page fault mechanism works similarly to
> > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > triggers we can identify it as a NUMA hinting page fault.
> > 
> 
> That implies that there is an atomic update requirement or at least
> an ordering requirement -- present bit must be cleared before setting
> NUMA bit. No doubt it'll be clear later in the series how this is
> accomplished. What you propose seems ok but it all depends how it's
> implemented so I'm leaving my ack off this particular patch for now.

Correct. The switch is done atomically (clear _PAGE_PRESENT at the
same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
to avoid firing an IPI for every pte/pmd_numa we establish).

It's still similar to setting a range PROT_NONE (except the way
_PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
mutually exclusive, so they can easily share the same pte/pmd
bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.

The NUMA hinting page fault also won't require any TLB flush ever.

So the whole process (establish/teardown) has an incredibly low TLB
flushing cost.

The only fixed cost is in knuma_scand and the enter/exit kernel for
every not-shared page every 10 sec (or whatever you set the duration
of a knuma_scand pass in sysfs).

Furthermore, if the pmd_scan mode is activated, I guarantee there's at
max 1 NUMA hinting page fault every 2m virtual region (even if some
accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
to disable THP (echo never >enabled) to measure the exact cost per 4k
page. It's hardly measurable here. With THP the fault is also 1 every
2m virtual region but no accuracy is lost in that case (or more
precisely, there's no way to get more accuracy than that as we deal
with a pmd).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
  2012-10-11 11:15   ` Mel Gorman
@ 2012-10-11 16:58       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 12:15:45PM +0100, Mel Gorman wrote:
> huh?
> 
> #define _PAGE_NUMA     _PAGE_PROTNONE
> 
> so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE
> 
> I suspect you are doing this because there is no requirement for
> _PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
> describe your intent. Is that really the case or did I miss something
> stupid?

Exactly.

It reminds that we need to return true in pte_present when the NUMA
hinting page fault is on.

Hardwiring _PAGE_NUMA to _PAGE_PROTNONE conceptually is not necessary
and it's actually an artificial restrictions. Other archs without a
bitflag for _PAGE_PROTNONE, may want to use something else and they'll
have to deal with pte_present too, somehow. So this is a reminder for
them as well.

> >  static inline int pte_hidden(pte_t pte)
> > @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> >  	 * the _PAGE_PSE flag will remain set at all times while the
> >  	 * _PAGE_PRESENT bit is clear).
> >  	 */
> > -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> > +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> > +				 _PAGE_NUMA);
> > +}
> > +
> > +#ifdef CONFIG_AUTONUMA
> > +/*
> > + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> > + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> 
> same bit on x86, not necessarily anywhere else.

Yep. In fact before using _PAGE_PRESENT the two bits were different
even on x86. But I unified them. If I vary them then they will become
_PAGE_PTE_NUMA/_PAGE_PMD_NUMA and the above will fail to build without
risk of errors.

> 
> _PAGE_PRESENT?

good eye ;) corrected.

> > +/*
> > + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> > + * because they're called by the NUMA hinting minor page fault.
> 
> automatically or atomically?
> 
> I assume you meant atomically but what stops two threads faulting at the
> same time and doing to the same update? mmap_sem will be insufficient in
> that case so what is guaranteeing the atomicity. PTL?

I meant automatically. I explained myself wrong and automatically may
be the wrong word. It also is atomic of course but it wasn't about the
atomic part.

So the thing is: the numa hinting page fault hooking point is this:

	if (pte_numa(entry))
		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);

It won't get this far:

	entry = pte_mkyoung(entry);
	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {

So if I don't set _PAGE_ACCESSED in pte/pmd_mknuma, the TLB miss
handler will have to set _PAGE_ACCESSED itself with an additional
write on the pte/pmd later when userland touches the page. And that
will slow us down for no good.

Because mknuma is only called in the numa hinting page fault context,
it's optimal to set _PAGE_ACCESSED too, not only _PAGE_PRESENT (and
clearing _PAGE_NUMA of course).

The basic idea, is that the numa hinting page fault can only trigger
if userland touches the page, and after such an event, _PAGE_ACCESSED
would be set by the hardware no matter if there is a NUMA hinting page
fault or not (so we can optimize away the hardware action when the NUMA
hinting page fault triggers).

I tried to reword it:

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cf1d3f0..3dc6a9b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -449,12 +449,12 @@ static inline int pmd_numa(pmd_t pmd)
 #endif
 
 /*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag too because they're
+ * only called by the NUMA hinting minor page fault. If we wouldn't
+ * set the _PAGE_ACCESSED bitflag here, the TLB miss handler would be
+ * forced to set it later while filling the TLB after we return to
+ * userland. That would trigger a second write to memory that we
+ * optimize away by setting _PAGE_ACCESSED here.
  */
 static inline pte_t pte_mknonnuma(pte_t pte)
 {


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
@ 2012-10-11 16:58       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 16:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 12:15:45PM +0100, Mel Gorman wrote:
> huh?
> 
> #define _PAGE_NUMA     _PAGE_PROTNONE
> 
> so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE
> 
> I suspect you are doing this because there is no requirement for
> _PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
> describe your intent. Is that really the case or did I miss something
> stupid?

Exactly.

It reminds that we need to return true in pte_present when the NUMA
hinting page fault is on.

Hardwiring _PAGE_NUMA to _PAGE_PROTNONE conceptually is not necessary
and it's actually an artificial restrictions. Other archs without a
bitflag for _PAGE_PROTNONE, may want to use something else and they'll
have to deal with pte_present too, somehow. So this is a reminder for
them as well.

> >  static inline int pte_hidden(pte_t pte)
> > @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> >  	 * the _PAGE_PSE flag will remain set at all times while the
> >  	 * _PAGE_PRESENT bit is clear).
> >  	 */
> > -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> > +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> > +				 _PAGE_NUMA);
> > +}
> > +
> > +#ifdef CONFIG_AUTONUMA
> > +/*
> > + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> > + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> 
> same bit on x86, not necessarily anywhere else.

Yep. In fact before using _PAGE_PRESENT the two bits were different
even on x86. But I unified them. If I vary them then they will become
_PAGE_PTE_NUMA/_PAGE_PMD_NUMA and the above will fail to build without
risk of errors.

> 
> _PAGE_PRESENT?

good eye ;) corrected.

> > +/*
> > + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> > + * because they're called by the NUMA hinting minor page fault.
> 
> automatically or atomically?
> 
> I assume you meant atomically but what stops two threads faulting at the
> same time and doing to the same update? mmap_sem will be insufficient in
> that case so what is guaranteeing the atomicity. PTL?

I meant automatically. I explained myself wrong and automatically may
be the wrong word. It also is atomic of course but it wasn't about the
atomic part.

So the thing is: the numa hinting page fault hooking point is this:

	if (pte_numa(entry))
		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);

It won't get this far:

	entry = pte_mkyoung(entry);
	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {

So if I don't set _PAGE_ACCESSED in pte/pmd_mknuma, the TLB miss
handler will have to set _PAGE_ACCESSED itself with an additional
write on the pte/pmd later when userland touches the page. And that
will slow us down for no good.

Because mknuma is only called in the numa hinting page fault context,
it's optimal to set _PAGE_ACCESSED too, not only _PAGE_PRESENT (and
clearing _PAGE_NUMA of course).

The basic idea, is that the numa hinting page fault can only trigger
if userland touches the page, and after such an event, _PAGE_ACCESSED
would be set by the hardware no matter if there is a NUMA hinting page
fault or not (so we can optimize away the hardware action when the NUMA
hinting page fault triggers).

I tried to reword it:

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cf1d3f0..3dc6a9b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -449,12 +449,12 @@ static inline int pmd_numa(pmd_t pmd)
 #endif
 
 /*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag too because they're
+ * only called by the NUMA hinting minor page fault. If we wouldn't
+ * set the _PAGE_ACCESSED bitflag here, the TLB miss handler would be
+ * forced to set it later while filling the TLB after we return to
+ * userland. That would trigger a second write to memory that we
+ * optimize away by setting _PAGE_ACCESSED here.
  */
 static inline pte_t pte_mknonnuma(pte_t pte)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
  2012-10-11 12:22   ` Mel Gorman
@ 2012-10-11 17:05       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 01:22:55PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> > In the special "pmd" mode of knuma_scand
> > (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> > type (_PAGE_PRESENT not set), however the pte might be
> > present. Therefore, gup_pmd_range() must return 0 in this case to
> > avoid losing a NUMA hinting page fault during gup_fast.
> > 
> 
> So if gup_fast fails, presumably we fall back to taking the mmap_sem and
> calling get_user_pages(). This is a heavier operation and I wonder if the
> cost is justified. i.e. Is the performance loss from using get_user_pages()
> offset by improved NUMA placement? I ask because we always incur the cost of
> taking mmap_sem but only sometimes get it back from improved NUMA placement.
> How bad would it be if gup_fast lost some of the NUMA hinting information?

Good question indeed. Now, I agree it wouldn't be bad to skip NUMA
hinting page faults in gup_fast for no-virt usage like
O_DIRECT/ptrace, but the only problem is that we'd lose AutoNUMA on
the memory touched by the KVM vcpus.

I've been also asked if the vhost-net kernel thread (KVM in kernel
virtio backend) will be controlled by autonuma in between
use_mm/unuse_mm and answer is yes, but to do that, it also needs
this. (see also the flush to task_autonuma_nid and mm/task statistics in
unuse_mm to reset it back to regular kernel thread status,
uncontrolled by autonuma)

$ git grep get_user_pages
tcm_vhost.c:            ret = get_user_pages_fast((unsigned long)ptr, 1, write, &page);
vhost.c:        r = get_user_pages_fast(log, 1, 1, &page);

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
@ 2012-10-11 17:05       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 01:22:55PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> > In the special "pmd" mode of knuma_scand
> > (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> > type (_PAGE_PRESENT not set), however the pte might be
> > present. Therefore, gup_pmd_range() must return 0 in this case to
> > avoid losing a NUMA hinting page fault during gup_fast.
> > 
> 
> So if gup_fast fails, presumably we fall back to taking the mmap_sem and
> calling get_user_pages(). This is a heavier operation and I wonder if the
> cost is justified. i.e. Is the performance loss from using get_user_pages()
> offset by improved NUMA placement? I ask because we always incur the cost of
> taking mmap_sem but only sometimes get it back from improved NUMA placement.
> How bad would it be if gup_fast lost some of the NUMA hinting information?

Good question indeed. Now, I agree it wouldn't be bad to skip NUMA
hinting page faults in gup_fast for no-virt usage like
O_DIRECT/ptrace, but the only problem is that we'd lose AutoNUMA on
the memory touched by the KVM vcpus.

I've been also asked if the vhost-net kernel thread (KVM in kernel
virtio backend) will be controlled by autonuma in between
use_mm/unuse_mm and answer is yes, but to do that, it also needs
this. (see also the flush to task_autonuma_nid and mm/task statistics in
unuse_mm to reset it back to regular kernel thread status,
uncontrolled by autonuma)

$ git grep get_user_pages
tcm_vhost.c:            ret = get_user_pages_fast((unsigned long)ptr, 1, write, &page);
vhost.c:        r = get_user_pages_fast(log, 1, 1, &page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-11 12:28   ` Mel Gorman
@ 2012-10-11 17:15       ` Andrea Arcangeli
  2012-10-11 17:15       ` Andrea Arcangeli
  1 sibling, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 01:28:27PM +0100, Mel Gorman wrote:
> s/togehter/together/

Fixed.

> 
> > + * knumad_scan structure.
> > + */
> > +struct mm_autonuma {
> 
> Nit but this is very similar in principle to mm_slot for transparent
> huge pages. It might be worth renaming both to mm_thp_slot and
> mm_autonuma_slot to set the expectation they are very similar in nature.
> Could potentially be made generic but probably overkill.

Agreed. A plain rename to mm_autonuma_slot would have the only cons of
making some code spill over 80 col ;).

> > +	/* link for knuma_scand's list of mm structures to scan */
> > +	struct list_head mm_node;
> > +	/* Pointer to associated mm structure */
> > +	struct mm_struct *mm;
> > +
> > +	/*
> > +	 * Zeroed from here during allocation, check
> > +	 * mm_autonuma_reset() if you alter the below.
> > +	 */
> > +
> > +	/*
> > +	 * Pass counter for this mm. This exist only to be able to
> > +	 * tell when it's time to apply the exponential backoff on the
> > +	 * task_autonuma statistics.
> > +	 */
> > +	unsigned long mm_numa_fault_pass;
> > +	/* Total number of pages that will trigger NUMA faults for this mm */
> > +	unsigned long mm_numa_fault_tot;
> > +	/* Number of pages that will trigger NUMA faults for each [nid] */
> > +	unsigned long mm_numa_fault[0];
> > +	/* do not add more variables here, the above array size is dynamic */
> > +};
> 
> How cache hot is this structure? nodes are sharing counters in the same
> cache lines so if updates are frequent this will bounce like a mad yoke.
> Profiles will tell for sure but it's possible that some sort of per-cpu
> hilarity will be necessary here in the future.

On autonuma27 this is only written by knuma_scand so it won't risk to
bounce.

On autonuma28 however it's updated by the numa hinting page fault
locklessy and so your concern is very real, and the cacheline bounces
will materialize. It'll cause more interconnect traffic before the
workload converges too. I thought about that, but I wanted the
mm_autonuma updated in real time as migration happens otherwise it
converges more slowly if we have to wait until the next pass to bring
mm_autonuma statistical data in sync with the migration
activities. Converging more slowly looked worse than paying more
cacheline bounces.

It's a tradeoff. And if it's not a good one, we can go back to
autonuma27 mm_autonuma stat gathering method and converge slower but
without any cacheline bouncing in the NUMA hinting page faults. At
least it's lockless.

> > +	unsigned long task_numa_fault_pass;
> > +	/* Total number of eligible pages that triggered NUMA faults */
> > +	unsigned long task_numa_fault_tot;
> > +	/* Number of pages that triggered NUMA faults for each [nid] */
> > +	unsigned long task_numa_fault[0];
> > +	/* do not add more variables here, the above array size is dynamic */
> > +};
> > +
> 
> Same question about cache hotness.

Here it's per-thread, so there won't be risk of accesses interleaved
by different CPUs.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
@ 2012-10-11 17:15       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 01:28:27PM +0100, Mel Gorman wrote:
> s/togehter/together/

Fixed.

> 
> > + * knumad_scan structure.
> > + */
> > +struct mm_autonuma {
> 
> Nit but this is very similar in principle to mm_slot for transparent
> huge pages. It might be worth renaming both to mm_thp_slot and
> mm_autonuma_slot to set the expectation they are very similar in nature.
> Could potentially be made generic but probably overkill.

Agreed. A plain rename to mm_autonuma_slot would have the only cons of
making some code spill over 80 col ;).

> > +	/* link for knuma_scand's list of mm structures to scan */
> > +	struct list_head mm_node;
> > +	/* Pointer to associated mm structure */
> > +	struct mm_struct *mm;
> > +
> > +	/*
> > +	 * Zeroed from here during allocation, check
> > +	 * mm_autonuma_reset() if you alter the below.
> > +	 */
> > +
> > +	/*
> > +	 * Pass counter for this mm. This exist only to be able to
> > +	 * tell when it's time to apply the exponential backoff on the
> > +	 * task_autonuma statistics.
> > +	 */
> > +	unsigned long mm_numa_fault_pass;
> > +	/* Total number of pages that will trigger NUMA faults for this mm */
> > +	unsigned long mm_numa_fault_tot;
> > +	/* Number of pages that will trigger NUMA faults for each [nid] */
> > +	unsigned long mm_numa_fault[0];
> > +	/* do not add more variables here, the above array size is dynamic */
> > +};
> 
> How cache hot is this structure? nodes are sharing counters in the same
> cache lines so if updates are frequent this will bounce like a mad yoke.
> Profiles will tell for sure but it's possible that some sort of per-cpu
> hilarity will be necessary here in the future.

On autonuma27 this is only written by knuma_scand so it won't risk to
bounce.

On autonuma28 however it's updated by the numa hinting page fault
locklessy and so your concern is very real, and the cacheline bounces
will materialize. It'll cause more interconnect traffic before the
workload converges too. I thought about that, but I wanted the
mm_autonuma updated in real time as migration happens otherwise it
converges more slowly if we have to wait until the next pass to bring
mm_autonuma statistical data in sync with the migration
activities. Converging more slowly looked worse than paying more
cacheline bounces.

It's a tradeoff. And if it's not a good one, we can go back to
autonuma27 mm_autonuma stat gathering method and converge slower but
without any cacheline bouncing in the NUMA hinting page faults. At
least it's lockless.

> > +	unsigned long task_numa_fault_pass;
> > +	/* Total number of eligible pages that triggered NUMA faults */
> > +	unsigned long task_numa_fault_tot;
> > +	/* Number of pages that triggered NUMA faults for each [nid] */
> > +	unsigned long task_numa_fault[0];
> > +	/* do not add more variables here, the above array size is dynamic */
> > +};
> > +
> 
> Same question about cache hotness.

Here it's per-thread, so there won't be risk of accesses interleaved
by different CPUs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 15/33] autonuma: alloc/free/init task_autonuma
  2012-10-11 15:53   ` Mel Gorman
@ 2012-10-11 17:34     ` Rik van Riel
       [not found]       ` <20121011175953.GT1818@redhat.com>
  0 siblings, 1 reply; 148+ messages in thread
From: Rik van Riel @ 2012-10-11 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 10/11/2012 11:53 AM, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:57AM +0200, Andrea Arcangeli wrote:
>> This is where the dynamically allocated task_autonuma structure is
>> being handled.
>>
>> This is the structure holding the per-thread NUMA statistics generated
>> by the NUMA hinting page faults. This per-thread NUMA statistical
>> information is needed by sched_autonuma_balance to make optimal NUMA
>> balancing decisions.
>>
>> It also contains the task_selected_nid which hints the stock CPU
>> scheduler on the best NUMA node to schedule this thread on (as decided
>> by sched_autonuma_balance).
>>
>> The reason for keeping this outside of the task_struct besides not
>> using too much kernel stack, is to only allocate it on NUMA
>> hardware. So the non NUMA hardware only pays the memory of a pointer
>> in the kernel stack (which remains NULL at all times in that case).
>>
>> If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
>> is allocated on the kernel stack of course.
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>
> There is a possibility that someone will complain about the extra
> kmalloc() during fork that is now necessary for the autonuma structure.
> Microbenchmarks will howl but who cares -- autonuma only makes sense for
> long-lived processes anyway. It may be necessary in the future to defer
> this allocation until the process has consumed a few CPU seconds and
> likely to hang around for a while. Overkill for now though so

That is indeed a future optimization I have suggested
in the past. Allocation of this struct could be deferred
until the first time knuma_scand unmaps pages from the
process to generate NUMA page faults.

> Acked-by: Mel Gorman <mgorman@suse.de>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
  2012-10-11 13:46   ` Mel Gorman
@ 2012-10-11 17:34       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 02:46:43PM +0100, Mel Gorman wrote:
> Should this be a SCHED_FEATURE flag?

I guess it could. It is only used by kernel/sched/numa.c which isn't
even built unless CONFIG_AUTONUMA is set. So it would require a
CONFIG_AUTONUMA in the sched feature flags unless we want to expose
no-operational bits. I'm not sure what the preferred way is.

> Have you ever identified a case where it's a good idea to set that flag?

It's currently set by default but no, I didn't do enough experiments
if it's worth copying or resetting the data.

> A child that closely shared data with its parent is not likely to also
> want to migrate to separate nodes. It just seems unnecessary to have and

Agreed, this is why the task_selected_nid is always inherited by
default (that is the CFS autopilot driver).

The question is if the full statistics also should be inherited across
fork/clone or not. I don't know the answer yet and that's why that
knob exists.

If we retain them, the autonuma_balance may decide to move the
task before a full statistics buildup executed the child.

The current way is to reset the data, and wait the data to buildup in
the child, while we keep CFS on autopilot with task_selected_nid
(which is always inherited). I thought the current one to be a good
tradeoff, but copying all data isn't an horrible idea either.

> impossible to suggest to an administrator how the flag might be used.

Agreed. this in fact is a debug flag only, it won't ever showup to the admin.

#ifdef CONFIG_DEBUG_VM
SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
SYSFS_ENTRY(migrate_allow_first_fault,
	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
#endif /* CONFIG_DEBUG_VM */

> 
> > +	/*
> > +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> > +	 * faults at the pmd level instead of the pte level. This
> > +	 * reduces the number of NUMA hinting faults potentially
> > +	 * saving CPU time. It reduces the accuracy of the
> > +	 * task_autonuma statistics (but does not change the accuracy
> > +	 * of the mm_autonuma statistics). This flag can be toggled
> > +	 * through sysfs as runtime.
> > +	 *
> > +	 * This flag does not affect AutoNUMA with transparent
> > +	 * hugepages (THP). With THP the NUMA hinting page faults
> > +	 * always happen at the pmd level, regardless of the setting
> > +	 * of this flag. Note: there is no reduction in accuracy of
> > +	 * task_autonuma statistics with THP.
> > +	 *
> > +	 * Default set.
> > +	 */
> > +	AUTONUMA_SCAN_PMD_FLAG,
> 
> This flag and the other flags make sense. Early on we just are not going
> to know what the correct choice is. My gut says that ultimately we'll

Agreed. This is why I left these knobs in, even if I've been asked to
drop them a few times (they were perceived as adding complexity). But
for things we're not sure about, these really helps to benchmark quick
one way or another.

scan_pmd is actually not under DEBUG_VM as it looked a more fundamental thing.

> default to PMD level *but* fall back to PTE level on a per-task basis if
> ping-pong migrations are detected. This will catch ping-pongs on data
> that is not PMD aligned although obviously data that is not page aligned
> will also suffer. Eventually I think this flag will go away but the
> behaviour will be;
> 
> default, AUTONUMA_SCAN_PMD
> if ping-pong, fallback to AUTONUMA_SCAN_PTE
> if ping-ping, AUTONUMA_SCAN_NONE

That would be ideal, good idea indeed.

> so there is a graceful degradation if autonuma is doing the wrong thing.

Makes perfect sense to me if we figure out how to reliably detect when
to make the switch.

thanks!
Andrea

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
@ 2012-10-11 17:34       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-11 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 02:46:43PM +0100, Mel Gorman wrote:
> Should this be a SCHED_FEATURE flag?

I guess it could. It is only used by kernel/sched/numa.c which isn't
even built unless CONFIG_AUTONUMA is set. So it would require a
CONFIG_AUTONUMA in the sched feature flags unless we want to expose
no-operational bits. I'm not sure what the preferred way is.

> Have you ever identified a case where it's a good idea to set that flag?

It's currently set by default but no, I didn't do enough experiments
if it's worth copying or resetting the data.

> A child that closely shared data with its parent is not likely to also
> want to migrate to separate nodes. It just seems unnecessary to have and

Agreed, this is why the task_selected_nid is always inherited by
default (that is the CFS autopilot driver).

The question is if the full statistics also should be inherited across
fork/clone or not. I don't know the answer yet and that's why that
knob exists.

If we retain them, the autonuma_balance may decide to move the
task before a full statistics buildup executed the child.

The current way is to reset the data, and wait the data to buildup in
the child, while we keep CFS on autopilot with task_selected_nid
(which is always inherited). I thought the current one to be a good
tradeoff, but copying all data isn't an horrible idea either.

> impossible to suggest to an administrator how the flag might be used.

Agreed. this in fact is a debug flag only, it won't ever showup to the admin.

#ifdef CONFIG_DEBUG_VM
SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
SYSFS_ENTRY(migrate_allow_first_fault,
	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
#endif /* CONFIG_DEBUG_VM */

> 
> > +	/*
> > +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> > +	 * faults at the pmd level instead of the pte level. This
> > +	 * reduces the number of NUMA hinting faults potentially
> > +	 * saving CPU time. It reduces the accuracy of the
> > +	 * task_autonuma statistics (but does not change the accuracy
> > +	 * of the mm_autonuma statistics). This flag can be toggled
> > +	 * through sysfs as runtime.
> > +	 *
> > +	 * This flag does not affect AutoNUMA with transparent
> > +	 * hugepages (THP). With THP the NUMA hinting page faults
> > +	 * always happen at the pmd level, regardless of the setting
> > +	 * of this flag. Note: there is no reduction in accuracy of
> > +	 * task_autonuma statistics with THP.
> > +	 *
> > +	 * Default set.
> > +	 */
> > +	AUTONUMA_SCAN_PMD_FLAG,
> 
> This flag and the other flags make sense. Early on we just are not going
> to know what the correct choice is. My gut says that ultimately we'll

Agreed. This is why I left these knobs in, even if I've been asked to
drop them a few times (they were perceived as adding complexity). But
for things we're not sure about, these really helps to benchmark quick
one way or another.

scan_pmd is actually not under DEBUG_VM as it looked a more fundamental thing.

> default to PMD level *but* fall back to PTE level on a per-task basis if
> ping-pong migrations are detected. This will catch ping-pongs on data
> that is not PMD aligned although obviously data that is not page aligned
> will also suffer. Eventually I think this flag will go away but the
> behaviour will be;
> 
> default, AUTONUMA_SCAN_PMD
> if ping-pong, fallback to AUTONUMA_SCAN_PTE
> if ping-ping, AUTONUMA_SCAN_NONE

That would be ideal, good idea indeed.

> so there is a graceful degradation if autonuma is doing the wrong thing.

Makes perfect sense to me if we figure out how to reliably detect when
to make the switch.

thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
  2012-10-10 22:01   ` Rik van Riel
@ 2012-10-11 18:28   ` Mel Gorman
  2012-10-13 18:06   ` Srikar Dronamraju
  2 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:01AM +0200, Andrea Arcangeli wrote:
> This implements the following parts of autonuma:
> 
> o knuma_scand: daemon for setting pte_numa and pmd_numa while
>   gathering NUMA mm stats
> 
> o NUMA hinting page fault handler: Migrate on Fault and gathers NUMA
>   task stats
> 
> o Migrate On Fault: in the context of the NUMA hinting page faults we
>   migrate memory from remote nodes to the local node
> 
> o The rest of autonuma core logic: false sharing detection, sysfs and
>   initialization routines
> 
> The AutoNUMA algorithm when knuma_scand is not running is fully
> bypassed and it will not alter the runtime of memory management or the
> scheduler.
> 
> The whole AutoNUMA logic is a chain reaction as a result of the
> actions of the knuma_scand. Various parts of the code can be described
> like different gears (gears as in glxgears).
> 
> knuma_scand is the first gear and it collects the mm_autonuma
> per-process statistics and at the same time it sets the ptes and pmds
> it scans respectively as pte_numa and pmd_numa.
> 
> The second gear are the numa hinting page faults. These are triggered
> by the pte_numa/pmd_numa pmd/ptes. They collect the task_autonuma
> per-thread statistics. They also implement the memory follow CPU logic
> where we track if pages are repeatedly accessed by remote nodes. The
> memory follow CPU logic can decide to migrate pages across different
> NUMA nodes using Migrate On Fault.
> 
> The third gear is Migrate On Fault. Pages pending for migration are
> migrated in the context of the NUMA hinting page faults. Each
> destination node has a migration rate limit configurable with sysfs.
> 

Ok, all that is understandable.

> The fourth gear is the NUMA scheduler balancing code. That computes
> the statistical information collected in mm->mm_autonuma and
> p->task_autonuma and evaluates the status of all CPUs to decide if
> tasks should be migrated to CPUs in remote nodes.
> 

I imagine this is where all the real complexity lies.

> The only "input" information of the AutoNUMA algorithm that isn't
> collected through NUMA hinting page faults are the per-process
> mm->mm_autonuma statistics. Those mm_autonuma statistics are collected
> by the knuma_scand pmd/pte scans that are also responsible for setting
> pte_numa/pmd_numa to activate the NUMA hinting page faults.
> 
> knuma_scand -> NUMA hinting page faults
>   |                       |
>  \|/                     \|/
> mm_autonuma  <->  task_autonuma (CPU follow memory, this is mm_autonuma too)
>                   page last_nid  (false thread sharing/thread shared memory detection )
>                   queue or cancel page migration (memory follow CPU)
> 
> The code includes some fixes from Hillf Danton <dhillf@gmail.com>.
> 
> Math documentation on autonuma_last_nid in the header of
> last_nid_set() reworked from sched-numa code by Peter Zijlstra
> <a.p.zijlstra@chello.nl>.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Hillf Danton <dhillf@gmail.com>
> ---
>  mm/autonuma.c    | 1365 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c |   34 ++
>  2 files changed, 1399 insertions(+), 0 deletions(-)
>  create mode 100644 mm/autonuma.c
> 
> diff --git a/mm/autonuma.c b/mm/autonuma.c
> new file mode 100644
> index 0000000..1b2530c
> --- /dev/null
> +++ b/mm/autonuma.c
> @@ -0,0 +1,1365 @@
> +/*
> + *  Copyright (C) 2012  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + *
> + *  Boot with "numa=fake=2" to test on non NUMA systems.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/rmap.h>
> +#include <linux/kthread.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/freezer.h>
> +#include <linux/mm_inline.h>
> +#include <linux/migrate.h>
> +#include <linux/swap.h>
> +#include <linux/autonuma.h>
> +#include <asm/tlbflush.h>
> +#include <asm/pgtable.h>
> +
> +unsigned long autonuma_flags __read_mostly =
> +	(1<<AUTONUMA_POSSIBLE_FLAG)
> +#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> +	|(1<<AUTONUMA_ENABLED_FLAG)
> +#endif
> +	|(1<<AUTONUMA_SCAN_PMD_FLAG);
> +
> +static DEFINE_MUTEX(knumad_mm_mutex);
> +

/* Protects the list of mm's being considered by knuma_scan */

> +/* knuma_scand */
> +static unsigned int scan_sleep_millisecs __read_mostly = 100;
> +static unsigned int scan_sleep_pass_millisecs __read_mostly = 10000;
> +static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
> +static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
> +static unsigned long full_scans;
> +static unsigned long pages_scanned;
> +
> +/* page migration rate limiting control */
> +static unsigned int migrate_sleep_millisecs __read_mostly = 100;
> +static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
> +static volatile unsigned long pages_migrated;
> +

Ok, so it's rate limited to a small value. Explain the values better

/*
 * page migration te limiting control
 * AutoNUMA will not migrate more then pages_to_migrate pages per
 * migrate_sleep_millisecs 
 */

If that's right, the names could be a bit more helpful

/*
 * page migration rate limiting control
 * During a given time window of migrate_ratelimit_window_ms
 * milliseconds, no more than migrate_ratelimit_nr_pages will
 * be migrated. The default values will not migrate more than
 * 1.2G/sec
 */
migrate_ratelimit_window_ms = 100
migrate_ratelimit_nr_pages  = 128 << 20 >> PAGE_SHIFT

or something?


> +static struct knuma_scand_data {
> +	struct list_head mm_head; /* entry: mm->mm_autonuma->mm_node */
> +	struct mm_struct *mm;
> +	unsigned long address;
> +	unsigned long *mm_numa_fault_tmp;
> +} knuma_scand_data = {
> +	.mm_head = LIST_HEAD_INIT(knuma_scand_data.mm_head),
> +};
> +
> +/* caller already holds the compound_lock */
> +void autonuma_migrate_split_huge_page(struct page *page,
> +				      struct page *page_tail)
> +{
> +	int last_nid;
> +
> +	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> +	if (last_nid >= 0)
> +		page_tail->autonuma_last_nid = last_nid;
> +}
> +
> +static int sync_isolate_migratepages(struct list_head *migratepages,
> +				     struct page *page,
> +				     struct pglist_data *pgdat,
> +				     bool *migrated)
> +{

So, why did this thing not to a

struct compact_control cc = {
        .nr_freepages = 0,
        .nr_migratepages = 0,
        .zone = zone,
        .sync = false,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);

start_pfn = page_to_pfn(page);
end_pfn = start_pfn + 1;

if (PageTransHuge(page)) {
	end_pfn = start_pfn + HPAGE_PMD_NR;
	VM_BUG_ON(!PageAnon(page));
	if (unlikely(split_huge_page(page))) {
		autonuma_printk("autonuma migrate THP free\n");
		goto out;
	}
}
isolate_migratepages_range(page_zone(page), &cc, start_pfn, end_pfn);


I know that this looks really clumsy and you will need to shortcut the
migrate_async_suitable() check in isolate_migratepages_range() but it
looks something that could be a simple helper function in compaction.c
to create the compact_control and return the migratepages to you.

That way you would also get the lock contention detection and the like.
Otherwise you're going to end up implementing your own version of
compact_checklock_irqsave when that LRU lock (taken for single pages
ouchies) gets too hot. You might be able to more easily implement a
batching mechanism too of some sort (dunno what that would look like
yet)


> +	struct zone *zone;
> +	struct lruvec *lruvec;
> +	int nr_subpages;
> +	struct page *subpage;
> +	int ret = 0;
> +
> +	nr_subpages = 1;
> +	if (PageTransHuge(page)) {
> +		nr_subpages = HPAGE_PMD_NR;
> +		VM_BUG_ON(!PageAnon(page));
> +		/* FIXME: remove split_huge_page */
> +		if (unlikely(split_huge_page(page))) {
> +			autonuma_printk("autonuma migrate THP free\n");
> +			goto out;
> +		}
> +	}
> +
> +	/* All THP subpages are guaranteed to be in the same zone */
> +	zone = page_zone(page);
> +
> +	for (subpage = page; subpage < page+nr_subpages; subpage++) {
> +		spin_lock_irq(&zone->lru_lock);
> +
> +		/* Must run under the lru_lock and before page isolation */
> +		lruvec = mem_cgroup_page_lruvec(subpage, zone);
> +
> +		if (!__isolate_lru_page(subpage, ISOLATE_ASYNC_MIGRATE)) {
> +			VM_BUG_ON(PageTransCompound(subpage));
> +			del_page_from_lru_list(subpage, lruvec,
> +					       page_lru(subpage));
> +			inc_zone_state(zone, page_is_file_cache(subpage) ?
> +				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
> +			spin_unlock_irq(&zone->lru_lock);
> +
> +			list_add(&subpage->lru, migratepages);
> +			ret++;
> +		} else {
> +			/* losing page */
> +			spin_unlock_irq(&zone->lru_lock);
> +		}
> +	}
> +
> +	/*
> +	 * Pin the head subpage at least until the first
> +	 * __isolate_lru_page succeeds (__isolate_lru_page pins it
> +	 * again when it succeeds). If we unpin before
> +	 * __isolate_lru_page successd, the page could be freed and
> +	 * reallocated out from under us. Thus our previous checks on
> +	 * the page, and the split_huge_page, would be worthless.
> +	 *
> +	 * We really only need to do this if "ret > 0" but it doesn't
> +	 * hurt to do it unconditionally as nobody can reference
> +	 * "page" anymore after this and so we can avoid an "if (ret >
> +	 * 0)" branch here.
> +	 */
> +	put_page(page);
> +	/*
> +	 * Tell the caller we already released its pin, to avoid a
> +	 * double free.
> +	 */
> +	*migrated = true;
> +
> +out:
> +	return ret;
> +}
> +
> +static bool autonuma_balance_pgdat(struct pglist_data *pgdat,
> +				   int nr_migrate_pages)
> +{

This is actually a check function true/false where as page reclaims
balance_pgdat() function actually does the balancing. This made my brain
skip a little. Rename

should_autonuma_balance_pgdat()

Document what a returning true means

/* Returns true if a zone within the pgdat is below watermarks and may
 * need some of its pages migrated to other node to balance across
 * multiple NUMA nodes
 */
static bool should_autonuma_balance_pgdat(....)

> +	/* FIXME: this only check the wmarks, make it move
> +	 * "unused" memory or pagecache by queuing it to
> +	 * pgdat->autonuma_migrate_head[pgdat->node_id].
> +	 */
> +	int z;
> +	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
> +		struct zone *zone = pgdat->node_zones + z;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (zone->all_unreclaimable)
> +			continue;
> +
> +		/*
> +		 * FIXME: in theory we're ok if we can obtain
> +		 * pages_to_migrate pages from all zones, it doesn't
> +		 * need to be all in a single zone. We care about the
> +		 * pgdat, not the zone.
> +		 */
> +
> +		/*
> +		 * Try not to wakeup kswapd by allocating
> +		 * pages_to_migrate pages.
> +		 */
> +		if (!zone_watermark_ok(zone, 0,
> +				       high_wmark_pages(zone) +
> +				       nr_migrate_pages,
> +				       0, 0))
> +			continue;
> +		return true;
> +	}
> +	return false;
> +}

prepare_kswapd_sleep() has a check very similar to this to check if
all_zones_ok. I think it could be split out and mostly shared minus
the all_unreclaimbe bit?

> +
> +static struct page *alloc_migrate_dst_page(struct page *page,
> +					   unsigned long data,
> +					   int **result)
> +{
> +	int nid = (int) data;
> +	struct page *newpage;
> +	newpage = alloc_pages_exact_node(nid,
> +					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
> +					  __GFP_NOMEMALLOC | __GFP_NORETRY |
> +					  __GFP_NOWARN | __GFP_NO_KSWAPD) &

__GFP_NO_KSWAPD will be gone so that flag will go away. When it does,
this thing will wake kswapd which you probably don't want. Will you need
to bring the flag back or make this an atomic allocation?

> +					 ~GFP_IOFS, 0);
> +	if (newpage)
> +		newpage->autonuma_last_nid = page->autonuma_last_nid;
> +	return newpage;
> +}
> +
> +static inline void autonuma_migrate_lock(int nid)
> +{
> +	spin_lock(&NODE_DATA(nid)->autonuma_migrate_lock);
> +}
> +
> +static inline void autonuma_migrate_unlock(int nid)
> +{
> +	spin_unlock(&NODE_DATA(nid)->autonuma_migrate_lock);
> +}
> +
> +static bool autonuma_migrate_page(struct page *page, int dst_nid,
> +				  int page_nid, bool *migrated)
> +{
> +	int isolated = 0;
> +	LIST_HEAD(migratepages);
> +	struct pglist_data *pgdat = NODE_DATA(dst_nid);
> +	int nr_pages = hpage_nr_pages(page);
> +	unsigned long autonuma_migrate_nr_pages = 0;
> +
> +	autonuma_migrate_lock(dst_nid);

Why is this lock necessary?

It's necessary because multiple processes can be migrating on fault at
the same time and the counters must be protected.

/*
 * autonuma_migrate_lock protects against multiple processes performing
 * migrate-on-fault at the same time. The counter statistics must be
 * kept accurate.
 */

Severely doubt this is a hot lock.

> +	if (time_after(jiffies, pgdat->autonuma_migrate_last_jiffies +
> +		       msecs_to_jiffies(migrate_sleep_millisecs))) {
> +		autonuma_migrate_nr_pages = pgdat->autonuma_migrate_nr_pages;
> +		pgdat->autonuma_migrate_nr_pages = 0;
> +		pgdat->autonuma_migrate_last_jiffies = jiffies;
> +	}
> +	if (pgdat->autonuma_migrate_nr_pages >= pages_to_migrate) {
> +		autonuma_migrate_unlock(dst_nid);
> +		goto out;
> +	}
> +	pgdat->autonuma_migrate_nr_pages += nr_pages;
> +	autonuma_migrate_unlock(dst_nid);
> +
> +	if (autonuma_migrate_nr_pages)
> +		autonuma_printk("migrated %lu pages to node %d\n",
> +				autonuma_migrate_nr_pages, dst_nid);
> +
> +	if (autonuma_balance_pgdat(pgdat, nr_pages))
> +		isolated = sync_isolate_migratepages(&migratepages,
> +						     page, pgdat,
> +						     migrated);
> +

At some point you want want to batch that a bit - either 1 THP or
SWAP_CLUSTER_MAX

We should have either tracepoints or vmstats available that allow us to
track how many pages are being scanned, isolated and migrated. The
vmstats would give a coarse view of the overall systems. The tracepoints
may allow someone to figure out what rate data is moving between nodes
due to autonuma.

> +	if (isolated) {
> +		int err;
> +		pages_migrated += isolated; /* FIXME: per node */
> +		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
> +				    pgdat->node_id, false, MIGRATE_ASYNC);

Another option would be to push some of the stats down here. The current
stats are from compaction and they suck. compact_pages_moved and
compact_pagemigrate_failed should be replaced and pushed down into
migration. It'll be insufficinet to tell which are due to autonuma dna
which are due to compaction but again tracepoints could be used to
answer that question if someone cared.

> +		if (err)
> +			putback_lru_pages(&migratepages);
> +	}
> +	BUG_ON(!list_empty(&migratepages));
> +out:
> +	return isolated;
> +}
> +
> +static void cpu_follow_memory_pass(struct task_struct *p,
> +				   struct task_autonuma *task_autonuma,
> +				   unsigned long *task_numa_fault)
> +{
> +	int nid;
> +	/* If a new pass started, degrade the stats by a factor of 2 */
> +	for_each_node(nid)
> +		task_numa_fault[nid] >>= 1;
> +	task_autonuma->task_numa_fault_tot >>= 1;
> +}
> +
> +static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
> +						 int access_nid,
> +						 int numpages,
> +						 bool new_pass)
> +{
> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
> +
> +	/* prevent sched_autonuma_balance() to run on top of us */
> +	local_bh_disable();
> +
> +	if (unlikely(new_pass))
> +		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
> +	task_numa_fault[access_nid] += numpages;
> +	task_autonuma->task_numa_fault_tot += numpages;
> +
> +	local_bh_enable();
> +}
> +
> +/*
> + * In this function we build a temporal CPU_node<->page relation by
> + * using a two-stage autonuma_last_nid filter to remove short/unlikely
> + * relations.
> + *
> + * Using P(p) ~ n_p / n_t as per frequentest probability, we can
> + * equate a node's CPU usage of a particular page (n_p) per total
> + * usage of this page (n_t) (in a given time-span) to a probability.
> + *
> + * Our periodic faults will then sample this probability and getting
> + * the same result twice in a row, given these samples are fully
> + * independent, is then given by P(n)^2, provided our sample period
> + * is sufficiently short compared to the usage pattern.
> + *
> + * This quadric squishes small probabilities, making it less likely
> + * we act on an unlikely CPU_node<->page relation.
> + */
> +static inline bool last_nid_set(struct page *page, int this_nid)
> +{
> +	bool ret = true;
> +	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> +	VM_BUG_ON(this_nid < 0);
> +	VM_BUG_ON(this_nid >= MAX_NUMNODES);
> +	if (autonuma_last_nid != this_nid) {
> +		if (autonuma_last_nid >= 0)
> +			ret = false;
> +		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
> +	}
> +	return ret;
> +}
> +

Ok, you could include your blurb from another mail about how this is
functionally very similar to how pages are aged - two LRU scans for
page aging versus two knuma_scans for autonuma. It does mean that small
processes may have a tendency to move betwen nodes faster because the
knuma_scan completes faster. Not many workloads will notice though.

> +static int numa_hinting_fault_memory_follow_cpu(struct page *page,
> +						int this_nid, int page_nid,
> +						bool new_pass,
> +						bool *migrated)
> +{
> +	if (!last_nid_set(page, this_nid))
> +		goto out;
> +	if (!PageLRU(page))
> +		goto out;
> +	if (this_nid != page_nid) {
> +		if (autonuma_migrate_page(page, this_nid, page_nid,
> +					  migrated))
> +			return this_nid;
> +	}
> +out:
> +	return page_nid;
> +}
> +
> +bool numa_hinting_fault(struct page *page, int numpages)
> +{
> +	bool migrated = false;
> +
> +	/*
> +	 * "current->mm" could be different from the "mm" where the
> +	 * NUMA hinting page fault happened, if get_user_pages()
> +	 * triggered the fault on some other process "mm". That is ok,
> +	 * all we care about is to count the "page_nid" access on the
> +	 * current->task_autonuma, even if the page belongs to a
> +	 * different "mm".
> +	 */
> +	WARN_ON_ONCE(!current->mm);
> +	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
> +		struct task_struct *p = current;
> +		int this_nid, page_nid, access_nid;
> +		bool new_pass;
> +
> +		/*
> +		 * new_pass is only true the first time the thread
> +		 * faults on this pass of knuma_scand.
> +		 */
> +		new_pass = p->task_autonuma->task_numa_fault_pass !=
> +			p->mm->mm_autonuma->mm_numa_fault_pass;
> +		page_nid = page_to_nid(page);
> +		this_nid = numa_node_id();
> +		VM_BUG_ON(this_nid < 0);
> +		VM_BUG_ON(this_nid >= MAX_NUMNODES);
> +		access_nid = numa_hinting_fault_memory_follow_cpu(page,
> +								  this_nid,
> +								  page_nid,
> +								  new_pass,
> +								  &migrated);
> +		/* "page" has been already freed if "migrated" is true */
> +		numa_hinting_fault_cpu_follow_memory(p, access_nid,
> +						     numpages, new_pass);
> +		if (unlikely(new_pass))
> +			/*
> +			 * Set the task's fault_pass equal to the new
> +			 * mm's fault_pass, so new_pass will be false
> +			 * on the next fault by this thread in this
> +			 * same pass.
> +			 */
> +			p->task_autonuma->task_numa_fault_pass =
> +				p->mm->mm_autonuma->mm_numa_fault_pass;
> +	}
> +
> +	return migrated;
> +}

ok.

> +
> +/* NUMA hinting page fault entry point for ptes */
> +int pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
> +		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
> +{

Naming. We're not "fixing" anything, it's not broken so no way should
this be the entry point. It should follow the same pattern as other
fault handlers and be called do_numahint_page();

handle_pte_fault -> do_numahint_page
handle_pte_fault -> do_anonymous_page

etc.

Functionally I did not spot a problem.

> +	struct page *page;
> +	spinlock_t *ptl;
> +	bool migrated;
> +
> +	/*
> +	 * The "pte" at this point cannot be used safely without
> +	 * validation through pte_unmap_same(). It's of NUMA type but
> +	 * the pfn may be screwed if the read is non atomic.
> +	 */
> +
> +	ptl = pte_lockptr(mm, pmd);
> +	spin_lock(ptl);
> +	if (unlikely(!pte_same(*ptep, pte)))
> +		goto out_unlock;
> +	pte = pte_mknonnuma(pte);
> +	set_pte_at(mm, addr, ptep, pte);
> +	page = vm_normal_page(vma, addr, pte);
> +	BUG_ON(!page);
> +	if (unlikely(page_mapcount(page) != 1))
> +		goto out_unlock;
> +	get_page(page);
> +	pte_unmap_unlock(ptep, ptl);
> +
> +	migrated = numa_hinting_fault(page, 1);
> +	if (!migrated)
> +		put_page(page);
> +out:
> +	return 0;
> +
> +out_unlock:
> +	pte_unmap_unlock(ptep, ptl);
> +	goto out;
> +}
> +
> +/* NUMA hinting page fault entry point for regular pmds */
> +int pmd_numa_fixup(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp)

do_numahint_pmd

> +{
> +	pmd_t pmd;
> +	pte_t *pte, *orig_pte;
> +	unsigned long _addr = addr & PMD_MASK;
> +	unsigned long offset;
> +	spinlock_t *ptl;
> +	bool numa = false;
> +	struct vm_area_struct *vma;
> +	bool migrated;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = *pmdp;
> +	if (pmd_numa(pmd)) {
> +		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
> +		numa = true;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	if (!numa)
> +		return 0;
> +
> +	vma = find_vma(mm, _addr);
> +	/* we're in a page fault so some vma must be in the range */
> +	BUG_ON(!vma);
> +	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
> +	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
> +	VM_BUG_ON(offset >= PMD_SIZE);
> +	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
> +	pte += offset >> PAGE_SHIFT;
> +	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {

Pity you can't use walk_pmd_range() but it's unsuitable.

> +		pte_t pteval = *pte;
> +		struct page * page;
> +		if (!pte_present(pteval))
> +			continue;
> +		if (addr >= vma->vm_end) {
> +			vma = find_vma(mm, addr);
> +			/* there's a pte present so there must be a vma */
> +			BUG_ON(!vma);
> +			BUG_ON(addr < vma->vm_start);
> +		}
> +		if (pte_numa(pteval)) {
> +			pteval = pte_mknonnuma(pteval);
> +			set_pte_at(mm, addr, pte, pteval);
> +		}
> +		page = vm_normal_page(vma, addr, pteval);
> +		if (unlikely(!page))
> +			continue;
> +		/* only check non-shared pages */
> +		if (unlikely(page_mapcount(page) != 1))
> +			continue;
> +		get_page(page);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		migrated = numa_hinting_fault(page, 1);
> +		if (!migrated)
> +			put_page(page);
> +
> +		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	}
> +	pte_unmap_unlock(orig_pte, ptl);
> +	return 0;
> +}
> +
> +static inline int task_autonuma_size(void)
> +{
> +	return sizeof(struct task_autonuma) +
> +		nr_node_ids * sizeof(unsigned long);
> +}
> +
> +static inline int task_autonuma_reset_size(void)
> +{
> +	struct task_autonuma *task_autonuma = NULL;
> +	return task_autonuma_size() -
> +		(int)((char *)(&task_autonuma->task_numa_fault_pass) -
> +		      (char *)task_autonuma);
> +}
> +
> +static void __task_autonuma_reset(struct task_autonuma *task_autonuma)
> +{
> +	memset(&task_autonuma->task_numa_fault_pass, 0,
> +	       task_autonuma_reset_size());
> +}
> +
> +static void task_autonuma_reset(struct task_autonuma *task_autonuma)
> +{
> +	task_autonuma->task_selected_nid = -1;
> +	__task_autonuma_reset(task_autonuma);
> +}
> +
> +static inline int mm_autonuma_fault_size(void)
> +{
> +	return nr_node_ids * sizeof(unsigned long);
> +}
> +
> +static inline int mm_autonuma_size(void)
> +{
> +	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size();
> +}
> +
> +static inline int mm_autonuma_reset_size(void)
> +{
> +	struct mm_autonuma *mm_autonuma = NULL;
> +	return mm_autonuma_size() -
> +		(int)((char *)(&mm_autonuma->mm_numa_fault_pass) -
> +		      (char *)mm_autonuma);
> +}
> +
> +static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
> +{
> +	memset(&mm_autonuma->mm_numa_fault_pass, 0, mm_autonuma_reset_size());
> +}
> +
> +void autonuma_setup_new_exec(struct task_struct *p)
> +{
> +	if (p->task_autonuma)
> +		task_autonuma_reset(p->task_autonuma);
> +	if (p->mm && p->mm->mm_autonuma)
> +		mm_autonuma_reset(p->mm->mm_autonuma);
> +}
> +
> +static inline int knumad_test_exit(struct mm_struct *mm)
> +{
> +	return atomic_read(&mm->mm_users) == 0;
> +}
> +
> +/*
> + * Here we search for not shared page mappings (mapcount == 1) and we
> + * set up the pmd/pte_numa on those mappings so the very next access
> + * will fire a NUMA hinting page fault. We also collect the
> + * mm_autonuma statistics for this process mm at the same time.
> + */
> +static int knuma_scand_pmd(struct mm_struct *mm,
> +			   struct vm_area_struct *vma,
> +			   unsigned long address)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte, *_pte;
> +	struct page *page;
> +	unsigned long _address, end;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	VM_BUG_ON(address & ~PAGE_MASK);
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> +		int page_nid;
> +		unsigned long *fault_tmp;
> +		ret = HPAGE_PMD_NR;
> +
> +		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +		page = pmd_page(*pmd);
> +
> +		/* only check non-shared pages */
> +		if (page_mapcount(page) != 1) {
> +			spin_unlock(&mm->page_table_lock);
> +			goto out;
> +		}
> +
> +		page_nid = page_to_nid(page);
> +		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
> +		fault_tmp[page_nid] += ret;
> +
> +		if (pmd_numa(*pmd)) {
> +			spin_unlock(&mm->page_table_lock);
> +			goto out;
> +		}
> +
> +		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +		/* defer TLB flush to lower the overhead */
> +		spin_unlock(&mm->page_table_lock);
> +		goto out;
> +	}
> +

Ok, so it's another area to watch out for but the competition on
mmap_sem means that knuma_scan can increase mmap() latency and basically
anything that requires down_write(mmap_sem). If the lock is contented,
knuma_scan will spin on it where as it probably should back off and this
is a possibility as at this level you are dealing with
mm->page_table_lock and not the fine-grained ptl.

That said, it's probably similar concerns that exist for khugepaged and
so far the scanning has not been a major problem. Something to keep an
eye on, but not necessarily get into a twist over either. As with so
many other places, it's a contributed to System CPU usage that must be
kept an eye on and incrementally improved.

> +	if (pmd_trans_unstable(pmd))
> +		goto out;
> +	VM_BUG_ON(!pmd_present(*pmd));
> +
> +	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
> +	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	for (_address = address, _pte = pte; _address < end;
> +	     _pte++, _address += PAGE_SIZE) {
> +		pte_t pteval = *_pte;
> +		unsigned long *fault_tmp;
> +		if (!pte_present(pteval))
> +			continue;
> +		page = vm_normal_page(vma, _address, pteval);
> +		if (unlikely(!page))
> +			continue;
> +		/* only check non-shared pages */
> +		if (page_mapcount(page) != 1)
> +			continue;
> +
> +		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
> +		fault_tmp[page_to_nid(page)]++;
> +
> +		if (pte_numa(pteval))
> +			continue;
> +
> +		if (!autonuma_scan_pmd())
> +			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
> +
> +		/* defer TLB flush to lower the overhead */
> +		ret++;
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +
> +	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
> +		/*
> +		 * Mark the page table pmd as numa if "autonuma scan
> +		 * pmd" mode is enabled.
> +		 */
> +		spin_lock(&mm->page_table_lock);
> +		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +		spin_unlock(&mm->page_table_lock);
> +		/* defer TLB flush to lower the overhead */
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
> +{
> +	int nid;
> +	struct mm_autonuma *mma = mm->mm_autonuma;
> +	unsigned long tot;
> +	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
> +
> +	/* FIXME: would be better protected with write_seqlock_bh() */
> +	local_bh_disable();
> +
> +	tot = 0;
> +	for_each_node(nid) {
> +		unsigned long faults = fault_tmp[nid];
> +		fault_tmp[nid] = 0;
> +		mma->mm_numa_fault[nid] = faults;
> +		tot += faults;
> +	}
> +	mma->mm_numa_fault_tot = tot;
> +
> +	local_bh_enable();
> +}
> +
> +static void mm_numa_fault_tmp_reset(void)
> +{
> +	memset(knuma_scand_data.mm_numa_fault_tmp, 0,
> +	       mm_autonuma_fault_size());
> +}
> +
> +static inline void validate_mm_numa_fault_tmp(unsigned long address)
> +{
> +#ifdef CONFIG_DEBUG_VM
> +	int nid;
> +	if (address)
> +		return;
> +	for_each_node(nid)
> +		BUG_ON(knuma_scand_data.mm_numa_fault_tmp[nid]);
> +#endif
> +}
> +
> +/*
> + * Scan the next part of the mm. Keep track of the progress made and
> + * return it.
> + */
> +static int knumad_do_scan(void)
> +{
> +	struct mm_struct *mm;
> +	struct mm_autonuma *mm_autonuma;
> +	unsigned long address;
> +	struct vm_area_struct *vma;
> +	int progress = 0;
> +
> +	mm = knuma_scand_data.mm;
> +	/*
> +	 * knuma_scand_data.mm is NULL after the end of each
> +	 * knuma_scand pass. So when it's NULL we've start from
> +	 * scratch from the very first mm in the list.
> +	 */
> +	if (!mm) {
> +		if (unlikely(list_empty(&knuma_scand_data.mm_head)))
> +			return pages_to_scan;
> +		mm_autonuma = list_entry(knuma_scand_data.mm_head.next,
> +					 struct mm_autonuma, mm_node);
> +		mm = mm_autonuma->mm;
> +		knuma_scand_data.address = 0;
> +		knuma_scand_data.mm = mm;
> +		atomic_inc(&mm->mm_count);
> +		mm_autonuma->mm_numa_fault_pass++;
> +	}
> +	address = knuma_scand_data.address;
> +
> +	validate_mm_numa_fault_tmp(address);
> +
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	down_read(&mm->mmap_sem);
> +	if (unlikely(knumad_test_exit(mm)))
> +		vma = NULL;
> +	else
> +		vma = find_vma(mm, address);
> +
> +	progress++;
> +	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
> +		unsigned long start_addr, end_addr;
> +		cond_resched();
> +		if (unlikely(knumad_test_exit(mm))) {
> +			progress++;
> +			break;
> +		}
> +
> +		if (!vma->anon_vma || vma_policy(vma)) {
> +			progress++;
> +			continue;
> +		}
> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
> +			progress++;
> +			continue;
> +		}
> +		/*
> +		 * Skip regions mprotected with PROT_NONE. It would be
> +		 * safe to scan them too, but it's worthless because
> +		 * NUMA hinting page faults can't run on those.
> +		 */
> +		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) {
> +			progress++;
> +			continue;
> +		}
> +		if (is_vma_temporary_stack(vma)) {
> +			progress++;
> +			continue;
> +		}
> +
> +		VM_BUG_ON(address & ~PAGE_MASK);
> +		if (address < vma->vm_start)
> +			address = vma->vm_start;
> +
> +		start_addr = address;
> +		while (address < vma->vm_end) {
> +			cond_resched();
> +			if (unlikely(knumad_test_exit(mm)))
> +				break;
> +
> +			VM_BUG_ON(address < vma->vm_start ||
> +				  address + PAGE_SIZE > vma->vm_end);
> +			progress += knuma_scand_pmd(mm, vma, address);
> +			/* move to next address */
> +			address = (address + PMD_SIZE) & PMD_MASK;
> +			if (progress >= pages_to_scan)
> +				break;
> +		}
> +		end_addr = min(address, vma->vm_end);
> +
> +		/*
> +		 * Flush the TLB for the mm to start the NUMA hinting
> +		 * page faults after we finish scanning this vma part.
> +		 */
> +		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
> +						    end_addr);
> +		flush_tlb_range(vma, start_addr, end_addr);

It's possible that the full range you checked already were set PAGE_NUMA
and no updates were necessary and avoid the PMD flush and notification.

> +		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
> +						  end_addr);
> +	}
> +	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
> +
> +	mutex_lock(&knumad_mm_mutex);
> +	VM_BUG_ON(knuma_scand_data.mm != mm);
> +	knuma_scand_data.address = address;
> +	/*
> +	 * Change the current mm if this mm is about to die, or if we
> +	 * scanned all vmas of this mm.
> +	 */
> +	if (knumad_test_exit(mm) || !vma) {
> +		mm_autonuma = mm->mm_autonuma;
> +		if (mm_autonuma->mm_node.next != &knuma_scand_data.mm_head) {
> +			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> +						 struct mm_autonuma, mm_node);
> +			knuma_scand_data.mm = mm_autonuma->mm;
> +			atomic_inc(&knuma_scand_data.mm->mm_count);
> +			knuma_scand_data.address = 0;
> +			knuma_scand_data.mm->mm_autonuma->mm_numa_fault_pass++;
> +		} else
> +			knuma_scand_data.mm = NULL;
> +
> +		if (knumad_test_exit(mm)) {
> +			list_del(&mm->mm_autonuma->mm_node);
> +			/* tell autonuma_exit not to list_del */
> +			VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +			mm->mm_autonuma->mm = NULL;
> +			mm_numa_fault_tmp_reset();
> +		} else
> +			mm_numa_fault_tmp_flush(mm);
> +
> +		mmdrop(mm);
> +	}
> +
> +	return progress;
> +}
> +
> +static void knuma_scand_disabled(void)
> +{
> +	if (!autonuma_enabled())
> +		wait_event_freezable(knuma_scand_wait,
> +				     autonuma_enabled() ||
> +				     kthread_should_stop());
> +}
> +
> +static int knuma_scand(void *none)
> +{
> +	struct mm_struct *mm = NULL;
> +	int progress = 0, _progress;
> +	unsigned long total_progress = 0;
> +
> +	set_freezable();
> +
> +	knuma_scand_disabled();
> +
> +	/*
> +	 * Serialize the knuma_scand_data against
> +	 * autonuma_enter/exit().
> +	 */
> +	mutex_lock(&knumad_mm_mutex);
> +
> +	for (;;) {
> +		if (unlikely(kthread_should_stop()))
> +			break;
> +
> +		/* Do one loop of scanning, keeping track of the progress */
> +		_progress = knumad_do_scan();
> +		progress += _progress;
> +		total_progress += _progress;
> +		mutex_unlock(&knumad_mm_mutex);
> +
> +		/* Check if we completed one full scan pass */
> +		if (unlikely(!knuma_scand_data.mm)) {
> +			autonuma_printk("knuma_scand %lu\n", total_progress);
> +			pages_scanned += total_progress;
> +			total_progress = 0;
> +			full_scans++;
> +
> +			wait_event_freezable_timeout(knuma_scand_wait,
> +						     kthread_should_stop(),
> +						     msecs_to_jiffies(
> +						     scan_sleep_pass_millisecs));
> +
> +			if (autonuma_debug()) {
> +				extern void sched_autonuma_dump_mm(void);
> +				sched_autonuma_dump_mm();
> +			}
> +
> +			/* wait while there is no pinned mm */
> +			knuma_scand_disabled();
> +		}
> +		if (progress > pages_to_scan) {
> +			progress = 0;
> +			wait_event_freezable_timeout(knuma_scand_wait,
> +						     kthread_should_stop(),
> +						     msecs_to_jiffies(
> +						     scan_sleep_millisecs));
> +		}
> +		cond_resched();
> +		mutex_lock(&knumad_mm_mutex);
> +	}
> +
> +	mm = knuma_scand_data.mm;
> +	knuma_scand_data.mm = NULL;
> +	if (mm && knumad_test_exit(mm)) {
> +		list_del(&mm->mm_autonuma->mm_node);
> +		/* tell autonuma_exit not to list_del */
> +		VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +		mm->mm_autonuma->mm = NULL;
> +	}
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (mm)
> +		mmdrop(mm);
> +	mm_numa_fault_tmp_reset();
> +
> +	return 0;
> +}
> +
> +void autonuma_enter(struct mm_struct *mm)
> +{
> +	if (!autonuma_possible())
> +		return;
> +
> +	mutex_lock(&knumad_mm_mutex);
> +	list_add_tail(&mm->mm_autonuma->mm_node, &knuma_scand_data.mm_head);
> +	mutex_unlock(&knumad_mm_mutex);
> +}
> +
> +void autonuma_exit(struct mm_struct *mm)
> +{
> +	bool serialize;
> +
> +	if (!autonuma_possible())
> +		return;
> +
> +	serialize = false;
> +	mutex_lock(&knumad_mm_mutex);
> +	if (knuma_scand_data.mm == mm)
> +		serialize = true;
> +	else if (mm->mm_autonuma->mm) {
> +		VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +		mm->mm_autonuma->mm = NULL; /* debug */
> +		list_del(&mm->mm_autonuma->mm_node);
> +	}
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (serialize) {
> +		/* prevent the mm to go away under knumad_do_scan main loop */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
> +}
> +
> +static int start_knuma_scand(void)
> +{
> +	int err = 0;
> +	struct task_struct *knumad_thread;
> +
> +	knuma_scand_data.mm_numa_fault_tmp = kzalloc(mm_autonuma_fault_size(),
> +						     GFP_KERNEL);
> +	if (!knuma_scand_data.mm_numa_fault_tmp)
> +		return -ENOMEM;
> +
> +	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
> +	if (unlikely(IS_ERR(knumad_thread))) {
> +		autonuma_printk(KERN_ERR
> +				"knumad: kthread_run(knuma_scand) failed\n");
> +		err = PTR_ERR(knumad_thread);
> +	}
> +	return err;
> +}
> +

Ok, I ignored everything after this because it mostly looks like
boiler-plate code.

Overall, this is pretty heavy but there is no getting away from it either
for the autonuma concept. There are places where it could be improved but
that's ok because profiles will point to where the areas to improve are over
time just as what has happened with compaction. Ultimately it'll come down
to whether the ends justify the means -- I think in many cases they will
and the better NUMA placement will offset the amount of work this thing
has to do but that is hard to detect. It's easy to see autonumas costs at
runtime but harder to see the benefit without doing comparisons with it
on and off. I guess perf measuring NUMA traffic would do it though. There
will be corner case -- for example, I suspect that small processes will
move too quickly because they can be scanned quickly but bounce around a lot.

All that said, an administrator can just turn it off and on and decide
for themselves where the system CPU overhead is worth it or not.

Eventually I think it will be necessary to reduce the scanning rate as
autonuma identifies that the workload has converged or almost converged
to reduce the overhead and lock contention. I did not see something that
makes it go quiet when the system is fully idle and it may be consuming
CPU scaning for no reason on idle machines.  Maybe it does this already
and I'll spot it after thinking about it. We also will need to keep an
eye on lock contention as it does make a few locks a bit hotter.

I did not see any fundamental problems! Users will decide for themselves
whether they would prefer autonuma to spend CPU figuring out the best
placement or whether they want to spent the tuning and development time
doing it by hand. Where possible (e.g. HPC) they will but I suspect Java
workloads and anything virtualisation-based will prefer this and pay the
CPU cost.

> +
> +#ifdef CONFIG_SYSFS
> +
> +static ssize_t flag_show(struct kobject *kobj,
> +			 struct kobj_attribute *attr, char *buf,
> +			 enum autonuma_flag flag)
> +{
> +	return sprintf(buf, "%d\n",
> +		       !!test_bit(flag, &autonuma_flags));
> +}
> +static ssize_t flag_store(struct kobject *kobj,
> +			  struct kobj_attribute *attr,
> +			  const char *buf, size_t count,
> +			  enum autonuma_flag flag)
> +{
> +	unsigned long value;
> +	int ret;
> +
> +	ret = kstrtoul(buf, 10, &value);
> +	if (ret < 0)
> +		return ret;
> +	if (value > 1)
> +		return -EINVAL;
> +
> +	if (value)
> +		set_bit(flag, &autonuma_flags);
> +	else
> +		clear_bit(flag, &autonuma_flags);
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return flag_show(kobj, attr, buf, AUTONUMA_ENABLED_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	ssize_t ret;
> +
> +	ret = flag_store(kobj, attr, buf, count, AUTONUMA_ENABLED_FLAG);
> +
> +	if (ret > 0 && autonuma_enabled())
> +		wake_up_interruptible(&knuma_scand_wait);
> +
> +	return ret;
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +#define SYSFS_ENTRY(NAME, FLAG)						\
> +static ssize_t NAME ## _show(struct kobject *kobj,			\
> +			     struct kobj_attribute *attr, char *buf)	\
> +{									\
> +	return flag_show(kobj, attr, buf, FLAG);			\
> +}									\
> +									\
> +static ssize_t NAME ## _store(struct kobject *kobj,			\
> +			      struct kobj_attribute *attr,		\
> +			      const char *buf, size_t count)		\
> +{									\
> +	return flag_store(kobj, attr, buf, count, FLAG);		\
> +}									\
> +static struct kobj_attribute NAME ## _attr =				\
> +	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
> +
> +SYSFS_ENTRY(scan_pmd, AUTONUMA_SCAN_PMD_FLAG);
> +SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
> +#ifdef CONFIG_DEBUG_VM
> +SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
> +SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +#undef SYSFS_ENTRY
> +
> +enum {
> +	SYSFS_SCAN_SLEEP_ENTRY,
> +	SYSFS_SCAN_PAGES_ENTRY,
> +	SYSFS_MIGRATE_SLEEP_ENTRY,
> +	SYSFS_MIGRATE_PAGES_ENTRY,
> +};
> +
> +#define SYSFS_ENTRY(NAME, SYSFS_TYPE)					\
> +	static ssize_t NAME ## _show(struct kobject *kobj,		\
> +				     struct kobj_attribute *attr,	\
> +				     char *buf)				\
> +	{								\
> +		return sprintf(buf, "%u\n", NAME);			\
> +	}								\
> +	static ssize_t NAME ## _store(struct kobject *kobj,		\
> +				      struct kobj_attribute *attr,	\
> +				      const char *buf, size_t count)	\
> +	{								\
> +		unsigned long val;					\
> +		int err;						\
> +									\
> +		err = strict_strtoul(buf, 10, &val);			\
> +		if (err || val > UINT_MAX)				\
> +			return -EINVAL;					\
> +		switch (SYSFS_TYPE) {					\
> +		case SYSFS_SCAN_PAGES_ENTRY:				\
> +		case SYSFS_MIGRATE_PAGES_ENTRY:				\
> +			if (!val)					\
> +				return -EINVAL;				\
> +			break;						\
> +		}							\
> +									\
> +		NAME = val;						\
> +		switch (SYSFS_TYPE) {					\
> +		case SYSFS_SCAN_SLEEP_ENTRY:				\
> +			wake_up_interruptible(&knuma_scand_wait);	\
> +			break;						\
> +		}							\
> +									\
> +		return count;						\
> +	}								\
> +	static struct kobj_attribute NAME ## _attr =			\
> +		__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
> +
> +SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_SCAN_SLEEP_ENTRY);
> +SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_SCAN_SLEEP_ENTRY);
> +SYSFS_ENTRY(pages_to_scan, SYSFS_SCAN_PAGES_ENTRY);
> +
> +SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_MIGRATE_SLEEP_ENTRY);
> +SYSFS_ENTRY(pages_to_migrate, SYSFS_MIGRATE_PAGES_ENTRY);
> +
> +#undef SYSFS_ENTRY
> +
> +#define SYSFS_ENTRY(NAME)					\
> +static ssize_t NAME ## _show(struct kobject *kobj,		\
> +			     struct kobj_attribute *attr,	\
> +			     char *buf)				\
> +{								\
> +	return sprintf(buf, "%lu\n", NAME);			\
> +}								\
> +static struct kobj_attribute NAME ## _attr =			\
> +	__ATTR_RO(NAME);
> +
> +SYSFS_ENTRY(full_scans);
> +SYSFS_ENTRY(pages_scanned);
> +SYSFS_ENTRY(pages_migrated);
> +
> +#undef SYSFS_ENTRY
> +
> +static struct attribute *autonuma_attr[] = {
> +	&enabled_attr.attr,
> +
> +	&debug_attr.attr,
> +
> +	/* migrate start */
> +	&migrate_sleep_millisecs_attr.attr,
> +	&pages_to_migrate_attr.attr,
> +	&pages_migrated_attr.attr,
> +	/* migrate end */
> +
> +	/* scan start */
> +	&scan_sleep_millisecs_attr.attr,
> +	&scan_sleep_pass_millisecs_attr.attr,
> +	&pages_to_scan_attr.attr,
> +	&pages_scanned_attr.attr,
> +	&full_scans_attr.attr,
> +	&scan_pmd_attr.attr,
> +	/* scan end */
> +
> +#ifdef CONFIG_DEBUG_VM
> +	&sched_load_balance_strict_attr.attr,
> +	&child_inheritance_attr.attr,
> +#endif
> +
> +	NULL,
> +};
> +static struct attribute_group autonuma_attr_group = {
> +	.attrs = autonuma_attr,
> +};
> +
> +static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
> +{
> +	int err;
> +
> +	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
> +	if (unlikely(!*autonuma_kobj)) {
> +		printk(KERN_ERR "autonuma: failed kobject create\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
> +	if (err) {
> +		printk(KERN_ERR "autonuma: failed register autonuma group\n");
> +		goto delete_obj;
> +	}
> +
> +	return 0;
> +
> +delete_obj:
> +	kobject_put(*autonuma_kobj);
> +	return err;
> +}
> +
> +static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
> +{
> +	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
> +	kobject_put(autonuma_kobj);
> +}
> +#else
> +static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
> +{
> +	return 0;
> +}
> +
> +static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
> +{
> +}
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init noautonuma_setup(char *str)
> +{
> +	if (autonuma_possible()) {
> +		printk("AutoNUMA permanently disabled\n");
> +		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
> +		WARN_ON(autonuma_possible()); /* avoid early crash */
> +	}
> +	return 1;
> +}
> +__setup("noautonuma", noautonuma_setup);
> +
> +static bool autonuma_init_checks_failed(void)
> +{
> +	/* safety checks on nr_node_ids */
> +	int last_nid = find_last_bit(node_states[N_POSSIBLE].bits, MAX_NUMNODES);
> +	if (last_nid + 1 != nr_node_ids) {
> +		WARN_ON(1);
> +		return true;
> +	}
> +	if (num_possible_nodes() > nr_node_ids) {
> +		WARN_ON(1);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static int __init autonuma_init(void)
> +{
> +	int err;
> +	struct kobject *autonuma_kobj;
> +
> +	VM_BUG_ON(num_possible_nodes() < 1);
> +	if (num_possible_nodes() <= 1 || !autonuma_possible()) {
> +		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
> +		return -EINVAL;
> +	} else if (autonuma_init_checks_failed()) {
> +		printk("autonuma disengaged: init checks failed\n");
> +		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
> +		return -EINVAL;
> +	}
> +
> +	err = autonuma_init_sysfs(&autonuma_kobj);
> +	if (err)
> +		return err;
> +
> +	err = start_knuma_scand();
> +	if (err) {
> +		printk("failed to start knuma_scand\n");
> +		goto out;
> +	}
> +
> +	printk("AutoNUMA initialized successfully\n");
> +	return err;
> +
> +out:
> +	autonuma_exit_sysfs(autonuma_kobj);
> +	return err;
> +}
> +module_init(autonuma_init)
> +
> +static struct kmem_cache *task_autonuma_cachep;
> +
> +int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
> +			 int node)
> +{
> +	int err = 1;
> +	struct task_autonuma *task_autonuma;
> +
> +	if (!autonuma_possible())
> +		goto no_numa;
> +	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
> +					      GFP_KERNEL, node);
> +	if (!task_autonuma)
> +		goto out;
> +	if (!autonuma_child_inheritance()) {
> +		/*
> +		 * Only reset the task NUMA stats, Always inherit the
> +		 * task_selected_nid. It's certainly better to start
> +		 * the child in the same NUMA node of the parent, if
> +		 * idle/load balancing permits. If they don't permit,
> +		 * task_selected_nid is a transient entity and it'll
> +		 * be updated accordingly.
> +		 */
> +		task_autonuma->task_selected_nid =
> +			orig->task_autonuma->task_selected_nid;
> +		__task_autonuma_reset(task_autonuma);
> +	} else
> +		memcpy(task_autonuma, orig->task_autonuma,
> +		       task_autonuma_size());
> +	VM_BUG_ON(task_autonuma->task_selected_nid < -1);
> +	VM_BUG_ON(task_autonuma->task_selected_nid >= nr_node_ids);
> +	tsk->task_autonuma = task_autonuma;
> +no_numa:
> +	err = 0;
> +out:
> +	return err;
> +}
> +
> +void free_task_autonuma(struct task_struct *tsk)
> +{
> +	if (!autonuma_possible()) {
> +		BUG_ON(tsk->task_autonuma);
> +		return;
> +	}
> +
> +	BUG_ON(!tsk->task_autonuma);
> +	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
> +	tsk->task_autonuma = NULL;
> +}
> +
> +void __init task_autonuma_init(void)
> +{
> +	struct task_autonuma *task_autonuma;
> +
> +	BUG_ON(current != &init_task);
> +
> +	if (!autonuma_possible())
> +		return;
> +
> +	task_autonuma_cachep =
> +		kmem_cache_create("task_autonuma",
> +				  task_autonuma_size(), 0,
> +				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
> +
> +	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
> +					      GFP_KERNEL, numa_node_id());
> +	BUG_ON(!task_autonuma);
> +	task_autonuma_reset(task_autonuma);
> +	BUG_ON(current->task_autonuma);
> +	current->task_autonuma = task_autonuma;
> +}
> +
> +static struct kmem_cache *mm_autonuma_cachep;
> +
> +int alloc_mm_autonuma(struct mm_struct *mm)
> +{
> +	int err = 1;
> +	struct mm_autonuma *mm_autonuma;
> +
> +	if (!autonuma_possible())
> +		goto no_numa;
> +	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
> +	if (!mm_autonuma)
> +		goto out;
> +	if (!autonuma_child_inheritance() || !mm->mm_autonuma)
> +		mm_autonuma_reset(mm_autonuma);
> +	else
> +		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
> +
> +	/*
> +	 * We're not leaking memory here, if mm->mm_autonuma is not
> +	 * zero it's a not refcounted copy of the parent's
> +	 * mm->mm_autonuma pointer.
> +	 */
> +	mm->mm_autonuma = mm_autonuma;
> +	mm_autonuma->mm = mm;
> +no_numa:
> +	err = 0;
> +out:
> +	return err;
> +}
> +
> +void free_mm_autonuma(struct mm_struct *mm)
> +{
> +	if (!autonuma_possible()) {
> +		BUG_ON(mm->mm_autonuma);
> +		return;
> +	}
> +
> +	BUG_ON(!mm->mm_autonuma);
> +	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
> +	mm->mm_autonuma = NULL;
> +}
> +
> +void __init mm_autonuma_init(void)
> +{
> +	BUG_ON(current != &init_task);
> +	BUG_ON(current->mm);
> +
> +	if (!autonuma_possible())
> +		return;
> +
> +	mm_autonuma_cachep =
> +		kmem_cache_create("mm_autonuma",
> +				  mm_autonuma_size(), 0,
> +				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
> +}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 25e262a..edee54d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1036,6 +1036,40 @@ out:
>  	return page;
>  }
>  
> +#ifdef CONFIG_AUTONUMA
> +/* NUMA hinting page fault entry point for trans huge pmds */
> +int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
> +			pmd_t pmd, pmd_t *pmdp)
> +{
> +	struct page *page;
> +	bool migrated;
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(pmd, *pmdp)))
> +		goto out_unlock;
> +
> +	page = pmd_page(pmd);
> +	pmd = pmd_mknonnuma(pmd);
> +	set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
> +	VM_BUG_ON(pmd_numa(*pmdp));
> +	if (unlikely(page_mapcount(page) != 1))
> +		goto out_unlock;
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	migrated = numa_hinting_fault(page, HPAGE_PMD_NR);
> +	if (!migrated)
> +		put_page(page);
> +
> +out:
> +	return 0;
> +
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +	goto out;
> +}
> +#endif
> +
>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		 pmd_t *pmd, unsigned long addr)
>  {
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA
  2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
  2012-10-04 20:03   ` KOSAKI Motohiro
@ 2012-10-11 18:32   ` Mel Gorman
  1 sibling, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:02AM +0200, Andrea Arcangeli wrote:
> If an task_selected_nid has already been selected for the task, try to
> allocate memory from it even if it's temporarily not the local
> node. Chances are it's where most of its memory is already located and
> where it will run in the future.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 21/33] autonuma: call autonuma_split_huge_page()
  2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
@ 2012-10-11 18:33   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:03AM +0200, Andrea Arcangeli wrote:
> This transfers the autonuma_last_nid information to all tail pages
> during split_huge_page.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 22/33] autonuma: make khugepaged pte_numa aware
  2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
@ 2012-10-11 18:36   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:04AM +0200, Andrea Arcangeli wrote:
> If any of the ptes that khugepaged is collapsing was a pte_numa, the
> resulting trans huge pmd will be a pmd_numa too.
> 
> See the comment inline for why we require just one pte_numa pte to
> make a pmd_numa pmd. If needed later we could change the number of
> pte_numa ptes required to create a pmd_numa and make it tunable with
> sysfs too.
> 

It does increase the number of NUMA hinting faults that are incurred though,
potentially offsetting the gains from using THP. Is this something that
would just go away when THP pages are natively migrated by autonuma?
Does it make a measurable improvement now?

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/huge_memory.c |   33 +++++++++++++++++++++++++++++++--
>  1 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 152d4dd..1023e67 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1833,12 +1833,19 @@ out:
>  	return isolated;
>  }
>  
> -static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> +/*
> + * Do the actual data copy for mapped ptes and release the mapped
> + * pages, or alternatively zero out the transparent hugepage in the
> + * mapping holes. Transfer the page_autonuma information in the
> + * process. Return true if any of the mapped ptes was of numa type.
> + */
> +static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  				      struct vm_area_struct *vma,
>  				      unsigned long address,
>  				      spinlock_t *ptl)
>  {
>  	pte_t *_pte;
> +	bool mknuma = false;
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
>  		pte_t pteval = *_pte;
>  		struct page *src_page;
> @@ -1865,11 +1872,29 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			page_remove_rmap(src_page);
>  			spin_unlock(ptl);
>  			free_page_and_swap_cache(src_page);
> +
> +			/*
> +			 * Only require one pte_numa mapped by a pmd
> +			 * to make it a pmd_numa, too. To avoid the
> +			 * risk of losing NUMA hinting page faults, it
> +			 * is better to overestimate the NUMA node
> +			 * affinity with a node where we just
> +			 * collapsed a hugepage, rather than
> +			 * underestimate it.
> +			 *
> +			 * Note: if AUTONUMA_SCAN_PMD_FLAG is set, we
> +			 * won't find any pte_numa ptes since we're
> +			 * only setting NUMA hinting at the pmd
> +			 * level.
> +			 */
> +			mknuma |= pte_numa(pteval);
>  		}
>  
>  		address += PAGE_SIZE;
>  		page++;
>  	}
> +
> +	return mknuma;
>  }
>  
>  static void collapse_huge_page(struct mm_struct *mm,
> @@ -1887,6 +1912,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	spinlock_t *ptl;
>  	int isolated;
>  	unsigned long hstart, hend;
> +	bool mknuma = false;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  #ifndef CONFIG_NUMA
> @@ -2005,7 +2031,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	anon_vma_unlock(vma->anon_vma);
>  
> -	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
> +	mknuma = pmd_numa(_pmd);
> +	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
>  	pte_unmap(pte);
>  	__SetPageUptodate(new_page);
>  	pgtable = pmd_pgtable(_pmd);
> @@ -2015,6 +2042,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	_pmd = mk_pmd(new_page, vma->vm_page_prot);
>  	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>  	_pmd = pmd_mkhuge(_pmd);
> +	if (mknuma)
> +		_pmd = pmd_mknuma(_pmd);
>  
>  	/*
>  	 * spin_lock() below is not the equivalent of smp_wmb(), so
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 23/33] autonuma: retain page last_nid information in khugepaged
  2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
@ 2012-10-11 18:44   ` Mel Gorman
  2012-10-12 11:37     ` Rik van Riel
  0 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:05AM +0200, Andrea Arcangeli wrote:
> When pages are collapsed try to keep the last_nid information from one
> of the original pages.
> 

If two pages within a THP disagree on the node, should the collapsing be
aborted? I would expect that the code of a remote access exceeds the
gain from reduced TLB overhead.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/huge_memory.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1023e67..78b2851 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1846,6 +1846,9 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  {
>  	pte_t *_pte;
>  	bool mknuma = false;
> +#ifdef CONFIG_AUTONUMA
> +	int autonuma_last_nid = -1;
> +#endif
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
>  		pte_t pteval = *_pte;
>  		struct page *src_page;
> @@ -1855,6 +1858,17 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>  		} else {
>  			src_page = pte_page(pteval);
> +#ifdef CONFIG_AUTONUMA
> +			/* pick the first one, better than nothing */
> +			if (autonuma_last_nid < 0) {
> +				autonuma_last_nid =
> +					ACCESS_ONCE(src_page->
> +						    autonuma_last_nid);
> +				if (autonuma_last_nid >= 0)
> +					ACCESS_ONCE(page->autonuma_last_nid) =
> +						autonuma_last_nid;
> +			}
> +#endif
>  			copy_user_highpage(page, src_page, address, vma);
>  			VM_BUG_ON(page_mapcount(src_page) != 1);
>  			release_pte_page(src_page);
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
@ 2012-10-11 18:45   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:06AM +0200, Andrea Arcangeli wrote:
> When we split a transparent hugepage, transfer the NUMA type from the
> pmd to the pte if needed.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 25/33] autonuma: numa hinting page faults entry points
  2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
@ 2012-10-11 18:47   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:07AM +0200, Andrea Arcangeli wrote:
> This is where the numa hinting page faults are detected and are passed
> over to the AutoNUMA core logic.
> 

So other than the naming of the entry points which I whinged about
already

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
@ 2012-10-11 18:50   ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 18:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:51:10AM +0200, Andrea Arcangeli wrote:
> Add the config options to allow building the kernel with AutoNUMA.
> 
> If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
> /sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
> be enabled automatically at boot.
> 

And it's disabled by default for now. I'd prefer it it was default y for
building if CONFIG_NUMA but disabled at runtime for now but otherwise

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
  2012-10-11 16:07       ` Andrea Arcangeli
@ 2012-10-11 19:37         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 19:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:07:02PM +0200, Andrea Arcangeli wrote:
> Hi,
> 
> On Thu, Oct 11, 2012 at 11:50:36AM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> > > +The AutoNUMA logic is a chain reaction resulting from the actions of
> > > +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
> > 
> > s/knum_scand/knuma_scand/
> 
> Applied.
> 
> > > +scans the mm structures of all active processes. It gathers the
> > > +AutoNUMA mm statistics for each "anon" page in the process's working
> > 
> > Ok, so this will not make a different to file-based workloads but as I
> > mentioned in the leader this would be a difficult proposition anyway
> > because if it's read/write based, you'll have no statistics.
> 
> Oops sorry for the confusion but the the doc is wrong on this one: it
> actually tracks anything with a page_mapcount == 1, even if that is
> pagecache or even .text as long as it's only mapped in a single
> process. So if you've a threaded database doing a gigantic MAP_SHARED,
> it'll track and move around the whole MAP_SHARED as well as anonymous
> memory or anything that can be moved.
> 

Ok, I would have expected MAP_PRIVATE in this case but I get your point.

> Changed to:
> 
> +AutoNUMA mm statistics for each not shared page in the process's
> 

Better.

> > > +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> > > +present bit in each pte or pmd that was counted. This triggers NUMA
> > > +hinting page faults described next.
> > > +
> > > +The mm statistics are expentially decayed by dividing the total memory
> > > +in half and adding the new totals to the decayed values for each
> > > +knuma_scand pass. This causes the mm statistics to resemble a simple
> > > +forecasting model, taking into account some past working set data.
> > > +
> > > +=== NUMA hinting fault ===
> > > +
> > > +A NUMA hinting fault occurs when a task running on a CPU thread
> > > +accesses a vma whose pte or pmd is not present and the NUMA bit is
> > > +set. The NUMA hinting page fault handler returns the pte or pmd back
> > > +to its present state and counts the fault's occurance in the
> > > +task_autonuma structure.
> > > +
> > 
> > So, minimally one source of System CPU overhead will be increased traps.
> 
> Correct.
> 
> It takes down 128M every 100msec, and then when it finished taking
> down everything it sleeps 10sec, then increases the pass_counter and
> restarts. It's not measurable, even if I do a kernel build with -j128
> in tmpfs the performance is identical with autonuma running or not.
> 

Ok, I see it clearly now, particularly after reading the series. It does
mean a CPU spike every 10 seconds but it'll be detectable if it's a problem.

> > I haven't seen the code yet obviously but I wonder if this gets accounted
> > for as a minor fault? If it does, how can we distinguish between minor
> > faults and numa hinting faults? If not, is it possible to get any idea of
> > how many numa hinting faults were incurred? Mention it here.
> 
> Yes, it's surely accounted as minor fault. To monitor it normally I
> use:
> 
> perf probe numa_hinting_fault
> perf record -e probe:numa_hinting_fault -aR -g sleep 10
> perf report -g
> 

Ok, straight-forward. Also can be recorded with trace-cmd obviously once
the probe is in place.

> # Samples: 345  of event 'probe:numa_hinting_fault'
> # Event count (approx.): 345
> #
> # Overhead  Command      Shared Object                  Symbol
> # ........  .......  .................  ......................
> #
>     64.64%     perf  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--57.40%-- sig_handler
>                   |          |          
>                   |          |--62.50%-- run_builtin
>                   |          |          main
>                   |          |          __libc_start_main
>                   |          |          
>                   |           --37.50%-- 0x7f47f7c6cba0
>                   |                     run_builtin
>                   |                     main
>                   |                     __libc_start_main
>                   |          
>                   |--16.59%-- __poll
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                   |--9.87%-- 0x7f47f7c6cba0
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                   |--9.42%-- save_i387_xstate
>                   |          do_signal
>                   |          do_notify_resume
>                   |          int_signal
>                   |          __poll
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                    --6.73%-- sys_poll
>                              system_call_fastpath
>                              __poll
> 
>     21.45%     ntpd  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--66.22%-- 0x42b910
>                   |          0x0
>                   |          
>                   |--24.32%-- __select
>                   |          0x0
>                   |          
>                   |--4.05%-- do_signal
>                   |          do_notify_resume
>                   |          int_signal
>                   |          __select
>                   |          0x0
>                   |          
>                   |--2.70%-- 0x7f88827b3ba0
>                   |          0x0
>                   |          
>                    --2.70%-- clock_gettime
>                              0x1a1eb808
> 
>      7.83%     init  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--33.33%-- __select
>                   |          0x0
>                   |          
>                   |--29.63%-- 0x404e0c
>                   |          0x0
>                   |          
>                   |--18.52%-- 0x405820
>                   |          
>                   |--11.11%-- sys_select
>                   |          system_call_fastpath
>                   |          __select
>                   |          0x0
>                   |          
>                    --7.41%-- 0x402528
> 
>      6.09%    sleep  [kernel.kallsyms]  [k] numa_hinting_fault
>               |
>               --- numa_hinting_fault
>                   handle_mm_fault
>                   do_page_fault
>                   page_fault
>                  |          
>                  |--42.86%-- 0x7f0f67847fe0
>                  |          0x7fff4cd6d42b
>                  |          
>                  |--28.57%-- 0x404007
>                  |          
>                  |--19.05%-- nanosleep
>                  |          
>                   --9.52%-- 0x4016d0
>                             0x7fff4cd6d42b
> 
> 
> Chances are we want to add more vmstat for this event.
> 
> > > +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> > > +
> > > +- Increments the total number of pages faulted for this task
> > > +
> > > +- Increments the number of pages faulted on the current NUMA node
> > > +
> > 
> > So, am I correct in assuming that the rate of NUMA hinting faults will be
> > related to the scan rate of knuma_scand?
> 
> This is correct. They're identical.
> 
> There's a slight chance that two threads hit the fault on the same
> pte/pmd_numa concurrently, but just one of the two will actually
> invoke the numa_hinting_fault() function.
> 

Ok, I see they'll be serialised by the PTL anyway.

> > > +- If the fault was for an hugepage, the number of subpages represented
> > > +  by an hugepage is added to the task statistics above
> > > +
> > > +- Each time the NUMA hinting page fault discoveres that another
> > 
> > s/discoveres/discovers/
> 
> Fixed.
> 
> > 
> > > +  knuma_scand pass has occurred, it divides the total number of pages
> > > +  and the pages for each NUMA node in half. This causes the task
> > > +  statistics to be exponentially decayed, just as the mm statistics
> > > +  are. Thus, the task statistics also resemble a simple forcasting
> 
> Also noticed forecasting ;).
> 
> > > +  model, taking into account some past NUMA hinting fault data.
> > > +
> > > +If the page being accessed is on the current NUMA node (same as the
> > > +task), the NUMA hinting fault handler only records the nid of the
> > > +current NUMA node in the page_autonuma structure field last_nid and
> > > +then it'd done.
> > > +
> > > +Othewise, it checks if the nid of the current NUMA node matches the
> > > +last_nid in the page_autonuma structure. If it matches it means it's
> > > +the second NUMA hinting fault for the page occurring (on a subsequent
> > > +pass of the knuma_scand daemon) from the current NUMA node.
> > 
> > You don't spell it out, but this is effectively a migration threshold N
> > where N is the number of remote NUMA hinting faults that must be
> > incurred before migration happens. The default value of this threshold
> > is 2.
> > 
> > Is that accurate? If so, why 2?
> 
> More like 1. It needs one confirmation the migrate request come from
> the same node again (note: it is allowed to come from a different
> threads as long as it's the same node and that is very important).
> 
> Why only 1 confirmation? It's the same as page aging. We could record
> the number of pagecache lookup hits, and not just have a single bit as
> reference count. But doing so, if the workload radically changes it
> takes too much time to adapt to the new configuration and so I usually
> don't like counting.
> 
> Plus I avoided as much as possible fixed numbers. I can explain why 0
> or 1, but I can't as easily explain why 5 or 8, so if I can't explain
> it, I avoid it.
> 

That's fair enough. Expressing in terms of page aging is reasonable. One
could argue for any number but ultimately it'll be related to the
workload. It's all part of the "ping-pong" detection problem. Include
this blurb in the docs.

> > I don't have a better suggestion, it's just an obvious source of an
> > adverse workload that could force a lot of migrations by faulting once
> > per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.
> 
> Correct, for certain workloads like single instance specjbb that
> wasn't enough, but it is fixed in autonuma28, now it's faster even on
> single instance.
> 

Ok.

> > I'm assuming it must be async migration then. IO in progress would be
> > a bit of a surprise though! It would have to be a mapped anonymous page
> > being written to swap.
> 
> It's all migrate on fault now, but I'm using all methods you implemented to
> avoid compaction to block in migrate_pages.
> 

Excellent. There are other places where autonuma may need to backoff if
contention is detected but it can be incrementally addressed.

> > > +=== Task exchange ===
> > > +
> > > +The following defines "weight" in the AutoNUMA balance routine's
> > > +algorithm.
> > > +
> > > +If the tasks are threads of the same process:
> > > +
> > > +    weight = task weight for the NUMA node (since memory weights are
> > > +             the same)
> > > +
> > > +If the tasks are not threads of the same process:
> > > +
> > > +    weight = memory weight for the NUMA node (prefer to move the task
> > > +             to the memory)
> > > +
> > > +The following algorithm determines if the current task will be
> > > +exchanged with a running task on a remote NUMA node:
> > > +
> > > +    this_diff: Weight of the current task on the remote NUMA node
> > > +               minus its weight on the current NUMA node (only used if
> > > +               a positive value). How much does the current task
> > > +               prefer to run on the remote NUMA node.
> > > +
> > > +    other_diff: Weight of the current task on the remote NUMA node
> > > +                minus the weight of the other task on the same remote
> > > +                NUMA node (only used if a positive value). How much
> > > +                does the current task prefer to run on the remote NUMA
> > > +                node compared to the other task.
> > > +
> > > +    total_weight_diff = this_diff + other_diff
> > > +
> > > +    total_weight_diff: How favorable it is to exchange the two tasks.
> > > +                       The pair of tasks with the highest
> > > +                       total_weight_diff (if any) are selected for
> > > +                       exchange.
> > > +
> > > +As mentioned above, if the two tasks are threads of the same process,
> > > +the AutoNUMA balance routine uses the task_autonuma statistics. By
> > > +using the task_autonuma statistics, each thread follows its own memory
> > > +locality and they will not necessarily converge on the same node. This
> > > +is often very desirable for processes with more threads than CPUs on
> > > +each NUMA node.
> > > +
> > 
> > What about the case where two threads on different CPUs are accessing
> 
> I assume on different nodes (different cpus if in the same node, the
> above won't kick in).
> 

Yes, on different nodes is the case I care about here.

> > separate structures that are not page-aligned (base or huge page but huge
> > page would be obviously worse). Does this cause a ping-pong effect or
> > otherwise mess up the statistics?
> 
> Very good point! This is exactly what I call NUMA false sharing and
> it's the biggest nightmare in this whole effort.
> 
> So if there's an huge amount of this over time the statistics will be
> around 50/50 (the statistics just record the working set of the
> thread).
> 
> So if there's another process (note: thread not) heavily computing the
> 50/50 won't be used and the mm statistics will be used instead to
> balance the two threads against the other process. And the two threads
> will converge in the same node, and then their thread statistics will
> change from 50/50 to 0/100 matching the mm statistics.
> 
> If there are just threads and they're all doing what you describe
> above with all their memory, well then the problem has no solution,
> and the new stuff in autonuma28 will deal with that too.
> 

Ok, I'll keep an eye out for it. 

> Ideally we should do MADV_INTERLEAVE, I didn't get that far yet but I
> probably could now.
> 

I like the idea that if autonuma was able to detect that it was
ping-ponging it would set MADV_INTERLEAVE and remove the task from the
list of address spaces to scan entirely. It would require more state in
the task struct.

> Even without the new stuff it wasn't too bad but there were a bit too
> many spurious migrations in that load with autonuma27 and previous. It
> was less spurious on bigger systems with many nodes because last_nid
> is implicitly more accurate there (as last_nid will have more possible
> values than 0|1). With autonuma28 even on 2 nodes it's perfectly fine.
> 
> If it's just 1 page false sharing and all the rest is thread-local,
> the statistics will be 99/1 and the false sharing will be lost in the
> noise.
> 
> The false sharing spillover caused by alignments is minor if the
> threads are really computing on a lot of local memory so it's not a
> concern and it will be optimized away by the last_nid plus the new
> stuff.
> 

Ok, so there will be examples of when this works and counter-examples
and that is unavoidable. At least the cases where it goes wrong are
known in advance. Ideally there would be a few statistics to help track
when that is going wrong.

pages_migrate_success		migrate.c
pages_migrate_fail		migrate.c
  (tracepoint to distinguish between compaction and autonuma, delete
   the equivalent counters in compaction.c. Monitoring the sucess
   counter over time will allow an estimate of how much inter-node
   traffic is due to autonuma)

thread_migrate_numa		autonuma.c
  (rapidly increasing count implies ping-pong. A count that is static
  indicates that it has converged. Potentially could be aggegated
  for a whole address space to see if an overal application has
  converged or not.)

PMU for remote numa accesses
  (should decrease with autonuma if working perfectly. Will increase it
   when ping-pong is occurring)

perf probe do_numahint_page() / do_numahint_pmd()
  (tracks overhead incurred from the faults, also shows up in profile)

I'm not suggesting this all has to be implemented but it'd be nice to know
in advance how we'll identify both when autonuma is getting things right
and when it's getting it wrong.

> > Ok, very obviously this will never be an RT feature but that is hardly
> > a surprise and anyone who tries to enable this for RT needs their head
> > examined. I'm not suggesting you do it but people running detailed
> > performance analysis on scheduler-intensive workloads might want to keep
> > an eye on their latency and jitter figures and how they are affected by
> > this exchanging. Does ftrace show a noticable increase in wakeup latencies
> > for example?
> 
> If you do:
> 
> echo 1 >/sys/kernel/mm/autonuma/debug
> 
> you will get 1 printk every single time sched_autonuma_balance
> triggers a task exchange.
> 

Yeah, ideally we would get away from that over time though and move to
tracepoints so it can be gathered by trace-cmd or perf depending on the
exact scenario.

> With autonuma28 I resolved a lot of the jittering and now there are
> 6/7 printk for the whole 198 seconds of numa01. CFS runs in autopilot
> all the time.
> 

Good. trace-cmd with the wakeup plugin might also be able to detect
interference in scheduling latencies.

> With specjbb x2 overcommit, the active balancing events are reduced to
> one every few sec (vs several per sec with autonuma27). In fact the
> specjbb x2 overcommit load jumped ahead too with autonuma28.
> 
> About tracing events, the git branch already has tracing events to
> monitor all page and task migrations showed in an awesome "perf script
> numatop" from Andrew.

Cool!

> Likely we need one tracing event to see the task
> exchange generated specifically by the autonuma balancing event (we're
> running short in event columns to show it in numatop though ;). Right
> now that is only available as the printk above.
> 

Ok.

> > > +=== task_autonuma - per task AutoNUMA data ===
> > > +
> > > +The task_autonuma structure is used to hold AutoNUMA data required for
> > > +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> > > +nodes.
> > > +
> > > +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> > > +                scheduler balancing code, -1 if none (2 bytes)
> > > +
> > > +- Task NUMA statistics for this thread/process:
> > > +
> > > +    Total number of NUMA hinting page faults in this pass of
> > > +    knuma_scand (8 bytes)
> > > +
> > > +    Per NUMA node number of NUMA hinting page faults in this pass of
> > > +    knuma_scand (8 bytes * # of NUMA nodes)
> > > +
> > 
> > It might be possible to put a coarse ping-pong detection counter in here
> > as well by recording a declaying average of number of pages migrated
> > over a number of knuma_scand passes instead of just the last one.  If the
> > value is too high, you're ping-ponging and the process should be ignored,
> > possibly forever. It's not a requirement and it would be more memory
> > overhead obviously but I'm throwing it out there as a suggestion if it
> > ever turns out the ping-pong problem is real.
> 
> Yes, this is a problem where we've an enormous degree in trying
> things, so your suggestions are very appreciated :).
> 
> About ping ponging of CPU I never seen it yet (even if it's 550/450,
> it rarely switches over from 450/550, and even it does, it doesn't
> really change anything because it's a fairly rare event and one node
> is not more right than the other anyway).
> 

I expect in practice that it's very rare that there is a workload that
does not or cannot align its data structures. It'll happen at least once
so it'd be nice to be able to identify that quickly when it does.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt
@ 2012-10-11 19:37         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 19:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:07:02PM +0200, Andrea Arcangeli wrote:
> Hi,
> 
> On Thu, Oct 11, 2012 at 11:50:36AM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:43AM +0200, Andrea Arcangeli wrote:
> > > +The AutoNUMA logic is a chain reaction resulting from the actions of
> > > +the AutoNUMA daemon, knum_scand. The knuma_scand daemon periodically
> > 
> > s/knum_scand/knuma_scand/
> 
> Applied.
> 
> > > +scans the mm structures of all active processes. It gathers the
> > > +AutoNUMA mm statistics for each "anon" page in the process's working
> > 
> > Ok, so this will not make a different to file-based workloads but as I
> > mentioned in the leader this would be a difficult proposition anyway
> > because if it's read/write based, you'll have no statistics.
> 
> Oops sorry for the confusion but the the doc is wrong on this one: it
> actually tracks anything with a page_mapcount == 1, even if that is
> pagecache or even .text as long as it's only mapped in a single
> process. So if you've a threaded database doing a gigantic MAP_SHARED,
> it'll track and move around the whole MAP_SHARED as well as anonymous
> memory or anything that can be moved.
> 

Ok, I would have expected MAP_PRIVATE in this case but I get your point.

> Changed to:
> 
> +AutoNUMA mm statistics for each not shared page in the process's
> 

Better.

> > > +set. While scanning, knuma_scand also sets the NUMA bit and clears the
> > > +present bit in each pte or pmd that was counted. This triggers NUMA
> > > +hinting page faults described next.
> > > +
> > > +The mm statistics are expentially decayed by dividing the total memory
> > > +in half and adding the new totals to the decayed values for each
> > > +knuma_scand pass. This causes the mm statistics to resemble a simple
> > > +forecasting model, taking into account some past working set data.
> > > +
> > > +=== NUMA hinting fault ===
> > > +
> > > +A NUMA hinting fault occurs when a task running on a CPU thread
> > > +accesses a vma whose pte or pmd is not present and the NUMA bit is
> > > +set. The NUMA hinting page fault handler returns the pte or pmd back
> > > +to its present state and counts the fault's occurance in the
> > > +task_autonuma structure.
> > > +
> > 
> > So, minimally one source of System CPU overhead will be increased traps.
> 
> Correct.
> 
> It takes down 128M every 100msec, and then when it finished taking
> down everything it sleeps 10sec, then increases the pass_counter and
> restarts. It's not measurable, even if I do a kernel build with -j128
> in tmpfs the performance is identical with autonuma running or not.
> 

Ok, I see it clearly now, particularly after reading the series. It does
mean a CPU spike every 10 seconds but it'll be detectable if it's a problem.

> > I haven't seen the code yet obviously but I wonder if this gets accounted
> > for as a minor fault? If it does, how can we distinguish between minor
> > faults and numa hinting faults? If not, is it possible to get any idea of
> > how many numa hinting faults were incurred? Mention it here.
> 
> Yes, it's surely accounted as minor fault. To monitor it normally I
> use:
> 
> perf probe numa_hinting_fault
> perf record -e probe:numa_hinting_fault -aR -g sleep 10
> perf report -g
> 

Ok, straight-forward. Also can be recorded with trace-cmd obviously once
the probe is in place.

> # Samples: 345  of event 'probe:numa_hinting_fault'
> # Event count (approx.): 345
> #
> # Overhead  Command      Shared Object                  Symbol
> # ........  .......  .................  ......................
> #
>     64.64%     perf  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--57.40%-- sig_handler
>                   |          |          
>                   |          |--62.50%-- run_builtin
>                   |          |          main
>                   |          |          __libc_start_main
>                   |          |          
>                   |           --37.50%-- 0x7f47f7c6cba0
>                   |                     run_builtin
>                   |                     main
>                   |                     __libc_start_main
>                   |          
>                   |--16.59%-- __poll
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                   |--9.87%-- 0x7f47f7c6cba0
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                   |--9.42%-- save_i387_xstate
>                   |          do_signal
>                   |          do_notify_resume
>                   |          int_signal
>                   |          __poll
>                   |          run_builtin
>                   |          main
>                   |          __libc_start_main
>                   |          
>                    --6.73%-- sys_poll
>                              system_call_fastpath
>                              __poll
> 
>     21.45%     ntpd  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--66.22%-- 0x42b910
>                   |          0x0
>                   |          
>                   |--24.32%-- __select
>                   |          0x0
>                   |          
>                   |--4.05%-- do_signal
>                   |          do_notify_resume
>                   |          int_signal
>                   |          __select
>                   |          0x0
>                   |          
>                   |--2.70%-- 0x7f88827b3ba0
>                   |          0x0
>                   |          
>                    --2.70%-- clock_gettime
>                              0x1a1eb808
> 
>      7.83%     init  [kernel.kallsyms]  [k] numa_hinting_fault
>                |
>                --- numa_hinting_fault
>                    handle_mm_fault
>                    do_page_fault
>                    page_fault
>                   |          
>                   |--33.33%-- __select
>                   |          0x0
>                   |          
>                   |--29.63%-- 0x404e0c
>                   |          0x0
>                   |          
>                   |--18.52%-- 0x405820
>                   |          
>                   |--11.11%-- sys_select
>                   |          system_call_fastpath
>                   |          __select
>                   |          0x0
>                   |          
>                    --7.41%-- 0x402528
> 
>      6.09%    sleep  [kernel.kallsyms]  [k] numa_hinting_fault
>               |
>               --- numa_hinting_fault
>                   handle_mm_fault
>                   do_page_fault
>                   page_fault
>                  |          
>                  |--42.86%-- 0x7f0f67847fe0
>                  |          0x7fff4cd6d42b
>                  |          
>                  |--28.57%-- 0x404007
>                  |          
>                  |--19.05%-- nanosleep
>                  |          
>                   --9.52%-- 0x4016d0
>                             0x7fff4cd6d42b
> 
> 
> Chances are we want to add more vmstat for this event.
> 
> > > +The NUMA hinting fault gathers the AutoNUMA task statistics as follows:
> > > +
> > > +- Increments the total number of pages faulted for this task
> > > +
> > > +- Increments the number of pages faulted on the current NUMA node
> > > +
> > 
> > So, am I correct in assuming that the rate of NUMA hinting faults will be
> > related to the scan rate of knuma_scand?
> 
> This is correct. They're identical.
> 
> There's a slight chance that two threads hit the fault on the same
> pte/pmd_numa concurrently, but just one of the two will actually
> invoke the numa_hinting_fault() function.
> 

Ok, I see they'll be serialised by the PTL anyway.

> > > +- If the fault was for an hugepage, the number of subpages represented
> > > +  by an hugepage is added to the task statistics above
> > > +
> > > +- Each time the NUMA hinting page fault discoveres that another
> > 
> > s/discoveres/discovers/
> 
> Fixed.
> 
> > 
> > > +  knuma_scand pass has occurred, it divides the total number of pages
> > > +  and the pages for each NUMA node in half. This causes the task
> > > +  statistics to be exponentially decayed, just as the mm statistics
> > > +  are. Thus, the task statistics also resemble a simple forcasting
> 
> Also noticed forecasting ;).
> 
> > > +  model, taking into account some past NUMA hinting fault data.
> > > +
> > > +If the page being accessed is on the current NUMA node (same as the
> > > +task), the NUMA hinting fault handler only records the nid of the
> > > +current NUMA node in the page_autonuma structure field last_nid and
> > > +then it'd done.
> > > +
> > > +Othewise, it checks if the nid of the current NUMA node matches the
> > > +last_nid in the page_autonuma structure. If it matches it means it's
> > > +the second NUMA hinting fault for the page occurring (on a subsequent
> > > +pass of the knuma_scand daemon) from the current NUMA node.
> > 
> > You don't spell it out, but this is effectively a migration threshold N
> > where N is the number of remote NUMA hinting faults that must be
> > incurred before migration happens. The default value of this threshold
> > is 2.
> > 
> > Is that accurate? If so, why 2?
> 
> More like 1. It needs one confirmation the migrate request come from
> the same node again (note: it is allowed to come from a different
> threads as long as it's the same node and that is very important).
> 
> Why only 1 confirmation? It's the same as page aging. We could record
> the number of pagecache lookup hits, and not just have a single bit as
> reference count. But doing so, if the workload radically changes it
> takes too much time to adapt to the new configuration and so I usually
> don't like counting.
> 
> Plus I avoided as much as possible fixed numbers. I can explain why 0
> or 1, but I can't as easily explain why 5 or 8, so if I can't explain
> it, I avoid it.
> 

That's fair enough. Expressing in terms of page aging is reasonable. One
could argue for any number but ultimately it'll be related to the
workload. It's all part of the "ping-pong" detection problem. Include
this blurb in the docs.

> > I don't have a better suggestion, it's just an obvious source of an
> > adverse workload that could force a lot of migrations by faulting once
> > per knuma_scand cycle and scheduling itself on a remote CPU every 2 cycles.
> 
> Correct, for certain workloads like single instance specjbb that
> wasn't enough, but it is fixed in autonuma28, now it's faster even on
> single instance.
> 

Ok.

> > I'm assuming it must be async migration then. IO in progress would be
> > a bit of a surprise though! It would have to be a mapped anonymous page
> > being written to swap.
> 
> It's all migrate on fault now, but I'm using all methods you implemented to
> avoid compaction to block in migrate_pages.
> 

Excellent. There are other places where autonuma may need to backoff if
contention is detected but it can be incrementally addressed.

> > > +=== Task exchange ===
> > > +
> > > +The following defines "weight" in the AutoNUMA balance routine's
> > > +algorithm.
> > > +
> > > +If the tasks are threads of the same process:
> > > +
> > > +    weight = task weight for the NUMA node (since memory weights are
> > > +             the same)
> > > +
> > > +If the tasks are not threads of the same process:
> > > +
> > > +    weight = memory weight for the NUMA node (prefer to move the task
> > > +             to the memory)
> > > +
> > > +The following algorithm determines if the current task will be
> > > +exchanged with a running task on a remote NUMA node:
> > > +
> > > +    this_diff: Weight of the current task on the remote NUMA node
> > > +               minus its weight on the current NUMA node (only used if
> > > +               a positive value). How much does the current task
> > > +               prefer to run on the remote NUMA node.
> > > +
> > > +    other_diff: Weight of the current task on the remote NUMA node
> > > +                minus the weight of the other task on the same remote
> > > +                NUMA node (only used if a positive value). How much
> > > +                does the current task prefer to run on the remote NUMA
> > > +                node compared to the other task.
> > > +
> > > +    total_weight_diff = this_diff + other_diff
> > > +
> > > +    total_weight_diff: How favorable it is to exchange the two tasks.
> > > +                       The pair of tasks with the highest
> > > +                       total_weight_diff (if any) are selected for
> > > +                       exchange.
> > > +
> > > +As mentioned above, if the two tasks are threads of the same process,
> > > +the AutoNUMA balance routine uses the task_autonuma statistics. By
> > > +using the task_autonuma statistics, each thread follows its own memory
> > > +locality and they will not necessarily converge on the same node. This
> > > +is often very desirable for processes with more threads than CPUs on
> > > +each NUMA node.
> > > +
> > 
> > What about the case where two threads on different CPUs are accessing
> 
> I assume on different nodes (different cpus if in the same node, the
> above won't kick in).
> 

Yes, on different nodes is the case I care about here.

> > separate structures that are not page-aligned (base or huge page but huge
> > page would be obviously worse). Does this cause a ping-pong effect or
> > otherwise mess up the statistics?
> 
> Very good point! This is exactly what I call NUMA false sharing and
> it's the biggest nightmare in this whole effort.
> 
> So if there's an huge amount of this over time the statistics will be
> around 50/50 (the statistics just record the working set of the
> thread).
> 
> So if there's another process (note: thread not) heavily computing the
> 50/50 won't be used and the mm statistics will be used instead to
> balance the two threads against the other process. And the two threads
> will converge in the same node, and then their thread statistics will
> change from 50/50 to 0/100 matching the mm statistics.
> 
> If there are just threads and they're all doing what you describe
> above with all their memory, well then the problem has no solution,
> and the new stuff in autonuma28 will deal with that too.
> 

Ok, I'll keep an eye out for it. 

> Ideally we should do MADV_INTERLEAVE, I didn't get that far yet but I
> probably could now.
> 

I like the idea that if autonuma was able to detect that it was
ping-ponging it would set MADV_INTERLEAVE and remove the task from the
list of address spaces to scan entirely. It would require more state in
the task struct.

> Even without the new stuff it wasn't too bad but there were a bit too
> many spurious migrations in that load with autonuma27 and previous. It
> was less spurious on bigger systems with many nodes because last_nid
> is implicitly more accurate there (as last_nid will have more possible
> values than 0|1). With autonuma28 even on 2 nodes it's perfectly fine.
> 
> If it's just 1 page false sharing and all the rest is thread-local,
> the statistics will be 99/1 and the false sharing will be lost in the
> noise.
> 
> The false sharing spillover caused by alignments is minor if the
> threads are really computing on a lot of local memory so it's not a
> concern and it will be optimized away by the last_nid plus the new
> stuff.
> 

Ok, so there will be examples of when this works and counter-examples
and that is unavoidable. At least the cases where it goes wrong are
known in advance. Ideally there would be a few statistics to help track
when that is going wrong.

pages_migrate_success		migrate.c
pages_migrate_fail		migrate.c
  (tracepoint to distinguish between compaction and autonuma, delete
   the equivalent counters in compaction.c. Monitoring the sucess
   counter over time will allow an estimate of how much inter-node
   traffic is due to autonuma)

thread_migrate_numa		autonuma.c
  (rapidly increasing count implies ping-pong. A count that is static
  indicates that it has converged. Potentially could be aggegated
  for a whole address space to see if an overal application has
  converged or not.)

PMU for remote numa accesses
  (should decrease with autonuma if working perfectly. Will increase it
   when ping-pong is occurring)

perf probe do_numahint_page() / do_numahint_pmd()
  (tracks overhead incurred from the faults, also shows up in profile)

I'm not suggesting this all has to be implemented but it'd be nice to know
in advance how we'll identify both when autonuma is getting things right
and when it's getting it wrong.

> > Ok, very obviously this will never be an RT feature but that is hardly
> > a surprise and anyone who tries to enable this for RT needs their head
> > examined. I'm not suggesting you do it but people running detailed
> > performance analysis on scheduler-intensive workloads might want to keep
> > an eye on their latency and jitter figures and how they are affected by
> > this exchanging. Does ftrace show a noticable increase in wakeup latencies
> > for example?
> 
> If you do:
> 
> echo 1 >/sys/kernel/mm/autonuma/debug
> 
> you will get 1 printk every single time sched_autonuma_balance
> triggers a task exchange.
> 

Yeah, ideally we would get away from that over time though and move to
tracepoints so it can be gathered by trace-cmd or perf depending on the
exact scenario.

> With autonuma28 I resolved a lot of the jittering and now there are
> 6/7 printk for the whole 198 seconds of numa01. CFS runs in autopilot
> all the time.
> 

Good. trace-cmd with the wakeup plugin might also be able to detect
interference in scheduling latencies.

> With specjbb x2 overcommit, the active balancing events are reduced to
> one every few sec (vs several per sec with autonuma27). In fact the
> specjbb x2 overcommit load jumped ahead too with autonuma28.
> 
> About tracing events, the git branch already has tracing events to
> monitor all page and task migrations showed in an awesome "perf script
> numatop" from Andrew.

Cool!

> Likely we need one tracing event to see the task
> exchange generated specifically by the autonuma balancing event (we're
> running short in event columns to show it in numatop though ;). Right
> now that is only available as the printk above.
> 

Ok.

> > > +=== task_autonuma - per task AutoNUMA data ===
> > > +
> > > +The task_autonuma structure is used to hold AutoNUMA data required for
> > > +each mm task (process/thread). Total size: 10 bytes + 8 * # of NUMA
> > > +nodes.
> > > +
> > > +- selected_nid: preferred NUMA node as determined by the AutoNUMA
> > > +                scheduler balancing code, -1 if none (2 bytes)
> > > +
> > > +- Task NUMA statistics for this thread/process:
> > > +
> > > +    Total number of NUMA hinting page faults in this pass of
> > > +    knuma_scand (8 bytes)
> > > +
> > > +    Per NUMA node number of NUMA hinting page faults in this pass of
> > > +    knuma_scand (8 bytes * # of NUMA nodes)
> > > +
> > 
> > It might be possible to put a coarse ping-pong detection counter in here
> > as well by recording a declaying average of number of pages migrated
> > over a number of knuma_scand passes instead of just the last one.  If the
> > value is too high, you're ping-ponging and the process should be ignored,
> > possibly forever. It's not a requirement and it would be more memory
> > overhead obviously but I'm throwing it out there as a suggestion if it
> > ever turns out the ping-pong problem is real.
> 
> Yes, this is a problem where we've an enormous degree in trying
> things, so your suggestions are very appreciated :).
> 
> About ping ponging of CPU I never seen it yet (even if it's 550/450,
> it rarely switches over from 450/550, and even it does, it doesn't
> really change anything because it's a fairly rare event and one node
> is not more right than the other anyway).
> 

I expect in practice that it's very rare that there is a workload that
does not or cannot align its data structures. It'll happen at least once
so it'd be nice to be able to identify that quickly when it does.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
  2012-10-11 16:43       ` Andrea Arcangeli
@ 2012-10-11 19:48         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 19:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:43:00PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > > faults to identify the per NUMA node working set of the thread at
> > > runtime.
> > > 
> > > Arming the NUMA hinting page fault mechanism works similarly to
> > > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > > triggers we can identify it as a NUMA hinting page fault.
> > > 
> > 
> > That implies that there is an atomic update requirement or at least
> > an ordering requirement -- present bit must be cleared before setting
> > NUMA bit. No doubt it'll be clear later in the series how this is
> > accomplished. What you propose seems ok but it all depends how it's
> > implemented so I'm leaving my ack off this particular patch for now.
> 
> Correct. The switch is done atomically (clear _PAGE_PRESENT at the
> same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
> to avoid firing an IPI for every pte/pmd_numa we establish).
> 

Good. I think you might still be flushing more than you need to but
commented on the patch itself.

> It's still similar to setting a range PROT_NONE (except the way
> _PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
> mutually exclusive, so they can easily share the same pte/pmd
> bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.
> 
> The NUMA hinting page fault also won't require any TLB flush ever.
> 

It sortof can. The fault itself is still a heavy operation that can do
things like this

numa_hinting_fault
 -> numa_hinting_fault_memory_follow_cpu
    -> autonuma_migrate_page
      -> sync_isolate_migratepages
	 (lru lock for single page)
      -> migrate_pages

and buried down there where it unmaps the page and makes a migration PTE
is a TLB flush due to calling ptep_clear_flush_notify(). That's a bad case
obviously and the expectation is that as the threads converage to a node that
it's not a problem. While it's converging though it will be a heavy cost.

Tracking how often a numa_hinting_fault results in a migration should be
enough to keep an eye on it.

> So the whole process (establish/teardown) has an incredibly low TLB
> flushing cost.
> 
> The only fixed cost is in knuma_scand and the enter/exit kernel for
> every not-shared page every 10 sec (or whatever you set the duration
> of a knuma_scand pass in sysfs).
> 

10 seconds should be sufficiently low. It itself might need to adapt in
the future but at least 10 seconds now by default will not stomp too heavily.

> Furthermore, if the pmd_scan mode is activated, I guarantee there's at
> max 1 NUMA hinting page fault every 2m virtual region (even if some
> accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
> to disable THP (echo never >enabled) to measure the exact cost per 4k
> page. It's hardly measurable here. With THP the fault is also 1 every
> 2m virtual region but no accuracy is lost in that case (or more
> precisely, there's no way to get more accuracy than that as we deal
> with a pmd).
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
@ 2012-10-11 19:48         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 19:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:43:00PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > > faults to identify the per NUMA node working set of the thread at
> > > runtime.
> > > 
> > > Arming the NUMA hinting page fault mechanism works similarly to
> > > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > > triggers we can identify it as a NUMA hinting page fault.
> > > 
> > 
> > That implies that there is an atomic update requirement or at least
> > an ordering requirement -- present bit must be cleared before setting
> > NUMA bit. No doubt it'll be clear later in the series how this is
> > accomplished. What you propose seems ok but it all depends how it's
> > implemented so I'm leaving my ack off this particular patch for now.
> 
> Correct. The switch is done atomically (clear _PAGE_PRESENT at the
> same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
> to avoid firing an IPI for every pte/pmd_numa we establish).
> 

Good. I think you might still be flushing more than you need to but
commented on the patch itself.

> It's still similar to setting a range PROT_NONE (except the way
> _PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
> mutually exclusive, so they can easily share the same pte/pmd
> bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.
> 
> The NUMA hinting page fault also won't require any TLB flush ever.
> 

It sortof can. The fault itself is still a heavy operation that can do
things like this

numa_hinting_fault
 -> numa_hinting_fault_memory_follow_cpu
    -> autonuma_migrate_page
      -> sync_isolate_migratepages
	 (lru lock for single page)
      -> migrate_pages

and buried down there where it unmaps the page and makes a migration PTE
is a TLB flush due to calling ptep_clear_flush_notify(). That's a bad case
obviously and the expectation is that as the threads converage to a node that
it's not a problem. While it's converging though it will be a heavy cost.

Tracking how often a numa_hinting_fault results in a migration should be
enough to keep an eye on it.

> So the whole process (establish/teardown) has an incredibly low TLB
> flushing cost.
> 
> The only fixed cost is in knuma_scand and the enter/exit kernel for
> every not-shared page every 10 sec (or whatever you set the duration
> of a knuma_scand pass in sysfs).
> 

10 seconds should be sufficiently low. It itself might need to adapt in
the future but at least 10 seconds now by default will not stomp too heavily.

> Furthermore, if the pmd_scan mode is activated, I guarantee there's at
> max 1 NUMA hinting page fault every 2m virtual region (even if some
> accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
> to disable THP (echo never >enabled) to measure the exact cost per 4k
> page. It's hardly measurable here. With THP the fault is also 1 every
> 2m virtual region but no accuracy is lost in that case (or more
> precisely, there's no way to get more accuracy than that as we deal
> with a pmd).
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
  2012-10-11 16:58       ` Andrea Arcangeli
@ 2012-10-11 19:54         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:58:47PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:15:45PM +0100, Mel Gorman wrote:
> > huh?
> > 
> > #define _PAGE_NUMA     _PAGE_PROTNONE
> > 
> > so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE
> > 
> > I suspect you are doing this because there is no requirement for
> > _PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
> > describe your intent. Is that really the case or did I miss something
> > stupid?
> 
> Exactly.
> 
> It reminds that we need to return true in pte_present when the NUMA
> hinting page fault is on.
> 
> Hardwiring _PAGE_NUMA to _PAGE_PROTNONE conceptually is not necessary
> and it's actually an artificial restrictions. Other archs without a
> bitflag for _PAGE_PROTNONE, may want to use something else and they'll
> have to deal with pte_present too, somehow. So this is a reminder for
> them as well.
> 

That's all very reasonable.

> > >  static inline int pte_hidden(pte_t pte)
> > > @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> > >  	 * the _PAGE_PSE flag will remain set at all times while the
> > >  	 * _PAGE_PRESENT bit is clear).
> > >  	 */
> > > -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> > > +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> > > +				 _PAGE_NUMA);
> > > +}
> > > +
> > > +#ifdef CONFIG_AUTONUMA
> > > +/*
> > > + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> > > + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> > 
> > same bit on x86, not necessarily anywhere else.
> 
> Yep. In fact before using _PAGE_PRESENT the two bits were different
> even on x86. But I unified them. If I vary them then they will become
> _PAGE_PTE_NUMA/_PAGE_PMD_NUMA and the above will fail to build without
> risk of errors.
> 

Ok.

> > 
> > _PAGE_PRESENT?
> 
> good eye ;) corrected.
> 
> > > +/*
> > > + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> > > + * because they're called by the NUMA hinting minor page fault.
> > 
> > automatically or atomically?
> > 
> > I assume you meant atomically but what stops two threads faulting at the
> > same time and doing to the same update? mmap_sem will be insufficient in
> > that case so what is guaranteeing the atomicity. PTL?
> 
> I meant automatically. I explained myself wrong and automatically may
> be the wrong word. It also is atomic of course but it wasn't about the
> atomic part.
> 
> So the thing is: the numa hinting page fault hooking point is this:
> 
> 	if (pte_numa(entry))
> 		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);
> 
> It won't get this far:
> 
> 	entry = pte_mkyoung(entry);
> 	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
> 
> So if I don't set _PAGE_ACCESSED in pte/pmd_mknuma, the TLB miss
> handler will have to set _PAGE_ACCESSED itself with an additional
> write on the pte/pmd later when userland touches the page. And that
> will slow us down for no good.
> 

All clear now. Letting it fall through to reach that point would be
convulated and messy. This is a better option.

> Because mknuma is only called in the numa hinting page fault context,
> it's optimal to set _PAGE_ACCESSED too, not only _PAGE_PRESENT (and
> clearing _PAGE_NUMA of course).
> 
> The basic idea, is that the numa hinting page fault can only trigger
> if userland touches the page, and after such an event, _PAGE_ACCESSED
> would be set by the hardware no matter if there is a NUMA hinting page
> fault or not (so we can optimize away the hardware action when the NUMA
> hinting page fault triggers).
> 
> I tried to reword it:
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cf1d3f0..3dc6a9b 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -449,12 +449,12 @@ static inline int pmd_numa(pmd_t pmd)
>  #endif
>  
>  /*
> - * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> - * because they're called by the NUMA hinting minor page fault. If we
> - * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
> - * would be forced to set it later while filling the TLB after we
> - * return to userland. That would trigger a second write to memory
> - * that we optimize away by setting _PAGE_ACCESSED here.
> + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag too because they're
> + * only called by the NUMA hinting minor page fault. If we wouldn't
> + * set the _PAGE_ACCESSED bitflag here, the TLB miss handler would be
> + * forced to set it later while filling the TLB after we return to
> + * userland. That would trigger a second write to memory that we
> + * optimize away by setting _PAGE_ACCESSED here.
>   */
>  static inline pte_t pte_mknonnuma(pte_t pte)
>  {
> 

Much better.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 05/33] autonuma: pte_numa() and pmd_numa()
@ 2012-10-11 19:54         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 06:58:47PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:15:45PM +0100, Mel Gorman wrote:
> > huh?
> > 
> > #define _PAGE_NUMA     _PAGE_PROTNONE
> > 
> > so this is effective _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PROTNONE
> > 
> > I suspect you are doing this because there is no requirement for
> > _PAGE_NUMA == _PAGE_PROTNONE for other architectures and it was best to
> > describe your intent. Is that really the case or did I miss something
> > stupid?
> 
> Exactly.
> 
> It reminds that we need to return true in pte_present when the NUMA
> hinting page fault is on.
> 
> Hardwiring _PAGE_NUMA to _PAGE_PROTNONE conceptually is not necessary
> and it's actually an artificial restrictions. Other archs without a
> bitflag for _PAGE_PROTNONE, may want to use something else and they'll
> have to deal with pte_present too, somehow. So this is a reminder for
> them as well.
> 

That's all very reasonable.

> > >  static inline int pte_hidden(pte_t pte)
> > > @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> > >  	 * the _PAGE_PSE flag will remain set at all times while the
> > >  	 * _PAGE_PRESENT bit is clear).
> > >  	 */
> > > -	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> > > +	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> > > +				 _PAGE_NUMA);
> > > +}
> > > +
> > > +#ifdef CONFIG_AUTONUMA
> > > +/*
> > > + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> > > + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> > 
> > same bit on x86, not necessarily anywhere else.
> 
> Yep. In fact before using _PAGE_PRESENT the two bits were different
> even on x86. But I unified them. If I vary them then they will become
> _PAGE_PTE_NUMA/_PAGE_PMD_NUMA and the above will fail to build without
> risk of errors.
> 

Ok.

> > 
> > _PAGE_PRESENT?
> 
> good eye ;) corrected.
> 
> > > +/*
> > > + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> > > + * because they're called by the NUMA hinting minor page fault.
> > 
> > automatically or atomically?
> > 
> > I assume you meant atomically but what stops two threads faulting at the
> > same time and doing to the same update? mmap_sem will be insufficient in
> > that case so what is guaranteeing the atomicity. PTL?
> 
> I meant automatically. I explained myself wrong and automatically may
> be the wrong word. It also is atomic of course but it wasn't about the
> atomic part.
> 
> So the thing is: the numa hinting page fault hooking point is this:
> 
> 	if (pte_numa(entry))
> 		return pte_numa_fixup(mm, vma, address, entry, pte, pmd);
> 
> It won't get this far:
> 
> 	entry = pte_mkyoung(entry);
> 	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
> 
> So if I don't set _PAGE_ACCESSED in pte/pmd_mknuma, the TLB miss
> handler will have to set _PAGE_ACCESSED itself with an additional
> write on the pte/pmd later when userland touches the page. And that
> will slow us down for no good.
> 

All clear now. Letting it fall through to reach that point would be
convulated and messy. This is a better option.

> Because mknuma is only called in the numa hinting page fault context,
> it's optimal to set _PAGE_ACCESSED too, not only _PAGE_PRESENT (and
> clearing _PAGE_NUMA of course).
> 
> The basic idea, is that the numa hinting page fault can only trigger
> if userland touches the page, and after such an event, _PAGE_ACCESSED
> would be set by the hardware no matter if there is a NUMA hinting page
> fault or not (so we can optimize away the hardware action when the NUMA
> hinting page fault triggers).
> 
> I tried to reword it:
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cf1d3f0..3dc6a9b 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -449,12 +449,12 @@ static inline int pmd_numa(pmd_t pmd)
>  #endif
>  
>  /*
> - * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> - * because they're called by the NUMA hinting minor page fault. If we
> - * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
> - * would be forced to set it later while filling the TLB after we
> - * return to userland. That would trigger a second write to memory
> - * that we optimize away by setting _PAGE_ACCESSED here.
> + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag too because they're
> + * only called by the NUMA hinting minor page fault. If we wouldn't
> + * set the _PAGE_ACCESSED bitflag here, the TLB miss handler would be
> + * forced to set it later while filling the TLB after we return to
> + * userland. That would trigger a second write to memory that we
> + * optimize away by setting _PAGE_ACCESSED here.
>   */
>  static inline pte_t pte_mknonnuma(pte_t pte)
>  {
> 

Much better.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
  2012-10-11 17:05       ` Andrea Arcangeli
@ 2012-10-11 20:01         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 20:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:05:33PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:22:55PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> > > In the special "pmd" mode of knuma_scand
> > > (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> > > type (_PAGE_PRESENT not set), however the pte might be
> > > present. Therefore, gup_pmd_range() must return 0 in this case to
> > > avoid losing a NUMA hinting page fault during gup_fast.
> > > 
> > 
> > So if gup_fast fails, presumably we fall back to taking the mmap_sem and
> > calling get_user_pages(). This is a heavier operation and I wonder if the
> > cost is justified. i.e. Is the performance loss from using get_user_pages()
> > offset by improved NUMA placement? I ask because we always incur the cost of
> > taking mmap_sem but only sometimes get it back from improved NUMA placement.
> > How bad would it be if gup_fast lost some of the NUMA hinting information?
> 
> Good question indeed. Now, I agree it wouldn't be bad to skip NUMA
> hinting page faults in gup_fast for no-virt usage like
> O_DIRECT/ptrace, but the only problem is that we'd lose AutoNUMA on
> the memory touched by the KVM vcpus.
> 

Ok I see, that could be in the changelog because it's not immediately
obvious. At least, it's not as obvious as the potential downside (more GUP
fallbacks). In this context there is no way to guess what type of access
it is. AFAIK, there is no way from here to tell if it's KVM calling gup
or if it's due to O_DIRECT.

> I've been also asked if the vhost-net kernel thread (KVM in kernel
> virtio backend) will be controlled by autonuma in between
> use_mm/unuse_mm and answer is yes, but to do that, it also needs
> this. (see also the flush to task_autonuma_nid and mm/task statistics in
> unuse_mm to reset it back to regular kernel thread status,
> uncontrolled by autonuma)

I can understand why it needs this now. The clearing of the statistics is
still not clear to me but I asked that question in the thread that adjusts
unuse_mm already.

> 
> $ git grep get_user_pages
> tcm_vhost.c:            ret = get_user_pages_fast((unsigned long)ptr, 1, write, &page);
> vhost.c:        r = get_user_pages_fast(log, 1, 1, &page);
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 06/33] autonuma: teach gup_fast about pmd_numa
@ 2012-10-11 20:01         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 20:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:05:33PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:22:55PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:48AM +0200, Andrea Arcangeli wrote:
> > > In the special "pmd" mode of knuma_scand
> > > (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> > > type (_PAGE_PRESENT not set), however the pte might be
> > > present. Therefore, gup_pmd_range() must return 0 in this case to
> > > avoid losing a NUMA hinting page fault during gup_fast.
> > > 
> > 
> > So if gup_fast fails, presumably we fall back to taking the mmap_sem and
> > calling get_user_pages(). This is a heavier operation and I wonder if the
> > cost is justified. i.e. Is the performance loss from using get_user_pages()
> > offset by improved NUMA placement? I ask because we always incur the cost of
> > taking mmap_sem but only sometimes get it back from improved NUMA placement.
> > How bad would it be if gup_fast lost some of the NUMA hinting information?
> 
> Good question indeed. Now, I agree it wouldn't be bad to skip NUMA
> hinting page faults in gup_fast for no-virt usage like
> O_DIRECT/ptrace, but the only problem is that we'd lose AutoNUMA on
> the memory touched by the KVM vcpus.
> 

Ok I see, that could be in the changelog because it's not immediately
obvious. At least, it's not as obvious as the potential downside (more GUP
fallbacks). In this context there is no way to guess what type of access
it is. AFAIK, there is no way from here to tell if it's KVM calling gup
or if it's due to O_DIRECT.

> I've been also asked if the vhost-net kernel thread (KVM in kernel
> virtio backend) will be controlled by autonuma in between
> use_mm/unuse_mm and answer is yes, but to do that, it also needs
> this. (see also the flush to task_autonuma_nid and mm/task statistics in
> unuse_mm to reset it back to regular kernel thread status,
> uncontrolled by autonuma)

I can understand why it needs this now. The clearing of the statistics is
still not clear to me but I asked that question in the thread that adjusts
unuse_mm already.

> 
> $ git grep get_user_pages
> tcm_vhost.c:            ret = get_user_pages_fast((unsigned long)ptr, 1, write, &page);
> vhost.c:        r = get_user_pages_fast(log, 1, 1, &page);
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-11 17:15       ` Andrea Arcangeli
@ 2012-10-11 20:06         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 20:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:15:20PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:28:27PM +0100, Mel Gorman wrote:
> > s/togehter/together/
> 
> Fixed.
> 
> > 
> > > + * knumad_scan structure.
> > > + */
> > > +struct mm_autonuma {
> > 
> > Nit but this is very similar in principle to mm_slot for transparent
> > huge pages. It might be worth renaming both to mm_thp_slot and
> > mm_autonuma_slot to set the expectation they are very similar in nature.
> > Could potentially be made generic but probably overkill.
> 
> Agreed. A plain rename to mm_autonuma_slot would have the only cons of
> making some code spill over 80 col ;).
> 

Fair enough :)

> > > +	/* link for knuma_scand's list of mm structures to scan */
> > > +	struct list_head mm_node;
> > > +	/* Pointer to associated mm structure */
> > > +	struct mm_struct *mm;
> > > +
> > > +	/*
> > > +	 * Zeroed from here during allocation, check
> > > +	 * mm_autonuma_reset() if you alter the below.
> > > +	 */
> > > +
> > > +	/*
> > > +	 * Pass counter for this mm. This exist only to be able to
> > > +	 * tell when it's time to apply the exponential backoff on the
> > > +	 * task_autonuma statistics.
> > > +	 */
> > > +	unsigned long mm_numa_fault_pass;
> > > +	/* Total number of pages that will trigger NUMA faults for this mm */
> > > +	unsigned long mm_numa_fault_tot;
> > > +	/* Number of pages that will trigger NUMA faults for each [nid] */
> > > +	unsigned long mm_numa_fault[0];
> > > +	/* do not add more variables here, the above array size is dynamic */
> > > +};
> > 
> > How cache hot is this structure? nodes are sharing counters in the same
> > cache lines so if updates are frequent this will bounce like a mad yoke.
> > Profiles will tell for sure but it's possible that some sort of per-cpu
> > hilarity will be necessary here in the future.
> 
> On autonuma27 this is only written by knuma_scand so it won't risk to
> bounce.
> 
> On autonuma28 however it's updated by the numa hinting page fault
> locklessy and so your concern is very real, and the cacheline bounces
> will materialize.

It will be related to the knuma_scan thing though so once every 10
seconds, we might see a sudden spike in cache conflicts. Is that
accurate? Something like perf top might detect when this happens but it
can be inferred using perf probe on the fault handler too.

> It'll cause more interconnect traffic before the
> workload converges too. I thought about that, but I wanted the
> mm_autonuma updated in real time as migration happens otherwise it
> converges more slowly if we have to wait until the next pass to bring
> mm_autonuma statistical data in sync with the migration
> activities. Converging more slowly looked worse than paying more
> cacheline bounces.
> 

You could argue that slower converging also means more cross-node
traffic so it costs either way.

> It's a tradeoff. And if it's not a good one, we can go back to
> autonuma27 mm_autonuma stat gathering method and converge slower but
> without any cacheline bouncing in the NUMA hinting page faults. At
> least it's lockless.
> 

Yep.

> > > +	unsigned long task_numa_fault_pass;
> > > +	/* Total number of eligible pages that triggered NUMA faults */
> > > +	unsigned long task_numa_fault_tot;
> > > +	/* Number of pages that triggered NUMA faults for each [nid] */
> > > +	unsigned long task_numa_fault[0];
> > > +	/* do not add more variables here, the above array size is dynamic */
> > > +};
> > > +
> > 
> > Same question about cache hotness.
> 
> Here it's per-thread, so there won't be risk of accesses interleaved
> by different CPUs.
> 

Ok thanks. With that clarification

Acked-by: Mel Gorman <mgorman@suse.de>

While I still have concerns about the cache behaviour of this the basic
intent of the structure will not change no matter how the problem is
addressed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
@ 2012-10-11 20:06         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 20:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:15:20PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:28:27PM +0100, Mel Gorman wrote:
> > s/togehter/together/
> 
> Fixed.
> 
> > 
> > > + * knumad_scan structure.
> > > + */
> > > +struct mm_autonuma {
> > 
> > Nit but this is very similar in principle to mm_slot for transparent
> > huge pages. It might be worth renaming both to mm_thp_slot and
> > mm_autonuma_slot to set the expectation they are very similar in nature.
> > Could potentially be made generic but probably overkill.
> 
> Agreed. A plain rename to mm_autonuma_slot would have the only cons of
> making some code spill over 80 col ;).
> 

Fair enough :)

> > > +	/* link for knuma_scand's list of mm structures to scan */
> > > +	struct list_head mm_node;
> > > +	/* Pointer to associated mm structure */
> > > +	struct mm_struct *mm;
> > > +
> > > +	/*
> > > +	 * Zeroed from here during allocation, check
> > > +	 * mm_autonuma_reset() if you alter the below.
> > > +	 */
> > > +
> > > +	/*
> > > +	 * Pass counter for this mm. This exist only to be able to
> > > +	 * tell when it's time to apply the exponential backoff on the
> > > +	 * task_autonuma statistics.
> > > +	 */
> > > +	unsigned long mm_numa_fault_pass;
> > > +	/* Total number of pages that will trigger NUMA faults for this mm */
> > > +	unsigned long mm_numa_fault_tot;
> > > +	/* Number of pages that will trigger NUMA faults for each [nid] */
> > > +	unsigned long mm_numa_fault[0];
> > > +	/* do not add more variables here, the above array size is dynamic */
> > > +};
> > 
> > How cache hot is this structure? nodes are sharing counters in the same
> > cache lines so if updates are frequent this will bounce like a mad yoke.
> > Profiles will tell for sure but it's possible that some sort of per-cpu
> > hilarity will be necessary here in the future.
> 
> On autonuma27 this is only written by knuma_scand so it won't risk to
> bounce.
> 
> On autonuma28 however it's updated by the numa hinting page fault
> locklessy and so your concern is very real, and the cacheline bounces
> will materialize.

It will be related to the knuma_scan thing though so once every 10
seconds, we might see a sudden spike in cache conflicts. Is that
accurate? Something like perf top might detect when this happens but it
can be inferred using perf probe on the fault handler too.

> It'll cause more interconnect traffic before the
> workload converges too. I thought about that, but I wanted the
> mm_autonuma updated in real time as migration happens otherwise it
> converges more slowly if we have to wait until the next pass to bring
> mm_autonuma statistical data in sync with the migration
> activities. Converging more slowly looked worse than paying more
> cacheline bounces.
> 

You could argue that slower converging also means more cross-node
traffic so it costs either way.

> It's a tradeoff. And if it's not a good one, we can go back to
> autonuma27 mm_autonuma stat gathering method and converge slower but
> without any cacheline bouncing in the NUMA hinting page faults. At
> least it's lockless.
> 

Yep.

> > > +	unsigned long task_numa_fault_pass;
> > > +	/* Total number of eligible pages that triggered NUMA faults */
> > > +	unsigned long task_numa_fault_tot;
> > > +	/* Number of pages that triggered NUMA faults for each [nid] */
> > > +	unsigned long task_numa_fault[0];
> > > +	/* do not add more variables here, the above array size is dynamic */
> > > +};
> > > +
> > 
> > Same question about cache hotness.
> 
> Here it's per-thread, so there won't be risk of accesses interleaved
> by different CPUs.
> 

Ok thanks. With that clarification

Acked-by: Mel Gorman <mgorman@suse.de>

While I still have concerns about the cache behaviour of this the basic
intent of the structure will not change no matter how the problem is
addressed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
  2012-10-11 17:34       ` Andrea Arcangeli
@ 2012-10-11 20:17         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:34:42PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 02:46:43PM +0100, Mel Gorman wrote:
> > Should this be a SCHED_FEATURE flag?
> 
> I guess it could. It is only used by kernel/sched/numa.c which isn't
> even built unless CONFIG_AUTONUMA is set. So it would require a
> CONFIG_AUTONUMA in the sched feature flags unless we want to expose
> no-operational bits. I'm not sure what the preferred way is.
> 

It's fine this way for now. It just felt that it was bolted onto the
side a bit and didn't quite belong there but it could be argued either
way so just leave it alone.

> > Have you ever identified a case where it's a good idea to set that flag?
> 
> It's currently set by default but no, I didn't do enough experiments
> if it's worth copying or resetting the data.
> 

Ok, if it was something that was going to be regularly used there would
be more justification for SCHED_FEATURE.

> > A child that closely shared data with its parent is not likely to also
> > want to migrate to separate nodes. It just seems unnecessary to have and
> 
> Agreed, this is why the task_selected_nid is always inherited by
> default (that is the CFS autopilot driver).
> 
> The question is if the full statistics also should be inherited across
> fork/clone or not. I don't know the answer yet and that's why that
> knob exists.
> 

I very strongly suspect the answer is "no".

> If we retain them, the autonuma_balance may decide to move the
> task before a full statistics buildup executed the child.
> 
> The current way is to reset the data, and wait the data to buildup in
> the child, while we keep CFS on autopilot with task_selected_nid
> (which is always inherited). I thought the current one to be a good
> tradeoff, but copying all data isn't an horrible idea either.
> 
> > impossible to suggest to an administrator how the flag might be used.
> 
> Agreed. this in fact is a debug flag only, it won't ever showup to the admin.
> 
> #ifdef CONFIG_DEBUG_VM
> SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
> SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
> SYSFS_ENTRY(migrate_allow_first_fault,
> 	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
> #endif /* CONFIG_DEBUG_VM */
> 

Good. Nice to have just in case even if I think it'll never be used :)

> > 
> > > +	/*
> > > +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> > > +	 * faults at the pmd level instead of the pte level. This
> > > +	 * reduces the number of NUMA hinting faults potentially
> > > +	 * saving CPU time. It reduces the accuracy of the
> > > +	 * task_autonuma statistics (but does not change the accuracy
> > > +	 * of the mm_autonuma statistics). This flag can be toggled
> > > +	 * through sysfs as runtime.
> > > +	 *
> > > +	 * This flag does not affect AutoNUMA with transparent
> > > +	 * hugepages (THP). With THP the NUMA hinting page faults
> > > +	 * always happen at the pmd level, regardless of the setting
> > > +	 * of this flag. Note: there is no reduction in accuracy of
> > > +	 * task_autonuma statistics with THP.
> > > +	 *
> > > +	 * Default set.
> > > +	 */
> > > +	AUTONUMA_SCAN_PMD_FLAG,
> > 
> > This flag and the other flags make sense. Early on we just are not going
> > to know what the correct choice is. My gut says that ultimately we'll
> 
> Agreed. This is why I left these knobs in, even if I've been asked to
> drop them a few times (they were perceived as adding complexity). But
> for things we're not sure about, these really helps to benchmark quick
> one way or another.
> 

I don't mind them being left in for now. They at least forced me to
consider the cases where they might be required and consider if that is
realistic or not. From that perspective alone it was worth it :)

> scan_pmd is actually not under DEBUG_VM as it looked a more fundamental thing.
> 
> > default to PMD level *but* fall back to PTE level on a per-task basis if
> > ping-pong migrations are detected. This will catch ping-pongs on data
> > that is not PMD aligned although obviously data that is not page aligned
> > will also suffer. Eventually I think this flag will go away but the
> > behaviour will be;
> > 
> > default, AUTONUMA_SCAN_PMD
> > if ping-pong, fallback to AUTONUMA_SCAN_PTE
> > if ping-ping, AUTONUMA_SCAN_NONE
> 
> That would be ideal, good idea indeed.
> 
> > so there is a graceful degradation if autonuma is doing the wrong thing.
> 
> Makes perfect sense to me if we figure out how to reliably detect when
> to make the switch.
> 

The "reliable" part is the mess. I think it potentially would be possible
to detect it based on the number of times numa_hinting_fault() migrated
pages and decay that at each knuma_scan but that could take too long
to detect with the 10 second delays so there is no obvious good answer.
WIth some experience on a few different workloads, it might be a bit
more obvious. Right now what you have is good enough and we can just
keep the potential problem in mind so we'll recognise it when we see it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 08/33] autonuma: define the autonuma flags
@ 2012-10-11 20:17         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 07:34:42PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 02:46:43PM +0100, Mel Gorman wrote:
> > Should this be a SCHED_FEATURE flag?
> 
> I guess it could. It is only used by kernel/sched/numa.c which isn't
> even built unless CONFIG_AUTONUMA is set. So it would require a
> CONFIG_AUTONUMA in the sched feature flags unless we want to expose
> no-operational bits. I'm not sure what the preferred way is.
> 

It's fine this way for now. It just felt that it was bolted onto the
side a bit and didn't quite belong there but it could be argued either
way so just leave it alone.

> > Have you ever identified a case where it's a good idea to set that flag?
> 
> It's currently set by default but no, I didn't do enough experiments
> if it's worth copying or resetting the data.
> 

Ok, if it was something that was going to be regularly used there would
be more justification for SCHED_FEATURE.

> > A child that closely shared data with its parent is not likely to also
> > want to migrate to separate nodes. It just seems unnecessary to have and
> 
> Agreed, this is why the task_selected_nid is always inherited by
> default (that is the CFS autopilot driver).
> 
> The question is if the full statistics also should be inherited across
> fork/clone or not. I don't know the answer yet and that's why that
> knob exists.
> 

I very strongly suspect the answer is "no".

> If we retain them, the autonuma_balance may decide to move the
> task before a full statistics buildup executed the child.
> 
> The current way is to reset the data, and wait the data to buildup in
> the child, while we keep CFS on autopilot with task_selected_nid
> (which is always inherited). I thought the current one to be a good
> tradeoff, but copying all data isn't an horrible idea either.
> 
> > impossible to suggest to an administrator how the flag might be used.
> 
> Agreed. this in fact is a debug flag only, it won't ever showup to the admin.
> 
> #ifdef CONFIG_DEBUG_VM
> SYSFS_ENTRY(sched_load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
> SYSFS_ENTRY(child_inheritance, AUTONUMA_CHILD_INHERITANCE_FLAG);
> SYSFS_ENTRY(migrate_allow_first_fault,
> 	    AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
> #endif /* CONFIG_DEBUG_VM */
> 

Good. Nice to have just in case even if I think it'll never be used :)

> > 
> > > +	/*
> > > +	 * If set, this tells knuma_scand to trigger NUMA hinting page
> > > +	 * faults at the pmd level instead of the pte level. This
> > > +	 * reduces the number of NUMA hinting faults potentially
> > > +	 * saving CPU time. It reduces the accuracy of the
> > > +	 * task_autonuma statistics (but does not change the accuracy
> > > +	 * of the mm_autonuma statistics). This flag can be toggled
> > > +	 * through sysfs as runtime.
> > > +	 *
> > > +	 * This flag does not affect AutoNUMA with transparent
> > > +	 * hugepages (THP). With THP the NUMA hinting page faults
> > > +	 * always happen at the pmd level, regardless of the setting
> > > +	 * of this flag. Note: there is no reduction in accuracy of
> > > +	 * task_autonuma statistics with THP.
> > > +	 *
> > > +	 * Default set.
> > > +	 */
> > > +	AUTONUMA_SCAN_PMD_FLAG,
> > 
> > This flag and the other flags make sense. Early on we just are not going
> > to know what the correct choice is. My gut says that ultimately we'll
> 
> Agreed. This is why I left these knobs in, even if I've been asked to
> drop them a few times (they were perceived as adding complexity). But
> for things we're not sure about, these really helps to benchmark quick
> one way or another.
> 

I don't mind them being left in for now. They at least forced me to
consider the cases where they might be required and consider if that is
realistic or not. From that perspective alone it was worth it :)

> scan_pmd is actually not under DEBUG_VM as it looked a more fundamental thing.
> 
> > default to PMD level *but* fall back to PTE level on a per-task basis if
> > ping-pong migrations are detected. This will catch ping-pongs on data
> > that is not PMD aligned although obviously data that is not page aligned
> > will also suffer. Eventually I think this flag will go away but the
> > behaviour will be;
> > 
> > default, AUTONUMA_SCAN_PMD
> > if ping-pong, fallback to AUTONUMA_SCAN_PTE
> > if ping-ping, AUTONUMA_SCAN_NONE
> 
> That would be ideal, good idea indeed.
> 
> > so there is a graceful degradation if autonuma is doing the wrong thing.
> 
> Makes perfect sense to me if we figure out how to reliably detect when
> to make the switch.
> 

The "reliable" part is the mess. I think it potentially would be possible
to detect it based on the number of times numa_hinting_fault() migrated
pages and decay that at each knuma_scan but that could take too long
to detect with the 10 second delays so there is no obvious good answer.
WIth some experience on a few different workloads, it might be a bit
more obvious. Right now what you have is good enough and we can just
keep the potential problem in mind so we'll recognise it when we see it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (34 preceding siblings ...)
  2012-10-11 10:19 ` Mel Gorman
@ 2012-10-11 21:34 ` Mel Gorman
  2012-10-12  1:45     ` Andrea Arcangeli
  2012-10-13 18:40   ` Srikar Dronamraju
  2012-10-16 13:48 ` Mel Gorman
  37 siblings, 1 reply; 148+ messages in thread
From: Mel Gorman @ 2012-10-11 21:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:42AM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> This is a new AutoNUMA27 release for Linux v3.6.
> 

So after getting through the full review of it, there wasn't anything
I could not stand. I think it's *very* heavy on some of the paths like
the idle balancer which I was not keen on and the fault paths are also
quite heavy.  I think the weight on some of these paths can be reduced
but not to 0 if the objectives to autonuma are to be met.

I'm not fully convinced that the task exchange is actually necessary or
beneficial because it somewhat assumes that there is a symmetry between CPU
and memory balancing that may not be true. The fact that it only considers
tasks that are currently running feels a bit random but examining all tasks
that recently ran on the node would be far too expensive to there is no
good answer. You are caught between a rock and a hard place and either
direction you go is wrong for different reasons. You need something more
frequent than scans (because it'll converge too slowly) but doing it from
the balancer misses some tasks and may run too frequently and it's unclear
how it effects the current load balancer decisions. I don't have a good
alternative solution for this but ideally it would be better integrated with
the existing scheduler when there is more data on what those scheduling
decisions should be. That will only come from a wide range of testing and
the inevitable bug reports.

That said, this is concentrating on the problems without considering the
situations where it would work very well.  I think it'll come down to HPC
and anything jitter-sensitive will hate this while workloads like JVM,
virtualisation or anything that uses a lot of memory without caring about
placement will love it. It's not perfect but it's better than incurring
the cost of remote access unconditionally.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-11 15:24     ` Rik van Riel
  2012-10-11 15:57       ` Mel Gorman
@ 2012-10-12  0:23       ` Christoph Lameter
  2012-10-12  0:52           ` Andrea Arcangeli
  1 sibling, 1 reply; 148+ messages in thread
From: Christoph Lameter @ 2012-10-12  0:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Andrea Arcangeli, linux-kernel, linux-mm,
	Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Hugh Dickins, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 11 Oct 2012, Rik van Riel wrote:

> These statistics are updated at page fault time, I
> believe while holding the page table lock.
>
> In other words, they are in code paths where updating
> the stats should not cause issues.

The per cpu counters in the VM were introduced because of
counter contention caused at page fault time. This is the same code path
where you think that there cannot be contention.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
  2012-10-11 14:58   ` Mel Gorman
@ 2012-10-12  0:25       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > This algorithm takes as input the statistical information filled by the
> > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > (p->task_autonuma), evaluates it for the current scheduled task, and
> > compares it against every other running process to see if it should
> > move the current task to another NUMA node.
> > 
> 
> That sounds expensive if there are a lot of running processes in the
> system. How often does this happen? Mention it here even though I
> realised much later that it's obvious from the patch itself.

Ok I added:

==
This algorithm will run once every ~100msec, and can be easily slowed
down further. Its computational complexity is O(nr_cpus) and it's
executed by all CPUs. The number of running threads and processes is
not going to alter the cost of this algorithm, only the online number
of CPUs is. However practically this will very rarely hit on all CPUs
runqueues. Most of the time it will only compute on local data in the
task_autonuma struct (for example if convergence has been
reached). Even if no convergence has been reached yet, it'll only scan
the CPUs in the NUMA nodes where the local task_autonuma data is
showing that they are worth migrating to.
==

It's configurable through sysfs, 100mses is the default.

> > + * there is no affinity set for the task).
> > + */
> > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > +{
> 
> nit, but elsewhere you have
> 
> static inline TYPE and here you have
> static TYPE inline

Fixed.

> 
> > +	int task_selected_nid;
> > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > +
> > +	if (!task_autonuma)
> > +		return true;
> > +
> > +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > +		return true;
> > +	else
> > +		return false;
> > +}
> 
> no need for else.

Removed.

> 
> > +
> > +static inline void sched_autonuma_balance(void)
> > +{
> > +	struct task_autonuma *ta = current->task_autonuma;
> > +
> > +	if (ta && current->mm)
> > +		__sched_autonuma_balance();
> > +}
> > +
> 
> Ok, so this could do with a comment explaining where it is called from.
> It is called during idle balancing at least so potentially this is every
> scheduler tick. It'll be run from softirq context so the cost will not
> be obvious to a process but the overhead will be there. What happens if
> this takes longer than a scheduler tick to run? Is that possible?

softirqs can run for huge amount of time so it won't harm.

Nested IRQs could even run on top of the softirq, and they could take
milliseconds too if they're hyper inefficient and we must still run
perfectly rock solid (with horrible latency, but still stable).

I added:

/*
 * This is called in the context of the SCHED_SOFTIRQ from
 * run_rebalance_domains().
 */

> > +/*
> > + * This function __sched_autonuma_balance() is responsible for
> 
> This function is far too shot and could do with another few pages :P

:) I tried to split it once already but gave up in the middle.

> > + * "Full convergence" is achieved when all memory accesses by a task
> > + * are 100% local to the CPU it is running on. A task's "best node" is
> 
> I think this is the first time you defined convergence in the series.
> The explanation should be included in the documentation.

Ok. It's not too easy concept to explain with words.  Here a try:

 *
 * A workload converges when all the memory of a thread or a process
 * has been placed in the NUMA node of the CPU where the process or
 * thread is running on.
 *

> > + * other_diff: how much the current task is closer to fully converge
> > + * on the node of the other CPU than the other task that is currently
> > + * running in the other CPU.
> 
> In the changelog you talked about comparing a process with every other
> running process but here it looks like you intent to examine every
> process that is *currently running* on a remote node and compare that.
> What if the best process to swap with is not currently running? Do we
> miss it?

Correct, only currently running processes are being checked. If a task
in R state goes to sleep immediately, it's not relevant where it
runs. We focus on "long running" compute tasks, so tasks that are in R
state most frequently.

> > + * If both checks succeed it guarantees that we found a way to
> > + * multilaterally improve the system wide NUMA
> > + * convergence. Multilateral here means that the same checks will not
> > + * succeed again on those same two tasks, after the task exchange, so
> > + * there is no risk of ping-pong.
> > + *
> 
> At least not in that instance of time. A new CPU binding or change in
> behaviour (such as a computation finishing and a reduce step starting)
> might change that scoring.

Yes.

> > + * If a task exchange can happen because the two checks succeed, we
> > + * select the destination CPU that will give us the biggest increase
> > + * in system wide convergence (i.e. biggest "weight", in the above
> > + * quoted code).
> > + *
> 
> So there is a bit of luck that the best task to exchange is currently
> running. How bad is that? It depends really on the number of tasks
> running on that node and the priority. There is a chance that it doesn't
> matter as such because if all the wrong tasks are currently running then
> no exchange will take place - it was just wasted CPU. It does imply that
> AutoNUMA works best of CPUs are not over-subscribed with processes. Is
> that fair?

It seems to works fine with overcommit as well. specjbb x2 is
converging fine, as well as numa01 in parallel with numa02. It's
actually pretty cool to watch.

Try to run this:

while :; do ./nmstat -n numa; sleep 1; done

nmstat is a binary in autonuma benchmark.

Then run:

time (./numa01 & ./numa02 & wait)

The thing is, we work together with CFS, CFS in autopilot works fine,
we only need to correct the occasional error.

It works the same as the active idle balancing, that corrects the
occasional error for HT cores left idle, then CFS takes over.

> Again, I have no suggestions at all on how this might be improved and
> these comments are hand-waving towards where we *might* see problems in
> the future. If problems are actually identified in practice for
> worklaods then autonuma can be turned off until the relevant problem
> area is fixed.

Exactly, it's enough to run:

echo 1 >/sys/kernel/mm/autonuma/enabled

If you want to get rid of the 2 bytes per page too, passing
"noautonuma" at boot will do it (but then /sys/kernel/mm/autonuma
disapperers and you can't enable it anymore).

Plus if there's any issue with the cost of sched_autonuma_balance it's
more than enough to run "perf top" to find out.

> I would fully expect that there are parallel workloads that work on
> differenet portions of a large set of data and it would be perfectly
> reasonable for threads using the same address space to converge on
> different nodes.

Agreed. Even if they can't converge fully they could have stats like
70/30, 30/70, with 30 being numa-false-shared and we'll schedule them
right, so running faster than upstream. That 30% will also tend to
slowly distribute better over time.

> I would hope we manage to figure out a way to examine fewer processes,
> not more :)

8)))

> > +void __sched_autonuma_balance(void)
> > +{
> > +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> > +	int this_nid = numa_node_id();
> > +	int this_cpu = smp_processor_id();
> > +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> > +	unsigned long task_max, mm_max;
> > +	unsigned long weight_diff_max;
> > +	long uninitialized_var(s_w_nid);
> > +	long uninitialized_var(s_w_this_nid);
> > +	long uninitialized_var(s_w_other);
> > +	bool uninitialized_var(s_w_type_thread);
> > +	struct cpumask *allowed;
> > +	struct task_struct *p = current, *other_task;
> 
> So the task in question is current but this is called by the idle
> balancer. I'm missing something obvious here but it's not clear to me why
> that process is necessarily relevant. What guarantee is there that all
> tasks will eventually run this code? Maybe it doesn't matter because the
> most CPU intensive tasks are also the most likely to end up in here but
> a clarification would be nice.

Exactly. We only focus on who is significantly computing. If a task
runs for 1msec we can't possibly care where it runs and where the
memory is. If it keeps running for 1msec, over time even that task
will be migrated right.

> > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > +	struct mm_autonuma *mm_autonuma;
> > +	struct rq *rq;
> > +
> > +	/* per-cpu statically allocated in runqueues */
> > +	long *task_numa_weight;
> > +	long *mm_numa_weight;
> > +
> > +	if (!task_autonuma || !p->mm)
> > +		return;
> > +
> > +	if (!autonuma_enabled()) {
> > +		if (task_autonuma->task_selected_nid != -1)
> > +			task_autonuma->task_selected_nid = -1;
> > +		return;
> > +	}
> > +
> > +	allowed = tsk_cpus_allowed(p);
> > +	mm_autonuma = p->mm->mm_autonuma;
> > +
> > +	/*
> > +	 * If the task has no NUMA hinting page faults or if the mm
> > +	 * hasn't been fully scanned by knuma_scand yet, set task
> > +	 * selected nid to the current nid, to avoid the task bounce
> > +	 * around randomly.
> > +	 */
> > +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
> 
> Why ACCESS_ONCE?

mm variables are altered by other threads too. Only task_autonuma is
local to this task and cannot change from under us.

I did it all lockless, I don't care if we're off once in a while.

> > +	if (!mm_tot) {
> > +		if (task_autonuma->task_selected_nid != this_nid)
> > +			task_autonuma->task_selected_nid = this_nid;
> > +		return;
> > +	}
> > +	task_tot = task_autonuma->task_numa_fault_tot;
> > +	if (!task_tot) {
> > +		if (task_autonuma->task_selected_nid != this_nid)
> > +			task_autonuma->task_selected_nid = this_nid;
> > +		return;
> > +	}
> > +
> > +	rq = cpu_rq(this_cpu);
> > +
> > +	/*
> > +	 * Verify that we can migrate the current task, otherwise try
> > +	 * again later.
> > +	 */
> > +	if (ACCESS_ONCE(rq->autonuma_balance))
> > +		return;
> > +
> > +	/*
> > +	 * The following two arrays will hold the NUMA affinity weight
> > +	 * information for the current process if scheduled on the
> > +	 * given NUMA node.
> > +	 *
> > +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> > +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> > +	 */
> > +	task_numa_weight = rq->task_numa_weight;
> > +	mm_numa_weight = rq->mm_numa_weight;
> > +
> > +	/*
> > +	 * Identify the NUMA node where this thread (task_struct), and
> > +	 * the process (mm_struct) as a whole, has the largest number
> > +	 * of NUMA faults.
> > +	 */
> > +	task_max = mm_max = 0;
> > +	selected_nid = mm_selected_nid = -1;
> > +	for_each_online_node(nid) {
> > +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> > +		task_fault = task_autonuma->task_numa_fault[nid];
> > +		if (mm_fault > mm_tot)
> > +			/* could be removed with a seqlock */
> > +			mm_tot = mm_fault;
> > +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> > +		if (task_fault > task_tot) {
> > +			task_tot = task_fault;
> > +			WARN_ON(1);
> > +		}
> > +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> > +		if (mm_numa_weight[nid] > mm_max) {
> > +			mm_max = mm_numa_weight[nid];
> > +			mm_selected_nid = nid;
> > +		}
> > +		if (task_numa_weight[nid] > task_max) {
> > +			task_max = task_numa_weight[nid];
> > +			selected_nid = nid;
> > +		}
> > +	}
> 
> Ok, so this is a big walk to take every time and as this happens every
> scheduler tick, it seems unlikely that the workload would be changing
> phases that often in terms of NUMA behaviour. Would it be possible for
> this to be sampled less frequently and cache the result?

Even if there are 8 nodes, this is fairly quick and only requires 2
cachelines. At 16 nodes we're at 4 cachelines. The cacheline of
task_autonuma is fully local. The one of mm_autonuma can be shared
(modulo numa hinting page faults with atuonuma28, in autonuma27 it was
also sharable even despite numa hinting page faults).

> > +			/*
> > +			 * Grab the fault/tot of the processes running
> > +			 * in the other CPUs to compute w_other.
> > +			 */
> > +			raw_spin_lock_irq(&rq->lock);
> > +			_other_task = rq->curr;
> > +			/* recheck after implicit barrier() */
> > +			mm = _other_task->mm;
> > +			if (!mm) {
> > +				raw_spin_unlock_irq(&rq->lock);
> > +				continue;
> > +			}
> > +
> 
> Is it really critical to pin those values using the lock? That seems *really*
> heavy. If the results have to be exactly stable then is there any chance
> the values could be encoded in the high and low bits of a single unsigned
> long and read without the lock?  Updates would be more expensive but that's
> in a trap anyway. This on the other hand is a scheduler path.

The reason of the lock is to prevent rq->curr, mm etc.. to be freed
from under us.

> > +			/*
> > +			 * Check if the _other_task is allowed to be
> > +			 * migrated to this_cpu.
> > +			 */
> > +			if (!cpumask_test_cpu(this_cpu,
> > +					      tsk_cpus_allowed(_other_task))) {
> > +				raw_spin_unlock_irq(&rq->lock);
> > +				continue;
> > +			}
> > +
> 
> Would it not make sense to check this *before* we take the lock and
> grab all its counters? It probably will not make much of a difference in
> practice as I expect it's rare that the target CPU is running a task
> that can't migrate but it still feels the wrong way around.

It's a micro optimization to do it here. It's too rare that the above
fails, while !tot may be zero much more frequently (like if the task
has been just started).

> > +	if (selected_cpu != this_cpu) {
> > +		if (autonuma_debug()) {
> > +			char *w_type_str;
> > +			w_type_str = s_w_type_thread ? "thread" : "process";
> > +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> > +			       p->mm, p->pid, this_nid, selected_nid,
> > +			       this_cpu, selected_cpu,
> > +			       s_w_other, s_w_nid, s_w_this_nid,
> > +			       w_type_str);
> > +		}
> 
> Can these be made tracepoints and get rid of the autonuma_debug() check?
> I recognise there is a risk that some tool might grow to depend on
> implementation details but in this case it seems very unlikely.

The debug mode provides me also a dump of all mm done racy, I wouldn't
know how to do it with tracing.

So I wouldn't remove the printk until we can replace everything with
tracing, but I'd welcome to add a tracepoint too. There are already
other proper tracepoints driving "perf script numatop".

> Ok, so I confess I did not work out if the weights and calculations really
> make sense or not but at a glance they seem reasonable and I spotted no
> obvious flaws. The function is pretty heavy though and may be doing more
> work around locking than is really necessary. That said, there will be
> workloads where the cost is justified and offset by the performance gains
> from improved NUMA locality. I just don't expect it to be a universal win so
> we'll need to keep an eye on the system CPU usage and incrementally optimise
> where possible. I suspect there will be a time when an incremental
> optimisation just does not cut it any more but by then I would also
> expect there will be more data on how autonuma behaves in practice and a
> new algorithm might be more obvious at that point.

Agreed. Chances are I can replace all this already with RCU and a
rcu_dereference or ACCESS_ONCE to grab the rq->curr->task_autonuma and
rq->curr->mm->mm_autonuma data. I didn't try yet. The task struct
shouldn't go away from under us after rcu_read_lock, the mm may be more
tricky, I haven't checked this yet. Optimizations welcome ;)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
@ 2012-10-12  0:25       ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > This algorithm takes as input the statistical information filled by the
> > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > (p->task_autonuma), evaluates it for the current scheduled task, and
> > compares it against every other running process to see if it should
> > move the current task to another NUMA node.
> > 
> 
> That sounds expensive if there are a lot of running processes in the
> system. How often does this happen? Mention it here even though I
> realised much later that it's obvious from the patch itself.

Ok I added:

==
This algorithm will run once every ~100msec, and can be easily slowed
down further. Its computational complexity is O(nr_cpus) and it's
executed by all CPUs. The number of running threads and processes is
not going to alter the cost of this algorithm, only the online number
of CPUs is. However practically this will very rarely hit on all CPUs
runqueues. Most of the time it will only compute on local data in the
task_autonuma struct (for example if convergence has been
reached). Even if no convergence has been reached yet, it'll only scan
the CPUs in the NUMA nodes where the local task_autonuma data is
showing that they are worth migrating to.
==

It's configurable through sysfs, 100mses is the default.

> > + * there is no affinity set for the task).
> > + */
> > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > +{
> 
> nit, but elsewhere you have
> 
> static inline TYPE and here you have
> static TYPE inline

Fixed.

> 
> > +	int task_selected_nid;
> > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > +
> > +	if (!task_autonuma)
> > +		return true;
> > +
> > +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > +		return true;
> > +	else
> > +		return false;
> > +}
> 
> no need for else.

Removed.

> 
> > +
> > +static inline void sched_autonuma_balance(void)
> > +{
> > +	struct task_autonuma *ta = current->task_autonuma;
> > +
> > +	if (ta && current->mm)
> > +		__sched_autonuma_balance();
> > +}
> > +
> 
> Ok, so this could do with a comment explaining where it is called from.
> It is called during idle balancing at least so potentially this is every
> scheduler tick. It'll be run from softirq context so the cost will not
> be obvious to a process but the overhead will be there. What happens if
> this takes longer than a scheduler tick to run? Is that possible?

softirqs can run for huge amount of time so it won't harm.

Nested IRQs could even run on top of the softirq, and they could take
milliseconds too if they're hyper inefficient and we must still run
perfectly rock solid (with horrible latency, but still stable).

I added:

/*
 * This is called in the context of the SCHED_SOFTIRQ from
 * run_rebalance_domains().
 */

> > +/*
> > + * This function __sched_autonuma_balance() is responsible for
> 
> This function is far too shot and could do with another few pages :P

:) I tried to split it once already but gave up in the middle.

> > + * "Full convergence" is achieved when all memory accesses by a task
> > + * are 100% local to the CPU it is running on. A task's "best node" is
> 
> I think this is the first time you defined convergence in the series.
> The explanation should be included in the documentation.

Ok. It's not too easy concept to explain with words.  Here a try:

 *
 * A workload converges when all the memory of a thread or a process
 * has been placed in the NUMA node of the CPU where the process or
 * thread is running on.
 *

> > + * other_diff: how much the current task is closer to fully converge
> > + * on the node of the other CPU than the other task that is currently
> > + * running in the other CPU.
> 
> In the changelog you talked about comparing a process with every other
> running process but here it looks like you intent to examine every
> process that is *currently running* on a remote node and compare that.
> What if the best process to swap with is not currently running? Do we
> miss it?

Correct, only currently running processes are being checked. If a task
in R state goes to sleep immediately, it's not relevant where it
runs. We focus on "long running" compute tasks, so tasks that are in R
state most frequently.

> > + * If both checks succeed it guarantees that we found a way to
> > + * multilaterally improve the system wide NUMA
> > + * convergence. Multilateral here means that the same checks will not
> > + * succeed again on those same two tasks, after the task exchange, so
> > + * there is no risk of ping-pong.
> > + *
> 
> At least not in that instance of time. A new CPU binding or change in
> behaviour (such as a computation finishing and a reduce step starting)
> might change that scoring.

Yes.

> > + * If a task exchange can happen because the two checks succeed, we
> > + * select the destination CPU that will give us the biggest increase
> > + * in system wide convergence (i.e. biggest "weight", in the above
> > + * quoted code).
> > + *
> 
> So there is a bit of luck that the best task to exchange is currently
> running. How bad is that? It depends really on the number of tasks
> running on that node and the priority. There is a chance that it doesn't
> matter as such because if all the wrong tasks are currently running then
> no exchange will take place - it was just wasted CPU. It does imply that
> AutoNUMA works best of CPUs are not over-subscribed with processes. Is
> that fair?

It seems to works fine with overcommit as well. specjbb x2 is
converging fine, as well as numa01 in parallel with numa02. It's
actually pretty cool to watch.

Try to run this:

while :; do ./nmstat -n numa; sleep 1; done

nmstat is a binary in autonuma benchmark.

Then run:

time (./numa01 & ./numa02 & wait)

The thing is, we work together with CFS, CFS in autopilot works fine,
we only need to correct the occasional error.

It works the same as the active idle balancing, that corrects the
occasional error for HT cores left idle, then CFS takes over.

> Again, I have no suggestions at all on how this might be improved and
> these comments are hand-waving towards where we *might* see problems in
> the future. If problems are actually identified in practice for
> worklaods then autonuma can be turned off until the relevant problem
> area is fixed.

Exactly, it's enough to run:

echo 1 >/sys/kernel/mm/autonuma/enabled

If you want to get rid of the 2 bytes per page too, passing
"noautonuma" at boot will do it (but then /sys/kernel/mm/autonuma
disapperers and you can't enable it anymore).

Plus if there's any issue with the cost of sched_autonuma_balance it's
more than enough to run "perf top" to find out.

> I would fully expect that there are parallel workloads that work on
> differenet portions of a large set of data and it would be perfectly
> reasonable for threads using the same address space to converge on
> different nodes.

Agreed. Even if they can't converge fully they could have stats like
70/30, 30/70, with 30 being numa-false-shared and we'll schedule them
right, so running faster than upstream. That 30% will also tend to
slowly distribute better over time.

> I would hope we manage to figure out a way to examine fewer processes,
> not more :)

8)))

> > +void __sched_autonuma_balance(void)
> > +{
> > +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> > +	int this_nid = numa_node_id();
> > +	int this_cpu = smp_processor_id();
> > +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> > +	unsigned long task_max, mm_max;
> > +	unsigned long weight_diff_max;
> > +	long uninitialized_var(s_w_nid);
> > +	long uninitialized_var(s_w_this_nid);
> > +	long uninitialized_var(s_w_other);
> > +	bool uninitialized_var(s_w_type_thread);
> > +	struct cpumask *allowed;
> > +	struct task_struct *p = current, *other_task;
> 
> So the task in question is current but this is called by the idle
> balancer. I'm missing something obvious here but it's not clear to me why
> that process is necessarily relevant. What guarantee is there that all
> tasks will eventually run this code? Maybe it doesn't matter because the
> most CPU intensive tasks are also the most likely to end up in here but
> a clarification would be nice.

Exactly. We only focus on who is significantly computing. If a task
runs for 1msec we can't possibly care where it runs and where the
memory is. If it keeps running for 1msec, over time even that task
will be migrated right.

> > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > +	struct mm_autonuma *mm_autonuma;
> > +	struct rq *rq;
> > +
> > +	/* per-cpu statically allocated in runqueues */
> > +	long *task_numa_weight;
> > +	long *mm_numa_weight;
> > +
> > +	if (!task_autonuma || !p->mm)
> > +		return;
> > +
> > +	if (!autonuma_enabled()) {
> > +		if (task_autonuma->task_selected_nid != -1)
> > +			task_autonuma->task_selected_nid = -1;
> > +		return;
> > +	}
> > +
> > +	allowed = tsk_cpus_allowed(p);
> > +	mm_autonuma = p->mm->mm_autonuma;
> > +
> > +	/*
> > +	 * If the task has no NUMA hinting page faults or if the mm
> > +	 * hasn't been fully scanned by knuma_scand yet, set task
> > +	 * selected nid to the current nid, to avoid the task bounce
> > +	 * around randomly.
> > +	 */
> > +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
> 
> Why ACCESS_ONCE?

mm variables are altered by other threads too. Only task_autonuma is
local to this task and cannot change from under us.

I did it all lockless, I don't care if we're off once in a while.

> > +	if (!mm_tot) {
> > +		if (task_autonuma->task_selected_nid != this_nid)
> > +			task_autonuma->task_selected_nid = this_nid;
> > +		return;
> > +	}
> > +	task_tot = task_autonuma->task_numa_fault_tot;
> > +	if (!task_tot) {
> > +		if (task_autonuma->task_selected_nid != this_nid)
> > +			task_autonuma->task_selected_nid = this_nid;
> > +		return;
> > +	}
> > +
> > +	rq = cpu_rq(this_cpu);
> > +
> > +	/*
> > +	 * Verify that we can migrate the current task, otherwise try
> > +	 * again later.
> > +	 */
> > +	if (ACCESS_ONCE(rq->autonuma_balance))
> > +		return;
> > +
> > +	/*
> > +	 * The following two arrays will hold the NUMA affinity weight
> > +	 * information for the current process if scheduled on the
> > +	 * given NUMA node.
> > +	 *
> > +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> > +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> > +	 */
> > +	task_numa_weight = rq->task_numa_weight;
> > +	mm_numa_weight = rq->mm_numa_weight;
> > +
> > +	/*
> > +	 * Identify the NUMA node where this thread (task_struct), and
> > +	 * the process (mm_struct) as a whole, has the largest number
> > +	 * of NUMA faults.
> > +	 */
> > +	task_max = mm_max = 0;
> > +	selected_nid = mm_selected_nid = -1;
> > +	for_each_online_node(nid) {
> > +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> > +		task_fault = task_autonuma->task_numa_fault[nid];
> > +		if (mm_fault > mm_tot)
> > +			/* could be removed with a seqlock */
> > +			mm_tot = mm_fault;
> > +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> > +		if (task_fault > task_tot) {
> > +			task_tot = task_fault;
> > +			WARN_ON(1);
> > +		}
> > +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> > +		if (mm_numa_weight[nid] > mm_max) {
> > +			mm_max = mm_numa_weight[nid];
> > +			mm_selected_nid = nid;
> > +		}
> > +		if (task_numa_weight[nid] > task_max) {
> > +			task_max = task_numa_weight[nid];
> > +			selected_nid = nid;
> > +		}
> > +	}
> 
> Ok, so this is a big walk to take every time and as this happens every
> scheduler tick, it seems unlikely that the workload would be changing
> phases that often in terms of NUMA behaviour. Would it be possible for
> this to be sampled less frequently and cache the result?

Even if there are 8 nodes, this is fairly quick and only requires 2
cachelines. At 16 nodes we're at 4 cachelines. The cacheline of
task_autonuma is fully local. The one of mm_autonuma can be shared
(modulo numa hinting page faults with atuonuma28, in autonuma27 it was
also sharable even despite numa hinting page faults).

> > +			/*
> > +			 * Grab the fault/tot of the processes running
> > +			 * in the other CPUs to compute w_other.
> > +			 */
> > +			raw_spin_lock_irq(&rq->lock);
> > +			_other_task = rq->curr;
> > +			/* recheck after implicit barrier() */
> > +			mm = _other_task->mm;
> > +			if (!mm) {
> > +				raw_spin_unlock_irq(&rq->lock);
> > +				continue;
> > +			}
> > +
> 
> Is it really critical to pin those values using the lock? That seems *really*
> heavy. If the results have to be exactly stable then is there any chance
> the values could be encoded in the high and low bits of a single unsigned
> long and read without the lock?  Updates would be more expensive but that's
> in a trap anyway. This on the other hand is a scheduler path.

The reason of the lock is to prevent rq->curr, mm etc.. to be freed
from under us.

> > +			/*
> > +			 * Check if the _other_task is allowed to be
> > +			 * migrated to this_cpu.
> > +			 */
> > +			if (!cpumask_test_cpu(this_cpu,
> > +					      tsk_cpus_allowed(_other_task))) {
> > +				raw_spin_unlock_irq(&rq->lock);
> > +				continue;
> > +			}
> > +
> 
> Would it not make sense to check this *before* we take the lock and
> grab all its counters? It probably will not make much of a difference in
> practice as I expect it's rare that the target CPU is running a task
> that can't migrate but it still feels the wrong way around.

It's a micro optimization to do it here. It's too rare that the above
fails, while !tot may be zero much more frequently (like if the task
has been just started).

> > +	if (selected_cpu != this_cpu) {
> > +		if (autonuma_debug()) {
> > +			char *w_type_str;
> > +			w_type_str = s_w_type_thread ? "thread" : "process";
> > +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> > +			       p->mm, p->pid, this_nid, selected_nid,
> > +			       this_cpu, selected_cpu,
> > +			       s_w_other, s_w_nid, s_w_this_nid,
> > +			       w_type_str);
> > +		}
> 
> Can these be made tracepoints and get rid of the autonuma_debug() check?
> I recognise there is a risk that some tool might grow to depend on
> implementation details but in this case it seems very unlikely.

The debug mode provides me also a dump of all mm done racy, I wouldn't
know how to do it with tracing.

So I wouldn't remove the printk until we can replace everything with
tracing, but I'd welcome to add a tracepoint too. There are already
other proper tracepoints driving "perf script numatop".

> Ok, so I confess I did not work out if the weights and calculations really
> make sense or not but at a glance they seem reasonable and I spotted no
> obvious flaws. The function is pretty heavy though and may be doing more
> work around locking than is really necessary. That said, there will be
> workloads where the cost is justified and offset by the performance gains
> from improved NUMA locality. I just don't expect it to be a universal win so
> we'll need to keep an eye on the system CPU usage and incrementally optimise
> where possible. I suspect there will be a time when an incremental
> optimisation just does not cut it any more but by then I would also
> expect there will be more data on how autonuma behaves in practice and a
> new algorithm might be more obvious at that point.

Agreed. Chances are I can replace all this already with RCU and a
rcu_dereference or ACCESS_ONCE to grab the rq->curr->task_autonuma and
rq->curr->mm->mm_autonuma data. I didn't try yet. The task struct
shouldn't go away from under us after rcu_read_lock, the mm may be more
tricky, I haven't checked this yet. Optimizations welcome ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 15:35       ` Mel Gorman
@ 2012-10-12  0:41         ` Andrea Arcangeli
  -1 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> If System CPU time really does go down as this converges then that
> should be obvious from monitoring vmstat over time for a test. Early on
> - high usage with that dropping as it converges. If that doesn't happen
>   then the tasks are not converging, the phases change constantly or
> something unexpected happened that needs to be identified.

Yes, all measurable kernel cost should be in the memory copies
(migration and khugepaged, the latter is going to be optimized away).

The migrations must stop after the workload converges. Either
migrations are used to reach convergence or they shouldn't happen in
the first place (not in any measurable amount).

> Ok. Are they separate STREAM instances or threads running on the same
> arrays? 

My understanding is separate instances. I think it's a single threaded
benchmark and you run many copies. It was modified to run for 5min
(otherwise upstream has not enough time to get it wrong, as result of
background scheduling jitters).

Thanks!

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-12  0:41         ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> If System CPU time really does go down as this converges then that
> should be obvious from monitoring vmstat over time for a test. Early on
> - high usage with that dropping as it converges. If that doesn't happen
>   then the tasks are not converging, the phases change constantly or
> something unexpected happened that needs to be identified.

Yes, all measurable kernel cost should be in the memory copies
(migration and khugepaged, the latter is going to be optimized away).

The migrations must stop after the workload converges. Either
migrations are used to reach convergence or they shouldn't happen in
the first place (not in any measurable amount).

> Ok. Are they separate STREAM instances or threads running on the same
> arrays? 

My understanding is separate instances. I think it's a single threaded
benchmark and you run many copies. It was modified to run for 5min
(otherwise upstream has not enough time to get it wrong, as result of
background scheduling jitters).

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
  2012-10-12  0:23       ` Christoph Lameter
@ 2012-10-12  0:52           ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney

Hi Christoph,

On Fri, Oct 12, 2012 at 12:23:17AM +0000, Christoph Lameter wrote:
> On Thu, 11 Oct 2012, Rik van Riel wrote:
> 
> > These statistics are updated at page fault time, I
> > believe while holding the page table lock.
> >
> > In other words, they are in code paths where updating
> > the stats should not cause issues.
> 
> The per cpu counters in the VM were introduced because of
> counter contention caused at page fault time. This is the same code path
> where you think that there cannot be contention.

There's no contention at all in autonuma27.

I changed it in autonuma28, to get real time updates in mm_autonuma
from migration events.

There is no lock taken though (the spinlock below is taken once every
pass, very rarely). It's a few liner change shown in detail below. The
only contention point is this:

+	ACCESS_ONCE(mm_numa_fault[access_nid]) += numpages;
+	ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot) += numpages;

autonuma28 is much more experimental than autonuma27 :)

I wouldn't focus on >1024 CPU systems for this though. The bigger the
system the more costly any automatic placement logic will become, no
matter which algorithm and which computation complexity the algorithm
has, and chances are those will use NUMA hard bindings anyway
considering how much they're expensive to setup and maintain.

The diff looks like this, I can consider undoing it. Comments
welcome. (but real time stats updates, converge faster in autonuma28)

--- a/mm/autonuma.c
+++ b/mm/autonuma.c
 
 static struct knuma_scand_data {
 	struct list_head mm_head; /* entry: mm->mm_autonuma->mm_node */
 	struct mm_struct *mm;
 	unsigned long address;
-	unsigned long *mm_numa_fault_tmp;
 } knuma_scand_data = {
 	.mm_head = LIST_HEAD_INIT(knuma_scand_data.mm_head),
 };






+	unsigned long tot;
+
+	/*
+	 * Set the task's fault_pass equal to the new
+	 * mm's fault_pass, so new_pass will be false
+	 * on the next fault by this thread in this
+	 * same pass.
+	 */
+	p->task_autonuma->task_numa_fault_pass = mm_numa_fault_pass;
+
 	/* If a new pass started, degrade the stats by a factor of 2 */
 	for_each_node(nid)
 		task_numa_fault[nid] >>= 1;
 	task_autonuma->task_numa_fault_tot >>= 1;
+
+	if (mm_numa_fault_pass ==
+	    ACCESS_ONCE(mm_autonuma->mm_numa_fault_last_pass))
+		return;
+
+	spin_lock(&mm_autonuma->mm_numa_fault_lock);
+	if (unlikely(mm_numa_fault_pass ==
+		     mm_autonuma->mm_numa_fault_last_pass)) {
+		spin_unlock(&mm_autonuma->mm_numa_fault_lock);
+		return;
+	}
+	mm_autonuma->mm_numa_fault_last_pass = mm_numa_fault_pass;
+
+	tot = 0;
+	for_each_node(nid) {
+		unsigned long fault = ACCESS_ONCE(mm_numa_fault[nid]);
+		fault >>= 1;
+		ACCESS_ONCE(mm_numa_fault[nid]) = fault;
+		tot += fault;
+	}
+	mm_autonuma->mm_numa_fault_tot = tot;
+	spin_unlock(&mm_autonuma->mm_numa_fault_lock);
 }






 	task_numa_fault[access_nid] += numpages;
 	task_autonuma->task_numa_fault_tot += numpages;
 
+	ACCESS_ONCE(mm_numa_fault[access_nid]) += numpages;
+	ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot) += numpages;
+
 	local_bh_enable();
 }
 
@@ -310,28 +355,35 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
@@ -593,35 +628,26 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		goto out;
 
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		int page_nid;
-		unsigned long *fault_tmp;
 		ret = HPAGE_PMD_NR;
 
 		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-		if (autonuma_mm_working_set() && pmd_numa(*pmd)) {
+		if (pmd_numa(*pmd)) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
 		page = pmd_page(*pmd);
-
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
-		page_nid = page_to_nid(page);
-		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-		fault_tmp[page_nid] += ret;
-
 		if (pmd_numa(*pmd)) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
 		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+
 		/* defer TLB flush to lower the overhead */
 		spin_unlock(&mm->page_table_lock);
 		goto out;
@@ -636,10 +662,9 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 	for (_address = address, _pte = pte; _address < end;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
-		unsigned long *fault_tmp;
 		if (!pte_present(pteval))
 			continue;
-		if (autonuma_mm_working_set() && pte_numa(pteval))
+		if (pte_numa(pteval))
 			continue;
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
@@ -647,13 +672,8 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1)
 			continue;
-
-		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-		fault_tmp[page_to_nid(page)]++;
-
 		if (pte_numa(pteval))
 			continue;
-
 		if (!autonuma_scan_pmd())
 			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
 
@@ -677,56 +697,6 @@ out:
 	return ret;
 }
 
-static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
-{
-	int nid;
-	struct mm_autonuma *mma = mm->mm_autonuma;
-	unsigned long tot;
-	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-
-	if (autonuma_mm_working_set()) {
-		for_each_node(nid) {
-			tot = fault_tmp[nid];
-			if (tot)
-				break;
-		}
-		if (!tot)
-			/* process was idle, keep the old data */
-			return;
-	}
-
-	/* FIXME: would be better protected with write_seqlock_bh() */
-	local_bh_disable();
-
-	tot = 0;
-	for_each_node(nid) {
-		unsigned long faults = fault_tmp[nid];
-		fault_tmp[nid] = 0;
-		mma->mm_numa_fault[nid] = faults;
-		tot += faults;
-	}
-	mma->mm_numa_fault_tot = tot;
-
-	local_bh_enable();
-}
-
-static void mm_numa_fault_tmp_reset(void)
-{
-	memset(knuma_scand_data.mm_numa_fault_tmp, 0,
-	       mm_autonuma_fault_size());
-}
-
-static inline void validate_mm_numa_fault_tmp(unsigned long address)
-{
-#ifdef CONFIG_DEBUG_VM
-	int nid;
-	if (address)
-		return;
-	for_each_node(nid)
-		BUG_ON(knuma_scand_data.mm_numa_fault_tmp[nid]);
-#endif
-}
-
 /*
  * Scan the next part of the mm. Keep track of the progress made and
  * return it.
@@ -758,8 +728,6 @@ static int knumad_do_scan(void)
 	}
 	address = knuma_scand_data.address;
 
-	validate_mm_numa_fault_tmp(address);
-
 	mutex_unlock(&knumad_mm_mutex);
 
 	down_read(&mm->mmap_sem);
@@ -855,9 +824,7 @@ static int knumad_do_scan(void)
 			/* tell autonuma_exit not to list_del */
 			VM_BUG_ON(mm->mm_autonuma->mm != mm);
 			mm->mm_autonuma->mm = NULL;
-			mm_numa_fault_tmp_reset();
-		} else
-			mm_numa_fault_tmp_flush(mm);
+		}
 
 		mmdrop(mm);
 	}
@@ -942,7 +916,6 @@ static int knuma_scand(void *none)
 
 	if (mm)
 		mmdrop(mm);
-	mm_numa_fault_tmp_reset();
 
 	return 0;
 }
@@ -987,11 +960,6 @@ static int start_knuma_scand(void)
 	int err = 0;
 	struct task_struct *knumad_thread;
 
-	knuma_scand_data.mm_numa_fault_tmp = kzalloc(mm_autonuma_fault_size(),
-						     GFP_KERNEL);
-	if (!knuma_scand_data.mm_numa_fault_tmp)
-		return -ENOMEM;
-
 	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
 	if (unlikely(IS_ERR(knumad_thread))) {
 		autonuma_printk(KERN_ERR

Thanks!

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures
@ 2012-10-12  0:52           ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  0:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney

Hi Christoph,

On Fri, Oct 12, 2012 at 12:23:17AM +0000, Christoph Lameter wrote:
> On Thu, 11 Oct 2012, Rik van Riel wrote:
> 
> > These statistics are updated at page fault time, I
> > believe while holding the page table lock.
> >
> > In other words, they are in code paths where updating
> > the stats should not cause issues.
> 
> The per cpu counters in the VM were introduced because of
> counter contention caused at page fault time. This is the same code path
> where you think that there cannot be contention.

There's no contention at all in autonuma27.

I changed it in autonuma28, to get real time updates in mm_autonuma
from migration events.

There is no lock taken though (the spinlock below is taken once every
pass, very rarely). It's a few liner change shown in detail below. The
only contention point is this:

+	ACCESS_ONCE(mm_numa_fault[access_nid]) += numpages;
+	ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot) += numpages;

autonuma28 is much more experimental than autonuma27 :)

I wouldn't focus on >1024 CPU systems for this though. The bigger the
system the more costly any automatic placement logic will become, no
matter which algorithm and which computation complexity the algorithm
has, and chances are those will use NUMA hard bindings anyway
considering how much they're expensive to setup and maintain.

The diff looks like this, I can consider undoing it. Comments
welcome. (but real time stats updates, converge faster in autonuma28)

--- a/mm/autonuma.c
+++ b/mm/autonuma.c
 
 static struct knuma_scand_data {
 	struct list_head mm_head; /* entry: mm->mm_autonuma->mm_node */
 	struct mm_struct *mm;
 	unsigned long address;
-	unsigned long *mm_numa_fault_tmp;
 } knuma_scand_data = {
 	.mm_head = LIST_HEAD_INIT(knuma_scand_data.mm_head),
 };






+	unsigned long tot;
+
+	/*
+	 * Set the task's fault_pass equal to the new
+	 * mm's fault_pass, so new_pass will be false
+	 * on the next fault by this thread in this
+	 * same pass.
+	 */
+	p->task_autonuma->task_numa_fault_pass = mm_numa_fault_pass;
+
 	/* If a new pass started, degrade the stats by a factor of 2 */
 	for_each_node(nid)
 		task_numa_fault[nid] >>= 1;
 	task_autonuma->task_numa_fault_tot >>= 1;
+
+	if (mm_numa_fault_pass ==
+	    ACCESS_ONCE(mm_autonuma->mm_numa_fault_last_pass))
+		return;
+
+	spin_lock(&mm_autonuma->mm_numa_fault_lock);
+	if (unlikely(mm_numa_fault_pass ==
+		     mm_autonuma->mm_numa_fault_last_pass)) {
+		spin_unlock(&mm_autonuma->mm_numa_fault_lock);
+		return;
+	}
+	mm_autonuma->mm_numa_fault_last_pass = mm_numa_fault_pass;
+
+	tot = 0;
+	for_each_node(nid) {
+		unsigned long fault = ACCESS_ONCE(mm_numa_fault[nid]);
+		fault >>= 1;
+		ACCESS_ONCE(mm_numa_fault[nid]) = fault;
+		tot += fault;
+	}
+	mm_autonuma->mm_numa_fault_tot = tot;
+	spin_unlock(&mm_autonuma->mm_numa_fault_lock);
 }






 	task_numa_fault[access_nid] += numpages;
 	task_autonuma->task_numa_fault_tot += numpages;
 
+	ACCESS_ONCE(mm_numa_fault[access_nid]) += numpages;
+	ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot) += numpages;
+
 	local_bh_enable();
 }
 
@@ -310,28 +355,35 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
@@ -593,35 +628,26 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		goto out;
 
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		int page_nid;
-		unsigned long *fault_tmp;
 		ret = HPAGE_PMD_NR;
 
 		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-		if (autonuma_mm_working_set() && pmd_numa(*pmd)) {
+		if (pmd_numa(*pmd)) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
 		page = pmd_page(*pmd);
-
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
-		page_nid = page_to_nid(page);
-		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-		fault_tmp[page_nid] += ret;
-
 		if (pmd_numa(*pmd)) {
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-
 		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+
 		/* defer TLB flush to lower the overhead */
 		spin_unlock(&mm->page_table_lock);
 		goto out;
@@ -636,10 +662,9 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 	for (_address = address, _pte = pte; _address < end;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
-		unsigned long *fault_tmp;
 		if (!pte_present(pteval))
 			continue;
-		if (autonuma_mm_working_set() && pte_numa(pteval))
+		if (pte_numa(pteval))
 			continue;
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
@@ -647,13 +672,8 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1)
 			continue;
-
-		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-		fault_tmp[page_to_nid(page)]++;
-
 		if (pte_numa(pteval))
 			continue;
-
 		if (!autonuma_scan_pmd())
 			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
 
@@ -677,56 +697,6 @@ out:
 	return ret;
 }
 
-static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
-{
-	int nid;
-	struct mm_autonuma *mma = mm->mm_autonuma;
-	unsigned long tot;
-	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
-
-	if (autonuma_mm_working_set()) {
-		for_each_node(nid) {
-			tot = fault_tmp[nid];
-			if (tot)
-				break;
-		}
-		if (!tot)
-			/* process was idle, keep the old data */
-			return;
-	}
-
-	/* FIXME: would be better protected with write_seqlock_bh() */
-	local_bh_disable();
-
-	tot = 0;
-	for_each_node(nid) {
-		unsigned long faults = fault_tmp[nid];
-		fault_tmp[nid] = 0;
-		mma->mm_numa_fault[nid] = faults;
-		tot += faults;
-	}
-	mma->mm_numa_fault_tot = tot;
-
-	local_bh_enable();
-}
-
-static void mm_numa_fault_tmp_reset(void)
-{
-	memset(knuma_scand_data.mm_numa_fault_tmp, 0,
-	       mm_autonuma_fault_size());
-}
-
-static inline void validate_mm_numa_fault_tmp(unsigned long address)
-{
-#ifdef CONFIG_DEBUG_VM
-	int nid;
-	if (address)
-		return;
-	for_each_node(nid)
-		BUG_ON(knuma_scand_data.mm_numa_fault_tmp[nid]);
-#endif
-}
-
 /*
  * Scan the next part of the mm. Keep track of the progress made and
  * return it.
@@ -758,8 +728,6 @@ static int knumad_do_scan(void)
 	}
 	address = knuma_scand_data.address;
 
-	validate_mm_numa_fault_tmp(address);
-
 	mutex_unlock(&knumad_mm_mutex);
 
 	down_read(&mm->mmap_sem);
@@ -855,9 +824,7 @@ static int knumad_do_scan(void)
 			/* tell autonuma_exit not to list_del */
 			VM_BUG_ON(mm->mm_autonuma->mm != mm);
 			mm->mm_autonuma->mm = NULL;
-			mm_numa_fault_tmp_reset();
-		} else
-			mm_numa_fault_tmp_flush(mm);
+		}
 
 		mmdrop(mm);
 	}
@@ -942,7 +916,6 @@ static int knuma_scand(void *none)
 
 	if (mm)
 		mmdrop(mm);
-	mm_numa_fault_tmp_reset();
 
 	return 0;
 }
@@ -987,11 +960,6 @@ static int start_knuma_scand(void)
 	int err = 0;
 	struct task_struct *knumad_thread;
 
-	knuma_scand_data.mm_numa_fault_tmp = kzalloc(mm_autonuma_fault_size(),
-						     GFP_KERNEL);
-	if (!knuma_scand_data.mm_numa_fault_tmp)
-		return -ENOMEM;
-
 	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
 	if (unlikely(IS_ERR(knumad_thread))) {
 		autonuma_printk(KERN_ERR

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 21:34 ` Mel Gorman
@ 2012-10-12  1:45     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  1:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi Mel,

On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> So after getting through the full review of it, there wasn't anything
> I could not stand. I think it's *very* heavy on some of the paths like
> the idle balancer which I was not keen on and the fault paths are also
> quite heavy.  I think the weight on some of these paths can be reduced
> but not to 0 if the objectives to autonuma are to be met.
> 
> I'm not fully convinced that the task exchange is actually necessary or
> beneficial because it somewhat assumes that there is a symmetry between CPU
> and memory balancing that may not be true. The fact that it only considers

The problem is that without an active task exchange and no explicit
call to stop_one_cpu*, there's no way to migrate a currently running
task and clearly we need that. We can indefinitely wait hoping the
task goes to sleep and leaves the CPU idle, or that a couple of other
tasks start and trigger load balance events.

We must move tasks even if all cpus are in a steady rq->nr_running ==
1 state and there's no other scheduler balance event that could
possibly attempt to move tasks around in such a steady state.

Of course one could hack the active idle balancing so that it does the
active NUMA balancing action, but that would be a purely artificial
complication: it would add unnecessary delay and it would provide no
benefit whatsoever.

Why don't we dump the active idle balancing too, and we hack the load
balancing to do the active idle balancing as well? Of course then the
two will be more integrated. But it'll be a mess and slower and
there's a good reason why they exist as totally separated pieces of
code working in parallel.

We can integrate it more, but in my view the result would be worse and
more complicated. Last but not the least messing the idle balancing
code to do an active NUMA balancing action (somehow invoking
stop_one_cpu* in the steady state described above) would force even
cellphones and UP kernels to deal with NUMA code somehow.

> tasks that are currently running feels a bit random but examining all tasks
> that recently ran on the node would be far too expensive to there is no

So far this seems a good tradeoff. Nothing will prevent us to scan
deeper into the runqueues later if find a way to do that efficiently.

> good answer. You are caught between a rock and a hard place and either
> direction you go is wrong for different reasons. You need something more

I think you described the problem perfectly ;).

> frequent than scans (because it'll converge too slowly) but doing it from
> the balancer misses some tasks and may run too frequently and it's unclear
> how it effects the current load balancer decisions. I don't have a good
> alternative solution for this but ideally it would be better integrated with
> the existing scheduler when there is more data on what those scheduling
> decisions should be. That will only come from a wide range of testing and
> the inevitable bug reports.
> 
> That said, this is concentrating on the problems without considering the
> situations where it would work very well.  I think it'll come down to HPC
> and anything jitter-sensitive will hate this while workloads like JVM,
> virtualisation or anything that uses a lot of memory without caring about
> placement will love it. It's not perfect but it's better than incurring
> the cost of remote access unconditionally.

Full agreement.

Your detailed full review was very appreciated, thanks!

Andrea

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-12  1:45     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-12  1:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

Hi Mel,

On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> So after getting through the full review of it, there wasn't anything
> I could not stand. I think it's *very* heavy on some of the paths like
> the idle balancer which I was not keen on and the fault paths are also
> quite heavy.  I think the weight on some of these paths can be reduced
> but not to 0 if the objectives to autonuma are to be met.
> 
> I'm not fully convinced that the task exchange is actually necessary or
> beneficial because it somewhat assumes that there is a symmetry between CPU
> and memory balancing that may not be true. The fact that it only considers

The problem is that without an active task exchange and no explicit
call to stop_one_cpu*, there's no way to migrate a currently running
task and clearly we need that. We can indefinitely wait hoping the
task goes to sleep and leaves the CPU idle, or that a couple of other
tasks start and trigger load balance events.

We must move tasks even if all cpus are in a steady rq->nr_running ==
1 state and there's no other scheduler balance event that could
possibly attempt to move tasks around in such a steady state.

Of course one could hack the active idle balancing so that it does the
active NUMA balancing action, but that would be a purely artificial
complication: it would add unnecessary delay and it would provide no
benefit whatsoever.

Why don't we dump the active idle balancing too, and we hack the load
balancing to do the active idle balancing as well? Of course then the
two will be more integrated. But it'll be a mess and slower and
there's a good reason why they exist as totally separated pieces of
code working in parallel.

We can integrate it more, but in my view the result would be worse and
more complicated. Last but not the least messing the idle balancing
code to do an active NUMA balancing action (somehow invoking
stop_one_cpu* in the steady state described above) would force even
cellphones and UP kernels to deal with NUMA code somehow.

> tasks that are currently running feels a bit random but examining all tasks
> that recently ran on the node would be far too expensive to there is no

So far this seems a good tradeoff. Nothing will prevent us to scan
deeper into the runqueues later if find a way to do that efficiently.

> good answer. You are caught between a rock and a hard place and either
> direction you go is wrong for different reasons. You need something more

I think you described the problem perfectly ;).

> frequent than scans (because it'll converge too slowly) but doing it from
> the balancer misses some tasks and may run too frequently and it's unclear
> how it effects the current load balancer decisions. I don't have a good
> alternative solution for this but ideally it would be better integrated with
> the existing scheduler when there is more data on what those scheduling
> decisions should be. That will only come from a wide range of testing and
> the inevitable bug reports.
> 
> That said, this is concentrating on the problems without considering the
> situations where it would work very well.  I think it'll come down to HPC
> and anything jitter-sensitive will hate this while workloads like JVM,
> virtualisation or anything that uses a lot of memory without caring about
> placement will love it. It's not perfect but it's better than incurring
> the cost of remote access unconditionally.

Full agreement.

Your detailed full review was very appreciated, thanks!

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
  2012-10-12  0:25       ` Andrea Arcangeli
@ 2012-10-12  8:29         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12  8:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Fri, Oct 12, 2012 at 02:25:13AM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > > This algorithm takes as input the statistical information filled by the
> > > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > > (p->task_autonuma), evaluates it for the current scheduled task, and
> > > compares it against every other running process to see if it should
> > > move the current task to another NUMA node.
> > > 
> > 
> > That sounds expensive if there are a lot of running processes in the
> > system. How often does this happen? Mention it here even though I
> > realised much later that it's obvious from the patch itself.
> 
> Ok I added:
> 
> ==
> This algorithm will run once every ~100msec,

~100msec (depending on the scheduler tick)

> and can be easily slowed
> down further

using the sysfs tunable ....

>. Its computational complexity is O(nr_cpus) and it's
> executed by all CPUs. The number of running threads and processes is
> not going to alter the cost of this algorithm, only the online number
> of CPUs is. However practically this will very rarely hit on all CPUs
> runqueues. Most of the time it will only compute on local data in the
> task_autonuma struct (for example if convergence has been
> reached). Even if no convergence has been reached yet, it'll only scan
> the CPUs in the NUMA nodes where the local task_autonuma data is
> showing that they are worth migrating to.

Ok, this explains how things are currently which is beter.

> ==
> 
> It's configurable through sysfs, 100mses is the default.
> 
> > > + * there is no affinity set for the task).
> > > + */
> > > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > > +{
> > 
> > nit, but elsewhere you have
> > 
> > static inline TYPE and here you have
> > static TYPE inline
> 
> Fixed.
> 
> > 
> > > +	int task_selected_nid;
> > > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +
> > > +	if (!task_autonuma)
> > > +		return true;
> > > +
> > > +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > > +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > > +		return true;
> > > +	else
> > > +		return false;
> > > +}
> > 
> > no need for else.
> 
> Removed.
> 
> > 
> > > +
> > > +static inline void sched_autonuma_balance(void)
> > > +{
> > > +	struct task_autonuma *ta = current->task_autonuma;
> > > +
> > > +	if (ta && current->mm)
> > > +		__sched_autonuma_balance();
> > > +}
> > > +
> > 
> > Ok, so this could do with a comment explaining where it is called from.
> > It is called during idle balancing at least so potentially this is every
> > scheduler tick. It'll be run from softirq context so the cost will not
> > be obvious to a process but the overhead will be there. What happens if
> > this takes longer than a scheduler tick to run? Is that possible?
> 
> softirqs can run for huge amount of time so it won't harm.
> 

They're allowed, but it's not free. Its not a stopper but eventually
we'll want to get away with it.

> Nested IRQs could even run on top of the softirq, and they could take
> milliseconds too if they're hyper inefficient and we must still run
> perfectly rock solid (with horrible latency, but still stable).
> 
> I added:
> 
> /*
>  * This is called in the context of the SCHED_SOFTIRQ from
>  * run_rebalance_domains().
>  */
> 

Ok. A vague idea occurred to me while mulling this over that would avoid the
walk. I did not flesh this out at all so there will be major inaccuracies
but hopefully you'll get the general idea.

The scheduler already caches some information about domains such as
sd_llc storing a per-cpu basis a pointer to the highest shared domain
with the same lowest level cache.

It should be possible to cache on a per-NUMA node domain basis the
highest mm_numafault and task_mmfault and the PID within that domain 
in sd_numa_mostconverged with one entry per NUMA node. At a scheduling tick, the
current task does the for_each_online_node(), calculates its values,
them to sd_numa_mostconverged and updates the cache if necessary.

With the view to integrating this with CFQ better, this update should happen
in kernel/sched/fair.c in a function called update_convergence_stats()
or possibly even integrated within one of the existing CPU walkers
like nohz_idle_balance or maybe in idle_balance itself and moved out of
kernel/sched/numa.c.  It shouldn't migrate tasks at this point and
reduce the overhead in the idle balancer.

This should integrate the whole of the following block into CFQ.

        /*
         * Identify the NUMA node where this thread (task_struct), and
         * the process (mm_struct) as a whole, has the largest number
         * of NUMA faults.
         */

It then later considers doing the task exchange but only the
sd_numa_mostconverged values for each node are considered.
This gets rid of the
for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) loop with the obvious
caveat that there is no guarantee that cached PID is eligible for exchange
but I expect that's rare (it could be mitigated by never caching pids
that are bound to a single node for example). This would make this block

        /*
         * Check the other NUMA nodes to see if there is a task we
         * should exchange places with.
         */

O(num_online_nodes()) instead of O(num_online_cpus()) and reduce
the cost of that path. It will converge slower, but only slightly slower,
as you only ever consider one task per node anyway after deciding which
one is the best.

Again, in the interest in integrating with CFQ further, this whole block
should then move to kernel/sched/fair.c , possibly within load_balance()
so they are working closer together.

That just leaves the task exchange part which can remain separate and
just called from load_balance() when autonuma is in use.

This is not a patch obviously but I think it's important to have some
sort of integrating path with CFQ in mind.

> > > +/*
> > > + * This function __sched_autonuma_balance() is responsible for
> > 
> > This function is far too shot and could do with another few pages :P
> 
> :) I tried to split it once already but gave up in the middle.
> 

FWIW, the blocks are at least clear and it was easier to follow than I
expected.

> > > + * "Full convergence" is achieved when all memory accesses by a task
> > > + * are 100% local to the CPU it is running on. A task's "best node" is
> > 
> > I think this is the first time you defined convergence in the series.
> > The explanation should be included in the documentation.
> 
> Ok. It's not too easy concept to explain with words.  Here a try:
> 
>  *
>  * A workload converges when all the memory of a thread or a process
>  * has been placed in the NUMA node of the CPU where the process or
>  * thread is running on.
>  *
> 

Sounds right to me.

> > > + * other_diff: how much the current task is closer to fully converge
> > > + * on the node of the other CPU than the other task that is currently
> > > + * running in the other CPU.
> > 
> > In the changelog you talked about comparing a process with every other
> > running process but here it looks like you intent to examine every
> > process that is *currently running* on a remote node and compare that.
> > What if the best process to swap with is not currently running? Do we
> > miss it?
> 
> Correct, only currently running processes are being checked. If a task
> in R state goes to sleep immediately, it's not relevant where it
> runs. We focus on "long running" compute tasks, so tasks that are in R
> state most frequently.
> 

Ok, so it can still miss some things but we're trying to reduce the
overhead, not increase it. If the most and worst PIDS were cached as I
described above they could be updated either on the idle balancing (and
potentially miss tasks like this does) or if high granularity was every
required it could be done on every reschedule. It's one call for a
relatively light function. I don't think it's necessary to have this
fine granularity though.

> > > + * If both checks succeed it guarantees that we found a way to
> > > + * multilaterally improve the system wide NUMA
> > > + * convergence. Multilateral here means that the same checks will not
> > > + * succeed again on those same two tasks, after the task exchange, so
> > > + * there is no risk of ping-pong.
> > > + *
> > 
> > At least not in that instance of time. A new CPU binding or change in
> > behaviour (such as a computation finishing and a reduce step starting)
> > might change that scoring.
> 
> Yes.
> 
> > > + * If a task exchange can happen because the two checks succeed, we
> > > + * select the destination CPU that will give us the biggest increase
> > > + * in system wide convergence (i.e. biggest "weight", in the above
> > > + * quoted code).
> > > + *
> > 
> > So there is a bit of luck that the best task to exchange is currently
> > running. How bad is that? It depends really on the number of tasks
> > running on that node and the priority. There is a chance that it doesn't
> > matter as such because if all the wrong tasks are currently running then
> > no exchange will take place - it was just wasted CPU. It does imply that
> > AutoNUMA works best of CPUs are not over-subscribed with processes. Is
> > that fair?
> 
> It seems to works fine with overcommit as well. specjbb x2 is
> converging fine, as well as numa01 in parallel with numa02. It's
> actually pretty cool to watch.
> 
> Try to run this:
> 
> while :; do ./nmstat -n numa; sleep 1; done
> 
> nmstat is a binary in autonuma benchmark.
> 
> Then run:
> 
> time (./numa01 & ./numa02 & wait)
> 
> The thing is, we work together with CFS, CFS in autopilot works fine,
> we only need to correct the occasional error.
> 
> It works the same as the active idle balancing, that corrects the
> occasional error for HT cores left idle, then CFS takes over.
> 

Ok.

> > Again, I have no suggestions at all on how this might be improved and
> > these comments are hand-waving towards where we *might* see problems in
> > the future. If problems are actually identified in practice for
> > worklaods then autonuma can be turned off until the relevant problem
> > area is fixed.
> 
> Exactly, it's enough to run:
> 
> echo 1 >/sys/kernel/mm/autonuma/enabled
> 
> If you want to get rid of the 2 bytes per page too, passing
> "noautonuma" at boot will do it (but then /sys/kernel/mm/autonuma
> disapperers and you can't enable it anymore).
> 
> Plus if there's any issue with the cost of sched_autonuma_balance it's
> more than enough to run "perf top" to find out.
> 

Yep. I'm just trying to anticipate what the problems might be so when/if
I see a problem profile I'll have a rough idea what it might be due to.

> > I would fully expect that there are parallel workloads that work on
> > differenet portions of a large set of data and it would be perfectly
> > reasonable for threads using the same address space to converge on
> > different nodes.
> 
> Agreed. Even if they can't converge fully they could have stats like
> 70/30, 30/70, with 30 being numa-false-shared and we'll schedule them
> right, so running faster than upstream. That 30% will also tend to
> slowly distribute better over time.
> 

Ok

> > I would hope we manage to figure out a way to examine fewer processes,
> > not more :)
> 
> 8)))
> 
> > > +void __sched_autonuma_balance(void)
> > > +{
> > > +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> > > +	int this_nid = numa_node_id();
> > > +	int this_cpu = smp_processor_id();
> > > +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> > > +	unsigned long task_max, mm_max;
> > > +	unsigned long weight_diff_max;
> > > +	long uninitialized_var(s_w_nid);
> > > +	long uninitialized_var(s_w_this_nid);
> > > +	long uninitialized_var(s_w_other);
> > > +	bool uninitialized_var(s_w_type_thread);
> > > +	struct cpumask *allowed;
> > > +	struct task_struct *p = current, *other_task;
> > 
> > So the task in question is current but this is called by the idle
> > balancer. I'm missing something obvious here but it's not clear to me why
> > that process is necessarily relevant. What guarantee is there that all
> > tasks will eventually run this code? Maybe it doesn't matter because the
> > most CPU intensive tasks are also the most likely to end up in here but
> > a clarification would be nice.
> 
> Exactly. We only focus on who is significantly computing. If a task
> runs for 1msec we can't possibly care where it runs and where the
> memory is. If it keeps running for 1msec, over time even that task
> will be migrated right.
> 

This limitation is fine, but it should be mentioned in a comment above
__sched_autonuma_balance() for the next person that reviews this in the
future.

> > > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +	struct mm_autonuma *mm_autonuma;
> > > +	struct rq *rq;
> > > +
> > > +	/* per-cpu statically allocated in runqueues */
> > > +	long *task_numa_weight;
> > > +	long *mm_numa_weight;
> > > +
> > > +	if (!task_autonuma || !p->mm)
> > > +		return;
> > > +
> > > +	if (!autonuma_enabled()) {
> > > +		if (task_autonuma->task_selected_nid != -1)
> > > +			task_autonuma->task_selected_nid = -1;
> > > +		return;
> > > +	}
> > > +
> > > +	allowed = tsk_cpus_allowed(p);
> > > +	mm_autonuma = p->mm->mm_autonuma;
> > > +
> > > +	/*
> > > +	 * If the task has no NUMA hinting page faults or if the mm
> > > +	 * hasn't been fully scanned by knuma_scand yet, set task
> > > +	 * selected nid to the current nid, to avoid the task bounce
> > > +	 * around randomly.
> > > +	 */
> > > +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
> > 
> > Why ACCESS_ONCE?
> 
> mm variables are altered by other threads too. Only task_autonuma is
> local to this task and cannot change from under us.
> 
> I did it all lockless, I don't care if we're off once in a while.
> 

Mention why ACCESS_ONCE is used in a comment the first time it appears
in kernel/sched/numa.c. It's not necessary to mention it after that.

> > > +	if (!mm_tot) {
> > > +		if (task_autonuma->task_selected_nid != this_nid)
> > > +			task_autonuma->task_selected_nid = this_nid;
> > > +		return;
> > > +	}
> > > +	task_tot = task_autonuma->task_numa_fault_tot;
> > > +	if (!task_tot) {
> > > +		if (task_autonuma->task_selected_nid != this_nid)
> > > +			task_autonuma->task_selected_nid = this_nid;
> > > +		return;
> > > +	}
> > > +
> > > +	rq = cpu_rq(this_cpu);
> > > +
> > > +	/*
> > > +	 * Verify that we can migrate the current task, otherwise try
> > > +	 * again later.
> > > +	 */
> > > +	if (ACCESS_ONCE(rq->autonuma_balance))
> > > +		return;
> > > +
> > > +	/*
> > > +	 * The following two arrays will hold the NUMA affinity weight
> > > +	 * information for the current process if scheduled on the
> > > +	 * given NUMA node.
> > > +	 *
> > > +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> > > +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> > > +	 */
> > > +	task_numa_weight = rq->task_numa_weight;
> > > +	mm_numa_weight = rq->mm_numa_weight;
> > > +
> > > +	/*
> > > +	 * Identify the NUMA node where this thread (task_struct), and
> > > +	 * the process (mm_struct) as a whole, has the largest number
> > > +	 * of NUMA faults.
> > > +	 */
> > > +	task_max = mm_max = 0;
> > > +	selected_nid = mm_selected_nid = -1;
> > > +	for_each_online_node(nid) {
> > > +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> > > +		task_fault = task_autonuma->task_numa_fault[nid];
> > > +		if (mm_fault > mm_tot)
> > > +			/* could be removed with a seqlock */
> > > +			mm_tot = mm_fault;
> > > +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> > > +		if (task_fault > task_tot) {
> > > +			task_tot = task_fault;
> > > +			WARN_ON(1);
> > > +		}
> > > +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> > > +		if (mm_numa_weight[nid] > mm_max) {
> > > +			mm_max = mm_numa_weight[nid];
> > > +			mm_selected_nid = nid;
> > > +		}
> > > +		if (task_numa_weight[nid] > task_max) {
> > > +			task_max = task_numa_weight[nid];
> > > +			selected_nid = nid;
> > > +		}
> > > +	}
> > 
> > Ok, so this is a big walk to take every time and as this happens every
> > scheduler tick, it seems unlikely that the workload would be changing
> > phases that often in terms of NUMA behaviour. Would it be possible for
> > this to be sampled less frequently and cache the result?
> 
> Even if there are 8 nodes, this is fairly quick and only requires 2
> cachelines. At 16 nodes we're at 4 cachelines. The cacheline of
> task_autonuma is fully local. The one of mm_autonuma can be shared
> (modulo numa hinting page faults with atuonuma28, in autonuma27 it was
> also sharable even despite numa hinting page faults).
> 

Two cachelines that bounce though because of writes. I still don't
really like it but it can be lived with for now I guess, it's not my call
really. However, I'd like you to consider the suggestion above on how we
might create a per-NUMA scheduling domain cache of this information that
is only updated by a task if it scores "better" or "worse" than the current
cached value.

> > > +			/*
> > > +			 * Grab the fault/tot of the processes running
> > > +			 * in the other CPUs to compute w_other.
> > > +			 */
> > > +			raw_spin_lock_irq(&rq->lock);
> > > +			_other_task = rq->curr;
> > > +			/* recheck after implicit barrier() */
> > > +			mm = _other_task->mm;
> > > +			if (!mm) {
> > > +				raw_spin_unlock_irq(&rq->lock);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Is it really critical to pin those values using the lock? That seems *really*
> > heavy. If the results have to be exactly stable then is there any chance
> > the values could be encoded in the high and low bits of a single unsigned
> > long and read without the lock?  Updates would be more expensive but that's
> > in a trap anyway. This on the other hand is a scheduler path.
> 
> The reason of the lock is to prevent rq->curr, mm etc.. to be freed
> from under us.
> 

Crap, yes.

> > > +			/*
> > > +			 * Check if the _other_task is allowed to be
> > > +			 * migrated to this_cpu.
> > > +			 */
> > > +			if (!cpumask_test_cpu(this_cpu,
> > > +					      tsk_cpus_allowed(_other_task))) {
> > > +				raw_spin_unlock_irq(&rq->lock);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Would it not make sense to check this *before* we take the lock and
> > grab all its counters? It probably will not make much of a difference in
> > practice as I expect it's rare that the target CPU is running a task
> > that can't migrate but it still feels the wrong way around.
> 
> It's a micro optimization to do it here. It's too rare that the above
> fails, while !tot may be zero much more frequently (like if the task
> has been just started).
> 

Ok.

> > > +	if (selected_cpu != this_cpu) {
> > > +		if (autonuma_debug()) {
> > > +			char *w_type_str;
> > > +			w_type_str = s_w_type_thread ? "thread" : "process";
> > > +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> > > +			       p->mm, p->pid, this_nid, selected_nid,
> > > +			       this_cpu, selected_cpu,
> > > +			       s_w_other, s_w_nid, s_w_this_nid,
> > > +			       w_type_str);
> > > +		}
> > 
> > Can these be made tracepoints and get rid of the autonuma_debug() check?
> > I recognise there is a risk that some tool might grow to depend on
> > implementation details but in this case it seems very unlikely.
> 
> The debug mode provides me also a dump of all mm done racy, I wouldn't
> know how to do it with tracing.
> 

For live reporting on a terminal;

$ trace-cmd start -e autonuma:some_event_whatever_you_called_it
$ cat /sys/kernel/debug/tracing/trace_pipe
$ trace-cmd stop -e autonuma:some_event_whatever_you_called_it

you can record the trace using trace-cmd record but I suspect in this
case you want live reporting and I think this is the best way of doing
it.

> So I wouldn't remove the printk until we can replace everything with
> tracing, but I'd welcome to add a tracepoint too. There are already
> other proper tracepoints driving "perf script numatop".
> 

Good.

> > Ok, so I confess I did not work out if the weights and calculations really
> > make sense or not but at a glance they seem reasonable and I spotted no
> > obvious flaws. The function is pretty heavy though and may be doing more
> > work around locking than is really necessary. That said, there will be
> > workloads where the cost is justified and offset by the performance gains
> > from improved NUMA locality. I just don't expect it to be a universal win so
> > we'll need to keep an eye on the system CPU usage and incrementally optimise
> > where possible. I suspect there will be a time when an incremental
> > optimisation just does not cut it any more but by then I would also
> > expect there will be more data on how autonuma behaves in practice and a
> > new algorithm might be more obvious at that point.
> 
> Agreed. Chances are I can replace all this already with RCU and a
> rcu_dereference or ACCESS_ONCE to grab the rq->curr->task_autonuma and
> rq->curr->mm->mm_autonuma data. I didn't try yet. The task struct
> shouldn't go away from under us after rcu_read_lock, the mm may be more
> tricky, I haven't checked this yet. Optimizations welcome ;)
> 

Optimizations are limited to hand waving and no patches for the moment
:)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/33] autonuma: CPU follows memory algorithm
@ 2012-10-12  8:29         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12  8:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Fri, Oct 12, 2012 at 02:25:13AM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > > This algorithm takes as input the statistical information filled by the
> > > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > > (p->task_autonuma), evaluates it for the current scheduled task, and
> > > compares it against every other running process to see if it should
> > > move the current task to another NUMA node.
> > > 
> > 
> > That sounds expensive if there are a lot of running processes in the
> > system. How often does this happen? Mention it here even though I
> > realised much later that it's obvious from the patch itself.
> 
> Ok I added:
> 
> ==
> This algorithm will run once every ~100msec,

~100msec (depending on the scheduler tick)

> and can be easily slowed
> down further

using the sysfs tunable ....

>. Its computational complexity is O(nr_cpus) and it's
> executed by all CPUs. The number of running threads and processes is
> not going to alter the cost of this algorithm, only the online number
> of CPUs is. However practically this will very rarely hit on all CPUs
> runqueues. Most of the time it will only compute on local data in the
> task_autonuma struct (for example if convergence has been
> reached). Even if no convergence has been reached yet, it'll only scan
> the CPUs in the NUMA nodes where the local task_autonuma data is
> showing that they are worth migrating to.

Ok, this explains how things are currently which is beter.

> ==
> 
> It's configurable through sysfs, 100mses is the default.
> 
> > > + * there is no affinity set for the task).
> > > + */
> > > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > > +{
> > 
> > nit, but elsewhere you have
> > 
> > static inline TYPE and here you have
> > static TYPE inline
> 
> Fixed.
> 
> > 
> > > +	int task_selected_nid;
> > > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +
> > > +	if (!task_autonuma)
> > > +		return true;
> > > +
> > > +	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > > +	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > > +		return true;
> > > +	else
> > > +		return false;
> > > +}
> > 
> > no need for else.
> 
> Removed.
> 
> > 
> > > +
> > > +static inline void sched_autonuma_balance(void)
> > > +{
> > > +	struct task_autonuma *ta = current->task_autonuma;
> > > +
> > > +	if (ta && current->mm)
> > > +		__sched_autonuma_balance();
> > > +}
> > > +
> > 
> > Ok, so this could do with a comment explaining where it is called from.
> > It is called during idle balancing at least so potentially this is every
> > scheduler tick. It'll be run from softirq context so the cost will not
> > be obvious to a process but the overhead will be there. What happens if
> > this takes longer than a scheduler tick to run? Is that possible?
> 
> softirqs can run for huge amount of time so it won't harm.
> 

They're allowed, but it's not free. Its not a stopper but eventually
we'll want to get away with it.

> Nested IRQs could even run on top of the softirq, and they could take
> milliseconds too if they're hyper inefficient and we must still run
> perfectly rock solid (with horrible latency, but still stable).
> 
> I added:
> 
> /*
>  * This is called in the context of the SCHED_SOFTIRQ from
>  * run_rebalance_domains().
>  */
> 

Ok. A vague idea occurred to me while mulling this over that would avoid the
walk. I did not flesh this out at all so there will be major inaccuracies
but hopefully you'll get the general idea.

The scheduler already caches some information about domains such as
sd_llc storing a per-cpu basis a pointer to the highest shared domain
with the same lowest level cache.

It should be possible to cache on a per-NUMA node domain basis the
highest mm_numafault and task_mmfault and the PID within that domain 
in sd_numa_mostconverged with one entry per NUMA node. At a scheduling tick, the
current task does the for_each_online_node(), calculates its values,
them to sd_numa_mostconverged and updates the cache if necessary.

With the view to integrating this with CFQ better, this update should happen
in kernel/sched/fair.c in a function called update_convergence_stats()
or possibly even integrated within one of the existing CPU walkers
like nohz_idle_balance or maybe in idle_balance itself and moved out of
kernel/sched/numa.c.  It shouldn't migrate tasks at this point and
reduce the overhead in the idle balancer.

This should integrate the whole of the following block into CFQ.

        /*
         * Identify the NUMA node where this thread (task_struct), and
         * the process (mm_struct) as a whole, has the largest number
         * of NUMA faults.
         */

It then later considers doing the task exchange but only the
sd_numa_mostconverged values for each node are considered.
This gets rid of the
for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) loop with the obvious
caveat that there is no guarantee that cached PID is eligible for exchange
but I expect that's rare (it could be mitigated by never caching pids
that are bound to a single node for example). This would make this block

        /*
         * Check the other NUMA nodes to see if there is a task we
         * should exchange places with.
         */

O(num_online_nodes()) instead of O(num_online_cpus()) and reduce
the cost of that path. It will converge slower, but only slightly slower,
as you only ever consider one task per node anyway after deciding which
one is the best.

Again, in the interest in integrating with CFQ further, this whole block
should then move to kernel/sched/fair.c , possibly within load_balance()
so they are working closer together.

That just leaves the task exchange part which can remain separate and
just called from load_balance() when autonuma is in use.

This is not a patch obviously but I think it's important to have some
sort of integrating path with CFQ in mind.

> > > +/*
> > > + * This function __sched_autonuma_balance() is responsible for
> > 
> > This function is far too shot and could do with another few pages :P
> 
> :) I tried to split it once already but gave up in the middle.
> 

FWIW, the blocks are at least clear and it was easier to follow than I
expected.

> > > + * "Full convergence" is achieved when all memory accesses by a task
> > > + * are 100% local to the CPU it is running on. A task's "best node" is
> > 
> > I think this is the first time you defined convergence in the series.
> > The explanation should be included in the documentation.
> 
> Ok. It's not too easy concept to explain with words.  Here a try:
> 
>  *
>  * A workload converges when all the memory of a thread or a process
>  * has been placed in the NUMA node of the CPU where the process or
>  * thread is running on.
>  *
> 

Sounds right to me.

> > > + * other_diff: how much the current task is closer to fully converge
> > > + * on the node of the other CPU than the other task that is currently
> > > + * running in the other CPU.
> > 
> > In the changelog you talked about comparing a process with every other
> > running process but here it looks like you intent to examine every
> > process that is *currently running* on a remote node and compare that.
> > What if the best process to swap with is not currently running? Do we
> > miss it?
> 
> Correct, only currently running processes are being checked. If a task
> in R state goes to sleep immediately, it's not relevant where it
> runs. We focus on "long running" compute tasks, so tasks that are in R
> state most frequently.
> 

Ok, so it can still miss some things but we're trying to reduce the
overhead, not increase it. If the most and worst PIDS were cached as I
described above they could be updated either on the idle balancing (and
potentially miss tasks like this does) or if high granularity was every
required it could be done on every reschedule. It's one call for a
relatively light function. I don't think it's necessary to have this
fine granularity though.

> > > + * If both checks succeed it guarantees that we found a way to
> > > + * multilaterally improve the system wide NUMA
> > > + * convergence. Multilateral here means that the same checks will not
> > > + * succeed again on those same two tasks, after the task exchange, so
> > > + * there is no risk of ping-pong.
> > > + *
> > 
> > At least not in that instance of time. A new CPU binding or change in
> > behaviour (such as a computation finishing and a reduce step starting)
> > might change that scoring.
> 
> Yes.
> 
> > > + * If a task exchange can happen because the two checks succeed, we
> > > + * select the destination CPU that will give us the biggest increase
> > > + * in system wide convergence (i.e. biggest "weight", in the above
> > > + * quoted code).
> > > + *
> > 
> > So there is a bit of luck that the best task to exchange is currently
> > running. How bad is that? It depends really on the number of tasks
> > running on that node and the priority. There is a chance that it doesn't
> > matter as such because if all the wrong tasks are currently running then
> > no exchange will take place - it was just wasted CPU. It does imply that
> > AutoNUMA works best of CPUs are not over-subscribed with processes. Is
> > that fair?
> 
> It seems to works fine with overcommit as well. specjbb x2 is
> converging fine, as well as numa01 in parallel with numa02. It's
> actually pretty cool to watch.
> 
> Try to run this:
> 
> while :; do ./nmstat -n numa; sleep 1; done
> 
> nmstat is a binary in autonuma benchmark.
> 
> Then run:
> 
> time (./numa01 & ./numa02 & wait)
> 
> The thing is, we work together with CFS, CFS in autopilot works fine,
> we only need to correct the occasional error.
> 
> It works the same as the active idle balancing, that corrects the
> occasional error for HT cores left idle, then CFS takes over.
> 

Ok.

> > Again, I have no suggestions at all on how this might be improved and
> > these comments are hand-waving towards where we *might* see problems in
> > the future. If problems are actually identified in practice for
> > worklaods then autonuma can be turned off until the relevant problem
> > area is fixed.
> 
> Exactly, it's enough to run:
> 
> echo 1 >/sys/kernel/mm/autonuma/enabled
> 
> If you want to get rid of the 2 bytes per page too, passing
> "noautonuma" at boot will do it (but then /sys/kernel/mm/autonuma
> disapperers and you can't enable it anymore).
> 
> Plus if there's any issue with the cost of sched_autonuma_balance it's
> more than enough to run "perf top" to find out.
> 

Yep. I'm just trying to anticipate what the problems might be so when/if
I see a problem profile I'll have a rough idea what it might be due to.

> > I would fully expect that there are parallel workloads that work on
> > differenet portions of a large set of data and it would be perfectly
> > reasonable for threads using the same address space to converge on
> > different nodes.
> 
> Agreed. Even if they can't converge fully they could have stats like
> 70/30, 30/70, with 30 being numa-false-shared and we'll schedule them
> right, so running faster than upstream. That 30% will also tend to
> slowly distribute better over time.
> 

Ok

> > I would hope we manage to figure out a way to examine fewer processes,
> > not more :)
> 
> 8)))
> 
> > > +void __sched_autonuma_balance(void)
> > > +{
> > > +	int cpu, nid, selected_cpu, selected_nid, mm_selected_nid;
> > > +	int this_nid = numa_node_id();
> > > +	int this_cpu = smp_processor_id();
> > > +	unsigned long task_fault, task_tot, mm_fault, mm_tot;
> > > +	unsigned long task_max, mm_max;
> > > +	unsigned long weight_diff_max;
> > > +	long uninitialized_var(s_w_nid);
> > > +	long uninitialized_var(s_w_this_nid);
> > > +	long uninitialized_var(s_w_other);
> > > +	bool uninitialized_var(s_w_type_thread);
> > > +	struct cpumask *allowed;
> > > +	struct task_struct *p = current, *other_task;
> > 
> > So the task in question is current but this is called by the idle
> > balancer. I'm missing something obvious here but it's not clear to me why
> > that process is necessarily relevant. What guarantee is there that all
> > tasks will eventually run this code? Maybe it doesn't matter because the
> > most CPU intensive tasks are also the most likely to end up in here but
> > a clarification would be nice.
> 
> Exactly. We only focus on who is significantly computing. If a task
> runs for 1msec we can't possibly care where it runs and where the
> memory is. If it keeps running for 1msec, over time even that task
> will be migrated right.
> 

This limitation is fine, but it should be mentioned in a comment above
__sched_autonuma_balance() for the next person that reviews this in the
future.

> > > +	struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +	struct mm_autonuma *mm_autonuma;
> > > +	struct rq *rq;
> > > +
> > > +	/* per-cpu statically allocated in runqueues */
> > > +	long *task_numa_weight;
> > > +	long *mm_numa_weight;
> > > +
> > > +	if (!task_autonuma || !p->mm)
> > > +		return;
> > > +
> > > +	if (!autonuma_enabled()) {
> > > +		if (task_autonuma->task_selected_nid != -1)
> > > +			task_autonuma->task_selected_nid = -1;
> > > +		return;
> > > +	}
> > > +
> > > +	allowed = tsk_cpus_allowed(p);
> > > +	mm_autonuma = p->mm->mm_autonuma;
> > > +
> > > +	/*
> > > +	 * If the task has no NUMA hinting page faults or if the mm
> > > +	 * hasn't been fully scanned by knuma_scand yet, set task
> > > +	 * selected nid to the current nid, to avoid the task bounce
> > > +	 * around randomly.
> > > +	 */
> > > +	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
> > 
> > Why ACCESS_ONCE?
> 
> mm variables are altered by other threads too. Only task_autonuma is
> local to this task and cannot change from under us.
> 
> I did it all lockless, I don't care if we're off once in a while.
> 

Mention why ACCESS_ONCE is used in a comment the first time it appears
in kernel/sched/numa.c. It's not necessary to mention it after that.

> > > +	if (!mm_tot) {
> > > +		if (task_autonuma->task_selected_nid != this_nid)
> > > +			task_autonuma->task_selected_nid = this_nid;
> > > +		return;
> > > +	}
> > > +	task_tot = task_autonuma->task_numa_fault_tot;
> > > +	if (!task_tot) {
> > > +		if (task_autonuma->task_selected_nid != this_nid)
> > > +			task_autonuma->task_selected_nid = this_nid;
> > > +		return;
> > > +	}
> > > +
> > > +	rq = cpu_rq(this_cpu);
> > > +
> > > +	/*
> > > +	 * Verify that we can migrate the current task, otherwise try
> > > +	 * again later.
> > > +	 */
> > > +	if (ACCESS_ONCE(rq->autonuma_balance))
> > > +		return;
> > > +
> > > +	/*
> > > +	 * The following two arrays will hold the NUMA affinity weight
> > > +	 * information for the current process if scheduled on the
> > > +	 * given NUMA node.
> > > +	 *
> > > +	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
> > > +	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
> > > +	 */
> > > +	task_numa_weight = rq->task_numa_weight;
> > > +	mm_numa_weight = rq->mm_numa_weight;
> > > +
> > > +	/*
> > > +	 * Identify the NUMA node where this thread (task_struct), and
> > > +	 * the process (mm_struct) as a whole, has the largest number
> > > +	 * of NUMA faults.
> > > +	 */
> > > +	task_max = mm_max = 0;
> > > +	selected_nid = mm_selected_nid = -1;
> > > +	for_each_online_node(nid) {
> > > +		mm_fault = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
> > > +		task_fault = task_autonuma->task_numa_fault[nid];
> > > +		if (mm_fault > mm_tot)
> > > +			/* could be removed with a seqlock */
> > > +			mm_tot = mm_fault;
> > > +		mm_numa_weight[nid] = mm_fault*AUTONUMA_BALANCE_SCALE/mm_tot;
> > > +		if (task_fault > task_tot) {
> > > +			task_tot = task_fault;
> > > +			WARN_ON(1);
> > > +		}
> > > +		task_numa_weight[nid] = task_fault*AUTONUMA_BALANCE_SCALE/task_tot;
> > > +		if (mm_numa_weight[nid] > mm_max) {
> > > +			mm_max = mm_numa_weight[nid];
> > > +			mm_selected_nid = nid;
> > > +		}
> > > +		if (task_numa_weight[nid] > task_max) {
> > > +			task_max = task_numa_weight[nid];
> > > +			selected_nid = nid;
> > > +		}
> > > +	}
> > 
> > Ok, so this is a big walk to take every time and as this happens every
> > scheduler tick, it seems unlikely that the workload would be changing
> > phases that often in terms of NUMA behaviour. Would it be possible for
> > this to be sampled less frequently and cache the result?
> 
> Even if there are 8 nodes, this is fairly quick and only requires 2
> cachelines. At 16 nodes we're at 4 cachelines. The cacheline of
> task_autonuma is fully local. The one of mm_autonuma can be shared
> (modulo numa hinting page faults with atuonuma28, in autonuma27 it was
> also sharable even despite numa hinting page faults).
> 

Two cachelines that bounce though because of writes. I still don't
really like it but it can be lived with for now I guess, it's not my call
really. However, I'd like you to consider the suggestion above on how we
might create a per-NUMA scheduling domain cache of this information that
is only updated by a task if it scores "better" or "worse" than the current
cached value.

> > > +			/*
> > > +			 * Grab the fault/tot of the processes running
> > > +			 * in the other CPUs to compute w_other.
> > > +			 */
> > > +			raw_spin_lock_irq(&rq->lock);
> > > +			_other_task = rq->curr;
> > > +			/* recheck after implicit barrier() */
> > > +			mm = _other_task->mm;
> > > +			if (!mm) {
> > > +				raw_spin_unlock_irq(&rq->lock);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Is it really critical to pin those values using the lock? That seems *really*
> > heavy. If the results have to be exactly stable then is there any chance
> > the values could be encoded in the high and low bits of a single unsigned
> > long and read without the lock?  Updates would be more expensive but that's
> > in a trap anyway. This on the other hand is a scheduler path.
> 
> The reason of the lock is to prevent rq->curr, mm etc.. to be freed
> from under us.
> 

Crap, yes.

> > > +			/*
> > > +			 * Check if the _other_task is allowed to be
> > > +			 * migrated to this_cpu.
> > > +			 */
> > > +			if (!cpumask_test_cpu(this_cpu,
> > > +					      tsk_cpus_allowed(_other_task))) {
> > > +				raw_spin_unlock_irq(&rq->lock);
> > > +				continue;
> > > +			}
> > > +
> > 
> > Would it not make sense to check this *before* we take the lock and
> > grab all its counters? It probably will not make much of a difference in
> > practice as I expect it's rare that the target CPU is running a task
> > that can't migrate but it still feels the wrong way around.
> 
> It's a micro optimization to do it here. It's too rare that the above
> fails, while !tot may be zero much more frequently (like if the task
> has been just started).
> 

Ok.

> > > +	if (selected_cpu != this_cpu) {
> > > +		if (autonuma_debug()) {
> > > +			char *w_type_str;
> > > +			w_type_str = s_w_type_thread ? "thread" : "process";
> > > +			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
> > > +			       p->mm, p->pid, this_nid, selected_nid,
> > > +			       this_cpu, selected_cpu,
> > > +			       s_w_other, s_w_nid, s_w_this_nid,
> > > +			       w_type_str);
> > > +		}
> > 
> > Can these be made tracepoints and get rid of the autonuma_debug() check?
> > I recognise there is a risk that some tool might grow to depend on
> > implementation details but in this case it seems very unlikely.
> 
> The debug mode provides me also a dump of all mm done racy, I wouldn't
> know how to do it with tracing.
> 

For live reporting on a terminal;

$ trace-cmd start -e autonuma:some_event_whatever_you_called_it
$ cat /sys/kernel/debug/tracing/trace_pipe
$ trace-cmd stop -e autonuma:some_event_whatever_you_called_it

you can record the trace using trace-cmd record but I suspect in this
case you want live reporting and I think this is the best way of doing
it.

> So I wouldn't remove the printk until we can replace everything with
> tracing, but I'd welcome to add a tracepoint too. There are already
> other proper tracepoints driving "perf script numatop".
> 

Good.

> > Ok, so I confess I did not work out if the weights and calculations really
> > make sense or not but at a glance they seem reasonable and I spotted no
> > obvious flaws. The function is pretty heavy though and may be doing more
> > work around locking than is really necessary. That said, there will be
> > workloads where the cost is justified and offset by the performance gains
> > from improved NUMA locality. I just don't expect it to be a universal win so
> > we'll need to keep an eye on the system CPU usage and incrementally optimise
> > where possible. I suspect there will be a time when an incremental
> > optimisation just does not cut it any more but by then I would also
> > expect there will be more data on how autonuma behaves in practice and a
> > new algorithm might be more obvious at that point.
> 
> Agreed. Chances are I can replace all this already with RCU and a
> rcu_dereference or ACCESS_ONCE to grab the rq->curr->task_autonuma and
> rq->curr->mm->mm_autonuma data. I didn't try yet. The task struct
> shouldn't go away from under us after rcu_read_lock, the mm may be more
> tricky, I haven't checked this yet. Optimizations welcome ;)
> 

Optimizations are limited to hand waving and no patches for the moment
:)

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-12  1:45     ` Andrea Arcangeli
@ 2012-10-12  8:46       ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12  8:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> > So after getting through the full review of it, there wasn't anything
> > I could not stand. I think it's *very* heavy on some of the paths like
> > the idle balancer which I was not keen on and the fault paths are also
> > quite heavy.  I think the weight on some of these paths can be reduced
> > but not to 0 if the objectives to autonuma are to be met.
> > 
> > I'm not fully convinced that the task exchange is actually necessary or
> > beneficial because it somewhat assumes that there is a symmetry between CPU
> > and memory balancing that may not be true. The fact that it only considers
> 
> The problem is that without an active task exchange and no explicit
> call to stop_one_cpu*, there's no way to migrate a currently running
> task and clearly we need that. We can indefinitely wait hoping the
> task goes to sleep and leaves the CPU idle, or that a couple of other
> tasks start and trigger load balance events.
> 

Stick that in a comment although I still don't fully see why the actual
exchange is necessary and why you cannot just move the current task to
the remote CPUs runqueue. Maybe it's something to do with them converging
faster if you do an exchange. I'll figure it out eventually.

> We must move tasks even if all cpus are in a steady rq->nr_running ==
> 1 state and there's no other scheduler balance event that could
> possibly attempt to move tasks around in such a steady state.
> 

I see, because just because there is a 1:1 mapping between tasks and
CPUs does not mean that it has converged from a NUMA perspective. The
idle balancer could be moving to an idle CPU that is poor from a NUMA
point of view. Better integration with the load balancer and caching on
a per-NUMA basis both the best and worst converged processes might help
but I'm hand-waving.

> Of course one could hack the active idle balancing so that it does the
> active NUMA balancing action, but that would be a purely artificial
> complication: it would add unnecessary delay and it would provide no
> benefit whatsoever.
> 
> Why don't we dump the active idle balancing too, and we hack the load
> balancing to do the active idle balancing as well? Of course then the
> two will be more integrated. But it'll be a mess and slower and
> there's a good reason why they exist as totally separated pieces of
> code working in parallel.
> 

I'm not 100% convinced they have to be separate but you have thought about
this a hell of a lot more than I have and I'm a scheduling dummy.

For example, to me it seems that if the load balancer was going to move a
task to an idle CPU on a remote node, it could also check it it would be
more or less converged before moving and reject the balancing if it would
be less converged after the move. This increases the search cost in the
load balancer but not necessarily any worse than what happens currently.

> We can integrate it more, but in my view the result would be worse and
> more complicated. Last but not the least messing the idle balancing
> code to do an active NUMA balancing action (somehow invoking
> stop_one_cpu* in the steady state described above) would force even
> cellphones and UP kernels to deal with NUMA code somehow.
> 

hmm...

> > tasks that are currently running feels a bit random but examining all tasks
> > that recently ran on the node would be far too expensive to there is no
> 
> So far this seems a good tradeoff. Nothing will prevent us to scan
> deeper into the runqueues later if find a way to do that efficiently.
> 

I don't think there is an effecient way to do that but I'm hoping
caching an exchange candiate on a per-NUMA basis could reduce the cost
while still converging reasonably quickly.

> > good answer. You are caught between a rock and a hard place and either
> > direction you go is wrong for different reasons. You need something more
> 
> I think you described the problem perfectly ;).
> 
> > frequent than scans (because it'll converge too slowly) but doing it from
> > the balancer misses some tasks and may run too frequently and it's unclear
> > how it effects the current load balancer decisions. I don't have a good
> > alternative solution for this but ideally it would be better integrated with
> > the existing scheduler when there is more data on what those scheduling
> > decisions should be. That will only come from a wide range of testing and
> > the inevitable bug reports.
> > 
> > That said, this is concentrating on the problems without considering the
> > situations where it would work very well.  I think it'll come down to HPC
> > and anything jitter-sensitive will hate this while workloads like JVM,
> > virtualisation or anything that uses a lot of memory without caring about
> > placement will love it. It's not perfect but it's better than incurring
> > the cost of remote access unconditionally.
> 
> Full agreement.
> 
> Your detailed full review was very appreciated, thanks!
> 

You're welcome.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-12  8:46       ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12  8:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> > So after getting through the full review of it, there wasn't anything
> > I could not stand. I think it's *very* heavy on some of the paths like
> > the idle balancer which I was not keen on and the fault paths are also
> > quite heavy.  I think the weight on some of these paths can be reduced
> > but not to 0 if the objectives to autonuma are to be met.
> > 
> > I'm not fully convinced that the task exchange is actually necessary or
> > beneficial because it somewhat assumes that there is a symmetry between CPU
> > and memory balancing that may not be true. The fact that it only considers
> 
> The problem is that without an active task exchange and no explicit
> call to stop_one_cpu*, there's no way to migrate a currently running
> task and clearly we need that. We can indefinitely wait hoping the
> task goes to sleep and leaves the CPU idle, or that a couple of other
> tasks start and trigger load balance events.
> 

Stick that in a comment although I still don't fully see why the actual
exchange is necessary and why you cannot just move the current task to
the remote CPUs runqueue. Maybe it's something to do with them converging
faster if you do an exchange. I'll figure it out eventually.

> We must move tasks even if all cpus are in a steady rq->nr_running ==
> 1 state and there's no other scheduler balance event that could
> possibly attempt to move tasks around in such a steady state.
> 

I see, because just because there is a 1:1 mapping between tasks and
CPUs does not mean that it has converged from a NUMA perspective. The
idle balancer could be moving to an idle CPU that is poor from a NUMA
point of view. Better integration with the load balancer and caching on
a per-NUMA basis both the best and worst converged processes might help
but I'm hand-waving.

> Of course one could hack the active idle balancing so that it does the
> active NUMA balancing action, but that would be a purely artificial
> complication: it would add unnecessary delay and it would provide no
> benefit whatsoever.
> 
> Why don't we dump the active idle balancing too, and we hack the load
> balancing to do the active idle balancing as well? Of course then the
> two will be more integrated. But it'll be a mess and slower and
> there's a good reason why they exist as totally separated pieces of
> code working in parallel.
> 

I'm not 100% convinced they have to be separate but you have thought about
this a hell of a lot more than I have and I'm a scheduling dummy.

For example, to me it seems that if the load balancer was going to move a
task to an idle CPU on a remote node, it could also check it it would be
more or less converged before moving and reject the balancing if it would
be less converged after the move. This increases the search cost in the
load balancer but not necessarily any worse than what happens currently.

> We can integrate it more, but in my view the result would be worse and
> more complicated. Last but not the least messing the idle balancing
> code to do an active NUMA balancing action (somehow invoking
> stop_one_cpu* in the steady state described above) would force even
> cellphones and UP kernels to deal with NUMA code somehow.
> 

hmm...

> > tasks that are currently running feels a bit random but examining all tasks
> > that recently ran on the node would be far too expensive to there is no
> 
> So far this seems a good tradeoff. Nothing will prevent us to scan
> deeper into the runqueues later if find a way to do that efficiently.
> 

I don't think there is an effecient way to do that but I'm hoping
caching an exchange candiate on a per-NUMA basis could reduce the cost
while still converging reasonably quickly.

> > good answer. You are caught between a rock and a hard place and either
> > direction you go is wrong for different reasons. You need something more
> 
> I think you described the problem perfectly ;).
> 
> > frequent than scans (because it'll converge too slowly) but doing it from
> > the balancer misses some tasks and may run too frequently and it's unclear
> > how it effects the current load balancer decisions. I don't have a good
> > alternative solution for this but ideally it would be better integrated with
> > the existing scheduler when there is more data on what those scheduling
> > decisions should be. That will only come from a wide range of testing and
> > the inevitable bug reports.
> > 
> > That said, this is concentrating on the problems without considering the
> > situations where it would work very well.  I think it'll come down to HPC
> > and anything jitter-sensitive will hate this while workloads like JVM,
> > virtualisation or anything that uses a lot of memory without caring about
> > placement will love it. It's not perfect but it's better than incurring
> > the cost of remote access unconditionally.
> 
> Full agreement.
> 
> Your detailed full review was very appreciated, thanks!
> 

You're welcome.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 23/33] autonuma: retain page last_nid information in khugepaged
  2012-10-11 18:44   ` Mel Gorman
@ 2012-10-12 11:37     ` Rik van Riel
  2012-10-12 12:35       ` Mel Gorman
  0 siblings, 1 reply; 148+ messages in thread
From: Rik van Riel @ 2012-10-12 11:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 10/11/2012 02:44 PM, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:51:05AM +0200, Andrea Arcangeli wrote:
>> When pages are collapsed try to keep the last_nid information from one
>> of the original pages.
>>
>
> If two pages within a THP disagree on the node, should the collapsing be
> aborted? I would expect that the code of a remote access exceeds the
> gain from reduced TLB overhead.

Hard to predict.  The gains from THP seem to be on the same
order as the gains from NUMA locality, both between 5-15%
typically.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 23/33] autonuma: retain page last_nid information in khugepaged
  2012-10-12 11:37     ` Rik van Riel
@ 2012-10-12 12:35       ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12 12:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Oct 12, 2012 at 07:37:50AM -0400, Rik van Riel wrote:
> On 10/11/2012 02:44 PM, Mel Gorman wrote:
> >On Thu, Oct 04, 2012 at 01:51:05AM +0200, Andrea Arcangeli wrote:
> >>When pages are collapsed try to keep the last_nid information from one
> >>of the original pages.
> >>
> >
> >If two pages within a THP disagree on the node, should the collapsing be
> >aborted? I would expect that the code of a remote access exceeds the
> >gain from reduced TLB overhead.
> 
> Hard to predict.  The gains from THP seem to be on the same
> order as the gains from NUMA locality, both between 5-15%
> typically.
> 

Usually yes, but in this case you know that at least 50% of those accesses
are going to be remote and as autonuma will be attempting to get hints on
the PMD level there is going to be a struggle between THP collapsing the
page and autonuma splitting it for NUMA migration. It feels to me that
the best decision in this case is to leave the page split and NUMA
hinting take place on the PTE level.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 15/33] autonuma: alloc/free/init task_autonuma
       [not found]       ` <20121011175953.GT1818@redhat.com>
@ 2012-10-12 14:03           ` Rik van Riel
  0 siblings, 0 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-12 14:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On 10/11/2012 01:59 PM, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:34:12PM -0400, Rik van Riel wrote:

>> That is indeed a future optimization I have suggested
>> in the past. Allocation of this struct could be deferred
>> until the first time knuma_scand unmaps pages from the
>> process to generate NUMA page faults.
>
> I already tried this, and quickly noticed that for mm_autonuma we
> can't, or we wouldn't have memory to queue the "mm" into knuma_scand
> in the first place.
>
> For task_autonuma we could, but then we wouldn't be able to inherit
> the task_autonuma->task_autonuma_nid across clone/fork which kind of
> makes sense to me (and it's done by default without knob at the
> moment). It's actually more important for clone than for fork but it
> might be good for fork too if it doesn't exec immediately.
>
> Another option is to move task_autonuma_nid in the task_structure
> (it's in the stack so it won't cost RAM). Then I probably can defer
> the task_autonuma if I remove the child_inheritance knob.
>
> In knuma_scand we don't have the task pointer, so task_autonuma would
> need to be allocated in the NUMA page faults, the first time it fires.

One thing that could be done is have the (few) mm and
task specific bits directly in the mm and task structs,
and have the sized-by-number-of-nodes statistics in
a separate numa_stats struct.

At that point, the numa_stats struct could be lazily
allocated, reducing the memory allocations at fork
time by 2 (and the frees at exit time, for short lived
processes).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 15/33] autonuma: alloc/free/init task_autonuma
@ 2012-10-12 14:03           ` Rik van Riel
  0 siblings, 0 replies; 148+ messages in thread
From: Rik van Riel @ 2012-10-12 14:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Ingo Molnar, Hugh Dickins,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On 10/11/2012 01:59 PM, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 01:34:12PM -0400, Rik van Riel wrote:

>> That is indeed a future optimization I have suggested
>> in the past. Allocation of this struct could be deferred
>> until the first time knuma_scand unmaps pages from the
>> process to generate NUMA page faults.
>
> I already tried this, and quickly noticed that for mm_autonuma we
> can't, or we wouldn't have memory to queue the "mm" into knuma_scand
> in the first place.
>
> For task_autonuma we could, but then we wouldn't be able to inherit
> the task_autonuma->task_autonuma_nid across clone/fork which kind of
> makes sense to me (and it's done by default without knob at the
> moment). It's actually more important for clone than for fork but it
> might be good for fork too if it doesn't exec immediately.
>
> Another option is to move task_autonuma_nid in the task_structure
> (it's in the stack so it won't cost RAM). Then I probably can defer
> the task_autonuma if I remove the child_inheritance knob.
>
> In knuma_scand we don't have the task pointer, so task_autonuma would
> need to be allocated in the NUMA page faults, the first time it fires.

One thing that could be done is have the (few) mm and
task specific bits directly in the mm and task structs,
and have the sized-by-number-of-nodes statistics in
a separate numa_stats struct.

At that point, the numa_stats struct could be lazily
allocated, reducing the memory allocations at fork
time by 2 (and the frees at exit time, for short lived
processes).

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-11 15:35       ` Mel Gorman
@ 2012-10-12 14:54         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12 14:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> > Hi Mel,
> > 
> > On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > > Benchmark on a 4-node machine and the following fell out.
> > > 
> > >                                      3.6.0                 3.6.0
> > >                                    vanilla        autonuma-v33r6
> > > User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> > > User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> > > System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> > > System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> > > Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> > > Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> > > CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> > > CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)
> > 
> > Thanks a lot for the help and for looking into it!
> > 
> > Just curious, why are you running only numa02_SMT and
> > numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> > without _suffix)
> > 
> 
> Bug in the testing script on my end. Each of them are run separtly and it

Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so
it can run the tests individually. I know start_bench.sh can run all the
tests itself but in time I'll want mmtests to collect additional stats
that can also be applied to other benchmarks consistently. The revised
results look like this

AUTONUMA BENCH
                                          3.6.0                 3.6.0
                                        vanilla        autonuma-v33r6
User    NUMA01               66395.58 (  0.00%)    32000.83 ( 51.80%)
User    NUMA01_THEADLOCAL    55952.48 (  0.00%)    16950.48 ( 69.71%)
User    NUMA02                6988.51 (  0.00%)     2150.56 ( 69.23%)
User    NUMA02_SMT            2914.25 (  0.00%)     1013.11 ( 65.24%)
System  NUMA01                 319.12 (  0.00%)      483.60 (-51.54%)
System  NUMA01_THEADLOCAL       40.60 (  0.00%)      184.39 (-354.16%)
System  NUMA02                   1.62 (  0.00%)       23.92 (-1376.54%)
System  NUMA02_SMT               0.90 (  0.00%)       16.20 (-1700.00%)
Elapsed NUMA01                1519.53 (  0.00%)      757.40 ( 50.16%)
Elapsed NUMA01_THEADLOCAL     1269.49 (  0.00%)      398.63 ( 68.60%)
Elapsed NUMA02                 181.12 (  0.00%)       57.09 ( 68.48%)
Elapsed NUMA02_SMT             164.18 (  0.00%)       53.16 ( 67.62%)
CPU     NUMA01                4390.00 (  0.00%)     4288.00 (  2.32%)
CPU     NUMA01_THEADLOCAL     4410.00 (  0.00%)     4298.00 (  2.54%)
CPU     NUMA02                3859.00 (  0.00%)     3808.00 (  1.32%)
CPU     NUMA02_SMT            1775.00 (  0.00%)     1935.00 ( -9.01%)

MMTests Statistics: duration
               3.6.0       3.6.0
             vanilla autonuma-v33r6
User       132257.44    52121.30
System        362.79      708.62
Elapsed      3142.66     1275.72

MMTests Statistics: vmstat
                              3.6.0       3.6.0
                            vanilla autonuma-v33r6
THP fault alloc               17660       19927
THP collapse alloc               10       12399
THP splits                        4       12637

The System CPU usage is high but is compenstated for with reduced User
and Elapsed times in this particular case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-12 14:54         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-12 14:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> > Hi Mel,
> > 
> > On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > > Benchmark on a 4-node machine and the following fell out.
> > > 
> > >                                      3.6.0                 3.6.0
> > >                                    vanilla        autonuma-v33r6
> > > User    SMT             82851.82 (  0.00%)    33084.03 ( 60.07%)
> > > User    THREAD_ALLOC   142723.90 (  0.00%)    47707.38 ( 66.57%)
> > > System  SMT               396.68 (  0.00%)      621.46 (-56.67%)
> > > System  THREAD_ALLOC      675.22 (  0.00%)      836.96 (-23.95%)
> > > Elapsed SMT              1987.08 (  0.00%)      828.57 ( 58.30%)
> > > Elapsed THREAD_ALLOC     3222.99 (  0.00%)     1101.31 ( 65.83%)
> > > CPU     SMT              4189.00 (  0.00%)     4067.00 (  2.91%)
> > > CPU     THREAD_ALLOC     4449.00 (  0.00%)     4407.00 (  0.94%)
> > 
> > Thanks a lot for the help and for looking into it!
> > 
> > Just curious, why are you running only numa02_SMT and
> > numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> > without _suffix)
> > 
> 
> Bug in the testing script on my end. Each of them are run separtly and it

Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so
it can run the tests individually. I know start_bench.sh can run all the
tests itself but in time I'll want mmtests to collect additional stats
that can also be applied to other benchmarks consistently. The revised
results look like this

AUTONUMA BENCH
                                          3.6.0                 3.6.0
                                        vanilla        autonuma-v33r6
User    NUMA01               66395.58 (  0.00%)    32000.83 ( 51.80%)
User    NUMA01_THEADLOCAL    55952.48 (  0.00%)    16950.48 ( 69.71%)
User    NUMA02                6988.51 (  0.00%)     2150.56 ( 69.23%)
User    NUMA02_SMT            2914.25 (  0.00%)     1013.11 ( 65.24%)
System  NUMA01                 319.12 (  0.00%)      483.60 (-51.54%)
System  NUMA01_THEADLOCAL       40.60 (  0.00%)      184.39 (-354.16%)
System  NUMA02                   1.62 (  0.00%)       23.92 (-1376.54%)
System  NUMA02_SMT               0.90 (  0.00%)       16.20 (-1700.00%)
Elapsed NUMA01                1519.53 (  0.00%)      757.40 ( 50.16%)
Elapsed NUMA01_THEADLOCAL     1269.49 (  0.00%)      398.63 ( 68.60%)
Elapsed NUMA02                 181.12 (  0.00%)       57.09 ( 68.48%)
Elapsed NUMA02_SMT             164.18 (  0.00%)       53.16 ( 67.62%)
CPU     NUMA01                4390.00 (  0.00%)     4288.00 (  2.32%)
CPU     NUMA01_THEADLOCAL     4410.00 (  0.00%)     4298.00 (  2.54%)
CPU     NUMA02                3859.00 (  0.00%)     3808.00 (  1.32%)
CPU     NUMA02_SMT            1775.00 (  0.00%)     1935.00 ( -9.01%)

MMTests Statistics: duration
               3.6.0       3.6.0
             vanilla autonuma-v33r6
User       132257.44    52121.30
System        362.79      708.62
Elapsed      3142.66     1275.72

MMTests Statistics: vmstat
                              3.6.0       3.6.0
                            vanilla autonuma-v33r6
THP fault alloc               17660       19927
THP collapse alloc               10       12399
THP splits                        4       12637

The System CPU usage is high but is compenstated for with reduced User
and Elapsed times in this particular case.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
  2012-10-10 22:01   ` Rik van Riel
  2012-10-11 18:28   ` Mel Gorman
@ 2012-10-13 18:06   ` Srikar Dronamraju
  2012-10-15  8:24       ` Srikar Dronamraju
  2 siblings, 1 reply; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-13 18:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

> +
> +bool numa_hinting_fault(struct page *page, int numpages)
> +{
> +	bool migrated = false;
> +
> +	/*
> +	 * "current->mm" could be different from the "mm" where the
> +	 * NUMA hinting page fault happened, if get_user_pages()
> +	 * triggered the fault on some other process "mm". That is ok,
> +	 * all we care about is to count the "page_nid" access on the
> +	 * current->task_autonuma, even if the page belongs to a
> +	 * different "mm".
> +	 */
> +	WARN_ON_ONCE(!current->mm);

Given the above comment, Do we really need this warn_on?
I think I have seen this warning when using autonuma.

> +	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
> +		struct task_struct *p = current;
> +		int this_nid, page_nid, access_nid;
> +		bool new_pass;
> +
> +		/*
> +		 * new_pass is only true the first time the thread
> +		 * faults on this pass of knuma_scand.
> +		 */
> +		new_pass = p->task_autonuma->task_numa_fault_pass !=
> +			p->mm->mm_autonuma->mm_numa_fault_pass;
> +		page_nid = page_to_nid(page);
> +		this_nid = numa_node_id();
> +		VM_BUG_ON(this_nid < 0);
> +		VM_BUG_ON(this_nid >= MAX_NUMNODES);
> +		access_nid = numa_hinting_fault_memory_follow_cpu(page,
> +								  this_nid,
> +								  page_nid,
> +								  new_pass,
> +								  &migrated);
> +		/* "page" has been already freed if "migrated" is true */
> +		numa_hinting_fault_cpu_follow_memory(p, access_nid,
> +						     numpages, new_pass);
> +		if (unlikely(new_pass))
> +			/*
> +			 * Set the task's fault_pass equal to the new
> +			 * mm's fault_pass, so new_pass will be false
> +			 * on the next fault by this thread in this
> +			 * same pass.
> +			 */
> +			p->task_autonuma->task_numa_fault_pass =
> +				p->mm->mm_autonuma->mm_numa_fault_pass;
> +	}
> +
> +	return migrated;
> +}
> +

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
@ 2012-10-13 18:40   ` Srikar Dronamraju
  2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
                     ` (36 subsequent siblings)
  37 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-13 18:40 UTC (permalink / raw)
  To: aarcange
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

* Andrea Arcangeli <aarcange@redhat.com> [2012-10-04 01:50:42]:

> Hello everyone,
> 
> This is a new AutoNUMA27 release for Linux v3.6.
> 


Here results of autonumabenchmark on a 328GB 64 core with ht disabled
comparing v3.6 with autonuma27.

$ numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31689 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31917 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31928 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31926 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31913 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63952 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64230 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  20  20  20  20  20  20  20 
  1:  20  10  20  20  20  20  20  20 
  2:  20  20  10  20  20  20  20  20 
  3:  20  20  20  10  20  20  20  20 
  4:  20  20  20  20  10  20  20  20 
  5:  20  20  20  20  20  10  20  20 
  6:  20  20  20  20  20  20  10  20 
  7:  20  20  20  20  20  20  20  10 



          KernelVersion:                 3.6.0-mainline_v36
                        Testcase:     Min      Max      Avg
                          numa01: 1509.14  2098.75  1793.90
                numa01_HARD_BIND:  865.43  1826.40  1334.85
             numa01_INVERSE_BIND: 3242.76  3496.71  3345.12
             numa01_THREAD_ALLOC:  944.28  1418.78  1214.32
   numa01_THREAD_ALLOC_HARD_BIND:  696.33  1004.99   825.63
numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88  2301.27  2186.33
                          numa02:  129.87   146.10   136.88
                numa02_HARD_BIND:   25.81    26.18    25.97
             numa02_INVERSE_BIND:  341.96   354.73   345.59
                      numa02_SMT:  160.77   246.66   186.85
            numa02_SMT_HARD_BIND:   25.77    38.86    33.57
         numa02_SMT_INVERSE_BIND:  282.61   326.76   296.44

          KernelVersion:               3.6.0-autonuma27+                            
                        Testcase:     Min      Max      Avg  %Change   
                          numa01: 1805.19  1907.11  1866.39    -3.88%  
                numa01_HARD_BIND:  953.33  2050.23  1603.29   -16.74%  
             numa01_INVERSE_BIND: 3515.14  3882.10  3715.28    -9.96%  
             numa01_THREAD_ALLOC:  323.50   362.17   348.81   248.13%  
   numa01_THREAD_ALLOC_HARD_BIND:  841.08  1205.80   977.43   -15.53%  
numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35  2654.89  2439.51   -10.38%  
                          numa02:   51.64    73.35    58.88   132.47%  
                numa02_HARD_BIND:   25.23    26.31    25.93     0.15%  
             numa02_INVERSE_BIND:  338.39   355.70   344.82     0.22%  
                      numa02_SMT:   51.76    66.78    58.63   218.69%  
            numa02_SMT_HARD_BIND:   34.95    45.39    39.24   -14.45%  
         numa02_SMT_INVERSE_BIND:  287.85   300.82   295.80     0.22%  


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-13 18:40   ` Srikar Dronamraju
  0 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-13 18:40 UTC (permalink / raw)
  To: aarcange
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

* Andrea Arcangeli <aarcange@redhat.com> [2012-10-04 01:50:42]:

> Hello everyone,
> 
> This is a new AutoNUMA27 release for Linux v3.6.
> 


Here results of autonumabenchmark on a 328GB 64 core with ht disabled
comparing v3.6 with autonuma27.

$ numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31689 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31917 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31928 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31926 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31913 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63952 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64230 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  20  20  20  20  20  20  20 
  1:  20  10  20  20  20  20  20  20 
  2:  20  20  10  20  20  20  20  20 
  3:  20  20  20  10  20  20  20  20 
  4:  20  20  20  20  10  20  20  20 
  5:  20  20  20  20  20  10  20  20 
  6:  20  20  20  20  20  20  10  20 
  7:  20  20  20  20  20  20  20  10 



          KernelVersion:                 3.6.0-mainline_v36
                        Testcase:     Min      Max      Avg
                          numa01: 1509.14  2098.75  1793.90
                numa01_HARD_BIND:  865.43  1826.40  1334.85
             numa01_INVERSE_BIND: 3242.76  3496.71  3345.12
             numa01_THREAD_ALLOC:  944.28  1418.78  1214.32
   numa01_THREAD_ALLOC_HARD_BIND:  696.33  1004.99   825.63
numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88  2301.27  2186.33
                          numa02:  129.87   146.10   136.88
                numa02_HARD_BIND:   25.81    26.18    25.97
             numa02_INVERSE_BIND:  341.96   354.73   345.59
                      numa02_SMT:  160.77   246.66   186.85
            numa02_SMT_HARD_BIND:   25.77    38.86    33.57
         numa02_SMT_INVERSE_BIND:  282.61   326.76   296.44

          KernelVersion:               3.6.0-autonuma27+                            
                        Testcase:     Min      Max      Avg  %Change   
                          numa01: 1805.19  1907.11  1866.39    -3.88%  
                numa01_HARD_BIND:  953.33  2050.23  1603.29   -16.74%  
             numa01_INVERSE_BIND: 3515.14  3882.10  3715.28    -9.96%  
             numa01_THREAD_ALLOC:  323.50   362.17   348.81   248.13%  
   numa01_THREAD_ALLOC_HARD_BIND:  841.08  1205.80   977.43   -15.53%  
numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35  2654.89  2439.51   -10.38%  
                          numa02:   51.64    73.35    58.88   132.47%  
                numa02_HARD_BIND:   25.23    26.31    25.93     0.15%  
             numa02_INVERSE_BIND:  338.39   355.70   344.82     0.22%  
                      numa02_SMT:   51.76    66.78    58.63   218.69%  
            numa02_SMT_HARD_BIND:   34.95    45.39    39.24   -14.45%  
         numa02_SMT_INVERSE_BIND:  287.85   300.82   295.80     0.22%  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-13 18:40   ` Srikar Dronamraju
@ 2012-10-14  4:57     ` Andrea Arcangeli
  -1 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-14  4:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

Hi Srikar,

On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote:
> * Andrea Arcangeli <aarcange@redhat.com> [2012-10-04 01:50:42]:
> 
> > Hello everyone,
> > 
> > This is a new AutoNUMA27 release for Linux v3.6.
> > 
> 
> 
> Here results of autonumabenchmark on a 328GB 64 core with ht disabled
> comparing v3.6 with autonuma27.

*snip*

>                           numa01: 1805.19  1907.11  1866.39    -3.88%  

Interesting. So numa01 should be improved in autonuma28fast. Not sure
why the hard binds show any difference, but I'm more concerned in
optimizing numa01. I get the same results from hard bindings on
upstream or autonuma, strange.

Could you repeat only numa01 with the origin/autonuma28fast branch?
Also if you could post the two pdf convergence chart generated by
numa01 on autonuma27 and autonuma28fast, I think that would be
interesting to see the full effect and why it is faster.

I only had the time for a quick push after having the idea added in
autonuma28fast (which is yet improved compared to autonuma28), but
I've been told already that it's dealing with numa01 on the 8 node
very well as expected.

numa01 in the 8 node is a workload without a perfect solution (other
than MADV_INTERLEAVE). Full convergence preventing cross-node traffic
is impossible because there are 2 processes spanning over 8 nodes and
all process memory is touched by all threads constantly. Yet
autonuma28fast should deal optimally that scenario too.

As a side note: numa01 on the 2 node instead converges fully (2
processes + 2 nodes = full convergence). numa01 on 2 nodes or >2nodes
is a very different kind of test.

I'll release an autonuma29 behaving like 28fast if there are no
surprises. The new algorithm change in 28fast will also save memory
once I rewrite it properly.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-14  4:57     ` Andrea Arcangeli
  0 siblings, 0 replies; 148+ messages in thread
From: Andrea Arcangeli @ 2012-10-14  4:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

Hi Srikar,

On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote:
> * Andrea Arcangeli <aarcange@redhat.com> [2012-10-04 01:50:42]:
> 
> > Hello everyone,
> > 
> > This is a new AutoNUMA27 release for Linux v3.6.
> > 
> 
> 
> Here results of autonumabenchmark on a 328GB 64 core with ht disabled
> comparing v3.6 with autonuma27.

*snip*

>                           numa01: 1805.19  1907.11  1866.39    -3.88%  

Interesting. So numa01 should be improved in autonuma28fast. Not sure
why the hard binds show any difference, but I'm more concerned in
optimizing numa01. I get the same results from hard bindings on
upstream or autonuma, strange.

Could you repeat only numa01 with the origin/autonuma28fast branch?
Also if you could post the two pdf convergence chart generated by
numa01 on autonuma27 and autonuma28fast, I think that would be
interesting to see the full effect and why it is faster.

I only had the time for a quick push after having the idea added in
autonuma28fast (which is yet improved compared to autonuma28), but
I've been told already that it's dealing with numa01 on the 8 node
very well as expected.

numa01 in the 8 node is a workload without a perfect solution (other
than MADV_INTERLEAVE). Full convergence preventing cross-node traffic
is impossible because there are 2 processes spanning over 8 nodes and
all process memory is touched by all threads constantly. Yet
autonuma28fast should deal optimally that scenario too.

As a side note: numa01 on the 2 node instead converges fully (2
processes + 2 nodes = full convergence). numa01 on 2 nodes or >2nodes
is a very different kind of test.

I'll release an autonuma29 behaving like 28fast if there are no
surprises. The new algorithm change in 28fast will also save memory
once I rewrite it properly.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-14  4:57     ` Andrea Arcangeli
@ 2012-10-15  8:16       ` Srikar Dronamraju
  -1 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-15  8:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

> 
> Interesting. So numa01 should be improved in autonuma28fast. Not sure
> why the hard binds show any difference, but I'm more concerned in
> optimizing numa01. I get the same results from hard bindings on
> upstream or autonuma, strange.
> 
> Could you repeat only numa01 with the origin/autonuma28fast branch?

Okay, will try to get the numbers on autonuma28 soon.

> Also if you could post the two pdf convergence chart generated by
> numa01 on autonuma27 and autonuma28fast, I think that would be
> interesting to see the full effect and why it is faster.

Have attached the chart for autonuma27 in a private email.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-15  8:16       ` Srikar Dronamraju
  0 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-15  8:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

> 
> Interesting. So numa01 should be improved in autonuma28fast. Not sure
> why the hard binds show any difference, but I'm more concerned in
> optimizing numa01. I get the same results from hard bindings on
> upstream or autonuma, strange.
> 
> Could you repeat only numa01 with the origin/autonuma28fast branch?

Okay, will try to get the numbers on autonuma28 soon.

> Also if you could post the two pdf convergence chart generated by
> numa01 on autonuma27 and autonuma28fast, I think that would be
> interesting to see the full effect and why it is faster.

Have attached the chart for autonuma27 in a private email.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-13 18:06   ` Srikar Dronamraju
@ 2012-10-15  8:24       ` Srikar Dronamraju
  0 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-15  8:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:

> > +
> > +bool numa_hinting_fault(struct page *page, int numpages)
> > +{
> > +	bool migrated = false;
> > +
> > +	/*
> > +	 * "current->mm" could be different from the "mm" where the
> > +	 * NUMA hinting page fault happened, if get_user_pages()
> > +	 * triggered the fault on some other process "mm". That is ok,
> > +	 * all we care about is to count the "page_nid" access on the
> > +	 * current->task_autonuma, even if the page belongs to a
> > +	 * different "mm".
> > +	 */
> > +	WARN_ON_ONCE(!current->mm);
> 
> Given the above comment, Do we really need this warn_on?
> I think I have seen this warning when using autonuma.
> 

------------[ cut here ]------------
WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
Hardware name: BladeCenter HS22V -[7871AC1]-
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3
Call Trace:
 [<ffffffff8105194f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff810519aa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81153f0d>] numa_hinting_fault+0x60d/0x7c0
 [<ffffffff8104ae90>] ? flush_tlb_mm_range+0x250/0x250
 [<ffffffff8103b82e>] ? physflat_send_IPI_mask+0xe/0x10
 [<ffffffff81036db5>] ? native_send_call_func_ipi+0xa5/0xd0
 [<ffffffff81154255>] pmd_numa_fixup+0x195/0x350
 [<ffffffff81135ef4>] handle_mm_fault+0x2c4/0x3d0
 [<ffffffff8113139c>] ? follow_page+0x2fc/0x4f0
 [<ffffffff81156364>] break_ksm+0x74/0xa0
 [<ffffffff81156562>] break_cow+0xa2/0xb0
 [<ffffffff81158444>] ksm_scan_thread+0xb54/0xd50
 [<ffffffff81075cf0>] ? wake_up_bit+0x40/0x40
 [<ffffffff811578f0>] ? run_store+0x340/0x340
 [<ffffffff8107563e>] kthread+0x9e/0xb0
 [<ffffffff814e8c44>] kernel_thread_helper+0x4/0x10
 [<ffffffff810755a0>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814e8c40>] ? gs_change+0x13/0x13
---[ end trace 8f50820d1887cf93 ]-


While running specjbb on a 2 node box. Seems pretty easy to produce this.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
@ 2012-10-15  8:24       ` Srikar Dronamraju
  0 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-15  8:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:

> > +
> > +bool numa_hinting_fault(struct page *page, int numpages)
> > +{
> > +	bool migrated = false;
> > +
> > +	/*
> > +	 * "current->mm" could be different from the "mm" where the
> > +	 * NUMA hinting page fault happened, if get_user_pages()
> > +	 * triggered the fault on some other process "mm". That is ok,
> > +	 * all we care about is to count the "page_nid" access on the
> > +	 * current->task_autonuma, even if the page belongs to a
> > +	 * different "mm".
> > +	 */
> > +	WARN_ON_ONCE(!current->mm);
> 
> Given the above comment, Do we really need this warn_on?
> I think I have seen this warning when using autonuma.
> 

------------[ cut here ]------------
WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
Hardware name: BladeCenter HS22V -[7871AC1]-
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3
Call Trace:
 [<ffffffff8105194f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff810519aa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81153f0d>] numa_hinting_fault+0x60d/0x7c0
 [<ffffffff8104ae90>] ? flush_tlb_mm_range+0x250/0x250
 [<ffffffff8103b82e>] ? physflat_send_IPI_mask+0xe/0x10
 [<ffffffff81036db5>] ? native_send_call_func_ipi+0xa5/0xd0
 [<ffffffff81154255>] pmd_numa_fixup+0x195/0x350
 [<ffffffff81135ef4>] handle_mm_fault+0x2c4/0x3d0
 [<ffffffff8113139c>] ? follow_page+0x2fc/0x4f0
 [<ffffffff81156364>] break_ksm+0x74/0xa0
 [<ffffffff81156562>] break_cow+0xa2/0xb0
 [<ffffffff81158444>] ksm_scan_thread+0xb54/0xd50
 [<ffffffff81075cf0>] ? wake_up_bit+0x40/0x40
 [<ffffffff811578f0>] ? run_store+0x340/0x340
 [<ffffffff8107563e>] kthread+0x9e/0xb0
 [<ffffffff814e8c44>] kernel_thread_helper+0x4/0x10
 [<ffffffff810755a0>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814e8c40>] ? gs_change+0x13/0x13
---[ end trace 8f50820d1887cf93 ]-


While running specjbb on a 2 node box. Seems pretty easy to produce this.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-15  8:24       ` Srikar Dronamraju
@ 2012-10-15  9:20         ` Mel Gorman
  -1 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-15  9:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, torvalds, akpm,
	pzijlstr, mingo, hughd, riel, hannes, dhillf, drjones, tglx, pjt,
	cl, suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

On Mon, Oct 15, 2012 at 01:54:13PM +0530, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:
> 
> > > +
> > > +bool numa_hinting_fault(struct page *page, int numpages)
> > > +{
> > > +	bool migrated = false;
> > > +
> > > +	/*
> > > +	 * "current->mm" could be different from the "mm" where the
> > > +	 * NUMA hinting page fault happened, if get_user_pages()
> > > +	 * triggered the fault on some other process "mm". That is ok,
> > > +	 * all we care about is to count the "page_nid" access on the
> > > +	 * current->task_autonuma, even if the page belongs to a
> > > +	 * different "mm".
> > > +	 */
> > > +	WARN_ON_ONCE(!current->mm);
> > 
> > Given the above comment, Do we really need this warn_on?
> > I think I have seen this warning when using autonuma.
> > 
> 
> ------------[ cut here ]------------
> WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
> Hardware name: BladeCenter HS22V -[7871AC1]-
> Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3

The kernel is tainted "D" which implies that it has already oopsed
before this warning was triggered. What was the other oops?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
@ 2012-10-15  9:20         ` Mel Gorman
  0 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-15  9:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, torvalds, akpm,
	pzijlstr, mingo, hughd, riel, hannes, dhillf, drjones, tglx, pjt,
	cl, suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

On Mon, Oct 15, 2012 at 01:54:13PM +0530, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:
> 
> > > +
> > > +bool numa_hinting_fault(struct page *page, int numpages)
> > > +{
> > > +	bool migrated = false;
> > > +
> > > +	/*
> > > +	 * "current->mm" could be different from the "mm" where the
> > > +	 * NUMA hinting page fault happened, if get_user_pages()
> > > +	 * triggered the fault on some other process "mm". That is ok,
> > > +	 * all we care about is to count the "page_nid" access on the
> > > +	 * current->task_autonuma, even if the page belongs to a
> > > +	 * different "mm".
> > > +	 */
> > > +	WARN_ON_ONCE(!current->mm);
> > 
> > Given the above comment, Do we really need this warn_on?
> > I think I have seen this warning when using autonuma.
> > 
> 
> ------------[ cut here ]------------
> WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
> Hardware name: BladeCenter HS22V -[7871AC1]-
> Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3

The kernel is tainted "D" which implies that it has already oopsed
before this warning was triggered. What was the other oops?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-10-15  9:20         ` Mel Gorman
@ 2012-10-15 10:00           ` Srikar Dronamraju
  -1 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-15 10:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, torvalds, akpm,
	pzijlstr, mingo, hughd, riel, hannes, dhillf, drjones, tglx, pjt,
	cl, suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

* Mel Gorman <mel@csn.ul.ie> [2012-10-15 10:20:44]:

> On Mon, Oct 15, 2012 at 01:54:13PM +0530, Srikar Dronamraju wrote:
> > * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:
> > 
> > > > +
> > > > +bool numa_hinting_fault(struct page *page, int numpages)
> > > > +{
> > > > +	bool migrated = false;
> > > > +
> > > > +	/*
> > > > +	 * "current->mm" could be different from the "mm" where the
> > > > +	 * NUMA hinting page fault happened, if get_user_pages()
> > > > +	 * triggered the fault on some other process "mm". That is ok,
> > > > +	 * all we care about is to count the "page_nid" access on the
> > > > +	 * current->task_autonuma, even if the page belongs to a
> > > > +	 * different "mm".
> > > > +	 */
> > > > +	WARN_ON_ONCE(!current->mm);
> > > 
> > > Given the above comment, Do we really need this warn_on?
> > > I think I have seen this warning when using autonuma.
> > > 
> > 
> > ------------[ cut here ]------------
> > WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
> > Hardware name: BladeCenter HS22V -[7871AC1]-
> > Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3
> 
> The kernel is tainted "D" which implies that it has already oopsed
> before this warning was triggered. What was the other oops?
> 

Yes, But this oops shows up even with v3.6 kernel and not related to autonuma changes.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000dc
IP: [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
PGD 671ce4067 PUD 671257067 PMD 0 
Oops: 0000 [#3] SMP 
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
CPU 1 
Pid: 10833, comm: tar Tainted: G      D      3.6.0-autonuma27+ #2 IBM BladeCenter HS22V -[7871AC1]-/81Y5995     
RIP: 0010:[<ffffffffa0015543>]  [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
RSP: 0018:ffff88033a10fe68  EFLAGS: 00010286
RAX: ffff880371bd5000 RBX: ffffffffa0018880 RCX: ffffffffa0015530
RDX: 0000000000000000 RSI: ffffffffa0018880 RDI: ffff88036f0af000
RBP: ffff88033a10fe68 R08: ffff88036f0af010 R09: ffffffff8152a140
R10: 0000000000002de7 R11: 0000000000000246 R12: ffff88033a10ff48
R13: 0000000000001000 R14: 0000000000ccc600 R15: ffff88036f233e40
FS:  00007f57c07c47a0(0000) GS:ffff88037fc20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000dc CR3: 0000000671e12000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tar (pid: 10833, threadinfo ffff88033a10e000, task ffff88036e45e7f0)
Stack:
 ffff88033a10fe98 ffffffff8132b1e7 ffff88033a10fe88 ffffffff81110b5e
 ffff88033a10fe98 ffff88036f233e60 ffff88033a10fef8 ffffffff811d2d1e
 0000000000001000 ffff88036f0af010 ffffffff8152a140 ffff88036d875e48
Call Trace:
 [<ffffffff8132b1e7>] dev_attr_show+0x27/0x50
 [<ffffffff81110b5e>] ? __get_free_pages+0xe/0x50
 [<ffffffff811d2d1e>] sysfs_read_file+0xce/0x1c0
 [<ffffffff81162ed5>] vfs_read+0xc5/0x190
 [<ffffffff811630a1>] sys_read+0x51/0x90
 [<ffffffff814e29e9>] system_call_fastpath+0x16/0x1b
Code: 89 c7 48 c7 c6 64 79 01 a0 31 c0 e8 18 8d 23 e1 c9 48 98 c3 0f 1f 40 00 55 48 89 e5 66 66 66 66 90 48 89 d0 48 8b 97 c0 03 00 00 <8b> 92 dc 00 00 00 85 d2 78 1b 48 89 c7 48 c7 c6 69 79 01 a0 31 
RIP  [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
 RSP <ffff88033a10fe68>
CR2: 00000000000000dc
---[ end trace f0a3a4c8c85ff69f ]---

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
@ 2012-10-15 10:00           ` Srikar Dronamraju
  0 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-15 10:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, torvalds, akpm,
	pzijlstr, mingo, hughd, riel, hannes, dhillf, drjones, tglx, pjt,
	cl, suresh.b.siddha, efault, paulmck, laijs, Lee.Schermerhorn,
	alex.shi, benh

* Mel Gorman <mel@csn.ul.ie> [2012-10-15 10:20:44]:

> On Mon, Oct 15, 2012 at 01:54:13PM +0530, Srikar Dronamraju wrote:
> > * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-10-13 23:36:18]:
> > 
> > > > +
> > > > +bool numa_hinting_fault(struct page *page, int numpages)
> > > > +{
> > > > +	bool migrated = false;
> > > > +
> > > > +	/*
> > > > +	 * "current->mm" could be different from the "mm" where the
> > > > +	 * NUMA hinting page fault happened, if get_user_pages()
> > > > +	 * triggered the fault on some other process "mm". That is ok,
> > > > +	 * all we care about is to count the "page_nid" access on the
> > > > +	 * current->task_autonuma, even if the page belongs to a
> > > > +	 * different "mm".
> > > > +	 */
> > > > +	WARN_ON_ONCE(!current->mm);
> > > 
> > > Given the above comment, Do we really need this warn_on?
> > > I think I have seen this warning when using autonuma.
> > > 
> > 
> > ------------[ cut here ]------------
> > WARNING: at ../mm/autonuma.c:359 numa_hinting_fault+0x60d/0x7c0()
> > Hardware name: BladeCenter HS22V -[7871AC1]-
> > Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw lpc_ich mfd_core i2c_i801 i2c_core shpchp ioatdma i7core_edac edac_core bnx2 ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > Pid: 116, comm: ksmd Tainted: G      D      3.6.0-autonuma27+ #3
> 
> The kernel is tainted "D" which implies that it has already oopsed
> before this warning was triggered. What was the other oops?
> 

Yes, But this oops shows up even with v3.6 kernel and not related to autonuma changes.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000dc
IP: [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
PGD 671ce4067 PUD 671257067 PMD 0 
Oops: 0000 [#3] SMP 
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii kvm_intel kvm microcode serio_raw i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
CPU 1 
Pid: 10833, comm: tar Tainted: G      D      3.6.0-autonuma27+ #2 IBM BladeCenter HS22V -[7871AC1]-/81Y5995     
RIP: 0010:[<ffffffffa0015543>]  [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
RSP: 0018:ffff88033a10fe68  EFLAGS: 00010286
RAX: ffff880371bd5000 RBX: ffffffffa0018880 RCX: ffffffffa0015530
RDX: 0000000000000000 RSI: ffffffffa0018880 RDI: ffff88036f0af000
RBP: ffff88033a10fe68 R08: ffff88036f0af010 R09: ffffffff8152a140
R10: 0000000000002de7 R11: 0000000000000246 R12: ffff88033a10ff48
R13: 0000000000001000 R14: 0000000000ccc600 R15: ffff88036f233e40
FS:  00007f57c07c47a0(0000) GS:ffff88037fc20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000dc CR3: 0000000671e12000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tar (pid: 10833, threadinfo ffff88033a10e000, task ffff88036e45e7f0)
Stack:
 ffff88033a10fe98 ffffffff8132b1e7 ffff88033a10fe88 ffffffff81110b5e
 ffff88033a10fe98 ffff88036f233e60 ffff88033a10fef8 ffffffff811d2d1e
 0000000000001000 ffff88036f0af010 ffffffff8152a140 ffff88036d875e48
Call Trace:
 [<ffffffff8132b1e7>] dev_attr_show+0x27/0x50
 [<ffffffff81110b5e>] ? __get_free_pages+0xe/0x50
 [<ffffffff811d2d1e>] sysfs_read_file+0xce/0x1c0
 [<ffffffff81162ed5>] vfs_read+0xc5/0x190
 [<ffffffff811630a1>] sys_read+0x51/0x90
 [<ffffffff814e29e9>] system_call_fastpath+0x16/0x1b
Code: 89 c7 48 c7 c6 64 79 01 a0 31 c0 e8 18 8d 23 e1 c9 48 98 c3 0f 1f 40 00 55 48 89 e5 66 66 66 66 90 48 89 d0 48 8b 97 c0 03 00 00 <8b> 92 dc 00 00 00 85 d2 78 1b 48 89 c7 48 c7 c6 69 79 01 a0 31 
RIP  [<ffffffffa0015543>] i7core_inject_show_col+0x13/0x50 [i7core_edac]
 RSP <ffff88033a10fe68>
CR2: 00000000000000dc
---[ end trace f0a3a4c8c85ff69f ]---

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
                   ` (36 preceding siblings ...)
  2012-10-13 18:40   ` Srikar Dronamraju
@ 2012-10-16 13:48 ` Mel Gorman
  37 siblings, 0 replies; 148+ messages in thread
From: Mel Gorman @ 2012-10-16 13:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith,
	Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Oct 04, 2012 at 01:50:42AM +0200, Andrea Arcangeli wrote:
> Key to the kernels used in the testing:
> 
> - 3.6.0         = upstream 3.6.0 kernel
> - 3.6.0numactl  = 3.6.0 kernel with numactl hard NUMA bindings
> - autonuma26MoF = previous autonuma version based 3.6.0-rc7 kernel
> 
> == specjbb multi instance, 4 nodes, 4 instances ==
> 
> autonuma26MoF outperform 3.6.0 by 11% while 3.6.0numactl provides an
> additional 9% increase.
> 
> 3.6.0numactl:
> Per-node process memory usage (in MBs):
>              PID             N0             N1             N2             N3
>       ----------     ----------     ----------     ----------     ----------
>            38901        3075.56           0.54           0.07           7.53
>            38902           1.31           0.54        3065.37           7.53
>            38903           1.31           0.54           0.07        3070.10
>            38904           1.31        3064.56           0.07           7.53
> 
> autonuma26MoF:
> Per-node process memory usage (in MBs):
>              PID             N0             N1             N2             N3
>       ----------     ----------     ----------     ----------     ----------
>             9704          94.85        2862.37          50.86         139.35
>             9705          61.51          20.05        2963.78          40.62
>             9706        2941.80          11.68         104.12           7.70
>             9707          35.02          10.62           9.57        3042.25
> 

This is a somewhat opaque view of what specjbb measures. You mention that
it out-performs but that actually hides useful information in specjbb which
only reports on a range of measurements around the "expected peak". This
expected peak may or may not be related to the actual peak.

In the interest of being able to make fair comparisons, I automated specjbb
in MMTests (will be in 0.07) and compared just vanilla with autonuma -
no comparison with hard-binding. Mean values are between JVM instances
which is one per node or 4 instances in this particular case.

SPECJBB PEAKS
                                       3.6.0                      3.6.0
                                     vanilla             autonuma-v33r6
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               448606.00 (  0.00%)               596993.00 ( 33.08%)
 Actual Warehouse                    6.00 (  0.00%)                    8.00 ( 33.33%)
 Actual Peak Bops               551074.00 (  0.00%)               640830.00 ( 16.29%)

The expected number of warehouses the workload was to peak at was 12 in
both cases as it's related to the number of CPUs. autonuma peaked with
more warehouses although both fall far short of the expected peaks. Be
it the expected or actual peak values, autonuma performed better.

I've truncated the following report. It goes up to 48 warehouses but
I'll cut it off at 12.

SPECJBB BOPS
                          3.6.0                 3.6.0
                        vanilla        autonuma-v33r6
Mean   1      25867.75 (  0.00%)     25373.00 ( -1.91%)
Mean   2      53529.25 (  0.00%)     56647.25 (  5.82%)
Mean   3      77217.75 (  0.00%)     82738.75 (  7.15%)
Mean   4      99545.25 (  0.00%)    107591.25 (  8.08%)
Mean   5     120928.50 (  0.00%)    131507.75 (  8.75%)
Mean   6     137768.50 (  0.00%)    152805.25 ( 10.91%)
Mean   7     137708.25 (  0.00%)    158663.50 ( 15.22%)
Mean   8     135210.50 (  0.00%)    160207.50 ( 18.49%)
Mean   9     133033.25 (  0.00%)    159569.50 ( 19.95%)
Mean   10    124737.00 (  0.00%)    158120.50 ( 26.76%)
Mean   11    122714.00 (  0.00%)    154189.50 ( 25.65%)
Mean   12    112151.50 (  0.00%)    149248.25 ( 33.08%)
Stddev 1        636.78 (  0.00%)      1476.21 (-131.82%)
Stddev 2        718.08 (  0.00%)      1141.74 (-59.00%)
Stddev 3        780.06 (  0.00%)       913.81 (-17.15%)
Stddev 4        755.54 (  0.00%)      1128.75 (-49.40%)
Stddev 5        825.39 (  0.00%)      1346.97 (-63.19%)
Stddev 6        563.58 (  0.00%)      1283.66 (-127.77%)
Stddev 7        848.47 (  0.00%)       715.98 ( 15.62%)
Stddev 8       1361.77 (  0.00%)      1020.32 ( 25.07%)
Stddev 9       5559.53 (  0.00%)       120.52 ( 97.83%)
Stddev 10      5128.25 (  0.00%)      2245.96 ( 56.20%)
Stddev 11      4086.70 (  0.00%)      3452.71 ( 15.51%)
Stddev 12      4410.86 (  0.00%)      9030.55 (-104.73%)
TPut   1     103471.00 (  0.00%)    101492.00 ( -1.91%)
TPut   2     214117.00 (  0.00%)    226589.00 (  5.82%)
TPut   3     308871.00 (  0.00%)    330955.00 (  7.15%)
TPut   4     398181.00 (  0.00%)    430365.00 (  8.08%)
TPut   5     483714.00 (  0.00%)    526031.00 (  8.75%)
TPut   6     551074.00 (  0.00%)    611221.00 ( 10.91%)
TPut   7     550833.00 (  0.00%)    634654.00 ( 15.22%)
TPut   8     540842.00 (  0.00%)    640830.00 ( 18.49%)
TPut   9     532133.00 (  0.00%)    638278.00 ( 19.95%)
TPut   10    498948.00 (  0.00%)    632482.00 ( 26.76%)
TPut   11    490856.00 (  0.00%)    616758.00 ( 25.65%)
TPut   12    448606.00 (  0.00%)    596993.00 ( 33.08%)

The average Bops per JVM instance and overall throughput is higher with
autonuma but note the standard deviations are higher. I do not have an
explanation for this as it could be due to anything.

MMTests Statistics: duration
               3.6.0       3.6.0
             vanillaautonuma-v33r6
User       481036.95   478932.80
System        185.86      824.27
Elapsed     10385.16    10356.73

Time to complete is unchanged which is expected as it runs for a fixed
length of time. Again, the System CPU usage is very high with autonuma
which matches what was seen with the autonuma benchmark.

MMTests Statistics: vmstat
                              3.6.0       3.6.0
                            vanillaautonuma-v33r6
THP fault alloc                   0           0
THP collapse alloc                0           0
THP splits                        0           2
THP fault fallback                0           0
THP collapse fail                 0           0
Compaction stalls                 0           0
Compaction success                0           0
Compaction failures               0           0
Compaction pages moved            0           0
Compaction move failure           0           0

No THP activity at all - suspiciously low actually but it implies that
native migration of THP pages would make no difference to JVMs (or at
least this JVM).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
  2012-10-14  4:57     ` Andrea Arcangeli
@ 2012-10-23 16:32       ` Srikar Dronamraju
  -1 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-23 16:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

* Andrea Arcangeli <aarcange@redhat.com> [2012-10-14 06:57:16]:

> I'll release an autonuma29 behaving like 28fast if there are no
> surprises. The new algorithm change in 28fast will also save memory
> once I rewrite it properly.
> 

Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but
plan to run on a newer release soon).


---------------------------------------------------------------------------------------------------
|          kernel|      vm|                              nofit|                                fit|
-                -        -------------------------------------------------------------------------
|                |        |            noksm|              ksm|            noksm|              ksm|
-                -        -------------------------------------------------------------------------
|                |        |   nothp|     thp|   nothp|     thp|   nothp|     thp|   nothp|     thp|
---------------------------------------------------------------------------------------------------
|    mainline_v36|    vm_1|  136085|  188500|  133871|  163638|  133540|  178159|  132460|  164763|
|                |    vm_2|   61549|   80496|   61420|   74864|   63777|   80573|   60479|   73416|
|                |    vm_3|   60688|   79349|   62244|   73289|   64394|   80803|   61040|   74258|
---------------------------------------------------------------------------------------------------
|     autonuma27_|    vm_1|  143261|  186080|  127420|  178505|  141080|  201436|  143216|  183710|
|                |    vm_2|   72224|   94368|   71309|   89576|   59098|   83750|   63813|   90862|
|                |    vm_3|   61215|   94213|   71539|   89594|   76269|   99637|   72412|   91191|
---------------------------------------------------------------------------------------------------
| improvement    |    vm_1|   5.27%|  -1.28%|  -4.82%|   9.09%|   5.65%|  13.07%|   8.12%|  11.50%|
|   from         |    vm_2|  17.34%|  17.23%|  16.10%|  19.65%|  -7.34%|   3.94%|   5.51%|  23.76%|
|  mainline      |    vm_3|   0.87%|  18.73%|  14.93%|  22.25%|  18.44%|  23.31%|  18.63%|  22.80%|
---------------------------------------------------------------------------------------------------


(Results with suggested tweaks from Andrea)

echo 0 > /sys/kernel/mm/autonuma/knuma_scand/pmd

echo 15000 > /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs 

----------------------------------------------------------------------------------------------------
|          kernel|      vm|                               nofit|                                fit|
-                -        --------------------------------------------------------------------------
|                |        |             noksm|              ksm|            noksm|              ksm|
-                -        --------------------------------------------------------------------------
|                |        |    nothp|     thp|   nothp|     thp|   nothp|     thp|   nothp|     thp|
----------------------------------------------------------------------------------------------------
|    mainline_v36|    vm_1|   136142|  178362|  132493|  166169|  131774|  179340|  133058|  164637|
|                |    vm_2|    61143|   81943|   60998|   74195|   63725|   79530|   61916|   73183|
|                |    vm_3|    61599|   79058|   61448|   73248|   62563|   80815|   61381|   74669|
----------------------------------------------------------------------------------------------------
|     autonuma27_|    vm_1|   142023|      na|  142808|  177880|      na|  197244|  145165|  174175|
|                |    vm_2|    61071|      na|   61008|   91184|      na|   78893|   71675|   80471|
|                |    vm_3|    72646|      na|   72855|   92167|      na|   99080|   64758|   91831|
----------------------------------------------------------------------------------------------------
| improvement    |    vm_1|    4.32%|      na|   7.79%|   7.05%|      na|   9.98%|   9.10%|   5.79%|
|  from          |    vm_2|   -0.12%|      na|   0.02%|  22.90%|      na|  -0.80%|  15.76%|   9.96%|
|  mainline      |    vm_3|   17.93%|      na|  18.56%|  25.83%|      na|  22.60%|   5.50%|  22.98%|
----------------------------------------------------------------------------------------------------

Host:

    Enterprise Linux Distro
    2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
        (total of 24 logical CPUs and 24 GB RAM) 

VMs:

    Enterprise Linux Distro
    Distro Kernel
        Main VM (VM1) -- relevant benchmark score.
            12 vCPUs

	    Either 12 GB (for '< 1 Node' configuration, i.e fit case)
		 or 14 GB (for '> 1 Node', i.e no fit case) 
        Noise VMs (VM2 and VM3)
            each noise VM has half of the remaining resources.
            6 vCPUs

            Either 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node ')
                (to sum 20 GB w/ Main VM + 4 GB for host = total 24 GB) 

Settings:

    Swapping disabled on host and VMs.
    Memory Overcommit enabled on host and VMs.
    THP on host is a variable. THP disabled on VMs.
    KSM on host is a variable. KSM disabled on VMs. 

na: refers to I results where I wasnt able to collect the results.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/33] AutoNUMA27
@ 2012-10-23 16:32       ` Srikar Dronamraju
  0 siblings, 0 replies; 148+ messages in thread
From: Srikar Dronamraju @ 2012-10-23 16:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, torvalds, akpm, pzijlstr, mingo, mel,
	hughd, riel, hannes, dhillf, drjones, tglx, pjt, cl,
	suresh.b.siddha, efault, paulmck, alex.shi, konrad.wilk, benh

* Andrea Arcangeli <aarcange@redhat.com> [2012-10-14 06:57:16]:

> I'll release an autonuma29 behaving like 28fast if there are no
> surprises. The new algorithm change in 28fast will also save memory
> once I rewrite it properly.
> 

Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but
plan to run on a newer release soon).


---------------------------------------------------------------------------------------------------
|          kernel|      vm|                              nofit|                                fit|
-                -        -------------------------------------------------------------------------
|                |        |            noksm|              ksm|            noksm|              ksm|
-                -        -------------------------------------------------------------------------
|                |        |   nothp|     thp|   nothp|     thp|   nothp|     thp|   nothp|     thp|
---------------------------------------------------------------------------------------------------
|    mainline_v36|    vm_1|  136085|  188500|  133871|  163638|  133540|  178159|  132460|  164763|
|                |    vm_2|   61549|   80496|   61420|   74864|   63777|   80573|   60479|   73416|
|                |    vm_3|   60688|   79349|   62244|   73289|   64394|   80803|   61040|   74258|
---------------------------------------------------------------------------------------------------
|     autonuma27_|    vm_1|  143261|  186080|  127420|  178505|  141080|  201436|  143216|  183710|
|                |    vm_2|   72224|   94368|   71309|   89576|   59098|   83750|   63813|   90862|
|                |    vm_3|   61215|   94213|   71539|   89594|   76269|   99637|   72412|   91191|
---------------------------------------------------------------------------------------------------
| improvement    |    vm_1|   5.27%|  -1.28%|  -4.82%|   9.09%|   5.65%|  13.07%|   8.12%|  11.50%|
|   from         |    vm_2|  17.34%|  17.23%|  16.10%|  19.65%|  -7.34%|   3.94%|   5.51%|  23.76%|
|  mainline      |    vm_3|   0.87%|  18.73%|  14.93%|  22.25%|  18.44%|  23.31%|  18.63%|  22.80%|
---------------------------------------------------------------------------------------------------


(Results with suggested tweaks from Andrea)

echo 0 > /sys/kernel/mm/autonuma/knuma_scand/pmd

echo 15000 > /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs 

----------------------------------------------------------------------------------------------------
|          kernel|      vm|                               nofit|                                fit|
-                -        --------------------------------------------------------------------------
|                |        |             noksm|              ksm|            noksm|              ksm|
-                -        --------------------------------------------------------------------------
|                |        |    nothp|     thp|   nothp|     thp|   nothp|     thp|   nothp|     thp|
----------------------------------------------------------------------------------------------------
|    mainline_v36|    vm_1|   136142|  178362|  132493|  166169|  131774|  179340|  133058|  164637|
|                |    vm_2|    61143|   81943|   60998|   74195|   63725|   79530|   61916|   73183|
|                |    vm_3|    61599|   79058|   61448|   73248|   62563|   80815|   61381|   74669|
----------------------------------------------------------------------------------------------------
|     autonuma27_|    vm_1|   142023|      na|  142808|  177880|      na|  197244|  145165|  174175|
|                |    vm_2|    61071|      na|   61008|   91184|      na|   78893|   71675|   80471|
|                |    vm_3|    72646|      na|   72855|   92167|      na|   99080|   64758|   91831|
----------------------------------------------------------------------------------------------------
| improvement    |    vm_1|    4.32%|      na|   7.79%|   7.05%|      na|   9.98%|   9.10%|   5.79%|
|  from          |    vm_2|   -0.12%|      na|   0.02%|  22.90%|      na|  -0.80%|  15.76%|   9.96%|
|  mainline      |    vm_3|   17.93%|      na|  18.56%|  25.83%|      na|  22.60%|   5.50%|  22.98%|
----------------------------------------------------------------------------------------------------

Host:

    Enterprise Linux Distro
    2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
        (total of 24 logical CPUs and 24 GB RAM) 

VMs:

    Enterprise Linux Distro
    Distro Kernel
        Main VM (VM1) -- relevant benchmark score.
            12 vCPUs

	    Either 12 GB (for '< 1 Node' configuration, i.e fit case)
		 or 14 GB (for '> 1 Node', i.e no fit case) 
        Noise VMs (VM2 and VM3)
            each noise VM has half of the remaining resources.
            6 vCPUs

            Either 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node ')
                (to sum 20 GB w/ Main VM + 4 GB for host = total 24 GB) 

Settings:

    Swapping disabled on host and VMs.
    Memory Overcommit enabled on host and VMs.
    THP on host is a variable. THP disabled on VMs.
    KSM on host is a variable. KSM disabled on VMs. 

na: refers to I results where I wasnt able to collect the results.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

end of thread, other threads:[~2012-10-23 16:32 UTC | newest]

Thread overview: 148+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
2012-10-11 10:50   ` Mel Gorman
2012-10-11 16:07     ` Andrea Arcangeli
2012-10-11 16:07       ` Andrea Arcangeli
2012-10-11 19:37       ` Mel Gorman
2012-10-11 19:37         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-10-11 10:54   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-10-11 10:54   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
2012-10-11 11:01   ` Mel Gorman
2012-10-11 16:43     ` Andrea Arcangeli
2012-10-11 16:43       ` Andrea Arcangeli
2012-10-11 19:48       ` Mel Gorman
2012-10-11 19:48         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-10-11 11:15   ` Mel Gorman
2012-10-11 16:58     ` Andrea Arcangeli
2012-10-11 16:58       ` Andrea Arcangeli
2012-10-11 19:54       ` Mel Gorman
2012-10-11 19:54         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-10-11 12:22   ` Mel Gorman
2012-10-11 17:05     ` Andrea Arcangeli
2012-10-11 17:05       ` Andrea Arcangeli
2012-10-11 20:01       ` Mel Gorman
2012-10-11 20:01         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-10-11 12:28   ` Mel Gorman
2012-10-11 15:24     ` Rik van Riel
2012-10-11 15:57       ` Mel Gorman
2012-10-12  0:23       ` Christoph Lameter
2012-10-12  0:52         ` Andrea Arcangeli
2012-10-12  0:52           ` Andrea Arcangeli
2012-10-11 17:15     ` Andrea Arcangeli
2012-10-11 17:15       ` Andrea Arcangeli
2012-10-11 20:06       ` Mel Gorman
2012-10-11 20:06         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
2012-10-11 13:46   ` Mel Gorman
2012-10-11 17:34     ` Andrea Arcangeli
2012-10-11 17:34       ` Andrea Arcangeli
2012-10-11 20:17       ` Mel Gorman
2012-10-11 20:17         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 09/33] autonuma: core autonuma.h header Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-10-11 14:58   ` Mel Gorman
2012-10-12  0:25     ` Andrea Arcangeli
2012-10-12  0:25       ` Andrea Arcangeli
2012-10-12  8:29       ` Mel Gorman
2012-10-12  8:29         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
2012-10-11 15:43   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-10-11 13:50   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-10-11 15:47   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
2012-10-11 15:53   ` Mel Gorman
2012-10-11 17:34     ` Rik van Riel
     [not found]       ` <20121011175953.GT1818@redhat.com>
2012-10-12 14:03         ` Rik van Riel
2012-10-12 14:03           ` Rik van Riel
2012-10-03 23:50 ` [PATCH 16/33] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-10-05  6:41   ` Mike Galbraith
2012-10-05 11:54     ` Andrea Arcangeli
2012-10-06  2:39       ` Mike Galbraith
2012-10-06 12:34         ` Andrea Arcangeli
2012-10-07  6:07           ` Mike Galbraith
2012-10-08  7:03             ` Mike Galbraith
2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
2012-10-10 22:01   ` Rik van Riel
2012-10-10 22:36     ` Andrea Arcangeli
2012-10-11 18:28   ` Mel Gorman
2012-10-13 18:06   ` Srikar Dronamraju
2012-10-15  8:24     ` Srikar Dronamraju
2012-10-15  8:24       ` Srikar Dronamraju
2012-10-15  9:20       ` Mel Gorman
2012-10-15  9:20         ` Mel Gorman
2012-10-15 10:00         ` Srikar Dronamraju
2012-10-15 10:00           ` Srikar Dronamraju
2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-10-04 20:03   ` KOSAKI Motohiro
2012-10-11 18:32   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-10-11 18:33   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-10-11 18:36   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-10-11 18:44   ` Mel Gorman
2012-10-12 11:37     ` Rik van Riel
2012-10-12 12:35       ` Mel Gorman
2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
2012-10-11 18:45   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-10-11 18:47   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 26/33] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-10-11 18:50   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
2012-10-04 14:16   ` Christoph Lameter
2012-10-04 20:09   ` KOSAKI Motohiro
2012-10-05 11:31     ` Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 31/33] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 33/33] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
2012-10-04 20:49   ` Rik van Riel
2012-10-05 23:08   ` Rik van Riel
2012-10-05 23:14   ` Andi Kleen
2012-10-05 23:14     ` Andi Kleen
2012-10-05 23:57     ` Tim Chen
2012-10-05 23:57       ` Tim Chen
2012-10-06  0:11       ` Andi Kleen
2012-10-06  0:11         ` Andi Kleen
2012-10-08 13:44         ` Don Morris
2012-10-08 13:44           ` Don Morris
2012-10-08 20:34     ` Rik van Riel
2012-10-08 20:34       ` Rik van Riel
2012-10-11 10:19 ` Mel Gorman
2012-10-11 14:56   ` Andrea Arcangeli
2012-10-11 14:56     ` Andrea Arcangeli
2012-10-11 15:35     ` Mel Gorman
2012-10-11 15:35       ` Mel Gorman
2012-10-12  0:41       ` Andrea Arcangeli
2012-10-12  0:41         ` Andrea Arcangeli
2012-10-12 14:54       ` Mel Gorman
2012-10-12 14:54         ` Mel Gorman
2012-10-11 21:34 ` Mel Gorman
2012-10-12  1:45   ` Andrea Arcangeli
2012-10-12  1:45     ` Andrea Arcangeli
2012-10-12  8:46     ` Mel Gorman
2012-10-12  8:46       ` Mel Gorman
2012-10-13 18:40 ` Srikar Dronamraju
2012-10-13 18:40   ` Srikar Dronamraju
2012-10-14  4:57   ` Andrea Arcangeli
2012-10-14  4:57     ` Andrea Arcangeli
2012-10-15  8:16     ` Srikar Dronamraju
2012-10-15  8:16       ` Srikar Dronamraju
2012-10-23 16:32     ` Srikar Dronamraju
2012-10-23 16:32       ` Srikar Dronamraju
2012-10-16 13:48 ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.