linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] Automatic NUMA Balancing V11
@ 2012-12-12 10:03 Mel Gorman
  2012-12-12 21:27 ` Stephen Rothwell
  2012-12-16 23:19 ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Mel Gorman @ 2012-12-12 10:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

Hi Linus,

This is a pull request for "Automatic NUMA Balancing V11". The list
of changes since commit f4a75d2eb7b1e2206094b901be09adb31ba63681:

  Linux 3.7-rc6 (2012-11-16 17:42:40 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git balancenuma-v11

for you to fetch changes up to 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6:

  mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable (2012-12-11 14:43:00 +0000)

There are three implementations for NUMA balancing, this tree (balancenuma),
numacore which has been developed in tip/master and autonuma which is in
aa.git. In almost all respects balancenuma is the dumbest of the three
because its main impact is on the VM side with no attempt to be smart
about scheduling.  In the interest of getting the ball rolling, it would
be desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.

The most recent set of comparisons available from different people are

mel:    https://lkml.org/lkml/2012/12/9/108
mingo:  https://lkml.org/lkml/2012/12/7/331
tglx:   https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397

The results are a mixed bag. In my own tests, balancenuma does reasonably
well. It's dumb as rocks and does not regress against mainline. On the
other hand, Ingo's tests shows that balancenuma is incapable of converging
for this workloads driven by perf which is bad but is potentially explained
by the lack of scheduler smarts. Thomas' results show balancenuma improves
on mainline but falls far short of numacore or autonuma. Srikar's results
indicate we all suffer on a large machine with imbalanced node sizes.

My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.  We've
butted heads heavily on system CPU usage and high levels of migration even
when it shows that overall performance is better. There are also cases
where it regresses. Of interest is that for specjbb in some configurations
it will regress for lower numbers of warehouses and show gains for higher
numbers which is not reported by the tool by default and sometimes missed
in treports. Recently I reported for numacore that the JVM was crashing
with NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch handles
PTEs but I'm no longer think this is the case. It's possible numacore is
just able to trigger it due to higher rates of migration.

These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks.

Thanks.

Andrea Arcangeli (5):
      mm: numa: define _PAGE_NUMA
      mm: numa: pte_numa() and pmd_numa()
      mm: numa: Support NUMA hinting page faults from gup/gup_fast
      mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
      mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting

Hillf Danton (2):
      mm: numa: split_huge_page: Transfer last_nid on tail page
      mm: numa: migrate: Set last_nid on newly allocated page

Ingo Molnar (3):
      mm: Optimize the TLB flush of sys_mprotect() and change_protection() users
      mm/rmap: Convert the struct anon_vma::mutex to an rwsem
      mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable

Lee Schermerhorn (3):
      mm: mempolicy: Add MPOL_NOOP
      mm: mempolicy: Check for misplaced page
      mm: mempolicy: Add MPOL_MF_LAZY

Mel Gorman (26):
      mm: Check if PTE is already allocated during page fault
      mm: compaction: Move migration fail/success stats to migrate.c
      mm: migrate: Add a tracepoint for migrate_pages
      mm: compaction: Add scanned and isolated counters for compaction
      mm: numa: Create basic numa page hinting infrastructure
      mm: migrate: Drop the misplaced pages reference count if the target node is full
      mm: mempolicy: Use _PAGE_NUMA to migrate pages
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection()
      mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
      mm: numa: Add pte updates, hinting and migration stats
      mm: numa: Migrate on reference policy
      mm: numa: Migrate pages handled during a pmd_numa hinting fault
      mm: numa: Rate limit the amount of memory that is migrated between nodes
      mm: numa: Rate limit setting of pte_numa if node is saturated
      sched: numa: Slowly increase the scanning period as NUMA faults are handled
      mm: numa: Introduce last_nid to the page frame
      mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
      mm: sched: numa: Control enabling and disabling of NUMA balancing
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
      mm: numa: Add THP migration for the NUMA working set scanning fault case.
      mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
      mm: numa: Account for failed allocations and isolations as migration failures
      mm: migrate: Account a transhuge page properly when rate limiting

Peter Zijlstra (6):
      mm: Count the number of pages affected in change_protection()
      mm: mempolicy: Make MPOL_LOCAL a real policy
      mm: migrate: Introduce migrate_misplaced_page()
      mm: numa: Add fault driven placement and migration
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
      mm: sched: numa: Implement slow start for working set sampling

Rik van Riel (5):
      x86: mm: only do a local tlb flush in ptep_set_access_flags()
      x86: mm: drop TLB flush from ptep_set_access_flags
      mm,generic: only flush the local TLB in ptep_set_access_flags
      x86/mm: Introduce pte_accessible()
      mm: Only flush the TLB when clearing an accessible pte

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 ++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |  110 +++++++++++
 include/linux/huge_mm.h              |   16 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   47 ++++-
 include/linux/mm.h                   |   39 ++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/rmap.h                 |   33 ++--
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 +++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   45 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   71 +++++--
 kernel/sched/fair.c                  |  227 +++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |   12 ++
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |  108 ++++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/ksm.c                             |    6 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    7 +-
 mm/memory.c                          |  199 +++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 +++++++++++++++++++++++++---
 mm/migrate.c                         |  337 +++++++++++++++++++++++++++++++++-
 mm/mmap.c                            |   10 +-
 mm/mprotect.c                        |  135 +++++++++++---
 mm/mremap.c                          |    2 +-
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/rmap.c                            |   66 +++----
 mm/vmstat.c                          |   16 +-
 45 files changed, 1940 insertions(+), 173 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-12 10:03 [GIT PULL] Automatic NUMA Balancing V11 Mel Gorman
@ 2012-12-12 21:27 ` Stephen Rothwell
  2012-12-12 22:17   ` Mel Gorman
  2012-12-16 23:19 ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Stephen Rothwell @ 2012-12-12 21:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]

Hi,

On Wed, 12 Dec 2012 10:03:38 +0000 Mel Gorman <mgorman@suse.de> wrote:
>
> This is a pull request for "Automatic NUMA Balancing V11". The list
> of changes since commit f4a75d2eb7b1e2206094b901be09adb31ba63681:
> 
>   Linux 3.7-rc6 (2012-11-16 17:42:40 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git balancenuma-v11
> 
> for you to fetch changes up to 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6:
> 
>   mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable (2012-12-11 14:43:00 +0000)
> 
> There are three implementations for NUMA balancing, this tree (balancenuma),
> numacore which has been developed in tip/master and autonuma which is in
> aa.git. In almost all respects balancenuma is the dumbest of the three
> because its main impact is on the VM side with no attempt to be smart
> about scheduling.  In the interest of getting the ball rolling, it would
> be desirable to see this much merged for 3.8 with the view to building
> scheduler smarts on top and adapting the VM where required for 3.9.
> 
> The most recent set of comparisons available from different people are
> 
> mel:    https://lkml.org/lkml/2012/12/9/108
> mingo:  https://lkml.org/lkml/2012/12/7/331
> tglx:   https://lkml.org/lkml/2012/12/10/437
> srikar: https://lkml.org/lkml/2012/12/10/397
> 
> The results are a mixed bag. In my own tests, balancenuma does reasonably
> well. It's dumb as rocks and does not regress against mainline. On the
> other hand, Ingo's tests shows that balancenuma is incapable of converging
> for this workloads driven by perf which is bad but is potentially explained
> by the lack of scheduler smarts. Thomas' results show balancenuma improves
> on mainline but falls far short of numacore or autonuma. Srikar's results
> indicate we all suffer on a large machine with imbalanced node sizes.
> 
> My own testing showed that recent numacore results have improved
> dramatically, particularly in the last week but not universally.  We've
> butted heads heavily on system CPU usage and high levels of migration even
> when it shows that overall performance is better. There are also cases
> where it regresses. Of interest is that for specjbb in some configurations
> it will regress for lower numbers of warehouses and show gains for higher
> numbers which is not reported by the tool by default and sometimes missed
> in treports. Recently I reported for numacore that the JVM was crashing
> with NullPointerExceptions but currently it's unclear what the source of
> this problem is. Initially I thought it was in how numacore batch handles
> PTEs but I'm no longer think this is the case. It's possible numacore is
> just able to trigger it due to higher rates of migration.
> 
> These reports were quite late in the cycle so I/we would like to start
> with this tree as it contains much of the code we can agree on and has
> not changed significantly over the last 2-3 weeks.

It has, however all been rebased from what still exists in the linux-next
tree (as part of the tip tree).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-12 21:27 ` Stephen Rothwell
@ 2012-12-12 22:17   ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-12-12 22:17 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Linus Torvalds, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

On Thu, Dec 13, 2012 at 08:27:41AM +1100, Stephen Rothwell wrote:
> Hi,
> 
> On Wed, 12 Dec 2012 10:03:38 +0000 Mel Gorman <mgorman@suse.de> wrote:
> >
> > This is a pull request for "Automatic NUMA Balancing V11". The list
> > of changes since commit f4a75d2eb7b1e2206094b901be09adb31ba63681:
> > 
> >   Linux 3.7-rc6 (2012-11-16 17:42:40 -0800)
> > 
> > are available in the git repository at:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git balancenuma-v11
> > 
> > for you to fetch changes up to 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6:
> > 
> >   mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable (2012-12-11 14:43:00 +0000)
> > 
> > <SNIP>
> > These reports were quite late in the cycle so I/we would like to start
> > with this tree as it contains much of the code we can agree on and has
> > not changed significantly over the last 2-3 weeks.
> 
> It has, however all been rebased from what still exists in the linux-next
> tree (as part of the tip tree).
> 

What's in the tip tree is not the same even though there are
similarities. I know that bypassing linux-next like this is not the done
thing but it was not possible to have this tree in linux-next before
now. After this mail https://lkml.org/lkml/2012/12/11/62 my expectation
is that that the numacore bits from tip that were included in linux-next
will not be pulled this time. However, due to some of the similarities
I'm hoping that the collisions due to pulling this tree will not be too
severe.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-12 10:03 [GIT PULL] Automatic NUMA Balancing V11 Mel Gorman
  2012-12-12 21:27 ` Stephen Rothwell
@ 2012-12-16 23:19 ` Linus Torvalds
  2012-12-17  2:53   ` Hugh Dickins
                     ` (5 more replies)
  1 sibling, 6 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-12-16 23:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

On Wed, Dec 12, 2012 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote:
> This is a pull request for "Automatic NUMA Balancing V11". The list

Ok, guys, I've pulled this and pushed out. There were some conflicts
with both the VM changes and with the scheduler tree, but they were
pretty small and looked simple, so I fixed them up and hope they all
work.

Has anybody tested the impact on single-node systems? If distros
enable this by default (and it does have 'default y', which is a big
no-no for new features - I undid that part) then there will be tons of
people running this without actually having multiple sockets. Does it
gracefully avoid pointless overheads for this case?

Anyway, hopefully we'll have a more real numa balancing for 3.9, and
this is still considered a reasonable base for that work.

                  Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-16 23:19 ` Linus Torvalds
@ 2012-12-17  2:53   ` Hugh Dickins
  2012-12-17  2:56     ` [PATCH] mm: fix kernel BUG at huge_memory.c:1474! Hugh Dickins
  2012-12-17 10:10   ` [GIT PULL] Automatic NUMA Balancing V11 Ingo Molnar
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2012-12-17  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Andrew Morton,
	Kirill A. Shutemov, LKML

On Sun, 16 Dec 2012, Linus Torvalds wrote:
> On Wed, Dec 12, 2012 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote:
> > This is a pull request for "Automatic NUMA Balancing V11". The list
> 
> Ok, guys, I've pulled this and pushed out. There were some conflicts
> with both the VM changes and with the scheduler tree, but they were
> pretty small and looked simple, so I fixed them up and hope they all
> work.

Great! Thank you. Rejoicing on all sides.
One small merge fixup follows under new subject.

Hugh

> 
> Has anybody tested the impact on single-node systems? If distros
> enable this by default (and it does have 'default y', which is a big
> no-no for new features - I undid that part) then there will be tons of
> people running this without actually having multiple sockets. Does it
> gracefully avoid pointless overheads for this case?
> 
> Anyway, hopefully we'll have a more real numa balancing for 3.9, and
> this is still considered a reasonable base for that work.
> 
>                   Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mm: fix kernel BUG at huge_memory.c:1474!
  2012-12-17  2:53   ` Hugh Dickins
@ 2012-12-17  2:56     ` Hugh Dickins
  2012-12-17  3:00       ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2012-12-17  2:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton,
	Kirill A. Shutemov, LKML

Andrea's autonuma-benchmark numa01 hits kernel BUG at huge_memory.c:1474!
in change_huge_pmd called from change_protection from change_prot_numa
from task_numa_work.

That BUG, introduced in the huge zero page commit cad7f613c4d0 ("thp:
change_huge_pmd(): make sure we don't try to make a page writable")
was trying to verify that newprot never adds write permission to an
anonymous huge page; but Automatic NUMA Balancing's 4b10e7d562c9 ("mm:
mempolicy: Implement change_prot_numa() in terms of change_protection()")
adds a new prot_numa path into change_huge_pmd(), which makes no use of
the newprot provided, and may retain the write bit in the pmd.

Just move the BUG_ON(pmd_write(entry)) up into the !prot_numa block.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/huge_memory.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- 2a74dbb9/mm/huge_memory.c	2012-12-16 16:35:08.752441527 -0800
+++ linux/mm/huge_memory.c	2012-12-16 18:21:24.308156970 -0800
@@ -1460,9 +1460,10 @@ int change_huge_pmd(struct vm_area_struc
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
-		if (!prot_numa)
+		if (!prot_numa) {
 			entry = pmd_modify(entry, newprot);
-		else {
+			BUG_ON(pmd_write(entry));
+		} else {
 			struct page *page = pmd_page(*pmd);
 
 			/* only check non-shared pages */
@@ -1471,7 +1472,6 @@ int change_huge_pmd(struct vm_area_struc
 				entry = pmd_mknuma(entry);
 			}
 		}
-		BUG_ON(pmd_write(entry));
 		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: fix kernel BUG at huge_memory.c:1474!
  2012-12-17  2:56     ` [PATCH] mm: fix kernel BUG at huge_memory.c:1474! Hugh Dickins
@ 2012-12-17  3:00       ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-12-17  3:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton,
	Kirill A. Shutemov, LKML

On Sun, Dec 16, 2012 at 6:56 PM, Hugh Dickins <hughd@google.com> wrote:
> Andrea's autonuma-benchmark numa01 hits kernel BUG at huge_memory.c:1474!
> in change_huge_pmd called from change_protection from change_prot_numa
> from task_numa_work.
>
> That BUG, introduced in the huge zero page commit cad7f613c4d0 ("thp:
> change_huge_pmd(): make sure we don't try to make a page writable")
> was trying to verify that newprot never adds write permission to an
> anonymous huge page; but Automatic NUMA Balancing's 4b10e7d562c9 ("mm:
> mempolicy: Implement change_prot_numa() in terms of change_protection()")
> adds a new prot_numa path into change_huge_pmd(), which makes no use of
> the newprot provided, and may retain the write bit in the pmd.

Ok. I did wonder about that particular conflict, but it looked like
neither case was writable, so I resolved it wrongly, and it worked for
me, but then I don't have any numa setups, nor do I even enable it..

Thanks,

                Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-16 23:19 ` Linus Torvalds
  2012-12-17  2:53   ` Hugh Dickins
@ 2012-12-17 10:10   ` Ingo Molnar
  2012-12-17 11:12   ` Mel Gorman
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2012-12-17 10:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Dec 12, 2012 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote:
> > This is a pull request for "Automatic NUMA Balancing V11". The list
> 
> Ok, guys, I've pulled this and pushed out. There were some 
> conflicts with both the VM changes and with the scheduler 
> tree, but they were pretty small and looked simple, so I fixed 
> them up and hope they all work.

Cool, thanks Linus!

> Has anybody tested the impact on single-node systems? If 
> distros enable this by default (and it does have 'default y', 
> which is a big no-no for new features - I undid that part) 

Yes, that was for easy testing, leaving it in was an oversight.

> then there will be tons of people running this without 
> actually having multiple sockets. Does it gracefully avoid 
> pointless overheads for this case?

Yes. We have:

+       bool numabalancing_default = false;
+
+       if (IS_ENABLED(CONFIG_NUMA_BALANCING_DEFAULT_ENABLED))
+               numabalancing_default = true;
+
+       if (nr_node_ids > 1 && !numabalancing_override) {
+               printk(KERN_INFO "Enabling automatic NUMA balancing. "
+                       "Configure with numa_balancing= or sysctl");
+               set_numabalancing_state(numabalancing_default);
+       }

The nr_node_ids check makes sure that on single-node systems we 
don't enable the feature.

At that point it will be some extra passive code in the kernel - 
last I measured it was around +20K to the kernel image plus a 
couple of extra branches in a couple of generic paths - but no 
measurable runtime overhead.

Any other negative impact would either come from preparatory or 
scalability patches attached to the NUMA balancing feature, 
which would be a regression we want to fix.

> Anyway, hopefully we'll have a more real numa balancing for 
> 3.9, and this is still considered a reasonable base for that 
> work.

We are working on it ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-16 23:19 ` Linus Torvalds
  2012-12-17  2:53   ` Hugh Dickins
  2012-12-17 10:10   ` [GIT PULL] Automatic NUMA Balancing V11 Ingo Molnar
@ 2012-12-17 11:12   ` Mel Gorman
  2012-12-17 14:05   ` [PATCH] sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE Mel Gorman
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-12-17 11:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

On Sun, Dec 16, 2012 at 03:19:20PM -0800, Linus Torvalds wrote:
> On Wed, Dec 12, 2012 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote:
> > This is a pull request for "Automatic NUMA Balancing V11". The list
> 
> Ok, guys, I've pulled this and pushed out. There were some conflicts
> with both the VM changes and with the scheduler tree, but they were
> pretty small and looked simple, so I fixed them up and hope they all
> work.
> 

Thanks very much.

> Has anybody tested the impact on single-node systems?

Not as much as I'd like. I'll be queueing a full set of tests to run against
3.8-rc1 when it's released and I should have latest -stable kernel results
to compare against.

> If distros
> enable this by default (and it does have 'default y', which is a big
> no-no for new features - I undid that part)

My bad. That switch to default y was a last-minute change by me when I
was taking a final look through. I switched it to default y based on the
distribution and upstream discussion at the last kernel summit. I expected
that distributions, particularly the enterprise ones, would be enabling
this by default and I thought that the upstream default should be the same.

> then there will be tons of
> people running this without actually having multiple sockets. Does it
> gracefully avoid pointless overheads for this case?
> 

Good question. I'm expecting the impact to be low for two reasons.

First, commit 1a687c2e (mm: sched: numa: Control enabling and disabling of
NUMA balancing) disables the feature by default and it is only enabled by
check_numabalancing_enable() if nr_node_ids > 1. It would have been even
better if the check in task_tick_numa was based on numabalancing_enabled
because that would save a small cost if !CONFIG_SCHED_DEBUG.

Second, even if it is enabled by numa_balancing=enable on UMA then commit
5bca2303 (mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node) comes into play. On single socket systems it should never
be possible to schedule on a new node and so the PTE scanner should stay
inactive unless the user uses the scheduler debugging feature to enable
NUMA_FORCE.

Either commit should prevent UMA systems scanning PTEs, marking them pte_numa
and incurring numa hinting faults which hides the vast bulk of the cost.
I'm currently guessing that if there is a visible impact from the series
on UMA it'll be due to anon_vma mutex changing to a rwsem. I consider a
regression due to this change to be very unlikely as compaction and THP
migrate far less than automatic NUMA balancing potentially does. If a bug
of this type is reported then I'm more likely to consider the real bug to
be that compaction is migrating excessively and the locking change just
made the bug more obvious.

> Anyway, hopefully we'll have a more real numa balancing for 3.9, and
> this is still considered a reasonable base for that work.
> 

That is what I'm hoping!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE
  2012-12-16 23:19 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2012-12-17 11:12   ` Mel Gorman
@ 2012-12-17 14:05   ` Mel Gorman
  2012-12-18  7:55     ` David Rientjes
  2012-12-18  8:03   ` [patch] x86, paravirt: fix build error when thp is disabled David Rientjes
  2012-12-20 13:50   ` [GIT PULL] Automatic NUMA Balancing V11 Alex Shi
  5 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2012-12-17 14:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, Michal Hocko,
	LKML

Michal Hocko reported that the following build error occurs if
CONFIG_NUMA_BALANCING is set without THP support

kernel/sched/fair.c: In function ‘task_numa_work’:
kernel/sched/fair.c:932:55: error: call to ‘__build_bug_failed’ declared with attribute error: BUILD_BUG failed

The problem is that HPAGE_PMD_SHIFT triggers a BUILD_BUG() on
!CONFIG_TRANSPARENT_HUGEPAGE. This patch addresses the problem.

Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9af5af9..4603d6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -929,7 +929,7 @@ void task_numa_work(struct callback_head *work)
 			continue;
 
 		/* Skip small VMAs. They are not likely to be of relevance */
-		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
 			continue;
 
 		do {


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE
  2012-12-17 14:05   ` [PATCH] sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE Mel Gorman
@ 2012-12-18  7:55     ` David Rientjes
  0 siblings, 0 replies; 13+ messages in thread
From: David Rientjes @ 2012-12-18  7:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, Michal Hocko,
	LKML

[-- Attachment #1: Type: TEXT/PLAIN, Size: 647 bytes --]

On Mon, 17 Dec 2012, Mel Gorman wrote:

> Michal Hocko reported that the following build error occurs if
> CONFIG_NUMA_BALANCING is set without THP support
> 
> kernel/sched/fair.c: In function â??task_numa_workâ??:
> kernel/sched/fair.c:932:55: error: call to â??__build_bug_failedâ?? declared with attribute error: BUILD_BUG failed
> 
> The problem is that HPAGE_PMD_SHIFT triggers a BUILD_BUG() on
> !CONFIG_TRANSPARENT_HUGEPAGE. This patch addresses the problem.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

Fixes the build issue for me, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch] x86, paravirt: fix build error when thp is disabled
  2012-12-16 23:19 ` Linus Torvalds
                     ` (3 preceding siblings ...)
  2012-12-17 14:05   ` [PATCH] sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE Mel Gorman
@ 2012-12-18  8:03   ` David Rientjes
  2012-12-20 13:50   ` [GIT PULL] Automatic NUMA Balancing V11 Alex Shi
  5 siblings, 0 replies; 13+ messages in thread
From: David Rientjes @ 2012-12-18  8:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

With CONFIG_PARAVIRT=y and CONFIG_TRANSPARENT_HUGEPAGE=n, the build breaks 
because set_pmd_at() is undeclared:

mm/memory.c: In function 'do_pmd_numa_page':
mm/memory.c:3520: error: implicit declaration of function 'set_pmd_at'
mm/mprotect.c: In function 'change_pmd_protnuma':
mm/mprotect.c:120: error: implicit declaration of function 'set_pmd_at'

This is because paravirt defines set_pmd_at() only when 
CONFIG_TRANSPARENT_HUGEPAGE=y and such a restriction is unneeded.  The fix 
is to define it for all CONFIG_PARAVIRT configurations.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -528,7 +528,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -539,7 +538,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [GIT PULL] Automatic NUMA Balancing V11
  2012-12-16 23:19 ` Linus Torvalds
                     ` (4 preceding siblings ...)
  2012-12-18  8:03   ` [patch] x86, paravirt: fix build error when thp is disabled David Rientjes
@ 2012-12-20 13:50   ` Alex Shi
  5 siblings, 0 replies; 13+ messages in thread
From: Alex Shi @ 2012-12-20 13:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Srikar Dronamraju, Aneesh Kumar, Andrew Morton, LKML

On Mon, Dec 17, 2012 at 7:19 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Dec 12, 2012 at 2:03 AM, Mel Gorman <mgorman@suse.de> wrote:
>> This is a pull request for "Automatic NUMA Balancing V11". The list
>
> Ok, guys, I've pulled this and pushed out. There were some conflicts
> with both the VM changes and with the scheduler tree, but they were
> pretty small and looked simple, so I fixed them up and hope they all
> work.
>
> Has anybody tested the impact on single-node systems? If distros

I tested your tree till this patch set under our lkp testing system,
with benchmark kbuild, aim9-mutitask, specjbb2005 -openjdk/jrockit,
hackbench-process/thread, sysbench -fileio-cfq, multiple loop back
netperf, on 2 laptops, SNB i7, and WSM i5.
only aim9-mutitask-nl (2000 loads, increment 100) has about 2%
performance drop on both of machine.
all others has no clear performance change.


> enable this by default (and it does have 'default y', which is a big
> no-no for new features - I undid that part) then there will be tons of
> people running this without actually having multiple sockets. Does it
> gracefully avoid pointless overheads for this case?
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-12-20 13:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-12 10:03 [GIT PULL] Automatic NUMA Balancing V11 Mel Gorman
2012-12-12 21:27 ` Stephen Rothwell
2012-12-12 22:17   ` Mel Gorman
2012-12-16 23:19 ` Linus Torvalds
2012-12-17  2:53   ` Hugh Dickins
2012-12-17  2:56     ` [PATCH] mm: fix kernel BUG at huge_memory.c:1474! Hugh Dickins
2012-12-17  3:00       ` Linus Torvalds
2012-12-17 10:10   ` [GIT PULL] Automatic NUMA Balancing V11 Ingo Molnar
2012-12-17 11:12   ` Mel Gorman
2012-12-17 14:05   ` [PATCH] sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE Mel Gorman
2012-12-18  7:55     ` David Rientjes
2012-12-18  8:03   ` [patch] x86, paravirt: fix build error when thp is disabled David Rientjes
2012-12-20 13:50   ` [GIT PULL] Automatic NUMA Balancing V11 Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).