BUG: mm, numa: test segfaults, only when NUMA balancing is on

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* BUG: mm, numa: test segfaults, only when NUMA balancing is on
@ 2013-10-16 15:54 Alex Thorlton
  2013-10-17 11:30 ` Bob Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Alex Thorlton @ 2013-10-16 15:54 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Hi guys,

I ran into a bug a week or so ago, that I believe has something to do
with NUMA balancing, but I'm having a tough time tracking down exactly
what is causing it.  When running with the following configuration
options set:

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA_BALANCING=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set

I get intermittent segfaults when running the memscale test that we've
been using to test some of the THP changes.  Here's a link to the test:

ftp://shell.sgi.com/collect/memscale/

I typically run the test with a line similar to this:

./thp_memscale -C 0 -m 0 -c <cores> -b <memory>

Where <cores> is the number of cores to spawn threads on, and <memory>
is the amount of memory to reserve from each core.  The <memory> field
can accept values like 512m or 1g, etc.  I typically run 256 cores and
512m, though I think the problem should be reproducable on anything with
128+ cores.

The test never seems to have any problems when running with hugetlbfs
on and NUMA balancing off, but it segfaults every once in a while with
the config options above.  It seems to occur more frequently, the more
cores you run on.  It segfaults on about 50% of the runs at 256 cores,
and on almost every run at 512 cores.  The fewest number of cores I've
seen a segfault on has been 128, though it seems to be rare on this many
cores.

At this point, I'm not familiar enough with NUMA balancing code to know
what could be causing this, and we don't typically run with NUMA
balancing on, so I don't see this in my everyday testing, but I felt
that it was definitely worth bringing up.

If anybody has any ideas of where I could poke around to find a
solution, please let me know.

- Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-10-16 15:54 BUG: mm, numa: test segfaults, only when NUMA balancing is on Alex Thorlton
@ 2013-10-17 11:30 ` Bob Liu
  2013-10-18  0:33   ` Alex Thorlton
  2013-11-04 14:58 ` Mel Gorman
  2013-11-07 21:52 ` Alex Thorlton
  2 siblings, 1 reply; 18+ messages in thread
From: Bob Liu @ 2013-10-17 11:30 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Linux-MM, Linux-Kernel

Hi Alex,

On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton <athorlton@sgi.com> wrote:
> Hi guys,
>
> I ran into a bug a week or so ago, that I believe has something to do
> with NUMA balancing, but I'm having a tough time tracking down exactly
> what is causing it.  When running with the following configuration
> options set:
>
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> CONFIG_NUMA_BALANCING=y
> # CONFIG_HUGETLBFS is not set
> # CONFIG_HUGETLB_PAGE is not set
>

What's your kernel version?
And did you enable CONFIG_TRANSPARENT_HUGEPAGE ?

> I get intermittent segfaults when running the memscale test that we've
> been using to test some of the THP changes.  Here's a link to the test:
>
> ftp://shell.sgi.com/collect/memscale/
>
> I typically run the test with a line similar to this:
>
> ./thp_memscale -C 0 -m 0 -c <cores> -b <memory>
>
> Where <cores> is the number of cores to spawn threads on, and <memory>
> is the amount of memory to reserve from each core.  The <memory> field
> can accept values like 512m or 1g, etc.  I typically run 256 cores and
> 512m, though I think the problem should be reproducable on anything with
> 128+ cores.
>
> The test never seems to have any problems when running with hugetlbfs
> on and NUMA balancing off, but it segfaults every once in a while with
> the config options above.  It seems to occur more frequently, the more
> cores you run on.  It segfaults on about 50% of the runs at 256 cores,
> and on almost every run at 512 cores.  The fewest number of cores I've
> seen a segfault on has been 128, though it seems to be rare on this many
> cores.
>

Could you please attach some logs?

> At this point, I'm not familiar enough with NUMA balancing code to know
> what could be causing this, and we don't typically run with NUMA
> balancing on, so I don't see this in my everyday testing, but I felt
> that it was definitely worth bringing up.
>
> If anybody has any ideas of where I could poke around to find a
> solution, please let me know.
>

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-10-17 11:30 ` Bob Liu
@ 2013-10-18  0:33   ` Alex Thorlton
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Thorlton @ 2013-10-18  0:33 UTC (permalink / raw)
  To: Bob Liu; +Cc: Linux-MM, Linux-Kernel

On Thu, Oct 17, 2013 at 07:30:58PM +0800, Bob Liu wrote:
> Hi Alex,
> 
> On Wed, Oct 16, 2013 at 11:54 PM, Alex Thorlton <athorlton@sgi.com> wrote:
> > Hi guys,
> >
> > I ran into a bug a week or so ago, that I believe has something to do
> > with NUMA balancing, but I'm having a tough time tracking down exactly
> > what is causing it.  When running with the following configuration
> > options set:
> >
> > CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> > CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> > CONFIG_NUMA_BALANCING=y
> > # CONFIG_HUGETLBFS is not set
> > # CONFIG_HUGETLB_PAGE is not set
> >
> 
> What's your kernel version?
> And did you enable CONFIG_TRANSPARENT_HUGEPAGE ?

Ah, two important things that I forgot to include!  The kernel I
originally spotted the problem on was 3.11 and it continued to be an
issue up through 3.12-rc4, but after running a 30-trial run of the
test today, it appears that the issue must have cleared up after the
3.12-rc5 release on Monday.  I'll still include the requested
information, but I guess this is no longer an issue.

I rolled all the way back to 3.7 while researching the issue, and that
appears to be the last kernel where the problem didn't show up.  3.8
is the first kernel where the bug appears; I believe this is also the
kernel where NUMA balancing was officially introduced.

Here are my settings related to THP:

CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set

I did most of my testing with THP set to "always", although the problem
still occurs with THP set to "never".  

> 
> > I get intermittent segfaults when running the memscale test that we've
> > been using to test some of the THP changes.  Here's a link to the test:
> >
> > ftp://shell.sgi.com/collect/memscale/
> >
> > I typically run the test with a line similar to this:
> >
> > ./thp_memscale -C 0 -m 0 -c <cores> -b <memory>
> >
> > Where <cores> is the number of cores to spawn threads on, and <memory>
> > is the amount of memory to reserve from each core.  The <memory> field
> > can accept values like 512m or 1g, etc.  I typically run 256 cores and
> > 512m, though I think the problem should be reproducable on anything with
> > 128+ cores.
> >
> > The test never seems to have any problems when running with hugetlbfs
> > on and NUMA balancing off, but it segfaults every once in a while with
> > the config options above.  It seems to occur more frequently, the more
> > cores you run on.  It segfaults on about 50% of the runs at 256 cores,
> > and on almost every run at 512 cores.  The fewest number of cores I've
> > seen a segfault on has been 128, though it seems to be rare on this many
> > cores.
> >
> 
> Could you please attach some logs?

Here are the relevant chunks from the syslog for a 10-shot run at 256
cores, each chunk is from a separate run.  4 out of 10 failed with
segfaults:

Oct 17 11:36:41 harp83-sys kernel: thp_memscale[21566]: segfault at 0 ip           (null) sp 00007ff8531fcdc0 error 14 in thp_memscale[400000+5000]
Oct 17 11:36:41 harp83-sys kernel: thp_memscale[21565]: segfault at 0 ip           (null) sp 00007ff8539fddc0 error 14 in thp_memscale[400000+5000]
---
Oct 17 12:08:14 harp83-sys kernel: thp_memscale[22893]: segfault at 0 ip           (null) sp 00007ff69cffddc0 error 14 in thp_memscale[400000+5000]
---
Oct 17 12:26:30 harp83-sys kernel: thp_memscale[23995]: segfault at 0 ip           (null) sp 00007fe7af1fcdc0 error 14 in thp_memscale[400000+5000]
Oct 17 12:26:30 harp83-sys kernel: thp_memscale[23994]: segfault at 0 ip           (null) sp 00007fe7af9fddc0 error 14 in thp_memscale[400000+5000]
---
Oct 17 12:32:29 harp83-sys kernel: thp_memscale[24116]: segfault at 0 ip           (null) sp 00007ff77a9fcdc0 error 14 in thp_memscale[400000+5000]

Since this has cleared up in the latest release, I won't be pursuing the
issue any further (though I'll keep an eye out to make sure that it
doesn't show back up).  I am, however, still curious as to what the
cause of the problem was...

> 
> > At this point, I'm not familiar enough with NUMA balancing code to know
> > what could be causing this, and we don't typically run with NUMA
> > balancing on, so I don't see this in my everyday testing, but I felt
> > that it was definitely worth bringing up.
> >
> > If anybody has any ideas of where I could poke around to find a
> > solution, please let me know.
> >
> 
> -- 
> Regards,
> --Bob

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-10-16 15:54 BUG: mm, numa: test segfaults, only when NUMA balancing is on Alex Thorlton
  2013-10-17 11:30 ` Bob Liu
@ 2013-11-04 14:58 ` Mel Gorman
  2013-11-04 20:03   ` Alex Thorlton
  2013-11-07 21:52 ` Alex Thorlton
  2 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2013-11-04 14:58 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: linux-mm, linux-kernel

On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> Hi guys,
> 
> I ran into a bug a week or so ago, that I believe has something to do
> with NUMA balancing, but I'm having a tough time tracking down exactly
> what is causing it.  When running with the following configuration
> options set:
> 

Can you test with patches
cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1
applied? They fix some known memory corruption problems, were merged for
3.12 (so alternatively just test 3.12) and have been tagged for -stable.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-04 14:58 ` Mel Gorman
@ 2013-11-04 20:03   ` Alex Thorlton
  2013-11-06 13:10     ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Thorlton @ 2013-11-04 20:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Mon, Nov 04, 2013 at 02:58:28PM +0000, Mel Gorman wrote:
> On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> > Hi guys,
> > 
> > I ran into a bug a week or so ago, that I believe has something to do
> > with NUMA balancing, but I'm having a tough time tracking down exactly
> > what is causing it.  When running with the following configuration
> > options set:
> > 
> 
> Can you test with patches
> cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1
> applied? They fix some known memory corruption problems, were merged for
> 3.12 (so alternatively just test 3.12) and have been tagged for -stable.

I just finished testing with 3.12, and I'm still seeing the same issue.
This is actually a bit strange to me, because, when I tested with
3.12-rc5 a while back, everything seemed to be ok (see previoues e-mail
in this thread, to Bob Liu).  I guess, embarrasingly enough, I must have
been playing with a screwed up config that day, and managed to somehow
avoid the problem...  Either way, it appears that we still have a
problem here.

I'll poke around a bit more on this in the next few days and see if I
can come up with any more information.  In the meantime, let me know if
you have any other suggestions.

Thanks,

- Alex

> 
> Thanks.
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-04 20:03   ` Alex Thorlton
@ 2013-11-06 13:10     ` Mel Gorman
  2013-11-07 21:48       ` Alex Thorlton
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2013-11-06 13:10 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: linux-mm, linux-kernel

On Mon, Nov 04, 2013 at 02:03:46PM -0600, Alex Thorlton wrote:
> On Mon, Nov 04, 2013 at 02:58:28PM +0000, Mel Gorman wrote:
> > On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> > > Hi guys,
> > > 
> > > I ran into a bug a week or so ago, that I believe has something to do
> > > with NUMA balancing, but I'm having a tough time tracking down exactly
> > > what is causing it.  When running with the following configuration
> > > options set:
> > > 
> > 
> > Can you test with patches
> > cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1
> > applied? They fix some known memory corruption problems, were merged for
> > 3.12 (so alternatively just test 3.12) and have been tagged for -stable.
> 
> I just finished testing with 3.12, and I'm still seeing the same issue.

Ok, I plugged your test into mmtests and ran it a few times but was not
able to reproduce the same issue. It's a much smaller machine which
might be a factor.

> I'll poke around a bit more on this in the next few days and see if I
> can come up with any more information.  In the meantime, let me know if
> you have any other suggestions.
> 

Try the following patch on top of 3.12. It's a patch that is expected to
be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
-stable but corruption trumps performance and the full series is not
going to be considered acceptable for -stable

---8<---
mm: numa: Do not batch handle PMD pages

With the THP migration races closed it is still possible to occasionally
see corruption. The problem is related to handling PMD pages in batch.
When a page fault is handled it can be assumed that the page being
faulted will also be flushed from the TLB. The same flushing does not
happen when handling PMD pages in batch. Fixing is straight forward but
there are a number of reasons not to

1. Multiple TLB flushes may have to be sent depending on what pages get
   migrated
2. The handling of PMDs in batch means that faults get accounted to
   the task that is handling the fault. While care is taken to only
   mark PMDs where the last CPU and PID match it can still have problems
   due to PID truncation when matching PIDs.
3. Batching on the PMD level may reduce faults but setting pmd_numa
   requires taking a heavy lock that can contend with THP migration
   and handling the fault requires the release/acquisition of the PTL
   for every page migrated. It's still pretty heavy.

PMD batch handling is not something that people ever have been happy
with. This patch removes it and later patches will deal with the
additional fault overhead using more installigent migrate rate adaption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c   | 91 ++---------------------------------------------------------
 mm/mprotect.c | 39 ++-----------------------
 2 files changed, 4 insertions(+), 126 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d176154..f453384 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3586,93 +3586,6 @@ out:
 	return 0;
 }
 
-/* NUMA hinting page fault entry point for regular pmds */
-#ifdef CONFIG_NUMA_BALANCING
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	pte_t *pte, *orig_pte;
-	unsigned long _addr = addr & PMD_MASK;
-	unsigned long offset;
-	spinlock_t *ptl;
-	bool numa = false;
-
-	spin_lock(&mm->page_table_lock);
-	pmd = *pmdp;
-	if (pmd_numa(pmd)) {
-		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
-		numa = true;
-	}
-	spin_unlock(&mm->page_table_lock);
-
-	if (!numa)
-		return 0;
-
-	/* we're in a page fault so some vma must be in the range */
-	BUG_ON(!vma);
-	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
-	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
-	VM_BUG_ON(offset >= PMD_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
-	pte += offset >> PAGE_SHIFT;
-	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
-		pte_t pteval = *pte;
-		struct page *page;
-		int page_nid = -1;
-		int target_nid;
-		bool migrated = false;
-
-		if (!pte_present(pteval))
-			continue;
-		if (!pte_numa(pteval))
-			continue;
-		if (addr >= vma->vm_end) {
-			vma = find_vma(mm, addr);
-			/* there's a pte present so there must be a vma */
-			BUG_ON(!vma);
-			BUG_ON(addr < vma->vm_start);
-		}
-		if (pte_numa(pteval)) {
-			pteval = pte_mknonnuma(pteval);
-			set_pte_at(mm, addr, pte, pteval);
-		}
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page))
-			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
-
-		page_nid = page_to_nid(page);
-		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-		pte_unmap_unlock(pte, ptl);
-		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
-			if (migrated)
-				page_nid = target_nid;
-		} else {
-			put_page(page);
-		}
-
-		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
-
-		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-	}
-	pte_unmap_unlock(orig_pte, ptl);
-
-	return 0;
-}
-#else
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	BUG();
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3811,8 +3724,8 @@ retry:
 		}
 	}
 
-	if (pmd_numa(*pmd))
-		return do_pmd_numa_page(mm, vma, address, pmd);
+	/* THP should already have been handled */
+	BUG_ON(pmd_numa(*pmd));
 
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 412ba2b..4c9d733 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,12 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
-	int last_nid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -63,12 +61,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int this_nid = page_to_nid(page);
-					if (last_nid == -1)
-						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
-
 					/* only check non-shared pages */
 					if (!pte_numa(oldpte) &&
 					    page_mapcount(page) == 1) {
@@ -111,26 +103,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
 	return pages;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	spin_lock(&mm->page_table_lock);
-	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-	spin_unlock(&mm->page_table_lock);
-}
-#else
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	BUG();
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -138,7 +113,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -156,16 +130,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
-
-		/*
-		 * If we are changing protections for NUMA hinting faults then
-		 * set pmd_numa if the examined pages were all on the same
-		 * node. This allows a regular PMD to be handled as one fault
-		 * and effectively batches the taking of the PTL
-		 */
-		if (prot_numa && all_same_node)
-			change_pmd_protnuma(vma->vm_mm, addr, pmd);
+				 dirty_accountable, prot_numa);
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-06 13:10     ` Mel Gorman
@ 2013-11-07 21:48       ` Alex Thorlton
  2013-11-08 11:20         ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Thorlton @ 2013-11-07 21:48 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Wed, Nov 06, 2013 at 01:10:48PM +0000, Mel Gorman wrote:
> On Mon, Nov 04, 2013 at 02:03:46PM -0600, Alex Thorlton wrote:
> > On Mon, Nov 04, 2013 at 02:58:28PM +0000, Mel Gorman wrote:
> > > On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> > > > Hi guys,
> > > > 
> > > > I ran into a bug a week or so ago, that I believe has something to do
> > > > with NUMA balancing, but I'm having a tough time tracking down exactly
> > > > what is causing it.  When running with the following configuration
> > > > options set:
> > > > 
> > > 
> > > Can you test with patches
> > > cd65718712469ad844467250e8fad20a5838baae..0255d491848032f6c601b6410c3b8ebded3a37b1
> > > applied? They fix some known memory corruption problems, were merged for
> > > 3.12 (so alternatively just test 3.12) and have been tagged for -stable.
> > 
> > I just finished testing with 3.12, and I'm still seeing the same issue.
> 
> Ok, I plugged your test into mmtests and ran it a few times but was not
> able to reproduce the same issue. It's a much smaller machine which
> might be a factor.
> 
> > I'll poke around a bit more on this in the next few days and see if I
> > can come up with any more information.  In the meantime, let me know if
> > you have any other suggestions.
> > 
> 
> Try the following patch on top of 3.12. It's a patch that is expected to
> be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> -stable but corruption trumps performance and the full series is not
> going to be considered acceptable for -stable

I gave this patch a shot, and it didn't seem to solve the problem.
Actually I'm running into what appear to be *worse* problems on the 3.12
kernel.  Here're a couple stack traces of what I get when I run the test
on 3.12, 512 cores:

(These are just two of the CPUs, obviously, but most of the memscale
processes appeared to be in one of these two spots)

Nov  7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6
Nov  7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-00006-g0255d49 #381
Nov  7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
Nov  7 13:54:39 uvpsw1 kernel: task: ffff8810647e0300 ti: ffff88106413e000 task.ti: ffff88106413e000
Nov  7 13:54:39 uvpsw1 kernel: RIP: 0010:[<ffffffff8151c7d5>]  [<ffffffff8151c7d5>] _raw_spin_lock+0x1a/0x25
Nov  7 13:54:39 uvpsw1 kernel: RSP: 0018:ffff88106413fd38  EFLAGS: 00000283
Nov  7 13:54:39 uvpsw1 kernel: RAX: 00000000a1a9a0fe RBX: 0000000000000206 RCX: ffff880000000000
Nov  7 13:54:41 uvpsw1 kernel: RDX: 000000000000a1a9 RSI: 00003ffffffff000 RDI: ffff8907ded35494
Nov  7 13:54:41 uvpsw1 kernel: RBP: ffff88106413fd38 R08: 0000000000000006 R09: 0000000000000002
Nov  7 13:54:41 uvpsw1 kernel: R10: 0000000000000007 R11: ffff88106413ff40 R12: ffff8907ded35494
Nov  7 13:54:42 uvpsw1 kernel: R13: ffff88106413fe1c R14: ffff8810637a05f0 R15: 0000000000000206
Nov  7 13:54:42 uvpsw1 kernel: FS:  00007fffd5def700(0000) GS:ffff88107d980000(0000) knlGS:0000000000000000
Nov  7 13:54:42 uvpsw1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  7 13:54:42 uvpsw1 kernel: CR2: 00007fffd5ded000 CR3: 00000107dfbcf000 CR4: 00000000000007e0
Nov  7 13:54:42 uvpsw1 kernel: Stack:
Nov  7 13:54:42 uvpsw1 kernel:  ffff88106413fda8 ffffffff810d670a 0000000000000002 0000000000000006
Nov  7 13:54:42 uvpsw1 kernel:  00007fff57dde000 ffff8810640e1cc0 000002006413fe10 ffff8907ded35440
Nov  7 13:54:45 uvpsw1 kernel:  ffff88106413fda8 0000000000000206 0000000000000002 0000000000000000
Nov  7 13:54:45 uvpsw1 kernel: Call Trace:
Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d670a>] follow_page_mask+0x123/0x3f1
Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d7c4e>] __get_user_pages+0x3e3/0x488
Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d7d90>] get_user_pages+0x4d/0x4f
Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810ec869>] SyS_get_mempolicy+0x1a9/0x3e0
Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff8151d422>] system_call_fastpath+0x16/0x1b
Nov  7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00

Nov  7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8
Nov  7 13:55:59 uvpsw1 kernel: INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs
Nov  7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-00006-g0255d49 #381
Nov  7 13:56:04 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
Nov  7 13:56:04 uvpsw1 kernel: task: ffff881063c56380 ti: ffff8810621b8000 task.ti: ffff8810621b8000
Nov  7 13:56:04 uvpsw1 kernel: RIP: 0010:[<ffffffff8151c7d5>]  [<ffffffff8151c7d5>] _raw_spin_lock+0x1a/0x25
Nov  7 13:56:04 uvpsw1 kernel: RSP: 0018:ffff8810621b9c98  EFLAGS: 00000283
Nov  7 13:56:04 uvpsw1 kernel: RAX: 00000000a20aa0ff RBX: ffff8810621002b0 RCX: 8000000000000025
Nov  7 13:56:04 uvpsw1 kernel: RDX: 000000000000a20a RSI: ffff8810621002b0 RDI: ffff8907ded35494
Nov  7 13:56:04 uvpsw1 kernel: RBP: ffff8810621b9c98 R08: 0000000000000001 R09: 0000000000000001
Nov  7 13:56:04 uvpsw1 kernel: R10: 000000000000000a R11: 0000000000000246 R12: ffff881062f726b8
Nov  7 13:56:04 uvpsw1 kernel: R13: 0000000000000001 R14: ffff8810621002b0 R15: ffff881062f726b8
Nov  7 13:56:09 uvpsw1 kernel: FS:  00007fff79512700(0000) GS:ffff88107da00000(0000) knlGS:0000000000000000
Nov  7 13:56:09 uvpsw1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  7 13:56:09 uvpsw1 kernel: CR2: 00007fff79510000 CR3: 00000107dfbcf000 CR4: 00000000000007e0
Nov  7 13:56:09 uvpsw1 kernel: Stack:
Nov  7 13:56:09 uvpsw1 kernel:  ffff8810621b9cb8 ffffffff810f3e57 8000000000000025 ffff881062f726b8
Nov  7 13:56:09 uvpsw1 kernel:  ffff8810621b9ce8 ffffffff810f3edb 80000187dd73e166 00007fe2dae00000
Nov  7 13:56:09 uvpsw1 kernel:  ffff881063708ff8 00007fe2db000000 ffff8810621b9dc8 ffffffff810def2c
Nov  7 13:56:09 uvpsw1 kernel: Call Trace:
Nov  7 13:56:09 uvpsw1 kernel:  [<ffffffff810f3e57>] __pmd_trans_huge_lock+0x1a/0x7c
Nov  7 13:56:10 uvpsw1 kernel:  [<ffffffff810f3edb>] change_huge_pmd+0x22/0xcc
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810def2c>] change_protection+0x200/0x591
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810ecb07>] change_prot_numa+0x16/0x2c
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff8106c247>] task_numa_work+0x224/0x29a
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810551b1>] task_work_run+0x81/0x99
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810025e1>] do_notify_resume+0x539/0x54b
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810c3ce9>] ? put_page+0x10/0x24
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810ec9fa>] ? SyS_get_mempolicy+0x33a/0x3e0
Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff8151d6aa>] int_signal+0x12/0x17
Nov  7 13:56:14 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00

I managed to bisect the issue down to this commit:

0255d491848032f6c601b6410c3b8ebded3a37b1 is the first bad commit
commit 0255d491848032f6c601b6410c3b8ebded3a37b1
Author: Mel Gorman <mgorman@suse.de>
Date:   Mon Oct 7 11:28:47 2013 +0100

    mm: Account for a THP NUMA hinting update as one PTE update

    A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
    large difference when estimating the cost of automatic NUMA balancing and
    can be misleading when comparing results that had collapsed versus split
    THP. This patch addresses the accounting issue.

    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Cc: <stable@kernel.org>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

:040000 040000 e5a44a1f0eea2f41d2cccbdf07eafee4e171b1e2 ef030a7c78ef346095ac991c3e3aa139498ed8e7 M      mm

I haven't had a chance yet to dig into the code for this commit to see
what might be causing the crashes, but I have confirmed that this is
where the new problem started (checked the commit before this, and we
don't get the crash, just segfaults like we were getting before).  So,
in summary, we still have the segfault issue, but this new issue seems
to be a bit more serious, so I'm going to try and chase this one down
first.

Let me know if you'd like any more information from me and I'll be glad
to provide it.

- Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-10-16 15:54 BUG: mm, numa: test segfaults, only when NUMA balancing is on Alex Thorlton
  2013-10-17 11:30 ` Bob Liu
  2013-11-04 14:58 ` Mel Gorman
@ 2013-11-07 21:52 ` Alex Thorlton
  2 siblings, 0 replies; 18+ messages in thread
From: Alex Thorlton @ 2013-11-07 21:52 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

On Wed, Oct 16, 2013 at 10:54:29AM -0500, Alex Thorlton wrote:
> Hi guys,
> 
> I ran into a bug a week or so ago, that I believe has something to do
> with NUMA balancing, but I'm having a tough time tracking down exactly
> what is causing it.  When running with the following configuration
> options set:
> 
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
> CONFIG_NUMA_BALANCING=y
> # CONFIG_HUGETLBFS is not set
> # CONFIG_HUGETLB_PAGE is not set
> 
> I get intermittent segfaults when running the memscale test that we've
> been using to test some of the THP changes.  Here's a link to the test:
> 
> ftp://shell.sgi.com/collect/memscale/

For anyone who's interested, this test has been moved to:

http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz

It should remain there permanently.

> 
> I typically run the test with a line similar to this:
> 
> ./thp_memscale -C 0 -m 0 -c <cores> -b <memory>
> 
> Where <cores> is the number of cores to spawn threads on, and <memory>
> is the amount of memory to reserve from each core.  The <memory> field
> can accept values like 512m or 1g, etc.  I typically run 256 cores and
> 512m, though I think the problem should be reproducable on anything with
> 128+ cores.
> 
> The test never seems to have any problems when running with hugetlbfs
> on and NUMA balancing off, but it segfaults every once in a while with
> the config options above.  It seems to occur more frequently, the more
> cores you run on.  It segfaults on about 50% of the runs at 256 cores,
> and on almost every run at 512 cores.  The fewest number of cores I've
> seen a segfault on has been 128, though it seems to be rare on this many
> cores.
> 
> At this point, I'm not familiar enough with NUMA balancing code to know
> what could be causing this, and we don't typically run with NUMA
> balancing on, so I don't see this in my everyday testing, but I felt
> that it was definitely worth bringing up.
> 
> If anybody has any ideas of where I could poke around to find a
> solution, please let me know.
> 
> - Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-07 21:48       ` Alex Thorlton
@ 2013-11-08 11:20         ` Mel Gorman
  2013-11-08 14:08           ` Mel Gorman
  2013-11-08 22:13           ` Alex Thorlton
  0 siblings, 2 replies; 18+ messages in thread
From: Mel Gorman @ 2013-11-08 11:20 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, linux-mm, linux-kernel

On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > Try the following patch on top of 3.12. It's a patch that is expected to
> > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > -stable but corruption trumps performance and the full series is not
> > going to be considered acceptable for -stable
> 
> I gave this patch a shot, and it didn't seem to solve the problem.
> Actually I'm running into what appear to be *worse* problems on the 3.12
> kernel.  Here're a couple stack traces of what I get when I run the test
> on 3.12, 512 cores:
> 

Ok, so there are two issues at least. Whatever is causing your
corruption (which I still cannot reproduce) and the fact that 3.12 is
worse. The largest machine I've tested with is 40 cores. I'm trying to
get time on a 60 core machine to see if has a better chance. I will not
be able to get access to anything resembling 512 cores.

> (These are just two of the CPUs, obviously, but most of the memscale
> processes appeared to be in one of these two spots)
> 
> Nov  7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6
> Nov  7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-00006-g0255d49 #381
> Nov  7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
> Nov  7 13:54:39 uvpsw1 kernel: task: ffff8810647e0300 ti: ffff88106413e000 task.ti: ffff88106413e000
> Nov  7 13:54:39 uvpsw1 kernel: RIP: 0010:[<ffffffff8151c7d5>]  [<ffffffff8151c7d5>] _raw_spin_lock+0x1a/0x25
> Nov  7 13:54:39 uvpsw1 kernel: RSP: 0018:ffff88106413fd38  EFLAGS: 00000283
> Nov  7 13:54:39 uvpsw1 kernel: RAX: 00000000a1a9a0fe RBX: 0000000000000206 RCX: ffff880000000000
> Nov  7 13:54:41 uvpsw1 kernel: RDX: 000000000000a1a9 RSI: 00003ffffffff000 RDI: ffff8907ded35494
> Nov  7 13:54:41 uvpsw1 kernel: RBP: ffff88106413fd38 R08: 0000000000000006 R09: 0000000000000002
> Nov  7 13:54:41 uvpsw1 kernel: R10: 0000000000000007 R11: ffff88106413ff40 R12: ffff8907ded35494
> Nov  7 13:54:42 uvpsw1 kernel: R13: ffff88106413fe1c R14: ffff8810637a05f0 R15: 0000000000000206
> Nov  7 13:54:42 uvpsw1 kernel: FS:  00007fffd5def700(0000) GS:ffff88107d980000(0000) knlGS:0000000000000000
> Nov  7 13:54:42 uvpsw1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Nov  7 13:54:42 uvpsw1 kernel: CR2: 00007fffd5ded000 CR3: 00000107dfbcf000 CR4: 00000000000007e0
> Nov  7 13:54:42 uvpsw1 kernel: Stack:
> Nov  7 13:54:42 uvpsw1 kernel:  ffff88106413fda8 ffffffff810d670a 0000000000000002 0000000000000006
> Nov  7 13:54:42 uvpsw1 kernel:  00007fff57dde000 ffff8810640e1cc0 000002006413fe10 ffff8907ded35440
> Nov  7 13:54:45 uvpsw1 kernel:  ffff88106413fda8 0000000000000206 0000000000000002 0000000000000000
> Nov  7 13:54:45 uvpsw1 kernel: Call Trace:
> Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d670a>] follow_page_mask+0x123/0x3f1
> Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d7c4e>] __get_user_pages+0x3e3/0x488
> Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d7d90>] get_user_pages+0x4d/0x4f
> Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810ec869>] SyS_get_mempolicy+0x1a9/0x3e0
> Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff8151d422>] system_call_fastpath+0x16/0x1b
> Nov  7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00
> 

That is probably the mm->page_table_lock being contended on. Kirill has
patches that split this which will help the scalability when THP is
enabled. They should be merged for 3.13-rc1. In itself it should not
cause bugs other than maybe soft lockups.

> Nov  7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8
> Nov  7 13:55:59 uvpsw1 kernel: INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs
> Nov  7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-00006-g0255d49 #381
> Nov  7 13:56:04 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
> Nov  7 13:56:04 uvpsw1 kernel: task: ffff881063c56380 ti: ffff8810621b8000 task.ti: ffff8810621b8000
> Nov  7 13:56:04 uvpsw1 kernel: RIP: 0010:[<ffffffff8151c7d5>]  [<ffffffff8151c7d5>] _raw_spin_lock+0x1a/0x25
> Nov  7 13:56:04 uvpsw1 kernel: RSP: 0018:ffff8810621b9c98  EFLAGS: 00000283
> Nov  7 13:56:04 uvpsw1 kernel: RAX: 00000000a20aa0ff RBX: ffff8810621002b0 RCX: 8000000000000025
> Nov  7 13:56:04 uvpsw1 kernel: RDX: 000000000000a20a RSI: ffff8810621002b0 RDI: ffff8907ded35494
> Nov  7 13:56:04 uvpsw1 kernel: RBP: ffff8810621b9c98 R08: 0000000000000001 R09: 0000000000000001
> Nov  7 13:56:04 uvpsw1 kernel: R10: 000000000000000a R11: 0000000000000246 R12: ffff881062f726b8
> Nov  7 13:56:04 uvpsw1 kernel: R13: 0000000000000001 R14: ffff8810621002b0 R15: ffff881062f726b8
> Nov  7 13:56:09 uvpsw1 kernel: FS:  00007fff79512700(0000) GS:ffff88107da00000(0000) knlGS:0000000000000000
> Nov  7 13:56:09 uvpsw1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Nov  7 13:56:09 uvpsw1 kernel: CR2: 00007fff79510000 CR3: 00000107dfbcf000 CR4: 00000000000007e0
> Nov  7 13:56:09 uvpsw1 kernel: Stack:
> Nov  7 13:56:09 uvpsw1 kernel:  ffff8810621b9cb8 ffffffff810f3e57 8000000000000025 ffff881062f726b8
> Nov  7 13:56:09 uvpsw1 kernel:  ffff8810621b9ce8 ffffffff810f3edb 80000187dd73e166 00007fe2dae00000
> Nov  7 13:56:09 uvpsw1 kernel:  ffff881063708ff8 00007fe2db000000 ffff8810621b9dc8 ffffffff810def2c
> Nov  7 13:56:09 uvpsw1 kernel: Call Trace:
> Nov  7 13:56:09 uvpsw1 kernel:  [<ffffffff810f3e57>] __pmd_trans_huge_lock+0x1a/0x7c
> Nov  7 13:56:10 uvpsw1 kernel:  [<ffffffff810f3edb>] change_huge_pmd+0x22/0xcc
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810def2c>] change_protection+0x200/0x591
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810ecb07>] change_prot_numa+0x16/0x2c
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff8106c247>] task_numa_work+0x224/0x29a
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810551b1>] task_work_run+0x81/0x99
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810025e1>] do_notify_resume+0x539/0x54b
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810c3ce9>] ? put_page+0x10/0x24
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810ec9fa>] ? SyS_get_mempolicy+0x33a/0x3e0
> Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff8151d6aa>] int_signal+0x12/0x17
> Nov  7 13:56:14 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00
> 

And this is indicating that NUMA balancing is making contention on that lock
much worse. I would have expected it to run slower, but not cause corruption.

There is no indication from these partial logs what the crash might be
due to unfortunately. Is your machine configured to do anyhting like
panic on soft lockups? By any chance have you booted this machines to
use 1G pages by default for hugetlbfs?

> I managed to bisect the issue down to this commit:
> 
> 0255d491848032f6c601b6410c3b8ebded3a37b1 is the first bad commit
> commit 0255d491848032f6c601b6410c3b8ebded3a37b1
> Author: Mel Gorman <mgorman@suse.de>
> Date:   Mon Oct 7 11:28:47 2013 +0100
> 
>     mm: Account for a THP NUMA hinting update as one PTE update
> 
>     A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
>     large difference when estimating the cost of automatic NUMA balancing and
>     can be misleading when comparing results that had collapsed versus split
>     THP. This patch addresses the accounting issue.
> 
>     Signed-off-by: Mel Gorman <mgorman@suse.de>
>     Reviewed-by: Rik van Riel <riel@redhat.com>
>     Cc: Andrea Arcangeli <aarcange@redhat.com>
>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>     Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>     Cc: <stable@kernel.org>
>     Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>     Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> :040000 040000 e5a44a1f0eea2f41d2cccbdf07eafee4e171b1e2 ef030a7c78ef346095ac991c3e3aa139498ed8e7 M      mm
> 
> I haven't had a chance yet to dig into the code for this commit to see
> what might be causing the crashes, but I have confirmed that this is
> where the new problem started (checked the commit before this, and we
> don't get the crash, just segfaults like we were getting before). 

One consequence of this patch that it adjusts the speed that task_numa_work
scans the virtual address space. This was an oversight that needs to be
corrected. Can you test if the following patch on top of 3.12 brings you
back to "just" segfaulting? It is compile-tested only because my own tests
will not even be able to start with this patch for another 3-4 hours.

---8<---
mm: numa: Return the number of base pages altered by protection changes

Commit 0255d491 (mm: Account for a THP NUMA hinting update as one PTE
update) was added to account for the number of PTE updates when marking
pages prot_numa. task_numa_work was using the old return value to track
how much address space had been updated. Altering the return value causes
the scanner to do more work than it is configured to in a single pass.
The temptation is just to revert 0255d491 but that means the vmstat value
is once again impossible to interpret. This patch restores change_protection
to returning the number of pages updated but it will also return how many
base PTEs were skipped because THP pages were encounted. The impact of this
patch is that the NUMA PTE scanner will scan more slowly when THP is enabled.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..3b510c4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -32,7 +32,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
-			int prot_numa);
+			bool prot_numa);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..a1cc4bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1088,7 +1088,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks);
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
-			      int dirty_accountable, int prot_numa);
+			      int dirty_accountable,
+			      int *prot_numa_pte_skipped);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..e554318 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1459,7 +1459,7 @@ out:
 }
 
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, int prot_numa)
+		unsigned long addr, pgprot_t newprot, bool prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..7b27125 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -626,11 +626,14 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long addr, unsigned long end)
 {
 	int nr_updated;
+	int nr_pte_skipped = 0;
 	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
-	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot,
+			0, &nr_pte_skipped);
 	if (nr_updated)
-		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+		count_vm_numa_events(NUMA_PTE_UPDATES,
+					nr_updated - nr_pte_skipped);
 
 	return nr_updated;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 412ba2b..f74b848 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -133,12 +133,14 @@ static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable,
+		int *prot_numa_pte_skipped)
 {
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
 	bool all_same_node;
+	bool prot_numa = (prot_numa_pte_skipped != NULL);
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -148,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages++;
+				pages += HPAGE_PMD_NR;
+				if (prot_numa_pte_skipped)
+					(*prot_numa_pte_skipped) += (HPAGE_PMD_NR - 1);
 				continue;
 			}
 			/* fall through */
@@ -173,7 +177,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable,
+		int *prot_numa_pte_skipped)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -185,7 +190,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa_pte_skipped);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -193,7 +198,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int *prot_numa_pte_skipped)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -209,7 +214,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa_pte_skipped);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -219,9 +224,15 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	return pages;
 }
 
+/*
+ * Returns the number of base pages affected by the protection change. Huge
+ * pages are accounted as pages << huge_order. The number of THP pages updated
+ * is returned (prot_numa_pte_skipped) for NUMA hinting faults updates to
+ * account for the number of PTEs updated in vmstat.
+ */
 unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
-		       int dirty_accountable, int prot_numa)
+		       int dirty_accountable, int *prot_numa_pte_skipped)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long pages;
@@ -230,7 +241,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot,
+				dirty_accountable, prot_numa_pte_skipped);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
 	return pages;
@@ -309,7 +321,7 @@ success:
 	}
 
 	change_protection(vma, start, end, vma->vm_page_prot,
-			  dirty_accountable, 0);
+			  dirty_accountable, NULL);
 
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-08 11:20         ` Mel Gorman
@ 2013-11-08 14:08           ` Mel Gorman
  2013-11-08 22:13           ` Alex Thorlton
  1 sibling, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2013-11-08 14:08 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, linux-mm, linux-kernel

On Fri, Nov 08, 2013 at 11:20:54AM +0000, Mel Gorman wrote:
> > <SNIP>
> > 0255d491848032f6c601b6410c3b8ebded3a37b1 is the first bad commit
> > commit 0255d491848032f6c601b6410c3b8ebded3a37b1
> > Author: Mel Gorman <mgorman@suse.de>
> > Date:   Mon Oct 7 11:28:47 2013 +0100
> > 
> >     mm: Account for a THP NUMA hinting update as one PTE update
> > 
> >     A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> >     large difference when estimating the cost of automatic NUMA balancing and
> >     can be misleading when comparing results that had collapsed versus split
> >     THP. This patch addresses the accounting issue.
> > 
> >     Signed-off-by: Mel Gorman <mgorman@suse.de>
> >     Reviewed-by: Rik van Riel <riel@redhat.com>
> >     Cc: Andrea Arcangeli <aarcange@redhat.com>
> >     Cc: Johannes Weiner <hannes@cmpxchg.org>
> >     Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >     Cc: <stable@kernel.org>
> >     Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >     Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
> >     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > 
> > :040000 040000 e5a44a1f0eea2f41d2cccbdf07eafee4e171b1e2 ef030a7c78ef346095ac991c3e3aa139498ed8e7 M      mm
> > 
> > I haven't had a chance yet to dig into the code for this commit to see
> > what might be causing the crashes, but I have confirmed that this is
> > where the new problem started (checked the commit before this, and we
> > don't get the crash, just segfaults like we were getting before). 
> 
> One consequence of this patch that it adjusts the speed that task_numa_work
> scans the virtual address space. This was an oversight that needs to be
> corrected. Can you test if the following patch on top of 3.12 brings you
> back to "just" segfaulting? It is compile-tested only because my own tests
> will not even be able to start with this patch for another 3-4 hours.
> 

This is a version that is less likely to stab reviewers in the eye. It
moves responsibility for interpreting stats to userspace but is a lot more
readable and maintainable.

---8<---
mm: numa: Return the number of base pages altered by protection changes

Commit 0255d491 (mm: Account for a THP NUMA hinting update as one PTE
update) was added to account for the number of PTE updates when marking
pages prot_numa. task_numa_work was using the old return value to track
how much address space had been updated. Altering the return value causes
the scanner to do more work than it is configured or documented to in a
single unit of work.

This patch reverts 0255d491 and accounts for the number of THP updates
separately in vmstat. It is up to the administrator to interpret the pair
of values correctly. This is a straight-forward operation and likely to
only be of interest when actively debugging NUMA balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan more slowly
when THP is enabled. Workloads may converge slower as a result. On the
flip size system CPU usage should be lower than recent tests. This is an
illustrative example of a short single JVM specjbb test

specjbb
                       3.12.0                3.12.0
                      vanilla      acctupdates
TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)

              3.12.0      3.12.0
             vanillaacctupdates
User         5169.64     5184.14
System        100.45       80.02
Elapsed       252.75      251.85

Performance is roughly comparable but note the reduction in system CPU
time. While this showed a performance gain, it will not be universal but at
least it'll be behaving as documented. The vmstats are obviously different
but here is an obvious interpretation of them

                                3.12.0      3.12.0
                               vanillaacctupdates
NUMA page range updates        1408326    11043064
NUMA huge PMD updates                0       21040
NUMA PTE updates               1408326      291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner. NUMA huge PMD updates were the number of THP updates
which in combination can be used to calculate how many ptes were updated
from userspace.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/vm_event_item.h | 1 +
 mm/mprotect.c                 | 6 +++++-
 mm/vmstat.c                   | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..c557c6d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -39,6 +39,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
+		NUMA_HUGE_PTE_UPDATES,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 412ba2b..f94d2bd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,6 +138,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
+	unsigned long nr_huge_updates = 0;
 	bool all_same_node;
 
 	pmd = pmd_offset(pud, addr);
@@ -148,7 +149,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages++;
+				pages += HPAGE_PMD_NR;
+				nr_huge_updates++;
 				continue;
 			}
 			/* fall through */
@@ -168,6 +170,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
+	if (nr_huge_updates)
+		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb3145..5a442a7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -812,6 +812,7 @@ const char * const vmstat_text[] = {
 
 #ifdef CONFIG_NUMA_BALANCING
 	"numa_pte_updates",
+	"numa_huge_pte_updates",
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-08 11:20         ` Mel Gorman
  2013-11-08 14:08           ` Mel Gorman
@ 2013-11-08 22:13           ` Alex Thorlton
  2013-11-12 21:29             ` Alex Thorlton
  1 sibling, 1 reply; 18+ messages in thread
From: Alex Thorlton @ 2013-11-08 22:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, linux-mm, linux-kernel

On Fri, Nov 08, 2013 at 11:20:54AM +0000, Mel Gorman wrote:
> On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > > Try the following patch on top of 3.12. It's a patch that is expected to
> > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > > -stable but corruption trumps performance and the full series is not
> > > going to be considered acceptable for -stable
> > 
> > I gave this patch a shot, and it didn't seem to solve the problem.
> > Actually I'm running into what appear to be *worse* problems on the 3.12
> > kernel.  Here're a couple stack traces of what I get when I run the test
> > on 3.12, 512 cores:
> > 
> 
> Ok, so there are two issues at least. Whatever is causing your
> corruption (which I still cannot reproduce) and the fact that 3.12 is
> worse. The largest machine I've tested with is 40 cores. I'm trying to
> get time on a 60 core machine to see if has a better chance. I will not
> be able to get access to anything resembling 512 cores.

At this point, the smallest machine I've been able to recreate this
issue on has been a 128 core, but it's rare on a machine that small.
I'll kick off a really long run on a 64 core over the weekend and see if
I can hit it on there at all, but I haven't been able to previously.

> 
> > (These are just two of the CPUs, obviously, but most of the memscale
> > processes appeared to be in one of these two spots)
> > 
> > Nov  7 13:54:39 uvpsw1 kernel: NMI backtrace for cpu 6
> > Nov  7 13:54:39 uvpsw1 kernel: CPU: 6 PID: 17759 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-00006-g0255d49 #381
> > Nov  7 13:54:39 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
> > Nov  7 13:54:39 uvpsw1 kernel: task: ffff8810647e0300 ti: ffff88106413e000 task.ti: ffff88106413e000
> > Nov  7 13:54:39 uvpsw1 kernel: RIP: 0010:[<ffffffff8151c7d5>]  [<ffffffff8151c7d5>] _raw_spin_lock+0x1a/0x25
> > Nov  7 13:54:39 uvpsw1 kernel: RSP: 0018:ffff88106413fd38  EFLAGS: 00000283
> > Nov  7 13:54:39 uvpsw1 kernel: RAX: 00000000a1a9a0fe RBX: 0000000000000206 RCX: ffff880000000000
> > Nov  7 13:54:41 uvpsw1 kernel: RDX: 000000000000a1a9 RSI: 00003ffffffff000 RDI: ffff8907ded35494
> > Nov  7 13:54:41 uvpsw1 kernel: RBP: ffff88106413fd38 R08: 0000000000000006 R09: 0000000000000002
> > Nov  7 13:54:41 uvpsw1 kernel: R10: 0000000000000007 R11: ffff88106413ff40 R12: ffff8907ded35494
> > Nov  7 13:54:42 uvpsw1 kernel: R13: ffff88106413fe1c R14: ffff8810637a05f0 R15: 0000000000000206
> > Nov  7 13:54:42 uvpsw1 kernel: FS:  00007fffd5def700(0000) GS:ffff88107d980000(0000) knlGS:0000000000000000
> > Nov  7 13:54:42 uvpsw1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > Nov  7 13:54:42 uvpsw1 kernel: CR2: 00007fffd5ded000 CR3: 00000107dfbcf000 CR4: 00000000000007e0
> > Nov  7 13:54:42 uvpsw1 kernel: Stack:
> > Nov  7 13:54:42 uvpsw1 kernel:  ffff88106413fda8 ffffffff810d670a 0000000000000002 0000000000000006
> > Nov  7 13:54:42 uvpsw1 kernel:  00007fff57dde000 ffff8810640e1cc0 000002006413fe10 ffff8907ded35440
> > Nov  7 13:54:45 uvpsw1 kernel:  ffff88106413fda8 0000000000000206 0000000000000002 0000000000000000
> > Nov  7 13:54:45 uvpsw1 kernel: Call Trace:
> > Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d670a>] follow_page_mask+0x123/0x3f1
> > Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d7c4e>] __get_user_pages+0x3e3/0x488
> > Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810d7d90>] get_user_pages+0x4d/0x4f
> > Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff810ec869>] SyS_get_mempolicy+0x1a9/0x3e0
> > Nov  7 13:54:45 uvpsw1 kernel:  [<ffffffff8151d422>] system_call_fastpath+0x16/0x1b
> > Nov  7 13:54:46 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00
> > 
> 
> That is probably the mm->page_table_lock being contended on. Kirill has
> patches that split this which will help the scalability when THP is
> enabled. They should be merged for 3.13-rc1. In itself it should not
> cause bugs other than maybe soft lockups.
> 
> > Nov  7 13:55:59 uvpsw1 kernel: NMI backtrace for cpu 8
> > Nov  7 13:55:59 uvpsw1 kernel: INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.099 msecs
> > Nov  7 13:56:04 uvpsw1 kernel: CPU: 8 PID: 17761 Comm: thp_memscale Not tainted 3.12.0-rc7-medusa-00006-g0255d49 #381
> > Nov  7 13:56:04 uvpsw1 kernel: Hardware name: Intel Corp. Stoutland Platform, BIOS 2.20 UEFI2.10 PI1.0 X64 2013-09-20
> > Nov  7 13:56:04 uvpsw1 kernel: task: ffff881063c56380 ti: ffff8810621b8000 task.ti: ffff8810621b8000
> > Nov  7 13:56:04 uvpsw1 kernel: RIP: 0010:[<ffffffff8151c7d5>]  [<ffffffff8151c7d5>] _raw_spin_lock+0x1a/0x25
> > Nov  7 13:56:04 uvpsw1 kernel: RSP: 0018:ffff8810621b9c98  EFLAGS: 00000283
> > Nov  7 13:56:04 uvpsw1 kernel: RAX: 00000000a20aa0ff RBX: ffff8810621002b0 RCX: 8000000000000025
> > Nov  7 13:56:04 uvpsw1 kernel: RDX: 000000000000a20a RSI: ffff8810621002b0 RDI: ffff8907ded35494
> > Nov  7 13:56:04 uvpsw1 kernel: RBP: ffff8810621b9c98 R08: 0000000000000001 R09: 0000000000000001
> > Nov  7 13:56:04 uvpsw1 kernel: R10: 000000000000000a R11: 0000000000000246 R12: ffff881062f726b8
> > Nov  7 13:56:04 uvpsw1 kernel: R13: 0000000000000001 R14: ffff8810621002b0 R15: ffff881062f726b8
> > Nov  7 13:56:09 uvpsw1 kernel: FS:  00007fff79512700(0000) GS:ffff88107da00000(0000) knlGS:0000000000000000
> > Nov  7 13:56:09 uvpsw1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > Nov  7 13:56:09 uvpsw1 kernel: CR2: 00007fff79510000 CR3: 00000107dfbcf000 CR4: 00000000000007e0
> > Nov  7 13:56:09 uvpsw1 kernel: Stack:
> > Nov  7 13:56:09 uvpsw1 kernel:  ffff8810621b9cb8 ffffffff810f3e57 8000000000000025 ffff881062f726b8
> > Nov  7 13:56:09 uvpsw1 kernel:  ffff8810621b9ce8 ffffffff810f3edb 80000187dd73e166 00007fe2dae00000
> > Nov  7 13:56:09 uvpsw1 kernel:  ffff881063708ff8 00007fe2db000000 ffff8810621b9dc8 ffffffff810def2c
> > Nov  7 13:56:09 uvpsw1 kernel: Call Trace:
> > Nov  7 13:56:09 uvpsw1 kernel:  [<ffffffff810f3e57>] __pmd_trans_huge_lock+0x1a/0x7c
> > Nov  7 13:56:10 uvpsw1 kernel:  [<ffffffff810f3edb>] change_huge_pmd+0x22/0xcc
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810def2c>] change_protection+0x200/0x591
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810ecb07>] change_prot_numa+0x16/0x2c
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff8106c247>] task_numa_work+0x224/0x29a
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810551b1>] task_work_run+0x81/0x99
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810025e1>] do_notify_resume+0x539/0x54b
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810c3ce9>] ? put_page+0x10/0x24
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff810ec9fa>] ? SyS_get_mempolicy+0x33a/0x3e0
> > Nov  7 13:56:14 uvpsw1 kernel:  [<ffffffff8151d6aa>] int_signal+0x12/0x17
> > Nov  7 13:56:14 uvpsw1 kernel: Code: b1 17 39 c8 ba 01 00 00 00 74 02 31 d2 89 d0 c9 c3 55 48 89 e5 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 d0 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c9 c3 55 48 89 e5 9c 59 fa b8 00 00
> > 
> 
> And this is indicating that NUMA balancing is making contention on that lock
> much worse. I would have expected it to run slower, but not cause corruption.
> 
> There is no indication from these partial logs what the crash might be
> due to unfortunately. Is your machine configured to do anyhting like
> panic on soft lockups? By any chance have you booted this machines to
> use 1G pages by default for hugetlbfs?

The machine isn't configured to panic on soft lockups, and, as far as
hugetlbfs goes, I have it disabled altogether in my config.  It seems
that you don't hit these segfaults with hugetlbfs turned on.  I've only
seen the segfaults with hugetlbfs off, and NUMA balancing on.

> 
> > I managed to bisect the issue down to this commit:
> > 
> > 0255d491848032f6c601b6410c3b8ebded3a37b1 is the first bad commit
> > commit 0255d491848032f6c601b6410c3b8ebded3a37b1
> > Author: Mel Gorman <mgorman@suse.de>
> > Date:   Mon Oct 7 11:28:47 2013 +0100
> > 
> >     mm: Account for a THP NUMA hinting update as one PTE update
> > 
> >     A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> >     large difference when estimating the cost of automatic NUMA balancing and
> >     can be misleading when comparing results that had collapsed versus split
> >     THP. This patch addresses the accounting issue.
> > 
> >     Signed-off-by: Mel Gorman <mgorman@suse.de>
> >     Reviewed-by: Rik van Riel <riel@redhat.com>
> >     Cc: Andrea Arcangeli <aarcange@redhat.com>
> >     Cc: Johannes Weiner <hannes@cmpxchg.org>
> >     Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >     Cc: <stable@kernel.org>
> >     Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >     Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
> >     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > 
> > :040000 040000 e5a44a1f0eea2f41d2cccbdf07eafee4e171b1e2 ef030a7c78ef346095ac991c3e3aa139498ed8e7 M      mm
> > 
> > I haven't had a chance yet to dig into the code for this commit to see
> > what might be causing the crashes, but I have confirmed that this is
> > where the new problem started (checked the commit before this, and we
> > don't get the crash, just segfaults like we were getting before). 
> 
> One consequence of this patch that it adjusts the speed that task_numa_work
> scans the virtual address space. This was an oversight that needs to be
> corrected. Can you test if the following patch on top of 3.12 brings you
> back to "just" segfaulting? It is compile-tested only because my own tests
> will not even be able to start with this patch for another 3-4 hours.

This patch seems to have fixed things, I just got one of our 640 core
machines to survive 20 runs of the memscale test (512 cores, 512m per
core).

Now that it seems like we've got things back to normal (or back to just
segfaulting, anyways), I'm going to start digging into the segfault
issue a bit more.

I'll let you know if I can manage to cause this problem on anything
smaller than 128 cores.  Let me know if you have any other ideas/need
anything from me in the meantime.

- Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-08 22:13           ` Alex Thorlton
@ 2013-11-12 21:29             ` Alex Thorlton
  2013-11-15  0:09               ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Thorlton @ 2013-11-12 21:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, linux-mm, linux-kernel

On Fri, Nov 08, 2013 at 04:13:29PM -0600, Alex Thorlton wrote:
> On Fri, Nov 08, 2013 at 11:20:54AM +0000, Mel Gorman wrote:
> > On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > > > Try the following patch on top of 3.12. It's a patch that is expected to
> > > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > > > -stable but corruption trumps performance and the full series is not
> > > > going to be considered acceptable for -stable
> > > 
> > > I gave this patch a shot, and it didn't seem to solve the problem.
> > > Actually I'm running into what appear to be *worse* problems on the 3.12
> > > kernel.  Here're a couple stack traces of what I get when I run the test
> > > on 3.12, 512 cores:
> > > 
> > 
> > Ok, so there are two issues at least. Whatever is causing your
> > corruption (which I still cannot reproduce) and the fact that 3.12 is
> > worse. The largest machine I've tested with is 40 cores. I'm trying to
> > get time on a 60 core machine to see if has a better chance. I will not
> > be able to get access to anything resembling 512 cores.
> 
> At this point, the smallest machine I've been able to recreate this
> issue on has been a 128 core, but it's rare on a machine that small.
> I'll kick off a really long run on a 64 core over the weekend and see if
> I can hit it on there at all, but I haven't been able to previously.

Just a quick update, I ran this test 500 times on 64 cores, allocating
512m per core, and every single test completed successfully.  At this
point, it looks like you definitely need at least 128 cores to reproduce
the issue.

- Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-12 21:29             ` Alex Thorlton
@ 2013-11-15  0:09               ` Mel Gorman
  2013-11-15 14:45                 ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2013-11-15  0:09 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, linux-mm, linux-kernel

On Tue, Nov 12, 2013 at 03:29:02PM -0600, Alex Thorlton wrote:
> On Fri, Nov 08, 2013 at 04:13:29PM -0600, Alex Thorlton wrote:
> > On Fri, Nov 08, 2013 at 11:20:54AM +0000, Mel Gorman wrote:
> > > On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > > > > Try the following patch on top of 3.12. It's a patch that is expected to
> > > > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > > > > -stable but corruption trumps performance and the full series is not
> > > > > going to be considered acceptable for -stable
> > > > 
> > > > I gave this patch a shot, and it didn't seem to solve the problem.
> > > > Actually I'm running into what appear to be *worse* problems on the 3.12
> > > > kernel.  Here're a couple stack traces of what I get when I run the test
> > > > on 3.12, 512 cores:
> > > > 
> > > 
> > > Ok, so there are two issues at least. Whatever is causing your
> > > corruption (which I still cannot reproduce) and the fact that 3.12 is
> > > worse. The largest machine I've tested with is 40 cores. I'm trying to
> > > get time on a 60 core machine to see if has a better chance. I will not
> > > be able to get access to anything resembling 512 cores.
> > 
> > At this point, the smallest machine I've been able to recreate this
> > issue on has been a 128 core, but it's rare on a machine that small.
> > I'll kick off a really long run on a 64 core over the weekend and see if
> > I can hit it on there at all, but I haven't been able to previously.
> 
> Just a quick update, I ran this test 500 times on 64 cores, allocating
> 512m per core, and every single test completed successfully.  At this
> point, it looks like you definitely need at least 128 cores to reproduce
> the issue.
> 

Awesome. I checked behind the couch but did not find a 128 core machine there
so it took a while to do this the harder way instead. Try the following
patch against 3.12 on top of the pmd batch handling backport and the scan
rate fix please. The scan rate fix is optional because it should make the
bug easier to trigger but it's very important that the pmd batch handling
removal patch is applied.

Thanks.

---8<---
mm: numa: Serialise parallel get_user_page against THP migration

Base pages are unmapped and flushed from cache and TLB during normal page
migration and replaced with a migration entry that causes any parallel or
gup to block until migration completes. THP does not unmap pages due to
a lack of support for migration entries at a PMD level. This allows races
with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
Close races between THP migration and PMD numa clearing") made worse by
introducing a pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault. On the
migration side, the TLB is flushed after the page table update and the
page table lock taken for each update. A barrier is introduced to guarantee
the page table update is visible when waiters on THP migration wake up.

---
 arch/x86/mm/gup.c |  7 +++++++
 mm/huge_memory.c  | 30 +++++++++++++++++++++---------
 mm/migrate.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/mprotect.c     |  2 +-
 4 files changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4b36edb 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -167,6 +167,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
 		} else {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..7f91be1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1240,6 +1240,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto out;
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!PageHead(page));
 	if (flags & FOLL_TOUCH) {
@@ -1306,26 +1310,33 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
 			goto clear_pmdnuma;
+	}
 
-		/*
-		 * Otherwise wait for potential migrations and retry. We do
-		 * relock and check_same as the page may no longer be mapped.
-		 * As the fault is being retried, do not account for it.
-		 */
+	/*
+	 * If there are potential migrations, wait for completion and retry. We
+	 * do not relock and check_same as the page may no longer be mapped.
+	 * Furtermore, even if the page is currently misplaced, there is no
+	 * guarantee it is still misplaced after the migration completes.
+	 */
+	if (!page_locked) {
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
 		page_nid = -1;
+
+		/* Matches barrier in migrate_misplaced_transhuge_page */
+		smp_rmb();
 		goto out;
 	}
 
-	/* Page is misplaced, serialise migrations and parallel THP splits */
+	/*
+	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
+	 * to serialises splits
+	 */
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked)
-		lock_page(page);
 	anon_vma = page_lock_anon_vma_read(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PTE did not change while unlocked */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
@@ -1350,6 +1361,7 @@ clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
+	flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 	update_mmu_cache_pmd(vma, addr, pmdp);
 	unlock_page(page);
 out_unlock:
diff --git a/mm/migrate.c b/mm/migrate.c
index c4f9819..535212b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1703,9 +1703,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	migrate_page_copy(new_page, page);
 	WARN_ON(PageLRU(new_page));
 
-	/* Recheck the target PMD */
+	/*
+	 * Recheck the target PMD. A parallel GUP should be impossible due to
+	 * it going through handle_mm_fault, stalling on the page lock and
+	 * retrying. Catch the situation and warn if it happens
+	 */
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, entry))) {
+	if (!pmd_same(*pmd, entry) || WARN_ON_ONCE(page_count(page) != 2)) {
 		spin_unlock(&mm->page_table_lock);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1736,15 +1740,23 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
 	entry = mk_pmd(new_page, vma->vm_page_prot);
-	entry = pmd_mknonnuma(entry);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
-	pmdp_clear_flush(vma, haddr, pmd);
-	set_pmd_at(mm, haddr, pmd, entry);
+	/*
+	 * Clear the old entry under pagetable lock and establish the new PTE.
+	 * Any parallel GUP will either observe the old page blocking on the
+	 * page lock, block on the page table lock or observe the new page.
+	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
+	 * guarantee the copy is visible before the pagetable update.
+	 */
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 	page_add_new_anon_rmap(new_page, vma, haddr);
+	set_pmd_at(mm, haddr, pmd, entry);
+	flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
+
 	/*
 	 * Finish the charge transaction under the page table lock to
 	 * prevent split_huge_page() from dividing up the charge
@@ -1753,6 +1765,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(&mm->page_table_lock);
 
+	/*
+	 * unlock_page is not a barrier. Guarantee that the page table update
+	 * is visible to any process waiting on the page lock. Matches the
+	 * smp_rmb in do_huge_pmd_numa_page.
+	 */
+	smp_wmb();
+
 	unlock_page(new_page);
 	unlock_page(page);
 	put_page(page);			/* Drop the rmap reference */
@@ -1769,9 +1788,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
-	entry = pmd_mknonnuma(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(*pmd, entry)) {
+		entry = pmd_mknonnuma(entry);
+		set_pmd_at(mm, haddr, pmd, entry);
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		update_mmu_cache_pmd(vma, address, &entry);
+	}
+	spin_unlock(&mm->page_table_lock);
 
 	unlock_page(page);
 	put_page(page);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-15  0:09               ` Mel Gorman
@ 2013-11-15 14:45                 ` Mel Gorman
  2013-11-22 21:28                   ` Alex Thorlton
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2013-11-15 14:45 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, linux-mm, linux-kernel

On Fri, Nov 15, 2013 at 12:09:01AM +0000, Mel Gorman wrote:
> On Tue, Nov 12, 2013 at 03:29:02PM -0600, Alex Thorlton wrote:
> > On Fri, Nov 08, 2013 at 04:13:29PM -0600, Alex Thorlton wrote:
> > > On Fri, Nov 08, 2013 at 11:20:54AM +0000, Mel Gorman wrote:
> > > > On Thu, Nov 07, 2013 at 03:48:38PM -0600, Alex Thorlton wrote:
> > > > > > Try the following patch on top of 3.12. It's a patch that is expected to
> > > > > > be merged for 3.13. On its own it'll hurt automatic NUMA balancing in
> > > > > > -stable but corruption trumps performance and the full series is not
> > > > > > going to be considered acceptable for -stable
> > > > > 
> > > > > I gave this patch a shot, and it didn't seem to solve the problem.
> > > > > Actually I'm running into what appear to be *worse* problems on the 3.12
> > > > > kernel.  Here're a couple stack traces of what I get when I run the test
> > > > > on 3.12, 512 cores:
> > > > > 
> > > > 
> > > > Ok, so there are two issues at least. Whatever is causing your
> > > > corruption (which I still cannot reproduce) and the fact that 3.12 is
> > > > worse. The largest machine I've tested with is 40 cores. I'm trying to
> > > > get time on a 60 core machine to see if has a better chance. I will not
> > > > be able to get access to anything resembling 512 cores.
> > > 
> > > At this point, the smallest machine I've been able to recreate this
> > > issue on has been a 128 core, but it's rare on a machine that small.
> > > I'll kick off a really long run on a 64 core over the weekend and see if
> > > I can hit it on there at all, but I haven't been able to previously.
> > 
> > Just a quick update, I ran this test 500 times on 64 cores, allocating
> > 512m per core, and every single test completed successfully.  At this
> > point, it looks like you definitely need at least 128 cores to reproduce
> > the issue.
> > 
> 
> Awesome. I checked behind the couch but did not find a 128 core machine there
> so it took a while to do this the harder way instead. Try the following
> patch against 3.12 on top of the pmd batch handling backport and the scan
> rate fix please. The scan rate fix is optional because it should make the
> bug easier to trigger but it's very important that the pmd batch handling
> removal patch is applied.
> 

If the warning added by that patch does *not* trigger than can you also
test this patch? It removes the barriers which should not be necessary
and takes a reference tot he page before waiting on the lock. The
previous version did not take the reference because otherwise the
WARN_ON could not distinguish between a migration waiter and a surprise
gup.

Thanks

---8<---
mm: numa: Serialise parallel get_user_page against THP migration

Base pages are unmapped and flushed from cache and TLB during normal page
migration and replaced with a migration entry that causes any parallel or
gup to block until migration completes. THP does not unmap pages due to
a lack of support for migration entries at a PMD level. This allows races
with get_user_pages and get_user_pages_fast which commit 3f926ab94 ("mm:
Close races between THP migration and PMD numa clearing") made worse by
introducing a pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against THP
migration and properly account for the NUMA hinting fault. On the migration
side, the TLB is flushed after the page table update and the page table
lock taken for each update. A barrier is introduced to guarantee the page
table update is visible when waiters on THP migration wake up.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/gup.c |  7 +++++++
 mm/huge_memory.c  | 30 ++++++++++++++++++++----------
 mm/migrate.c      | 29 +++++++++++++++++++++--------
 3 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4b36edb 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -167,6 +167,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
 		} else {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..5af2b86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1240,6 +1240,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
 		return ERR_PTR(-EFAULT);
 
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto out;
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!PageHead(page));
 	if (flags & FOLL_TOUCH) {
@@ -1306,26 +1310,31 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
 			goto clear_pmdnuma;
+	}
 
-		/*
-		 * Otherwise wait for potential migrations and retry. We do
-		 * relock and check_same as the page may no longer be mapped.
-		 * As the fault is being retried, do not account for it.
-		 */
+	/*
+	 * If there are potential migrations, wait for completion and retry. We
+	 * do not relock and check_same as the page may no longer be mapped.
+	 * Furtermore, even if the page is currently misplaced, there is no
+	 * guarantee it is still misplaced after the migration completes.
+	 */
+	get_page(page);
+	if (!page_locked) {
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
+		put_page(page);
 		page_nid = -1;
 		goto out;
 	}
 
-	/* Page is misplaced, serialise migrations and parallel THP splits */
-	get_page(page);
+	/*
+	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
+	 * to serialises splits
+	 */
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked)
-		lock_page(page);
 	anon_vma = page_lock_anon_vma_read(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PTE did not change while unlocked */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
@@ -1350,6 +1359,7 @@ clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
+	flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 	update_mmu_cache_pmd(vma, addr, pmdp);
 	unlock_page(page);
 out_unlock:
diff --git a/mm/migrate.c b/mm/migrate.c
index c4f9819..9167b22 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1703,7 +1703,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	migrate_page_copy(new_page, page);
 	WARN_ON(PageLRU(new_page));
 
-	/* Recheck the target PMD */
+	/* Recheck the target PMD. */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry))) {
 		spin_unlock(&mm->page_table_lock);
@@ -1736,15 +1736,23 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
 	entry = mk_pmd(new_page, vma->vm_page_prot);
-	entry = pmd_mknonnuma(entry);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
-	pmdp_clear_flush(vma, haddr, pmd);
-	set_pmd_at(mm, haddr, pmd, entry);
+	/*
+	 * Clear the old entry under pagetable lock and establish the new PTE.
+	 * Any parallel GUP will either observe the old page blocking on the
+	 * page lock, block on the page table lock or observe the new page.
+	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
+	 * guarantee the copy is visible before the pagetable update.
+	 */
+	flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 	page_add_new_anon_rmap(new_page, vma, haddr);
+	set_pmd_at(mm, haddr, pmd, entry);
+	flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
+
 	/*
 	 * Finish the charge transaction under the page table lock to
 	 * prevent split_huge_page() from dividing up the charge
@@ -1769,9 +1777,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
-	entry = pmd_mknonnuma(entry);
-	set_pmd_at(mm, haddr, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(*pmd, entry)) {
+		entry = pmd_mknonnuma(entry);
+		set_pmd_at(mm, haddr, pmd, entry);
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		update_mmu_cache_pmd(vma, address, &entry);
+	}
+	spin_unlock(&mm->page_table_lock);
 
 	unlock_page(page);
 	put_page(page);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-15 14:45                 ` Mel Gorman
@ 2013-11-22 21:28                   ` Alex Thorlton
  2013-11-22 23:05                     ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Thorlton @ 2013-11-22 21:28 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, linux-mm, linux-kernel

> If the warning added by that patch does *not* trigger than can you also
> test this patch? It removes the barriers which should not be necessary
> and takes a reference tot he page before waiting on the lock. The
> previous version did not take the reference because otherwise the
> WARN_ON could not distinguish between a migration waiter and a surprise
> gup.

Sorry for the delay; been a bit busy.  I tested both of these patches on
top of this one (separately, of course):

http://www.spinics.net/lists/linux-mm/msg63919.html

I think that's the one you were referring to, if not send me a pointer
to the correct one and I'll give it another shot.  Both patches still
segfaulted, so it doesn't appear that either of these solved the
problem.  If you have anything else you'd like for me to try let me
know.

- Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-22 21:28                   ` Alex Thorlton
@ 2013-11-22 23:05                     ` Mel Gorman
  2013-11-23  0:09                       ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2013-11-22 23:05 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, linux-mm, linux-kernel

On Fri, Nov 22, 2013 at 03:28:07PM -0600, Alex Thorlton wrote:
> > If the warning added by that patch does *not* trigger than can you also
> > test this patch? It removes the barriers which should not be necessary
> > and takes a reference tot he page before waiting on the lock. The
> > previous version did not take the reference because otherwise the
> > WARN_ON could not distinguish between a migration waiter and a surprise
> > gup.
> 
> Sorry for the delay; been a bit busy.  I tested both of these patches on
> top of this one (separately, of course):
> 
> http://www.spinics.net/lists/linux-mm/msg63919.html
> 
> I think that's the one you were referring to, if not send me a pointer
> to the correct one and I'll give it another shot.  Both patches still
> segfaulted, so it doesn't appear that either of these solved the
> problem. 

I see. Does THP have to be enabled or does it segfault even with THP
disabled?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-22 23:05                     ` Mel Gorman
@ 2013-11-23  0:09                       ` Mel Gorman
  2013-11-27 23:58                         ` Alex Thorlton
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2013-11-23  0:09 UTC (permalink / raw)
  To: Alex Thorlton; +Cc: Rik van Riel, linux-mm, linux-kernel

On Fri, Nov 22, 2013 at 11:05:24PM +0000, Mel Gorman wrote:
> On Fri, Nov 22, 2013 at 03:28:07PM -0600, Alex Thorlton wrote:
> > > If the warning added by that patch does *not* trigger than can you also
> > > test this patch? It removes the barriers which should not be necessary
> > > and takes a reference tot he page before waiting on the lock. The
> > > previous version did not take the reference because otherwise the
> > > WARN_ON could not distinguish between a migration waiter and a surprise
> > > gup.
> > 
> > Sorry for the delay; been a bit busy.  I tested both of these patches on
> > top of this one (separately, of course):
> > 
> > http://www.spinics.net/lists/linux-mm/msg63919.html
> > 
> > I think that's the one you were referring to, if not send me a pointer
> > to the correct one and I'll give it another shot.  Both patches still
> > segfaulted, so it doesn't appear that either of these solved the
> > problem. 
> 
> I see. Does THP have to be enabled or does it segfault even with THP
> disabled?
> 

On a semi-related note, is the large machine doing anything with xpmem
or anything that depends on MMU notifiers to work properly? I noted
while looking at this that THP migration is not invalidating pages which
might be confusing a driver depending on it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: mm, numa: test segfaults, only when NUMA balancing is on
  2013-11-23  0:09                       ` Mel Gorman
@ 2013-11-27 23:58                         ` Alex Thorlton
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Thorlton @ 2013-11-27 23:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, linux-mm, linux-kernel

On Sat, Nov 23, 2013 at 12:09:24AM +0000, Mel Gorman wrote:
> On Fri, Nov 22, 2013 at 11:05:24PM +0000, Mel Gorman wrote:
> > On Fri, Nov 22, 2013 at 03:28:07PM -0600, Alex Thorlton wrote:
> > > > If the warning added by that patch does *not* trigger than can you also
> > > > test this patch? It removes the barriers which should not be necessary
> > > > and takes a reference tot he page before waiting on the lock. The
> > > > previous version did not take the reference because otherwise the
> > > > WARN_ON could not distinguish between a migration waiter and a surprise
> > > > gup.
> > > 
> > > Sorry for the delay; been a bit busy.  I tested both of these patches on
> > > top of this one (separately, of course):
> > > 
> > > http://www.spinics.net/lists/linux-mm/msg63919.html
> > > 
> > > I think that's the one you were referring to, if not send me a pointer
> > > to the correct one and I'll give it another shot.  Both patches still
> > > segfaulted, so it doesn't appear that either of these solved the
> > > problem. 
> > 
> > I see. Does THP have to be enabled or does it segfault even with THP
> > disabled?

It occurs with both THP on, and off.  I get RCU stalls with THP on
though.  That's probably related to not having Kirill/Naoya's patches
applied though.

> > 
> 
> On a semi-related note, is the large machine doing anything with xpmem
> or anything that depends on MMU notifiers to work properly? I noted
> while looking at this that THP migration is not invalidating pages which
> might be confusing a driver depending on it.

I'm not using xpmem on any of the machines that I've been testing on,
and I don't think that anything should be using MMU notifiers.

- Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-11-27 23:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-16 15:54 BUG: mm, numa: test segfaults, only when NUMA balancing is on Alex Thorlton
2013-10-17 11:30 ` Bob Liu
2013-10-18  0:33   ` Alex Thorlton
2013-11-04 14:58 ` Mel Gorman
2013-11-04 20:03   ` Alex Thorlton
2013-11-06 13:10     ` Mel Gorman
2013-11-07 21:48       ` Alex Thorlton
2013-11-08 11:20         ` Mel Gorman
2013-11-08 14:08           ` Mel Gorman
2013-11-08 22:13           ` Alex Thorlton
2013-11-12 21:29             ` Alex Thorlton
2013-11-15  0:09               ` Mel Gorman
2013-11-15 14:45                 ` Mel Gorman
2013-11-22 21:28                   ` Alex Thorlton
2013-11-22 23:05                     ` Mel Gorman
2013-11-23  0:09                       ` Mel Gorman
2013-11-27 23:58                         ` Alex Thorlton
2013-11-07 21:52 ` Alex Thorlton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).