linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] mm: hugetlb: add stub-like do_hugetlb_numa()
@ 2015-03-30  9:40 Naoya Horiguchi
  2015-03-30 10:28 ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Naoya Horiguchi @ 2015-03-30  9:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Kirill A. Shutemov,
	David Rientjes, Rik van Riel, linux-kernel

hugetlb doesn't support NUMA balancing now, but that doesn't mean that we
don't have to make hugetlb code prepared for PROTNONE entry properly.
In the current kernel, when a process accesses to hugetlb range protected
with PROTNONE, it causes unexpected COWs, which finally put hugetlb subsystem
into broken/uncontrollable state, where for example h->resv_huge_pages is
subtracted too much and wrapped around to a very large number, and free
hugepage pool is no longer maintainable.

This patch simply clears PROTNONE when it's caught out. Real NUMA balancing
code for hugetlb is not implemented yet (not sure how much it's worth doing.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/asm-generic/hugetlb.h | 13 +++++++++++++
 mm/hugetlb.c                  | 24 ++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git v4.0-rc4.orig/include/asm-generic/hugetlb.h v4.0-rc4/include/asm-generic/hugetlb.h
index 99b490b4d05a..7e73cc9e57b1 100644
--- v4.0-rc4.orig/include/asm-generic/hugetlb.h
+++ v4.0-rc4/include/asm-generic/hugetlb.h
@@ -37,4 +37,17 @@ static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 	pte_clear(mm, addr, ptep);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline int huge_pte_protnone(pte_t pte)
+{
+	return (pte_flags(pte) & (_PAGE_PROTNONE | _PAGE_PRESENT))
+		== _PAGE_PROTNONE;
+}
+#else
+static inline int huge_pte_protnone(pte_t pte)
+{
+	return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #endif /* _ASM_GENERIC_HUGETLB_H */
diff --git v4.0-rc4.orig/mm/hugetlb.c v4.0-rc4/mm/hugetlb.c
index cbb0bbc6662a..18c169674ee4 100644
--- v4.0-rc4.orig/mm/hugetlb.c
+++ v4.0-rc4/mm/hugetlb.c
@@ -3090,6 +3090,28 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	goto out;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * NUMA balancing code is to be implemented. Now we just clear PROTNONE to
+ * avoid unstability of hugetlb subsystem.
+ */
+static int do_hugetlb_numa(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long address, pte_t *ptep, pte_t pte)
+{
+	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
+
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte)))
+		goto unlock;
+	pte = pte_mkhuge(huge_pte_modify(pte, vma->vm_page_prot));
+	pte = pte_mkyoung(pte);
+	set_huge_pte_at(mm, address, ptep, pte);
+unlock:
+	spin_unlock(ptl);
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_SMP
 static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 			    struct vm_area_struct *vma,
@@ -3144,6 +3166,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
+		if (huge_pte_protnone(entry))
+			return do_hugetlb_numa(mm, vma, address, ptep, entry);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, mm, ptep);
 			return 0;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: hugetlb: add stub-like do_hugetlb_numa()
  2015-03-30  9:40 [RFC][PATCH] mm: hugetlb: add stub-like do_hugetlb_numa() Naoya Horiguchi
@ 2015-03-30 10:28 ` Mel Gorman
  2015-03-30 10:42   ` Naoya Horiguchi
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2015-03-30 10:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Kirill A. Shutemov,
	David Rientjes, Rik van Riel, linux-kernel

On Mon, Mar 30, 2015 at 09:40:54AM +0000, Naoya Horiguchi wrote:
> hugetlb doesn't support NUMA balancing now, but that doesn't mean that we
> don't have to make hugetlb code prepared for PROTNONE entry properly.
> In the current kernel, when a process accesses to hugetlb range protected
> with PROTNONE, it causes unexpected COWs, which finally put hugetlb subsystem
> into broken/uncontrollable state, where for example h->resv_huge_pages is
> subtracted too much and wrapped around to a very large number, and free
> hugepage pool is no longer maintainable.
> 

Ouch!

> This patch simply clears PROTNONE when it's caught out. Real NUMA balancing
> code for hugetlb is not implemented yet (not sure how much it's worth doing.)
> 

It's not worth doing at all. Furthermore, an application that took the
effort to allocate and use hugetlb pages is not going to appreciate the
minor faults incurred by automatic balancing for no gain. Why not something
like the following untested patch? It simply avoids doing protection updates
on hugetlb VMAs. If it works for you, feel free to take it and reuse most
of the same changelog for it. I'll only be intermittently online for the
next few days and would rather not unnecessarily delay a fix.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..74bfde50fd4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2161,8 +2161,10 @@ void task_numa_work(struct callback_head *work)
 		vma = mm->mmap;
 	}
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma) || !vma_policy_mof(vma))
+		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+						is_vm_hugetlb_page(vma)) {
 			continue;
+		}
 
 		/*
 		 * Shared library pages mapped by multiple processes are not

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: hugetlb: add stub-like do_hugetlb_numa()
  2015-03-30 10:28 ` Mel Gorman
@ 2015-03-30 10:42   ` Naoya Horiguchi
  2015-03-30 11:59     ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Naoya Horiguchi @ 2015-03-30 10:42 UTC (permalink / raw)
  To: Mel Gorman, Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Kirill A. Shutemov,
	David Rientjes, Rik van Riel, linux-kernel

On 03/30/2015 07:28 PM, Mel Gorman wrote:
> On Mon, Mar 30, 2015 at 09:40:54AM +0000, Naoya Horiguchi wrote:
>> hugetlb doesn't support NUMA balancing now, but that doesn't mean that we
>> don't have to make hugetlb code prepared for PROTNONE entry properly.
>> In the current kernel, when a process accesses to hugetlb range protected
>> with PROTNONE, it causes unexpected COWs, which finally put hugetlb subsystem
>> into broken/uncontrollable state, where for example h->resv_huge_pages is
>> subtracted too much and wrapped around to a very large number, and free
>> hugepage pool is no longer maintainable.
>>
>
> Ouch!
>
>> This patch simply clears PROTNONE when it's caught out. Real NUMA balancing
>> code for hugetlb is not implemented yet (not sure how much it's worth doing.)
>>
>
> It's not worth doing at all. Furthermore, an application that took the
> effort to allocate and use hugetlb pages is not going to appreciate the
> minor faults incurred by automatic balancing for no gain.

OK,

> Why not something
> like the following untested patch?

I'll test this tomorrow.
Thank you very much for the comment.

Naoya Horiguchi

> It simply avoids doing protection updates
> on hugetlb VMAs. If it works for you, feel free to take it and reuse most
> of the same changelog for it. I'll only be intermittently online for the
> next few days and would rather not unnecessarily delay a fix
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..74bfde50fd4e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2161,8 +2161,10 @@ void task_numa_work(struct callback_head *work)
>   		vma = mm->mmap;
>   	}
>   	for (; vma; vma = vma->vm_next) {
> -		if (!vma_migratable(vma) || !vma_policy_mof(vma))
> +		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> +						is_vm_hugetlb_page(vma)) {
>   			continue;
> +		}
>
>   		/*
>   		 * Shared library pages mapped by multiple processes are not
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] mm: hugetlb: add stub-like do_hugetlb_numa()
  2015-03-30 10:42   ` Naoya Horiguchi
@ 2015-03-30 11:59     ` Mel Gorman
  2015-03-31  1:45       ` [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB) Naoya Horiguchi
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2015-03-30 11:59 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Hugh Dickins,
	Kirill A. Shutemov, David Rientjes, Rik van Riel, linux-kernel

On Mon, Mar 30, 2015 at 07:42:13PM +0900, Naoya Horiguchi wrote:
> On 03/30/2015 07:28 PM, Mel Gorman wrote:
> >On Mon, Mar 30, 2015 at 09:40:54AM +0000, Naoya Horiguchi wrote:
> >>hugetlb doesn't support NUMA balancing now, but that doesn't mean that we
> >>don't have to make hugetlb code prepared for PROTNONE entry properly.
> >>In the current kernel, when a process accesses to hugetlb range protected
> >>with PROTNONE, it causes unexpected COWs, which finally put hugetlb subsystem
> >>into broken/uncontrollable state, where for example h->resv_huge_pages is
> >>subtracted too much and wrapped around to a very large number, and free
> >>hugepage pool is no longer maintainable.
> >>
> >
> >Ouch!
> >
> >>This patch simply clears PROTNONE when it's caught out. Real NUMA balancing
> >>code for hugetlb is not implemented yet (not sure how much it's worth doing.)
> >>
> >
> >It's not worth doing at all. Furthermore, an application that took the
> >effort to allocate and use hugetlb pages is not going to appreciate the
> >minor faults incurred by automatic balancing for no gain.
> 
> OK,
> 
> >Why not something
> >like the following untested patch?
> 
> I'll test this tomorrow.
> Thank you very much for the comment.
> 

I note now that the patch was too hasty. By rights, that check
should be covered by vma_migratable() but it's only checked if
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION which means it's x86-only. If you
are seeing this problem on any other arch then a more correct fix might be
to remove the CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION check in vma_migratable.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB)
  2015-03-30 11:59     ` Mel Gorman
@ 2015-03-31  1:45       ` Naoya Horiguchi
  2015-03-31 21:35         ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Naoya Horiguchi @ 2015-03-31  1:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Hugh Dickins,
	Kirill A. Shutemov, David Rientjes, Rik van Riel, linux-kernel

On Mon, Mar 30, 2015 at 12:59:01PM +0100, Mel Gorman wrote:
> On Mon, Mar 30, 2015 at 07:42:13PM +0900, Naoya Horiguchi wrote:
...
> 
> I note now that the patch was too hasty. By rights, that check
> should be covered by vma_migratable() but it's only checked if
> CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION which means it's x86-only. If you
> are seeing this problem on any other arch then a more correct fix might be
> to remove the CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION check in vma_migratable.

Changing vma_migratable() affects other usecases of hugepage migration like
mbind(), so simply removing the ifdef doesn't work for such usecases.
I didn't test other archs, but I guess that this problem could happen on all
archs enabling numa balancing, whether it supports CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION.

So I'd like pick/push your first suggestion. It passed my testing.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Subject: [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB)

Currently when a process accesses to hugetlb range protected with PROTNONE,
unexpected COWs are triggered, which finally put hugetlb subsystem into
broken/uncontrollable state, where for example h->resv_huge_pages is subtracted
too much and wrapped around to a very large number, and free hugepage pool
is no longer maintainable.

This patch simply stops changing protection for vma(VM_HUGETLB) to fix the
problem. And this also allows us to avoid useless overhead of minor faults.

Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..6ad0d570f38e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2161,8 +2161,10 @@ void task_numa_work(struct callback_head *work)
 		vma = mm->mmap;
 	}
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma) || !vma_policy_mof(vma))
+		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+			is_vm_hugetlb_page(vma)) {
 			continue;
+		}
 
 		/*
 		 * Shared library pages mapped by multiple processes are not
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB)
  2015-03-31  1:45       ` [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB) Naoya Horiguchi
@ 2015-03-31 21:35         ` Andrew Morton
  2015-04-01  4:14           ` Naoya Horiguchi
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2015-03-31 21:35 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mel Gorman, Naoya Horiguchi, linux-mm, Hugh Dickins,
	Kirill A. Shutemov, David Rientjes, Rik van Riel, linux-kernel

On Tue, 31 Mar 2015 01:45:55 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently when a process accesses to hugetlb range protected with PROTNONE,
> unexpected COWs are triggered, which finally put hugetlb subsystem into
> broken/uncontrollable state, where for example h->resv_huge_pages is subtracted
> too much and wrapped around to a very large number, and free hugepage pool
> is no longer maintainable.
> 
> This patch simply stops changing protection for vma(VM_HUGETLB) to fix the
> problem. And this also allows us to avoid useless overhead of minor faults.
> 
> ...
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2161,8 +2161,10 @@ void task_numa_work(struct callback_head *work)
>  		vma = mm->mmap;
>  	}
>  	for (; vma; vma = vma->vm_next) {
> -		if (!vma_migratable(vma) || !vma_policy_mof(vma))
> +		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> +			is_vm_hugetlb_page(vma)) {
>  			continue;
> +		}
>  
>  		/*
>  		 * Shared library pages mapped by multiple processes are not

Which kernel version(s) need this patch?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB)
  2015-03-31 21:35         ` Andrew Morton
@ 2015-04-01  4:14           ` Naoya Horiguchi
  0 siblings, 0 replies; 7+ messages in thread
From: Naoya Horiguchi @ 2015-04-01  4:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Naoya Horiguchi, linux-mm, Hugh Dickins,
	Kirill A. Shutemov, David Rientjes, Rik van Riel, linux-kernel

On Tue, Mar 31, 2015 at 02:35:21PM -0700, Andrew Morton wrote:
> On Tue, 31 Mar 2015 01:45:55 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Currently when a process accesses to hugetlb range protected with PROTNONE,
> > unexpected COWs are triggered, which finally put hugetlb subsystem into
> > broken/uncontrollable state, where for example h->resv_huge_pages is subtracted
> > too much and wrapped around to a very large number, and free hugepage pool
> > is no longer maintainable.
> > 
> > This patch simply stops changing protection for vma(VM_HUGETLB) to fix the
> > problem. And this also allows us to avoid useless overhead of minor faults.
> > 
> > ...
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2161,8 +2161,10 @@ void task_numa_work(struct callback_head *work)
> >  		vma = mm->mmap;
> >  	}
> >  	for (; vma; vma = vma->vm_next) {
> > -		if (!vma_migratable(vma) || !vma_policy_mof(vma))
> > +		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> > +			is_vm_hugetlb_page(vma)) {
> >  			continue;
> > +		}
> >  
> >  		/*
> >  		 * Shared library pages mapped by multiple processes are not
> 
> Which kernel version(s) need this patch?

I don't bisect completely, but the problem this patch is mentioning is visible
since v4.0-rc1 (not reproduced at v3.19).

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-01 22:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-30  9:40 [RFC][PATCH] mm: hugetlb: add stub-like do_hugetlb_numa() Naoya Horiguchi
2015-03-30 10:28 ` Mel Gorman
2015-03-30 10:42   ` Naoya Horiguchi
2015-03-30 11:59     ` Mel Gorman
2015-03-31  1:45       ` [PATCH] mm: numa: disable change protection for vma(VM_HUGETLB) Naoya Horiguchi
2015-03-31 21:35         ` Andrew Morton
2015-04-01  4:14           ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).