All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 11:27 ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-02 11:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arm-kernel, akpm, mhocko, mingo, labbott, catalin.marinas,
	will.deacon, mark.rutland, zhongjiang, guohanjun, tanxiaojun,
	Ard Biesheuvel

While vmalloc() itself strictly uses page mappings only on all
architectures, some of the support routines are aware of the possible
existence of PMD or PUD size mappings inside the VMALLOC region.
This is necessary given that ioremap() shares this region and the
unmap routines with vmalloc(), and ioremap() may use huge pages on
some architectures.

On arm64 running with 4 KB pages, VM_MAP mappings will exist in the
VMALLOC region that are mapped to some extent using PMD size mappings.
As reported by Zhong Jiang, this confuses the kcore code, given that
vread() does not expect having to deal with PMD mappings, resulting
in oopses.

Even though we could work around this by special casing kcore or vmalloc
code for the VM_MAP mappings used by the arm64 kernel, the fact is that
there is already a precedent for dealing with PMD/PUD mappings in the
VMALLOC region, and so we could update the vmalloc_to_page() routine to
deal with such mappings as well. This solves the problem, and brings us
a step closer to huge page support in vmalloc/vmap, which could well be
in our future anyway.

Reported-by: Zhong Jiang <zhongjiang@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 mm/vmalloc.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 34a1c3e46ed7..cd79e62f8011 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/highmem.h>
+#include <linux/hugetlb.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -263,6 +264,38 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
+ * Some architectures (such as arm64) allow vmap() mappings in the
+ * vmalloc region that may consist of PMD block mappings.
+ */
+static struct page *vmalloc_to_pud_page(unsigned long addr, pud_t *pud)
+{
+	struct page *page = NULL;
+#ifdef CONFIG_HUGETLB_PAGE
+	pte_t pte = huge_ptep_get((pte_t *)pud);
+
+	if (pte_present(pte))
+		page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+#else
+	VIRTUAL_BUG_ON(1);
+#endif
+	return page;
+}
+
+static struct page *vmalloc_to_pmd_page(unsigned long addr, pmd_t *pmd)
+{
+	struct page *page = NULL;
+#ifdef CONFIG_HUGETLB_PAGE
+	pte_t pte = huge_ptep_get((pte_t *)pmd);
+
+	if (pte_present(pte))
+		page = pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+#else
+	VIRTUAL_BUG_ON(1);
+#endif
+	return page;
+}
+
+/*
  * Walk a vmap address to the struct page it maps.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
@@ -289,9 +322,13 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 	pud = pud_offset(p4d, addr);
 	if (pud_none(*pud))
 		return NULL;
+	if (pud_huge(*pud))
+		return vmalloc_to_pud_page(addr, pud);
 	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		return NULL;
+	if (pmd_huge(*pmd))
+		return vmalloc_to_pmd_page(addr, pmd);
 
 	ptep = pte_offset_map(pmd, addr);
 	pte = *ptep;
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 11:27 ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-02 11:27 UTC (permalink / raw)
  To: linux-arm-kernel

While vmalloc() itself strictly uses page mappings only on all
architectures, some of the support routines are aware of the possible
existence of PMD or PUD size mappings inside the VMALLOC region.
This is necessary given that ioremap() shares this region and the
unmap routines with vmalloc(), and ioremap() may use huge pages on
some architectures.

On arm64 running with 4 KB pages, VM_MAP mappings will exist in the
VMALLOC region that are mapped to some extent using PMD size mappings.
As reported by Zhong Jiang, this confuses the kcore code, given that
vread() does not expect having to deal with PMD mappings, resulting
in oopses.

Even though we could work around this by special casing kcore or vmalloc
code for the VM_MAP mappings used by the arm64 kernel, the fact is that
there is already a precedent for dealing with PMD/PUD mappings in the
VMALLOC region, and so we could update the vmalloc_to_page() routine to
deal with such mappings as well. This solves the problem, and brings us
a step closer to huge page support in vmalloc/vmap, which could well be
in our future anyway.

Reported-by: Zhong Jiang <zhongjiang@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 mm/vmalloc.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 34a1c3e46ed7..cd79e62f8011 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/highmem.h>
+#include <linux/hugetlb.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -263,6 +264,38 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
+ * Some architectures (such as arm64) allow vmap() mappings in the
+ * vmalloc region that may consist of PMD block mappings.
+ */
+static struct page *vmalloc_to_pud_page(unsigned long addr, pud_t *pud)
+{
+	struct page *page = NULL;
+#ifdef CONFIG_HUGETLB_PAGE
+	pte_t pte = huge_ptep_get((pte_t *)pud);
+
+	if (pte_present(pte))
+		page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+#else
+	VIRTUAL_BUG_ON(1);
+#endif
+	return page;
+}
+
+static struct page *vmalloc_to_pmd_page(unsigned long addr, pmd_t *pmd)
+{
+	struct page *page = NULL;
+#ifdef CONFIG_HUGETLB_PAGE
+	pte_t pte = huge_ptep_get((pte_t *)pmd);
+
+	if (pte_present(pte))
+		page = pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+#else
+	VIRTUAL_BUG_ON(1);
+#endif
+	return page;
+}
+
+/*
  * Walk a vmap address to the struct page it maps.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
@@ -289,9 +322,13 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 	pud = pud_offset(p4d, addr);
 	if (pud_none(*pud))
 		return NULL;
+	if (pud_huge(*pud))
+		return vmalloc_to_pud_page(addr, pud);
 	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		return NULL;
+	if (pmd_huge(*pmd))
+		return vmalloc_to_pmd_page(addr, pmd);
 
 	ptep = pte_offset_map(pmd, addr);
 	pte = *ptep;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 11:27 ` Ard Biesheuvel
@ 2017-06-02 14:29   ` Dave Hansen
  -1 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2017-06-02 14:29 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-mm
  Cc: linux-arm-kernel, akpm, mhocko, mingo, labbott, catalin.marinas,
	will.deacon, mark.rutland, zhongjiang, guohanjun, tanxiaojun

On 06/02/2017 04:27 AM, Ard Biesheuvel wrote:
> +static struct page *vmalloc_to_pud_page(unsigned long addr, pud_t *pud)
> +{
> +	struct page *page = NULL;
> +#ifdef CONFIG_HUGETLB_PAGE

Do we really want this based on hugetlbfs?  Won't this be dead code on x86?

Also, don't we discourage #ifdefs in .c files?

> +	pte_t pte = huge_ptep_get((pte_t *)pud);
> +
> +	if (pte_present(pte))
> +		page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);

x86 has pmd/pud_page().  Seems a bit silly to open-code it here.

> +#else
> +	VIRTUAL_BUG_ON(1);
> +#endif
> +	return page;
> +}

So if somebody manages to call this function on a huge page table entry,
but doesn't have hugetlbfs configured on, we kill the machine?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 14:29   ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2017-06-02 14:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/02/2017 04:27 AM, Ard Biesheuvel wrote:
> +static struct page *vmalloc_to_pud_page(unsigned long addr, pud_t *pud)
> +{
> +	struct page *page = NULL;
> +#ifdef CONFIG_HUGETLB_PAGE

Do we really want this based on hugetlbfs?  Won't this be dead code on x86?

Also, don't we discourage #ifdefs in .c files?

> +	pte_t pte = huge_ptep_get((pte_t *)pud);
> +
> +	if (pte_present(pte))
> +		page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);

x86 has pmd/pud_page().  Seems a bit silly to open-code it here.

> +#else
> +	VIRTUAL_BUG_ON(1);
> +#endif
> +	return page;
> +}

So if somebody manages to call this function on a huge page table entry,
but doesn't have hugetlbfs configured on, we kill the machine?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 14:29   ` Dave Hansen
@ 2017-06-02 15:11     ` Ard Biesheuvel
  -1 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-02 15:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-arm-kernel, Andrew Morton, Michal Hocko,
	Ingo Molnar, Laura Abbott, Catalin Marinas, Will Deacon,
	Mark Rutland, Zhong Jiang, Hanjun Guo, Tanxiaojun

On 2 June 2017 at 14:29, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/02/2017 04:27 AM, Ard Biesheuvel wrote:
>> +static struct page *vmalloc_to_pud_page(unsigned long addr, pud_t *pud)
>> +{
>> +     struct page *page = NULL;
>> +#ifdef CONFIG_HUGETLB_PAGE
>
> Do we really want this based on hugetlbfs?  Won't this be dead code on x86?
>

I don't see why one would follow from the other, but perhaps it is
better to make this depend on CONFIG_HAVE_ARCH_HUGE_VMAP instead,
which is already meant to imply that huge mappings may exist in the
vmalloc region.

> Also, don't we discourage #ifdefs in .c files?
>

Yes. But the alternative is to define a dummy huge_ptep_get(), which
is undefined otherwise. I am not sure that is better in this case. I
will try to address this more elegantly though, perhaps by folding the
huge_ptep_get() into the VIRTUAL_BUG_ON().

>> +     pte_t pte = huge_ptep_get((pte_t *)pud);
>> +
>> +     if (pte_present(pte))
>> +             page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>
> x86 has pmd/pud_page().  Seems a bit silly to open-code it here.
>

So I should replace pud_page() with what exactly?

>> +#else
>> +     VIRTUAL_BUG_ON(1);
>> +#endif
>> +     return page;
>> +}
>
> So if somebody manages to call this function on a huge page table entry,
> but doesn't have hugetlbfs configured on, we kill the machine?

Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
it seems appropriate to signal a failure rather than proceed with
dereferencing the huge PMD entry as if it were a table entry.

-- 
Ard.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 15:11     ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-02 15:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 2 June 2017 at 14:29, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/02/2017 04:27 AM, Ard Biesheuvel wrote:
>> +static struct page *vmalloc_to_pud_page(unsigned long addr, pud_t *pud)
>> +{
>> +     struct page *page = NULL;
>> +#ifdef CONFIG_HUGETLB_PAGE
>
> Do we really want this based on hugetlbfs?  Won't this be dead code on x86?
>

I don't see why one would follow from the other, but perhaps it is
better to make this depend on CONFIG_HAVE_ARCH_HUGE_VMAP instead,
which is already meant to imply that huge mappings may exist in the
vmalloc region.

> Also, don't we discourage #ifdefs in .c files?
>

Yes. But the alternative is to define a dummy huge_ptep_get(), which
is undefined otherwise. I am not sure that is better in this case. I
will try to address this more elegantly though, perhaps by folding the
huge_ptep_get() into the VIRTUAL_BUG_ON().

>> +     pte_t pte = huge_ptep_get((pte_t *)pud);
>> +
>> +     if (pte_present(pte))
>> +             page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>
> x86 has pmd/pud_page().  Seems a bit silly to open-code it here.
>

So I should replace pud_page() with what exactly?

>> +#else
>> +     VIRTUAL_BUG_ON(1);
>> +#endif
>> +     return page;
>> +}
>
> So if somebody manages to call this function on a huge page table entry,
> but doesn't have hugetlbfs configured on, we kill the machine?

Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
it seems appropriate to signal a failure rather than proceed with
dereferencing the huge PMD entry as if it were a table entry.

-- 
Ard.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 15:11     ` Ard Biesheuvel
@ 2017-06-02 16:03       ` Dave Hansen
  -1 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2017-06-02 16:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-mm, linux-arm-kernel, Andrew Morton, Michal Hocko,
	Ingo Molnar, Laura Abbott, Catalin Marinas, Will Deacon,
	Mark Rutland, Zhong Jiang, Hanjun Guo, Tanxiaojun, Shutemov,
	Kirill

On 06/02/2017 08:11 AM, Ard Biesheuvel wrote:
>>> +     pte_t pte = huge_ptep_get((pte_t *)pud);
>>> +
>>> +     if (pte_present(pte))
>>> +             page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>> x86 has pmd/pud_page().  Seems a bit silly to open-code it here.
>>
> So I should replace pud_page() with what exactly?

Whoops, I was reading that wrong.

So, the pud in this case is a huge pud pointing to data.  pud_page()
gives us the head page, but not the base (tail) page.  The 'addr' math
gets us that.

First of all, this math isn't guaranteed to work.  We don't guarantee
virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
or paddr first, add the pud offset, then convert to a 'struct page'.

But, what *is* the right thing to return here?  Do the users here want
the head page or the tail page?

BTW, _are_ your huge vmalloc pages compound?

>>> +#else
>>> +     VIRTUAL_BUG_ON(1);
>>> +#endif
>>> +     return page;
>>> +}
>> So if somebody manages to call this function on a huge page table entry,
>> but doesn't have hugetlbfs configured on, we kill the machine?
> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
> it seems appropriate to signal a failure rather than proceed with
> dereferencing the huge PMD entry as if it were a table entry.

Why kill the machine rather than just warning and returning NULL?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 16:03       ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2017-06-02 16:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/02/2017 08:11 AM, Ard Biesheuvel wrote:
>>> +     pte_t pte = huge_ptep_get((pte_t *)pud);
>>> +
>>> +     if (pte_present(pte))
>>> +             page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>> x86 has pmd/pud_page().  Seems a bit silly to open-code it here.
>>
> So I should replace pud_page() with what exactly?

Whoops, I was reading that wrong.

So, the pud in this case is a huge pud pointing to data.  pud_page()
gives us the head page, but not the base (tail) page.  The 'addr' math
gets us that.

First of all, this math isn't guaranteed to work.  We don't guarantee
virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
or paddr first, add the pud offset, then convert to a 'struct page'.

But, what *is* the right thing to return here?  Do the users here want
the head page or the tail page?

BTW, _are_ your huge vmalloc pages compound?

>>> +#else
>>> +     VIRTUAL_BUG_ON(1);
>>> +#endif
>>> +     return page;
>>> +}
>> So if somebody manages to call this function on a huge page table entry,
>> but doesn't have hugetlbfs configured on, we kill the machine?
> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
> it seems appropriate to signal a failure rather than proceed with
> dereferencing the huge PMD entry as if it were a table entry.

Why kill the machine rather than just warning and returning NULL?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 16:03       ` Dave Hansen
@ 2017-06-02 16:21         ` Ard Biesheuvel
  -1 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-02 16:21 UTC (permalink / raw)
  To: Dave Hansen, Steve Capper
  Cc: linux-mm, linux-arm-kernel, Andrew Morton, Michal Hocko,
	Ingo Molnar, Laura Abbott, Catalin Marinas, Will Deacon,
	Mark Rutland, Zhong Jiang, Hanjun Guo, Tanxiaojun, Shutemov,
	Kirill

(apologies for the lack of patience in sending out my v2)

(+ Steve)

On 2 June 2017 at 16:03, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/02/2017 08:11 AM, Ard Biesheuvel wrote:
>>>> +     pte_t pte = huge_ptep_get((pte_t *)pud);
>>>> +
>>>> +     if (pte_present(pte))
>>>> +             page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>> x86 has pmd/pud_page().  Seems a bit silly to open-code it here.
>>>
>> So I should replace pud_page() with what exactly?
>
> Whoops, I was reading that wrong.
>
> So, the pud in this case is a huge pud pointing to data.  pud_page()
> gives us the head page, but not the base (tail) page.  The 'addr' math
> gets us that.
>
> First of all, this math isn't guaranteed to work.  We don't guarantee
> virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
> or paddr first, add the pud offset, then convert to a 'struct page'.
>

OK, so you are saying the slice of the struct page array covering the
range could be discontiguous even though the physical range it
describes is contiguous? (which is guaranteed due to the nature of a
PMD mapping IIUC) In that case,

> But, what *is* the right thing to return here?  Do the users here want
> the head page or the tail page?
>

Hmm, I see what you mean. The vread() code that I am trying to fix
simply kmaps the returned page, copies from it and unmaps it, so it is
after the tail page. But I guess code that is aware of compound pages
is after the head page instead.

> BTW, _are_ your huge vmalloc pages compound?
>

Not in the case that I am trying to solve, no. They are simply VM_MAP
mappings of sequences of pages that are occupied by the kernel itself,
and not allocated by the page allocator.


>>>> +#else
>>>> +     VIRTUAL_BUG_ON(1);
>>>> +#endif
>>>> +     return page;
>>>> +}
>>> So if somebody manages to call this function on a huge page table entry,
>>> but doesn't have hugetlbfs configured on, we kill the machine?
>> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
>> it seems appropriate to signal a failure rather than proceed with
>> dereferencing the huge PMD entry as if it were a table entry.
>
> Why kill the machine rather than just warning and returning NULL?

I know this is generally a bad thing, but in this case, when a debug
option has been enabled exactly for this purpose, I think it is not
inappropriate to BUG() when encountering such a mapping. But I am
happy to relax it to a WARN() and return NULL instead, but in that
case, it should be unconditional imo and not based on
CONFIG_DEBUG_VIRTUAL or the likes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 16:21         ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-02 16:21 UTC (permalink / raw)
  To: linux-arm-kernel

(apologies for the lack of patience in sending out my v2)

(+ Steve)

On 2 June 2017 at 16:03, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/02/2017 08:11 AM, Ard Biesheuvel wrote:
>>>> +     pte_t pte = huge_ptep_get((pte_t *)pud);
>>>> +
>>>> +     if (pte_present(pte))
>>>> +             page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>> x86 has pmd/pud_page().  Seems a bit silly to open-code it here.
>>>
>> So I should replace pud_page() with what exactly?
>
> Whoops, I was reading that wrong.
>
> So, the pud in this case is a huge pud pointing to data.  pud_page()
> gives us the head page, but not the base (tail) page.  The 'addr' math
> gets us that.
>
> First of all, this math isn't guaranteed to work.  We don't guarantee
> virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
> or paddr first, add the pud offset, then convert to a 'struct page'.
>

OK, so you are saying the slice of the struct page array covering the
range could be discontiguous even though the physical range it
describes is contiguous? (which is guaranteed due to the nature of a
PMD mapping IIUC) In that case,

> But, what *is* the right thing to return here?  Do the users here want
> the head page or the tail page?
>

Hmm, I see what you mean. The vread() code that I am trying to fix
simply kmaps the returned page, copies from it and unmaps it, so it is
after the tail page. But I guess code that is aware of compound pages
is after the head page instead.

> BTW, _are_ your huge vmalloc pages compound?
>

Not in the case that I am trying to solve, no. They are simply VM_MAP
mappings of sequences of pages that are occupied by the kernel itself,
and not allocated by the page allocator.


>>>> +#else
>>>> +     VIRTUAL_BUG_ON(1);
>>>> +#endif
>>>> +     return page;
>>>> +}
>>> So if somebody manages to call this function on a huge page table entry,
>>> but doesn't have hugetlbfs configured on, we kill the machine?
>> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
>> it seems appropriate to signal a failure rather than proceed with
>> dereferencing the huge PMD entry as if it were a table entry.
>
> Why kill the machine rather than just warning and returning NULL?

I know this is generally a bad thing, but in this case, when a debug
option has been enabled exactly for this purpose, I think it is not
inappropriate to BUG() when encountering such a mapping. But I am
happy to relax it to a WARN() and return NULL instead, but in that
case, it should be unconditional imo and not based on
CONFIG_DEBUG_VIRTUAL or the likes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 11:27 ` Ard Biesheuvel
@ 2017-06-02 16:34   ` kbuild test robot
  -1 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2017-06-02 16:34 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: kbuild-all, linux-mm, linux-arm-kernel, akpm, mhocko, mingo,
	labbott, catalin.marinas, will.deacon, mark.rutland, zhongjiang,
	guohanjun, tanxiaojun

[-- Attachment #1: Type: text/plain, Size: 1761 bytes --]

Hi Ard,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ard-Biesheuvel/mm-vmalloc-make-vmalloc_to_page-deal-with-PMD-PUD-mappings/20170602-210236
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: sh-sh7785lcr_32bit_defconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All error/warnings (new ones prefixed by >>):

   mm/vmalloc.c: In function 'vmalloc_to_pud_page':
>> mm/vmalloc.c:277:10: error: implicit declaration of function 'pud_page' [-Werror=implicit-function-declaration]
      page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
             ^~~~~~~~
>> mm/vmalloc.c:277:8: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
           ^
   cc1: some warnings being treated as errors

vim +/pud_page +277 mm/vmalloc.c

   271	{
   272		struct page *page = NULL;
   273	#ifdef CONFIG_HUGETLB_PAGE
   274		pte_t pte = huge_ptep_get((pte_t *)pud);
   275	
   276		if (pte_present(pte))
 > 277			page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
   278	#else
   279		VIRTUAL_BUG_ON(1);
   280	#endif

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 16941 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 16:34   ` kbuild test robot
  0 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2017-06-02 16:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ard-Biesheuvel/mm-vmalloc-make-vmalloc_to_page-deal-with-PMD-PUD-mappings/20170602-210236
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: sh-sh7785lcr_32bit_defconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All error/warnings (new ones prefixed by >>):

   mm/vmalloc.c: In function 'vmalloc_to_pud_page':
>> mm/vmalloc.c:277:10: error: implicit declaration of function 'pud_page' [-Werror=implicit-function-declaration]
      page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
             ^~~~~~~~
>> mm/vmalloc.c:277:8: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
           ^
   cc1: some warnings being treated as errors

vim +/pud_page +277 mm/vmalloc.c

   271	{
   272		struct page *page = NULL;
   273	#ifdef CONFIG_HUGETLB_PAGE
   274		pte_t pte = huge_ptep_get((pte_t *)pud);
   275	
   276		if (pte_present(pte))
 > 277			page = pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
   278	#else
   279		VIRTUAL_BUG_ON(1);
   280	#endif

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/gzip
Size: 16941 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20170603/c4e81de6/attachment-0001.gz>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 16:21         ` Ard Biesheuvel
@ 2017-06-02 18:18           ` Dave Hansen
  -1 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2017-06-02 18:18 UTC (permalink / raw)
  To: Ard Biesheuvel, Steve Capper
  Cc: linux-mm, linux-arm-kernel, Andrew Morton, Michal Hocko,
	Ingo Molnar, Laura Abbott, Catalin Marinas, Will Deacon,
	Mark Rutland, Zhong Jiang, Hanjun Guo, Tanxiaojun, Shutemov,
	Kirill

On 06/02/2017 09:21 AM, Ard Biesheuvel wrote:
>> First of all, this math isn't guaranteed to work.  We don't guarantee
>> virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
>> or paddr first, add the pud offset, then convert to a 'struct page'.
> 
> OK, so you are saying the slice of the struct page array covering the
> range could be discontiguous even though the physical range it
> describes is contiguous? (which is guaranteed due to the nature of a
> PMD mapping IIUC) In that case,

Yes.

>> But, what *is* the right thing to return here?  Do the users here want
>> the head page or the tail page?
> 
> Hmm, I see what you mean. The vread() code that I am trying to fix
> simply kmaps the returned page, copies from it and unmaps it, so it is
> after the tail page. But I guess code that is aware of compound pages
> is after the head page instead.

Yeah, and some operations happen on tail pages while others get
redirected to the head page.

>> BTW, _are_ your huge vmalloc pages compound?
> 
> Not in the case that I am trying to solve, no. They are simply VM_MAP
> mappings of sequences of pages that are occupied by the kernel itself,
> and not allocated by the page allocator.

Huh, so what are they?  Are they system RAM that was bootmem allocated
or something?

>>>>> +#else
>>>>> +     VIRTUAL_BUG_ON(1);
>>>>> +#endif
>>>>> +     return page;
>>>>> +}
>>>> So if somebody manages to call this function on a huge page table entry,
>>>> but doesn't have hugetlbfs configured on, we kill the machine?
>>> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
>>> it seems appropriate to signal a failure rather than proceed with
>>> dereferencing the huge PMD entry as if it were a table entry.
>>
>> Why kill the machine rather than just warning and returning NULL?
> 
> I know this is generally a bad thing, but in this case, when a debug
> option has been enabled exactly for this purpose, I think it is not
> inappropriate to BUG() when encountering such a mapping. But I am
> happy to relax it to a WARN() and return NULL instead, but in that
> case, it should be unconditional imo and not based on
> CONFIG_DEBUG_VIRTUAL or the likes.

Sounds sane to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-02 18:18           ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2017-06-02 18:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/02/2017 09:21 AM, Ard Biesheuvel wrote:
>> First of all, this math isn't guaranteed to work.  We don't guarantee
>> virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
>> or paddr first, add the pud offset, then convert to a 'struct page'.
> 
> OK, so you are saying the slice of the struct page array covering the
> range could be discontiguous even though the physical range it
> describes is contiguous? (which is guaranteed due to the nature of a
> PMD mapping IIUC) In that case,

Yes.

>> But, what *is* the right thing to return here?  Do the users here want
>> the head page or the tail page?
> 
> Hmm, I see what you mean. The vread() code that I am trying to fix
> simply kmaps the returned page, copies from it and unmaps it, so it is
> after the tail page. But I guess code that is aware of compound pages
> is after the head page instead.

Yeah, and some operations happen on tail pages while others get
redirected to the head page.

>> BTW, _are_ your huge vmalloc pages compound?
> 
> Not in the case that I am trying to solve, no. They are simply VM_MAP
> mappings of sequences of pages that are occupied by the kernel itself,
> and not allocated by the page allocator.

Huh, so what are they?  Are they system RAM that was bootmem allocated
or something?

>>>>> +#else
>>>>> +     VIRTUAL_BUG_ON(1);
>>>>> +#endif
>>>>> +     return page;
>>>>> +}
>>>> So if somebody manages to call this function on a huge page table entry,
>>>> but doesn't have hugetlbfs configured on, we kill the machine?
>>> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
>>> it seems appropriate to signal a failure rather than proceed with
>>> dereferencing the huge PMD entry as if it were a table entry.
>>
>> Why kill the machine rather than just warning and returning NULL?
> 
> I know this is generally a bad thing, but in this case, when a debug
> option has been enabled exactly for this purpose, I think it is not
> inappropriate to BUG() when encountering such a mapping. But I am
> happy to relax it to a WARN() and return NULL instead, but in that
> case, it should be unconditional imo and not based on
> CONFIG_DEBUG_VIRTUAL or the likes.

Sounds sane to me.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
  2017-06-02 18:18           ` Dave Hansen
@ 2017-06-05 12:35             ` Ard Biesheuvel
  -1 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-05 12:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Steve Capper, linux-mm, linux-arm-kernel, Andrew Morton,
	Michal Hocko, Ingo Molnar, Laura Abbott, Catalin Marinas,
	Will Deacon, Mark Rutland, Zhong Jiang, Hanjun Guo, Tanxiaojun,
	Shutemov, Kirill

On 2 June 2017 at 18:18, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/02/2017 09:21 AM, Ard Biesheuvel wrote:
>>> First of all, this math isn't guaranteed to work.  We don't guarantee
>>> virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
>>> or paddr first, add the pud offset, then convert to a 'struct page'.
>>
>> OK, so you are saying the slice of the struct page array covering the
>> range could be discontiguous even though the physical range it
>> describes is contiguous? (which is guaranteed due to the nature of a
>> PMD mapping IIUC) In that case,
>
> Yes.
>
>>> But, what *is* the right thing to return here?  Do the users here want
>>> the head page or the tail page?
>>
>> Hmm, I see what you mean. The vread() code that I am trying to fix
>> simply kmaps the returned page, copies from it and unmaps it, so it is
>> after the tail page. But I guess code that is aware of compound pages
>> is after the head page instead.
>
> Yeah, and some operations happen on tail pages while others get
> redirected to the head page.
>

OK. So given that vmalloc() never allocates compound pages, and vmap()
does not deal with them at all, we should be able to safely assume
that vmalloc_to_page() callers are interested in the tail page only.

>>> BTW, _are_ your huge vmalloc pages compound?
>>
>> Not in the case that I am trying to solve, no. They are simply VM_MAP
>> mappings of sequences of pages that are occupied by the kernel itself,
>> and not allocated by the page allocator.
>
> Huh, so what are they?  Are they system RAM that was bootmem allocated
> or something?
>

They are static mappings of vmlinux segments. I.e., on my system I have

vmalloc : 0xffff000008000000 - 0xffff7dffbfff0000   (129022 GB)
  .text : 0xffff2125f4ce0000 - 0xffff2125f5670000   (  9792 KB)
.rodata : 0xffff2125f5670000 - 0xffff2125f5a30000   (  3840 KB)
  .init : 0xffff2125f5a30000 - 0xffff2125f5e50000   (  4224 KB)
  .data : 0xffff2125f5e50000 - 0xffff2125f5f8ba00   (  1263 KB)
   .bss : 0xffff2125f5f8ba00 - 0xffff2125f609692c   (  1068 KB)

where KASLR may place these segments anywhere in the VMALLOC region.
Mark has suggested that these regions should not intersect, but in my
opinion, given that the VMALLOC region already contains executable
code and associated data (for kernel modules), and may already contain
huge mappings (for HUGE_VMAP), it is reasonable to expect shared code
to at least tolerate such mappings.

As Mark pointed out, pmd_huge()/pud_huge() may not work as expected
depending on the kernel configuration, so I will respin the patch to
take HUGE_VMAP into account for those definitions as well.

-- 
Ard.



>>>>>> +#else
>>>>>> +     VIRTUAL_BUG_ON(1);
>>>>>> +#endif
>>>>>> +     return page;
>>>>>> +}
>>>>> So if somebody manages to call this function on a huge page table entry,
>>>>> but doesn't have hugetlbfs configured on, we kill the machine?
>>>> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
>>>> it seems appropriate to signal a failure rather than proceed with
>>>> dereferencing the huge PMD entry as if it were a table entry.
>>>
>>> Why kill the machine rather than just warning and returning NULL?
>>
>> I know this is generally a bad thing, but in this case, when a debug
>> option has been enabled exactly for this purpose, I think it is not
>> inappropriate to BUG() when encountering such a mapping. But I am
>> happy to relax it to a WARN() and return NULL instead, but in that
>> case, it should be unconditional imo and not based on
>> CONFIG_DEBUG_VIRTUAL or the likes.
>
> Sounds sane to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings
@ 2017-06-05 12:35             ` Ard Biesheuvel
  0 siblings, 0 replies; 16+ messages in thread
From: Ard Biesheuvel @ 2017-06-05 12:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 2 June 2017 at 18:18, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/02/2017 09:21 AM, Ard Biesheuvel wrote:
>>> First of all, this math isn't guaranteed to work.  We don't guarantee
>>> virtual contiguity for all mem_map[]s.  I think you need to go to a pfn
>>> or paddr first, add the pud offset, then convert to a 'struct page'.
>>
>> OK, so you are saying the slice of the struct page array covering the
>> range could be discontiguous even though the physical range it
>> describes is contiguous? (which is guaranteed due to the nature of a
>> PMD mapping IIUC) In that case,
>
> Yes.
>
>>> But, what *is* the right thing to return here?  Do the users here want
>>> the head page or the tail page?
>>
>> Hmm, I see what you mean. The vread() code that I am trying to fix
>> simply kmaps the returned page, copies from it and unmaps it, so it is
>> after the tail page. But I guess code that is aware of compound pages
>> is after the head page instead.
>
> Yeah, and some operations happen on tail pages while others get
> redirected to the head page.
>

OK. So given that vmalloc() never allocates compound pages, and vmap()
does not deal with them at all, we should be able to safely assume
that vmalloc_to_page() callers are interested in the tail page only.

>>> BTW, _are_ your huge vmalloc pages compound?
>>
>> Not in the case that I am trying to solve, no. They are simply VM_MAP
>> mappings of sequences of pages that are occupied by the kernel itself,
>> and not allocated by the page allocator.
>
> Huh, so what are they?  Are they system RAM that was bootmem allocated
> or something?
>

They are static mappings of vmlinux segments. I.e., on my system I have

vmalloc : 0xffff000008000000 - 0xffff7dffbfff0000   (129022 GB)
  .text : 0xffff2125f4ce0000 - 0xffff2125f5670000   (  9792 KB)
.rodata : 0xffff2125f5670000 - 0xffff2125f5a30000   (  3840 KB)
  .init : 0xffff2125f5a30000 - 0xffff2125f5e50000   (  4224 KB)
  .data : 0xffff2125f5e50000 - 0xffff2125f5f8ba00   (  1263 KB)
   .bss : 0xffff2125f5f8ba00 - 0xffff2125f609692c   (  1068 KB)

where KASLR may place these segments anywhere in the VMALLOC region.
Mark has suggested that these regions should not intersect, but in my
opinion, given that the VMALLOC region already contains executable
code and associated data (for kernel modules), and may already contain
huge mappings (for HUGE_VMAP), it is reasonable to expect shared code
to at least tolerate such mappings.

As Mark pointed out, pmd_huge()/pud_huge() may not work as expected
depending on the kernel configuration, so I will respin the patch to
take HUGE_VMAP into account for those definitions as well.

-- 
Ard.



>>>>>> +#else
>>>>>> +     VIRTUAL_BUG_ON(1);
>>>>>> +#endif
>>>>>> +     return page;
>>>>>> +}
>>>>> So if somebody manages to call this function on a huge page table entry,
>>>>> but doesn't have hugetlbfs configured on, we kill the machine?
>>>> Yes. But only if you have CONFIG_DEBUG_VIRTUAL defined, in which case
>>>> it seems appropriate to signal a failure rather than proceed with
>>>> dereferencing the huge PMD entry as if it were a table entry.
>>>
>>> Why kill the machine rather than just warning and returning NULL?
>>
>> I know this is generally a bad thing, but in this case, when a debug
>> option has been enabled exactly for this purpose, I think it is not
>> inappropriate to BUG() when encountering such a mapping. But I am
>> happy to relax it to a WARN() and return NULL instead, but in that
>> case, it should be unconditional imo and not based on
>> CONFIG_DEBUG_VIRTUAL or the likes.
>
> Sounds sane to me.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-06-05 12:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-02 11:27 [PATCH] mm: vmalloc: make vmalloc_to_page() deal with PMD/PUD mappings Ard Biesheuvel
2017-06-02 11:27 ` Ard Biesheuvel
2017-06-02 14:29 ` Dave Hansen
2017-06-02 14:29   ` Dave Hansen
2017-06-02 15:11   ` Ard Biesheuvel
2017-06-02 15:11     ` Ard Biesheuvel
2017-06-02 16:03     ` Dave Hansen
2017-06-02 16:03       ` Dave Hansen
2017-06-02 16:21       ` Ard Biesheuvel
2017-06-02 16:21         ` Ard Biesheuvel
2017-06-02 18:18         ` Dave Hansen
2017-06-02 18:18           ` Dave Hansen
2017-06-05 12:35           ` Ard Biesheuvel
2017-06-05 12:35             ` Ard Biesheuvel
2017-06-02 16:34 ` kbuild test robot
2017-06-02 16:34   ` kbuild test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.