[PATCH 1/2] fs/dax: deposit pagetable even when installing zero page

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page
@ 2019-02-28  8:35 Aneesh Kumar K.V
  2019-02-28  8:35 ` [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default Aneesh Kumar K.V
  2019-02-28  9:21 ` [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Jan Kara
  0 siblings, 2 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-02-28  8:35 UTC (permalink / raw)
  To: akpm, Kirill A . Shutemov, Jan Kara, mpe, Ross Zwisler,
	Oliver O'Halloran
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

Architectures like ppc64 use the deposited page table to store hardware
page table slot information. Make sure we deposit a page table when
using zero page at the pmd level for hash.

Without this we hit

Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc000000000082a74
Oops: Kernel access of bad area, sig: 11 [#1]
....

NIP [c000000000082a74] __hash_page_thp+0x224/0x5b0
LR [c0000000000829a4] __hash_page_thp+0x154/0x5b0
Call Trace:
 hash_page_mm+0x43c/0x740
 do_hash_page+0x2c/0x3c
 copy_from_iter_flushcache+0xa4/0x4a0
 pmem_copy_from_iter+0x2c/0x50 [nd_pmem]
 dax_copy_from_iter+0x40/0x70
 dax_iomap_actor+0x134/0x360
 iomap_apply+0xfc/0x1b0
 dax_iomap_rw+0xac/0x130
 ext4_file_write_iter+0x254/0x460 [ext4]
 __vfs_write+0x120/0x1e0
 vfs_write+0xd8/0x220
 SyS_write+0x6c/0x110
 system_call+0x3c/0x130

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
TODO:
* Add fixes tag 

 fs/dax.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 6959837cc465..01bfb2ac34f9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -33,6 +33,7 @@
 #include <linux/sizes.h>
 #include <linux/mmu_notifier.h>
 #include <linux/iomap.h>
+#include <asm/pgalloc.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1410,7 +1411,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = mapping->host;
+	pgtable_t pgtable = NULL;
 	struct page *zero_page;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
@@ -1425,12 +1428,22 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
 			DAX_PMD | DAX_ZERO_PAGE, false);
 
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
+
 	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (!pmd_none(*(vmf->pmd))) {
 		spin_unlock(ptl);
 		goto fallback;
 	}
 
+	if (pgtable) {
+		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		mm_inc_nr_ptes(vma->vm_mm);
+	}
 	pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
 	pmd_entry = pmd_mkhuge(pmd_entry);
 	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
@@ -1439,6 +1452,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE;
 
 fallback:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
 	trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
 	return VM_FAULT_FALLBACK;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28  8:35 [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Aneesh Kumar K.V
@ 2019-02-28  8:35 ` Aneesh Kumar K.V
  2019-02-28  9:40   ` Jan Kara
  2019-02-28  9:40   ` Oliver
  2019-02-28  9:21 ` [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Jan Kara
  1 sibling, 2 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-02-28  8:35 UTC (permalink / raw)
  To: akpm, Kirill A . Shutemov, Jan Kara, mpe, Ross Zwisler,
	Oliver O'Halloran
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

Add a flag to indicate the ability to do huge page dax mapping. On architecture
like ppc64, the hypervisor can disable huge page support in the guest. In
such a case, we should not enable huge page dax mapping. This patch adds
a flag which the architecture code will update to indicate huge page
dax mapping support.

Architectures mostly do transparent_hugepage_flag = 0; if they can't
do hugepages. That also takes care of disabling dax hugepage mapping
with this change.

Without this patch we get the below error with kvm on ppc64.

[  118.849975] lpar: Failed hash pte insert with error -4

NOTE: The patch also use

echo never > /sys/kernel/mm/transparent_hugepage/enabled
to disable dax huge page mapping.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
TODO:
* Add Fixes: tag

 include/linux/huge_mm.h | 4 +++-
 mm/huge_memory.c        | 4 ++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 381e872bfde0..01ad5258545e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_DAX_FLAG,
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
@@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
 		return true;
 
-	if (vma_is_dax(vma))
+	if (vma_is_dax(vma) &&
+	    (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
 		return true;
 
 	if (transparent_hugepage_flags &
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357eaf0ce..43d742fe0341 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
 	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
 #endif
+	(1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
@@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
 			  &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
+			  &transparent_hugepage_flags);
 		ret = 1;
 	}
 out:
@@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(mm, pmd);
+	/* should we check for none here again? */
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page
  2019-02-28  8:35 [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Aneesh Kumar K.V
  2019-02-28  8:35 ` [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default Aneesh Kumar K.V
@ 2019-02-28  9:21 ` Jan Kara
  2019-02-28 12:34   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2019-02-28  9:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, Kirill A . Shutemov, Jan Kara, mpe, Ross Zwisler,
	Oliver O'Halloran, linux-mm, linux-kernel, linuxppc-dev

On Thu 28-02-19 14:05:21, Aneesh Kumar K.V wrote:
> Architectures like ppc64 use the deposited page table to store hardware
> page table slot information. Make sure we deposit a page table when
> using zero page at the pmd level for hash.
> 
> Without this we hit
> 
> Unable to handle kernel paging request for data at address 0x00000000
> Faulting instruction address: 0xc000000000082a74
> Oops: Kernel access of bad area, sig: 11 [#1]
> ....
> 
> NIP [c000000000082a74] __hash_page_thp+0x224/0x5b0
> LR [c0000000000829a4] __hash_page_thp+0x154/0x5b0
> Call Trace:
>  hash_page_mm+0x43c/0x740
>  do_hash_page+0x2c/0x3c
>  copy_from_iter_flushcache+0xa4/0x4a0
>  pmem_copy_from_iter+0x2c/0x50 [nd_pmem]
>  dax_copy_from_iter+0x40/0x70
>  dax_iomap_actor+0x134/0x360
>  iomap_apply+0xfc/0x1b0
>  dax_iomap_rw+0xac/0x130
>  ext4_file_write_iter+0x254/0x460 [ext4]
>  __vfs_write+0x120/0x1e0
>  vfs_write+0xd8/0x220
>  SyS_write+0x6c/0x110
>  system_call+0x3c/0x130
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Thanks for the patch. It looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

> ---
> TODO:
> * Add fixes tag 

Probably this is a problem since initial PPC PMEM support, isn't it?

								Honza

> 
>  fs/dax.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 6959837cc465..01bfb2ac34f9 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -33,6 +33,7 @@
>  #include <linux/sizes.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/iomap.h>
> +#include <asm/pgalloc.h>
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -1410,7 +1411,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  {
>  	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>  	unsigned long pmd_addr = vmf->address & PMD_MASK;
> +	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = mapping->host;
> +	pgtable_t pgtable = NULL;
>  	struct page *zero_page;
>  	spinlock_t *ptl;
>  	pmd_t pmd_entry;
> @@ -1425,12 +1428,22 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
>  			DAX_PMD | DAX_ZERO_PAGE, false);
>  
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (!pgtable)
> +			return VM_FAULT_OOM;
> +	}
> +
>  	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
>  	if (!pmd_none(*(vmf->pmd))) {
>  		spin_unlock(ptl);
>  		goto fallback;
>  	}
>  
> +	if (pgtable) {
> +		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> +		mm_inc_nr_ptes(vma->vm_mm);
> +	}
>  	pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
>  	pmd_entry = pmd_mkhuge(pmd_entry);
>  	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> @@ -1439,6 +1452,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  	return VM_FAULT_NOPAGE;
>  
>  fallback:
> +	if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
>  	trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
>  	return VM_FAULT_FALLBACK;
>  }
> -- 
> 2.20.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28  8:35 ` [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default Aneesh Kumar K.V
@ 2019-02-28  9:40   ` Jan Kara
  2019-02-28 12:32     ` Aneesh Kumar K.V
  2019-02-28  9:40   ` Oliver
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2019-02-28  9:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, Kirill A . Shutemov, Jan Kara, mpe, Ross Zwisler,
	Oliver O'Halloran, linux-mm, linux-kernel, linuxppc-dev,
	Dan Williams

On Thu 28-02-19 14:05:22, Aneesh Kumar K.V wrote:
> Add a flag to indicate the ability to do huge page dax mapping. On architecture
> like ppc64, the hypervisor can disable huge page support in the guest. In
> such a case, we should not enable huge page dax mapping. This patch adds
> a flag which the architecture code will update to indicate huge page
> dax mapping support.
> 
> Architectures mostly do transparent_hugepage_flag = 0; if they can't
> do hugepages. That also takes care of disabling dax hugepage mapping
> with this change.
> 
> Without this patch we get the below error with kvm on ppc64.
> 
> [  118.849975] lpar: Failed hash pte insert with error -4
> 
> NOTE: The patch also use
> 
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> to disable dax huge page mapping.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Added Dan to CC for opinion. I kind of fail to see why you don't use
TRANSPARENT_HUGEPAGE_FLAG for this. I know that technically DAX huge pages
and normal THPs are different things but so far we've tried to avoid making
that distinction visible to userspace.

								Honza
> ---
> TODO:
> * Add Fixes: tag
> 
>  include/linux/huge_mm.h | 4 +++-
>  mm/huge_memory.c        | 4 ++++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 381e872bfde0..01ad5258545e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>  			pud_t *pud, pfn_t pfn, bool write);
>  enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_DAX_FLAG,
>  	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>  	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>  		return true;
>  
> -	if (vma_is_dax(vma))
> +	if (vma_is_dax(vma) &&
> +	    (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>  		return true;
>  
>  	if (transparent_hugepage_flags &
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faf357eaf0ce..43d742fe0341 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>  	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>  #endif
> +	(1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
>  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>  	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
> @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
>  			  &transparent_hugepage_flags);
>  		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>  			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
> +			  &transparent_hugepage_flags);
>  		ret = 1;
>  	}
>  out:
> @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	spinlock_t *ptl;
>  
>  	ptl = pmd_lock(mm, pmd);
> +	/* should we check for none here again? */
>  	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>  	if (pfn_t_devmap(pfn))
>  		entry = pmd_mkdevmap(entry);
> -- 
> 2.20.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28  8:35 ` [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default Aneesh Kumar K.V
  2019-02-28  9:40   ` Jan Kara
@ 2019-02-28  9:40   ` Oliver
  2019-02-28 12:43     ` Aneesh Kumar K.V
  2019-02-28 16:45     ` Dan Williams
  1 sibling, 2 replies; 25+ messages in thread
From: Oliver @ 2019-02-28  9:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Dan Williams
  Cc: Andrew Morton, Kirill A . Shutemov, Jan Kara, Michael Ellerman,
	Ross Zwisler, Linux MM, Linux Kernel Mailing List, linuxppc-dev

On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Add a flag to indicate the ability to do huge page dax mapping. On architecture
> like ppc64, the hypervisor can disable huge page support in the guest. In
> such a case, we should not enable huge page dax mapping. This patch adds
> a flag which the architecture code will update to indicate huge page
> dax mapping support.

*groan*

> Architectures mostly do transparent_hugepage_flag = 0; if they can't
> do hugepages. That also takes care of disabling dax hugepage mapping
> with this change.
>
> Without this patch we get the below error with kvm on ppc64.
>
> [  118.849975] lpar: Failed hash pte insert with error -4
>
> NOTE: The patch also use
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> to disable dax huge page mapping.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> TODO:
> * Add Fixes: tag
>
>  include/linux/huge_mm.h | 4 +++-
>  mm/huge_memory.c        | 4 ++++
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 381e872bfde0..01ad5258545e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>                         pud_t *pud, pfn_t pfn, bool write);
>  enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_FLAG,
> +       TRANSPARENT_HUGEPAGE_DAX_FLAG,
>         TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>                 return true;
>
> -       if (vma_is_dax(vma))
> +       if (vma_is_dax(vma) &&
> +           (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>                 return true;

Forcing PTE sized faults should be fine for fsdax, but it'll break
devdax. The devdax driver requires the fault size be >= the namespace
alignment since devdax tries to guarantee hugepage mappings will be
used and PMD alignment is the default. We can probably have devdax
fall back to the largest size the hypervisor has made available, but
it does run contrary to the design. Ah well, I suppose it's better off
being degraded rather than unusable.

>         if (transparent_hugepage_flags &
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faf357eaf0ce..43d742fe0341 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>         (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>  #endif
> +       (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
>         (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>         (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>         (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
> @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
>                           &transparent_hugepage_flags);
>                 clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>                           &transparent_hugepage_flags);
> +               clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
> +                         &transparent_hugepage_flags);
>                 ret = 1;
>         }
>  out:

> @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>         spinlock_t *ptl;
>
>         ptl = pmd_lock(mm, pmd);
> +       /* should we check for none here again? */

VM_WARN_ON() maybe? If THP is disabled and we're here then something
has gone wrong.

>         entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>         if (pfn_t_devmap(pfn))
>                 entry = pmd_mkdevmap(entry);
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28  9:40   ` Jan Kara
@ 2019-02-28 12:32     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-02-28 12:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: akpm, Kirill A . Shutemov, mpe, Ross Zwisler,
	Oliver O'Halloran, linux-mm, linux-kernel, linuxppc-dev,
	Dan Williams

On 2/28/19 3:10 PM, Jan Kara wrote:
> On Thu 28-02-19 14:05:22, Aneesh Kumar K.V wrote:
>> Add a flag to indicate the ability to do huge page dax mapping. On architecture
>> like ppc64, the hypervisor can disable huge page support in the guest. In
>> such a case, we should not enable huge page dax mapping. This patch adds
>> a flag which the architecture code will update to indicate huge page
>> dax mapping support.
>>
>> Architectures mostly do transparent_hugepage_flag = 0; if they can't
>> do hugepages. That also takes care of disabling dax hugepage mapping
>> with this change.
>>
>> Without this patch we get the below error with kvm on ppc64.
>>
>> [  118.849975] lpar: Failed hash pte insert with error -4
>>
>> NOTE: The patch also use
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> to disable dax huge page mapping.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> Added Dan to CC for opinion. I kind of fail to see why you don't use
> TRANSPARENT_HUGEPAGE_FLAG for this. I know that technically DAX huge pages
> and normal THPs are different things but so far we've tried to avoid making
> that distinction visible to userspace.


I would also like to use the same flag. Was not sure whether it was ok. 
In fact that is one of the reason I hooked this to 
/sys/kernel/mm/transparent_hugepage/enabled. If we are ok with using 
same flag, we can kill the vma_is_dax() check completely.


-aneesh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page
  2019-02-28  9:21 ` [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Jan Kara
@ 2019-02-28 12:34   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-02-28 12:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: akpm, Kirill A . Shutemov, mpe, Ross Zwisler,
	Oliver O'Halloran, linux-mm, linux-kernel, linuxppc-dev

On 2/28/19 2:51 PM, Jan Kara wrote:
> On Thu 28-02-19 14:05:21, Aneesh Kumar K.V wrote:
>> Architectures like ppc64 use the deposited page table to store hardware
>> page table slot information. Make sure we deposit a page table when
>> using zero page at the pmd level for hash.
>>
>> Without this we hit
>>
>> Unable to handle kernel paging request for data at address 0x00000000
>> Faulting instruction address: 0xc000000000082a74
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> ....
>>
>> NIP [c000000000082a74] __hash_page_thp+0x224/0x5b0
>> LR [c0000000000829a4] __hash_page_thp+0x154/0x5b0
>> Call Trace:
>>   hash_page_mm+0x43c/0x740
>>   do_hash_page+0x2c/0x3c
>>   copy_from_iter_flushcache+0xa4/0x4a0
>>   pmem_copy_from_iter+0x2c/0x50 [nd_pmem]
>>   dax_copy_from_iter+0x40/0x70
>>   dax_iomap_actor+0x134/0x360
>>   iomap_apply+0xfc/0x1b0
>>   dax_iomap_rw+0xac/0x130
>>   ext4_file_write_iter+0x254/0x460 [ext4]
>>   __vfs_write+0x120/0x1e0
>>   vfs_write+0xd8/0x220
>>   SyS_write+0x6c/0x110
>>   system_call+0x3c/0x130
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> Thanks for the patch. It looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
>> ---
>> TODO:
>> * Add fixes tag
> 
> Probably this is a problem since initial PPC PMEM support, isn't it?
> 

Considering ppc64 is the only broken architecture here, I guess I will 
use the commit that enabled PPC PMEM support here.

-aneesh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28  9:40   ` Oliver
@ 2019-02-28 12:43     ` Aneesh Kumar K.V
  2019-02-28 16:45     ` Dan Williams
  1 sibling, 0 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-02-28 12:43 UTC (permalink / raw)
  To: Oliver, Dan Williams
  Cc: Andrew Morton, Kirill A . Shutemov, Jan Kara, Michael Ellerman,
	Ross Zwisler, Linux MM, Linux Kernel Mailing List, linuxppc-dev

On 2/28/19 3:10 PM, Oliver wrote:
> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Add a flag to indicate the ability to do huge page dax mapping. On architecture
>> like ppc64, the hypervisor can disable huge page support in the guest. In
>> such a case, we should not enable huge page dax mapping. This patch adds
>> a flag which the architecture code will update to indicate huge page
>> dax mapping support.
> 
> *groan*
> 
>> Architectures mostly do transparent_hugepage_flag = 0; if they can't
>> do hugepages. That also takes care of disabling dax hugepage mapping
>> with this change.
>>
>> Without this patch we get the below error with kvm on ppc64.
>>
>> [  118.849975] lpar: Failed hash pte insert with error -4
>>
>> NOTE: The patch also use
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> to disable dax huge page mapping.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> TODO:
>> * Add Fixes: tag
>>
>>   include/linux/huge_mm.h | 4 +++-
>>   mm/huge_memory.c        | 4 ++++
>>   2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 381e872bfde0..01ad5258545e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>>                          pud_t *pud, pfn_t pfn, bool write);
>>   enum transparent_hugepage_flag {
>>          TRANSPARENT_HUGEPAGE_FLAG,
>> +       TRANSPARENT_HUGEPAGE_DAX_FLAG,
>>          TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>>          TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>>          TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>> @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>>          if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>>                  return true;
>>
>> -       if (vma_is_dax(vma))
>> +       if (vma_is_dax(vma) &&
>> +           (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>>                  return true;
> 
> Forcing PTE sized faults should be fine for fsdax, but it'll break
> devdax. The devdax driver requires the fault size be >= the namespace
> alignment since devdax tries to guarantee hugepage mappings will be
> used and PMD alignment is the default. We can probably have devdax
> fall back to the largest size the hypervisor has made available, but
> it does run contrary to the design. Ah well, I suppose it's better off
> being degraded rather than unusable.
> 

Will fix that. I will make PFN_DEFAULT_ALIGNMENT arch specific.


>>          if (transparent_hugepage_flags &
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index faf357eaf0ce..43d742fe0341 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>>          (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>>   #endif
>> +       (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
>>          (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>>          (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>>          (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>> @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
>>                            &transparent_hugepage_flags);
>>                  clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>>                            &transparent_hugepage_flags);
>> +               clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> +                         &transparent_hugepage_flags);
>>                  ret = 1;
>>          }
>>   out:
> 
>> @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>          spinlock_t *ptl;
>>
>>          ptl = pmd_lock(mm, pmd);
>> +       /* should we check for none here again? */
> 
> VM_WARN_ON() maybe? If THP is disabled and we're here then something
> has gone wrong.

I was wondering whether we can end up calling insert_pfn_pmd in parallel 
and hence end up having a pmd entry here already. Usually we check for 
if (!pmd_none(pmd)) after holding pmd_lock. Was not sure whether there 
is anything preventing a parallel fault in case of dax.


> 
>>          entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>>          if (pfn_t_devmap(pfn))
>>                  entry = pmd_mkdevmap(entry);
>> --
>> 2.20.1
>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28  9:40   ` Oliver
  2019-02-28 12:43     ` Aneesh Kumar K.V
@ 2019-02-28 16:45     ` Dan Williams
  2019-03-06  9:17       ` Aneesh Kumar K.V
  1 sibling, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-02-28 16:45 UTC (permalink / raw)
  To: Oliver
  Cc: Aneesh Kumar K.V, Andrew Morton, Kirill A . Shutemov, Jan Kara,
	Michael Ellerman, Ross Zwisler, Linux MM,
	Linux Kernel Mailing List, linuxppc-dev

On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
>
> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> >
> > Add a flag to indicate the ability to do huge page dax mapping. On architecture
> > like ppc64, the hypervisor can disable huge page support in the guest. In
> > such a case, we should not enable huge page dax mapping. This patch adds
> > a flag which the architecture code will update to indicate huge page
> > dax mapping support.
>
> *groan*
>
> > Architectures mostly do transparent_hugepage_flag = 0; if they can't
> > do hugepages. That also takes care of disabling dax hugepage mapping
> > with this change.
> >
> > Without this patch we get the below error with kvm on ppc64.
> >
> > [  118.849975] lpar: Failed hash pte insert with error -4
> >
> > NOTE: The patch also use
> >
> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
> > to disable dax huge page mapping.
> >
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ---
> > TODO:
> > * Add Fixes: tag
> >
> >  include/linux/huge_mm.h | 4 +++-
> >  mm/huge_memory.c        | 4 ++++
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 381e872bfde0..01ad5258545e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
> >                         pud_t *pud, pfn_t pfn, bool write);
> >  enum transparent_hugepage_flag {
> >         TRANSPARENT_HUGEPAGE_FLAG,
> > +       TRANSPARENT_HUGEPAGE_DAX_FLAG,
> >         TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> > @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
> >         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
> >                 return true;
> >
> > -       if (vma_is_dax(vma))
> > +       if (vma_is_dax(vma) &&
> > +           (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
> >                 return true;
>
> Forcing PTE sized faults should be fine for fsdax, but it'll break
> devdax. The devdax driver requires the fault size be >= the namespace
> alignment since devdax tries to guarantee hugepage mappings will be
> used and PMD alignment is the default. We can probably have devdax
> fall back to the largest size the hypervisor has made available, but
> it does run contrary to the design. Ah well, I suppose it's better off
> being degraded rather than unusable.

Given this is an explicit setting I think device-dax should explicitly
fail to enable in the presence of this flag to preserve the
application visible behavior.

I.e. if device-dax was enabled after this setting was made then I
think future faults should fail as well.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-02-28 16:45     ` Dan Williams
@ 2019-03-06  9:17       ` Aneesh Kumar K.V
  2019-03-06 11:44         ` Michal Suchánek
  2019-03-13 16:02         ` Dan Williams
  0 siblings, 2 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-03-06  9:17 UTC (permalink / raw)
  To: Dan Williams, Oliver
  Cc: Andrew Morton, Kirill A . Shutemov, Jan Kara, Michael Ellerman,
	Ross Zwisler, Linux MM, Linux Kernel Mailing List, linuxppc-dev,
	linux-nvdimm

Dan Williams <dan.j.williams@intel.com> writes:

> On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
>>
>> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
>> <aneesh.kumar@linux.ibm.com> wrote:
>> >
>> > Add a flag to indicate the ability to do huge page dax mapping. On architecture
>> > like ppc64, the hypervisor can disable huge page support in the guest. In
>> > such a case, we should not enable huge page dax mapping. This patch adds
>> > a flag which the architecture code will update to indicate huge page
>> > dax mapping support.
>>
>> *groan*
>>
>> > Architectures mostly do transparent_hugepage_flag = 0; if they can't
>> > do hugepages. That also takes care of disabling dax hugepage mapping
>> > with this change.
>> >
>> > Without this patch we get the below error with kvm on ppc64.
>> >
>> > [  118.849975] lpar: Failed hash pte insert with error -4
>> >
>> > NOTE: The patch also use
>> >
>> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> > to disable dax huge page mapping.
>> >
>> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> > ---
>> > TODO:
>> > * Add Fixes: tag
>> >
>> >  include/linux/huge_mm.h | 4 +++-
>> >  mm/huge_memory.c        | 4 ++++
>> >  2 files changed, 7 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> > index 381e872bfde0..01ad5258545e 100644
>> > --- a/include/linux/huge_mm.h
>> > +++ b/include/linux/huge_mm.h
>> > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>> >                         pud_t *pud, pfn_t pfn, bool write);
>> >  enum transparent_hugepage_flag {
>> >         TRANSPARENT_HUGEPAGE_FLAG,
>> > +       TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> >         TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>> >         TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>> >         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>> > @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>> >         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>> >                 return true;
>> >
>> > -       if (vma_is_dax(vma))
>> > +       if (vma_is_dax(vma) &&
>> > +           (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>> >                 return true;
>>
>> Forcing PTE sized faults should be fine for fsdax, but it'll break
>> devdax. The devdax driver requires the fault size be >= the namespace
>> alignment since devdax tries to guarantee hugepage mappings will be
>> used and PMD alignment is the default. We can probably have devdax
>> fall back to the largest size the hypervisor has made available, but
>> it does run contrary to the design. Ah well, I suppose it's better off
>> being degraded rather than unusable.
>
> Given this is an explicit setting I think device-dax should explicitly
> fail to enable in the presence of this flag to preserve the
> application visible behavior.
>
> I.e. if device-dax was enabled after this setting was made then I
> think future faults should fail as well.

Not sure I understood that. Now we are disabling the ability to map
pages as huge pages. I am now considering that this should not be
user configurable. Ie, this is something that platform can use to avoid
dax forcing huge page mapping, but if the architecture can enable huge
dax mapping, we should always default to using that.

Now w.r.t to failures, can device-dax do an opportunistic huge page
usage? I haven't looked at the device-dax details fully yet. Do we make the
assumption of the mapping page size as a format w.r.t device-dax? Is that
derived from nd_pfn->align value?

Here is what I am working on:
1) If the platform doesn't support huge page and if the device superblock
indicated that it was created with huge page support, we fail the device
init.

2) Now if we are creating a new namespace without huge page support in
the platform, then we force the align details to PAGE_SIZE. In such a
configuration when handling dax fault even with THP enabled during
the build, we should not try to use hugepage. This I think we can
achieve by using TRANSPARENT_HUGEPAEG_DAX_FLAG.

Also even if the user decided to not use THP, by
echo "never" > transparent_hugepage/enabled , we should continue to map
dax fault using huge page on platforms that can support huge pages.

This still doesn't cover the details of a device-dax created with
PAGE_SIZE align later booted with a kernel that can do hugepage dax.How
should we handle that? That makes me think, this should be a VMA flag
which got derived from device config? May be use VM_HUGEPAGE to indicate
if device should use a hugepage mapping or not?

-aneesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-06  9:17       ` Aneesh Kumar K.V
@ 2019-03-06 11:44         ` Michal Suchánek
  2019-03-06 12:45           ` Aneesh Kumar K.V
  2019-03-13 16:02         ` Dan Williams
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Suchánek @ 2019-03-06 11:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Dan Williams, Oliver, Jan Kara, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Wed, 06 Mar 2019 14:47:33 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:  
> >>
> >> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> >> <aneesh.kumar@linux.ibm.com> wrote:  
 
> Also even if the user decided to not use THP, by
> echo "never" > transparent_hugepage/enabled , we should continue to map
> dax fault using huge page on platforms that can support huge pages.

Is this a good idea?

This knob is there for a reason. In some situations having huge pages
can severely impact performance of the system (due to host-guest
interaction or whatever) and the ability to really turn off all THP
would be important in those cases, right?

Thanks

Michal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-06 11:44         ` Michal Suchánek
@ 2019-03-06 12:45           ` Aneesh Kumar K.V
  2019-03-06 13:06             ` Kirill A. Shutemov
  2019-03-13 16:07             ` Dan Williams
  0 siblings, 2 replies; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-03-06 12:45 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Dan Williams, Oliver, Jan Kara, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On 3/6/19 5:14 PM, Michal Suchánek wrote:
> On Wed, 06 Mar 2019 14:47:33 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> 
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
>>>>
>>>> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
>>>> <aneesh.kumar@linux.ibm.com> wrote:
>   
>> Also even if the user decided to not use THP, by
>> echo "never" > transparent_hugepage/enabled , we should continue to map
>> dax fault using huge page on platforms that can support huge pages.
> 
> Is this a good idea?
> 
> This knob is there for a reason. In some situations having huge pages
> can severely impact performance of the system (due to host-guest
> interaction or whatever) and the ability to really turn off all THP
> would be important in those cases, right?
> 

My understanding was that is not true for dax pages? These are not 
regular memory that got allocated. They are allocated out of /dev/dax/ 
or /dev/pmem*. Do we have a reason not to use hugepages for mapping 
pages in that case?

-aneesh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-06 12:45           ` Aneesh Kumar K.V
@ 2019-03-06 13:06             ` Kirill A. Shutemov
  2019-03-13 16:07             ` Dan Williams
  1 sibling, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2019-03-06 13:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michal Suchánek, Dan Williams, Oliver, Jan Kara,
	linux-nvdimm, Linux Kernel Mailing List, Linux MM, Ross Zwisler,
	Andrew Morton, linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 06, 2019 at 06:15:25PM +0530, Aneesh Kumar K.V wrote:
> On 3/6/19 5:14 PM, Michal Suchánek wrote:
> > On Wed, 06 Mar 2019 14:47:33 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > > Dan Williams <dan.j.williams@intel.com> writes:
> > > 
> > > > On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
> > > > > 
> > > > > On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > Also even if the user decided to not use THP, by
> > > echo "never" > transparent_hugepage/enabled , we should continue to map
> > > dax fault using huge page on platforms that can support huge pages.
> > 
> > Is this a good idea?
> > 
> > This knob is there for a reason. In some situations having huge pages
> > can severely impact performance of the system (due to host-guest
> > interaction or whatever) and the ability to really turn off all THP
> > would be important in those cases, right?
> > 
> 
> My understanding was that is not true for dax pages? These are not regular
> memory that got allocated. They are allocated out of /dev/dax/ or
> /dev/pmem*. Do we have a reason not to use hugepages for mapping pages in
> that case?

Yes. Like when you don't want dax to compete for TLB with mission-critical
application (which uses hugetlb for instance).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-06  9:17       ` Aneesh Kumar K.V
  2019-03-06 11:44         ` Michal Suchánek
@ 2019-03-13 16:02         ` Dan Williams
  2019-03-14  3:45           ` Aneesh Kumar K.V
  1 sibling, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-03-13 16:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Oliver, Andrew Morton, Kirill A . Shutemov, Jan Kara,
	Michael Ellerman, Ross Zwisler, Linux MM,
	Linux Kernel Mailing List, linuxppc-dev, linux-nvdimm

On Wed, Mar 6, 2019 at 1:18 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
> >>
> >> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> >> <aneesh.kumar@linux.ibm.com> wrote:
> >> >
> >> > Add a flag to indicate the ability to do huge page dax mapping. On architecture
> >> > like ppc64, the hypervisor can disable huge page support in the guest. In
> >> > such a case, we should not enable huge page dax mapping. This patch adds
> >> > a flag which the architecture code will update to indicate huge page
> >> > dax mapping support.
> >>
> >> *groan*
> >>
> >> > Architectures mostly do transparent_hugepage_flag = 0; if they can't
> >> > do hugepages. That also takes care of disabling dax hugepage mapping
> >> > with this change.
> >> >
> >> > Without this patch we get the below error with kvm on ppc64.
> >> >
> >> > [  118.849975] lpar: Failed hash pte insert with error -4
> >> >
> >> > NOTE: The patch also use
> >> >
> >> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
> >> > to disable dax huge page mapping.
> >> >
> >> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> > ---
> >> > TODO:
> >> > * Add Fixes: tag
> >> >
> >> >  include/linux/huge_mm.h | 4 +++-
> >> >  mm/huge_memory.c        | 4 ++++
> >> >  2 files changed, 7 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> > index 381e872bfde0..01ad5258545e 100644
> >> > --- a/include/linux/huge_mm.h
> >> > +++ b/include/linux/huge_mm.h
> >> > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
> >> >                         pud_t *pud, pfn_t pfn, bool write);
> >> >  enum transparent_hugepage_flag {
> >> >         TRANSPARENT_HUGEPAGE_FLAG,
> >> > +       TRANSPARENT_HUGEPAGE_DAX_FLAG,
> >> >         TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> >> >         TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> >> >         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> >> > @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
> >> >         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
> >> >                 return true;
> >> >
> >> > -       if (vma_is_dax(vma))
> >> > +       if (vma_is_dax(vma) &&
> >> > +           (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
> >> >                 return true;
> >>
> >> Forcing PTE sized faults should be fine for fsdax, but it'll break
> >> devdax. The devdax driver requires the fault size be >= the namespace
> >> alignment since devdax tries to guarantee hugepage mappings will be
> >> used and PMD alignment is the default. We can probably have devdax
> >> fall back to the largest size the hypervisor has made available, but
> >> it does run contrary to the design. Ah well, I suppose it's better off
> >> being degraded rather than unusable.
> >
> > Given this is an explicit setting I think device-dax should explicitly
> > fail to enable in the presence of this flag to preserve the
> > application visible behavior.
> >
> > I.e. if device-dax was enabled after this setting was made then I
> > think future faults should fail as well.
>
> Not sure I understood that. Now we are disabling the ability to map
> pages as huge pages. I am now considering that this should not be
> user configurable. Ie, this is something that platform can use to avoid
> dax forcing huge page mapping, but if the architecture can enable huge
> dax mapping, we should always default to using that.

No, that's an application visible behavior regression. The side effect
of this setting is that all huge-page configured device-dax instances
must be disabled.

> Now w.r.t to failures, can device-dax do an opportunistic huge page
> usage?

device-dax explicitly disclaims the ability to do opportunistic mappings.

> I haven't looked at the device-dax details fully yet. Do we make the
> assumption of the mapping page size as a format w.r.t device-dax? Is that
> derived from nd_pfn->align value?

Correct.

>
> Here is what I am working on:
> 1) If the platform doesn't support huge page and if the device superblock
> indicated that it was created with huge page support, we fail the device
> init.

Ok.

> 2) Now if we are creating a new namespace without huge page support in
> the platform, then we force the align details to PAGE_SIZE. In such a
> configuration when handling dax fault even with THP enabled during
> the build, we should not try to use hugepage. This I think we can
> achieve by using TRANSPARENT_HUGEPAEG_DAX_FLAG.

How is this dynamic property communicated to the guest?

>
> Also even if the user decided to not use THP, by
> echo "never" > transparent_hugepage/enabled , we should continue to map
> dax fault using huge page on platforms that can support huge pages.
>
> This still doesn't cover the details of a device-dax created with
> PAGE_SIZE align later booted with a kernel that can do hugepage dax.How
> should we handle that? That makes me think, this should be a VMA flag
> which got derived from device config? May be use VM_HUGEPAGE to indicate
> if device should use a hugepage mapping or not?

device-dax configured with PAGE_SIZE always gets PAGE_SIZE mappings.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-06 12:45           ` Aneesh Kumar K.V
  2019-03-06 13:06             ` Kirill A. Shutemov
@ 2019-03-13 16:07             ` Dan Williams
  2019-03-19  8:44               ` Kirill A. Shutemov
  1 sibling, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-03-13 16:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michal Suchánek, Oliver, Jan Kara, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 6, 2019 at 4:46 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 3/6/19 5:14 PM, Michal Suchánek wrote:
> > On Wed, 06 Mar 2019 14:47:33 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> >
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >>> On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
> >>>>
> >>>> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> >>>> <aneesh.kumar@linux.ibm.com> wrote:
> >
> >> Also even if the user decided to not use THP, by
> >> echo "never" > transparent_hugepage/enabled , we should continue to map
> >> dax fault using huge page on platforms that can support huge pages.
> >
> > Is this a good idea?
> >
> > This knob is there for a reason. In some situations having huge pages
> > can severely impact performance of the system (due to host-guest
> > interaction or whatever) and the ability to really turn off all THP
> > would be important in those cases, right?
> >
>
> My understanding was that is not true for dax pages? These are not
> regular memory that got allocated. They are allocated out of /dev/dax/
> or /dev/pmem*. Do we have a reason not to use hugepages for mapping
> pages in that case?

The problem with the transparent_hugepage/enabled interface is that it
conflates performing compaction work to produce THP-pages with the
ability to map huge pages at all. The compaction is a nop for dax
because the memory is already statically allocated. If the
administrator does not want dax to consume huge TLB entries then don't
configure huge-page dax. If a hypervisor wants to force disable
huge-page-configured device-dax instances after the fact it seems we
need an explicit interface for that and not overload
transparent_hugepage/enabled.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-13 16:02         ` Dan Williams
@ 2019-03-14  3:45           ` Aneesh Kumar K.V
  2019-03-14  4:02             ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-03-14  3:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

Dan Williams <dan.j.williams@intel.com> writes:

> On Wed, Mar 6, 2019 at 1:18 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>> > On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
>> >>
>> >> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
>> >> <aneesh.kumar@linux.ibm.com> wrote:
>> >> >
>> >> > Add a flag to indicate the ability to do huge page dax mapping. On architecture
>> >> > like ppc64, the hypervisor can disable huge page support in the guest. In
>> >> > such a case, we should not enable huge page dax mapping. This patch adds
>> >> > a flag which the architecture code will update to indicate huge page
>> >> > dax mapping support.
>> >>
>> >> *groan*
>> >>
>> >> > Architectures mostly do transparent_hugepage_flag = 0; if they can't
>> >> > do hugepages. That also takes care of disabling dax hugepage mapping
>> >> > with this change.
>> >> >
>> >> > Without this patch we get the below error with kvm on ppc64.
>> >> >
>> >> > [  118.849975] lpar: Failed hash pte insert with error -4
>> >> >
>> >> > NOTE: The patch also use
>> >> >
>> >> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> >> > to disable dax huge page mapping.
>> >> >
>> >> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> >> > ---
>> >> > TODO:
>> >> > * Add Fixes: tag
>> >> >
>> >> >  include/linux/huge_mm.h | 4 +++-
>> >> >  mm/huge_memory.c        | 4 ++++
>> >> >  2 files changed, 7 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> >> > index 381e872bfde0..01ad5258545e 100644
>> >> > --- a/include/linux/huge_mm.h
>> >> > +++ b/include/linux/huge_mm.h
>> >> > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>> >> >                         pud_t *pud, pfn_t pfn, bool write);
>> >> >  enum transparent_hugepage_flag {
>> >> >         TRANSPARENT_HUGEPAGE_FLAG,
>> >> > +       TRANSPARENT_HUGEPAGE_DAX_FLAG,
>> >> >         TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>> >> >         TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>> >> >         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>> >> > @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>> >> >         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>> >> >                 return true;
>> >> >
>> >> > -       if (vma_is_dax(vma))
>> >> > +       if (vma_is_dax(vma) &&
>> >> > +           (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>> >> >                 return true;
>> >>
>> >> Forcing PTE sized faults should be fine for fsdax, but it'll break
>> >> devdax. The devdax driver requires the fault size be >= the namespace
>> >> alignment since devdax tries to guarantee hugepage mappings will be
>> >> used and PMD alignment is the default. We can probably have devdax
>> >> fall back to the largest size the hypervisor has made available, but
>> >> it does run contrary to the design. Ah well, I suppose it's better off
>> >> being degraded rather than unusable.
>> >
>> > Given this is an explicit setting I think device-dax should explicitly
>> > fail to enable in the presence of this flag to preserve the
>> > application visible behavior.
>> >
>> > I.e. if device-dax was enabled after this setting was made then I
>> > think future faults should fail as well.
>>
>> Not sure I understood that. Now we are disabling the ability to map
>> pages as huge pages. I am now considering that this should not be
>> user configurable. Ie, this is something that platform can use to avoid
>> dax forcing huge page mapping, but if the architecture can enable huge
>> dax mapping, we should always default to using that.
>
> No, that's an application visible behavior regression. The side effect
> of this setting is that all huge-page configured device-dax instances
> must be disabled.

So if the device was created with a nd_pfn->align value of PMD_SIZE, that is
an indication that we would map the pages in PMD_SIZE?

Ok with that understanding, If the align value is not a supported
mapping size, we fail initializing the device. 


>
>> Now w.r.t to failures, can device-dax do an opportunistic huge page
>> usage?
>
> device-dax explicitly disclaims the ability to do opportunistic mappings.
>
>> I haven't looked at the device-dax details fully yet. Do we make the
>> assumption of the mapping page size as a format w.r.t device-dax? Is that
>> derived from nd_pfn->align value?
>
> Correct.
>
>>
>> Here is what I am working on:
>> 1) If the platform doesn't support huge page and if the device superblock
>> indicated that it was created with huge page support, we fail the device
>> init.
>
> Ok.
>
>> 2) Now if we are creating a new namespace without huge page support in
>> the platform, then we force the align details to PAGE_SIZE. In such a
>> configuration when handling dax fault even with THP enabled during
>> the build, we should not try to use hugepage. This I think we can
>> achieve by using TRANSPARENT_HUGEPAEG_DAX_FLAG.
>
> How is this dynamic property communicated to the guest?

via device tree on powerpc. We have a device tree node indicating
supported page sizes.

>
>>
>> Also even if the user decided to not use THP, by
>> echo "never" > transparent_hugepage/enabled , we should continue to map
>> dax fault using huge page on platforms that can support huge pages.
>>
>> This still doesn't cover the details of a device-dax created with
>> PAGE_SIZE align later booted with a kernel that can do hugepage dax.How
>> should we handle that? That makes me think, this should be a VMA flag
>> which got derived from device config? May be use VM_HUGEPAGE to indicate
>> if device should use a hugepage mapping or not?
>
> device-dax configured with PAGE_SIZE always gets PAGE_SIZE mappings.

Now what will be page size used for mapping vmemmap? Architectures
possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
device-dax with struct page in the device will have pfn reserve area aligned
to PAGE_SIZE with the above example? We can't map that using
PMD_SIZE page size?

-aneesh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-14  3:45           ` Aneesh Kumar K.V
@ 2019-03-14  4:02             ` Dan Williams
  2019-03-20  8:06               ` Aneesh Kumar K.V
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-03-14  4:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 13, 2019 at 8:45 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
[..]
> >> Now w.r.t to failures, can device-dax do an opportunistic huge page
> >> usage?
> >
> > device-dax explicitly disclaims the ability to do opportunistic mappings.
> >
> >> I haven't looked at the device-dax details fully yet. Do we make the
> >> assumption of the mapping page size as a format w.r.t device-dax? Is that
> >> derived from nd_pfn->align value?
> >
> > Correct.
> >
> >>
> >> Here is what I am working on:
> >> 1) If the platform doesn't support huge page and if the device superblock
> >> indicated that it was created with huge page support, we fail the device
> >> init.
> >
> > Ok.
> >
> >> 2) Now if we are creating a new namespace without huge page support in
> >> the platform, then we force the align details to PAGE_SIZE. In such a
> >> configuration when handling dax fault even with THP enabled during
> >> the build, we should not try to use hugepage. This I think we can
> >> achieve by using TRANSPARENT_HUGEPAEG_DAX_FLAG.
> >
> > How is this dynamic property communicated to the guest?
>
> via device tree on powerpc. We have a device tree node indicating
> supported page sizes.

Ah, ok, yeah let's plumb that straight to the device-dax driver and
leave out the interaction / interpretation of the thp-enabled flags.

>
> >
> >>
> >> Also even if the user decided to not use THP, by
> >> echo "never" > transparent_hugepage/enabled , we should continue to map
> >> dax fault using huge page on platforms that can support huge pages.
> >>
> >> This still doesn't cover the details of a device-dax created with
> >> PAGE_SIZE align later booted with a kernel that can do hugepage dax.How
> >> should we handle that? That makes me think, this should be a VMA flag
> >> which got derived from device config? May be use VM_HUGEPAGE to indicate
> >> if device should use a hugepage mapping or not?
> >
> > device-dax configured with PAGE_SIZE always gets PAGE_SIZE mappings.
>
> Now what will be page size used for mapping vmemmap?

That's up to the architecture's vmemmap_populate() implementation.

> Architectures
> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> device-dax with struct page in the device will have pfn reserve area aligned
> to PAGE_SIZE with the above example? We can't map that using
> PMD_SIZE page size?

IIUC, that's a different alignment. Currently that's handled by
padding the reservation area up to a section (128MB on x86) boundary,
but I'm working on patches to allow sub-section sized ranges to be
mapped.

Now, that said, I expect there may be bugs lurking in the
implementation if PAGE_SIZE changes from one boot to the next simply
because I've never tested that.

I think this also indicates that the section padding logic can't be
removed until all arch vmemmap_populate() implementations understand
the sub-section case.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-13 16:07             ` Dan Williams
@ 2019-03-19  8:44               ` Kirill A. Shutemov
  2019-03-19 15:36                 ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2019-03-19  8:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Aneesh Kumar K.V, Michal Suchánek, Oliver, Jan Kara,
	linux-nvdimm, Linux Kernel Mailing List, Linux MM, Ross Zwisler,
	Andrew Morton, linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 13, 2019 at 09:07:13AM -0700, Dan Williams wrote:
> On Wed, Mar 6, 2019 at 4:46 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> >
> > On 3/6/19 5:14 PM, Michal Suchánek wrote:
> > > On Wed, 06 Mar 2019 14:47:33 +0530
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > >> Dan Williams <dan.j.williams@intel.com> writes:
> > >>
> > >>> On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
> > >>>>
> > >>>> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> > >>>> <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > >> Also even if the user decided to not use THP, by
> > >> echo "never" > transparent_hugepage/enabled , we should continue to map
> > >> dax fault using huge page on platforms that can support huge pages.
> > >
> > > Is this a good idea?
> > >
> > > This knob is there for a reason. In some situations having huge pages
> > > can severely impact performance of the system (due to host-guest
> > > interaction or whatever) and the ability to really turn off all THP
> > > would be important in those cases, right?
> > >
> >
> > My understanding was that is not true for dax pages? These are not
> > regular memory that got allocated. They are allocated out of /dev/dax/
> > or /dev/pmem*. Do we have a reason not to use hugepages for mapping
> > pages in that case?
> 
> The problem with the transparent_hugepage/enabled interface is that it
> conflates performing compaction work to produce THP-pages with the
> ability to map huge pages at all.

That's not [entirely] true. transparent_hugepage/defrag gates heavy-duty
compaction. We do only very limited compaction if it's not advised by
transparent_hugepage/defrag.

I believe DAX has to respect transparent_hugepage/enabled. Or not
advertise its huge pages as THP. It's confusing for user.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-19  8:44               ` Kirill A. Shutemov
@ 2019-03-19 15:36                 ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2019-03-19 15:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Aneesh Kumar K.V, Michal Suchánek, Oliver, Jan Kara,
	linux-nvdimm, Linux Kernel Mailing List, Linux MM, Ross Zwisler,
	Andrew Morton, linuxppc-dev, Kirill A . Shutemov

On Tue, Mar 19, 2019 at 1:45 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, Mar 13, 2019 at 09:07:13AM -0700, Dan Williams wrote:
> > On Wed, Mar 6, 2019 at 4:46 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > On 3/6/19 5:14 PM, Michal Suchánek wrote:
> > > > On Wed, 06 Mar 2019 14:47:33 +0530
> > > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> > > >
> > > >> Dan Williams <dan.j.williams@intel.com> writes:
> > > >>
> > > >>> On Thu, Feb 28, 2019 at 1:40 AM Oliver <oohall@gmail.com> wrote:
> > > >>>>
> > > >>>> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> > > >>>> <aneesh.kumar@linux.ibm.com> wrote:
> > > >
> > > >> Also even if the user decided to not use THP, by
> > > >> echo "never" > transparent_hugepage/enabled , we should continue to map
> > > >> dax fault using huge page on platforms that can support huge pages.
> > > >
> > > > Is this a good idea?
> > > >
> > > > This knob is there for a reason. In some situations having huge pages
> > > > can severely impact performance of the system (due to host-guest
> > > > interaction or whatever) and the ability to really turn off all THP
> > > > would be important in those cases, right?
> > > >
> > >
> > > My understanding was that is not true for dax pages? These are not
> > > regular memory that got allocated. They are allocated out of /dev/dax/
> > > or /dev/pmem*. Do we have a reason not to use hugepages for mapping
> > > pages in that case?
> >
> > The problem with the transparent_hugepage/enabled interface is that it
> > conflates performing compaction work to produce THP-pages with the
> > ability to map huge pages at all.
>
> That's not [entirely] true. transparent_hugepage/defrag gates heavy-duty
> compaction. We do only very limited compaction if it's not advised by
> transparent_hugepage/defrag.
>
> I believe DAX has to respect transparent_hugepage/enabled. Or not
> advertise its huge pages as THP. It's confusing for user.

What does "advertise its huge pages as THP" mean in practice? I think
it's confusing that DAX, a facility that bypasses System RAM, is
affected by a transparent_hugepage flag which is a feature for
combining System RAM pages into larger pages. For the same reason that
transparent_hugepage does not gate / control hugetlb operation is the
same reason that transparent_hugepage should not gate / control DAX. A
global setting to disable opportunistic large page mappings of
System-RAM makes sense, but I don't see why that should read on DAX?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-14  4:02             ` Dan Williams
@ 2019-03-20  8:06               ` Aneesh Kumar K.V
  2019-03-20  8:09                 ` Aneesh Kumar K.V
  0 siblings, 1 reply; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-03-20  8:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

Dan Williams <dan.j.williams@intel.com> writes:

>
>> Now what will be page size used for mapping vmemmap?
>
> That's up to the architecture's vmemmap_populate() implementation.
>
>> Architectures
>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
>> device-dax with struct page in the device will have pfn reserve area aligned
>> to PAGE_SIZE with the above example? We can't map that using
>> PMD_SIZE page size?
>
> IIUC, that's a different alignment. Currently that's handled by
> padding the reservation area up to a section (128MB on x86) boundary,
> but I'm working on patches to allow sub-section sized ranges to be
> mapped.

I am missing something w.r.t code. The below code align that using nd_pfn->align

	if (nd_pfn->mode == PFN_MODE_PMEM) {
		unsigned long memmap_size;

		/*
		 * vmemmap_populate_hugepages() allocates the memmap array in
		 * HPAGE_SIZE chunks.
		 */
		memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
		offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
				nd_pfn->align) - start;
      }

IIUC that is finding the offset where to put vmemmap start. And that has
to be aligned to the page size with which we may end up mapping vmemmap
area right?

Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
is to compute howmany pfns we should map for this pfn dev right?
	
-aneesh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-20  8:06               ` Aneesh Kumar K.V
@ 2019-03-20  8:09                 ` Aneesh Kumar K.V
  2019-03-20 15:34                   ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Aneesh Kumar K.V @ 2019-03-20  8:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:

> Dan Williams <dan.j.williams@intel.com> writes:
>
>>
>>> Now what will be page size used for mapping vmemmap?
>>
>> That's up to the architecture's vmemmap_populate() implementation.
>>
>>> Architectures
>>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
>>> device-dax with struct page in the device will have pfn reserve area aligned
>>> to PAGE_SIZE with the above example? We can't map that using
>>> PMD_SIZE page size?
>>
>> IIUC, that's a different alignment. Currently that's handled by
>> padding the reservation area up to a section (128MB on x86) boundary,
>> but I'm working on patches to allow sub-section sized ranges to be
>> mapped.
>
> I am missing something w.r.t code. The below code align that using nd_pfn->align
>
> 	if (nd_pfn->mode == PFN_MODE_PMEM) {
> 		unsigned long memmap_size;
>
> 		/*
> 		 * vmemmap_populate_hugepages() allocates the memmap array in
> 		 * HPAGE_SIZE chunks.
> 		 */
> 		memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> 		offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
> 				nd_pfn->align) - start;
>       }
>
> IIUC that is finding the offset where to put vmemmap start. And that has
> to be aligned to the page size with which we may end up mapping vmemmap
> area right?
>
> Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> is to compute howmany pfns we should map for this pfn dev right?
> 	

Also i guess those 4K assumptions there is wrong?

modified   drivers/nvdimm/pfn_devs.c
@@ -783,7 +783,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 		return -ENXIO;
 	}
 
-	npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
+	npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;
 	pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
 	pfn_sb->dataoff = cpu_to_le64(offset);
 	pfn_sb->npfns = cpu_to_le64(npfns);


-aneesh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-20  8:09                 ` Aneesh Kumar K.V
@ 2019-03-20 15:34                   ` Dan Williams
  2019-03-20 20:57                     ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-03-20 15:34 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
>
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> >>
> >>> Now what will be page size used for mapping vmemmap?
> >>
> >> That's up to the architecture's vmemmap_populate() implementation.
> >>
> >>> Architectures
> >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> >>> device-dax with struct page in the device will have pfn reserve area aligned
> >>> to PAGE_SIZE with the above example? We can't map that using
> >>> PMD_SIZE page size?
> >>
> >> IIUC, that's a different alignment. Currently that's handled by
> >> padding the reservation area up to a section (128MB on x86) boundary,
> >> but I'm working on patches to allow sub-section sized ranges to be
> >> mapped.
> >
> > I am missing something w.r.t code. The below code align that using nd_pfn->align
> >
> >       if (nd_pfn->mode == PFN_MODE_PMEM) {
> >               unsigned long memmap_size;
> >
> >               /*
> >                * vmemmap_populate_hugepages() allocates the memmap array in
> >                * HPAGE_SIZE chunks.
> >                */
> >               memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> >               offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
> >                               nd_pfn->align) - start;
> >       }
> >
> > IIUC that is finding the offset where to put vmemmap start. And that has
> > to be aligned to the page size with which we may end up mapping vmemmap
> > area right?

Right, that's the physical offset of where the vmemmap ends, and the
memory to be mapped begins.

> > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > is to compute howmany pfns we should map for this pfn dev right?
> >
>
> Also i guess those 4K assumptions there is wrong?

Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
needs to be revved and the PAGE_SIZE needs to be recorded in the
info-block.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-20 15:34                   ` Dan Williams
@ 2019-03-20 20:57                     ` Dan Williams
  2019-03-21  3:08                       ` Oliver
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-03-20 20:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 20, 2019 at 8:34 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> >
> > Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
> >
> > > Dan Williams <dan.j.williams@intel.com> writes:
> > >
> > >>
> > >>> Now what will be page size used for mapping vmemmap?
> > >>
> > >> That's up to the architecture's vmemmap_populate() implementation.
> > >>
> > >>> Architectures
> > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > >>> device-dax with struct page in the device will have pfn reserve area aligned
> > >>> to PAGE_SIZE with the above example? We can't map that using
> > >>> PMD_SIZE page size?
> > >>
> > >> IIUC, that's a different alignment. Currently that's handled by
> > >> padding the reservation area up to a section (128MB on x86) boundary,
> > >> but I'm working on patches to allow sub-section sized ranges to be
> > >> mapped.
> > >
> > > I am missing something w.r.t code. The below code align that using nd_pfn->align
> > >
> > >       if (nd_pfn->mode == PFN_MODE_PMEM) {
> > >               unsigned long memmap_size;
> > >
> > >               /*
> > >                * vmemmap_populate_hugepages() allocates the memmap array in
> > >                * HPAGE_SIZE chunks.
> > >                */
> > >               memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > >               offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
> > >                               nd_pfn->align) - start;
> > >       }
> > >
> > > IIUC that is finding the offset where to put vmemmap start. And that has
> > > to be aligned to the page size with which we may end up mapping vmemmap
> > > area right?
>
> Right, that's the physical offset of where the vmemmap ends, and the
> memory to be mapped begins.
>
> > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > is to compute howmany pfns we should map for this pfn dev right?
> > >
> >
> > Also i guess those 4K assumptions there is wrong?
>
> Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> needs to be revved and the PAGE_SIZE needs to be recorded in the
> info-block.

How often does a system change page-size. Is it fixed or do
environment change it from one boot to the next? I'm thinking through
the behavior of what do when the recorded PAGE_SIZE in the info-block
does not match the current system page size. The simplest option is to
just fail the device and require it to be reconfigured. Is that
acceptable?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-20 20:57                     ` Dan Williams
@ 2019-03-21  3:08                       ` Oliver
  2019-03-21  3:12                         ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Oliver @ 2019-03-21  3:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Aneesh Kumar K.V, Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Thu, Mar 21, 2019 at 7:57 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Mar 20, 2019 at 8:34 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
> > >
> > > > Dan Williams <dan.j.williams@intel.com> writes:
> > > >
> > > >>
> > > >>> Now what will be page size used for mapping vmemmap?
> > > >>
> > > >> That's up to the architecture's vmemmap_populate() implementation.
> > > >>
> > > >>> Architectures
> > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > > >>> device-dax with struct page in the device will have pfn reserve area aligned
> > > >>> to PAGE_SIZE with the above example? We can't map that using
> > > >>> PMD_SIZE page size?
> > > >>
> > > >> IIUC, that's a different alignment. Currently that's handled by
> > > >> padding the reservation area up to a section (128MB on x86) boundary,
> > > >> but I'm working on patches to allow sub-section sized ranges to be
> > > >> mapped.
> > > >
> > > > I am missing something w.r.t code. The below code align that using nd_pfn->align
> > > >
> > > >       if (nd_pfn->mode == PFN_MODE_PMEM) {
> > > >               unsigned long memmap_size;
> > > >
> > > >               /*
> > > >                * vmemmap_populate_hugepages() allocates the memmap array in
> > > >                * HPAGE_SIZE chunks.
> > > >                */
> > > >               memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > > >               offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
> > > >                               nd_pfn->align) - start;
> > > >       }
> > > >
> > > > IIUC that is finding the offset where to put vmemmap start. And that has
> > > > to be aligned to the page size with which we may end up mapping vmemmap
> > > > area right?
> >
> > Right, that's the physical offset of where the vmemmap ends, and the
> > memory to be mapped begins.
> >
> > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > > is to compute howmany pfns we should map for this pfn dev right?
> > > >
> > >
> > > Also i guess those 4K assumptions there is wrong?
> >
> > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> > needs to be revved and the PAGE_SIZE needs to be recorded in the
> > info-block.
>
> How often does a system change page-size. Is it fixed or do
> environment change it from one boot to the next? I'm thinking through
> the behavior of what do when the recorded PAGE_SIZE in the info-block
> does not match the current system page size. The simplest option is to
> just fail the device and require it to be reconfigured. Is that
> acceptable?

The kernel page size is set at build time and as far as I know every
distro configures their ppc64(le) kernel for 64K. I've used 4K kernels
a few times in the past to debug PAGE_SIZE dependent problems, but I'd
be surprised if anyone is using 4K in production.

Anyway, my view is that using 4K here isn't really a problem since
it's just the accounting unit of the pfn superblock format. The kernel
reading form it should understand that and scale it to whatever
accounting unit it wants to use internally. Currently we don't so that
should probably be fixed, but that doesn't seem to cause any real
issues. As far as I can tell the only user of npfns in
__nvdimm_setup_pfn() whih prints the "number of pfns truncated"
message.

Am I missing something?

> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
  2019-03-21  3:08                       ` Oliver
@ 2019-03-21  3:12                         ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2019-03-21  3:12 UTC (permalink / raw)
  To: Oliver
  Cc: Aneesh Kumar K.V, Jan Kara, linux-nvdimm, Michael Ellerman,
	Linux Kernel Mailing List, Linux MM, Ross Zwisler, Andrew Morton,
	linuxppc-dev, Kirill A . Shutemov

On Wed, Mar 20, 2019 at 8:09 PM Oliver <oohall@gmail.com> wrote:
>
> On Thu, Mar 21, 2019 at 7:57 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Mar 20, 2019 at 8:34 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Wed, Mar 20, 2019 at 1:09 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > >
> > > > Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:
> > > >
> > > > > Dan Williams <dan.j.williams@intel.com> writes:
> > > > >
> > > > >>
> > > > >>> Now what will be page size used for mapping vmemmap?
> > > > >>
> > > > >> That's up to the architecture's vmemmap_populate() implementation.
> > > > >>
> > > > >>> Architectures
> > > > >>> possibly will use PMD_SIZE mapping if supported for vmemmap. Now a
> > > > >>> device-dax with struct page in the device will have pfn reserve area aligned
> > > > >>> to PAGE_SIZE with the above example? We can't map that using
> > > > >>> PMD_SIZE page size?
> > > > >>
> > > > >> IIUC, that's a different alignment. Currently that's handled by
> > > > >> padding the reservation area up to a section (128MB on x86) boundary,
> > > > >> but I'm working on patches to allow sub-section sized ranges to be
> > > > >> mapped.
> > > > >
> > > > > I am missing something w.r.t code. The below code align that using nd_pfn->align
> > > > >
> > > > >       if (nd_pfn->mode == PFN_MODE_PMEM) {
> > > > >               unsigned long memmap_size;
> > > > >
> > > > >               /*
> > > > >                * vmemmap_populate_hugepages() allocates the memmap array in
> > > > >                * HPAGE_SIZE chunks.
> > > > >                */
> > > > >               memmap_size = ALIGN(64 * npfns, HPAGE_SIZE);
> > > > >               offset = ALIGN(start + SZ_8K + memmap_size + dax_label_reserve,
> > > > >                               nd_pfn->align) - start;
> > > > >       }
> > > > >
> > > > > IIUC that is finding the offset where to put vmemmap start. And that has
> > > > > to be aligned to the page size with which we may end up mapping vmemmap
> > > > > area right?
> > >
> > > Right, that's the physical offset of where the vmemmap ends, and the
> > > memory to be mapped begins.
> > >
> > > > > Yes we find the npfns by aligning up using PAGES_PER_SECTION. But that
> > > > > is to compute howmany pfns we should map for this pfn dev right?
> > > > >
> > > >
> > > > Also i guess those 4K assumptions there is wrong?
> > >
> > > Yes, I think to support non-4K-PAGE_SIZE systems the 'pfn' metadata
> > > needs to be revved and the PAGE_SIZE needs to be recorded in the
> > > info-block.
> >
> > How often does a system change page-size. Is it fixed or do
> > environment change it from one boot to the next? I'm thinking through
> > the behavior of what do when the recorded PAGE_SIZE in the info-block
> > does not match the current system page size. The simplest option is to
> > just fail the device and require it to be reconfigured. Is that
> > acceptable?
>
> The kernel page size is set at build time and as far as I know every
> distro configures their ppc64(le) kernel for 64K. I've used 4K kernels
> a few times in the past to debug PAGE_SIZE dependent problems, but I'd
> be surprised if anyone is using 4K in production.

Ah, ok.

> Anyway, my view is that using 4K here isn't really a problem since
> it's just the accounting unit of the pfn superblock format. The kernel
> reading form it should understand that and scale it to whatever
> accounting unit it wants to use internally. Currently we don't so that
> should probably be fixed, but that doesn't seem to cause any real
> issues. As far as I can tell the only user of npfns in
> __nvdimm_setup_pfn() whih prints the "number of pfns truncated"
> message.
>
> Am I missing something?

No, I don't think so. The only time it would break is if a system with
64K page size laid down an info-block with not enough reserved
capacity when the page-size is 4K (npfns too small). However, that
sounds like an exceptional case which is why no problems have been
reported to date.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-03-21  3:12 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-28  8:35 [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Aneesh Kumar K.V
2019-02-28  8:35 ` [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default Aneesh Kumar K.V
2019-02-28  9:40   ` Jan Kara
2019-02-28 12:32     ` Aneesh Kumar K.V
2019-02-28  9:40   ` Oliver
2019-02-28 12:43     ` Aneesh Kumar K.V
2019-02-28 16:45     ` Dan Williams
2019-03-06  9:17       ` Aneesh Kumar K.V
2019-03-06 11:44         ` Michal Suchánek
2019-03-06 12:45           ` Aneesh Kumar K.V
2019-03-06 13:06             ` Kirill A. Shutemov
2019-03-13 16:07             ` Dan Williams
2019-03-19  8:44               ` Kirill A. Shutemov
2019-03-19 15:36                 ` Dan Williams
2019-03-13 16:02         ` Dan Williams
2019-03-14  3:45           ` Aneesh Kumar K.V
2019-03-14  4:02             ` Dan Williams
2019-03-20  8:06               ` Aneesh Kumar K.V
2019-03-20  8:09                 ` Aneesh Kumar K.V
2019-03-20 15:34                   ` Dan Williams
2019-03-20 20:57                     ` Dan Williams
2019-03-21  3:08                       ` Oliver
2019-03-21  3:12                         ` Dan Williams
2019-02-28  9:21 ` [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page Jan Kara
2019-02-28 12:34   ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).