linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files
@ 2019-12-11 21:32 Barret Rhoden
  2019-12-11 21:32 ` [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible Barret Rhoden
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Barret Rhoden @ 2019-12-11 21:32 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

This patchset allows KVM to map huge pages for DAX-backed files.

I held previous versions in limbo while people were sorting out whether
or not DAX pages were going to remain PageReserved and how that relates
to KVM.

Now that that is sorted out (DAX pages are PageReserved, but they are
not kvm_is_reserved_pfn(), and DAX pages are considered on a
case-by-case basis for KVM), I can repost this.

v3 -> v4:
v3: https://lore.kernel.org/lkml/20190404202345.133553-1-brho@google.com/
- Rebased onto linus/master

v2 -> v3:
v2: https://lore.kernel.org/lkml/20181114215155.259978-1-brho@google.com/
- Updated Acks/Reviewed-by
- Rebased onto linux-next

v1 -> v2:
https://lore.kernel.org/lkml/20181109203921.178363-1-brho@google.com/
- Updated Acks/Reviewed-by
- Minor touchups
- Added patch to remove redundant PageReserved() check
- Rebased onto linux-next

RFC/discussion thread:
https://lore.kernel.org/lkml/20181029210716.212159-1-brho@google.com/

Barret Rhoden (2):
  mm: make dev_pagemap_mapping_shift() externally visible
  kvm: Use huge pages for DAX-backed files

 arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
 include/linux/mm.h     |  3 +++
 mm/memory-failure.c    | 38 +++-----------------------------------
 mm/util.c              | 34 ++++++++++++++++++++++++++++++++++
 4 files changed, 72 insertions(+), 39 deletions(-)

-- 
2.24.0.525.g8f36a354ae-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible
  2019-12-11 21:32 [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
@ 2019-12-11 21:32 ` Barret Rhoden
  2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
  2019-12-12  0:22 ` [PATCH v4 0/2] " Paolo Bonzini
  2 siblings, 0 replies; 19+ messages in thread
From: Barret Rhoden @ 2019-12-11 21:32 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

KVM has a use case for determining the size of a dax mapping.  The KVM
code has easy access to the address and the mm; hence the change in
parameters.

Signed-off-by: Barret Rhoden <brho@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h  |  3 +++
 mm/memory-failure.c | 38 +++-----------------------------------
 mm/util.c           | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b0ef04b6d15..f88bcc6a3bd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -998,6 +998,9 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 #define page_ref_zero_or_close_to_overflow(page) \
 	((unsigned int) page_ref_count(page) + 127u <= 127u)
 
+unsigned long dev_pagemap_mapping_shift(unsigned long address,
+					struct mm_struct *mm);
+
 static inline void get_page(struct page *page)
 {
 	page = compound_head(page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 41c634f45d45..9dc487e73d9b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -261,40 +261,6 @@ void shake_page(struct page *p, int access)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
-		struct vm_area_struct *vma)
-{
-	unsigned long address = vma_address(page, vma);
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-
-	pgd = pgd_offset(vma->vm_mm, address);
-	if (!pgd_present(*pgd))
-		return 0;
-	p4d = p4d_offset(pgd, address);
-	if (!p4d_present(*p4d))
-		return 0;
-	pud = pud_offset(p4d, address);
-	if (!pud_present(*pud))
-		return 0;
-	if (pud_devmap(*pud))
-		return PUD_SHIFT;
-	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
-		return 0;
-	if (pmd_devmap(*pmd))
-		return PMD_SHIFT;
-	pte = pte_offset_map(pmd, address);
-	if (!pte_present(*pte))
-		return 0;
-	if (pte_devmap(*pte))
-		return PAGE_SHIFT;
-	return 0;
-}
-
 /*
  * Failure handling: if we can't find or can't kill a process there's
  * not much we can do.	We just print a message and ignore otherwise.
@@ -318,7 +284,9 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 
 	tk->addr = page_address_in_vma(p, vma);
 	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
+		tk->size_shift =
+			dev_pagemap_mapping_shift(vma_address(page, vma),
+						  vma->vm_mm);
 	else
 		tk->size_shift = page_shift(compound_head(p));
 
diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..553fbe1692ed 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -911,3 +911,37 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	kunmap_atomic(addr1);
 	return ret;
 }
+
+unsigned long dev_pagemap_mapping_shift(unsigned long address,
+					struct mm_struct *mm)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return 0;
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return 0;
+	pud = pud_offset(p4d, address);
+	if (!pud_present(*pud))
+		return 0;
+	if (pud_devmap(*pud))
+		return PUD_SHIFT;
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+	if (pmd_devmap(*pmd))
+		return PMD_SHIFT;
+	pte = pte_offset_map(pmd, address);
+	if (!pte_present(*pte))
+		return 0;
+	if (pte_devmap(*pte))
+		return PAGE_SHIFT;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_mapping_shift);
-- 
2.24.0.525.g8f36a354ae-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-11 21:32 [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
  2019-12-11 21:32 ` [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible Barret Rhoden
@ 2019-12-11 21:32 ` Barret Rhoden
  2019-12-12  0:21   ` Paolo Bonzini
                     ` (3 more replies)
  2019-12-12  0:22 ` [PATCH v4 0/2] " Paolo Bonzini
  2 siblings, 4 replies; 19+ messages in thread
From: Barret Rhoden @ 2019-12-11 21:32 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

DAX pages are not PageTransCompound.  The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not.  For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
For DAX, we can check the page table itself.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f92b40d798c..cd07bc4e595f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
+static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	unsigned long hva;
+
+	if (!is_zone_device_page(page))
+		return PageTransCompoundMap(page);
+
+	/*
+	 * DAX pages do not use compound pages.  The page should have already
+	 * been mapped into the host-side page table during try_async_pf(), so
+	 * we can check the page tables directly.
+	 */
+	hva = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(hva))
+		return false;
+
+	/*
+	 * Our caller grabbed the KVM mmu_lock with a successful
+	 * mmu_notifier_retry, so we're safe to walk the page table.
+	 */
+	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
+	case PMD_SHIFT:
+	case PUD_SIZE:
+		return true;
+	}
+	return false;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 					gfn_t gfn, kvm_pfn_t *pfnp,
 					int *levelp)
@@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * here.
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn)) &&
+	    level == PT_PAGE_TABLE_LEVEL &&
+	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
 		/*
@@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * mapping if the indirect sp has level = 1.
 		 */
 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
-		    !kvm_is_zone_device_pfn(pfn) &&
-		    PageTransCompoundMap(pfn_to_page(pfn))) {
+		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
-- 
2.24.0.525.g8f36a354ae-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
@ 2019-12-12  0:21   ` Paolo Bonzini
  2019-12-12 12:22   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2019-12-12  0:21 UTC (permalink / raw)
  To: Barret Rhoden, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

On 11/12/19 22:32, Barret Rhoden wrote:
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> +	case PMD_SHIFT:
> +	case PUD_SIZE:
> +		return true;
> +	}
> +	return false;

Should this simply be "> PAGE_SHIFT"?

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files
  2019-12-11 21:32 [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
  2019-12-11 21:32 ` [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible Barret Rhoden
  2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
@ 2019-12-12  0:22 ` Paolo Bonzini
  2 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2019-12-12  0:22 UTC (permalink / raw)
  To: Barret Rhoden, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

On 11/12/19 22:32, Barret Rhoden wrote:
> This patchset allows KVM to map huge pages for DAX-backed files.
> 
> I held previous versions in limbo while people were sorting out whether
> or not DAX pages were going to remain PageReserved and how that relates
> to KVM.
> 
> Now that that is sorted out (DAX pages are PageReserved, but they are
> not kvm_is_reserved_pfn(), and DAX pages are considered on a
> case-by-case basis for KVM), I can repost this.
> 
> v3 -> v4:
> v3: https://lore.kernel.org/lkml/20190404202345.133553-1-brho@google.com/
> - Rebased onto linus/master
> 
> v2 -> v3:
> v2: https://lore.kernel.org/lkml/20181114215155.259978-1-brho@google.com/
> - Updated Acks/Reviewed-by
> - Rebased onto linux-next
> 
> v1 -> v2:
> https://lore.kernel.org/lkml/20181109203921.178363-1-brho@google.com/
> - Updated Acks/Reviewed-by
> - Minor touchups
> - Added patch to remove redundant PageReserved() check
> - Rebased onto linux-next
> 
> RFC/discussion thread:
> https://lore.kernel.org/lkml/20181029210716.212159-1-brho@google.com/
> 
> Barret Rhoden (2):
>   mm: make dev_pagemap_mapping_shift() externally visible
>   kvm: Use huge pages for DAX-backed files
> 
>  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>  include/linux/mm.h     |  3 +++
>  mm/memory-failure.c    | 38 +++-----------------------------------
>  mm/util.c              | 34 ++++++++++++++++++++++++++++++++++
>  4 files changed, 72 insertions(+), 39 deletions(-)
> 

Thanks, with the Acked-by already in place for patch 1 I can pick this up.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
  2019-12-12  0:21   ` Paolo Bonzini
@ 2019-12-12 12:22   ` David Hildenbrand
  2019-12-12 16:31     ` Barret Rhoden
  2019-12-12 12:33   ` Liran Alon
  2019-12-12 17:34   ` Sean Christopherson
  3 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2019-12-12 12:22 UTC (permalink / raw)
  To: Barret Rhoden, Paolo Bonzini, Dan Williams, Dave Jiang, Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

On 11.12.19 22:32, Barret Rhoden wrote:
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
> 
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.
> 
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>  	return -EFAULT;
>  }
>  
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long hva;
> +
> +	if (!is_zone_device_page(page))
> +		return PageTransCompoundMap(page);
> +
> +	/*
> +	 * DAX pages do not use compound pages.  The page should have already
> +	 * been mapped into the host-side page table during try_async_pf(), so
> +	 * we can check the page tables directly.
> +	 */
> +	hva = gfn_to_hva(kvm, gfn);
> +	if (kvm_is_error_hva(hva))
> +		return false;
> +
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> +	case PMD_SHIFT:
> +	case PUD_SIZE:

Shouldn't this be PUD_SHIFT?

But I agree with Paolo, that this is simply

return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT;

> +		return true;
> +	}
> +	return false;
> +}
> +
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  					gfn_t gfn, kvm_pfn_t *pfnp,
>  					int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  	 * here.
>  	 */
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	    level == PT_PAGE_TABLE_LEVEL &&
> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
>  		/*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 * mapping if the indirect sp has level = 1.
>  		 */
>  		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>  			pte_list_remove(rmap_head, sptep);
>  
>  			if (kvm_available_flush_tlb_with_range())
> 

Patch itself looks good to me (especially, cleans up these two places a
bit). I am not an expert on the locking part, so I can't give my RB.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
  2019-12-12  0:21   ` Paolo Bonzini
  2019-12-12 12:22   ` David Hildenbrand
@ 2019-12-12 12:33   ` Liran Alon
  2019-12-12 16:54     ` Dan Williams
  2019-12-12 17:03     ` Barret Rhoden
  2019-12-12 17:34   ` Sean Christopherson
  3 siblings, 2 replies; 19+ messages in thread
From: Liran Alon @ 2019-12-12 12:33 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Paolo Bonzini, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, x86, kvm, linux-kernel,
	jason.zeng



> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
> 
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
> 
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.

For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
will return the page-size of the hugetlbfs without the need to parse the page-tables.
See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().

Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.

Therefore, it seems more logical to me that:
(a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
(b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.

> 
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> 1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> 	return -EFAULT;
> }
> 
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long hva;
> +
> +	if (!is_zone_device_page(page))
> +		return PageTransCompoundMap(page);
> +
> +	/*
> +	 * DAX pages do not use compound pages.  The page should have already
> +	 * been mapped into the host-side page table during try_async_pf(), so
> +	 * we can check the page tables directly.
> +	 */
> +	hva = gfn_to_hva(kvm, gfn);
> +	if (kvm_is_error_hva(hva))
> +		return false;
> +
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {

Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
Was this changed by a commit I missed?

-Liran

> +	case PMD_SHIFT:
> +	case PUD_SIZE:
> +		return true;
> +	}
> +	return false;
> +}
> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> 					gfn_t gfn, kvm_pfn_t *pfnp,
> 					int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> 	 * here.
> 	 */
> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	    level == PT_PAGE_TABLE_LEVEL &&
> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> 		unsigned long mask;
> 		/*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> 		 * mapping if the indirect sp has level = 1.
> 		 */
> 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> 			pte_list_remove(rmap_head, sptep);
> 
> 			if (kvm_available_flush_tlb_with_range())
> -- 
> 2.24.0.525.g8f36a354ae-goog
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 12:22   ` David Hildenbrand
@ 2019-12-12 16:31     ` Barret Rhoden
  0 siblings, 0 replies; 19+ messages in thread
From: Barret Rhoden @ 2019-12-12 16:31 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Dan Williams, Dave Jiang,
	Alexander Duyck
  Cc: linux-nvdimm, x86, kvm, linux-kernel, jason.zeng

On 12/12/19 7:22 AM, David Hildenbrand wrote:
>> +	/*
>> +	 * Our caller grabbed the KVM mmu_lock with a successful
>> +	 * mmu_notifier_retry, so we're safe to walk the page table.
>> +	 */
>> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> +	case PMD_SHIFT:
>> +	case PUD_SIZE:
> 
> Shouldn't this be PUD_SHIFT?
> 
> But I agree with Paolo, that this is simply
> 
> return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT;

Yes, good call.  I'll fix that in the next version.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 12:33   ` Liran Alon
@ 2019-12-12 16:54     ` Dan Williams
  2019-12-12 17:39       ` Liran Alon
  2019-12-12 17:03     ` Barret Rhoden
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-12-12 16:54 UTC (permalink / raw)
  To: Liran Alon
  Cc: Barret Rhoden, Paolo Bonzini, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason

On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
> >
> > This change allows KVM to map DAX-backed files made of huge pages with
> > huge mappings in the EPT/TDP.
> >
> > DAX pages are not PageTransCompound.  The existing check is trying to
> > determine if the mapping for the pfn is a huge mapping or not.  For
> > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> > For DAX, we can check the page table itself.
>
> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
> will return the page-size of the hugetlbfs without the need to parse the page-tables.
> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>
> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>
> Therefore, it seems more logical to me that:
> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.

A given dax-mapped vma may have mixed page sizes so ->page_size()
can't be used reliably to enumerating the mapping size.

> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.

DAX pages do not participate in THP and do not have the
PageTransCompound accounting. The only mechanism that records the
mapping size for dax is the page tables themselves.


>
> >
> > Note that KVM already faulted in the page (or huge page) in the host's
> > page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> >
> > Signed-off-by: Barret Rhoden <brho@google.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> > 1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f92b40d798c..cd07bc4e595f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> >       return -EFAULT;
> > }
> >
> > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +     struct page *page = pfn_to_page(pfn);
> > +     unsigned long hva;
> > +
> > +     if (!is_zone_device_page(page))
> > +             return PageTransCompoundMap(page);
> > +
> > +     /*
> > +      * DAX pages do not use compound pages.  The page should have already
> > +      * been mapped into the host-side page table during try_async_pf(), so
> > +      * we can check the page tables directly.
> > +      */
> > +     hva = gfn_to_hva(kvm, gfn);
> > +     if (kvm_is_error_hva(hva))
> > +             return false;
> > +
> > +     /*
> > +      * Our caller grabbed the KVM mmu_lock with a successful
> > +      * mmu_notifier_retry, so we're safe to walk the page table.
> > +      */
> > +     switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>
> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
> Was this changed by a commit I missed?
>
> -Liran
>
> > +     case PMD_SHIFT:
> > +     case PUD_SIZE:
> > +             return true;
> > +     }
> > +     return false;
> > +}
> > +
> > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >                                       gfn_t gfn, kvm_pfn_t *pfnp,
> >                                       int *levelp)
> > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >        * here.
> >        */
> >       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > -         !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> > -         PageTransCompoundMap(pfn_to_page(pfn)) &&
> > +         level == PT_PAGE_TABLE_LEVEL &&
> > +         pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> >           !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> >               unsigned long mask;
> >               /*
> > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >                * mapping if the indirect sp has level = 1.
> >                */
> >               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> > -                 !kvm_is_zone_device_pfn(pfn) &&
> > -                 PageTransCompoundMap(pfn_to_page(pfn))) {
> > +                 pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> >                       pte_list_remove(rmap_head, sptep);
> >
> >                       if (kvm_available_flush_tlb_with_range())
> > --
> > 2.24.0.525.g8f36a354ae-goog
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 12:33   ` Liran Alon
  2019-12-12 16:54     ` Dan Williams
@ 2019-12-12 17:03     ` Barret Rhoden
  1 sibling, 0 replies; 19+ messages in thread
From: Barret Rhoden @ 2019-12-12 17:03 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, x86, kvm, linux-kernel,
	jason.zeng

On 12/12/19 7:33 AM, Liran Alon wrote:
>> +	/*
>> +	 * Our caller grabbed the KVM mmu_lock with a successful
>> +	 * mmu_notifier_retry, so we're safe to walk the page table.
>> +	 */
>> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
> Was this changed by a commit I missed?

I changed this in Patch 1.  The place I call it in KVM has the address 
and mm available, which is the only think dev_pagemap_mapping_shift() 
really needs.  (The first thing it did was convert page to address).

I'll add some more text to patch 1's commit message about that.

Thanks,

Barret



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
                     ` (2 preceding siblings ...)
  2019-12-12 12:33   ` Liran Alon
@ 2019-12-12 17:34   ` Sean Christopherson
  2019-12-12 17:37     ` Dan Williams
  2019-12-12 17:45     ` Liran Alon
  3 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2019-12-12 17:34 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Paolo Bonzini, Dan Williams, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, x86, kvm, linux-kernel,
	jason.zeng

On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
> 
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.
> 
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> 
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>  	return -EFAULT;
>  }
>  
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long hva;
> +
> +	if (!is_zone_device_page(page))
> +		return PageTransCompoundMap(page);
> +
> +	/*
> +	 * DAX pages do not use compound pages.  The page should have already
> +	 * been mapped into the host-side page table during try_async_pf(), so
> +	 * we can check the page tables directly.
> +	 */
> +	hva = gfn_to_hva(kvm, gfn);
> +	if (kvm_is_error_hva(hva))
> +		return false;
> +
> +	/*
> +	 * Our caller grabbed the KVM mmu_lock with a successful
> +	 * mmu_notifier_retry, so we're safe to walk the page table.
> +	 */
> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> +	case PMD_SHIFT:
> +	case PUD_SIZE:

I assume this means DAX can have 1GB pages?  I ask because KVM's THP logic
has historically relied on THP only supporting 2MB.  I cleaned this up in
a recent series[*], which is in kvm/queue, but I obviously didn't actually
test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

The easiest thing is probably to rebase on kvm/queue.  You'll need to do
that anyways, and I suspect doing so will help shake out any hiccups.

[*] https://lkml.kernel.org/r/20191206235729.29263-1-sean.j.christopherson@intel.com

> +		return true;
> +	}
> +	return false;
> +}
> +
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  					gfn_t gfn, kvm_pfn_t *pfnp,
>  					int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  	 * here.
>  	 */
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	    level == PT_PAGE_TABLE_LEVEL &&
> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
>  		/*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 * mapping if the indirect sp has level = 1.
>  		 */
>  		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>  			pte_list_remove(rmap_head, sptep);
>  
>  			if (kvm_available_flush_tlb_with_range())
> -- 
> 2.24.0.525.g8f36a354ae-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 17:34   ` Sean Christopherson
@ 2019-12-12 17:37     ` Dan Williams
  2019-12-12 19:16       ` Barret Rhoden
  2019-12-12 17:45     ` Liran Alon
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-12-12 17:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Barret Rhoden, Paolo Bonzini, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason

On Thu, Dec 12, 2019 at 9:34 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
> > This change allows KVM to map DAX-backed files made of huge pages with
> > huge mappings in the EPT/TDP.
> >
> > DAX pages are not PageTransCompound.  The existing check is trying to
> > determine if the mapping for the pfn is a huge mapping or not.  For
> > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> > For DAX, we can check the page table itself.
> >
> > Note that KVM already faulted in the page (or huge page) in the host's
> > page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
> > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> >
> > Signed-off-by: Barret Rhoden <brho@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f92b40d798c..cd07bc4e595f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> >       return -EFAULT;
> >  }
> >
> > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +     struct page *page = pfn_to_page(pfn);
> > +     unsigned long hva;
> > +
> > +     if (!is_zone_device_page(page))
> > +             return PageTransCompoundMap(page);
> > +
> > +     /*
> > +      * DAX pages do not use compound pages.  The page should have already
> > +      * been mapped into the host-side page table during try_async_pf(), so
> > +      * we can check the page tables directly.
> > +      */
> > +     hva = gfn_to_hva(kvm, gfn);
> > +     if (kvm_is_error_hva(hva))
> > +             return false;
> > +
> > +     /*
> > +      * Our caller grabbed the KVM mmu_lock with a successful
> > +      * mmu_notifier_retry, so we're safe to walk the page table.
> > +      */
> > +     switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> > +     case PMD_SHIFT:
> > +     case PUD_SIZE:
>
> I assume this means DAX can have 1GB pages?

Correct, it can. Not in the filesystem-dax case, but device-dax
supports 1GB pages.

> I ask because KVM's THP logic
> has historically relied on THP only supporting 2MB.  I cleaned this up in
> a recent series[*], which is in kvm/queue, but I obviously didn't actually
> test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

Yeah, since device-dax is the only path to support longterm page
pinning for vfio device assignment, testing with device-dax + 1GB
pages would be a useful sanity check.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 16:54     ` Dan Williams
@ 2019-12-12 17:39       ` Liran Alon
  2019-12-12 17:59         ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Liran Alon @ 2019-12-12 17:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Barret Rhoden, Paolo Bonzini, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason



> On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>> 
>>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
>>> 
>>> This change allows KVM to map DAX-backed files made of huge pages with
>>> huge mappings in the EPT/TDP.
>>> 
>>> DAX pages are not PageTransCompound.  The existing check is trying to
>>> determine if the mapping for the pfn is a huge mapping or not.  For
>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>>> For DAX, we can check the page table itself.
>> 
>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
>> will return the page-size of the hugetlbfs without the need to parse the page-tables.
>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>> 
>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>> 
>> Therefore, it seems more logical to me that:
>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
> 
> A given dax-mapped vma may have mixed page sizes so ->page_size()
> can't be used reliably to enumerating the mapping size.

Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?
What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.

> 
>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
> 
> DAX pages do not participate in THP and do not have the
> PageTransCompound accounting. The only mechanism that records the
> mapping size for dax is the page tables themselves.

What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?

-Liran

> 
> 
>> 
>>> 
>>> Note that KVM already faulted in the page (or huge page) in the host's
>>> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
>>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>>> 
>>> Signed-off-by: Barret Rhoden <brho@google.com>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>>> 1 file changed, 32 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 6f92b40d798c..cd07bc4e595f 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>>>      return -EFAULT;
>>> }
>>> 
>>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>>> +{
>>> +     struct page *page = pfn_to_page(pfn);
>>> +     unsigned long hva;
>>> +
>>> +     if (!is_zone_device_page(page))
>>> +             return PageTransCompoundMap(page);
>>> +
>>> +     /*
>>> +      * DAX pages do not use compound pages.  The page should have already
>>> +      * been mapped into the host-side page table during try_async_pf(), so
>>> +      * we can check the page tables directly.
>>> +      */
>>> +     hva = gfn_to_hva(kvm, gfn);
>>> +     if (kvm_is_error_hva(hva))
>>> +             return false;
>>> +
>>> +     /*
>>> +      * Our caller grabbed the KVM mmu_lock with a successful
>>> +      * mmu_notifier_retry, so we're safe to walk the page table.
>>> +      */
>>> +     switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> 
>> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
>> Was this changed by a commit I missed?
>> 
>> -Liran
>> 
>>> +     case PMD_SHIFT:
>>> +     case PUD_SIZE:
>>> +             return true;
>>> +     }
>>> +     return false;
>>> +}
>>> +
>>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>                                      gfn_t gfn, kvm_pfn_t *pfnp,
>>>                                      int *levelp)
>>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>       * here.
>>>       */
>>>      if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>>> -         !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>>> -         PageTransCompoundMap(pfn_to_page(pfn)) &&
>>> +         level == PT_PAGE_TABLE_LEVEL &&
>>> +         pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>>>          !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>>>              unsigned long mask;
>>>              /*
>>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>>               * mapping if the indirect sp has level = 1.
>>>               */
>>>              if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>>> -                 !kvm_is_zone_device_pfn(pfn) &&
>>> -                 PageTransCompoundMap(pfn_to_page(pfn))) {
>>> +                 pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>>>                      pte_list_remove(rmap_head, sptep);
>>> 
>>>                      if (kvm_available_flush_tlb_with_range())
>>> --
>>> 2.24.0.525.g8f36a354ae-goog
>>> 
>> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 17:34   ` Sean Christopherson
  2019-12-12 17:37     ` Dan Williams
@ 2019-12-12 17:45     ` Liran Alon
  1 sibling, 0 replies; 19+ messages in thread
From: Liran Alon @ 2019-12-12 17:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Barret Rhoden, Paolo Bonzini, Dan Williams, David Hildenbrand,
	Dave Jiang, Alexander Duyck, linux-nvdimm, x86, kvm,
	linux-kernel, jason.zeng



> On 12 Dec 2019, at 19:34, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
>> This change allows KVM to map DAX-backed files made of huge pages with
>> huge mappings in the EPT/TDP.
>> 
>> DAX pages are not PageTransCompound.  The existing check is trying to
>> determine if the mapping for the pfn is a huge mapping or not.  For
>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>> For DAX, we can check the page table itself.
>> 
>> Note that KVM already faulted in the page (or huge page) in the host's
>> page table, and we hold the KVM mmu spinlock.  We grabbed that lock in
>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>> 
>> Signed-off-by: Barret Rhoden <brho@google.com>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>> 1 file changed, 32 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 6f92b40d798c..cd07bc4e595f 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>> 	return -EFAULT;
>> }
>> 
>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>> +{
>> +	struct page *page = pfn_to_page(pfn);
>> +	unsigned long hva;
>> +
>> +	if (!is_zone_device_page(page))
>> +		return PageTransCompoundMap(page);
>> +
>> +	/*
>> +	 * DAX pages do not use compound pages.  The page should have already
>> +	 * been mapped into the host-side page table during try_async_pf(), so
>> +	 * we can check the page tables directly.
>> +	 */
>> +	hva = gfn_to_hva(kvm, gfn);
>> +	if (kvm_is_error_hva(hva))
>> +		return false;
>> +
>> +	/*
>> +	 * Our caller grabbed the KVM mmu_lock with a successful
>> +	 * mmu_notifier_retry, so we're safe to walk the page table.
>> +	 */
>> +	switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> +	case PMD_SHIFT:
>> +	case PUD_SIZE:
> 
> I assume this means DAX can have 1GB pages?  I ask because KVM's THP logic
> has historically relied on THP only supporting 2MB.  I cleaned this up in
> a recent series[*], which is in kvm/queue, but I obviously didn't actually
> test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

KVM doesn’t handle 1GB correctly for all types of non-hugetlbfs pages.
One example we have noticed internally but haven’t submitted an upstream patch yet is
for pages without “struct page”. As in this case, hva_to_pfn() will notice vma->vm_flags have VM_PFNMAP set
and call hva_to_pfn_remapped() -> follow_pfn().
However, follow_pfn() currently just calls follow_pte() which use __follow_pte_pmd() that doesn’t handle a huge PUD entry.

> 
> The easiest thing is probably to rebase on kvm/queue.  You'll need to do
> that anyways, and I suspect doing so will help shake out any hiccups.
> 
> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20191206235729.29263-2D1-2Dsean.j.christopherson-40intel.com&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Lk-PXE125WU3GWJOV4U4crsSEFx7f5AUmRJhkrfIeAE&s=BIo4tnL4OfswRQ2QKfTs9VYScLU5lBy2pwzePBnHow8&e= 
> 
>> +		return true;
>> +	}
>> +	return false;
>> +}
>> +
>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> 					gfn_t gfn, kvm_pfn_t *pfnp,
>> 					int *levelp)
>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> 	 * here.
>> 	 */
>> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
>> +	    level == PT_PAGE_TABLE_LEVEL &&
>> +	    pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>> 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>> 		unsigned long mask;
>> 		/*
>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>> 		 * mapping if the indirect sp has level = 1.
>> 		 */
>> 		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>> -		    !kvm_is_zone_device_pfn(pfn) &&
>> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
>> +		    pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>> 			pte_list_remove(rmap_head, sptep);
>> 
>> 			if (kvm_available_flush_tlb_with_range())
>> -- 
>> 2.24.0.525.g8f36a354ae-goog
>> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 17:39       ` Liran Alon
@ 2019-12-12 17:59         ` Dan Williams
  2019-12-12 18:32           ` Liran Alon
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-12-12 17:59 UTC (permalink / raw)
  To: Liran Alon
  Cc: Barret Rhoden, Paolo Bonzini, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason

On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
> >>
> >>
> >>
> >>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
> >>>
> >>> This change allows KVM to map DAX-backed files made of huge pages with
> >>> huge mappings in the EPT/TDP.
> >>>
> >>> DAX pages are not PageTransCompound.  The existing check is trying to
> >>> determine if the mapping for the pfn is a huge mapping or not.  For
> >>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> >>> For DAX, we can check the page table itself.
> >>
> >> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
> >> will return the page-size of the hugetlbfs without the need to parse the page-tables.
> >> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
> >>
> >> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
> >>
> >> Therefore, it seems more logical to me that:
> >> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
> >
> > A given dax-mapped vma may have mixed page sizes so ->page_size()
> > can't be used reliably to enumerating the mapping size.
>
> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?

Filesystems traditionally have not populated ->pagesize() in their
vm_operations, there was no compelling reason to go add it and the
complexity seems prohibitive.

> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.

It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE. Huge
pages in the page cache has a similar issue.

> >> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
> >
> > DAX pages do not participate in THP and do not have the
> > PageTransCompound accounting. The only mechanism that records the
> > mapping size for dax is the page tables themselves.
>
> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?

THP accounting is a function of the page allocator. ZONE_DEVICE pages
are excluded from the page allocator. ZONE_DEVICE is just enough
infrastructure to support pfn_to_page(), page_address(), and
get_user_pages(). Other page allocator services beyond that are not
present.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 17:59         ` Dan Williams
@ 2019-12-12 18:32           ` Liran Alon
  0 siblings, 0 replies; 19+ messages in thread
From: Liran Alon @ 2019-12-12 18:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Barret Rhoden, Paolo Bonzini, David Hildenbrand, Dave Jiang,
	Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason



> On 12 Dec 2019, at 19:59, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>> 
>>> On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote:
>>>>> 
>>>>> This change allows KVM to map DAX-backed files made of huge pages with
>>>>> huge mappings in the EPT/TDP.
>>>>> 
>>>>> DAX pages are not PageTransCompound.  The existing check is trying to
>>>>> determine if the mapping for the pfn is a huge mapping or not.  For
>>>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>>>>> For DAX, we can check the page table itself.
>>>> 
>>>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
>>>> will return the page-size of the hugetlbfs without the need to parse the page-tables.
>>>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>>>> 
>>>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>>>> 
>>>> Therefore, it seems more logical to me that:
>>>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
>>> 
>>> A given dax-mapped vma may have mixed page sizes so ->page_size()
>>> can't be used reliably to enumerating the mapping size.
>> 
>> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?
> 
> Filesystems traditionally have not populated ->pagesize() in their
> vm_operations, there was no compelling reason to go add it and the
> complexity seems prohibitive.

I understand. Though this is technical debt that breaks ->page_size() semantics which might cause a complex bug some day...

> 
>> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.
> 
> It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE.

Of course. :) I meant it may be unexpected by the caller.

> Huge
> pages in the page cache has a similar issue.

Ok. I haven’t known that. Thanks for the explanation.

> 
>>>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
>>> 
>>> DAX pages do not participate in THP and do not have the
>>> PageTransCompound accounting. The only mechanism that records the
>>> mapping size for dax is the page tables themselves.
>> 
>> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?
> 
> THP accounting is a function of the page allocator. ZONE_DEVICE pages
> are excluded from the page allocator. ZONE_DEVICE is just enough
> infrastructure to support pfn_to_page(), page_address(), and
> get_user_pages(). Other page allocator services beyond that are not
> present.

Ok.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 17:37     ` Dan Williams
@ 2019-12-12 19:16       ` Barret Rhoden
  2019-12-12 19:48         ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Barret Rhoden @ 2019-12-12 19:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sean Christopherson, Paolo Bonzini, David Hildenbrand,
	Dave Jiang, Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason

On 12/12/19 12:37 PM, Dan Williams wrote:
> Yeah, since device-dax is the only path to support longterm page
> pinning for vfio device assignment, testing with device-dax + 1GB
> pages would be a useful sanity check.

What are the issues with fs-dax and page pinning?  Is that limitation 
something that is permanent and unfixable (by me or anyone)?

I'd like to put a lot more in a DAX/pmem region than just a guest's 
memory, and having a mountable filesystem would be extremely convenient.

Thanks,

Barret



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 19:16       ` Barret Rhoden
@ 2019-12-12 19:48         ` Dan Williams
  2019-12-12 20:08           ` Barret Rhoden
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-12-12 19:48 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Sean Christopherson, Paolo Bonzini, David Hildenbrand,
	Dave Jiang, Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason, Christoph Hellwig

On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote:
>
> On 12/12/19 12:37 PM, Dan Williams wrote:
> > Yeah, since device-dax is the only path to support longterm page
> > pinning for vfio device assignment, testing with device-dax + 1GB
> > pages would be a useful sanity check.
>
> What are the issues with fs-dax and page pinning?  Is that limitation
> something that is permanent and unfixable (by me or anyone)?

It's a surprisingly painful point of contention...

File backed DAX pages cannot be truncated while the page is pinned
because the pin may indicate that DMA is ongoing to the file block /
DAX page. When that pin is from RDMA or VFIO that creates a situation
where filesystem operations are blocked indefinitely. More details
here: 94db151dc892 "vfio: disable filesystem-dax page pinning".

Currently, to prevent the deadlock, RDMA, VFIO, and IO_URING memory
registration is blocked if the mapping is filesystem-dax backed (see
the FOLL_LONGTERM flag to get_user_pages).

One of the proposals to break the impasse was to allow the filesystem
to forcibly revoke the mapping. I.e. to use the IOMMU to forcibly kick
the RDMA device out of its registration. That was rejected by RDMA
folks because RDMA applications are not prepared for this revocation
to happen and the application that performed the registration may not
be the application that uses the registration. There was an attempt to
use a file lease to indicate the presence of a file /
memory-registration that is blocking file-system operations, but that
was still less palatable to filesystem folks than just keeping the
status quo of blocking longterm pinning.

That said, the VFIO use case seems a different situation than RDMA.
There's often a 1:1 relationship between the application performing
the memory registration and the application consuming it, the VMM, and
there is always an IOMMU present that could revoke access and kill the
guest is the mapping got truncated. It seems in theory that VFIO could
tolerate a "revoke pin on truncate" mechanism where RDMA could not.

> I'd like to put a lot more in a DAX/pmem region than just a guest's
> memory, and having a mountable filesystem would be extremely convenient.

Why would page pinning be involved in allowing the guest to mount a
filesystem on guest-pmem? That already works today, it's just the
device-passthrough that causes guest memory to be pinned indefinitely.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
  2019-12-12 19:48         ` Dan Williams
@ 2019-12-12 20:08           ` Barret Rhoden
  0 siblings, 0 replies; 19+ messages in thread
From: Barret Rhoden @ 2019-12-12 20:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sean Christopherson, Paolo Bonzini, David Hildenbrand,
	Dave Jiang, Alexander Duyck, linux-nvdimm, X86 ML, KVM list,
	Linux Kernel Mailing List, Zeng, Jason, Christoph Hellwig

On 12/12/19 2:48 PM, Dan Williams wrote:
> On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote:
>>
>> On 12/12/19 12:37 PM, Dan Williams wrote:
>>> Yeah, since device-dax is the only path to support longterm page
>>> pinning for vfio device assignment, testing with device-dax + 1GB
>>> pages would be a useful sanity check.
>>
>> What are the issues with fs-dax and page pinning?  Is that limitation
>> something that is permanent and unfixable (by me or anyone)?
> 
> It's a surprisingly painful point of contention...

Thanks for the info; I'll check out those threads.

[snip]

>> I'd like to put a lot more in a DAX/pmem region than just a guest's
>> memory, and having a mountable filesystem would be extremely convenient.
> 
> Why would page pinning be involved in allowing the guest to mount a
> filesystem on guest-pmem? That already works today, it's just the
> device-passthrough that causes guest memory to be pinned indefinitely.

I'd like to mount the pmem filesystem on the *host* and use its files 
for the guest's memory.  So far I've just been making an ext4 FS on 
/dev/pmem0 and creating a bunch of files in the FS.  Some of the files 
are the guest memory: one file for each VM.  Other files are just 
metadata that the host uses.

That all works right now, but I'd also like to use VFIO with the guests.

Thanks,

Barret





^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-12-12 20:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-11 21:32 [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
2019-12-11 21:32 ` [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible Barret Rhoden
2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
2019-12-12  0:21   ` Paolo Bonzini
2019-12-12 12:22   ` David Hildenbrand
2019-12-12 16:31     ` Barret Rhoden
2019-12-12 12:33   ` Liran Alon
2019-12-12 16:54     ` Dan Williams
2019-12-12 17:39       ` Liran Alon
2019-12-12 17:59         ` Dan Williams
2019-12-12 18:32           ` Liran Alon
2019-12-12 17:03     ` Barret Rhoden
2019-12-12 17:34   ` Sean Christopherson
2019-12-12 17:37     ` Dan Williams
2019-12-12 19:16       ` Barret Rhoden
2019-12-12 19:48         ` Dan Williams
2019-12-12 20:08           ` Barret Rhoden
2019-12-12 17:45     ` Liran Alon
2019-12-12  0:22 ` [PATCH v4 0/2] " Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).