linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages
@ 2021-03-21 18:45 Thomas Hellström (Intel)
  2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
  2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
  0 siblings, 2 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-21 18:45 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström (Intel),
	Christian Koenig, David Airlie, Daniel Vetter, Andrew Morton,
	Jason Gunthorpe, linux-mm, linux-kernel

get_user_pages() to TTM pages is uwanted since TTM assumes it owns
the pages exclusively and / or sets up page-table mappings to io memory.

The first patch make sures we stop fast gup to huge TTM pages using
a trick with pmd_devmap() and pud_devmap() without a backing
dev_pagemap.

The second patch makes sure we block normal gup by setting VM_PFNMAP

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-21 18:45 [RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages Thomas Hellström (Intel)
@ 2021-03-21 18:45 ` Thomas Hellström (Intel)
  2021-03-23 11:34   ` Daniel Vetter
                     ` (2 more replies)
  2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
  1 sibling, 3 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-21 18:45 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström (Intel),
	Christian Koenig, David Airlie, Daniel Vetter, Andrew Morton,
	Jason Gunthorpe, linux-mm, linux-kernel

TTM sets up huge page-table-entries both to system- and device memory,
and we don't want gup to assume there are always valid backing struct
pages for these. For PTEs this is handled by setting the pte_special bit,
but for the huge PUDs and PMDs, we have neither pmd_special nor
pud_special. Normally, huge TTM entries are identified by looking at
vma_is_special_huge(), but fast gup can't do that, so as an alternative
define _devmap entries for which there are no backing dev_pagemap as
special, update documentation and make huge TTM entries _devmap, after
verifying that there is no backing dev_pagemap.

One other alternative would be to block TTM huge page-table-entries
completely, and while currently only vmwgfx use them, they would be
beneficial to other graphis drivers moving forward as well.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
---
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 ++++++++++++++++-
 mm/gup.c                        | 21 +++++++++++----------
 mm/memremap.c                   |  5 +++++
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dc96cf66744..1c34983480e5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
 	pfn_t pfnt;
 	struct ttm_tt *ttm = bo->ttm;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	struct dev_pagemap *pagemap;
 
 	/* Fault should not cross bo boundary. */
 	page_offset &= ~(fault_page_size - 1);
@@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
 	if ((pfn & (fault_page_size - 1)) != 0)
 		goto out_fallback;
 
+	/*
+	 * Huge entries must be special, that is marking them as devmap
+	 * with no backing device map range. If there is a backing
+	 * range, Don't insert a huge entry.
+	 * If this check turns out to be too much of a performance hit,
+	 * we can instead have drivers indicate whether they may have
+	 * backing device map ranges and if not, skip this lookup.
+	 */
+	pagemap = get_dev_pagemap(pfn, NULL);
+	if (pagemap) {
+		put_dev_pagemap(pagemap);
+		goto out_fallback;
+	}
+
 	/* Check that memory is contiguous. */
 	if (!bo->mem.bus.is_iomem) {
 		for (i = 1; i < fault_page_size; ++i) {
@@ -223,7 +238,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
 		}
 	}
 
-	pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
+	pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
 	if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
 		ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..1b6a127f0bdd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 }
 
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be
+ * special. For THP, special huge entries are indicated by xxx_devmap()
+ * returning true, but a corresponding call to get_dev_pagemap() will
+ * return NULL.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * get_user_pages_fast_only implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
@@ -2069,16 +2080,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 	return ret;
 }
 #else
-
-/*
- * If we can't determine whether or not a pte is special, then fail immediately
- * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
- * to be special.
- *
- * For a futex to be placed on a THP tail page, get_futex_key requires a
- * get_user_pages_fast_only implementation that can pin pages. Thus it's still
- * useful to have gup_huge_pmd even if we can't operate on ptes.
- */
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
diff --git a/mm/memremap.c b/mm/memremap.c
index 7aa7d6e80ee5..757551cd2a4d 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -471,6 +471,11 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
  *
  * If @pgmap is non-NULL and covers @pfn it will be returned as-is.  If @pgmap
  * is non-NULL but does not cover @pfn the reference to it will be released.
+ *
+ * Return: A referenced pointer to a struct dev_pagemap containing @pfn,
+ * or NULL if there was no such pagemap registered. For interpretion
+ * of NULL returns for pfns extracted from valid huge page table entries,
+ * please see gup_pte_range().
  */
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		struct dev_pagemap *pgmap)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-21 18:45 [RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages Thomas Hellström (Intel)
  2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
@ 2021-03-21 18:45 ` Thomas Hellström (Intel)
  2021-03-22  7:47   ` Christian König
                     ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-21 18:45 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström (Intel),
	Christian Koenig, David Airlie, Daniel Vetter, Andrew Morton,
	Jason Gunthorpe, linux-mm, linux-kernel

To block fast gup we need to make sure TTM ptes are always special.
With MIXEDMAP we, on architectures that don't support pte_special,
insert normal ptes, but OTOH on those architectures, fast is not
supported.
At the same time, the function documentation to vm_normal_page() suggests
that ptes pointing to system memory pages of MIXEDMAP vmas are always
normal, but that doesn't seem consistent with what's implemented in
vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
needed.

But to make sure and to avoid also normal (non-fast) gup, make all
TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
anymore so make is_cow_mapping() available and use it to reject
COW mappigs at mmap time.

There was previously a comment in the code that WC mappings together
with x86 PAT + PFNMAP was bad for performance. However from looking at
vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
are handled the same for architectures that support pte_special. This
means there should not be a performance difference anymore, but this
needs to be verified.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-mm@kvack.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
---
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
 include/linux/mm.h              |  5 +++++
 mm/internal.h                   |  5 -----
 3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 1c34983480e5..708c6fb9be81 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 		 * at arbitrary times while the data is mmap'ed.
 		 * See vmf_insert_mixed_prot() for a discussion.
 		 */
-		if (vma->vm_flags & VM_MIXEDMAP)
-			ret = vmf_insert_mixed_prot(vma, address,
-						    __pfn_to_pfn_t(pfn, PFN_DEV),
-						    prot);
-		else
-			ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
+		ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
 
 		/* Never error on prefaulted PTEs */
 		if (unlikely((ret & VM_FAULT_ERROR))) {
@@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s
 	 * Note: We're transferring the bo reference to
 	 * vma->vm_private_data here.
 	 */
-
 	vma->vm_private_data = bo;
 
 	/*
-	 * We'd like to use VM_PFNMAP on shared mappings, where
-	 * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
-	 * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
-	 * bad for performance. Until that has been sorted out, use
-	 * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
+	 * PFNMAP forces us to block COW mappings in mmap(),
+	 * and with MIXEDMAP we would incorrectly allow fast gup
+	 * on TTM memory on architectures that don't have pte_special.
 	 */
-	vma->vm_flags |= VM_MIXEDMAP;
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
 }
 
 int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
@@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
 	if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
 		return -EINVAL;
 
+	if (unlikely(is_cow_mapping(vma->vm_flags)))
+		return -EINVAL;
+
 	bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
 	if (unlikely(!bo))
 		return -EINVAL;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..c6ebf7f9ddbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool is_cow_mapping(vm_flags_t flags)
+{
+	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..1432feec62df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page)
  */
 #define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
 
-static inline bool is_cow_mapping(vm_flags_t flags)
-{
-	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-}
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
@ 2021-03-22  7:47   ` Christian König
  2021-03-22  8:13     ` Thomas Hellström (Intel)
  2021-03-23 11:47   ` Daniel Vetter
  2021-03-23 14:00   ` Jason Gunthorpe
  2 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-22  7:47 UTC (permalink / raw)
  To: Thomas Hellström (Intel), dri-devel
  Cc: David Airlie, Daniel Vetter, Andrew Morton, Jason Gunthorpe,
	linux-mm, linux-kernel

Am 21.03.21 um 19:45 schrieb Thomas Hellström (Intel):
> To block fast gup we need to make sure TTM ptes are always special.
> With MIXEDMAP we, on architectures that don't support pte_special,
> insert normal ptes, but OTOH on those architectures, fast is not
> supported.
> At the same time, the function documentation to vm_normal_page() suggests
> that ptes pointing to system memory pages of MIXEDMAP vmas are always
> normal, but that doesn't seem consistent with what's implemented in
> vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
> needed.
>
> But to make sure and to avoid also normal (non-fast) gup, make all
> TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
> anymore so make is_cow_mapping() available and use it to reject
> COW mappigs at mmap time.

I would separate the disallowing of COW mapping from the PFN change. I'm 
pretty sure that COW mappings never worked on TTM BOs in the first place.

But either way this patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>.

Thanks,
Christian.

>
> There was previously a comment in the code that WC mappings together
> with x86 PAT + PFNMAP was bad for performance. However from looking at
> vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
> are handled the same for architectures that support pte_special. This
> means there should not be a performance difference anymore, but this
> needs to be verified.
>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
> ---
>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
>   include/linux/mm.h              |  5 +++++
>   mm/internal.h                   |  5 -----
>   3 files changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 1c34983480e5..708c6fb9be81 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>   		 * at arbitrary times while the data is mmap'ed.
>   		 * See vmf_insert_mixed_prot() for a discussion.
>   		 */
> -		if (vma->vm_flags & VM_MIXEDMAP)
> -			ret = vmf_insert_mixed_prot(vma, address,
> -						    __pfn_to_pfn_t(pfn, PFN_DEV),
> -						    prot);
> -		else
> -			ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
> +		ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>   
>   		/* Never error on prefaulted PTEs */
>   		if (unlikely((ret & VM_FAULT_ERROR))) {
> @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s
>   	 * Note: We're transferring the bo reference to
>   	 * vma->vm_private_data here.
>   	 */
> -
>   	vma->vm_private_data = bo;
>   
>   	/*
> -	 * We'd like to use VM_PFNMAP on shared mappings, where
> -	 * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
> -	 * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
> -	 * bad for performance. Until that has been sorted out, use
> -	 * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
> +	 * PFNMAP forces us to block COW mappings in mmap(),
> +	 * and with MIXEDMAP we would incorrectly allow fast gup
> +	 * on TTM memory on architectures that don't have pte_special.
>   	 */
> -	vma->vm_flags |= VM_MIXEDMAP;
> -	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>   }
>   
>   int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
> @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>   	if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
>   		return -EINVAL;
>   
> +	if (unlikely(is_cow_mapping(vma->vm_flags)))
> +		return -EINVAL;
> +
>   	bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
>   	if (unlikely(!bo))
>   		return -EINVAL;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..c6ebf7f9ddbb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
>   	return vma->vm_flags & VM_ACCESS_FLAGS;
>   }
>   
> +static inline bool is_cow_mapping(vm_flags_t flags)
> +{
> +	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> +}
> +
>   #ifdef CONFIG_SHMEM
>   /*
>    * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/internal.h b/mm/internal.h
> index 9902648f2206..1432feec62df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page)
>    */
>   #define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
>   
> -static inline bool is_cow_mapping(vm_flags_t flags)
> -{
> -	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> -}
> -
>   /*
>    * These three helpers classifies VMAs for virtual memory accounting.
>    */



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-22  7:47   ` Christian König
@ 2021-03-22  8:13     ` Thomas Hellström (Intel)
  2021-03-23 11:57       ` Christian König
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-22  8:13 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: David Airlie, Daniel Vetter, Andrew Morton, Jason Gunthorpe,
	linux-mm, linux-kernel

Hi!

On 3/22/21 8:47 AM, Christian König wrote:
> Am 21.03.21 um 19:45 schrieb Thomas Hellström (Intel):
>> To block fast gup we need to make sure TTM ptes are always special.
>> With MIXEDMAP we, on architectures that don't support pte_special,
>> insert normal ptes, but OTOH on those architectures, fast is not
>> supported.
>> At the same time, the function documentation to vm_normal_page() 
>> suggests
>> that ptes pointing to system memory pages of MIXEDMAP vmas are always
>> normal, but that doesn't seem consistent with what's implemented in
>> vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
>> needed.
>>
>> But to make sure and to avoid also normal (non-fast) gup, make all
>> TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
>> anymore so make is_cow_mapping() available and use it to reject
>> COW mappigs at mmap time.
>
> I would separate the disallowing of COW mapping from the PFN change. 
> I'm pretty sure that COW mappings never worked on TTM BOs in the first 
> place.

COW doesn't work with PFNMAP together with non-linear maps, so as a 
consequence from moving from MIXEDMAP to PFNMAP we must disallow COW, so 
it seems logical to me to do it in one patch.

And working COW was one of the tests I used for huge PMDs/PUDs, so it 
has indeed been working, but I can't think of any relevant use-cases.

Did you, BTW, have a chance to test this with WC mappings?

Thanks,
/Thomas



>
> But either way this patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>.
>
> Thanks,
> Christian.
>
>>
>> There was previously a comment in the code that WC mappings together
>> with x86 PAT + PFNMAP was bad for performance. However from looking at
>> vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
>> are handled the same for architectures that support pte_special. This
>> means there should not be a performance difference anymore, but this
>> needs to be verified.
>>
>> Cc: Christian Koenig <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@linux.ie>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: linux-mm@kvack.org
>> Cc: dri-devel@lists.freedesktop.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
>>   include/linux/mm.h              |  5 +++++
>>   mm/internal.h                   |  5 -----
>>   3 files changed, 13 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 1c34983480e5..708c6fb9be81 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct 
>> vm_fault *vmf,
>>            * at arbitrary times while the data is mmap'ed.
>>            * See vmf_insert_mixed_prot() for a discussion.
>>            */
>> -        if (vma->vm_flags & VM_MIXEDMAP)
>> -            ret = vmf_insert_mixed_prot(vma, address,
>> -                            __pfn_to_pfn_t(pfn, PFN_DEV),
>> -                            prot);
>> -        else
>> -            ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>> +        ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>>             /* Never error on prefaulted PTEs */
>>           if (unlikely((ret & VM_FAULT_ERROR))) {
>> @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct 
>> ttm_buffer_object *bo, struct vm_area_s
>>        * Note: We're transferring the bo reference to
>>        * vma->vm_private_data here.
>>        */
>> -
>>       vma->vm_private_data = bo;
>>         /*
>> -     * We'd like to use VM_PFNMAP on shared mappings, where
>> -     * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
>> -     * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
>> -     * bad for performance. Until that has been sorted out, use
>> -     * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
>> +     * PFNMAP forces us to block COW mappings in mmap(),
>> +     * and with MIXEDMAP we would incorrectly allow fast gup
>> +     * on TTM memory on architectures that don't have pte_special.
>>        */
>> -    vma->vm_flags |= VM_MIXEDMAP;
>> -    vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>> +    vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>   }
>>     int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>> @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct 
>> vm_area_struct *vma,
>>       if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
>>           return -EINVAL;
>>   +    if (unlikely(is_cow_mapping(vma->vm_flags)))
>> +        return -EINVAL;
>> +
>>       bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
>>       if (unlikely(!bo))
>>           return -EINVAL;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 77e64e3eac80..c6ebf7f9ddbb 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct 
>> vm_area_struct *vma)
>>       return vma->vm_flags & VM_ACCESS_FLAGS;
>>   }
>>   +static inline bool is_cow_mapping(vm_flags_t flags)
>> +{
>> +    return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>> +}
>> +
>>   #ifdef CONFIG_SHMEM
>>   /*
>>    * The vma_is_shmem is not inline because it is used only by slow
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9902648f2206..1432feec62df 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct 
>> page *page)
>>    */
>>   #define buddy_order_unsafe(page) READ_ONCE(page_private(page))
>>   -static inline bool is_cow_mapping(vm_flags_t flags)
>> -{
>> -    return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>> -}
>> -
>>   /*
>>    * These three helpers classifies VMAs for virtual memory accounting.
>>    */


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
@ 2021-03-23 11:34   ` Daniel Vetter
  2021-03-23 16:34     ` Thomas Hellström (Intel)
  2021-03-23 13:52   ` Jason Gunthorpe
  2021-03-23 19:52   ` Williams, Dan J
  2 siblings, 1 reply; 63+ messages in thread
From: Daniel Vetter @ 2021-03-23 11:34 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, Jason Gunthorpe, linux-mm, linux-kernel

On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote:
> TTM sets up huge page-table-entries both to system- and device memory,
> and we don't want gup to assume there are always valid backing struct
> pages for these. For PTEs this is handled by setting the pte_special bit,
> but for the huge PUDs and PMDs, we have neither pmd_special nor
> pud_special. Normally, huge TTM entries are identified by looking at
> vma_is_special_huge(), but fast gup can't do that, so as an alternative
> define _devmap entries for which there are no backing dev_pagemap as
> special, update documentation and make huge TTM entries _devmap, after
> verifying that there is no backing dev_pagemap.
> 
> One other alternative would be to block TTM huge page-table-entries
> completely, and while currently only vmwgfx use them, they would be
> beneficial to other graphis drivers moving forward as well.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
> ---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 ++++++++++++++++-
>  mm/gup.c                        | 21 +++++++++++----------
>  mm/memremap.c                   |  5 +++++
>  3 files changed, 32 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 6dc96cf66744..1c34983480e5 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>  	pfn_t pfnt;
>  	struct ttm_tt *ttm = bo->ttm;
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> +	struct dev_pagemap *pagemap;
>  
>  	/* Fault should not cross bo boundary. */
>  	page_offset &= ~(fault_page_size - 1);
> @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>  	if ((pfn & (fault_page_size - 1)) != 0)
>  		goto out_fallback;
>  
> +	/*
> +	 * Huge entries must be special, that is marking them as devmap
> +	 * with no backing device map range. If there is a backing
> +	 * range, Don't insert a huge entry.
> +	 * If this check turns out to be too much of a performance hit,
> +	 * we can instead have drivers indicate whether they may have
> +	 * backing device map ranges and if not, skip this lookup.
> +	 */

I think we can do this statically:
- if it's system memory we know there's no devmap for it, and we do the
  trick to block gup_fast
- if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
  so might as well not do that

I think that would cover all cases without this check here?
-Daniel

> +	pagemap = get_dev_pagemap(pfn, NULL);
> +	if (pagemap) {
> +		put_dev_pagemap(pagemap);
> +		goto out_fallback;
> +	}
> +
>  	/* Check that memory is contiguous. */
>  	if (!bo->mem.bus.is_iomem) {
>  		for (i = 1; i < fault_page_size; ++i) {
> @@ -223,7 +238,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>  		}
>  	}
>  
> -	pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
> +	pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
>  	if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
>  		ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..1b6a127f0bdd 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be
> + * special. For THP, special huge entries are indicated by xxx_devmap()
> + * returning true, but a corresponding call to get_dev_pagemap() will
> + * return NULL.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
> @@ -2069,16 +2080,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  	return ret;
>  }
>  #else
> -
> -/*
> - * If we can't determine whether or not a pte is special, then fail immediately
> - * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> - * to be special.
> - *
> - * For a futex to be placed on a THP tail page, get_futex_key requires a
> - * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> - * useful to have gup_huge_pmd even if we can't operate on ptes.
> - */
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 7aa7d6e80ee5..757551cd2a4d 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -471,6 +471,11 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
>   *
>   * If @pgmap is non-NULL and covers @pfn it will be returned as-is.  If @pgmap
>   * is non-NULL but does not cover @pfn the reference to it will be released.
> + *
> + * Return: A referenced pointer to a struct dev_pagemap containing @pfn,
> + * or NULL if there was no such pagemap registered. For interpretion
> + * of NULL returns for pfns extracted from valid huge page table entries,
> + * please see gup_pte_range().
>   */
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  		struct dev_pagemap *pgmap)
> -- 
> 2.30.2
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
  2021-03-22  7:47   ` Christian König
@ 2021-03-23 11:47   ` Daniel Vetter
  2021-03-23 14:04     ` Jason Gunthorpe
  2021-03-23 14:00   ` Jason Gunthorpe
  2 siblings, 1 reply; 63+ messages in thread
From: Daniel Vetter @ 2021-03-23 11:47 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, Jason Gunthorpe, linux-mm, linux-kernel

On Sun, Mar 21, 2021 at 07:45:29PM +0100, Thomas Hellström (Intel) wrote:
> To block fast gup we need to make sure TTM ptes are always special.
> With MIXEDMAP we, on architectures that don't support pte_special,
> insert normal ptes, but OTOH on those architectures, fast is not
> supported.
> At the same time, the function documentation to vm_normal_page() suggests
> that ptes pointing to system memory pages of MIXEDMAP vmas are always
> normal, but that doesn't seem consistent with what's implemented in
> vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
> needed.
> 
> But to make sure and to avoid also normal (non-fast) gup, make all
> TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
> anymore so make is_cow_mapping() available and use it to reject
> COW mappigs at mmap time.
> 
> There was previously a comment in the code that WC mappings together
> with x86 PAT + PFNMAP was bad for performance. However from looking at
> vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
> are handled the same for architectures that support pte_special. This
> means there should not be a performance difference anymore, but this
> needs to be verified.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
> ---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
>  include/linux/mm.h              |  5 +++++
>  mm/internal.h                   |  5 -----
>  3 files changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 1c34983480e5..708c6fb9be81 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>  		 * at arbitrary times while the data is mmap'ed.
>  		 * See vmf_insert_mixed_prot() for a discussion.
>  		 */
> -		if (vma->vm_flags & VM_MIXEDMAP)
> -			ret = vmf_insert_mixed_prot(vma, address,
> -						    __pfn_to_pfn_t(pfn, PFN_DEV),
> -						    prot);
> -		else
> -			ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
> +		ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>  
>  		/* Never error on prefaulted PTEs */
>  		if (unlikely((ret & VM_FAULT_ERROR))) {
> @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s
>  	 * Note: We're transferring the bo reference to
>  	 * vma->vm_private_data here.
>  	 */
> -
>  	vma->vm_private_data = bo;
>  
>  	/*
> -	 * We'd like to use VM_PFNMAP on shared mappings, where
> -	 * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
> -	 * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
> -	 * bad for performance. Until that has been sorted out, use
> -	 * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
> +	 * PFNMAP forces us to block COW mappings in mmap(),
> +	 * and with MIXEDMAP we would incorrectly allow fast gup
> +	 * on TTM memory on architectures that don't have pte_special.
>  	 */
> -	vma->vm_flags |= VM_MIXEDMAP;
> -	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>  }
>  
>  int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
> @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>  	if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
>  		return -EINVAL;
>  
> +	if (unlikely(is_cow_mapping(vma->vm_flags)))
> +		return -EINVAL;
> +
>  	bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
>  	if (unlikely(!bo))
>  		return -EINVAL;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..c6ebf7f9ddbb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_ACCESS_FLAGS;
>  }
>  
> +static inline bool is_cow_mapping(vm_flags_t flags)

Bit a bikeshed, but I wonder whether the public interface shouldn't be
vma_is_cow_mapping. Or whether this shouldn't be rejected somewhere else,
since at least in drivers/gpu we have tons of cases that don't check for
this and get it all kinds of wrong I think.

remap_pfn_range handles this for many cases, but by far not for all.

Anyway patch itself lgtm:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I'll try and find some -mm folks to look at this too.
-Daniel

> +{
> +	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> +}
> +
>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/internal.h b/mm/internal.h
> index 9902648f2206..1432feec62df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page)
>   */
>  #define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
>  
> -static inline bool is_cow_mapping(vm_flags_t flags)
> -{
> -	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> -}
> -
>  /*
>   * These three helpers classifies VMAs for virtual memory accounting.
>   */
> -- 
> 2.30.2
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-22  8:13     ` Thomas Hellström (Intel)
@ 2021-03-23 11:57       ` Christian König
  0 siblings, 0 replies; 63+ messages in thread
From: Christian König @ 2021-03-23 11:57 UTC (permalink / raw)
  To: Thomas Hellström (Intel), dri-devel
  Cc: David Airlie, Daniel Vetter, Andrew Morton, Jason Gunthorpe,
	linux-mm, linux-kernel



Am 22.03.21 um 09:13 schrieb Thomas Hellström (Intel):
> Hi!
>
> On 3/22/21 8:47 AM, Christian König wrote:
>> Am 21.03.21 um 19:45 schrieb Thomas Hellström (Intel):
>>> To block fast gup we need to make sure TTM ptes are always special.
>>> With MIXEDMAP we, on architectures that don't support pte_special,
>>> insert normal ptes, but OTOH on those architectures, fast is not
>>> supported.
>>> At the same time, the function documentation to vm_normal_page() 
>>> suggests
>>> that ptes pointing to system memory pages of MIXEDMAP vmas are always
>>> normal, but that doesn't seem consistent with what's implemented in
>>> vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
>>> needed.
>>>
>>> But to make sure and to avoid also normal (non-fast) gup, make all
>>> TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
>>> anymore so make is_cow_mapping() available and use it to reject
>>> COW mappigs at mmap time.
>>
>> I would separate the disallowing of COW mapping from the PFN change. 
>> I'm pretty sure that COW mappings never worked on TTM BOs in the 
>> first place.
>
> COW doesn't work with PFNMAP together with non-linear maps, so as a 
> consequence from moving from MIXEDMAP to PFNMAP we must disallow COW, 
> so it seems logical to me to do it in one patch.
>
> And working COW was one of the tests I used for huge PMDs/PUDs, so it 
> has indeed been working, but I can't think of any relevant use-cases.

Ok, going to keep that in mind. I was assuming COW didn't worked before 
on TTM pages.

> Did you, BTW, have a chance to test this with WC mappings?

I'm going to give this a full piglit round, but currently I'm busy with 
internal testing.

Thanks,
Christian.

>
> Thanks,
> /Thomas
>
>
>
>>
>> But either way this patch is Reviewed-by: Christian König 
>> <christian.koenig@amd.com>.
>>
>> Thanks,
>> Christian.
>>
>>>
>>> There was previously a comment in the code that WC mappings together
>>> with x86 PAT + PFNMAP was bad for performance. However from looking at
>>> vmf_insert_mixed() it looks like in the current code PFNMAP and 
>>> MIXEDMAP
>>> are handled the same for architectures that support pte_special. This
>>> means there should not be a performance difference anymore, but this
>>> needs to be verified.
>>>
>>> Cc: Christian Koenig <christian.koenig@amd.com>
>>> Cc: David Airlie <airlied@linux.ie>
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>> Cc: linux-mm@kvack.org
>>> Cc: dri-devel@lists.freedesktop.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
>>> ---
>>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
>>>   include/linux/mm.h              |  5 +++++
>>>   mm/internal.h                   |  5 -----
>>>   3 files changed, 13 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>> index 1c34983480e5..708c6fb9be81 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>> @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct 
>>> vm_fault *vmf,
>>>            * at arbitrary times while the data is mmap'ed.
>>>            * See vmf_insert_mixed_prot() for a discussion.
>>>            */
>>> -        if (vma->vm_flags & VM_MIXEDMAP)
>>> -            ret = vmf_insert_mixed_prot(vma, address,
>>> -                            __pfn_to_pfn_t(pfn, PFN_DEV),
>>> -                            prot);
>>> -        else
>>> -            ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>>> +        ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>>>             /* Never error on prefaulted PTEs */
>>>           if (unlikely((ret & VM_FAULT_ERROR))) {
>>> @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct 
>>> ttm_buffer_object *bo, struct vm_area_s
>>>        * Note: We're transferring the bo reference to
>>>        * vma->vm_private_data here.
>>>        */
>>> -
>>>       vma->vm_private_data = bo;
>>>         /*
>>> -     * We'd like to use VM_PFNMAP on shared mappings, where
>>> -     * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
>>> -     * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
>>> -     * bad for performance. Until that has been sorted out, use
>>> -     * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
>>> +     * PFNMAP forces us to block COW mappings in mmap(),
>>> +     * and with MIXEDMAP we would incorrectly allow fast gup
>>> +     * on TTM memory on architectures that don't have pte_special.
>>>        */
>>> -    vma->vm_flags |= VM_MIXEDMAP;
>>> -    vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>> +    vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>   }
>>>     int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>>> @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct 
>>> vm_area_struct *vma,
>>>       if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
>>>           return -EINVAL;
>>>   +    if (unlikely(is_cow_mapping(vma->vm_flags)))
>>> +        return -EINVAL;
>>> +
>>>       bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
>>>       if (unlikely(!bo))
>>>           return -EINVAL;
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 77e64e3eac80..c6ebf7f9ddbb 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct 
>>> vm_area_struct *vma)
>>>       return vma->vm_flags & VM_ACCESS_FLAGS;
>>>   }
>>>   +static inline bool is_cow_mapping(vm_flags_t flags)
>>> +{
>>> +    return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>>> +}
>>> +
>>>   #ifdef CONFIG_SHMEM
>>>   /*
>>>    * The vma_is_shmem is not inline because it is used only by slow
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 9902648f2206..1432feec62df 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct 
>>> page *page)
>>>    */
>>>   #define buddy_order_unsafe(page) READ_ONCE(page_private(page))
>>>   -static inline bool is_cow_mapping(vm_flags_t flags)
>>> -{
>>> -    return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>>> -}
>>> -
>>>   /*
>>>    * These three helpers classifies VMAs for virtual memory accounting.
>>>    */



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
  2021-03-23 11:34   ` Daniel Vetter
@ 2021-03-23 13:52   ` Jason Gunthorpe
  2021-03-23 15:05     ` Thomas Hellström (Intel)
  2021-03-23 19:52   ` Williams, Dan J
  2 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 13:52 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, linux-mm, linux-kernel

On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..1b6a127f0bdd 100644
> +++ b/mm/gup.c
> @@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be
> + * special. For THP, special huge entries are indicated by xxx_devmap()
> + * returning true, but a corresponding call to get_dev_pagemap() will
> + * return NULL.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */

Why move this comment? I think it was correct where it was

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
  2021-03-22  7:47   ` Christian König
  2021-03-23 11:47   ` Daniel Vetter
@ 2021-03-23 14:00   ` Jason Gunthorpe
  2021-03-23 15:46     ` Thomas Hellström (Intel)
  2 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 14:00 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, linux-mm, linux-kernel

On Sun, Mar 21, 2021 at 07:45:29PM +0100, Thomas Hellström (Intel) wrote:
> To block fast gup we need to make sure TTM ptes are always special.
> With MIXEDMAP we, on architectures that don't support pte_special,
> insert normal ptes, but OTOH on those architectures, fast is not
> supported.
> At the same time, the function documentation to vm_normal_page() suggests
> that ptes pointing to system memory pages of MIXEDMAP vmas are always
> normal, but that doesn't seem consistent with what's implemented in
> vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
> needed.
> 
> But to make sure and to avoid also normal (non-fast) gup, make all
> TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
> anymore so make is_cow_mapping() available and use it to reject
> COW mappigs at mmap time.
> 
> There was previously a comment in the code that WC mappings together
> with x86 PAT + PFNMAP was bad for performance. However from looking at
> vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
> are handled the same for architectures that support pte_special. This
> means there should not be a performance difference anymore, but this
> needs to be verified.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
>  include/linux/mm.h              |  5 +++++
>  mm/internal.h                   |  5 -----
>  3 files changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 1c34983480e5..708c6fb9be81 100644
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>  		 * at arbitrary times while the data is mmap'ed.
>  		 * See vmf_insert_mixed_prot() for a discussion.
>  		 */
> -		if (vma->vm_flags & VM_MIXEDMAP)
> -			ret = vmf_insert_mixed_prot(vma, address,
> -						    __pfn_to_pfn_t(pfn, PFN_DEV),
> -						    prot);
> -		else
> -			ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
> +		ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>  
>  		/* Never error on prefaulted PTEs */
>  		if (unlikely((ret & VM_FAULT_ERROR))) {
> @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s
>  	 * Note: We're transferring the bo reference to
>  	 * vma->vm_private_data here.
>  	 */
> -
>  	vma->vm_private_data = bo;
>  
>  	/*
> -	 * We'd like to use VM_PFNMAP on shared mappings, where
> -	 * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
> -	 * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
> -	 * bad for performance. Until that has been sorted out, use
> -	 * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
> +	 * PFNMAP forces us to block COW mappings in mmap(),
> +	 * and with MIXEDMAP we would incorrectly allow fast gup
> +	 * on TTM memory on architectures that don't have pte_special.
>  	 */
> -	vma->vm_flags |= VM_MIXEDMAP;
> -	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>  }
>  
>  int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
> @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>  	if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
>  		return -EINVAL;
>  
> +	if (unlikely(is_cow_mapping(vma->vm_flags)))
> +		return -EINVAL;
> +
>  	bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
>  	if (unlikely(!bo))
>  		return -EINVAL;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..c6ebf7f9ddbb 100644
> +++ b/include/linux/mm.h
> @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_ACCESS_FLAGS;
>  }
>  
> +static inline bool is_cow_mapping(vm_flags_t flags)
> +{
> +	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> +}

Most driver places are just banning VM_SHARED.

I see you copied this from remap_pfn_range(), but that logic is so
special I'm not sure..

Can the user mprotect the write back on with the above logic? Do we
need VM_DENYWRITE too?

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-23 11:47   ` Daniel Vetter
@ 2021-03-23 14:04     ` Jason Gunthorpe
  2021-03-23 15:51       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 14:04 UTC (permalink / raw)
  To: Thomas Hellström (Intel),
	dri-devel, Christian Koenig, David Airlie, Andrew Morton,
	linux-mm, linux-kernel

On Tue, Mar 23, 2021 at 12:47:24PM +0100, Daniel Vetter wrote:

> > +static inline bool is_cow_mapping(vm_flags_t flags)
> 
> Bit a bikeshed, but I wonder whether the public interface shouldn't be
> vma_is_cow_mapping. Or whether this shouldn't be rejected somewhere else,
> since at least in drivers/gpu we have tons of cases that don't check for
> this and get it all kinds of wrong I think.
> 
> remap_pfn_range handles this for many cases, but by far not for all.
> 
> Anyway patch itself lgtm:
> 
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I would like it if io_remap_pfn_range() did not allow shared mappings
at all.

IIRC it doesn't work anyway, the kernel can't reliably copy from IO
pages eg the "_copy_from_user_inatomic()" under cow_user_page() will
not work on s390 that requires all IO memory be accessed with special
instructions.

Unfortunately I have no idea what the long ago special case of
allowing COW'd IO mappings is. :\

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 13:52   ` Jason Gunthorpe
@ 2021-03-23 15:05     ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-23 15:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, linux-mm, linux-kernel


On 3/23/21 2:52 PM, Jason Gunthorpe wrote:
> On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote:
>> diff --git a/mm/gup.c b/mm/gup.c
>> index e40579624f10..1b6a127f0bdd 100644
>> +++ b/mm/gup.c
>> @@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
>>   }
>>   
>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> +/*
>> + * If we can't determine whether or not a pte is special, then fail immediately
>> + * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be
>> + * special. For THP, special huge entries are indicated by xxx_devmap()
>> + * returning true, but a corresponding call to get_dev_pagemap() will
>> + * return NULL.
>> + *
>> + * For a futex to be placed on a THP tail page, get_futex_key requires a
>> + * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>> + * useful to have gup_huge_pmd even if we can't operate on ptes.
>> + */
> Why move this comment? I think it was correct where it was

Yes, you're right. I misread it to refer to the actual code in the 
gup_pte_range function rather than to the empty version. I'll move it back.

/Thomas


>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-23 14:00   ` Jason Gunthorpe
@ 2021-03-23 15:46     ` Thomas Hellström (Intel)
  2021-03-23 16:06       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-23 15:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, linux-mm, linux-kernel


On 3/23/21 3:00 PM, Jason Gunthorpe wrote:
> On Sun, Mar 21, 2021 at 07:45:29PM +0100, Thomas Hellström (Intel) wrote:
>> To block fast gup we need to make sure TTM ptes are always special.
>> With MIXEDMAP we, on architectures that don't support pte_special,
>> insert normal ptes, but OTOH on those architectures, fast is not
>> supported.
>> At the same time, the function documentation to vm_normal_page() suggests
>> that ptes pointing to system memory pages of MIXEDMAP vmas are always
>> normal, but that doesn't seem consistent with what's implemented in
>> vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
>> needed.
>>
>> But to make sure and to avoid also normal (non-fast) gup, make all
>> TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
>> anymore so make is_cow_mapping() available and use it to reject
>> COW mappigs at mmap time.
>>
>> There was previously a comment in the code that WC mappings together
>> with x86 PAT + PFNMAP was bad for performance. However from looking at
>> vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
>> are handled the same for architectures that support pte_special. This
>> means there should not be a performance difference anymore, but this
>> needs to be verified.
>>
>> Cc: Christian Koenig <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@linux.ie>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: linux-mm@kvack.org
>> Cc: dri-devel@lists.freedesktop.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 ++++++++--------------
>>   include/linux/mm.h              |  5 +++++
>>   mm/internal.h                   |  5 -----
>>   3 files changed, 13 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 1c34983480e5..708c6fb9be81 100644
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>>   		 * at arbitrary times while the data is mmap'ed.
>>   		 * See vmf_insert_mixed_prot() for a discussion.
>>   		 */
>> -		if (vma->vm_flags & VM_MIXEDMAP)
>> -			ret = vmf_insert_mixed_prot(vma, address,
>> -						    __pfn_to_pfn_t(pfn, PFN_DEV),
>> -						    prot);
>> -		else
>> -			ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>> +		ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>>   
>>   		/* Never error on prefaulted PTEs */
>>   		if (unlikely((ret & VM_FAULT_ERROR))) {
>> @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s
>>   	 * Note: We're transferring the bo reference to
>>   	 * vma->vm_private_data here.
>>   	 */
>> -
>>   	vma->vm_private_data = bo;
>>   
>>   	/*
>> -	 * We'd like to use VM_PFNMAP on shared mappings, where
>> -	 * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
>> -	 * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
>> -	 * bad for performance. Until that has been sorted out, use
>> -	 * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
>> +	 * PFNMAP forces us to block COW mappings in mmap(),
>> +	 * and with MIXEDMAP we would incorrectly allow fast gup
>> +	 * on TTM memory on architectures that don't have pte_special.
>>   	 */
>> -	vma->vm_flags |= VM_MIXEDMAP;
>> -	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>> +	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>   }
>>   
>>   int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>> @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
>>   	if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
>>   		return -EINVAL;
>>   
>> +	if (unlikely(is_cow_mapping(vma->vm_flags)))
>> +		return -EINVAL;
>> +
>>   	bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
>>   	if (unlikely(!bo))
>>   		return -EINVAL;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 77e64e3eac80..c6ebf7f9ddbb 100644
>> +++ b/include/linux/mm.h
>> @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
>>   	return vma->vm_flags & VM_ACCESS_FLAGS;
>>   }
>>   
>> +static inline bool is_cow_mapping(vm_flags_t flags)
>> +{
>> +	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>> +}
> Most driver places are just banning VM_SHARED.
>
> I see you copied this from remap_pfn_range(), but that logic is so
> special I'm not sure..

It's actually used all over the place. Both in drivers and also 
redefined with
CONFIG_MEM_SOFT_DIRTY which makes me think Daniels idea of 
vma_is_cow_mapping() is better since it won't clash and cause 
compilation failures...

>
> Can the user mprotect the write back on with the above logic?
No, it's blocked by mprotect.
> Do we
> need VM_DENYWRITE too?

Seems tied to MAP_DENYWRITE which is nowadays ignored according to man 
mmap().

Thanks,

Thomas

>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-23 14:04     ` Jason Gunthorpe
@ 2021-03-23 15:51       ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-23 15:51 UTC (permalink / raw)
  To: Jason Gunthorpe, dri-devel, Christian Koenig, David Airlie,
	Andrew Morton, linux-mm, linux-kernel


On 3/23/21 3:04 PM, Jason Gunthorpe wrote:
> On Tue, Mar 23, 2021 at 12:47:24PM +0100, Daniel Vetter wrote:
>
>>> +static inline bool is_cow_mapping(vm_flags_t flags)
>> Bit a bikeshed, but I wonder whether the public interface shouldn't be
>> vma_is_cow_mapping. Or whether this shouldn't be rejected somewhere else,
>> since at least in drivers/gpu we have tons of cases that don't check for
>> this and get it all kinds of wrong I think.
>>
>> remap_pfn_range handles this for many cases, but by far not for all.
>>
>> Anyway patch itself lgtm:
>>
>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> I would like it if io_remap_pfn_range() did not allow shared mappings
> at all.

You mean private mappings?

>
> IIRC it doesn't work anyway, the kernel can't reliably copy from IO
> pages eg the "_copy_from_user_inatomic()" under cow_user_page() will
> not work on s390 that requires all IO memory be accessed with special
> instructions.
>
> Unfortunately I have no idea what the long ago special case of
> allowing COW'd IO mappings is. :\

Me neither, but at some point it must have been important enough to 
introduce VM_MIXEDMAP...

/Thomas


> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
  2021-03-23 15:46     ` Thomas Hellström (Intel)
@ 2021-03-23 16:06       ` Jason Gunthorpe
  0 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 16:06 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: dri-devel, Christian Koenig, David Airlie, Daniel Vetter,
	Andrew Morton, linux-mm, linux-kernel

On Tue, Mar 23, 2021 at 04:46:00PM +0100, Thomas Hellström (Intel) wrote:
> > > +static inline bool is_cow_mapping(vm_flags_t flags)
> > > +{
> > > +	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > > +}
> > Most driver places are just banning VM_SHARED.
> > 
> > I see you copied this from remap_pfn_range(), but that logic is so
> > special I'm not sure..
> 
> It's actually used all over the place. Both in drivers and also redefined
> with
> CONFIG_MEM_SOFT_DIRTY which makes me think Daniels idea of
> vma_is_cow_mapping() is better since it won't clash and cause compilation
> failures...

Well, lets update more mmap fops to use this new helper then?
Searching for VM_SHARED gives a good list, there are several in
drivers/infiniband

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 11:34   ` Daniel Vetter
@ 2021-03-23 16:34     ` Thomas Hellström (Intel)
  2021-03-23 16:37       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-23 16:34 UTC (permalink / raw)
  To: dri-devel, Christian Koenig, David Airlie, Andrew Morton,
	Jason Gunthorpe, linux-mm, linux-kernel

Hi,

On 3/23/21 12:34 PM, Daniel Vetter wrote:
> On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote:
>> TTM sets up huge page-table-entries both to system- and device memory,
>> and we don't want gup to assume there are always valid backing struct
>> pages for these. For PTEs this is handled by setting the pte_special bit,
>> but for the huge PUDs and PMDs, we have neither pmd_special nor
>> pud_special. Normally, huge TTM entries are identified by looking at
>> vma_is_special_huge(), but fast gup can't do that, so as an alternative
>> define _devmap entries for which there are no backing dev_pagemap as
>> special, update documentation and make huge TTM entries _devmap, after
>> verifying that there is no backing dev_pagemap.
>>
>> One other alternative would be to block TTM huge page-table-entries
>> completely, and while currently only vmwgfx use them, they would be
>> beneficial to other graphis drivers moving forward as well.
>>
>> Cc: Christian Koenig <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@linux.ie>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: linux-mm@kvack.org
>> Cc: dri-devel@lists.freedesktop.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Thomas Hellström (Intel) <thomas_os@shipmail.org>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 ++++++++++++++++-
>>   mm/gup.c                        | 21 +++++++++++----------
>>   mm/memremap.c                   |  5 +++++
>>   3 files changed, 32 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 6dc96cf66744..1c34983480e5 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>>   	pfn_t pfnt;
>>   	struct ttm_tt *ttm = bo->ttm;
>>   	bool write = vmf->flags & FAULT_FLAG_WRITE;
>> +	struct dev_pagemap *pagemap;
>>   
>>   	/* Fault should not cross bo boundary. */
>>   	page_offset &= ~(fault_page_size - 1);
>> @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>>   	if ((pfn & (fault_page_size - 1)) != 0)
>>   		goto out_fallback;
>>   
>> +	/*
>> +	 * Huge entries must be special, that is marking them as devmap
>> +	 * with no backing device map range. If there is a backing
>> +	 * range, Don't insert a huge entry.
>> +	 * If this check turns out to be too much of a performance hit,
>> +	 * we can instead have drivers indicate whether they may have
>> +	 * backing device map ranges and if not, skip this lookup.
>> +	 */
> I think we can do this statically:
> - if it's system memory we know there's no devmap for it, and we do the
>    trick to block gup_fast
Yes, that should work.
> - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
>    so might as well not do that

I think gup_fast will unfortunately mistake a huge iomem page for an 
ordinary page and try to access a non-existant struct page for it, 
unless we do the devmap trick.

And the lookup would then be for the rare case where a driver would have 
already registered a dev_pagemap for an iomem area which may also be 
mapped through TTM (like the patch from Felix a couple of weeks ago). If 
a driver can promise not to do that, then we can safely remove the lookup.

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 16:34     ` Thomas Hellström (Intel)
@ 2021-03-23 16:37       ` Jason Gunthorpe
  2021-03-23 16:59         ` Christoph Hellwig
  2021-03-23 17:06         ` Thomas Hellström (Intel)
  0 siblings, 2 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 16:37 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: dri-devel, Christian Koenig, David Airlie, Andrew Morton,
	linux-mm, linux-kernel

On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:

> > > @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
> > >   	if ((pfn & (fault_page_size - 1)) != 0)
> > >   		goto out_fallback;
> > > +	/*
> > > +	 * Huge entries must be special, that is marking them as devmap
> > > +	 * with no backing device map range. If there is a backing
> > > +	 * range, Don't insert a huge entry.
> > > +	 * If this check turns out to be too much of a performance hit,
> > > +	 * we can instead have drivers indicate whether they may have
> > > +	 * backing device map ranges and if not, skip this lookup.
> > > +	 */
> > I think we can do this statically:
> > - if it's system memory we know there's no devmap for it, and we do the
> >    trick to block gup_fast
> Yes, that should work.
> > - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
> >    so might as well not do that
> 
> I think gup_fast will unfortunately mistake a huge iomem page for an
> ordinary page and try to access a non-existant struct page for it, unless we
> do the devmap trick.
> 
> And the lookup would then be for the rare case where a driver would have
> already registered a dev_pagemap for an iomem area which may also be mapped
> through TTM (like the patch from Felix a couple of weeks ago). If a driver
> can promise not to do that, then we can safely remove the lookup.

Isn't the devmap PTE flag arch optional? Does this fall back to not
using huge pages on arches that don't support it?

Also, I feel like this code to install "pte_special" huge pages does
not belong in the drm subsystem..

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 16:37       ` Jason Gunthorpe
@ 2021-03-23 16:59         ` Christoph Hellwig
  2021-03-23 17:06         ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellstr??m (Intel),
	dri-devel, Christian Koenig, David Airlie, Andrew Morton,
	linux-mm, linux-kernel

On Tue, Mar 23, 2021 at 01:37:15PM -0300, Jason Gunthorpe wrote:
> Isn't the devmap PTE flag arch optional? Does this fall back to not
> using huge pages on arches that don't support it?
> 
> Also, I feel like this code to install "pte_special" huge pages does
> not belong in the drm subsystem..

It doesn't.  Unfortunately the drm code has a lot of such warts where
it pokes way to deep into VM internals. 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 16:37       ` Jason Gunthorpe
  2021-03-23 16:59         ` Christoph Hellwig
@ 2021-03-23 17:06         ` Thomas Hellström (Intel)
  2021-03-24  9:56           ` Daniel Vetter
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-23 17:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, Christian Koenig, David Airlie, Andrew Morton,
	linux-mm, linux-kernel


On 3/23/21 5:37 PM, Jason Gunthorpe wrote:
> On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:
>
>>>> @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>>>>    	if ((pfn & (fault_page_size - 1)) != 0)
>>>>    		goto out_fallback;
>>>> +	/*
>>>> +	 * Huge entries must be special, that is marking them as devmap
>>>> +	 * with no backing device map range. If there is a backing
>>>> +	 * range, Don't insert a huge entry.
>>>> +	 * If this check turns out to be too much of a performance hit,
>>>> +	 * we can instead have drivers indicate whether they may have
>>>> +	 * backing device map ranges and if not, skip this lookup.
>>>> +	 */
>>> I think we can do this statically:
>>> - if it's system memory we know there's no devmap for it, and we do the
>>>     trick to block gup_fast
>> Yes, that should work.
>>> - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
>>>     so might as well not do that
>> I think gup_fast will unfortunately mistake a huge iomem page for an
>> ordinary page and try to access a non-existant struct page for it, unless we
>> do the devmap trick.
>>
>> And the lookup would then be for the rare case where a driver would have
>> already registered a dev_pagemap for an iomem area which may also be mapped
>> through TTM (like the patch from Felix a couple of weeks ago). If a driver
>> can promise not to do that, then we can safely remove the lookup.
> Isn't the devmap PTE flag arch optional? Does this fall back to not
> using huge pages on arches that don't support it?

Good point. No, currently it's only conditioned on transhuge page support.
Need to condition it on also devmap support.

>
> Also, I feel like this code to install "pte_special" huge pages does
> not belong in the drm subsystem..

I could add helpers in huge_memory.c:

vmf_insert_pfn_pmd_prot_special() and
vmf_insert_pfn_pud_prot_special()

/Thomas

>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
  2021-03-23 11:34   ` Daniel Vetter
  2021-03-23 13:52   ` Jason Gunthorpe
@ 2021-03-23 19:52   ` Williams, Dan J
  2021-03-23 20:42     ` Thomas Hellström (Intel)
  2 siblings, 1 reply; 63+ messages in thread
From: Williams, Dan J @ 2021-03-23 19:52 UTC (permalink / raw)
  To: dri-devel, thomas_os
  Cc: daniel, christian.koenig, jgg, airlied, linux-mm, linux-kernel, akpm

On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote:
> TTM sets up huge page-table-entries both to system- and device
> memory,
> and we don't want gup to assume there are always valid backing struct
> pages for these. For PTEs this is handled by setting the pte_special
> bit,
> but for the huge PUDs and PMDs, we have neither pmd_special nor
> pud_special. Normally, huge TTM entries are identified by looking at
> vma_is_special_huge(), but fast gup can't do that, so as an
> alternative
> define _devmap entries for which there are no backing dev_pagemap as
> special, update documentation and make huge TTM entries _devmap,
> after
> verifying that there is no backing dev_pagemap.

Please do not abuse p{m,u}d_devmap like this. I'm in the process of
removing get_devpagemap() from the gup-fast path [1]. Instead there
should be space for p{m,u}d_special in the page table entries (at least
for x86-64). So the fix is to remove that old assumption that huge
pages can never be special.

[1]: 
http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 19:52   ` Williams, Dan J
@ 2021-03-23 20:42     ` Thomas Hellström (Intel)
  2021-03-24  9:58       ` Daniel Vetter
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-23 20:42 UTC (permalink / raw)
  To: Williams, Dan J, dri-devel
  Cc: daniel, christian.koenig, jgg, airlied, linux-mm, linux-kernel, akpm


On 3/23/21 8:52 PM, Williams, Dan J wrote:
> On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote:
>> TTM sets up huge page-table-entries both to system- and device
>> memory,
>> and we don't want gup to assume there are always valid backing struct
>> pages for these. For PTEs this is handled by setting the pte_special
>> bit,
>> but for the huge PUDs and PMDs, we have neither pmd_special nor
>> pud_special. Normally, huge TTM entries are identified by looking at
>> vma_is_special_huge(), but fast gup can't do that, so as an
>> alternative
>> define _devmap entries for which there are no backing dev_pagemap as
>> special, update documentation and make huge TTM entries _devmap,
>> after
>> verifying that there is no backing dev_pagemap.
> Please do not abuse p{m,u}d_devmap like this. I'm in the process of
> removing get_devpagemap() from the gup-fast path [1]. Instead there
> should be space for p{m,u}d_special in the page table entries (at least
> for x86-64). So the fix is to remove that old assumption that huge
> pages can never be special.
>
> [1]:
> http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com
>
Hmm, yes with that patch it will obviously not work as intended.

Given that, I think we'll need to disable the TTM huge pages for now 
until we can sort out and agree on using a page table entry bit.

Thanks,

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 17:06         ` Thomas Hellström (Intel)
@ 2021-03-24  9:56           ` Daniel Vetter
  2021-03-24 12:24             ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Daniel Vetter @ 2021-03-24  9:56 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Jason Gunthorpe, David Airlie, linux-kernel, dri-devel, linux-mm,
	Andrew Morton, Christian Koenig

On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/23/21 5:37 PM, Jason Gunthorpe wrote:
> > On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:
> > 
> > > > > @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
> > > > >    	if ((pfn & (fault_page_size - 1)) != 0)
> > > > >    		goto out_fallback;
> > > > > +	/*
> > > > > +	 * Huge entries must be special, that is marking them as devmap
> > > > > +	 * with no backing device map range. If there is a backing
> > > > > +	 * range, Don't insert a huge entry.
> > > > > +	 * If this check turns out to be too much of a performance hit,
> > > > > +	 * we can instead have drivers indicate whether they may have
> > > > > +	 * backing device map ranges and if not, skip this lookup.
> > > > > +	 */
> > > > I think we can do this statically:
> > > > - if it's system memory we know there's no devmap for it, and we do the
> > > >     trick to block gup_fast
> > > Yes, that should work.
> > > > - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
> > > >     so might as well not do that
> > > I think gup_fast will unfortunately mistake a huge iomem page for an
> > > ordinary page and try to access a non-existant struct page for it, unless we
> > > do the devmap trick.
> > > 
> > > And the lookup would then be for the rare case where a driver would have
> > > already registered a dev_pagemap for an iomem area which may also be mapped
> > > through TTM (like the patch from Felix a couple of weeks ago). If a driver
> > > can promise not to do that, then we can safely remove the lookup.
> > Isn't the devmap PTE flag arch optional? Does this fall back to not
> > using huge pages on arches that don't support it?
> 
> Good point. No, currently it's only conditioned on transhuge page support.
> Need to condition it on also devmap support.
> 
> > 
> > Also, I feel like this code to install "pte_special" huge pages does
> > not belong in the drm subsystem..
> 
> I could add helpers in huge_memory.c:
> 
> vmf_insert_pfn_pmd_prot_special() and
> vmf_insert_pfn_pud_prot_special()

The somewhat annoying thing is that we'd need an error code so we fall
back to pte fault handling. That's at least my understanding of how
pud/pmd fault handling works. Not sure how awkward that is going to be
with the overall fault handling flow.

But aside from that I think this makes tons of sense.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-23 20:42     ` Thomas Hellström (Intel)
@ 2021-03-24  9:58       ` Daniel Vetter
  2021-03-24 10:05         ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Daniel Vetter @ 2021-03-24  9:58 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Williams, Dan J, dri-devel, daniel, christian.koenig, jgg,
	airlied, linux-mm, linux-kernel, akpm

On Tue, Mar 23, 2021 at 09:42:18PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/23/21 8:52 PM, Williams, Dan J wrote:
> > On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote:
> > > TTM sets up huge page-table-entries both to system- and device
> > > memory,
> > > and we don't want gup to assume there are always valid backing struct
> > > pages for these. For PTEs this is handled by setting the pte_special
> > > bit,
> > > but for the huge PUDs and PMDs, we have neither pmd_special nor
> > > pud_special. Normally, huge TTM entries are identified by looking at
> > > vma_is_special_huge(), but fast gup can't do that, so as an
> > > alternative
> > > define _devmap entries for which there are no backing dev_pagemap as
> > > special, update documentation and make huge TTM entries _devmap,
> > > after
> > > verifying that there is no backing dev_pagemap.
> > Please do not abuse p{m,u}d_devmap like this. I'm in the process of
> > removing get_devpagemap() from the gup-fast path [1]. Instead there
> > should be space for p{m,u}d_special in the page table entries (at least
> > for x86-64). So the fix is to remove that old assumption that huge
> > pages can never be special.
> > 
> > [1]:
> > http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com
> > 
> Hmm, yes with that patch it will obviously not work as intended.
> 
> Given that, I think we'll need to disable the TTM huge pages for now until
> we can sort out and agree on using a page table entry bit.

Yeah :-/

I think going full pud/pmd_mkspecial should then also mesh well with
Jason's request to wrap it all up into a vmf_insert_* helper, so at least
it would all look rather pretty in the end.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24  9:58       ` Daniel Vetter
@ 2021-03-24 10:05         ` Thomas Hellström (Intel)
       [not found]           ` <75423f64-adef-a2c4-8e7d-2cb814127b18@intel.com>
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-24 10:05 UTC (permalink / raw)
  To: Williams, Dan J, dri-devel, christian.koenig, jgg, airlied,
	linux-mm, linux-kernel, akpm


On 3/24/21 10:58 AM, Daniel Vetter wrote:
> On Tue, Mar 23, 2021 at 09:42:18PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/23/21 8:52 PM, Williams, Dan J wrote:
>>> On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote:
>>>> TTM sets up huge page-table-entries both to system- and device
>>>> memory,
>>>> and we don't want gup to assume there are always valid backing struct
>>>> pages for these. For PTEs this is handled by setting the pte_special
>>>> bit,
>>>> but for the huge PUDs and PMDs, we have neither pmd_special nor
>>>> pud_special. Normally, huge TTM entries are identified by looking at
>>>> vma_is_special_huge(), but fast gup can't do that, so as an
>>>> alternative
>>>> define _devmap entries for which there are no backing dev_pagemap as
>>>> special, update documentation and make huge TTM entries _devmap,
>>>> after
>>>> verifying that there is no backing dev_pagemap.
>>> Please do not abuse p{m,u}d_devmap like this. I'm in the process of
>>> removing get_devpagemap() from the gup-fast path [1]. Instead there
>>> should be space for p{m,u}d_special in the page table entries (at least
>>> for x86-64). So the fix is to remove that old assumption that huge
>>> pages can never be special.
>>>
>>> [1]:
>>> http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com
>>>
>> Hmm, yes with that patch it will obviously not work as intended.
>>
>> Given that, I think we'll need to disable the TTM huge pages for now until
>> we can sort out and agree on using a page table entry bit.
> Yeah :-/
>
> I think going full pud/pmd_mkspecial should then also mesh well with
> Jason's request to wrap it all up into a vmf_insert_* helper, so at least
> it would all look rather pretty in the end.

Yes, I agree. Seems like the special (SW1) is available also for huge 
page table entries on x86 AFAICT, although just not implemented. 
Otherwise the SW bits appear completely used up.

The PTE size vmf_insert_pfn__xxx functions either insert one of devmap 
or special.  I think the only users of the huge insert functions apart 
form TTM currently insert devmap so we should probably be able to do the 
same, and then DRM / TTM wouldn't need to care at all about special or not.

/Thomas



> -Daniel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24  9:56           ` Daniel Vetter
@ 2021-03-24 12:24             ` Jason Gunthorpe
  2021-03-24 12:35               ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 12:24 UTC (permalink / raw)
  To: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton,
	Christian Koenig

On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote:
> On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:
> > 
> > On 3/23/21 5:37 PM, Jason Gunthorpe wrote:
> > > On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:
> > > 
> > > > > > @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
> > > > > >    	if ((pfn & (fault_page_size - 1)) != 0)
> > > > > >    		goto out_fallback;
> > > > > > +	/*
> > > > > > +	 * Huge entries must be special, that is marking them as devmap
> > > > > > +	 * with no backing device map range. If there is a backing
> > > > > > +	 * range, Don't insert a huge entry.
> > > > > > +	 * If this check turns out to be too much of a performance hit,
> > > > > > +	 * we can instead have drivers indicate whether they may have
> > > > > > +	 * backing device map ranges and if not, skip this lookup.
> > > > > > +	 */
> > > > > I think we can do this statically:
> > > > > - if it's system memory we know there's no devmap for it, and we do the
> > > > >     trick to block gup_fast
> > > > Yes, that should work.
> > > > > - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
> > > > >     so might as well not do that
> > > > I think gup_fast will unfortunately mistake a huge iomem page for an
> > > > ordinary page and try to access a non-existant struct page for it, unless we
> > > > do the devmap trick.
> > > > 
> > > > And the lookup would then be for the rare case where a driver would have
> > > > already registered a dev_pagemap for an iomem area which may also be mapped
> > > > through TTM (like the patch from Felix a couple of weeks ago). If a driver
> > > > can promise not to do that, then we can safely remove the lookup.
> > > Isn't the devmap PTE flag arch optional? Does this fall back to not
> > > using huge pages on arches that don't support it?
> > 
> > Good point. No, currently it's only conditioned on transhuge page support.
> > Need to condition it on also devmap support.
> > 
> > > 
> > > Also, I feel like this code to install "pte_special" huge pages does
> > > not belong in the drm subsystem..
> > 
> > I could add helpers in huge_memory.c:
> > 
> > vmf_insert_pfn_pmd_prot_special() and
> > vmf_insert_pfn_pud_prot_special()
> 
> The somewhat annoying thing is that we'd need an error code so we fall
> back to pte fault handling. That's at least my understanding of how
> pud/pmd fault handling works. Not sure how awkward that is going to be
> with the overall fault handling flow.
> 
> But aside from that I think this makes tons of sense.

Why should the driver be so specific?

vmf_insert_pfn_range_XXX()

And it will figure out the optimal way to build the page tables.

Driver should provide the largest physically contiguous range it can

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 12:24             ` Jason Gunthorpe
@ 2021-03-24 12:35               ` Thomas Hellström (Intel)
  2021-03-24 12:41                 ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-24 12:35 UTC (permalink / raw)
  To: Jason Gunthorpe, David Airlie, linux-kernel, dri-devel, linux-mm,
	Andrew Morton, Christian Koenig


On 3/24/21 1:24 PM, Jason Gunthorpe wrote:
> On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote:
>> On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:
>>> On 3/23/21 5:37 PM, Jason Gunthorpe wrote:
>>>> On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:
>>>>
>>>>>>> @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>>>>>>>     	if ((pfn & (fault_page_size - 1)) != 0)
>>>>>>>     		goto out_fallback;
>>>>>>> +	/*
>>>>>>> +	 * Huge entries must be special, that is marking them as devmap
>>>>>>> +	 * with no backing device map range. If there is a backing
>>>>>>> +	 * range, Don't insert a huge entry.
>>>>>>> +	 * If this check turns out to be too much of a performance hit,
>>>>>>> +	 * we can instead have drivers indicate whether they may have
>>>>>>> +	 * backing device map ranges and if not, skip this lookup.
>>>>>>> +	 */
>>>>>> I think we can do this statically:
>>>>>> - if it's system memory we know there's no devmap for it, and we do the
>>>>>>      trick to block gup_fast
>>>>> Yes, that should work.
>>>>>> - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
>>>>>>      so might as well not do that
>>>>> I think gup_fast will unfortunately mistake a huge iomem page for an
>>>>> ordinary page and try to access a non-existant struct page for it, unless we
>>>>> do the devmap trick.
>>>>>
>>>>> And the lookup would then be for the rare case where a driver would have
>>>>> already registered a dev_pagemap for an iomem area which may also be mapped
>>>>> through TTM (like the patch from Felix a couple of weeks ago). If a driver
>>>>> can promise not to do that, then we can safely remove the lookup.
>>>> Isn't the devmap PTE flag arch optional? Does this fall back to not
>>>> using huge pages on arches that don't support it?
>>> Good point. No, currently it's only conditioned on transhuge page support.
>>> Need to condition it on also devmap support.
>>>
>>>> Also, I feel like this code to install "pte_special" huge pages does
>>>> not belong in the drm subsystem..
>>> I could add helpers in huge_memory.c:
>>>
>>> vmf_insert_pfn_pmd_prot_special() and
>>> vmf_insert_pfn_pud_prot_special()
>> The somewhat annoying thing is that we'd need an error code so we fall
>> back to pte fault handling. That's at least my understanding of how
>> pud/pmd fault handling works. Not sure how awkward that is going to be
>> with the overall fault handling flow.
>>
>> But aside from that I think this makes tons of sense.
> Why should the driver be so specific?
>
> vmf_insert_pfn_range_XXX()
>
> And it will figure out the optimal way to build the page tables.
>
> Driver should provide the largest physically contiguous range it can

I figure that would probably work, but since the huge_fault() interface 
is already providing the size of the fault based on how the pagetable is 
currently populated I figure that would have to move a lot of that logic 
into that helper...

/Thomas


>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 12:35               ` Thomas Hellström (Intel)
@ 2021-03-24 12:41                 ` Jason Gunthorpe
  2021-03-24 13:35                   ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 12:41 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton,
	Christian Koenig

On Wed, Mar 24, 2021 at 01:35:17PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/24/21 1:24 PM, Jason Gunthorpe wrote:
> > On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote:
> > > On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:
> > > > On 3/23/21 5:37 PM, Jason Gunthorpe wrote:
> > > > > On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:
> > > > > 
> > > > > > > > @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
> > > > > > > >     	if ((pfn & (fault_page_size - 1)) != 0)
> > > > > > > >     		goto out_fallback;
> > > > > > > > +	/*
> > > > > > > > +	 * Huge entries must be special, that is marking them as devmap
> > > > > > > > +	 * with no backing device map range. If there is a backing
> > > > > > > > +	 * range, Don't insert a huge entry.
> > > > > > > > +	 * If this check turns out to be too much of a performance hit,
> > > > > > > > +	 * we can instead have drivers indicate whether they may have
> > > > > > > > +	 * backing device map ranges and if not, skip this lookup.
> > > > > > > > +	 */
> > > > > > > I think we can do this statically:
> > > > > > > - if it's system memory we know there's no devmap for it, and we do the
> > > > > > >      trick to block gup_fast
> > > > > > Yes, that should work.
> > > > > > > - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
> > > > > > >      so might as well not do that
> > > > > > I think gup_fast will unfortunately mistake a huge iomem page for an
> > > > > > ordinary page and try to access a non-existant struct page for it, unless we
> > > > > > do the devmap trick.
> > > > > > 
> > > > > > And the lookup would then be for the rare case where a driver would have
> > > > > > already registered a dev_pagemap for an iomem area which may also be mapped
> > > > > > through TTM (like the patch from Felix a couple of weeks ago). If a driver
> > > > > > can promise not to do that, then we can safely remove the lookup.
> > > > > Isn't the devmap PTE flag arch optional? Does this fall back to not
> > > > > using huge pages on arches that don't support it?
> > > > Good point. No, currently it's only conditioned on transhuge page support.
> > > > Need to condition it on also devmap support.
> > > > 
> > > > > Also, I feel like this code to install "pte_special" huge pages does
> > > > > not belong in the drm subsystem..
> > > > I could add helpers in huge_memory.c:
> > > > 
> > > > vmf_insert_pfn_pmd_prot_special() and
> > > > vmf_insert_pfn_pud_prot_special()
> > > The somewhat annoying thing is that we'd need an error code so we fall
> > > back to pte fault handling. That's at least my understanding of how
> > > pud/pmd fault handling works. Not sure how awkward that is going to be
> > > with the overall fault handling flow.
> > > 
> > > But aside from that I think this makes tons of sense.
> > Why should the driver be so specific?
> > 
> > vmf_insert_pfn_range_XXX()
> > 
> > And it will figure out the optimal way to build the page tables.
> > 
> > Driver should provide the largest physically contiguous range it can
> 
> I figure that would probably work, but since the huge_fault() interface is
> already providing the size of the fault based on how the pagetable is
> currently populated I figure that would have to move a lot of that logic
> into that helper...

But we don't really care about the size of the fault when we stuff the
pfns.

The device might use it when handling the fault, but once the fault is
handled the device knows what the contiguous pfn range is that it has
available to stuff into the page tables, it just tells the vmf_insert
what it was able to create, and it creates the necessary page table
structure.

The size of the hole in the page table is really only advisory, the
device may not want to make a 2M or 1G page entry and may prefer to
only create 4k.

In an ideal world the creation/destruction of page table levels would
by dynamic at this point, like THP.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 12:41                 ` Jason Gunthorpe
@ 2021-03-24 13:35                   ` Thomas Hellström (Intel)
  2021-03-24 13:48                     ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-24 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton,
	Christian Koenig


On 3/24/21 1:41 PM, Jason Gunthorpe wrote:
> On Wed, Mar 24, 2021 at 01:35:17PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/24/21 1:24 PM, Jason Gunthorpe wrote:
>>> On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote:
>>>> On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:
>>>>> On 3/23/21 5:37 PM, Jason Gunthorpe wrote:
>>>>>> On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:
>>>>>>
>>>>>>>>> @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf,
>>>>>>>>>      	if ((pfn & (fault_page_size - 1)) != 0)
>>>>>>>>>      		goto out_fallback;
>>>>>>>>> +	/*
>>>>>>>>> +	 * Huge entries must be special, that is marking them as devmap
>>>>>>>>> +	 * with no backing device map range. If there is a backing
>>>>>>>>> +	 * range, Don't insert a huge entry.
>>>>>>>>> +	 * If this check turns out to be too much of a performance hit,
>>>>>>>>> +	 * we can instead have drivers indicate whether they may have
>>>>>>>>> +	 * backing device map ranges and if not, skip this lookup.
>>>>>>>>> +	 */
>>>>>>>> I think we can do this statically:
>>>>>>>> - if it's system memory we know there's no devmap for it, and we do the
>>>>>>>>       trick to block gup_fast
>>>>>>> Yes, that should work.
>>>>>>>> - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
>>>>>>>>       so might as well not do that
>>>>>>> I think gup_fast will unfortunately mistake a huge iomem page for an
>>>>>>> ordinary page and try to access a non-existant struct page for it, unless we
>>>>>>> do the devmap trick.
>>>>>>>
>>>>>>> And the lookup would then be for the rare case where a driver would have
>>>>>>> already registered a dev_pagemap for an iomem area which may also be mapped
>>>>>>> through TTM (like the patch from Felix a couple of weeks ago). If a driver
>>>>>>> can promise not to do that, then we can safely remove the lookup.
>>>>>> Isn't the devmap PTE flag arch optional? Does this fall back to not
>>>>>> using huge pages on arches that don't support it?
>>>>> Good point. No, currently it's only conditioned on transhuge page support.
>>>>> Need to condition it on also devmap support.
>>>>>
>>>>>> Also, I feel like this code to install "pte_special" huge pages does
>>>>>> not belong in the drm subsystem..
>>>>> I could add helpers in huge_memory.c:
>>>>>
>>>>> vmf_insert_pfn_pmd_prot_special() and
>>>>> vmf_insert_pfn_pud_prot_special()
>>>> The somewhat annoying thing is that we'd need an error code so we fall
>>>> back to pte fault handling. That's at least my understanding of how
>>>> pud/pmd fault handling works. Not sure how awkward that is going to be
>>>> with the overall fault handling flow.
>>>>
>>>> But aside from that I think this makes tons of sense.
>>> Why should the driver be so specific?
>>>
>>> vmf_insert_pfn_range_XXX()
>>>
>>> And it will figure out the optimal way to build the page tables.
>>>
>>> Driver should provide the largest physically contiguous range it can
>> I figure that would probably work, but since the huge_fault() interface is
>> already providing the size of the fault based on how the pagetable is
>> currently populated I figure that would have to move a lot of that logic
>> into that helper...
> But we don't really care about the size of the fault when we stuff the
> pfns.
>
> The device might use it when handling the fault, but once the fault is
> handled the device knows what the contiguous pfn range is that it has
> available to stuff into the page tables, it just tells the vmf_insert
> what it was able to create, and it creates the necessary page table
> structure.
>
> The size of the hole in the page table is really only advisory, the
> device may not want to make a 2M or 1G page entry and may prefer to
> only create 4k.
>
> In an ideal world the creation/destruction of page table levels would
> by dynamic at this point, like THP.

Hmm, but I'm not sure what problem we're trying to solve by changing the 
interface in this way?

Currently if the core vm requests a huge pud, we give it one, and if we 
can't or don't want to (because of dirty-tracking, for example, which is 
always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the 
fault is retried at a lower level. Also, determining whether we have a 
contigous range is not free, so we  don't want to do that unnecessarily.

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 13:35                   ` Thomas Hellström (Intel)
@ 2021-03-24 13:48                     ` Jason Gunthorpe
  2021-03-24 15:50                       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 13:48 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton,
	Christian Koenig

On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote:

> > In an ideal world the creation/destruction of page table levels would
> > by dynamic at this point, like THP.
> 
> Hmm, but I'm not sure what problem we're trying to solve by changing the
> interface in this way?

We are trying to make a sensible driver API to deal with huge pages.
 
> Currently if the core vm requests a huge pud, we give it one, and if we
> can't or don't want to (because of dirty-tracking, for example, which is
> always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the
> fault is retried at a lower level.

Well, my thought would be to move the pte related stuff into
vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.

I don't know if the locking works out, but it feels cleaner that the
driver tells the vmf how big a page it can stuff in, not the vm
telling the driver to stuff in a certain size page which it might not
want to do.

Some devices want to work on a in-between page size like 64k so they
can't form 2M pages but they can stuff 64k of 4K pages in a batch on
every fault.

That idea doesn't fit naturally if the VM is driving the size.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 13:48                     ` Jason Gunthorpe
@ 2021-03-24 15:50                       ` Thomas Hellström (Intel)
  2021-03-24 16:38                         ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-24 15:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton,
	Christian Koenig


On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote:
>
>>> In an ideal world the creation/destruction of page table levels would
>>> by dynamic at this point, like THP.
>> Hmm, but I'm not sure what problem we're trying to solve by changing the
>> interface in this way?
> We are trying to make a sensible driver API to deal with huge pages.
>   
>> Currently if the core vm requests a huge pud, we give it one, and if we
>> can't or don't want to (because of dirty-tracking, for example, which is
>> always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the
>> fault is retried at a lower level.
> Well, my thought would be to move the pte related stuff into
> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>
> I don't know if the locking works out, but it feels cleaner that the
> driver tells the vmf how big a page it can stuff in, not the vm
> telling the driver to stuff in a certain size page which it might not
> want to do.
>
> Some devices want to work on a in-between page size like 64k so they
> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
> every fault.

Hmm, yes, but we would in that case be limited anyway to insert ranges 
smaller than and equal to the fault size to avoid extensive and possibly 
unnecessary checks for contigous memory. And then if we can't support 
the full fault size, we'd need to either presume a size and alignment of 
the next level or search for contigous memory in both directions around 
the fault address, perhaps unnecessarily as well. I do think the current 
interface works ok, as we're just acting on what the core vm tells us to do.

/Thomas

>
> That idea doesn't fit naturally if the VM is driving the size.
>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 15:50                       ` Thomas Hellström (Intel)
@ 2021-03-24 16:38                         ` Jason Gunthorpe
  2021-03-24 18:31                           ` Christian König
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 16:38 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton,
	Christian Koenig

On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
> > On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote:
> > 
> > > > In an ideal world the creation/destruction of page table levels would
> > > > by dynamic at this point, like THP.
> > > Hmm, but I'm not sure what problem we're trying to solve by changing the
> > > interface in this way?
> > We are trying to make a sensible driver API to deal with huge pages.
> > > Currently if the core vm requests a huge pud, we give it one, and if we
> > > can't or don't want to (because of dirty-tracking, for example, which is
> > > always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the
> > > fault is retried at a lower level.
> > Well, my thought would be to move the pte related stuff into
> > vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
> > 
> > I don't know if the locking works out, but it feels cleaner that the
> > driver tells the vmf how big a page it can stuff in, not the vm
> > telling the driver to stuff in a certain size page which it might not
> > want to do.
> > 
> > Some devices want to work on a in-between page size like 64k so they
> > can't form 2M pages but they can stuff 64k of 4K pages in a batch on
> > every fault.
> 
> Hmm, yes, but we would in that case be limited anyway to insert ranges
> smaller than and equal to the fault size to avoid extensive and possibly
> unnecessary checks for contigous memory. 

Why? The insert function is walking the page tables, it just updates
things as they are. It learns the arragement for free while doing the
walk.

The device has to always provide consistent data, if it overlaps into
pages that are already populated that is fine so long as it isn't
changing their addresses.

> And then if we can't support the full fault size, we'd need to
> either presume a size and alignment of the next level or search for
> contigous memory in both directions around the fault address,
> perhaps unnecessarily as well.

You don't really need to care about levels, the device should be
faulting in the largest memory regions it can within its efficiency.

If it works on 4M pages then it should be faulting 4M pages. The page
size of the underlying CPU doesn't really matter much other than some
tuning to impact how the device's allocator works.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 16:38                         ` Jason Gunthorpe
@ 2021-03-24 18:31                           ` Christian König
  2021-03-24 20:07                             ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-24 18:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Hellström (Intel)
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton



Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote:
>>>
>>>>> In an ideal world the creation/destruction of page table levels would
>>>>> by dynamic at this point, like THP.
>>>> Hmm, but I'm not sure what problem we're trying to solve by changing the
>>>> interface in this way?
>>> We are trying to make a sensible driver API to deal with huge pages.
>>>> Currently if the core vm requests a huge pud, we give it one, and if we
>>>> can't or don't want to (because of dirty-tracking, for example, which is
>>>> always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the
>>>> fault is retried at a lower level.
>>> Well, my thought would be to move the pte related stuff into
>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>
>>> I don't know if the locking works out, but it feels cleaner that the
>>> driver tells the vmf how big a page it can stuff in, not the vm
>>> telling the driver to stuff in a certain size page which it might not
>>> want to do.
>>>
>>> Some devices want to work on a in-between page size like 64k so they
>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
>>> every fault.
>> Hmm, yes, but we would in that case be limited anyway to insert ranges
>> smaller than and equal to the fault size to avoid extensive and possibly
>> unnecessary checks for contigous memory.
> Why? The insert function is walking the page tables, it just updates
> things as they are. It learns the arragement for free while doing the
> walk.
>
> The device has to always provide consistent data, if it overlaps into
> pages that are already populated that is fine so long as it isn't
> changing their addresses.
>
>> And then if we can't support the full fault size, we'd need to
>> either presume a size and alignment of the next level or search for
>> contigous memory in both directions around the fault address,
>> perhaps unnecessarily as well.
> You don't really need to care about levels, the device should be
> faulting in the largest memory regions it can within its efficiency.
>
> If it works on 4M pages then it should be faulting 4M pages. The page
> size of the underlying CPU doesn't really matter much other than some
> tuning to impact how the device's allocator works.

I agree with Jason here.

We get the best efficiency when we look at the what the GPU driver 
provides and make sure that we handle one GPU page at once instead of 
looking to much into what the CPU is doing with it's page tables.

At least one AMD GPUs the GPU page size can be anything between 4KiB and 
2GiB and if we will in a 2GiB chunk at once this can in theory be 
handled by just two giant page table entries on the CPU side.

On the other hand I'm not sure how filling in the CPU page tables work 
in detail.

Christian.

>
> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 18:31                           ` Christian König
@ 2021-03-24 20:07                             ` Thomas Hellström (Intel)
  2021-03-24 23:14                               ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-24 20:07 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton


On 3/24/21 7:31 PM, Christian König wrote:
>
>
> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) 
>> wrote:
>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) 
>>>> wrote:
>>>>
>>>>>> In an ideal world the creation/destruction of page table levels 
>>>>>> would
>>>>>> by dynamic at this point, like THP.
>>>>> Hmm, but I'm not sure what problem we're trying to solve by 
>>>>> changing the
>>>>> interface in this way?
>>>> We are trying to make a sensible driver API to deal with huge pages.
>>>>> Currently if the core vm requests a huge pud, we give it one, and 
>>>>> if we
>>>>> can't or don't want to (because of dirty-tracking, for example, 
>>>>> which is
>>>>> always done on 4K page-level) we just return VM_FAULT_FALLBACK, 
>>>>> and the
>>>>> fault is retried at a lower level.
>>>> Well, my thought would be to move the pte related stuff into
>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>
>>>> I don't know if the locking works out, but it feels cleaner that the
>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>> telling the driver to stuff in a certain size page which it might not
>>>> want to do.
>>>>
>>>> Some devices want to work on a in-between page size like 64k so they
>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
>>>> every fault.
>>> Hmm, yes, but we would in that case be limited anyway to insert ranges
>>> smaller than and equal to the fault size to avoid extensive and 
>>> possibly
>>> unnecessary checks for contigous memory.
>> Why? The insert function is walking the page tables, it just updates
>> things as they are. It learns the arragement for free while doing the
>> walk.
>>
>> The device has to always provide consistent data, if it overlaps into
>> pages that are already populated that is fine so long as it isn't
>> changing their addresses.
>>
>>> And then if we can't support the full fault size, we'd need to
>>> either presume a size and alignment of the next level or search for
>>> contigous memory in both directions around the fault address,
>>> perhaps unnecessarily as well.
>> You don't really need to care about levels, the device should be
>> faulting in the largest memory regions it can within its efficiency.
>>
>> If it works on 4M pages then it should be faulting 4M pages. The page
>> size of the underlying CPU doesn't really matter much other than some
>> tuning to impact how the device's allocator works.

Yes, but then we'd be adding a lot of complexity into this function that 
is already provided by the current interface for DAX, for little or no 
gain, at least in the drm/ttm setting. Please think of the following 
situation: You get a fault, you do an extensive time-consuming scan of 
your VRAM buffer object into which the fault goes and determine you can 
fault 1GB. Now you hand it to vmf_insert_range() and because the 
user-space address is misaligned, or already partly populated because of 
a previous eviction, you can only fault single pages, and you end up 
faulting a full GB of single pages perhaps for a one-time small update.

On top of this, unless we want to do the walk trying increasingly 
smaller sizes of vmf_insert_xxx(), we'd have to use 
apply_to_page_range() and teach it about transhuge page table entries, 
because pagewalk.c can't be used (It can't populate page tables). That 
also means apply_to_page_range() needs to be complicated with page table 
locks since transhuge pages aren't stable and can be zapped and 
refaulted under us while we do the walk.

On top of this, the user-space address allocator needs to know how large 
gpu pages are aligned in buffer objects to have a reasonable chance of 
aligning with CPU huge page boundaries which is a requirement to be able 
to insert a huge CPU page table entry, so the driver would basically 
need the drm helper that can do this alignment anyway.

All this makes me think we should settle for the current interface for 
now, and if someone feels like refining it, I'm fine with that.  After 
all, this isn't a strange drm/ttm invention, it's a pre-existing 
interface that we reuse.

>
> I agree with Jason here.
>
> We get the best efficiency when we look at the what the GPU driver 
> provides and make sure that we handle one GPU page at once instead of 
> looking to much into what the CPU is doing with it's page tables.
>
> At least one AMD GPUs the GPU page size can be anything between 4KiB 
> and 2GiB and if we will in a 2GiB chunk at once this can in theory be 
> handled by just two giant page table entries on the CPU side.

Yes, but I fail to see why, with the current code, we can't do this 
(save the refcounting bug)?

/Thomas



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
       [not found]           ` <75423f64-adef-a2c4-8e7d-2cb814127b18@intel.com>
@ 2021-03-24 20:22             ` Thomas Hellström (Intel)
  2021-03-24 20:25               ` Dave Hansen
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-24 20:22 UTC (permalink / raw)
  To: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig, jgg,
	airlied, linux-mm, linux-kernel, akpm


On 3/24/21 5:34 PM, Dave Hansen wrote:
> On 3/24/21 3:05 AM, Thomas Hellström (Intel) wrote:
>> Yes, I agree. Seems like the special (SW1) is available also for huge
>> page table entries on x86 AFAICT, although just not implemented.
>> Otherwise the SW bits appear completely used up.
> Although the _PAGE_BIT_SOFTW* bits are used up, there's plenty of room
> in the hardware PTEs.  Bits 52->58 are software-available, and we're
> only using 58 at the moment.
>
> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
> used.  It's quite possible we can encode another use even in the
> existing bits.
>
> Personally, I'd just try:
>
> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>
OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems 
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 20:22             ` Thomas Hellström (Intel)
@ 2021-03-24 20:25               ` Dave Hansen
  2021-03-25 17:51                 ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2021-03-24 20:25 UTC (permalink / raw)
  To: Thomas Hellström (Intel),
	Williams, Dan J, dri-devel, christian.koenig, jgg, airlied,
	linux-mm, linux-kernel, akpm

On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>> used.  It's quite possible we can encode another use even in the
>> existing bits.
>>
>> Personally, I'd just try:
>>
>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>
> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
> used in a selftest, but only for PTEs AFAICT.
> 
> Oh, and we don't care about 32-bit much anymore?

On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled.  IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 20:07                             ` Thomas Hellström (Intel)
@ 2021-03-24 23:14                               ` Jason Gunthorpe
  2021-03-25  7:48                                 ` Thomas Hellström (Intel)
  2021-03-25  7:49                                 ` Christian König
  0 siblings, 2 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 23:14 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Christian König, David Airlie, linux-kernel, dri-devel,
	linux-mm, Andrew Morton

On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/24/21 7:31 PM, Christian König wrote:
> > 
> > 
> > Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
> > > On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
> > > wrote:
> > > > On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
> > > > > On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
> > > > > (Intel) wrote:
> > > > > 
> > > > > > > In an ideal world the creation/destruction of page
> > > > > > > table levels would
> > > > > > > by dynamic at this point, like THP.
> > > > > > Hmm, but I'm not sure what problem we're trying to solve
> > > > > > by changing the
> > > > > > interface in this way?
> > > > > We are trying to make a sensible driver API to deal with huge pages.
> > > > > > Currently if the core vm requests a huge pud, we give it
> > > > > > one, and if we
> > > > > > can't or don't want to (because of dirty-tracking, for
> > > > > > example, which is
> > > > > > always done on 4K page-level) we just return
> > > > > > VM_FAULT_FALLBACK, and the
> > > > > > fault is retried at a lower level.
> > > > > Well, my thought would be to move the pte related stuff into
> > > > > vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
> > > > > 
> > > > > I don't know if the locking works out, but it feels cleaner that the
> > > > > driver tells the vmf how big a page it can stuff in, not the vm
> > > > > telling the driver to stuff in a certain size page which it might not
> > > > > want to do.
> > > > > 
> > > > > Some devices want to work on a in-between page size like 64k so they
> > > > > can't form 2M pages but they can stuff 64k of 4K pages in a batch on
> > > > > every fault.
> > > > Hmm, yes, but we would in that case be limited anyway to insert ranges
> > > > smaller than and equal to the fault size to avoid extensive and
> > > > possibly
> > > > unnecessary checks for contigous memory.
> > > Why? The insert function is walking the page tables, it just updates
> > > things as they are. It learns the arragement for free while doing the
> > > walk.
> > > 
> > > The device has to always provide consistent data, if it overlaps into
> > > pages that are already populated that is fine so long as it isn't
> > > changing their addresses.
> > > 
> > > > And then if we can't support the full fault size, we'd need to
> > > > either presume a size and alignment of the next level or search for
> > > > contigous memory in both directions around the fault address,
> > > > perhaps unnecessarily as well.
> > > You don't really need to care about levels, the device should be
> > > faulting in the largest memory regions it can within its efficiency.
> > > 
> > > If it works on 4M pages then it should be faulting 4M pages. The page
> > > size of the underlying CPU doesn't really matter much other than some
> > > tuning to impact how the device's allocator works.
> 
> Yes, but then we'd be adding a lot of complexity into this function that is
> already provided by the current interface for DAX, for little or no gain, at
> least in the drm/ttm setting. Please think of the following situation: You
> get a fault, you do an extensive time-consuming scan of your VRAM buffer
> object into which the fault goes and determine you can fault 1GB. Now you
> hand it to vmf_insert_range() and because the user-space address is
> misaligned, or already partly populated because of a previous eviction, you
> can only fault single pages, and you end up faulting a full GB of single
> pages perhaps for a one-time small update.

Why would "you can only fault single pages" ever be true? If you have
1GB of pages then the vmf_insert_range should allocate enough page
table entries to consume it, regardless of alignment.

And why shouldn't DAX switch to this kind of interface anyhow? It is
basically exactly the same problem. The underlying filesystem block
size is *not* necessarily aligned to the CPU page table sizes and DAX
would benefit from better handling of this mismatch.

> On top of this, unless we want to do the walk trying increasingly smaller
> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach
> it about transhuge page table entries, because pagewalk.c can't be used (It
> can't populate page tables). That also means apply_to_page_range() needs to
> be complicated with page table locks since transhuge pages aren't stable and
> can be zapped and refaulted under us while we do the walk.

I didn't say it would be simple :) But we also need to stop hacking
around the sides of all this huge page stuff and come up with sensible
APIs that drivers can actually implement correctly. Exposing drivers
to specific kinds of page levels really feels like the wrong level of
abstraction.

Once we start doing this we should do it everywhere, the io_remap_pfn
stuff should be able to create huge special IO pages as well, for
instance.
 
> On top of this, the user-space address allocator needs to know how large gpu
> pages are aligned in buffer objects to have a reasonable chance of aligning
> with CPU huge page boundaries which is a requirement to be able to insert a
> huge CPU page table entry, so the driver would basically need the drm helper
> that can do this alignment anyway.

Don't you have this problem anyhow?

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 23:14                               ` Jason Gunthorpe
@ 2021-03-25  7:48                                 ` Thomas Hellström (Intel)
  2021-03-25  8:27                                   ` Christian König
  2021-03-25  7:49                                 ` Christian König
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25  7:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, David Airlie, linux-kernel, dri-devel,
	linux-mm, Andrew Morton


On 3/25/21 12:14 AM, Jason Gunthorpe wrote:
> On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/24/21 7:31 PM, Christian König wrote:
>>>
>>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
>>>> wrote:
>>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
>>>>>> (Intel) wrote:
>>>>>>
>>>>>>>> In an ideal world the creation/destruction of page
>>>>>>>> table levels would
>>>>>>>> by dynamic at this point, like THP.
>>>>>>> Hmm, but I'm not sure what problem we're trying to solve
>>>>>>> by changing the
>>>>>>> interface in this way?
>>>>>> We are trying to make a sensible driver API to deal with huge pages.
>>>>>>> Currently if the core vm requests a huge pud, we give it
>>>>>>> one, and if we
>>>>>>> can't or don't want to (because of dirty-tracking, for
>>>>>>> example, which is
>>>>>>> always done on 4K page-level) we just return
>>>>>>> VM_FAULT_FALLBACK, and the
>>>>>>> fault is retried at a lower level.
>>>>>> Well, my thought would be to move the pte related stuff into
>>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>>>
>>>>>> I don't know if the locking works out, but it feels cleaner that the
>>>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>>>> telling the driver to stuff in a certain size page which it might not
>>>>>> want to do.
>>>>>>
>>>>>> Some devices want to work on a in-between page size like 64k so they
>>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
>>>>>> every fault.
>>>>> Hmm, yes, but we would in that case be limited anyway to insert ranges
>>>>> smaller than and equal to the fault size to avoid extensive and
>>>>> possibly
>>>>> unnecessary checks for contigous memory.
>>>> Why? The insert function is walking the page tables, it just updates
>>>> things as they are. It learns the arragement for free while doing the
>>>> walk.
>>>>
>>>> The device has to always provide consistent data, if it overlaps into
>>>> pages that are already populated that is fine so long as it isn't
>>>> changing their addresses.
>>>>
>>>>> And then if we can't support the full fault size, we'd need to
>>>>> either presume a size and alignment of the next level or search for
>>>>> contigous memory in both directions around the fault address,
>>>>> perhaps unnecessarily as well.
>>>> You don't really need to care about levels, the device should be
>>>> faulting in the largest memory regions it can within its efficiency.
>>>>
>>>> If it works on 4M pages then it should be faulting 4M pages. The page
>>>> size of the underlying CPU doesn't really matter much other than some
>>>> tuning to impact how the device's allocator works.
>> Yes, but then we'd be adding a lot of complexity into this function that is
>> already provided by the current interface for DAX, for little or no gain, at
>> least in the drm/ttm setting. Please think of the following situation: You
>> get a fault, you do an extensive time-consuming scan of your VRAM buffer
>> object into which the fault goes and determine you can fault 1GB. Now you
>> hand it to vmf_insert_range() and because the user-space address is
>> misaligned, or already partly populated because of a previous eviction, you
>> can only fault single pages, and you end up faulting a full GB of single
>> pages perhaps for a one-time small update.
> Why would "you can only fault single pages" ever be true? If you have
> 1GB of pages then the vmf_insert_range should allocate enough page
> table entries to consume it, regardless of alignment.

Ah yes, What I meant was you can only insert PTE size entries, either 
because of misalignment or because the page-table is alredy 
pre-populated with pmd size page directories, which you can't remove 
with only the read side of the mmap lock held.

>
> And why shouldn't DAX switch to this kind of interface anyhow? It is
> basically exactly the same problem. The underlying filesystem block
> size is *not* necessarily aligned to the CPU page table sizes and DAX
> would benefit from better handling of this mismatch.

First, I think we must sort out what "better handling" means. This is my 
takeout of the discussion so far:

Claimed Pros: of vmf_insert_range()
* We get an interface that doesn't require knowledge of CPU page table 
entry level sizes.
* We get the best efficiency when we look at what the GPU driver 
provides. (I disagree on this one).

Claimed Cons:
* A new implementation that may get complicated particularly if it 
involves modifying all of the DAX code
* The driver would have to know about those sizes anyway to get 
alignment right (Applies to DRM, because we mmap buffer objects, not 
physical address ranges. But not to DAX AFAICT),
* We loose efficiency, because we are prepared to spend an extra effort 
for alignment- and continuity checks when we know we can insert a huge 
page table entry, but not if we know we can't
* We loose efficiency because we might unnecessarily prefault a number 
of PTE size page-table entries (really a special case of the above one).

Now in the context of quickly fixing a critical bug, the choice IMHO 
becomes easy.

>
>> On top of this, unless we want to do the walk trying increasingly smaller
>> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach
>> it about transhuge page table entries, because pagewalk.c can't be used (It
>> can't populate page tables). That also means apply_to_page_range() needs to
>> be complicated with page table locks since transhuge pages aren't stable and
>> can be zapped and refaulted under us while we do the walk.
> I didn't say it would be simple :) But we also need to stop hacking
> around the sides of all this huge page stuff and come up with sensible
> APIs that drivers can actually implement correctly. Exposing drivers
> to specific kinds of page levels really feels like the wrong level of
> abstraction.

I generally agree. But for the last sentence I think the potential gain 
must be carefully weighed against the efficiency arguments.

>
> Once we start doing this we should do it everywhere, the io_remap_pfn
> stuff should be able to create huge special IO pages as well, for
> instance.

I agree here as well. Here we can be more agressive as the contigous 
range is already known and we IIRC hold the mmap lock in write mode.

>   
>> On top of this, the user-space address allocator needs to know how large gpu
>> pages are aligned in buffer objects to have a reasonable chance of aligning
>> with CPU huge page boundaries which is a requirement to be able to insert a
>> huge CPU page table entry, so the driver would basically need the drm helper
>> that can do this alignment anyway.
> Don't you have this problem anyhow?

Yes, but it sort of defeats the simplicity argument of the proposed 
interface change.

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 23:14                               ` Jason Gunthorpe
  2021-03-25  7:48                                 ` Thomas Hellström (Intel)
@ 2021-03-25  7:49                                 ` Christian König
  2021-03-25  9:41                                   ` Daniel Vetter
  1 sibling, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-25  7:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Hellström (Intel)
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

Am 25.03.21 um 00:14 schrieb Jason Gunthorpe:
> On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/24/21 7:31 PM, Christian König wrote:
>>>
>>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
>>>> wrote:
>>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
>>>>>> (Intel) wrote:
>>>>>>
>>>>>>>> In an ideal world the creation/destruction of page
>>>>>>>> table levels would
>>>>>>>> by dynamic at this point, like THP.
>>>>>>> Hmm, but I'm not sure what problem we're trying to solve
>>>>>>> by changing the
>>>>>>> interface in this way?
>>>>>> We are trying to make a sensible driver API to deal with huge pages.
>>>>>>> Currently if the core vm requests a huge pud, we give it
>>>>>>> one, and if we
>>>>>>> can't or don't want to (because of dirty-tracking, for
>>>>>>> example, which is
>>>>>>> always done on 4K page-level) we just return
>>>>>>> VM_FAULT_FALLBACK, and the
>>>>>>> fault is retried at a lower level.
>>>>>> Well, my thought would be to move the pte related stuff into
>>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>>>
>>>>>> I don't know if the locking works out, but it feels cleaner that the
>>>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>>>> telling the driver to stuff in a certain size page which it might not
>>>>>> want to do.
>>>>>>
>>>>>> Some devices want to work on a in-between page size like 64k so they
>>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
>>>>>> every fault.
>>>>> Hmm, yes, but we would in that case be limited anyway to insert ranges
>>>>> smaller than and equal to the fault size to avoid extensive and
>>>>> possibly
>>>>> unnecessary checks for contigous memory.
>>>> Why? The insert function is walking the page tables, it just updates
>>>> things as they are. It learns the arragement for free while doing the
>>>> walk.
>>>>
>>>> The device has to always provide consistent data, if it overlaps into
>>>> pages that are already populated that is fine so long as it isn't
>>>> changing their addresses.
>>>>
>>>>> And then if we can't support the full fault size, we'd need to
>>>>> either presume a size and alignment of the next level or search for
>>>>> contigous memory in both directions around the fault address,
>>>>> perhaps unnecessarily as well.
>>>> You don't really need to care about levels, the device should be
>>>> faulting in the largest memory regions it can within its efficiency.
>>>>
>>>> If it works on 4M pages then it should be faulting 4M pages. The page
>>>> size of the underlying CPU doesn't really matter much other than some
>>>> tuning to impact how the device's allocator works.
>> Yes, but then we'd be adding a lot of complexity into this function that is
>> already provided by the current interface for DAX, for little or no gain, at
>> least in the drm/ttm setting. Please think of the following situation: You
>> get a fault, you do an extensive time-consuming scan of your VRAM buffer
>> object into which the fault goes and determine you can fault 1GB. Now you
>> hand it to vmf_insert_range() and because the user-space address is
>> misaligned, or already partly populated because of a previous eviction, you
>> can only fault single pages, and you end up faulting a full GB of single
>> pages perhaps for a one-time small update.
> Why would "you can only fault single pages" ever be true? If you have
> 1GB of pages then the vmf_insert_range should allocate enough page
> table entries to consume it, regardless of alignment.

Completely agree with Jason. Filling in the CPU page tables is 
relatively cheap if you fill in a large continuous range.

In other words filling in 1GiB of a linear range is *much* less overhead 
than filling in 1<<18 4KiB faults.

I would say that this is always preferable even if the CPU only wants to 
update a single byte.

> And why shouldn't DAX switch to this kind of interface anyhow? It is
> basically exactly the same problem. The underlying filesystem block
> size is *not* necessarily aligned to the CPU page table sizes and DAX
> would benefit from better handling of this mismatch.
>
>> On top of this, unless we want to do the walk trying increasingly smaller
>> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach
>> it about transhuge page table entries, because pagewalk.c can't be used (It
>> can't populate page tables). That also means apply_to_page_range() needs to
>> be complicated with page table locks since transhuge pages aren't stable and
>> can be zapped and refaulted under us while we do the walk.
> I didn't say it would be simple :) But we also need to stop hacking
> around the sides of all this huge page stuff and come up with sensible
> APIs that drivers can actually implement correctly. Exposing drivers
> to specific kinds of page levels really feels like the wrong level of
> abstraction.
>
> Once we start doing this we should do it everywhere, the io_remap_pfn
> stuff should be able to create huge special IO pages as well, for
> instance.

Oh, yes please!

We easily have 16GiB of VRAM which is linear mapped into the kernel 
space for each GPU instance.

Doing that with 1GiB mapping instead of 4KiB would be quite a win.

Regards,
Christian.

>   
>> On top of this, the user-space address allocator needs to know how large gpu
>> pages are aligned in buffer objects to have a reasonable chance of aligning
>> with CPU huge page boundaries which is a requirement to be able to insert a
>> huge CPU page table entry, so the driver would basically need the drm helper
>> that can do this alignment anyway.
> Don't you have this problem anyhow?
>
> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25  7:48                                 ` Thomas Hellström (Intel)
@ 2021-03-25  8:27                                   ` Christian König
  2021-03-25  9:51                                     ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-25  8:27 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

Am 25.03.21 um 08:48 schrieb Thomas Hellström (Intel):
>
> On 3/25/21 12:14 AM, Jason Gunthorpe wrote:
>> On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) 
>> wrote:
>>> On 3/24/21 7:31 PM, Christian König wrote:
>>>>
>>>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>>>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
>>>>> wrote:
>>>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
>>>>>>> (Intel) wrote:
>>>>>>>
>>>>>>>>> In an ideal world the creation/destruction of page
>>>>>>>>> table levels would
>>>>>>>>> by dynamic at this point, like THP.
>>>>>>>> Hmm, but I'm not sure what problem we're trying to solve
>>>>>>>> by changing the
>>>>>>>> interface in this way?
>>>>>>> We are trying to make a sensible driver API to deal with huge 
>>>>>>> pages.
>>>>>>>> Currently if the core vm requests a huge pud, we give it
>>>>>>>> one, and if we
>>>>>>>> can't or don't want to (because of dirty-tracking, for
>>>>>>>> example, which is
>>>>>>>> always done on 4K page-level) we just return
>>>>>>>> VM_FAULT_FALLBACK, and the
>>>>>>>> fault is retried at a lower level.
>>>>>>> Well, my thought would be to move the pte related stuff into
>>>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>>>>
>>>>>>> I don't know if the locking works out, but it feels cleaner that 
>>>>>>> the
>>>>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>>>>> telling the driver to stuff in a certain size page which it 
>>>>>>> might not
>>>>>>> want to do.
>>>>>>>
>>>>>>> Some devices want to work on a in-between page size like 64k so 
>>>>>>> they
>>>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a 
>>>>>>> batch on
>>>>>>> every fault.
>>>>>> Hmm, yes, but we would in that case be limited anyway to insert 
>>>>>> ranges
>>>>>> smaller than and equal to the fault size to avoid extensive and
>>>>>> possibly
>>>>>> unnecessary checks for contigous memory.
>>>>> Why? The insert function is walking the page tables, it just updates
>>>>> things as they are. It learns the arragement for free while doing the
>>>>> walk.
>>>>>
>>>>> The device has to always provide consistent data, if it overlaps into
>>>>> pages that are already populated that is fine so long as it isn't
>>>>> changing their addresses.
>>>>>
>>>>>> And then if we can't support the full fault size, we'd need to
>>>>>> either presume a size and alignment of the next level or search for
>>>>>> contigous memory in both directions around the fault address,
>>>>>> perhaps unnecessarily as well.
>>>>> You don't really need to care about levels, the device should be
>>>>> faulting in the largest memory regions it can within its efficiency.
>>>>>
>>>>> If it works on 4M pages then it should be faulting 4M pages. The page
>>>>> size of the underlying CPU doesn't really matter much other than some
>>>>> tuning to impact how the device's allocator works.
>>> Yes, but then we'd be adding a lot of complexity into this function 
>>> that is
>>> already provided by the current interface for DAX, for little or no 
>>> gain, at
>>> least in the drm/ttm setting. Please think of the following 
>>> situation: You
>>> get a fault, you do an extensive time-consuming scan of your VRAM 
>>> buffer
>>> object into which the fault goes and determine you can fault 1GB. 
>>> Now you
>>> hand it to vmf_insert_range() and because the user-space address is
>>> misaligned, or already partly populated because of a previous 
>>> eviction, you
>>> can only fault single pages, and you end up faulting a full GB of 
>>> single
>>> pages perhaps for a one-time small update.
>> Why would "you can only fault single pages" ever be true? If you have
>> 1GB of pages then the vmf_insert_range should allocate enough page
>> table entries to consume it, regardless of alignment.
>
> Ah yes, What I meant was you can only insert PTE size entries, either 
> because of misalignment or because the page-table is alredy 
> pre-populated with pmd size page directories, which you can't remove 
> with only the read side of the mmap lock held.

Please explain that further. Why do we need the mmap lock to insert PMDs 
but not when insert PTEs?

>> And why shouldn't DAX switch to this kind of interface anyhow? It is
>> basically exactly the same problem. The underlying filesystem block
>> size is *not* necessarily aligned to the CPU page table sizes and DAX
>> would benefit from better handling of this mismatch.
>
> First, I think we must sort out what "better handling" means. This is 
> my takeout of the discussion so far:
>
> Claimed Pros: of vmf_insert_range()
> * We get an interface that doesn't require knowledge of CPU page table 
> entry level sizes.
> * We get the best efficiency when we look at what the GPU driver 
> provides. (I disagree on this one).
>
> Claimed Cons:
> * A new implementation that may get complicated particularly if it 
> involves modifying all of the DAX code
> * The driver would have to know about those sizes anyway to get 
> alignment right (Applies to DRM, because we mmap buffer objects, not 
> physical address ranges. But not to DAX AFAICT),

I don't think so. We could just align all buffers to their next power of 
two in size. Since we have plenty of offset space that shouldn't matter 
much.

Apart from that I still don't fully get why we need this in the first place.

> * We loose efficiency, because we are prepared to spend an extra 
> effort for alignment- and continuity checks when we know we can insert 
> a huge page table entry, but not if we know we can't

I don't think so either. See with don't need any extra effort for the 
alignment nor the handling, it actually becomes much cheaper as far as I 
can see.

In other words when you have a fault you don't care about the faulting 
address that much, you only use it to determine the memory segment to map.

Then this whole memory segment is mapped into the address space of the 
user application.

If can of course happen that we need to fiddle with addresses and sizes 
because userspace only mmap a fraction of the underlying buffer, but in 
reality we never do this.

> * We loose efficiency because we might unnecessarily prefault a number 
> of PTE size page-table entries (really a special case of the above one).

I really don't see that either. When a buffer is accessed by the CPU it 
is in > 90% of all cases completely accessed. Not faulting in full 
ranges is just optimizing for a really unlikely case here.

>
> Now in the context of quickly fixing a critical bug, the choice IMHO 
> becomes easy.

Well for quick fixing this I would rather disable huge pages for now.

Regards,
Christian.

>
>>
>>> On top of this, unless we want to do the walk trying increasingly 
>>> smaller
>>> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() 
>>> and teach
>>> it about transhuge page table entries, because pagewalk.c can't be 
>>> used (It
>>> can't populate page tables). That also means apply_to_page_range() 
>>> needs to
>>> be complicated with page table locks since transhuge pages aren't 
>>> stable and
>>> can be zapped and refaulted under us while we do the walk.
>> I didn't say it would be simple :) But we also need to stop hacking
>> around the sides of all this huge page stuff and come up with sensible
>> APIs that drivers can actually implement correctly. Exposing drivers
>> to specific kinds of page levels really feels like the wrong level of
>> abstraction.
>
> I generally agree. But for the last sentence I think the potential 
> gain must be carefully weighed against the efficiency arguments.
>
>>
>> Once we start doing this we should do it everywhere, the io_remap_pfn
>> stuff should be able to create huge special IO pages as well, for
>> instance.
>
> I agree here as well. Here we can be more agressive as the contigous 
> range is already known and we IIRC hold the mmap lock in write mode.
>
>>> On top of this, the user-space address allocator needs to know how 
>>> large gpu
>>> pages are aligned in buffer objects to have a reasonable chance of 
>>> aligning
>>> with CPU huge page boundaries which is a requirement to be able to 
>>> insert a
>>> huge CPU page table entry, so the driver would basically need the 
>>> drm helper
>>> that can do this alignment anyway.
>> Don't you have this problem anyhow?
>
> Yes, but it sort of defeats the simplicity argument of the proposed 
> interface change.
>
> /Thomas
>
>



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25  7:49                                 ` Christian König
@ 2021-03-25  9:41                                   ` Daniel Vetter
  0 siblings, 0 replies; 63+ messages in thread
From: Daniel Vetter @ 2021-03-25  9:41 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Thomas Hellström (Intel),
	David Airlie, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	dri-devel

On Thu, Mar 25, 2021 at 8:50 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 25.03.21 um 00:14 schrieb Jason Gunthorpe:
> > On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote:
> >> On 3/24/21 7:31 PM, Christian König wrote:
> >>>
> >>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
> >>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
> >>>> wrote:
> >>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
> >>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
> >>>>>> (Intel) wrote:
> >>>>>>
> >>>>>>>> In an ideal world the creation/destruction of page
> >>>>>>>> table levels would
> >>>>>>>> by dynamic at this point, like THP.
> >>>>>>> Hmm, but I'm not sure what problem we're trying to solve
> >>>>>>> by changing the
> >>>>>>> interface in this way?
> >>>>>> We are trying to make a sensible driver API to deal with huge pages.
> >>>>>>> Currently if the core vm requests a huge pud, we give it
> >>>>>>> one, and if we
> >>>>>>> can't or don't want to (because of dirty-tracking, for
> >>>>>>> example, which is
> >>>>>>> always done on 4K page-level) we just return
> >>>>>>> VM_FAULT_FALLBACK, and the
> >>>>>>> fault is retried at a lower level.
> >>>>>> Well, my thought would be to move the pte related stuff into
> >>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
> >>>>>>
> >>>>>> I don't know if the locking works out, but it feels cleaner that the
> >>>>>> driver tells the vmf how big a page it can stuff in, not the vm
> >>>>>> telling the driver to stuff in a certain size page which it might not
> >>>>>> want to do.
> >>>>>>
> >>>>>> Some devices want to work on a in-between page size like 64k so they
> >>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
> >>>>>> every fault.
> >>>>> Hmm, yes, but we would in that case be limited anyway to insert ranges
> >>>>> smaller than and equal to the fault size to avoid extensive and
> >>>>> possibly
> >>>>> unnecessary checks for contigous memory.
> >>>> Why? The insert function is walking the page tables, it just updates
> >>>> things as they are. It learns the arragement for free while doing the
> >>>> walk.
> >>>>
> >>>> The device has to always provide consistent data, if it overlaps into
> >>>> pages that are already populated that is fine so long as it isn't
> >>>> changing their addresses.
> >>>>
> >>>>> And then if we can't support the full fault size, we'd need to
> >>>>> either presume a size and alignment of the next level or search for
> >>>>> contigous memory in both directions around the fault address,
> >>>>> perhaps unnecessarily as well.
> >>>> You don't really need to care about levels, the device should be
> >>>> faulting in the largest memory regions it can within its efficiency.
> >>>>
> >>>> If it works on 4M pages then it should be faulting 4M pages. The page
> >>>> size of the underlying CPU doesn't really matter much other than some
> >>>> tuning to impact how the device's allocator works.
> >> Yes, but then we'd be adding a lot of complexity into this function that is
> >> already provided by the current interface for DAX, for little or no gain, at
> >> least in the drm/ttm setting. Please think of the following situation: You
> >> get a fault, you do an extensive time-consuming scan of your VRAM buffer
> >> object into which the fault goes and determine you can fault 1GB. Now you
> >> hand it to vmf_insert_range() and because the user-space address is
> >> misaligned, or already partly populated because of a previous eviction, you
> >> can only fault single pages, and you end up faulting a full GB of single
> >> pages perhaps for a one-time small update.
> > Why would "you can only fault single pages" ever be true? If you have
> > 1GB of pages then the vmf_insert_range should allocate enough page
> > table entries to consume it, regardless of alignment.
>
> Completely agree with Jason. Filling in the CPU page tables is
> relatively cheap if you fill in a large continuous range.
>
> In other words filling in 1GiB of a linear range is *much* less overhead
> than filling in 1<<18 4KiB faults.
>
> I would say that this is always preferable even if the CPU only wants to
> update a single byte.
>
> > And why shouldn't DAX switch to this kind of interface anyhow? It is
> > basically exactly the same problem. The underlying filesystem block
> > size is *not* necessarily aligned to the CPU page table sizes and DAX
> > would benefit from better handling of this mismatch.
> >
> >> On top of this, unless we want to do the walk trying increasingly smaller
> >> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach
> >> it about transhuge page table entries, because pagewalk.c can't be used (It
> >> can't populate page tables). That also means apply_to_page_range() needs to
> >> be complicated with page table locks since transhuge pages aren't stable and
> >> can be zapped and refaulted under us while we do the walk.
> > I didn't say it would be simple :) But we also need to stop hacking
> > around the sides of all this huge page stuff and come up with sensible
> > APIs that drivers can actually implement correctly. Exposing drivers
> > to specific kinds of page levels really feels like the wrong level of
> > abstraction.
> >
> > Once we start doing this we should do it everywhere, the io_remap_pfn
> > stuff should be able to create huge special IO pages as well, for
> > instance.
>
> Oh, yes please!
>
> We easily have 16GiB of VRAM which is linear mapped into the kernel
> space for each GPU instance.
>
> Doing that with 1GiB mapping instead of 4KiB would be quite a win.

io_remap_pfn is for userspace mmaps. Kernel mappings should be as big
as possible already I think for everything.
-Daniel


> Regards,
> Christian.
>
> >
> >> On top of this, the user-space address allocator needs to know how large gpu
> >> pages are aligned in buffer objects to have a reasonable chance of aligning
> >> with CPU huge page boundaries which is a requirement to be able to insert a
> >> huge CPU page table entry, so the driver would basically need the drm helper
> >> that can do this alignment anyway.
> > Don't you have this problem anyhow?
> >
> > Jason
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25  8:27                                   ` Christian König
@ 2021-03-25  9:51                                     ` Thomas Hellström (Intel)
  2021-03-25 11:30                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25  9:51 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton


On 3/25/21 9:27 AM, Christian König wrote:
> Am 25.03.21 um 08:48 schrieb Thomas Hellström (Intel):
>>
>> On 3/25/21 12:14 AM, Jason Gunthorpe wrote:
>>> On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) 
>>> wrote:
>>>> On 3/24/21 7:31 PM, Christian König wrote:
>>>>>
>>>>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
>>>>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
>>>>>> wrote:
>>>>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
>>>>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
>>>>>>>> (Intel) wrote:
>>>>>>>>
>>>>>>>>>> In an ideal world the creation/destruction of page
>>>>>>>>>> table levels would
>>>>>>>>>> by dynamic at this point, like THP.
>>>>>>>>> Hmm, but I'm not sure what problem we're trying to solve
>>>>>>>>> by changing the
>>>>>>>>> interface in this way?
>>>>>>>> We are trying to make a sensible driver API to deal with huge 
>>>>>>>> pages.
>>>>>>>>> Currently if the core vm requests a huge pud, we give it
>>>>>>>>> one, and if we
>>>>>>>>> can't or don't want to (because of dirty-tracking, for
>>>>>>>>> example, which is
>>>>>>>>> always done on 4K page-level) we just return
>>>>>>>>> VM_FAULT_FALLBACK, and the
>>>>>>>>> fault is retried at a lower level.
>>>>>>>> Well, my thought would be to move the pte related stuff into
>>>>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
>>>>>>>>
>>>>>>>> I don't know if the locking works out, but it feels cleaner 
>>>>>>>> that the
>>>>>>>> driver tells the vmf how big a page it can stuff in, not the vm
>>>>>>>> telling the driver to stuff in a certain size page which it 
>>>>>>>> might not
>>>>>>>> want to do.
>>>>>>>>
>>>>>>>> Some devices want to work on a in-between page size like 64k so 
>>>>>>>> they
>>>>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a 
>>>>>>>> batch on
>>>>>>>> every fault.
>>>>>>> Hmm, yes, but we would in that case be limited anyway to insert 
>>>>>>> ranges
>>>>>>> smaller than and equal to the fault size to avoid extensive and
>>>>>>> possibly
>>>>>>> unnecessary checks for contigous memory.
>>>>>> Why? The insert function is walking the page tables, it just updates
>>>>>> things as they are. It learns the arragement for free while doing 
>>>>>> the
>>>>>> walk.
>>>>>>
>>>>>> The device has to always provide consistent data, if it overlaps 
>>>>>> into
>>>>>> pages that are already populated that is fine so long as it isn't
>>>>>> changing their addresses.
>>>>>>
>>>>>>> And then if we can't support the full fault size, we'd need to
>>>>>>> either presume a size and alignment of the next level or search for
>>>>>>> contigous memory in both directions around the fault address,
>>>>>>> perhaps unnecessarily as well.
>>>>>> You don't really need to care about levels, the device should be
>>>>>> faulting in the largest memory regions it can within its efficiency.
>>>>>>
>>>>>> If it works on 4M pages then it should be faulting 4M pages. The 
>>>>>> page
>>>>>> size of the underlying CPU doesn't really matter much other than 
>>>>>> some
>>>>>> tuning to impact how the device's allocator works.
>>>> Yes, but then we'd be adding a lot of complexity into this function 
>>>> that is
>>>> already provided by the current interface for DAX, for little or no 
>>>> gain, at
>>>> least in the drm/ttm setting. Please think of the following 
>>>> situation: You
>>>> get a fault, you do an extensive time-consuming scan of your VRAM 
>>>> buffer
>>>> object into which the fault goes and determine you can fault 1GB. 
>>>> Now you
>>>> hand it to vmf_insert_range() and because the user-space address is
>>>> misaligned, or already partly populated because of a previous 
>>>> eviction, you
>>>> can only fault single pages, and you end up faulting a full GB of 
>>>> single
>>>> pages perhaps for a one-time small update.
>>> Why would "you can only fault single pages" ever be true? If you have
>>> 1GB of pages then the vmf_insert_range should allocate enough page
>>> table entries to consume it, regardless of alignment.
>>
>> Ah yes, What I meant was you can only insert PTE size entries, either 
>> because of misalignment or because the page-table is alredy 
>> pre-populated with pmd size page directories, which you can't remove 
>> with only the read side of the mmap lock held.
>
> Please explain that further. Why do we need the mmap lock to insert 
> PMDs but not when insert PTEs?

We don't. But once you've inserted a PMD directory you can't remove it 
unless you have the mmap lock (and probably also the i_mmap_lock in 
write mode). That for example means that if you have a VRAM region 
mapped with huge PMDs, and then it gets evicted, and you happen to read 
a byte from it when it's evicted and therefore populate the full region 
with PTEs pointing to system pages, you can't go back to huge PMDs again 
without a munmap() in between.

>
>>> And why shouldn't DAX switch to this kind of interface anyhow? It is
>>> basically exactly the same problem. The underlying filesystem block
>>> size is *not* necessarily aligned to the CPU page table sizes and DAX
>>> would benefit from better handling of this mismatch.
>>
>> First, I think we must sort out what "better handling" means. This is 
>> my takeout of the discussion so far:
>>
>> Claimed Pros: of vmf_insert_range()
>> * We get an interface that doesn't require knowledge of CPU page 
>> table entry level sizes.
>> * We get the best efficiency when we look at what the GPU driver 
>> provides. (I disagree on this one).
>>
>> Claimed Cons:
>> * A new implementation that may get complicated particularly if it 
>> involves modifying all of the DAX code
>> * The driver would have to know about those sizes anyway to get 
>> alignment right (Applies to DRM, because we mmap buffer objects, not 
>> physical address ranges. But not to DAX AFAICT),
>
> I don't think so. We could just align all buffers to their next power 
> of two in size. Since we have plenty of offset space that shouldn't 
> matter much.
It's not offset space like in drm fake offsets, but virtual address 
space. But I guess we have plenty of that as well.
>
> Apart from that I still don't fully get why we need this in the first 
> place.

Because virtual huge page address boundaries need to be aligned with 
physical huge page address boundaries, and mmap can happen before bos 
are populated so you have no way of knowing how physical huge page 
address boundaries are laid out in the buffer object unless you define a 
rule for how that should be done. Meaning whatever scheme you use for 
the virtual address space you need apply for linear VRAM as well, and 
that's a more scarce resource.

The scheme used today is that buffers that are > PMD size aligns bo VRAM 
on PMD size boundaries if possible (and similar for virtual addresses). 
Buffers that are > PUD size aligns to PUD size boundaries.

>
>> * We loose efficiency, because we are prepared to spend an extra 
>> effort for alignment- and continuity checks when we know we can 
>> insert a huge page table entry, but not if we know we can't
>
> I don't think so either. See with don't need any extra effort for the 
> alignment nor the handling, it actually becomes much cheaper as far as 
> I can see.

We have those checks there today. Think buddy allocator, bos with system 
pages. They are only executed when we know we can insert a huge page 
today. With the new interface we'd do them always.

>
> In other words when you have a fault you don't care about the faulting 
> address that much, you only use it to determine the memory segment to 
> map.
>
> Then this whole memory segment is mapped into the address space of the 
> user application.
>
> If can of course happen that we need to fiddle with addresses and 
> sizes because userspace only mmap a fraction of the underlying buffer, 
> but in reality we never do this.
>
>> * We loose efficiency because we might unnecessarily prefault a 
>> number of PTE size page-table entries (really a special case of the 
>> above one).
>
> I really don't see that either. When a buffer is accessed by the CPU 
> it is in > 90% of all cases completely accessed. Not faulting in full 
> ranges is just optimizing for a really unlikely case here.

It might be that you're right, but are all drivers wanting to use this 
like drm in this respect? Using the interface to fault in a 1G range in 
the hope it could map it to a huge pud may unexpectedly consume and 
populate some 16+ MB of page tables.

To me, keeping the current interface for flexibility and add an optional 
huge-page-table-entry-aware prefaulting helper, perhaps not restricted 
to contigous ranges, for drivers that think its a good idea to prefault 
given the right conditions and unconditionally for stuff like 
remap_pfn_range() sounds reasonable.

/Thomas

>
>>
>> Now in the context of quickly fixing a critical bug, the choice IMHO 
>> becomes easy.
>
> Well for quick fixing this I would rather disable huge pages for now.
>
> Regards,
> Christian.
>
>>
>>>
>>>> On top of this, unless we want to do the walk trying increasingly 
>>>> smaller
>>>> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() 
>>>> and teach
>>>> it about transhuge page table entries, because pagewalk.c can't be 
>>>> used (It
>>>> can't populate page tables). That also means apply_to_page_range() 
>>>> needs to
>>>> be complicated with page table locks since transhuge pages aren't 
>>>> stable and
>>>> can be zapped and refaulted under us while we do the walk.
>>> I didn't say it would be simple :) But we also need to stop hacking
>>> around the sides of all this huge page stuff and come up with sensible
>>> APIs that drivers can actually implement correctly. Exposing drivers
>>> to specific kinds of page levels really feels like the wrong level of
>>> abstraction.
>>
>> I generally agree. But for the last sentence I think the potential 
>> gain must be carefully weighed against the efficiency arguments.
>>
>>>
>>> Once we start doing this we should do it everywhere, the io_remap_pfn
>>> stuff should be able to create huge special IO pages as well, for
>>> instance.
>>
>> I agree here as well. Here we can be more agressive as the contigous 
>> range is already known and we IIRC hold the mmap lock in write mode.
>>
>>>> On top of this, the user-space address allocator needs to know how 
>>>> large gpu
>>>> pages are aligned in buffer objects to have a reasonable chance of 
>>>> aligning
>>>> with CPU huge page boundaries which is a requirement to be able to 
>>>> insert a
>>>> huge CPU page table entry, so the driver would basically need the 
>>>> drm helper
>>>> that can do this alignment anyway.
>>> Don't you have this problem anyhow?
>>
>> Yes, but it sort of defeats the simplicity argument of the proposed 
>> interface change.
>>
>> /Thomas
>>
>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25  9:51                                     ` Thomas Hellström (Intel)
@ 2021-03-25 11:30                                       ` Jason Gunthorpe
  2021-03-25 11:53                                         ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 11:30 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Christian König, David Airlie, linux-kernel, dri-devel,
	linux-mm, Andrew Morton

On Thu, Mar 25, 2021 at 10:51:35AM +0100, Thomas Hellström (Intel) wrote:

> > Please explain that further. Why do we need the mmap lock to insert PMDs
> > but not when insert PTEs?
> 
> We don't. But once you've inserted a PMD directory you can't remove it
> unless you have the mmap lock (and probably also the i_mmap_lock in write
> mode). That for example means that if you have a VRAM region mapped with
> huge PMDs, and then it gets evicted, and you happen to read a byte from it
> when it's evicted and therefore populate the full region with PTEs pointing
> to system pages, you can't go back to huge PMDs again without a munmap() in
> between.

This is all basically magic to me still, but THP does this
transformation and I think what it does could work here too. We
probably wouldn't be able to upgrade while handling fault, but at the
same time, this should be quite rare as it would require the driver to
have supplied a small page for this VMA at some point.

> > Apart from that I still don't fully get why we need this in the first
> > place.
> 
> Because virtual huge page address boundaries need to be aligned with
> physical huge page address boundaries, and mmap can happen before bos are
> populated so you have no way of knowing how physical huge page
> address

But this is a mmap-time problem, fault can't fix mmap using the wrong VA.

> > I really don't see that either. When a buffer is accessed by the CPU it
> > is in > 90% of all cases completely accessed. Not faulting in full
> > ranges is just optimizing for a really unlikely case here.
> 
> It might be that you're right, but are all drivers wanting to use this like
> drm in this respect? Using the interface to fault in a 1G range in the hope
> it could map it to a huge pud may unexpectedly consume and populate some 16+
> MB of page tables.

If the underlying device block size is so big then sure, why not? The
"unexpectedly" should be quite rare/non existant anyhow.

Jason
 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 11:30                                       ` Jason Gunthorpe
@ 2021-03-25 11:53                                         ` Thomas Hellström (Intel)
  2021-03-25 12:01                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25 11:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, David Airlie, linux-kernel, dri-devel,
	linux-mm, Andrew Morton


On 3/25/21 12:30 PM, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 10:51:35AM +0100, Thomas Hellström (Intel) wrote:
>
>>> Please explain that further. Why do we need the mmap lock to insert PMDs
>>> but not when insert PTEs?
>> We don't. But once you've inserted a PMD directory you can't remove it
>> unless you have the mmap lock (and probably also the i_mmap_lock in write
>> mode). That for example means that if you have a VRAM region mapped with
>> huge PMDs, and then it gets evicted, and you happen to read a byte from it
>> when it's evicted and therefore populate the full region with PTEs pointing
>> to system pages, you can't go back to huge PMDs again without a munmap() in
>> between.
> This is all basically magic to me still, but THP does this
> transformation and I think what it does could work here too. We
> probably wouldn't be able to upgrade while handling fault, but at the
> same time, this should be quite rare as it would require the driver to
> have supplied a small page for this VMA at some point.

IIRC THP handles this using khugepaged, grabbing the lock in write mode 
when coalescing, and yeah, I don't think anything prevents anyone from 
extending khugepaged doing that also for special huge page table entries.

>
>>> Apart from that I still don't fully get why we need this in the first
>>> place.
>> Because virtual huge page address boundaries need to be aligned with
>> physical huge page address boundaries, and mmap can happen before bos are
>> populated so you have no way of knowing how physical huge page
>> address
> But this is a mmap-time problem, fault can't fix mmap using the wrong VA.

Nope. The point here was that in this case, to make sure mmap uses the 
correct VA to give us a reasonable chance of alignement, the driver 
might need to be aware of and do trickery with the huge page-table-entry 
sizes anyway, although I think in most cases a standard helper for this 
can be supplied.

/Thomas


>
>>> I really don't see that either. When a buffer is accessed by the CPU it
>>> is in > 90% of all cases completely accessed. Not faulting in full
>>> ranges is just optimizing for a really unlikely case here.
>> It might be that you're right, but are all drivers wanting to use this like
>> drm in this respect? Using the interface to fault in a 1G range in the hope
>> it could map it to a huge pud may unexpectedly consume and populate some 16+
>> MB of page tables.
> If the underlying device block size is so big then sure, why not? The
> "unexpectedly" should be quite rare/non existant anyhow.
>
> Jason
>   


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 11:53                                         ` Thomas Hellström (Intel)
@ 2021-03-25 12:01                                           ` Jason Gunthorpe
  2021-03-25 12:09                                             ` Christian König
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 12:01 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Christian König, David Airlie, linux-kernel, dri-devel,
	linux-mm, Andrew Morton

On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:

> Nope. The point here was that in this case, to make sure mmap uses the
> correct VA to give us a reasonable chance of alignement, the driver might
> need to be aware of and do trickery with the huge page-table-entry sizes
> anyway, although I think in most cases a standard helper for this can be
> supplied.

Of course the driver needs some way to influence the VA mmap uses,
gernally it should align to the natural page size of the device

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 12:01                                           ` Jason Gunthorpe
@ 2021-03-25 12:09                                             ` Christian König
  2021-03-25 12:36                                               ` Thomas Hellström (Intel)
  2021-03-25 12:42                                               ` Jason Gunthorpe
  0 siblings, 2 replies; 63+ messages in thread
From: Christian König @ 2021-03-25 12:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Hellström (Intel)
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
>
>> Nope. The point here was that in this case, to make sure mmap uses the
>> correct VA to give us a reasonable chance of alignement, the driver might
>> need to be aware of and do trickery with the huge page-table-entry sizes
>> anyway, although I think in most cases a standard helper for this can be
>> supplied.
> Of course the driver needs some way to influence the VA mmap uses,
> gernally it should align to the natural page size of the device

Well a mmap() needs to be aligned to the page size of the CPU, but not 
necessarily to the one of the device.

So I'm pretty sure the device driver should not be involved in any way 
the choosing of the VA for the CPU mapping.

Christian.

>
> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 12:09                                             ` Christian König
@ 2021-03-25 12:36                                               ` Thomas Hellström (Intel)
  2021-03-25 13:02                                                 ` Christian König
  2021-03-25 12:42                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25 12:36 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton


On 3/25/21 1:09 PM, Christian König wrote:
> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
>> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) 
>> wrote:
>>
>>> Nope. The point here was that in this case, to make sure mmap uses the
>>> correct VA to give us a reasonable chance of alignement, the driver 
>>> might
>>> need to be aware of and do trickery with the huge page-table-entry 
>>> sizes
>>> anyway, although I think in most cases a standard helper for this 
>>> can be
>>> supplied.
>> Of course the driver needs some way to influence the VA mmap uses,
>> gernally it should align to the natural page size of the device
>
> Well a mmap() needs to be aligned to the page size of the CPU, but not 
> necessarily to the one of the device.
>
> So I'm pretty sure the device driver should not be involved in any way 
> the choosing of the VA for the CPU mapping.
>
> Christian.
>
We've had this discussion before and at that time I managed to convince 
you by pointing to the shmem helper for this, shmem_get_umapped_area().

Basically there are two ways to do this. Either use a standard helper 
similar to shmem's, and then the driver needs to align physical (device) 
huge page boundaries to address space offset huge page boundaries. If 
you don't do that you can just as well use a custom function that 
adjusts for you not doing that (drm_get_unmapped_area()). Both require 
driver knowledge of the size of huge pages.

Without a function to adjust, mmap will use it's default (16 byte?) 
alignment and chance of alignment becomes very small.

/Thomas


>>
>> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 12:09                                             ` Christian König
  2021-03-25 12:36                                               ` Thomas Hellström (Intel)
@ 2021-03-25 12:42                                               ` Jason Gunthorpe
  2021-03-25 13:05                                                 ` Christian König
  1 sibling, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 12:42 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

On Thu, Mar 25, 2021 at 01:09:14PM +0100, Christian König wrote:
> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
> > On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
> > 
> > > Nope. The point here was that in this case, to make sure mmap uses the
> > > correct VA to give us a reasonable chance of alignement, the driver might
> > > need to be aware of and do trickery with the huge page-table-entry sizes
> > > anyway, although I think in most cases a standard helper for this can be
> > > supplied.
> > Of course the driver needs some way to influence the VA mmap uses,
> > gernally it should align to the natural page size of the device
> 
> Well a mmap() needs to be aligned to the page size of the CPU, but not
> necessarily to the one of the device.
> 
> So I'm pretty sure the device driver should not be involved in any way the
> choosing of the VA for the CPU mapping.

No, if the device wants to use huge pages it must influence the mmap
VA or it can't form huge pgaes.

It is the same reason why mmap returns 2M stuff these days to make THP
possible

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 12:36                                               ` Thomas Hellström (Intel)
@ 2021-03-25 13:02                                                 ` Christian König
  2021-03-25 13:31                                                   ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-25 13:02 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton



Am 25.03.21 um 13:36 schrieb Thomas Hellström (Intel):
>
> On 3/25/21 1:09 PM, Christian König wrote:
>> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
>>> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) 
>>> wrote:
>>>
>>>> Nope. The point here was that in this case, to make sure mmap uses the
>>>> correct VA to give us a reasonable chance of alignement, the driver 
>>>> might
>>>> need to be aware of and do trickery with the huge page-table-entry 
>>>> sizes
>>>> anyway, although I think in most cases a standard helper for this 
>>>> can be
>>>> supplied.
>>> Of course the driver needs some way to influence the VA mmap uses,
>>> gernally it should align to the natural page size of the device
>>
>> Well a mmap() needs to be aligned to the page size of the CPU, but 
>> not necessarily to the one of the device.
>>
>> So I'm pretty sure the device driver should not be involved in any 
>> way the choosing of the VA for the CPU mapping.
>>
>> Christian.
>>
> We've had this discussion before and at that time I managed to 
> convince you by pointing to the shmem helper for this, 
> shmem_get_umapped_area().

No, you didn't convinced me. I was just surprised that this is something 
under driver control.

>
> Basically there are two ways to do this. Either use a standard helper 
> similar to shmem's, and then the driver needs to align physical 
> (device) huge page boundaries to address space offset huge page 
> boundaries. If you don't do that you can just as well use a custom 
> function that adjusts for you not doing that 
> (drm_get_unmapped_area()). Both require driver knowledge of the size 
> of huge pages.

And once more, at least for GPU drivers that looks like the totally 
wrong approach to me.

Aligning the VMA so that huge page allocations become possible is the 
job of the MM subsystem and not that of the drivers.

>
> Without a function to adjust, mmap will use it's default (16 byte?) 
> alignment and chance of alignment becomes very small.

Well it's 4KiB at least.

Regards,
Christian.

>
> /Thomas
>
>
>>>
>>> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 12:42                                               ` Jason Gunthorpe
@ 2021-03-25 13:05                                                 ` Christian König
  2021-03-25 13:17                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-25 13:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton



Am 25.03.21 um 13:42 schrieb Jason Gunthorpe:
> On Thu, Mar 25, 2021 at 01:09:14PM +0100, Christian König wrote:
>> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
>>> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
>>>
>>>> Nope. The point here was that in this case, to make sure mmap uses the
>>>> correct VA to give us a reasonable chance of alignement, the driver might
>>>> need to be aware of and do trickery with the huge page-table-entry sizes
>>>> anyway, although I think in most cases a standard helper for this can be
>>>> supplied.
>>> Of course the driver needs some way to influence the VA mmap uses,
>>> gernally it should align to the natural page size of the device
>> Well a mmap() needs to be aligned to the page size of the CPU, but not
>> necessarily to the one of the device.
>>
>> So I'm pretty sure the device driver should not be involved in any way the
>> choosing of the VA for the CPU mapping.
> No, if the device wants to use huge pages it must influence the mmap
> VA or it can't form huge pgaes.

No, that's the job of the core MM and not of the individual driver.

In other words current->mm->get_unmapped_area should already return a 
properly aligned VA.

Messing with that inside file->f_op->get_unmapped_area is utterly 
nonsense as far as I can see.

It happens to be this way currently, but that is not even remotely good 
design.

Christian.

>
> It is the same reason why mmap returns 2M stuff these days to make THP
> possible
>
> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 13:05                                                 ` Christian König
@ 2021-03-25 13:17                                                   ` Jason Gunthorpe
  2021-03-25 13:26                                                     ` Christian König
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 13:17 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

On Thu, Mar 25, 2021 at 02:05:14PM +0100, Christian König wrote:
> 
> 
> Am 25.03.21 um 13:42 schrieb Jason Gunthorpe:
> > On Thu, Mar 25, 2021 at 01:09:14PM +0100, Christian König wrote:
> > > Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
> > > > On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
> > > > 
> > > > > Nope. The point here was that in this case, to make sure mmap uses the
> > > > > correct VA to give us a reasonable chance of alignement, the driver might
> > > > > need to be aware of and do trickery with the huge page-table-entry sizes
> > > > > anyway, although I think in most cases a standard helper for this can be
> > > > > supplied.
> > > > Of course the driver needs some way to influence the VA mmap uses,
> > > > gernally it should align to the natural page size of the device
> > > Well a mmap() needs to be aligned to the page size of the CPU, but not
> > > necessarily to the one of the device.
> > > 
> > > So I'm pretty sure the device driver should not be involved in any way the
> > > choosing of the VA for the CPU mapping.
> > No, if the device wants to use huge pages it must influence the mmap
> > VA or it can't form huge pgaes.
> 
> No, that's the job of the core MM and not of the individual driver.

The core mm doesn't know the page size of the device, only which of
several page levels the arch supports. The device must be involevd
here.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 13:17                                                   ` Jason Gunthorpe
@ 2021-03-25 13:26                                                     ` Christian König
  2021-03-25 13:33                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-25 13:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

Am 25.03.21 um 14:17 schrieb Jason Gunthorpe:
> On Thu, Mar 25, 2021 at 02:05:14PM +0100, Christian König wrote:
>>
>> Am 25.03.21 um 13:42 schrieb Jason Gunthorpe:
>>> On Thu, Mar 25, 2021 at 01:09:14PM +0100, Christian König wrote:
>>>> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
>>>>> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
>>>>>
>>>>>> Nope. The point here was that in this case, to make sure mmap uses the
>>>>>> correct VA to give us a reasonable chance of alignement, the driver might
>>>>>> need to be aware of and do trickery with the huge page-table-entry sizes
>>>>>> anyway, although I think in most cases a standard helper for this can be
>>>>>> supplied.
>>>>> Of course the driver needs some way to influence the VA mmap uses,
>>>>> gernally it should align to the natural page size of the device
>>>> Well a mmap() needs to be aligned to the page size of the CPU, but not
>>>> necessarily to the one of the device.
>>>>
>>>> So I'm pretty sure the device driver should not be involved in any way the
>>>> choosing of the VA for the CPU mapping.
>>> No, if the device wants to use huge pages it must influence the mmap
>>> VA or it can't form huge pgaes.
>> No, that's the job of the core MM and not of the individual driver.
> The core mm doesn't know the page size of the device, only which of
> several page levels the arch supports. The device must be involevd
> here.

Why? See you can have a device which has for example 256KiB pages, but 
it should perfectly work that the CPU mapping is aligned to only 4KiB.

As long as you don't do things like shared virtual memory between device 
and CPU the VA addresses used on the CPU should be completely irrelevant 
for the device.

Regards,
Christian.

>
> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 13:02                                                 ` Christian König
@ 2021-03-25 13:31                                                   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25 13:31 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

Hi,

On 3/25/21 2:02 PM, Christian König wrote:
>
>
> Am 25.03.21 um 13:36 schrieb Thomas Hellström (Intel):
>>
>> On 3/25/21 1:09 PM, Christian König wrote:
>>> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
>>>> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) 
>>>> wrote:
>>>>
>>>>> Nope. The point here was that in this case, to make sure mmap uses 
>>>>> the
>>>>> correct VA to give us a reasonable chance of alignement, the 
>>>>> driver might
>>>>> need to be aware of and do trickery with the huge page-table-entry 
>>>>> sizes
>>>>> anyway, although I think in most cases a standard helper for this 
>>>>> can be
>>>>> supplied.
>>>> Of course the driver needs some way to influence the VA mmap uses,
>>>> gernally it should align to the natural page size of the device
>>>
>>> Well a mmap() needs to be aligned to the page size of the CPU, but 
>>> not necessarily to the one of the device.
>>>
>>> So I'm pretty sure the device driver should not be involved in any 
>>> way the choosing of the VA for the CPU mapping.
>>>
>>> Christian.
>>>
>> We've had this discussion before and at that time I managed to 
>> convince you by pointing to the shmem helper for this, 
>> shmem_get_umapped_area().
>
> No, you didn't convinced me. I was just surprised that this is 
> something under driver control.
>
>>
>> Basically there are two ways to do this. Either use a standard helper 
>> similar to shmem's, and then the driver needs to align physical 
>> (device) huge page boundaries to address space offset huge page 
>> boundaries. If you don't do that you can just as well use a custom 
>> function that adjusts for you not doing that 
>> (drm_get_unmapped_area()). Both require driver knowledge of the size 
>> of huge pages.
>
> And once more, at least for GPU drivers that looks like the totally 
> wrong approach to me.
>
> Aligning the VMA so that huge page allocations become possible is the 
> job of the MM subsystem and not that of the drivers.
>
Previous discussion here

https://www.spinics.net/lists/linux-mm/msg205291.html

>>
>> Without a function to adjust, mmap will use it's default (16 byte?) 
>> alignment and chance of alignment becomes very small.
>
> Well it's 4KiB at least.
Yes :/ ...
>
> Regards,
> Christian.
>
Thanks,

Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 13:26                                                     ` Christian König
@ 2021-03-25 13:33                                                       ` Jason Gunthorpe
  2021-03-25 13:54                                                         ` Christian König
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 13:33 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

On Thu, Mar 25, 2021 at 02:26:50PM +0100, Christian König wrote:
> Am 25.03.21 um 14:17 schrieb Jason Gunthorpe:
> > On Thu, Mar 25, 2021 at 02:05:14PM +0100, Christian König wrote:
> > > 
> > > Am 25.03.21 um 13:42 schrieb Jason Gunthorpe:
> > > > On Thu, Mar 25, 2021 at 01:09:14PM +0100, Christian König wrote:
> > > > > Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
> > > > > > On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
> > > > > > 
> > > > > > > Nope. The point here was that in this case, to make sure mmap uses the
> > > > > > > correct VA to give us a reasonable chance of alignement, the driver might
> > > > > > > need to be aware of and do trickery with the huge page-table-entry sizes
> > > > > > > anyway, although I think in most cases a standard helper for this can be
> > > > > > > supplied.
> > > > > > Of course the driver needs some way to influence the VA mmap uses,
> > > > > > gernally it should align to the natural page size of the device
> > > > > Well a mmap() needs to be aligned to the page size of the CPU, but not
> > > > > necessarily to the one of the device.
> > > > > 
> > > > > So I'm pretty sure the device driver should not be involved in any way the
> > > > > choosing of the VA for the CPU mapping.
> > > > No, if the device wants to use huge pages it must influence the mmap
> > > > VA or it can't form huge pgaes.
> > > No, that's the job of the core MM and not of the individual driver.
> > The core mm doesn't know the page size of the device, only which of
> > several page levels the arch supports. The device must be involevd
> > here.
> 
> Why? See you can have a device which has for example 256KiB pages, but it
> should perfectly work that the CPU mapping is aligned to only 4KiB.

The goal is to optimize large page size usage in the page tables.

There are three critera that impact this:
 1) The possible CPU page table sizes
 2) The useful contiguity the device can create in its iomemory
 3) The VA's alignment, as this sets an upper bound on 1 and 2

If a device has 256k pages and the arch supports 2M and 4k then the VA
should align to somewhere between 4k and 256k. The ideal alignment
would be to optimize PTE usage when stuffing 256k blocks by fully
populating PTEs and depends on the arch's # of PTE's per page.

If a device has 256k pages and the arch supports 256k pages then the
VA should align to 256k.

The device should never be touching any of this, it should simply
inform what its operating page size is and the MM should use that to
align the VA.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 13:33                                                       ` Jason Gunthorpe
@ 2021-03-25 13:54                                                         ` Christian König
  2021-03-25 13:56                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2021-03-25 13:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

Am 25.03.21 um 14:33 schrieb Jason Gunthorpe:
> On Thu, Mar 25, 2021 at 02:26:50PM +0100, Christian König wrote:
>> Am 25.03.21 um 14:17 schrieb Jason Gunthorpe:
>>> On Thu, Mar 25, 2021 at 02:05:14PM +0100, Christian König wrote:
>>>> Am 25.03.21 um 13:42 schrieb Jason Gunthorpe:
>>>>> On Thu, Mar 25, 2021 at 01:09:14PM +0100, Christian König wrote:
>>>>>> Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
>>>>>>> On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote:
>>>>>>>
>>>>>>>> Nope. The point here was that in this case, to make sure mmap uses the
>>>>>>>> correct VA to give us a reasonable chance of alignement, the driver might
>>>>>>>> need to be aware of and do trickery with the huge page-table-entry sizes
>>>>>>>> anyway, although I think in most cases a standard helper for this can be
>>>>>>>> supplied.
>>>>>>> Of course the driver needs some way to influence the VA mmap uses,
>>>>>>> gernally it should align to the natural page size of the device
>>>>>> Well a mmap() needs to be aligned to the page size of the CPU, but not
>>>>>> necessarily to the one of the device.
>>>>>>
>>>>>> So I'm pretty sure the device driver should not be involved in any way the
>>>>>> choosing of the VA for the CPU mapping.
>>>>> No, if the device wants to use huge pages it must influence the mmap
>>>>> VA or it can't form huge pgaes.
>>>> No, that's the job of the core MM and not of the individual driver.
>>> The core mm doesn't know the page size of the device, only which of
>>> several page levels the arch supports. The device must be involevd
>>> here.
>> Why? See you can have a device which has for example 256KiB pages, but it
>> should perfectly work that the CPU mapping is aligned to only 4KiB.
> The goal is to optimize large page size usage in the page tables.
>
> There are three critera that impact this:
>   1) The possible CPU page table sizes
>   2) The useful contiguity the device can create in its iomemory
>   3) The VA's alignment, as this sets an upper bound on 1 and 2
>
> If a device has 256k pages and the arch supports 2M and 4k then the VA
> should align to somewhere between 4k and 256k. The ideal alignment
> would be to optimize PTE usage when stuffing 256k blocks by fully
> populating PTEs and depends on the arch's # of PTE's per page.

Ah! So you want to also avoid that we only halve populate a PTEs as 
well! That rather nifty.

But you don't need the device page size for this. Just looking at the 
size of the mapping should be enough.

In other words we would align the VA so that it tries to avoid crossing 
page table boundaries.

But to be honest I'm really wondering why the heck we don't already do 
this in vm_unmapped_area(). That should be beneficial for basically 
every slightly larger mapping.

Christian.

>
> If a device has 256k pages and the arch supports 256k pages then the
> VA should align to 256k.
>
> The device should never be touching any of this, it should simply
> inform what its operating page size is and the MM should use that to
> align the VA.
>
> Jason



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 13:54                                                         ` Christian König
@ 2021-03-25 13:56                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 13:56 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	David Airlie, linux-kernel, dri-devel, linux-mm, Andrew Morton

On Thu, Mar 25, 2021 at 02:54:31PM +0100, Christian König wrote:

> > The goal is to optimize large page size usage in the page tables.
> > 
> > There are three critera that impact this:
> >   1) The possible CPU page table sizes
> >   2) The useful contiguity the device can create in its iomemory
> >   3) The VA's alignment, as this sets an upper bound on 1 and 2
> > 
> > If a device has 256k pages and the arch supports 2M and 4k then the VA
> > should align to somewhere between 4k and 256k. The ideal alignment
> > would be to optimize PTE usage when stuffing 256k blocks by fully
> > populating PTEs and depends on the arch's # of PTE's per page.
> 
> Ah! So you want to also avoid that we only halve populate a PTEs as well!
> That rather nifty.
> 
> But you don't need the device page size for this. Just looking at the size
> of the mapping should be enough.

Well, kind of, at a certain point we start to over-align things which
is a bit harmful too, it is best to cap it at what the device could
actually use, IMHO.

Keep in mind address space is not free, and 32 bit in particular needs
to be efficient.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-24 20:25               ` Dave Hansen
@ 2021-03-25 17:51                 ` Thomas Hellström (Intel)
  2021-03-25 17:55                   ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25 17:51 UTC (permalink / raw)
  To: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig, jgg,
	airlied, linux-mm, linux-kernel, akpm


On 3/24/21 9:25 PM, Dave Hansen wrote:
> On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>>> used.  It's quite possible we can encode another use even in the
>>> existing bits.
>>>
>>> Personally, I'd just try:
>>>
>>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>>
>> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
>> used in a selftest, but only for PTEs AFAICT.
>>
>> Oh, and we don't care about 32-bit much anymore?
> On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
> enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
>
> But, yeah, we don't care about 32-bit. :)

Hmm,

Actually it makes some sense to use SW1, to make it end up in the same 
dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t 
on 32-bit PAE is not atomic, so in theory a huge pmd could be modified 
while reading the pmd_t making the dwords inconsistent.... How does that 
work with fast gup anyway?

In any case, what would be the best cause of action here? Use SW1 or 
disable completely for 32-bit?

/Thomas





^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 17:51                 ` Thomas Hellström (Intel)
@ 2021-03-25 17:55                   ` Jason Gunthorpe
  2021-03-25 18:13                     ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 17:55 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm

On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/24/21 9:25 PM, Dave Hansen wrote:
> > On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
> > > > We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
> > > > used.  It's quite possible we can encode another use even in the
> > > > existing bits.
> > > > 
> > > > Personally, I'd just try:
> > > > 
> > > > #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
> > > > 
> > > OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
> > > used in a selftest, but only for PTEs AFAICT.
> > > 
> > > Oh, and we don't care about 32-bit much anymore?
> > On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
> > enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
> > 
> > But, yeah, we don't care about 32-bit. :)
> 
> Hmm,
> 
> Actually it makes some sense to use SW1, to make it end up in the same dword
> as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
> PAE is not atomic, so in theory a huge pmd could be modified while reading
> the pmd_t making the dwords inconsistent.... How does that work with fast
> gup anyway?

It loops to get an atomic 64 bit value if the arch can't provide an
atomic 64 bit load

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 17:55                   ` Jason Gunthorpe
@ 2021-03-25 18:13                     ` Thomas Hellström (Intel)
  2021-03-25 18:24                       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25 18:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm


On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/24/21 9:25 PM, Dave Hansen wrote:
>>> On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>>>>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>>>>> used.  It's quite possible we can encode another use even in the
>>>>> existing bits.
>>>>>
>>>>> Personally, I'd just try:
>>>>>
>>>>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>>>>
>>>> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
>>>> used in a selftest, but only for PTEs AFAICT.
>>>>
>>>> Oh, and we don't care about 32-bit much anymore?
>>> On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
>>> enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
>>>
>>> But, yeah, we don't care about 32-bit. :)
>> Hmm,
>>
>> Actually it makes some sense to use SW1, to make it end up in the same dword
>> as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
>> PAE is not atomic, so in theory a huge pmd could be modified while reading
>> the pmd_t making the dwords inconsistent.... How does that work with fast
>> gup anyway?
> It loops to get an atomic 64 bit value if the arch can't provide an
> atomic 64 bit load

Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting 
pmd is dereferenced either in try_grab_compound_head() or 
__gup_device_huge(), before the pmd is compared to the value the pointer 
is currently pointing to. Couldn't those dereferences be on invalid 
pointers?

/Thomas

>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 18:13                     ` Thomas Hellström (Intel)
@ 2021-03-25 18:24                       ` Jason Gunthorpe
  2021-03-25 18:42                         ` Thomas Hellström (Intel)
  2021-03-26  9:08                         ` Thomas Hellström (Intel)
  0 siblings, 2 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 18:24 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm

On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
> > On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
> > > On 3/24/21 9:25 PM, Dave Hansen wrote:
> > > > On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
> > > > > > We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
> > > > > > used.  It's quite possible we can encode another use even in the
> > > > > > existing bits.
> > > > > > 
> > > > > > Personally, I'd just try:
> > > > > > 
> > > > > > #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
> > > > > > 
> > > > > OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
> > > > > used in a selftest, but only for PTEs AFAICT.
> > > > > 
> > > > > Oh, and we don't care about 32-bit much anymore?
> > > > On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
> > > > enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
> > > > 
> > > > But, yeah, we don't care about 32-bit. :)
> > > Hmm,
> > > 
> > > Actually it makes some sense to use SW1, to make it end up in the same dword
> > > as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
> > > PAE is not atomic, so in theory a huge pmd could be modified while reading
> > > the pmd_t making the dwords inconsistent.... How does that work with fast
> > > gup anyway?
> > It loops to get an atomic 64 bit value if the arch can't provide an
> > atomic 64 bit load
> 
> Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
> is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
> before the pmd is compared to the value the pointer is currently pointing
> to. Couldn't those dereferences be on invalid pointers?

Uhhhhh.. That does look questionable, yes. Unless there is some tricky
reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
has a stable upper 32 bits..

The pte does it with ptep_get_lockless(), we probably need the same
for the other levels too instead of open coding a READ_ONCE?

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 18:24                       ` Jason Gunthorpe
@ 2021-03-25 18:42                         ` Thomas Hellström (Intel)
  2021-03-26  9:08                         ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-25 18:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm


On 3/25/21 7:24 PM, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
>>> On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/24/21 9:25 PM, Dave Hansen wrote:
>>>>> On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>>>>>>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>>>>>>> used.  It's quite possible we can encode another use even in the
>>>>>>> existing bits.
>>>>>>>
>>>>>>> Personally, I'd just try:
>>>>>>>
>>>>>>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>>>>>>
>>>>>> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
>>>>>> used in a selftest, but only for PTEs AFAICT.
>>>>>>
>>>>>> Oh, and we don't care about 32-bit much anymore?
>>>>> On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
>>>>> enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
>>>>>
>>>>> But, yeah, we don't care about 32-bit. :)
>>>> Hmm,
>>>>
>>>> Actually it makes some sense to use SW1, to make it end up in the same dword
>>>> as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
>>>> PAE is not atomic, so in theory a huge pmd could be modified while reading
>>>> the pmd_t making the dwords inconsistent.... How does that work with fast
>>>> gup anyway?
>>> It loops to get an atomic 64 bit value if the arch can't provide an
>>> atomic 64 bit load
>> Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
>> is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
>> before the pmd is compared to the value the pointer is currently pointing
>> to. Couldn't those dereferences be on invalid pointers?
> Uhhhhh.. That does look questionable, yes. Unless there is some tricky
> reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
> has a stable upper 32 bits..
>
> The pte does it with ptep_get_lockless(), we probably need the same
> for the other levels too instead of open coding a READ_ONCE?
>
> Jason

Yes, unless that comment before local_irq_disable() means some magic is 
done to prevent bad things happening, but I guess if it's needed for 
ptes, it's probably needed for pmds and puds as well.

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-25 18:24                       ` Jason Gunthorpe
  2021-03-25 18:42                         ` Thomas Hellström (Intel)
@ 2021-03-26  9:08                         ` Thomas Hellström (Intel)
  2021-03-26 11:46                           ` Jason Gunthorpe
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-26  9:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm


On 3/25/21 7:24 PM, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
>>> On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/24/21 9:25 PM, Dave Hansen wrote:
>>>>> On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>>>>>>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>>>>>>> used.  It's quite possible we can encode another use even in the
>>>>>>> existing bits.
>>>>>>>
>>>>>>> Personally, I'd just try:
>>>>>>>
>>>>>>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>>>>>>
>>>>>> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
>>>>>> used in a selftest, but only for PTEs AFAICT.
>>>>>>
>>>>>> Oh, and we don't care about 32-bit much anymore?
>>>>> On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
>>>>> enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
>>>>>
>>>>> But, yeah, we don't care about 32-bit. :)
>>>> Hmm,
>>>>
>>>> Actually it makes some sense to use SW1, to make it end up in the same dword
>>>> as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
>>>> PAE is not atomic, so in theory a huge pmd could be modified while reading
>>>> the pmd_t making the dwords inconsistent.... How does that work with fast
>>>> gup anyway?
>>> It loops to get an atomic 64 bit value if the arch can't provide an
>>> atomic 64 bit load
>> Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
>> is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
>> before the pmd is compared to the value the pointer is currently pointing
>> to. Couldn't those dereferences be on invalid pointers?
> Uhhhhh.. That does look questionable, yes. Unless there is some tricky
> reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
> has a stable upper 32 bits..
>
> The pte does it with ptep_get_lockless(), we probably need the same
> for the other levels too instead of open coding a READ_ONCE?
>
> Jason

TBH, ptep_get_lockless() also looks a bit fishy. it says
"it will not switch to a completely different present page without a TLB 
flush in between".

What if the following happens:

processor 1: Reads lower dword of PTE.
processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
processor 1: Reads upper dword of PTE, which is now zero.
processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a 
new PTE value which happens to be the same as the original one before 
the zap.
processor 1: Reads the newly faulted in lower dword, compares to the old 
one, gives an OK and returns a bogus PTE.

/Thomas




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-26  9:08                         ` Thomas Hellström (Intel)
@ 2021-03-26 11:46                           ` Jason Gunthorpe
  2021-03-26 12:33                             ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2021-03-26 11:46 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm

On Fri, Mar 26, 2021 at 10:08:09AM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/25/21 7:24 PM, Jason Gunthorpe wrote:
> > On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
> > > On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
> > > > On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
> > > > > On 3/24/21 9:25 PM, Dave Hansen wrote:
> > > > > > On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
> > > > > > > > We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
> > > > > > > > used.  It's quite possible we can encode another use even in the
> > > > > > > > existing bits.
> > > > > > > > 
> > > > > > > > Personally, I'd just try:
> > > > > > > > 
> > > > > > > > #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
> > > > > > > > 
> > > > > > > OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
> > > > > > > used in a selftest, but only for PTEs AFAICT.
> > > > > > > 
> > > > > > > Oh, and we don't care about 32-bit much anymore?
> > > > > > On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
> > > > > > enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
> > > > > > 
> > > > > > But, yeah, we don't care about 32-bit. :)
> > > > > Hmm,
> > > > > 
> > > > > Actually it makes some sense to use SW1, to make it end up in the same dword
> > > > > as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
> > > > > PAE is not atomic, so in theory a huge pmd could be modified while reading
> > > > > the pmd_t making the dwords inconsistent.... How does that work with fast
> > > > > gup anyway?
> > > > It loops to get an atomic 64 bit value if the arch can't provide an
> > > > atomic 64 bit load
> > > Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
> > > is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
> > > before the pmd is compared to the value the pointer is currently pointing
> > > to. Couldn't those dereferences be on invalid pointers?
> > Uhhhhh.. That does look questionable, yes. Unless there is some tricky
> > reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
> > has a stable upper 32 bits..
> > 
> > The pte does it with ptep_get_lockless(), we probably need the same
> > for the other levels too instead of open coding a READ_ONCE?
> > 
> > Jason
> 
> TBH, ptep_get_lockless() also looks a bit fishy. it says
> "it will not switch to a completely different present page without a TLB
> flush in between".
> 
> What if the following happens:
> 
> processor 1: Reads lower dword of PTE.
> processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
> processor 1: Reads upper dword of PTE, which is now zero.
> processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new
> PTE value which happens to be the same as the original one before the zap.
> processor 1: Reads the newly faulted in lower dword, compares to the old
> one, gives an OK and returns a bogus PTE.

So you are saying that while the zap will wait for the TLB flush to
globally finish once it gets started any other processor can still
write to the pte?

I can't think of any serialization that would cause fault to wait for
the zap/TLB flush, especially if the zap comes from the address_space
and doesn't hold the mmap lock.

Seems worth bringing up in a bigger thread, maybe someone else knows?

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
  2021-03-26 11:46                           ` Jason Gunthorpe
@ 2021-03-26 12:33                             ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-26 12:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, Williams, Dan J, dri-devel, christian.koenig,
	airlied, linux-mm, linux-kernel, akpm, Nick Piggin


On 3/26/21 12:46 PM, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 10:08:09AM +0100, Thomas Hellström (Intel) wrote:
>> On 3/25/21 7:24 PM, Jason Gunthorpe wrote:
>>> On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
>>>>> On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
>>>>>> On 3/24/21 9:25 PM, Dave Hansen wrote:
>>>>>>> On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>>>>>>>>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>>>>>>>>> used.  It's quite possible we can encode another use even in the
>>>>>>>>> existing bits.
>>>>>>>>>
>>>>>>>>> Personally, I'd just try:
>>>>>>>>>
>>>>>>>>> #define _PAGE_BIT_SOFTW5        57      /* available for programmer */
>>>>>>>>>
>>>>>>>> OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
>>>>>>>> used in a selftest, but only for PTEs AFAICT.
>>>>>>>>
>>>>>>>> Oh, and we don't care about 32-bit much anymore?
>>>>>>> On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
>>>>>>> enabled.  IOW, we can handle the majority of 32-bit CPUs out there.
>>>>>>>
>>>>>>> But, yeah, we don't care about 32-bit. :)
>>>>>> Hmm,
>>>>>>
>>>>>> Actually it makes some sense to use SW1, to make it end up in the same dword
>>>>>> as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
>>>>>> PAE is not atomic, so in theory a huge pmd could be modified while reading
>>>>>> the pmd_t making the dwords inconsistent.... How does that work with fast
>>>>>> gup anyway?
>>>>> It loops to get an atomic 64 bit value if the arch can't provide an
>>>>> atomic 64 bit load
>>>> Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
>>>> is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
>>>> before the pmd is compared to the value the pointer is currently pointing
>>>> to. Couldn't those dereferences be on invalid pointers?
>>> Uhhhhh.. That does look questionable, yes. Unless there is some tricky
>>> reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
>>> has a stable upper 32 bits..
>>>
>>> The pte does it with ptep_get_lockless(), we probably need the same
>>> for the other levels too instead of open coding a READ_ONCE?
>>>
>>> Jason
>> TBH, ptep_get_lockless() also looks a bit fishy. it says
>> "it will not switch to a completely different present page without a TLB
>> flush in between".
>>
>> What if the following happens:
>>
>> processor 1: Reads lower dword of PTE.
>> processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
>> processor 1: Reads upper dword of PTE, which is now zero.
>> processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new
>> PTE value which happens to be the same as the original one before the zap.
>> processor 1: Reads the newly faulted in lower dword, compares to the old
>> one, gives an OK and returns a bogus PTE.
> So you are saying that while the zap will wait for the TLB flush to
> globally finish once it gets started any other processor can still
> write to the pte?
>
> I can't think of any serialization that would cause fault to wait for
> the zap/TLB flush, especially if the zap comes from the address_space
> and doesn't hold the mmap lock.

I might of course be completely wrong, but It seems there is an 
assumption made that all potentially affected processors would have a 
valid TLB entry for the PTE. Then the fault would not happen (well 
unless of course the TLB flush completes on some processors before 
getting stuck on the local_irq_disable() on processor 1).

+CC: Nick Piggin

Seems like Nick Piggin is the original author of the comment. Perhaps he 
can can clarify a bit.

/Thomas


>
> Seems worth bringing up in a bigger thread, maybe someone else knows?
>
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2021-03-26 12:33 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-21 18:45 [RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages Thomas Hellström (Intel)
2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
2021-03-23 11:34   ` Daniel Vetter
2021-03-23 16:34     ` Thomas Hellström (Intel)
2021-03-23 16:37       ` Jason Gunthorpe
2021-03-23 16:59         ` Christoph Hellwig
2021-03-23 17:06         ` Thomas Hellström (Intel)
2021-03-24  9:56           ` Daniel Vetter
2021-03-24 12:24             ` Jason Gunthorpe
2021-03-24 12:35               ` Thomas Hellström (Intel)
2021-03-24 12:41                 ` Jason Gunthorpe
2021-03-24 13:35                   ` Thomas Hellström (Intel)
2021-03-24 13:48                     ` Jason Gunthorpe
2021-03-24 15:50                       ` Thomas Hellström (Intel)
2021-03-24 16:38                         ` Jason Gunthorpe
2021-03-24 18:31                           ` Christian König
2021-03-24 20:07                             ` Thomas Hellström (Intel)
2021-03-24 23:14                               ` Jason Gunthorpe
2021-03-25  7:48                                 ` Thomas Hellström (Intel)
2021-03-25  8:27                                   ` Christian König
2021-03-25  9:51                                     ` Thomas Hellström (Intel)
2021-03-25 11:30                                       ` Jason Gunthorpe
2021-03-25 11:53                                         ` Thomas Hellström (Intel)
2021-03-25 12:01                                           ` Jason Gunthorpe
2021-03-25 12:09                                             ` Christian König
2021-03-25 12:36                                               ` Thomas Hellström (Intel)
2021-03-25 13:02                                                 ` Christian König
2021-03-25 13:31                                                   ` Thomas Hellström (Intel)
2021-03-25 12:42                                               ` Jason Gunthorpe
2021-03-25 13:05                                                 ` Christian König
2021-03-25 13:17                                                   ` Jason Gunthorpe
2021-03-25 13:26                                                     ` Christian König
2021-03-25 13:33                                                       ` Jason Gunthorpe
2021-03-25 13:54                                                         ` Christian König
2021-03-25 13:56                                                           ` Jason Gunthorpe
2021-03-25  7:49                                 ` Christian König
2021-03-25  9:41                                   ` Daniel Vetter
2021-03-23 13:52   ` Jason Gunthorpe
2021-03-23 15:05     ` Thomas Hellström (Intel)
2021-03-23 19:52   ` Williams, Dan J
2021-03-23 20:42     ` Thomas Hellström (Intel)
2021-03-24  9:58       ` Daniel Vetter
2021-03-24 10:05         ` Thomas Hellström (Intel)
     [not found]           ` <75423f64-adef-a2c4-8e7d-2cb814127b18@intel.com>
2021-03-24 20:22             ` Thomas Hellström (Intel)
2021-03-24 20:25               ` Dave Hansen
2021-03-25 17:51                 ` Thomas Hellström (Intel)
2021-03-25 17:55                   ` Jason Gunthorpe
2021-03-25 18:13                     ` Thomas Hellström (Intel)
2021-03-25 18:24                       ` Jason Gunthorpe
2021-03-25 18:42                         ` Thomas Hellström (Intel)
2021-03-26  9:08                         ` Thomas Hellström (Intel)
2021-03-26 11:46                           ` Jason Gunthorpe
2021-03-26 12:33                             ` Thomas Hellström (Intel)
2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
2021-03-22  7:47   ` Christian König
2021-03-22  8:13     ` Thomas Hellström (Intel)
2021-03-23 11:57       ` Christian König
2021-03-23 11:47   ` Daniel Vetter
2021-03-23 14:04     ` Jason Gunthorpe
2021-03-23 15:51       ` Thomas Hellström (Intel)
2021-03-23 14:00   ` Jason Gunthorpe
2021-03-23 15:46     ` Thomas Hellström (Intel)
2021-03-23 16:06       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).