[PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
@ 2023-06-22  7:27 Vivek Kasireddy
  2023-06-22  7:27 ` [PATCH v1 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Vivek Kasireddy @ 2023-06-22  7:27 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, Mike Kravetz, David Hildenbrand, Gerd Hoffmann,
	Dongwon Kim, Andrew Morton, James Houghton, Jerome Marchand,
	Junxiao Chang, Kirill A . Shutemov, Michal Hocko, Muchun Song

The first patch ensures that the mappings needed for handling mmap
operation would be managed by using the pfn instead of struct page.
The second patch restores support for mapping hugetlb pages where
subpages of a hugepage are not directly used anymore (main reason
for revert) and instead the hugetlb pages and the relevant offsets
are used to populate the scatterlist for dma-buf export and for
mmap operation.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m....
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: James Houghton <jthoughton@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>

Vivek Kasireddy (2):
  udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  udmabuf: Add back support for mapping hugetlb pages

 drivers/dma-buf/udmabuf.c | 105 ++++++++++++++++++++++++++++++++------
 1 file changed, 88 insertions(+), 17 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  2023-06-22  7:27 [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
@ 2023-06-22  7:27 ` Vivek Kasireddy
  2023-06-22  7:27 ` [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
  2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
  2 siblings, 0 replies; 23+ messages in thread
From: Vivek Kasireddy @ 2023-06-22  7:27 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, Mike Kravetz, David Hildenbrand, Gerd Hoffmann,
	Dongwon Kim, Andrew Morton, James Houghton, Jerome Marchand,
	Junxiao Chang, Kirill A . Shutemov, Michal Hocko, Muchun Song

Add VM_PFNMAP to vm_flags in the mmap handler to ensure that
the mappings would be managed without using struct page.

And, in the vm_fault handler, use vmf_insert_pfn to share the
page's pfn to userspace instead of directly sharing the page
(via struct page *).

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: James Houghton <jthoughton@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 12cf6bb2e3ce..6de40c51d895 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -35,12 +35,13 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct udmabuf *ubuf = vma->vm_private_data;
 	pgoff_t pgoff = vmf->pgoff;
+	unsigned long pfn;
 
 	if (pgoff >= ubuf->pagecount)
 		return VM_FAULT_SIGBUS;
-	vmf->page = ubuf->pages[pgoff];
-	get_page(vmf->page);
-	return 0;
+
+	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
 static const struct vm_operations_struct udmabuf_vm_ops = {
@@ -58,6 +59,7 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
 
 	vma->vm_ops = &udmabuf_vm_ops;
 	vma->vm_private_data = ubuf;
+	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-22  7:27 [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
  2023-06-22  7:27 ` [PATCH v1 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
@ 2023-06-22  7:27 ` Vivek Kasireddy
  2023-06-22 22:10   ` kernel test robot
  2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
  2 siblings, 1 reply; 23+ messages in thread
From: Vivek Kasireddy @ 2023-06-22  7:27 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, Mike Kravetz, David Hildenbrand, Gerd Hoffmann,
	Dongwon Kim, Andrew Morton, James Houghton, Jerome Marchand,
	Junxiao Chang, Kirill A . Shutemov, Michal Hocko, Muchun Song

A user or admin can configure a VMM (Qemu) Guest's memory to be
backed by hugetlb pages for various reasons. However, a Guest OS
would still allocate (and pin) buffers that are backed by regular
4k sized pages. In order to map these buffers and create dma-bufs
for them on the Host, we first need to find the hugetlb pages where
the buffer allocations are located and then determine the offsets
of individual chunks (within those pages) and use this information
to eventually populate a scatterlist.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m....
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: James Houghton <jthoughton@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 97 +++++++++++++++++++++++++++++++++------
 1 file changed, 83 insertions(+), 14 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 6de40c51d895..0ae44bce01e7 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -10,6 +10,7 @@
 #include <linux/miscdevice.h>
 #include <linux/module.h>
 #include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
 #include <linux/slab.h>
 #include <linux/udmabuf.h>
 #include <linux/vmalloc.h>
@@ -28,6 +29,7 @@ struct udmabuf {
 	struct page **pages;
 	struct sg_table *sg;
 	struct miscdevice *device;
+	pgoff_t *offsets;
 };
 
 static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
@@ -41,6 +43,10 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	if (ubuf->offsets) {
+		pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
+	}
+
 	return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
@@ -92,23 +98,40 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
 {
 	struct udmabuf *ubuf = buf->priv;
 	struct sg_table *sg;
+	struct scatterlist *sgl;
+	unsigned long i = 0;
 	int ret;
 
 	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
 	if (!sg)
 		return ERR_PTR(-ENOMEM);
-	ret = sg_alloc_table_from_pages(sg, ubuf->pages, ubuf->pagecount,
-					0, ubuf->pagecount << PAGE_SHIFT,
-					GFP_KERNEL);
-	if (ret < 0)
-		goto err;
+
+	if (ubuf->offsets) {
+		ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
+		if (ret < 0)
+			goto err_alloc;
+
+		for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) {
+			sg_set_page(sgl, ubuf->pages[i], PAGE_SIZE,
+				    ubuf->offsets[i]);
+		}
+	} else {
+		ret = sg_alloc_table_from_pages(sg, ubuf->pages,
+						ubuf->pagecount, 0,
+						ubuf->pagecount << PAGE_SHIFT,
+						GFP_KERNEL);
+		if (ret < 0)
+			goto err_alloc;
+
+	}
 	ret = dma_map_sgtable(dev, sg, direction, 0);
 	if (ret < 0)
-		goto err;
+		goto err_map;
 	return sg;
 
-err:
+err_map:
 	sg_free_table(sg);
+err_alloc:
 	kfree(sg);
 	return ERR_PTR(ret);
 }
@@ -145,6 +168,8 @@ static void release_udmabuf(struct dma_buf *buf)
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
 		put_page(ubuf->pages[pg]);
+	if (ubuf->offsets)
+		kfree(ubuf->offsets);
 	kfree(ubuf->pages);
 	kfree(ubuf);
 }
@@ -208,7 +233,9 @@ static long udmabuf_create(struct miscdevice *device,
 	struct udmabuf *ubuf;
 	struct dma_buf *buf;
 	pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
-	struct page *page;
+	struct page *page, *hpage = NULL;
+	pgoff_t hpoff, chunkoff, maxchunks;
+	struct hstate *hpstate;
 	int seals, ret = -EINVAL;
 	u32 i, flags;
 
@@ -244,7 +271,7 @@ static long udmabuf_create(struct miscdevice *device,
 		if (!memfd)
 			goto err;
 		mapping = memfd->f_mapping;
-		if (!shmem_mapping(mapping))
+		if (!shmem_mapping(mapping) && !is_file_hugepages(memfd))
 			goto err;
 		seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
 		if (seals == -EINVAL)
@@ -255,16 +282,56 @@ static long udmabuf_create(struct miscdevice *device,
 			goto err;
 		pgoff = list[i].offset >> PAGE_SHIFT;
 		pgcnt = list[i].size   >> PAGE_SHIFT;
+		if (is_file_hugepages(memfd)) {
+			if (!ubuf->offsets) {
+				ubuf->offsets = kmalloc_array(ubuf->pagecount,
+							      sizeof(*ubuf->offsets),
+							      GFP_KERNEL);
+				if (!ubuf->offsets) {
+					ret = -ENOMEM;
+					goto err;
+				}
+			}
+			hpstate = hstate_file(memfd);
+			hpoff = list[i].offset >> huge_page_shift(hpstate);
+			chunkoff = (list[i].offset &
+				    ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
+			maxchunks = huge_page_size(hpstate) >> PAGE_SHIFT;
+		}
 		for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-			page = shmem_read_mapping_page(mapping, pgoff + pgidx);
-			if (IS_ERR(page)) {
-				ret = PTR_ERR(page);
-				goto err;
+			if (is_file_hugepages(memfd)) {
+				if (!hpage) {
+					hpage = find_get_page_flags(mapping, hpoff,
+								    FGP_ACCESSED);
+					if (!hpage) {
+						ret = -EINVAL;
+						goto err;
+					}
+				}
+				get_page(hpage);
+				ubuf->pages[pgbuf] = hpage;
+				ubuf->offsets[pgbuf++] = chunkoff << PAGE_SHIFT;
+				if (++chunkoff == maxchunks) {
+					put_page(hpage);
+					hpage = NULL;
+					chunkoff = 0;
+					hpoff++;
+				}
+			} else {
+				page = shmem_read_mapping_page(mapping, pgoff + pgidx);
+				if (IS_ERR(page)) {
+					ret = PTR_ERR(page);
+					goto err;
+				}
+				ubuf->pages[pgbuf++] = page;
 			}
-			ubuf->pages[pgbuf++] = page;
 		}
 		fput(memfd);
 		memfd = NULL;
+		if (hpage) {
+			put_page(hpage);
+			hpage = NULL;
+		}
 	}
 
 	exp_info.ops  = &udmabuf_ops;
@@ -289,6 +356,8 @@ static long udmabuf_create(struct miscdevice *device,
 		put_page(ubuf->pages[--pgbuf]);
 	if (memfd)
 		fput(memfd);
+	if (ubuf->offsets)
+		kfree(ubuf->offsets);
 	kfree(ubuf->pages);
 	kfree(ubuf);
 	return ret;
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-22  7:27 [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
  2023-06-22  7:27 ` [PATCH v1 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
  2023-06-22  7:27 ` [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
@ 2023-06-22  8:25 ` David Hildenbrand
  2023-06-22 21:33   ` Mike Kravetz
                     ` (2 more replies)
  2 siblings, 3 replies; 23+ messages in thread
From: David Hildenbrand @ 2023-06-22  8:25 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: Mike Kravetz, Gerd Hoffmann, Dongwon Kim, Andrew Morton,
	James Houghton, Jerome Marchand, Junxiao Chang,
	Kirill A . Shutemov, Michal Hocko, Muchun Song, Jason Gunthorpe,
	John Hubbard

On 22.06.23 09:27, Vivek Kasireddy wrote:
> The first patch ensures that the mappings needed for handling mmap
> operation would be managed by using the pfn instead of struct page.
> The second patch restores support for mapping hugetlb pages where
> subpages of a hugepage are not directly used anymore (main reason
> for revert) and instead the hugetlb pages and the relevant offsets
> are used to populate the scatterlist for dma-buf export and for
> mmap operation.
> 
> Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
> were passed to the Host kernel and Qemu was launched with these
> relevant options: qemu-system-x86_64 -m 4096m....
> -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> -display gtk,gl=on
> -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> -machine memory-backend=mem1
> 
> Replacing -display gtk,gl=on with -display gtk,gl=off above would
> exercise the mmap handler.
> 

While I think the VM_PFNMAP approach is much better and should fix that 
issue at hand, I thought more about missing memlock support and realized 
that we might have to fix something else. SO I'm going to raise the 
issue here.

I think udmabuf chose the wrong interface to do what it's doing, that 
makes it harder to fix it eventually.

Instead of accepting a range in a memfd, it should just have accepted a 
user space address range and then used 
pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages 
"officially".

So what's the issue? Udma effectively pins pages longterm ("possibly 
forever") simply by grabbing a reference on them. These pages might 
easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.

So what udmabuf does is break memory hotunplug and CMA, because it turns 
pages that have to remain movable unmovable.

In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate these 
pages. See mm/gup.c:check_and_migrate_movable_pages() and especially 
folio_is_longterm_pinnable(). We'd probably have to implement something 
similar for udmabuf, where we detect such unpinnable pages and migrate them.

For example, pairing udmabuf with vfio (which pins pages using 
pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in 
all cases: if udmabuf longterm pinned the pages "the wrong way", vfio 
will fail to migrate them during FOLL_LONGTERM and consequently fail 
pin_user_pages(). As long as udmabuf holds a reference on these pages, 
that will never succeed.

There are *probably* more issues on the QEMU side when udmabuf is paired 
with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for 
virtio-balloon, virtio-mem, postcopy live migration, ... for example, in 
the vfio/vdpa case we make sure that we disallow most of these, because 
otherwise there can be an accidental "disconnect" between the pages 
mapped into the VM (guest view) and the pages mapped into the IOMMU 
(device view), for example, after a reboot.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
@ 2023-06-22 21:33   ` Mike Kravetz
  2023-06-23  6:13   ` Kasireddy, Vivek
  2023-08-08 16:17   ` Daniel Vetter
  2 siblings, 0 replies; 23+ messages in thread
From: Mike Kravetz @ 2023-06-22 21:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vivek Kasireddy, dri-devel, linux-mm, Gerd Hoffmann, Dongwon Kim,
	Andrew Morton, James Houghton, Jerome Marchand, Junxiao Chang,
	Kirill A . Shutemov, Michal Hocko, Muchun Song, Jason Gunthorpe,
	John Hubbard

[-- Attachment #1: Type: text/plain, Size: 950 bytes --]

On 06/22/23 10:25, David Hildenbrand wrote:
> On 22.06.23 09:27, Vivek Kasireddy wrote:
> 
> There are *probably* more issues on the QEMU side when udmabuf is paired
> with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for virtio-balloon,
> virtio-mem, postcopy live migration, ... for example, in the vfio/vdpa case
> we make sure that we disallow most of these, because otherwise there can be
> an accidental "disconnect" between the pages mapped into the VM (guest view)
> and the pages mapped into the IOMMU (device view), for example, after a
> reboot.
> 

Yes, this "disconnect" is still possible.  Attached is a test program I
hacked up based on the udmabuf selftest.  You can see different content
in the memfd pages and udma pages.

FYI- I can verify this new udmabuf code is not accessing struct pages of
hugetlb tail pages, as this test program BUG'ed if hugetlb vmemmap
optimization was enabled in the old udmabuf.
-- 
Mike Kravetz

[-- Attachment #2: udmabuf.c --]
[-- Type: text/plain, Size: 3726 bytes --]

// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#define __EXPORTED_HEADERS__

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <malloc.h>
#include <sys/mman.h>

#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/memfd.h>
#include <linux/udmabuf.h>

#define TEST_PREFIX	"drivers/dma-buf/udmabuf"
#define NUM_PAGES       2

static int my_getpagesize(void)
{
	/* huge page size */
	return getpagesize() * 512;
}

#if 0
static int memfd_create(const char *name, unsigned int flags)
{
	return syscall(__NR_memfd_create, name, flags);
}
#endif

int main(int argc, char *argv[])
{
	struct udmabuf_create create;
	int devfd, memfd, buf, ret;
	off_t size;
	void *mem;
	int i;
	char foo;
	int mem_fd, udma_fd;
	void *addr1, *addr2;

	devfd = open("/dev/udmabuf", O_RDWR);
	if (devfd < 0) {
		printf("%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n",
		       TEST_PREFIX);
		exit(77);
	}

	mem_fd = memfd_create("udmabuf-test", MFD_HUGETLB | MFD_ALLOW_SEALING);
	if (mem_fd < 0) {
		printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
		exit(77);
	}

	ret = fcntl(mem_fd, F_ADD_SEALS, F_SEAL_SHRINK);
	if (ret < 0) {
		printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
		exit(77);
	}


	size = my_getpagesize() * NUM_PAGES;
	ret = ftruncate(mem_fd, size);

	if (ret == -1) {
		printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
		exit(1);
	}

	/* touch all pages */
	addr1 = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, mem_fd, 0);
	if (addr1 == MAP_FAILED) {
		printf("%s: [FAIL,mmap]\n", TEST_PREFIX);
		exit(1);
	}

	for (i = 0; i < size / getpagesize(); i++) {
		*((char *)addr1 + (i * getpagesize())) = 'a';
	}

	memset(&create, 0, sizeof(create));

#if 0
	/* should fail (offset not page aligned) */
	create.memfd  = mem_fd;
	create.offset = getpagesize()/2;
	create.size   = getpagesize();
	buf = ioctl(devfd, UDMABUF_CREATE, &create);
	if (buf >= 0) {
		printf("%s: [FAIL,test-1]\n", TEST_PREFIX);
		exit(1);
	}

	/* should fail (size not multiple of page) */
	create.memfd  = mem_fd;
	create.offset = 0;
	create.size   = getpagesize()/2;
	buf = ioctl(devfd, UDMABUF_CREATE, &create);
	if (buf >= 0) {
		printf("%s: [FAIL,test-2]\n", TEST_PREFIX);
		exit(1);
	}

	/* should fail (not memfd) */
	create.memfd  = 0; /* stdin */
	create.offset = 0;
	create.size   = size;
	buf = ioctl(devfd, UDMABUF_CREATE, &create);
	if (buf >= 0) {
		printf("%s: [FAIL,test-3]\n", TEST_PREFIX);
		exit(1);
	}
#endif

	/* should work */
	create.memfd  = mem_fd;
	create.offset = getpagesize() * 256;
	create.size   = getpagesize() * 4;
	udma_fd = ioctl(devfd, UDMABUF_CREATE, &create);
	if (udma_fd < 0) {
		perror("UDMABUF_CREATE");
		printf("%s: [FAIL,test-4]\n", TEST_PREFIX);
		exit(1);
	}

	printf("before hole punch\n");
	(void)getchar();
	ret = fallocate(mem_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
		0, my_getpagesize());
	if (ret)
		perror("fallocate punch hole");
	printf("after hole punch\n");
	(void)getchar();

	for (i = 0; i < size / getpagesize(); i++) {
		*((char *)addr1 + (i * getpagesize())) = 'b';
	}

	printf("after touch again\n");
	(void)getchar();

	/* touch all pages */
	addr2 = mmap(NULL, getpagesize() * 4, PROT_READ|PROT_WRITE, MAP_SHARED, udma_fd, 0);
	if (addr2 == MAP_FAILED) {
		perror("mmap");
		printf("%s: udma_fd mmap fail\n", TEST_PREFIX);
		exit(1);
	}

	for (i = 0; i < 4; i++) {
		foo = *((char *)addr2 + (i * getpagesize()));
		printf("udmabuf %c\n", foo);
	}

	for (i = 256; i < 260; i++) {
		foo = *((char *)addr1 + (i * getpagesize()));
		printf("memfd %c\n", foo);
	}

	fprintf(stderr, "%s: ok\n", TEST_PREFIX);
	close(udma_fd);
	close(mem_fd);
	close(devfd);
	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-22  7:27 ` [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
@ 2023-06-22 22:10   ` kernel test robot
  0 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2023-06-22 22:10 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: oe-kbuild-all, James Houghton, Jerome Marchand, Dongwon Kim,
	David Hildenbrand, Junxiao Chang, Muchun Song, Vivek Kasireddy,
	Michal Hocko, Gerd Hoffmann, Andrew Morton,
	Linux Memory Management List, Kirill A . Shutemov, Mike Kravetz

Hi Vivek,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master next-20230622]
[cannot apply to drm-misc/drm-misc-next drm-tip/drm-tip v6.4-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vivek-Kasireddy/udmabuf-Use-vmf_insert_pfn-and-VM_PFNMAP-for-handling-mmap/20230622-155402
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230622072710.3707315-3-vivek.kasireddy%40intel.com
patch subject: [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages
config: openrisc-randconfig-c044-20230622 (https://download.01.org/0day-ci/archive/20230623/202306230534.TIq82gef-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230623/202306230534.TIq82gef-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306230534.TIq82gef-lkp@intel.com/

cocci warnings: (new ones prefixed by >>)
>> drivers/dma-buf/udmabuf.c:172:2-7: WARNING: NULL check before some freeing functions is not needed.
   drivers/dma-buf/udmabuf.c:360:2-7: WARNING: NULL check before some freeing functions is not needed.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
  2023-06-22 21:33   ` Mike Kravetz
@ 2023-06-23  6:13   ` Kasireddy, Vivek
  2023-06-23 16:35     ` Peter Xu
  2023-08-08 16:17   ` Daniel Vetter
  2 siblings, 1 reply; 23+ messages in thread
From: Kasireddy, Vivek @ 2023-06-23  6:13 UTC (permalink / raw)
  To: David Hildenbrand, dri-devel, linux-mm
  Cc: Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, Jason Gunthorpe,
	John Hubbard

Hi David,

> > The first patch ensures that the mappings needed for handling mmap
> > operation would be managed by using the pfn instead of struct page.
> > The second patch restores support for mapping hugetlb pages where
> > subpages of a hugepage are not directly used anymore (main reason
> > for revert) and instead the hugetlb pages and the relevant offsets
> > are used to populate the scatterlist for dma-buf export and for
> > mmap operation.
> >
> > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> options
> > were passed to the Host kernel and Qemu was launched with these
> > relevant options: qemu-system-x86_64 -m 4096m....
> > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > -display gtk,gl=on
> > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > -machine memory-backend=mem1
> >
> > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > exercise the mmap handler.
> >
> 
> While I think the VM_PFNMAP approach is much better and should fix that
> issue at hand, I thought more about missing memlock support and realized
> that we might have to fix something else. SO I'm going to raise the
> issue here.
> 
> I think udmabuf chose the wrong interface to do what it's doing, that
> makes it harder to fix it eventually.
> 
> Instead of accepting a range in a memfd, it should just have accepted a
> user space address range and then used
> pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages
> "officially".
Udmabuf indeed started off by using user space address range and GUP but
the dma-buf subsystem maintainer had concerns with that approach in v2.
It also had support for mlock in that version. Here is v2 and the relevant
conversation:
https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2

> 
> So what's the issue? Udma effectively pins pages longterm ("possibly
> forever") simply by grabbing a reference on them. These pages might
> easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> 
> So what udmabuf does is break memory hotunplug and CMA, because it
> turns
> pages that have to remain movable unmovable.
> 
> In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> these
> pages. See mm/gup.c:check_and_migrate_movable_pages() and especially
> folio_is_longterm_pinnable(). We'd probably have to implement something
> similar for udmabuf, where we detect such unpinnable pages and migrate
> them.
The pages udmabuf pins are only those associated with Guest (GPU driver/virtio-gpu)
resources (or buffers allocated and pinned from shmem via drm GEM). Some
resources are short-lived, and some are long-lived and whenever a resource
gets destroyed, the pages are unpinned. And, not all resources have their pages
pinned. The resource that is pinned for the longest duration is the FB and that's
because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host
as a workaround, but I guess that may not be very efficient given the amount
of churn it would create.

Also, as far as migration or S3/S4 is concerned, my understanding is that all
the Guest resources are destroyed and recreated again. So, wouldn't something
similar happen during memory hotunplug?

> 
> 
> For example, pairing udmabuf with vfio (which pins pages using
> pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in
> all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> will fail to migrate them during FOLL_LONGTERM and consequently fail
> pin_user_pages(). As long as udmabuf holds a reference on these pages,
> that will never succeed.
Dma-buf rules (for exporters) indicate that the pages only need to be pinned
during the map_attachment phase (and until unmap attachment happens).
In other words, only when the sg_table is created by udmabuf. I guess one
option would be to not hold any references during UDMABUF_CREATE and
only grab references to the pages (as and when it gets used) during this step.
Would this help?

> 
> 
> There are *probably* more issues on the QEMU side when udmabuf is
> paired
> with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> the vfio/vdpa case we make sure that we disallow most of these, because
> otherwise there can be an accidental "disconnect" between the pages
> mapped into the VM (guest view) and the pages mapped into the IOMMU
> (device view), for example, after a reboot.
Ok; I am not sure if I can figure out if there is any acceptable way to address
these issues but given the current constraints associated with udmabuf, what
do you suggest is the most reasonable way to deal with these problems you
have identified?

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-23  6:13   ` Kasireddy, Vivek
@ 2023-06-23 16:35     ` Peter Xu
  2023-06-23 16:37       ` Jason Gunthorpe
  2023-06-26  7:45       ` Kasireddy, Vivek
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Xu @ 2023-06-23 16:35 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: David Hildenbrand, dri-devel, linux-mm, Mike Kravetz,
	Gerd Hoffmann, Kim, Dongwon, Andrew Morton, James Houghton,
	Jerome Marchand, Chang, Junxiao, Kirill A . Shutemov, Hocko,
	Michal, Muchun Song, Jason Gunthorpe, John Hubbard

On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
> Hi David,
> 
> > > The first patch ensures that the mappings needed for handling mmap
> > > operation would be managed by using the pfn instead of struct page.
> > > The second patch restores support for mapping hugetlb pages where
> > > subpages of a hugepage are not directly used anymore (main reason
> > > for revert) and instead the hugetlb pages and the relevant offsets
> > > are used to populate the scatterlist for dma-buf export and for
> > > mmap operation.
> > >
> > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> > options
> > > were passed to the Host kernel and Qemu was launched with these
> > > relevant options: qemu-system-x86_64 -m 4096m....
> > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > > -display gtk,gl=on
> > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > > -machine memory-backend=mem1
> > >
> > > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > > exercise the mmap handler.
> > >
> > 
> > While I think the VM_PFNMAP approach is much better and should fix that
> > issue at hand, I thought more about missing memlock support and realized
> > that we might have to fix something else. SO I'm going to raise the
> > issue here.
> > 
> > I think udmabuf chose the wrong interface to do what it's doing, that
> > makes it harder to fix it eventually.
> > 
> > Instead of accepting a range in a memfd, it should just have accepted a
> > user space address range and then used
> > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages
> > "officially".
> Udmabuf indeed started off by using user space address range and GUP but
> the dma-buf subsystem maintainer had concerns with that approach in v2.
> It also had support for mlock in that version. Here is v2 and the relevant
> conversation:
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
> 
> > 
> > So what's the issue? Udma effectively pins pages longterm ("possibly
> > forever") simply by grabbing a reference on them. These pages might
> > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> > 
> > So what udmabuf does is break memory hotunplug and CMA, because it
> > turns
> > pages that have to remain movable unmovable.
> > 
> > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> > these
> > pages. See mm/gup.c:check_and_migrate_movable_pages() and especially
> > folio_is_longterm_pinnable(). We'd probably have to implement something
> > similar for udmabuf, where we detect such unpinnable pages and migrate
> > them.
> The pages udmabuf pins are only those associated with Guest (GPU driver/virtio-gpu)
> resources (or buffers allocated and pinned from shmem via drm GEM). Some
> resources are short-lived, and some are long-lived and whenever a resource
> gets destroyed, the pages are unpinned. And, not all resources have their pages
> pinned. The resource that is pinned for the longest duration is the FB and that's
> because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
> GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host
> as a workaround, but I guess that may not be very efficient given the amount
> of churn it would create.
> 
> Also, as far as migration or S3/S4 is concerned, my understanding is that all
> the Guest resources are destroyed and recreated again. So, wouldn't something
> similar happen during memory hotunplug?
> 
> > 
> > 
> > For example, pairing udmabuf with vfio (which pins pages using
> > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in
> > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > that will never succeed.
> Dma-buf rules (for exporters) indicate that the pages only need to be pinned
> during the map_attachment phase (and until unmap attachment happens).
> In other words, only when the sg_table is created by udmabuf. I guess one
> option would be to not hold any references during UDMABUF_CREATE and
> only grab references to the pages (as and when it gets used) during this step.
> Would this help?

IIUC the refcount is needed, otherwise I don't see what to protect the page
from being freed and even reused elsewhere before map_attachment().

It seems the previous concern on using gup was majorly fork(), if this is it:

https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213

Could it also be guarded by just making sure the pages are MAP_SHARED when
creating the udmabuf, if fork() is a requirement of the feature?

I had a feeling that the userspace still needs to always do the right thing
to make it work, even using pure PFN mappings.

For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
at least not F_SEAL_WRITE with current impl), and fault a new page into the
page cache?

Thanks,

> 
> > 
> > 
> > There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> > the vfio/vdpa case we make sure that we disallow most of these, because
> > otherwise there can be an accidental "disconnect" between the pages
> > mapped into the VM (guest view) and the pages mapped into the IOMMU
> > (device view), for example, after a reboot.
> Ok; I am not sure if I can figure out if there is any acceptable way to address
> these issues but given the current constraints associated with udmabuf, what
> do you suggest is the most reasonable way to deal with these problems you
> have identified?
> 
> Thanks,
> Vivek
> 
> > 
> > --
> > Cheers,
> > 
> > David / dhildenb
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-23 16:35     ` Peter Xu
@ 2023-06-23 16:37       ` Jason Gunthorpe
  2023-06-23 17:28         ` Peter Xu
  2023-06-26  7:45       ` Kasireddy, Vivek
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2023-06-23 16:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kasireddy, Vivek, David Hildenbrand, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Fri, Jun 23, 2023 at 12:35:45PM -0400, Peter Xu wrote:

> It seems the previous concern on using gup was majorly fork(), if this is it:
> 
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213

Fork and GUP have been fixed since that comment anyhow there is no
longer a problem using GUP and fork together.

> Could it also be guarded by just making sure the pages are MAP_SHARED when
> creating the udmabuf, if fork() is a requirement of the feature?

Also a resaonable thing to do

> For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
> file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> at least not F_SEAL_WRITE with current impl), and fault a new page into the
> page cache?

It becomes incoherent just like all other GUP users if userspace
explicitly manipulates the VMAs. It is no different to what happens
with VFIO, qemu should treat this memory the same as it does for VFIO
memory.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-23 16:37       ` Jason Gunthorpe
@ 2023-06-23 17:28         ` Peter Xu
  2023-06-26 12:57           ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-06-23 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kasireddy, Vivek, David Hildenbrand, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Fri, Jun 23, 2023 at 01:37:58PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 23, 2023 at 12:35:45PM -0400, Peter Xu wrote:
> 
> > It seems the previous concern on using gup was majorly fork(), if this is it:
> > 
> > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213
> 
> Fork and GUP have been fixed since that comment anyhow there is no
> longer a problem using GUP and fork together.

Ah, I read it previously as a requirement that the child will also be able
the see the same / coherent page when manipulating the dmabuf later after
fork(), e.g., the udmabuf can be created before fork().

> 
> > Could it also be guarded by just making sure the pages are MAP_SHARED when
> > creating the udmabuf, if fork() is a requirement of the feature?
> 
> Also a resaonable thing to do
> 
> > For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
> > file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> > at least not F_SEAL_WRITE with current impl), and fault a new page into the
> > page cache?
> 
> It becomes incoherent just like all other GUP users if userspace
> explicitly manipulates the VMAs. It is no different to what happens
> with VFIO, qemu should treat this memory the same as it does for VFIO
> memory.

Agreed.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-23 16:35     ` Peter Xu
  2023-06-23 16:37       ` Jason Gunthorpe
@ 2023-06-26  7:45       ` Kasireddy, Vivek
  2023-06-26 17:52         ` Peter Xu
  1 sibling, 1 reply; 23+ messages in thread
From: Kasireddy, Vivek @ 2023-06-26  7:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, dri-devel, linux-mm, Mike Kravetz,
	Gerd Hoffmann, Kim, Dongwon, Andrew Morton, James Houghton,
	Jerome Marchand, Chang, Junxiao, Kirill A . Shutemov, Hocko,
	Michal, Muchun Song, Jason Gunthorpe, John Hubbard

Hi Peter,

> 
> On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
> > Hi David,
> >
> > > > The first patch ensures that the mappings needed for handling mmap
> > > > operation would be managed by using the pfn instead of struct page.
> > > > The second patch restores support for mapping hugetlb pages where
> > > > subpages of a hugepage are not directly used anymore (main reason
> > > > for revert) and instead the hugetlb pages and the relevant offsets
> > > > are used to populate the scatterlist for dma-buf export and for
> > > > mmap operation.
> > > >
> > > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> > > options
> > > > were passed to the Host kernel and Qemu was launched with these
> > > > relevant options: qemu-system-x86_64 -m 4096m....
> > > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > > > -display gtk,gl=on
> > > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > > > -machine memory-backend=mem1
> > > >
> > > > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > > > exercise the mmap handler.
> > > >
> > >
> > > While I think the VM_PFNMAP approach is much better and should fix
> that
> > > issue at hand, I thought more about missing memlock support and
> realized
> > > that we might have to fix something else. SO I'm going to raise the
> > > issue here.
> > >
> > > I think udmabuf chose the wrong interface to do what it's doing, that
> > > makes it harder to fix it eventually.
> > >
> > > Instead of accepting a range in a memfd, it should just have accepted a
> > > user space address range and then used
> > > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the
> pages
> > > "officially".
> > Udmabuf indeed started off by using user space address range and GUP
> but
> > the dma-buf subsystem maintainer had concerns with that approach in v2.
> > It also had support for mlock in that version. Here is v2 and the relevant
> > conversation:
> > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
> >
> > >
> > > So what's the issue? Udma effectively pins pages longterm ("possibly
> > > forever") simply by grabbing a reference on them. These pages might
> > > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> > >
> > > So what udmabuf does is break memory hotunplug and CMA, because it
> > > turns
> > > pages that have to remain movable unmovable.
> > >
> > > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> > > these
> > > pages. See mm/gup.c:check_and_migrate_movable_pages() and
> especially
> > > folio_is_longterm_pinnable(). We'd probably have to implement
> something
> > > similar for udmabuf, where we detect such unpinnable pages and
> migrate
> > > them.
> > The pages udmabuf pins are only those associated with Guest (GPU
> driver/virtio-gpu)
> > resources (or buffers allocated and pinned from shmem via drm GEM).
> Some
> > resources are short-lived, and some are long-lived and whenever a
> resource
> > gets destroyed, the pages are unpinned. And, not all resources have their
> pages
> > pinned. The resource that is pinned for the longest duration is the FB and
> that's
> > because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
> > GPU driver. We can certainly pin/unpin the FB after it is accessed on the
> Host
> > as a workaround, but I guess that may not be very efficient given the
> amount
> > of churn it would create.
> >
> > Also, as far as migration or S3/S4 is concerned, my understanding is that all
> > the Guest resources are destroyed and recreated again. So, wouldn't
> something
> > similar happen during memory hotunplug?
> >
> > >
> > >
> > > For example, pairing udmabuf with vfio (which pins pages using
> > > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work
> in
> > > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > > that will never succeed.
> > Dma-buf rules (for exporters) indicate that the pages only need to be
> pinned
> > during the map_attachment phase (and until unmap attachment happens).
> > In other words, only when the sg_table is created by udmabuf. I guess one
> > option would be to not hold any references during UDMABUF_CREATE and
> > only grab references to the pages (as and when it gets used) during this
> step.
> > Would this help?
> 
> IIUC the refcount is needed, otherwise I don't see what to protect the page
> from being freed and even reused elsewhere before map_attachment().
> 
> It seems the previous concern on using gup was majorly fork(), if this is it:
> 
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#co
> mment_414213
> 
> Could it also be guarded by just making sure the pages are MAP_SHARED
> when
> creating the udmabuf, if fork() is a requirement of the feature?
> 
> I had a feeling that the userspace still needs to always do the right thing
> to make it work, even using pure PFN mappings.
> 
> For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
> file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> at least not F_SEAL_WRITE with current impl), and fault a new page into the
> page cache?
IIUC, Qemu creates and owns the memfd that is associated with Guest memory.
And if it punches a hole in its own memfd that goes through Guest pinned pages 
associated with buffers/resources, then I think the proper way to fix this is to
somehow notify the Guest driver (virtio-gpu in this case) that the backing store
of the affected resources is no longer valid and that the resources need to be
destroyed and re-created again.

Having said that, one option I could think of is to probably install a mmu_notifier
associated with the relevant pages in udmabuf and once we get notified about
any invalidation event concerning any of the pages, we'd fail any subsequent
attempt to access these pages and propagate the error across the stack. 

However, it feels like udmabuf is not the right place to handle this issue because
there are very limited options for taking proper corrective action if Qemu decides
to punch a hole in Guest memory that takes out pages that are pinned.

Thanks,
Vivek

> 
> Thanks,
> 
> >
> > >
> > >
> > > There are *probably* more issues on the QEMU side when udmabuf is
> > > paired
> > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> > > the vfio/vdpa case we make sure that we disallow most of these, because
> > > otherwise there can be an accidental "disconnect" between the pages
> > > mapped into the VM (guest view) and the pages mapped into the IOMMU
> > > (device view), for example, after a reboot.
> > Ok; I am not sure if I can figure out if there is any acceptable way to
> address
> > these issues but given the current constraints associated with udmabuf,
> what
> > do you suggest is the most reasonable way to deal with these problems you
> > have identified?
> >
> > Thanks,
> > Vivek
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> >
> 
> --
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-23 17:28         ` Peter Xu
@ 2023-06-26 12:57           ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2023-06-26 12:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Kasireddy, Vivek, David Hildenbrand, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Fri, Jun 23, 2023 at 01:28:24PM -0400, Peter Xu wrote:
> On Fri, Jun 23, 2023 at 01:37:58PM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 23, 2023 at 12:35:45PM -0400, Peter Xu wrote:
> > 
> > > It seems the previous concern on using gup was majorly fork(), if this is it:
> > > 
> > > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213
> > 
> > Fork and GUP have been fixed since that comment anyhow there is no
> > longer a problem using GUP and fork together.
> 
> Ah, I read it previously as a requirement that the child will also be able
> the see the same / coherent page when manipulating the dmabuf later after
> fork(), e.g., the udmabuf can be created before fork().

That always worked, just use MAP_SHARED.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-26  7:45       ` Kasireddy, Vivek
@ 2023-06-26 17:52         ` Peter Xu
  2023-06-26 18:14           ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-06-26 17:52 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: David Hildenbrand, dri-devel, linux-mm, Mike Kravetz,
	Gerd Hoffmann, Kim, Dongwon, Andrew Morton, James Houghton,
	Jerome Marchand, Chang, Junxiao, Kirill A . Shutemov, Hocko,
	Michal, Muchun Song, Jason Gunthorpe, John Hubbard

On Mon, Jun 26, 2023 at 07:45:37AM +0000, Kasireddy, Vivek wrote:
> Hi Peter,
> 
> > 
> > On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
> > > Hi David,
> > >
> > > > > The first patch ensures that the mappings needed for handling mmap
> > > > > operation would be managed by using the pfn instead of struct page.
> > > > > The second patch restores support for mapping hugetlb pages where
> > > > > subpages of a hugepage are not directly used anymore (main reason
> > > > > for revert) and instead the hugetlb pages and the relevant offsets
> > > > > are used to populate the scatterlist for dma-buf export and for
> > > > > mmap operation.
> > > > >
> > > > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> > > > options
> > > > > were passed to the Host kernel and Qemu was launched with these
> > > > > relevant options: qemu-system-x86_64 -m 4096m....
> > > > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > > > > -display gtk,gl=on
> > > > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > > > > -machine memory-backend=mem1
> > > > >
> > > > > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > > > > exercise the mmap handler.
> > > > >
> > > >
> > > > While I think the VM_PFNMAP approach is much better and should fix
> > that
> > > > issue at hand, I thought more about missing memlock support and
> > realized
> > > > that we might have to fix something else. SO I'm going to raise the
> > > > issue here.
> > > >
> > > > I think udmabuf chose the wrong interface to do what it's doing, that
> > > > makes it harder to fix it eventually.
> > > >
> > > > Instead of accepting a range in a memfd, it should just have accepted a
> > > > user space address range and then used
> > > > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the
> > pages
> > > > "officially".
> > > Udmabuf indeed started off by using user space address range and GUP
> > but
> > > the dma-buf subsystem maintainer had concerns with that approach in v2.
> > > It also had support for mlock in that version. Here is v2 and the relevant
> > > conversation:
> > > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
> > >
> > > >
> > > > So what's the issue? Udma effectively pins pages longterm ("possibly
> > > > forever") simply by grabbing a reference on them. These pages might
> > > > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> > > >
> > > > So what udmabuf does is break memory hotunplug and CMA, because it
> > > > turns
> > > > pages that have to remain movable unmovable.
> > > >
> > > > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> > > > these
> > > > pages. See mm/gup.c:check_and_migrate_movable_pages() and
> > especially
> > > > folio_is_longterm_pinnable(). We'd probably have to implement
> > something
> > > > similar for udmabuf, where we detect such unpinnable pages and
> > migrate
> > > > them.
> > > The pages udmabuf pins are only those associated with Guest (GPU
> > driver/virtio-gpu)
> > > resources (or buffers allocated and pinned from shmem via drm GEM).
> > Some
> > > resources are short-lived, and some are long-lived and whenever a
> > resource
> > > gets destroyed, the pages are unpinned. And, not all resources have their
> > pages
> > > pinned. The resource that is pinned for the longest duration is the FB and
> > that's
> > > because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
> > > GPU driver. We can certainly pin/unpin the FB after it is accessed on the
> > Host
> > > as a workaround, but I guess that may not be very efficient given the
> > amount
> > > of churn it would create.
> > >
> > > Also, as far as migration or S3/S4 is concerned, my understanding is that all
> > > the Guest resources are destroyed and recreated again. So, wouldn't
> > something
> > > similar happen during memory hotunplug?
> > >
> > > >
> > > >
> > > > For example, pairing udmabuf with vfio (which pins pages using
> > > > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work
> > in
> > > > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > > > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > > > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > > > that will never succeed.
> > > Dma-buf rules (for exporters) indicate that the pages only need to be
> > pinned
> > > during the map_attachment phase (and until unmap attachment happens).
> > > In other words, only when the sg_table is created by udmabuf. I guess one
> > > option would be to not hold any references during UDMABUF_CREATE and
> > > only grab references to the pages (as and when it gets used) during this
> > step.
> > > Would this help?
> > 
> > IIUC the refcount is needed, otherwise I don't see what to protect the page
> > from being freed and even reused elsewhere before map_attachment().
> > 
> > It seems the previous concern on using gup was majorly fork(), if this is it:
> > 
> > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#co
> > mment_414213
> > 
> > Could it also be guarded by just making sure the pages are MAP_SHARED
> > when
> > creating the udmabuf, if fork() is a requirement of the feature?
> > 
> > I had a feeling that the userspace still needs to always do the right thing
> > to make it work, even using pure PFN mappings.
> > 
> > For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
> > file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> > at least not F_SEAL_WRITE with current impl), and fault a new page into the
> > page cache?
> IIUC, Qemu creates and owns the memfd that is associated with Guest memory.
> And if it punches a hole in its own memfd that goes through Guest pinned pages 
> associated with buffers/resources, then I think the proper way to fix this is to
> somehow notify the Guest driver (virtio-gpu in this case) that the backing store
> of the affected resources is no longer valid and that the resources need to be
> destroyed and re-created again.
> 
> Having said that, one option I could think of is to probably install a mmu_notifier
> associated with the relevant pages in udmabuf and once we get notified about
> any invalidation event concerning any of the pages, we'd fail any subsequent
> attempt to access these pages and propagate the error across the stack. 

Sounds right, maybe it needs to go back to the old GUP solution, though, as
mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also solve
the movable pages issue on pinning.

> 
> However, it feels like udmabuf is not the right place to handle this issue because
> there are very limited options for taking proper corrective action if Qemu decides
> to punch a hole in Guest memory that takes out pages that are pinned.

I'm not familiar with the use case that much, but IMHO it's fine if the
driver relies on proper behaviors of userapp to work.

IIUC the worst case here is the udmabuf holds some pages that are not the
pages of the guest mem anymore, but it only happens on misbehaved
userspace, then it looks all fine as long as they can at least be properly
released when releasing the udmabuf.  It'll be better if the udmabuf can
fail hard when detecting this, but IMHO even that can be optional depending
on the need, while any corrective action will be even one step further.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-26 17:52         ` Peter Xu
@ 2023-06-26 18:14           ` David Hildenbrand
  2023-06-26 18:18             ` Jason Gunthorpe
  2023-06-27  6:37             ` Kasireddy, Vivek
  0 siblings, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2023-06-26 18:14 UTC (permalink / raw)
  To: Peter Xu, Kasireddy, Vivek
  Cc: dri-devel, linux-mm, Mike Kravetz, Gerd Hoffmann, Kim, Dongwon,
	Andrew Morton, James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, Jason Gunthorpe,
	John Hubbard

On 26.06.23 19:52, Peter Xu wrote:
> On Mon, Jun 26, 2023 at 07:45:37AM +0000, Kasireddy, Vivek wrote:
>> Hi Peter,
>>
>>>
>>> On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
>>>> Hi David,
>>>>
>>>>>> The first patch ensures that the mappings needed for handling mmap
>>>>>> operation would be managed by using the pfn instead of struct page.
>>>>>> The second patch restores support for mapping hugetlb pages where
>>>>>> subpages of a hugepage are not directly used anymore (main reason
>>>>>> for revert) and instead the hugetlb pages and the relevant offsets
>>>>>> are used to populate the scatterlist for dma-buf export and for
>>>>>> mmap operation.
>>>>>>
>>>>>> Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
>>>>> options
>>>>>> were passed to the Host kernel and Qemu was launched with these
>>>>>> relevant options: qemu-system-x86_64 -m 4096m....
>>>>>> -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
>>>>>> -display gtk,gl=on
>>>>>> -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
>>>>>> -machine memory-backend=mem1
>>>>>>
>>>>>> Replacing -display gtk,gl=on with -display gtk,gl=off above would
>>>>>> exercise the mmap handler.
>>>>>>
>>>>>
>>>>> While I think the VM_PFNMAP approach is much better and should fix
>>> that
>>>>> issue at hand, I thought more about missing memlock support and
>>> realized
>>>>> that we might have to fix something else. SO I'm going to raise the
>>>>> issue here.
>>>>>
>>>>> I think udmabuf chose the wrong interface to do what it's doing, that
>>>>> makes it harder to fix it eventually.
>>>>>
>>>>> Instead of accepting a range in a memfd, it should just have accepted a
>>>>> user space address range and then used
>>>>> pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the
>>> pages
>>>>> "officially".
>>>> Udmabuf indeed started off by using user space address range and GUP
>>> but
>>>> the dma-buf subsystem maintainer had concerns with that approach in v2.
>>>> It also had support for mlock in that version. Here is v2 and the relevant
>>>> conversation:
>>>> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
>>>>
>>>>>
>>>>> So what's the issue? Udma effectively pins pages longterm ("possibly
>>>>> forever") simply by grabbing a reference on them. These pages might
>>>>> easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
>>>>>
>>>>> So what udmabuf does is break memory hotunplug and CMA, because it
>>>>> turns
>>>>> pages that have to remain movable unmovable.
>>>>>
>>>>> In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
>>>>> these
>>>>> pages. See mm/gup.c:check_and_migrate_movable_pages() and
>>> especially
>>>>> folio_is_longterm_pinnable(). We'd probably have to implement
>>> something
>>>>> similar for udmabuf, where we detect such unpinnable pages and
>>> migrate
>>>>> them.
>>>> The pages udmabuf pins are only those associated with Guest (GPU
>>> driver/virtio-gpu)
>>>> resources (or buffers allocated and pinned from shmem via drm GEM).
>>> Some
>>>> resources are short-lived, and some are long-lived and whenever a
>>> resource
>>>> gets destroyed, the pages are unpinned. And, not all resources have their
>>> pages
>>>> pinned. The resource that is pinned for the longest duration is the FB and
>>> that's
>>>> because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
>>>> GPU driver. We can certainly pin/unpin the FB after it is accessed on the
>>> Host
>>>> as a workaround, but I guess that may not be very efficient given the
>>> amount
>>>> of churn it would create.
>>>>
>>>> Also, as far as migration or S3/S4 is concerned, my understanding is that all
>>>> the Guest resources are destroyed and recreated again. So, wouldn't
>>> something
>>>> similar happen during memory hotunplug?
>>>>
>>>>>
>>>>>
>>>>> For example, pairing udmabuf with vfio (which pins pages using
>>>>> pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work
>>> in
>>>>> all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
>>>>> will fail to migrate them during FOLL_LONGTERM and consequently fail
>>>>> pin_user_pages(). As long as udmabuf holds a reference on these pages,
>>>>> that will never succeed.
>>>> Dma-buf rules (for exporters) indicate that the pages only need to be
>>> pinned
>>>> during the map_attachment phase (and until unmap attachment happens).
>>>> In other words, only when the sg_table is created by udmabuf. I guess one
>>>> option would be to not hold any references during UDMABUF_CREATE and
>>>> only grab references to the pages (as and when it gets used) during this
>>> step.
>>>> Would this help?
>>>
>>> IIUC the refcount is needed, otherwise I don't see what to protect the page
>>> from being freed and even reused elsewhere before map_attachment().
>>>
>>> It seems the previous concern on using gup was majorly fork(), if this is it:
>>>
>>> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#co
>>> mment_414213
>>>
>>> Could it also be guarded by just making sure the pages are MAP_SHARED
>>> when
>>> creating the udmabuf, if fork() is a requirement of the feature?
>>>
>>> I had a feeling that the userspace still needs to always do the right thing
>>> to make it work, even using pure PFN mappings.
>>>
>>> For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
>>> file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
>>> at least not F_SEAL_WRITE with current impl), and fault a new page into the
>>> page cache?
>> IIUC, Qemu creates and owns the memfd that is associated with Guest memory.
>> And if it punches a hole in its own memfd that goes through Guest pinned pages
>> associated with buffers/resources, then I think the proper way to fix this is to
>> somehow notify the Guest driver (virtio-gpu in this case) that the backing store
>> of the affected resources is no longer valid and that the resources need to be
>> destroyed and re-created again.
>>
>> Having said that, one option I could think of is to probably install a mmu_notifier
>> associated with the relevant pages in udmabuf and once we get notified about
>> any invalidation event concerning any of the pages, we'd fail any subsequent
>> attempt to access these pages and propagate the error across the stack.
> 
> Sounds right, maybe it needs to go back to the old GUP solution, though, as
> mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
> it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also solve
> the movable pages issue on pinning.

It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we 
cannot achieve that without breaking the existing kernel interface ...

So we might have to implement the same page migration as gup does on 
FOLL_LONGTERM here ... maybe there are more such cases/drivers that 
actually require that handling when simply taking pages out of the 
memfd, believing they can hold on to them forever.

> 
>>
>> However, it feels like udmabuf is not the right place to handle this issue because
>> there are very limited options for taking proper corrective action if Qemu decides
>> to punch a hole in Guest memory that takes out pages that are pinned.
> 
> I'm not familiar with the use case that much, but IMHO it's fine if the
> driver relies on proper behaviors of userapp to work.
> 
> IIUC the worst case here is the udmabuf holds some pages that are not the
> pages of the guest mem anymore, but it only happens on misbehaved
> userspace, then it looks all fine as long as they can at least be properly
> released when releasing the udmabuf.  It'll be better if the udmabuf can
> fail hard when detecting this, but IMHO even that can be optional depending
> on the need, while any corrective action will be even one step further.

For vfio the issue are e.g., VM reboots, not a misbehaving guest. If the 
old guest kernel inflated the balloon and we reboot the VM, VFIO will 
still reference the old pages and the new (unaware kernel) might use 
that previously inflated memory to communicate with the devices. Because 
of that, we disable balloon inflation when vfio is active in QEMU.

[there are more cases like unloading the balloon driver or deflating the 
balloon, and then using that memory for communicating with the device]

Maybe it's all fine with udmabuf because of the way it is setup/torn 
down by the guest driver. Unfortunately I can't tell.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-26 18:14           ` David Hildenbrand
@ 2023-06-26 18:18             ` Jason Gunthorpe
  2023-06-26 19:04               ` Peter Xu
  2023-06-27  6:37             ` Kasireddy, Vivek
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2023-06-26 18:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Kasireddy, Vivek, dri-devel, linux-mm, Mike Kravetz,
	Gerd Hoffmann, Kim, Dongwon, Andrew Morton, James Houghton,
	Jerome Marchand, Chang, Junxiao, Kirill A . Shutemov, Hocko,
	Michal, Muchun Song, John Hubbard

On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:

> So we might have to implement the same page migration as gup does on
> FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually
> require that handling when simply taking pages out of the memfd, believing
> they can hold on to them forever.

In general I would like to see an interface to FOLL_LONGTERM pin pages
from a memfd. I would quite happily use that in iommufd as well.

It solves some problems we have there with fork/exec/etc if the pages
are not linked to a mm_struct.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-26 18:18             ` Jason Gunthorpe
@ 2023-06-26 19:04               ` Peter Xu
  2023-06-27 15:52                 ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-06-26 19:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Kasireddy, Vivek, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> 
> > So we might have to implement the same page migration as gup does on
> > FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually
> > require that handling when simply taking pages out of the memfd, believing
> > they can hold on to them forever.
> 
> In general I would like to see an interface to FOLL_LONGTERM pin pages
> from a memfd. I would quite happily use that in iommufd as well.
> 
> It solves some problems we have there with fork/exec/etc if the pages
> are not linked to a mm_struct.

Afaiu any fd based approach should mean it'll never work with private
memories, while mm-based should be able to work on any kind.  Maybe that's
not a problem - I assume at least udmabuf should just only work with shared
memories; not sure on iommufd, though.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-26 18:14           ` David Hildenbrand
  2023-06-26 18:18             ` Jason Gunthorpe
@ 2023-06-27  6:37             ` Kasireddy, Vivek
  2023-06-27  7:10               ` David Hildenbrand
  1 sibling, 1 reply; 23+ messages in thread
From: Kasireddy, Vivek @ 2023-06-27  6:37 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: dri-devel, linux-mm, Mike Kravetz, Gerd Hoffmann, Kim, Dongwon,
	Andrew Morton, James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, Jason Gunthorpe,
	John Hubbard

Hi David,

> On 26.06.23 19:52, Peter Xu wrote:
> > On Mon, Jun 26, 2023 at 07:45:37AM +0000, Kasireddy, Vivek wrote:
> >> Hi Peter,
> >>
> >>>
> >>> On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
> >>>> Hi David,
> >>>>
> >>>>>> The first patch ensures that the mappings needed for handling mmap
> >>>>>> operation would be managed by using the pfn instead of struct page.
> >>>>>> The second patch restores support for mapping hugetlb pages where
> >>>>>> subpages of a hugepage are not directly used anymore (main reason
> >>>>>> for revert) and instead the hugetlb pages and the relevant offsets
> >>>>>> are used to populate the scatterlist for dma-buf export and for
> >>>>>> mmap operation.
> >>>>>>
> >>>>>> Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> >>>>> options
> >>>>>> were passed to the Host kernel and Qemu was launched with these
> >>>>>> relevant options: qemu-system-x86_64 -m 4096m....
> >>>>>> -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> >>>>>> -display gtk,gl=on
> >>>>>> -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> >>>>>> -machine memory-backend=mem1
> >>>>>>
> >>>>>> Replacing -display gtk,gl=on with -display gtk,gl=off above would
> >>>>>> exercise the mmap handler.
> >>>>>>
> >>>>>
> >>>>> While I think the VM_PFNMAP approach is much better and should fix
> >>> that
> >>>>> issue at hand, I thought more about missing memlock support and
> >>> realized
> >>>>> that we might have to fix something else. SO I'm going to raise the
> >>>>> issue here.
> >>>>>
> >>>>> I think udmabuf chose the wrong interface to do what it's doing, that
> >>>>> makes it harder to fix it eventually.
> >>>>>
> >>>>> Instead of accepting a range in a memfd, it should just have accepted a
> >>>>> user space address range and then used
> >>>>> pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the
> >>> pages
> >>>>> "officially".
> >>>> Udmabuf indeed started off by using user space address range and GUP
> >>> but
> >>>> the dma-buf subsystem maintainer had concerns with that approach in
> v2.
> >>>> It also had support for mlock in that version. Here is v2 and the relevant
> >>>> conversation:
> >>>>
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
> >>>>
> >>>>>
> >>>>> So what's the issue? Udma effectively pins pages longterm ("possibly
> >>>>> forever") simply by grabbing a reference on them. These pages might
> >>>>> easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> >>>>>
> >>>>> So what udmabuf does is break memory hotunplug and CMA, because
> it
> >>>>> turns
> >>>>> pages that have to remain movable unmovable.
> >>>>>
> >>>>> In the pin_user_pages(FOLL_LONGTERM) case we make sure to
> migrate
> >>>>> these
> >>>>> pages. See mm/gup.c:check_and_migrate_movable_pages() and
> >>> especially
> >>>>> folio_is_longterm_pinnable(). We'd probably have to implement
> >>> something
> >>>>> similar for udmabuf, where we detect such unpinnable pages and
> >>> migrate
> >>>>> them.
> >>>> The pages udmabuf pins are only those associated with Guest (GPU
> >>> driver/virtio-gpu)
> >>>> resources (or buffers allocated and pinned from shmem via drm GEM).
> >>> Some
> >>>> resources are short-lived, and some are long-lived and whenever a
> >>> resource
> >>>> gets destroyed, the pages are unpinned. And, not all resources have
> their
> >>> pages
> >>>> pinned. The resource that is pinned for the longest duration is the FB
> and
> >>> that's
> >>>> because it is updated every ~16ms (assuming 1920x1080@60) by the
> Guest
> >>>> GPU driver. We can certainly pin/unpin the FB after it is accessed on the
> >>> Host
> >>>> as a workaround, but I guess that may not be very efficient given the
> >>> amount
> >>>> of churn it would create.
> >>>>
> >>>> Also, as far as migration or S3/S4 is concerned, my understanding is
> that all
> >>>> the Guest resources are destroyed and recreated again. So, wouldn't
> >>> something
> >>>> similar happen during memory hotunplug?
> >>>>
> >>>>>
> >>>>>
> >>>>> For example, pairing udmabuf with vfio (which pins pages using
> >>>>> pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not
> work
> >>> in
> >>>>> all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> >>>>> will fail to migrate them during FOLL_LONGTERM and consequently fail
> >>>>> pin_user_pages(). As long as udmabuf holds a reference on these
> pages,
> >>>>> that will never succeed.
> >>>> Dma-buf rules (for exporters) indicate that the pages only need to be
> >>> pinned
> >>>> during the map_attachment phase (and until unmap attachment
> happens).
> >>>> In other words, only when the sg_table is created by udmabuf. I guess
> one
> >>>> option would be to not hold any references during UDMABUF_CREATE
> and
> >>>> only grab references to the pages (as and when it gets used) during this
> >>> step.
> >>>> Would this help?
> >>>
> >>> IIUC the refcount is needed, otherwise I don't see what to protect the
> page
> >>> from being freed and even reused elsewhere before map_attachment().
> >>>
> >>> It seems the previous concern on using gup was majorly fork(), if this is
> it:
> >>>
> >>>
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#co
> >>> mment_414213
> >>>
> >>> Could it also be guarded by just making sure the pages are MAP_SHARED
> >>> when
> >>> creating the udmabuf, if fork() is a requirement of the feature?
> >>>
> >>> I had a feeling that the userspace still needs to always do the right thing
> >>> to make it work, even using pure PFN mappings.
> >>>
> >>> For instance, what if the userapp just punchs a hole in the
> shmem/hugetlbfs
> >>> file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
> >>> at least not F_SEAL_WRITE with current impl), and fault a new page into
> the
> >>> page cache?
> >> IIUC, Qemu creates and owns the memfd that is associated with Guest
> memory.
> >> And if it punches a hole in its own memfd that goes through Guest pinned
> pages
> >> associated with buffers/resources, then I think the proper way to fix this is
> to
> >> somehow notify the Guest driver (virtio-gpu in this case) that the backing
> store
> >> of the affected resources is no longer valid and that the resources need to
> be
> >> destroyed and re-created again.
> >>
> >> Having said that, one option I could think of is to probably install a
> mmu_notifier
> >> associated with the relevant pages in udmabuf and once we get notified
> about
> >> any invalidation event concerning any of the pages, we'd fail any
> subsequent
> >> attempt to access these pages and propagate the error across the stack.
> >
> > Sounds right, maybe it needs to go back to the old GUP solution, though, as
> > mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
> > it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also solve
> > the movable pages issue on pinning.
> 
> It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we
> cannot achieve that without breaking the existing kernel interface ...
Yeah, as you suggest, we unfortunately cannot go back to using GUP
without breaking udmabuf_create UAPI that expects memfds and file
offsets. 

> 
> So we might have to implement the same page migration as gup does on
> FOLL_LONGTERM here ... maybe there are more such cases/drivers that
> actually require that handling when simply taking pages out of the
> memfd, believing they can hold on to them forever.
IIUC, I don't think just handling the page migration in udmabuf is going to
cut it. It might require active cooperation of the Guest GPU driver as well
if this is even feasible.

> 
> >
> >>
> >> However, it feels like udmabuf is not the right place to handle this issue
> because
> >> there are very limited options for taking proper corrective action if Qemu
> decides
> >> to punch a hole in Guest memory that takes out pages that are pinned.
> >
> > I'm not familiar with the use case that much, but IMHO it's fine if the
> > driver relies on proper behaviors of userapp to work.
> >
> > IIUC the worst case here is the udmabuf holds some pages that are not the
> > pages of the guest mem anymore, but it only happens on misbehaved
> > userspace, then it looks all fine as long as they can at least be properly
> > released when releasing the udmabuf.  It'll be better if the udmabuf can
> > fail hard when detecting this, but IMHO even that can be optional
> depending
> > on the need, while any corrective action will be even one step further.
> 
> For vfio the issue are e.g., VM reboots, not a misbehaving guest. If the
> old guest kernel inflated the balloon and we reboot the VM, VFIO will
> still reference the old pages and the new (unaware kernel) might use
> that previously inflated memory to communicate with the devices. Because
> of that, we disable balloon inflation when vfio is active in QEMU.
> 
> [there are more cases like unloading the balloon driver or deflating the
> balloon, and then using that memory for communicating with the device]
> 
> Maybe it's all fine with udmabuf because of the way it is setup/torn
> down by the guest driver. Unfortunately I can't tell.
Here are the functions used by virtio-gpu (Guest GPU driver) to allocate
pages for its resources:
__drm_gem_shmem_create: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L97
Interestingly, the comment in the above function says that the pages
should not be allocated from the MOVABLE zone.
The pages along with their dma addresses are then extracted and shared
with Qemu using these two functions:
drm_gem_get_pages: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem.c#L534
virtio_gpu_object_shmem_init: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/virtio/virtgpu_object.c#L135
Qemu then translates the dma addresses into file offsets and creates
udmabufs -- as an optimization to avoid data copies only if blob is set
to true.

I'll need to run some experiments to understand the impact of memfd page
migration (or replacement) on virtio-gpu driver's resources -- and also on the
rest of the Guest graphics stack.

Thanks,
Vivek
> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-27  6:37             ` Kasireddy, Vivek
@ 2023-06-27  7:10               ` David Hildenbrand
  2023-06-28  8:04                 ` Kasireddy, Vivek
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2023-06-27  7:10 UTC (permalink / raw)
  To: Kasireddy, Vivek, Peter Xu
  Cc: dri-devel, linux-mm, Mike Kravetz, Gerd Hoffmann, Kim, Dongwon,
	Andrew Morton, James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, Jason Gunthorpe,
	John Hubbard

On 27.06.23 08:37, Kasireddy, Vivek wrote:
> Hi David,
> 

Hi!

sorry for taking a bit longer to reply lately.

[...]

>>> Sounds right, maybe it needs to go back to the old GUP solution, though, as
>>> mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
>>> it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also solve
>>> the movable pages issue on pinning.
>>
>> It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we
>> cannot achieve that without breaking the existing kernel interface ...
> Yeah, as you suggest, we unfortunately cannot go back to using GUP
> without breaking udmabuf_create UAPI that expects memfds and file
> offsets.
> 
>>
>> So we might have to implement the same page migration as gup does on
>> FOLL_LONGTERM here ... maybe there are more such cases/drivers that
>> actually require that handling when simply taking pages out of the
>> memfd, believing they can hold on to them forever.
> IIUC, I don't think just handling the page migration in udmabuf is going to
> cut it. It might require active cooperation of the Guest GPU driver as well
> if this is even feasible.

The idea is, that once you extract the page from the memfd and it 
resides somewhere bad (MIGRATE_CMA, ZONE_MOVABLE), you trigger page 
migration. Essentially what migrate_longterm_unpinnable_pages() does:

Why would the guest driver have to be involved? It shouldn't care about
page migration in the hypervisor.

[...]

>> balloon, and then using that memory for communicating with the device]
>>
>> Maybe it's all fine with udmabuf because of the way it is setup/torn
>> down by the guest driver. Unfortunately I can't tell.
> Here are the functions used by virtio-gpu (Guest GPU driver) to allocate
> pages for its resources:
> __drm_gem_shmem_create: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L97
> Interestingly, the comment in the above function says that the pages
> should not be allocated from the MOVABLE zone.

It doesn't add GFP_MOVABLE, so pages don't end up in 
ZONE_MOVABLE/MIGRATE_CMA *in the guest*. But we care about the 
ZONE_MOVABLE /MIGRATE_CMA *in the host*. (what the guest does is right, 
though)

IOW, what udmabuf does with guest memory on the hypervisor side, not the 
guest driver on the guest side.

> The pages along with their dma addresses are then extracted and shared
> with Qemu using these two functions:
> drm_gem_get_pages: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem.c#L534
> virtio_gpu_object_shmem_init: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/virtio/virtgpu_object.c#L135

^ so these two target the guest driver as well, right? IOW, there is a 
memfd (shmem) in the guest that the guest driver uses to allocate pages 
from and there is the memfd in the hypervisor to back guest RAM.

The latter gets registered with udmabuf.

> Qemu then translates the dma addresses into file offsets and creates
> udmabufs -- as an optimization to avoid data copies only if blob is set
> to true.

If the guest OS doesn't end up freeing/reallocating that memory while 
it's registered with udmabuf in the hypervisor, then we should be fine.

Because that way, the guest won't end up trigger MADV_DONTNEED by 
"accident".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-26 19:04               ` Peter Xu
@ 2023-06-27 15:52                 ` Jason Gunthorpe
  2023-06-27 16:00                   ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2023-06-27 15:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Kasireddy, Vivek, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Mon, Jun 26, 2023 at 03:04:21PM -0400, Peter Xu wrote:
> On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> > 
> > > So we might have to implement the same page migration as gup does on
> > > FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually
> > > require that handling when simply taking pages out of the memfd, believing
> > > they can hold on to them forever.
> > 
> > In general I would like to see an interface to FOLL_LONGTERM pin pages
> > from a memfd. I would quite happily use that in iommufd as well.
> > 
> > It solves some problems we have there with fork/exec/etc if the pages
> > are not linked to a mm_struct.
> 
> Afaiu any fd based approach should mean it'll never work with private
> memories, while mm-based should be able to work on any kind.  

Is there a significant use case to open a memfd and then use
MAP_PRIVATE? Why would anyone want to do that instead of just using
normal mmap anonymous memory?

Jason 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-27 15:52                 ` Jason Gunthorpe
@ 2023-06-27 16:00                   ` Peter Xu
  2023-06-27 16:04                     ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2023-06-27 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Kasireddy, Vivek, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Tue, Jun 27, 2023 at 12:52:34PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 26, 2023 at 03:04:21PM -0400, Peter Xu wrote:
> > On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> > > 
> > > > So we might have to implement the same page migration as gup does on
> > > > FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually
> > > > require that handling when simply taking pages out of the memfd, believing
> > > > they can hold on to them forever.
> > > 
> > > In general I would like to see an interface to FOLL_LONGTERM pin pages
> > > from a memfd. I would quite happily use that in iommufd as well.
> > > 
> > > It solves some problems we have there with fork/exec/etc if the pages
> > > are not linked to a mm_struct.
> > 
> > Afaiu any fd based approach should mean it'll never work with private
> > memories, while mm-based should be able to work on any kind.  
> 
> Is there a significant use case to open a memfd and then use
> MAP_PRIVATE? Why would anyone want to do that instead of just using
> normal mmap anonymous memory?

I remember David Hildenbrand somewhere mentioned the use case where one
wants to snapshot a VM RAM into a file, then start multiple instances by
loading that VM RAM with MAP_PRIVATE, so it clones a bunch of snapshoted VM
running with a single RAM file shared as a template.  Not a generic use
case, I guess.

My question applies not only memfd but also in general - qemu by default
doesn't use memfd afaict, so it boils down to e.g. whether you'll target
the iommufd project to work in that case, where qemu uses anonymous memory.
Privately mapped file memory is only one of those kinds.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-27 16:00                   ` Peter Xu
@ 2023-06-27 16:04                     ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2023-06-27 16:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Kasireddy, Vivek, dri-devel, linux-mm,
	Mike Kravetz, Gerd Hoffmann, Kim, Dongwon, Andrew Morton,
	James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, John Hubbard

On Tue, Jun 27, 2023 at 12:00:38PM -0400, Peter Xu wrote:
> On Tue, Jun 27, 2023 at 12:52:34PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 26, 2023 at 03:04:21PM -0400, Peter Xu wrote:
> > > On Mon, Jun 26, 2023 at 03:18:48PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Jun 26, 2023 at 08:14:27PM +0200, David Hildenbrand wrote:
> > > > 
> > > > > So we might have to implement the same page migration as gup does on
> > > > > FOLL_LONGTERM here ... maybe there are more such cases/drivers that actually
> > > > > require that handling when simply taking pages out of the memfd, believing
> > > > > they can hold on to them forever.
> > > > 
> > > > In general I would like to see an interface to FOLL_LONGTERM pin pages
> > > > from a memfd. I would quite happily use that in iommufd as well.
> > > > 
> > > > It solves some problems we have there with fork/exec/etc if the pages
> > > > are not linked to a mm_struct.
> > > 
> > > Afaiu any fd based approach should mean it'll never work with private
> > > memories, while mm-based should be able to work on any kind.  
> > 
> > Is there a significant use case to open a memfd and then use
> > MAP_PRIVATE? Why would anyone want to do that instead of just using
> > normal mmap anonymous memory?
> 
> I remember David Hildenbrand somewhere mentioned the use case where one
> wants to snapshot a VM RAM into a file, then start multiple instances by
> loading that VM RAM with MAP_PRIVATE, so it clones a bunch of snapshoted VM
> running with a single RAM file shared as a template.  Not a generic use
> case, I guess.

A file I can see, but a file is not a memfd, we are talking
specifically about memfd, aren't we?

> My question applies not only memfd but also in general - qemu by default
> doesn't use memfd afaict, so it boils down to e.g. whether you'll target
> the iommufd project to work in that case, where qemu uses anonymous
> memory.

I think this may change, as I understand it, the approach for
confidential compute is to put the guest memory in a memfd...

> Privately mapped file memory is only one of those kinds.

I think memfd and related shmem-like objects are a reasonable target. We
already know we should not FOLL_LONGTERM pin file backed pages.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-27  7:10               ` David Hildenbrand
@ 2023-06-28  8:04                 ` Kasireddy, Vivek
  0 siblings, 0 replies; 23+ messages in thread
From: Kasireddy, Vivek @ 2023-06-28  8:04 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: dri-devel, linux-mm, Mike Kravetz, Gerd Hoffmann, Kim, Dongwon,
	Andrew Morton, James Houghton, Jerome Marchand, Chang, Junxiao,
	Kirill A . Shutemov, Hocko, Michal, Muchun Song, Jason Gunthorpe,
	John Hubbard

Hi David,

> 
> On 27.06.23 08:37, Kasireddy, Vivek wrote:
> > Hi David,
> >
> 
> Hi!
> 
> sorry for taking a bit longer to reply lately.
No problem.

> 
> [...]
> 
> >>> Sounds right, maybe it needs to go back to the old GUP solution, though,
> as
> >>> mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
> >>> it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also
> solve
> >>> the movable pages issue on pinning.
> >>
> >> It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we
> >> cannot achieve that without breaking the existing kernel interface ...
> > Yeah, as you suggest, we unfortunately cannot go back to using GUP
> > without breaking udmabuf_create UAPI that expects memfds and file
> > offsets.
> >
> >>
> >> So we might have to implement the same page migration as gup does on
> >> FOLL_LONGTERM here ... maybe there are more such cases/drivers that
> >> actually require that handling when simply taking pages out of the
> >> memfd, believing they can hold on to them forever.
> > IIUC, I don't think just handling the page migration in udmabuf is going to
> > cut it. It might require active cooperation of the Guest GPU driver as well
> > if this is even feasible.
> 
> The idea is, that once you extract the page from the memfd and it
> resides somewhere bad (MIGRATE_CMA, ZONE_MOVABLE), you trigger page
> migration. Essentially what migrate_longterm_unpinnable_pages() does:
So, IIUC, it looks like calling check_and_migrate_movable_pages() at the time
of creation (udmabuf_create) and when we get notified about something like
FALLOC_FL_PUNCH_HOLE will be all that needs to be done in udmabuf?

> 
> Why would the guest driver have to be involved? It shouldn't care about
> page migration in the hypervisor.
Yeah, it appears that the page migration would be transparent to the Guest
driver.

> 
> [...]
> 
> >> balloon, and then using that memory for communicating with the device]
> >>
> >> Maybe it's all fine with udmabuf because of the way it is setup/torn
> >> down by the guest driver. Unfortunately I can't tell.
> > Here are the functions used by virtio-gpu (Guest GPU driver) to allocate
> > pages for its resources:
> > __drm_gem_shmem_create:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem_sh
> mem_helper.c#L97
> > Interestingly, the comment in the above function says that the pages
> > should not be allocated from the MOVABLE zone.
> 
> It doesn't add GFP_MOVABLE, so pages don't end up in
> ZONE_MOVABLE/MIGRATE_CMA *in the guest*. But we care about the
> ZONE_MOVABLE /MIGRATE_CMA *in the host*. (what the guest does is
> right,
> though)
> 
> IOW, what udmabuf does with guest memory on the hypervisor side, not the
> guest driver on the guest side.
Ok, got it.

> 
> > The pages along with their dma addresses are then extracted and shared
> > with Qemu using these two functions:
> > drm_gem_get_pages:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem.c#
> L534
> > virtio_gpu_object_shmem_init:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/virtio/virtgpu
> _object.c#L135
> 
> ^ so these two target the guest driver as well, right? IOW, there is a
> memfd (shmem) in the guest that the guest driver uses to allocate pages
> from and there is the memfd in the hypervisor to back guest RAM.
> 
> The latter gets registered with udmabuf.
Yes, that's exactly what happens.

> 
> > Qemu then translates the dma addresses into file offsets and creates
> > udmabufs -- as an optimization to avoid data copies only if blob is set
> > to true.
> 
> If the guest OS doesn't end up freeing/reallocating that memory while
> it's registered with udmabuf in the hypervisor, then we should be fine.
IIUC, udmabuf does get notified when something like that happens.

Thanks,
Vivek

> 
> Because that way, the guest won't end up trigger MADV_DONTNEED by
> "accident".
> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
  2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
  2023-06-22 21:33   ` Mike Kravetz
  2023-06-23  6:13   ` Kasireddy, Vivek
@ 2023-08-08 16:17   ` Daniel Vetter
  2 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2023-08-08 16:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vivek Kasireddy, dri-devel, linux-mm, James Houghton,
	Jerome Marchand, Dongwon Kim, Junxiao Chang, Muchun Song,
	Michal Hocko, Gerd Hoffmann, Jason Gunthorpe, John Hubbard,
	Andrew Morton, Kirill A . Shutemov, Mike Kravetz

On Thu, Jun 22, 2023 at 10:25:17AM +0200, David Hildenbrand wrote:
> On 22.06.23 09:27, Vivek Kasireddy wrote:
> > The first patch ensures that the mappings needed for handling mmap
> > operation would be managed by using the pfn instead of struct page.
> > The second patch restores support for mapping hugetlb pages where
> > subpages of a hugepage are not directly used anymore (main reason
> > for revert) and instead the hugetlb pages and the relevant offsets
> > are used to populate the scatterlist for dma-buf export and for
> > mmap operation.
> > 
> > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
> > were passed to the Host kernel and Qemu was launched with these
> > relevant options: qemu-system-x86_64 -m 4096m....
> > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > -display gtk,gl=on
> > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > -machine memory-backend=mem1
> > 
> > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > exercise the mmap handler.
> > 
> 
> While I think the VM_PFNMAP approach is much better and should fix that
> issue at hand, I thought more about missing memlock support and realized
> that we might have to fix something else. SO I'm going to raise the issue
> here.
> 
> I think udmabuf chose the wrong interface to do what it's doing, that makes
> it harder to fix it eventually.
> 
> Instead of accepting a range in a memfd, it should just have accepted a user
> space address range and then used pin_user_pages(FOLL_WRITE|FOLL_LONGTERM)
> to longterm-pin the pages "officially".
> 
> So what's the issue? Udma effectively pins pages longterm ("possibly
> forever") simply by grabbing a reference on them. These pages might easily
> reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> 
> So what udmabuf does is break memory hotunplug and CMA, because it turns
> pages that have to remain movable unmovable.
> 
> In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate these
> pages. See mm/gup.c:check_and_migrate_movable_pages() and especially
> folio_is_longterm_pinnable(). We'd probably have to implement something
> similar for udmabuf, where we detect such unpinnable pages and migrate them.
> 
> 
> For example, pairing udmabuf with vfio (which pins pages using
> pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in all
> cases: if udmabuf longterm pinned the pages "the wrong way", vfio will fail
> to migrate them during FOLL_LONGTERM and consequently fail pin_user_pages().
> As long as udmabuf holds a reference on these pages, that will never
> succeed.

Uh this is no good and I totally missed this, because the very first
version of udmabuf used pin_user_pages(FOLL_LONGTERM). I think what we
need here as first fix is a shmem_pin_mapping_page_longterm that does all
the equivalent of pin_user_pages(FOLL_LONGTERM), and use it in udmabuf.
From a quick look the folio conversions that already landed should help
there.

It might also be good if we convert all the gpu driver users of
shmem_read_mapping_page over to that new shmem_pin_mapping_page_longterm,
just for safety. gpu drivers use a private shmem file and adjust the gfp
mask to clear GFP_MOVEABLE, so the biggest issues shouldn't be possible.
But pin(LONGTERM) compared to just getting a page ref has gained quite a
few other differences in the past years, and it would be good to be
consistent I think.

Anything else than longterm pins wont work for udmabuf, because the
locking between struct page/gup.c/mmu_notifier and dma_buf is rather
fundamentally (and by design due to gpu driver requirements) incompatible
with dma_buf locking rules.
 
> There are *probably* more issues on the QEMU side when udmabuf is paired
> with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for virtio-balloon,
> virtio-mem, postcopy live migration, ... for example, in the vfio/vdpa case
> we make sure that we disallow most of these, because otherwise there can be
> an accidental "disconnect" between the pages mapped into the VM (guest view)
> and the pages mapped into the IOMMU (device view), for example, after a
> reboot.

I think once we have the proper longterm pinning for udmabuf we need to
look into what coherency issues are left, and how to best fix them.
udmabuf already requires that the memfd is size sealed to avoid some
issues, we might need to require more. Or on the other side, perhaps
reject or quietly ignore some of the hole punching for longterm pinned
pages, to maintain coherency.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-08-08 16:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-22  7:27 [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
2023-06-22  7:27 ` [PATCH v1 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
2023-06-22  7:27 ` [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
2023-06-22 22:10   ` kernel test robot
2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
2023-06-22 21:33   ` Mike Kravetz
2023-06-23  6:13   ` Kasireddy, Vivek
2023-06-23 16:35     ` Peter Xu
2023-06-23 16:37       ` Jason Gunthorpe
2023-06-23 17:28         ` Peter Xu
2023-06-26 12:57           ` Jason Gunthorpe
2023-06-26  7:45       ` Kasireddy, Vivek
2023-06-26 17:52         ` Peter Xu
2023-06-26 18:14           ` David Hildenbrand
2023-06-26 18:18             ` Jason Gunthorpe
2023-06-26 19:04               ` Peter Xu
2023-06-27 15:52                 ` Jason Gunthorpe
2023-06-27 16:00                   ` Peter Xu
2023-06-27 16:04                     ` Jason Gunthorpe
2023-06-27  6:37             ` Kasireddy, Vivek
2023-06-27  7:10               ` David Hildenbrand
2023-06-28  8:04                 ` Kasireddy, Vivek
2023-08-08 16:17   ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).