Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

From: Peter Xu <peterx@redhat.com>
To: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"Kim, Dongwon" <dongwon.kim@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	James Houghton <jthoughton@google.com>,
	Jerome Marchand <jmarchan@redhat.com>,
	"Chang, Junxiao" <junxiao.chang@intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Hocko, Michal" <mhocko@suse.com>,
	Muchun Song <muchun.song@linux.dev>,
	Jason Gunthorpe <jgg@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
Date: Fri, 23 Jun 2023 12:35:45 -0400	[thread overview]
Message-ID: <ZJXJ4ULRKVN5JPyv@x1n> (raw)
In-Reply-To: <IA0PR11MB718510999F2C0E2B70914E2AF823A@IA0PR11MB7185.namprd11.prod.outlook.com>

On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
> Hi David,
> 
> > > The first patch ensures that the mappings needed for handling mmap
> > > operation would be managed by using the pfn instead of struct page.
> > > The second patch restores support for mapping hugetlb pages where
> > > subpages of a hugepage are not directly used anymore (main reason
> > > for revert) and instead the hugetlb pages and the relevant offsets
> > > are used to populate the scatterlist for dma-buf export and for
> > > mmap operation.
> > >
> > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> > options
> > > were passed to the Host kernel and Qemu was launched with these
> > > relevant options: qemu-system-x86_64 -m 4096m....
> > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > > -display gtk,gl=on
> > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > > -machine memory-backend=mem1
> > >
> > > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > > exercise the mmap handler.
> > >
> > 
> > While I think the VM_PFNMAP approach is much better and should fix that
> > issue at hand, I thought more about missing memlock support and realized
> > that we might have to fix something else. SO I'm going to raise the
> > issue here.
> > 
> > I think udmabuf chose the wrong interface to do what it's doing, that
> > makes it harder to fix it eventually.
> > 
> > Instead of accepting a range in a memfd, it should just have accepted a
> > user space address range and then used
> > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages
> > "officially".
> Udmabuf indeed started off by using user space address range and GUP but
> the dma-buf subsystem maintainer had concerns with that approach in v2.
> It also had support for mlock in that version. Here is v2 and the relevant
> conversation:
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
> 
> > 
> > So what's the issue? Udma effectively pins pages longterm ("possibly
> > forever") simply by grabbing a reference on them. These pages might
> > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> > 
> > So what udmabuf does is break memory hotunplug and CMA, because it
> > turns
> > pages that have to remain movable unmovable.
> > 
> > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> > these
> > pages. See mm/gup.c:check_and_migrate_movable_pages() and especially
> > folio_is_longterm_pinnable(). We'd probably have to implement something
> > similar for udmabuf, where we detect such unpinnable pages and migrate
> > them.
> The pages udmabuf pins are only those associated with Guest (GPU driver/virtio-gpu)
> resources (or buffers allocated and pinned from shmem via drm GEM). Some
> resources are short-lived, and some are long-lived and whenever a resource
> gets destroyed, the pages are unpinned. And, not all resources have their pages
> pinned. The resource that is pinned for the longest duration is the FB and that's
> because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
> GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host
> as a workaround, but I guess that may not be very efficient given the amount
> of churn it would create.
> 
> Also, as far as migration or S3/S4 is concerned, my understanding is that all
> the Guest resources are destroyed and recreated again. So, wouldn't something
> similar happen during memory hotunplug?
> 
> > 
> > 
> > For example, pairing udmabuf with vfio (which pins pages using
> > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in
> > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > that will never succeed.
> Dma-buf rules (for exporters) indicate that the pages only need to be pinned
> during the map_attachment phase (and until unmap attachment happens).
> In other words, only when the sg_table is created by udmabuf. I guess one
> option would be to not hold any references during UDMABUF_CREATE and
> only grab references to the pages (as and when it gets used) during this step.
> Would this help?

IIUC the refcount is needed, otherwise I don't see what to protect the page
from being freed and even reused elsewhere before map_attachment().

It seems the previous concern on using gup was majorly fork(), if this is it:

https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213

Could it also be guarded by just making sure the pages are MAP_SHARED when
creating the udmabuf, if fork() is a requirement of the feature?

I had a feeling that the userspace still needs to always do the right thing
to make it work, even using pure PFN mappings.

For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
at least not F_SEAL_WRITE with current impl), and fault a new page into the
page cache?

Thanks,

> 
> > 
> > 
> > There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> > the vfio/vdpa case we make sure that we disallow most of these, because
> > otherwise there can be an accidental "disconnect" between the pages
> > mapped into the VM (guest view) and the pages mapped into the IOMMU
> > (device view), for example, after a reboot.
> Ok; I am not sure if I can figure out if there is any acceptable way to address
> these issues but given the current constraints associated with udmabuf, what
> do you suggest is the most reasonable way to deal with these problems you
> have identified?
> 
> Thanks,
> Vivek
> 
> > 
> > --
> > Cheers,
> > 
> > David / dhildenb
> 

-- 
Peter Xu