linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"Kim, Dongwon" <dongwon.kim@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	James Houghton <jthoughton@google.com>,
	Jerome Marchand <jmarchan@redhat.com>,
	"Chang, Junxiao" <junxiao.chang@intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Hocko, Michal" <mhocko@suse.com>,
	Muchun Song <muchun.song@linux.dev>,
	Jason Gunthorpe <jgg@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages
Date: Fri, 23 Jun 2023 12:35:45 -0400	[thread overview]
Message-ID: <ZJXJ4ULRKVN5JPyv@x1n> (raw)
In-Reply-To: <IA0PR11MB718510999F2C0E2B70914E2AF823A@IA0PR11MB7185.namprd11.prod.outlook.com>

On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote:
> Hi David,
> 
> > > The first patch ensures that the mappings needed for handling mmap
> > > operation would be managed by using the pfn instead of struct page.
> > > The second patch restores support for mapping hugetlb pages where
> > > subpages of a hugepage are not directly used anymore (main reason
> > > for revert) and instead the hugetlb pages and the relevant offsets
> > > are used to populate the scatterlist for dma-buf export and for
> > > mmap operation.
> > >
> > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> > options
> > > were passed to the Host kernel and Qemu was launched with these
> > > relevant options: qemu-system-x86_64 -m 4096m....
> > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > > -display gtk,gl=on
> > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > > -machine memory-backend=mem1
> > >
> > > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > > exercise the mmap handler.
> > >
> > 
> > While I think the VM_PFNMAP approach is much better and should fix that
> > issue at hand, I thought more about missing memlock support and realized
> > that we might have to fix something else. SO I'm going to raise the
> > issue here.
> > 
> > I think udmabuf chose the wrong interface to do what it's doing, that
> > makes it harder to fix it eventually.
> > 
> > Instead of accepting a range in a memfd, it should just have accepted a
> > user space address range and then used
> > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages
> > "officially".
> Udmabuf indeed started off by using user space address range and GUP but
> the dma-buf subsystem maintainer had concerns with that approach in v2.
> It also had support for mlock in that version. Here is v2 and the relevant
> conversation:
> https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2
> 
> > 
> > So what's the issue? Udma effectively pins pages longterm ("possibly
> > forever") simply by grabbing a reference on them. These pages might
> > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> > 
> > So what udmabuf does is break memory hotunplug and CMA, because it
> > turns
> > pages that have to remain movable unmovable.
> > 
> > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> > these
> > pages. See mm/gup.c:check_and_migrate_movable_pages() and especially
> > folio_is_longterm_pinnable(). We'd probably have to implement something
> > similar for udmabuf, where we detect such unpinnable pages and migrate
> > them.
> The pages udmabuf pins are only those associated with Guest (GPU driver/virtio-gpu)
> resources (or buffers allocated and pinned from shmem via drm GEM). Some
> resources are short-lived, and some are long-lived and whenever a resource
> gets destroyed, the pages are unpinned. And, not all resources have their pages
> pinned. The resource that is pinned for the longest duration is the FB and that's
> because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
> GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host
> as a workaround, but I guess that may not be very efficient given the amount
> of churn it would create.
> 
> Also, as far as migration or S3/S4 is concerned, my understanding is that all
> the Guest resources are destroyed and recreated again. So, wouldn't something
> similar happen during memory hotunplug?
> 
> > 
> > 
> > For example, pairing udmabuf with vfio (which pins pages using
> > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in
> > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > that will never succeed.
> Dma-buf rules (for exporters) indicate that the pages only need to be pinned
> during the map_attachment phase (and until unmap attachment happens).
> In other words, only when the sg_table is created by udmabuf. I guess one
> option would be to not hold any references during UDMABUF_CREATE and
> only grab references to the pages (as and when it gets used) during this step.
> Would this help?

IIUC the refcount is needed, otherwise I don't see what to protect the page
from being freed and even reused elsewhere before map_attachment().

It seems the previous concern on using gup was majorly fork(), if this is it:

https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213

Could it also be guarded by just making sure the pages are MAP_SHARED when
creating the udmabuf, if fork() is a requirement of the feature?

I had a feeling that the userspace still needs to always do the right thing
to make it work, even using pure PFN mappings.

For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs
file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but
at least not F_SEAL_WRITE with current impl), and fault a new page into the
page cache?

Thanks,

> 
> > 
> > 
> > There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> > the vfio/vdpa case we make sure that we disallow most of these, because
> > otherwise there can be an accidental "disconnect" between the pages
> > mapped into the VM (guest view) and the pages mapped into the IOMMU
> > (device view), for example, after a reboot.
> Ok; I am not sure if I can figure out if there is any acceptable way to address
> these issues but given the current constraints associated with udmabuf, what
> do you suggest is the most reasonable way to deal with these problems you
> have identified?
> 
> Thanks,
> Vivek
> 
> > 
> > --
> > Cheers,
> > 
> > David / dhildenb
> 

-- 
Peter Xu



  reply	other threads:[~2023-06-23 16:35 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-22  7:27 [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
2023-06-22  7:27 ` [PATCH v1 1/2] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
2023-06-22  7:27 ` [PATCH v1 2/2] udmabuf: Add back support for mapping hugetlb pages Vivek Kasireddy
2023-06-22 22:10   ` kernel test robot
2023-06-22  8:25 ` [PATCH v1 0/2] " David Hildenbrand
2023-06-22 21:33   ` Mike Kravetz
2023-06-23  6:13   ` Kasireddy, Vivek
2023-06-23 16:35     ` Peter Xu [this message]
2023-06-23 16:37       ` Jason Gunthorpe
2023-06-23 17:28         ` Peter Xu
2023-06-26 12:57           ` Jason Gunthorpe
2023-06-26  7:45       ` Kasireddy, Vivek
2023-06-26 17:52         ` Peter Xu
2023-06-26 18:14           ` David Hildenbrand
2023-06-26 18:18             ` Jason Gunthorpe
2023-06-26 19:04               ` Peter Xu
2023-06-27 15:52                 ` Jason Gunthorpe
2023-06-27 16:00                   ` Peter Xu
2023-06-27 16:04                     ` Jason Gunthorpe
2023-06-27  6:37             ` Kasireddy, Vivek
2023-06-27  7:10               ` David Hildenbrand
2023-06-28  8:04                 ` Kasireddy, Vivek
2023-08-08 16:17   ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJXJ4ULRKVN5JPyv@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=dongwon.kim@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=jmarchan@redhat.com \
    --cc=jthoughton@google.com \
    --cc=junxiao.chang@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kraxel@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=vivek.kasireddy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).