Re: find_get_page() VS pin_user_pages()

From: Jan Kara <jack@suse.cz>
To: "Teterevkov, Ivan" <Ivan.Teterevkov@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"jhubbard@nvidia.com" <jhubbard@nvidia.com>,
	"jack@suse.cz" <jack@suse.cz>,
	"rppt@linux.ibm.com" <rppt@linux.ibm.com>,
	"jglisse@redhat.com" <jglisse@redhat.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: find_get_page() VS pin_user_pages()
Date: Wed, 12 Apr 2023 12:41:51 +0200	[thread overview]
Message-ID: <20230412104151.hkl5navnaoc7l7ob@quack3> (raw)
In-Reply-To: <MW5PR12MB55984F39C8CECADDFE7548F2F09B9@MW5PR12MB5598.namprd12.prod.outlook.com>

On Wed 12-04-23 09:04:33, Teterevkov, Ivan wrote:
> From: Alistair Popple <apopple@nvidia.com> 
> 
> > "Teterevkov, Ivan" <Ivan.Teterevkov@amd.com> writes:
> > 
> > > Hello folks,
> > >
> > > I work with an application which aims to share memory in the userspace and
> > > interact with the NIC DMA. The memory allocation workflow begins in the
> > > userspace, which creates a new file backed by 2MiB hugepages with
> > > memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
> > > makes an IOCTL to the kernel module with the file descriptor and size so that
> > > the kernel module can get the struct page with find_get_page(). Then the kernel
> > > module calls dma_map_single(page_address(page)) for NIC, which concludes the
> > > datapath. The allocated memory may (significantly) outlive the originating
> > > userspace application. The hugepages stay mapped with NIC, and the kernel
> > > module wants to continue using them and map to other applications that come and
> > > go with vm_mmap().
> > >
> > > I am studying the pin_user_pages*() family of functions, and I wonder if the
> > > outlined workflow requires it. The hugepages do not page out, but they can move
> > > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> > > must increment the page reference counter without mapping and prevent it from
> > > moving. In particular, https://docs.kernel.org/mm/page_migration.html:
> > 
> > I'm not super familiar with the memfd_create()/find_get_page() workflow
> > but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM)
> > to get the struct page initially? You're description above sounds
> > exactly the use case pin_user_pages() was designed for because it marks
> > the page as being writen to by DMA, makes sure it's not in a movable
> > zone, etc.
> > 
> 
> The biggest obstacle with the application workflow is that the memory
> allocation is mostly kernel-driven. The kernel module may want to tell DMA
> about the hugepages before the userspace application maps it into its address
> space, so the kernel module does not have the starting user address at hand.

I'm a bit confused. Above you write that:

"The memory allocation workflow begins in the userspace, which creates a new
file backed by 2MiB hugepages with memfd_create(MFD_HUGETLB, MFD_HUGE_2MB)
and fallocate(). Then the userspace makes an IOCTL to the kernel module
with the file descriptor and size so that the kernel module can get the
struct page with find_get_page()."

So the memory allocation actually does happen from fallocate(2) as far as I
can tell. What guys are suggesting is that instead of passing the prepared
'fd' to ioctl(2), your application should mmap the file and pass the
address of the mmapped area. That's how things are usually done and it also
gives userspace more freedom over how it prepares buffers for DMA. Also then
pin_user_pages() comes as a natural API to use in the driver.

Now I'm not sure whether changing the ioctl(2) is still an option for you.
If not, then you have to resort to some kind of workaround as you
mentioned. But still pin_user_pages(FOLL_LONGTERM) is definitely the API
you should be using for telling the kernel you are going to DMA into these
pages and want to hold onto them for a long time.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR