linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* P2PDMA in Userspace RDMA
@ 2024-05-09 17:31 Logan Gunthorpe
  2024-05-09 17:42 ` Jason Gunthorpe
  0 siblings, 1 reply; 3+ messages in thread
From: Logan Gunthorpe @ 2024-05-09 17:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Martin Oliveira, Christoph Hellwig, Dan Williams, LKML,
	linux-rdma, linux-mm

Hi Jason,

We've become interested again in enabling P2PDMA transactions with
userspace RDMA and the NVMe CMBs we are already exporting to userspace
from our previous work.

Enabling FOLL_PCI_P2PDMA in ib_umem_get is almost a trivial change, but
there are two issues holding us back.

The biggest issue is that we disallowed FOLL_LONGTERM with
FOLL_PCI_P2PDMA out of concern that P2PDMA had the same problem as
fs-dax. See [1] to review the discussion from 2 years ago. However, in
trying to understand the problem again, I'm not sure that concern was
valid. In P2PDMA, unmap_mapping_range() is strictly only called on
driver unbind when everything is getting torn down[2]. The next thing
that happens immediately after the unmap is the tear down of the pgmap
which drops the elevated reference on the pages and waits for all page's
reference counts to go back to zero. This will effectively wait until
all longterm pins involving the memory have been released. This can
cause a hang on unbind but, in your words, its "annoying not critical".

Even if unmap_mapping_range() was happening outside of teardown, the
P2PDMA code isn't like a regular filesystem in that it fully supports
the elevated reference counts in the pgmap code and pages cannot be
reused until the pin releases its reference counts and the count returns
back to one. So I don't really see how there can be a UAF concern with this.

Unless I'm really missing something, it seems P2PDMA does not have the
same concern as fs-dax and I think it is safe to remove that
restriction. Any objections?

The other issue we hit when enabling this feature is the check for
vma_needs_dirty_tracking() in writable_file_mapping_allowed() during the
gup flow. This hits because the p2pdma code is using the common
sysfs/kernfs infrastructure to create the VMA which installs a
page_mkwrite operator()[4] to change the file update time on write. I
don't think this feature really makes any sense for the P2PDMA sysfs
file which is really operating as an allocator in userspace -- the time
on the file does not really need to reflect the last write of some
process that wrote to memory allocated using it. So I think removing the
operator for P2PDMA makes sense, it's just a matter of creating the
plumbing that allows P2PDMA to indicate this to kernfs. There may be a
certain amount of bikeshedding on how this might be done, but it doesn't
seem terribly complicated.

Thoughts?

Logan

[1] https://lore.kernel.org/all/Yy4Ot5MoOhsgYLTQ@ziepe.ca/T/#u
[2] https://elixir.bootlin.com/linux/v6.8.8/source/drivers/pci/p2pdma.c#L333
[3] https://elixir.bootlin.com/linux/v6.8.8/source/mm/gup.c#L1029
[4] https://elixir.bootlin.com/linux/v6.8.8/source/fs/kernfs/file.c#L435

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: P2PDMA in Userspace RDMA
  2024-05-09 17:31 P2PDMA in Userspace RDMA Logan Gunthorpe
@ 2024-05-09 17:42 ` Jason Gunthorpe
  2024-05-09 20:48   ` Alistair Popple
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Gunthorpe @ 2024-05-09 17:42 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Martin Oliveira, Christoph Hellwig, Dan Williams, LKML,
	linux-rdma, linux-mm

On Thu, May 09, 2024 at 11:31:33AM -0600, Logan Gunthorpe wrote:
> Hi Jason,
> 
> We've become interested again in enabling P2PDMA transactions with
> userspace RDMA and the NVMe CMBs we are already exporting to userspace
> from our previous work.
> 
> Enabling FOLL_PCI_P2PDMA in ib_umem_get is almost a trivial change, but
> there are two issues holding us back.
> 
> The biggest issue is that we disallowed FOLL_LONGTERM with
> FOLL_PCI_P2PDMA out of concern that P2PDMA had the same problem as
> fs-dax.

Yeah, it was not a great outcome of that issue.

> See [1] to review the discussion from 2 years ago. However, in
> trying to understand the problem again, I'm not sure that concern was
> valid. In P2PDMA, unmap_mapping_range() is strictly only called on
> driver unbind when everything is getting torn down[2]. The next thing
> that happens immediately after the unmap is the tear down of the pgmap
> which drops the elevated reference on the pages and waits for all page's
> reference counts to go back to zero. This will effectively wait until
> all longterm pins involving the memory have been released. This can
> cause a hang on unbind but, in your words, its "annoying not critical".

Yes

But you are looking at the code as it is right now, and stuff has been
quitely fixed with the pgmap refcount area since. I think it is
probably good now. IIRC it was pushed over the finish line when the
ZONE_DEVICE/PRIVATE pages were converted to have normal reference
counting.

If p2p is following the new ZONE_DEVICE scheme then it should be fine.

It would be good to read over Alistair's latest series fixing up fsdax
refcounts to see if anything pops out as problematic specifically with
the P2P case.

Otherwise a careful check through is probably all that is needed.

> The other issue we hit when enabling this feature is the check for
> vma_needs_dirty_tracking() in writable_file_mapping_allowed() during the
> gup flow. This hits because the p2pdma code is using the common
> sysfs/kernfs infrastructure to create the VMA which installs a
> page_mkwrite operator()[4] to change the file update time on write. 

Ah.

> I don't think this feature really makes any sense for the P2PDMA
> sysfs file which is really operating as an allocator in userspace --
> the time on the file does not really need to reflect the last write
> of some process that wrote to memory allocated using it. 

Right, you shouldn't have mkwrite for these pages.

Jason

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: P2PDMA in Userspace RDMA
  2024-05-09 17:42 ` Jason Gunthorpe
@ 2024-05-09 20:48   ` Alistair Popple
  0 siblings, 0 replies; 3+ messages in thread
From: Alistair Popple @ 2024-05-09 20:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Martin Oliveira, Christoph Hellwig,
	Dan Williams, LKML, linux-rdma, linux-mm


Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Thu, May 09, 2024 at 11:31:33AM -0600, Logan Gunthorpe wrote:
>> Hi Jason,
>> 
>> We've become interested again in enabling P2PDMA transactions with
>> userspace RDMA and the NVMe CMBs we are already exporting to userspace
>> from our previous work.
>> 
>> Enabling FOLL_PCI_P2PDMA in ib_umem_get is almost a trivial change, but
>> there are two issues holding us back.
>> 
>> The biggest issue is that we disallowed FOLL_LONGTERM with
>> FOLL_PCI_P2PDMA out of concern that P2PDMA had the same problem as
>> fs-dax.
>
> Yeah, it was not a great outcome of that issue.

I recently came across this trying to do something similar with io_uring
to mapped P2PDMA pages for testing purposes. I commented out the
FOLL_PCI_P2PDMA check but it hung/stalled during process teardown so
there might be something there.

>> See [1] to review the discussion from 2 years ago. However, in
>> trying to understand the problem again, I'm not sure that concern was
>> valid. In P2PDMA, unmap_mapping_range() is strictly only called on
>> driver unbind when everything is getting torn down[2]. The next thing
>> that happens immediately after the unmap is the tear down of the pgmap
>> which drops the elevated reference on the pages and waits for all page's
>> reference counts to go back to zero. This will effectively wait until
>> all longterm pins involving the memory have been released. This can
>> cause a hang on unbind but, in your words, its "annoying not critical".
>
> Yes
>
> But you are looking at the code as it is right now, and stuff has been
> quitely fixed with the pgmap refcount area since. I think it is
> probably good now. IIRC it was pushed over the finish line when the
> ZONE_DEVICE/PRIVATE pages were converted to have normal reference
> counting.

But P2P DMA pages are still in the off-by-one reference counting
scheme. My RFR series[1] (aims) to fix that, although the pgmap
refcounting scheme needs a closer look because I think doing the page
refcounting properly allows some cleanups there.

[1] - https://lore.kernel.org/linux-mm/cover.fe275e9819458a4bbb9451b888cafb88af8867d4.1712796818.git-series.apopple@nvidia.com/

> If p2p is following the new ZONE_DEVICE scheme then it should be fine.
>
> It would be good to read over Alistair's latest series fixing up fsdax
> refcounts to see if anything pops out as problematic specifically with
> the P2P case.
>
> Otherwise a careful check through is probably all that is needed.
>
>> The other issue we hit when enabling this feature is the check for
>> vma_needs_dirty_tracking() in writable_file_mapping_allowed() during the
>> gup flow. This hits because the p2pdma code is using the common
>> sysfs/kernfs infrastructure to create the VMA which installs a
>> page_mkwrite operator()[4] to change the file update time on write. 
>
> Ah.
>
>> I don't think this feature really makes any sense for the P2PDMA
>> sysfs file which is really operating as an allocator in userspace --
>> the time on the file does not really need to reflect the last write
>> of some process that wrote to memory allocated using it. 
>
> Right, you shouldn't have mkwrite for these pages.
>
> Jason


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-09 20:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-09 17:31 P2PDMA in Userspace RDMA Logan Gunthorpe
2024-05-09 17:42 ` Jason Gunthorpe
2024-05-09 20:48   ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).