linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Logan Gunthorpe <logang@deltatee.com>
To: Jason Gunthorpe <jgg@ziepe.ca>,
	Alistair Popple <apopple@nvidia.com>,
	Felix Kuehling <Felix.Kuehling@amd.com>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-mm@kvack.org, iommu@lists.linux-foundation.org,
	"Stephen Bates" <sbates@raithlin.com>,
	"Christian König" <christian.koenig@amd.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Don Dutile" <ddutile@redhat.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	"Jakowski Andrzej" <andrzej.jakowski@intel.com>,
	"Minturn Dave B" <dave.b.minturn@intel.com>,
	"Jason Ekstrand" <jason@jlekstrand.net>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Xiong Jianxin" <jianxin.xiong@intel.com>,
	"Bjorn Helgaas" <helgaas@kernel.org>,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Robin Murphy" <robin.murphy@arm.com>,
	"Martin Oliveira" <martin.oliveira@eideticom.com>,
	"Chaitanya Kulkarni" <ckulkarnilinux@gmail.com>
Subject: Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
Date: Fri, 1 Oct 2021 11:01:49 -0600	[thread overview]
Message-ID: <4fdd337b-fa35-a909-5eee-823bfd1e9dc4@deltatee.com> (raw)
In-Reply-To: <20211001134856.GN3544071@ziepe.ca>




On 2021-10-01 7:48 a.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote:
> 
>> Why would DAX want to do this in the first place?? This means the
>> address space zap is much more important that just speeding up
>> destruction, it is essential for correctness since the PTEs are not
>> holding refcounts naturally...
> 
> It is not really for this series to fix, but I think the whole thing
> is probably racy once you start allowing pte_special pages to be
> accessed by GUP.
> 
> If we look at unmapping the PTE relative to GUP fast the important
> sequence is how the TLB flushing doesn't decrement the page refcount
> until after it knows any concurrent GUP fast is completed. This is
> arch specific, eg it could be done async through a call_rcu handler.
> 
> This ensures that pages can't cross back into the free pool and be
> reallocated until we know for certain that nobody is walking the PTEs
> and could potentially take an additional reference on it. The scheme
> cannot rely on the page refcount being 0 because oce it goes into the
> free pool it could be immeidately reallocated back to a non-zero
> refcount.
> 
> A DAX user that simply does an address space invalidation doesn't
> sequence itself with any of this mechanism. So we can race with a
> thread doing GUP fast and another thread re-cycling the page into
> another use - creating a leakage of the page from one security context
> to another.
> 
> This seems to be made worse for the pgmap stuff due to the wonky
> refcount usage - at least if the refcount had dropped to zero gup fast
> would be blocked for a time, but even that doesn't happen.
> 
> In short, I think using pg special for anything that can be returned
> by gup fast (and maybe even gup!) is racy/wrong. We must have the
> normal refcount mechanism work for correctness of the recycling flow.

I'm not quite following all of this. I'm not entirely sure how fs/dax
works in this regard, but for device-dax (and similarly p2pdma) it
doesn't seem as bad as you say.

In device-dax, the refcount is only used to prevent the device, and
therefore the pages, from going away on device unbind. Pages cannot be
recycled, as you say, as they are mapped linearly within the device. The
address space invalidation is done only when the device is unbound.
Before the invalidation, an active flag is cleared to ensure no new
mappings can be created while the unmap is proceeding.
unmap_mapping_range() should sequence itself with the TLB flush and
GUP-fast using the same mechanism it does for regular pages. As far as I
can see, by the time unmap_mapping_range() returns, we should be
confident that there are no pages left in any mapping (seeing no new
pages could be added since before the call). Then before finishing the
unbind, device-dax decrements the refcount of all pages and then waits
for the refcount of all pages to go to zero. Thus, any pages that
successfully were got with GUP, during or before unmap_mapping_range
should hold a reference and once all those references are returned,
unbind can finish.

P2PDMA follows this pattern, except pages are not mapped linearly and
are returned to the genalloc when their refcount falls to 1. This only
happens after a VMA is closed which should imply the PTEs have already
been unlinked from the pages. And the same situation occurs on unbind
with a flag preventing new mappings from being created before
unmap_mapping_range(), etc.

Not to say that all this couldn't use a big conceptual cleanup. A
similar question exists with the single find_special_page() user
(xen/gntdev) and it's definitely not clear what the differences are
between the find_special_page() and vmf_insert_mixed() techniques and
when one should be used over the other. Or could they both be merged to
use the same technique?

Logan

  reply	other threads:[~2021-10-01 17:02 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
2021-09-28 18:32   ` Jason Gunthorpe
2021-09-29 21:15     ` Logan Gunthorpe
2021-09-30  4:47   ` Chaitanya Kulkarni
2021-09-30 16:49     ` Logan Gunthorpe
2021-09-30  4:57   ` Chaitanya Kulkarni
2021-09-16 23:40 ` [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set Logan Gunthorpe
2021-09-27 18:50   ` Bjorn Helgaas
2021-09-16 23:40 ` [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static Logan Gunthorpe
2021-09-27 18:46   ` Bjorn Helgaas
2021-09-28 18:48   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
2021-09-27 18:53   ` Bjorn Helgaas
2021-09-27 19:59     ` Logan Gunthorpe
2021-09-28 18:55   ` Jason Gunthorpe
2021-09-29 21:26     ` Logan Gunthorpe
2021-09-28 22:05   ` [PATCH v3 4/20] " Jason Gunthorpe
2021-09-29 21:30     ` Logan Gunthorpe
2021-09-29 22:46       ` Jason Gunthorpe
2021-09-29 23:00         ` Logan Gunthorpe
2021-09-29 23:40           ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers Logan Gunthorpe
2021-09-28 18:57   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
2021-09-28 19:08   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
2021-09-28 19:11   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
2021-09-28 19:15   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
2021-09-30  5:06   ` Chaitanya Kulkarni
2021-09-30 16:51     ` Logan Gunthorpe
2021-09-30 17:19       ` Chaitanya Kulkarni
2021-09-16 23:40 ` [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable() Logan Gunthorpe
2021-10-05 22:29   ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported() Logan Gunthorpe
2021-09-28 19:17   ` Jason Gunthorpe
2021-10-05 22:31   ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable() Logan Gunthorpe
2021-09-28 19:43   ` Jason Gunthorpe
2021-09-29 22:56     ` Logan Gunthorpe
2021-10-05 22:40     ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
2021-09-27 18:50   ` Bjorn Helgaas
2021-09-28 19:43   ` Jason Gunthorpe
2021-10-05 22:42   ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
2021-09-28 19:47   ` Jason Gunthorpe
2021-09-29 21:34     ` Logan Gunthorpe
2021-09-29 22:48       ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 15/20] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 16/20] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 17/20] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 18/20] mm: use custom page_free for P2PDMA pages Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
2021-09-27 18:49   ` Bjorn Helgaas
2021-09-28 19:55   ` Jason Gunthorpe
2021-09-29 21:42     ` Logan Gunthorpe
2021-09-29 23:05       ` Jason Gunthorpe
2021-09-29 23:27         ` Logan Gunthorpe
2021-09-29 23:35           ` Jason Gunthorpe
2021-09-29 23:49             ` Logan Gunthorpe
2021-09-30  0:36               ` Jason Gunthorpe
2021-10-01 13:48                 ` Jason Gunthorpe
2021-10-01 17:01                   ` Logan Gunthorpe [this message]
2021-10-01 17:45                     ` Jason Gunthorpe
2021-10-01 20:13                       ` Logan Gunthorpe
2021-10-01 22:14                         ` Jason Gunthorpe
2021-10-01 22:22                           ` Logan Gunthorpe
2021-10-01 22:46                             ` Jason Gunthorpe
2021-10-01 23:27                               ` John Hubbard
2021-10-01 23:34                               ` Logan Gunthorpe
2021-10-04  6:58                       ` Christian König
2021-10-04 13:11                         ` Jason Gunthorpe
2021-10-04 13:22                           ` Christian König
2021-10-04 13:27                             ` Jason Gunthorpe
2021-10-04 14:54                               ` Christian König
2021-09-28 20:05   ` Jason Gunthorpe
2021-09-29 21:46     ` Logan Gunthorpe
2021-09-16 23:41 ` [PATCH v3 20/20] nvme-pci: allow mmaping the CMB in userspace Logan Gunthorpe
2021-09-28 20:02 ` [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Jason Gunthorpe
2021-09-29 21:50   ` Logan Gunthorpe
2021-09-29 23:21     ` Jason Gunthorpe
2021-09-29 23:28       ` Logan Gunthorpe
2021-09-29 23:36         ` Jason Gunthorpe
2021-09-29 23:52           ` Logan Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4fdd337b-fa35-a909-5eee-823bfd1e9dc4@deltatee.com \
    --to=logang@deltatee.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=andrzej.jakowski@intel.com \
    --cc=apopple@nvidia.com \
    --cc=christian.koenig@amd.com \
    --cc=ckulkarnilinux@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dave.b.minturn@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=ddutile@redhat.com \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=ira.weiny@intel.com \
    --cc=jason@jlekstrand.net \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=jianxin.xiong@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=martin.oliveira@eideticom.com \
    --cc=robin.murphy@arm.com \
    --cc=sbates@raithlin.com \
    --cc=willy@infradead.org \
    --subject='Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).