linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Barret Rhoden <brho@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>, X86 ML <x86@kernel.org>,
	KVM list <kvm@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Zeng, Jason" <jason.zeng@intel.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files
Date: Thu, 12 Dec 2019 11:48:12 -0800	[thread overview]
Message-ID: <CAPcyv4h19dKGpz0XzEHz0nOddnRAefE=rOuhGTHEL6FPhqk8GQ@mail.gmail.com> (raw)
In-Reply-To: <b50720a2-5358-19ea-a45e-a0c0628c68b0@google.com>

On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote:
>
> On 12/12/19 12:37 PM, Dan Williams wrote:
> > Yeah, since device-dax is the only path to support longterm page
> > pinning for vfio device assignment, testing with device-dax + 1GB
> > pages would be a useful sanity check.
>
> What are the issues with fs-dax and page pinning?  Is that limitation
> something that is permanent and unfixable (by me or anyone)?

It's a surprisingly painful point of contention...

File backed DAX pages cannot be truncated while the page is pinned
because the pin may indicate that DMA is ongoing to the file block /
DAX page. When that pin is from RDMA or VFIO that creates a situation
where filesystem operations are blocked indefinitely. More details
here: 94db151dc892 "vfio: disable filesystem-dax page pinning".

Currently, to prevent the deadlock, RDMA, VFIO, and IO_URING memory
registration is blocked if the mapping is filesystem-dax backed (see
the FOLL_LONGTERM flag to get_user_pages).

One of the proposals to break the impasse was to allow the filesystem
to forcibly revoke the mapping. I.e. to use the IOMMU to forcibly kick
the RDMA device out of its registration. That was rejected by RDMA
folks because RDMA applications are not prepared for this revocation
to happen and the application that performed the registration may not
be the application that uses the registration. There was an attempt to
use a file lease to indicate the presence of a file /
memory-registration that is blocking file-system operations, but that
was still less palatable to filesystem folks than just keeping the
status quo of blocking longterm pinning.

That said, the VFIO use case seems a different situation than RDMA.
There's often a 1:1 relationship between the application performing
the memory registration and the application consuming it, the VMM, and
there is always an IOMMU present that could revoke access and kill the
guest is the mapping got truncated. It seems in theory that VFIO could
tolerate a "revoke pin on truncate" mechanism where RDMA could not.

> I'd like to put a lot more in a DAX/pmem region than just a guest's
> memory, and having a mountable filesystem would be extremely convenient.

Why would page pinning be involved in allowing the guest to mount a
filesystem on guest-pmem? That already works today, it's just the
device-passthrough that causes guest memory to be pinned indefinitely.

  reply	other threads:[~2019-12-12 19:48 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-11 21:32 [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
2019-12-11 21:32 ` [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible Barret Rhoden
2019-12-11 21:32 ` [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files Barret Rhoden
2019-12-12  0:21   ` Paolo Bonzini
2019-12-12 12:22   ` David Hildenbrand
2019-12-12 16:31     ` Barret Rhoden
2019-12-12 12:33   ` Liran Alon
2019-12-12 16:54     ` Dan Williams
2019-12-12 17:39       ` Liran Alon
2019-12-12 17:59         ` Dan Williams
2019-12-12 18:32           ` Liran Alon
2019-12-12 17:03     ` Barret Rhoden
2019-12-12 17:34   ` Sean Christopherson
2019-12-12 17:37     ` Dan Williams
2019-12-12 19:16       ` Barret Rhoden
2019-12-12 19:48         ` Dan Williams [this message]
2019-12-12 20:08           ` Barret Rhoden
2019-12-12 17:45     ` Liran Alon
2019-12-12  0:22 ` [PATCH v4 0/2] " Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4h19dKGpz0XzEHz0nOddnRAefE=rOuhGTHEL6FPhqk8GQ@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=brho@google.com \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=hch@lst.de \
    --cc=jason.zeng@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=pbonzini@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).