All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laine Stump <laine@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Jason Gunthorpe <jgg@nvidia.com>
Cc: "Eric Auger" <eric.auger@redhat.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"Rodel, Jorg" <jroedel@suse.de>,
	"Lu Baolu" <baolu.lu@linux.intel.com>,
	"Chaitanya Kulkarni" <chaitanyak@nvidia.com>,
	"Cornelia Huck" <cohuck@redhat.com>,
	"Daniel Jordan" <daniel.m.jordan@oracle.com>,
	"David Gibson" <david@gibson.dropbear.id.au>,
	"Eric Farman" <farman@linux.ibm.com>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
	"Jason Wang" <jasowang@redhat.com>,
	"Jean-Philippe Brucker" <jean-philippe@linaro.org>,
	"Martins, Joao" <joao.m.martins@oracle.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Matthew Rosato" <mjrosato@linux.ibm.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Nicolin Chen" <nicolinc@nvidia.com>,
	"Niklas Schnelle" <schnelle@linux.ibm.com>,
	"Shameerali Kolothum Thodi"
	<shameerali.kolothum.thodi@huawei.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"Keqian Zhu" <zhukeqian1@huawei.com>,
	"Steve Sistare" <steven.sistare@oracle.com>,
	"libvir-list@redhat.com" <libvir-list@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>
Subject: Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
Date: Wed, 21 Sep 2022 18:36:27 -0400	[thread overview]
Message-ID: <463d5e09-202f-ca2f-ffb0-c86c8f8b75c9@redhat.com> (raw)
In-Reply-To: <20220921120649.5d2ff778.alex.williamson@redhat.com>

On 9/21/22 2:06 PM, Alex Williamson wrote:
> [Cc+ Steve, libvirt, Daniel, Laine]
> 
> On Tue, 20 Sep 2022 16:56:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
>>> Hi,
>>>
>>> On 9/13/22 03:55, Tian, Kevin wrote:
>>>> We didn't close the open of how to get this merged in LPC due to the
>>>> audio issue. Then let's use mails.
>>>>
>>>> Overall there are three options on the table:
>>>>
>>>> 1) Require vfio-compat to be 100% compatible with vfio-type1
>>>>
>>>>     Probably not a good choice given the amount of work to fix the remaining
>>>>     gaps. And this will block support of new IOMMU features for a longer time.
>>>>
>>>> 2) Leave vfio-compat as what it is in this series
>>>>
>>>>     Treat it as a vehicle to validate the iommufd logic instead of immediately
>>>>     replacing vfio-type1. Functionally most vfio applications can work w/o
>>>>     change if putting aside the difference on locked mm accounting, p2p, etc.
>>>>
>>>>     Then work on new features and 100% vfio-type1 compat. in parallel.
>>>>
>>>> 3) Focus on iommufd native uAPI first
>>>>
>>>>     Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
>>>>
>>>>     Then work on new features and vfio-compat in parallel.
>>>>
>>>> I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).
>>>
>>> I am also inclined to pursue 3) as this was the initial Jason's guidance
>>> and pre-requisite to integrate new features. In the past we concluded
>>> vfio-compat would mostly be used for testing purpose. Our QEMU
>>> integration fully is based on device based API.
>>
>> There are some poor chicken and egg problems here.
>>
>> I had some assumptions:
>>   a - the vfio cdev model is going to be iommufd only
>>   b - any uAPI we add as we go along should be generally useful going
>>       forward
>>   c - we should try to minimize the 'minimally viable iommufd' series
>>
>> The compat as it stands now (eg #2) is threading this needle. Since it
>> can exist without cdev it means (c) is made smaller, to two series.
>>
>> Since we add something useful to some use cases, eg DPDK is deployable
>> that way, (b) is OK.
>>
>> If we focus on a strict path with 3, and avoid adding non-useful code,
>> then we have to have two more (unwritten!) series beyond where we are
>> now - vfio group compartmentalization, and cdev integration, and the
>> initial (c) will increase.
>>
>> 3 also has us merging something that currently has no usable
>> userspace, which I also do dislike alot.
>>
>> I still think the compat gaps are small. I've realized that
>> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
>> can deadlock the kernel I propose we purge it completely.
> 
> Steve won't be happy to hear that, QEMU support exists but isn't yet
> merged.
>   
>> P2P is ongoing.
>>
>> That really just leaves the accounting, and I'm still not convinced at
>> this must be a critical thing. Linus's latest remarks reported in lwn
>> at the maintainer summit on tracepoints/BPF as ABI seem to support
>> this. Let's see an actual deployed production configuration that would
>> be impacted, and we won't find that unless we move forward.
> 
> I'll try to summarize the proposed change so that we can get better
> advice from libvirt folks, or potentially anyone else managing locked
> memory limits for device assignment VMs.
> 
> Background: when a DMA range, ex. guest RAM, is mapped to a vfio device,
> we use the system IOMMU to provide GPA to HPA translation for assigned
> devices. Unlike CPU page tables, we don't generally have a means to
> demand fault these translations, therefore the memory target of the
> translation is pinned to prevent that it cannot be swapped or
> relocated, ie. to guarantee the translation is always valid.
> 
> The issue is where we account these pinned pages, where accounting is
> necessary such that a user cannot lock an arbitrary number of pages
> into RAM to generate a DoS attack.  Duplicate accounting should be
> resolved by iommufd, but is outside the scope of this discussion.
> 
> Currently, vfio tests against the mm_struct.locked_vm relative to
> rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur,
> where task is the current process.  This is the same limit set via the
> setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.
> 
> Note that in both cases above, we're dealing with a task, or process
> limit and both prlimit and ulimit man pages describe them as such.
> 
> iommufd supposes instead, and references existing kernel
> implementations, that despite the descriptions above these limits are
> actually meant to be user limits and therefore instead charges pinned
> pages against user_struct.locked_vm and also marks them in
> mm_struct.pinned_vm.
> 
> The proposed algorithm is to read the _task_ locked memory limit, then
> attempt to charge the _user_ locked_vm, such that user_struct.locked_vm
> cannot exceed the task locked memory limit.
> 
> This obviously has implications.  AFAICT, any management tool that
> doesn't instantiate assigned device VMs under separate users are
> essentially untenable.  For example, if we launch VM1 under userA and
> set a locked memory limit of 4GB via prlimit to account for an assigned
> device, that works fine, until we launch VM2 from userA as well.  In
> that case we can't simply set a 4GB limit on the VM2 task because
> there's already 4GB charged against user_struct.locked_vm for VM1.  So
> we'd need to set the VM2 task limit to 8GB to be able to launch VM2.
> But not only that, we'd need to go back and also set VM1's task limit
> to 8GB or else it will fail if a DMA mapped memory region is transient
> and needs to be re-mapped.
> 
> Effectively any task under the same user and requiring pinned memory
> needs to have a locked memory limit set, and updated, to account for
> all tasks using pinned memory by that user.
> 
> How does this affect known current use cases of locked memory
> management for assigned device VMs?
> 
> Does qemu://system by default sandbox into per VM uids or do they all
> use the qemu user by default.

Unless it is told otherwise in the XML for the VMs, each qemu process 
uses the same uid (which is usually "qemu", but can be changed in 
systemwide config).

>  I imagine qemu://session mode is pretty
> screwed by this, but I also don't know who/where locked limits are
> lifted for such VMs.  Boxes, who I think now supports assigned device
> VMs, could also be affected.

because qemu:///session runs an unprivileged libvirt (i.e. unable to 
raise the limits), boxes sets the limits elsewhere  beforehand (not sure 
where, as I'm not familiar with boxes source).

>   
>> So, I still like 2 because it yields the smallest next step before we
>> can bring all the parallel work onto the list, and it makes testing
>> and converting non-qemu stuff easier even going forward.
> 
> If a vfio compatible interface isn't transparently compatible, then I
> have a hard time understanding its value.  Please correct my above
> description and implications, but I suspect these are not just
> theoretical ABI compat issues.  Thanks,
> 
> Alex
> 


  parent reply	other threads:[~2022-09-21 22:36 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-02 19:59 [PATCH RFC v2 00/13] IOMMUFD Generic interface Jason Gunthorpe
2022-09-02 19:59 ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 01/13] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 02/13] iommufd: Overview documentation Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-07  1:39   ` David Gibson
2022-09-09 18:52     ` Jason Gunthorpe
2022-09-12 10:40       ` David Gibson
2022-09-27 17:33         ` Jason Gunthorpe
2022-09-29  3:47           ` David Gibson
2022-09-02 19:59 ` [PATCH RFC v2 03/13] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-04  8:19   ` Baolu Lu
2022-09-09 18:46     ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 04/13] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 05/13] iommufd: PFN handling for iopt_pages Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 06/13] iommufd: Algorithms for PFN storage Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 07/13] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 08/13] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 09/13] iommufd: Add a HW pagetable object Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 10/13] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 11/13] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 12/13] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 13/13] iommufd: Add a selftest Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-13  1:55 ` [PATCH RFC v2 00/13] IOMMUFD Generic interface Tian, Kevin
2022-09-13  7:28   ` Eric Auger
2022-09-20 19:56     ` Jason Gunthorpe
2022-09-21  3:48       ` Tian, Kevin
2022-09-21 18:06       ` Alex Williamson
2022-09-21 18:44         ` Jason Gunthorpe
2022-09-21 19:30           ` Steven Sistare
2022-09-21 23:09             ` Jason Gunthorpe
2022-10-06 16:01               ` Jason Gunthorpe
2022-10-06 22:57                 ` Steven Sistare
2022-10-10 20:54                 ` Steven Sistare
2022-10-11 12:30                   ` Jason Gunthorpe
2022-10-11 20:30                     ` Steven Sistare
2022-10-12 12:32                       ` Jason Gunthorpe
2022-10-12 13:50                         ` Steven Sistare
2022-10-12 14:40                           ` Jason Gunthorpe
2022-10-12 14:55                             ` Steven Sistare
2022-10-12 14:59                               ` Jason Gunthorpe
2022-09-21 23:20           ` Jason Gunthorpe
2022-09-22 11:20           ` Daniel P. Berrangé
2022-09-22 14:08             ` Jason Gunthorpe
2022-09-22 14:49               ` Daniel P. Berrangé
2022-09-22 14:51                 ` Jason Gunthorpe
2022-09-22 15:00                   ` Daniel P. Berrangé
2022-09-22 15:31                     ` Jason Gunthorpe
2022-09-23  8:54                       ` Daniel P. Berrangé
2022-09-23 13:29                         ` Jason Gunthorpe
2022-09-23 13:35                           ` Daniel P. Berrangé
2022-09-23 13:46                             ` Jason Gunthorpe
2022-09-23 14:00                               ` Daniel P. Berrangé
2022-09-23 15:40                                 ` Laine Stump
2022-10-21 19:56                                   ` Jason Gunthorpe
2022-09-23 14:03                           ` Alex Williamson
2022-09-26  6:34                             ` David Gibson
2022-09-21 22:36         ` Laine Stump [this message]
2022-09-22 11:06         ` Daniel P. Berrangé
2022-09-22 14:13           ` Jason Gunthorpe
2022-09-22 14:46             ` Daniel P. Berrangé
2022-09-13  2:05 ` Tian, Kevin
2022-09-20 20:07   ` Jason Gunthorpe
2022-09-21  3:40     ` Tian, Kevin
2022-09-21 16:19       ` Jason Gunthorpe
2022-09-26 13:48     ` Rodel, Jorg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=463d5e09-202f-ca2f-ffb0-c86c8f8b75c9@redhat.com \
    --to=laine@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=berrange@redhat.com \
    --cc=chaitanyak@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=eric.auger@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=iommu@lists.linux.dev \
    --cc=jasowang@redhat.com \
    --cc=jean-philippe@linaro.org \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=jroedel@suse.de \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=libvir-list@redhat.com \
    --cc=mjrosato@linux.ibm.com \
    --cc=mst@redhat.com \
    --cc=nicolinc@nvidia.com \
    --cc=schnelle@linux.ibm.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=steven.sistare@oracle.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhukeqian1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.