Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

From: Sean Mooney <smooney@redhat.com>
To: Jason Wang <jasowang@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Gibson <david@gibson.dropbear.id.au>,
	Eric Auger <eric.auger@redhat.com>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	Jean-Philippe Brucker <jean-philippe@linaro.org>,
	"Martins, Joao" <joao.m.martins@oracle.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Matthew Rosato <mjrosato@linux.ibm.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Nicolin Chen <nicolinc@nvidia.com>,
	Shameerali Kolothum Thodi  <shameerali.kolothum.thodi@huawei.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	Keqian Zhu <zhukeqian1@huawei.com>
Subject: Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
Date: Mon, 28 Mar 2022 14:14:26 +0100	[thread overview]
Message-ID: <5accdd9074f20e8fef30984285a23366b7025497.camel@redhat.com> (raw)
In-Reply-To: <CACGkMEtTVMuc-JebEbTrb3vRUVaNJ28FV_VyFRdRquVQN9VeQA@mail.gmail.com>

On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > 
> > > It's simply because we don't want to break existing userspace. [1]
> > 
> > I'm still waiting to hear what exactly breaks in real systems.
> > 
> > As I explained this is not a significant change, but it could break
> > something in a few special scenarios.
> > 
> > Also the one place we do have ABI breaks is security, and ulimit is a
> > security mechanism that isn't working right. So we do clearly need to
> > understand *exactly* what real thing breaks - if anything.
> > 
> > Jason
> > 
> 
> To tell the truth, I don't know. I remember that Openstack may do some
> accounting so adding Sean for more comments. But we really can't image
> openstack is the only userspace that may use this.
sorry there is a lot of context to this discussion i have tried to read back the
thread but i may have missed part of it.

tl;dr openstack does not currently track locked/pinned memory per use or per vm because we have
no idea when libvirt will request it or how much is needed per device. when ulimits are configured
today for nova/openstack its done at teh qemu user level outside of openstack in our installer tooling.
e.g. in tripleo the ulimits woudl be set on the nova_libvirt contaienr to constrain all vms spawned
not per vm/process.

full responce below
-------------------

openstacks history with locked/pinned/unswapable memory is a bit complicated.
we currently only request locked memory explictly in 2 cases directly
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/driver.py#L5769-L5784=
when the adminstartor configure the vm flaovr to requst amd's SEV feature or configures the flavor for realtime scheduling pirorotiy.
i say explictly as libvirt invented a request for locked/pinned pages implictly for sriov VFs and a number of other cases
which we were not aware of implictly. this only became apprent when we went to add vdpa supprot to openstack and libvirt
did not make that implict request and we had to fall back to requesting realtime instances as a workaround.

nova/openstack does have the ablity to generate the libvirt xml element that configure hard and soft limtis 
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/config.py#L2559-L2590
however it is only ever used in our test code
https://github.com/openstack/nova/search?q=LibvirtConfigGuestMemoryTune

the descirption of hard limit in the libvirt docs stongly dicurages its use with a small caveat for locked memory
https://libvirt.org/formatdomain.html#memory-tuning

   hard_limit

       The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users
   of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory
   needed for a process to run is an undecidable problem; that said, if you already set locked in memory backing because your workload demands it, you'll
   have to take into account the specifics of your deployment and figure out a value for hard_limit that is large enough to support the memory
   requirements of your guest, but small enough to protect your host against a malicious guest locking all memory.

we coudl not figure out how to automatically comptue a hard_limit in nova that would work for everyone and we felt exposign this to our
users/operators was  bit of a cop out when they likely cant caluate that properly either. As a result we cant actully account for them today when
schduilign workloads to a host. Im not sure this woudl chagne even if you exposed new user space apis unless we 
had a way to inspect each VF to know how much locked memory that VF woudl need to lock? same for vdpa devices,
mdevs ectra. cloud system dont normaly have quotas on "locked" memory used trasitivly via passthoguh devices so even if we had this info
its not imeditly apperant how we woudl consume it without altering our existing quotas. Openstack is a self service cloud plathform
where enduser can upload there own worload iamge so its basicaly impossibel for the oeprator of the cloud to know how much memroy to set teh ard limit
too without setting it overly large in most cases. from a management applciaton point of view we currently have no insigth into how
memory will be pinned in the kernel or when libvirt will invent addtional request for pinned/locked memeory or how large they are. 

instead of going down that route operators are encuraged to use ulimit to set a global limit on the amount of memory the nova/qemu user can use.
while nova/openstack support multi tenancy we do not expose that multi tenancy to hte underlying hypervior hosts. the agents are typicly
deploy as the nova user which is a member of the libvirt and qemu groups. the vms that are created for our teants are all created as under the qemu
user/group as a result. so the qemu user is gobal ulimit on realtime systems woudl need to be set "to protect your host against a malicious guest
locking all memory" but we do not do this on a per vm or per process basis.

to avoid memory starvation we generally recommend using hugepages when ever you are locking memroy as we at least track those per numa node and
have the memroy trackign in place to know that they are not oversubscibeable. i.e. they cant be swapped so the are effectivly the same as being locked
form a user space point of view. using hugepage memory as a workaround whenever we need to account for memory lockign is not ideal but most of our
user that need sriov or vdpa are telcos so they are alreayd usign hugepages and cpu pinning in most cases so it kind of works.

since we dont currently support per instance hard_limits and dont plan to intoruce them in the future wehter this is track per process(vm) or per
user(qemu) is not going to break openstack today. it may complicate any future use of the memtune element in libvirt but we do not currently have
customer/user asking use to expose this and as a cloud solution this super low level customiation is not really somethign we
want to expose in our api anyway.

regards
sean

> 
> To me, it looks more easier to not answer this question by letting
> userspace know about the change,
> 
> Thanks
>