Re: [PATCH RFC v2] vfio: Documentation for the migration region

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Cornelia Huck <cohuck@redhat.com>,
	kvm@vger.kernel.org, Kirti Wankhede <kwankhede@nvidia.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC v2] vfio: Documentation for the migration region
Date: Mon, 6 Dec 2021 15:15:00 -0400	[thread overview]
Message-ID: <20211206191500.GL4670@nvidia.com> (raw)
In-Reply-To: <20211203110619.1835e584.alex.williamson@redhat.com>

On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote:

> > Yes, not sure how you get to a non-issue though? If the IOMMU map is
> > present the huawei device can be attacked by a hostile VM and forced
> > to exit NDMA. All this takes is any two DMA capable devices to be
> > plugged in?
> 
> I'm confused by your idea of a "hostile VM".  Devices within a VM can
> only DMA to memory owned by the VM.  So the only hostile VM that can
> attack this device is the VM that owns the device in the first place.
> Meanwhile the VMM is transitioning all devices to the NDMA state, so
> the opportunity to poke one device from another device within the same
> VM is pretty limited.  And for what gain?  Is this a DoS opportunity
> that the guest might trigger the migration to abort or fail?

I'm quite worried about what could be possible if the VM is able to
impact the hypervisor. Something like migration is a hypervisor
activity and it should be fully isolated.

If we are certain that the only consequence is that the VM corrupts
itself, then I agree that is OK, but I've seen enough crazy security
stuff to be skeptical we are smart enough to know that :(

> I suggest this might be a non-issue for hns because I understand it to
> not support any sort of p2p, therefore this DMA write to the device
> while it's under mediation would never happen in practice. 

It has nothing to do with the huawei device, any DMA capable device
could be instructed to generate a P2P write to the huawei device.

> > This is a good idea, if we want to make huawei support NDMA then
> > flagging it to never be in the iommu map in the first place is a great
> > solution. Then they can use CPU mmio trapping get the rest of the way.
> 
> NB, this flag would only be a hint to the VMM, the vfio IOMMU backend
> would need to be involved if we intended to prevent such mappings, but
> that seems like overkill if we're expecting userspace to follow the
> rules to get working migration and migration drivers have sufficient
> robustness to maintain device isolation.

I'm fine with hint. Secure userspace needs to follow the security
rules, the kernel doesn't need to nanny a VMM, AFAIK?

> > OK, I don't read it like that. It seems OK to have a performance hit
> > in NDMA since it is only a short grace state.
> 
> Is it the length of time we're in the NDMA state or the utility of the
> NDMA state itself?  If we're telling the device "no more outbound DMA,
> no more interrupts", then it doesn't seem super relevant for the
> majority of use cases what happens to the MMIO latency.

Actually, I wrote down that the P2P grace state should have
!VCPU_RUNNING, so we don't even need to worry about anything else
here. Just remove it from the IOMMU map.

PRI is the only wrinkle, but I think we have to insist on NDMA to do
vPRI, and the huawei device doesn't support PRI anyhow.

> > OTOH "supportable" qemu could certainly make the default choice to
> > require devices for simplicity.
> 
> I get a bit lost every time I try to sketch out how QEMU would
> implement it.  Forgive the stream of consciousness and rhetorical
> discussion below...

IMHO the simple solution without ugly edges:

qemu maintains two IOMMU domains on demand:

1) has all the MMIO maps for all PCI devices.
2) has no MMIO maps for any PCI devices.

A device supporting NDMA gets assigned to #1

A device without NDMA gets assigned to #2

Some qemu flag 'enable_p2p=off' causes all devices to use #2 and
optimizes away the 2nd iommu domain.

Thus all combinations of devices can be hotplugged in/out in any order
and nothing breaks.

>  - Does it make sense that a device itself can opt-out of p2p mappings?

Bit strange, but yes it kind of does make sense. It is more of a
'device strictly requires quiet MMIO when !RUNNING' flag. Which means
the user space process must guarentee not to touch the MMIO through
any means, including CPU, VCPU or IOMMU.

Combined with the above, the special flag would allow such devices to
join the #2 IOMMU domain and reduces cases where #1 != #2

>  - Can a user know which devices will require enable_p2p=off in order
>    to set migration=on?  "Read the error log" is a poor user experience
>    and difficult hurdle for libvirt.

I'm sure we can add something in sysfs. There is a patch series adding
a cdev for the vfio_device and that comes along with a full sysfs
directory that can hold per-vfio-device information for userspace to
access.

> > Can we go the other way and move more of the uAPI header text here?
> 
> Currently our Documentation/ file describes an overview of vfio, the
> high level theory of groups, devices, and iommus, provides a detailed
> userspace usage example, an overview of device registration and ops,
> and a bunch of random spapr implementation notes largely around
> userspace considerations on power platforms.

Currently, but it doesn't have to stay that way.

> So I would generalize that our Documentation/ is currently focused on
> userspace access and interaction with the uAPI.  The uAPI itself
> therefore generally provides the detail regarding how a given interface
> is intended to work from an implementation perspective.  I'd say the
> proposed document here is mixing both of those in Documentation/, some
> aspects relevant to implementation of the uAPI, others to userspace.
> 
> Currently I would clearly say that the uAPI header is the canonical
> source of truth and I think it should continue to be so for describing
> the implementation of the uAPI.

Well, I don't want to have a > 1000 line comment in the uAPI
header.

> > Unfortunately more than half of that describes how the data window
> > works, and half of the rest is kind of obvious statements.
> 
> How does that rationalize forking implementation details to userspace
> Documentation/ rather than resolving the issues you see with the uAPI
> description?

I object to putting a enormous comment in a header file and using poor
tools to write documentation. If there are small things to update and
correct then please point to them and we can fix it. Otherwise lets
keep them split and try to work to improve the discussion in the rst
in general for all the uapi.

> > In your mind you see generality, in our mind we want to know how to
> > write an inter operable driver and there is no documention saying how
> > to do that.
> 
> I didn't know there was a hive mind over there, maybe that explains
> your quick replies ;)

No :) It is just been a long topic here. You recall the first version
of the series had a very different interpritation of the header file
comment regarding the device_state. It was rather surprising your
remark that the device_state is not a FSM when the header file
document has FSM-like discussions.

> This is exactly the sort of "designed for QEMU implementation"
> inter-operability that I want to avoid.  It doesn't take much of a
> crystal ball to guess that gratuitous and redundant device resets
> slow VM instantiation and are a likely target for optimization.

Sorry, but Linus's "don't break userspace" forces us to this world.

It does not matter what is written in text files, only what userspace
actually does and the kernel must accommodate existing userspace going
forward. So once released qemu forms some definitive spec and the
guardrails that limit what we can do going forward.

In part this document is also an effort to describe to kernel devs
what these guardrails are.

> I think though that your reference flow priority depends a lot on your
> implementation that resuming state is stored somewhere and only
> processed on the transition of the RESUMING bit.  You're sharing a
> buffer between SAVING and RESUMING.  That's implementation, not
> specification.  A device may choose to validate and incorporate data
> written to the migration region as it's written.  A device may choose
> to expose actual portions of on device memory through the migration
> region during either SAVING or RESTORING.  Therefore setting or
> clearing of these bits may not have the data dependencies that drive
> the priority scheme of mlx5.

I've said this before, it doesn't matter what other devices might
do. mlx5 needs to work this way, so least-common-denominator.

The only other option is to declare multi-bit changes as undefined
behavior - that seems dangerous in general.

The reason this is a good ordering is because it matches the ordering
userspace would do normally if it does one bit change at a time.

Therefore, *any* device must be able to implement this order if it
simply decomposes multi-bit changes to single actions, it must already
support, using the priority listed here.

Ideally we'd put this logic in the core code and drivers would just
have callbacks, but I'm not quite sure we see this probelm well enough
yet to make the callbacks. Maybe once we have three drivers..

> So why is this the one true implementation?

Because it is what we are going to write down?

> I think implementation and clarification of the actual definition of
> the uAPI should exist in the uAPI header, userspace clarification and
> examples for the consumption of the uAPI should stay here in
> Documentation/, and we should not assume either the QEMU or mlx5
> implementations as the canonical reference for either.

I'm not even sure anymore what you are looking for in order to
advance.

Can you please out line exactly what things you want from us? Like, as
a task list?

> > It is informative for the device driver author to understand what
> > device functionality to map to this.
> 
> Sounds like we're in agreement, so why does this belong in userspace
> Documentation/?

Because this is where documentation should live, not in a header file.

> The uAPI might need some clarification here, but the only viable
> scenario would seem to be that yes, userspace should continue to poll
> the device so long as it remains in SAVING|RUNNING as internal state
> can continue to change asynchronously from userspace polling.  It's
> only when pending_bytes returns zero while !RUNNING that the data
> stream is complete and the device is precluded from deciding any new
> state is available.  Thanks,

I updated around these lines, thanks

Jason