Re: [PATCH] vfio/pci: Propagate ACPI notifications to the user-space

From: Grzegorz Jaszczyk <jaz@semihalf.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Dominik Behr <dbehr@google.com>,
	Dominik Behr <dbehr@chromium.org>,
	linux-kernel@vger.kernel.org, dmy@semihalf.com, tn@semihalf.com,
	upstream@semihalf.com, dtor@google.com, jgg@ziepe.ca,
	kevin.tian@intel.com, cohuck@redhat.com, abhsahu@nvidia.com,
	yishaih@nvidia.com, yi.l.liu@intel.com, kvm@vger.kernel.org,
	libvir-list@redhat.com
Subject: Re: [PATCH] vfio/pci: Propagate ACPI notifications to the user-space
Date: Wed, 22 Mar 2023 10:47:56 +0100	[thread overview]
Message-ID: <CAH76GKMVquX7eLoGYtwMo06ar0dE9J3LtJEnYJKY0-3X5XFwow@mail.gmail.com> (raw)
In-Reply-To: <CAH76GKP+W9JUvQpqvjLHADMeRORPUf0d8vn5gCgE5fjxz0YkPQ@mail.gmail.com>

Hi Alex

Could you please refer to my previous message?

Thank you in advance,
Grzegorz

czw., 9 mar 2023 o 14:41 Grzegorz Jaszczyk <jaz@semihalf.com> napisał(a):
>
> czw., 9 mar 2023 o 00:38 Alex Williamson <alex.williamson@redhat.com>
> napisał(a):
> >
> > On Wed, 8 Mar 2023 14:44:28 -0800
> > Dominik Behr <dbehr@google.com> wrote:
> >
> > > On Wed, Mar 8, 2023 at 12:06 PM Alex Williamson
> > > <alex.williamson@redhat.com> wrote:
> > > >
> > > > On Wed, 8 Mar 2023 10:45:51 -0800
> > > > Dominik Behr <dbehr@chromium.org> wrote:
> > > >
> > > > > It is the same interface as other ACPI events like AC adapter LID etc
> > > > > are forwarded to user-space.
> > > > >  ACPI events are not particularly high frequency like interrupts.
> > > >
> > > > I'm not sure that's relevant, these interfaces don't proclaim to
> > > > provide isolation among host processes which manage behavior relative
> > > > to accessories.  These are effectively system level services.  It's only
> > > > a very, very specialized use case that places a VMM as peers among these
> > > > processes.  Generally we don't want to grant a VMM any privileges beyond
> > > > what it absolutely needs, so letting a VMM managing an assigned NIC
> > > > really ought not to be able to snoop host events related to anything
> > > > other than the NIC.
> > > How is that related to the fact that we are forwarding VFIO-PCI events
> > > to netlink? Kernel does not grant any privileges to VMM.
> > > There are already other ACPI events on netlink. The implementer of the
> > > VMM can choose to allow VMM to snoop them or not.
> > > In our case our VMM (crosvm) does already snoop LID, battery and AC
> > > adapter events so the guest can adjust its behavior accordingly.
> > > This change just adds another class of ACPI events that are forwarded
> > > to netlink.
> >
> > That's true, it is the VMM choice whether to allow snooping netlink,
> > but this is being proposed as THE solution to allow VMMs to receive
> > ACPI events related to vfio assigned devices.  If the solution
> > inherently requires escalating the VMM privileges to see all netlink
> > events, that's a weakness in the proposal.  As noted previously,
> > there's also no introspection here, the VMM can't know whether it
> > should listen to netlink for ACPI events or include AML related to a
> > GPE for the device.  It cannot determine if either the kernel supports
> > this feature or if the device has an ACPI companion that can generate
> > these events.
>
> To be precise the VMM doesn't listen to all netlink events: it listens
> only to "acpi_event" family and acpi related multicast group, which
> means it listens to all events generated through
> acpi_bus_generate_netlink_event.
>
> Before sending this patch I thought about using eventfd instead
> netalink which will actually provide a channel associated with a given
> device and therefore such notifications will be received only by the
> VMM associated with such a device. Nevertheless, it seems like eventfd
> will allow to signalize events happening (notify on a given device)
> but is not capable of sending any payload so in our case there is no
> room for propagating notification value via eventfd. Maybe there is
> other mechanism eventfd-like which will allow to achieve above?
>
> If there is no such mechanism, maybe instead of using existing acpi
> netlink events, which are associated with "acpi_event" netlink family
> and acpi multicast group, we could create per vfio-pci a different
> netlink family or probably reuse "acpi_event" family but use different
> multicast group, so each device will have dedicated netlink family.
> Does it seem reasonable?
>
> >
> > > >
> > > > > > > > What sort of ACPI events are we expecting to see here and what does user space do with them?
> > > > > The use we are looking at right now are D-notifier events about the
> > > > > GPU power available to mobile discrete GPUs.
> > > > > The firmware notifies the GPU driver and resource daemon to
> > > > > dynamically adjust the amount of power that can be used by the GPU.
> > > > >
> > > > > > The proposed interface really has no introspection, how does the VMM
> > > > > > know which devices need ACPI tables added "upfront"?  How do these
> > > > > > events factor into hotplug device support, where we may not be able to
> > > > > > dynamically inject ACPI code into the VM?
> > > > >
> > > > > The VMM can examine PCI IDs and the associated firmware node of the
> > > > > PCI device to figure out what events to expect and what ACPI table to
> > > > > generate to support it but that should not be necessary.
> > > >
> > > > I'm not entirely sure where your VMM is drawing the line between the VM
> > > > and management tools, but I think this is another case where the
> > > > hypervisor itself should not have privileges to examine the host
> > > > firmware tables to build its own.  Something like libvirt would be
> > > > responsible for that.
> > > Yes, but that depends on the design of hypervisor and VMM and is not
> > > related to this patch.
> >
> > It is very much related to this patch if it proposes an interface to
> > solve a problem which is likely not compatible with the security model
> > of other VMMs.  We need a single solution to support all VMMs.
> >
> > > >
> > > > > A generic GPE based ACPI event forwarder as Grzegorz proposed can be
> > > > > injected at VM init time and handle any notification that comes later,
> > > > > even from hotplug devices.
> > > >
> > > > It appears that forwarder is sending the notify to a specific ACPI
> > > > device node, so it's unclear to me how that becomes boilerplate AML
> > > > added to all VMs.  We'll need to notify different devices based on
> > > > different events, right?
> > > Valid point. The notifications have a "scope" ACPI path.
> > > In my experience these events are consumed without looking where they
> > > came from but I believe the patch can be extended to
> > > provide ACPI path, in your example "_SB.PCI0.GPP0.PEGP" instead of
> > > generic vfio_pci which VMM could use to translate an equivalent ACPI
> > > path in the guest and pass it to a generic ACPI GPE based notifier via
> > > shared memory. Grzegorz could you chime in whether that would be
> > > possible?
> >
> > So effectively we're imposing the host ACPI namespace on the VM, or at
> > least a mapping between the host and VM namespace?  The generality of
> > this is not improving.
>
> Yes, in the example VMM implementation we have mapping between the
> host pci device address and guest pci device. Therefore VMM knows,
> based on device name (BDF) sent via netlink, to which guest device
> this notification should be propagated. The boilerplate AML is added
> to each vfio-pci device which belongs to VMM and each vfio-pci device
> has associated pre-allocated GPE so the VMM knows which GPE should be
> triggered to replicate notification for a given device. BTW this is
> only current WIP VMM implementation - this could probably be optimized
> if needed.
>
> Handling hotplug devices is more problematic. I see that the kernel
> provides some runtime ACPI patching mechanism:
> https://www.kernel.org/doc/html/latest/firmware-guide/acpi/method-customizing.html
> (which I never tried) but not even sure how VMM could take advantage
> of it. BTW this realized me that the same problem with hotplug applies
> to other vfio-pci use-cases e.g. runtime PM, which relies on guest
> virtual ACPI method:
> https://patchwork.kernel.org/project/linux-pm/patch/20220829114850.4341-5-abhsahu@nvidia.com/.
> Generating virtual ACPI content for hotplug devices seems like a more
> generic issue.
>
> >
> > > > > > The acpi_bus_generate_netlink_event() below really only seems to form a
> > > > > > u8 event type from the u32 event.  Is this something that could be
> > > > > > provided directly from the vfio device uAPI with an ioeventfd, thus
> > > > > > providing introspection that a device supports ACPI event notifications
> > > > > > and the ability for the VMM to exclusively monitor those events, and
> > > > > > only those events for the device, without additional privileges?
> > > > >
> > > > > From what I can see these events are 8 bit as they come from ACPI.
> > > > > They also do not carry any payload and it is up to the receiving
> > > > > driver to query any additional context/state from the device.
> > > > > This will work the same in the VM where driver can query the same
> > > > > information from the passed through PCI device.
> > > > > There are multiple other netflink based ACPI events forwarders which
> > > > > do exactly the same thing for other devices like AC adapter, lid/power
> > > > > button, ACPI thermal notifications, etc.
> > > > > They all use the same mechanism and can be received by user-space
> > > > > programs whether VMMs or others.
> > > >
> > > > But again, those other receivers are potentially system services, not
> > > > an isolated VM instance operating in a limited privilege environment.
> > > > IMO, it's very different if the host display server has access to lid
> > > > or power events than it is to allow some arbitrary VM that happens to
> > > > have an unrelated assigned device that same privilege.
> > > Therefore these VFIO related ACPI events could be received by a system
> > > service via this netlink event and selectively forwarded to VMM if
> > > such is a desire of whoever implements the userspace.
> > > This is outside the scope of this patch. In our case our VMM does
> > > receive these LID, AC or battery events.
> >
> > But this is backwards, we're presupposing the choice to use netlink
> > based on the convenience of one VMM, which potentially creates
> > obstacles, maybe even security isolation issues for other VMMs.  The
> > method of delivering ACPI events to a VMM is very much within the scope
> > of this proposal.  Thanks,
> >
> > Alex
> >
>
> regarding:
> > > > On my laptop, I see multiple _GPE scopes, each apparently very unique
> > > > to the devices:
> > > >
> > > >    Scope (_GPE)
> > > >    {
> > > >        Method (_L0C, 0, Serialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
> > > >        {
> > > >            Notify (\_SB.PCI0.GPP0.PEGP, 0x81) // Information Change
> > > >        }
> > > >
> > > >        Method (_L0D, 0, Serialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
> > > >        {
> > > >             Notify (\_SB.PCI0.GPP0.PEGP, 0x81) // Information Change
> > > >        }
> > > >
> > > >        Method (_L0F, 0, Serialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
> > > >        {
> > > >             Notify (\_SB.PCI0.GPP0.PEGP, 0x81) // Information Change
> > > >         }
> > > >     }
> > > >
> > > >     Scope (_GPE)
> > > >     {
> > > >         Method (_L19, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
> > > >         {
> > > >             Notify (\_SB.PCI0.GP17, 0x02) // Device Wake
> > > >             Notify (\_SB.PCI0.GP17.XHC0, 0x02) // Device Wake
> > > >             Notify (\_SB.PCI0.GP17.XHC1, 0x02) // Device Wake
> > > >             Notify (\_SB.PWRB, 0x02) // Device Wake
> > > >         }
> > > >
> > > >         Method (_L08, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
> > > >         {
> > > >            Notify (\_SB.PCI0.GP18, 0x02) // Device Wake
> > > >             Notify (\_SB.PCI0.GPP0, 0x02) // Device Wake
> > > >             Notify (\_SB.PCI0.GPP1, 0x02) // Device Wake
> > > >             Notify (\_SB.PCI0.GPP5, 0x02) // Device Wake
> > > >         }
> > > >     }
> > > >
> > > > At least one more even significantly more extensive, calling methods
> > > > that interact with OpRegions.  So how does a simple stub of a
> > > > GPE block replicate this sort of behavior in the host AML?  Thanks,
>
> The simple stub of GPE block will work to replicate the ACPI
> notification only: as mentioned earlier GPE handler will be generated
> per-vfio device so in your example if let assume that only:
> - \_SB.PCI0.GPP0.PEGP
> - \_SB.PCI0.GP17.XHC1
> will be pass-through to the guest, the generated AML code for VM will
> look more-less like below:
>
>         Scope (_GPE)
>         {
>             Method (_E00, 0, NotSerialized)
>             {
>                 Local0 =  \_SB.PCI0.GPP0.PEGP.NOTY
>                 Notify ( \_SB.PCI0.GPP0.PEGP, Local0)
>             }
>         }
>         Scope (_GPE)
>         {
>             Method (_E01, 0, NotSerialized)
>             {
>                 Local0 =\_SB.PCI0.GP17.XHC1.NOTY
>                 Notify (\_SB.PCI0.GP17.XHC1, Local0)
>             }
>         }
>
> So each pass-through device will have associated GPE (0 for
> \_SB.PCI0.GPP0.PEGP and 1 for \_SB.PCI0.GP17.XHC1). The path in Notify
> could actually be different and related to guest pci hierarchy (but
> associated to those host devices). Please also note that in this case
> we use GPE in order to "inform" guest about notification coming and we
> do not try to replicate host GPE scope description.
>
> Above we assumed that other devices (like \_SB.PCI0.GPP0/1) are not
> pass-through to the guest and notification are handled in host as
> usual (they are not binded to pci-vfio) and there is no need to
> generate AML code, allocate GPE for them and so on.
>
> Thank you,
> Grzegorz