Re: [Qemu-devel] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

From: "Yao, Jiewen" <jiewen.yao@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Laszlo Ersek <lersek@redhat.com>
Cc: "Chen, Yingwen" <yingwen.chen@intel.com>,
	"devel@edk2.groups.io" <devel@edk2.groups.io>,
	Phillip Goerl <phillip.goerl@oracle.com>,
	qemu devel list <qemu-devel@nongnu.org>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	Igor Mammedov <imammedo@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	edk2-rfc-groups-io <rfc@edk2.groups.io>,
	Joao Marcal Lemos Martins <joao.m.martins@oracle.com>
Subject: Re: [Qemu-devel] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
Date: Sat, 17 Aug 2019 00:20:25 +0000	[thread overview]
Message-ID: <74D8A39837DF1E4DA445A8C0B3885C503F761B96@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <20190816161933.7d30a881@x1.home>

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, August 17, 2019 6:20 AM
> To: Laszlo Ersek <lersek@redhat.com>
> Cc: Yao, Jiewen <jiewen.yao@intel.com>; Paolo Bonzini
> <pbonzini@redhat.com>; devel@edk2.groups.io; edk2-rfc-groups-io
> <rfc@edk2.groups.io>; qemu devel list <qemu-devel@nongnu.org>; Igor
> Mammedov <imammedo@redhat.com>; Chen, Yingwen
> <yingwen.chen@intel.com>; Nakajima, Jun <jun.nakajima@intel.com>; Boris
> Ostrovsky <boris.ostrovsky@oracle.com>; Joao Marcal Lemos Martins
> <joao.m.martins@oracle.com>; Phillip Goerl <phillip.goerl@oracle.com>
> Subject: Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
> 
> On Fri, 16 Aug 2019 22:15:15 +0200
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
> > +Alex (direct question at the bottom)
> >
> > On 08/16/19 09:49, Yao, Jiewen wrote:
> > > below
> > >
> > >> -----Original Message-----
> > >> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> > >> Sent: Friday, August 16, 2019 3:20 PM
> > >> To: Yao, Jiewen <jiewen.yao@intel.com>; Laszlo Ersek
> > >> <lersek@redhat.com>; devel@edk2.groups.io
> > >> Cc: edk2-rfc-groups-io <rfc@edk2.groups.io>; qemu devel list
> > >> <qemu-devel@nongnu.org>; Igor Mammedov
> <imammedo@redhat.com>;
> > >> Chen, Yingwen <yingwen.chen@intel.com>; Nakajima, Jun
> > >> <jun.nakajima@intel.com>; Boris Ostrovsky
> <boris.ostrovsky@oracle.com>;
> > >> Joao Marcal Lemos Martins <joao.m.martins@oracle.com>; Phillip
> Goerl
> > >> <phillip.goerl@oracle.com>
> > >> Subject: Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
> > >>
> > >> On 16/08/19 04:46, Yao, Jiewen wrote:
> > >>> Comment below:
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> > >>>> Sent: Friday, August 16, 2019 12:21 AM
> > >>>> To: Laszlo Ersek <lersek@redhat.com>; devel@edk2.groups.io; Yao,
> > >> Jiewen
> > >>>> <jiewen.yao@intel.com>
> > >>>> Cc: edk2-rfc-groups-io <rfc@edk2.groups.io>; qemu devel list
> > >>>> <qemu-devel@nongnu.org>; Igor Mammedov
> > >> <imammedo@redhat.com>;
> > >>>> Chen, Yingwen <yingwen.chen@intel.com>; Nakajima, Jun
> > >>>> <jun.nakajima@intel.com>; Boris Ostrovsky
> > >> <boris.ostrovsky@oracle.com>;
> > >>>> Joao Marcal Lemos Martins <joao.m.martins@oracle.com>; Phillip
> Goerl
> > >>>> <phillip.goerl@oracle.com>
> > >>>> Subject: Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
> > >>>>
> > >>>> On 15/08/19 17:00, Laszlo Ersek wrote:
> > >>>>> On 08/14/19 16:04, Paolo Bonzini wrote:
> > >>>>>> On 14/08/19 15:20, Yao, Jiewen wrote:
> > >>>>>>>> - Does this part require a new branch somewhere in the OVMF
> SEC
> > >>>> code?
> > >>>>>>>>   How do we determine whether the CPU executing SEC is BSP
> or
> > >>>>>>>>   hot-plugged AP?
> > >>>>>>> [Jiewen] I think this is blocked from hardware perspective, since
> the
> > >> first
> > >>>> instruction.
> > >>>>>>> There are some hardware specific registers can be used to
> determine
> > >> if
> > >>>> the CPU is new added.
> > >>>>>>> I don’t think this must be same as the real hardware.
> > >>>>>>> You are free to invent some registers in device model to be used
> in
> > >>>> OVMF hot plug driver.
> > >>>>>>
> > >>>>>> Yes, this would be a new operation mode for QEMU, that only
> applies
> > >> to
> > >>>>>> hot-plugged CPUs.  In this mode the AP doesn't reply to INIT or
> SMI,
> > >> in
> > >>>>>> fact it doesn't reply to anything at all.
> > >>>>>>
> > >>>>>>>> - How do we tell the hot-plugged AP where to start execution?
> (I.e.
> > >>>> that
> > >>>>>>>>   it should execute code at a particular pflash location.)
> > >>>>>>> [Jiewen] Same real mode reset vector at FFFF:FFF0.
> > >>>>>>
> > >>>>>> You do not need a reset vector or INIT/SIPI/SIPI sequence at all in
> > >>>>>> QEMU.  The AP does not start execution at all when it is
> unplugged,
> > >> so
> > >>>>>> no cache-as-RAM etc.
> > >>>>>>
> > >>>>>> We only need to modify QEMU so that hot-plugged APIs do not
> reply
> > >> to
> > >>>>>> INIT/SIPI/SMI.
> > >>>>>>
> > >>>>>>> I don’t think there is problem for real hardware, who always has
> CAR.
> > >>>>>>> Can QEMU provide some CPU specific space, such as MMIO
> region?
> > >>>>>>
> > >>>>>> Why is a CPU-specific region needed if every other processor is in
> SMM
> > >>>>>> and thus trusted.
> > >>>>>
> > >>>>> I was going through the steps Jiewen and Yingwen recommended.
> > >>>>>
> > >>>>> In step (02), the new CPU is expected to set up RAM access. In step
> > >>>>> (03), the new CPU, executing code from flash, is expected to "send
> > >> board
> > >>>>> message to tell host CPU (GPIO->SCI) -- I am waiting for hot-add
> > >>>>> message." For that action, the new CPU may need a stack
> (minimally if
> > >> we
> > >>>>> want to use C function calls).
> > >>>>>
> > >>>>> Until step (03), there had been no word about any other (=
> pre-plugged)
> > >>>>> CPUs (more precisely, Jiewen even confirmed "No impact to other
> > >>>>> processors"), so I didn't assume that other CPUs had entered SMM.
> > >>>>>
> > >>>>> Paolo, I've attempted to read Jiewen's response, and yours, as
> carefully
> > >>>>> as I can. I'm still very confused. If you have a better understanding,
> > >>>>> could you please write up the 15-step process from the thread
> starter
> > >>>>> again, with all QEMU customizations applied? Such as, unnecessary
> > >> steps
> > >>>>> removed, and platform specifics filled in.
> > >>>>
> > >>>> Sure.
> > >>>>
> > >>>> (01a) QEMU: create new CPU.  The CPU already exists, but it does
> not
> > >>>>      start running code until unparked by the CPU hotplug
> controller.
> > >>>>
> > >>>> (01b) QEMU: trigger SCI
> > >>>>
> > >>>> (02-03) no equivalent
> > >>>>
> > >>>> (04) Host CPU: (OS) execute GPE handler from DSDT
> > >>>>
> > >>>> (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New
> CPU
> > >>>>      will not enter CPU because SMI is disabled)
> > >>>>
> > >>>> (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM
> > >>>>      rebase code.
> > >>>>
> > >>>> (07a) Host CPU: (SMM) Write to CPU hotplug controller to enable
> > >>>>      new CPU
> > >>>>
> > >>>> (07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU.
> > >>> [Jiewen] NOTE: INIT/SIPI/SIPI can be sent by a malicious CPU. There is
> no
> > >>> restriction that INIT/SIPI/SIPI can only be sent in SMM.
> > >>
> > >> All of the CPUs are now in SMM, and INIT/SIPI/SIPI will be discarded
> > >> before 07a, so this is okay.
> > > [Jiewen] May I know why INIT/SIPI/SIPI is discarded before 07a but is
> delivered at 07a?
> > > I don’t see any extra step between 06 and 07a.
> > > What is the magic here?
> >
> > The magic is 07a itself, IIUC. The CPU hotplug controller would be
> > accessible only in SMM. And until 07a happens, the new CPU ignores
> > INIT/SIPI/SIPI even if another CPU sends it those, simply because QEMU
> > would implement the new CPU's behavior like that.
[Jiewen] Got it. Looks fine to me.

> > >> However I do see a problem, because a PCI device's DMA could
> overwrite
> > >> 0x38000 between (06) and (10) and hijack the code that is executed in
> > >> SMM.  How is this avoided on real hardware?  By the time the new
> CPU
> > >> enters SMM, it doesn't run off cache-as-RAM anymore.
> > > [Jiewen] Interesting question.
> > > I don’t think the DMA attack is considered in threat model for the virtual
> environment. We only list adversary below:
> > > -- Adversary: System Software Attacker, who can control any OS memory
> or silicon register from OS level, or read write BIOS data.
> > > -- Adversary: Simple hardware attacker, who can hot add or hot remove
> a CPU.
> >
> > We do have physical PCI(e) device assignment; sorry for not highlighting
> > that earlier.
[Jiewen] That is OK. Then we MUST add the third adversary.
-- Adversary: Simple hardware attacker, who can use device to perform DMA attack in the virtual world.
NOTE: The DMA attack in the real world is out of scope. That is be handled by IOMMU in the real world, such as VTd. -- Please do clarify if this is TRUE.

In the real world:
#1: the SMM MUST be non-DMA capable region.
#2: the MMIO MUST be non-DMA capable region.
#3: the stolen memory MIGHT be DMA capable region or non-DMA capable region. It depends upon the silicon design.
#4: the normal OS accessible memory - including ACPI reclaim, ACPI NVS, and reserved memory not included by #3 - MUST be DMA capable region.
As such, IOMMU protection is NOT required for #1 and #2. IOMMU protection MIGHT be required for #3 and MUST be required for #4.
I assume the virtual environment is designed in the same way. Please correct me if I am wrong.

>> That feature (VFIO) does rely on the (physical) IOMMU, and
> > it makes sure that the assigned device can only access physical frames
> > that belong to the virtual machine that the device is assigned to.
[Jiewen] Thank you! Good to know.
I found https://www.kernel.org/doc/Documentation/vfio.txt
Is that what you scribed above?
Anyway, I believe the problem is clear and the solution in real world is clear.
I will leave the virtual world discussion to Alex, Paolo, Laszlo.
If you need any of my input, please let me know.

> > However, as far as I know, VFIO doesn't try to restrict PCI DMA to
> > subsets of guest RAM... I could be wrong about that, I vaguely recall
> > RMRR support, which seems somewhat related.
> >
> > > I agree it is a threat from real hardware perspective. SMM may check
> VTd to make sure the 38000 is blocked.
> > > I doubt if it is a threat in virtual environment. Do we have a way to block
> DMA in virtual environment?
> >
> > I think that would be a VFIO feature.
> >
> > Alex: if we wanted to block PCI(e) DMA to a specific part of guest RAM
> > (expressed with guest-physical RAM addresses), perhaps permanently,
> > perhaps just for a while -- not sure about coordination though --, could
> > VFIO accommodate that (I guess by "punching holes" in the IOMMU page
> > tables)?
> 
> It depends.  For starters, the vfio mapping API does not allow
> unmapping arbitrary sub-ranges of previous mappings.  So the hole you
> want to punch would need to be independently mapped.  From there you
> get into the issue of whether this range is a potential DMA target.  If
> it is, then this is the path to data corruption.  We cannot interfere
> with the operation of the device and we have little to no visibility of
> active DMA targets.
> 
> If we're talking about RAM that is never a DMA target, perhaps e820
> reserved memory, then we can make sure certainly MemoryRegions are
> skipped when mapped by QEMU and would expect the guest to never map
> them through a vIOMMU as well.  Maybe then it's a question of where
> we're trying to provide security (it might be more difficult if QEMU
> needs to sanitize vIOMMU mappings to actively prevent mapping
> reserved areas).
> 
> Is there anything unique about the VM case here?  Bare metal SMM needs
> to be concerned about protecting itself from I/O devices that operate
> outside of the realm of SMM mode as well, right?  Is something "simple"
> like an AddressSpace switch necessary here, such that an I/O device
> always has a mapping to a safe guest RAM page while the vCPU
> AddressSpace can switch to some protected page?  The IOMMU and vCPU
> mappings don't need to be the same.  The vCPU is more under our control
> than the assigned device.
> 
> FWIW, RMRRs are a VT-d specific mechanism to define an address range as
> persistently, identity mapped for one or more devices.  IOW, the device
> would always map that range.  I don't think that's what you're after
> here.  RMRRs are also an abomination that I hope we never find a
> requirement for in a VM.  Thanks,
> 
> Alex