From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53226)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1Yl2OP-0002u9-SD
	for qemu-devel@nongnu.org; Wed, 22 Apr 2015 17:41:15 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1Yl2OM-0004E0-K4
	for qemu-devel@nongnu.org; Wed, 22 Apr 2015 17:41:13 -0400
Received: from mx1.redhat.com ([209.132.183.28]:39906)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1Yl2OM-0004Di-BK
	for qemu-devel@nongnu.org; Wed, 22 Apr 2015 17:41:10 -0400
Received: from int-mx10.intmail.prod.int.phx2.redhat.com
	(int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t3MLf91L031790
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=FAIL)
	for <qemu-devel@nongnu.org>; Wed, 22 Apr 2015 17:41:09 -0400
Message-ID: <55381571.4000201@redhat.com>
Date: Wed, 22 Apr 2015 23:41:05 +0200
From: Laszlo Ersek <lersek@redhat.com>
MIME-Version: 1.0
References: <1429521560-2743-1-git-send-email-kraxel@redhat.com>	
	<1429521560-2743-5-git-send-email-kraxel@redhat.com>	
	<55365C33.2090101@redhat.com>
	<1429628650.21164.24.camel@nilsson.home.kraxel.org>
In-Reply-To: <1429628650.21164.24.camel@nilsson.home.kraxel.org>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 5/6] [wip] tseg, part1, not (yet) tested
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Gerd Hoffmann <kraxel@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, mst@redhat.com

On 04/21/15 17:04, Gerd Hoffmann wrote:
>>> +static const MemoryRegionOps tseg_blackhole_ops = {
>>> +    .read = tseg_blackhole_read,
>>> +    .write = tseg_blackhole_write,
>>> +    .endianness = DEVICE_NATIVE_ENDIAN,
>>> +    .valid.min_access_size = 1,
>>> +    .valid.max_access_size = 4,
>>> +    .impl.min_access_size = 4,
>>> +    .impl.max_access_size = 4,
>>> +    .endianness = DEVICE_LITTLE_ENDIAN,
>>> +};
>>
>> (a) you've specified .endianness twice (with inconsistent initializers
>> at that, although the value returned by the accessor happens to be
>> unaffected).
> 
> Oops, cut+paste bug, will fix (/me wonders why gcc didn't throw an error
> on that one).
> 
>> (b) Why are 8-byte accesses forbidden? ...
> 
> just a matter of setting .valid.max_access_size = 8, can do that of
> course.
> 
>>> +
>>>  /* PCIe MMCFG */
>>>  static void mch_update_pciexbar(MCHPCIState *mch)
>>>  {
>>> @@ -267,6 +289,7 @@ static void mch_update_smram(MCHPCIState *mch)
>>>  {
>>>      PCIDevice *pd = PCI_DEVICE(mch);
>>>      bool h_smrame = (pd->config[MCH_HOST_BRIDGE_ESMRAMC] & MCH_HOST_BRIDGE_ESMRAMC_H_SMRAME);
>>> +    uint32_t tseg_size;
>>>  
>>>      /* implement SMRAM.D_LCK */
>>>      if (pd->config[MCH_HOST_BRIDGE_SMRAM] & MCH_HOST_BRIDGE_SMRAM_D_LCK) {
>>> @@ -296,6 +319,32 @@ static void mch_update_smram(MCHPCIState *mch)
>>>          memory_region_set_enabled(&mch->high_smram, false);
>>>      }
>>>  
>>> +    if (pd->config[MCH_HOST_BRIDGE_ESMRAMC] & MCH_HOST_BRIDGE_ESMRAMC_T_EN) {
>>> +        switch (pd->config[MCH_HOST_BRIDGE_ESMRAMC] &
>>> +                MCH_HOST_BRIDGE_ESMRAMC_TSEG_SZ_MASK) {
>>> +        case MCH_HOST_BRIDGE_ESMRAMC_TSEG_SZ_1MB:
>>> +            tseg_size = 1024 * 1024;
>>> +            break;
>>> +        case MCH_HOST_BRIDGE_ESMRAMC_TSEG_SZ_2MB:
>>> +            tseg_size = 1024 * 1024 * 2;
>>> +            break;
>>> +        case MCH_HOST_BRIDGE_ESMRAMC_TSEG_SZ_8MB:
>>> +            tseg_size = 1024 * 1024 * 8;
>>> +            break;
>>> +        default:
>>> +            tseg_size = 0;
>>> +            break;
>>> +        }
>>> +    } else {
>>> +        tseg_size = 0;
>>> +    }
>>
>> - so, is the guest supposed to select the tseg size by writing this
>> register?
> 
> Correct.
> 
>> - I assume this register is not reprogrammable once SMRAM is locked --
>> is that correct?
> 
> Correct.
> 
>> - can the guest somehow use this facility to detect, dynamically, the
>> presence of this feature? (For now I'm thinking about a static build
>> flag for OVMF that would enable SMM support, but I'm 99% sure Jordan
>> will object and ask for a dynamic feature detection.)
> 
> Hmm.  I think if we merge all the smm / smram / tseg patches in one go
> this should work.  Without patches reading the ESMRAMC register returns
> zero.  With the patches the three cache-disable bits are hardcoded to
> '1'.  This should work for detecting tseg support.
> 
> You also have to test for smm support.  The current protocol for this is
> that seabios checks whenever smm is already initialized (see
> *_apmc_smm_setup() in seabios/src/fw/smm.c) and if so it skips smm
> initialization.  Right now qemu sets the bit on reset when running on
> kvm, so seabios doesn't try to use smm.  On tcg the bit is clear after
> reset and seabios actually uses smm mode.

I started looking into this. Basically the condition for SMM/SMRAM
support is:

  Q35 && SMRAM support && SMM support           (1)

The first subcondition can be checked via the host bridge devid (and
OVMF already does, for other purposes).

The second one relies on the ESMRAMC, which is in PCI config space.

The third one is messy. It relies on SMI_EN, which is an ioport at
PMBA+0x30. It requires a configured PMBA.

The problem for OVMF is the following: this condition is too complex /
too intrusive to evaluate in order to see whether Q35+SMM+SMRAM are
available.

For that reason, I'd like to ask if it would be possible to introduce a
new fw_cfg file that would simply communicate the end result of the
above expression.

Long version:

If the above condition evaluates to TRUE, we assume that the user wants
"security" -- ie. it wants to prevent a "malicious" runtime guest OS
form messing with OVMF's Secure Boot feature. Good.

However, at the moment, OVMF has a large number of holes that a
malicious runtime guest OS can exploit. The S3 LockBox and the varstore
pflash chip are most frequently mentioned, but there are more; in
particular I'm referring to the other special memory ranges OVMF uses.

(Please refer to the

  A comprehensive memory map of OVMF

section of the OVMF whitepaper
<http://people.redhat.com/~lersek/ovmf-whitepaper-c770f8c.txt> for the
following.  It describes in detail what special memory ranges OVMF has,
and the life cycle of each.)

These ranges are adequately protected against a *benign* guest runtime
OS; the UEFI memory map makes sure that such a runtime OS doesn't
accidentally overwrite stuff that OVMF will either *need* for S3 resume
in pristine form from the first cold boot, or will *overwrite* during S3
resume as scratch space (thereby potentially destroying OS data).

But a malicious guest OS can nonetheless overwrite things in these
special areas that OVMF will run (or use) during S3, before it locks
down SMRAM again.

Some of those holes / inappropriately protected memory ranges are tied
to the SEC phase, which is the earliest and least capable UEFI phase.
For example, refer to:

+--------------------------+ 9216 KB       PcdOvmfDxeMemFvBase
|                          |
|PEIFV from FVMAIN_COMPACT | size: 896 KB  PcdOvmfPeiMemFvSize
|  decompressed firmware   |
| volume with PEI modules  |
|                          |
+--------------------------+ 8320 KB       PcdOvmfPeiMemFvBase

"(6) PEIFV -- decompressed firmware volume with PEI modules".

OVMF is an unconventional UEFI firmware in the sense that it runs *only*
the SEC phase from (unmodifiable, read-only) flash. The PEI and DXE
firmware volumes are decompressed from flash to RAM by SEC, and then PEI
and DXE modules run from RAM.

(It is more conventional to run PEI modules from the flash as well, not
just SEC. The way OVMF does it allows it to save space in the pflash,
because PEI modules are thus allowed to exist only in compressed (as
opposed to directly executable) form in the pflash.)

In order to save some cycles, OVMF's SEC decompresses the PEI and DXE
firmware volumes (in one go) to RAM after a real reset only, and then
"protects" the decompressed PEI firmware volume in RAM via the UEFI
memory map (see above). Then, at S3 resume, SEC simply reuses the
already decompressed (and "protected") PEI firmware volume when handing
off to PEI.

Obviously, if a guest OS pokes some foreign code into this decompressed
PEI firmware volume before S3 suspend, then OVMF will run that foreign
code at S3 resume, with the SMRAM open.

Therefore OVMF has to forego such optimizations even as early as in SEC
when condition (1) evaluates to false -- it must decompress the firmware
volumes from pristine pflash over the possibly breached RAM even during
S3 resume. And for that SEC needs to evaluate (1).

This is where the complexity / intrusiveness of (1) becomes a problem.
Programming the PMBA in SEC just so I can read SMI_EN (for the sake of
evaluation (1)) is something I'd like to avoid. We already have a nice
fw_cfg client library that is usable in SEC; that would be my choice for
learning about condition (1).

I'll audit the rest of the special memory regions too, of course. In any
case, exposing this QEMU capability / configuration (see also Paolo's
-machine smm=on/off idea) via fw_cfg seems both appropriate for fw_cfg
(after all it does configure the firmware) and the least intrusive for OVMF.

What do you think?

Thanks!
Laszlo


>>> +    memory_region_set_enabled(&mch->tseg_blackhole, tseg_size);
>>> +    memory_region_set_size(&mch->tseg_blackhole, tseg_size);
>>> +    memory_region_add_subregion_overlap(mch->system_memory,
>>> +                                        mch->below_4g_mem_size - tseg_size,
>>> +                                        &mch->tseg_blackhole, 1);
>>
>> So, does tseg overlay the last few MB of the RAM below 4GB? (Ie. right
>> before the PCI hole starts?)
> 
> tseg is just normal ram (yes, located at the end of memory), but (once
> tseg is enabled) only cpus in smm mode are allowed to access it.
> Likewise busmaster dma access is rejected, so non-smm code can't use the
> sata controller to access this indirectly.
> 
>>> +    memory_region_init_io(&mch->tseg_blackhole, OBJECT(mch),
>>> +                          &tseg_blackhole_ops, NULL,
>>> +                          "tseg-blackhole", 0);
>>
>> Okay, this answers one of the above, tseg_blackhole is never uninitialized.
>>
>>> +    memory_region_set_enabled(&mch->tseg_blackhole, false);
>>> +    memory_region_add_subregion_overlap(mch->system_memory,
>>> +                                        mch->below_4g_mem_size,
>>> +                                        &mch->tseg_blackhole, 1);
>>
>> The address (2nd argument) is inconsistent with the one in
>> mch_update_smram() -- this function call places the tseg black hole at
>> the beginning of the 32-bit PCI hole, not just below it.
> 
> It's a zero-length hole here, therefore it actually is the same.  But I
> didn't bother to use "mch->below_4g_mem_size - 0" as start address to
> make that more clear ;)
> 
>> (a) place the permanent PEI RAM so that it ends at the top of
>> "below_4g_mem_size"; PublishPeiMemory()
>> (b) create one of the memory HOBs, QemuInitializeRam(),
>> (c) create the MMIO HOB that describes the 32-bit PCI hole,
>>     MemMapInitialization()
>>
>> If the last 1 / 2 / 8 MB of the low RAM can't be used as general purpose
>> RAM, then (a) is affected (it should simply be sunk by the tseg size of
>> choice); (b) is affected similarly (just decrease the top of the memory
>> HOB);
> 
> Yes.
> 
> cheers,
>   Gerd
> 
>