[Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI
@ 2017-01-12 18:24 Laszlo Ersek
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg Laszlo Ersek
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-12 18:24 UTC (permalink / raw)
  To: qemu devel list
  Cc: Michael S. Tsirkin, Eduardo Habkost, Gerd Hoffmann,
	Igor Mammedov, Paolo Bonzini

Previous version (v5):
http://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg01902.html

Changes in v6 (see the individual patches for details):
- pick up feedback tags
- rename "smi_broadcast" to "x-smi-broadcast" [Eduardo]

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>

Thanks!
Laszlo

Laszlo Ersek (3):
  hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  hw/isa/lpc_ich9: add broadcast SMI feature
  hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types

 include/hw/i386/ich9.h | 13 ++++++++
 include/hw/i386/pc.h   |  5 +++
 hw/isa/lpc_ich9.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 108 insertions(+), 1 deletion(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  2017-01-12 18:24 [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Laszlo Ersek
@ 2017-01-12 18:24 ` Laszlo Ersek
  2017-01-13 10:15   ` Igor Mammedov
  2017-01-13 13:34   ` Igor Mammedov
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature Laszlo Ersek
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-12 18:24 UTC (permalink / raw)
  To: qemu devel list
  Cc: Michael S. Tsirkin, Gerd Hoffmann, Igor Mammedov, Paolo Bonzini

Introduce the following fw_cfg files:

- "etc/smi/supported-features": a little endian uint64_t feature bitmap,
  presenting the features known by the host to the guest. Read-only for
  the guest.

  The content of this file will be determined via bit-granularity ICH9-LPC
  device properties, to be introduced later. For now, the bitmask is left
  zeroed. The bits will be set from machine type compat properties and on
  the QEMU command line, hence this file is not migrated.

- "etc/smi/requested-features": a little endian uint64_t feature bitmap,
  representing the features the guest would like to request. Read-write
  for the guest.

  The guest can freely (re)write this file, it has no direct consequence.
  Initial value is zero. A nonzero value causes the SMI-related fw_cfg
  files and fields that are under guest influence to be migrated.

- "etc/smi/features-ok": contains a uint8_t value, and it is read-only for
  the guest. When the guest selects the associated fw_cfg key, the guest
  features are validated against the host features. In case of error, the
  negotiation doesn't proceed, and the "features-ok" file remains zero. In
  case of success, the "features-ok" file becomes (uint8_t)1, and the
  negotiated features are locked down internally (to which no further
  changes are possible until reset).

  The initial value is zero.  A nonzero value causes the SMI-related
  fw_cfg files and fields that are under guest influence to be migrated.

The C-language fields backing the "supported-features" and
"requested-features" files are uint8_t arrays. This is because they carry
guest-side representation (our choice is little endian), while
VMSTATE_UINT64() assumes / implies host-side endianness for any uint64_t
fields. If we migrate a guest between hosts with different endiannesses
(which is possible with TCG), then the host-side value is preserved, and
the host-side representation is translated. This would be visible to the
guest through fw_cfg, unless we used plain byte arrays. So we do.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
---

Notes:
    v6:
    - no changes, pick up Michael's R-b
    
    v5:
    - rename the "etc/smi/host-features" fw_cfg file to
      "etc/smi/supported-features" [Igor]
    
    - rename the "etc/smi/guest-features" fw_cfg file to
      "etc/smi/requested-features" [Igor]
    
    - suffix the names of the "ICH9LPCState.smi_host_features" and
      "ICH9LPCState.smi_guest_features" array fields with "_le" for
      representing their guest-visible encoding [Igor]
    
    - Replace the "smi_host_features" parameter of ich9_lpc_pm_init() --
      which was meant in v4 to be set by  board code -- with a new
      "ICH9LPCState.smi_host_features" field, of type uint64_t.
      Bit-granularity ICH9-LPC device properties will be carved out of this
      field. [Igor]
    
    - Given the "ICH9LPCState.smi_host_features" uint64_t field, we can now
      use that directly for feature validation in
      smi_features_ok_callback(). Converting the (otherwise guest-read-only)
      "ICH9LPCState.smi_host_features_le" array back to CPU endianness just
      for this is no longer necessary.

 include/hw/i386/ich9.h | 10 +++++++
 hw/isa/lpc_ich9.c      | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
index 5fd7e97d2347..da1118727146 100644
--- a/include/hw/i386/ich9.h
+++ b/include/hw/i386/ich9.h
@@ -64,6 +64,16 @@ typedef struct ICH9LPCState {
     uint8_t rst_cnt;
     MemoryRegion rst_cnt_mem;
 
+    /* SMI feature negotiation via fw_cfg */
+    uint64_t smi_host_features;       /* guest-invisible, host endian */
+    uint8_t smi_host_features_le[8];  /* guest-visible, read-only, little
+                                       * endian uint64_t */
+    uint8_t smi_guest_features_le[8]; /* guest-visible, read-write, little
+                                       * endian uint64_t */
+    uint8_t smi_features_ok;          /* guest-visible, read-only; selecting it
+                                       * triggers feature lockdown */
+    uint64_t smi_negotiated_features; /* guest-invisible, host endian */
+
     /* isa bus */
     ISABus *isa_bus;
     MemoryRegion rcrb_mem; /* root complex register block */
diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
index 10d1ee8b9310..376b7801a42c 100644
--- a/hw/isa/lpc_ich9.c
+++ b/hw/isa/lpc_ich9.c
@@ -48,6 +48,8 @@
 #include "exec/address-spaces.h"
 #include "sysemu/sysemu.h"
 #include "qom/cpu.h"
+#include "hw/nvram/fw_cfg.h"
+#include "qemu/cutils.h"
 
 /*****************************************************************************/
 /* ICH9 LPC PCI to ISA bridge */
@@ -360,13 +362,62 @@ static void ich9_set_sci(void *opaque, int irq_num, int level)
     }
 }
 
+static void smi_features_ok_callback(void *opaque)
+{
+    ICH9LPCState *lpc = opaque;
+    uint64_t guest_features;
+
+    if (lpc->smi_features_ok) {
+        /* negotiation already complete, features locked */
+        return;
+    }
+
+    memcpy(&guest_features, lpc->smi_guest_features_le, sizeof guest_features);
+    le64_to_cpus(&guest_features);
+    if (guest_features & ~lpc->smi_host_features) {
+        /* guest requests invalid features, leave @features_ok at zero */
+        return;
+    }
+
+    /* valid feature subset requested, lock it down, report success */
+    lpc->smi_negotiated_features = guest_features;
+    lpc->smi_features_ok = 1;
+}
+
 void ich9_lpc_pm_init(PCIDevice *lpc_pci, bool smm_enabled)
 {
     ICH9LPCState *lpc = ICH9_LPC_DEVICE(lpc_pci);
     qemu_irq sci_irq;
+    FWCfgState *fw_cfg = fw_cfg_find();
 
     sci_irq = qemu_allocate_irq(ich9_set_sci, lpc, 0);
     ich9_pm_init(lpc_pci, &lpc->pm, smm_enabled, sci_irq);
+
+    if (lpc->smi_host_features && fw_cfg) {
+        uint64_t host_features_le;
+
+        host_features_le = cpu_to_le64(lpc->smi_host_features);
+        memcpy(lpc->smi_host_features_le, &host_features_le,
+               sizeof host_features_le);
+        fw_cfg_add_file(fw_cfg, "etc/smi/supported-features",
+                        lpc->smi_host_features_le,
+                        sizeof lpc->smi_host_features_le);
+
+        /* The other two guest-visible fields are cleared on device reset, we
+         * just link them into fw_cfg here.
+         */
+        fw_cfg_add_file_callback(fw_cfg, "etc/smi/requested-features",
+                                 NULL, NULL,
+                                 lpc->smi_guest_features_le,
+                                 sizeof lpc->smi_guest_features_le,
+                                 false);
+        fw_cfg_add_file_callback(fw_cfg, "etc/smi/features-ok",
+                                 smi_features_ok_callback, lpc,
+                                 &lpc->smi_features_ok,
+                                 sizeof lpc->smi_features_ok,
+                                 true);
+    }
+
     ich9_lpc_reset(&lpc->d.qdev);
 }
 
@@ -507,6 +558,10 @@ static void ich9_lpc_reset(DeviceState *qdev)
 
     lpc->sci_level = 0;
     lpc->rst_cnt = 0;
+
+    memset(lpc->smi_guest_features_le, 0, sizeof lpc->smi_guest_features_le);
+    lpc->smi_features_ok = 0;
+    lpc->smi_negotiated_features = 0;
 }
 
 /* root complex register block is mapped into memory space */
@@ -668,6 +723,29 @@ static const VMStateDescription vmstate_ich9_rst_cnt = {
     }
 };
 
+static bool ich9_smi_feat_needed(void *opaque)
+{
+    ICH9LPCState *lpc = opaque;
+
+    return !buffer_is_zero(lpc->smi_guest_features_le,
+                           sizeof lpc->smi_guest_features_le) ||
+           lpc->smi_features_ok;
+}
+
+static const VMStateDescription vmstate_ich9_smi_feat = {
+    .name = "ICH9LPC/smi_feat",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = ich9_smi_feat_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT8_ARRAY(smi_guest_features_le, ICH9LPCState,
+                            sizeof(uint64_t)),
+        VMSTATE_UINT8(smi_features_ok, ICH9LPCState),
+        VMSTATE_UINT64(smi_negotiated_features, ICH9LPCState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static const VMStateDescription vmstate_ich9_lpc = {
     .name = "ICH9LPC",
     .version_id = 1,
@@ -683,6 +761,7 @@ static const VMStateDescription vmstate_ich9_lpc = {
     },
     .subsections = (const VMStateDescription*[]) {
         &vmstate_ich9_rst_cnt,
+        &vmstate_ich9_smi_feat,
         NULL
     }
 };
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-12 18:24 [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Laszlo Ersek
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg Laszlo Ersek
@ 2017-01-12 18:24 ` Laszlo Ersek
  2017-01-13 13:09   ` Igor Mammedov
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 3/3] hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types Laszlo Ersek
  2017-01-18 19:02 ` [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Michael S. Tsirkin
  3 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-12 18:24 UTC (permalink / raw)
  To: qemu devel list
  Cc: Michael S. Tsirkin, Gerd Hoffmann, Igor Mammedov, Paolo Bonzini

The generic edk2 SMM infrastructure prefers
EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
Trigger() only brings the current processor into SMM, then edk2 handles it
in the following ways:

(1) If Trigger() is executed by the BSP (which is guaranteed before
    ExitBootServices(), but is not necessarily true at runtime), then:

    (a) If edk2 has been configured for "traditional" SMM synchronization,
        then the BSP sends directed SMIs to the APs with APIC delivery,
        bringing them into SMM individually. Then the BSP runs the SMI
        handler / dispatcher.

    (b) If edk2 has been configured for "relaxed" SMM synchronization,
        then the APs that are not already in SMM are not brought in, and
        the BSP runs the SMI handler / dispatcher.

(2) If Trigger() is executed by an AP (which is possible after
    ExitBootServices(), and can be forced e.g. by "taskset -c 1
    efibootmgr"), then the AP in question brings in the BSP with a
    directed SMI, and the BSP runs the SMI handler / dispatcher.

The smaller problem with (1a) and (2) is that the BSP and AP
synchronization is slow. For example, the "taskset -c 1 efibootmgr"
command from (2) can take more than 3 seconds to complete, because
efibootmgr accesses non-volatile UEFI variables intensively.

The larger problem is that QEMU's current behavior diverges from the
behavior usually seen on physical hardware, and that keeps exposing
obscure corner cases, race conditions and other instabilities in edk2,
which generally expects / prefers a software SMI to affect all CPUs at
once.

Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
the SMI on all VCPUs.

While the original posting of this patch
<http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
only intended to speed up (2), based on our recent "stress testing" of SMM
this patch actually provides functional improvements.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
---

Notes:
    v6:
    - no changes, pick up Michael's R-b
    
    v5:
    - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
      ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
      DEFINE_PROP_BIT() in the next patch)

 include/hw/i386/ich9.h |  3 +++
 hw/isa/lpc_ich9.c      | 10 +++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
index da1118727146..18dcca7ebcbf 100644
--- a/include/hw/i386/ich9.h
+++ b/include/hw/i386/ich9.h
@@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
 #define ICH9_SMB_HST_D1                         0x06
 #define ICH9_SMB_HOST_BLOCK_DB                  0x07
 
+/* bit positions used in fw_cfg SMI feature negotiation */
+#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
+
 #endif /* HW_ICH9_H */
diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
index 376b7801a42c..ced6f803a4f2 100644
--- a/hw/isa/lpc_ich9.c
+++ b/hw/isa/lpc_ich9.c
@@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
 
     /* SMI_EN = PMBASE + 30. SMI control and enable register */
     if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
-        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
+        if (lpc->smi_negotiated_features &
+            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
+            CPUState *cs;
+            CPU_FOREACH(cs) {
+                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
+            }
+        } else {
+            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
+        }
     }
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH v6 wave 2 3/3] hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types
  2017-01-12 18:24 [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Laszlo Ersek
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg Laszlo Ersek
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature Laszlo Ersek
@ 2017-01-12 18:24 ` Laszlo Ersek
  2017-01-13 13:36   ` Igor Mammedov
  2017-01-18 19:02 ` [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Michael S. Tsirkin
  3 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-12 18:24 UTC (permalink / raw)
  To: qemu devel list
  Cc: Michael S. Tsirkin, Eduardo Habkost, Gerd Hoffmann,
	Igor Mammedov, Paolo Bonzini

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
---

Notes:
    v6:
    - rename "smi_broadcast" to "x-smi-broadcast" [Eduardo]
    - pick up Eduardo's R-b
    - pick up Michael's R-b
    
    v5:
    - replace the v4 patch "hw/i386/pc_q35: advertise broadcast SMI if VCPU
      hotplug is turned off" with a simple device property and compat
      setting [Igor]

 include/hw/i386/pc.h | 5 +++++
 hw/isa/lpc_ich9.c    | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 230e9e70c504..9d950733abdf 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -376,6 +376,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
 #define PC_COMPAT_2_8 \
     HW_COMPAT_2_8 \
+    {\
+        .driver   = "ICH9-LPC",\
+        .property = "x-smi-broadcast",\
+        .value    = "off",\
+    },\
 
 
 #define PC_COMPAT_2_7 \
diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
index ced6f803a4f2..59930dd9d09d 100644
--- a/hw/isa/lpc_ich9.c
+++ b/hw/isa/lpc_ich9.c
@@ -776,6 +776,8 @@ static const VMStateDescription vmstate_ich9_lpc = {
 
 static Property ich9_lpc_properties[] = {
     DEFINE_PROP_BOOL("noreboot", ICH9LPCState, pin_strap.spkr_hi, true),
+    DEFINE_PROP_BIT64("x-smi-broadcast", ICH9LPCState, smi_host_features,
+                      ICH9_LPC_SMI_F_BROADCAST_BIT, true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg Laszlo Ersek
@ 2017-01-13 10:15   ` Igor Mammedov
  2017-01-13 11:24     ` Laszlo Ersek
  2017-01-13 13:34   ` Igor Mammedov
  1 sibling, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-13 10:15 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann, Paolo Bonzini

On Thu, 12 Jan 2017 19:24:44 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> Introduce the following fw_cfg files:
> 
> - "etc/smi/supported-features": a little endian uint64_t feature bitmap,
>   presenting the features known by the host to the guest. Read-only for
>   the guest.
> 
>   The content of this file will be determined via bit-granularity ICH9-LPC
>   device properties, to be introduced later. For now, the bitmask is left
>   zeroed. The bits will be set from machine type compat properties and on
>   the QEMU command line, hence this file is not migrated.
> 
> - "etc/smi/requested-features": a little endian uint64_t feature bitmap,
>   representing the features the guest would like to request. Read-write
>   for the guest.
> 
>   The guest can freely (re)write this file, it has no direct consequence.
>   Initial value is zero. A nonzero value causes the SMI-related fw_cfg
>   files and fields that are under guest influence to be migrated.
> 
> - "etc/smi/features-ok": contains a uint8_t value, and it is read-only for
>   the guest. When the guest selects the associated fw_cfg key, the guest
>   features are validated against the host features. In case of error, the
>   negotiation doesn't proceed, and the "features-ok" file remains zero. In
>   case of success, the "features-ok" file becomes (uint8_t)1, and the
>   negotiated features are locked down internally (to which no further
>   changes are possible until reset).
> 
>   The initial value is zero.  A nonzero value causes the SMI-related
>   fw_cfg files and fields that are under guest influence to be migrated.
I'm still not quite sure if we need all this negotiation thingy with
all complexity it brings in when looking from cpu hotplug pov.

Paolo mentioned following security implications:
 1: OS could trigger broadcast SMI and try to hijack SMI handler for not
    yet relocated default SMI base of hotplugged CPU.
    That [c|sh]ould be handled by firmware protecting default SMI base.
 2: Even if firmware protected default SMI base, OS still could
    cause undefined behavior by sending broadcast SMI in case if more than
    1 CPU has been hotplugged, which would make unconfigured CPUs
    use the same SMI base simultaneously.
    Paolo suggested that real HW avoids the issue by making hotplugged CPUs
    "parked" until firmware unparks it from its SMI handler.
    So that's adds one more runtime state to migrate and qemu-guest ABI knob
    to maintain even if we ignore that there is no such terms as '(un)parked' in SDM.

How about considering an alternative simpler approach:
 * QEMU provides only "etc/smi/supported-features" file with SMI broadcast
   (no need to migrate)
 * firmware takes care of #1 by protecting default SMI base and using
   broadcast SMI if "etc/smi/supported-features" advertises it.
 * and QEMU could deal with #2 issue by just crashing guest as it tries
   to invoke undefined behavior (i.e. check that there is only 1 CPU with
   default SMI base and crash if there are more).
With this approach there is not need to negotiate nor migrate extra state
and inventing an unSPECed unpark knob for CPU hotplug could be avoided
(i.e. less qemu-guest ABI to maintain).


> 
> The C-language fields backing the "supported-features" and
> "requested-features" files are uint8_t arrays. This is because they carry
> guest-side representation (our choice is little endian), while
> VMSTATE_UINT64() assumes / implies host-side endianness for any uint64_t
> fields. If we migrate a guest between hosts with different endiannesses
> (which is possible with TCG), then the host-side value is preserved, and
> the host-side representation is translated. This would be visible to the
> guest through fw_cfg, unless we used plain byte arrays. So we do.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Notes:
>     v6:
>     - no changes, pick up Michael's R-b
>     
>     v5:
>     - rename the "etc/smi/host-features" fw_cfg file to
>       "etc/smi/supported-features" [Igor]
>     
>     - rename the "etc/smi/guest-features" fw_cfg file to
>       "etc/smi/requested-features" [Igor]
>     
>     - suffix the names of the "ICH9LPCState.smi_host_features" and
>       "ICH9LPCState.smi_guest_features" array fields with "_le" for
>       representing their guest-visible encoding [Igor]
>     
>     - Replace the "smi_host_features" parameter of ich9_lpc_pm_init() --
>       which was meant in v4 to be set by  board code -- with a new
>       "ICH9LPCState.smi_host_features" field, of type uint64_t.
>       Bit-granularity ICH9-LPC device properties will be carved out of this
>       field. [Igor]
>     
>     - Given the "ICH9LPCState.smi_host_features" uint64_t field, we can now
>       use that directly for feature validation in
>       smi_features_ok_callback(). Converting the (otherwise guest-read-only)
>       "ICH9LPCState.smi_host_features_le" array back to CPU endianness just
>       for this is no longer necessary.
> 
>  include/hw/i386/ich9.h | 10 +++++++
>  hw/isa/lpc_ich9.c      | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 89 insertions(+)
> 
> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> index 5fd7e97d2347..da1118727146 100644
> --- a/include/hw/i386/ich9.h
> +++ b/include/hw/i386/ich9.h
> @@ -64,6 +64,16 @@ typedef struct ICH9LPCState {
>      uint8_t rst_cnt;
>      MemoryRegion rst_cnt_mem;
>  
> +    /* SMI feature negotiation via fw_cfg */
> +    uint64_t smi_host_features;       /* guest-invisible, host endian */
> +    uint8_t smi_host_features_le[8];  /* guest-visible, read-only, little
> +                                       * endian uint64_t */
> +    uint8_t smi_guest_features_le[8]; /* guest-visible, read-write, little
> +                                       * endian uint64_t */
> +    uint8_t smi_features_ok;          /* guest-visible, read-only; selecting it
> +                                       * triggers feature lockdown */
> +    uint64_t smi_negotiated_features; /* guest-invisible, host endian */
> +
>      /* isa bus */
>      ISABus *isa_bus;
>      MemoryRegion rcrb_mem; /* root complex register block */
> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> index 10d1ee8b9310..376b7801a42c 100644
> --- a/hw/isa/lpc_ich9.c
> +++ b/hw/isa/lpc_ich9.c
> @@ -48,6 +48,8 @@
>  #include "exec/address-spaces.h"
>  #include "sysemu/sysemu.h"
>  #include "qom/cpu.h"
> +#include "hw/nvram/fw_cfg.h"
> +#include "qemu/cutils.h"
>  
>  /*****************************************************************************/
>  /* ICH9 LPC PCI to ISA bridge */
> @@ -360,13 +362,62 @@ static void ich9_set_sci(void *opaque, int irq_num, int level)
>      }
>  }
>  
> +static void smi_features_ok_callback(void *opaque)
> +{
> +    ICH9LPCState *lpc = opaque;
> +    uint64_t guest_features;
> +
> +    if (lpc->smi_features_ok) {
> +        /* negotiation already complete, features locked */
> +        return;
> +    }
> +
> +    memcpy(&guest_features, lpc->smi_guest_features_le, sizeof guest_features);
> +    le64_to_cpus(&guest_features);
> +    if (guest_features & ~lpc->smi_host_features) {
> +        /* guest requests invalid features, leave @features_ok at zero */
> +        return;
> +    }
> +
> +    /* valid feature subset requested, lock it down, report success */
> +    lpc->smi_negotiated_features = guest_features;
> +    lpc->smi_features_ok = 1;
> +}
> +
>  void ich9_lpc_pm_init(PCIDevice *lpc_pci, bool smm_enabled)
>  {
>      ICH9LPCState *lpc = ICH9_LPC_DEVICE(lpc_pci);
>      qemu_irq sci_irq;
> +    FWCfgState *fw_cfg = fw_cfg_find();
>  
>      sci_irq = qemu_allocate_irq(ich9_set_sci, lpc, 0);
>      ich9_pm_init(lpc_pci, &lpc->pm, smm_enabled, sci_irq);
> +
> +    if (lpc->smi_host_features && fw_cfg) {
> +        uint64_t host_features_le;
> +
> +        host_features_le = cpu_to_le64(lpc->smi_host_features);
> +        memcpy(lpc->smi_host_features_le, &host_features_le,
> +               sizeof host_features_le);
> +        fw_cfg_add_file(fw_cfg, "etc/smi/supported-features",
> +                        lpc->smi_host_features_le,
> +                        sizeof lpc->smi_host_features_le);
> +
> +        /* The other two guest-visible fields are cleared on device reset, we
> +         * just link them into fw_cfg here.
> +         */
> +        fw_cfg_add_file_callback(fw_cfg, "etc/smi/requested-features",
> +                                 NULL, NULL,
> +                                 lpc->smi_guest_features_le,
> +                                 sizeof lpc->smi_guest_features_le,
> +                                 false);
> +        fw_cfg_add_file_callback(fw_cfg, "etc/smi/features-ok",
> +                                 smi_features_ok_callback, lpc,
> +                                 &lpc->smi_features_ok,
> +                                 sizeof lpc->smi_features_ok,
> +                                 true);
> +    }
> +
>      ich9_lpc_reset(&lpc->d.qdev);
>  }
>  
> @@ -507,6 +558,10 @@ static void ich9_lpc_reset(DeviceState *qdev)
>  
>      lpc->sci_level = 0;
>      lpc->rst_cnt = 0;
> +
> +    memset(lpc->smi_guest_features_le, 0, sizeof lpc->smi_guest_features_le);
> +    lpc->smi_features_ok = 0;
> +    lpc->smi_negotiated_features = 0;
>  }
>  
>  /* root complex register block is mapped into memory space */
> @@ -668,6 +723,29 @@ static const VMStateDescription vmstate_ich9_rst_cnt = {
>      }
>  };
>  
> +static bool ich9_smi_feat_needed(void *opaque)
> +{
> +    ICH9LPCState *lpc = opaque;
> +
> +    return !buffer_is_zero(lpc->smi_guest_features_le,
> +                           sizeof lpc->smi_guest_features_le) ||
> +           lpc->smi_features_ok;
> +}
> +
> +static const VMStateDescription vmstate_ich9_smi_feat = {
> +    .name = "ICH9LPC/smi_feat",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = ich9_smi_feat_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT8_ARRAY(smi_guest_features_le, ICH9LPCState,
> +                            sizeof(uint64_t)),
> +        VMSTATE_UINT8(smi_features_ok, ICH9LPCState),
> +        VMSTATE_UINT64(smi_negotiated_features, ICH9LPCState),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static const VMStateDescription vmstate_ich9_lpc = {
>      .name = "ICH9LPC",
>      .version_id = 1,
> @@ -683,6 +761,7 @@ static const VMStateDescription vmstate_ich9_lpc = {
>      },
>      .subsections = (const VMStateDescription*[]) {
>          &vmstate_ich9_rst_cnt,
> +        &vmstate_ich9_smi_feat,
>          NULL
>      }
>  };

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  2017-01-13 10:15   ` Igor Mammedov
@ 2017-01-13 11:24     ` Laszlo Ersek
  0 siblings, 0 replies; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-13 11:24 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann, Paolo Bonzini

On 01/13/17 11:15, Igor Mammedov wrote:
> On Thu, 12 Jan 2017 19:24:44 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> Introduce the following fw_cfg files:
>>
>> - "etc/smi/supported-features": a little endian uint64_t feature bitmap,
>>   presenting the features known by the host to the guest. Read-only for
>>   the guest.
>>
>>   The content of this file will be determined via bit-granularity ICH9-LPC
>>   device properties, to be introduced later. For now, the bitmask is left
>>   zeroed. The bits will be set from machine type compat properties and on
>>   the QEMU command line, hence this file is not migrated.
>>
>> - "etc/smi/requested-features": a little endian uint64_t feature bitmap,
>>   representing the features the guest would like to request. Read-write
>>   for the guest.
>>
>>   The guest can freely (re)write this file, it has no direct consequence.
>>   Initial value is zero. A nonzero value causes the SMI-related fw_cfg
>>   files and fields that are under guest influence to be migrated.
>>
>> - "etc/smi/features-ok": contains a uint8_t value, and it is read-only for
>>   the guest. When the guest selects the associated fw_cfg key, the guest
>>   features are validated against the host features. In case of error, the
>>   negotiation doesn't proceed, and the "features-ok" file remains zero. In
>>   case of success, the "features-ok" file becomes (uint8_t)1, and the
>>   negotiated features are locked down internally (to which no further
>>   changes are possible until reset).
>>
>>   The initial value is zero.  A nonzero value causes the SMI-related
>>   fw_cfg files and fields that are under guest influence to be migrated.
> I'm still not quite sure if we need all this negotiation thingy with
> all complexity it brings in when looking from cpu hotplug pov.

It's not VCPU hotplug that necessitates feature negotiation at this
point. The broadcast SMI feature is more foundational than VCPU hotplug.
Broadcast SMI is necessary for *generally* improving correctness and
performance of the edk2 SMM stack, as built into OVMF and run on QEMU,
with VCPU hotplug not even in the picture.

Summarizing the feedback from Paolo, Michael and Kevin O'Connor, the
guidance was clearly that
- we needed feature negotiation,
- it should resemble virtio,
- it should not be some ad-hoc IO port hackery, but a reusable method
  for future firmware stuff that needs negotiation.

> 
> Paolo mentioned following security implications:

Frankly, for the scope of this work, I absolutely don't care about VCPU
hotplug specifically. VCPU hotplug is a feature that will certainly
depend on the robustness and reliability of the edk2 SMM stack, as it is
currently available for use in OVMF. These patches improve that
robustness and reliability.

The only consideration for VCPU hotplug at the moment is that, should it
need some negotiable features, the fw_cfg pattern should be able to
accommodate them. That's all.

Discussing any VCPU hotplug specifics at the moment has no merit, in my
opinion. I have not seen, or run, a single line of edk2 SMM code related
to VCPU hotplug. Whatever theories we make up will hang in the air.
Meanwhile the basic SMM stack doesn't work reliably -- it needs these
patches.

>  1: OS could trigger broadcast SMI and try to hijack SMI handler for not
>     yet relocated default SMI base of hotplugged CPU.
>     That [c|sh]ould be handled by firmware protecting default SMI base.
>  2: Even if firmware protected default SMI base, OS still could
>     cause undefined behavior by sending broadcast SMI in case if more than
>     1 CPU has been hotplugged, which would make unconfigured CPUs
>     use the same SMI base simultaneously.
>     Paolo suggested that real HW avoids the issue by making hotplugged CPUs
>     "parked" until firmware unparks it from its SMI handler.
>     So that's adds one more runtime state to migrate and qemu-guest ABI knob
>     to maintain even if we ignore that there is no such terms as '(un)parked' in SDM.
> 
> How about considering an alternative simpler approach:
>  * QEMU provides only "etc/smi/supported-features" file with SMI broadcast
>    (no need to migrate)
>  * firmware takes care of #1 by protecting default SMI base and using
>    broadcast SMI if "etc/smi/supported-features" advertises it.
>  * and QEMU could deal with #2 issue by just crashing guest as it tries
>    to invoke undefined behavior (i.e. check that there is only 1 CPU with
>    default SMI base and crash if there are more).
> With this approach there is not need to negotiate nor migrate extra state
> and inventing an unSPECed unpark knob for CPU hotplug could be avoided
> (i.e. less qemu-guest ABI to maintain).

The virtio-like feature negotiation (with host-features, guest-features,
and features-ok) has been part of the design since v3
<http://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg03582.html>.

I'm confused why you are raising such concerns now, for v6, when the
v4->v5 iteration was done mainly to address your -- much welcomed! --
feedback. At that point, you seemed to agree with the general design,
and suggested implementation-level improvements, and I was happy to oblige.

The current protocol can accommodate simpler uses; a simpler protocol
might be less future-proof. Virtio, which is the pattern that the
current fw_cfg design follows, is historically proven, and the resultant
fw_cfg code complexity and migration footprint are not high.

Also, when you say "firmware takes care of #1 by protecting default SMI
base", you might be unknowingly suggesting changes for the SMM core
drivers in edk2 that are simply impossible for me to implement (or even
design).

I wish all people finally understood that UEFI / edk2 is not SeaBIOS,
where you just dig down and do whatever's necessary, in whatever way
that you see fit. edk2 is the reference implementation of the PI and
UEFI specs, which are themselves developed by *closed* standards-bodies.
(The specs are released publicly, but the spec development process is
proprietary.)

The specs facilitate interoperability between *binary* modules,
specifying, and thereby ossifying, low-level, internal interfaces
between components. This is the *polar opposite* of what most people are
used to in open source development (notably, Linux, but SeaBIOS too),
where internal interfaces are subject to continuous change and
improvement. (See "Documentation/stable_api_nonsense.txt".)

In brief, the PI and UEFI specs can be called "standardized cruft" that
exist in order to ensure interop between proprietary vendors.

I have reasonable freedom in modifying platfrom code (which lives under
OvmfPkg), and minimal freedom for core code -- e.g.
UefiCpuPkg/PiSmmCpuDxeSmm (the most critical SMM driver).

Specifically, the OVMF-side code -- which I've written and tested, but
not posted yet -- that will interface with this QEMU feature, lives
entirely in OvmfPkg. That's why this approach is possible at all.

And it is enabled only by the fact that the PI spec delegates the
EFI_SMM_CONTROL2_PROTOCOL.Trigger() method to individual platforms.
That's why I can have a say in the circumstances of SMI injection.

SMBASE handling is entirely out of the platform's hands. I estimate that
my SMM- and multiprocessing-related patches in edk2 (mainly OvmfPkg, but
also some bugfixes in more core modules) could be approaching the eighty
or hundred count, and I haven't once dealt directly with SMBASE
relocation or protection in that time.

Thus, please let's not change the design at this point.

Thanks
Laszlo

> 
> 
>>
>> The C-language fields backing the "supported-features" and
>> "requested-features" files are uint8_t arrays. This is because they carry
>> guest-side representation (our choice is little endian), while
>> VMSTATE_UINT64() assumes / implies host-side endianness for any uint64_t
>> fields. If we migrate a guest between hosts with different endiannesses
>> (which is possible with TCG), then the host-side value is preserved, and
>> the host-side representation is translated. This would be visible to the
>> guest through fw_cfg, unless we used plain byte arrays. So we do.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>
>> Notes:
>>     v6:
>>     - no changes, pick up Michael's R-b
>>     
>>     v5:
>>     - rename the "etc/smi/host-features" fw_cfg file to
>>       "etc/smi/supported-features" [Igor]
>>     
>>     - rename the "etc/smi/guest-features" fw_cfg file to
>>       "etc/smi/requested-features" [Igor]
>>     
>>     - suffix the names of the "ICH9LPCState.smi_host_features" and
>>       "ICH9LPCState.smi_guest_features" array fields with "_le" for
>>       representing their guest-visible encoding [Igor]
>>     
>>     - Replace the "smi_host_features" parameter of ich9_lpc_pm_init() --
>>       which was meant in v4 to be set by  board code -- with a new
>>       "ICH9LPCState.smi_host_features" field, of type uint64_t.
>>       Bit-granularity ICH9-LPC device properties will be carved out of this
>>       field. [Igor]
>>     
>>     - Given the "ICH9LPCState.smi_host_features" uint64_t field, we can now
>>       use that directly for feature validation in
>>       smi_features_ok_callback(). Converting the (otherwise guest-read-only)
>>       "ICH9LPCState.smi_host_features_le" array back to CPU endianness just
>>       for this is no longer necessary.
>>
>>  include/hw/i386/ich9.h | 10 +++++++
>>  hw/isa/lpc_ich9.c      | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 89 insertions(+)
>>
>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>> index 5fd7e97d2347..da1118727146 100644
>> --- a/include/hw/i386/ich9.h
>> +++ b/include/hw/i386/ich9.h
>> @@ -64,6 +64,16 @@ typedef struct ICH9LPCState {
>>      uint8_t rst_cnt;
>>      MemoryRegion rst_cnt_mem;
>>  
>> +    /* SMI feature negotiation via fw_cfg */
>> +    uint64_t smi_host_features;       /* guest-invisible, host endian */
>> +    uint8_t smi_host_features_le[8];  /* guest-visible, read-only, little
>> +                                       * endian uint64_t */
>> +    uint8_t smi_guest_features_le[8]; /* guest-visible, read-write, little
>> +                                       * endian uint64_t */
>> +    uint8_t smi_features_ok;          /* guest-visible, read-only; selecting it
>> +                                       * triggers feature lockdown */
>> +    uint64_t smi_negotiated_features; /* guest-invisible, host endian */
>> +
>>      /* isa bus */
>>      ISABus *isa_bus;
>>      MemoryRegion rcrb_mem; /* root complex register block */
>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>> index 10d1ee8b9310..376b7801a42c 100644
>> --- a/hw/isa/lpc_ich9.c
>> +++ b/hw/isa/lpc_ich9.c
>> @@ -48,6 +48,8 @@
>>  #include "exec/address-spaces.h"
>>  #include "sysemu/sysemu.h"
>>  #include "qom/cpu.h"
>> +#include "hw/nvram/fw_cfg.h"
>> +#include "qemu/cutils.h"
>>  
>>  /*****************************************************************************/
>>  /* ICH9 LPC PCI to ISA bridge */
>> @@ -360,13 +362,62 @@ static void ich9_set_sci(void *opaque, int irq_num, int level)
>>      }
>>  }
>>  
>> +static void smi_features_ok_callback(void *opaque)
>> +{
>> +    ICH9LPCState *lpc = opaque;
>> +    uint64_t guest_features;
>> +
>> +    if (lpc->smi_features_ok) {
>> +        /* negotiation already complete, features locked */
>> +        return;
>> +    }
>> +
>> +    memcpy(&guest_features, lpc->smi_guest_features_le, sizeof guest_features);
>> +    le64_to_cpus(&guest_features);
>> +    if (guest_features & ~lpc->smi_host_features) {
>> +        /* guest requests invalid features, leave @features_ok at zero */
>> +        return;
>> +    }
>> +
>> +    /* valid feature subset requested, lock it down, report success */
>> +    lpc->smi_negotiated_features = guest_features;
>> +    lpc->smi_features_ok = 1;
>> +}
>> +
>>  void ich9_lpc_pm_init(PCIDevice *lpc_pci, bool smm_enabled)
>>  {
>>      ICH9LPCState *lpc = ICH9_LPC_DEVICE(lpc_pci);
>>      qemu_irq sci_irq;
>> +    FWCfgState *fw_cfg = fw_cfg_find();
>>  
>>      sci_irq = qemu_allocate_irq(ich9_set_sci, lpc, 0);
>>      ich9_pm_init(lpc_pci, &lpc->pm, smm_enabled, sci_irq);
>> +
>> +    if (lpc->smi_host_features && fw_cfg) {
>> +        uint64_t host_features_le;
>> +
>> +        host_features_le = cpu_to_le64(lpc->smi_host_features);
>> +        memcpy(lpc->smi_host_features_le, &host_features_le,
>> +               sizeof host_features_le);
>> +        fw_cfg_add_file(fw_cfg, "etc/smi/supported-features",
>> +                        lpc->smi_host_features_le,
>> +                        sizeof lpc->smi_host_features_le);
>> +
>> +        /* The other two guest-visible fields are cleared on device reset, we
>> +         * just link them into fw_cfg here.
>> +         */
>> +        fw_cfg_add_file_callback(fw_cfg, "etc/smi/requested-features",
>> +                                 NULL, NULL,
>> +                                 lpc->smi_guest_features_le,
>> +                                 sizeof lpc->smi_guest_features_le,
>> +                                 false);
>> +        fw_cfg_add_file_callback(fw_cfg, "etc/smi/features-ok",
>> +                                 smi_features_ok_callback, lpc,
>> +                                 &lpc->smi_features_ok,
>> +                                 sizeof lpc->smi_features_ok,
>> +                                 true);
>> +    }
>> +
>>      ich9_lpc_reset(&lpc->d.qdev);
>>  }
>>  
>> @@ -507,6 +558,10 @@ static void ich9_lpc_reset(DeviceState *qdev)
>>  
>>      lpc->sci_level = 0;
>>      lpc->rst_cnt = 0;
>> +
>> +    memset(lpc->smi_guest_features_le, 0, sizeof lpc->smi_guest_features_le);
>> +    lpc->smi_features_ok = 0;
>> +    lpc->smi_negotiated_features = 0;
>>  }
>>  
>>  /* root complex register block is mapped into memory space */
>> @@ -668,6 +723,29 @@ static const VMStateDescription vmstate_ich9_rst_cnt = {
>>      }
>>  };
>>  
>> +static bool ich9_smi_feat_needed(void *opaque)
>> +{
>> +    ICH9LPCState *lpc = opaque;
>> +
>> +    return !buffer_is_zero(lpc->smi_guest_features_le,
>> +                           sizeof lpc->smi_guest_features_le) ||
>> +           lpc->smi_features_ok;
>> +}
>> +
>> +static const VMStateDescription vmstate_ich9_smi_feat = {
>> +    .name = "ICH9LPC/smi_feat",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .needed = ich9_smi_feat_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT8_ARRAY(smi_guest_features_le, ICH9LPCState,
>> +                            sizeof(uint64_t)),
>> +        VMSTATE_UINT8(smi_features_ok, ICH9LPCState),
>> +        VMSTATE_UINT64(smi_negotiated_features, ICH9LPCState),
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static const VMStateDescription vmstate_ich9_lpc = {
>>      .name = "ICH9LPC",
>>      .version_id = 1,
>> @@ -683,6 +761,7 @@ static const VMStateDescription vmstate_ich9_lpc = {
>>      },
>>      .subsections = (const VMStateDescription*[]) {
>>          &vmstate_ich9_rst_cnt,
>> +        &vmstate_ich9_smi_feat,
>>          NULL
>>      }
>>  };
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature Laszlo Ersek
@ 2017-01-13 13:09   ` Igor Mammedov
  2017-01-16 20:46     ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-13 13:09 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann, Paolo Bonzini

On Thu, 12 Jan 2017 19:24:45 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> The generic edk2 SMM infrastructure prefers
> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
> Trigger() only brings the current processor into SMM, then edk2 handles it
> in the following ways:
> 
> (1) If Trigger() is executed by the BSP (which is guaranteed before
>     ExitBootServices(), but is not necessarily true at runtime), then:
> 
>     (a) If edk2 has been configured for "traditional" SMM synchronization,
>         then the BSP sends directed SMIs to the APs with APIC delivery,
>         bringing them into SMM individually. Then the BSP runs the SMI
>         handler / dispatcher.
> 
>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>         then the APs that are not already in SMM are not brought in, and
>         the BSP runs the SMI handler / dispatcher.
> 
> (2) If Trigger() is executed by an AP (which is possible after
>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>     efibootmgr"), then the AP in question brings in the BSP with a
>     directed SMI, and the BSP runs the SMI handler / dispatcher.
> 
> The smaller problem with (1a) and (2) is that the BSP and AP
> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
> command from (2) can take more than 3 seconds to complete, because
> efibootmgr accesses non-volatile UEFI variables intensively.
> 
> The larger problem is that QEMU's current behavior diverges from the
> behavior usually seen on physical hardware, and that keeps exposing
> obscure corner cases, race conditions and other instabilities in edk2,
> which generally expects / prefers a software SMI to affect all CPUs at
> once.
> 
> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
> the SMI on all VCPUs.
> 
> While the original posting of this patch
> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
> only intended to speed up (2), based on our recent "stress testing" of SMM
> this patch actually provides functional improvements.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Notes:
>     v6:
>     - no changes, pick up Michael's R-b
>     
>     v5:
>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>       DEFINE_PROP_BIT() in the next patch)
> 
>  include/hw/i386/ich9.h |  3 +++
>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> index da1118727146..18dcca7ebcbf 100644
> --- a/include/hw/i386/ich9.h
> +++ b/include/hw/i386/ich9.h
> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>  #define ICH9_SMB_HST_D1                         0x06
>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>  
> +/* bit positions used in fw_cfg SMI feature negotiation */
> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
> +
>  #endif /* HW_ICH9_H */
> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> index 376b7801a42c..ced6f803a4f2 100644
> --- a/hw/isa/lpc_ich9.c
> +++ b/hw/isa/lpc_ich9.c
> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
>  
>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> +        if (lpc->smi_negotiated_features &
> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
> +            CPUState *cs;
> +            CPU_FOREACH(cs) {
> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> +            }
Shouldn't CPUs with default SMI base be excluded from broadcast?

> +        } else {
> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> +        }
>      }
>  }
>  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg Laszlo Ersek
  2017-01-13 10:15   ` Igor Mammedov
@ 2017-01-13 13:34   ` Igor Mammedov
  2017-01-16 20:51     ` Laszlo Ersek
  1 sibling, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-13 13:34 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Paolo Bonzini, Gerd Hoffmann, Michael S. Tsirkin

On Thu, 12 Jan 2017 19:24:44 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

[...]
> +static void smi_features_ok_callback(void *opaque)
> +{
> +    ICH9LPCState *lpc = opaque;
> +    uint64_t guest_features;
> +
> +    if (lpc->smi_features_ok) {
> +        /* negotiation already complete, features locked */
> +        return;
> +    }
> +
> +    memcpy(&guest_features, lpc->smi_guest_features_le, sizeof guest_features);
> +    le64_to_cpus(&guest_features);
> +    if (guest_features & ~lpc->smi_host_features) {
> +        /* guest requests invalid features, leave @features_ok at zero */
> +        return;
> +    }
> +
> +    /* valid feature subset requested, lock it down, report success */
> +    lpc->smi_negotiated_features = guest_features;
> +    lpc->smi_features_ok = 1;
> +}
[...]

> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT8_ARRAY(smi_guest_features_le, ICH9LPCState,
> +                            sizeof(uint64_t)),
> +        VMSTATE_UINT8(smi_features_ok, ICH9LPCState),
> +        VMSTATE_UINT64(smi_negotiated_features, ICH9LPCState),

Couldn't smi_negotiated_features be derived from
smi_guest_features_le on target side?


> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static const VMStateDescription vmstate_ich9_lpc = {
>      .name = "ICH9LPC",
>      .version_id = 1,
> @@ -683,6 +761,7 @@ static const VMStateDescription vmstate_ich9_lpc = {
>      },
>      .subsections = (const VMStateDescription*[]) {
>          &vmstate_ich9_rst_cnt,
> +        &vmstate_ich9_smi_feat,
>          NULL
>      }
>  };

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 3/3] hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 3/3] hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types Laszlo Ersek
@ 2017-01-13 13:36   ` Igor Mammedov
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Mammedov @ 2017-01-13 13:36 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Michael S. Tsirkin, Eduardo Habkost,
	Gerd Hoffmann, Paolo Bonzini

On Thu, 12 Jan 2017 19:24:46 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Notes:
>     v6:
>     - rename "smi_broadcast" to "x-smi-broadcast" [Eduardo]
>     - pick up Eduardo's R-b
>     - pick up Michael's R-b
>     
>     v5:
>     - replace the v4 patch "hw/i386/pc_q35: advertise broadcast SMI if VCPU
>       hotplug is turned off" with a simple device property and compat
>       setting [Igor]
> 
>  include/hw/i386/pc.h | 5 +++++
>  hw/isa/lpc_ich9.c    | 2 ++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 230e9e70c504..9d950733abdf 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -376,6 +376,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>  
>  #define PC_COMPAT_2_8 \
>      HW_COMPAT_2_8 \
> +    {\
> +        .driver   = "ICH9-LPC",\
> +        .property = "x-smi-broadcast",\
> +        .value    = "off",\
> +    },\
>  
>  
>  #define PC_COMPAT_2_7 \
> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> index ced6f803a4f2..59930dd9d09d 100644
> --- a/hw/isa/lpc_ich9.c
> +++ b/hw/isa/lpc_ich9.c
> @@ -776,6 +776,8 @@ static const VMStateDescription vmstate_ich9_lpc = {
>  
>  static Property ich9_lpc_properties[] = {
>      DEFINE_PROP_BOOL("noreboot", ICH9LPCState, pin_strap.spkr_hi, true),
> +    DEFINE_PROP_BIT64("x-smi-broadcast", ICH9LPCState, smi_host_features,
> +                      ICH9_LPC_SMI_F_BROADCAST_BIT, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-13 13:09   ` Igor Mammedov
@ 2017-01-16 20:46     ` Laszlo Ersek
  2017-01-17 11:21       ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-16 20:46 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann,
	Paolo Bonzini, Michael Kinney

On 01/13/17 14:09, Igor Mammedov wrote:
> On Thu, 12 Jan 2017 19:24:45 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> The generic edk2 SMM infrastructure prefers
>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
>> Trigger() only brings the current processor into SMM, then edk2 handles it
>> in the following ways:
>>
>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>     ExitBootServices(), but is not necessarily true at runtime), then:
>>
>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
>>         then the BSP sends directed SMIs to the APs with APIC delivery,
>>         bringing them into SMM individually. Then the BSP runs the SMI
>>         handler / dispatcher.
>>
>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>>         then the APs that are not already in SMM are not brought in, and
>>         the BSP runs the SMI handler / dispatcher.
>>
>> (2) If Trigger() is executed by an AP (which is possible after
>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>     efibootmgr"), then the AP in question brings in the BSP with a
>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>
>> The smaller problem with (1a) and (2) is that the BSP and AP
>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>> command from (2) can take more than 3 seconds to complete, because
>> efibootmgr accesses non-volatile UEFI variables intensively.
>>
>> The larger problem is that QEMU's current behavior diverges from the
>> behavior usually seen on physical hardware, and that keeps exposing
>> obscure corner cases, race conditions and other instabilities in edk2,
>> which generally expects / prefers a software SMI to affect all CPUs at
>> once.
>>
>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
>> the SMI on all VCPUs.
>>
>> While the original posting of this patch
>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>> only intended to speed up (2), based on our recent "stress testing" of SMM
>> this patch actually provides functional improvements.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>
>> Notes:
>>     v6:
>>     - no changes, pick up Michael's R-b
>>     
>>     v5:
>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>       DEFINE_PROP_BIT() in the next patch)
>>
>>  include/hw/i386/ich9.h |  3 +++
>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>> index da1118727146..18dcca7ebcbf 100644
>> --- a/include/hw/i386/ich9.h
>> +++ b/include/hw/i386/ich9.h
>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>  #define ICH9_SMB_HST_D1                         0x06
>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>  
>> +/* bit positions used in fw_cfg SMI feature negotiation */
>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>> +
>>  #endif /* HW_ICH9_H */
>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>> index 376b7801a42c..ced6f803a4f2 100644
>> --- a/hw/isa/lpc_ich9.c
>> +++ b/hw/isa/lpc_ich9.c
>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
>>  
>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>> +        if (lpc->smi_negotiated_features &
>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>> +            CPUState *cs;
>> +            CPU_FOREACH(cs) {
>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>> +            }
> Shouldn't CPUs with default SMI base be excluded from broadcast?

I don't think so; as far as I recall, in edk2 the initial SMM entry code
-- before SMBASE relocation -- accommodates all VCPUs entering the same
code area for execution. The VCPUs are told apart from each other by
Initial APIC ID (that is, "who am I"), which is used as an index or
search criterion into a global shared array. Once they find their
respective private data blocks, the VCPUs don't step on each other's toes.

Thanks
Laszlo

> 
>> +        } else {
>> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>> +        }
>>      }
>>  }
>>  
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  2017-01-13 13:34   ` Igor Mammedov
@ 2017-01-16 20:51     ` Laszlo Ersek
  2017-01-17  8:42       ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-16 20:51 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu devel list, Paolo Bonzini, Gerd Hoffmann, Michael S. Tsirkin

On 01/13/17 14:34, Igor Mammedov wrote:
> On Thu, 12 Jan 2017 19:24:44 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
> [...]
>> +static void smi_features_ok_callback(void *opaque)
>> +{
>> +    ICH9LPCState *lpc = opaque;
>> +    uint64_t guest_features;
>> +
>> +    if (lpc->smi_features_ok) {
>> +        /* negotiation already complete, features locked */
>> +        return;
>> +    }
>> +
>> +    memcpy(&guest_features, lpc->smi_guest_features_le, sizeof guest_features);
>> +    le64_to_cpus(&guest_features);
>> +    if (guest_features & ~lpc->smi_host_features) {
>> +        /* guest requests invalid features, leave @features_ok at zero */
>> +        return;
>> +    }
>> +
>> +    /* valid feature subset requested, lock it down, report success */
>> +    lpc->smi_negotiated_features = guest_features;
>> +    lpc->smi_features_ok = 1;
>> +}
> [...]
> 
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT8_ARRAY(smi_guest_features_le, ICH9LPCState,
>> +                            sizeof(uint64_t)),
>> +        VMSTATE_UINT8(smi_features_ok, ICH9LPCState),
>> +        VMSTATE_UINT64(smi_negotiated_features, ICH9LPCState),
> 
> Couldn't smi_negotiated_features be derived from
> smi_guest_features_le on target side?

No, that's something we must expressly not do.

The "guest-features" file remains writeable (and back-readable) to the
guest after feature validation/lockdown (i.e., after selecting
"features-ok"), it just has no effect on QEMU.

And, if the guest rewrites "guest-features" between feature lockdown and
migration, that must have no effect on the target host either. Hence we
must carry forward the negotiated features originally computed on the
source host.

Thanks
Laszlo

> 
> 
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static const VMStateDescription vmstate_ich9_lpc = {
>>      .name = "ICH9LPC",
>>      .version_id = 1,
>> @@ -683,6 +761,7 @@ static const VMStateDescription vmstate_ich9_lpc = {
>>      },
>>      .subsections = (const VMStateDescription*[]) {
>>          &vmstate_ich9_rst_cnt,
>> +        &vmstate_ich9_smi_feat,
>>          NULL
>>      }
>>  };
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
  2017-01-16 20:51     ` Laszlo Ersek
@ 2017-01-17  8:42       ` Igor Mammedov
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Mammedov @ 2017-01-17  8:42 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Paolo Bonzini, Michael S. Tsirkin, qemu devel list, Gerd Hoffmann

On Mon, 16 Jan 2017 21:51:34 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/13/17 14:34, Igor Mammedov wrote:
> > On Thu, 12 Jan 2017 19:24:44 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> > 
> > [...]  
> >> +static void smi_features_ok_callback(void *opaque)
> >> +{
> >> +    ICH9LPCState *lpc = opaque;
> >> +    uint64_t guest_features;
> >> +
> >> +    if (lpc->smi_features_ok) {
> >> +        /* negotiation already complete, features locked */
> >> +        return;
> >> +    }
> >> +
> >> +    memcpy(&guest_features, lpc->smi_guest_features_le, sizeof guest_features);
> >> +    le64_to_cpus(&guest_features);
> >> +    if (guest_features & ~lpc->smi_host_features) {
> >> +        /* guest requests invalid features, leave @features_ok at zero */
> >> +        return;
> >> +    }
> >> +
> >> +    /* valid feature subset requested, lock it down, report success */
> >> +    lpc->smi_negotiated_features = guest_features;
> >> +    lpc->smi_features_ok = 1;
> >> +}  
> > [...]
> >   
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_UINT8_ARRAY(smi_guest_features_le, ICH9LPCState,
> >> +                            sizeof(uint64_t)),
> >> +        VMSTATE_UINT8(smi_features_ok, ICH9LPCState),
> >> +        VMSTATE_UINT64(smi_negotiated_features, ICH9LPCState),  
> > 
> > Couldn't smi_negotiated_features be derived from
> > smi_guest_features_le on target side?  
> 
> No, that's something we must expressly not do.
> 
> The "guest-features" file remains writeable (and back-readable) to the
> guest after feature validation/lockdown (i.e., after selecting
> "features-ok"), it just has no effect on QEMU.
> 
> And, if the guest rewrites "guest-features" between feature lockdown and
> migration, that must have no effect on the target host either. Hence we
> must carry forward the negotiated features originally computed on the
> source host.
ok,
patch looks good to me, so

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> Thanks
> Laszlo
> 
> > 
> >   
> >> +        VMSTATE_END_OF_LIST()
> >> +    }
> >> +};
> >> +
> >>  static const VMStateDescription vmstate_ich9_lpc = {
> >>      .name = "ICH9LPC",
> >>      .version_id = 1,
> >> @@ -683,6 +761,7 @@ static const VMStateDescription vmstate_ich9_lpc = {
> >>      },
> >>      .subsections = (const VMStateDescription*[]) {
> >>          &vmstate_ich9_rst_cnt,
> >> +        &vmstate_ich9_smi_feat,
> >>          NULL
> >>      }
> >>  };  
> >   
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-16 20:46     ` Laszlo Ersek
@ 2017-01-17 11:21       ` Igor Mammedov
  2017-01-17 12:06         ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-17 11:21 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann,
	Paolo Bonzini, Michael Kinney

On Mon, 16 Jan 2017 21:46:55 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/13/17 14:09, Igor Mammedov wrote:
> > On Thu, 12 Jan 2017 19:24:45 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> The generic edk2 SMM infrastructure prefers
> >> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
> >> Trigger() only brings the current processor into SMM, then edk2 handles it
> >> in the following ways:
> >>
> >> (1) If Trigger() is executed by the BSP (which is guaranteed before
> >>     ExitBootServices(), but is not necessarily true at runtime), then:
> >>
> >>     (a) If edk2 has been configured for "traditional" SMM synchronization,
> >>         then the BSP sends directed SMIs to the APs with APIC delivery,
> >>         bringing them into SMM individually. Then the BSP runs the SMI
> >>         handler / dispatcher.
> >>
> >>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
> >>         then the APs that are not already in SMM are not brought in, and
> >>         the BSP runs the SMI handler / dispatcher.
> >>
> >> (2) If Trigger() is executed by an AP (which is possible after
> >>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
> >>     efibootmgr"), then the AP in question brings in the BSP with a
> >>     directed SMI, and the BSP runs the SMI handler / dispatcher.
> >>
> >> The smaller problem with (1a) and (2) is that the BSP and AP
> >> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
> >> command from (2) can take more than 3 seconds to complete, because
> >> efibootmgr accesses non-volatile UEFI variables intensively.
> >>
> >> The larger problem is that QEMU's current behavior diverges from the
> >> behavior usually seen on physical hardware, and that keeps exposing
> >> obscure corner cases, race conditions and other instabilities in edk2,
> >> which generally expects / prefers a software SMI to affect all CPUs at
> >> once.
> >>
> >> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
> >> the SMI on all VCPUs.
> >>
> >> While the original posting of this patch
> >> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
> >> only intended to speed up (2), based on our recent "stress testing" of SMM
> >> this patch actually provides functional improvements.
> >>
> >> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >> Cc: Igor Mammedov <imammedo@redhat.com>
> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> >> ---
> >>
> >> Notes:
> >>     v6:
> >>     - no changes, pick up Michael's R-b
> >>     
> >>     v5:
> >>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
> >>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
> >>       DEFINE_PROP_BIT() in the next patch)
> >>
> >>  include/hw/i386/ich9.h |  3 +++
> >>  hw/isa/lpc_ich9.c      | 10 +++++++++-
> >>  2 files changed, 12 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> >> index da1118727146..18dcca7ebcbf 100644
> >> --- a/include/hw/i386/ich9.h
> >> +++ b/include/hw/i386/ich9.h
> >> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
> >>  #define ICH9_SMB_HST_D1                         0x06
> >>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
> >>  
> >> +/* bit positions used in fw_cfg SMI feature negotiation */
> >> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
> >> +
> >>  #endif /* HW_ICH9_H */
> >> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> >> index 376b7801a42c..ced6f803a4f2 100644
> >> --- a/hw/isa/lpc_ich9.c
> >> +++ b/hw/isa/lpc_ich9.c
> >> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
> >>  
> >>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
> >>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
> >> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >> +        if (lpc->smi_negotiated_features &
> >> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
> >> +            CPUState *cs;
> >> +            CPU_FOREACH(cs) {
> >> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >> +            }  
> > Shouldn't CPUs with default SMI base be excluded from broadcast?  
> 
> I don't think so; as far as I recall, in edk2 the initial SMM entry code
> -- before SMBASE relocation -- accommodates all VCPUs entering the same
> code area for execution. The VCPUs are told apart from each other by
> Initial APIC ID (that is, "who am I"), which is used as an index or
> search criterion into a global shared array. Once they find their
> respective private data blocks, the VCPUs don't step on each other's toes.
That's what I'm not sure.
If broadcast SMI is sent before relocation all CPUs will use
the same SMBASE and as result save their state in the same
memory (even if it's state after RESET/POWER ON) racing/overwriting
each other's saved state and then on exit from SMI they all will restore
whatever state that race produced so it seems plain wrong thing to do.

Are you sure that edk2 does broadcast for SMI initialization or does it
using directed SMIs?

> 
> Thanks
> Laszlo
> 
> >   
> >> +        } else {
> >> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >> +        }
> >>      }
> >>  }
> >>    
> >   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-17 11:21       ` Igor Mammedov
@ 2017-01-17 12:06         ` Laszlo Ersek
  2017-01-17 13:20           ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-17 12:06 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann,
	Paolo Bonzini, Michael Kinney

On 01/17/17 12:21, Igor Mammedov wrote:
> On Mon, 16 Jan 2017 21:46:55 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/13/17 14:09, Igor Mammedov wrote:
>>> On Thu, 12 Jan 2017 19:24:45 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> The generic edk2 SMM infrastructure prefers
>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
>>>> Trigger() only brings the current processor into SMM, then edk2 handles it
>>>> in the following ways:
>>>>
>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
>>>>
>>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
>>>>         bringing them into SMM individually. Then the BSP runs the SMI
>>>>         handler / dispatcher.
>>>>
>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>>>>         then the APs that are not already in SMM are not brought in, and
>>>>         the BSP runs the SMI handler / dispatcher.
>>>>
>>>> (2) If Trigger() is executed by an AP (which is possible after
>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>>>     efibootmgr"), then the AP in question brings in the BSP with a
>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>>>
>>>> The smaller problem with (1a) and (2) is that the BSP and AP
>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>>>> command from (2) can take more than 3 seconds to complete, because
>>>> efibootmgr accesses non-volatile UEFI variables intensively.
>>>>
>>>> The larger problem is that QEMU's current behavior diverges from the
>>>> behavior usually seen on physical hardware, and that keeps exposing
>>>> obscure corner cases, race conditions and other instabilities in edk2,
>>>> which generally expects / prefers a software SMI to affect all CPUs at
>>>> once.
>>>>
>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
>>>> the SMI on all VCPUs.
>>>>
>>>> While the original posting of this patch
>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>>>> only intended to speed up (2), based on our recent "stress testing" of SMM
>>>> this patch actually provides functional improvements.
>>>>
>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>>>> ---
>>>>
>>>> Notes:
>>>>     v6:
>>>>     - no changes, pick up Michael's R-b
>>>>     
>>>>     v5:
>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>>>       DEFINE_PROP_BIT() in the next patch)
>>>>
>>>>  include/hw/i386/ich9.h |  3 +++
>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>>>> index da1118727146..18dcca7ebcbf 100644
>>>> --- a/include/hw/i386/ich9.h
>>>> +++ b/include/hw/i386/ich9.h
>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>>>  #define ICH9_SMB_HST_D1                         0x06
>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>>>  
>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>>>> +
>>>>  #endif /* HW_ICH9_H */
>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>>>> index 376b7801a42c..ced6f803a4f2 100644
>>>> --- a/hw/isa/lpc_ich9.c
>>>> +++ b/hw/isa/lpc_ich9.c
>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
>>>>  
>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>> +        if (lpc->smi_negotiated_features &
>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>>>> +            CPUState *cs;
>>>> +            CPU_FOREACH(cs) {
>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>> +            }  
>>> Shouldn't CPUs with default SMI base be excluded from broadcast?  
>>
>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
>> code area for execution. The VCPUs are told apart from each other by
>> Initial APIC ID (that is, "who am I"), which is used as an index or
>> search criterion into a global shared array. Once they find their
>> respective private data blocks, the VCPUs don't step on each other's toes.
> That's what I'm not sure.
> If broadcast SMI is sent before relocation all CPUs will use
> the same SMBASE and as result save their state in the same
> memory (even if it's state after RESET/POWER ON) racing/overwriting
> each other's saved state

Good point!

> and then on exit from SMI they all will restore
> whatever state that race produced so it seems plain wrong thing to do.
> 
> Are you sure that edk2 does broadcast for SMI initialization or does it
> using directed SMIs?

Hmmm, you are right. The SmmRelocateBases() function in
"UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
each individual AP in succession, by sending SMI IPIs to them, one by
one. Then it does the same for the BSP, by sending itself a similar SMI IPI.

The above QEMU code is only exercised much later, after the SMBASE
relocation, with the SMM stack up and running. It is used when a
(preferably broadcast) SMI is triggered for firmware services (for
example, the UEFI variable services).

So, you are right.

Considering edk2 only, the difference shouldn't matter -- when this code
is invoked (via an IO port write), the SMBASEs will all have been relocated.

Also, I've been informed that on real hardware, writing to APM_CNT
triggers an SMI on all CPUs, regardless of the individual SMBASE values.
So I believe the above code should be closest to real hardware, and the
pre-patch code was only written in unicast form for SeaBIOS's sake.

I think the code is safe above. If the guest injects an SMI via APM_CNT
after negotiating SMI broadcast, it should have not left any VCPUs
without an SMI handler by that time. edk2 / OVMF conforms to this. In
the future we can tweak this further, if necessary; we have 63 more
bits, and we'll be able to reject invalid combinations of bits.

Do you feel that we should skip VCPUs whose SMBASE has not been changed
from the default?

If so, doesn't that run the risk of missing a VCPU that has an actual
SMI handler installed, and it just happens to be placed at the default
SMBASE location?

... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
but I think (a) that's not what real hardware does, and (b) it is risky
if a VCPU's actual SMI handler has been installed while keeping the
default SMBASE value.

What do you think?

Thanks!
Laszlo

> 
>>
>> Thanks
>> Laszlo
>>
>>>   
>>>> +        } else {
>>>> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>> +        }
>>>>      }
>>>>  }
>>>>    
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-17 12:06         ` Laszlo Ersek
@ 2017-01-17 13:20           ` Igor Mammedov
  2017-01-17 13:46             ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-17 13:20 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann,
	Paolo Bonzini, Michael Kinney

On Tue, 17 Jan 2017 13:06:13 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/17/17 12:21, Igor Mammedov wrote:
> > On Mon, 16 Jan 2017 21:46:55 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> On 01/13/17 14:09, Igor Mammedov wrote:  
> >>> On Thu, 12 Jan 2017 19:24:45 +0100
> >>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>     
> >>>> The generic edk2 SMM infrastructure prefers
> >>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
> >>>> Trigger() only brings the current processor into SMM, then edk2 handles it
> >>>> in the following ways:
> >>>>
> >>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
> >>>>     ExitBootServices(), but is not necessarily true at runtime), then:
> >>>>
> >>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
> >>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
> >>>>         bringing them into SMM individually. Then the BSP runs the SMI
> >>>>         handler / dispatcher.
> >>>>
> >>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
> >>>>         then the APs that are not already in SMM are not brought in, and
> >>>>         the BSP runs the SMI handler / dispatcher.
> >>>>
> >>>> (2) If Trigger() is executed by an AP (which is possible after
> >>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
> >>>>     efibootmgr"), then the AP in question brings in the BSP with a
> >>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
> >>>>
> >>>> The smaller problem with (1a) and (2) is that the BSP and AP
> >>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
> >>>> command from (2) can take more than 3 seconds to complete, because
> >>>> efibootmgr accesses non-volatile UEFI variables intensively.
> >>>>
> >>>> The larger problem is that QEMU's current behavior diverges from the
> >>>> behavior usually seen on physical hardware, and that keeps exposing
> >>>> obscure corner cases, race conditions and other instabilities in edk2,
> >>>> which generally expects / prefers a software SMI to affect all CPUs at
> >>>> once.
> >>>>
> >>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
> >>>> the SMI on all VCPUs.
> >>>>
> >>>> While the original posting of this patch
> >>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
> >>>> only intended to speed up (2), based on our recent "stress testing" of SMM
> >>>> this patch actually provides functional improvements.
> >>>>
> >>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >>>> Cc: Igor Mammedov <imammedo@redhat.com>
> >>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> >>>> ---
> >>>>
> >>>> Notes:
> >>>>     v6:
> >>>>     - no changes, pick up Michael's R-b
> >>>>     
> >>>>     v5:
> >>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
> >>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
> >>>>       DEFINE_PROP_BIT() in the next patch)
> >>>>
> >>>>  include/hw/i386/ich9.h |  3 +++
> >>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
> >>>>  2 files changed, 12 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> >>>> index da1118727146..18dcca7ebcbf 100644
> >>>> --- a/include/hw/i386/ich9.h
> >>>> +++ b/include/hw/i386/ich9.h
> >>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
> >>>>  #define ICH9_SMB_HST_D1                         0x06
> >>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
> >>>>  
> >>>> +/* bit positions used in fw_cfg SMI feature negotiation */
> >>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
> >>>> +
> >>>>  #endif /* HW_ICH9_H */
> >>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> >>>> index 376b7801a42c..ced6f803a4f2 100644
> >>>> --- a/hw/isa/lpc_ich9.c
> >>>> +++ b/hw/isa/lpc_ich9.c
> >>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
> >>>>  
> >>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
> >>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
> >>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >>>> +        if (lpc->smi_negotiated_features &
> >>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
> >>>> +            CPUState *cs;
> >>>> +            CPU_FOREACH(cs) {
> >>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>> +            }    
> >>> Shouldn't CPUs with default SMI base be excluded from broadcast?    
> >>
> >> I don't think so; as far as I recall, in edk2 the initial SMM entry code
> >> -- before SMBASE relocation -- accommodates all VCPUs entering the same
> >> code area for execution. The VCPUs are told apart from each other by
> >> Initial APIC ID (that is, "who am I"), which is used as an index or
> >> search criterion into a global shared array. Once they find their
> >> respective private data blocks, the VCPUs don't step on each other's toes.  
> > That's what I'm not sure.
> > If broadcast SMI is sent before relocation all CPUs will use
> > the same SMBASE and as result save their state in the same
> > memory (even if it's state after RESET/POWER ON) racing/overwriting
> > each other's saved state  
> 
> Good point!
> 
> > and then on exit from SMI they all will restore
> > whatever state that race produced so it seems plain wrong thing to do.
> > 
> > Are you sure that edk2 does broadcast for SMI initialization or does it
> > using directed SMIs?  
> 
> Hmmm, you are right. The SmmRelocateBases() function in
> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
> each individual AP in succession, by sending SMI IPIs to them, one by
> one. Then it does the same for the BSP, by sending itself a similar SMI IPI.
> 
> The above QEMU code is only exercised much later, after the SMBASE
> relocation, with the SMM stack up and running. It is used when a
> (preferably broadcast) SMI is triggered for firmware services (for
> example, the UEFI variable services).
> 
> So, you are right.
> 
> Considering edk2 only, the difference shouldn't matter -- when this code
> is invoked (via an IO port write), the SMBASEs will all have been relocated.
> 
> Also, I've been informed that on real hardware, writing to APM_CNT
> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
> So I believe the above code should be closest to real hardware, and the
> pre-patch code was only written in unicast form for SeaBIOS's sake.
> 
> I think the code is safe above. If the guest injects an SMI via APM_CNT
> after negotiating SMI broadcast, it should have not left any VCPUs
> without an SMI handler by that time. edk2 / OVMF conforms to this. In
> the future we can tweak this further, if necessary; we have 63 more
> bits, and we'll be able to reject invalid combinations of bits.
> 
> Do you feel that we should skip VCPUs whose SMBASE has not been changed
> from the default?
> 
> If so, doesn't that run the risk of missing a VCPU that has an actual
> SMI handler installed, and it just happens to be placed at the default
> SMBASE location?
> 
> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
> but I think (a) that's not what real hardware does, and (b) it is risky
> if a VCPU's actual SMI handler has been installed while keeping the
> default SMBASE value.
> 
> What do you think?
(a) probably doesn't consider hotplug, so it's totally fine for the case
where firmware is the only one who wakes up/initializes CPUs.
however consider 2 CPU hotplugged and then broadcast SMI happens
to serve some other task (if there isn't unpark 'feature').
Even if real hw does it, it's behavior not defined by SDM with possibly
bad consequences.

(b) alone it's risky so I'd skip based on combination of
 
 if (default SMBASE & CPU is in reset state)
    skip;

that should be safe and it leaves open possibility to avoid adding
unpark 'feature' to CPU.

> 
> Thanks!
> Laszlo
> 
> >   
> >>
> >> Thanks
> >> Laszlo
> >>  
> >>>     
> >>>> +        } else {
> >>>> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >>>> +        }
> >>>>      }
> >>>>  }
> >>>>      
> >>>     
> >>  
> >   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-17 13:20           ` Igor Mammedov
@ 2017-01-17 13:46             ` Laszlo Ersek
  2017-01-17 14:20               ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-17 13:46 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann,
	Paolo Bonzini, Michael Kinney

On 01/17/17 14:20, Igor Mammedov wrote:
> On Tue, 17 Jan 2017 13:06:13 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/17/17 12:21, Igor Mammedov wrote:
>>> On Mon, 16 Jan 2017 21:46:55 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> On 01/13/17 14:09, Igor Mammedov wrote:  
>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>     
>>>>>> The generic edk2 SMM infrastructure prefers
>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it
>>>>>> in the following ways:
>>>>>>
>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
>>>>>>
>>>>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
>>>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
>>>>>>         bringing them into SMM individually. Then the BSP runs the SMI
>>>>>>         handler / dispatcher.
>>>>>>
>>>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>>>>>>         then the APs that are not already in SMM are not brought in, and
>>>>>>         the BSP runs the SMI handler / dispatcher.
>>>>>>
>>>>>> (2) If Trigger() is executed by an AP (which is possible after
>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>>>>>
>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>>>>>> command from (2) can take more than 3 seconds to complete, because
>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
>>>>>>
>>>>>> The larger problem is that QEMU's current behavior diverges from the
>>>>>> behavior usually seen on physical hardware, and that keeps exposing
>>>>>> obscure corner cases, race conditions and other instabilities in edk2,
>>>>>> which generally expects / prefers a software SMI to affect all CPUs at
>>>>>> once.
>>>>>>
>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
>>>>>> the SMI on all VCPUs.
>>>>>>
>>>>>> While the original posting of this patch
>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM
>>>>>> this patch actually provides functional improvements.
>>>>>>
>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>> ---
>>>>>>
>>>>>> Notes:
>>>>>>     v6:
>>>>>>     - no changes, pick up Michael's R-b
>>>>>>     
>>>>>>     v5:
>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>>>>>       DEFINE_PROP_BIT() in the next patch)
>>>>>>
>>>>>>  include/hw/i386/ich9.h |  3 +++
>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>>>>>> index da1118727146..18dcca7ebcbf 100644
>>>>>> --- a/include/hw/i386/ich9.h
>>>>>> +++ b/include/hw/i386/ich9.h
>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>>>>>  #define ICH9_SMB_HST_D1                         0x06
>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>>>>>  
>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>>>>>> +
>>>>>>  #endif /* HW_ICH9_H */
>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>>>>>> index 376b7801a42c..ced6f803a4f2 100644
>>>>>> --- a/hw/isa/lpc_ich9.c
>>>>>> +++ b/hw/isa/lpc_ich9.c
>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
>>>>>>  
>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>>>> +        if (lpc->smi_negotiated_features &
>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>>>>>> +            CPUState *cs;
>>>>>> +            CPU_FOREACH(cs) {
>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>> +            }    
>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?    
>>>>
>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
>>>> code area for execution. The VCPUs are told apart from each other by
>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
>>>> search criterion into a global shared array. Once they find their
>>>> respective private data blocks, the VCPUs don't step on each other's toes.  
>>> That's what I'm not sure.
>>> If broadcast SMI is sent before relocation all CPUs will use
>>> the same SMBASE and as result save their state in the same
>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
>>> each other's saved state  
>>
>> Good point!
>>
>>> and then on exit from SMI they all will restore
>>> whatever state that race produced so it seems plain wrong thing to do.
>>>
>>> Are you sure that edk2 does broadcast for SMI initialization or does it
>>> using directed SMIs?  
>>
>> Hmmm, you are right. The SmmRelocateBases() function in
>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
>> each individual AP in succession, by sending SMI IPIs to them, one by
>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI.
>>
>> The above QEMU code is only exercised much later, after the SMBASE
>> relocation, with the SMM stack up and running. It is used when a
>> (preferably broadcast) SMI is triggered for firmware services (for
>> example, the UEFI variable services).
>>
>> So, you are right.
>>
>> Considering edk2 only, the difference shouldn't matter -- when this code
>> is invoked (via an IO port write), the SMBASEs will all have been relocated.
>>
>> Also, I've been informed that on real hardware, writing to APM_CNT
>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
>> So I believe the above code should be closest to real hardware, and the
>> pre-patch code was only written in unicast form for SeaBIOS's sake.
>>
>> I think the code is safe above. If the guest injects an SMI via APM_CNT
>> after negotiating SMI broadcast, it should have not left any VCPUs
>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
>> the future we can tweak this further, if necessary; we have 63 more
>> bits, and we'll be able to reject invalid combinations of bits.
>>
>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
>> from the default?
>>
>> If so, doesn't that run the risk of missing a VCPU that has an actual
>> SMI handler installed, and it just happens to be placed at the default
>> SMBASE location?
>>
>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
>> but I think (a) that's not what real hardware does, and (b) it is risky
>> if a VCPU's actual SMI handler has been installed while keeping the
>> default SMBASE value.
>>
>> What do you think?
> (a) probably doesn't consider hotplug, so it's totally fine for the case
> where firmware is the only one who wakes up/initializes CPUs.
> however consider 2 CPU hotplugged and then broadcast SMI happens
> to serve some other task (if there isn't unpark 'feature').
> Even if real hw does it, it's behavior not defined by SDM with possibly
> bad consequences.
> 
> (b) alone it's risky so I'd skip based on combination of
>  
>  if (default SMBASE & CPU is in reset state)
>     skip;
> 
> that should be safe and it leaves open possibility to avoid adding
> unpark 'feature' to CPU.

The above attributes ("SMBASE" and "CPU is in reset state") are specific
to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
X86_CPU() macro?

Can I assume here that the macro will never return NULL? (I think so,
LPC is only used with x86.)

... I guess SMBASE can be accessed as

  X86_CPU(cs)->env.smbase

How do I figure out if the CPU is in reset state ("waiting for first
INIT") though? Note that this state should be distinguished from simply
"halted" (i.e., after a HLT). After a HLT, the SMI should be injected
and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
APs to sleep.

Thanks
Laszlo

> 
>>
>> Thanks!
>> Laszlo
>>
>>>   
>>>>
>>>> Thanks
>>>> Laszlo
>>>>  
>>>>>     
>>>>>> +        } else {
>>>>>> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>>>> +        }
>>>>>>      }
>>>>>>  }
>>>>>>      
>>>>>     
>>>>  
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-17 13:46             ` Laszlo Ersek
@ 2017-01-17 14:20               ` Igor Mammedov
  2017-01-17 18:53                 ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-17 14:20 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Michael S. Tsirkin, Gerd Hoffmann,
	Paolo Bonzini, Michael Kinney

On Tue, 17 Jan 2017 14:46:05 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/17/17 14:20, Igor Mammedov wrote:
> > On Tue, 17 Jan 2017 13:06:13 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> On 01/17/17 12:21, Igor Mammedov wrote:  
> >>> On Mon, 16 Jan 2017 21:46:55 +0100
> >>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>     
> >>>> On 01/13/17 14:09, Igor Mammedov wrote:    
> >>>>> On Thu, 12 Jan 2017 19:24:45 +0100
> >>>>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>>>       
> >>>>>> The generic edk2 SMM infrastructure prefers
> >>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
> >>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it
> >>>>>> in the following ways:
> >>>>>>
> >>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
> >>>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
> >>>>>>
> >>>>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
> >>>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
> >>>>>>         bringing them into SMM individually. Then the BSP runs the SMI
> >>>>>>         handler / dispatcher.
> >>>>>>
> >>>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
> >>>>>>         then the APs that are not already in SMM are not brought in, and
> >>>>>>         the BSP runs the SMI handler / dispatcher.
> >>>>>>
> >>>>>> (2) If Trigger() is executed by an AP (which is possible after
> >>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
> >>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
> >>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
> >>>>>>
> >>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
> >>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
> >>>>>> command from (2) can take more than 3 seconds to complete, because
> >>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
> >>>>>>
> >>>>>> The larger problem is that QEMU's current behavior diverges from the
> >>>>>> behavior usually seen on physical hardware, and that keeps exposing
> >>>>>> obscure corner cases, race conditions and other instabilities in edk2,
> >>>>>> which generally expects / prefers a software SMI to affect all CPUs at
> >>>>>> once.
> >>>>>>
> >>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
> >>>>>> the SMI on all VCPUs.
> >>>>>>
> >>>>>> While the original posting of this patch
> >>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
> >>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM
> >>>>>> this patch actually provides functional improvements.
> >>>>>>
> >>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >>>>>> Cc: Igor Mammedov <imammedo@redhat.com>
> >>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>> ---
> >>>>>>
> >>>>>> Notes:
> >>>>>>     v6:
> >>>>>>     - no changes, pick up Michael's R-b
> >>>>>>     
> >>>>>>     v5:
> >>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
> >>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
> >>>>>>       DEFINE_PROP_BIT() in the next patch)
> >>>>>>
> >>>>>>  include/hw/i386/ich9.h |  3 +++
> >>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
> >>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> >>>>>> index da1118727146..18dcca7ebcbf 100644
> >>>>>> --- a/include/hw/i386/ich9.h
> >>>>>> +++ b/include/hw/i386/ich9.h
> >>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
> >>>>>>  #define ICH9_SMB_HST_D1                         0x06
> >>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
> >>>>>>  
> >>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
> >>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
> >>>>>> +
> >>>>>>  #endif /* HW_ICH9_H */
> >>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> >>>>>> index 376b7801a42c..ced6f803a4f2 100644
> >>>>>> --- a/hw/isa/lpc_ich9.c
> >>>>>> +++ b/hw/isa/lpc_ich9.c
> >>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
> >>>>>>  
> >>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
> >>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
> >>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >>>>>> +        if (lpc->smi_negotiated_features &
> >>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
> >>>>>> +            CPUState *cs;
> >>>>>> +            CPU_FOREACH(cs) {
> >>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>>> +            }      
> >>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?      
> >>>>
> >>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
> >>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
> >>>> code area for execution. The VCPUs are told apart from each other by
> >>>> Initial APIC ID (that is, "who am I"), which is used as an index or
> >>>> search criterion into a global shared array. Once they find their
> >>>> respective private data blocks, the VCPUs don't step on each other's toes.    
> >>> That's what I'm not sure.
> >>> If broadcast SMI is sent before relocation all CPUs will use
> >>> the same SMBASE and as result save their state in the same
> >>> memory (even if it's state after RESET/POWER ON) racing/overwriting
> >>> each other's saved state    
> >>
> >> Good point!
> >>  
> >>> and then on exit from SMI they all will restore
> >>> whatever state that race produced so it seems plain wrong thing to do.
> >>>
> >>> Are you sure that edk2 does broadcast for SMI initialization or does it
> >>> using directed SMIs?    
> >>
> >> Hmmm, you are right. The SmmRelocateBases() function in
> >> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
> >> each individual AP in succession, by sending SMI IPIs to them, one by
> >> one. Then it does the same for the BSP, by sending itself a similar SMI IPI.
> >>
> >> The above QEMU code is only exercised much later, after the SMBASE
> >> relocation, with the SMM stack up and running. It is used when a
> >> (preferably broadcast) SMI is triggered for firmware services (for
> >> example, the UEFI variable services).
> >>
> >> So, you are right.
> >>
> >> Considering edk2 only, the difference shouldn't matter -- when this code
> >> is invoked (via an IO port write), the SMBASEs will all have been relocated.
> >>
> >> Also, I've been informed that on real hardware, writing to APM_CNT
> >> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
> >> So I believe the above code should be closest to real hardware, and the
> >> pre-patch code was only written in unicast form for SeaBIOS's sake.
> >>
> >> I think the code is safe above. If the guest injects an SMI via APM_CNT
> >> after negotiating SMI broadcast, it should have not left any VCPUs
> >> without an SMI handler by that time. edk2 / OVMF conforms to this. In
> >> the future we can tweak this further, if necessary; we have 63 more
> >> bits, and we'll be able to reject invalid combinations of bits.
> >>
> >> Do you feel that we should skip VCPUs whose SMBASE has not been changed
> >> from the default?
> >>
> >> If so, doesn't that run the risk of missing a VCPU that has an actual
> >> SMI handler installed, and it just happens to be placed at the default
> >> SMBASE location?
> >>
> >> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
> >> but I think (a) that's not what real hardware does, and (b) it is risky
> >> if a VCPU's actual SMI handler has been installed while keeping the
> >> default SMBASE value.
> >>
> >> What do you think?  
> > (a) probably doesn't consider hotplug, so it's totally fine for the case
> > where firmware is the only one who wakes up/initializes CPUs.
> > however consider 2 CPU hotplugged and then broadcast SMI happens
> > to serve some other task (if there isn't unpark 'feature').
> > Even if real hw does it, it's behavior not defined by SDM with possibly
> > bad consequences.
> > 
> > (b) alone it's risky so I'd skip based on combination of
> >  
> >  if (default SMBASE & CPU is in reset state)
> >     skip;
> > 
> > that should be safe and it leaves open possibility to avoid adding
> > unpark 'feature' to CPU.  
> 
> The above attributes ("SMBASE" and "CPU is in reset state") are specific
> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
> X86_CPU() macro?
> 
> Can I assume here that the macro will never return NULL? (I think so,
> LPC is only used with x86.)
yep, that should work.

> 
> ... I guess SMBASE can be accessed as
> 
>   X86_CPU(cs)->env.smbase
preferred style:
    X86CPU *cpu = X86_CPU(cs);
    cpu->...

> 
> How do I figure out if the CPU is in reset state ("waiting for first
> INIT") though? Note that this state should be distinguished from simply
> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
> APs to sleep.

how about using EIP reset value?
x86_cpu_reset()
   ...
   env->eip = 0xfff0;

> 
> Thanks
> Laszlo
> 
> >   
> >>
> >> Thanks!
> >> Laszlo
> >>  
> >>>     
> >>>>
> >>>> Thanks
> >>>> Laszlo
> >>>>    
> >>>>>       
> >>>>>> +        } else {
> >>>>>> +            cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >>>>>> +        }
> >>>>>>      }
> >>>>>>  }
> >>>>>>        
> >>>>>       
> >>>>    
> >>>     
> >>  
> >   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-17 14:20               ` Igor Mammedov
@ 2017-01-17 18:53                 ` Laszlo Ersek
  2017-01-18 10:03                   ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-17 18:53 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On 01/17/17 15:20, Igor Mammedov wrote:
> On Tue, 17 Jan 2017 14:46:05 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/17/17 14:20, Igor Mammedov wrote:
>>> On Tue, 17 Jan 2017 13:06:13 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> On 01/17/17 12:21, Igor Mammedov wrote:  
>>>>> On Mon, 16 Jan 2017 21:46:55 +0100
>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>     
>>>>>> On 01/13/17 14:09, Igor Mammedov wrote:    
>>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
>>>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>>>       
>>>>>>>> The generic edk2 SMM infrastructure prefers
>>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
>>>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it
>>>>>>>> in the following ways:
>>>>>>>>
>>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>>>>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
>>>>>>>>
>>>>>>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
>>>>>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
>>>>>>>>         bringing them into SMM individually. Then the BSP runs the SMI
>>>>>>>>         handler / dispatcher.
>>>>>>>>
>>>>>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>>>>>>>>         then the APs that are not already in SMM are not brought in, and
>>>>>>>>         the BSP runs the SMI handler / dispatcher.
>>>>>>>>
>>>>>>>> (2) If Trigger() is executed by an AP (which is possible after
>>>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
>>>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>>>>>>>
>>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
>>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>>>>>>>> command from (2) can take more than 3 seconds to complete, because
>>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
>>>>>>>>
>>>>>>>> The larger problem is that QEMU's current behavior diverges from the
>>>>>>>> behavior usually seen on physical hardware, and that keeps exposing
>>>>>>>> obscure corner cases, race conditions and other instabilities in edk2,
>>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at
>>>>>>>> once.
>>>>>>>>
>>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
>>>>>>>> the SMI on all VCPUs.
>>>>>>>>
>>>>>>>> While the original posting of this patch
>>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>>>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM
>>>>>>>> this patch actually provides functional improvements.
>>>>>>>>
>>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>>>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>>     v6:
>>>>>>>>     - no changes, pick up Michael's R-b
>>>>>>>>     
>>>>>>>>     v5:
>>>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>>>>>>>       DEFINE_PROP_BIT() in the next patch)
>>>>>>>>
>>>>>>>>  include/hw/i386/ich9.h |  3 +++
>>>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>>>>>>>> index da1118727146..18dcca7ebcbf 100644
>>>>>>>> --- a/include/hw/i386/ich9.h
>>>>>>>> +++ b/include/hw/i386/ich9.h
>>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>>>>>>>  #define ICH9_SMB_HST_D1                         0x06
>>>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>>>>>>>  
>>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
>>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>>>>>>>> +
>>>>>>>>  #endif /* HW_ICH9_H */
>>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>>>>>>>> index 376b7801a42c..ced6f803a4f2 100644
>>>>>>>> --- a/hw/isa/lpc_ich9.c
>>>>>>>> +++ b/hw/isa/lpc_ich9.c
>>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
>>>>>>>>  
>>>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>>>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>>>>>> +        if (lpc->smi_negotiated_features &
>>>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>>>>>>>> +            CPUState *cs;
>>>>>>>> +            CPU_FOREACH(cs) {
>>>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>>>> +            }      
>>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?      
>>>>>>
>>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
>>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
>>>>>> code area for execution. The VCPUs are told apart from each other by
>>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
>>>>>> search criterion into a global shared array. Once they find their
>>>>>> respective private data blocks, the VCPUs don't step on each other's toes.    
>>>>> That's what I'm not sure.
>>>>> If broadcast SMI is sent before relocation all CPUs will use
>>>>> the same SMBASE and as result save their state in the same
>>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
>>>>> each other's saved state    
>>>>
>>>> Good point!
>>>>  
>>>>> and then on exit from SMI they all will restore
>>>>> whatever state that race produced so it seems plain wrong thing to do.
>>>>>
>>>>> Are you sure that edk2 does broadcast for SMI initialization or does it
>>>>> using directed SMIs?    
>>>>
>>>> Hmmm, you are right. The SmmRelocateBases() function in
>>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
>>>> each individual AP in succession, by sending SMI IPIs to them, one by
>>>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI.
>>>>
>>>> The above QEMU code is only exercised much later, after the SMBASE
>>>> relocation, with the SMM stack up and running. It is used when a
>>>> (preferably broadcast) SMI is triggered for firmware services (for
>>>> example, the UEFI variable services).
>>>>
>>>> So, you are right.
>>>>
>>>> Considering edk2 only, the difference shouldn't matter -- when this code
>>>> is invoked (via an IO port write), the SMBASEs will all have been relocated.
>>>>
>>>> Also, I've been informed that on real hardware, writing to APM_CNT
>>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
>>>> So I believe the above code should be closest to real hardware, and the
>>>> pre-patch code was only written in unicast form for SeaBIOS's sake.
>>>>
>>>> I think the code is safe above. If the guest injects an SMI via APM_CNT
>>>> after negotiating SMI broadcast, it should have not left any VCPUs
>>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
>>>> the future we can tweak this further, if necessary; we have 63 more
>>>> bits, and we'll be able to reject invalid combinations of bits.
>>>>
>>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
>>>> from the default?
>>>>
>>>> If so, doesn't that run the risk of missing a VCPU that has an actual
>>>> SMI handler installed, and it just happens to be placed at the default
>>>> SMBASE location?
>>>>
>>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
>>>> but I think (a) that's not what real hardware does, and (b) it is risky
>>>> if a VCPU's actual SMI handler has been installed while keeping the
>>>> default SMBASE value.
>>>>
>>>> What do you think?  
>>> (a) probably doesn't consider hotplug, so it's totally fine for the case
>>> where firmware is the only one who wakes up/initializes CPUs.
>>> however consider 2 CPU hotplugged and then broadcast SMI happens
>>> to serve some other task (if there isn't unpark 'feature').
>>> Even if real hw does it, it's behavior not defined by SDM with possibly
>>> bad consequences.
>>>
>>> (b) alone it's risky so I'd skip based on combination of
>>>  
>>>  if (default SMBASE & CPU is in reset state)
>>>     skip;
>>>
>>> that should be safe and it leaves open possibility to avoid adding
>>> unpark 'feature' to CPU.  
>>
>> The above attributes ("SMBASE" and "CPU is in reset state") are specific
>> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
>> X86_CPU() macro?
>>
>> Can I assume here that the macro will never return NULL? (I think so,
>> LPC is only used with x86.)
> yep, that should work.
> 
>>
>> ... I guess SMBASE can be accessed as
>>
>>   X86_CPU(cs)->env.smbase
> preferred style:
>     X86CPU *cpu = X86_CPU(cs);
>     cpu->...
> 
>>
>> How do I figure out if the CPU is in reset state ("waiting for first
>> INIT") though? Note that this state should be distinguished from simply
>> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
>> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
>> APs to sleep.
> 
> how about using EIP reset value?
> x86_cpu_reset()
>    ...
>    env->eip = 0xfff0;

Okay, so I tried this. It doesn't work.

This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
function from ich9_apm_ctrl_changed(), due to size reasons):

> static void ich9_apm_broadcast_smi(void)
> {
>     CPUState *cs;
>
>     cpu_synchronize_all_states(); /* <--------- mark this */
>     CPU_FOREACH(cs) {
>         X86CPU *cpu = X86_CPU(cs);
>         CPUX86State *env = &cpu->env;
>
>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>             CPUClass *k = CPU_GET_CLASS(cs);
>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>
>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>             continue;
>         }
>
>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>     }
> }

There are two cases:

(a) If I do *not* add the cpu_synchronize_all_states() call before the
loop, then the filter condition matches *despite* the APM_CNT write
being performed by OVMF *after* SMBASE relocation. I see the trace
messages in the QEMU log (this is a four VCPU Ia32 guest):

28863@1484677204.715171:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0
28863@1484677204.715182:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1
28863@1484677204.715184:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2
28863@1484677204.715185:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3
28863@1484677204.724084:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0
28863@1484677204.724088:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1
28863@1484677204.724090:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2
28863@1484677204.724091:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3

That is, even though SMBASE relocation succeeds first, and APM_CNT is
only written afterwards -- I proved this from the OVMF debug log --, the
code above does not see an up-to-date CPU state for the KVM VCPUs, and
it filters out the SMI for *all* VCPUs.

Needless to say, this breaks the firmware entirely; it cannot boot.

(b) If I add the cpu_synchronize_all_states() call before the loop, then
the incorrect filter matches go away; QEMU sees the KVM VCPU states
correctly, and the SMI is broad-cast.

However, in this case, the boot slows to a *crawl*. It's unbearable. All
VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
with the naked eye, almost line by line.

I didn't expect that cpu_synchronize_all_states() would be so costly,
but it makes the idea of checking VCPU state in the APM_CNT handler a
non-starter.

Can we stick with the current "v6 wave 2" in light of this?

Thanks,
Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-17 18:53                 ` Laszlo Ersek
@ 2017-01-18 10:03                   ` Igor Mammedov
  2017-01-18 10:19                     ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-18 10:03 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On Tue, 17 Jan 2017 19:53:21 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/17/17 15:20, Igor Mammedov wrote:
> > On Tue, 17 Jan 2017 14:46:05 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> On 01/17/17 14:20, Igor Mammedov wrote:  
> >>> On Tue, 17 Jan 2017 13:06:13 +0100
> >>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>     
> >>>> On 01/17/17 12:21, Igor Mammedov wrote:    
> >>>>> On Mon, 16 Jan 2017 21:46:55 +0100
> >>>>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>>>       
> >>>>>> On 01/13/17 14:09, Igor Mammedov wrote:      
> >>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
> >>>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>>>>>         
> >>>>>>>> The generic edk2 SMM infrastructure prefers
> >>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
> >>>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it
> >>>>>>>> in the following ways:
> >>>>>>>>
> >>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
> >>>>>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
> >>>>>>>>
> >>>>>>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
> >>>>>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
> >>>>>>>>         bringing them into SMM individually. Then the BSP runs the SMI
> >>>>>>>>         handler / dispatcher.
> >>>>>>>>
> >>>>>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
> >>>>>>>>         then the APs that are not already in SMM are not brought in, and
> >>>>>>>>         the BSP runs the SMI handler / dispatcher.
> >>>>>>>>
> >>>>>>>> (2) If Trigger() is executed by an AP (which is possible after
> >>>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
> >>>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
> >>>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
> >>>>>>>>
> >>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
> >>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
> >>>>>>>> command from (2) can take more than 3 seconds to complete, because
> >>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
> >>>>>>>>
> >>>>>>>> The larger problem is that QEMU's current behavior diverges from the
> >>>>>>>> behavior usually seen on physical hardware, and that keeps exposing
> >>>>>>>> obscure corner cases, race conditions and other instabilities in edk2,
> >>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at
> >>>>>>>> once.
> >>>>>>>>
> >>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
> >>>>>>>> the SMI on all VCPUs.
> >>>>>>>>
> >>>>>>>> While the original posting of this patch
> >>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
> >>>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM
> >>>>>>>> this patch actually provides functional improvements.
> >>>>>>>>
> >>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >>>>>>>> Cc: Igor Mammedov <imammedo@redhat.com>
> >>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>> Notes:
> >>>>>>>>     v6:
> >>>>>>>>     - no changes, pick up Michael's R-b
> >>>>>>>>     
> >>>>>>>>     v5:
> >>>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
> >>>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
> >>>>>>>>       DEFINE_PROP_BIT() in the next patch)
> >>>>>>>>
> >>>>>>>>  include/hw/i386/ich9.h |  3 +++
> >>>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
> >>>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> >>>>>>>> index da1118727146..18dcca7ebcbf 100644
> >>>>>>>> --- a/include/hw/i386/ich9.h
> >>>>>>>> +++ b/include/hw/i386/ich9.h
> >>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
> >>>>>>>>  #define ICH9_SMB_HST_D1                         0x06
> >>>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
> >>>>>>>>  
> >>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
> >>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
> >>>>>>>> +
> >>>>>>>>  #endif /* HW_ICH9_H */
> >>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> >>>>>>>> index 376b7801a42c..ced6f803a4f2 100644
> >>>>>>>> --- a/hw/isa/lpc_ich9.c
> >>>>>>>> +++ b/hw/isa/lpc_ich9.c
> >>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
> >>>>>>>>  
> >>>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
> >>>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
> >>>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >>>>>>>> +        if (lpc->smi_negotiated_features &
> >>>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
> >>>>>>>> +            CPUState *cs;
> >>>>>>>> +            CPU_FOREACH(cs) {
> >>>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>>>>> +            }        
> >>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?        
> >>>>>>
> >>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
> >>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
> >>>>>> code area for execution. The VCPUs are told apart from each other by
> >>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
> >>>>>> search criterion into a global shared array. Once they find their
> >>>>>> respective private data blocks, the VCPUs don't step on each other's toes.      
> >>>>> That's what I'm not sure.
> >>>>> If broadcast SMI is sent before relocation all CPUs will use
> >>>>> the same SMBASE and as result save their state in the same
> >>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
> >>>>> each other's saved state      
> >>>>
> >>>> Good point!
> >>>>    
> >>>>> and then on exit from SMI they all will restore
> >>>>> whatever state that race produced so it seems plain wrong thing to do.
> >>>>>
> >>>>> Are you sure that edk2 does broadcast for SMI initialization or does it
> >>>>> using directed SMIs?      
> >>>>
> >>>> Hmmm, you are right. The SmmRelocateBases() function in
> >>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
> >>>> each individual AP in succession, by sending SMI IPIs to them, one by
> >>>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI.
> >>>>
> >>>> The above QEMU code is only exercised much later, after the SMBASE
> >>>> relocation, with the SMM stack up and running. It is used when a
> >>>> (preferably broadcast) SMI is triggered for firmware services (for
> >>>> example, the UEFI variable services).
> >>>>
> >>>> So, you are right.
> >>>>
> >>>> Considering edk2 only, the difference shouldn't matter -- when this code
> >>>> is invoked (via an IO port write), the SMBASEs will all have been relocated.
> >>>>
> >>>> Also, I've been informed that on real hardware, writing to APM_CNT
> >>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
> >>>> So I believe the above code should be closest to real hardware, and the
> >>>> pre-patch code was only written in unicast form for SeaBIOS's sake.
> >>>>
> >>>> I think the code is safe above. If the guest injects an SMI via APM_CNT
> >>>> after negotiating SMI broadcast, it should have not left any VCPUs
> >>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
> >>>> the future we can tweak this further, if necessary; we have 63 more
> >>>> bits, and we'll be able to reject invalid combinations of bits.
> >>>>
> >>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
> >>>> from the default?
> >>>>
> >>>> If so, doesn't that run the risk of missing a VCPU that has an actual
> >>>> SMI handler installed, and it just happens to be placed at the default
> >>>> SMBASE location?
> >>>>
> >>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
> >>>> but I think (a) that's not what real hardware does, and (b) it is risky
> >>>> if a VCPU's actual SMI handler has been installed while keeping the
> >>>> default SMBASE value.
> >>>>
> >>>> What do you think?    
> >>> (a) probably doesn't consider hotplug, so it's totally fine for the case
> >>> where firmware is the only one who wakes up/initializes CPUs.
> >>> however consider 2 CPU hotplugged and then broadcast SMI happens
> >>> to serve some other task (if there isn't unpark 'feature').
> >>> Even if real hw does it, it's behavior not defined by SDM with possibly
> >>> bad consequences.
> >>>
> >>> (b) alone it's risky so I'd skip based on combination of
> >>>  
> >>>  if (default SMBASE & CPU is in reset state)
> >>>     skip;
> >>>
> >>> that should be safe and it leaves open possibility to avoid adding
> >>> unpark 'feature' to CPU.    
> >>
> >> The above attributes ("SMBASE" and "CPU is in reset state") are specific
> >> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
> >> X86_CPU() macro?
> >>
> >> Can I assume here that the macro will never return NULL? (I think so,
> >> LPC is only used with x86.)  
> > yep, that should work.
> >   
> >>
> >> ... I guess SMBASE can be accessed as
> >>
> >>   X86_CPU(cs)->env.smbase  
> > preferred style:
> >     X86CPU *cpu = X86_CPU(cs);
> >     cpu->...
> >   
> >>
> >> How do I figure out if the CPU is in reset state ("waiting for first
> >> INIT") though? Note that this state should be distinguished from simply
> >> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
> >> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
> >> APs to sleep.  
> > 
> > how about using EIP reset value?
> > x86_cpu_reset()
> >    ...
> >    env->eip = 0xfff0;  
> 
> Okay, so I tried this. It doesn't work.
> 
> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
> function from ich9_apm_ctrl_changed(), due to size reasons):
> 
> > static void ich9_apm_broadcast_smi(void)
> > {
> >     CPUState *cs;
> >
> >     cpu_synchronize_all_states(); /* <--------- mark this */
it probably should be executed after pause_all_vcpus(),
maybe it will help.

> >     CPU_FOREACH(cs) {
> >         X86CPU *cpu = X86_CPU(cs);
> >         CPUX86State *env = &cpu->env;
> >
> >         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >             CPUClass *k = CPU_GET_CLASS(cs);
> >             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >
> >             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >             continue;
> >         }
> >
> >         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >     }
> > }  
> 
[...]
> (b) If I add the cpu_synchronize_all_states() call before the loop, then
> the incorrect filter matches go away; QEMU sees the KVM VCPU states
> correctly, and the SMI is broad-cast.
> 
> However, in this case, the boot slows to a *crawl*. It's unbearable. All
> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
> with the naked eye, almost line by line.
> I didn't expect that cpu_synchronize_all_states() would be so costly,
> but it makes the idea of checking VCPU state in the APM_CNT handler a
> non-starter.
I wonder were bottleneck is to slow down guest so much.
What is actual cost of calling cpu_synchronize_all_states()?

Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
would help.

 
> Can we stick with the current "v6 wave 2" in light of this?
> 
> Thanks,
> Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 10:03                   ` Igor Mammedov
@ 2017-01-18 10:19                     ` Laszlo Ersek
  2017-01-18 10:23                       ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-18 10:19 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On 01/18/17 11:03, Igor Mammedov wrote:
> On Tue, 17 Jan 2017 19:53:21 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/17/17 15:20, Igor Mammedov wrote:
>>> On Tue, 17 Jan 2017 14:46:05 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> On 01/17/17 14:20, Igor Mammedov wrote:  
>>>>> On Tue, 17 Jan 2017 13:06:13 +0100
>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>     
>>>>>> On 01/17/17 12:21, Igor Mammedov wrote:    
>>>>>>> On Mon, 16 Jan 2017 21:46:55 +0100
>>>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>>>       
>>>>>>>> On 01/13/17 14:09, Igor Mammedov wrote:      
>>>>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
>>>>>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>>>>>         
>>>>>>>>>> The generic edk2 SMM infrastructure prefers
>>>>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If
>>>>>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it
>>>>>>>>>> in the following ways:
>>>>>>>>>>
>>>>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>>>>>>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
>>>>>>>>>>
>>>>>>>>>>     (a) If edk2 has been configured for "traditional" SMM synchronization,
>>>>>>>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
>>>>>>>>>>         bringing them into SMM individually. Then the BSP runs the SMI
>>>>>>>>>>         handler / dispatcher.
>>>>>>>>>>
>>>>>>>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>>>>>>>>>>         then the APs that are not already in SMM are not brought in, and
>>>>>>>>>>         the BSP runs the SMI handler / dispatcher.
>>>>>>>>>>
>>>>>>>>>> (2) If Trigger() is executed by an AP (which is possible after
>>>>>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>>>>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
>>>>>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>>>>>>>>>
>>>>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
>>>>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>>>>>>>>>> command from (2) can take more than 3 seconds to complete, because
>>>>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
>>>>>>>>>>
>>>>>>>>>> The larger problem is that QEMU's current behavior diverges from the
>>>>>>>>>> behavior usually seen on physical hardware, and that keeps exposing
>>>>>>>>>> obscure corner cases, race conditions and other instabilities in edk2,
>>>>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at
>>>>>>>>>> once.
>>>>>>>>>>
>>>>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject
>>>>>>>>>> the SMI on all VCPUs.
>>>>>>>>>>
>>>>>>>>>> While the original posting of this patch
>>>>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>>>>>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM
>>>>>>>>>> this patch actually provides functional improvements.
>>>>>>>>>>
>>>>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>>>>>>>> Cc: Igor Mammedov <imammedo@redhat.com>
>>>>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> Notes:
>>>>>>>>>>     v6:
>>>>>>>>>>     - no changes, pick up Michael's R-b
>>>>>>>>>>     
>>>>>>>>>>     v5:
>>>>>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>>>>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>>>>>>>>>       DEFINE_PROP_BIT() in the next patch)
>>>>>>>>>>
>>>>>>>>>>  include/hw/i386/ich9.h |  3 +++
>>>>>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>>>>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>>>>>>>>>> index da1118727146..18dcca7ebcbf 100644
>>>>>>>>>> --- a/include/hw/i386/ich9.h
>>>>>>>>>> +++ b/include/hw/i386/ich9.h
>>>>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>>>>>>>>>  #define ICH9_SMB_HST_D1                         0x06
>>>>>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>>>>>>>>>  
>>>>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
>>>>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>>>>>>>>>> +
>>>>>>>>>>  #endif /* HW_ICH9_H */
>>>>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>>>>>>>>>> index 376b7801a42c..ced6f803a4f2 100644
>>>>>>>>>> --- a/hw/isa/lpc_ich9.c
>>>>>>>>>> +++ b/hw/isa/lpc_ich9.c
>>>>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
>>>>>>>>>>  
>>>>>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>>>>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>>>>>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>>>>>>>> +        if (lpc->smi_negotiated_features &
>>>>>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>>>>>>>>>> +            CPUState *cs;
>>>>>>>>>> +            CPU_FOREACH(cs) {
>>>>>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>>>>>> +            }        
>>>>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?        
>>>>>>>>
>>>>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
>>>>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
>>>>>>>> code area for execution. The VCPUs are told apart from each other by
>>>>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
>>>>>>>> search criterion into a global shared array. Once they find their
>>>>>>>> respective private data blocks, the VCPUs don't step on each other's toes.      
>>>>>>> That's what I'm not sure.
>>>>>>> If broadcast SMI is sent before relocation all CPUs will use
>>>>>>> the same SMBASE and as result save their state in the same
>>>>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
>>>>>>> each other's saved state      
>>>>>>
>>>>>> Good point!
>>>>>>    
>>>>>>> and then on exit from SMI they all will restore
>>>>>>> whatever state that race produced so it seems plain wrong thing to do.
>>>>>>>
>>>>>>> Are you sure that edk2 does broadcast for SMI initialization or does it
>>>>>>> using directed SMIs?      
>>>>>>
>>>>>> Hmmm, you are right. The SmmRelocateBases() function in
>>>>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
>>>>>> each individual AP in succession, by sending SMI IPIs to them, one by
>>>>>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI.
>>>>>>
>>>>>> The above QEMU code is only exercised much later, after the SMBASE
>>>>>> relocation, with the SMM stack up and running. It is used when a
>>>>>> (preferably broadcast) SMI is triggered for firmware services (for
>>>>>> example, the UEFI variable services).
>>>>>>
>>>>>> So, you are right.
>>>>>>
>>>>>> Considering edk2 only, the difference shouldn't matter -- when this code
>>>>>> is invoked (via an IO port write), the SMBASEs will all have been relocated.
>>>>>>
>>>>>> Also, I've been informed that on real hardware, writing to APM_CNT
>>>>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
>>>>>> So I believe the above code should be closest to real hardware, and the
>>>>>> pre-patch code was only written in unicast form for SeaBIOS's sake.
>>>>>>
>>>>>> I think the code is safe above. If the guest injects an SMI via APM_CNT
>>>>>> after negotiating SMI broadcast, it should have not left any VCPUs
>>>>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
>>>>>> the future we can tweak this further, if necessary; we have 63 more
>>>>>> bits, and we'll be able to reject invalid combinations of bits.
>>>>>>
>>>>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
>>>>>> from the default?
>>>>>>
>>>>>> If so, doesn't that run the risk of missing a VCPU that has an actual
>>>>>> SMI handler installed, and it just happens to be placed at the default
>>>>>> SMBASE location?
>>>>>>
>>>>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
>>>>>> but I think (a) that's not what real hardware does, and (b) it is risky
>>>>>> if a VCPU's actual SMI handler has been installed while keeping the
>>>>>> default SMBASE value.
>>>>>>
>>>>>> What do you think?    
>>>>> (a) probably doesn't consider hotplug, so it's totally fine for the case
>>>>> where firmware is the only one who wakes up/initializes CPUs.
>>>>> however consider 2 CPU hotplugged and then broadcast SMI happens
>>>>> to serve some other task (if there isn't unpark 'feature').
>>>>> Even if real hw does it, it's behavior not defined by SDM with possibly
>>>>> bad consequences.
>>>>>
>>>>> (b) alone it's risky so I'd skip based on combination of
>>>>>  
>>>>>  if (default SMBASE & CPU is in reset state)
>>>>>     skip;
>>>>>
>>>>> that should be safe and it leaves open possibility to avoid adding
>>>>> unpark 'feature' to CPU.    
>>>>
>>>> The above attributes ("SMBASE" and "CPU is in reset state") are specific
>>>> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
>>>> X86_CPU() macro?
>>>>
>>>> Can I assume here that the macro will never return NULL? (I think so,
>>>> LPC is only used with x86.)  
>>> yep, that should work.
>>>   
>>>>
>>>> ... I guess SMBASE can be accessed as
>>>>
>>>>   X86_CPU(cs)->env.smbase  
>>> preferred style:
>>>     X86CPU *cpu = X86_CPU(cs);
>>>     cpu->...
>>>   
>>>>
>>>> How do I figure out if the CPU is in reset state ("waiting for first
>>>> INIT") though? Note that this state should be distinguished from simply
>>>> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
>>>> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
>>>> APs to sleep.  
>>>
>>> how about using EIP reset value?
>>> x86_cpu_reset()
>>>    ...
>>>    env->eip = 0xfff0;  
>>
>> Okay, so I tried this. It doesn't work.
>>
>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
>> function from ich9_apm_ctrl_changed(), due to size reasons):
>>
>>> static void ich9_apm_broadcast_smi(void)
>>> {
>>>     CPUState *cs;
>>>
>>>     cpu_synchronize_all_states(); /* <--------- mark this */
> it probably should be executed after pause_all_vcpus(),
> maybe it will help.
> 
>>>     CPU_FOREACH(cs) {
>>>         X86CPU *cpu = X86_CPU(cs);
>>>         CPUX86State *env = &cpu->env;
>>>
>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>
>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>             continue;
>>>         }
>>>
>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>     }
>>> }  
>>
> [...]
>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
>> correctly, and the SMI is broad-cast.
>>
>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
>> with the naked eye, almost line by line.
>> I didn't expect that cpu_synchronize_all_states() would be so costly,
>> but it makes the idea of checking VCPU state in the APM_CNT handler a
>> non-starter.
> I wonder were bottleneck is to slow down guest so much.
> What is actual cost of calling cpu_synchronize_all_states()?
> 
> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
> would help.

If I prepend just a pause_all_vcpus() function call, at the top, then
the VM freezes (quite understandably) when the first SMI is raised via
APM_CNT.

If I, instead, bracket the function body with pause_all_vcpus() and
resume_all_vcpus(), like this:

static void ich9_apm_broadcast_smi(void)
{
    CPUState *cs;

    pause_all_vcpus();
    cpu_synchronize_all_states();
    CPU_FOREACH(cs) {
        X86CPU *cpu = X86_CPU(cs);
        CPUX86State *env = &cpu->env;

        if (env->smbase == 0x30000 && env->eip == 0xfff0) {
            CPUClass *k = CPU_GET_CLASS(cs);
            uint64_t cpu_arch_id = k->get_arch_id(cs);

            trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
            continue;
        }

        cpu_interrupt(cs, CPU_INTERRUPT_SMI);
    }
    resume_all_vcpus();
}

then I see the following symptoms:
- rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
  100%
- the boot process, as observed from the OVMF log, is just as slow
  (= crawling) as without the VCPU pausing/resuming (i.e., like with
  cpu_synchronize_all_states() only); so ultimately the
  pausing/resuming doesn't help.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 10:19                     ` Laszlo Ersek
@ 2017-01-18 10:23                       ` Laszlo Ersek
  2017-01-18 12:38                         ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-18 10:23 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On 01/18/17 11:19, Laszlo Ersek wrote:
> On 01/18/17 11:03, Igor Mammedov wrote:
>> On Tue, 17 Jan 2017 19:53:21 +0100
>> Laszlo Ersek <lersek@redhat.com> wrote:
>>

[snip]

>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
>>> function from ich9_apm_ctrl_changed(), due to size reasons):
>>>
>>>> static void ich9_apm_broadcast_smi(void)
>>>> {
>>>>     CPUState *cs;
>>>>
>>>>     cpu_synchronize_all_states(); /* <--------- mark this */
>> it probably should be executed after pause_all_vcpus(),
>> maybe it will help.
>>
>>>>     CPU_FOREACH(cs) {
>>>>         X86CPU *cpu = X86_CPU(cs);
>>>>         CPUX86State *env = &cpu->env;
>>>>
>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>>
>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>>             continue;
>>>>         }
>>>>
>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>     }
>>>> }  
>>>
>> [...]
>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
>>> correctly, and the SMI is broad-cast.
>>>
>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
>>> with the naked eye, almost line by line.
>>> I didn't expect that cpu_synchronize_all_states() would be so costly,
>>> but it makes the idea of checking VCPU state in the APM_CNT handler a
>>> non-starter.
>> I wonder were bottleneck is to slow down guest so much.
>> What is actual cost of calling cpu_synchronize_all_states()?
>>
>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
>> would help.
> 
> If I prepend just a pause_all_vcpus() function call, at the top, then
> the VM freezes (quite understandably) when the first SMI is raised via
> APM_CNT.
> 
> If I, instead, bracket the function body with pause_all_vcpus() and
> resume_all_vcpus(), like this:
> 
> static void ich9_apm_broadcast_smi(void)
> {
>     CPUState *cs;
> 
>     pause_all_vcpus();
>     cpu_synchronize_all_states();
>     CPU_FOREACH(cs) {
>         X86CPU *cpu = X86_CPU(cs);
>         CPUX86State *env = &cpu->env;
> 
>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>             CPUClass *k = CPU_GET_CLASS(cs);
>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> 
>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>             continue;
>         }
> 
>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>     }
>     resume_all_vcpus();
> }
> 
> then I see the following symptoms:
> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
>   100%
> - the boot process, as observed from the OVMF log, is just as slow
>   (= crawling) as without the VCPU pausing/resuming (i.e., like with
>   cpu_synchronize_all_states() only); so ultimately the
>   pausing/resuming doesn't help.

I also tried this variant (bracketing only the sync call with pause/resume):

static void ich9_apm_broadcast_smi(void)
{
    CPUState *cs;

    pause_all_vcpus();
    cpu_synchronize_all_states();
    resume_all_vcpus();
    CPU_FOREACH(cs) {
        X86CPU *cpu = X86_CPU(cs);
        CPUX86State *env = &cpu->env;

        if (env->smbase == 0x30000 && env->eip == 0xfff0) {
            CPUClass *k = CPU_GET_CLASS(cs);
            uint64_t cpu_arch_id = k->get_arch_id(cs);

            trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
            continue;
        }

        cpu_interrupt(cs, CPU_INTERRUPT_SMI);
    }
}

This behaves identically to the version where pause/resume bracket the
entire function body (i.e., 1 VCPU pegged, super-slow boot progress).

Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 10:23                       ` Laszlo Ersek
@ 2017-01-18 12:38                         ` Igor Mammedov
  2017-01-18 15:42                           ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-18 12:38 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On Wed, 18 Jan 2017 11:23:48 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/18/17 11:19, Laszlo Ersek wrote:
> > On 01/18/17 11:03, Igor Mammedov wrote:  
> >> On Tue, 17 Jan 2017 19:53:21 +0100
> >> Laszlo Ersek <lersek@redhat.com> wrote:
> >>  
> 
> [snip]
> 
> >>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
> >>> function from ich9_apm_ctrl_changed(), due to size reasons):
> >>>  
> >>>> static void ich9_apm_broadcast_smi(void)
> >>>> {
> >>>>     CPUState *cs;
> >>>>
> >>>>     cpu_synchronize_all_states(); /* <--------- mark this */  
> >> it probably should be executed after pause_all_vcpus(),
> >> maybe it will help.
> >>  
> >>>>     CPU_FOREACH(cs) {
> >>>>         X86CPU *cpu = X86_CPU(cs);
> >>>>         CPUX86State *env = &cpu->env;
> >>>>
> >>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>>>             CPUClass *k = CPU_GET_CLASS(cs);
> >>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>>>
> >>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>>>             continue;
> >>>>         }
> >>>>
> >>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>     }
> >>>> }    
> >>>  
> >> [...]  
> >>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
> >>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
> >>> correctly, and the SMI is broad-cast.
> >>>
> >>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
> >>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
> >>> with the naked eye, almost line by line.
> >>> I didn't expect that cpu_synchronize_all_states() would be so costly,
> >>> but it makes the idea of checking VCPU state in the APM_CNT handler a
> >>> non-starter.  
> >> I wonder were bottleneck is to slow down guest so much.
> >> What is actual cost of calling cpu_synchronize_all_states()?
> >>
> >> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
> >> would help.  
> > 
> > If I prepend just a pause_all_vcpus() function call, at the top, then
> > the VM freezes (quite understandably) when the first SMI is raised via
> > APM_CNT.
> > 
> > If I, instead, bracket the function body with pause_all_vcpus() and
> > resume_all_vcpus(), like this:
> > 
> > static void ich9_apm_broadcast_smi(void)
> > {
> >     CPUState *cs;
> > 
> >     pause_all_vcpus();
> >     cpu_synchronize_all_states();
> >     CPU_FOREACH(cs) {
> >         X86CPU *cpu = X86_CPU(cs);
> >         CPUX86State *env = &cpu->env;
> > 
> >         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >             CPUClass *k = CPU_GET_CLASS(cs);
> >             uint64_t cpu_arch_id = k->get_arch_id(cs);
> > 
> >             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >             continue;
> >         }
> > 
> >         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >     }
> >     resume_all_vcpus();
> > }
> > 
> > then I see the following symptoms:
> > - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
> >   100%
> > - the boot process, as observed from the OVMF log, is just as slow
> >   (= crawling) as without the VCPU pausing/resuming (i.e., like with
> >   cpu_synchronize_all_states() only); so ultimately the
> >   pausing/resuming doesn't help.  
> 
> I also tried this variant (bracketing only the sync call with pause/resume):
> 
> static void ich9_apm_broadcast_smi(void)
> {
>     CPUState *cs;
> 
>     pause_all_vcpus();
>     cpu_synchronize_all_states();
>     resume_all_vcpus();
>     CPU_FOREACH(cs) {
>         X86CPU *cpu = X86_CPU(cs);
>         CPUX86State *env = &cpu->env;
> 
>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>             CPUClass *k = CPU_GET_CLASS(cs);
>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> 
>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>             continue;
>         }
> 
>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>     }
> }
> 
> This behaves identically to the version where pause/resume bracket the
> entire function body (i.e., 1 VCPU pegged, super-slow boot progress).
wrapping entire function into pause_all_vcpus()/resume_all_vcpus()
looks better as then cpu_interrupt() would not need to send IPIs to CPUs
since pause_all_vcpus() would have done it.

So the left question is what this 1 CPU does to make things so slow.
Does it send a lot of SMIs or something else?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 12:38                         ` Igor Mammedov
@ 2017-01-18 15:42                           ` Laszlo Ersek
  2017-01-18 16:26                             ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-18 15:42 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On 01/18/17 13:38, Igor Mammedov wrote:
> On Wed, 18 Jan 2017 11:23:48 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/18/17 11:19, Laszlo Ersek wrote:
>>> On 01/18/17 11:03, Igor Mammedov wrote:  
>>>> On Tue, 17 Jan 2017 19:53:21 +0100
>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>  
>>
>> [snip]
>>
>>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
>>>>> function from ich9_apm_ctrl_changed(), due to size reasons):
>>>>>  
>>>>>> static void ich9_apm_broadcast_smi(void)
>>>>>> {
>>>>>>     CPUState *cs;
>>>>>>
>>>>>>     cpu_synchronize_all_states(); /* <--------- mark this */  
>>>> it probably should be executed after pause_all_vcpus(),
>>>> maybe it will help.
>>>>  
>>>>>>     CPU_FOREACH(cs) {
>>>>>>         X86CPU *cpu = X86_CPU(cs);
>>>>>>         CPUX86State *env = &cpu->env;
>>>>>>
>>>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>>>>
>>>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>>>>             continue;
>>>>>>         }
>>>>>>
>>>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>>     }
>>>>>> }    
>>>>>  
>>>> [...]  
>>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
>>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
>>>>> correctly, and the SMI is broad-cast.
>>>>>
>>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
>>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
>>>>> with the naked eye, almost line by line.
>>>>> I didn't expect that cpu_synchronize_all_states() would be so costly,
>>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a
>>>>> non-starter.  
>>>> I wonder were bottleneck is to slow down guest so much.
>>>> What is actual cost of calling cpu_synchronize_all_states()?
>>>>
>>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
>>>> would help.  
>>>
>>> If I prepend just a pause_all_vcpus() function call, at the top, then
>>> the VM freezes (quite understandably) when the first SMI is raised via
>>> APM_CNT.
>>>
>>> If I, instead, bracket the function body with pause_all_vcpus() and
>>> resume_all_vcpus(), like this:
>>>
>>> static void ich9_apm_broadcast_smi(void)
>>> {
>>>     CPUState *cs;
>>>
>>>     pause_all_vcpus();
>>>     cpu_synchronize_all_states();
>>>     CPU_FOREACH(cs) {
>>>         X86CPU *cpu = X86_CPU(cs);
>>>         CPUX86State *env = &cpu->env;
>>>
>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>
>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>             continue;
>>>         }
>>>
>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>     }
>>>     resume_all_vcpus();
>>> }
>>>
>>> then I see the following symptoms:
>>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
>>>   100%
>>> - the boot process, as observed from the OVMF log, is just as slow
>>>   (= crawling) as without the VCPU pausing/resuming (i.e., like with
>>>   cpu_synchronize_all_states() only); so ultimately the
>>>   pausing/resuming doesn't help.  
>>
>> I also tried this variant (bracketing only the sync call with pause/resume):
>>
>> static void ich9_apm_broadcast_smi(void)
>> {
>>     CPUState *cs;
>>
>>     pause_all_vcpus();
>>     cpu_synchronize_all_states();
>>     resume_all_vcpus();
>>     CPU_FOREACH(cs) {
>>         X86CPU *cpu = X86_CPU(cs);
>>         CPUX86State *env = &cpu->env;
>>
>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>             CPUClass *k = CPU_GET_CLASS(cs);
>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>
>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>             continue;
>>         }
>>
>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>     }
>> }
>>
>> This behaves identically to the version where pause/resume bracket the
>> entire function body (i.e., 1 VCPU pegged, super-slow boot progress).
> wrapping entire function into pause_all_vcpus()/resume_all_vcpus()
> looks better as then cpu_interrupt() would not need to send IPIs to CPUs
> since pause_all_vcpus() would have done it.

I don't understand, why would a pause operation send IPIs?

Do you mean the resume? Even in that case, why would the resume send
precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls)
and not some other kind of interrupt?

> So the left question is what this 1 CPU does to make things so slow.
> Does it send a lot of SMIs or something else?

The firmware uses many SMIs / Trigger() calls, for example for the
implementation of UEFI variable services, and SMM lockbox operations.
However, it does not use *masses* of them. Following the OVMF log, I see
that *individual* APM_CNT writes take very long (on the order of a
second, even), which implies that a single cpu_synchronize_all_states()
function call takes very long.

I briefly checked what cpu_synchronize_all_states() does, digging down
its call stack. I don't see anything in it that we should or could
modify for the sake of this work.


I think the current line of investigation is way out of scope for this
work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3",
cpu_interrupt() just works as intended.

Why are we branching out this far from that? Just to avoid the CPU
unpark feature that Paolo suggested (which would also be negotiated)?

What is so wrong with the planned CPU unpark stuff that justifies
blocking this patch?

Honestly I don't understand why we can't approach these features
*incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU
hotplug later. I think the current negotiation pattern is flexible and
future-proofed enough for that. It was you who suggested, for
simplifying the last patch in the series, that we completely ignore VCPU
hotplug for now (and I whole-heartedly agreed).

Let me put it like this: what is in this patch, in your opinion, that
*irrevocably* breaks the upcoming VCPU hotplug feature, or else limits
our options in implementing it? In my opinion, we can later implement
*anything at all*, dependent on new feature bits. We can even decide to
reject guest feature requests that specify *only* bit#0; which
practically means disabling guests (using SMIs) that don't conform to
our VCPU hotlpug implementation.

Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 15:42                           ` Laszlo Ersek
@ 2017-01-18 16:26                             ` Igor Mammedov
  2017-01-18 17:23                               ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-18 16:26 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On Wed, 18 Jan 2017 16:42:27 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/18/17 13:38, Igor Mammedov wrote:
> > On Wed, 18 Jan 2017 11:23:48 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> On 01/18/17 11:19, Laszlo Ersek wrote:  
> >>> On 01/18/17 11:03, Igor Mammedov wrote:    
> >>>> On Tue, 17 Jan 2017 19:53:21 +0100
> >>>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>>    
> >>
> >> [snip]
> >>  
> >>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
> >>>>> function from ich9_apm_ctrl_changed(), due to size reasons):
> >>>>>    
> >>>>>> static void ich9_apm_broadcast_smi(void)
> >>>>>> {
> >>>>>>     CPUState *cs;
> >>>>>>
> >>>>>>     cpu_synchronize_all_states(); /* <--------- mark this */    
> >>>> it probably should be executed after pause_all_vcpus(),
> >>>> maybe it will help.
> >>>>    
> >>>>>>     CPU_FOREACH(cs) {
> >>>>>>         X86CPU *cpu = X86_CPU(cs);
> >>>>>>         CPUX86State *env = &cpu->env;
> >>>>>>
> >>>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>>>>>             CPUClass *k = CPU_GET_CLASS(cs);
> >>>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>>>>>
> >>>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>
> >>>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>>>     }
> >>>>>> }      
> >>>>>    
> >>>> [...]    
> >>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
> >>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
> >>>>> correctly, and the SMI is broad-cast.
> >>>>>
> >>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
> >>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
> >>>>> with the naked eye, almost line by line.
> >>>>> I didn't expect that cpu_synchronize_all_states() would be so costly,
> >>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a
> >>>>> non-starter.    
> >>>> I wonder were bottleneck is to slow down guest so much.
> >>>> What is actual cost of calling cpu_synchronize_all_states()?
> >>>>
> >>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
> >>>> would help.    
> >>>
> >>> If I prepend just a pause_all_vcpus() function call, at the top, then
> >>> the VM freezes (quite understandably) when the first SMI is raised via
> >>> APM_CNT.
> >>>
> >>> If I, instead, bracket the function body with pause_all_vcpus() and
> >>> resume_all_vcpus(), like this:
> >>>
> >>> static void ich9_apm_broadcast_smi(void)
> >>> {
> >>>     CPUState *cs;
> >>>
> >>>     pause_all_vcpus();
> >>>     cpu_synchronize_all_states();
> >>>     CPU_FOREACH(cs) {
> >>>         X86CPU *cpu = X86_CPU(cs);
> >>>         CPUX86State *env = &cpu->env;
> >>>
> >>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>>             CPUClass *k = CPU_GET_CLASS(cs);
> >>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>>
> >>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>>             continue;
> >>>         }
> >>>
> >>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>     }
> >>>     resume_all_vcpus();
> >>> }
> >>>
> >>> then I see the following symptoms:
> >>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
> >>>   100%
> >>> - the boot process, as observed from the OVMF log, is just as slow
> >>>   (= crawling) as without the VCPU pausing/resuming (i.e., like with
> >>>   cpu_synchronize_all_states() only); so ultimately the
> >>>   pausing/resuming doesn't help.    
> >>
> >> I also tried this variant (bracketing only the sync call with pause/resume):
> >>
> >> static void ich9_apm_broadcast_smi(void)
> >> {
> >>     CPUState *cs;
> >>
> >>     pause_all_vcpus();
> >>     cpu_synchronize_all_states();
> >>     resume_all_vcpus();
> >>     CPU_FOREACH(cs) {
> >>         X86CPU *cpu = X86_CPU(cs);
> >>         CPUX86State *env = &cpu->env;
> >>
> >>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>             CPUClass *k = CPU_GET_CLASS(cs);
> >>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>
> >>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>             continue;
> >>         }
> >>
> >>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>     }
> >> }
> >>
> >> This behaves identically to the version where pause/resume bracket the
> >> entire function body (i.e., 1 VCPU pegged, super-slow boot progress).  
> > wrapping entire function into pause_all_vcpus()/resume_all_vcpus()
> > looks better as then cpu_interrupt() would not need to send IPIs to CPUs
> > since pause_all_vcpus() would have done it.
> 
> I don't understand, why would a pause operation send IPIs?
pause_all forces exit on all CPUs and cpu_interrupt() does the same to
a CPU so if all CPUs were kicked out of running state then cpu_interrupt()
would skip kick out step.

> Do you mean the resume? Even in that case, why would the resume send
> precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls)
> and not some other kind of interrupt?
> 
> > So the left question is what this 1 CPU does to make things so slow.
> > Does it send a lot of SMIs or something else?  
> 
> The firmware uses many SMIs / Trigger() calls, for example for the
> implementation of UEFI variable services, and SMM lockbox operations.
> However, it does not use *masses* of them. Following the OVMF log, I see
> that *individual* APM_CNT writes take very long (on the order of a
> second, even), which implies that a single cpu_synchronize_all_states()
> function call takes very long.
> 
> I briefly checked what cpu_synchronize_all_states() does, digging down
> its call stack. I don't see anything in it that we should or could
> modify for the sake of this work.
> 
> 
> I think the current line of investigation is way out of scope for this
> work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3",
> cpu_interrupt() just works as intended.
It's not that I'm outright against of v6 as is,
it just seems that something fundamentally broken on
cpu_synchronize_all_states() call chain and instead of fixing issue
the same SMBASE race case would be just ignored since CPU hotplug is
broken with OVMF anyways.

So I'm fine with applying v6 as is
and as follow up fix to cpu_synchronize_all_states() with
follow up filtering out of the same SMBASE in reset state CPUs
if possible in 2.9 merge window.

> Why are we branching out this far from that? Just to avoid the CPU
> unpark feature that Paolo suggested (which would also be negotiated)?
> 
> What is so wrong with the planned CPU unpark stuff that justifies
> blocking this patch?
> 
> Honestly I don't understand why we can't approach these features
> *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU
> hotplug later. I think the current negotiation pattern is flexible and
> future-proofed enough for that. It was you who suggested, for
> simplifying the last patch in the series, that we completely ignore VCPU
> hotplug for now (and I whole-heartedly agreed).
> 
> Let me put it like this: what is in this patch, in your opinion, that
> *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits
> our options in implementing it? In my opinion, we can later implement
> *anything at all*, dependent on new feature bits. We can even decide to
> reject guest feature requests that specify *only* bit#0; which
> practically means disabling guests (using SMIs) that don't conform to
> our VCPU hotlpug implementation.
Why do negotiation if hotplug could be done without it, on hw(QEMU)
side hotplug seems to be implemented sufficiently (but we still missing
SMRR support). The only thing to make hotplug work with OVMF might be
issuing SMI either directly from QEMU or from AML (write into APM_CNT)
after CPU is hotplugged.

> 
> Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 16:26                             ` Igor Mammedov
@ 2017-01-18 17:23                               ` Laszlo Ersek
  2017-01-18 18:06                                 ` Igor Mammedov
  0 siblings, 1 reply; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-18 17:23 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On 01/18/17 17:26, Igor Mammedov wrote:
> On Wed, 18 Jan 2017 16:42:27 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/18/17 13:38, Igor Mammedov wrote:
>>> On Wed, 18 Jan 2017 11:23:48 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> On 01/18/17 11:19, Laszlo Ersek wrote:  
>>>>> On 01/18/17 11:03, Igor Mammedov wrote:    
>>>>>> On Tue, 17 Jan 2017 19:53:21 +0100
>>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>>    
>>>>
>>>> [snip]
>>>>  
>>>>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
>>>>>>> function from ich9_apm_ctrl_changed(), due to size reasons):
>>>>>>>    
>>>>>>>> static void ich9_apm_broadcast_smi(void)
>>>>>>>> {
>>>>>>>>     CPUState *cs;
>>>>>>>>
>>>>>>>>     cpu_synchronize_all_states(); /* <--------- mark this */    
>>>>>> it probably should be executed after pause_all_vcpus(),
>>>>>> maybe it will help.
>>>>>>    
>>>>>>>>     CPU_FOREACH(cs) {
>>>>>>>>         X86CPU *cpu = X86_CPU(cs);
>>>>>>>>         CPUX86State *env = &cpu->env;
>>>>>>>>
>>>>>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>>>>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>>>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>>>>>>
>>>>>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>>>>>>             continue;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>>>>     }
>>>>>>>> }      
>>>>>>>    
>>>>>> [...]    
>>>>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
>>>>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
>>>>>>> correctly, and the SMI is broad-cast.
>>>>>>>
>>>>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
>>>>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
>>>>>>> with the naked eye, almost line by line.
>>>>>>> I didn't expect that cpu_synchronize_all_states() would be so costly,
>>>>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a
>>>>>>> non-starter.    
>>>>>> I wonder were bottleneck is to slow down guest so much.
>>>>>> What is actual cost of calling cpu_synchronize_all_states()?
>>>>>>
>>>>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
>>>>>> would help.    
>>>>>
>>>>> If I prepend just a pause_all_vcpus() function call, at the top, then
>>>>> the VM freezes (quite understandably) when the first SMI is raised via
>>>>> APM_CNT.
>>>>>
>>>>> If I, instead, bracket the function body with pause_all_vcpus() and
>>>>> resume_all_vcpus(), like this:
>>>>>
>>>>> static void ich9_apm_broadcast_smi(void)
>>>>> {
>>>>>     CPUState *cs;
>>>>>
>>>>>     pause_all_vcpus();
>>>>>     cpu_synchronize_all_states();
>>>>>     CPU_FOREACH(cs) {
>>>>>         X86CPU *cpu = X86_CPU(cs);
>>>>>         CPUX86State *env = &cpu->env;
>>>>>
>>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>>>
>>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>>>             continue;
>>>>>         }
>>>>>
>>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>     }
>>>>>     resume_all_vcpus();
>>>>> }
>>>>>
>>>>> then I see the following symptoms:
>>>>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
>>>>>   100%
>>>>> - the boot process, as observed from the OVMF log, is just as slow
>>>>>   (= crawling) as without the VCPU pausing/resuming (i.e., like with
>>>>>   cpu_synchronize_all_states() only); so ultimately the
>>>>>   pausing/resuming doesn't help.    
>>>>
>>>> I also tried this variant (bracketing only the sync call with pause/resume):
>>>>
>>>> static void ich9_apm_broadcast_smi(void)
>>>> {
>>>>     CPUState *cs;
>>>>
>>>>     pause_all_vcpus();
>>>>     cpu_synchronize_all_states();
>>>>     resume_all_vcpus();
>>>>     CPU_FOREACH(cs) {
>>>>         X86CPU *cpu = X86_CPU(cs);
>>>>         CPUX86State *env = &cpu->env;
>>>>
>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>>
>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>>             continue;
>>>>         }
>>>>
>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>     }
>>>> }
>>>>
>>>> This behaves identically to the version where pause/resume bracket the
>>>> entire function body (i.e., 1 VCPU pegged, super-slow boot progress).  
>>> wrapping entire function into pause_all_vcpus()/resume_all_vcpus()
>>> looks better as then cpu_interrupt() would not need to send IPIs to CPUs
>>> since pause_all_vcpus() would have done it.
>>
>> I don't understand, why would a pause operation send IPIs?
> pause_all forces exit on all CPUs and cpu_interrupt() does the same to
> a CPU so if all CPUs were kicked out of running state then cpu_interrupt()
> would skip kick out step.

Understood.

> 
>> Do you mean the resume? Even in that case, why would the resume send
>> precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls)
>> and not some other kind of interrupt?
>>
>>> So the left question is what this 1 CPU does to make things so slow.
>>> Does it send a lot of SMIs or something else?  
>>
>> The firmware uses many SMIs / Trigger() calls, for example for the
>> implementation of UEFI variable services, and SMM lockbox operations.
>> However, it does not use *masses* of them. Following the OVMF log, I see
>> that *individual* APM_CNT writes take very long (on the order of a
>> second, even), which implies that a single cpu_synchronize_all_states()
>> function call takes very long.
>>
>> I briefly checked what cpu_synchronize_all_states() does, digging down
>> its call stack. I don't see anything in it that we should or could
>> modify for the sake of this work.
>>
>>
>> I think the current line of investigation is way out of scope for this
>> work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3",
>> cpu_interrupt() just works as intended.

> It's not that I'm outright against of v6 as is,
> it just seems that something fundamentally broken on
> cpu_synchronize_all_states() call chain

Maybe, but I don't see how fixing that is a prerequisite for this patch.

> and instead of fixing issue
> the same SMBASE race case would be just ignored

I don't understand. What SMBASE race? Ignored by what and when? Sorry,
I'm lost.

If you are saying that, for the moment we can ignore any SMBASE races
(whatever they may be) because CPU hotplug is broken with OVMF anyway,
then I agree. I would like to limit the scope of this series as much as
possible.

> since CPU hotplug is
> broken with OVMF anyways.
> 
> So I'm fine with applying v6 as is
> and as follow up fix to cpu_synchronize_all_states()

I cannot promise to work on cpu_synchronize_all_states(). It ties into
KVM and it goes over my head.

In fact, if there is a problem with cpu_synchronize_all_states(), it
must be a general one.

The only reason I even thought of calling cpu_synchronize_all_states()
here, after seeing the out-of-date register values, is that I recalled
it from when I worked on dump-guest-memory. dump-guest-memory writes the
register file to the dump, and so it needs cpu_synchronize_all_states().

But, again, cpu_synchronize_all_states() is a general facility, and I
can't promise to work on it.

> with
> follow up filtering out of the same SMBASE in reset state CPUs
> if possible in 2.9 merge window.

I'm still not convinced that this filtering is absolutely necessary, in
the face of feature negotiation; *but*, if someone figures out what's
wrong with cpu_synchronize_all_states(), and manages to make its
performance impact negligible, then I am certainly willing to add the
filtering to the SMI broadcast.

At that point, it shouldn't be much work, I hope.

> 
>> Why are we branching out this far from that? Just to avoid the CPU
>> unpark feature that Paolo suggested (which would also be negotiated)?
>>
>> What is so wrong with the planned CPU unpark stuff that justifies
>> blocking this patch?
>>
>> Honestly I don't understand why we can't approach these features
>> *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU
>> hotplug later. I think the current negotiation pattern is flexible and
>> future-proofed enough for that. It was you who suggested, for
>> simplifying the last patch in the series, that we completely ignore VCPU
>> hotplug for now (and I whole-heartedly agreed).
>>
>> Let me put it like this: what is in this patch, in your opinion, that
>> *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits
>> our options in implementing it? In my opinion, we can later implement
>> *anything at all*, dependent on new feature bits. We can even decide to
>> reject guest feature requests that specify *only* bit#0; which
>> practically means disabling guests (using SMIs) that don't conform to
>> our VCPU hotlpug implementation.

> Why do negotiation if hotplug could be done without it,

Because it lets us segment the feature set into several small steps,
where I have a chance of grasping the small pieces and maybe even
contributing small, surgical patches?

You and Paolo might be seeing the big picture, and I certainly don't
want to go *against* the big picture. I just want to advance with as
little steps as possible.

There's the historical observation that a compiler has as many stages as
there are sub-teams in the compiler team. (Please excuse me for not
recalling the exact phrasing and the person who quipped it.) That
observation exists for a reason. Feature and module boundaries tend to
mirror human understanding.

> on hw(QEMU)
> side hotplug seems to be implemented sufficiently (but we still missing
> SMRR support). The only thing to make hotplug work with OVMF might be
> issuing SMI either directly from QEMU or from AML (write into APM_CNT)
> after CPU is hotplugged.

Maybe. For me, that looms in the future.

If you agree with my participation as outlined above; that is,
- I care about this exact patch, as posted,
- someone else looks into cpu_synchronize_all_states(),
- and then I'm willing to care about the incremental patch for the
  filtering,

then I propose we go ahead with this patch. It's the last one in the
series that needs your R-b.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 17:23                               ` Laszlo Ersek
@ 2017-01-18 18:06                                 ` Igor Mammedov
  2017-01-18 19:11                                   ` Laszlo Ersek
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Mammedov @ 2017-01-18 18:06 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

On Wed, 18 Jan 2017 18:23:45 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/18/17 17:26, Igor Mammedov wrote:
> > On Wed, 18 Jan 2017 16:42:27 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> On 01/18/17 13:38, Igor Mammedov wrote:  
> >>> On Wed, 18 Jan 2017 11:23:48 +0100
> >>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>     
> >>>> On 01/18/17 11:19, Laszlo Ersek wrote:    
> >>>>> On 01/18/17 11:03, Igor Mammedov wrote:      
> >>>>>> On Tue, 17 Jan 2017 19:53:21 +0100
> >>>>>> Laszlo Ersek <lersek@redhat.com> wrote:
> >>>>>>      
> >>>>
> >>>> [snip]
> >>>>    
> >>>>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
> >>>>>>> function from ich9_apm_ctrl_changed(), due to size reasons):
> >>>>>>>      
> >>>>>>>> static void ich9_apm_broadcast_smi(void)
> >>>>>>>> {
> >>>>>>>>     CPUState *cs;
> >>>>>>>>
> >>>>>>>>     cpu_synchronize_all_states(); /* <--------- mark this */      
> >>>>>> it probably should be executed after pause_all_vcpus(),
> >>>>>> maybe it will help.
> >>>>>>      
> >>>>>>>>     CPU_FOREACH(cs) {
> >>>>>>>>         X86CPU *cpu = X86_CPU(cs);
> >>>>>>>>         CPUX86State *env = &cpu->env;
> >>>>>>>>
> >>>>>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>>>>>>>             CPUClass *k = CPU_GET_CLASS(cs);
> >>>>>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>>>>>>>
> >>>>>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>>>>>>>             continue;
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>>>>>     }
> >>>>>>>> }        
> >>>>>>>      
> >>>>>> [...]      
> >>>>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
> >>>>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
> >>>>>>> correctly, and the SMI is broad-cast.
> >>>>>>>
> >>>>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
> >>>>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
> >>>>>>> with the naked eye, almost line by line.
> >>>>>>> I didn't expect that cpu_synchronize_all_states() would be so costly,
> >>>>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a
> >>>>>>> non-starter.      
> >>>>>> I wonder were bottleneck is to slow down guest so much.
> >>>>>> What is actual cost of calling cpu_synchronize_all_states()?
> >>>>>>
> >>>>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
> >>>>>> would help.      
> >>>>>
> >>>>> If I prepend just a pause_all_vcpus() function call, at the top, then
> >>>>> the VM freezes (quite understandably) when the first SMI is raised via
> >>>>> APM_CNT.
> >>>>>
> >>>>> If I, instead, bracket the function body with pause_all_vcpus() and
> >>>>> resume_all_vcpus(), like this:
> >>>>>
> >>>>> static void ich9_apm_broadcast_smi(void)
> >>>>> {
> >>>>>     CPUState *cs;
> >>>>>
> >>>>>     pause_all_vcpus();
> >>>>>     cpu_synchronize_all_states();
> >>>>>     CPU_FOREACH(cs) {
> >>>>>         X86CPU *cpu = X86_CPU(cs);
> >>>>>         CPUX86State *env = &cpu->env;
> >>>>>
> >>>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>>>>             CPUClass *k = CPU_GET_CLASS(cs);
> >>>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>>>>
> >>>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>>>>             continue;
> >>>>>         }
> >>>>>
> >>>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>>     }
> >>>>>     resume_all_vcpus();
> >>>>> }
> >>>>>
> >>>>> then I see the following symptoms:
> >>>>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
> >>>>>   100%
> >>>>> - the boot process, as observed from the OVMF log, is just as slow
> >>>>>   (= crawling) as without the VCPU pausing/resuming (i.e., like with
> >>>>>   cpu_synchronize_all_states() only); so ultimately the
> >>>>>   pausing/resuming doesn't help.      
> >>>>
> >>>> I also tried this variant (bracketing only the sync call with pause/resume):
> >>>>
> >>>> static void ich9_apm_broadcast_smi(void)
> >>>> {
> >>>>     CPUState *cs;
> >>>>
> >>>>     pause_all_vcpus();
> >>>>     cpu_synchronize_all_states();
> >>>>     resume_all_vcpus();
> >>>>     CPU_FOREACH(cs) {
> >>>>         X86CPU *cpu = X86_CPU(cs);
> >>>>         CPUX86State *env = &cpu->env;
> >>>>
> >>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >>>>             CPUClass *k = CPU_GET_CLASS(cs);
> >>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >>>>
> >>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >>>>             continue;
> >>>>         }
> >>>>
> >>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>     }
> >>>> }
> >>>>
> >>>> This behaves identically to the version where pause/resume bracket the
> >>>> entire function body (i.e., 1 VCPU pegged, super-slow boot progress).    
> >>> wrapping entire function into pause_all_vcpus()/resume_all_vcpus()
> >>> looks better as then cpu_interrupt() would not need to send IPIs to CPUs
> >>> since pause_all_vcpus() would have done it.  
> >>
> >> I don't understand, why would a pause operation send IPIs?  
> > pause_all forces exit on all CPUs and cpu_interrupt() does the same to
> > a CPU so if all CPUs were kicked out of running state then cpu_interrupt()
> > would skip kick out step.  
> 
> Understood.
> 
> >   
> >> Do you mean the resume? Even in that case, why would the resume send
> >> precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls)
> >> and not some other kind of interrupt?
> >>  
> >>> So the left question is what this 1 CPU does to make things so slow.
> >>> Does it send a lot of SMIs or something else?    
> >>
> >> The firmware uses many SMIs / Trigger() calls, for example for the
> >> implementation of UEFI variable services, and SMM lockbox operations.
> >> However, it does not use *masses* of them. Following the OVMF log, I see
> >> that *individual* APM_CNT writes take very long (on the order of a
> >> second, even), which implies that a single cpu_synchronize_all_states()
> >> function call takes very long.
> >>
> >> I briefly checked what cpu_synchronize_all_states() does, digging down
> >> its call stack. I don't see anything in it that we should or could
> >> modify for the sake of this work.
> >>
> >>
> >> I think the current line of investigation is way out of scope for this
> >> work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3",
> >> cpu_interrupt() just works as intended.  
> 
> > It's not that I'm outright against of v6 as is,
> > it just seems that something fundamentally broken on
> > cpu_synchronize_all_states() call chain  
> 
> Maybe, but I don't see how fixing that is a prerequisite for this patch.
> 
> > and instead of fixing issue
> > the same SMBASE race case would be just ignored  
> 
> I don't understand. What SMBASE race? Ignored by what and when? Sorry,
> I'm lost.
> 
> If you are saying that, for the moment we can ignore any SMBASE races
> (whatever they may be) because CPU hotplug is broken with OVMF anyway,
> then I agree. I would like to limit the scope of this series as much as
> possible.
> 
> > since CPU hotplug is
> > broken with OVMF anyways.
> > 
> > So I'm fine with applying v6 as is
> > and as follow up fix to cpu_synchronize_all_states()  
> 
> I cannot promise to work on cpu_synchronize_all_states(). It ties into
> KVM and it goes over my head.
> 
> In fact, if there is a problem with cpu_synchronize_all_states(), it
> must be a general one.
> 
> The only reason I even thought of calling cpu_synchronize_all_states()
> here, after seeing the out-of-date register values, is that I recalled
> it from when I worked on dump-guest-memory. dump-guest-memory writes the
> register file to the dump, and so it needs cpu_synchronize_all_states().
> 
> But, again, cpu_synchronize_all_states() is a general facility, and I
> can't promise to work on it.
> 
> > with
> > follow up filtering out of the same SMBASE in reset state CPUs
> > if possible in 2.9 merge window.  
> 
> I'm still not convinced that this filtering is absolutely necessary, in
> the face of feature negotiation; *but*, if someone figures out what's
> wrong with cpu_synchronize_all_states(), and manages to make its
> performance impact negligible, then I am certainly willing to add the
> filtering to the SMI broadcast.
> 
> At that point, it shouldn't be much work, I hope.
> 
> >   
> >> Why are we branching out this far from that? Just to avoid the CPU
> >> unpark feature that Paolo suggested (which would also be negotiated)?
> >>
> >> What is so wrong with the planned CPU unpark stuff that justifies
> >> blocking this patch?
> >>
> >> Honestly I don't understand why we can't approach these features
> >> *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU
> >> hotplug later. I think the current negotiation pattern is flexible and
> >> future-proofed enough for that. It was you who suggested, for
> >> simplifying the last patch in the series, that we completely ignore VCPU
> >> hotplug for now (and I whole-heartedly agreed).
> >>
> >> Let me put it like this: what is in this patch, in your opinion, that
> >> *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits
> >> our options in implementing it? In my opinion, we can later implement
> >> *anything at all*, dependent on new feature bits. We can even decide to
> >> reject guest feature requests that specify *only* bit#0; which
> >> practically means disabling guests (using SMIs) that don't conform to
> >> our VCPU hotlpug implementation.  
> 
> > Why do negotiation if hotplug could be done without it,  
> 
> Because it lets us segment the feature set into several small steps,
> where I have a chance of grasping the small pieces and maybe even
> contributing small, surgical patches?
> 
> You and Paolo might be seeing the big picture, and I certainly don't
> want to go *against* the big picture. I just want to advance with as
> little steps as possible.
> 
> There's the historical observation that a compiler has as many stages as
> there are sub-teams in the compiler team. (Please excuse me for not
> recalling the exact phrasing and the person who quipped it.) That
> observation exists for a reason. Feature and module boundaries tend to
> mirror human understanding.
> 
> > on hw(QEMU)
> > side hotplug seems to be implemented sufficiently (but we still missing
> > SMRR support). The only thing to make hotplug work with OVMF might be
> > issuing SMI either directly from QEMU or from AML (write into APM_CNT)
> > after CPU is hotplugged.  
> 
> Maybe. For me, that looms in the future.
> 
> If you agree with my participation as outlined above; that is,
> - I care about this exact patch, as posted,
> - someone else looks into cpu_synchronize_all_states(),
CCing Radim who graciously agreed to take a look what wrong from KVM side,
Could you give him steps to reproduce issue, pls.

> - and then I'm willing to care about the incremental patch for the
>   filtering,
ok


> then I propose we go ahead with this patch. It's the last one in the
> series that needs your R-b.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> 
> Thanks
> Laszlo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI
  2017-01-12 18:24 [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Laszlo Ersek
                   ` (2 preceding siblings ...)
  2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 3/3] hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types Laszlo Ersek
@ 2017-01-18 19:02 ` Michael S. Tsirkin
  3 siblings, 0 replies; 28+ messages in thread
From: Michael S. Tsirkin @ 2017-01-18 19:02 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: qemu devel list, Eduardo Habkost, Gerd Hoffmann, Igor Mammedov,
	Paolo Bonzini

On Thu, Jan 12, 2017 at 07:24:43PM +0100, Laszlo Ersek wrote:
> Previous version (v5):
> http://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg01902.html
> 
> Changes in v6 (see the individual patches for details):
> - pick up feedback tags
> - rename "smi_broadcast" to "x-smi-broadcast" [Eduardo]
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>

Series:

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

Paolo, I'm merging wave 1, I assume you will want to merge wave 2?

> Thanks!
> Laszlo
> 
> Laszlo Ersek (3):
>   hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg
>   hw/isa/lpc_ich9: add broadcast SMI feature
>   hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types
> 
>  include/hw/i386/ich9.h | 13 ++++++++
>  include/hw/i386/pc.h   |  5 +++
>  hw/isa/lpc_ich9.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 108 insertions(+), 1 deletion(-)
> 
> -- 
> 2.9.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
  2017-01-18 18:06                                 ` Igor Mammedov
@ 2017-01-18 19:11                                   ` Laszlo Ersek
  0 siblings, 0 replies; 28+ messages in thread
From: Laszlo Ersek @ 2017-01-18 19:11 UTC (permalink / raw)
  To: Igor Mammedov, Radim Krčmář
  Cc: Michael Kinney, Paolo Bonzini, Gerd Hoffmann, qemu devel list,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 5532 bytes --]

On 01/18/17 19:06, Igor Mammedov wrote:
> On Wed, 18 Jan 2017 18:23:45 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 

[snip]

>> If you agree with my participation as outlined above; that is,
>> - I care about this exact patch, as posted,
>> - someone else looks into cpu_synchronize_all_states(),
> CCing Radim who graciously agreed to take a look what wrong from KVM side,

Thank you, Radim!

> Could you give him steps to reproduce issue, pls.

Absolutely.

(1) My laptop is a ThinkPad W541, with an i7-4810MQ CPU. Quad-core with
HT enabled. It supports "unrestricted_guest" and both EPT and EPTAD. For
completeness, here are my current KVM/Intel module parameters:

> ==> /sys/module/kvm_intel/parameters/emulate_invalid_guest_state <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/enable_apicv <==
> N
>
> ==> /sys/module/kvm_intel/parameters/enable_shadow_vmcs <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/ept <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/eptad <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/fasteoi <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/flexpriority <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/nested <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/ple_gap <==
> 128
>
> ==> /sys/module/kvm_intel/parameters/ple_window <==
> 4096
>
> ==> /sys/module/kvm_intel/parameters/ple_window_grow <==
> 2
>
> ==> /sys/module/kvm_intel/parameters/ple_window_max <==
> 1073741823
>
> ==> /sys/module/kvm_intel/parameters/ple_window_shrink <==
> 0
>
> ==> /sys/module/kvm_intel/parameters/pml <==
> N
>
> ==> /sys/module/kvm_intel/parameters/unrestricted_guest <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/vmm_exclusive <==
> Y
>
> ==> /sys/module/kvm_intel/parameters/vpid <==
> Y

and the CPU flags:

> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>                   mca cmov pat pse36 clflush dts acpi mmx fxsr sse
>                   sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
>                   constant_tsc arch_perfmon pebs bts rep_good nopl
>                   xtopology nonstop_tsc aperfmperf eagerfpu pni
>                   pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>                   ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic
>                   movbe popcnt tsc_deadline_timer aes xsave avx f16c
>                   rdrand lahf_lm abm ida arat epb pln pts dtherm
>                   tpr_shadow vnmi flexpriority ept vpid fsgsbase
>                   tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt

(2) My host kernel is a semi-recent RHEL7 development kernel,
3.10.0-537.el7.x86_64. Other than that, the system is a fairly
un-modified RHEL-7.3.z install.

(3) QEMU was configured with:

  ./configure \
    --target-list=x86_64-softmmu,i386-softmmu \
    --audio-drv-list=alsa \
    --enable-werror \
    --enable-spice \
    --disable-gtk \
    --enable-trace-backends=log \
    --disable-stack-protector \
    --enable-debug

(I have the SDL devel packages installed, so for graphical display, the
above config will select SDL.)

(4) The QEMU tree comes together from the following "layers":

* baseline:
  current-ish master (0f2d17c1a59c9f11e7a874fb56fee3714b101705)

* on top of that:
  [PATCH v6 wave 1 0/4] fw-cfg: support writeable blobs and more files
  [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI
  (please see the mailing list archive links in
  <https://bugzilla.redhat.com/show_bug.cgi?id=1412327#c3>)

* on top of that, apply the attached patch called "filter.patch".

(5) Download the "OVMF_CODE.32.fd" and "OVMF_VARS.fd" files from
<http://people.redhat.com/~lersek/bcast-smi-filter-428b6508-b3d6-4dd4-a329-691aa1eba80e/>.

This is an Ia32, -D SMM_REQUIRE build of OVMF, approximately at current
master, with my pending (not as yet posted) broadcast SMI enablement
patches on top, *plus* a trivial one-line debug patch that logs every
time the Trigger() function is called, and OVMF is about to write the
APM_CNT IO port.

(6) Run QEMU with the following commands:

  cp OVMF_VARS.fd varstore.fd

  qemu-system-i386 \
  \
  -machine pc-q35-2.9,smm=on,accel=kvm \
  -global driver=cfi.pflash01,property=secure,value=on \
  -smp cpus=4 \
  -cpu coreduo,-nx \
  \
  -m 2048 \
  \
  -device qxl-vga \
  -net none \
  \
  -drive if=pflash,readonly,format=raw,file=OVMF_CODE.32.fd \
  -drive if=pflash,format=raw,file=varstore.fd \
  \
  -debugcon file:debug.log \
  -global isa-debugcon.iobase=0x402 \
  \
  -chardev stdio,signal=off,mux=on,id=char0 \
  -mon chardev=char0,mode=readline \
  -serial chardev:char0

Without "filter.patch" applied, you should be reaching the UEFI shell
real quick.

With "filter.patch" applied, the boot will appear hung. However, if you
follow the OVMF debug log, written to the file "debug.log", from another
terminal, for example with "tail -f", you see that the boot is actually
progressing, just extremely slowly.

Every time the firmware is about to raise the SMI via the APM_CNT IO
port write, the debug log will say "SmmControl2DxeTrigger: 111", and
that's when the boot stalls for about one second (with one VCPU pegged
100%).

> 
>> - and then I'm willing to care about the incremental patch for the
>>   filtering,
> ok
> 
> 
>> then I propose we go ahead with this patch. It's the last one in the
>> series that needs your R-b.
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>

Thank you very much for working through this with me.

Next question: who can pick this up please? Michael indicated he'd
prefer Paolo. Paolo, do you agree?

Thanks,
Laszlo

[-- Attachment #2: filter.patch --]
[-- Type: text/x-patch, Size: 2137 bytes --]

diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
index 59930dd9d09d..701a0821705b 100644
--- a/hw/isa/lpc_ich9.c
+++ b/hw/isa/lpc_ich9.c
@@ -50,6 +50,7 @@
 #include "qom/cpu.h"
 #include "hw/nvram/fw_cfg.h"
 #include "qemu/cutils.h"
+#include "trace.h"
 
 /*****************************************************************************/
 /* ICH9 LPC PCI to ISA bridge */
@@ -423,6 +424,29 @@ void ich9_lpc_pm_init(PCIDevice *lpc_pci, bool smm_enabled)
 
 /* APM */
 
+static void ich9_apm_broadcast_smi(void)
+{
+    CPUState *cs;
+
+    pause_all_vcpus();
+    cpu_synchronize_all_states();
+    CPU_FOREACH(cs) {
+        X86CPU *cpu = X86_CPU(cs);
+        CPUX86State *env = &cpu->env;
+
+        if (env->smbase == 0x30000 && env->eip == 0xfff0) {
+            CPUClass *k = CPU_GET_CLASS(cs);
+            uint64_t cpu_arch_id = k->get_arch_id(cs);
+
+            trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
+            continue;
+        }
+
+        cpu_interrupt(cs, CPU_INTERRUPT_SMI);
+    }
+    resume_all_vcpus();
+}
+
 static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
 {
     ICH9LPCState *lpc = arg;
@@ -439,10 +463,7 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg)
     if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
         if (lpc->smi_negotiated_features &
             (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
-            CPUState *cs;
-            CPU_FOREACH(cs) {
-                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
-            }
+            ich9_apm_broadcast_smi();
         } else {
             cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
         }
diff --git a/hw/isa/trace-events b/hw/isa/trace-events
index 9faca41a975d..a0df525d042a 100644
--- a/hw/isa/trace-events
+++ b/hw/isa/trace-events
@@ -7,3 +7,6 @@ pc87312_info_floppy(uint32_t base) "base 0x%x"
 pc87312_info_ide(uint32_t base) "base 0x%x"
 pc87312_info_parallel(uint32_t base, uint32_t irq) "base 0x%x, irq %u"
 pc87312_info_serial(int n, uint32_t base, uint32_t irq) "id=%d, base 0x%x, irq %u"
+
+# hw/isa/lpc_ich9.c
+ich9_apm_broadcast_smi_skip(uint64_t cpu_arch_id) "cpu_arch_id=0x%"PRIx64

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-01-18 19:11 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-12 18:24 [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Laszlo Ersek
2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 1/3] hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg Laszlo Ersek
2017-01-13 10:15   ` Igor Mammedov
2017-01-13 11:24     ` Laszlo Ersek
2017-01-13 13:34   ` Igor Mammedov
2017-01-16 20:51     ` Laszlo Ersek
2017-01-17  8:42       ` Igor Mammedov
2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature Laszlo Ersek
2017-01-13 13:09   ` Igor Mammedov
2017-01-16 20:46     ` Laszlo Ersek
2017-01-17 11:21       ` Igor Mammedov
2017-01-17 12:06         ` Laszlo Ersek
2017-01-17 13:20           ` Igor Mammedov
2017-01-17 13:46             ` Laszlo Ersek
2017-01-17 14:20               ` Igor Mammedov
2017-01-17 18:53                 ` Laszlo Ersek
2017-01-18 10:03                   ` Igor Mammedov
2017-01-18 10:19                     ` Laszlo Ersek
2017-01-18 10:23                       ` Laszlo Ersek
2017-01-18 12:38                         ` Igor Mammedov
2017-01-18 15:42                           ` Laszlo Ersek
2017-01-18 16:26                             ` Igor Mammedov
2017-01-18 17:23                               ` Laszlo Ersek
2017-01-18 18:06                                 ` Igor Mammedov
2017-01-18 19:11                                   ` Laszlo Ersek
2017-01-12 18:24 ` [Qemu-devel] [PATCH v6 wave 2 3/3] hw/isa/lpc_ich9: negotiate SMI broadcast on pc-q35-2.9+ machine types Laszlo Ersek
2017-01-13 13:36   ` Igor Mammedov
2017-01-18 19:02 ` [Qemu-devel] [PATCH v6 wave 2 0/3] q35: add negotiable broadcast SMI Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.