[Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt
@ 2014-12-12 15:58 Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg Laszlo Ersek
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

Addressing Peter's review comments for v3.

Patch #2 is new; I thought recognizing & rejecting the overlapping I/O
ports in case the data port was wider than 1 byte merited a separate
patch.

Other changes are noted per-patch.

Rebased the series and ran checkpatch on all patches.

Thanks
Laszlo

Laszlo Ersek (8):
  fw_cfg: max access size and region size are the same for MMIO data reg
  fw_cfg: generalize overlap check for combining control and data I/O
    ports
  fw_cfg: introduce the "data_memwidth" property
  fw_cfg: expose the "data_memwidth" prop with
    fw_cfg_init_data_memwidth()
  arm: add fw_cfg to "virt" board
  hw/loader: split out load_image_gzipped_buffer()
  hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  hw/arm/virt: enable passing of EFI-stubbed kernel to guest UEFI
    firmware

 include/hw/arm/arm.h      |  5 +++
 include/hw/loader.h       |  9 +++++
 include/hw/nvram/fw_cfg.h |  3 ++
 hw/arm/boot.c             | 91 ++++++++++++++++++++++++++++++++++++++++++++---
 hw/arm/virt.c             | 22 ++++++++++++
 hw/core/loader.c          | 30 +++++++++++-----
 hw/nvram/fw_cfg.c         | 49 ++++++++++++++++++++-----
 7 files changed, 187 insertions(+), 22 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-16 13:48   ` Andrew Jones
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 2/8] fw_cfg: generalize overlap check for combining control and data I/O ports Laszlo Ersek
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

Make it clear that the maximum access size to the MMIO data register
determines the full size of the memory region.

Currently the max access size is 1. Ensure that if a larger size were used
in "fw_cfg_data_mem_ops.valid.max_access_size", the memory subsystem would
split the access to byte-sized accesses internally, in increasing address
order.

fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about
"address" or "size"; they just call the sequential fw_cfg_read() and
fw_cfg_write() functions, correspondingly. Therefore the automatic
splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is
native.)

This patch doesn't change behavior.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    v4:
    - unchanged
    
    v3:
    - new in v3 [Drew Jones]

 hw/nvram/fw_cfg.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index c4b78ed..7f6031c 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -30,9 +30,8 @@
 #include "qemu/error-report.h"
 #include "qemu/config-file.h"
 
 #define FW_CFG_SIZE 2
-#define FW_CFG_DATA_SIZE 1
 #define TYPE_FW_CFG "fw_cfg"
 #define FW_CFG_NAME "fw_cfg"
 #define FW_CFG_PATH "/machine/" FW_CFG_NAME
 #define FW_CFG(obj) OBJECT_CHECK(FWCfgState, (obj), TYPE_FW_CFG)
@@ -323,8 +322,9 @@ static const MemoryRegionOps fw_cfg_data_mem_ops = {
     .valid = {
         .min_access_size = 1,
         .max_access_size = 1,
     },
+    .impl.max_access_size = 1,
 };
 
 static const MemoryRegionOps fw_cfg_comb_mem_ops = {
     .read = fw_cfg_comb_read,
@@ -608,9 +608,10 @@ static void fw_cfg_initfn(Object *obj)
     memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s,
                           "fwcfg.ctl", FW_CFG_SIZE);
     sysbus_init_mmio(sbd, &s->ctl_iomem);
     memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s,
-                          "fwcfg.data", FW_CFG_DATA_SIZE);
+                          "fwcfg.data",
+                          fw_cfg_data_mem_ops.valid.max_access_size);
     sysbus_init_mmio(sbd, &s->data_iomem);
     /* In case ctl and data overlap: */
     memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s,
                           "fwcfg", FW_CFG_SIZE);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 2/8] fw_cfg: generalize overlap check for combining control and data I/O ports
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property Laszlo Ersek
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

If the board code overlaps the (currently only byte-wide) data I/O port
with the high byte of the selector I/O port, we install the "comb_iomem"
MemoryRegion.

Generalize the check to see if *any* byte of the data I/O port overlaps
with the high byte of the selector I/O port. If that's the case:
- If the data I/O port is just one byte wide, then keep the current
  behavior.
- Otherwise, reject the combination:

qemu-system-target: only a byte-wide data I/O port can be combined
qemu-system-target: Initialization of device fw_cfg failed

The patch doesn't immediately change behavior, because:
- fw_cfg_data_mem_ops.valid.max_access_size == 1
- ctl_io_last == s->ctl_iobase + 1
- data_io_end == s->data_iobase + 1

- The condition

    ctl_io_last >= s->data_iobase && ctl_io_last < data_io_end

  is equivalent to

    ctl_io_last >= s->data_iobase && ctl_io_last < s->data_iobase + 1

  after substituting "data_io_end". Further, the second relation can be
  rewritten as

    ctl_io_last >= s->data_iobase && ctl_io_last <= s->data_iobase

  which gives

    ctl_io_last == s->data_iobase

  After substituting "ctl_io_last", we get

    s->ctl_iobase + 1 == s->data_iobase

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    v4:
    - new in v4 [Peter]

 hw/nvram/fw_cfg.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index 7f6031c..eb0ad83 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -620,11 +620,24 @@ static void fw_cfg_initfn(Object *obj)
 static void fw_cfg_realize(DeviceState *dev, Error **errp)
 {
     FWCfgState *s = FW_CFG(dev);
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
+    uint32_t ctl_io_last;
+    uint32_t data_io_end;
 
-    if (s->ctl_iobase + 1 == s->data_iobase) {
-        sysbus_add_io(sbd, s->ctl_iobase, &s->comb_iomem);
+    if (s->ctl_iobase == 0 && s->data_iobase == 0) {
+        return;
+    }
+
+    ctl_io_last = s->ctl_iobase + FW_CFG_SIZE - 1;
+    data_io_end = s->data_iobase + fw_cfg_data_mem_ops.valid.max_access_size;
+    if (ctl_io_last >= s->data_iobase && ctl_io_last < data_io_end) {
+        if (fw_cfg_data_mem_ops.valid.max_access_size == 1) {
+            sysbus_add_io(sbd, s->ctl_iobase, &s->comb_iomem);
+        } else {
+            error_setg(errp, "only a byte-wide data I/O port can be combined");
+            return;
+        }
     } else {
         if (s->ctl_iobase) {
             sysbus_add_io(sbd, s->ctl_iobase, &s->ctl_iomem);
         }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 2/8] fw_cfg: generalize overlap check for combining control and data I/O ports Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-16 12:06   ` Alexander Graf
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 4/8] fw_cfg: expose the "data_memwidth" prop with fw_cfg_init_data_memwidth() Laszlo Ersek
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

The "data_memwidth" property is capable of changing the maximum valid
access size to the MMIO data register, and (corresponding to the previous
patch) resizes the memory region similarly, at device realization time.

(Because "data_iomem" is configured and installed dynamically now, we must
delay those steps to the realize callback.)

The default value of "data_memwidth" is set so that we don't yet diverge
from "fw_cfg_data_mem_ops".

Most of the fw_cfg users will stick with the default, and for them we
should continue using the statically allocated "fw_cfg_data_mem_ops". This
is beneficial for debugging because gdb can resolve pointers referencing
static objects to the names of those objects.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    v4:
    - reject I/O port combining if data register is wider than 1 byte
      [Peter]
    
    v3:
    - new in v3 [Drew Jones]

 hw/nvram/fw_cfg.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index eb0ad83..0947136 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -50,8 +50,9 @@ struct FWCfgState {
     /*< public >*/
 
     MemoryRegion ctl_iomem, data_iomem, comb_iomem;
     uint32_t ctl_iobase, data_iobase;
+    uint32_t data_memwidth;
     FWCfgEntry entries[2][FW_CFG_MAX_ENTRY];
     FWCfgFiles *files;
     uint16_t cur_entry;
     uint32_t cur_offset;
@@ -569,8 +570,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
 
     dev = qdev_create(NULL, TYPE_FW_CFG);
     qdev_prop_set_uint32(dev, "ctl_iobase", ctl_port);
     qdev_prop_set_uint32(dev, "data_iobase", data_port);
+    qdev_prop_set_uint32(dev, "data_memwidth",
+                         fw_cfg_data_mem_ops.valid.max_access_size);
     d = SYS_BUS_DEVICE(dev);
 
     s = FW_CFG(dev);
 
@@ -607,12 +610,8 @@ static void fw_cfg_initfn(Object *obj)
 
     memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s,
                           "fwcfg.ctl", FW_CFG_SIZE);
     sysbus_init_mmio(sbd, &s->ctl_iomem);
-    memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s,
-                          "fwcfg.data",
-                          fw_cfg_data_mem_ops.valid.max_access_size);
-    sysbus_init_mmio(sbd, &s->data_iomem);
     /* In case ctl and data overlap: */
     memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s,
                           "fwcfg", FW_CFG_SIZE);
 }
@@ -620,19 +619,31 @@ static void fw_cfg_initfn(Object *obj)
 static void fw_cfg_realize(DeviceState *dev, Error **errp)
 {
     FWCfgState *s = FW_CFG(dev);
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
+    const MemoryRegionOps *data_mem_ops = &fw_cfg_data_mem_ops;
     uint32_t ctl_io_last;
     uint32_t data_io_end;
 
+    if (s->data_memwidth > data_mem_ops->valid.max_access_size) {
+        MemoryRegionOps *ops;
+
+        ops = g_memdup(data_mem_ops, sizeof(*data_mem_ops));
+        ops->valid.max_access_size = s->data_memwidth;
+        data_mem_ops = ops;
+    }
+    memory_region_init_io(&s->data_iomem, OBJECT(s), data_mem_ops, s,
+                          "fwcfg.data", data_mem_ops->valid.max_access_size);
+    sysbus_init_mmio(sbd, &s->data_iomem);
+
     if (s->ctl_iobase == 0 && s->data_iobase == 0) {
         return;
     }
 
     ctl_io_last = s->ctl_iobase + FW_CFG_SIZE - 1;
-    data_io_end = s->data_iobase + fw_cfg_data_mem_ops.valid.max_access_size;
+    data_io_end = s->data_iobase + data_mem_ops->valid.max_access_size;
     if (ctl_io_last >= s->data_iobase && ctl_io_last < data_io_end) {
-        if (fw_cfg_data_mem_ops.valid.max_access_size == 1) {
+        if (data_mem_ops->valid.max_access_size == 1) {
             sysbus_add_io(sbd, s->ctl_iobase, &s->comb_iomem);
         } else {
             error_setg(errp, "only a byte-wide data I/O port can be combined");
             return;
@@ -649,8 +660,9 @@ static void fw_cfg_realize(DeviceState *dev, Error **errp)
 
 static Property fw_cfg_properties[] = {
     DEFINE_PROP_UINT32("ctl_iobase", FWCfgState, ctl_iobase, -1),
     DEFINE_PROP_UINT32("data_iobase", FWCfgState, data_iobase, -1),
+    DEFINE_PROP_UINT32("data_memwidth", FWCfgState, data_memwidth, -1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
 FWCfgState *fw_cfg_find(void)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 4/8] fw_cfg: expose the "data_memwidth" prop with fw_cfg_init_data_memwidth()
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
                   ` (2 preceding siblings ...)
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 5/8] arm: add fw_cfg to "virt" board Laszlo Ersek
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

We rebase fw_cfg_init() to the new function for compatibility with current
callers.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    v4:
    - unchanged
    
    v3:
    - new in v3 [Drew Jones]

 include/hw/nvram/fw_cfg.h |  3 +++
 hw/nvram/fw_cfg.c         | 15 +++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
index 56e1ed7..fd7c35c 100644
--- a/include/hw/nvram/fw_cfg.h
+++ b/include/hw/nvram/fw_cfg.h
@@ -77,8 +77,11 @@ void fw_cfg_add_file_callback(FWCfgState *s, const char *filename,
                               FWCfgReadCallback callback, void *callback_opaque,
                               void *data, size_t len);
 void *fw_cfg_modify_file(FWCfgState *s, const char *filename, void *data,
                          size_t len);
+FWCfgState *fw_cfg_init_data_memwidth(uint32_t ctl_port, uint32_t data_port,
+                                      hwaddr ctl_addr, hwaddr data_addr,
+                                      uint32_t data_memwidth);
 FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
                         hwaddr crl_addr, hwaddr data_addr);
 
 FWCfgState *fw_cfg_find(void);
diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index 0947136..081acd2 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -560,20 +560,20 @@ static void fw_cfg_machine_ready(struct Notifier *n, void *data)
     FWCfgState *s = container_of(n, FWCfgState, machine_ready);
     qemu_register_reset(fw_cfg_machine_reset, s);
 }
 
-FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
-                        hwaddr ctl_addr, hwaddr data_addr)
+FWCfgState *fw_cfg_init_data_memwidth(uint32_t ctl_port, uint32_t data_port,
+                                      hwaddr ctl_addr, hwaddr data_addr,
+                                      uint32_t data_memwidth)
 {
     DeviceState *dev;
     SysBusDevice *d;
     FWCfgState *s;
 
     dev = qdev_create(NULL, TYPE_FW_CFG);
     qdev_prop_set_uint32(dev, "ctl_iobase", ctl_port);
     qdev_prop_set_uint32(dev, "data_iobase", data_port);
-    qdev_prop_set_uint32(dev, "data_memwidth",
-                         fw_cfg_data_mem_ops.valid.max_access_size);
+    qdev_prop_set_uint32(dev, "data_memwidth", data_memwidth);
     d = SYS_BUS_DEVICE(dev);
 
     s = FW_CFG(dev);
 
@@ -602,8 +602,15 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
 
     return s;
 }
 
+FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
+                        hwaddr ctl_addr, hwaddr data_addr)
+{
+    return fw_cfg_init_data_memwidth(ctl_port, data_port, ctl_addr, data_addr,
+                                     fw_cfg_data_mem_ops.valid.max_access_size);
+}
+
 static void fw_cfg_initfn(Object *obj)
 {
     SysBusDevice *sbd = SYS_BUS_DEVICE(obj);
     FWCfgState *s = FW_CFG(obj);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 5/8] arm: add fw_cfg to "virt" board
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
                   ` (3 preceding siblings ...)
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 4/8] fw_cfg: expose the "data_memwidth" prop with fw_cfg_init_data_memwidth() Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 6/8] hw/loader: split out load_image_gzipped_buffer() Laszlo Ersek
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c,
ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the "virt"
board.

Because MMIO access is slow on ARM KVM, we enable the guest, with
fw_cfg_init_data_memwidth(), to transfer up to 8 bytes with a single
access. This has been measured to speed up transfers up to 7.5-fold,
relative to single byte data access, on both ARM KVM and x86_64 TCG.

The mmio register block of fw_cfg is advertized in the device tree. As
base address we pick 0x09020000, which conforms to the comment preceding
"a15memmap": it falls in the miscellaneous device I/O range 128MB..256MB,
and it is aligned at 64KB. The DTB properties follow the documentation in
the Linux source file "Documentation/devicetree/bindings/arm/fw-cfg.txt".

fw_cfg automatically exports a number of files to the guest; for example,
"bootorder" (see fw_cfg_machine_reset()).

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---

Notes:
    v4:
    - unchanged
    
    v3:
    - enable 8-byte wide access to fw_cfg MMIO data port [Drew]
    - expose the exact size of the fused region in the DTB [Peter]
    - reorder data register and control register so that we can keep the
      region contiguous and still conform to alignment requirements [Laszlo]

 hw/arm/virt.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 314e55b..274aaae 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -67,8 +67,9 @@ enum {
     VIRT_GIC_CPU,
     VIRT_UART,
     VIRT_MMIO,
     VIRT_RTC,
+    VIRT_FW_CFG,
 };
 
 typedef struct MemMapEntry {
     hwaddr base;
@@ -106,8 +107,9 @@ static const MemMapEntry a15memmap[] = {
     [VIRT_GIC_DIST] =   { 0x08000000, 0x00010000 },
     [VIRT_GIC_CPU] =    { 0x08010000, 0x00010000 },
     [VIRT_UART] =       { 0x09000000, 0x00001000 },
     [VIRT_RTC] =        { 0x09010000, 0x00001000 },
+    [VIRT_FW_CFG] =     { 0x09020000, 0x0000000a },
     [VIRT_MMIO] =       { 0x0a000000, 0x00000200 },
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     /* 0x10000000 .. 0x40000000 reserved for PCI */
     [VIRT_MEM] =        { 0x40000000, 30ULL * 1024 * 1024 * 1024 },
@@ -518,8 +520,25 @@ static void create_flash(const VirtBoardInfo *vbi)
     qemu_fdt_setprop_cell(vbi->fdt, nodename, "bank-width", 4);
     g_free(nodename);
 }
 
+static void create_fw_cfg(const VirtBoardInfo *vbi)
+{
+    hwaddr base = vbi->memmap[VIRT_FW_CFG].base;
+    hwaddr size = vbi->memmap[VIRT_FW_CFG].size;
+    char *nodename;
+
+    fw_cfg_init_data_memwidth(0, 0, base + 8, base, 8);
+
+    nodename = g_strdup_printf("/fw-cfg@%" PRIx64, base);
+    qemu_fdt_add_subnode(vbi->fdt, nodename);
+    qemu_fdt_setprop_string(vbi->fdt, nodename,
+                            "compatible", "qemu,fw-cfg-mmio");
+    qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg",
+                                 2, base, 2, size);
+    g_free(nodename);
+}
+
 static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size)
 {
     const VirtBoardInfo *board = (const VirtBoardInfo *)binfo;
 
@@ -603,8 +622,10 @@ static void machvirt_init(MachineState *machine)
      * no backend is created the transport will just sit harmlessly idle.
      */
     create_virtio_devices(vbi, pic);
 
+    create_fw_cfg(vbi);
+
     vbi->bootinfo.ram_size = machine->ram_size;
     vbi->bootinfo.kernel_filename = machine->kernel_filename;
     vbi->bootinfo.kernel_cmdline = machine->kernel_cmdline;
     vbi->bootinfo.initrd_filename = machine->initrd_filename;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 6/8] hw/loader: split out load_image_gzipped_buffer()
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
                   ` (4 preceding siblings ...)
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 5/8] arm: add fw_cfg to "virt" board Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg Laszlo Ersek
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 8/8] hw/arm/virt: enable passing of EFI-stubbed kernel to guest UEFI firmware Laszlo Ersek
  7 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

In the next patch we'd like to reuse the image decompression facility
without installing the output as a ROM at a specific guest-phys address.

In addition, expose LOAD_IMAGE_MAX_GUNZIP_BYTES, because that's a
straightforward "max_sz" argument for the new load_image_gzipped_buffer().

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---

Notes:
    v4:
    - drop space "between function name and open parenthesis '('" [Peter]
    
    v3:
    - unchanged

 include/hw/loader.h |  9 +++++++++
 hw/core/loader.c    | 30 +++++++++++++++++++++---------
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/hw/loader.h b/include/hw/loader.h
index 6481639..8997620 100644
--- a/include/hw/loader.h
+++ b/include/hw/loader.h
@@ -15,8 +15,17 @@ int get_image_size(const char *filename);
 int load_image(const char *filename, uint8_t *addr); /* deprecated */
 ssize_t load_image_size(const char *filename, void *addr, size_t size);
 int load_image_targphys(const char *filename, hwaddr,
                         uint64_t max_sz);
+
+/* This is the limit on the maximum uncompressed image size that
+ * load_image_gzipped_buffer() and load_image_gzipped() will read. It prevents
+ * g_malloc() in those functions from allocating a huge amount of memory.
+ */
+#define LOAD_IMAGE_MAX_GUNZIP_BYTES (256 << 20)
+
+int load_image_gzipped_buffer(const char *filename, uint64_t max_sz,
+                              uint8_t **buffer);
 int load_image_gzipped(const char *filename, hwaddr addr, uint64_t max_sz);
 
 #define ELF_LOAD_FAILED       -1
 #define ELF_LOAD_NOT_ELF      -2
diff --git a/hw/core/loader.c b/hw/core/loader.c
index 7527fd3..f2b34da 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -613,16 +613,11 @@ int load_ramdisk(const char *filename, hwaddr addr, uint64_t max_sz)
     return load_uboot_image(filename, NULL, &addr, NULL, IH_TYPE_RAMDISK,
                             NULL, NULL);
 }
 
-/* This simply prevents g_malloc in the function below from allocating
- * a huge amount of memory, by placing a limit on the maximum
- * uncompressed image size that load_image_gzipped will read.
- */
-#define LOAD_IMAGE_MAX_GUNZIP_BYTES (256 << 20)
-
-/* Load a gzip-compressed kernel. */
-int load_image_gzipped(const char *filename, hwaddr addr, uint64_t max_sz)
+/* Load a gzip-compressed kernel to a dynamically allocated buffer. */
+int load_image_gzipped_buffer(const char *filename, uint64_t max_sz,
+                              uint8_t **buffer)
 {
     uint8_t *compressed_data = NULL;
     uint8_t *data = NULL;
     gsize len;
@@ -652,17 +647,34 @@ int load_image_gzipped(const char *filename, hwaddr addr, uint64_t max_sz)
                 filename);
         goto out;
     }
 
-    rom_add_blob_fixed(filename, data, bytes, addr);
+    /* trim to actual size and return to caller */
+    *buffer = g_realloc(data, bytes);
     ret = bytes;
+    /* ownership has been transferred to caller */
+    data = NULL;
 
  out:
     g_free(compressed_data);
     g_free(data);
     return ret;
 }
 
+/* Load a gzip-compressed kernel. */
+int load_image_gzipped(const char *filename, hwaddr addr, uint64_t max_sz)
+{
+    int bytes;
+    uint8_t *data;
+
+    bytes = load_image_gzipped_buffer(filename, max_sz, &data);
+    if (bytes != -1) {
+        rom_add_blob_fixed(filename, data, bytes, addr);
+        g_free(data);
+    }
+    return bytes;
+}
+
 /*
  * Functions for reboot-persistent memory regions.
  *  - used for vga bios and option roms.
  *  - also linux kernel (-kernel / -initrd).
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
                   ` (5 preceding siblings ...)
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 6/8] hw/loader: split out load_image_gzipped_buffer() Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  2014-12-16 12:15   ` Alexander Graf
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 8/8] hw/arm/virt: enable passing of EFI-stubbed kernel to guest UEFI firmware Laszlo Ersek
  7 siblings, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

Introduce the new boolean field "arm_boot_info.firmware_loaded". When this
field is set, it means that the portion of guest DRAM that the VCPU
normally starts to execute, or the pflash chip that the VCPU normally
starts to execute, has been populated by board-specific code with
full-fledged guest firmware code, before the board calls
arm_load_kernel().

Simultaneously, "arm_boot_info.firmware_loaded" guarantees that the board
code has set up the global firmware config instance, for arm_load_kernel()
to find with fw_cfg_find().

Guest kernel (-kernel) and guest firmware (-bios, -pflash) has always been
possible to specify independently on the command line. The following cases
should be considered:

nr  -bios    -pflash  -kernel  description
             unit#0
--  -------  -------  -------  -------------------------------------------
1   present  present  absent   Board code rejects this case, -bios and
    present  present  present  -pflash unit#0 are exclusive. Left intact
                               by this patch.

2   absent   absent   present  Traditional kernel loading, with qemu's
                               minimal board firmware. Left intact by this
                               patch.

3   absent   present  absent   Preexistent case for booting guest firmware
    present  absent   absent   loaded with -bios or -pflash. Left intact
                               by this patch.

4   absent   absent   absent   Preexistent case for not loading any
                               firmware or kernel up-front. Left intact by
                               this patch.

5   present  absent   present  New case introduced by this patch: kernel
    absent   present  present  image is passed to externally loaded
                               firmware in unmodified form, using fw_cfg.

An easy way to see that this patch doesn't interfere with existing cases
is to realize that "info->firmware_loaded" is constant zero at this point.
Which makes the "outer" condition unchanged, and the "inner" condition
(with the fw_cfg-related code) dead.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---

Notes:
    v4:
    - drop space "between function name and open parenthesis '('" [Peter]
    
    v3:
    - unchanged

 include/hw/arm/arm.h |  5 +++
 hw/arm/boot.c        | 91 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/include/hw/arm/arm.h b/include/hw/arm/arm.h
index cefc9e6..dd69d66 100644
--- a/include/hw/arm/arm.h
+++ b/include/hw/arm/arm.h
@@ -65,8 +65,13 @@ struct arm_boot_info {
     int is_linux;
     hwaddr initrd_start;
     hwaddr initrd_size;
     hwaddr entry;
+
+    /* Boot firmware has been loaded, typically at address 0, with -bios or
+     * -pflash. It also implies that fw_cfg_find() will succeed.
+     */
+    bool firmware_loaded;
 };
 void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info);
 
 /* Multiplication factor to convert from system clock ticks to qemu timer
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index e6a3c5b..5c65d16 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -477,8 +477,62 @@ static void do_cpu_reset(void *opaque)
         }
     }
 }
 
+/**
+ * load_image_to_fw_cfg() - Load an image file into an fw_cfg entry identified
+ *                          by key.
+ * @fw_cfg:     The firmware config instance to store the data in.
+ * @size_key:   The firmware config key to store the size of the loaded data
+ *              under, with fw_cfg_add_i32().
+ * @data_key:   The firmware config key to store the loaded data under, with
+ *              fw_cfg_add_bytes().
+ * @image_name: The name of the image file to load. If it is NULL, the function
+ *              returns without doing anything.
+ * @decompress: Whether the image should be  decompressed (gunzipped) before
+ *              adding it to fw_cfg.
+ *
+ * In case of failure, the function prints an error message to stderr and the
+ * process exits with status 1.
+ */
+static void load_image_to_fw_cfg(FWCfgState *fw_cfg, uint16_t size_key,
+                                 uint16_t data_key, const char *image_name,
+                                 bool decompress)
+{
+    size_t size;
+    uint8_t *data;
+
+    if (image_name == NULL) {
+        return;
+    }
+
+    if (decompress) {
+        int bytes;
+
+        bytes = load_image_gzipped_buffer(image_name,
+                                          LOAD_IMAGE_MAX_GUNZIP_BYTES, &data);
+        if (bytes == -1) {
+            fprintf(stderr, "failed to load and decompress \"%s\"\n",
+                    image_name);
+            exit(1);
+        }
+        size = bytes;
+    } else {
+        gchar *contents;
+        gsize length;
+
+        if (!g_file_get_contents(image_name, &contents, &length, NULL)) {
+            fprintf(stderr, "failed to load \"%s\"\n", image_name);
+            exit(1);
+        }
+        size = length;
+        data = (uint8_t *)contents;
+    }
+
+    fw_cfg_add_i32(fw_cfg, size_key, size);
+    fw_cfg_add_bytes(fw_cfg, data_key, data, size);
+}
+
 void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info)
 {
     CPUState *cs;
     int kernel_size;
@@ -499,21 +553,48 @@ void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info)
         qemu_register_reset(do_cpu_reset, ARM_CPU(cs));
     }
 
     /* Load the kernel.  */
-    if (!info->kernel_filename) {
+    if (!info->kernel_filename || info->firmware_loaded) {
 
         if (have_dtb(info)) {
-            /* If we have a device tree blob, but no kernel to supply it to,
-             * copy it to the base of RAM for a bootloader to pick up.
+            /* If we have a device tree blob, but no kernel to supply it to (or
+             * the kernel is supposed to be loaded by the bootloader), copy the
+             * DTB to the base of RAM for the bootloader to pick up.
              */
             if (load_dtb(info->loader_start, info, 0) < 0) {
                 exit(1);
             }
         }
 
-        /* If no kernel specified, do nothing; we will start from address 0
-         * (typically a boot ROM image) in the same way as hardware.
+        if (info->kernel_filename) {
+            FWCfgState *fw_cfg;
+            bool decompress_kernel;
+
+            fw_cfg = fw_cfg_find();
+            decompress_kernel = arm_feature(&cpu->env, ARM_FEATURE_AARCH64);
+
+            /* Expose the kernel, the command line, and the initrd in fw_cfg.
+             * We don't process them here at all, it's all left to the
+             * firmware.
+             */
+            load_image_to_fw_cfg(fw_cfg,
+                                 FW_CFG_KERNEL_SIZE, FW_CFG_KERNEL_DATA,
+                                 info->kernel_filename, decompress_kernel);
+            load_image_to_fw_cfg(fw_cfg,
+                                 FW_CFG_INITRD_SIZE, FW_CFG_INITRD_DATA,
+                                 info->initrd_filename, false);
+
+            if (info->kernel_cmdline) {
+                fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE,
+                               strlen(info->kernel_cmdline) + 1);
+                fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA,
+                                  info->kernel_cmdline);
+            }
+        }
+
+        /* We will start from address 0 (typically a boot ROM image) in the
+         * same way as hardware.
          */
         return;
     }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [Qemu-devel] [PATCH v4 8/8] hw/arm/virt: enable passing of EFI-stubbed kernel to guest UEFI firmware
  2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
                   ` (6 preceding siblings ...)
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg Laszlo Ersek
@ 2014-12-12 15:58 ` Laszlo Ersek
  7 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-12 15:58 UTC (permalink / raw)
  To: peter.maydell, qemu-devel, rjones, drjones, lersek

The virt board already ensures mutual exclusion between -bios and -pflash
unit#0; we only need to set "bootinfo.firmware_loaded", introduced in the
previous patch, if either of those options was used to load the guest
firmware.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---

Notes:
    v4:
    - unchanged
    
    v3:
    - unchanged

 hw/arm/virt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 274aaae..0b5ad7c 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -632,8 +632,9 @@ static void machvirt_init(MachineState *machine)
     vbi->bootinfo.nb_cpus = smp_cpus;
     vbi->bootinfo.board_id = -1;
     vbi->bootinfo.loader_start = vbi->memmap[VIRT_MEM].base;
     vbi->bootinfo.get_dtb = machvirt_dtb;
+    vbi->bootinfo.firmware_loaded = bios_name || drive_get(IF_PFLASH, 0, 0);
     arm_load_kernel(ARM_CPU(first_cpu), &vbi->bootinfo);
 }
 
 static QEMUMachine machvirt_a15_machine = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property Laszlo Ersek
@ 2014-12-16 12:06   ` Alexander Graf
  2014-12-16 12:42     ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2014-12-16 12:06 UTC (permalink / raw)
  To: Laszlo Ersek, peter.maydell, qemu-devel, rjones, drjones



On 12.12.14 16:58, Laszlo Ersek wrote:
> The "data_memwidth" property is capable of changing the maximum valid
> access size to the MMIO data register, and (corresponding to the previous
> patch) resizes the memory region similarly, at device realization time.
> 
> (Because "data_iomem" is configured and installed dynamically now, we must
> delay those steps to the realize callback.)
> 
> The default value of "data_memwidth" is set so that we don't yet diverge
> from "fw_cfg_data_mem_ops".
> 
> Most of the fw_cfg users will stick with the default, and for them we
> should continue using the statically allocated "fw_cfg_data_mem_ops". This
> is beneficial for debugging because gdb can resolve pointers referencing
> static objects to the names of those objects.
> 
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> 
> Notes:
>     v4:
>     - reject I/O port combining if data register is wider than 1 byte
>       [Peter]
>     
>     v3:
>     - new in v3 [Drew Jones]
> 
>  hw/nvram/fw_cfg.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
> index eb0ad83..0947136 100644
> --- a/hw/nvram/fw_cfg.c
> +++ b/hw/nvram/fw_cfg.c
> @@ -50,8 +50,9 @@ struct FWCfgState {
>      /*< public >*/
>  
>      MemoryRegion ctl_iomem, data_iomem, comb_iomem;
>      uint32_t ctl_iobase, data_iobase;
> +    uint32_t data_memwidth;
>      FWCfgEntry entries[2][FW_CFG_MAX_ENTRY];
>      FWCfgFiles *files;
>      uint16_t cur_entry;
>      uint32_t cur_offset;
> @@ -569,8 +570,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
>  
>      dev = qdev_create(NULL, TYPE_FW_CFG);
>      qdev_prop_set_uint32(dev, "ctl_iobase", ctl_port);
>      qdev_prop_set_uint32(dev, "data_iobase", data_port);
> +    qdev_prop_set_uint32(dev, "data_memwidth",
> +                         fw_cfg_data_mem_ops.valid.max_access_size);
>      d = SYS_BUS_DEVICE(dev);
>  
>      s = FW_CFG(dev);
>  
> @@ -607,12 +610,8 @@ static void fw_cfg_initfn(Object *obj)
>  
>      memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s,
>                            "fwcfg.ctl", FW_CFG_SIZE);
>      sysbus_init_mmio(sbd, &s->ctl_iomem);
> -    memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s,
> -                          "fwcfg.data",
> -                          fw_cfg_data_mem_ops.valid.max_access_size);
> -    sysbus_init_mmio(sbd, &s->data_iomem);
>      /* In case ctl and data overlap: */
>      memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s,
>                            "fwcfg", FW_CFG_SIZE);
>  }
> @@ -620,19 +619,31 @@ static void fw_cfg_initfn(Object *obj)
>  static void fw_cfg_realize(DeviceState *dev, Error **errp)
>  {
>      FWCfgState *s = FW_CFG(dev);
>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
> +    const MemoryRegionOps *data_mem_ops = &fw_cfg_data_mem_ops;
>      uint32_t ctl_io_last;
>      uint32_t data_io_end;
>  
> +    if (s->data_memwidth > data_mem_ops->valid.max_access_size) {
> +        MemoryRegionOps *ops;
> +
> +        ops = g_memdup(data_mem_ops, sizeof(*data_mem_ops));

Hrm, this memory will leak if the device gets destroyed after realize,
right?

I see 2 options around this:

  1) Free it on destruction
  2) Add the RegionOps as field into FWCfgState. Then it gets allocated
and free'd automatically

Option 2 is easier (and more failure proof) but will waste a few bytes
of ram for data_memwidth=1 users. I don't think we need to bother about
the few bytes and rather go with safety :).


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg Laszlo Ersek
@ 2014-12-16 12:15   ` Alexander Graf
  2014-12-16 12:18     ` Peter Maydell
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2014-12-16 12:15 UTC (permalink / raw)
  To: Laszlo Ersek, peter.maydell, qemu-devel, rjones, drjones



On 12.12.14 16:58, Laszlo Ersek wrote:
> Introduce the new boolean field "arm_boot_info.firmware_loaded". When this
> field is set, it means that the portion of guest DRAM that the VCPU
> normally starts to execute, or the pflash chip that the VCPU normally
> starts to execute, has been populated by board-specific code with
> full-fledged guest firmware code, before the board calls
> arm_load_kernel().
> 
> Simultaneously, "arm_boot_info.firmware_loaded" guarantees that the board
> code has set up the global firmware config instance, for arm_load_kernel()
> to find with fw_cfg_find().
> 
> Guest kernel (-kernel) and guest firmware (-bios, -pflash) has always been
> possible to specify independently on the command line. The following cases
> should be considered:
> 
> nr  -bios    -pflash  -kernel  description
>              unit#0
> --  -------  -------  -------  -------------------------------------------
> 1   present  present  absent   Board code rejects this case, -bios and
>     present  present  present  -pflash unit#0 are exclusive. Left intact
>                                by this patch.
> 
> 2   absent   absent   present  Traditional kernel loading, with qemu's
>                                minimal board firmware. Left intact by this
>                                patch.
> 
> 3   absent   present  absent   Preexistent case for booting guest firmware
>     present  absent   absent   loaded with -bios or -pflash. Left intact
>                                by this patch.
> 
> 4   absent   absent   absent   Preexistent case for not loading any
>                                firmware or kernel up-front. Left intact by
>                                this patch.
> 
> 5   present  absent   present  New case introduced by this patch: kernel
>     absent   present  present  image is passed to externally loaded
>                                firmware in unmodified form, using fw_cfg.
> 
> An easy way to see that this patch doesn't interfere with existing cases
> is to realize that "info->firmware_loaded" is constant zero at this point.
> Which makes the "outer" condition unchanged, and the "inner" condition
> (with the fw_cfg-related code) dead.
> 
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> 
> Notes:
>     v4:
>     - drop space "between function name and open parenthesis '('" [Peter]
>     
>     v3:
>     - unchanged
> 
>  include/hw/arm/arm.h |  5 +++
>  hw/arm/boot.c        | 91 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 91 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/arm/arm.h b/include/hw/arm/arm.h
> index cefc9e6..dd69d66 100644
> --- a/include/hw/arm/arm.h
> +++ b/include/hw/arm/arm.h
> @@ -65,8 +65,13 @@ struct arm_boot_info {
>      int is_linux;
>      hwaddr initrd_start;
>      hwaddr initrd_size;
>      hwaddr entry;
> +
> +    /* Boot firmware has been loaded, typically at address 0, with -bios or
> +     * -pflash. It also implies that fw_cfg_find() will succeed.
> +     */
> +    bool firmware_loaded;
>  };
>  void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info);
>  
>  /* Multiplication factor to convert from system clock ticks to qemu timer
> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> index e6a3c5b..5c65d16 100644
> --- a/hw/arm/boot.c
> +++ b/hw/arm/boot.c
> @@ -477,8 +477,62 @@ static void do_cpu_reset(void *opaque)
>          }
>      }
>  }
>  
> +/**
> + * load_image_to_fw_cfg() - Load an image file into an fw_cfg entry identified
> + *                          by key.
> + * @fw_cfg:     The firmware config instance to store the data in.
> + * @size_key:   The firmware config key to store the size of the loaded data
> + *              under, with fw_cfg_add_i32().
> + * @data_key:   The firmware config key to store the loaded data under, with
> + *              fw_cfg_add_bytes().
> + * @image_name: The name of the image file to load. If it is NULL, the function
> + *              returns without doing anything.
> + * @decompress: Whether the image should be  decompressed (gunzipped) before
> + *              adding it to fw_cfg.
> + *
> + * In case of failure, the function prints an error message to stderr and the
> + * process exits with status 1.
> + */
> +static void load_image_to_fw_cfg(FWCfgState *fw_cfg, uint16_t size_key,
> +                                 uint16_t data_key, const char *image_name,
> +                                 bool decompress)
> +{
> +    size_t size;
> +    uint8_t *data;
> +
> +    if (image_name == NULL) {
> +        return;
> +    }
> +
> +    if (decompress) {
> +        int bytes;
> +
> +        bytes = load_image_gzipped_buffer(image_name,
> +                                          LOAD_IMAGE_MAX_GUNZIP_BYTES, &data);
> +        if (bytes == -1) {
> +            fprintf(stderr, "failed to load and decompress \"%s\"\n",
> +                    image_name);
> +            exit(1);
> +        }
> +        size = bytes;
> +    } else {
> +        gchar *contents;
> +        gsize length;
> +
> +        if (!g_file_get_contents(image_name, &contents, &length, NULL)) {
> +            fprintf(stderr, "failed to load \"%s\"\n", image_name);
> +            exit(1);
> +        }
> +        size = length;
> +        data = (uint8_t *)contents;
> +    }
> +
> +    fw_cfg_add_i32(fw_cfg, size_key, size);
> +    fw_cfg_add_bytes(fw_cfg, data_key, data, size);
> +}
> +
>  void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info)
>  {
>      CPUState *cs;
>      int kernel_size;
> @@ -499,21 +553,48 @@ void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info)
>          qemu_register_reset(do_cpu_reset, ARM_CPU(cs));
>      }
>  
>      /* Load the kernel.  */
> -    if (!info->kernel_filename) {
> +    if (!info->kernel_filename || info->firmware_loaded) {
>  
>          if (have_dtb(info)) {
> -            /* If we have a device tree blob, but no kernel to supply it to,
> -             * copy it to the base of RAM for a bootloader to pick up.
> +            /* If we have a device tree blob, but no kernel to supply it to (or
> +             * the kernel is supposed to be loaded by the bootloader), copy the
> +             * DTB to the base of RAM for the bootloader to pick up.
>               */
>              if (load_dtb(info->loader_start, info, 0) < 0) {
>                  exit(1);
>              }
>          }
>  
> -        /* If no kernel specified, do nothing; we will start from address 0
> -         * (typically a boot ROM image) in the same way as hardware.
> +        if (info->kernel_filename) {
> +            FWCfgState *fw_cfg;
> +            bool decompress_kernel;
> +
> +            fw_cfg = fw_cfg_find();
> +            decompress_kernel = arm_feature(&cpu->env, ARM_FEATURE_AARCH64);

I don't understand this. On AArch64 I can simply do -kernel Image and it
boots without decompressing anything, no?


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-16 12:15   ` Alexander Graf
@ 2014-12-16 12:18     ` Peter Maydell
  2014-12-16 12:20       ` Alexander Graf
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2014-12-16 12:18 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Andrew Jones, Laszlo Ersek, QEMU Developers

On 16 December 2014 at 12:15, Alexander Graf <agraf@suse.de> wrote:
> I don't understand this. On AArch64 I can simply do -kernel Image and it
> boots without decompressing anything, no?

Yes, but if you pass -kernel Image.gz then it won't work unless the
bootloader (ie QEMU) does the decompression for you. AArch64 differs
from AArch32 here (on 32-bit compressed images have their own builtin
decompressor; on 64-bit this is made the job of the boot loader).
We could choose not to support loading compressed images, but that
would be annoying for users (most other bootloaders do). See
commit 6f5d3cbe88 where we added this support.

-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-16 12:18     ` Peter Maydell
@ 2014-12-16 12:20       ` Alexander Graf
  2014-12-16 12:25         ` Peter Maydell
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2014-12-16 12:20 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Andrew Jones, Laszlo Ersek, QEMU Developers



On 16.12.14 13:18, Peter Maydell wrote:
> On 16 December 2014 at 12:15, Alexander Graf <agraf@suse.de> wrote:
>> I don't understand this. On AArch64 I can simply do -kernel Image and it
>> boots without decompressing anything, no?
> 
> Yes, but if you pass -kernel Image.gz then it won't work unless the
> bootloader (ie QEMU) does the decompression for you. AArch64 differs
> from AArch32 here (on 32-bit compressed images have their own builtin
> decompressor; on 64-bit this is made the job of the boot loader).
> We could choose not to support loading compressed images, but that
> would be annoying for users (most other bootloaders do). See
> commit 6f5d3cbe88 where we added this support.

The patch as is assumes that AArch64 images are always gzipped. I don't
think this assumption is correct - if you do "make Image" on a kernel
source tree, you will get an uncompressed Image file.

I think we'd be better off trying to load it as gzip and if it's not
gzipped, fall back to linear load.


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-16 12:20       ` Alexander Graf
@ 2014-12-16 12:25         ` Peter Maydell
  2014-12-16 12:42           ` Richard W.M. Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2014-12-16 12:25 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Andrew Jones, Laszlo Ersek, QEMU Developers

On 16 December 2014 at 12:20, Alexander Graf <agraf@suse.de> wrote:
> The patch as is assumes that AArch64 images are always gzipped. I don't
> think this assumption is correct - if you do "make Image" on a kernel
> source tree, you will get an uncompressed Image file.
>
> I think we'd be better off trying to load it as gzip and if it's not
> gzipped, fall back to linear load.

Ah, I see what you mean. Yes, we need to continue to support
loading non-compressed Images as well as compressed ones,
so this patch needs to do the try-and-fall-back, in the
same way the current loader code does.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-16 12:25         ` Peter Maydell
@ 2014-12-16 12:42           ` Richard W.M. Jones
  2014-12-16 12:44             ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Richard W.M. Jones @ 2014-12-16 12:42 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Andrew Jones, Laszlo Ersek, Alexander Graf, QEMU Developers

On Tue, Dec 16, 2014 at 12:25:41PM +0000, Peter Maydell wrote:
> On 16 December 2014 at 12:20, Alexander Graf <agraf@suse.de> wrote:
> > The patch as is assumes that AArch64 images are always gzipped. I don't
> > think this assumption is correct - if you do "make Image" on a kernel
> > source tree, you will get an uncompressed Image file.
> >
> > I think we'd be better off trying to load it as gzip and if it's not
> > gzipped, fall back to linear load.
> 
> Ah, I see what you mean. Yes, we need to continue to support
> loading non-compressed Images as well as compressed ones,
> so this patch needs to do the try-and-fall-back, in the
> same way the current loader code does.

Hang on, I tested the non-compressed case when I originally submitted
the patch.

And indeed it does work (still):

(using qemu dfa9c2a0f4d0a0c8b2c1449ecdbb1297427e1560)
$ cp /boot/vmlinuz-3.18.0-1.fc22.aarch64 /tmp/Image.gz
$ gunzip /tmp/Image.gz 
$ ./aarch64-softmmu/qemu-system-aarch64 -kernel /tmp/Image -append 'console=ttyAMA0 earlyprintk=pl011,0x9000000 ignore_loglevel'
No machine specified, and there is no default.
Use -machine help to list supported machines!
rjones@mustang:~/d/qemu$ ./aarch64-softmmu/qemu-system-aarch64 -M virt -cpu cortex-a57 -kernel /tmp/Image -append 'console=ttyAMA0 earlyprintk=pl011,0x9000000 ignore_loglevel'
rjones@mustang:~/d/qemu$ 
rjones@mustang:~/d/qemu$ ./aarch64-softmmu/qemu-system-aarch64 -M virt -cpu cortex-a57 -kernel /tmp/Image -append 'console=ttyAMA0 earlyprintk=pl011,0x9000000 ignore_loglevel' -serial stdio
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.18.0-1.fc22.aarch64 (mockbuild@apm-mustang-ev3-11.ml3.eng.bos.redhat.com) (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC) ) #1 SMP Thu Dec 11 11:47:50 UTC 2014

[etc]

And for completeness with a compressed kernel:

$ ./aarch64-softmmu/qemu-system-aarch64 -M virt -cpu cortex-a57 -kernel /boot/vmlinuz-3.18.0-1.fc22.aarch64 -append 'console=ttyAMA0 earlyprintk=pl011,0x9000000 ignore_loglevel' -serial stdio
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.18.0-1.fc22.aarch64 (mockbuild@apm-mustang-ev3-11.ml3.eng.bos.redhat.com) (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC) ) #1 SMP Thu Dec 11 11:47:50 UTC 2014
[    0.000000] CPU: AArch64 Processor [411fd070] revision 0

[etc]

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-16 12:06   ` Alexander Graf
@ 2014-12-16 12:42     ` Laszlo Ersek
  2014-12-16 16:59       ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 12:42 UTC (permalink / raw)
  To: Alexander Graf, peter.maydell, qemu-devel, rjones, drjones

On 12/16/14 13:06, Alexander Graf wrote:
> 
> 
> On 12.12.14 16:58, Laszlo Ersek wrote:
>> The "data_memwidth" property is capable of changing the maximum valid
>> access size to the MMIO data register, and (corresponding to the previous
>> patch) resizes the memory region similarly, at device realization time.
>>
>> (Because "data_iomem" is configured and installed dynamically now, we must
>> delay those steps to the realize callback.)
>>
>> The default value of "data_memwidth" is set so that we don't yet diverge
>> from "fw_cfg_data_mem_ops".
>>
>> Most of the fw_cfg users will stick with the default, and for them we
>> should continue using the statically allocated "fw_cfg_data_mem_ops". This
>> is beneficial for debugging because gdb can resolve pointers referencing
>> static objects to the names of those objects.
>>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>>
>> Notes:
>>     v4:
>>     - reject I/O port combining if data register is wider than 1 byte
>>       [Peter]
>>     
>>     v3:
>>     - new in v3 [Drew Jones]
>>
>>  hw/nvram/fw_cfg.c | 24 ++++++++++++++++++------
>>  1 file changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
>> index eb0ad83..0947136 100644
>> --- a/hw/nvram/fw_cfg.c
>> +++ b/hw/nvram/fw_cfg.c
>> @@ -50,8 +50,9 @@ struct FWCfgState {
>>      /*< public >*/
>>  
>>      MemoryRegion ctl_iomem, data_iomem, comb_iomem;
>>      uint32_t ctl_iobase, data_iobase;
>> +    uint32_t data_memwidth;
>>      FWCfgEntry entries[2][FW_CFG_MAX_ENTRY];
>>      FWCfgFiles *files;
>>      uint16_t cur_entry;
>>      uint32_t cur_offset;
>> @@ -569,8 +570,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
>>  
>>      dev = qdev_create(NULL, TYPE_FW_CFG);
>>      qdev_prop_set_uint32(dev, "ctl_iobase", ctl_port);
>>      qdev_prop_set_uint32(dev, "data_iobase", data_port);
>> +    qdev_prop_set_uint32(dev, "data_memwidth",
>> +                         fw_cfg_data_mem_ops.valid.max_access_size);
>>      d = SYS_BUS_DEVICE(dev);
>>  
>>      s = FW_CFG(dev);
>>  
>> @@ -607,12 +610,8 @@ static void fw_cfg_initfn(Object *obj)
>>  
>>      memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s,
>>                            "fwcfg.ctl", FW_CFG_SIZE);
>>      sysbus_init_mmio(sbd, &s->ctl_iomem);
>> -    memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s,
>> -                          "fwcfg.data",
>> -                          fw_cfg_data_mem_ops.valid.max_access_size);
>> -    sysbus_init_mmio(sbd, &s->data_iomem);
>>      /* In case ctl and data overlap: */
>>      memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s,
>>                            "fwcfg", FW_CFG_SIZE);
>>  }
>> @@ -620,19 +619,31 @@ static void fw_cfg_initfn(Object *obj)
>>  static void fw_cfg_realize(DeviceState *dev, Error **errp)
>>  {
>>      FWCfgState *s = FW_CFG(dev);
>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
>> +    const MemoryRegionOps *data_mem_ops = &fw_cfg_data_mem_ops;
>>      uint32_t ctl_io_last;
>>      uint32_t data_io_end;
>>  
>> +    if (s->data_memwidth > data_mem_ops->valid.max_access_size) {
>> +        MemoryRegionOps *ops;
>> +
>> +        ops = g_memdup(data_mem_ops, sizeof(*data_mem_ops));
> 
> Hrm, this memory will leak if the device gets destroyed after realize,
> right?

How do you destroy the fw_cfg device after it is successfully realized?
I wouldn't introduce such a blatant leak out of oversight.

> I see 2 options around this:
> 
>   1) Free it on destruction

Does that mean an unrealize callback?

>   2) Add the RegionOps as field into FWCfgState. Then it gets allocated
> and free'd automatically
> 
> Option 2 is easier (and more failure proof) but will waste a few bytes
> of ram for data_memwidth=1 users. I don't think we need to bother about
> the few bytes and rather go with safety :).

I wanted to keep the static ops object for the common user, because it
is very convenient when debugging in gdb -- the address is automatically
resolved to the name of the static object. I guess I can do (1) (if that
means an unrealize callback).

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg
  2014-12-16 12:42           ` Richard W.M. Jones
@ 2014-12-16 12:44             ` Laszlo Ersek
  0 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 12:44 UTC (permalink / raw)
  To: Richard W.M. Jones, Peter Maydell
  Cc: Andrew Jones, Alexander Graf, QEMU Developers

On 12/16/14 13:42, Richard W.M. Jones wrote:
> On Tue, Dec 16, 2014 at 12:25:41PM +0000, Peter Maydell wrote:
>> On 16 December 2014 at 12:20, Alexander Graf <agraf@suse.de> wrote:
>>> The patch as is assumes that AArch64 images are always gzipped. I don't
>>> think this assumption is correct - if you do "make Image" on a kernel
>>> source tree, you will get an uncompressed Image file.
>>>
>>> I think we'd be better off trying to load it as gzip and if it's not
>>> gzipped, fall back to linear load.
>>
>> Ah, I see what you mean. Yes, we need to continue to support
>> loading non-compressed Images as well as compressed ones,
>> so this patch needs to do the try-and-fall-back, in the
>> same way the current loader code does.
> 
> Hang on, I tested the non-compressed case when I originally submitted
> the patch.
> 
> And indeed it does work (still):

The bug is not in your patch; it's in mine, because it doesn't follow
the fallback behavior that your patch correctly implements.

I guess I'll get to use the get_image_size() / load_image_size()
functions after all...

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg Laszlo Ersek
@ 2014-12-16 13:48   ` Andrew Jones
  2014-12-16 19:00     ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Jones @ 2014-12-16 13:48 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: peter.maydell, qemu-devel, rjones

On Fri, Dec 12, 2014 at 04:58:45PM +0100, Laszlo Ersek wrote:
> Make it clear that the maximum access size to the MMIO data register
> determines the full size of the memory region.
> 
> Currently the max access size is 1. Ensure that if a larger size were used
> in "fw_cfg_data_mem_ops.valid.max_access_size", the memory subsystem would
> split the access to byte-sized accesses internally, in increasing address
> order.
> 
> fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about
> "address" or "size"; they just call the sequential fw_cfg_read() and
> fw_cfg_write() functions, correspondingly. Therefore the automatic
> splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is
> native.)

This 'is native' caught my eye. Laszlo's
Documentation/devicetree/bindings/arm/fw-cfg.txt patch states that the
selector register is LE and 

"
The data register allows accesses with 8, 16, 32 and 64-bit width.  Accesses
larger than a byte are interpreted as arrays, bundled together only for better
performance. The bytes constituting such a word, in increasing address order,
correspond to the bytes that would have been transferred by byte-wide accesses
in chronological order.
"

I chatted with Laszlo to make sure the host-is-BE case was considered.
It looks like there may be an issue there that Laszlo promised to look
into. Laszlo had already noticed that the selector was defined as native
in qemu as well, but should be LE. Now that we support > 1 byte reads
and writes from the data port, then maybe we should look into changing
that as well.

drew

> 
> This patch doesn't change behavior.
> 
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> 
> Notes:
>     v4:
>     - unchanged
>     
>     v3:
>     - new in v3 [Drew Jones]
> 
>  hw/nvram/fw_cfg.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
> index c4b78ed..7f6031c 100644
> --- a/hw/nvram/fw_cfg.c
> +++ b/hw/nvram/fw_cfg.c
> @@ -30,9 +30,8 @@
>  #include "qemu/error-report.h"
>  #include "qemu/config-file.h"
>  
>  #define FW_CFG_SIZE 2
> -#define FW_CFG_DATA_SIZE 1
>  #define TYPE_FW_CFG "fw_cfg"
>  #define FW_CFG_NAME "fw_cfg"
>  #define FW_CFG_PATH "/machine/" FW_CFG_NAME
>  #define FW_CFG(obj) OBJECT_CHECK(FWCfgState, (obj), TYPE_FW_CFG)
> @@ -323,8 +322,9 @@ static const MemoryRegionOps fw_cfg_data_mem_ops = {
>      .valid = {
>          .min_access_size = 1,
>          .max_access_size = 1,
>      },
> +    .impl.max_access_size = 1,
>  };
>  
>  static const MemoryRegionOps fw_cfg_comb_mem_ops = {
>      .read = fw_cfg_comb_read,
> @@ -608,9 +608,10 @@ static void fw_cfg_initfn(Object *obj)
>      memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s,
>                            "fwcfg.ctl", FW_CFG_SIZE);
>      sysbus_init_mmio(sbd, &s->ctl_iomem);
>      memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s,
> -                          "fwcfg.data", FW_CFG_DATA_SIZE);
> +                          "fwcfg.data",
> +                          fw_cfg_data_mem_ops.valid.max_access_size);
>      sysbus_init_mmio(sbd, &s->data_iomem);
>      /* In case ctl and data overlap: */
>      memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s,
>                            "fwcfg", FW_CFG_SIZE);
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-16 12:42     ` Laszlo Ersek
@ 2014-12-16 16:59       ` Laszlo Ersek
  2014-12-16 17:10         ` Peter Maydell
  0 siblings, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 16:59 UTC (permalink / raw)
  To: Alexander Graf; +Cc: peter.maydell, drjones, qemu-devel

On 12/16/14 13:42, Laszlo Ersek wrote:
> On 12/16/14 13:06, Alexander Graf wrote:
>>
>>
>> On 12.12.14 16:58, Laszlo Ersek wrote:
>>> The "data_memwidth" property is capable of changing the maximum valid
>>> access size to the MMIO data register, and (corresponding to the previous
>>> patch) resizes the memory region similarly, at device realization time.
>>>
>>> (Because "data_iomem" is configured and installed dynamically now, we must
>>> delay those steps to the realize callback.)
>>>
>>> The default value of "data_memwidth" is set so that we don't yet diverge
>>> from "fw_cfg_data_mem_ops".
>>>
>>> Most of the fw_cfg users will stick with the default, and for them we
>>> should continue using the statically allocated "fw_cfg_data_mem_ops". This
>>> is beneficial for debugging because gdb can resolve pointers referencing
>>> static objects to the names of those objects.
>>>
>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>> ---
>>>
>>> Notes:
>>>     v4:
>>>     - reject I/O port combining if data register is wider than 1 byte
>>>       [Peter]
>>>     
>>>     v3:
>>>     - new in v3 [Drew Jones]
>>>
>>>  hw/nvram/fw_cfg.c | 24 ++++++++++++++++++------
>>>  1 file changed, 18 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
>>> index eb0ad83..0947136 100644
>>> --- a/hw/nvram/fw_cfg.c
>>> +++ b/hw/nvram/fw_cfg.c
>>> @@ -50,8 +50,9 @@ struct FWCfgState {
>>>      /*< public >*/
>>>  
>>>      MemoryRegion ctl_iomem, data_iomem, comb_iomem;
>>>      uint32_t ctl_iobase, data_iobase;
>>> +    uint32_t data_memwidth;
>>>      FWCfgEntry entries[2][FW_CFG_MAX_ENTRY];
>>>      FWCfgFiles *files;
>>>      uint16_t cur_entry;
>>>      uint32_t cur_offset;
>>> @@ -569,8 +570,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
>>>  
>>>      dev = qdev_create(NULL, TYPE_FW_CFG);
>>>      qdev_prop_set_uint32(dev, "ctl_iobase", ctl_port);
>>>      qdev_prop_set_uint32(dev, "data_iobase", data_port);
>>> +    qdev_prop_set_uint32(dev, "data_memwidth",
>>> +                         fw_cfg_data_mem_ops.valid.max_access_size);
>>>      d = SYS_BUS_DEVICE(dev);
>>>  
>>>      s = FW_CFG(dev);
>>>  
>>> @@ -607,12 +610,8 @@ static void fw_cfg_initfn(Object *obj)
>>>  
>>>      memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s,
>>>                            "fwcfg.ctl", FW_CFG_SIZE);
>>>      sysbus_init_mmio(sbd, &s->ctl_iomem);
>>> -    memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s,
>>> -                          "fwcfg.data",
>>> -                          fw_cfg_data_mem_ops.valid.max_access_size);
>>> -    sysbus_init_mmio(sbd, &s->data_iomem);
>>>      /* In case ctl and data overlap: */
>>>      memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s,
>>>                            "fwcfg", FW_CFG_SIZE);
>>>  }
>>> @@ -620,19 +619,31 @@ static void fw_cfg_initfn(Object *obj)
>>>  static void fw_cfg_realize(DeviceState *dev, Error **errp)
>>>  {
>>>      FWCfgState *s = FW_CFG(dev);
>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
>>> +    const MemoryRegionOps *data_mem_ops = &fw_cfg_data_mem_ops;
>>>      uint32_t ctl_io_last;
>>>      uint32_t data_io_end;
>>>  
>>> +    if (s->data_memwidth > data_mem_ops->valid.max_access_size) {
>>> +        MemoryRegionOps *ops;
>>> +
>>> +        ops = g_memdup(data_mem_ops, sizeof(*data_mem_ops));
>>
>> Hrm, this memory will leak if the device gets destroyed after realize,
>> right?
> 
> How do you destroy the fw_cfg device after it is successfully realized?
> I wouldn't introduce such a blatant leak out of oversight.
> 
>> I see 2 options around this:
>>
>>   1) Free it on destruction
> 
> Does that mean an unrealize callback?
> 
>>   2) Add the RegionOps as field into FWCfgState. Then it gets allocated
>> and free'd automatically
>>
>> Option 2 is easier (and more failure proof) but will waste a few bytes
>> of ram for data_memwidth=1 users. I don't think we need to bother about
>> the few bytes and rather go with safety :).
> 
> I wanted to keep the static ops object for the common user, because it
> is very convenient when debugging in gdb -- the address is automatically
> resolved to the name of the static object. I guess I can do (1) (if that
> means an unrealize callback).

To elaborate on the above -- the fw_cfg device appears to be
undestructible at the moment. It has no unrealize callback. If it were
destructible, then the above leak would be the smallest of concerns --
it doesn't unmap nor destroy the memory regions that implement the
various registers.

So, I think the above is not an actual leak, because the result of
g_memdup() can never become unreferenced.

Thanks,
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-16 16:59       ` Laszlo Ersek
@ 2014-12-16 17:10         ` Peter Maydell
  2014-12-16 17:20           ` Alexander Graf
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2014-12-16 17:10 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Richard W.M. Jones, Andrew Jones, Alexander Graf, QEMU Developers

On 16 December 2014 at 16:59, Laszlo Ersek <lersek@redhat.com> wrote:
> To elaborate on the above -- the fw_cfg device appears to be
> undestructible at the moment. It has no unrealize callback. If it were
> destructible, then the above leak would be the smallest of concerns --
> it doesn't unmap nor destroy the memory regions that implement the
> various registers.
>
> So, I think the above is not an actual leak, because the result of
> g_memdup() can never become unreferenced.

True, and we have a lot of device that are in this same
category of "can't ever be destroyed". However it is setting
up a minor beartrap for ourselves in future if we have
allocations which aren't tracked via a field in the device's
state structure, because the obvious future implementation of
destruction for a device is "just free/destroy everything
that is in the state struct".

NB: I think what Alex had in mind with his option (2) was
just to have a "MemoryRegionOps ops;" field in the state
struct, and then use "s->ops = data_mem_ops;" rather than
the memdup. That retains the use of the static field for
the non-variable-width case, it's just that instead of
allocating off the heap for the var-width setup we use
an inline lump of memory in a struct we're already allocing.

I don't think I care very much about this, but Alex's
suggestion 2 is slightly nicer I guess. Adding a whole
unrealize callback is definitely vastly overkill.

-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-16 17:10         ` Peter Maydell
@ 2014-12-16 17:20           ` Alexander Graf
  2014-12-16 18:52             ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2014-12-16 17:20 UTC (permalink / raw)
  To: Peter Maydell, Laszlo Ersek; +Cc: Andrew Jones, QEMU Developers

On 12/16/14 18:10, Peter Maydell wrote:
> On 16 December 2014 at 16:59, Laszlo Ersek <lersek@redhat.com> wrote:
>> To elaborate on the above -- the fw_cfg device appears to be
>> undestructible at the moment. It has no unrealize callback. If it were
>> destructible, then the above leak would be the smallest of concerns --
>> it doesn't unmap nor destroy the memory regions that implement the
>> various registers.
>>
>> So, I think the above is not an actual leak, because the result of
>> g_memdup() can never become unreferenced.
> True, and we have a lot of device that are in this same
> category of "can't ever be destroyed". However it is setting
> up a minor beartrap for ourselves in future if we have
> allocations which aren't tracked via a field in the device's
> state structure, because the obvious future implementation of
> destruction for a device is "just free/destroy everything
> that is in the state struct".
>
> NB: I think what Alex had in mind with his option (2) was
> just to have a "MemoryRegionOps ops;" field in the state
> struct, and then use "s->ops = data_mem_ops;" rather than
> the memdup. That retains the use of the static field for
> the non-variable-width case, it's just that instead of
> allocating off the heap for the var-width setup we use
> an inline lump of memory in a struct we're already allocing.
>
> I don't think I care very much about this, but Alex's
> suggestion 2 is slightly nicer I guess. Adding a whole
> unrealize callback is definitely vastly overkill.

Yeah, it's exactly what I meant. Sorry for not being as clear. By moving 
the dynamically created struct into the device struct we're just making 
the whole allocation flow easier.

But if this is the only nitpick, there's no need for a respin just for that.


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property
  2014-12-16 17:20           ` Alexander Graf
@ 2014-12-16 18:52             ` Laszlo Ersek
  0 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 18:52 UTC (permalink / raw)
  To: Alexander Graf, Peter Maydell; +Cc: Andrew Jones, QEMU Developers

On 12/16/14 18:20, Alexander Graf wrote:
> On 12/16/14 18:10, Peter Maydell wrote:
>> On 16 December 2014 at 16:59, Laszlo Ersek <lersek@redhat.com> wrote:
>>> To elaborate on the above -- the fw_cfg device appears to be
>>> undestructible at the moment. It has no unrealize callback. If it were
>>> destructible, then the above leak would be the smallest of concerns --
>>> it doesn't unmap nor destroy the memory regions that implement the
>>> various registers.
>>>
>>> So, I think the above is not an actual leak, because the result of
>>> g_memdup() can never become unreferenced.
>> True, and we have a lot of device that are in this same
>> category of "can't ever be destroyed". However it is setting
>> up a minor beartrap for ourselves in future if we have
>> allocations which aren't tracked via a field in the device's
>> state structure, because the obvious future implementation of
>> destruction for a device is "just free/destroy everything
>> that is in the state struct".
>>
>> NB: I think what Alex had in mind with his option (2) was
>> just to have a "MemoryRegionOps ops;" field in the state
>> struct, and then use "s->ops = data_mem_ops;" rather than
>> the memdup. That retains the use of the static field for
>> the non-variable-width case, it's just that instead of
>> allocating off the heap for the var-width setup we use
>> an inline lump of memory in a struct we're already allocing.
>>
>> I don't think I care very much about this, but Alex's
>> suggestion 2 is slightly nicer I guess. Adding a whole
>> unrealize callback is definitely vastly overkill.
> 
> Yeah, it's exactly what I meant. Sorry for not being as clear. By moving
> the dynamically created struct into the device struct we're just making
> the whole allocation flow easier.
> 
> But if this is the only nitpick, there's no need for a respin just for
> that.

Okay, I understand it now. Thanks.

Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 13:48   ` Andrew Jones
@ 2014-12-16 19:00     ` Laszlo Ersek
  2014-12-16 19:49       ` Paolo Bonzini
  2014-12-16 20:41       ` Peter Maydell
  0 siblings, 2 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 19:00 UTC (permalink / raw)
  To: Andrew Jones; +Cc: peter.maydell, Alexander Graf, Paolo Bonzini, qemu-devel

On 12/16/14 14:48, Andrew Jones wrote:
> On Fri, Dec 12, 2014 at 04:58:45PM +0100, Laszlo Ersek wrote:
>> Make it clear that the maximum access size to the MMIO data register
>> determines the full size of the memory region.
>>
>> Currently the max access size is 1. Ensure that if a larger size were
>> used in "fw_cfg_data_mem_ops.valid.max_access_size", the memory
>> subsystem would split the access to byte-sized accesses internally,
>> in increasing address order.
>>
>> fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about
>> "address" or "size"; they just call the sequential fw_cfg_read() and
>> fw_cfg_write() functions, correspondingly. Therefore the automatic
>> splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is
>> native.)
>
> This 'is native' caught my eye. Laszlo's
> Documentation/devicetree/bindings/arm/fw-cfg.txt patch states that the
> selector register is LE and
>
> "
> The data register allows accesses with 8, 16, 32 and 64-bit width.
> Accesses larger than a byte are interpreted as arrays, bundled
> together only for better performance. The bytes constituting such a
> word, in increasing address order, correspond to the bytes that would
> have been transferred by byte-wide accesses in chronological order.
> "
>
> I chatted with Laszlo to make sure the host-is-BE case was considered.
> It looks like there may be an issue there that Laszlo promised to look
> into. Laszlo had already noticed that the selector was defined as
> native in qemu as well, but should be LE. Now that we support > 1 byte
> reads and writes from the data port, then maybe we should look into
> changing that as well.

Yes.

The root of this question is what each of

enum device_endian {
    DEVICE_NATIVE_ENDIAN,
    DEVICE_BIG_ENDIAN,
    DEVICE_LITTLE_ENDIAN,
};

means.

Consider the following call tree, which implements the splitting /
combining of an MMIO read:

memory_region_dispatch_read() [memory.c]
  memory_region_dispatch_read1()
    access_with_adjusted_size()
      memory_region_big_endian()
      for each byte in the wide access:
        memory_region_read_accessor()
          fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
            fw_cfg_read()
  adjust_endianness()
    memory_region_wrong_endianness()
    bswapXX()

The function access_with_adjusted_size() always iterates over the MMIO
address range in incrementing address order. However, it can calculate
the shift count for memory_region_read_accessor() in two ways.

When memory_region_big_endian() returns "true", the shift count
decreases as the MMIO address increases.

When memory_region_big_endian() returns "false", the shift count
increases as the MMIO address increases.

In memory_region_read_accessor(), the shift count is used for a logical
(ie. numeric) bitwise left shift (<<). That operator works with numeric
values and hides (ie. accounts for) host endianness.

Let's consider
- an array of two bytes, [0xaa, 0xbb],
- a uint16_t access made from the guest,
- and all twelve possible cases.

In the table below, the "host", "guest" and "device_endian" columns are
input variables. The columns to the right are calculated / derived
values. The arrows above the columns show the data dependencies.

After memory_region_dispatch_read1() constructs the value that is
visible in the "host value" column, it is passed to adjust_endianness().
If memory_region_wrong_endianness() returns "true", then the host value
is byte-swapped. The result is then passed to the guest.

              +---------------------------------------------------------------------------------------------------------------+----------+
              |                                                                                                               |          |
            +---- ------+-------------------------------------------------------------------------+                           |          |
            | |         |                                                                         |                           |          |
      +----------------------------------------------------------+---------+                      |                           |          |
      |     | |         |                                        |         |                      |                           |          |
      |   +-----------+-------------------+  +----------------+  |         |  +------------------------+-------------------+  |          |
      |   | | |       | |                 |  |                |  |         |  |                   |    |                   |  |          |
      |   | | |       | |                 v  |                v  |         v  |                   v    |                   v  |          v
 #  host  guest  device_endian  memory_region_big_endian()  host value  host repr.    memory_region_wrong_endianness()  guest repr.   guest value
--  ----  -----  -------------  --------------------------  ----------  ------------  --------------------------------  ------------  -----------
 1  LE    LE     native         0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
 2  LE    LE     BE             1                           0xaabb      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
 3  LE    LE     LE             0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa

 4  LE    BE     native         1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
 5  LE    BE     BE             1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
 6  LE    BE     LE             0                           0xbbaa      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa

 7  BE    LE     native         0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
 8  BE    LE     BE             1                           0xaabb      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
 9  BE    LE     LE             0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb

10  BE    BE     native         1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
11  BE    BE     BE             1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
12  BE    BE     LE             0                           0xbbaa      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb


Looking at the two rightmost columns, we must conclude:

(a) The "device_endian" field has absolutely no significance wrt. what
    the guest sees. In each triplet of cases, when we go from
    DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
    the guest sees the exact same value.

    I don't understand this result (it makes me doubt device_endian
    makes any sense). What did I do wrong?

    Just to be sure that I was not seeing ghosts, I actually tested this
    result. On an x86_64 hosts I tested the aarch64 guest firmware
    (using TCG), cycling the "fw_cfg_data_mem_ops.endianness" field
    through all three possible values. That is, I tested cases #1 to #3.

    They *all* worked!

(b) Considering a given host endianness (ie. a group of six cases): when
    the guest goes from little endian to big endian, the *numerical
    value* the guest sees does not change.

    In addition, fixating the guest endianness, and changing the host
    endianness, the *numerical value* that the guest sees (for the
    original host-side array [0xaa, 0xbb]) changes.

    This means that this interface is *value preserving*, not
    representation preserving. In other words: when host and guest
    endiannesses are identical, the *array* is transferred okay (the
    guest array matches the initial host array [0xaa, 0xbb]). When guest
    and host differ in endianness, the guest will see an incorrect
    *array*.

    Which, alas, makes this interface completely unsuitable for the
    purpose at hand (ie. transferring kernel & initrd images in words
    wider than a byte). We couldn't care less as to what numerical value
    the array [0xaa, 0xbb] encodes on the host -- whatever it encodes,
    the guest wants to receive the same *array* (not the same value).
    And the byte order cannot be fixed up in the guest, because it
    depends on the XOR of guest and host endiannesses, and the guest
    doesn't know about the host's endianness.

I apologize for wasting everyone's time, but I think both results are
very non-intuitive. I noticed the use of the bitwise shift in
memory_region_read_accessor() -- which internally accounts for the
host-side byte order -- just today, while discussing this with Drew on
IRC. I had assumed that it would store bytes in the recipient uint64_t
by address, not by numerical order of magnitude.

Looks like the DMA-like interface is the only way forward. :(

Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 19:00     ` Laszlo Ersek
@ 2014-12-16 19:49       ` Paolo Bonzini
  2014-12-16 20:06         ` Laszlo Ersek
  2014-12-16 20:41       ` Peter Maydell
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2014-12-16 19:49 UTC (permalink / raw)
  To: Laszlo Ersek, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel

On 16/12/2014 20:00, Laszlo Ersek wrote:
> Yes.
> 
> The root of this question is what each of
> 
> enum device_endian {
>     DEVICE_NATIVE_ENDIAN,
>     DEVICE_BIG_ENDIAN,
>     DEVICE_LITTLE_ENDIAN,
> };

Actually, I think the root of the answer :) is that fw_cfg_read (and
thus fw_cfg_data_mem_read) is not idempotent.  The split/compose stuff
accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them
according to the endianness.

In the case of fw_cfg it just retrieves 8 bytes, but in the case of host
big endian it reads them in the "wrong" order for some reason (sorry, I
haven't looked at this thoroughly).

So the solution is:

1) make fw_cfg_data_mem_ops DEVICE_LITTLE_ENDIAN

2) make fw_cfg_data_mem_read and fw_cfg_data_mem_write call fw_cfg_read
and fw_cfg_write SIZE times and build up a value from the lowest byte up.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 19:49       ` Paolo Bonzini
@ 2014-12-16 20:06         ` Laszlo Ersek
  2014-12-16 20:17           ` Laszlo Ersek
  2014-12-16 20:40           ` Paolo Bonzini
  0 siblings, 2 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 20:06 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel

On 12/16/14 20:49, Paolo Bonzini wrote:
> 
> 
> On 16/12/2014 20:00, Laszlo Ersek wrote:
>> Yes.
>>
>> The root of this question is what each of
>>
>> enum device_endian {
>>     DEVICE_NATIVE_ENDIAN,
>>     DEVICE_BIG_ENDIAN,
>>     DEVICE_LITTLE_ENDIAN,
>> };
> 
> Actually, I think the root of the answer :) is that fw_cfg_read (and
> thus fw_cfg_data_mem_read) is not idempotent.  The split/compose stuff
> accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them
> according to the endianness.
> 
> In the case of fw_cfg it just retrieves 8 bytes, but in the case of host
> big endian it reads them in the "wrong" order for some reason (sorry, I
> haven't looked at this thoroughly).

I can't imagine how that would happen; fw_cfg_data_mem_read() ignores
both "addr" and "size", and fw_cfg_read() simply advances the
"cur_offset" member.

> So the solution is:
> 
> 1) make fw_cfg_data_mem_ops DEVICE_LITTLE_ENDIAN
> 
> 2) make fw_cfg_data_mem_read and fw_cfg_data_mem_write call fw_cfg_read
> and fw_cfg_write SIZE times and build up a value from the lowest byte up.

Nonetheless, that's a really nice idea! I got so stuck with the
automatic splitting that I forgot about the possibility to act upon the
"size" parameter in fw_cfg_data_mem_read(). Thanks!

... Another thing that Andrew mentioned but I didn't cover in my other
email -- what about fw_cfg_ctl_mem_ops? It's currently DEVICE_NATIVE_ENDIAN.

You flipped the combined ops to LE in commit 6fdf98f2 (and, apparently,
I reviewed it). Shouldn't we do the same for the standalone selector?

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 20:06         ` Laszlo Ersek
@ 2014-12-16 20:17           ` Laszlo Ersek
  2014-12-16 21:47             ` Paolo Bonzini
  2014-12-16 20:40           ` Paolo Bonzini
  1 sibling, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-16 20:17 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel

On 12/16/14 21:06, Laszlo Ersek wrote:
> On 12/16/14 20:49, Paolo Bonzini wrote:
>>
>>
>> On 16/12/2014 20:00, Laszlo Ersek wrote:
>>> Yes.
>>>
>>> The root of this question is what each of
>>>
>>> enum device_endian {
>>>     DEVICE_NATIVE_ENDIAN,
>>>     DEVICE_BIG_ENDIAN,
>>>     DEVICE_LITTLE_ENDIAN,
>>> };
>>
>> Actually, I think the root of the answer :) is that fw_cfg_read (and
>> thus fw_cfg_data_mem_read) is not idempotent.  The split/compose stuff
>> accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them
>> according to the endianness.
>>
>> In the case of fw_cfg it just retrieves 8 bytes, but in the case of host
>> big endian it reads them in the "wrong" order for some reason (sorry, I
>> haven't looked at this thoroughly).
> 
> I can't imagine how that would happen; fw_cfg_data_mem_read() ignores
> both "addr" and "size", and fw_cfg_read() simply advances the
> "cur_offset" member.

Ah okay, I understand your point now; you're probably saying that
access_with_adjusted_size() traverses the offsets in the wrong order.
... I don't see how; the only difference in the access() param list is
the shift count. (I don't know how it should work by design.)

In any case fw_cfg should be able to handle the larger accesses itself.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 20:06         ` Laszlo Ersek
  2014-12-16 20:17           ` Laszlo Ersek
@ 2014-12-16 20:40           ` Paolo Bonzini
  2014-12-16 21:47             ` Peter Maydell
  2014-12-17  5:06             ` Laszlo Ersek
  1 sibling, 2 replies; 37+ messages in thread
From: Paolo Bonzini @ 2014-12-16 20:40 UTC (permalink / raw)
  To: Laszlo Ersek, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel



On 16/12/2014 21:06, Laszlo Ersek wrote:
> On 12/16/14 20:49, Paolo Bonzini wrote:
>> fw_cfg_read (and
>> thus fw_cfg_data_mem_read) is not idempotent.  The split/compose stuff
>> accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them
>> according to the endianness.
>>
>> In the case of fw_cfg it just retrieves 8 bytes, but in the case of host
>> big endian it reads them in the "wrong" order for some reason (sorry, I
>> haven't looked at this thoroughly).
> 
> I can't imagine how that would happen; fw_cfg_data_mem_read() ignores
> both "addr" and "size", and fw_cfg_read() simply advances the
> "cur_offset" member.

Honestly neither can I.  But still the automatic splitting (which is
even tested by tests/endianness-test.c :)) assumes idempotency of the
components and it's not entirely surprising that it somehow/sometimes
breaks if you don't respect that.

>> So the solution is:
>>
>> 1) make fw_cfg_data_mem_ops DEVICE_LITTLE_ENDIAN
>>
>> 2) make fw_cfg_data_mem_read and fw_cfg_data_mem_write call fw_cfg_read
>> and fw_cfg_write SIZE times and build up a value from the lowest byte up.
> 
> Nonetheless, that's a really nice idea! I got so stuck with the
> automatic splitting that I forgot about the possibility to act upon the
> "size" parameter in fw_cfg_data_mem_read(). Thanks!
> 
> ... Another thing that Andrew mentioned but I didn't cover in my other
> email -- what about fw_cfg_ctl_mem_ops? It's currently DEVICE_NATIVE_ENDIAN.
> 
> You flipped the combined ops to LE in commit 6fdf98f2 (and, apparently,
> I reviewed it). Shouldn't we do the same for the standalone selector?

No.  The standalone selector is used as MMIO, and the BE platforms
expect the platform to be big-endian.  The combined ops are only used on
ISA ports, where the firmware expects them to be little-endian (as
mentioned in the commit message).

That said, the standalone selector is used by BE platforms only, so we
know that the standalone selector is always DEVICE_BIG_ENDIAN.

So if you want, you can make the standalone selector and the standalone
datum BE and swap them in the firmware.  If the suggestion doesn't make
you jump up and down, I understand that. :)

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 19:00     ` Laszlo Ersek
  2014-12-16 19:49       ` Paolo Bonzini
@ 2014-12-16 20:41       ` Peter Maydell
  2014-12-17  7:13         ` Laszlo Ersek
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Maydell @ 2014-12-16 20:41 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Paolo Bonzini, Andrew Jones, QEMU Developers, Alexander Graf

On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote:
> The root of this question is what each of
>
> enum device_endian {
>     DEVICE_NATIVE_ENDIAN,
>     DEVICE_BIG_ENDIAN,
>     DEVICE_LITTLE_ENDIAN,
> };
>
> means.

An opening remark: endianness is a horribly confusing topic and
support of more than one endianness is even worse. I may have made
some inadvertent errors in this reply; if you think so please
let me know and I'll have another stab at it.

That said: the device_endian options indicate what a device's
internal idea of its endianness is. This is most relevant if
a device accepts accesses at wider than byte width
(for instance, if you can read a 32-bit value from
address offset 0 and also an 8-bit value from offset 3 then
how do those values line up? If you read a 32-bit value then
which way round is it compared to what the device's io read/write
function return?).

NATIVE_ENDIAN means "same order as the CPU's main data bus's
natural representation". (Note that this is not necessarily
the same as "the endianness the CPU currently has"; on ARM
you can flip the CPU between LE and BE at runtime, which
is basically inserting a byte-swizzling step between data
accesses and the CPU's data bus, which is always LE for ARMv7+.)

Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
(because the guest vcpu reads it directly with the h/w cpu).

> Consider the following call tree, which implements the splitting /
> combining of an MMIO read:
>
> memory_region_dispatch_read() [memory.c]
>   memory_region_dispatch_read1()
>     access_with_adjusted_size()
>       memory_region_big_endian()
>       for each byte in the wide access:
>         memory_region_read_accessor()
>           fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>             fw_cfg_read()
>   adjust_endianness()
>     memory_region_wrong_endianness()
>     bswapXX()
>
> The function access_with_adjusted_size() always iterates over the MMIO
> address range in incrementing address order. However, it can calculate
> the shift count for memory_region_read_accessor() in two ways.
>
> When memory_region_big_endian() returns "true", the shift count
> decreases as the MMIO address increases.
>
> When memory_region_big_endian() returns "false", the shift count
> increases as the MMIO address increases.

Yep, because this is how the device has told us that it thinks
of accesses as being put together.

The column in your table "host value" is the 16 bit value read from
the device, ie what we have decided (based on device_endian) that
it would have returned us if it had supported a 16 bit read directly
itself. BE devices compose 16 bit values with the high byte first,
LE devices with the low byte first, and native-endian devices
in the same order as guest-endianness.

> In memory_region_read_accessor(), the shift count is used for a logical
> (ie. numeric) bitwise left shift (<<). That operator works with numeric
> values and hides (ie. accounts for) host endianness.
>
> Let's consider
> - an array of two bytes, [0xaa, 0xbb],
> - a uint16_t access made from the guest,
> - and all twelve possible cases.
>
> In the table below, the "host", "guest" and "device_endian" columns are
> input variables. The columns to the right are calculated / derived
> values. The arrows above the columns show the data dependencies.
>
> After memory_region_dispatch_read1() constructs the value that is
> visible in the "host value" column, it is passed to adjust_endianness().
> If memory_region_wrong_endianness() returns "true", then the host value
> is byte-swapped. The result is then passed to the guest.
>
>               +---------------------------------------------------------------------------------------------------------------+----------+
>               |                                                                                                               |          |
>             +---- ------+-------------------------------------------------------------------------+                           |          |
>             | |         |                                                                         |                           |          |
>       +----------------------------------------------------------+---------+                      |                           |          |
>       |     | |         |                                        |         |                      |                           |          |
>       |   +-----------+-------------------+  +----------------+  |         |  +------------------------+-------------------+  |          |
>       |   | | |       | |                 |  |                |  |         |  |                   |    |                   |  |          |
>       |   | | |       | |                 v  |                v  |         v  |                   v    |                   v  |          v
>  #  host  guest  device_endian  memory_region_big_endian()  host value  host repr.    memory_region_wrong_endianness()  guest repr.   guest value
> --  ----  -----  -------------  --------------------------  ----------  ------------  --------------------------------  ------------  -----------
>  1  LE    LE     native         0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>  2  LE    LE     BE             1                           0xaabb      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>  3  LE    LE     LE             0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>
>  4  LE    BE     native         1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>  5  LE    BE     BE             1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>  6  LE    BE     LE             0                           0xbbaa      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>
>  7  BE    LE     native         0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>  8  BE    LE     BE             1                           0xaabb      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>  9  BE    LE     LE             0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>
> 10  BE    BE     native         1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
> 11  BE    BE     BE             1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
> 12  BE    BE     LE             0                           0xbbaa      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb

The column you have labelled 'guest repr' here is the returned data
from io_mem_read, whose API contract is "write the data from the
device into this host C uint16_t (or whatever) such that it is the
value returned by the device read as a native host value". It's
not related to the guest order at all.

So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
which passes it a local variable "val". So now "val" has the
"guest repr" column's bytes in it, and (as a host C variable) the
value:
1,2,3 : 0xbbaa
4,5,6 : 0xaabb
7,8,9 : 0xbbaa
10,11,12 : 0xaabb

As you can see, this depends on the "guest endianness" (which is
kind of the endianness of the bus): a BE guest 16 bit access to
this device would return the 16 bit value 0xaabb, and an LE guest
0xbbaa, and we have exactly those values in our host C variable.
If this is TCG, then we'll copy that 16 bit host value into the
CPUState struct field corresponding to the destination guest
register as-is. (TCG CPUState fields are always in host-C-order.)

However, to pass them up to KVM we have to put them into a
buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
in target CPU endianness"], so now buf has the bytes:
1,2,3 : [0xaa, 0xbb]
4,5,6 : [0xaa, 0xbb]
7,8,9 : [0xaa, 0xbb]
10,11,12 : [0xaa, 0xbb]

...which is the same for every case.

This buffer is (for KVM) the run->mmio.data buffer, whose semantics
are "the value as it would appear if the VCPU performed a load or store
of the appropriate width directly to the byte array". Which is what we
want -- your device has two bytes in order 0xaa, 0xbb, and we did
a 16 bit load in the guest CPU, so we should get the same answer as if
we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.

I think the fact that all of these things come out to the same
set of bytes in the mmio.data buffer is the indication that all
QEMU's byte swapping is correct.

> Looking at the two rightmost columns, we must conclude:
>
> (a) The "device_endian" field has absolutely no significance wrt. what
>     the guest sees. In each triplet of cases, when we go from
>     DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>     the guest sees the exact same value.
>
>     I don't understand this result (it makes me doubt device_endian
>     makes any sense). What did I do wrong?

I think it's because you defined your device as only supporting
byte accesses that you didn't see any difference between the
various device_endian settings. If a device presents itself as
just an array of bytes then it doesn't have an endianness, really.

As Paolo says, if you make your device support wider accesses
directly and build up the value yourself then you'll see that
setting the device_endian to LE vs BE vs native does change
the values you see in the guest (and that you'll need to set it
to LE and interpret the words in the guest as LE to get the
right behaviour).

>     This means that this interface is *value preserving*, not
>     representation preserving. In other words: when host and guest
>     endiannesses are identical, the *array* is transferred okay (the
>     guest array matches the initial host array [0xaa, 0xbb]). When guest
>     and host differ in endianness, the guest will see an incorrect
>     *array*.

Think of a device which supports only byte accesses as being
like a lump of RAM. A big-endian guest CPU which reads 32 bits
at a time will get different values in registers to an LE guest
which does the same.

*However*, if both CPUs just do "read 32 bits; write 32 bits to
RAM" (ie a kind of memcpy but with the source being the mmio
register rather than some other bit of RAM) then you should get
a bytewise-correct copy of the data in RAM.

So I *think* that would be a viable approach: have your QEMU
device code as it is now, and just make sure that if the guest
is doing wider-than-byte accesses it does them as
  do {
     load word from mmio register;
     store word to RAM;
  } while (count);

and doesn't try to interpret the byte order of the values while
they're in the CPU register in the middle of the loop.

Paolo's suggestion would work too, if you prefer that.

> I apologize for wasting everyone's time, but I think both results are
> very non-intuitive.

Everything around endianness handling is non-intuitive --
it's the nature of the problem, I'm afraid. (Some of this is
also QEMU's fault for not having good documentation of the
semantics of each of the layers involved in memory accesses.
I have on my todo list the vague idea of trying to write these
all up as a blog post.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 20:40           ` Paolo Bonzini
@ 2014-12-16 21:47             ` Peter Maydell
  2014-12-17  5:06             ` Laszlo Ersek
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Maydell @ 2014-12-16 21:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexander Graf, Andrew Jones, Laszlo Ersek, QEMU Developers

On 16 December 2014 at 20:40, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Honestly neither can I.  But still the automatic splitting (which is
> even tested by tests/endianness-test.c :)) assumes idempotency of the
> components and it's not entirely surprising that it somehow/sometimes
> breaks if you don't respect that.

Oh, good point. Yeah, I don't think we want to make any guarantee
about the order in which the N byte accesses hit the device if
we're assembling an N-byte access. Implementing the device as
a proper wide-access-supporting device is definitely the way to go.

-- PMM

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 20:17           ` Laszlo Ersek
@ 2014-12-16 21:47             ` Paolo Bonzini
  2014-12-17  4:52               ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2014-12-16 21:47 UTC (permalink / raw)
  To: Laszlo Ersek, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel

On 16/12/2014 21:17, Laszlo Ersek wrote:
>> > I can't imagine how that would happen; fw_cfg_data_mem_read() ignores
>> > both "addr" and "size", and fw_cfg_read() simply advances the
>> > "cur_offset" member.
> Ah okay, I understand your point now; you're probably saying that
> access_with_adjusted_size() traverses the offsets in the wrong order.
> ... I don't see how; the only difference in the access() param list is
> the shift count. (I don't know how it should work by design.)

I think I have figured it out.

Guest endianness affects where those bytes are placed, but not the order
in which they are fetched; and the effects of guest endianness are
always cleaned up by adjust_endianness, so ultimately they do not matter.

Think of how you would implement the uint64_t read in fw_cfg:

File bytes         12 34 56 78 9a bc de f0

fw_cfg_data_mem_read should read

size==4 BE host    0x12345678
size==4 LE host    0x78563412
size==8 BE host    0x123456789abcdef0
size==8 LE host    0xf0debc9a78563412

So the implementation of fw_cfg_data_mem_read must depend on host
endianness.  Instead, memory.c always fills in bytes in the same order,
on the assumption that the reads are idempotent.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 21:47             ` Paolo Bonzini
@ 2014-12-17  4:52               ` Laszlo Ersek
  0 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-17  4:52 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel

On 12/16/14 22:47, Paolo Bonzini wrote:
> On 16/12/2014 21:17, Laszlo Ersek wrote:
>>>> I can't imagine how that would happen; fw_cfg_data_mem_read() ignores
>>>> both "addr" and "size", and fw_cfg_read() simply advances the
>>>> "cur_offset" member.
>> Ah okay, I understand your point now; you're probably saying that
>> access_with_adjusted_size() traverses the offsets in the wrong order.
>> ... I don't see how; the only difference in the access() param list is
>> the shift count. (I don't know how it should work by design.)
> 
> I think I have figured it out.
> 
> Guest endianness affects where those bytes are placed, but not the order
> in which they are fetched; and the effects of guest endianness are
> always cleaned up by adjust_endianness, so ultimately they do not matter.
> 
> Think of how you would implement the uint64_t read in fw_cfg:
> 
> File bytes         12 34 56 78 9a bc de f0
> 
> fw_cfg_data_mem_read should read
> 
> size==4 BE host    0x12345678
> size==4 LE host    0x78563412
> size==8 BE host    0x123456789abcdef0
> size==8 LE host    0xf0debc9a78563412
> 
> So the implementation of fw_cfg_data_mem_read must depend on host
> endianness.  Instead, memory.c always fills in bytes in the same order,
> on the assumption that the reads are idempotent.

I see. Thanks!
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 20:40           ` Paolo Bonzini
  2014-12-16 21:47             ` Peter Maydell
@ 2014-12-17  5:06             ` Laszlo Ersek
  2014-12-17  9:23               ` Paolo Bonzini
  1 sibling, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-17  5:06 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel

On 12/16/14 21:40, Paolo Bonzini wrote:
> On 16/12/2014 21:06, Laszlo Ersek wrote:

>> You flipped the combined ops to LE in commit 6fdf98f2 (and, apparently,
>> I reviewed it). Shouldn't we do the same for the standalone selector?
> 
> No.  The standalone selector is used as MMIO, and the BE platforms
> expect the platform to be big-endian.  The combined ops are only used on
> ISA ports, where the firmware expects them to be little-endian (as
> mentioned in the commit message).
> 
> That said, the standalone selector is used by BE platforms only, so we
> know that the standalone selector is always DEVICE_BIG_ENDIAN.

This series exposes the standalone selector (as MMIO) to ARM guests as
well; and in "Documentation/devicetree/bindings/arm/fw-cfg.txt" for the
kernel I'm saying that the selector is little endian.

Therefore I think that the standalone selector is not only (going to be)
used by BE platforms (or I don't understand your above statement
correctly). But, the current (and to be preserved) NATIVE_ENDIAN setting
still matches what I say in
"Documentation/devicetree/bindings/arm/fw-cfg.txt", because, Peter said:

> NATIVE_ENDIAN means "same order as the CPU's main data bus's natural
> representation". (Note that this is not necessarily the same as "the
> endianness the CPU currently has"; on ARM you can flip the CPU between
> LE and BE at runtime, which is basically inserting a byte-swizzling
> step between data accesses and the CPU's data bus, which is always LE
> for ARMv7+.)

In other words, the standalone selector is NATIVE_ENDIAN, but in the
description of the *ARM* bindings, we can simply say that it's little
endian.

Is that right?

Thanks
Laszlo

> 
> So if you want, you can make the standalone selector and the standalone
> datum BE and swap them in the firmware.  If the suggestion doesn't make
> you jump up and down, I understand that. :)
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-16 20:41       ` Peter Maydell
@ 2014-12-17  7:13         ` Laszlo Ersek
  2014-12-17  8:28           ` Alexander Graf
  0 siblings, 1 reply; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-17  7:13 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Paolo Bonzini, Andrew Jones, QEMU Developers, Alexander Graf

On 12/16/14 21:41, Peter Maydell wrote:
> On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote:
>> The root of this question is what each of
>>
>> enum device_endian {
>>     DEVICE_NATIVE_ENDIAN,
>>     DEVICE_BIG_ENDIAN,
>>     DEVICE_LITTLE_ENDIAN,
>> };
>>
>> means.
> 
> An opening remark: endianness is a horribly confusing topic and
> support of more than one endianness is even worse. I may have made
> some inadvertent errors in this reply; if you think so please
> let me know and I'll have another stab at it.
> 
> That said: the device_endian options indicate what a device's
> internal idea of its endianness is. This is most relevant if
> a device accepts accesses at wider than byte width
> (for instance, if you can read a 32-bit value from
> address offset 0 and also an 8-bit value from offset 3 then
> how do those values line up? If you read a 32-bit value then
> which way round is it compared to what the device's io read/write
> function return?).
> 
> NATIVE_ENDIAN means "same order as the CPU's main data bus's
> natural representation". (Note that this is not necessarily
> the same as "the endianness the CPU currently has"; on ARM
> you can flip the CPU between LE and BE at runtime, which
> is basically inserting a byte-swizzling step between data
> accesses and the CPU's data bus, which is always LE for ARMv7+.)
> 
> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
> (because the guest vcpu reads it directly with the h/w cpu).
> 
>> Consider the following call tree, which implements the splitting /
>> combining of an MMIO read:
>>
>> memory_region_dispatch_read() [memory.c]
>>   memory_region_dispatch_read1()
>>     access_with_adjusted_size()
>>       memory_region_big_endian()
>>       for each byte in the wide access:
>>         memory_region_read_accessor()
>>           fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>             fw_cfg_read()
>>   adjust_endianness()
>>     memory_region_wrong_endianness()
>>     bswapXX()
>>
>> The function access_with_adjusted_size() always iterates over the MMIO
>> address range in incrementing address order. However, it can calculate
>> the shift count for memory_region_read_accessor() in two ways.
>>
>> When memory_region_big_endian() returns "true", the shift count
>> decreases as the MMIO address increases.
>>
>> When memory_region_big_endian() returns "false", the shift count
>> increases as the MMIO address increases.
> 
> Yep, because this is how the device has told us that it thinks
> of accesses as being put together.
> 
> The column in your table "host value" is the 16 bit value read from
> the device, ie what we have decided (based on device_endian) that
> it would have returned us if it had supported a 16 bit read directly
> itself. BE devices compose 16 bit values with the high byte first,
> LE devices with the low byte first, and native-endian devices
> in the same order as guest-endianness.
> 
>> In memory_region_read_accessor(), the shift count is used for a logical
>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>> values and hides (ie. accounts for) host endianness.
>>
>> Let's consider
>> - an array of two bytes, [0xaa, 0xbb],
>> - a uint16_t access made from the guest,
>> - and all twelve possible cases.
>>
>> In the table below, the "host", "guest" and "device_endian" columns are
>> input variables. The columns to the right are calculated / derived
>> values. The arrows above the columns show the data dependencies.
>>
>> After memory_region_dispatch_read1() constructs the value that is
>> visible in the "host value" column, it is passed to adjust_endianness().
>> If memory_region_wrong_endianness() returns "true", then the host value
>> is byte-swapped. The result is then passed to the guest.
>>
>>               +---------------------------------------------------------------------------------------------------------------+----------+
>>               |                                                                                                               |          |
>>             +---- ------+-------------------------------------------------------------------------+                           |          |
>>             | |         |                                                                         |                           |          |
>>       +----------------------------------------------------------+---------+                      |                           |          |
>>       |     | |         |                                        |         |                      |                           |          |
>>       |   +-----------+-------------------+  +----------------+  |         |  +------------------------+-------------------+  |          |
>>       |   | | |       | |                 |  |                |  |         |  |                   |    |                   |  |          |
>>       |   | | |       | |                 v  |                v  |         v  |                   v    |                   v  |          v
>>  #  host  guest  device_endian  memory_region_big_endian()  host value  host repr.    memory_region_wrong_endianness()  guest repr.   guest value
>> --  ----  -----  -------------  --------------------------  ----------  ------------  --------------------------------  ------------  -----------
>>  1  LE    LE     native         0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>  2  LE    LE     BE             1                           0xaabb      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>>  3  LE    LE     LE             0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>
>>  4  LE    BE     native         1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>  5  LE    BE     BE             1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>  6  LE    BE     LE             0                           0xbbaa      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>>
>>  7  BE    LE     native         0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>  8  BE    LE     BE             1                           0xaabb      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>>  9  BE    LE     LE             0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>
>> 10  BE    BE     native         1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>> 11  BE    BE     BE             1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>> 12  BE    BE     LE             0                           0xbbaa      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb
> 
> The column you have labelled 'guest repr' here is the returned data
> from io_mem_read, whose API contract is "write the data from the
> device into this host C uint16_t (or whatever) such that it is the
> value returned by the device read as a native host value". It's
> not related to the guest order at all.
> 
> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
> which passes it a local variable "val". So now "val" has the
> "guest repr" column's bytes in it, and (as a host C variable) the
> value:
> 1,2,3 : 0xbbaa
> 4,5,6 : 0xaabb
> 7,8,9 : 0xbbaa
> 10,11,12 : 0xaabb
> 
> As you can see, this depends on the "guest endianness" (which is
> kind of the endianness of the bus): a BE guest 16 bit access to
> this device would return the 16 bit value 0xaabb, and an LE guest
> 0xbbaa, and we have exactly those values in our host C variable.
> If this is TCG, then we'll copy that 16 bit host value into the
> CPUState struct field corresponding to the destination guest
> register as-is. (TCG CPUState fields are always in host-C-order.)
> 
> However, to pass them up to KVM we have to put them into a
> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
> in target CPU endianness"], so now buf has the bytes:
> 1,2,3 : [0xaa, 0xbb]
> 4,5,6 : [0xaa, 0xbb]
> 7,8,9 : [0xaa, 0xbb]
> 10,11,12 : [0xaa, 0xbb]
> 
> ...which is the same for every case.
> 
> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
> are "the value as it would appear if the VCPU performed a load or store
> of the appropriate width directly to the byte array". Which is what we
> want -- your device has two bytes in order 0xaa, 0xbb, and we did
> a 16 bit load in the guest CPU, so we should get the same answer as if
> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
> 
> I think the fact that all of these things come out to the same
> set of bytes in the mmio.data buffer is the indication that all
> QEMU's byte swapping is correct.
> 
>> Looking at the two rightmost columns, we must conclude:
>>
>> (a) The "device_endian" field has absolutely no significance wrt. what
>>     the guest sees. In each triplet of cases, when we go from
>>     DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>     the guest sees the exact same value.
>>
>>     I don't understand this result (it makes me doubt device_endian
>>     makes any sense). What did I do wrong?
> 
> I think it's because you defined your device as only supporting
> byte accesses that you didn't see any difference between the
> various device_endian settings. If a device presents itself as
> just an array of bytes then it doesn't have an endianness, really.
> 
> As Paolo says, if you make your device support wider accesses
> directly and build up the value yourself then you'll see that
> setting the device_endian to LE vs BE vs native does change
> the values you see in the guest (and that you'll need to set it
> to LE and interpret the words in the guest as LE to get the
> right behaviour).
> 
>>     This means that this interface is *value preserving*, not
>>     representation preserving. In other words: when host and guest
>>     endiannesses are identical, the *array* is transferred okay (the
>>     guest array matches the initial host array [0xaa, 0xbb]). When guest
>>     and host differ in endianness, the guest will see an incorrect
>>     *array*.
> 
> Think of a device which supports only byte accesses as being
> like a lump of RAM. A big-endian guest CPU which reads 32 bits
> at a time will get different values in registers to an LE guest
> which does the same.
> 
> *However*, if both CPUs just do "read 32 bits; write 32 bits to
> RAM" (ie a kind of memcpy but with the source being the mmio
> register rather than some other bit of RAM) then you should get
> a bytewise-correct copy of the data in RAM.
> 
> So I *think* that would be a viable approach: have your QEMU
> device code as it is now, and just make sure that if the guest
> is doing wider-than-byte accesses it does them as
>   do {
>      load word from mmio register;
>      store word to RAM;
>   } while (count);
> 
> and doesn't try to interpret the byte order of the values while
> they're in the CPU register in the middle of the loop.
> 
> Paolo's suggestion would work too, if you prefer that.
> 
>> I apologize for wasting everyone's time, but I think both results are
>> very non-intuitive.
> 
> Everything around endianness handling is non-intuitive --
> it's the nature of the problem, I'm afraid. (Some of this is
> also QEMU's fault for not having good documentation of the
> semantics of each of the layers involved in memory accesses.
> I have on my todo list the vague idea of trying to write these
> all up as a blog post.)

Thanks for taking the time to write this up. My analysis must have
missed at least two important things then:
- the device's read/write function needs to consider address & size, and
  return values that match host byte order. fw_cfg doesn't conform ATM.
- there's one more layer outside the call tree that I checked that can
  perform endianness conversion.

I'll try to implement Paolo's suggestion (ie. support wide accesses in
fw_cfg internally, not relying on splitting/combining by memory.c).

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-17  7:13         ` Laszlo Ersek
@ 2014-12-17  8:28           ` Alexander Graf
  2014-12-17  8:40             ` Laszlo Ersek
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2014-12-17  8:28 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Peter Maydell, Andrew Jones, Paolo Bonzini, QEMU Developers




> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <lersek@redhat.com>:
> 
>> On 12/16/14 21:41, Peter Maydell wrote:
>>> On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote:
>>> The root of this question is what each of
>>> 
>>> enum device_endian {
>>>    DEVICE_NATIVE_ENDIAN,
>>>    DEVICE_BIG_ENDIAN,
>>>    DEVICE_LITTLE_ENDIAN,
>>> };
>>> 
>>> means.
>> 
>> An opening remark: endianness is a horribly confusing topic and
>> support of more than one endianness is even worse. I may have made
>> some inadvertent errors in this reply; if you think so please
>> let me know and I'll have another stab at it.
>> 
>> That said: the device_endian options indicate what a device's
>> internal idea of its endianness is. This is most relevant if
>> a device accepts accesses at wider than byte width
>> (for instance, if you can read a 32-bit value from
>> address offset 0 and also an 8-bit value from offset 3 then
>> how do those values line up? If you read a 32-bit value then
>> which way round is it compared to what the device's io read/write
>> function return?).
>> 
>> NATIVE_ENDIAN means "same order as the CPU's main data bus's
>> natural representation". (Note that this is not necessarily
>> the same as "the endianness the CPU currently has"; on ARM
>> you can flip the CPU between LE and BE at runtime, which
>> is basically inserting a byte-swizzling step between data
>> accesses and the CPU's data bus, which is always LE for ARMv7+.)
>> 
>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
>> (because the guest vcpu reads it directly with the h/w cpu).
>> 
>>> Consider the following call tree, which implements the splitting /
>>> combining of an MMIO read:
>>> 
>>> memory_region_dispatch_read() [memory.c]
>>>  memory_region_dispatch_read1()
>>>    access_with_adjusted_size()
>>>      memory_region_big_endian()
>>>      for each byte in the wide access:
>>>        memory_region_read_accessor()
>>>          fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>>            fw_cfg_read()
>>>  adjust_endianness()
>>>    memory_region_wrong_endianness()
>>>    bswapXX()
>>> 
>>> The function access_with_adjusted_size() always iterates over the MMIO
>>> address range in incrementing address order. However, it can calculate
>>> the shift count for memory_region_read_accessor() in two ways.
>>> 
>>> When memory_region_big_endian() returns "true", the shift count
>>> decreases as the MMIO address increases.
>>> 
>>> When memory_region_big_endian() returns "false", the shift count
>>> increases as the MMIO address increases.
>> 
>> Yep, because this is how the device has told us that it thinks
>> of accesses as being put together.
>> 
>> The column in your table "host value" is the 16 bit value read from
>> the device, ie what we have decided (based on device_endian) that
>> it would have returned us if it had supported a 16 bit read directly
>> itself. BE devices compose 16 bit values with the high byte first,
>> LE devices with the low byte first, and native-endian devices
>> in the same order as guest-endianness.
>> 
>>> In memory_region_read_accessor(), the shift count is used for a logical
>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>>> values and hides (ie. accounts for) host endianness.
>>> 
>>> Let's consider
>>> - an array of two bytes, [0xaa, 0xbb],
>>> - a uint16_t access made from the guest,
>>> - and all twelve possible cases.
>>> 
>>> In the table below, the "host", "guest" and "device_endian" columns are
>>> input variables. The columns to the right are calculated / derived
>>> values. The arrows above the columns show the data dependencies.
>>> 
>>> After memory_region_dispatch_read1() constructs the value that is
>>> visible in the "host value" column, it is passed to adjust_endianness().
>>> If memory_region_wrong_endianness() returns "true", then the host value
>>> is byte-swapped. The result is then passed to the guest.
>>> 
>>>              +---------------------------------------------------------------------------------------------------------------+----------+
>>>              |                                                                                                               |          |
>>>            +---- ------+-------------------------------------------------------------------------+                           |          |
>>>            | |         |                                                                         |                           |          |
>>>      +----------------------------------------------------------+---------+                      |                           |          |
>>>      |     | |         |                                        |         |                      |                           |          |
>>>      |   +-----------+-------------------+  +----------------+  |         |  +------------------------+-------------------+  |          |
>>>      |   | | |       | |                 |  |                |  |         |  |                   |    |                   |  |          |
>>>      |   | | |       | |                 v  |                v  |         v  |                   v    |                   v  |          v
>>> #  host  guest  device_endian  memory_region_big_endian()  host value  host repr.    memory_region_wrong_endianness()  guest repr.   guest value
>>> --  ----  -----  -------------  --------------------------  ----------  ------------  --------------------------------  ------------  -----------
>>> 1  LE    LE     native         0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>> 2  LE    LE     BE             1                           0xaabb      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>>> 3  LE    LE     LE             0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>> 
>>> 4  LE    BE     native         1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>> 5  LE    BE     BE             1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>> 6  LE    BE     LE             0                           0xbbaa      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>>> 
>>> 7  BE    LE     native         0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>> 8  BE    LE     BE             1                           0xaabb      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>>> 9  BE    LE     LE             0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>> 
>>> 10  BE    BE     native         1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>> 11  BE    BE     BE             1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>> 12  BE    BE     LE             0                           0xbbaa      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb
>> 
>> The column you have labelled 'guest repr' here is the returned data
>> from io_mem_read, whose API contract is "write the data from the
>> device into this host C uint16_t (or whatever) such that it is the
>> value returned by the device read as a native host value". It's
>> not related to the guest order at all.
>> 
>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
>> which passes it a local variable "val". So now "val" has the
>> "guest repr" column's bytes in it, and (as a host C variable) the
>> value:
>> 1,2,3 : 0xbbaa
>> 4,5,6 : 0xaabb
>> 7,8,9 : 0xbbaa
>> 10,11,12 : 0xaabb
>> 
>> As you can see, this depends on the "guest endianness" (which is
>> kind of the endianness of the bus): a BE guest 16 bit access to
>> this device would return the 16 bit value 0xaabb, and an LE guest
>> 0xbbaa, and we have exactly those values in our host C variable.
>> If this is TCG, then we'll copy that 16 bit host value into the
>> CPUState struct field corresponding to the destination guest
>> register as-is. (TCG CPUState fields are always in host-C-order.)
>> 
>> However, to pass them up to KVM we have to put them into a
>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
>> in target CPU endianness"], so now buf has the bytes:
>> 1,2,3 : [0xaa, 0xbb]
>> 4,5,6 : [0xaa, 0xbb]
>> 7,8,9 : [0xaa, 0xbb]
>> 10,11,12 : [0xaa, 0xbb]
>> 
>> ...which is the same for every case.
>> 
>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
>> are "the value as it would appear if the VCPU performed a load or store
>> of the appropriate width directly to the byte array". Which is what we
>> want -- your device has two bytes in order 0xaa, 0xbb, and we did
>> a 16 bit load in the guest CPU, so we should get the same answer as if
>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
>> 
>> I think the fact that all of these things come out to the same
>> set of bytes in the mmio.data buffer is the indication that all
>> QEMU's byte swapping is correct.
>> 
>>> Looking at the two rightmost columns, we must conclude:
>>> 
>>> (a) The "device_endian" field has absolutely no significance wrt. what
>>>    the guest sees. In each triplet of cases, when we go from
>>>    DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>>    the guest sees the exact same value.
>>> 
>>>    I don't understand this result (it makes me doubt device_endian
>>>    makes any sense). What did I do wrong?
>> 
>> I think it's because you defined your device as only supporting
>> byte accesses that you didn't see any difference between the
>> various device_endian settings. If a device presents itself as
>> just an array of bytes then it doesn't have an endianness, really.
>> 
>> As Paolo says, if you make your device support wider accesses
>> directly and build up the value yourself then you'll see that
>> setting the device_endian to LE vs BE vs native does change
>> the values you see in the guest (and that you'll need to set it
>> to LE and interpret the words in the guest as LE to get the
>> right behaviour).
>> 
>>>    This means that this interface is *value preserving*, not
>>>    representation preserving. In other words: when host and guest
>>>    endiannesses are identical, the *array* is transferred okay (the
>>>    guest array matches the initial host array [0xaa, 0xbb]). When guest
>>>    and host differ in endianness, the guest will see an incorrect
>>>    *array*.
>> 
>> Think of a device which supports only byte accesses as being
>> like a lump of RAM. A big-endian guest CPU which reads 32 bits
>> at a time will get different values in registers to an LE guest
>> which does the same.
>> 
>> *However*, if both CPUs just do "read 32 bits; write 32 bits to
>> RAM" (ie a kind of memcpy but with the source being the mmio
>> register rather than some other bit of RAM) then you should get
>> a bytewise-correct copy of the data in RAM.
>> 
>> So I *think* that would be a viable approach: have your QEMU
>> device code as it is now, and just make sure that if the guest
>> is doing wider-than-byte accesses it does them as
>>  do {
>>     load word from mmio register;
>>     store word to RAM;
>>  } while (count);
>> 
>> and doesn't try to interpret the byte order of the values while
>> they're in the CPU register in the middle of the loop.
>> 
>> Paolo's suggestion would work too, if you prefer that.
>> 
>>> I apologize for wasting everyone's time, but I think both results are
>>> very non-intuitive.
>> 
>> Everything around endianness handling is non-intuitive --
>> it's the nature of the problem, I'm afraid. (Some of this is
>> also QEMU's fault for not having good documentation of the
>> semantics of each of the layers involved in memory accesses.
>> I have on my todo list the vague idea of trying to write these
>> all up as a blog post.)
> 
> Thanks for taking the time to write this up. My analysis must have
> missed at least two important things then:
> - the device's read/write function needs to consider address & size, and
>  return values that match host byte order. fw_cfg doesn't conform ATM.
> - there's one more layer outside the call tree that I checked that can
>  perform endianness conversion.
> 
> I'll try to implement Paolo's suggestion (ie. support wide accesses in
> fw_cfg internally, not relying on splitting/combining by memory.c).

Awesome :). Please define it as device little endian while you go as well. That should give us fewer headaches if we want to support wide access on ppc.


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-17  8:28           ` Alexander Graf
@ 2014-12-17  8:40             ` Laszlo Ersek
  0 siblings, 0 replies; 37+ messages in thread
From: Laszlo Ersek @ 2014-12-17  8:40 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Andrew Jones, Paolo Bonzini, QEMU Developers

On 12/17/14 09:28, Alexander Graf wrote:
> 
> 
> 
>> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <lersek@redhat.com>:
>>
>>> On 12/16/14 21:41, Peter Maydell wrote:
>>>> On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote:
>>>> The root of this question is what each of
>>>>
>>>> enum device_endian {
>>>>    DEVICE_NATIVE_ENDIAN,
>>>>    DEVICE_BIG_ENDIAN,
>>>>    DEVICE_LITTLE_ENDIAN,
>>>> };
>>>>
>>>> means.
>>>
>>> An opening remark: endianness is a horribly confusing topic and
>>> support of more than one endianness is even worse. I may have made
>>> some inadvertent errors in this reply; if you think so please
>>> let me know and I'll have another stab at it.
>>>
>>> That said: the device_endian options indicate what a device's
>>> internal idea of its endianness is. This is most relevant if
>>> a device accepts accesses at wider than byte width
>>> (for instance, if you can read a 32-bit value from
>>> address offset 0 and also an 8-bit value from offset 3 then
>>> how do those values line up? If you read a 32-bit value then
>>> which way round is it compared to what the device's io read/write
>>> function return?).
>>>
>>> NATIVE_ENDIAN means "same order as the CPU's main data bus's
>>> natural representation". (Note that this is not necessarily
>>> the same as "the endianness the CPU currently has"; on ARM
>>> you can flip the CPU between LE and BE at runtime, which
>>> is basically inserting a byte-swizzling step between data
>>> accesses and the CPU's data bus, which is always LE for ARMv7+.)
>>>
>>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
>>> (because the guest vcpu reads it directly with the h/w cpu).
>>>
>>>> Consider the following call tree, which implements the splitting /
>>>> combining of an MMIO read:
>>>>
>>>> memory_region_dispatch_read() [memory.c]
>>>>  memory_region_dispatch_read1()
>>>>    access_with_adjusted_size()
>>>>      memory_region_big_endian()
>>>>      for each byte in the wide access:
>>>>        memory_region_read_accessor()
>>>>          fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>>>            fw_cfg_read()
>>>>  adjust_endianness()
>>>>    memory_region_wrong_endianness()
>>>>    bswapXX()
>>>>
>>>> The function access_with_adjusted_size() always iterates over the MMIO
>>>> address range in incrementing address order. However, it can calculate
>>>> the shift count for memory_region_read_accessor() in two ways.
>>>>
>>>> When memory_region_big_endian() returns "true", the shift count
>>>> decreases as the MMIO address increases.
>>>>
>>>> When memory_region_big_endian() returns "false", the shift count
>>>> increases as the MMIO address increases.
>>>
>>> Yep, because this is how the device has told us that it thinks
>>> of accesses as being put together.
>>>
>>> The column in your table "host value" is the 16 bit value read from
>>> the device, ie what we have decided (based on device_endian) that
>>> it would have returned us if it had supported a 16 bit read directly
>>> itself. BE devices compose 16 bit values with the high byte first,
>>> LE devices with the low byte first, and native-endian devices
>>> in the same order as guest-endianness.
>>>
>>>> In memory_region_read_accessor(), the shift count is used for a logical
>>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>>>> values and hides (ie. accounts for) host endianness.
>>>>
>>>> Let's consider
>>>> - an array of two bytes, [0xaa, 0xbb],
>>>> - a uint16_t access made from the guest,
>>>> - and all twelve possible cases.
>>>>
>>>> In the table below, the "host", "guest" and "device_endian" columns are
>>>> input variables. The columns to the right are calculated / derived
>>>> values. The arrows above the columns show the data dependencies.
>>>>
>>>> After memory_region_dispatch_read1() constructs the value that is
>>>> visible in the "host value" column, it is passed to adjust_endianness().
>>>> If memory_region_wrong_endianness() returns "true", then the host value
>>>> is byte-swapped. The result is then passed to the guest.
>>>>
>>>>              +---------------------------------------------------------------------------------------------------------------+----------+
>>>>              |                                                                                                               |          |
>>>>            +---- ------+-------------------------------------------------------------------------+                           |          |
>>>>            | |         |                                                                         |                           |          |
>>>>      +----------------------------------------------------------+---------+                      |                           |          |
>>>>      |     | |         |                                        |         |                      |                           |          |
>>>>      |   +-----------+-------------------+  +----------------+  |         |  +------------------------+-------------------+  |          |
>>>>      |   | | |       | |                 |  |                |  |         |  |                   |    |                   |  |          |
>>>>      |   | | |       | |                 v  |                v  |         v  |                   v    |                   v  |          v
>>>> #  host  guest  device_endian  memory_region_big_endian()  host value  host repr.    memory_region_wrong_endianness()  guest repr.   guest value
>>>> --  ----  -----  -------------  --------------------------  ----------  ------------  --------------------------------  ------------  -----------
>>>> 1  LE    LE     native         0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>>> 2  LE    LE     BE             1                           0xaabb      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>>>> 3  LE    LE     LE             0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>>>
>>>> 4  LE    BE     native         1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>>> 5  LE    BE     BE             1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>>> 6  LE    BE     LE             0                           0xbbaa      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>>>>
>>>> 7  BE    LE     native         0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>>> 8  BE    LE     BE             1                           0xaabb      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>>>> 9  BE    LE     LE             0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>>>
>>>> 10  BE    BE     native         1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>>> 11  BE    BE     BE             1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>>> 12  BE    BE     LE             0                           0xbbaa      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb
>>>
>>> The column you have labelled 'guest repr' here is the returned data
>>> from io_mem_read, whose API contract is "write the data from the
>>> device into this host C uint16_t (or whatever) such that it is the
>>> value returned by the device read as a native host value". It's
>>> not related to the guest order at all.
>>>
>>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
>>> which passes it a local variable "val". So now "val" has the
>>> "guest repr" column's bytes in it, and (as a host C variable) the
>>> value:
>>> 1,2,3 : 0xbbaa
>>> 4,5,6 : 0xaabb
>>> 7,8,9 : 0xbbaa
>>> 10,11,12 : 0xaabb
>>>
>>> As you can see, this depends on the "guest endianness" (which is
>>> kind of the endianness of the bus): a BE guest 16 bit access to
>>> this device would return the 16 bit value 0xaabb, and an LE guest
>>> 0xbbaa, and we have exactly those values in our host C variable.
>>> If this is TCG, then we'll copy that 16 bit host value into the
>>> CPUState struct field corresponding to the destination guest
>>> register as-is. (TCG CPUState fields are always in host-C-order.)
>>>
>>> However, to pass them up to KVM we have to put them into a
>>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
>>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
>>> in target CPU endianness"], so now buf has the bytes:
>>> 1,2,3 : [0xaa, 0xbb]
>>> 4,5,6 : [0xaa, 0xbb]
>>> 7,8,9 : [0xaa, 0xbb]
>>> 10,11,12 : [0xaa, 0xbb]
>>>
>>> ...which is the same for every case.
>>>
>>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
>>> are "the value as it would appear if the VCPU performed a load or store
>>> of the appropriate width directly to the byte array". Which is what we
>>> want -- your device has two bytes in order 0xaa, 0xbb, and we did
>>> a 16 bit load in the guest CPU, so we should get the same answer as if
>>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
>>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
>>>
>>> I think the fact that all of these things come out to the same
>>> set of bytes in the mmio.data buffer is the indication that all
>>> QEMU's byte swapping is correct.
>>>
>>>> Looking at the two rightmost columns, we must conclude:
>>>>
>>>> (a) The "device_endian" field has absolutely no significance wrt. what
>>>>    the guest sees. In each triplet of cases, when we go from
>>>>    DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>>>    the guest sees the exact same value.
>>>>
>>>>    I don't understand this result (it makes me doubt device_endian
>>>>    makes any sense). What did I do wrong?
>>>
>>> I think it's because you defined your device as only supporting
>>> byte accesses that you didn't see any difference between the
>>> various device_endian settings. If a device presents itself as
>>> just an array of bytes then it doesn't have an endianness, really.
>>>
>>> As Paolo says, if you make your device support wider accesses
>>> directly and build up the value yourself then you'll see that
>>> setting the device_endian to LE vs BE vs native does change
>>> the values you see in the guest (and that you'll need to set it
>>> to LE and interpret the words in the guest as LE to get the
>>> right behaviour).
>>>
>>>>    This means that this interface is *value preserving*, not
>>>>    representation preserving. In other words: when host and guest
>>>>    endiannesses are identical, the *array* is transferred okay (the
>>>>    guest array matches the initial host array [0xaa, 0xbb]). When guest
>>>>    and host differ in endianness, the guest will see an incorrect
>>>>    *array*.
>>>
>>> Think of a device which supports only byte accesses as being
>>> like a lump of RAM. A big-endian guest CPU which reads 32 bits
>>> at a time will get different values in registers to an LE guest
>>> which does the same.
>>>
>>> *However*, if both CPUs just do "read 32 bits; write 32 bits to
>>> RAM" (ie a kind of memcpy but with the source being the mmio
>>> register rather than some other bit of RAM) then you should get
>>> a bytewise-correct copy of the data in RAM.
>>>
>>> So I *think* that would be a viable approach: have your QEMU
>>> device code as it is now, and just make sure that if the guest
>>> is doing wider-than-byte accesses it does them as
>>>  do {
>>>     load word from mmio register;
>>>     store word to RAM;
>>>  } while (count);
>>>
>>> and doesn't try to interpret the byte order of the values while
>>> they're in the CPU register in the middle of the loop.
>>>
>>> Paolo's suggestion would work too, if you prefer that.
>>>
>>>> I apologize for wasting everyone's time, but I think both results are
>>>> very non-intuitive.
>>>
>>> Everything around endianness handling is non-intuitive --
>>> it's the nature of the problem, I'm afraid. (Some of this is
>>> also QEMU's fault for not having good documentation of the
>>> semantics of each of the layers involved in memory accesses.
>>> I have on my todo list the vague idea of trying to write these
>>> all up as a blog post.)
>>
>> Thanks for taking the time to write this up. My analysis must have
>> missed at least two important things then:
>> - the device's read/write function needs to consider address & size, and
>>  return values that match host byte order. fw_cfg doesn't conform ATM.
>> - there's one more layer outside the call tree that I checked that can
>>  perform endianness conversion.
>>
>> I'll try to implement Paolo's suggestion (ie. support wide accesses in
>> fw_cfg internally, not relying on splitting/combining by memory.c).
> 
> Awesome :). Please define it as device little endian while you go as well. That should give us fewer headaches if we want to support wide access on ppc.

Definitely; that was actually the first part of Paolo's suggestion. :)

Thanks!
Laszlo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-17  5:06             ` Laszlo Ersek
@ 2014-12-17  9:23               ` Paolo Bonzini
  2014-12-17  9:31                 ` Alexander Graf
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2014-12-17  9:23 UTC (permalink / raw)
  To: Laszlo Ersek, Andrew Jones; +Cc: peter.maydell, Alexander Graf, qemu-devel



On 17/12/2014 06:06, Laszlo Ersek wrote:
>> > That said, the standalone selector is used by BE platforms only, so we
>> > know that the standalone selector is always DEVICE_BIG_ENDIAN.
> This series exposes the standalone selector (as MMIO) to ARM guests as
> well; and in "Documentation/devicetree/bindings/arm/fw-cfg.txt" for the
> kernel I'm saying that the selector is little endian.
> 
> Therefore I think that the standalone selector is not only (going to be)
> used by BE platforms (or I don't understand your above statement
> correctly).

It's not going to be used only by BE platforms.  But so far it is used
by BE platforms only.  If you want to keep everything consistent, you
can make it (and the wide datum) BE; and swizzle data in the firmware.
Otherwise, NATIVE_ENDIAN is fine.

> In other words, the standalone selector is NATIVE_ENDIAN, but in the
> description of the *ARM* bindings, we can simply say that it's little
> endian.

Yes.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
  2014-12-17  9:23               ` Paolo Bonzini
@ 2014-12-17  9:31                 ` Alexander Graf
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Graf @ 2014-12-17  9:31 UTC (permalink / raw)
  To: Paolo Bonzini, Laszlo Ersek, Andrew Jones; +Cc: peter.maydell, qemu-devel



On 17.12.14 10:23, Paolo Bonzini wrote:
> 
> 
> On 17/12/2014 06:06, Laszlo Ersek wrote:
>>>> That said, the standalone selector is used by BE platforms only, so we
>>>> know that the standalone selector is always DEVICE_BIG_ENDIAN.
>> This series exposes the standalone selector (as MMIO) to ARM guests as
>> well; and in "Documentation/devicetree/bindings/arm/fw-cfg.txt" for the
>> kernel I'm saying that the selector is little endian.
>>
>> Therefore I think that the standalone selector is not only (going to be)
>> used by BE platforms (or I don't understand your above statement
>> correctly).
> 
> It's not going to be used only by BE platforms.  But so far it is used
> by BE platforms only.  If you want to keep everything consistent, you
> can make it (and the wide datum) BE; and swizzle data in the firmware.
> Otherwise, NATIVE_ENDIAN is fine.

For the sake of keeping myself sane, I would really prefer not to have
NATIVE_ENDIAN. Once you start thinking about little endian guests on PPC
sharing code with little endian guests on x86 your head will start to
explode otherwise.

If code becomes simpler by making the mmio accessors BIG_ENDIAN, that's
fine with me as well.


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-12-17  9:32 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg Laszlo Ersek
2014-12-16 13:48   ` Andrew Jones
2014-12-16 19:00     ` Laszlo Ersek
2014-12-16 19:49       ` Paolo Bonzini
2014-12-16 20:06         ` Laszlo Ersek
2014-12-16 20:17           ` Laszlo Ersek
2014-12-16 21:47             ` Paolo Bonzini
2014-12-17  4:52               ` Laszlo Ersek
2014-12-16 20:40           ` Paolo Bonzini
2014-12-16 21:47             ` Peter Maydell
2014-12-17  5:06             ` Laszlo Ersek
2014-12-17  9:23               ` Paolo Bonzini
2014-12-17  9:31                 ` Alexander Graf
2014-12-16 20:41       ` Peter Maydell
2014-12-17  7:13         ` Laszlo Ersek
2014-12-17  8:28           ` Alexander Graf
2014-12-17  8:40             ` Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 2/8] fw_cfg: generalize overlap check for combining control and data I/O ports Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property Laszlo Ersek
2014-12-16 12:06   ` Alexander Graf
2014-12-16 12:42     ` Laszlo Ersek
2014-12-16 16:59       ` Laszlo Ersek
2014-12-16 17:10         ` Peter Maydell
2014-12-16 17:20           ` Alexander Graf
2014-12-16 18:52             ` Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 4/8] fw_cfg: expose the "data_memwidth" prop with fw_cfg_init_data_memwidth() Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 5/8] arm: add fw_cfg to "virt" board Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 6/8] hw/loader: split out load_image_gzipped_buffer() Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg Laszlo Ersek
2014-12-16 12:15   ` Alexander Graf
2014-12-16 12:18     ` Peter Maydell
2014-12-16 12:20       ` Alexander Graf
2014-12-16 12:25         ` Peter Maydell
2014-12-16 12:42           ` Richard W.M. Jones
2014-12-16 12:44             ` Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 8/8] hw/arm/virt: enable passing of EFI-stubbed kernel to guest UEFI firmware Laszlo Ersek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.