All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH qemu v2] x86: don't let decompressed kernel image clobber setup_data
@ 2022-12-30 18:38 Jason A. Donenfeld
  2022-12-30 21:59 ` Jason A. Donenfeld
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2022-12-30 18:38 UTC (permalink / raw)
  To: pbonzini, ebiggers, x86, linux-kernel, qemu-devel, ardb, kraxel,
	hpa, bp, philmd
  Cc: Jason A. Donenfeld

The setup_data links are appended to the compressed kernel image. Since
the kernel image is typically loaded at 0x100000, setup_data lives at
`0x100000 + compressed_size`, which does not get relocated during the
kernel's boot process.

The kernel typically decompresses the image starting at address
0x1000000 (note: there's one more zero there than the compressed image
above). This usually is fine for most kernels.

However, if the compressed image is actually quite large, then
setup_data will live at a `0x100000 + compressed_size` that extends into
the decompressed zone at 0x1000000. In other words, if compressed_size
is larger than `0x1000000 - 0x100000`, then the decompression step will
clobber setup_data, resulting in crashes.

Visually, what happens now is that QEMU appends setup_data to the kernel
image:

          kernel image            setup_data
   |--------------------------||----------------|
0x100000                  0x100000+l1     0x100000+l1+l2

The problem is that this decompresses to 0x1000000 (one more zero). So
if l1 is > (0x1000000-0x100000), then this winds up looking like:

          kernel image            setup_data
   |--------------------------||----------------|
0x100000                  0x100000+l1     0x100000+l1+l2

                                 d e c o m p r e s s e d   k e r n e l
                     |-------------------------------------------------------------|
                0x1000000                                                     0x1000000+l3

The decompressed kernel seemingly overwriting the compressed kernel
image isn't a problem, because that gets relocated to a higher address
early on in the boot process, at the end of startup_64. setup_data,
however, stays in the same place, since those links are self referential
and nothing fixes them up.  So the decompressed kernel clobbers it.

Fix this by appending setup_data to the cmdline blob rather than the
kernel image blob, which remains at a lower address that won't get
clobbered.

This could have been done by overwriting the initrd blob instead, but
that poses big difficulties, such as no longer being able to use memory
mapped files for initrd, hurting performance, and, more importantly, the
initrd address calculation is hard coded in qboot, and it always grows
down rather than up, which means lots of brittle semantics would have to
be changed around, incurring more complexity. In contrast, using cmdline
is simple and doesn't interfere with anything.

The microvm machine has a gross hack where it fiddles with fw_cfg data
after the fact. So this hack is updated to account for this appending,
by reserving some bytes.

Cc: x86@kernel.org
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/microvm.c         | 10 ++++----
 hw/i386/x86.c             | 50 +++++++++++++++++++--------------------
 hw/nvram/fw_cfg.c         |  9 +++++++
 include/hw/i386/microvm.h |  5 ++--
 include/hw/nvram/fw_cfg.h |  9 +++++++
 5 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
index 170a331e3f..1f3686d90d 100644
--- a/hw/i386/microvm.c
+++ b/hw/i386/microvm.c
@@ -378,7 +378,8 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
     MicrovmMachineState *mms = MICROVM_MACHINE(machine);
     BusState *bus;
     BusChild *kid;
-    char *cmdline;
+    void *existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA);
+    char *cmdline, *insertion_point;
 
     /*
      * Find MMIO transports with attached devices, and add them to the kernel
@@ -387,7 +388,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
      * Yes, this is a hack, but one that heavily improves the UX without
      * introducing any significant issues.
      */
-    cmdline = g_strdup(machine->kernel_cmdline);
+    cmdline = g_strdup(existing_cmdline);
     bus = sysbus_get_default();
     QTAILQ_FOREACH(kid, &bus->children, sibling) {
         DeviceState *dev = kid->child;
@@ -411,9 +412,8 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
         }
     }
 
-    fw_cfg_modify_i32(x86ms->fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(cmdline) + 1);
-    fw_cfg_modify_string(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA, cmdline);
-
+    insertion_point = existing_cmdline + strlen(existing_cmdline);
+    memcpy(insertion_point, cmdline, strlen(cmdline) + 1);
     g_free(cmdline);
 }
 
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78cc131926..b57a993596 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -50,6 +50,7 @@
 #include "hw/intc/i8259.h"
 #include "hw/rtc/mc146818rtc.h"
 #include "target/i386/sev.h"
+#include "hw/i386/microvm.h"
 
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/irq.h"
@@ -802,7 +803,7 @@ void x86_load_linux(X86MachineState *x86ms,
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size;
     uint32_t initrd_max;
     uint8_t header[8192], *setup, *kernel;
     hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
@@ -813,12 +814,16 @@ void x86_load_linux(X86MachineState *x86ms,
     const char *kernel_filename = machine->kernel_filename;
     const char *initrd_filename = machine->initrd_filename;
     const char *dtb_filename = machine->dtb;
-    const char *kernel_cmdline = machine->kernel_cmdline;
+    char *kernel_cmdline;
     SevKernelLoaderContext sev_load_ctx = {};
     enum { RNG_SEED_LENGTH = 32 };
 
-    /* Align to 16 bytes as a paranoia measure */
-    cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
+    /* Add the NUL terminator, some padding for the microvm cmdline fiddling
+     * hack, and then align to 16 bytes as a paranoia measure */
+    cmdline_size = (strlen(machine->kernel_cmdline) + 1 +
+                    VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15;
+    /* Make a copy, since we might append arbitrary bytes to it later. */
+    kernel_cmdline = g_strndup(machine->kernel_cmdline, cmdline_size);
 
     /* load the kernel header */
     f = fopen(kernel_filename, "rb");
@@ -959,12 +964,6 @@ void x86_load_linux(X86MachineState *x86ms,
         initrd_max = x86ms->below_4g_mem_size - acpi_data_size - 1;
     }
 
-    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
-    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(kernel_cmdline) + 1);
-    fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline);
-    sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
-    sev_load_ctx.cmdline_size = strlen(kernel_cmdline) + 1;
-
     if (protocol >= 0x202) {
         stl_p(header + 0x228, cmdline_addr);
     } else {
@@ -1091,27 +1090,22 @@ void x86_load_linux(X86MachineState *x86ms,
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(SetupData) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (SetupData *)(kernel + setup_data_offset);
+        cmdline_size += sizeof(SetupData) + dtb_size;
+        kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
+        setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + dtb_size);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
-    if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(SetupData) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (SetupData *)(kernel + setup_data_offset);
+    if (!legacy_no_rng_seed && protocol >= 0x209) {
+        cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH;
+        kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
+        setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + RNG_SEED_LENGTH);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
@@ -1122,6 +1116,12 @@ void x86_load_linux(X86MachineState *x86ms,
         fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
     }
 
+    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
+    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, cmdline_size);
+    fw_cfg_add_bytes(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline, cmdline_size);
+    sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
+    sev_load_ctx.cmdline_size = cmdline_size;
+
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
     sev_load_ctx.kernel_data = (char *)kernel;
@@ -1134,7 +1134,7 @@ void x86_load_linux(X86MachineState *x86ms,
      * kernel on the other side of the fw_cfg interface matches the hash of the
      * file the user passed in.
      */
-    if (!sev_enabled()) {
+    if (!sev_enabled() && first_setup_data) {
         SetupDataFixup *fixup = g_malloc(sizeof(*fixup));
 
         memcpy(setup, header, MIN(sizeof(header), setup_size));
diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index a00881bc64..432754eda4 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len)
     fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
 }
 
+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
+{
+    int arch = !!(key & FW_CFG_ARCH_LOCAL);
+
+    key &= FW_CFG_ENTRY_MASK;
+    assert(key < fw_cfg_max_entry(s));
+    return s->entries[arch][key].data;
+}
+
 void fw_cfg_add_string(FWCfgState *s, uint16_t key, const char *value)
 {
     size_t sz = strlen(value) + 1;
diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h
index fad97a891d..1bc7133101 100644
--- a/include/hw/i386/microvm.h
+++ b/include/hw/i386/microvm.h
@@ -50,8 +50,9 @@
  */
 
 /* Platform virtio definitions */
-#define VIRTIO_MMIO_BASE      0xfeb00000
-#define VIRTIO_CMDLINE_MAXLEN 64
+#define VIRTIO_MMIO_BASE                0xfeb00000
+#define VIRTIO_CMDLINE_MAXLEN           64
+#define VIRTIO_CMDLINE_TOTAL_MAX_LEN    (VIRTIO_CMDLINE_MAXLEN * 16)
 
 #define GED_MMIO_BASE         0xfea00000
 #define GED_MMIO_BASE_MEMHP   (GED_MMIO_BASE + 0x100)
diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
index 2e503904dc..990dcdbb2e 100644
--- a/include/hw/nvram/fw_cfg.h
+++ b/include/hw/nvram/fw_cfg.h
@@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key,
                                void *data, size_t len,
                                bool read_only);
 
+/**
+ * fw_cfg_read_bytes_ptr:
+ * @s: fw_cfg device being modified
+ * @key: selector key value for new fw_cfg item
+ *
+ * Reads an existing fw_cfg data pointer.
+ */
+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
+
 /**
  * fw_cfg_add_string:
  * @s: fw_cfg device being modified
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v2] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 18:38 [PATCH qemu v2] x86: don't let decompressed kernel image clobber setup_data Jason A. Donenfeld
@ 2022-12-30 21:59 ` Jason A. Donenfeld
  2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2022-12-30 21:59 UTC (permalink / raw)
  To: pbonzini, ebiggers, x86, linux-kernel, qemu-devel, ardb, kraxel,
	hpa, bp, philmd

On Fri, Dec 30, 2022 at 07:38:19PM +0100, Jason A. Donenfeld wrote:
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.

This is a little derpy. I'll send a v3 in a second to clean this up.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 21:59 ` Jason A. Donenfeld
@ 2022-12-30 22:07   ` Jason A. Donenfeld
  2023-01-05  5:16     ` Eric Biggers
                       ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Jason A. Donenfeld @ 2022-12-30 22:07 UTC (permalink / raw)
  To: pbonzini, ebiggers, x86, linux-kernel, qemu-devel, ardb, kraxel,
	hpa, bp, philmd
  Cc: Jason A. Donenfeld

The setup_data links are appended to the compressed kernel image. Since
the kernel image is typically loaded at 0x100000, setup_data lives at
`0x100000 + compressed_size`, which does not get relocated during the
kernel's boot process.

The kernel typically decompresses the image starting at address
0x1000000 (note: there's one more zero there than the compressed image
above). This usually is fine for most kernels.

However, if the compressed image is actually quite large, then
setup_data will live at a `0x100000 + compressed_size` that extends into
the decompressed zone at 0x1000000. In other words, if compressed_size
is larger than `0x1000000 - 0x100000`, then the decompression step will
clobber setup_data, resulting in crashes.

Visually, what happens now is that QEMU appends setup_data to the kernel
image:

          kernel image            setup_data
   |--------------------------||----------------|
0x100000                  0x100000+l1     0x100000+l1+l2

The problem is that this decompresses to 0x1000000 (one more zero). So
if l1 is > (0x1000000-0x100000), then this winds up looking like:

          kernel image            setup_data
   |--------------------------||----------------|
0x100000                  0x100000+l1     0x100000+l1+l2

                                 d e c o m p r e s s e d   k e r n e l
                     |-------------------------------------------------------------|
                0x1000000                                                     0x1000000+l3

The decompressed kernel seemingly overwriting the compressed kernel
image isn't a problem, because that gets relocated to a higher address
early on in the boot process, at the end of startup_64. setup_data,
however, stays in the same place, since those links are self referential
and nothing fixes them up.  So the decompressed kernel clobbers it.

Fix this by appending setup_data to the cmdline blob rather than the
kernel image blob, which remains at a lower address that won't get
clobbered.

This could have been done by overwriting the initrd blob instead, but
that poses big difficulties, such as no longer being able to use memory
mapped files for initrd, hurting performance, and, more importantly, the
initrd address calculation is hard coded in qboot, and it always grows
down rather than up, which means lots of brittle semantics would have to
be changed around, incurring more complexity. In contrast, using cmdline
is simple and doesn't interfere with anything.

The microvm machine has a gross hack where it fiddles with fw_cfg data
after the fact. So this hack is updated to account for this appending,
by reserving some bytes.

Cc: x86@kernel.org
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Changes v2->v3:
- Fix mistakes in string handling.
Changes v1->v2:
- Append setup_data to cmdline instead of kernel image.

 hw/i386/microvm.c         | 13 ++++++----
 hw/i386/x86.c             | 50 +++++++++++++++++++--------------------
 hw/nvram/fw_cfg.c         |  9 +++++++
 include/hw/i386/microvm.h |  5 ++--
 include/hw/nvram/fw_cfg.h |  9 +++++++
 5 files changed, 54 insertions(+), 32 deletions(-)

diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
index 170a331e3f..1b19d28c02 100644
--- a/hw/i386/microvm.c
+++ b/hw/i386/microvm.c
@@ -378,7 +378,8 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
     MicrovmMachineState *mms = MICROVM_MACHINE(machine);
     BusState *bus;
     BusChild *kid;
-    char *cmdline;
+    char *cmdline, *existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA);
+    size_t len;
 
     /*
      * Find MMIO transports with attached devices, and add them to the kernel
@@ -387,7 +388,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
      * Yes, this is a hack, but one that heavily improves the UX without
      * introducing any significant issues.
      */
-    cmdline = g_strdup(machine->kernel_cmdline);
+    cmdline = g_strdup(existing_cmdline);
     bus = sysbus_get_default();
     QTAILQ_FOREACH(kid, &bus->children, sibling) {
         DeviceState *dev = kid->child;
@@ -411,9 +412,11 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
         }
     }
 
-    fw_cfg_modify_i32(x86ms->fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(cmdline) + 1);
-    fw_cfg_modify_string(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA, cmdline);
-
+    len = strlen(cmdline);
+    if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline))
+        fprintf(stderr, "qemu: virtio mmio cmdline too large, skipping\n");
+    else
+        memcpy(existing_cmdline, cmdline, len + 1);
     g_free(cmdline);
 }
 
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78cc131926..b57a993596 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -50,6 +50,7 @@
 #include "hw/intc/i8259.h"
 #include "hw/rtc/mc146818rtc.h"
 #include "target/i386/sev.h"
+#include "hw/i386/microvm.h"
 
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/irq.h"
@@ -802,7 +803,7 @@ void x86_load_linux(X86MachineState *x86ms,
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size;
     uint32_t initrd_max;
     uint8_t header[8192], *setup, *kernel;
     hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
@@ -813,12 +814,16 @@ void x86_load_linux(X86MachineState *x86ms,
     const char *kernel_filename = machine->kernel_filename;
     const char *initrd_filename = machine->initrd_filename;
     const char *dtb_filename = machine->dtb;
-    const char *kernel_cmdline = machine->kernel_cmdline;
+    char *kernel_cmdline;
     SevKernelLoaderContext sev_load_ctx = {};
     enum { RNG_SEED_LENGTH = 32 };
 
-    /* Align to 16 bytes as a paranoia measure */
-    cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
+    /* Add the NUL terminator, some padding for the microvm cmdline fiddling
+     * hack, and then align to 16 bytes as a paranoia measure */
+    cmdline_size = (strlen(machine->kernel_cmdline) + 1 +
+                    VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15;
+    /* Make a copy, since we might append arbitrary bytes to it later. */
+    kernel_cmdline = g_strndup(machine->kernel_cmdline, cmdline_size);
 
     /* load the kernel header */
     f = fopen(kernel_filename, "rb");
@@ -959,12 +964,6 @@ void x86_load_linux(X86MachineState *x86ms,
         initrd_max = x86ms->below_4g_mem_size - acpi_data_size - 1;
     }
 
-    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
-    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(kernel_cmdline) + 1);
-    fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline);
-    sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
-    sev_load_ctx.cmdline_size = strlen(kernel_cmdline) + 1;
-
     if (protocol >= 0x202) {
         stl_p(header + 0x228, cmdline_addr);
     } else {
@@ -1091,27 +1090,22 @@ void x86_load_linux(X86MachineState *x86ms,
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(SetupData) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (SetupData *)(kernel + setup_data_offset);
+        cmdline_size += sizeof(SetupData) + dtb_size;
+        kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
+        setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + dtb_size);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
-    if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(SetupData) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (SetupData *)(kernel + setup_data_offset);
+    if (!legacy_no_rng_seed && protocol >= 0x209) {
+        cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH;
+        kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
+        setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + RNG_SEED_LENGTH);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
@@ -1122,6 +1116,12 @@ void x86_load_linux(X86MachineState *x86ms,
         fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
     }
 
+    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
+    fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, cmdline_size);
+    fw_cfg_add_bytes(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline, cmdline_size);
+    sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
+    sev_load_ctx.cmdline_size = cmdline_size;
+
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
     sev_load_ctx.kernel_data = (char *)kernel;
@@ -1134,7 +1134,7 @@ void x86_load_linux(X86MachineState *x86ms,
      * kernel on the other side of the fw_cfg interface matches the hash of the
      * file the user passed in.
      */
-    if (!sev_enabled()) {
+    if (!sev_enabled() && first_setup_data) {
         SetupDataFixup *fixup = g_malloc(sizeof(*fixup));
 
         memcpy(setup, header, MIN(sizeof(header), setup_size));
diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index a00881bc64..432754eda4 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len)
     fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
 }
 
+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
+{
+    int arch = !!(key & FW_CFG_ARCH_LOCAL);
+
+    key &= FW_CFG_ENTRY_MASK;
+    assert(key < fw_cfg_max_entry(s));
+    return s->entries[arch][key].data;
+}
+
 void fw_cfg_add_string(FWCfgState *s, uint16_t key, const char *value)
 {
     size_t sz = strlen(value) + 1;
diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h
index fad97a891d..e8af61f194 100644
--- a/include/hw/i386/microvm.h
+++ b/include/hw/i386/microvm.h
@@ -50,8 +50,9 @@
  */
 
 /* Platform virtio definitions */
-#define VIRTIO_MMIO_BASE      0xfeb00000
-#define VIRTIO_CMDLINE_MAXLEN 64
+#define VIRTIO_MMIO_BASE                0xfeb00000
+#define VIRTIO_CMDLINE_MAXLEN           64
+#define VIRTIO_CMDLINE_TOTAL_MAX_LEN    ((VIRTIO_CMDLINE_MAXLEN + 1) * 16)
 
 #define GED_MMIO_BASE         0xfea00000
 #define GED_MMIO_BASE_MEMHP   (GED_MMIO_BASE + 0x100)
diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
index 2e503904dc..990dcdbb2e 100644
--- a/include/hw/nvram/fw_cfg.h
+++ b/include/hw/nvram/fw_cfg.h
@@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key,
                                void *data, size_t len,
                                bool read_only);
 
+/**
+ * fw_cfg_read_bytes_ptr:
+ * @s: fw_cfg device being modified
+ * @key: selector key value for new fw_cfg item
+ *
+ * Reads an existing fw_cfg data pointer.
+ */
+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
+
 /**
  * fw_cfg_add_string:
  * @s: fw_cfg device being modified
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
@ 2023-01-05  5:16     ` Eric Biggers
  2023-01-10 12:10     ` Mathias Krause
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Eric Biggers @ 2023-01-05  5:16 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: pbonzini, x86, linux-kernel, qemu-devel, ardb, kraxel, hpa, bp, philmd

On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
> 
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
> 
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
> 
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
> 
>           kernel image            setup_data
>    |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
> 
>           kernel image            setup_data
>    |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
>                                  d e c o m p r e s s e d   k e r n e l
>                      |-------------------------------------------------------------|
>                 0x1000000                                                     0x1000000+l3
> 
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up.  So the decompressed kernel clobbers it.
> 
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
> 
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
> 
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
> 
> Cc: x86@kernel.org
> Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

For what it's worth:

Tested-by: Eric Biggers <ebiggers@google.com>

- Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
  2023-01-05  5:16     ` Eric Biggers
@ 2023-01-10 12:10     ` Mathias Krause
  2023-01-10 15:34     ` Jason A. Donenfeld
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Mathias Krause @ 2023-01-10 12:10 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: qemu-devel, x86, linux-kernel, pbonzini, ebiggers, ardb, kraxel,
	hpa, bp, philmd

Hi Jason!

Am 30.12.22 um 23:07 schrieb Jason A. Donenfeld:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
> 
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
> 
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
> 
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
> 
>           kernel image            setup_data
>    |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
> 
>           kernel image            setup_data
>    |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
>                                  d e c o m p r e s s e d   k e r n e l
>                      |-------------------------------------------------------------|
>                 0x1000000                                                     0x1000000+l3
> 
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up.  So the decompressed kernel clobbers it.

I just ran into this very issue yesterday when trying to boot a 6.1
kernel. pipacs pointed me to some changes of yours[1] which confirmed,
the issue is related to the additional setup_data entries, as adding,
e.g., '-M pc-i440fx-7.0' to the QEMU command line made the bug vanish
(as QEMU then omits adding the random seed setup_data entries) .

[1] https://github.com/qemu/qemu/commit/67f7e426e538

After digging a while I found this thread and it fixes the issue for me,
thereby:

Tested-by: Mathias Krause <minipli@grsecurity.net>

Thanks,
Mathias

> [snip]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
  2023-01-05  5:16     ` Eric Biggers
  2023-01-10 12:10     ` Mathias Krause
@ 2023-01-10 15:34     ` Jason A. Donenfeld
  2023-01-10 17:50       ` Michael S. Tsirkin
  2023-01-23  8:26     ` Philippe Mathieu-Daudé
  2023-02-08 17:45     ` Nathan Chancellor
  4 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2023-01-10 15:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Marcel Apfelbaum
  Cc: Peter Maydell, philmd, pbonzini, ebiggers, Mathias Krause, qemu-devel

Hi Michael,

Could you queue up this patch and mark it as a fix for 7.2.1? It is a
straight-up bug fix for a 7.2 regression that's now affected several
users.

- It has two Tested-by tags on the thread.
- hpa, the maintainer of the kernel side of this, confirmed on one of
  the various tributary threads that this approach is a correct one.
- It doesn't introduce any new functionality.

For your convenience, you can grab this out of lore here:

  https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/

Or if you want to yolo it:

  curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s

It's now sat silent on the mailing list for a while. So let's please get
this committed and backported so that the bug reports stop coming in.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-01-10 15:34     ` Jason A. Donenfeld
@ 2023-01-10 17:50       ` Michael S. Tsirkin
  2023-01-23  4:21         ` Eric Biggers
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2023-01-10 17:50 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Marcel Apfelbaum, Peter Maydell, philmd, pbonzini, ebiggers,
	Mathias Krause, qemu-devel

On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> Could you queue up this patch and mark it as a fix for 7.2.1? It is a
> straight-up bug fix for a 7.2 regression that's now affected several
> users.

OK. In the future pls cc me if you want me to merge a patch. Thanks!

> - It has two Tested-by tags on the thread.
> - hpa, the maintainer of the kernel side of this, confirmed on one of
>   the various tributary threads that this approach is a correct one.
> - It doesn't introduce any new functionality.
> 
> For your convenience, you can grab this out of lore here:
> 
>   https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
> 
> Or if you want to yolo it:
> 
>   curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s
> 
> It's now sat silent on the mailing list for a while. So let's please get
> this committed and backported so that the bug reports stop coming in.
> 
> Thanks,
> Jason
> 
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-01-10 17:50       ` Michael S. Tsirkin
@ 2023-01-23  4:21         ` Eric Biggers
  2023-01-23 12:12           ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Biggers @ 2023-01-23  4:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason A. Donenfeld, Marcel Apfelbaum, Peter Maydell, philmd,
	pbonzini, Mathias Krause, qemu-devel

Hi Michael,

On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote:
> On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote:
> > Hi Michael,
> > 
> > Could you queue up this patch and mark it as a fix for 7.2.1? It is a
> > straight-up bug fix for a 7.2 regression that's now affected several
> > users.
> 
> OK. In the future pls cc me if you want me to merge a patch. Thanks!
> 
> > - It has two Tested-by tags on the thread.
> > - hpa, the maintainer of the kernel side of this, confirmed on one of
> >   the various tributary threads that this approach is a correct one.
> > - It doesn't introduce any new functionality.
> > 
> > For your convenience, you can grab this out of lore here:
> > 
> >   https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
> > 
> > Or if you want to yolo it:
> > 
> >   curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s
> > 
> > It's now sat silent on the mailing list for a while. So let's please get
> > this committed and backported so that the bug reports stop coming in.
> > 

This patch still isn't on QEMU's master branch.  What happened to it?

- Eric


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
                       ` (2 preceding siblings ...)
  2023-01-10 15:34     ` Jason A. Donenfeld
@ 2023-01-23  8:26     ` Philippe Mathieu-Daudé
  2023-02-08 17:45     ` Nathan Chancellor
  4 siblings, 0 replies; 17+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-01-23  8:26 UTC (permalink / raw)
  To: Jason A. Donenfeld, pbonzini, ebiggers, x86, linux-kernel,
	qemu-devel, ardb, kraxel, hpa, bp

On 30/12/22 23:07, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
> 
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
> 
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
> 
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
> 
>            kernel image            setup_data
>     |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
> 
>            kernel image            setup_data
>     |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
>                                   d e c o m p r e s s e d   k e r n e l
>                       |-------------------------------------------------------------|
>                  0x1000000                                                     0x1000000+l3
> 
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up.  So the decompressed kernel clobbers it.
> 
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
> 
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
> 
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
> 
> Cc: x86@kernel.org
> Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Changes v2->v3:
> - Fix mistakes in string handling.
> Changes v1->v2:
> - Append setup_data to cmdline instead of kernel image.
> 
>   hw/i386/microvm.c         | 13 ++++++----
>   hw/i386/x86.c             | 50 +++++++++++++++++++--------------------
>   hw/nvram/fw_cfg.c         |  9 +++++++
>   include/hw/i386/microvm.h |  5 ++--
>   include/hw/nvram/fw_cfg.h |  9 +++++++
>   5 files changed, 54 insertions(+), 32 deletions(-)

> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
> index a00881bc64..432754eda4 100644
> --- a/hw/nvram/fw_cfg.c
> +++ b/hw/nvram/fw_cfg.c
> @@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len)
>       fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
>   }
>   
> +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
> +{
> +    int arch = !!(key & FW_CFG_ARCH_LOCAL);
> +
> +    key &= FW_CFG_ENTRY_MASK;
> +    assert(key < fw_cfg_max_entry(s));
> +    return s->entries[arch][key].data;

Shouldn't it be safer to provide a size argument, and return
NULL if s->entries[arch][key].len < size?

Maybe this API should return a (casted) const pointer, so the
only way to update the key is via fw_cfg_add_bytes().

> +}
> +

> diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
> index 2e503904dc..990dcdbb2e 100644
> --- a/include/hw/nvram/fw_cfg.h
> +++ b/include/hw/nvram/fw_cfg.h
> @@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key,
>                                  void *data, size_t len,
>                                  bool read_only);
>   
> +/**
> + * fw_cfg_read_bytes_ptr:
> + * @s: fw_cfg device being modified
> + * @key: selector key value for new fw_cfg item
> + *
> + * Reads an existing fw_cfg data pointer.
> + */
> +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
> +
>   /**
>    * fw_cfg_add_string:
>    * @s: fw_cfg device being modified


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-01-23  4:21         ` Eric Biggers
@ 2023-01-23 12:12           ` Michael S. Tsirkin
  2023-01-23 12:37             ` Jason A. Donenfeld
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2023-01-23 12:12 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jason A. Donenfeld, Marcel Apfelbaum, Peter Maydell, philmd,
	pbonzini, Mathias Krause, qemu-devel

On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote:
> Hi Michael,
> 
> On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote:
> > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote:
> > > Hi Michael,
> > > 
> > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a
> > > straight-up bug fix for a 7.2 regression that's now affected several
> > > users.
> > 
> > OK. In the future pls cc me if you want me to merge a patch. Thanks!
> > 
> > > - It has two Tested-by tags on the thread.
> > > - hpa, the maintainer of the kernel side of this, confirmed on one of
> > >   the various tributary threads that this approach is a correct one.
> > > - It doesn't introduce any new functionality.
> > > 
> > > For your convenience, you can grab this out of lore here:
> > > 
> > >   https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
> > > 
> > > Or if you want to yolo it:
> > > 
> > >   curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s
> > > 
> > > It's now sat silent on the mailing list for a while. So let's please get
> > > this committed and backported so that the bug reports stop coming in.
> > > 
> 
> This patch still isn't on QEMU's master branch.  What happened to it?
> 
> - Eric

Indeed though I remember picking it up. Tagged again now. Thanks!



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-01-23 12:12           ` Michael S. Tsirkin
@ 2023-01-23 12:37             ` Jason A. Donenfeld
  2023-01-28 11:15               ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2023-01-23 12:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric Biggers, Marcel Apfelbaum, Peter Maydell, philmd, pbonzini,
	Mathias Krause, qemu-devel

On Mon, Jan 23, 2023 at 6:12 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote:
> > Hi Michael,
> >
> > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote:
> > > > Hi Michael,
> > > >
> > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a
> > > > straight-up bug fix for a 7.2 regression that's now affected several
> > > > users.
> > >
> > > OK. In the future pls cc me if you want me to merge a patch. Thanks!
> > >
> > > > - It has two Tested-by tags on the thread.
> > > > - hpa, the maintainer of the kernel side of this, confirmed on one of
> > > >   the various tributary threads that this approach is a correct one.
> > > > - It doesn't introduce any new functionality.
> > > >
> > > > For your convenience, you can grab this out of lore here:
> > > >
> > > >   https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
> > > >
> > > > Or if you want to yolo it:
> > > >
> > > >   curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s
> > > >
> > > > It's now sat silent on the mailing list for a while. So let's please get
> > > > this committed and backported so that the bug reports stop coming in.
> > > >
> >
> > This patch still isn't on QEMU's master branch.  What happened to it?
> >
> > - Eric
>
> Indeed though I remember picking it up. Tagged again now. Thanks!

Thanks. What branch is this in? I didn't see it on:
https://gitlab.com/mstredhat/qemu/-/branches/active
https://github.com/mstsirkin/qemu/branches


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-01-23 12:37             ` Jason A. Donenfeld
@ 2023-01-28 11:15               ` Michael S. Tsirkin
  2023-01-30  9:31                 ` Daniel P. Berrangé
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2023-01-28 11:15 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Eric Biggers, Marcel Apfelbaum, Peter Maydell, philmd, pbonzini,
	Mathias Krause, qemu-devel

On Mon, Jan 23, 2023 at 06:37:21AM -0600, Jason A. Donenfeld wrote:
> On Mon, Jan 23, 2023 at 6:12 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote:
> > > Hi Michael,
> > >
> > > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote:
> > > > > Hi Michael,
> > > > >
> > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a
> > > > > straight-up bug fix for a 7.2 regression that's now affected several
> > > > > users.
> > > >
> > > > OK. In the future pls cc me if you want me to merge a patch. Thanks!
> > > >
> > > > > - It has two Tested-by tags on the thread.
> > > > > - hpa, the maintainer of the kernel side of this, confirmed on one of
> > > > >   the various tributary threads that this approach is a correct one.
> > > > > - It doesn't introduce any new functionality.
> > > > >
> > > > > For your convenience, you can grab this out of lore here:
> > > > >
> > > > >   https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
> > > > >
> > > > > Or if you want to yolo it:
> > > > >
> > > > >   curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s
> > > > >
> > > > > It's now sat silent on the mailing list for a while. So let's please get
> > > > > this committed and backported so that the bug reports stop coming in.
> > > > >
> > >
> > > This patch still isn't on QEMU's master branch.  What happened to it?
> > >
> > > - Eric
> >
> > Indeed though I remember picking it up. Tagged again now. Thanks!
> 
> Thanks. What branch is this in? I didn't see it on:
> https://gitlab.com/mstredhat/qemu/-/branches/active
> https://github.com/mstsirkin/qemu/branches

I don't use github really. And it was not pushed to gitlab as I was
figuring out issues with other patches before starting CI as CI minutes
are limited.  BTW as checkpatch was unhappy I applied a fixup -
making checkpatch happier and in the process the code change a bit
smaller.  If you want to do cleanups on top be my guest but pls
make it pass checkpatch. Thanks!


commit a00d99e04c4481fca3ee2d7c40d42993b7b059c2
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Sat Jan 28 06:08:43 2023 -0500

    fixup! x86: don't let decompressed kernel image clobber setup_data

diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
index 1b19d28c02..29f30dd6d3 100644
--- a/hw/i386/microvm.c
+++ b/hw/i386/microvm.c
@@ -378,7 +378,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
     MicrovmMachineState *mms = MICROVM_MACHINE(machine);
     BusState *bus;
     BusChild *kid;
-    char *cmdline, *existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA);
+    char *cmdline, *existing_cmdline;
     size_t len;
 
     /*
@@ -388,6 +388,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
      * Yes, this is a hack, but one that heavily improves the UX without
      * introducing any significant issues.
      */
+    existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA);
     cmdline = g_strdup(existing_cmdline);
     bus = sysbus_get_default();
     QTAILQ_FOREACH(kid, &bus->children, sibling) {
@@ -413,10 +414,11 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
     }
 
     len = strlen(cmdline);
-    if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline))
+    if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline)) {
         fprintf(stderr, "qemu: virtio mmio cmdline too large, skipping\n");
-    else
+    } else {
         memcpy(existing_cmdline, cmdline, len + 1);
+    }
     g_free(cmdline);
 }
 
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index b57a993596..eaff4227bd 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -803,7 +803,7 @@ void x86_load_linux(X86MachineState *x86ms,
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size;
+    int dtb_size, setup_data_offset;
     uint32_t initrd_max;
     uint8_t header[8192], *setup, *kernel;
     hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
@@ -818,8 +818,10 @@ void x86_load_linux(X86MachineState *x86ms,
     SevKernelLoaderContext sev_load_ctx = {};
     enum { RNG_SEED_LENGTH = 32 };
 
-    /* Add the NUL terminator, some padding for the microvm cmdline fiddling
-     * hack, and then align to 16 bytes as a paranoia measure */
+    /*
+     * Add the NUL terminator, some padding for the microvm cmdline fiddling
+     * hack, and then align to 16 bytes as a paranoia measure
+     */
     cmdline_size = (strlen(machine->kernel_cmdline) + 1 +
                     VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15;
     /* Make a copy, since we might append arbitrary bytes to it later. */
@@ -1090,22 +1092,24 @@ void x86_load_linux(X86MachineState *x86ms,
             exit(1);
         }
 
+        setup_data_offset = cmdline_size;
         cmdline_size += sizeof(SetupData) + dtb_size;
         kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
-        setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + dtb_size);
+        setup_data = (void *)kernel_cmdline + setup_data_offset;
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
+        first_setup_data = cmdline_addr + setup_data_offset;
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
     if (!legacy_no_rng_seed && protocol >= 0x209) {
+        setup_data_offset = cmdline_size;
         cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH;
         kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
-        setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + RNG_SEED_LENGTH);
+        setup_data = (void *)kernel_cmdline + setup_data_offset;
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
+        first_setup_data = cmdline_addr + setup_data_offset;
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-01-28 11:15               ` Michael S. Tsirkin
@ 2023-01-30  9:31                 ` Daniel P. Berrangé
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2023-01-30  9:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason A. Donenfeld, Eric Biggers, Marcel Apfelbaum,
	Peter Maydell, philmd, pbonzini, Mathias Krause, qemu-devel

On Sat, Jan 28, 2023 at 06:15:03AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 23, 2023 at 06:37:21AM -0600, Jason A. Donenfeld wrote:
> > On Mon, Jan 23, 2023 at 6:12 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote:
> > > > Hi Michael,
> > > >
> > > > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote:
> > > > > > Hi Michael,
> > > > > >
> > > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a
> > > > > > straight-up bug fix for a 7.2 regression that's now affected several
> > > > > > users.
> > > > >
> > > > > OK. In the future pls cc me if you want me to merge a patch. Thanks!
> > > > >
> > > > > > - It has two Tested-by tags on the thread.
> > > > > > - hpa, the maintainer of the kernel side of this, confirmed on one of
> > > > > >   the various tributary threads that this approach is a correct one.
> > > > > > - It doesn't introduce any new functionality.
> > > > > >
> > > > > > For your convenience, you can grab this out of lore here:
> > > > > >
> > > > > >   https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
> > > > > >
> > > > > > Or if you want to yolo it:
> > > > > >
> > > > > >   curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s
> > > > > >
> > > > > > It's now sat silent on the mailing list for a while. So let's please get
> > > > > > this committed and backported so that the bug reports stop coming in.
> > > > > >
> > > >
> > > > This patch still isn't on QEMU's master branch.  What happened to it?
> > > >
> > > > - Eric
> > >
> > > Indeed though I remember picking it up. Tagged again now. Thanks!
> > 
> > Thanks. What branch is this in? I didn't see it on:
> > https://gitlab.com/mstredhat/qemu/-/branches/active
> > https://github.com/mstsirkin/qemu/branches
> 
> I don't use github really. And it was not pushed to gitlab as I was
> figuring out issues with other patches before starting CI as CI minutes
> are limited.

QEMU CI config as of a few months ago, pushing to gitlab will *not* start
CI pipelines. You need to explicitly use opt-in to it when pushing by
using 'git push -o ci.variable=QEMU_CI=1' to create a pipeline with all
jobs manual or QEMU_CI=2 to create a pipeline and immediately run all jobs.
Alternatively use the web UI to start the pipeline, again setting QEMU_CI=1|2
So don't let CI minutes concerns dissuade from pushing to gitlab merely to
publish the code.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
                       ` (3 preceding siblings ...)
  2023-01-23  8:26     ` Philippe Mathieu-Daudé
@ 2023-02-08 17:45     ` Nathan Chancellor
  2023-02-08 17:54       ` Jason A. Donenfeld
  4 siblings, 1 reply; 17+ messages in thread
From: Nathan Chancellor @ 2023-02-08 17:45 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: pbonzini, ebiggers, x86, linux-kernel, qemu-devel, ardb, kraxel,
	hpa, bp, philmd

Hi Jason,

On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
> 
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
> 
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
> 
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
> 
>           kernel image            setup_data
>    |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
> 
>           kernel image            setup_data
>    |--------------------------||----------------|
> 0x100000                  0x100000+l1     0x100000+l1+l2
> 
>                                  d e c o m p r e s s e d   k e r n e l
>                      |-------------------------------------------------------------|
>                 0x1000000                                                     0x1000000+l3
> 
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up.  So the decompressed kernel clobbers it.
> 
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
> 
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
> 
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
> 
> Cc: x86@kernel.org
> Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

I apologize if this has already been reported/fixed already (I did a
brief search on lore.kernel.org) or if my terminology is not as precise
as it could be, I am a little out of my element here :)

After this change as commit eac7a7791b ("x86: don't let decompressed
kernel image clobber setup_data") in QEMU master, I am no longer able to
boot a kernel directly through OVMF using '-append' + '-initrd' +
'-kernel'. Instead, systemd-boot tries to start the distribution's
kernel, which fails with:

  Error registering initrd: Already started

This can be reproduced with just a defconfig Linux kernel (I used
6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
boot testing [1], and OVMF. Prior to this change, the kernel would start
up but after, I am dumped into the UEFI shell (as there is no
bootloader).

The QEMU command I used was:

$ qemu-system-x86_64 \
    -kernel arch/x86_64/boot/bzImage \
    -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
    -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
    -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-pci \
    -cpu host \
    -enable-kvm \
    -smp 8 \
    -display none \
    -initrd .../boot-utils/images/x86_64/rootfs.cpio \
    -m 512m \
    -nodefaults \
    -no-reboot \
    -serial mon:stdio

Not sure if the OVMF version will matter but just in case:

$ pacman -Q edk2-ovmf
edk2-ovmf 202211-3

I did see a few fixes on qemu-devel for this patch such as [2] but none
I found fixed the issue for me.

If there is any additional information I can provide or patches I can
test, please let me know.

[1]: https://github.com/ClangBuiltLinux/boot-utils/blob/fec03ef13519e26ac1f226e404e1620a142d0e40/images/x86_64/rootfs.cpio.zst
[2]: https://lore.kernel.org/20230207224847.94429-1-Jason@zx2c4.com/

Cheers,
Nathan

# bad: [969d09c3a6186c0a4bc8a41db0c1aba1c76081fc] Merge tag 'pull-aspeed-20230207' of https://github.com/legoater/qemu into staging
# good: [b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56] Update VERSION for v7.2.0
git bisect start '969d09c3a6186c0a4bc8a41db0c1aba1c76081fc' 'b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56'
# good: [38cb336fe9d47507da5177c3d349532d0af6885e] hw/arm/gumstix: Use the IEC binary prefix definitions
git bisect good 38cb336fe9d47507da5177c3d349532d0af6885e
# bad: [864a3fa439276148b6d7abcf2d43ee8dbe4c4062] monitor: Rename misc.c to hmp-target.c
git bisect bad 864a3fa439276148b6d7abcf2d43ee8dbe4c4062
# good: [e471a8c9850f1af0c1bc5768ca28285348cdd6c5] target/riscv: Trap on writes to stimecmp from VS when hvictl.VTI=1
git bisect good e471a8c9850f1af0c1bc5768ca28285348cdd6c5
# bad: [c1a9ac9bdeea213162a76f7b9e2f876c89a50a94] tests: acpi: cleanup use_uefi argument usage
git bisect bad c1a9ac9bdeea213162a76f7b9e2f876c89a50a94
# good: [dc575b5e0300a7a375b4e4501a17ada21e9a6d10] hw/i2c/bitbang_i2c: Change state calling bitbang_i2c_set_state() helper
git bisect good dc575b5e0300a7a375b4e4501a17ada21e9a6d10
# good: [3b07a936d3bfe97b07ddffcfbb532985a88033dd] target/arm: Look up ARMCPRegInfo at runtime
git bisect good 3b07a936d3bfe97b07ddffcfbb532985a88033dd
# good: [d395b18dce82855d03d934e30a515caf5a10a885] hw/acpi/acpi_dev_interface: Remove unused parameter from AcpiDeviceIfClass::madt_cpu
git bisect good d395b18dce82855d03d934e30a515caf5a10a885
# bad: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data
git bisect bad eac7a7791bb6d719233deed750034042318ffd56
# good: [8a8c9c3a747f77e664fa2288735b45a9d750be75] hw/pci-host: Use register definitions from PCI standard
git bisect good 8a8c9c3a747f77e664fa2288735b45a9d750be75
# good: [8a7c606016d283a1716290c657f6f45bc7c4d817] intel-iommu: Document iova_tree
git bisect good 8a7c606016d283a1716290c657f6f45bc7c4d817
# first bad commit: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-02-08 17:45     ` Nathan Chancellor
@ 2023-02-08 17:54       ` Jason A. Donenfeld
  2023-02-08 18:09         ` Jason A. Donenfeld
  2023-02-08 18:10         ` Michael S. Tsirkin
  0 siblings, 2 replies; 17+ messages in thread
From: Jason A. Donenfeld @ 2023-02-08 17:54 UTC (permalink / raw)
  To: Nathan Chancellor, Michael S. Tsirkin
  Cc: pbonzini, ebiggers, x86, linux-kernel, qemu-devel, ardb, kraxel,
	hpa, bp, philmd

Hi Nathan (and MST),

On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> Hi Jason,
>
> On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > The setup_data links are appended to the compressed kernel image. Since
> > the kernel image is typically loaded at 0x100000, setup_data lives at
> > `0x100000 + compressed_size`, which does not get relocated during the
> > kernel's boot process.
> >
> > The kernel typically decompresses the image starting at address
> > 0x1000000 (note: there's one more zero there than the compressed image
> > above). This usually is fine for most kernels.
> >
> > However, if the compressed image is actually quite large, then
> > setup_data will live at a `0x100000 + compressed_size` that extends into
> > the decompressed zone at 0x1000000. In other words, if compressed_size
> > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > clobber setup_data, resulting in crashes.
> >
> > Visually, what happens now is that QEMU appends setup_data to the kernel
> > image:
> >
> >           kernel image            setup_data
> >    |--------------------------||----------------|
> > 0x100000                  0x100000+l1     0x100000+l1+l2
> >
> > The problem is that this decompresses to 0x1000000 (one more zero). So
> > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> >
> >           kernel image            setup_data
> >    |--------------------------||----------------|
> > 0x100000                  0x100000+l1     0x100000+l1+l2
> >
> >                                  d e c o m p r e s s e d   k e r n e l
> >                      |-------------------------------------------------------------|
> >                 0x1000000                                                     0x1000000+l3
> >
> > The decompressed kernel seemingly overwriting the compressed kernel
> > image isn't a problem, because that gets relocated to a higher address
> > early on in the boot process, at the end of startup_64. setup_data,
> > however, stays in the same place, since those links are self referential
> > and nothing fixes them up.  So the decompressed kernel clobbers it.
> >
> > Fix this by appending setup_data to the cmdline blob rather than the
> > kernel image blob, which remains at a lower address that won't get
> > clobbered.
> >
> > This could have been done by overwriting the initrd blob instead, but
> > that poses big difficulties, such as no longer being able to use memory
> > mapped files for initrd, hurting performance, and, more importantly, the
> > initrd address calculation is hard coded in qboot, and it always grows
> > down rather than up, which means lots of brittle semantics would have to
> > be changed around, incurring more complexity. In contrast, using cmdline
> > is simple and doesn't interfere with anything.
> >
> > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > after the fact. So this hack is updated to account for this appending,
> > by reserving some bytes.
> >
> > Cc: x86@kernel.org
> > Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Eric Biggers <ebiggers@kernel.org>
> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>
> I apologize if this has already been reported/fixed already (I did a
> brief search on lore.kernel.org) or if my terminology is not as precise
> as it could be, I am a little out of my element here :)
>
> After this change as commit eac7a7791b ("x86: don't let decompressed
> kernel image clobber setup_data") in QEMU master, I am no longer able to
> boot a kernel directly through OVMF using '-append' + '-initrd' +
> '-kernel'. Instead, systemd-boot tries to start the distribution's
> kernel, which fails with:
>
>   Error registering initrd: Already started
>
> This can be reproduced with just a defconfig Linux kernel (I used
> 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> boot testing [1], and OVMF. Prior to this change, the kernel would start
> up but after, I am dumped into the UEFI shell (as there is no
> bootloader).
>
> The QEMU command I used was:
>
> $ qemu-system-x86_64 \
>     -kernel arch/x86_64/boot/bzImage \
>     -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
>     -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
>     -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \

Oh no... Without jumping into it, at first glance, I have absolutely
no idea. I suppose I could start debugging it and probably come up
with a solution, but...

@mst - I'm beginning to think that this whole setup_data route is
cursed. This is accumulating hacks within hacks within hacks. What
would you think if I just send a patch *removing* all use of
setup_data (the rng seed and the dtb thing), and then we can gradually
add that back with an actual overarching design. For example, it'd
probably make sense to have a separate fwcfg file for setup_data
rather than trying to mangle and existing one, etc. This way, we
unbreak the tree, and let the new approach be reviewed more
reasonably.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-02-08 17:54       ` Jason A. Donenfeld
@ 2023-02-08 18:09         ` Jason A. Donenfeld
  2023-02-08 18:10         ` Michael S. Tsirkin
  1 sibling, 0 replies; 17+ messages in thread
From: Jason A. Donenfeld @ 2023-02-08 18:09 UTC (permalink / raw)
  To: Nathan Chancellor, Michael S. Tsirkin
  Cc: pbonzini, ebiggers, x86, linux-kernel, qemu-devel, ardb, kraxel,
	hpa, bp, philmd

On Wed, Feb 8, 2023 at 2:54 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Nathan (and MST),
>
> On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@kernel.org> wrote:
> >
> > Hi Jason,
> >
> > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > > The setup_data links are appended to the compressed kernel image. Since
> > > the kernel image is typically loaded at 0x100000, setup_data lives at
> > > `0x100000 + compressed_size`, which does not get relocated during the
> > > kernel's boot process.
> > >
> > > The kernel typically decompresses the image starting at address
> > > 0x1000000 (note: there's one more zero there than the compressed image
> > > above). This usually is fine for most kernels.
> > >
> > > However, if the compressed image is actually quite large, then
> > > setup_data will live at a `0x100000 + compressed_size` that extends into
> > > the decompressed zone at 0x1000000. In other words, if compressed_size
> > > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > > clobber setup_data, resulting in crashes.
> > >
> > > Visually, what happens now is that QEMU appends setup_data to the kernel
> > > image:
> > >
> > >           kernel image            setup_data
> > >    |--------------------------||----------------|
> > > 0x100000                  0x100000+l1     0x100000+l1+l2
> > >
> > > The problem is that this decompresses to 0x1000000 (one more zero). So
> > > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> > >
> > >           kernel image            setup_data
> > >    |--------------------------||----------------|
> > > 0x100000                  0x100000+l1     0x100000+l1+l2
> > >
> > >                                  d e c o m p r e s s e d   k e r n e l
> > >                      |-------------------------------------------------------------|
> > >                 0x1000000                                                     0x1000000+l3
> > >
> > > The decompressed kernel seemingly overwriting the compressed kernel
> > > image isn't a problem, because that gets relocated to a higher address
> > > early on in the boot process, at the end of startup_64. setup_data,
> > > however, stays in the same place, since those links are self referential
> > > and nothing fixes them up.  So the decompressed kernel clobbers it.
> > >
> > > Fix this by appending setup_data to the cmdline blob rather than the
> > > kernel image blob, which remains at a lower address that won't get
> > > clobbered.
> > >
> > > This could have been done by overwriting the initrd blob instead, but
> > > that poses big difficulties, such as no longer being able to use memory
> > > mapped files for initrd, hurting performance, and, more importantly, the
> > > initrd address calculation is hard coded in qboot, and it always grows
> > > down rather than up, which means lots of brittle semantics would have to
> > > be changed around, incurring more complexity. In contrast, using cmdline
> > > is simple and doesn't interfere with anything.
> > >
> > > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > > after the fact. So this hack is updated to account for this appending,
> > > by reserving some bytes.
> > >
> > > Cc: x86@kernel.org
> > > Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > Cc: Borislav Petkov <bp@alien8.de>
> > > Cc: Eric Biggers <ebiggers@kernel.org>
> > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> >
> > I apologize if this has already been reported/fixed already (I did a
> > brief search on lore.kernel.org) or if my terminology is not as precise
> > as it could be, I am a little out of my element here :)
> >
> > After this change as commit eac7a7791b ("x86: don't let decompressed
> > kernel image clobber setup_data") in QEMU master, I am no longer able to
> > boot a kernel directly through OVMF using '-append' + '-initrd' +
> > '-kernel'. Instead, systemd-boot tries to start the distribution's
> > kernel, which fails with:
> >
> >   Error registering initrd: Already started
> >
> > This can be reproduced with just a defconfig Linux kernel (I used
> > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> > boot testing [1], and OVMF. Prior to this change, the kernel would start
> > up but after, I am dumped into the UEFI shell (as there is no
> > bootloader).
> >
> > The QEMU command I used was:
> >
> > $ qemu-system-x86_64 \
> >     -kernel arch/x86_64/boot/bzImage \
> >     -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> >     -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> >     -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
>
> Oh no... Without jumping into it, at first glance, I have absolutely
> no idea. I suppose I could start debugging it and probably come up
> with a solution, but...
>
> @mst - I'm beginning to think that this whole setup_data route is
> cursed. This is accumulating hacks within hacks within hacks. What
> would you think if I just send a patch *removing* all use of
> setup_data (the rng seed and the dtb thing), and then we can gradually
> add that back with an actual overarching design. For example, it'd
> probably make sense to have a separate fwcfg file for setup_data
> rather than trying to mangle and existing one, etc. This way, we
> unbreak the tree, and let the new approach be reviewed more
> reasonably.

Sent as https://lore.kernel.org/qemu-devel/20230208180835.234638-1-Jason@zx2c4.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data
  2023-02-08 17:54       ` Jason A. Donenfeld
  2023-02-08 18:09         ` Jason A. Donenfeld
@ 2023-02-08 18:10         ` Michael S. Tsirkin
  1 sibling, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2023-02-08 18:10 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Nathan Chancellor, pbonzini, ebiggers, x86, linux-kernel,
	qemu-devel, ardb, kraxel, hpa, bp, philmd

On Wed, Feb 08, 2023 at 02:54:41PM -0300, Jason A. Donenfeld wrote:
> Hi Nathan (and MST),
> 
> On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@kernel.org> wrote:
> >
> > Hi Jason,
> >
> > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > > The setup_data links are appended to the compressed kernel image. Since
> > > the kernel image is typically loaded at 0x100000, setup_data lives at
> > > `0x100000 + compressed_size`, which does not get relocated during the
> > > kernel's boot process.
> > >
> > > The kernel typically decompresses the image starting at address
> > > 0x1000000 (note: there's one more zero there than the compressed image
> > > above). This usually is fine for most kernels.
> > >
> > > However, if the compressed image is actually quite large, then
> > > setup_data will live at a `0x100000 + compressed_size` that extends into
> > > the decompressed zone at 0x1000000. In other words, if compressed_size
> > > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > > clobber setup_data, resulting in crashes.
> > >
> > > Visually, what happens now is that QEMU appends setup_data to the kernel
> > > image:
> > >
> > >           kernel image            setup_data
> > >    |--------------------------||----------------|
> > > 0x100000                  0x100000+l1     0x100000+l1+l2
> > >
> > > The problem is that this decompresses to 0x1000000 (one more zero). So
> > > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> > >
> > >           kernel image            setup_data
> > >    |--------------------------||----------------|
> > > 0x100000                  0x100000+l1     0x100000+l1+l2
> > >
> > >                                  d e c o m p r e s s e d   k e r n e l
> > >                      |-------------------------------------------------------------|
> > >                 0x1000000                                                     0x1000000+l3
> > >
> > > The decompressed kernel seemingly overwriting the compressed kernel
> > > image isn't a problem, because that gets relocated to a higher address
> > > early on in the boot process, at the end of startup_64. setup_data,
> > > however, stays in the same place, since those links are self referential
> > > and nothing fixes them up.  So the decompressed kernel clobbers it.
> > >
> > > Fix this by appending setup_data to the cmdline blob rather than the
> > > kernel image blob, which remains at a lower address that won't get
> > > clobbered.
> > >
> > > This could have been done by overwriting the initrd blob instead, but
> > > that poses big difficulties, such as no longer being able to use memory
> > > mapped files for initrd, hurting performance, and, more importantly, the
> > > initrd address calculation is hard coded in qboot, and it always grows
> > > down rather than up, which means lots of brittle semantics would have to
> > > be changed around, incurring more complexity. In contrast, using cmdline
> > > is simple and doesn't interfere with anything.
> > >
> > > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > > after the fact. So this hack is updated to account for this appending,
> > > by reserving some bytes.
> > >
> > > Cc: x86@kernel.org
> > > Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > Cc: Borislav Petkov <bp@alien8.de>
> > > Cc: Eric Biggers <ebiggers@kernel.org>
> > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> >
> > I apologize if this has already been reported/fixed already (I did a
> > brief search on lore.kernel.org) or if my terminology is not as precise
> > as it could be, I am a little out of my element here :)
> >
> > After this change as commit eac7a7791b ("x86: don't let decompressed
> > kernel image clobber setup_data") in QEMU master, I am no longer able to
> > boot a kernel directly through OVMF using '-append' + '-initrd' +
> > '-kernel'. Instead, systemd-boot tries to start the distribution's
> > kernel, which fails with:
> >
> >   Error registering initrd: Already started
> >
> > This can be reproduced with just a defconfig Linux kernel (I used
> > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> > boot testing [1], and OVMF. Prior to this change, the kernel would start
> > up but after, I am dumped into the UEFI shell (as there is no
> > bootloader).
> >
> > The QEMU command I used was:
> >
> > $ qemu-system-x86_64 \
> >     -kernel arch/x86_64/boot/bzImage \
> >     -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> >     -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> >     -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
> 
> Oh no... Without jumping into it, at first glance, I have absolutely
> no idea. I suppose I could start debugging it and probably come up
> with a solution, but...
> 
> @mst - I'm beginning to think that this whole setup_data route is
> cursed. This is accumulating hacks within hacks within hacks. What
> would you think if I just send a patch *removing* all use of
> setup_data (the rng seed and the dtb thing), and then we can gradually
> add that back with an actual overarching design. For example, it'd
> probably make sense to have a separate fwcfg file for setup_data
> rather than trying to mangle and existing one, etc. This way, we
> unbreak the tree, and let the new approach be reviewed more
> reasonably.
> 
> Jason

Yea basically this is close to what I proposed. I can't off-hand figure
out whether just dropping all of this is fine or we need some compat
hacks to enable this in some way for 7.2, and it's pretty late here.
Suggestions welcome.

-- 
MST


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-02-08 18:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-30 18:38 [PATCH qemu v2] x86: don't let decompressed kernel image clobber setup_data Jason A. Donenfeld
2022-12-30 21:59 ` Jason A. Donenfeld
2022-12-30 22:07   ` [PATCH qemu v3] " Jason A. Donenfeld
2023-01-05  5:16     ` Eric Biggers
2023-01-10 12:10     ` Mathias Krause
2023-01-10 15:34     ` Jason A. Donenfeld
2023-01-10 17:50       ` Michael S. Tsirkin
2023-01-23  4:21         ` Eric Biggers
2023-01-23 12:12           ` Michael S. Tsirkin
2023-01-23 12:37             ` Jason A. Donenfeld
2023-01-28 11:15               ` Michael S. Tsirkin
2023-01-30  9:31                 ` Daniel P. Berrangé
2023-01-23  8:26     ` Philippe Mathieu-Daudé
2023-02-08 17:45     ` Nathan Chancellor
2023-02-08 17:54       ` Jason A. Donenfeld
2023-02-08 18:09         ` Jason A. Donenfeld
2023-02-08 18:10         ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.