All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] s390x: initial support for virtio-mem
@ 2020-07-08 18:51 David Hildenbrand
  2020-07-08 18:51 ` [PATCH RFC 1/5] s390x: move setting of maximum ram size to machine init David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-08 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	Heiko Carstens, Halil Pasic, Christian Borntraeger, qemu-s390x,
	David Hildenbrand, Claudio Imbrenda, Richard Henderson

This wires up the initial, basic version of virito-mem for s390x. General
information about virtio-mem can be found at [1] and in QEMU commit [2].
Patch #5 contains a short example for s390x.

virtio-mem for x86-64 Linux is part of v5.8-rc1. A branch with a s390x
prototype can be found at:
    git@github.com:davidhildenbrand/linux.git virtio-mem-s390x

Note that the kernel should either be compiled via
 CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or "memhp_default_state=online"
 should be passed on the kernel cmdline.

This series can be found at:
    git@github.com:davidhildenbrand/qemu.git virtio-mem-s390x

Related to s390x, we'll have to tackle migration of storage keys and
storage attributes (especially, skipping unplugged parts). Not sure if
I am missing something else (any ideas?). For virtio-mem in general, there
are a couple of TODOs, e.g., documented in [1] and [2], both in QEMU and
Linux. However, the basics are around.

I only tested this with fairly small amount of RAM in a z/VM environemnt
...

[1] https://virtio-mem.gitlab.io/
[2] 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")

David Hildenbrand (5):
  s390x: move setting of maximum ram size to machine init
  s390x: implement diag260
  s390x: prepare device memory address space
  s390x: implement virtio-mem-ccw
  s390x: initial support for virtio-mem

 hw/s390x/Kconfig                   |   1 +
 hw/s390x/Makefile.objs             |   1 +
 hw/s390x/s390-virtio-ccw.c         | 178 ++++++++++++++++++++++++++++-
 hw/s390x/sclp.c                    |  32 ++----
 hw/s390x/virtio-ccw-mem.c          | 165 ++++++++++++++++++++++++++
 hw/s390x/virtio-ccw.h              |  13 +++
 hw/virtio/virtio-mem.c             |   2 +
 include/hw/s390x/s390-virtio-ccw.h |   3 +
 target/s390x/diag.c                |  57 +++++++++
 target/s390x/internal.h            |   2 +
 target/s390x/kvm.c                 |  11 ++
 target/s390x/misc_helper.c         |   6 +
 target/s390x/translate.c           |   4 +
 13 files changed, 449 insertions(+), 26 deletions(-)
 create mode 100644 hw/s390x/virtio-ccw-mem.c

-- 
2.26.2



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH RFC 1/5] s390x: move setting of maximum ram size to machine init
  2020-07-08 18:51 [PATCH RFC 0/5] s390x: initial support for virtio-mem David Hildenbrand
@ 2020-07-08 18:51 ` David Hildenbrand
  2020-07-08 18:51 ` [PATCH RFC 2/5] s390x: implement diag260 David Hildenbrand
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-08 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	Heiko Carstens, Halil Pasic, Christian Borntraeger, qemu-s390x,
	David Hildenbrand, Claudio Imbrenda, Richard Henderson

As we no longer fixup the maximum ram size in sclp code, let's move
setting the maximum ram size to ccw_init()->s390_memory_init(), which
now looks like a better fit.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/s390x/s390-virtio-ccw.c | 19 ++++++++++++++++---
 hw/s390x/sclp.c            | 20 +-------------------
 2 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 023fd25f2b..2e6d292c23 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -160,13 +160,26 @@ static void virtio_ccw_register_hcalls(void)
                                    virtio_ccw_hcall_early_printk);
 }
 
-static void s390_memory_init(MemoryRegion *ram)
+static void s390_memory_init(MachineState *machine)
 {
     MemoryRegion *sysmem = get_system_memory();
     Error *local_err = NULL;
+    uint64_t hw_limit;
+    int ret;
+
+    /* We have to set the memory limit before adding any regions to sysmem. */
+    ret = s390_set_memory_limit(machine->maxram_size, &hw_limit);
+    if (ret == -E2BIG) {
+        error_report("host supports a maximum of %" PRIu64 " GB",
+                     hw_limit / GiB);
+        exit(EXIT_FAILURE);
+    } else if (ret) {
+        error_report("setting the guest size failed");
+        exit(EXIT_FAILURE);
+    }
 
     /* allocate RAM for core */
-    memory_region_add_subregion(sysmem, 0, ram);
+    memory_region_add_subregion(sysmem, 0, machine->ram);
 
     /*
      * Configure the maximum page size. As no memory devices were created
@@ -249,7 +262,7 @@ static void ccw_init(MachineState *machine)
 
     s390_sclp_init();
     /* init memory + setup max page size. Required for the CPU model */
-    s390_memory_init(machine->ram);
+    s390_memory_init(machine);
 
     /* init CPUs (incl. CPU model) early so s390_has_feature() works */
     s390_init_cpus(machine);
diff --git a/hw/s390x/sclp.c b/hw/s390x/sclp.c
index d39f6d7785..f59195e15a 100644
--- a/hw/s390x/sclp.c
+++ b/hw/s390x/sclp.c
@@ -327,32 +327,14 @@ void s390_sclp_init(void)
 
 static void sclp_realize(DeviceState *dev, Error **errp)
 {
-    MachineState *machine = MACHINE(qdev_get_machine());
     SCLPDevice *sclp = SCLP(dev);
-    Error *err = NULL;
-    uint64_t hw_limit;
-    int ret;
 
     /*
      * qdev_device_add searches the sysbus for TYPE_SCLP_EVENTS_BUS. As long
      * as we can't find a fitting bus via the qom tree, we have to add the
      * event facility to the sysbus, so e.g. a sclp console can be created.
      */
-    sysbus_realize(SYS_BUS_DEVICE(sclp->event_facility), &err);
-    if (err) {
-        goto out;
-    }
-
-    ret = s390_set_memory_limit(machine->maxram_size, &hw_limit);
-    if (ret == -E2BIG) {
-        error_setg(&err, "host supports a maximum of %" PRIu64 " GB",
-                   hw_limit / GiB);
-    } else if (ret) {
-        error_setg(&err, "setting the guest size failed");
-    }
-
-out:
-    error_propagate(errp, err);
+    sysbus_realize(SYS_BUS_DEVICE(sclp->event_facility), errp);
 }
 
 static void sclp_memory_init(SCLPDevice *sclp)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH RFC 2/5] s390x: implement diag260
  2020-07-08 18:51 [PATCH RFC 0/5] s390x: initial support for virtio-mem David Hildenbrand
  2020-07-08 18:51 ` [PATCH RFC 1/5] s390x: move setting of maximum ram size to machine init David Hildenbrand
@ 2020-07-08 18:51 ` David Hildenbrand
  2020-07-09 10:37   ` Cornelia Huck
  2020-07-09 10:52   ` Christian Borntraeger
  2020-07-08 18:51 ` [PATCH RFC 3/5] s390x: prepare device memory address space David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-08 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	Heiko Carstens, Halil Pasic, Christian Borntraeger, qemu-s390x,
	David Hildenbrand, Claudio Imbrenda, Richard Henderson

Let's implement the "storage configuration" part of diag260. This diag
is found under z/VM, to indicate usable chunks of memory tot he guest OS.
As I don't have access to documentation, I have no clue what the actual
error cases are, and which other stuff we could eventually query using this
interface. Somebody with access to documentation should fix this. This
implementation seems to work with Linux guests just fine.

The Linux kernel supports diag260 to query the available memory since
v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
(with maxmem being defined and bigger than the memory size, e.g., "-m
 2G,maxmem=4G"), just as if support for SCLP storage information is not
implemented. They will fail to detect the actual initial memory size.

This interface allows us to expose the maximum ramsize via sclp
and the initial ramsize via diag260 - without having to mess with the
memory increment size and having to align the initial memory size to it.

This is a preparation for memory device support. We'll unlock the
implementation with a new QEMU machine that supports memory devices.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
 target/s390x/internal.h    |  2 ++
 target/s390x/kvm.c         | 11 ++++++++
 target/s390x/misc_helper.c |  6 ++++
 target/s390x/translate.c   |  4 +++
 5 files changed, 80 insertions(+)

diff --git a/target/s390x/diag.c b/target/s390x/diag.c
index 1a48429564..c3b1e24b2c 100644
--- a/target/s390x/diag.c
+++ b/target/s390x/diag.c
@@ -23,6 +23,63 @@
 #include "hw/s390x/pv.h"
 #include "kvm_s390x.h"
 
+void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    const ram_addr_t initial_ram_size = ms->ram_size;
+    const uint64_t subcode = env->regs[r3];
+    S390CPU *cpu = env_archcpu(env);
+    ram_addr_t addr, length;
+    uint64_t tmp;
+
+    /* TODO: Unlock with new QEMU machine. */
+    if (false) {
+        s390_program_interrupt(env, PGM_OPERATION, ra);
+        return;
+    }
+
+    /*
+     * There also seems to be subcode "0xc", which stores the size of the
+     * first chunk and the total size to r1/r2. It's only used by very old
+     * Linux, so don't implement it.
+     */
+    if ((r1 & 1) || subcode != 0x10) {
+        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
+        return;
+    }
+    addr = env->regs[r1];
+    length = env->regs[r1 + 1];
+
+    /* FIXME: Somebody with documentation should fix this. */
+    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
+        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
+        return;
+    }
+
+    /* FIXME: Somebody with documentation should fix this. */
+    if (!length) {
+        setcc(cpu, 3);
+        return;
+    }
+
+    /* FIXME: Somebody with documentation should fix this. */
+    if (!address_space_access_valid(&address_space_memory, addr, length, true,
+                                    MEMTXATTRS_UNSPECIFIED)) {
+        s390_program_interrupt(env, PGM_ADDRESSING, ra);
+        return;
+    }
+
+    /* Indicate our initial memory ([0 .. ram_size - 1]) */
+    tmp = cpu_to_be64(0);
+    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
+    tmp = cpu_to_be64(initial_ram_size - 1);
+    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
+
+    /* Exactly one entry was stored. */
+    env->regs[r3] = 1;
+    setcc(cpu, 0);
+}
+
 int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
 {
     uint64_t func = env->regs[r1];
diff --git a/target/s390x/internal.h b/target/s390x/internal.h
index b1e0ebf67f..a7a3df9a3b 100644
--- a/target/s390x/internal.h
+++ b/target/s390x/internal.h
@@ -372,6 +372,8 @@ int mmu_translate_real(CPUS390XState *env, target_ulong raddr, int rw,
 
 
 /* misc_helper.c */
+void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3,
+                     uintptr_t ra);
 int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3);
 void handle_diag_308(CPUS390XState *env, uint64_t r1, uint64_t r3,
                      uintptr_t ra);
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index f2f75d2a57..d6de3ad86c 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -1565,6 +1565,14 @@ static int handle_hypercall(S390CPU *cpu, struct kvm_run *run)
     return ret;
 }
 
+static void kvm_handle_diag_260(S390CPU *cpu, struct kvm_run *run)
+{
+    const uint64_t r1 = (run->s390_sieic.ipa & 0x00f0) >> 4;
+    const uint64_t r3 = run->s390_sieic.ipa & 0x000f;
+
+    handle_diag_260(&cpu->env, r1, r3, 0);
+}
+
 static void kvm_handle_diag_288(S390CPU *cpu, struct kvm_run *run)
 {
     uint64_t r1, r3;
@@ -1614,6 +1622,9 @@ static int handle_diag(S390CPU *cpu, struct kvm_run *run, uint32_t ipb)
      */
     func_code = decode_basedisp_rs(&cpu->env, ipb, NULL) & DIAG_KVM_CODE_MASK;
     switch (func_code) {
+    case 0x260:
+        kvm_handle_diag_260(cpu, run);
+        break;
     case DIAG_TIMEREVENT:
         kvm_handle_diag_288(cpu, run);
         break;
diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
index 58dbc023eb..d7274eb320 100644
--- a/target/s390x/misc_helper.c
+++ b/target/s390x/misc_helper.c
@@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
     uint64_t r;
 
     switch (num) {
+    case 0x260:
+        qemu_mutex_lock_iothread();
+        handle_diag_260(env, r1, r3, GETPC());
+        qemu_mutex_unlock_iothread();
+        r = 0;
+        break;
     case 0x500:
         /* KVM hypercall */
         qemu_mutex_lock_iothread();
diff --git a/target/s390x/translate.c b/target/s390x/translate.c
index 4f6f1e31cd..6bb8b6e513 100644
--- a/target/s390x/translate.c
+++ b/target/s390x/translate.c
@@ -2398,6 +2398,10 @@ static DisasJumpType op_diag(DisasContext *s, DisasOps *o)
     TCGv_i32 func_code = tcg_const_i32(get_field(s, i2));
 
     gen_helper_diag(cpu_env, r1, r3, func_code);
+    /* Only some diags modify the CC. */
+    if (get_field(s, i2) == 0x260) {
+        set_cc_static(s);
+    }
 
     tcg_temp_free_i32(func_code);
     tcg_temp_free_i32(r3);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH RFC 3/5] s390x: prepare device memory address space
  2020-07-08 18:51 [PATCH RFC 0/5] s390x: initial support for virtio-mem David Hildenbrand
  2020-07-08 18:51 ` [PATCH RFC 1/5] s390x: move setting of maximum ram size to machine init David Hildenbrand
  2020-07-08 18:51 ` [PATCH RFC 2/5] s390x: implement diag260 David Hildenbrand
@ 2020-07-08 18:51 ` David Hildenbrand
  2020-07-09 10:59   ` Cornelia Huck
  2020-07-08 18:51 ` [PATCH RFC 4/5] s390x: implement virtio-mem-ccw David Hildenbrand
  2020-07-08 18:51 ` [PATCH RFC 5/5] s390x: initial support for virtio-mem David Hildenbrand
  4 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-08 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	Heiko Carstens, Halil Pasic, Christian Borntraeger, qemu-s390x,
	David Hildenbrand, Claudio Imbrenda, Richard Henderson

Let's allocate the device memory information and setup the device
memory address space. Expose the maximum ramsize via SCLP and the actual
initial ramsize via diag260.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/s390x/s390-virtio-ccw.c         | 43 ++++++++++++++++++++++++++++++
 hw/s390x/sclp.c                    | 12 +++++++--
 include/hw/s390x/s390-virtio-ccw.h |  3 +++
 target/s390x/diag.c                |  4 +--
 4 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 2e6d292c23..577590e623 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -160,6 +160,35 @@ static void virtio_ccw_register_hcalls(void)
                                    virtio_ccw_hcall_early_printk);
 }
 
+static void s390_device_memory_init(MachineState *machine)
+{
+    MemoryRegion *sysmem = get_system_memory();
+
+    machine->device_memory = g_malloc0(sizeof(*machine->device_memory));
+
+    /* initialize device memory address space */
+    if (machine->ram_size < machine->maxram_size) {
+        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+
+        if (QEMU_ALIGN_UP(machine->maxram_size, MiB) != machine->maxram_size) {
+            error_report("maximum memory size must by aligned to 1 MB");
+            exit(EXIT_FAILURE);
+        }
+
+        machine->device_memory->base = machine->ram_size;
+        if (machine->device_memory->base + device_mem_size < device_mem_size) {
+            error_report("unsupported amount of maximum memory: " RAM_ADDR_FMT,
+                         machine->maxram_size);
+            exit(EXIT_FAILURE);
+        }
+
+        memory_region_init(&machine->device_memory->mr, OBJECT(machine),
+                           "device-memory", device_mem_size);
+        memory_region_add_subregion(sysmem, machine->device_memory->base,
+                                    &machine->device_memory->mr);
+    }
+}
+
 static void s390_memory_init(MachineState *machine)
 {
     MemoryRegion *sysmem = get_system_memory();
@@ -194,6 +223,11 @@ static void s390_memory_init(MachineState *machine)
     s390_skeys_init();
     /* Initialize storage attributes device */
     s390_stattrib_init();
+
+    /* Support for memory devices is glued to compat machines. */
+    if (memory_devices_allowed()) {
+        s390_device_memory_init(machine);
+    }
 }
 
 static void s390_init_ipl_dev(const char *kernel_filename,
@@ -617,6 +651,7 @@ static void ccw_machine_class_init(ObjectClass *oc, void *data)
     s390mc->cpu_model_allowed = true;
     s390mc->css_migration_enabled = true;
     s390mc->hpage_1m_allowed = true;
+    s390mc->memory_devices_allowed = true;
     mc->init = ccw_init;
     mc->reset = s390_machine_reset;
     mc->hot_add_cpu = s390_hot_add_cpu;
@@ -713,6 +748,11 @@ bool hpage_1m_allowed(void)
     return get_machine_class()->hpage_1m_allowed;
 }
 
+bool memory_devices_allowed(void)
+{
+    return get_machine_class()->memory_devices_allowed;
+}
+
 static char *machine_get_loadparm(Object *obj, Error **errp)
 {
     S390CcwMachineState *ms = S390_CCW_MACHINE(obj);
@@ -831,8 +871,11 @@ static void ccw_machine_5_0_instance_options(MachineState *machine)
 
 static void ccw_machine_5_0_class_options(MachineClass *mc)
 {
+    S390CcwMachineClass *s390mc = S390_MACHINE_CLASS(mc);
+
     ccw_machine_5_1_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
+    s390mc->memory_devices_allowed = false;
 }
 DEFINE_CCW_MACHINE(5_0, "5.0", false);
 
diff --git a/hw/s390x/sclp.c b/hw/s390x/sclp.c
index f59195e15a..85d3505597 100644
--- a/hw/s390x/sclp.c
+++ b/hw/s390x/sclp.c
@@ -22,6 +22,7 @@
 #include "hw/s390x/event-facility.h"
 #include "hw/s390x/s390-pci-bus.h"
 #include "hw/s390x/ipl.h"
+#include "hw/s390x/s390-virtio-ccw.h"
 
 static inline SCLPDevice *get_sclp_device(void)
 {
@@ -110,8 +111,15 @@ static void read_SCP_info(SCLPDevice *sclp, SCCB *sccb)
         read_info->rnsize2 = cpu_to_be32(rnsize);
     }
 
-    /* we don't support standby memory, maxram_size is never exposed */
-    rnmax = machine->ram_size >> sclp->increment_size;
+    /*
+     * Support for maxram was added with support for memory devices. The
+     * size of the initial memory is exposed via diag260.
+     */
+    if (memory_devices_allowed()) {
+        rnmax = machine->maxram_size >> sclp->increment_size;
+    } else {
+        rnmax = machine->ram_size >> sclp->increment_size;
+    }
     if (rnmax < 0x10000) {
         read_info->rnmax = cpu_to_be16(rnmax);
     } else {
diff --git a/include/hw/s390x/s390-virtio-ccw.h b/include/hw/s390x/s390-virtio-ccw.h
index cd1dccc6e3..3a1e7e2a6d 100644
--- a/include/hw/s390x/s390-virtio-ccw.h
+++ b/include/hw/s390x/s390-virtio-ccw.h
@@ -41,6 +41,7 @@ typedef struct S390CcwMachineClass {
     bool cpu_model_allowed;
     bool css_migration_enabled;
     bool hpage_1m_allowed;
+    bool memory_devices_allowed;
 } S390CcwMachineClass;
 
 /* runtime-instrumentation allowed by the machine */
@@ -49,6 +50,8 @@ bool ri_allowed(void);
 bool cpu_model_allowed(void);
 /* 1M huge page mappings allowed by the machine */
 bool hpage_1m_allowed(void);
+/* Allow memory devices and diag260. */
+bool memory_devices_allowed(void);
 
 /**
  * Returns true if (vmstate based) migration of the channel subsystem
diff --git a/target/s390x/diag.c b/target/s390x/diag.c
index c3b1e24b2c..6b33eb0efc 100644
--- a/target/s390x/diag.c
+++ b/target/s390x/diag.c
@@ -32,8 +32,8 @@ void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
     ram_addr_t addr, length;
     uint64_t tmp;
 
-    /* TODO: Unlock with new QEMU machine. */
-    if (false) {
+    /* Support for diag260 is glued to support for memory devices. */
+    if (!memory_devices_allowed()) {
         s390_program_interrupt(env, PGM_OPERATION, ra);
         return;
     }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH RFC 4/5] s390x: implement virtio-mem-ccw
  2020-07-08 18:51 [PATCH RFC 0/5] s390x: initial support for virtio-mem David Hildenbrand
                   ` (2 preceding siblings ...)
  2020-07-08 18:51 ` [PATCH RFC 3/5] s390x: prepare device memory address space David Hildenbrand
@ 2020-07-08 18:51 ` David Hildenbrand
  2020-07-09  9:24   ` Cornelia Huck
  2020-07-08 18:51 ` [PATCH RFC 5/5] s390x: initial support for virtio-mem David Hildenbrand
  4 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-08 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	Heiko Carstens, Halil Pasic, Christian Borntraeger, qemu-s390x,
	David Hildenbrand, Claudio Imbrenda, Richard Henderson

Add a proper CCW proxy device, similar to the PCI variant.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/s390x/virtio-ccw-mem.c | 165 ++++++++++++++++++++++++++++++++++++++
 hw/s390x/virtio-ccw.h     |  13 +++
 2 files changed, 178 insertions(+)
 create mode 100644 hw/s390x/virtio-ccw-mem.c

diff --git a/hw/s390x/virtio-ccw-mem.c b/hw/s390x/virtio-ccw-mem.c
new file mode 100644
index 0000000000..ae856d7ad4
--- /dev/null
+++ b/hw/s390x/virtio-ccw-mem.c
@@ -0,0 +1,165 @@
+/*
+ * Virtio MEM CCW device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/qdev-properties.h"
+#include "hw/virtio/virtio.h"
+#include "qapi/error.h"
+#include "qemu/module.h"
+#include "virtio-ccw.h"
+#include "hw/mem/memory-device.h"
+#include "qapi/qapi-events-misc.h"
+
+static void virtio_ccw_mem_realize(VirtioCcwDevice *ccw_dev, Error **errp)
+{
+    VirtIOMEMCcw *ccw_mem = VIRTIO_MEM_CCW(ccw_dev);
+    DeviceState *vdev = DEVICE(&ccw_mem->vdev);
+
+    qdev_realize(vdev, BUS(&ccw_dev->bus), errp);
+}
+
+static void virtio_ccw_mem_set_addr(MemoryDeviceState *md, uint64_t addr,
+                                    Error **errp)
+{
+    object_property_set_uint(OBJECT(md), addr, VIRTIO_MEM_ADDR_PROP, errp);
+}
+
+static uint64_t virtio_ccw_mem_get_addr(const MemoryDeviceState *md)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_ADDR_PROP,
+                                    &error_abort);
+}
+
+static MemoryRegion *virtio_ccw_mem_get_memory_region(MemoryDeviceState *md,
+                                                      Error **errp)
+{
+    VirtIOMEMCcw *ccw_mem = VIRTIO_MEM_CCW(md);
+    VirtIOMEM *vmem = VIRTIO_MEM(&ccw_mem->vdev);
+    VirtIOMEMClass *vmc = VIRTIO_MEM_GET_CLASS(vmem);
+
+    return vmc->get_memory_region(vmem, errp);
+}
+
+static uint64_t virtio_ccw_mem_get_plugged_size(const MemoryDeviceState *md,
+                                                Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_SIZE_PROP,
+                                    errp);
+}
+
+static void virtio_ccw_mem_fill_device_info(const MemoryDeviceState *md,
+                                            MemoryDeviceInfo *info)
+{
+    VirtioMEMDeviceInfo *vi = g_new0(VirtioMEMDeviceInfo, 1);
+    VirtIOMEMCcw *ccw_mem = VIRTIO_MEM_CCW(md);
+    VirtIOMEM *vmem = VIRTIO_MEM(&ccw_mem->vdev);
+    VirtIOMEMClass *vpc = VIRTIO_MEM_GET_CLASS(vmem);
+    DeviceState *dev = DEVICE(md);
+
+    if (dev->id) {
+        vi->has_id = true;
+        vi->id = g_strdup(dev->id);
+    }
+
+    /* let the real device handle everything else */
+    vpc->fill_device_info(vmem, vi);
+
+    info->u.virtio_mem.data = vi;
+    info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
+}
+
+static void virtio_ccw_mem_size_change_notify(Notifier *notifier, void *data)
+{
+    VirtIOMEMCcw *ccw_mem = container_of(notifier, VirtIOMEMCcw,
+                                         size_change_notifier);
+    DeviceState *dev = DEVICE(ccw_mem);
+    const uint64_t * const size_p = data;
+    const char *id = NULL;
+
+    if (dev->id) {
+        id = g_strdup(dev->id);
+    }
+
+    qapi_event_send_memory_device_size_change(!!id, id, *size_p);
+}
+
+static void virtio_ccw_mem_instance_init(Object *obj)
+{
+    VirtIOMEMCcw *ccw_mem = VIRTIO_MEM_CCW(obj);
+    VirtIOMEMClass *vmc;
+    VirtIOMEM *vmem;
+
+    virtio_instance_init_common(obj, &ccw_mem->vdev, sizeof(ccw_mem->vdev),
+                                TYPE_VIRTIO_MEM);
+
+    ccw_mem->size_change_notifier.notify = virtio_ccw_mem_size_change_notify;
+    vmem = VIRTIO_MEM(&ccw_mem->vdev);
+    vmc = VIRTIO_MEM_GET_CLASS(vmem);
+    /*
+     * We never remove the notifier again, as we expect both devices to
+     * disappear at the same time.
+     */
+    vmc->add_size_change_notifier(vmem, &ccw_mem->size_change_notifier);
+
+    object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
+                              OBJECT(&ccw_mem->vdev),
+                              VIRTIO_MEM_BLOCK_SIZE_PROP);
+    object_property_add_alias(obj, VIRTIO_MEM_SIZE_PROP, OBJECT(&ccw_mem->vdev),
+                              VIRTIO_MEM_SIZE_PROP);
+    object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
+                              OBJECT(&ccw_mem->vdev),
+                              VIRTIO_MEM_REQUESTED_SIZE_PROP);
+}
+
+static Property virtio_ccw_mem_properties[] = {
+    DEFINE_PROP_BIT("ioeventfd", VirtioCcwDevice, flags,
+                    VIRTIO_CCW_FLAG_USE_IOEVENTFD_BIT, true),
+    DEFINE_PROP_UINT32("max_revision", VirtioCcwDevice, max_rev,
+                       VIRTIO_CCW_MAX_REV),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void virtio_ccw_mem_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtIOCCWDeviceClass *k = VIRTIO_CCW_DEVICE_CLASS(klass);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
+
+    k->realize = virtio_ccw_mem_realize;
+    device_class_set_props(dc, virtio_ccw_mem_properties);
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+
+    mdc->get_addr = virtio_ccw_mem_get_addr;
+    mdc->set_addr = virtio_ccw_mem_set_addr;
+    mdc->get_plugged_size = virtio_ccw_mem_get_plugged_size;
+    mdc->get_memory_region = virtio_ccw_mem_get_memory_region;
+    mdc->fill_device_info = virtio_ccw_mem_fill_device_info;
+}
+
+static const TypeInfo virtio_ccw_mem = {
+    .name = TYPE_VIRTIO_MEM_CCW,
+    .parent = TYPE_VIRTIO_CCW_DEVICE,
+    .instance_size = sizeof(VirtIOMEMCcw),
+    .instance_init = virtio_ccw_mem_instance_init,
+    .class_init = virtio_ccw_mem_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_MEMORY_DEVICE },
+        { }
+    },
+};
+
+static void virtio_ccw_mem_register(void)
+{
+    type_register_static(&virtio_ccw_mem);
+}
+
+type_init(virtio_ccw_mem_register)
diff --git a/hw/s390x/virtio-ccw.h b/hw/s390x/virtio-ccw.h
index c0e3355248..77aa87c41f 100644
--- a/hw/s390x/virtio-ccw.h
+++ b/hw/s390x/virtio-ccw.h
@@ -29,6 +29,7 @@
 #endif /* CONFIG_VHOST_VSOCK */
 #include "hw/virtio/virtio-gpu.h"
 #include "hw/virtio/virtio-input.h"
+#include "hw/virtio/virtio-mem.h"
 
 #include "hw/s390x/s390_flic.h"
 #include "hw/s390x/css.h"
@@ -256,4 +257,16 @@ typedef struct VirtIOInputHIDCcw {
     VirtIOInputHID vdev;
 } VirtIOInputHIDCcw;
 
+/* virtio-mem-ccw */
+
+#define TYPE_VIRTIO_MEM_CCW "virtio-mem-ccw"
+#define VIRTIO_MEM_CCW(obj) \
+        OBJECT_CHECK(VirtIOMEMCcw, (obj), TYPE_VIRTIO_MEM_CCW)
+
+typedef struct VirtIOMEMCcw {
+    VirtioCcwDevice parent_obj;
+    VirtIOMEM vdev;
+    Notifier size_change_notifier;
+} VirtIOMEMCcw;
+
 #endif
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH RFC 5/5] s390x: initial support for virtio-mem
  2020-07-08 18:51 [PATCH RFC 0/5] s390x: initial support for virtio-mem David Hildenbrand
                   ` (3 preceding siblings ...)
  2020-07-08 18:51 ` [PATCH RFC 4/5] s390x: implement virtio-mem-ccw David Hildenbrand
@ 2020-07-08 18:51 ` David Hildenbrand
  4 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-08 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	Heiko Carstens, Halil Pasic, Christian Borntraeger, qemu-s390x,
	David Hildenbrand, Claudio Imbrenda, Richard Henderson

Let's wire up the initial, basic virtio-mem implementation in QEMU. It will
have to see some important extensions (esp., resizeable allocations)
before it can be considered production ready. Also, the focus on the Linux
driver side is on memory hotplug, there are a lot of things optimize in
the future to improve memory unplug capabilities. However, the basics
are in place.

Block migration for now, as we'll have to take proper care of storage
keys and storage attributes. Also, make sure to not hotplug huge pages
to a setup without huge pages.

With a Linux guest that supports virtio-mem (and has
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE set for now), a basic example.

1. Start a VM with 2G initial memory and a virtio-mem device with a maximum
   capacity of 18GB (and an initial size of 300M):
    sudo qemu-system-s390x \
        --enable-kvm \
        -m 2G,maxmem=20G \
        -smp 4 \
        -nographic \
        -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
        -mon chardev=monitor,mode=readline \
        -net nic -net user \
        -hda s390x.cow2 \
        -object memory-backend-ram,id=mem0,size=18G \
        -device virtio-mem-ccw,id=vm0,memdev=mem0,requested-size=300M

2. Query the current size of virtio-mem device:
    (qemu) info memory-devices
    Memory device [virtio-mem]: "vm0"
      memaddr: 0x80000000
      node: 0
      requested-size: 314572800
      size: 314572800
      max-size: 19327352832
      block-size: 1048576
      memdev: /objects/mem0

3. Request to grow it to 8GB:
    (qemu) qom-set vm0 requested-size 8G
    (qemu) info memory-devices
    Memory device [virtio-mem]: "vm0"
      memaddr: 0x80000000
      node: 0
      requested-size: 8589934592
      size: 8589934592
      max-size: 19327352832
      block-size: 1048576
      memdev: /objects/mem0

4. Request to shrink it to 800M (might take a while, might not fully
   succeed, and might not be able to remove memory blocks in Linux):
  (qemu) qom-set vm0 requested-size 800M
  (qemu) info memory-devices
  Memory device [virtio-mem]: "vm0"
    memaddr: 0x80000000
    node: 0
    requested-size: 838860800
    size: 838860800
    max-size: 19327352832
    block-size: 1048576
    memdev: /objects/mem0

Note: Due to lack of resizeable allocations, we will go ahead and
reserve a 18GB vmalloc area + size the QEMU RAM slot + KVM mamory slot
18GB. echo 1 > /proc/sys/vm/overcommit_memory might be required for
now. In the future, this area will instead grow on actual demand and shrink
when possible.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/s390x/Kconfig           |   1 +
 hw/s390x/Makefile.objs     |   1 +
 hw/s390x/s390-virtio-ccw.c | 116 ++++++++++++++++++++++++++++++++++++-
 hw/virtio/virtio-mem.c     |   2 +
 4 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/hw/s390x/Kconfig b/hw/s390x/Kconfig
index 5e7d8a2bae..b8619c1adc 100644
--- a/hw/s390x/Kconfig
+++ b/hw/s390x/Kconfig
@@ -10,3 +10,4 @@ config S390_CCW_VIRTIO
     select SCLPCONSOLE
     select VIRTIO_CCW
     select MSI_NONBROKEN
+    select VIRTIO_MEM_SUPPORTED
diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs
index a46a1c7894..924775d6f0 100644
--- a/hw/s390x/Makefile.objs
+++ b/hw/s390x/Makefile.objs
@@ -20,6 +20,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio-ccw-net.o
 obj-$(CONFIG_VIRTIO_BLK) += virtio-ccw-blk.o
 obj-$(call land,$(CONFIG_VIRTIO_9P),$(CONFIG_VIRTFS)) += virtio-ccw-9p.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-ccw.o
+obj-$(CONFIG_VIRTIO_MEM) += virtio-ccw-mem.o
 endif
 obj-y += css-bridge.o
 obj-y += ccw-device.o
diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 577590e623..e714035077 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -45,6 +45,7 @@
 #include "sysemu/sysemu.h"
 #include "hw/s390x/pv.h"
 #include "migration/blocker.h"
+#include "hw/mem/memory-device.h"
 
 static Error *pv_mig_blocker;
 
@@ -542,11 +543,119 @@ static void s390_machine_reset(MachineState *machine)
     s390_ipl_clear_reset_request();
 }
 
+static void s390_virtio_md_pre_plug(HotplugHandler *hotplug_dev,
+                                    DeviceState *dev, Error **errp)
+{
+    HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
+    MemoryDeviceState *md = MEMORY_DEVICE(dev);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+    Error *local_err = NULL;
+
+    if (!hotplug_dev2 && dev->hotplugged) {
+        /*
+         * Without a bus hotplug handler, we cannot control the plug/unplug
+         * order. We should never reach this point when hotplugging, however,
+         * better add a safety net.
+         */
+        error_setg(errp, "hotplug of virtio based memory devices not supported"
+                         " on this bus.");
+        return;
+    }
+
+    /*
+     * KVM does not support device memory with a bigger page size than initial
+     * memory. The new memory backend is not mapped yet, so
+     * qemu_maxrampagesize() won't consider it.
+     */
+    if (kvm_enabled()) {
+        MemoryRegion *mr = mdc->get_memory_region(md, &local_err);
+
+        if (local_err) {
+            goto out;
+        }
+        if (qemu_ram_pagesize(mr->ram_block) > qemu_maxrampagesize()) {
+            error_setg(&local_err, "Device memory has a bigger page size than"
+                       " initial memory");
+            goto out;
+        }
+    }
+
+    /*
+     * First, see if we can plug this memory device at all. If that
+     * succeeds, branch of to the actual hotplug handler.
+     */
+    memory_device_pre_plug(md, MACHINE(hotplug_dev), NULL, &local_err);
+    if (!local_err && hotplug_dev2) {
+        hotplug_handler_pre_plug(hotplug_dev2, dev, &local_err);
+    }
+out:
+    error_propagate(errp, local_err);
+}
+
+static void s390_virtio_md_plug(HotplugHandler *hotplug_dev,
+                                DeviceState *dev, Error **errp)
+{
+    HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
+    static Error *migration_blocker;
+    bool add_blocker = !migration_blocker;
+    Error *local_err = NULL;
+
+    /*
+     * Until we support migration of storage keys and storage attributes
+     * for anything that's not initial memory, let's block migration.
+     */
+    if (add_blocker) {
+        error_setg(&migration_blocker, "storage keys/attributes not yet"
+                   " migrated for memory devices");
+        migrate_add_blocker(migration_blocker, &local_err);
+        if (local_err) {
+            error_free_or_abort(&migration_blocker);
+            goto out;
+        }
+    }
+
+    /*
+     * Plug the memory device first and then branch off to the actual
+     * hotplug handler. If that one fails, we can easily undo the memory
+     * device bits.
+     */
+    memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+    if (hotplug_dev2) {
+        hotplug_handler_plug(hotplug_dev2, dev, &local_err);
+        if (local_err) {
+            memory_device_unplug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+            if (add_blocker) {
+                migrate_del_blocker(migration_blocker);
+                error_free_or_abort(&migration_blocker);
+            }
+        }
+    }
+out:
+    error_propagate(errp, local_err);
+}
+
+static void s390_virtio_md_unplug_request(HotplugHandler *hotplug_dev,
+                                          DeviceState *dev, Error **errp)
+{
+    /* We don't support hot unplug of virtio based memory devices */
+    error_setg(errp, "virtio based memory devices cannot be unplugged.");
+}
+
+static void s390_machine_device_pre_plug(HotplugHandler *hotplug_dev,
+                                         DeviceState *dev, Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_CCW)) {
+        s390_virtio_md_pre_plug(hotplug_dev, dev, errp);
+    }
+}
+
 static void s390_machine_device_plug(HotplugHandler *hotplug_dev,
                                      DeviceState *dev, Error **errp)
 {
     if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         s390_cpu_plug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_CCW)) {
+        s390_virtio_md_plug(hotplug_dev, dev, errp);
     }
 }
 
@@ -555,7 +664,8 @@ static void s390_machine_device_unplug_request(HotplugHandler *hotplug_dev,
 {
     if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         error_setg(errp, "CPU hot unplug not supported on this machine");
-        return;
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_CCW)) {
+        s390_virtio_md_unplug_request(hotplug_dev, dev, errp);
     }
 }
 
@@ -596,7 +706,8 @@ static const CPUArchIdList *s390_possible_cpu_arch_ids(MachineState *ms)
 static HotplugHandler *s390_get_hotplug_handler(MachineState *machine,
                                                 DeviceState *dev)
 {
-    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_CCW)) {
         return HOTPLUG_HANDLER(machine);
     }
     return NULL;
@@ -668,6 +779,7 @@ static void ccw_machine_class_init(ObjectClass *oc, void *data)
     mc->possible_cpu_arch_ids = s390_possible_cpu_arch_ids;
     /* it is overridden with 'host' cpu *in kvm_arch_init* */
     mc->default_cpu_type = S390_CPU_TYPE_NAME("qemu");
+    hc->pre_plug = s390_machine_device_pre_plug;
     hc->plug = s390_machine_device_plug;
     hc->unplug_request = s390_machine_device_unplug_request;
     nc->nmi_monitor_handler = s390_nmi;
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 65850530e7..e1b3275089 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -53,6 +53,8 @@
  */
 #if defined(TARGET_X86_64) || defined(TARGET_I386)
 #define VIRTIO_MEM_USABLE_EXTENT (2 * (128 * MiB))
+#elif defined(TARGET_S390X)
+#define VIRTIO_MEM_USABLE_EXTENT (2 * (256 * MiB))
 #else
 #error VIRTIO_MEM_USABLE_EXTENT not defined
 #endif
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 4/5] s390x: implement virtio-mem-ccw
  2020-07-08 18:51 ` [PATCH RFC 4/5] s390x: implement virtio-mem-ccw David Hildenbrand
@ 2020-07-09  9:24   ` Cornelia Huck
  2020-07-09  9:26     ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Cornelia Huck @ 2020-07-09  9:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed,  8 Jul 2020 20:51:34 +0200
David Hildenbrand <david@redhat.com> wrote:

> Add a proper CCW proxy device, similar to the PCI variant.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/s390x/virtio-ccw-mem.c | 165 ++++++++++++++++++++++++++++++++++++++
>  hw/s390x/virtio-ccw.h     |  13 +++
>  2 files changed, 178 insertions(+)
>  create mode 100644 hw/s390x/virtio-ccw-mem.c

(...)

> +static void virtio_ccw_mem_instance_init(Object *obj)
> +{
> +    VirtIOMEMCcw *ccw_mem = VIRTIO_MEM_CCW(obj);
> +    VirtIOMEMClass *vmc;
> +    VirtIOMEM *vmem;
> +

I think you want

    ccw_dev->force_revision_1 = true;

here (similar to forcing virtio-pci to modern-only.)

> +    virtio_instance_init_common(obj, &ccw_mem->vdev, sizeof(ccw_mem->vdev),
> +                                TYPE_VIRTIO_MEM);
> +
> +    ccw_mem->size_change_notifier.notify = virtio_ccw_mem_size_change_notify;
> +    vmem = VIRTIO_MEM(&ccw_mem->vdev);
> +    vmc = VIRTIO_MEM_GET_CLASS(vmem);
> +    /*
> +     * We never remove the notifier again, as we expect both devices to
> +     * disappear at the same time.
> +     */
> +    vmc->add_size_change_notifier(vmem, &ccw_mem->size_change_notifier);
> +
> +    object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
> +                              OBJECT(&ccw_mem->vdev),
> +                              VIRTIO_MEM_BLOCK_SIZE_PROP);
> +    object_property_add_alias(obj, VIRTIO_MEM_SIZE_PROP, OBJECT(&ccw_mem->vdev),
> +                              VIRTIO_MEM_SIZE_PROP);
> +    object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
> +                              OBJECT(&ccw_mem->vdev),
> +                              VIRTIO_MEM_REQUESTED_SIZE_PROP);
> +}

(...)

(have not looked at the rest yet)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 4/5] s390x: implement virtio-mem-ccw
  2020-07-09  9:24   ` Cornelia Huck
@ 2020-07-09  9:26     ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-09  9:26 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 09.07.20 11:24, Cornelia Huck wrote:
> On Wed,  8 Jul 2020 20:51:34 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Add a proper CCW proxy device, similar to the PCI variant.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/s390x/virtio-ccw-mem.c | 165 ++++++++++++++++++++++++++++++++++++++
>>  hw/s390x/virtio-ccw.h     |  13 +++
>>  2 files changed, 178 insertions(+)
>>  create mode 100644 hw/s390x/virtio-ccw-mem.c
> 
> (...)
> 
>> +static void virtio_ccw_mem_instance_init(Object *obj)
>> +{
>> +    VirtIOMEMCcw *ccw_mem = VIRTIO_MEM_CCW(obj);
>> +    VirtIOMEMClass *vmc;
>> +    VirtIOMEM *vmem;
>> +
> 
> I think you want
> 
>     ccw_dev->force_revision_1 = true;
> 
> here (similar to forcing virtio-pci to modern-only.)

Ah, that's the magic bit, was looking for that. Thanks!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-08 18:51 ` [PATCH RFC 2/5] s390x: implement diag260 David Hildenbrand
@ 2020-07-09 10:37   ` Cornelia Huck
  2020-07-09 17:54     ` David Hildenbrand
  2020-07-10  8:32     ` David Hildenbrand
  2020-07-09 10:52   ` Christian Borntraeger
  1 sibling, 2 replies; 39+ messages in thread
From: Cornelia Huck @ 2020-07-09 10:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed,  8 Jul 2020 20:51:32 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's implement the "storage configuration" part of diag260. This diag
> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
> As I don't have access to documentation, I have no clue what the actual
> error cases are, and which other stuff we could eventually query using this
> interface. Somebody with access to documentation should fix this. This
> implementation seems to work with Linux guests just fine.
> 
> The Linux kernel supports diag260 to query the available memory since
> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
> (with maxmem being defined and bigger than the memory size, e.g., "-m
>  2G,maxmem=4G"), just as if support for SCLP storage information is not
> implemented. They will fail to detect the actual initial memory size.
> 
> This interface allows us to expose the maximum ramsize via sclp
> and the initial ramsize via diag260 - without having to mess with the
> memory increment size and having to align the initial memory size to it.
> 
> This is a preparation for memory device support. We'll unlock the
> implementation with a new QEMU machine that supports memory devices.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>  target/s390x/internal.h    |  2 ++
>  target/s390x/kvm.c         | 11 ++++++++
>  target/s390x/misc_helper.c |  6 ++++
>  target/s390x/translate.c   |  4 +++
>  5 files changed, 80 insertions(+)
> 
> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
> index 1a48429564..c3b1e24b2c 100644
> --- a/target/s390x/diag.c
> +++ b/target/s390x/diag.c
> @@ -23,6 +23,63 @@
>  #include "hw/s390x/pv.h"
>  #include "kvm_s390x.h"
>  
> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    const ram_addr_t initial_ram_size = ms->ram_size;
> +    const uint64_t subcode = env->regs[r3];
> +    S390CPU *cpu = env_archcpu(env);
> +    ram_addr_t addr, length;
> +    uint64_t tmp;
> +
> +    /* TODO: Unlock with new QEMU machine. */
> +    if (false) {
> +        s390_program_interrupt(env, PGM_OPERATION, ra);
> +        return;
> +    }
> +
> +    /*
> +     * There also seems to be subcode "0xc", which stores the size of the
> +     * first chunk and the total size to r1/r2. It's only used by very old
> +     * Linux, so don't implement it.

FWIW,
https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
seems to list the available subcodes. Anything but 0xc and 0x10 is for
24/31 bit only, so we can safely ignore them. Not sure what we want to
do with 0xc: it is supposed to "Return the highest addressable byte of
virtual storage in the host-primary address space, including named
saved systems and saved segments", so returning the end of the address
space should be easy enough, but not very useful.

> +     */
> +    if ((r1 & 1) || subcode != 0x10) {
> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
> +        return;
> +    }
> +    addr = env->regs[r1];
> +    length = env->regs[r1 + 1];
> +
> +    /* FIXME: Somebody with documentation should fix this. */

Doc mentioned above says for specification exception:

"For subcode X'10':
• Rx is not an even-numbered register.
• The address contained in Rx is not on a quadword boundary.
• The length contained in Rx+1 is not a positive multiple of 16."

> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
> +        return;
> +    }
> +
> +    /* FIXME: Somebody with documentation should fix this. */
> +    if (!length) {

Probably specification exception as well?

> +        setcc(cpu, 3);
> +        return;
> +    }
> +
> +    /* FIXME: Somebody with documentation should fix this. */

For access exception:

"For subcode X'10', an error occurred trying to store the extent
information into the guest's output area."

> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
> +                                    MEMTXATTRS_UNSPECIFIED)) {
> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
> +        return;
> +    }
> +
> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
> +    tmp = cpu_to_be64(0);
> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
> +    tmp = cpu_to_be64(initial_ram_size - 1);
> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
> +
> +    /* Exactly one entry was stored. */
> +    env->regs[r3] = 1;
> +    setcc(cpu, 0);
> +}
> +
>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>  {
>      uint64_t func = env->regs[r1];

(...)

> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
> index 58dbc023eb..d7274eb320 100644
> --- a/target/s390x/misc_helper.c
> +++ b/target/s390x/misc_helper.c
> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>      uint64_t r;
>  
>      switch (num) {
> +    case 0x260:
> +        qemu_mutex_lock_iothread();
> +        handle_diag_260(env, r1, r3, GETPC());
> +        qemu_mutex_unlock_iothread();
> +        r = 0;
> +        break;
>      case 0x500:
>          /* KVM hypercall */
>          qemu_mutex_lock_iothread();

Looking at the doc referenced above, it seems that we treat every diag
call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
to your patch; maybe I'm misreading.)

> diff --git a/target/s390x/translate.c b/target/s390x/translate.c
> index 4f6f1e31cd..6bb8b6e513 100644
> --- a/target/s390x/translate.c
> +++ b/target/s390x/translate.c
> @@ -2398,6 +2398,10 @@ static DisasJumpType op_diag(DisasContext *s, DisasOps *o)
>      TCGv_i32 func_code = tcg_const_i32(get_field(s, i2));
>  
>      gen_helper_diag(cpu_env, r1, r3, func_code);
> +    /* Only some diags modify the CC. */
> +    if (get_field(s, i2) == 0x260) {
> +        set_cc_static(s);
> +    }
>  
>      tcg_temp_free_i32(func_code);
>      tcg_temp_free_i32(r3);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-08 18:51 ` [PATCH RFC 2/5] s390x: implement diag260 David Hildenbrand
  2020-07-09 10:37   ` Cornelia Huck
@ 2020-07-09 10:52   ` Christian Borntraeger
  2020-07-09 18:15     ` David Hildenbrand
  1 sibling, 1 reply; 39+ messages in thread
From: Christian Borntraeger @ 2020-07-09 10:52 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	Cornelia Huck, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson


On 08.07.20 20:51, David Hildenbrand wrote:
> Let's implement the "storage configuration" part of diag260. This diag
> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
> As I don't have access to documentation, I have no clue what the actual
> error cases are, and which other stuff we could eventually query using this
> interface. Somebody with access to documentation should fix this. This
> implementation seems to work with Linux guests just fine.
> 
> The Linux kernel supports diag260 to query the available memory since
> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
> (with maxmem being defined and bigger than the memory size, e.g., "-m
>  2G,maxmem=4G"), just as if support for SCLP storage information is not
> implemented. They will fail to detect the actual initial memory size.
> 
> This interface allows us to expose the maximum ramsize via sclp
> and the initial ramsize via diag260 - without having to mess with the
> memory increment size and having to align the initial memory size to it.
> 
> This is a preparation for memory device support. We'll unlock the
> implementation with a new QEMU machine that supports memory devices.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

I have not looked into this, so this is purely a question. 

Is there a way to hotplug virtio-mem memory beyond the initial size of 
the memory as specified by the  initial sclp)? then we could avoid doing
this platform specfic diag260?
the only issue I see is when we need to go beyond 4TB due to the page table
upgrade in the kernel. 

FWIW diag 260 is publicly documented. 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 3/5] s390x: prepare device memory address space
  2020-07-08 18:51 ` [PATCH RFC 3/5] s390x: prepare device memory address space David Hildenbrand
@ 2020-07-09 10:59   ` Cornelia Huck
  2020-07-10  7:46     ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Cornelia Huck @ 2020-07-09 10:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed,  8 Jul 2020 20:51:33 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's allocate the device memory information and setup the device
> memory address space. Expose the maximum ramsize via SCLP and the actual
> initial ramsize via diag260.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/s390x/s390-virtio-ccw.c         | 43 ++++++++++++++++++++++++++++++
>  hw/s390x/sclp.c                    | 12 +++++++--
>  include/hw/s390x/s390-virtio-ccw.h |  3 +++
>  target/s390x/diag.c                |  4 +--
>  4 files changed, 58 insertions(+), 4 deletions(-)

(...)

> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
> index c3b1e24b2c..6b33eb0efc 100644
> --- a/target/s390x/diag.c
> +++ b/target/s390x/diag.c
> @@ -32,8 +32,8 @@ void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>      ram_addr_t addr, length;
>      uint64_t tmp;
>  
> -    /* TODO: Unlock with new QEMU machine. */
> -    if (false) {
> +    /* Support for diag260 is glued to support for memory devices. */

I'm wondering why you need to do this... sure, the availability of a
new diagnose could be perceived as a guest-visible change, but does the
information presented change anything? Without memory devices, it will
just duplicate the information already reported via SCLP, IIUC?

> +    if (!memory_devices_allowed()) {
>          s390_program_interrupt(env, PGM_OPERATION, ra);
>          return;
>      }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-09 10:37   ` Cornelia Huck
@ 2020-07-09 17:54     ` David Hildenbrand
  2020-07-10  8:32     ` David Hildenbrand
  1 sibling, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-09 17:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 09.07.20 12:37, Cornelia Huck wrote:
> On Wed,  8 Jul 2020 20:51:32 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Let's implement the "storage configuration" part of diag260. This diag
>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>> As I don't have access to documentation, I have no clue what the actual
>> error cases are, and which other stuff we could eventually query using this
>> interface. Somebody with access to documentation should fix this. This
>> implementation seems to work with Linux guests just fine.
>>
>> The Linux kernel supports diag260 to query the available memory since
>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>> implemented. They will fail to detect the actual initial memory size.
>>
>> This interface allows us to expose the maximum ramsize via sclp
>> and the initial ramsize via diag260 - without having to mess with the
>> memory increment size and having to align the initial memory size to it.
>>
>> This is a preparation for memory device support. We'll unlock the
>> implementation with a new QEMU machine that supports memory devices.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>>  target/s390x/internal.h    |  2 ++
>>  target/s390x/kvm.c         | 11 ++++++++
>>  target/s390x/misc_helper.c |  6 ++++
>>  target/s390x/translate.c   |  4 +++
>>  5 files changed, 80 insertions(+)
>>
>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>> index 1a48429564..c3b1e24b2c 100644
>> --- a/target/s390x/diag.c
>> +++ b/target/s390x/diag.c
>> @@ -23,6 +23,63 @@
>>  #include "hw/s390x/pv.h"
>>  #include "kvm_s390x.h"
>>  
>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>> +{
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    const ram_addr_t initial_ram_size = ms->ram_size;
>> +    const uint64_t subcode = env->regs[r3];
>> +    S390CPU *cpu = env_archcpu(env);
>> +    ram_addr_t addr, length;
>> +    uint64_t tmp;
>> +
>> +    /* TODO: Unlock with new QEMU machine. */
>> +    if (false) {
>> +        s390_program_interrupt(env, PGM_OPERATION, ra);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * There also seems to be subcode "0xc", which stores the size of the
>> +     * first chunk and the total size to r1/r2. It's only used by very old
>> +     * Linux, so don't implement it.
> 
> FWIW,
> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
> seems to list the available subcodes. Anything but 0xc and 0x10 is for
> 24/31 bit only, so we can safely ignore them. Not sure what we want to
> do with 0xc: it is supposed to "Return the highest addressable byte of
> virtual storage in the host-primary address space, including named
> saved systems and saved segments", so returning the end of the address
> space should be easy enough, but not very useful.

Thanks for the link to the documentation! Either my google search skills
are bad or that stuff is just hard to find :) I'll have a look and see
how to make sense of 0xc. Smells like "maxram_size - 1" indeed.

> 
>> +     */
>> +    if ((r1 & 1) || subcode != 0x10) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +    addr = env->regs[r1];
>> +    length = env->regs[r1 + 1];
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> Doc mentioned above says for specification exception:
> 
> "For subcode X'10':
> • Rx is not an even-numbered register.
> • The address contained in Rx is not on a quadword boundary.
> • The length contained in Rx+1 is not a positive multiple of 16."
> 
>> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
>> +    if (!length) {
> 
> Probably specification exception as well?

Yeah I'll add "|| !length" above.

> 
>> +        setcc(cpu, 3);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> For access exception:
> 
> "For subcode X'10', an error occurred trying to store the extent
> information into the guest's output area."
> 

Okay, looks good then!

>> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
>> +                                    MEMTXATTRS_UNSPECIFIED)) {
>> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
>> +        return;
>> +    }
>> +
>> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
>> +    tmp = cpu_to_be64(0);
>> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
>> +    tmp = cpu_to_be64(initial_ram_size - 1);
>> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
>> +
>> +    /* Exactly one entry was stored. */
>> +    env->regs[r3] = 1;
>> +    setcc(cpu, 0);
>> +}
>> +
>>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>>  {
>>      uint64_t func = env->regs[r1];
> 
> (...)
> 
>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
>> index 58dbc023eb..d7274eb320 100644
>> --- a/target/s390x/misc_helper.c
>> +++ b/target/s390x/misc_helper.c
>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>      uint64_t r;
>>  
>>      switch (num) {
>> +    case 0x260:
>> +        qemu_mutex_lock_iothread();
>> +        handle_diag_260(env, r1, r3, GETPC());
>> +        qemu_mutex_unlock_iothread();
>> +        r = 0;
>> +        break;
>>      case 0x500:
>>          /* KVM hypercall */
>>          qemu_mutex_lock_iothread();
> 
> Looking at the doc referenced above, it seems that we treat every diag
> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> to your patch; maybe I'm misreading.)

Interesting. Adding in onto my todo list.

Thanks again!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-09 10:52   ` Christian Borntraeger
@ 2020-07-09 18:15     ` David Hildenbrand
  2020-07-10  9:17       ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-09 18:15 UTC (permalink / raw)
  To: Christian Borntraeger, qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	Cornelia Huck, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson

On 09.07.20 12:52, Christian Borntraeger wrote:
> 
> On 08.07.20 20:51, David Hildenbrand wrote:
>> Let's implement the "storage configuration" part of diag260. This diag
>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>> As I don't have access to documentation, I have no clue what the actual
>> error cases are, and which other stuff we could eventually query using this
>> interface. Somebody with access to documentation should fix this. This
>> implementation seems to work with Linux guests just fine.
>>
>> The Linux kernel supports diag260 to query the available memory since
>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>> implemented. They will fail to detect the actual initial memory size.
>>
>> This interface allows us to expose the maximum ramsize via sclp
>> and the initial ramsize via diag260 - without having to mess with the
>> memory increment size and having to align the initial memory size to it.
>>
>> This is a preparation for memory device support. We'll unlock the
>> implementation with a new QEMU machine that supports memory devices.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I have not looked into this, so this is purely a question. 
> 
> Is there a way to hotplug virtio-mem memory beyond the initial size of 
> the memory as specified by the  initial sclp)? then we could avoid doing
> this platform specfic diag260?

We need a way to tell the guest about the maximum possible PFN, so it
can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT
tables. On s390x, the only way I see is using a combination of diag260,
without introducing any other new mechanisms.

Currently Linux selects 3. vs 4 level page tables based on that size (I
think that's what you were referring to with the 4TB limit). I can see
that kasan also does some magic based on the value ("populate kasan
shadow for untracked memory"), but did not look into the details. I
*think* kasan will never be able to track that memory, but am not
completely sure.

I'd like to avoid something as you propose (that's why I searched and
discovered diag260 after all :) ), especially to not silently break in
the future, when other assumptions based on that value are introduced.

E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as
default, so it does not seem to be a corner case mechanism nowadays.

> the only issue I see is when we need to go beyond 4TB due to the page table
> upgrade in the kernel. 
> 
> FWIW diag 260 is publicly documented. 

Yeah, Conny pointed me at the doc - makes things easier :)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 3/5] s390x: prepare device memory address space
  2020-07-09 10:59   ` Cornelia Huck
@ 2020-07-10  7:46     ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10  7:46 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 09.07.20 12:59, Cornelia Huck wrote:
> On Wed,  8 Jul 2020 20:51:33 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Let's allocate the device memory information and setup the device
>> memory address space. Expose the maximum ramsize via SCLP and the actual
>> initial ramsize via diag260.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/s390x/s390-virtio-ccw.c         | 43 ++++++++++++++++++++++++++++++
>>  hw/s390x/sclp.c                    | 12 +++++++--
>>  include/hw/s390x/s390-virtio-ccw.h |  3 +++
>>  target/s390x/diag.c                |  4 +--
>>  4 files changed, 58 insertions(+), 4 deletions(-)
> 
> (...)
> 
>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>> index c3b1e24b2c..6b33eb0efc 100644
>> --- a/target/s390x/diag.c
>> +++ b/target/s390x/diag.c
>> @@ -32,8 +32,8 @@ void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>>      ram_addr_t addr, length;
>>      uint64_t tmp;
>>  
>> -    /* TODO: Unlock with new QEMU machine. */
>> -    if (false) {
>> +    /* Support for diag260 is glued to support for memory devices. */
> 
> I'm wondering why you need to do this... sure, the availability of a
> new diagnose could be perceived as a guest-visible change, but does the
> information presented change anything? Without memory devices, it will
> just duplicate the information already reported via SCLP, IIUC?

Yes, it's essentially providing redundant information without memory
devices.

One could sense diag260 in the guest and assume it will work on
successive invocations. E.g., issue subcode 0xc while checking for
exceptions, then issue subcode 0x10 without checking for exceptions. If
we migrate in between, we could be in trouble.

Yes, it's somewhat unlikely, I don't have a strong opinion here. Gluing
it to some migration-safe mechanism (here, the machine) felt like the
right thing to do.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-09 10:37   ` Cornelia Huck
  2020-07-09 17:54     ` David Hildenbrand
@ 2020-07-10  8:32     ` David Hildenbrand
  2020-07-10  8:41       ` David Hildenbrand
  2020-07-13 11:54       ` Christian Borntraeger
  1 sibling, 2 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10  8:32 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 09.07.20 12:37, Cornelia Huck wrote:
> On Wed,  8 Jul 2020 20:51:32 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Let's implement the "storage configuration" part of diag260. This diag
>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>> As I don't have access to documentation, I have no clue what the actual
>> error cases are, and which other stuff we could eventually query using this
>> interface. Somebody with access to documentation should fix this. This
>> implementation seems to work with Linux guests just fine.
>>
>> The Linux kernel supports diag260 to query the available memory since
>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>> implemented. They will fail to detect the actual initial memory size.
>>
>> This interface allows us to expose the maximum ramsize via sclp
>> and the initial ramsize via diag260 - without having to mess with the
>> memory increment size and having to align the initial memory size to it.
>>
>> This is a preparation for memory device support. We'll unlock the
>> implementation with a new QEMU machine that supports memory devices.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>>  target/s390x/internal.h    |  2 ++
>>  target/s390x/kvm.c         | 11 ++++++++
>>  target/s390x/misc_helper.c |  6 ++++
>>  target/s390x/translate.c   |  4 +++
>>  5 files changed, 80 insertions(+)
>>
>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>> index 1a48429564..c3b1e24b2c 100644
>> --- a/target/s390x/diag.c
>> +++ b/target/s390x/diag.c
>> @@ -23,6 +23,63 @@
>>  #include "hw/s390x/pv.h"
>>  #include "kvm_s390x.h"
>>  
>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>> +{
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    const ram_addr_t initial_ram_size = ms->ram_size;
>> +    const uint64_t subcode = env->regs[r3];
>> +    S390CPU *cpu = env_archcpu(env);
>> +    ram_addr_t addr, length;
>> +    uint64_t tmp;
>> +
>> +    /* TODO: Unlock with new QEMU machine. */
>> +    if (false) {
>> +        s390_program_interrupt(env, PGM_OPERATION, ra);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * There also seems to be subcode "0xc", which stores the size of the
>> +     * first chunk and the total size to r1/r2. It's only used by very old
>> +     * Linux, so don't implement it.
> 
> FWIW,
> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
> seems to list the available subcodes. Anything but 0xc and 0x10 is for
> 24/31 bit only, so we can safely ignore them. Not sure what we want to
> do with 0xc: it is supposed to "Return the highest addressable byte of
> virtual storage in the host-primary address space, including named
> saved systems and saved segments", so returning the end of the address
> space should be easy enough, but not very useful.
> 
>> +     */
>> +    if ((r1 & 1) || subcode != 0x10) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +    addr = env->regs[r1];
>> +    length = env->regs[r1 + 1];
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> Doc mentioned above says for specification exception:
> 
> "For subcode X'10':
> • Rx is not an even-numbered register.
> • The address contained in Rx is not on a quadword boundary.
> • The length contained in Rx+1 is not a positive multiple of 16."
> 
>> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
>> +    if (!length) {
> 
> Probably specification exception as well?
> 
>> +        setcc(cpu, 3);
>> +        return;
>> +    }
>> +
>> +    /* FIXME: Somebody with documentation should fix this. */
> 
> For access exception:
> 
> "For subcode X'10', an error occurred trying to store the extent
> information into the guest's output area."
> 
>> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
>> +                                    MEMTXATTRS_UNSPECIFIED)) {
>> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
>> +        return;
>> +    }
>> +
>> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
>> +    tmp = cpu_to_be64(0);
>> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
>> +    tmp = cpu_to_be64(initial_ram_size - 1);
>> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
>> +
>> +    /* Exactly one entry was stored. */
>> +    env->regs[r3] = 1;
>> +    setcc(cpu, 0);
>> +}
>> +
>>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>>  {
>>      uint64_t func = env->regs[r1];
> 
> (...)
> 
>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
>> index 58dbc023eb..d7274eb320 100644
>> --- a/target/s390x/misc_helper.c
>> +++ b/target/s390x/misc_helper.c
>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>      uint64_t r;
>>  
>>      switch (num) {
>> +    case 0x260:
>> +        qemu_mutex_lock_iothread();
>> +        handle_diag_260(env, r1, r3, GETPC());
>> +        qemu_mutex_unlock_iothread();
>> +        r = 0;
>> +        break;
>>      case 0x500:
>>          /* KVM hypercall */
>>          qemu_mutex_lock_iothread();
> 
> Looking at the doc referenced above, it seems that we treat every diag
> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> to your patch; maybe I'm misreading.)

That's also a BUG in kvm then?

int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
{
...
	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
...
}

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10  8:32     ` David Hildenbrand
@ 2020-07-10  8:41       ` David Hildenbrand
  2020-07-10  9:19         ` Cornelia Huck
  2020-07-13 11:54       ` Christian Borntraeger
  1 sibling, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10  8:41 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 10.07.20 10:32, David Hildenbrand wrote:
> On 09.07.20 12:37, Cornelia Huck wrote:
>> On Wed,  8 Jul 2020 20:51:32 +0200
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> Let's implement the "storage configuration" part of diag260. This diag
>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>>> As I don't have access to documentation, I have no clue what the actual
>>> error cases are, and which other stuff we could eventually query using this
>>> interface. Somebody with access to documentation should fix this. This
>>> implementation seems to work with Linux guests just fine.
>>>
>>> The Linux kernel supports diag260 to query the available memory since
>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>>> implemented. They will fail to detect the actual initial memory size.
>>>
>>> This interface allows us to expose the maximum ramsize via sclp
>>> and the initial ramsize via diag260 - without having to mess with the
>>> memory increment size and having to align the initial memory size to it.
>>>
>>> This is a preparation for memory device support. We'll unlock the
>>> implementation with a new QEMU machine that supports memory devices.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  target/s390x/diag.c        | 57 ++++++++++++++++++++++++++++++++++++++
>>>  target/s390x/internal.h    |  2 ++
>>>  target/s390x/kvm.c         | 11 ++++++++
>>>  target/s390x/misc_helper.c |  6 ++++
>>>  target/s390x/translate.c   |  4 +++
>>>  5 files changed, 80 insertions(+)
>>>
>>> diff --git a/target/s390x/diag.c b/target/s390x/diag.c
>>> index 1a48429564..c3b1e24b2c 100644
>>> --- a/target/s390x/diag.c
>>> +++ b/target/s390x/diag.c
>>> @@ -23,6 +23,63 @@
>>>  #include "hw/s390x/pv.h"
>>>  #include "kvm_s390x.h"
>>>  
>>> +void handle_diag_260(CPUS390XState *env, uint64_t r1, uint64_t r3, uintptr_t ra)
>>> +{
>>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>> +    const ram_addr_t initial_ram_size = ms->ram_size;
>>> +    const uint64_t subcode = env->regs[r3];
>>> +    S390CPU *cpu = env_archcpu(env);
>>> +    ram_addr_t addr, length;
>>> +    uint64_t tmp;
>>> +
>>> +    /* TODO: Unlock with new QEMU machine. */
>>> +    if (false) {
>>> +        s390_program_interrupt(env, PGM_OPERATION, ra);
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * There also seems to be subcode "0xc", which stores the size of the
>>> +     * first chunk and the total size to r1/r2. It's only used by very old
>>> +     * Linux, so don't implement it.
>>
>> FWIW,
>> https://www-01.ibm.com/servers/resourcelink/svc0302a.nsf/pages/zVMV7R1sc246272/$file/hcpb4_v7r1.pdf
>> seems to list the available subcodes. Anything but 0xc and 0x10 is for
>> 24/31 bit only, so we can safely ignore them. Not sure what we want to
>> do with 0xc: it is supposed to "Return the highest addressable byte of
>> virtual storage in the host-primary address space, including named
>> saved systems and saved segments", so returning the end of the address
>> space should be easy enough, but not very useful.
>>
>>> +     */
>>> +    if ((r1 & 1) || subcode != 0x10) {
>>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>>> +        return;
>>> +    }
>>> +    addr = env->regs[r1];
>>> +    length = env->regs[r1 + 1];
>>> +
>>> +    /* FIXME: Somebody with documentation should fix this. */
>>
>> Doc mentioned above says for specification exception:
>>
>> "For subcode X'10':
>> • Rx is not an even-numbered register.
>> • The address contained in Rx is not on a quadword boundary.
>> • The length contained in Rx+1 is not a positive multiple of 16."
>>
>>> +    if (!QEMU_IS_ALIGNED(addr, 16) || !QEMU_IS_ALIGNED(length, 16)) {
>>> +        s390_program_interrupt(env, PGM_SPECIFICATION, ra);
>>> +        return;
>>> +    }
>>> +
>>> +    /* FIXME: Somebody with documentation should fix this. */
>>> +    if (!length) {
>>
>> Probably specification exception as well?
>>
>>> +        setcc(cpu, 3);
>>> +        return;
>>> +    }
>>> +
>>> +    /* FIXME: Somebody with documentation should fix this. */
>>
>> For access exception:
>>
>> "For subcode X'10', an error occurred trying to store the extent
>> information into the guest's output area."
>>
>>> +    if (!address_space_access_valid(&address_space_memory, addr, length, true,
>>> +                                    MEMTXATTRS_UNSPECIFIED)) {
>>> +        s390_program_interrupt(env, PGM_ADDRESSING, ra);
>>> +        return;
>>> +    }
>>> +
>>> +    /* Indicate our initial memory ([0 .. ram_size - 1]) */
>>> +    tmp = cpu_to_be64(0);
>>> +    cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
>>> +    tmp = cpu_to_be64(initial_ram_size - 1);
>>> +    cpu_physical_memory_write(addr + sizeof(tmp), &tmp, sizeof(tmp));
>>> +
>>> +    /* Exactly one entry was stored. */
>>> +    env->regs[r3] = 1;
>>> +    setcc(cpu, 0);
>>> +}
>>> +
>>>  int handle_diag_288(CPUS390XState *env, uint64_t r1, uint64_t r3)
>>>  {
>>>      uint64_t func = env->regs[r1];
>>
>> (...)
>>
>>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
>>> index 58dbc023eb..d7274eb320 100644
>>> --- a/target/s390x/misc_helper.c
>>> +++ b/target/s390x/misc_helper.c
>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>>      uint64_t r;
>>>  
>>>      switch (num) {
>>> +    case 0x260:
>>> +        qemu_mutex_lock_iothread();
>>> +        handle_diag_260(env, r1, r3, GETPC());
>>> +        qemu_mutex_unlock_iothread();
>>> +        r = 0;
>>> +        break;
>>>      case 0x500:
>>>          /* KVM hypercall */
>>>          qemu_mutex_lock_iothread();
>>
>> Looking at the doc referenced above, it seems that we treat every diag
>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
>> to your patch; maybe I'm misreading.)
> 
> That's also a BUG in kvm then?
> 
> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> {
> ...
> 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> ...
> }
> 

But OTOH, it does not sound sane if user space can bypass the OS to
yield the CPU ... so this might just be a wrong documentation. All DIAGs
should be privileged IIRC.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-09 18:15     ` David Hildenbrand
@ 2020-07-10  9:17       ` David Hildenbrand
  2020-07-10 12:12         ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10  9:17 UTC (permalink / raw)
  To: Christian Borntraeger, qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	Cornelia Huck, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson

On 09.07.20 20:15, David Hildenbrand wrote:
> On 09.07.20 12:52, Christian Borntraeger wrote:
>>
>> On 08.07.20 20:51, David Hildenbrand wrote:
>>> Let's implement the "storage configuration" part of diag260. This diag
>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>>> As I don't have access to documentation, I have no clue what the actual
>>> error cases are, and which other stuff we could eventually query using this
>>> interface. Somebody with access to documentation should fix this. This
>>> implementation seems to work with Linux guests just fine.
>>>
>>> The Linux kernel supports diag260 to query the available memory since
>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>>> implemented. They will fail to detect the actual initial memory size.
>>>
>>> This interface allows us to expose the maximum ramsize via sclp
>>> and the initial ramsize via diag260 - without having to mess with the
>>> memory increment size and having to align the initial memory size to it.
>>>
>>> This is a preparation for memory device support. We'll unlock the
>>> implementation with a new QEMU machine that supports memory devices.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> I have not looked into this, so this is purely a question. 
>>
>> Is there a way to hotplug virtio-mem memory beyond the initial size of 
>> the memory as specified by the  initial sclp)? then we could avoid doing
>> this platform specfic diag260?
> 
> We need a way to tell the guest about the maximum possible PFN, so it
> can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT
> tables. On s390x, the only way I see is using a combination of diag260,
> without introducing any other new mechanisms.
> 
> Currently Linux selects 3. vs 4 level page tables based on that size (I
> think that's what you were referring to with the 4TB limit). I can see
> that kasan also does some magic based on the value ("populate kasan
> shadow for untracked memory"), but did not look into the details. I
> *think* kasan will never be able to track that memory, but am not
> completely sure.
> 
> I'd like to avoid something as you propose (that's why I searched and
> discovered diag260 after all :) ), especially to not silently break in
> the future, when other assumptions based on that value are introduced.
> 
> E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as
> default, so it does not seem to be a corner case mechanism nowadays.
> 

Note: Reading about diag260 subcode 0xc, we could modify Linux to query
the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
indicating maxram size via SCLP, and keep diag260-unaware OSs keep
working as before. Thoughts?


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10  8:41       ` David Hildenbrand
@ 2020-07-10  9:19         ` Cornelia Huck
  0 siblings, 0 replies; 39+ messages in thread
From: Cornelia Huck @ 2020-07-10  9:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Fri, 10 Jul 2020 10:41:33 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 10.07.20 10:32, David Hildenbrand wrote:
> > On 09.07.20 12:37, Cornelia Huck wrote:  
> >> On Wed,  8 Jul 2020 20:51:32 +0200
> >> David Hildenbrand <david@redhat.com> wrote:

> >>> diff --git a/target/s390x/misc_helper.c b/target/s390x/misc_helper.c
> >>> index 58dbc023eb..d7274eb320 100644
> >>> --- a/target/s390x/misc_helper.c
> >>> +++ b/target/s390x/misc_helper.c
> >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
> >>>      uint64_t r;
> >>>  
> >>>      switch (num) {
> >>> +    case 0x260:
> >>> +        qemu_mutex_lock_iothread();
> >>> +        handle_diag_260(env, r1, r3, GETPC());
> >>> +        qemu_mutex_unlock_iothread();
> >>> +        r = 0;
> >>> +        break;
> >>>      case 0x500:
> >>>          /* KVM hypercall */
> >>>          qemu_mutex_lock_iothread();  
> >>
> >> Looking at the doc referenced above, it seems that we treat every diag
> >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> >> to your patch; maybe I'm misreading.)  
> > 
> > That's also a BUG in kvm then?
> > 
> > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> > {
> > ...
> > 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> > 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> > ...
> > }
> >   
> 
> But OTOH, it does not sound sane if user space can bypass the OS to
> yield the CPU ... so this might just be a wrong documentation. All DIAGs
> should be privileged IIRC.

Maybe not all of them, but the diag 0x44 case is indeed odd. No idea
what is documented for its use on LPAR (I don't think that document is
public.)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10  9:17       ` David Hildenbrand
@ 2020-07-10 12:12         ` David Hildenbrand
  2020-07-10 15:18           ` Heiko Carstens
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10 12:12 UTC (permalink / raw)
  To: Christian Borntraeger, qemu-devel
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	Cornelia Huck, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson

On 10.07.20 11:17, David Hildenbrand wrote:
> On 09.07.20 20:15, David Hildenbrand wrote:
>> On 09.07.20 12:52, Christian Borntraeger wrote:
>>>
>>> On 08.07.20 20:51, David Hildenbrand wrote:
>>>> Let's implement the "storage configuration" part of diag260. This diag
>>>> is found under z/VM, to indicate usable chunks of memory tot he guest OS.
>>>> As I don't have access to documentation, I have no clue what the actual
>>>> error cases are, and which other stuff we could eventually query using this
>>>> interface. Somebody with access to documentation should fix this. This
>>>> implementation seems to work with Linux guests just fine.
>>>>
>>>> The Linux kernel supports diag260 to query the available memory since
>>>> v4.20. Older kernels / kvm-unit-tests will later fail to run in such a VM
>>>> (with maxmem being defined and bigger than the memory size, e.g., "-m
>>>>  2G,maxmem=4G"), just as if support for SCLP storage information is not
>>>> implemented. They will fail to detect the actual initial memory size.
>>>>
>>>> This interface allows us to expose the maximum ramsize via sclp
>>>> and the initial ramsize via diag260 - without having to mess with the
>>>> memory increment size and having to align the initial memory size to it.
>>>>
>>>> This is a preparation for memory device support. We'll unlock the
>>>> implementation with a new QEMU machine that supports memory devices.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> I have not looked into this, so this is purely a question. 
>>>
>>> Is there a way to hotplug virtio-mem memory beyond the initial size of 
>>> the memory as specified by the  initial sclp)? then we could avoid doing
>>> this platform specfic diag260?
>>
>> We need a way to tell the guest about the maximum possible PFN, so it
>> can prepare for that. E.g. on x86-64 this is usually done via ACPI SRAT
>> tables. On s390x, the only way I see is using a combination of diag260,
>> without introducing any other new mechanisms.
>>
>> Currently Linux selects 3. vs 4 level page tables based on that size (I
>> think that's what you were referring to with the 4TB limit). I can see
>> that kasan also does some magic based on the value ("populate kasan
>> shadow for untracked memory"), but did not look into the details. I
>> *think* kasan will never be able to track that memory, but am not
>> completely sure.
>>
>> I'd like to avoid something as you propose (that's why I searched and
>> discovered diag260 after all :) ), especially to not silently break in
>> the future, when other assumptions based on that value are introduced.
>>
>> E.g., on my z/VM LinuxOne Community Cloud machine, diag260 gets used as
>> default, so it does not seem to be a corner case mechanism nowadays.
>>
> 
> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> working as before. Thoughts?

Implemented it, seems to work fine.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10 12:12         ` David Hildenbrand
@ 2020-07-10 15:18           ` Heiko Carstens
  2020-07-10 15:24             ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-10 15:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
> > Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> > the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> > indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> > working as before. Thoughts?
> 
> Implemented it, seems to work fine.

The returned value would not include standby/reserved memory within
z/VM. So this seems not to work.
Also: why do you want to change this?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10 15:18           ` Heiko Carstens
@ 2020-07-10 15:24             ` David Hildenbrand
  2020-07-10 15:43               ` Heiko Carstens
  2020-07-13  9:12               ` Heiko Carstens
  0 siblings, 2 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10 15:24 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 10.07.20 17:18, Heiko Carstens wrote:
> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>> working as before. Thoughts?
>>
>> Implemented it, seems to work fine.
> 
> The returned value would not include standby/reserved memory within
> z/VM. So this seems not to work.

Which value exactly are you referencing? diag 0xc returns two values.
One of them seems to do exactly what we need.

See
https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7

for my current Linux approach.

> Also: why do you want to change this

Which change exactly do you mean?

If we limit the value returned via SCLP to initial memory, we cannot
break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
purely optional.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10 15:24             ` David Hildenbrand
@ 2020-07-10 15:43               ` Heiko Carstens
  2020-07-10 15:45                 ` David Hildenbrand
  2020-07-13  9:12               ` Heiko Carstens
  1 sibling, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-10 15:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
> On 10.07.20 17:18, Heiko Carstens wrote:
> > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
> >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> >>> working as before. Thoughts?
> >>
> >> Implemented it, seems to work fine.
> > 
> > The returned value would not include standby/reserved memory within
> > z/VM. So this seems not to work.
> 
> Which value exactly are you referencing? diag 0xc returns two values.
> One of them seems to do exactly what we need.

Maybe I'm missing something as usual, but to me this
--------
Usage Notes:
...
2. If the RESERVED or STANDBY option was used on the DEFINE STORAGE
command to configure reserved or standby storage for a guest, the
values returned in Rx and Ry will be the current values, but these
values can change dynamically depending on the options specified and
any dynamic storage reconfiguration (DSR) changes initiated by the
guest.
--------
reads like it is not doing what you want. That is: it does *not*
include standby memory and therefore will not return the highest
possible pfn.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10 15:43               ` Heiko Carstens
@ 2020-07-10 15:45                 ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-10 15:45 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, David Hildenbrand, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Michael S . Tsirkin, Claudio Imbrenda, Richard Henderson



> Am 10.07.2020 um 17:43 schrieb Heiko Carstens <hca@linux.ibm.com>:
> 
> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>> working as before. Thoughts?
>>>> 
>>>> Implemented it, seems to work fine.
>>> 
>>> The returned value would not include standby/reserved memory within
>>> z/VM. So this seems not to work.
>> 
>> Which value exactly are you referencing? diag 0xc returns two values.
>> One of them seems to do exactly what we need.
> 
> Maybe I'm missing something as usual, but to me this
> --------
> Usage Notes:
> ...
> 2. If the RESERVED or STANDBY option was used on the DEFINE STORAGE
> command to configure reserved or standby storage for a guest, the
> values returned in Rx and Ry will be the current values, but these
> values can change dynamically depending on the options specified and
> any dynamic storage reconfiguration (DSR) changes initiated by the
> guest.
> --------
> reads like it is not doing what you want. That is: it does *not*
> include standby memory and therefore will not return the highest
> possible pfn.
> 

Ah, yes. See the kernel patch, I take the max of both values (SCLP, diag260(0xc)) values.

Anyhow, what would be your recommendation?

Thanks!



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10 15:24             ` David Hildenbrand
  2020-07-10 15:43               ` Heiko Carstens
@ 2020-07-13  9:12               ` Heiko Carstens
  2020-07-13 10:27                 ` David Hildenbrand
  1 sibling, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-13  9:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
> On 10.07.20 17:18, Heiko Carstens wrote:
> > On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
> >>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
> >>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
> >>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
> >>> working as before. Thoughts?
> >>
> >> Implemented it, seems to work fine.
> > 
> > The returned value would not include standby/reserved memory within
> > z/VM. So this seems not to work.
> 
> Which value exactly are you referencing? diag 0xc returns two values.
> One of them seems to do exactly what we need.
> 
> See
> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
> 
> for my current Linux approach.
> 
> > Also: why do you want to change this
> 
> Which change exactly do you mean?
> 
> If we limit the value returned via SCLP to initial memory, we cannot
> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
> purely optional.

Ok, now I see the context. Christian added my just to cc on this
specific patch.
So if I understand you correctly, then you want to use diag 260 in
order to figure out how much memory is _potentially_ available for a
guest?

This does not fit to the current semantics, since diag 260 returns the
address of the highest *currently* accessible address. That is: it
does explicitly *not* include standby memory or anything else that
might potentially be there.

So you would need a different interface to tell the guest about your
new hotplug memory interface. If sclp does not work, then maybe a new
diagnose(?).


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-13  9:12               ` Heiko Carstens
@ 2020-07-13 10:27                 ` David Hildenbrand
  2020-07-13 11:08                   ` Christian Borntraeger
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-13 10:27 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, David Hildenbrand, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Michael S . Tsirkin, Claudio Imbrenda, Richard Henderson



> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>:
> 
> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>> working as before. Thoughts?
>>>> 
>>>> Implemented it, seems to work fine.
>>> 
>>> The returned value would not include standby/reserved memory within
>>> z/VM. So this seems not to work.
>> 
>> Which value exactly are you referencing? diag 0xc returns two values.
>> One of them seems to do exactly what we need.
>> 
>> See
>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
>> 
>> for my current Linux approach.
>> 
>>> Also: why do you want to change this
>> 
>> Which change exactly do you mean?
>> 
>> If we limit the value returned via SCLP to initial memory, we cannot
>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
>> purely optional.
> 
> Ok, now I see the context. Christian added my just to cc on this
> specific patch.

I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up).

> So if I understand you correctly, then you want to use diag 260 in
> order to figure out how much memory is _potentially_ available for a
> guest?

Yes, exactly.

> 
> This does not fit to the current semantics, since diag 260 returns the
> address of the highest *currently* accessible address. That is: it
> does explicitly *not* include standby memory or anything else that
> might potentially be there.

The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory.

I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :)

> 
> So you would need a different interface to tell the guest about your
> new hotplug memory interface. If sclp does not work, then maybe a new
> diagnose(?).
> 

Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-13 10:27                 ` David Hildenbrand
@ 2020-07-13 11:08                   ` Christian Borntraeger
  2020-07-15  9:42                     ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Borntraeger @ 2020-07-13 11:08 UTC (permalink / raw)
  To: David Hildenbrand, Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson

On 13.07.20 12:27, David Hildenbrand wrote:
> 
> 
>> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>:
>>
>> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>>> working as before. Thoughts?
>>>>>
>>>>> Implemented it, seems to work fine.
>>>>
>>>> The returned value would not include standby/reserved memory within
>>>> z/VM. So this seems not to work.
>>>
>>> Which value exactly are you referencing? diag 0xc returns two values.
>>> One of them seems to do exactly what we need.
>>>
>>> See
>>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
>>>
>>> for my current Linux approach.
>>>
>>>> Also: why do you want to change this
>>>
>>> Which change exactly do you mean?
>>>
>>> If we limit the value returned via SCLP to initial memory, we cannot
>>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
>>> purely optional.
>>
>> Ok, now I see the context. Christian added my just to cc on this
>> specific patch.
> 
> I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up).
> 
>> So if I understand you correctly, then you want to use diag 260 in
>> order to figure out how much memory is _potentially_ available for a
>> guest?
> 
> Yes, exactly.
> 
>>
>> This does not fit to the current semantics, since diag 260 returns the
>> address of the highest *currently* accessible address. That is: it
>> does explicitly *not* include standby memory or anything else that
>> might potentially be there.
> 
> The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory.
> 
> I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :)
> 
>>
>> So you would need a different interface to tell the guest about your
>> new hotplug memory interface. If sclp does not work, then maybe a new
>> diagnose(?).
>>
> 
> Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?> 

Wouldnt sclp be the right thing to provide the max increment number? (and thus the max memory address)
And then (when I got the discussion right) use diag 260 to get the _current_ value.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-10  8:32     ` David Hildenbrand
  2020-07-10  8:41       ` David Hildenbrand
@ 2020-07-13 11:54       ` Christian Borntraeger
  2020-07-13 12:11         ` Cornelia Huck
  1 sibling, 1 reply; 39+ messages in thread
From: Christian Borntraeger @ 2020-07-13 11:54 UTC (permalink / raw)
  To: David Hildenbrand, Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Heiko Carstens,
	qemu-devel, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson



On 10.07.20 10:32, David Hildenbrand wrote:

>>> --- a/target/s390x/misc_helper.c
>>> +++ b/target/s390x/misc_helper.c
>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>>      uint64_t r;
>>>  
>>>      switch (num) {
>>> +    case 0x260:
>>> +        qemu_mutex_lock_iothread();
>>> +        handle_diag_260(env, r1, r3, GETPC());
>>> +        qemu_mutex_unlock_iothread();
>>> +        r = 0;
>>> +        break;
>>>      case 0x500:
>>>          /* KVM hypercall */
>>>          qemu_mutex_lock_iothread();
>>
>> Looking at the doc referenced above, it seems that we treat every diag
>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
>> to your patch; maybe I'm misreading.)
> 
> That's also a BUG in kvm then?
> 
> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> {
> ...
> 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> ...
> }

diag 44 gives a PRIVOP on LPAR, so I think this is fine. 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-13 11:54       ` Christian Borntraeger
@ 2020-07-13 12:11         ` Cornelia Huck
  2020-07-13 12:13           ` Christian Borntraeger
  0 siblings, 1 reply; 39+ messages in thread
From: Cornelia Huck @ 2020-07-13 12:11 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Thomas Huth, Janosch Frank, David Hildenbrand,
	Michael S . Tsirkin, Heiko Carstens, qemu-devel, Halil Pasic,
	qemu-s390x, Claudio Imbrenda, Richard Henderson

On Mon, 13 Jul 2020 13:54:41 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 10.07.20 10:32, David Hildenbrand wrote:
> 
> >>> --- a/target/s390x/misc_helper.c
> >>> +++ b/target/s390x/misc_helper.c
> >>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
> >>>      uint64_t r;
> >>>  
> >>>      switch (num) {
> >>> +    case 0x260:
> >>> +        qemu_mutex_lock_iothread();
> >>> +        handle_diag_260(env, r1, r3, GETPC());
> >>> +        qemu_mutex_unlock_iothread();
> >>> +        r = 0;
> >>> +        break;
> >>>      case 0x500:
> >>>          /* KVM hypercall */
> >>>          qemu_mutex_lock_iothread();  
> >>
> >> Looking at the doc referenced above, it seems that we treat every diag
> >> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
> >> to your patch; maybe I'm misreading.)  
> > 
> > That's also a BUG in kvm then?
> > 
> > int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
> > {
> > ...
> > 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> > 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
> > ...
> > }  
> 
> diag 44 gives a PRIVOP on LPAR, so I think this is fine. 
> 

Seems like a bug/inconsistency in CP (or its documentation), then.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-13 12:11         ` Cornelia Huck
@ 2020-07-13 12:13           ` Christian Borntraeger
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Borntraeger @ 2020-07-13 12:13 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Thomas Huth, Janosch Frank, David Hildenbrand,
	Michael S . Tsirkin, Heiko Carstens, qemu-devel, Halil Pasic,
	qemu-s390x, Claudio Imbrenda, Richard Henderson



On 13.07.20 14:11, Cornelia Huck wrote:
> On Mon, 13 Jul 2020 13:54:41 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 10.07.20 10:32, David Hildenbrand wrote:
>>
>>>>> --- a/target/s390x/misc_helper.c
>>>>> +++ b/target/s390x/misc_helper.c
>>>>> @@ -116,6 +116,12 @@ void HELPER(diag)(CPUS390XState *env, uint32_t r1, uint32_t r3, uint32_t num)
>>>>>      uint64_t r;
>>>>>  
>>>>>      switch (num) {
>>>>> +    case 0x260:
>>>>> +        qemu_mutex_lock_iothread();
>>>>> +        handle_diag_260(env, r1, r3, GETPC());
>>>>> +        qemu_mutex_unlock_iothread();
>>>>> +        r = 0;
>>>>> +        break;
>>>>>      case 0x500:
>>>>>          /* KVM hypercall */
>>>>>          qemu_mutex_lock_iothread();  
>>>>
>>>> Looking at the doc referenced above, it seems that we treat every diag
>>>> call as privileged under tcg; but it seems that 0x44 isn't? (Unrelated
>>>> to your patch; maybe I'm misreading.)  
>>>
>>> That's also a BUG in kvm then?
>>>
>>> int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
>>> {
>>> ...
>>> 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
>>> 		return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
>>> ...
>>> }  
>>
>> diag 44 gives a PRIVOP on LPAR, so I think this is fine. 
>>
> 
> Seems like a bug/inconsistency in CP (or its documentation), then.

Yes. 

.globl main
main:
        diag 0,0,0x44
        svc 1



also crashes under z/VM with an illegal op. 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-13 11:08                   ` Christian Borntraeger
@ 2020-07-15  9:42                     ` David Hildenbrand
  2020-07-15 10:43                       ` Heiko Carstens
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-15  9:42 UTC (permalink / raw)
  To: Christian Borntraeger, Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, qemu-s390x, Claudio Imbrenda,
	Richard Henderson

On 13.07.20 13:08, Christian Borntraeger wrote:
> On 13.07.20 12:27, David Hildenbrand wrote:
>>
>>
>>> Am 13.07.2020 um 11:12 schrieb Heiko Carstens <hca@linux.ibm.com>:
>>>
>>> On Fri, Jul 10, 2020 at 05:24:07PM +0200, David Hildenbrand wrote:
>>>>> On 10.07.20 17:18, Heiko Carstens wrote:
>>>>> On Fri, Jul 10, 2020 at 02:12:33PM +0200, David Hildenbrand wrote:
>>>>>>> Note: Reading about diag260 subcode 0xc, we could modify Linux to query
>>>>>>> the maximum possible pfn via diag260 0xc. Then, we maybe could avoid
>>>>>>> indicating maxram size via SCLP, and keep diag260-unaware OSs keep
>>>>>>> working as before. Thoughts?
>>>>>>
>>>>>> Implemented it, seems to work fine.
>>>>>
>>>>> The returned value would not include standby/reserved memory within
>>>>> z/VM. So this seems not to work.
>>>>
>>>> Which value exactly are you referencing? diag 0xc returns two values.
>>>> One of them seems to do exactly what we need.
>>>>
>>>> See
>>>> https://github.com/davidhildenbrand/linux/commit/a235f9fb20df7c04ae89bc0d134332d1a01842c7
>>>>
>>>> for my current Linux approach.
>>>>
>>>>> Also: why do you want to change this
>>>>
>>>> Which change exactly do you mean?
>>>>
>>>> If we limit the value returned via SCLP to initial memory, we cannot
>>>> break any guest (e.g., Linux pre 4.2, kvm-unit-tests). diag260 is then
>>>> purely optional.
>>>
>>> Ok, now I see the context. Christian added my just to cc on this
>>> specific patch.
>>
>> I tried to Cc you an all patches but the mail bounced with unknown address (maybe I messed up).
>>
>>> So if I understand you correctly, then you want to use diag 260 in
>>> order to figure out how much memory is _potentially_ available for a
>>> guest?
>>
>> Yes, exactly.
>>
>>>
>>> This does not fit to the current semantics, since diag 260 returns the
>>> address of the highest *currently* accessible address. That is: it
>>> does explicitly *not* include standby memory or anything else that
>>> might potentially be there.
>>
>> The confusing part is that it talks about „adressible“ and not „accessible“. Now that I understood the „DEFINE STORAGE ...“ example, it makes sense that the values change with reserved/standby memory.
>>
>> I agree that reusing that interface might not be what we want. I just seemed too easy to avoid creating something new :)
>>
>>>
>>> So you would need a different interface to tell the guest about your
>>> new hotplug memory interface. If sclp does not work, then maybe a new
>>> diagnose(?).
>>>
>>
>> Yes, I think a new Diagnose makes sense. I‘ll have a look next week to figure out which codes/subcodes we could use. @Christian @Conny any ideas/pointers?> 
> 
> Wouldnt sclp be the right thing to provide the max increment number? (and thus the max memory address)
> And then (when I got the discussion right) use diag 260 to get the _current_ value.

So, in summary, we want to indicate to the guest a memory region that
will be used to place memory devices ("device memory region"). The
region might have holes and the memory within this region might have
different semantics than ordinary system memory. Memory that belongs to
memory devices should only be detected+used if the guest OS has support
for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest
(e.g., no virtio-mem driver) should not accidentally make use of such
memory.

We need a way to
a) Tell the guest about boot memory (currently ram_size)
b) Tell the guest about the maximum possible ram address, including
device memory. (We could also indicate the special "device memory
region" explicitly)


AFAIK, we have three options:


1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10)

This is what this series (RFCv1 does).

Advantages:
- No need for a new diag. No need for memory sensing kernel changes.
Disadvantages
- Older guests without support for diag260 (<v4.2, kvm-unit-tests) will
  assume all memory is accessible. Bad.
- The semantics of the value returned in ry via diag260(0xc) is somewhat
  unclear. Should we return the end address of the highest memory
  device? OTOH, an unmodified guest OS (without support for memory
  devices) should not have to care at all about any such memory.
- If we ever want to also support standby memory, we might be in
  trouble. (see below)

2. Indicate ram_size via SCLP, indicate device memory region
   (currently maxram_size) via new DIAG

Advantages:
- Unmodified guests won't use/sense memory belonging to memory devices.
- We can later have standby memory + memory devices co-exist.
Disadvantages
- Need a new DIAG.

3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby
   memory)

I did not look into the details, because -ENODOCUMENTATION. At least we
would run into some alignment issues (again, having to align
ram_size/maxram_size to storage increments - which would no longer be
1MB). We would run into issues later, trying to also support standby memory.



I guess 1) would mostly work, one just has to run a suitable guest
inside the VM. This is no different to running under z/VM where querying
diag260 is required. The nice thing about 2) would be, that we can
easily implement standby memory. Something like:

-m 2G,maxram_size=20G,standbyram_size=4G

[ 2G boot RAM ][ 4G standby RAM ][ 14G device memory ]
                                 ^ via SCLP maximum increment
                                                     ^ via new DIAG

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15  9:42                     ` David Hildenbrand
@ 2020-07-15 10:43                       ` Heiko Carstens
  2020-07-15 11:21                         ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-15 10:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed, Jul 15, 2020 at 11:42:37AM +0200, David Hildenbrand wrote:
> So, in summary, we want to indicate to the guest a memory region that
> will be used to place memory devices ("device memory region"). The
> region might have holes and the memory within this region might have
> different semantics than ordinary system memory. Memory that belongs to
> memory devices should only be detected+used if the guest OS has support
> for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest
> (e.g., no virtio-mem driver) should not accidentally make use of such
> memory.
> 
> We need a way to
> a) Tell the guest about boot memory (currently ram_size)
> b) Tell the guest about the maximum possible ram address, including
> device memory. (We could also indicate the special "device memory
> region" explicitly)
> 
> AFAIK, we have three options:
> 
> 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10)
> 
> This is what this series (RFCv1 does).
> 
> Advantages:
> - No need for a new diag. No need for memory sensing kernel changes.
> Disadvantages
> - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will
>   assume all memory is accessible. Bad.

Why would old guests assume that?

At least in v4.1 the kernel will calculate the max address by using
increment size * increment number and then test if *each* increment is
available with tprot.

> - The semantics of the value returned in ry via diag260(0xc) is somewhat
>   unclear. Should we return the end address of the highest memory
>   device? OTOH, an unmodified guest OS (without support for memory
>   devices) should not have to care at all about any such memory.

I'm confused. The kernel currently only uses diag260(0x10). How is
diag260(0xc) relevant here?

> 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby
>    memory)
> 
> I did not look into the details, because -ENODOCUMENTATION. At least we
> would run into some alignment issues (again, having to align
> ram_size/maxram_size to storage increments - which would no longer be
> 1MB). We would run into issues later, trying to also support standby memory.

That doesn't make sense to me: either support memory hotplug via
sclp/standby memory, or with your new method. But trying to support
both.. what's the use case?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 10:43                       ` Heiko Carstens
@ 2020-07-15 11:21                         ` David Hildenbrand
  2020-07-15 11:34                           ` Heiko Carstens
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-15 11:21 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 15.07.20 12:43, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 11:42:37AM +0200, David Hildenbrand wrote:
>> So, in summary, we want to indicate to the guest a memory region that
>> will be used to place memory devices ("device memory region"). The
>> region might have holes and the memory within this region might have
>> different semantics than ordinary system memory. Memory that belongs to
>> memory devices should only be detected+used if the guest OS has support
>> for them (e.g., virtio-mem, virtio-pmem, ...). An unmodified guest
>> (e.g., no virtio-mem driver) should not accidentally make use of such
>> memory.
>>
>> We need a way to
>> a) Tell the guest about boot memory (currently ram_size)
>> b) Tell the guest about the maximum possible ram address, including
>> device memory. (We could also indicate the special "device memory
>> region" explicitly)
>>
>> AFAIK, we have three options:
>>
>> 1. Indicate maxram_size via SCLP, indicate ram_size via diag260(0x10)
>>
>> This is what this series (RFCv1 does).
>>
>> Advantages:
>> - No need for a new diag. No need for memory sensing kernel changes.
>> Disadvantages
>> - Older guests without support for diag260 (<v4.2, kvm-unit-tests) will
>>   assume all memory is accessible. Bad.
> 
> Why would old guests assume that?
> 
> At least in v4.1 the kernel will calculate the max address by using
> increment size * increment number and then test if *each* increment is
> available with tprot.

Yes, we do the same in kvm-unit-tests. But it's not sufficient for
memory devices.

Just because a tprot succeed (for memory belonging to a memory device)
does not mean the kernel should silently start to use that memory.

Note: memory devices are not just DIMMs that can be mapped to storage
increments. The memory might have completely different semantics, that's
why they are glued to a managing virtio device.

For example: a tprot might succeed on a memory region provided by
virtio-mem, this does, however, not mean that the memory can (and
should) be used by the guest.

> 
>> - The semantics of the value returned in ry via diag260(0xc) is somewhat
>>   unclear. Should we return the end address of the highest memory
>>   device? OTOH, an unmodified guest OS (without support for memory
>>   devices) should not have to care at all about any such memory.
> 
> I'm confused. The kernel currently only uses diag260(0x10). How is
> diag260(0xc) relevant here?

We have to implement diag260(0x10) if we implement diag260(0xc), no? Or
can we simply throw a specification exception?

> 
>> 3. Indicate maxram_size and ram_size via SCLP (using the SCLP standby
>>    memory)
>>
>> I did not look into the details, because -ENODOCUMENTATION. At least we
>> would run into some alignment issues (again, having to align
>> ram_size/maxram_size to storage increments - which would no longer be
>> 1MB). We would run into issues later, trying to also support standby memory.
> 
> That doesn't make sense to me: either support memory hotplug via
> sclp/standby memory, or with your new method. But trying to support
> both.. what's the use case?

Not sure if there is any, it just feels cleaner to me to separate the
architectured (sclp memory/reserved/standby) bits that specify a
semantic when used via rnmax+tprot from QEMU specific memory ranges that
have special semantics.

virtio-mem is only one type of a virtio-based memory device. In the
future we might want to have virtio-pmem, but there might be more ...

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 11:21                         ` David Hildenbrand
@ 2020-07-15 11:34                           ` Heiko Carstens
  2020-07-15 11:42                             ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-15 11:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed, Jul 15, 2020 at 01:21:06PM +0200, David Hildenbrand wrote:
> > At least in v4.1 the kernel will calculate the max address by using
> > increment size * increment number and then test if *each* increment is
> > available with tprot.
> 
> Yes, we do the same in kvm-unit-tests. But it's not sufficient for
> memory devices.
> 
> Just because a tprot succeed (for memory belonging to a memory device)
> does not mean the kernel should silently start to use that memory.
> 
> Note: memory devices are not just DIMMs that can be mapped to storage
> increments. The memory might have completely different semantics, that's
> why they are glued to a managing virtio device.
> 
> For example: a tprot might succeed on a memory region provided by
> virtio-mem, this does, however, not mean that the memory can (and
> should) be used by the guest.

So, are you saying that even at IPL time there might already be memory
devices attached to the system? And the kernel should _not_ treat them
as normal memory?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 11:34                           ` Heiko Carstens
@ 2020-07-15 11:42                             ` David Hildenbrand
  2020-07-15 16:14                               ` Heiko Carstens
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-15 11:42 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 15.07.20 13:34, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 01:21:06PM +0200, David Hildenbrand wrote:
>>> At least in v4.1 the kernel will calculate the max address by using
>>> increment size * increment number and then test if *each* increment is
>>> available with tprot.
>>
>> Yes, we do the same in kvm-unit-tests. But it's not sufficient for
>> memory devices.
>>
>> Just because a tprot succeed (for memory belonging to a memory device)
>> does not mean the kernel should silently start to use that memory.
>>
>> Note: memory devices are not just DIMMs that can be mapped to storage
>> increments. The memory might have completely different semantics, that's
>> why they are glued to a managing virtio device.
>>
>> For example: a tprot might succeed on a memory region provided by
>> virtio-mem, this does, however, not mean that the memory can (and
>> should) be used by the guest.
> 
> So, are you saying that even at IPL time there might already be memory
> devices attached to the system? And the kernel should _not_ treat them
> as normal memory?

Sorry if that was unclear. Yes, we can have such devices (including
memory areas) on a cold boot/reboot/kexec. In addition, they might pop
up at runtime (e.g., hotplugging a virtio-mem device). The device is in
charge of exposing that area and deciding what to do with it.

The kernel should never treat them as normal memory (IOW, system RAM).
Not during a cold boot, not during a reboot. The device driver is
responsible for deciding how to use that memory (e.g., add it as system
RAM), and which parts of that memory are actually valid to be used (even
if a tprot might succeed it might not be valid to use just yet - I guess
somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
want to use it like normal memory).

E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
exposed via the e820 map. The only trace that there might be *something*
now/in the future is indicated via ACPI SRAT tables. This takes
currently care of indicating the maximum possible PFN.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 11:42                             ` David Hildenbrand
@ 2020-07-15 16:14                               ` Heiko Carstens
  2020-07-15 17:38                                 ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-15 16:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote:
> > So, are you saying that even at IPL time there might already be memory
> > devices attached to the system? And the kernel should _not_ treat them
> > as normal memory?
> 
> Sorry if that was unclear. Yes, we can have such devices (including
> memory areas) on a cold boot/reboot/kexec. In addition, they might pop
> up at runtime (e.g., hotplugging a virtio-mem device). The device is in
> charge of exposing that area and deciding what to do with it.
> 
> The kernel should never treat them as normal memory (IOW, system RAM).
> Not during a cold boot, not during a reboot. The device driver is
> responsible for deciding how to use that memory (e.g., add it as system
> RAM), and which parts of that memory are actually valid to be used (even
> if a tprot might succeed it might not be valid to use just yet - I guess
> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
> want to use it like normal memory).
> 
> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
> exposed via the e820 map. The only trace that there might be *something*
> now/in the future is indicated via ACPI SRAT tables. This takes
> currently care of indicating the maximum possible PFN.

Ok, but all of this needa to be documented somewhere. This raises a
couple of questions to me:

What happens on

- IPL Clear with this special memory? Will it be detached/away afterwards?
- IPL Normal? "Obviously" it must stay otherwise kdump would never see
  that memory.

And when you write it's up to the device driver what to with that
memory: is there any documentation available what all of this is good
for? I would assume _most likely_ this extra memory is going to be
added to ZONE_MOVABLE _somehow_ so that it can be taken away also. But
since it is not normal memory, like you say, I'm wondering how that is
supposed to work.

As far as I can tell there would be a lot of inconsistencies in
userspace interfaces which provide memory / zone information. Or I'm
not getting the point of all of this at all.

So please provide more information, or a pointer to documentation.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 16:14                               ` Heiko Carstens
@ 2020-07-15 17:38                                 ` David Hildenbrand
  2020-07-15 17:51                                   ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-15 17:38 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 15.07.20 18:14, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote:
>>> So, are you saying that even at IPL time there might already be memory
>>> devices attached to the system? And the kernel should _not_ treat them
>>> as normal memory?
>>
>> Sorry if that was unclear. Yes, we can have such devices (including
>> memory areas) on a cold boot/reboot/kexec. In addition, they might pop
>> up at runtime (e.g., hotplugging a virtio-mem device). The device is in
>> charge of exposing that area and deciding what to do with it.
>>
>> The kernel should never treat them as normal memory (IOW, system RAM).
>> Not during a cold boot, not during a reboot. The device driver is
>> responsible for deciding how to use that memory (e.g., add it as system
>> RAM), and which parts of that memory are actually valid to be used (even
>> if a tprot might succeed it might not be valid to use just yet - I guess
>> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
>> want to use it like normal memory).
>>
>> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
>> exposed via the e820 map. The only trace that there might be *something*
>> now/in the future is indicated via ACPI SRAT tables. This takes
>> currently care of indicating the maximum possible PFN.
> 
> Ok, but all of this needa to be documented somewhere. This raises a
> couple of questions to me:

I assume this mostly targets virtio-mem, because the semantics of
virtio-mem provided memory are extra-weird (in contrast to rather static
virtio-pmem, which is essentially just an emulated NVDIMM - a disk
mapped into physical memory).

Regarding documentation (some linked in the cover letter), so far I have
(generic/x86-64)

1. https://virtio-mem.gitlab.io/
2. virtio spec proposal [1]
3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
5. Linux cover letter [2]
6. KVM forum talk [3] [4]

As your questions go quite into technical detail, and I don't feel like
rewriting the doc here :) , I suggest looking at [2], 1, and 5.

> 
> What happens on

I'll stick to virtio-mem when answering regarding "special memory". As I
noted, there might be more in the future.

> 
> - IPL Clear with this special memory? Will it be detached/away afterwards?

A diag308(0x3) - load clear - will usually* zap all virtio-mem provided
memory (discard backing storage in the hypervisor) and logically turn
the state of all virtio-mem memory inside the device-assigned memory
region to "unplugged" - just as during a cold boot. The semantics of
"unplugged" blocks depend on the "usable region" (see the virtio-spec if
you're curious - the memory might still be accessible). Starting "fresh"
with all memory logically unplugged is part of the way virtio-mem works.

* there are corner cases while a VM is getting migrated, where we cannot
perform this (similar, to us not being able to clear ordinary memory
during a load clear in QEMU while migrating). In this case, the memory
is left untouched.

> - IPL Normal? "Obviously" it must stay otherwise kdump would never see
>   that memory.

Only diag308(0x3) will mess with virtio-mem memory. For the other types
of resets, its left untouched. So yes, "obviously" is correct :)

> 
> And when you write it's up to the device driver what to with that
> memory: is there any documentation available what all of this is good
> for? I would assume _most likely_ this extra memory is going to be
> added to ZONE_MOVABLE _somehow_ so that it can be taken away also. But
> since it is not normal memory, like you say, I'm wondering how that is
> supposed to work.

For now

1. virtio-mem adds all (possible) aligned memory via add_memory() to Linux
2. Requires user space to online the memory blocks / configure a zone.

For 2., only ZONE_NORMAL really works right now and is recommended to
use. As you correctly note, that does not give you any guarantees how
much memory you can unplug again (e.g, fragmentation with unmovable
data), but is good enough for the first version (with focus on memory
hotplug, not unplug). ZONE_MOVABLE support is in the works.

However, we cannot blindly expose all memory to ZONE_MOVABLE (zone
imbalances leading to rashes), and sometimes also don't want to (e.g.,
gigantic pages). Without spoilering too much, a mixture would be nice.

> 
> As far as I can tell there would be a lot of inconsistencies in
> userspace interfaces which provide memory / zone information. Or I'm
> not getting the point of all of this at all.

All memory/zone stats are properly fixed up (similar to ballooning). The
only visible inconsistency that *might* happen when unplugging memory /
hotplugging memory in <256MB on s390x, is that the number of memory
block devices (/sys/devices/system/memory/...) might indicate more
memory than actually available (e.g., via lsmem).


[1]
https://lists.oasis-open.org/archives/virtio-comment/202006/msg00012.html
[2] https://lore.kernel.org/kvm/20200311171422.10484-1-david@redhat.com/
[3]
https://events19.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
[4] https://www.youtube.com/watch?v=H65FDUDPu9s

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 17:38                                 ` David Hildenbrand
@ 2020-07-15 17:51                                   ` David Hildenbrand
  2020-07-20 14:43                                     ` Heiko Carstens
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-07-15 17:51 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 15.07.20 19:38, David Hildenbrand wrote:
> On 15.07.20 18:14, Heiko Carstens wrote:
>> On Wed, Jul 15, 2020 at 01:42:02PM +0200, David Hildenbrand wrote:
>>>> So, are you saying that even at IPL time there might already be memory
>>>> devices attached to the system? And the kernel should _not_ treat them
>>>> as normal memory?
>>>
>>> Sorry if that was unclear. Yes, we can have such devices (including
>>> memory areas) on a cold boot/reboot/kexec. In addition, they might pop
>>> up at runtime (e.g., hotplugging a virtio-mem device). The device is in
>>> charge of exposing that area and deciding what to do with it.
>>>
>>> The kernel should never treat them as normal memory (IOW, system RAM).
>>> Not during a cold boot, not during a reboot. The device driver is
>>> responsible for deciding how to use that memory (e.g., add it as system
>>> RAM), and which parts of that memory are actually valid to be used (even
>>> if a tprot might succeed it might not be valid to use just yet - I guess
>>> somewhat similar to doing a tport on a dcss area - AFAIK, you also don't
>>> want to use it like normal memory).
>>>
>>> E.g., on x86-64, memory exposed via virtio-mem or virtio-pmem is never
>>> exposed via the e820 map. The only trace that there might be *something*
>>> now/in the future is indicated via ACPI SRAT tables. This takes
>>> currently care of indicating the maximum possible PFN.
>>
>> Ok, but all of this needa to be documented somewhere. This raises a
>> couple of questions to me:
> 
> I assume this mostly targets virtio-mem, because the semantics of
> virtio-mem provided memory are extra-weird (in contrast to rather static
> virtio-pmem, which is essentially just an emulated NVDIMM - a disk
> mapped into physical memory).
> 
> Regarding documentation (some linked in the cover letter), so far I have
> (generic/x86-64)
> 
> 1. https://virtio-mem.gitlab.io/
> 2. virtio spec proposal [1]
> 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
> 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
> 5. Linux cover letter [2]
> 6. KVM forum talk [3] [4]
> 
> As your questions go quite into technical detail, and I don't feel like
> rewriting the doc here :) , I suggest looking at [2], 1, and 5.

Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a
comparison to memory ballooning (and DIMM-based memory hotplug).


> [3]
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
> [4] https://www.youtube.com/watch?v=H65FDUDPu9s
> 


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-15 17:51                                   ` David Hildenbrand
@ 2020-07-20 14:43                                     ` Heiko Carstens
  2020-07-20 15:43                                       ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Heiko Carstens @ 2020-07-20 14:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On Wed, Jul 15, 2020 at 07:51:27PM +0200, David Hildenbrand wrote:
> > Regarding documentation (some linked in the cover letter), so far I have
> > (generic/x86-64)
> > 
> > 1. https://virtio-mem.gitlab.io/
> > 2. virtio spec proposal [1]
> > 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
> > 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
> > 5. Linux cover letter [2]
> > 6. KVM forum talk [3] [4]
> > 
> > As your questions go quite into technical detail, and I don't feel like
> > rewriting the doc here :) , I suggest looking at [2], 1, and 5.
> 
> Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a
> comparison to memory ballooning (and DIMM-based memory hotplug).

Ok, thanks for the pointers!

So I would go for what you suggested with option 2: provide a new
diagnose which tells the kernel where the memory device area is
(probably just start + size?), and leave all other interfaces alone.

This looks to me like the by far "cleanest" solution which does not
add semantics to existing interfaces, where it is questionable if this
wouldn't cause problems in the future.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH RFC 2/5] s390x: implement diag260
  2020-07-20 14:43                                     ` Heiko Carstens
@ 2020-07-20 15:43                                       ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-07-20 15:43 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Huth, Janosch Frank, Michael S . Tsirkin, Cornelia Huck,
	qemu-devel, Halil Pasic, Christian Borntraeger, qemu-s390x,
	Claudio Imbrenda, Richard Henderson

On 20.07.20 16:43, Heiko Carstens wrote:
> On Wed, Jul 15, 2020 at 07:51:27PM +0200, David Hildenbrand wrote:
>>> Regarding documentation (some linked in the cover letter), so far I have
>>> (generic/x86-64)
>>>
>>> 1. https://virtio-mem.gitlab.io/
>>> 2. virtio spec proposal [1]
>>> 3. QEMU 910b25766b33 ("virtio-mem: Paravirtualized memory hot(un)plug")
>>> 4. Linux 5f1f79bbc9 ("virtio-mem: Paravirtualized memory hotplug")
>>> 5. Linux cover letter [2]
>>> 6. KVM forum talk [3] [4]
>>>
>>> As your questions go quite into technical detail, and I don't feel like
>>> rewriting the doc here :) , I suggest looking at [2], 1, and 5.
>>
>> Sorry, I suggest looking at [3] (not [2]) first. Includes pictures and a
>> comparison to memory ballooning (and DIMM-based memory hotplug).
> 
> Ok, thanks for the pointers!

Thanks for having a look. Once the s390x part is in good shape, I'll add
proper documentation (+spec updates regarding exact system reset
handling on s390x).

> 
> So I would go for what you suggested with option 2: provide a new
> diagnose which tells the kernel where the memory device area is
> (probably just start + size?), and leave all other interfaces alone.

Ha, that's precisely what I hacked previously today :) Have a new
diag500 ("KVM hypercall") subcode (4) to give start+size of the area
reserved for memory devices. Will send a new RFC this week to showcase
how it would look like.

> 
> This looks to me like the by far "cleanest" solution which does not
> add semantics to existing interfaces, where it is questionable if this
> wouldn't cause problems in the future.

Yes, same thoughts over here!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2020-07-20 15:45 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-08 18:51 [PATCH RFC 0/5] s390x: initial support for virtio-mem David Hildenbrand
2020-07-08 18:51 ` [PATCH RFC 1/5] s390x: move setting of maximum ram size to machine init David Hildenbrand
2020-07-08 18:51 ` [PATCH RFC 2/5] s390x: implement diag260 David Hildenbrand
2020-07-09 10:37   ` Cornelia Huck
2020-07-09 17:54     ` David Hildenbrand
2020-07-10  8:32     ` David Hildenbrand
2020-07-10  8:41       ` David Hildenbrand
2020-07-10  9:19         ` Cornelia Huck
2020-07-13 11:54       ` Christian Borntraeger
2020-07-13 12:11         ` Cornelia Huck
2020-07-13 12:13           ` Christian Borntraeger
2020-07-09 10:52   ` Christian Borntraeger
2020-07-09 18:15     ` David Hildenbrand
2020-07-10  9:17       ` David Hildenbrand
2020-07-10 12:12         ` David Hildenbrand
2020-07-10 15:18           ` Heiko Carstens
2020-07-10 15:24             ` David Hildenbrand
2020-07-10 15:43               ` Heiko Carstens
2020-07-10 15:45                 ` David Hildenbrand
2020-07-13  9:12               ` Heiko Carstens
2020-07-13 10:27                 ` David Hildenbrand
2020-07-13 11:08                   ` Christian Borntraeger
2020-07-15  9:42                     ` David Hildenbrand
2020-07-15 10:43                       ` Heiko Carstens
2020-07-15 11:21                         ` David Hildenbrand
2020-07-15 11:34                           ` Heiko Carstens
2020-07-15 11:42                             ` David Hildenbrand
2020-07-15 16:14                               ` Heiko Carstens
2020-07-15 17:38                                 ` David Hildenbrand
2020-07-15 17:51                                   ` David Hildenbrand
2020-07-20 14:43                                     ` Heiko Carstens
2020-07-20 15:43                                       ` David Hildenbrand
2020-07-08 18:51 ` [PATCH RFC 3/5] s390x: prepare device memory address space David Hildenbrand
2020-07-09 10:59   ` Cornelia Huck
2020-07-10  7:46     ` David Hildenbrand
2020-07-08 18:51 ` [PATCH RFC 4/5] s390x: implement virtio-mem-ccw David Hildenbrand
2020-07-09  9:24   ` Cornelia Huck
2020-07-09  9:26     ` David Hildenbrand
2020-07-08 18:51 ` [PATCH RFC 5/5] s390x: initial support for virtio-mem David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.