* [PATCH v3 0/2] hw/arm/virt: Fix qemu booting failure on device-tree @ 2021-10-13 4:58 Gavin Shan 2021-10-13 4:58 ` [PATCH v3 1/2] numa: Require distance map when empty node exists Gavin Shan 2021-10-13 4:58 ` [PATCH v3 2/2] hw/arm/virt: Don't create device-tree node for empty NUMA node Gavin Shan 0 siblings, 2 replies; 14+ messages in thread From: Gavin Shan @ 2021-10-13 4:58 UTC (permalink / raw) To: qemu-arm Cc: robh, drjones, ehabkost, peter.maydell, qemu-devel, shan.gavin, imammedo The empty NUMA nodes, where no memory resides, are allowed on ARM64 virt platform. However, QEMU fails to boot because the device-tree can't be populated due to the conflicting device-tree node names of these empty NUMA nodes. For example, QEMU fails to boot and the following error message reported when below command line is used. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ : qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: FDT_ERR_EXISTS The lastest device-tree specification doesn't indicate how the device-tree nodes should be populated for these empty NUMA nodes. The proposed way to handle this is documented in linux kernel. The linux kernel patches have been acknoledged and merged to upstream pretty soon. https://lkml.org/lkml/2021/9/27/31 This series follows the suggestion, which is included in linux kernel patches, to resolve the QEMU boot failure issue: The corresponding device-tree nodes aren't created for the empty NUMA nodes, but their distance map matrix should be provided by users so that the empty NUMA node IDs can be parsed properly. Changelog ========= v3: * Require users to provide distance map matrix when empty NUMA node is included. The default distance map won't be generated any more (Igor/Drew) v2: * Amend PATCH[01/02]'s changelog to explain why we needn't switch to disable generating the default distance map (Drew) Gavin Shan (2): numa: Require distance map when empty node exists hw/arm/virt: Don't create device-tree node for empty NUMA node hw/arm/boot.c | 4 ++++ hw/core/machine.c | 4 ++++ hw/core/numa.c | 24 ++++++++++++++++++++++++ include/sysemu/numa.h | 1 + 4 files changed, 33 insertions(+) -- 2.23.0 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 4:58 [PATCH v3 0/2] hw/arm/virt: Fix qemu booting failure on device-tree Gavin Shan @ 2021-10-13 4:58 ` Gavin Shan 2021-10-13 6:35 ` Andrew Jones 2021-10-13 11:30 ` Igor Mammedov 2021-10-13 4:58 ` [PATCH v3 2/2] hw/arm/virt: Don't create device-tree node for empty NUMA node Gavin Shan 1 sibling, 2 replies; 14+ messages in thread From: Gavin Shan @ 2021-10-13 4:58 UTC (permalink / raw) To: qemu-arm Cc: robh, drjones, ehabkost, peter.maydell, qemu-devel, shan.gavin, imammedo The following option is used to specify the distance map. It's possible the option isn't provided by user. In this case, the distance map isn't populated and exposed to platform. On the other hand, the empty NUMA node, where no memory resides, is allowed on platforms like ARM64 virt. For these empty NUMA nodes, their corresponding device-tree nodes aren't populated, but their NUMA IDs should be included in the "/distance-map" device-tree node, so that kernel can probe them properly if device-tree is used. -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> This adds extra check after the machine is initialized, to ask for the distance map from user when empty nodes exist in device-tree. Signed-off-by: Gavin Shan <gshan@redhat.com> --- hw/core/machine.c | 4 ++++ hw/core/numa.c | 24 ++++++++++++++++++++++++ include/sysemu/numa.h | 1 + 3 files changed, 29 insertions(+) diff --git a/hw/core/machine.c b/hw/core/machine.c index b8d95eec32..c0765ad973 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); machine_class->init(machine); phase_advance(PHASE_MACHINE_INITIALIZED); + + if (machine->numa_state) { + numa_complete_validation(machine); + } } static NotifierList machine_init_done_notifiers = diff --git a/hw/core/numa.c b/hw/core/numa.c index 510d096a88..7404b7dd38 100644 --- a/hw/core/numa.c +++ b/hw/core/numa.c @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) } } +/* + * When device-tree is used by the machine, the empty node IDs should + * be included in the distance map. So we need provide pairs of distances + * in this case. + */ +void numa_complete_validation(MachineState *ms) +{ + NodeInfo *numa_info = ms->numa_state->nodes; + int nb_numa_nodes = ms->numa_state->num_nodes; + int i; + + if (!ms->fdt || ms->numa_state->have_numa_distance) { + return; + } + + for (i = 0; i < nb_numa_nodes; i++) { + if (numa_info[i].present && !numa_info[i].node_mem) { + error_report("Empty node %d found, please provide " + "distance map.", i); + exit(EXIT_FAILURE); + } + } +} + void parse_numa_opts(MachineState *ms) { qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h index 4173ef2afa..80f25ab830 100644 --- a/include/sysemu/numa.h +++ b/include/sysemu/numa.h @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, Error **errp); void numa_complete_configuration(MachineState *ms); +void numa_complete_validation(MachineState *ms); void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); extern QemuOptsList qemu_numa_opts; void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, -- 2.23.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 4:58 ` [PATCH v3 1/2] numa: Require distance map when empty node exists Gavin Shan @ 2021-10-13 6:35 ` Andrew Jones 2021-10-13 11:30 ` Igor Mammedov 1 sibling, 0 replies; 14+ messages in thread From: Andrew Jones @ 2021-10-13 6:35 UTC (permalink / raw) To: Gavin Shan Cc: robh, ehabkost, peter.maydell, qemu-devel, qemu-arm, shan.gavin, imammedo On Wed, Oct 13, 2021 at 12:58:04PM +0800, Gavin Shan wrote: > The following option is used to specify the distance map. It's > possible the option isn't provided by user. In this case, the > distance map isn't populated and exposed to platform. On the > other hand, the empty NUMA node, where no memory resides, is > allowed on platforms like ARM64 virt. For these empty NUMA > nodes, their corresponding device-tree nodes aren't populated, > but their NUMA IDs should be included in the "/distance-map" > device-tree node, so that kernel can probe them properly if > device-tree is used. > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > This adds extra check after the machine is initialized, to > ask for the distance map from user when empty nodes exist in > device-tree. > > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > hw/core/machine.c | 4 ++++ > hw/core/numa.c | 24 ++++++++++++++++++++++++ > include/sysemu/numa.h | 1 + > 3 files changed, 29 insertions(+) > Reviewed-by: Andrew Jones <drjones@redhat.com> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 4:58 ` [PATCH v3 1/2] numa: Require distance map when empty node exists Gavin Shan 2021-10-13 6:35 ` Andrew Jones @ 2021-10-13 11:30 ` Igor Mammedov 2021-10-13 11:35 ` Andrew Jones 1 sibling, 1 reply; 14+ messages in thread From: Igor Mammedov @ 2021-10-13 11:30 UTC (permalink / raw) To: Gavin Shan Cc: peter.maydell, drjones, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin On Wed, 13 Oct 2021 12:58:04 +0800 Gavin Shan <gshan@redhat.com> wrote: > The following option is used to specify the distance map. It's > possible the option isn't provided by user. In this case, the > distance map isn't populated and exposed to platform. On the > other hand, the empty NUMA node, where no memory resides, is > allowed on platforms like ARM64 virt. For these empty NUMA > nodes, their corresponding device-tree nodes aren't populated, > but their NUMA IDs should be included in the "/distance-map" > device-tree node, so that kernel can probe them properly if > device-tree is used. > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > This adds extra check after the machine is initialized, to > ask for the distance map from user when empty nodes exist in > device-tree. > > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > hw/core/machine.c | 4 ++++ > hw/core/numa.c | 24 ++++++++++++++++++++++++ > include/sysemu/numa.h | 1 + > 3 files changed, 29 insertions(+) > > diff --git a/hw/core/machine.c b/hw/core/machine.c > index b8d95eec32..c0765ad973 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > machine_class->init(machine); > phase_advance(PHASE_MACHINE_INITIALIZED); > + > + if (machine->numa_state) { > + numa_complete_validation(machine); > + } > } > > static NotifierList machine_init_done_notifiers = > diff --git a/hw/core/numa.c b/hw/core/numa.c > index 510d096a88..7404b7dd38 100644 > --- a/hw/core/numa.c > +++ b/hw/core/numa.c > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > } > } > > +/* > + * When device-tree is used by the machine, the empty node IDs should > + * be included in the distance map. So we need provide pairs of distances > + * in this case. > + */ > +void numa_complete_validation(MachineState *ms) > +{ > + NodeInfo *numa_info = ms->numa_state->nodes; > + int nb_numa_nodes = ms->numa_state->num_nodes; > + int i; > + > + if (!ms->fdt || ms->numa_state->have_numa_distance) { also skip check/limitation when VM is launched with ACPI enabled? > + return; > + } > + > + for (i = 0; i < nb_numa_nodes; i++) { > + if (numa_info[i].present && !numa_info[i].node_mem) { > + error_report("Empty node %d found, please provide " > + "distance map.", i); > + exit(EXIT_FAILURE); > + } > + } > +} > + > void parse_numa_opts(MachineState *ms) > { > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > index 4173ef2afa..80f25ab830 100644 > --- a/include/sysemu/numa.h > +++ b/include/sysemu/numa.h > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > Error **errp); > void numa_complete_configuration(MachineState *ms); > +void numa_complete_validation(MachineState *ms); > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > extern QemuOptsList qemu_numa_opts; > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 11:30 ` Igor Mammedov @ 2021-10-13 11:35 ` Andrew Jones 2021-10-13 11:53 ` Igor Mammedov 0 siblings, 1 reply; 14+ messages in thread From: Andrew Jones @ 2021-10-13 11:35 UTC (permalink / raw) To: Igor Mammedov Cc: peter.maydell, Gavin Shan, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: > On Wed, 13 Oct 2021 12:58:04 +0800 > Gavin Shan <gshan@redhat.com> wrote: > > > The following option is used to specify the distance map. It's > > possible the option isn't provided by user. In this case, the > > distance map isn't populated and exposed to platform. On the > > other hand, the empty NUMA node, where no memory resides, is > > allowed on platforms like ARM64 virt. For these empty NUMA > > nodes, their corresponding device-tree nodes aren't populated, > > but their NUMA IDs should be included in the "/distance-map" > > device-tree node, so that kernel can probe them properly if > > device-tree is used. > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > This adds extra check after the machine is initialized, to > > ask for the distance map from user when empty nodes exist in > > device-tree. > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > --- > > hw/core/machine.c | 4 ++++ > > hw/core/numa.c | 24 ++++++++++++++++++++++++ > > include/sysemu/numa.h | 1 + > > 3 files changed, 29 insertions(+) > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > index b8d95eec32..c0765ad973 100644 > > --- a/hw/core/machine.c > > +++ b/hw/core/machine.c > > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > > machine_class->init(machine); > > phase_advance(PHASE_MACHINE_INITIALIZED); > > + > > + if (machine->numa_state) { > > + numa_complete_validation(machine); > > + } > > } > > > > static NotifierList machine_init_done_notifiers = > > diff --git a/hw/core/numa.c b/hw/core/numa.c > > index 510d096a88..7404b7dd38 100644 > > --- a/hw/core/numa.c > > +++ b/hw/core/numa.c > > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > > } > > } > > > > +/* > > + * When device-tree is used by the machine, the empty node IDs should > > + * be included in the distance map. So we need provide pairs of distances > > + * in this case. > > + */ > > +void numa_complete_validation(MachineState *ms) > > +{ > > + NodeInfo *numa_info = ms->numa_state->nodes; > > + int nb_numa_nodes = ms->numa_state->num_nodes; > > + int i; > > + > > + if (!ms->fdt || ms->numa_state->have_numa_distance) { > > also skip check/limitation when VM is launched with ACPI enabled? Even with ACPI enabled we generate a DT and would like that DT to be as complete as possible. Of course we should generate a SLIT table with the distance information the user provides on the command line in order to satisfy the check, and we will, since we already have that code in place. Thanks, drew > > > + return; > > + } > > + > > + for (i = 0; i < nb_numa_nodes; i++) { > > + if (numa_info[i].present && !numa_info[i].node_mem) { > > + error_report("Empty node %d found, please provide " > > + "distance map.", i); > > + exit(EXIT_FAILURE); > > + } > > + } > > +} > > + > > void parse_numa_opts(MachineState *ms) > > { > > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > index 4173ef2afa..80f25ab830 100644 > > --- a/include/sysemu/numa.h > > +++ b/include/sysemu/numa.h > > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > > Error **errp); > > void numa_complete_configuration(MachineState *ms); > > +void numa_complete_validation(MachineState *ms); > > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > > extern QemuOptsList qemu_numa_opts; > > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 11:35 ` Andrew Jones @ 2021-10-13 11:53 ` Igor Mammedov 2021-10-13 12:11 ` Andrew Jones 0 siblings, 1 reply; 14+ messages in thread From: Igor Mammedov @ 2021-10-13 11:53 UTC (permalink / raw) To: Andrew Jones Cc: peter.maydell, Gavin Shan, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin On Wed, 13 Oct 2021 13:35:44 +0200 Andrew Jones <drjones@redhat.com> wrote: > On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: > > On Wed, 13 Oct 2021 12:58:04 +0800 > > Gavin Shan <gshan@redhat.com> wrote: > > > > > The following option is used to specify the distance map. It's > > > possible the option isn't provided by user. In this case, the > > > distance map isn't populated and exposed to platform. On the > > > other hand, the empty NUMA node, where no memory resides, is > > > allowed on platforms like ARM64 virt. For these empty NUMA > > > nodes, their corresponding device-tree nodes aren't populated, > > > but their NUMA IDs should be included in the "/distance-map" > > > device-tree node, so that kernel can probe them properly if > > > device-tree is used. > > > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > > > This adds extra check after the machine is initialized, to > > > ask for the distance map from user when empty nodes exist in > > > device-tree. > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > > --- > > > hw/core/machine.c | 4 ++++ > > > hw/core/numa.c | 24 ++++++++++++++++++++++++ > > > include/sysemu/numa.h | 1 + > > > 3 files changed, 29 insertions(+) > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > index b8d95eec32..c0765ad973 100644 > > > --- a/hw/core/machine.c > > > +++ b/hw/core/machine.c > > > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > > > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > > > machine_class->init(machine); > > > phase_advance(PHASE_MACHINE_INITIALIZED); > > > + > > > + if (machine->numa_state) { > > > + numa_complete_validation(machine); > > > + } > > > } > > > > > > static NotifierList machine_init_done_notifiers = > > > diff --git a/hw/core/numa.c b/hw/core/numa.c > > > index 510d096a88..7404b7dd38 100644 > > > --- a/hw/core/numa.c > > > +++ b/hw/core/numa.c > > > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > > > } > > > } > > > > > > +/* > > > + * When device-tree is used by the machine, the empty node IDs should > > > + * be included in the distance map. So we need provide pairs of distances > > > + * in this case. > > > + */ > > > +void numa_complete_validation(MachineState *ms) > > > +{ > > > + NodeInfo *numa_info = ms->numa_state->nodes; > > > + int nb_numa_nodes = ms->numa_state->num_nodes; > > > + int i; > > > + > > > + if (!ms->fdt || ms->numa_state->have_numa_distance) { > > > > also skip check/limitation when VM is launched with ACPI enabled? > > Even with ACPI enabled we generate a DT and would like that DT to be as > complete as possible. Of course we should generate a SLIT table with Guest will work just fine without distance map as SRAT describes all numa nodes. You are forcing VM to have SLIT just for the sake of 'completeness' that's practically unused. I'm still unsure about pushing all of this in generic numa code, as this will be used only by ARM for now. It's better to keep it ARM specific, and when RISCV machine will start using this, it could be moved to generic code. > the distance information the user provides on the command line in order > to satisfy the check, and we will, since we already have that code in > place. > > Thanks, > drew > > > > > > + return; > > > + } > > > + > > > + for (i = 0; i < nb_numa_nodes; i++) { > > > + if (numa_info[i].present && !numa_info[i].node_mem) { > > > + error_report("Empty node %d found, please provide " > > > + "distance map.", i); > > > + exit(EXIT_FAILURE); > > > + } > > > + } > > > +} > > > + > > > void parse_numa_opts(MachineState *ms) > > > { > > > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > > index 4173ef2afa..80f25ab830 100644 > > > --- a/include/sysemu/numa.h > > > +++ b/include/sysemu/numa.h > > > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > > > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > > > Error **errp); > > > void numa_complete_configuration(MachineState *ms); > > > +void numa_complete_validation(MachineState *ms); > > > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > > > extern QemuOptsList qemu_numa_opts; > > > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 11:53 ` Igor Mammedov @ 2021-10-13 12:11 ` Andrew Jones 2021-10-13 12:28 ` Andrew Jones 0 siblings, 1 reply; 14+ messages in thread From: Andrew Jones @ 2021-10-13 12:11 UTC (permalink / raw) To: Igor Mammedov Cc: peter.maydell, Gavin Shan, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin On Wed, Oct 13, 2021 at 01:53:46PM +0200, Igor Mammedov wrote: > On Wed, 13 Oct 2021 13:35:44 +0200 > Andrew Jones <drjones@redhat.com> wrote: > > > On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: > > > On Wed, 13 Oct 2021 12:58:04 +0800 > > > Gavin Shan <gshan@redhat.com> wrote: > > > > > > > The following option is used to specify the distance map. It's > > > > possible the option isn't provided by user. In this case, the > > > > distance map isn't populated and exposed to platform. On the > > > > other hand, the empty NUMA node, where no memory resides, is > > > > allowed on platforms like ARM64 virt. For these empty NUMA > > > > nodes, their corresponding device-tree nodes aren't populated, > > > > but their NUMA IDs should be included in the "/distance-map" > > > > device-tree node, so that kernel can probe them properly if > > > > device-tree is used. > > > > > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > > > > > This adds extra check after the machine is initialized, to > > > > ask for the distance map from user when empty nodes exist in > > > > device-tree. > > > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > > > --- > > > > hw/core/machine.c | 4 ++++ > > > > hw/core/numa.c | 24 ++++++++++++++++++++++++ > > > > include/sysemu/numa.h | 1 + > > > > 3 files changed, 29 insertions(+) > > > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > > index b8d95eec32..c0765ad973 100644 > > > > --- a/hw/core/machine.c > > > > +++ b/hw/core/machine.c > > > > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > > > > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > > > > machine_class->init(machine); > > > > phase_advance(PHASE_MACHINE_INITIALIZED); > > > > + > > > > + if (machine->numa_state) { > > > > + numa_complete_validation(machine); > > > > + } > > > > } > > > > > > > > static NotifierList machine_init_done_notifiers = > > > > diff --git a/hw/core/numa.c b/hw/core/numa.c > > > > index 510d096a88..7404b7dd38 100644 > > > > --- a/hw/core/numa.c > > > > +++ b/hw/core/numa.c > > > > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > > > > } > > > > } > > > > > > > > +/* > > > > + * When device-tree is used by the machine, the empty node IDs should > > > > + * be included in the distance map. So we need provide pairs of distances > > > > + * in this case. > > > > + */ > > > > +void numa_complete_validation(MachineState *ms) > > > > +{ > > > > + NodeInfo *numa_info = ms->numa_state->nodes; > > > > + int nb_numa_nodes = ms->numa_state->num_nodes; > > > > + int i; > > > > + > > > > + if (!ms->fdt || ms->numa_state->have_numa_distance) { > > > > > > also skip check/limitation when VM is launched with ACPI enabled? > > > > Even with ACPI enabled we generate a DT and would like that DT to be as > > complete as possible. Of course we should generate a SLIT table with > > Guest will work just fine without distance map as SRAT describes > all numa nodes. > You are forcing VM to have SLIT just for the sake of 'completeness' > that's practically unused. > > I'm still unsure about pushing all of this in generic numa code, > as this will be used only by ARM for now. It's better to keep it > ARM specific, and when RISCV machine will start using this, it > could be moved to generic code. I don't see a problem in providing this DT node / ACPI table to guests with empty NUMA nodes. I don't even see a problem with providing the distance information unconditionally. Can you explain why we should prefer not to provide optional HW descriptions? Thanks, drew > > > the distance information the user provides on the command line in order > > to satisfy the check, and we will, since we already have that code in > > place. > > > > > > Thanks, > > drew > > > > > > > > > + return; > > > > + } > > > > + > > > > + for (i = 0; i < nb_numa_nodes; i++) { > > > > + if (numa_info[i].present && !numa_info[i].node_mem) { > > > > + error_report("Empty node %d found, please provide " > > > > + "distance map.", i); > > > > + exit(EXIT_FAILURE); > > > > + } > > > > + } > > > > +} > > > > + > > > > void parse_numa_opts(MachineState *ms) > > > > { > > > > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > > > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > > > index 4173ef2afa..80f25ab830 100644 > > > > --- a/include/sysemu/numa.h > > > > +++ b/include/sysemu/numa.h > > > > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > > > > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > > > > Error **errp); > > > > void numa_complete_configuration(MachineState *ms); > > > > +void numa_complete_validation(MachineState *ms); > > > > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > > > > extern QemuOptsList qemu_numa_opts; > > > > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, > > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 12:11 ` Andrew Jones @ 2021-10-13 12:28 ` Andrew Jones 2021-10-14 15:14 ` Igor Mammedov 0 siblings, 1 reply; 14+ messages in thread From: Andrew Jones @ 2021-10-13 12:28 UTC (permalink / raw) To: Igor Mammedov Cc: peter.maydell, Gavin Shan, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin On Wed, Oct 13, 2021 at 02:11:25PM +0200, Andrew Jones wrote: > On Wed, Oct 13, 2021 at 01:53:46PM +0200, Igor Mammedov wrote: > > On Wed, 13 Oct 2021 13:35:44 +0200 > > Andrew Jones <drjones@redhat.com> wrote: > > > > > On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: > > > > On Wed, 13 Oct 2021 12:58:04 +0800 > > > > Gavin Shan <gshan@redhat.com> wrote: > > > > > > > > > The following option is used to specify the distance map. It's > > > > > possible the option isn't provided by user. In this case, the > > > > > distance map isn't populated and exposed to platform. On the > > > > > other hand, the empty NUMA node, where no memory resides, is > > > > > allowed on platforms like ARM64 virt. For these empty NUMA > > > > > nodes, their corresponding device-tree nodes aren't populated, > > > > > but their NUMA IDs should be included in the "/distance-map" > > > > > device-tree node, so that kernel can probe them properly if > > > > > device-tree is used. > > > > > > > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > > > > > > > This adds extra check after the machine is initialized, to > > > > > ask for the distance map from user when empty nodes exist in > > > > > device-tree. > > > > > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > > > > --- > > > > > hw/core/machine.c | 4 ++++ > > > > > hw/core/numa.c | 24 ++++++++++++++++++++++++ > > > > > include/sysemu/numa.h | 1 + > > > > > 3 files changed, 29 insertions(+) > > > > > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > > > index b8d95eec32..c0765ad973 100644 > > > > > --- a/hw/core/machine.c > > > > > +++ b/hw/core/machine.c > > > > > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > > > > > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > > > > > machine_class->init(machine); > > > > > phase_advance(PHASE_MACHINE_INITIALIZED); > > > > > + > > > > > + if (machine->numa_state) { > > > > > + numa_complete_validation(machine); > > > > > + } > > > > > } > > > > > > > > > > static NotifierList machine_init_done_notifiers = > > > > > diff --git a/hw/core/numa.c b/hw/core/numa.c > > > > > index 510d096a88..7404b7dd38 100644 > > > > > --- a/hw/core/numa.c > > > > > +++ b/hw/core/numa.c > > > > > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > > > > > } > > > > > } > > > > > > > > > > +/* > > > > > + * When device-tree is used by the machine, the empty node IDs should > > > > > + * be included in the distance map. So we need provide pairs of distances > > > > > + * in this case. > > > > > + */ > > > > > +void numa_complete_validation(MachineState *ms) > > > > > +{ > > > > > + NodeInfo *numa_info = ms->numa_state->nodes; > > > > > + int nb_numa_nodes = ms->numa_state->num_nodes; > > > > > + int i; > > > > > + > > > > > + if (!ms->fdt || ms->numa_state->have_numa_distance) { > > > > > > > > also skip check/limitation when VM is launched with ACPI enabled? On second thought, I guess it's a good idea to not error out when ACPI is enabled. There's no point in burdening the ACPI user. > > > > > > Even with ACPI enabled we generate a DT and would like that DT to be as > > > complete as possible. Of course we should generate a SLIT table with > > > > Guest will work just fine without distance map as SRAT describes > > all numa nodes. > > You are forcing VM to have SLIT just for the sake of 'completeness' > > that's practically unused. > > > > I'm still unsure about pushing all of this in generic numa code, > > as this will be used only by ARM for now. It's better to keep it > > ARM specific, and when RISCV machine will start using this, it > > could be moved to generic code. I think RISCV could start using it now. Linux shares the mem/numa DT parsing code between Arm and RISCV. If RISCV QEMU wanted to start exposing NUMA nodes, then they might as well support empty ones from the start. > > I don't see a problem in providing this DT node / ACPI table to guests > with empty NUMA nodes. I don't even see a problem with providing the > distance information unconditionally. Can you explain why we should > prefer not to provide optional HW descriptions? I'm still curious why we should be so reluctant to add HW descriptions. I agree we should be reluctant to error out, though. So, when ACPI is enabled, let's skip the enforcement of the SLIT table generation, even if there are empty nodes, as you suggest. Thanks, drew > > Thanks, > drew > > > > > > the distance information the user provides on the command line in order > > > to satisfy the check, and we will, since we already have that code in > > > place. > > > > > > > > > > Thanks, > > > drew > > > > > > > > > > > > + return; > > > > > + } > > > > > + > > > > > + for (i = 0; i < nb_numa_nodes; i++) { > > > > > + if (numa_info[i].present && !numa_info[i].node_mem) { > > > > > + error_report("Empty node %d found, please provide " > > > > > + "distance map.", i); > > > > > + exit(EXIT_FAILURE); > > > > > + } > > > > > + } > > > > > +} > > > > > + > > > > > void parse_numa_opts(MachineState *ms) > > > > > { > > > > > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > > > > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > > > > index 4173ef2afa..80f25ab830 100644 > > > > > --- a/include/sysemu/numa.h > > > > > +++ b/include/sysemu/numa.h > > > > > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > > > > > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > > > > > Error **errp); > > > > > void numa_complete_configuration(MachineState *ms); > > > > > +void numa_complete_validation(MachineState *ms); > > > > > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > > > > > extern QemuOptsList qemu_numa_opts; > > > > > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, > > > > > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-13 12:28 ` Andrew Jones @ 2021-10-14 15:14 ` Igor Mammedov 2021-10-14 15:36 ` Andrew Jones 0 siblings, 1 reply; 14+ messages in thread From: Igor Mammedov @ 2021-10-14 15:14 UTC (permalink / raw) To: Andrew Jones Cc: peter.maydell, Gavin Shan, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin, qemu-riscv On Wed, 13 Oct 2021 14:28:40 +0200 Andrew Jones <drjones@redhat.com> wrote: > On Wed, Oct 13, 2021 at 02:11:25PM +0200, Andrew Jones wrote: > > On Wed, Oct 13, 2021 at 01:53:46PM +0200, Igor Mammedov wrote: > > > On Wed, 13 Oct 2021 13:35:44 +0200 > > > Andrew Jones <drjones@redhat.com> wrote: > > > > > > > On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: > > > > > On Wed, 13 Oct 2021 12:58:04 +0800 > > > > > Gavin Shan <gshan@redhat.com> wrote: > > > > > > > > > > > The following option is used to specify the distance map. It's > > > > > > possible the option isn't provided by user. In this case, the > > > > > > distance map isn't populated and exposed to platform. On the > > > > > > other hand, the empty NUMA node, where no memory resides, is > > > > > > allowed on platforms like ARM64 virt. For these empty NUMA > > > > > > nodes, their corresponding device-tree nodes aren't populated, > > > > > > but their NUMA IDs should be included in the "/distance-map" > > > > > > device-tree node, so that kernel can probe them properly if > > > > > > device-tree is used. > > > > > > > > > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > > > > > > > > > This adds extra check after the machine is initialized, to > > > > > > ask for the distance map from user when empty nodes exist in > > > > > > device-tree. > > > > > > > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > > > > > --- > > > > > > hw/core/machine.c | 4 ++++ > > > > > > hw/core/numa.c | 24 ++++++++++++++++++++++++ > > > > > > include/sysemu/numa.h | 1 + > > > > > > 3 files changed, 29 insertions(+) > > > > > > > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > > > > index b8d95eec32..c0765ad973 100644 > > > > > > --- a/hw/core/machine.c > > > > > > +++ b/hw/core/machine.c > > > > > > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > > > > > > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > > > > > > machine_class->init(machine); > > > > > > phase_advance(PHASE_MACHINE_INITIALIZED); > > > > > > + > > > > > > + if (machine->numa_state) { > > > > > > + numa_complete_validation(machine); > > > > > > + } > > > > > > } > > > > > > > > > > > > static NotifierList machine_init_done_notifiers = > > > > > > diff --git a/hw/core/numa.c b/hw/core/numa.c > > > > > > index 510d096a88..7404b7dd38 100644 > > > > > > --- a/hw/core/numa.c > > > > > > +++ b/hw/core/numa.c > > > > > > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > > > > > > } > > > > > > } > > > > > > > > > > > > +/* > > > > > > + * When device-tree is used by the machine, the empty node IDs should > > > > > > + * be included in the distance map. So we need provide pairs of distances > > > > > > + * in this case. > > > > > > + */ > > > > > > +void numa_complete_validation(MachineState *ms) > > > > > > +{ > > > > > > + NodeInfo *numa_info = ms->numa_state->nodes; > > > > > > + int nb_numa_nodes = ms->numa_state->num_nodes; > > > > > > + int i; > > > > > > + > > > > > > + if (!ms->fdt || ms->numa_state->have_numa_distance) { > > > > > > > > > > also skip check/limitation when VM is launched with ACPI enabled? > > On second thought, I guess it's a good idea to not error out when ACPI is > enabled. There's no point in burdening the ACPI user. > > > > > > > > > Even with ACPI enabled we generate a DT and would like that DT to be as > > > > complete as possible. Of course we should generate a SLIT table with > > > > > > Guest will work just fine without distance map as SRAT describes > > > all numa nodes. > > > You are forcing VM to have SLIT just for the sake of 'completeness' > > > that's practically unused. > > > > > > I'm still unsure about pushing all of this in generic numa code, > > > as this will be used only by ARM for now. It's better to keep it > > > ARM specific, and when RISCV machine will start using this, it > > > could be moved to generic code. > > I think RISCV could start using it now. Linux shares the mem/numa DT > parsing code between Arm and RISCV. If RISCV QEMU wanted to start > exposing NUMA nodes, then they might as well support empty ones from > the start. When RISCV will start using memory-less nodes, we can move this check to generic code. So far 2/2 takes care of ARM only, as result I don't see a good reason to generalize it now. (CCing RISCV folks to see if there is any plans to impl. that) > > I don't see a problem in providing this DT node / ACPI table to guests > > with empty NUMA nodes. I don't even see a problem with providing the > > distance information unconditionally. Can you explain why we should > > prefer not to provide optional HW descriptions? > > I'm still curious why we should be so reluctant to add HW descriptions. > I agree we should be reluctant to error out, though. If distance map were the only way to figure out present node-ids, I wouldn't mind making it non-optional. However guest can already get node-ids from mem/cpu/pci nodes with current device-tree that QEMU generates and forcing optional distance-map (with duplicate info) on user creates non must-have CLI requirements/inconvenience that will eventually propagate to higher layers, I'd rather avoid that if possible. So question is if an idea of fetching numa-ids from cpu nodes in addition to memory nodes (which should cover QEMU needs) have been considered and why it didn't work out? > So, when ACPI is > enabled, let's skip the enforcement of the SLIT table generation, even > if there are empty nodes, as you suggest. > > Thanks, > drew > > > > > Thanks, > > drew > > > > > > > > > the distance information the user provides on the command line in order > > > > to satisfy the check, and we will, since we already have that code in > > > > place. > > > > > > > > > > > > > > Thanks, > > > > drew > > > > > > > > > > > > > > > + return; > > > > > > + } > > > > > > + > > > > > > + for (i = 0; i < nb_numa_nodes; i++) { > > > > > > + if (numa_info[i].present && !numa_info[i].node_mem) { > > > > > > + error_report("Empty node %d found, please provide " > > > > > > + "distance map.", i); > > > > > > + exit(EXIT_FAILURE); > > > > > > + } > > > > > > + } > > > > > > +} > > > > > > + > > > > > > void parse_numa_opts(MachineState *ms) > > > > > > { > > > > > > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > > > > > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > > > > > index 4173ef2afa..80f25ab830 100644 > > > > > > --- a/include/sysemu/numa.h > > > > > > +++ b/include/sysemu/numa.h > > > > > > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > > > > > > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > > > > > > Error **errp); > > > > > > void numa_complete_configuration(MachineState *ms); > > > > > > +void numa_complete_validation(MachineState *ms); > > > > > > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > > > > > > extern QemuOptsList qemu_numa_opts; > > > > > > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-14 15:14 ` Igor Mammedov @ 2021-10-14 15:36 ` Andrew Jones 2021-10-15 8:22 ` Gavin Shan 0 siblings, 1 reply; 14+ messages in thread From: Andrew Jones @ 2021-10-14 15:36 UTC (permalink / raw) To: Igor Mammedov Cc: peter.maydell, Gavin Shan, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin, qemu-riscv On Thu, Oct 14, 2021 at 05:14:17PM +0200, Igor Mammedov wrote: > On Wed, 13 Oct 2021 14:28:40 +0200 > Andrew Jones <drjones@redhat.com> wrote: > > > On Wed, Oct 13, 2021 at 02:11:25PM +0200, Andrew Jones wrote: > > > On Wed, Oct 13, 2021 at 01:53:46PM +0200, Igor Mammedov wrote: > > > > On Wed, 13 Oct 2021 13:35:44 +0200 > > > > Andrew Jones <drjones@redhat.com> wrote: > > > > > > > > > On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: > > > > > > On Wed, 13 Oct 2021 12:58:04 +0800 > > > > > > Gavin Shan <gshan@redhat.com> wrote: > > > > > > > > > > > > > The following option is used to specify the distance map. It's > > > > > > > possible the option isn't provided by user. In this case, the > > > > > > > distance map isn't populated and exposed to platform. On the > > > > > > > other hand, the empty NUMA node, where no memory resides, is > > > > > > > allowed on platforms like ARM64 virt. For these empty NUMA > > > > > > > nodes, their corresponding device-tree nodes aren't populated, > > > > > > > but their NUMA IDs should be included in the "/distance-map" > > > > > > > device-tree node, so that kernel can probe them properly if > > > > > > > device-tree is used. > > > > > > > > > > > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > > > > > > > > > > > This adds extra check after the machine is initialized, to > > > > > > > ask for the distance map from user when empty nodes exist in > > > > > > > device-tree. > > > > > > > > > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > > > > > > --- > > > > > > > hw/core/machine.c | 4 ++++ > > > > > > > hw/core/numa.c | 24 ++++++++++++++++++++++++ > > > > > > > include/sysemu/numa.h | 1 + > > > > > > > 3 files changed, 29 insertions(+) > > > > > > > > > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > > > > > index b8d95eec32..c0765ad973 100644 > > > > > > > --- a/hw/core/machine.c > > > > > > > +++ b/hw/core/machine.c > > > > > > > @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) > > > > > > > accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); > > > > > > > machine_class->init(machine); > > > > > > > phase_advance(PHASE_MACHINE_INITIALIZED); > > > > > > > + > > > > > > > + if (machine->numa_state) { > > > > > > > + numa_complete_validation(machine); > > > > > > > + } > > > > > > > } > > > > > > > > > > > > > > static NotifierList machine_init_done_notifiers = > > > > > > > diff --git a/hw/core/numa.c b/hw/core/numa.c > > > > > > > index 510d096a88..7404b7dd38 100644 > > > > > > > --- a/hw/core/numa.c > > > > > > > +++ b/hw/core/numa.c > > > > > > > @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > +/* > > > > > > > + * When device-tree is used by the machine, the empty node IDs should > > > > > > > + * be included in the distance map. So we need provide pairs of distances > > > > > > > + * in this case. > > > > > > > + */ > > > > > > > +void numa_complete_validation(MachineState *ms) > > > > > > > +{ > > > > > > > + NodeInfo *numa_info = ms->numa_state->nodes; > > > > > > > + int nb_numa_nodes = ms->numa_state->num_nodes; > > > > > > > + int i; > > > > > > > + > > > > > > > + if (!ms->fdt || ms->numa_state->have_numa_distance) { > > > > > > > > > > > > also skip check/limitation when VM is launched with ACPI enabled? > > > > On second thought, I guess it's a good idea to not error out when ACPI is > > enabled. There's no point in burdening the ACPI user. > > > > > > > > > > > > Even with ACPI enabled we generate a DT and would like that DT to be as > > > > > complete as possible. Of course we should generate a SLIT table with > > > > > > > > Guest will work just fine without distance map as SRAT describes > > > > all numa nodes. > > > > You are forcing VM to have SLIT just for the sake of 'completeness' > > > > that's practically unused. > > > > > > > > I'm still unsure about pushing all of this in generic numa code, > > > > as this will be used only by ARM for now. It's better to keep it > > > > ARM specific, and when RISCV machine will start using this, it > > > > could be moved to generic code. > > > > I think RISCV could start using it now. Linux shares the mem/numa DT > > parsing code between Arm and RISCV. If RISCV QEMU wanted to start > > exposing NUMA nodes, then they might as well support empty ones from > > the start. > > When RISCV will start using memory-less nodes, we can move this check to > generic code. > So far 2/2 takes care of ARM only, as result I don't see > a good reason to generalize it now. (CCing RISCV folks to see if > there is any plans to impl. that) > > > > > I don't see a problem in providing this DT node / ACPI table to guests > > > with empty NUMA nodes. I don't even see a problem with providing the > > > distance information unconditionally. Can you explain why we should > > > prefer not to provide optional HW descriptions? > > > > I'm still curious why we should be so reluctant to add HW descriptions. > > I agree we should be reluctant to error out, though. > > If distance map were the only way to figure out present node-ids, > I wouldn't mind making it non-optional. > > However guest can already get node-ids from mem/cpu/pci nodes > with current device-tree that QEMU generates and > forcing optional distance-map (with duplicate info) on user > creates non must-have CLI requirements/inconvenience that > will eventually propagate to higher layers, I'd rather avoid > that if possible. > > So question is if an idea of fetching numa-ids from cpu nodes > in addition to memory nodes (which should cover QEMU needs) > have been considered and why it didn't work out? Thinking about it some more. I think we only need patch 2/2 of this series. We should double check, but if when Linux generates its default distance-map it already generates a map using the numa-ids in the cpu nodes of the DT, then I guess the resulting Linux generated distance-map will be sufficient to represent the empty NUMA nodes and could be used for DT memory hotplug if Linux ever tried to implement that. Gavin, can you check if we can just drop patch 1/2 and then when booting a DT guest with empty numa nodes we get the distance map we wanted to enforce anyway? Thanks, drew > > > So, when ACPI is > > enabled, let's skip the enforcement of the SLIT table generation, even > > if there are empty nodes, as you suggest. > > > > Thanks, > > drew > > > > > > > > Thanks, > > > drew > > > > > > > > > > > > the distance information the user provides on the command line in order > > > > > to satisfy the check, and we will, since we already have that code in > > > > > place. > > > > > > > > > > > > > > > > > > Thanks, > > > > > drew > > > > > > > > > > > > > > > > > > + return; > > > > > > > + } > > > > > > > + > > > > > > > + for (i = 0; i < nb_numa_nodes; i++) { > > > > > > > + if (numa_info[i].present && !numa_info[i].node_mem) { > > > > > > > + error_report("Empty node %d found, please provide " > > > > > > > + "distance map.", i); > > > > > > > + exit(EXIT_FAILURE); > > > > > > > + } > > > > > > > + } > > > > > > > +} > > > > > > > + > > > > > > > void parse_numa_opts(MachineState *ms) > > > > > > > { > > > > > > > qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); > > > > > > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > > > > > > index 4173ef2afa..80f25ab830 100644 > > > > > > > --- a/include/sysemu/numa.h > > > > > > > +++ b/include/sysemu/numa.h > > > > > > > @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, > > > > > > > void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, > > > > > > > Error **errp); > > > > > > > void numa_complete_configuration(MachineState *ms); > > > > > > > +void numa_complete_validation(MachineState *ms); > > > > > > > void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); > > > > > > > extern QemuOptsList qemu_numa_opts; > > > > > > > void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-14 15:36 ` Andrew Jones @ 2021-10-15 8:22 ` Gavin Shan 2021-10-15 8:33 ` Andrew Jones 0 siblings, 1 reply; 14+ messages in thread From: Gavin Shan @ 2021-10-15 8:22 UTC (permalink / raw) To: Andrew Jones, Igor Mammedov Cc: peter.maydell, qemu-riscv, ehabkost, robh, qemu-devel, qemu-arm, shan.gavin Hi Drew, On 10/15/21 2:36 AM, Andrew Jones wrote: > On Thu, Oct 14, 2021 at 05:14:17PM +0200, Igor Mammedov wrote: >> On Wed, 13 Oct 2021 14:28:40 +0200 >> Andrew Jones <drjones@redhat.com> wrote: >>> On Wed, Oct 13, 2021 at 02:11:25PM +0200, Andrew Jones wrote: >>>> On Wed, Oct 13, 2021 at 01:53:46PM +0200, Igor Mammedov wrote: >>>>> On Wed, 13 Oct 2021 13:35:44 +0200 >>>>> Andrew Jones <drjones@redhat.com> wrote: >>>>>> On Wed, Oct 13, 2021 at 01:30:11PM +0200, Igor Mammedov wrote: >>>>>>> On Wed, 13 Oct 2021 12:58:04 +0800 >>>>>>> Gavin Shan <gshan@redhat.com> wrote: >>>>>>> >>>>>>>> The following option is used to specify the distance map. It's >>>>>>>> possible the option isn't provided by user. In this case, the >>>>>>>> distance map isn't populated and exposed to platform. On the >>>>>>>> other hand, the empty NUMA node, where no memory resides, is >>>>>>>> allowed on platforms like ARM64 virt. For these empty NUMA >>>>>>>> nodes, their corresponding device-tree nodes aren't populated, >>>>>>>> but their NUMA IDs should be included in the "/distance-map" >>>>>>>> device-tree node, so that kernel can probe them properly if >>>>>>>> device-tree is used. >>>>>>>> >>>>>>>> -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> >>>>>>>> >>>>>>>> This adds extra check after the machine is initialized, to >>>>>>>> ask for the distance map from user when empty nodes exist in >>>>>>>> device-tree. >>>>>>>> >>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>>>>>> --- >>>>>>>> hw/core/machine.c | 4 ++++ >>>>>>>> hw/core/numa.c | 24 ++++++++++++++++++++++++ >>>>>>>> include/sysemu/numa.h | 1 + >>>>>>>> 3 files changed, 29 insertions(+) >>>>>>>> >>>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c >>>>>>>> index b8d95eec32..c0765ad973 100644 >>>>>>>> --- a/hw/core/machine.c >>>>>>>> +++ b/hw/core/machine.c >>>>>>>> @@ -1355,6 +1355,10 @@ void machine_run_board_init(MachineState *machine) >>>>>>>> accel_init_interfaces(ACCEL_GET_CLASS(machine->accelerator)); >>>>>>>> machine_class->init(machine); >>>>>>>> phase_advance(PHASE_MACHINE_INITIALIZED); >>>>>>>> + >>>>>>>> + if (machine->numa_state) { >>>>>>>> + numa_complete_validation(machine); >>>>>>>> + } >>>>>>>> } >>>>>>>> >>>>>>>> static NotifierList machine_init_done_notifiers = >>>>>>>> diff --git a/hw/core/numa.c b/hw/core/numa.c >>>>>>>> index 510d096a88..7404b7dd38 100644 >>>>>>>> --- a/hw/core/numa.c >>>>>>>> +++ b/hw/core/numa.c >>>>>>>> @@ -727,6 +727,30 @@ void numa_complete_configuration(MachineState *ms) >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> +/* >>>>>>>> + * When device-tree is used by the machine, the empty node IDs should >>>>>>>> + * be included in the distance map. So we need provide pairs of distances >>>>>>>> + * in this case. >>>>>>>> + */ >>>>>>>> +void numa_complete_validation(MachineState *ms) >>>>>>>> +{ >>>>>>>> + NodeInfo *numa_info = ms->numa_state->nodes; >>>>>>>> + int nb_numa_nodes = ms->numa_state->num_nodes; >>>>>>>> + int i; >>>>>>>> + >>>>>>>> + if (!ms->fdt || ms->numa_state->have_numa_distance) { >>>>>>> >>>>>>> also skip check/limitation when VM is launched with ACPI enabled? >>> >>> On second thought, I guess it's a good idea to not error out when ACPI is >>> enabled. There's no point in burdening the ACPI user. >>> Yup, for platform like hw/arm/virt, the device-tree and ACPI table are used at same time can't bypass the check/limitation. Instead, x86 won't be affected as device-tree isn't used there. >>>>>> >>>>>> Even with ACPI enabled we generate a DT and would like that DT to be as >>>>>> complete as possible. Of course we should generate a SLIT table with >>>>> >>>>> Guest will work just fine without distance map as SRAT describes >>>>> all numa nodes. >>>>> You are forcing VM to have SLIT just for the sake of 'completeness' >>>>> that's practically unused. >>>>> >>>>> I'm still unsure about pushing all of this in generic numa code, >>>>> as this will be used only by ARM for now. It's better to keep it >>>>> ARM specific, and when RISCV machine will start using this, it >>>>> could be moved to generic code. >>> >>> I think RISCV could start using it now. Linux shares the mem/numa DT >>> parsing code between Arm and RISCV. If RISCV QEMU wanted to start >>> exposing NUMA nodes, then they might as well support empty ones from >>> the start. >> >> When RISCV will start using memory-less nodes, we can move this check to >> generic code. >> So far 2/2 takes care of ARM only, as result I don't see >> a good reason to generalize it now. (CCing RISCV folks to see if >> there is any plans to impl. that) >> >>>> I don't see a problem in providing this DT node / ACPI table to guests >>>> with empty NUMA nodes. I don't even see a problem with providing the >>>> distance information unconditionally. Can you explain why we should >>>> prefer not to provide optional HW descriptions? >>> >>> I'm still curious why we should be so reluctant to add HW descriptions. >>> I agree we should be reluctant to error out, though. >> >> If distance map were the only way to figure out present node-ids, >> I wouldn't mind making it non-optional. >> >> However guest can already get node-ids from mem/cpu/pci nodes >> with current device-tree that QEMU generates and >> forcing optional distance-map (with duplicate info) on user >> creates non must-have CLI requirements/inconvenience that >> will eventually propagate to higher layers, I'd rather avoid >> that if possible. >> >> So question is if an idea of fetching numa-ids from cpu nodes >> in addition to memory nodes (which should cover QEMU needs) >> have been considered and why it didn't work out? > > Thinking about it some more. I think we only need patch 2/2 of this > series. We should double check, but if when Linux generates its > default distance-map it already generates a map using the numa-ids > in the cpu nodes of the DT, then I guess the resulting Linux generated > distance-map will be sufficient to represent the empty NUMA nodes and > could be used for DT memory hotplug if Linux ever tried to implement > that. > > Gavin, can you check if we can just drop patch 1/2 and then when > booting a DT guest with empty numa nodes we get the distance > map we wanted to enforce anyway? > It's possible that the empty NUMA nodes aren't referred by any CPUs, as the following command line indicate. In this case, the empty NUMA node IDs aren't existing in device-tree CPU nodes. So we still need the distance-map. In the following example, The empty NUMA node #2 and #3 aren't referred by any CPUs. So their NUMA node IDs won't be included to device-tree CPU nodes. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 : However, I would like to drop this (PATCH[01/02]) in v4. The memory hotplug through device-tree on ARM64 isn't existing. We might have alternative ways to present the empty NUMA nodes when it's supported, or we needn't it for ever because ACPI is enough to support the memory hotplug. Thanks, Gavin >>>>>>> >>>>>>>> + return; >>>>>>>> + } >>>>>>>> + >>>>>>>> + for (i = 0; i < nb_numa_nodes; i++) { >>>>>>>> + if (numa_info[i].present && !numa_info[i].node_mem) { >>>>>>>> + error_report("Empty node %d found, please provide " >>>>>>>> + "distance map.", i); >>>>>>>> + exit(EXIT_FAILURE); >>>>>>>> + } >>>>>>>> + } >>>>>>>> +} >>>>>>>> + >>>>>>>> void parse_numa_opts(MachineState *ms) >>>>>>>> { >>>>>>>> qemu_opts_foreach(qemu_find_opts("numa"), parse_numa, ms, &error_fatal); >>>>>>>> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h >>>>>>>> index 4173ef2afa..80f25ab830 100644 >>>>>>>> --- a/include/sysemu/numa.h >>>>>>>> +++ b/include/sysemu/numa.h >>>>>>>> @@ -104,6 +104,7 @@ void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node, >>>>>>>> void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node, >>>>>>>> Error **errp); >>>>>>>> void numa_complete_configuration(MachineState *ms); >>>>>>>> +void numa_complete_validation(MachineState *ms); >>>>>>>> void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms); >>>>>>>> extern QemuOptsList qemu_numa_opts; >>>>>>>> void numa_cpu_pre_plug(const struct CPUArchId *slot, DeviceState *dev, >>>>>>> >>>>>> >>>>> >>> >>> >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-15 8:22 ` Gavin Shan @ 2021-10-15 8:33 ` Andrew Jones 2021-10-15 10:51 ` Gavin Shan 0 siblings, 1 reply; 14+ messages in thread From: Andrew Jones @ 2021-10-15 8:33 UTC (permalink / raw) To: Gavin Shan Cc: robh, qemu-riscv, ehabkost, peter.maydell, qemu-devel, qemu-arm, shan.gavin, Igor Mammedov On Fri, Oct 15, 2021 at 07:22:05PM +1100, Gavin Shan wrote: > It's possible that the empty NUMA nodes aren't referred by any CPUs, > as the following command line indicate. In this case, the empty NUMA > node IDs aren't existing in device-tree CPU nodes. So we still need > the distance-map. Ah, indeed. > > However, I would like to drop this (PATCH[01/02]) in v4. The memory > hotplug through device-tree on ARM64 isn't existing. We might have > alternative ways to present the empty NUMA nodes when it's supported, > or we needn't it for ever because ACPI is enough to support the > memory hotplug. Agreed. Please update the commit message of 2/2 for v4 and also add a comment above the new code with the rationale for dropping those memory nodes. Thanks, drew ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 1/2] numa: Require distance map when empty node exists 2021-10-15 8:33 ` Andrew Jones @ 2021-10-15 10:51 ` Gavin Shan 0 siblings, 0 replies; 14+ messages in thread From: Gavin Shan @ 2021-10-15 10:51 UTC (permalink / raw) To: Andrew Jones Cc: robh, qemu-riscv, ehabkost, peter.maydell, qemu-devel, qemu-arm, shan.gavin, Igor Mammedov Hi Drew, On 10/15/21 7:33 PM, Andrew Jones wrote: > On Fri, Oct 15, 2021 at 07:22:05PM +1100, Gavin Shan wrote: >> It's possible that the empty NUMA nodes aren't referred by any CPUs, >> as the following command line indicate. In this case, the empty NUMA >> node IDs aren't existing in device-tree CPU nodes. So we still need >> the distance-map. > > Ah, indeed. > >> >> However, I would like to drop this (PATCH[01/02]) in v4. The memory >> hotplug through device-tree on ARM64 isn't existing. We might have >> alternative ways to present the empty NUMA nodes when it's supported, >> or we needn't it for ever because ACPI is enough to support the >> memory hotplug. > > Agreed. Please update the commit message of 2/2 for v4 and also add > a comment above the new code with the rationale for dropping those > memory nodes. > v4 was posted for review. Your r-b tag was dropped as the additional comments are added. Please take a look on v4 and sorry for having taken so much time from you :) Thanks, Gavin ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 2/2] hw/arm/virt: Don't create device-tree node for empty NUMA node 2021-10-13 4:58 [PATCH v3 0/2] hw/arm/virt: Fix qemu booting failure on device-tree Gavin Shan 2021-10-13 4:58 ` [PATCH v3 1/2] numa: Require distance map when empty node exists Gavin Shan @ 2021-10-13 4:58 ` Gavin Shan 1 sibling, 0 replies; 14+ messages in thread From: Gavin Shan @ 2021-10-13 4:58 UTC (permalink / raw) To: qemu-arm Cc: robh, drjones, ehabkost, peter.maydell, qemu-devel, shan.gavin, imammedo The empty NUMA node, where no memory resides, are allowed. For example, the following command line specifies two empty NUMA nodes. With this, QEMU fails to boot because of the conflicting device-tree node names, as the following error message indicates. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 : qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: FDT_ERR_EXISTS As specified by linux device-tree binding document, the device-tree nodes for these empty NUMA nodes shouldn't be generated. However, the corresponding NUMA node IDs should be included in the distance map device-tree node. This skips populating the device-tree nodes for these empty NUMA nodes to avoid the error, so that QEMU can be started successfully. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Andrew Jones <drjones@redhat.com> --- hw/arm/boot.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/hw/arm/boot.c b/hw/arm/boot.c index 57efb61ee4..4e5898fcdc 100644 --- a/hw/arm/boot.c +++ b/hw/arm/boot.c @@ -603,6 +603,10 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo, mem_base = binfo->loader_start; for (i = 0; i < ms->numa_state->num_nodes; i++) { mem_len = ms->numa_state->nodes[i].node_mem; + if (!mem_len) { + continue; + } + rc = fdt_add_memory_node(fdt, acells, mem_base, scells, mem_len, i); if (rc < 0) { -- 2.23.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-10-15 10:52 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-13 4:58 [PATCH v3 0/2] hw/arm/virt: Fix qemu booting failure on device-tree Gavin Shan 2021-10-13 4:58 ` [PATCH v3 1/2] numa: Require distance map when empty node exists Gavin Shan 2021-10-13 6:35 ` Andrew Jones 2021-10-13 11:30 ` Igor Mammedov 2021-10-13 11:35 ` Andrew Jones 2021-10-13 11:53 ` Igor Mammedov 2021-10-13 12:11 ` Andrew Jones 2021-10-13 12:28 ` Andrew Jones 2021-10-14 15:14 ` Igor Mammedov 2021-10-14 15:36 ` Andrew Jones 2021-10-15 8:22 ` Gavin Shan 2021-10-15 8:33 ` Andrew Jones 2021-10-15 10:51 ` Gavin Shan 2021-10-13 4:58 ` [PATCH v3 2/2] hw/arm/virt: Don't create device-tree node for empty NUMA node Gavin Shan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).