* [PATCH] Documentation, dt, numa: Add note to empty NUMA node @ 2021-09-06 4:14 Gavin Shan 2021-09-21 19:44 ` Rob Herring 0 siblings, 1 reply; 6+ messages in thread From: Gavin Shan @ 2021-09-06 4:14 UTC (permalink / raw) To: devicetree Cc: linux-kernel, linux-arm-kernel, linux-efi, kvmarm, rdunlap, robh, drjones, ardb, will, maz, catalin.marinas, shan.gavin The empty memory nodes, where no memory resides in, are allowed. For these empty memory nodes, the 'len' of 'reg' property is zero. The NUMA node IDs are still valid and parsed, but memory may be added to them through hotplug afterwards. Currently, QEMU fails to boot when multiple empty memory nodes are specified. It's caused by device-tree population failure and duplicated memory node names. As device-tree specification indicates, the 'unit-address' of these empty memory nodes, part of their names, are the equivalents to 'base-address'. Unfortunately, I finds difficulty to get where the assignment of 'base-address' is properly documented for these empty memory nodes. So lets add a section for empty memory nodes to cover this in NUMA binding document. The 'unit-address', equivalent to 'base-address' in the 'reg' property of these empty memory nodes is specified to be the summation of highest memory address plus the NUMA node ID. Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> --- Documentation/devicetree/bindings/numa.txt | 60 +++++++++++++++++++++- 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt index 21b35053ca5a..82f047bc8dd6 100644 --- a/Documentation/devicetree/bindings/numa.txt +++ b/Documentation/devicetree/bindings/numa.txt @@ -103,7 +103,65 @@ Example: }; ============================================================================== -4 - Example dts +4 - Empty memory nodes +============================================================================== + +Empty memory nodes, which no memory resides in, are allowed. The 'length' +field of the 'reg' property is zero. However, the 'base-address' is a +dummy and invalid address, which is the summation of highest memory address +plus the NUMA node ID. The NUMA node IDs and distance maps are still valid +and memory may be added into them through hotplug afterwards. + +Example: + + memory@0 { + device_type = "memory"; + reg = <0x0 0x0 0x0 0x80000000>; + numa-node-id = <0>; + }; + + memory@80000000 { + device_type = "memory"; + reg = <0x0 0x80000000 0x0 0x80000000>; + numa-node-id = <1>; + }; + + /* Empty memory node */ + memory@100000002 { + device_type = "memory"; + reg = <0x1 0x2 0x0 0x0>; + numa-node-id = <2>; + }; + + /* Empty memory node */ + memory@100000003 { + device_type = "memory"; + reg = <0x1 0x3 0x0 0x0>; + numa-node-id = <3>; + }; + + distance-map { + compatible = "numa-distance-map-v1"; + distance-matrix = <0 0 10>, + <0 1 20>, + <0 2 40>, + <0 3 20>, + <1 0 20>, + <1 1 10>, + <1 2 20>, + <1 3 40>, + <2 0 40>, + <2 1 20>, + <2 2 10>, + <2 3 20>, + <3 0 20>, + <3 1 40>, + <3 2 20>, + <3 3 10>; + }; + +============================================================================== +5 - Example dts ============================================================================== Dual socket system consists of 2 boards connected through ccn bus and -- 2.23.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation, dt, numa: Add note to empty NUMA node 2021-09-06 4:14 [PATCH] Documentation, dt, numa: Add note to empty NUMA node Gavin Shan @ 2021-09-21 19:44 ` Rob Herring 2021-09-22 11:05 ` Ard Biesheuvel 0 siblings, 1 reply; 6+ messages in thread From: Rob Herring @ 2021-09-21 19:44 UTC (permalink / raw) To: Gavin Shan Cc: devicetree, linux-kernel, linux-arm-kernel, linux-efi, open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64), Randy Dunlap, drjones, Ard Biesheuvel, Will Deacon, Marc Zyngier, Catalin Marinas, shan.gavin On Sun, Sep 5, 2021 at 11:16 PM Gavin Shan <gshan@redhat.com> wrote: > > The empty memory nodes, where no memory resides in, are allowed. > For these empty memory nodes, the 'len' of 'reg' property is zero. > The NUMA node IDs are still valid and parsed, but memory may be > added to them through hotplug afterwards. Currently, QEMU fails > to boot when multiple empty memory nodes are specified. It's > caused by device-tree population failure and duplicated memory > node names. I still don't like the fake addresses. I can't really give suggestions on alternative ways to fix this with you just presenting a solution. What is the failure you see? Can we relax the kernel's expectations? What about UEFI boot as the memory nodes aren't used (or maybe they are for NUMA?) How does this work with ACPI? > As device-tree specification indicates, the 'unit-address' of > these empty memory nodes, part of their names, are the equivalents > to 'base-address'. Unfortunately, I finds difficulty to get where > the assignment of 'base-address' is properly documented for these > empty memory nodes. So lets add a section for empty memory nodes > to cover this in NUMA binding document. The 'unit-address', > equivalent to 'base-address' in the 'reg' property of these empty > memory nodes is specified to be the summation of highest memory > address plus the NUMA node ID. > > Signed-off-by: Gavin Shan <gshan@redhat.com> > Acked-by: Randy Dunlap <rdunlap@infradead.org> > --- > Documentation/devicetree/bindings/numa.txt | 60 +++++++++++++++++++++- > 1 file changed, 59 insertions(+), 1 deletion(-) > > diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt > index 21b35053ca5a..82f047bc8dd6 100644 > --- a/Documentation/devicetree/bindings/numa.txt > +++ b/Documentation/devicetree/bindings/numa.txt > @@ -103,7 +103,65 @@ Example: > }; > > ============================================================================== > -4 - Example dts > +4 - Empty memory nodes > +============================================================================== > + > +Empty memory nodes, which no memory resides in, are allowed. The 'length' > +field of the 'reg' property is zero. However, the 'base-address' is a > +dummy and invalid address, which is the summation of highest memory address > +plus the NUMA node ID. The NUMA node IDs and distance maps are still valid > +and memory may be added into them through hotplug afterwards. > + > +Example: > + > + memory@0 { > + device_type = "memory"; > + reg = <0x0 0x0 0x0 0x80000000>; > + numa-node-id = <0>; > + }; > + > + memory@80000000 { > + device_type = "memory"; > + reg = <0x0 0x80000000 0x0 0x80000000>; > + numa-node-id = <1>; > + }; > + > + /* Empty memory node */ > + memory@100000002 { > + device_type = "memory"; > + reg = <0x1 0x2 0x0 0x0>; > + numa-node-id = <2>; > + }; > + > + /* Empty memory node */ > + memory@100000003 { > + device_type = "memory"; > + reg = <0x1 0x3 0x0 0x0>; > + numa-node-id = <3>; > + }; Do you really need the memory nodes here or just some way to define numa node id's 2 and 3 as valid? > + > + distance-map { > + compatible = "numa-distance-map-v1"; > + distance-matrix = <0 0 10>, > + <0 1 20>, > + <0 2 40>, > + <0 3 20>, > + <1 0 20>, > + <1 1 10>, > + <1 2 20>, > + <1 3 40>, > + <2 0 40>, > + <2 1 20>, > + <2 2 10>, > + <2 3 20>, > + <3 0 20>, > + <3 1 40>, > + <3 2 20>, > + <3 3 10>; > + }; > + > +============================================================================== > +5 - Example dts > ============================================================================== > > Dual socket system consists of 2 boards connected through ccn bus and > -- > 2.23.0 > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation, dt, numa: Add note to empty NUMA node 2021-09-21 19:44 ` Rob Herring @ 2021-09-22 11:05 ` Ard Biesheuvel 2021-09-23 6:32 ` Gavin Shan 0 siblings, 1 reply; 6+ messages in thread From: Ard Biesheuvel @ 2021-09-22 11:05 UTC (permalink / raw) To: Rob Herring Cc: Gavin Shan, open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, linux-kernel, linux-arm-kernel, linux-efi, open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64), Randy Dunlap, Andrew Jones, Will Deacon, Marc Zyngier, Catalin Marinas, shan.gavin On Tue, 21 Sept 2021 at 21:45, Rob Herring <robh@kernel.org> wrote: > > On Sun, Sep 5, 2021 at 11:16 PM Gavin Shan <gshan@redhat.com> wrote: > > > > The empty memory nodes, where no memory resides in, are allowed. > > For these empty memory nodes, the 'len' of 'reg' property is zero. > > The NUMA node IDs are still valid and parsed, but memory may be > > added to them through hotplug afterwards. Currently, QEMU fails > > to boot when multiple empty memory nodes are specified. It's > > caused by device-tree population failure and duplicated memory > > node names. Those memory regions are known in advance, right? So wouldn't it be better to use something like 'status = "disabled"' here? > > I still don't like the fake addresses. I can't really give suggestions > on alternative ways to fix this with you just presenting a solution. > Agreed. Please try to explain what the problem is, and why this is the best way to solve it. Please include other solutions that were considered and rejected if any exist. > What is the failure you see? Can we relax the kernel's expectations? > What about UEFI boot as the memory nodes aren't used (or maybe they > are for NUMA?) How does this work with ACPI? > The EFI memory map only needs to describe the memory that was present at boot. More memory can be represented as ACPI objects, including coldplugged memory that is already present at boot. None of this involves the memory nodes in DT. > > As device-tree specification indicates, the 'unit-address' of > > these empty memory nodes, part of their names, are the equivalents > > to 'base-address'. Unfortunately, I finds difficulty to get where > > the assignment of 'base-address' is properly documented for these > > empty memory nodes. So lets add a section for empty memory nodes > > to cover this in NUMA binding document. The 'unit-address', > > equivalent to 'base-address' in the 'reg' property of these empty > > memory nodes is specified to be the summation of highest memory > > address plus the NUMA node ID. > > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > Acked-by: Randy Dunlap <rdunlap@infradead.org> > > --- > > Documentation/devicetree/bindings/numa.txt | 60 +++++++++++++++++++++- > > 1 file changed, 59 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt > > index 21b35053ca5a..82f047bc8dd6 100644 > > --- a/Documentation/devicetree/bindings/numa.txt > > +++ b/Documentation/devicetree/bindings/numa.txt > > @@ -103,7 +103,65 @@ Example: > > }; > > > > ============================================================================== > > -4 - Example dts > > +4 - Empty memory nodes > > +============================================================================== > > + > > +Empty memory nodes, which no memory resides in, are allowed. The 'length' > > +field of the 'reg' property is zero. However, the 'base-address' is a > > +dummy and invalid address, which is the summation of highest memory address > > +plus the NUMA node ID. The NUMA node IDs and distance maps are still valid > > +and memory may be added into them through hotplug afterwards. > > + > > +Example: > > + > > + memory@0 { > > + device_type = "memory"; > > + reg = <0x0 0x0 0x0 0x80000000>; > > + numa-node-id = <0>; > > + }; > > + > > + memory@80000000 { > > + device_type = "memory"; > > + reg = <0x0 0x80000000 0x0 0x80000000>; > > + numa-node-id = <1>; > > + }; > > + > > + /* Empty memory node */ > > + memory@100000002 { > > + device_type = "memory"; > > + reg = <0x1 0x2 0x0 0x0>; > > + numa-node-id = <2>; > > + }; > > + > > + /* Empty memory node */ > > + memory@100000003 { > > + device_type = "memory"; > > + reg = <0x1 0x3 0x0 0x0>; > > + numa-node-id = <3>; > > + }; > > Do you really need the memory nodes here or just some way to define > numa node id's 2 and 3 as valid? > > > > + > > + distance-map { > > + compatible = "numa-distance-map-v1"; > > + distance-matrix = <0 0 10>, > > + <0 1 20>, > > + <0 2 40>, > > + <0 3 20>, > > + <1 0 20>, > > + <1 1 10>, > > + <1 2 20>, > > + <1 3 40>, > > + <2 0 40>, > > + <2 1 20>, > > + <2 2 10>, > > + <2 3 20>, > > + <3 0 20>, > > + <3 1 40>, > > + <3 2 20>, > > + <3 3 10>; > > + }; > > + > > +============================================================================== > > +5 - Example dts > > ============================================================================== > > > > Dual socket system consists of 2 boards connected through ccn bus and > > -- > > 2.23.0 > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation, dt, numa: Add note to empty NUMA node 2021-09-22 11:05 ` Ard Biesheuvel @ 2021-09-23 6:32 ` Gavin Shan 2021-09-23 15:17 ` Rob Herring 0 siblings, 1 reply; 6+ messages in thread From: Gavin Shan @ 2021-09-23 6:32 UTC (permalink / raw) To: Ard Biesheuvel, Rob Herring Cc: open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, linux-kernel, linux-arm-kernel, linux-efi, open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64), Randy Dunlap, Andrew Jones, Will Deacon, Marc Zyngier, Catalin Marinas, shan.gavin Hi Rob and Ard, On 9/22/21 9:05 PM, Ard Biesheuvel wrote: > On Tue, 21 Sept 2021 at 21:45, Rob Herring <robh@kernel.org> wrote: >> On Sun, Sep 5, 2021 at 11:16 PM Gavin Shan <gshan@redhat.com> wrote: >>> >>> The empty memory nodes, where no memory resides in, are allowed. >>> For these empty memory nodes, the 'len' of 'reg' property is zero. >>> The NUMA node IDs are still valid and parsed, but memory may be >>> added to them through hotplug afterwards. Currently, QEMU fails >>> to boot when multiple empty memory nodes are specified. It's >>> caused by device-tree population failure and duplicated memory >>> node names. > > Those memory regions are known in advance, right? So wouldn't it be > better to use something like 'status = "disabled"' here? > Yes, these memory regions are known in advance. For the empty nodes, their 'len' property is zero and it's equal to disabled state. >> >> I still don't like the fake addresses. I can't really give suggestions >> on alternative ways to fix this with you just presenting a solution. >> > > Agreed. Please try to explain what the problem is, and why this is the > best way to solve it. Please include other solutions that were > considered and rejected if any exist. > >> What is the failure you see? Can we relax the kernel's expectations? >> What about UEFI boot as the memory nodes aren't used (or maybe they >> are for NUMA?) How does this work with ACPI? >> > > The EFI memory map only needs to describe the memory that was present > at boot. More memory can be represented as ACPI objects, including > coldplugged memory that is already present at boot. None of this > involves the memory nodes in DT. > I'm using the following command line to start a virtual machine (VM). There are 4 NUMA nodes specified, but the last two are empty. In QEMU, the device-tree nodes are populated to represent these 4 NUMA nodes. Unfortunately, QEMU fails to start because of the conflicting names for the empty node are found, as the following error message indicates. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ : -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes : : qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: FDT_ERR_EXISTS According to device-tree specification, the memory device-tree node's name is following the format 'memory@base-address'. For the empty NUMA nodes, their base addresses aren't determined. The device-tree specification doesn't indicate what 'base-address' should be assigned for the empty nodes. So I proposed this patch because I think the linux device-tree binding documentation is best place to get this documented. ACPI is different story. The NUMA nodes are represented by SRAT (System Resource Affinity Table). In the above example, there are 4 SRATs. We needn't assign names to the tables and we don't have the conflicting names as we do in device-tree case. By the way, QEMU currently prevents to expose SRATs for empty NUMA nodes. I need submit QEMU patch to break the limitation in future. With the limitation, the hot-added memory is always put into the last NUMA node and it's not exactly customer wants. >>> As device-tree specification indicates, the 'unit-address' of >>> these empty memory nodes, part of their names, are the equivalents >>> to 'base-address'. Unfortunately, I finds difficulty to get where >>> the assignment of 'base-address' is properly documented for these >>> empty memory nodes. So lets add a section for empty memory nodes >>> to cover this in NUMA binding document. The 'unit-address', >>> equivalent to 'base-address' in the 'reg' property of these empty >>> memory nodes is specified to be the summation of highest memory >>> address plus the NUMA node ID. >>> >>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>> Acked-by: Randy Dunlap <rdunlap@infradead.org> >>> --- >>> Documentation/devicetree/bindings/numa.txt | 60 +++++++++++++++++++++- >>> 1 file changed, 59 insertions(+), 1 deletion(-) >>> >>> diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt >>> index 21b35053ca5a..82f047bc8dd6 100644 >>> --- a/Documentation/devicetree/bindings/numa.txt >>> +++ b/Documentation/devicetree/bindings/numa.txt >>> @@ -103,7 +103,65 @@ Example: >>> }; >>> >>> ============================================================================== >>> -4 - Example dts >>> +4 - Empty memory nodes >>> +============================================================================== >>> + >>> +Empty memory nodes, which no memory resides in, are allowed. The 'length' >>> +field of the 'reg' property is zero. However, the 'base-address' is a >>> +dummy and invalid address, which is the summation of highest memory address >>> +plus the NUMA node ID. The NUMA node IDs and distance maps are still valid >>> +and memory may be added into them through hotplug afterwards. >>> + >>> +Example: >>> + >>> + memory@0 { >>> + device_type = "memory"; >>> + reg = <0x0 0x0 0x0 0x80000000>; >>> + numa-node-id = <0>; >>> + }; >>> + >>> + memory@80000000 { >>> + device_type = "memory"; >>> + reg = <0x0 0x80000000 0x0 0x80000000>; >>> + numa-node-id = <1>; >>> + }; >>> + >>> + /* Empty memory node */ >>> + memory@100000002 { >>> + device_type = "memory"; >>> + reg = <0x1 0x2 0x0 0x0>; >>> + numa-node-id = <2>; >>> + }; >>> + >>> + /* Empty memory node */ >>> + memory@100000003 { >>> + device_type = "memory"; >>> + reg = <0x1 0x3 0x0 0x0>; >>> + numa-node-id = <3>; >>> + }; >> >> Do you really need the memory nodes here or just some way to define >> numa node id's 2 and 3 as valid? >> It's the way to define NUMA node IDs are valid. Besides, the 'reg' property provides 'base-address', which is part of the device-tree node's name, as described in this patch. >> >>> + >>> + distance-map { >>> + compatible = "numa-distance-map-v1"; >>> + distance-matrix = <0 0 10>, >>> + <0 1 20>, >>> + <0 2 40>, >>> + <0 3 20>, >>> + <1 0 20>, >>> + <1 1 10>, >>> + <1 2 20>, >>> + <1 3 40>, >>> + <2 0 40>, >>> + <2 1 20>, >>> + <2 2 10>, >>> + <2 3 20>, >>> + <3 0 20>, >>> + <3 1 40>, >>> + <3 2 20>, >>> + <3 3 10>; >>> + }; >>> + >>> +============================================================================== >>> +5 - Example dts >>> ============================================================================== >>> Thanks, Gavin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation, dt, numa: Add note to empty NUMA node 2021-09-23 6:32 ` Gavin Shan @ 2021-09-23 15:17 ` Rob Herring 2021-09-27 1:16 ` Gavin Shan 0 siblings, 1 reply; 6+ messages in thread From: Rob Herring @ 2021-09-23 15:17 UTC (permalink / raw) To: Gavin Shan Cc: Ard Biesheuvel, open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, linux-kernel, linux-arm-kernel, linux-efi, open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64), Randy Dunlap, Andrew Jones, Will Deacon, Marc Zyngier, Catalin Marinas, shan.gavin On Thu, Sep 23, 2021 at 1:32 AM Gavin Shan <gshan@redhat.com> wrote: > > Hi Rob and Ard, > > On 9/22/21 9:05 PM, Ard Biesheuvel wrote: > > On Tue, 21 Sept 2021 at 21:45, Rob Herring <robh@kernel.org> wrote: > >> On Sun, Sep 5, 2021 at 11:16 PM Gavin Shan <gshan@redhat.com> wrote: > >>> > >>> The empty memory nodes, where no memory resides in, are allowed. > >>> For these empty memory nodes, the 'len' of 'reg' property is zero. > >>> The NUMA node IDs are still valid and parsed, but memory may be > >>> added to them through hotplug afterwards. Currently, QEMU fails > >>> to boot when multiple empty memory nodes are specified. It's > >>> caused by device-tree population failure and duplicated memory > >>> node names. > > > > Those memory regions are known in advance, right? So wouldn't it be > > better to use something like 'status = "disabled"' here? > > > > Yes, these memory regions are known in advance. For the empty nodes, > their 'len' property is zero and it's equal to disabled state. > > >> > >> I still don't like the fake addresses. I can't really give suggestions > >> on alternative ways to fix this with you just presenting a solution. > >> > > > > Agreed. Please try to explain what the problem is, and why this is the > > best way to solve it. Please include other solutions that were > > considered and rejected if any exist. > > > >> What is the failure you see? Can we relax the kernel's expectations? > >> What about UEFI boot as the memory nodes aren't used (or maybe they > >> are for NUMA?) How does this work with ACPI? > >> > > > > The EFI memory map only needs to describe the memory that was present > > at boot. More memory can be represented as ACPI objects, including > > coldplugged memory that is already present at boot. None of this > > involves the memory nodes in DT. > > > > I'm using the following command line to start a virtual machine (VM). > There are 4 NUMA nodes specified, but the last two are empty. In QEMU, > the device-tree nodes are populated to represent these 4 NUMA nodes. > Unfortunately, QEMU fails to start because of the conflicting names > for the empty node are found, as the following error message indicates. > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > -accel kvm -machine virt,gic-version=host \ > -cpu host -smp 4,sockets=2,cores=2,threads=1 \ > -m 1024M,slots=16,maxmem=64G \ > -object memory-backend-ram,id=mem0,size=512M \ > -object memory-backend-ram,id=mem1,size=512M \ > -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ > -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ > -numa node,nodeid=2 \ > -numa node,nodeid=3 \ > : > -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes > : > : > qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: FDT_ERR_EXISTS > > According to device-tree specification, the memory device-tree node's > name is following the format 'memory@base-address'. For the empty > NUMA nodes, their base addresses aren't determined. The device-tree > specification doesn't indicate what 'base-address' should be assigned > for the empty nodes. So I proposed this patch because I think the > linux device-tree binding documentation is best place to get this > documented. Why even create the node? What does IBM pSeries do here. AIUI, those platforms create/remove nodes for hotplug. That's the reason CONFIG_OF_DYNAMIC existed originally. Unfortunately, that's the extent of my knowledge on that. > ACPI is different story. The NUMA nodes are represented by SRAT > (System Resource Affinity Table). In the above example, there are > 4 SRATs. We needn't assign names to the tables and we don't have > the conflicting names as we do in device-tree case. > > By the way, QEMU currently prevents to expose SRATs for empty NUMA > nodes. I need submit QEMU patch to break the limitation in future. > With the limitation, the hot-added memory is always put into the > last NUMA node and it's not exactly customer wants. > > >>> As device-tree specification indicates, the 'unit-address' of > >>> these empty memory nodes, part of their names, are the equivalents > >>> to 'base-address'. Unfortunately, I finds difficulty to get where > >>> the assignment of 'base-address' is properly documented for these > >>> empty memory nodes. So lets add a section for empty memory nodes > >>> to cover this in NUMA binding document. The 'unit-address', > >>> equivalent to 'base-address' in the 'reg' property of these empty > >>> memory nodes is specified to be the summation of highest memory > >>> address plus the NUMA node ID. > >>> > >>> Signed-off-by: Gavin Shan <gshan@redhat.com> > >>> Acked-by: Randy Dunlap <rdunlap@infradead.org> > >>> --- > >>> Documentation/devicetree/bindings/numa.txt | 60 +++++++++++++++++++++- > >>> 1 file changed, 59 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt > >>> index 21b35053ca5a..82f047bc8dd6 100644 > >>> --- a/Documentation/devicetree/bindings/numa.txt > >>> +++ b/Documentation/devicetree/bindings/numa.txt > >>> @@ -103,7 +103,65 @@ Example: > >>> }; > >>> > >>> ============================================================================== > >>> -4 - Example dts > >>> +4 - Empty memory nodes > >>> +============================================================================== > >>> + > >>> +Empty memory nodes, which no memory resides in, are allowed. The 'length' > >>> +field of the 'reg' property is zero. However, the 'base-address' is a > >>> +dummy and invalid address, which is the summation of highest memory address > >>> +plus the NUMA node ID. The NUMA node IDs and distance maps are still valid > >>> +and memory may be added into them through hotplug afterwards. > >>> + > >>> +Example: > >>> + > >>> + memory@0 { > >>> + device_type = "memory"; > >>> + reg = <0x0 0x0 0x0 0x80000000>; > >>> + numa-node-id = <0>; > >>> + }; > >>> + > >>> + memory@80000000 { > >>> + device_type = "memory"; > >>> + reg = <0x0 0x80000000 0x0 0x80000000>; > >>> + numa-node-id = <1>; > >>> + }; > >>> + > >>> + /* Empty memory node */ > >>> + memory@100000002 { > >>> + device_type = "memory"; > >>> + reg = <0x1 0x2 0x0 0x0>; > >>> + numa-node-id = <2>; > >>> + }; > >>> + > >>> + /* Empty memory node */ > >>> + memory@100000003 { > >>> + device_type = "memory"; > >>> + reg = <0x1 0x3 0x0 0x0>; > >>> + numa-node-id = <3>; > >>> + }; > >> > >> Do you really need the memory nodes here or just some way to define > >> numa node id's 2 and 3 as valid? > >> > > It's the way to define NUMA node IDs are valid. Besides, the 'reg' > property provides 'base-address', which is part of the device-tree > node's name, as described in this patch. The distance-matrix already lists all possible NUMA node IDs. That should be enough information for the kernel. If not, fix the kernel. Rob ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Documentation, dt, numa: Add note to empty NUMA node 2021-09-23 15:17 ` Rob Herring @ 2021-09-27 1:16 ` Gavin Shan 0 siblings, 0 replies; 6+ messages in thread From: Gavin Shan @ 2021-09-27 1:16 UTC (permalink / raw) To: Rob Herring Cc: Ard Biesheuvel, open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, linux-kernel, linux-arm-kernel, linux-efi, open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64), Randy Dunlap, Andrew Jones, Will Deacon, Marc Zyngier, Catalin Marinas, shan.gavin Hi Rob, > On 9/24/21 1:17 AM, Rob Herring wrote: > On Thu, Sep 23, 2021 at 1:32 AM Gavin Shan <gshan@redhat.com> wrote: >> On 9/22/21 9:05 PM, Ard Biesheuvel wrote: >>> On Tue, 21 Sept 2021 at 21:45, Rob Herring <robh@kernel.org> wrote: >>>> On Sun, Sep 5, 2021 at 11:16 PM Gavin Shan <gshan@redhat.com> wrote: >>>>> >>>>> The empty memory nodes, where no memory resides in, are allowed. >>>>> For these empty memory nodes, the 'len' of 'reg' property is zero. >>>>> The NUMA node IDs are still valid and parsed, but memory may be >>>>> added to them through hotplug afterwards. Currently, QEMU fails >>>>> to boot when multiple empty memory nodes are specified. It's >>>>> caused by device-tree population failure and duplicated memory >>>>> node names. >>> >>> Those memory regions are known in advance, right? So wouldn't it be >>> better to use something like 'status = "disabled"' here? >>> >> >> Yes, these memory regions are known in advance. For the empty nodes, >> their 'len' property is zero and it's equal to disabled state. >> >>>> >>>> I still don't like the fake addresses. I can't really give suggestions >>>> on alternative ways to fix this with you just presenting a solution. >>>> >>> >>> Agreed. Please try to explain what the problem is, and why this is the >>> best way to solve it. Please include other solutions that were >>> considered and rejected if any exist. >>> >>>> What is the failure you see? Can we relax the kernel's expectations? >>>> What about UEFI boot as the memory nodes aren't used (or maybe they >>>> are for NUMA?) How does this work with ACPI? >>>> >>> >>> The EFI memory map only needs to describe the memory that was present >>> at boot. More memory can be represented as ACPI objects, including >>> coldplugged memory that is already present at boot. None of this >>> involves the memory nodes in DT. >>> >> >> I'm using the following command line to start a virtual machine (VM). >> There are 4 NUMA nodes specified, but the last two are empty. In QEMU, >> the device-tree nodes are populated to represent these 4 NUMA nodes. >> Unfortunately, QEMU fails to start because of the conflicting names >> for the empty node are found, as the following error message indicates. >> >> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ >> -accel kvm -machine virt,gic-version=host \ >> -cpu host -smp 4,sockets=2,cores=2,threads=1 \ >> -m 1024M,slots=16,maxmem=64G \ >> -object memory-backend-ram,id=mem0,size=512M \ >> -object memory-backend-ram,id=mem1,size=512M \ >> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ >> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ >> -numa node,nodeid=2 \ >> -numa node,nodeid=3 \ >> : >> -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes >> : >> : >> qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: FDT_ERR_EXISTS >> >> According to device-tree specification, the memory device-tree node's >> name is following the format 'memory@base-address'. For the empty >> NUMA nodes, their base addresses aren't determined. The device-tree >> specification doesn't indicate what 'base-address' should be assigned >> for the empty nodes. So I proposed this patch because I think the >> linux device-tree binding documentation is best place to get this >> documented. > > Why even create the node? > > What does IBM pSeries do here. AIUI, those platforms create/remove > nodes for hotplug. That's the reason CONFIG_OF_DYNAMIC existed > originally. Unfortunately, that's the extent of my knowledge on that. > It has been long time that I didn't read pSeries related code. I spent some time on that and you're correct. On pSeries, the device-tree node is added dynamically. However, the IBM private nodes or properties, whose names start with "ibm", are used for the memory hotplug. So ARM64 can't follow without adaption. I agree on your suggestion, which will be reflected in v2: (a) the memory device-tree nodes are added and removed on memory hot add and removal. (b) the supported NUMA nodes, including the empty ones, are identified through "numa-distance-map" compatible device-tree node. It's exactly same to what you suggested below. In this way, we won't have the issue of the conflicting memory node names, introduced by the empty NUMA nodes. >> ACPI is different story. The NUMA nodes are represented by SRAT >> (System Resource Affinity Table). In the above example, there are >> 4 SRATs. We needn't assign names to the tables and we don't have >> the conflicting names as we do in device-tree case. >> >> By the way, QEMU currently prevents to expose SRATs for empty NUMA >> nodes. I need submit QEMU patch to break the limitation in future. >> With the limitation, the hot-added memory is always put into the >> last NUMA node and it's not exactly customer wants. >> >>>>> As device-tree specification indicates, the 'unit-address' of >>>>> these empty memory nodes, part of their names, are the equivalents >>>>> to 'base-address'. Unfortunately, I finds difficulty to get where >>>>> the assignment of 'base-address' is properly documented for these >>>>> empty memory nodes. So lets add a section for empty memory nodes >>>>> to cover this in NUMA binding document. The 'unit-address', >>>>> equivalent to 'base-address' in the 'reg' property of these empty >>>>> memory nodes is specified to be the summation of highest memory >>>>> address plus the NUMA node ID. >>>>> >>>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>>> Acked-by: Randy Dunlap <rdunlap@infradead.org> >>>>> --- >>>>> Documentation/devicetree/bindings/numa.txt | 60 +++++++++++++++++++++- >>>>> 1 file changed, 59 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt >>>>> index 21b35053ca5a..82f047bc8dd6 100644 >>>>> --- a/Documentation/devicetree/bindings/numa.txt >>>>> +++ b/Documentation/devicetree/bindings/numa.txt >>>>> @@ -103,7 +103,65 @@ Example: >>>>> }; >>>>> >>>>> ============================================================================== >>>>> -4 - Example dts >>>>> +4 - Empty memory nodes >>>>> +============================================================================== >>>>> + >>>>> +Empty memory nodes, which no memory resides in, are allowed. The 'length' >>>>> +field of the 'reg' property is zero. However, the 'base-address' is a >>>>> +dummy and invalid address, which is the summation of highest memory address >>>>> +plus the NUMA node ID. The NUMA node IDs and distance maps are still valid >>>>> +and memory may be added into them through hotplug afterwards. >>>>> + >>>>> +Example: >>>>> + >>>>> + memory@0 { >>>>> + device_type = "memory"; >>>>> + reg = <0x0 0x0 0x0 0x80000000>; >>>>> + numa-node-id = <0>; >>>>> + }; >>>>> + >>>>> + memory@80000000 { >>>>> + device_type = "memory"; >>>>> + reg = <0x0 0x80000000 0x0 0x80000000>; >>>>> + numa-node-id = <1>; >>>>> + }; >>>>> + >>>>> + /* Empty memory node */ >>>>> + memory@100000002 { >>>>> + device_type = "memory"; >>>>> + reg = <0x1 0x2 0x0 0x0>; >>>>> + numa-node-id = <2>; >>>>> + }; >>>>> + >>>>> + /* Empty memory node */ >>>>> + memory@100000003 { >>>>> + device_type = "memory"; >>>>> + reg = <0x1 0x3 0x0 0x0>; >>>>> + numa-node-id = <3>; >>>>> + }; >>>> >>>> Do you really need the memory nodes here or just some way to define >>>> numa node id's 2 and 3 as valid? >>>> >> >> It's the way to define NUMA node IDs are valid. Besides, the 'reg' >> property provides 'base-address', which is part of the device-tree >> node's name, as described in this patch. > > The distance-matrix already lists all possible NUMA node IDs. That > should be enough information for the kernel. If not, fix the kernel. > Thanks, Rob. An extra patch will address the issue in v2. Thanks, Gavin ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-09-27 1:16 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-06 4:14 [PATCH] Documentation, dt, numa: Add note to empty NUMA node Gavin Shan 2021-09-21 19:44 ` Rob Herring 2021-09-22 11:05 ` Ard Biesheuvel 2021-09-23 6:32 ` Gavin Shan 2021-09-23 15:17 ` Rob Herring 2021-09-27 1:16 ` Gavin Shan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).