All of lore.kernel.org
 help / color / mirror / Atom feed
* qemu cxl memory expander shows numa_node -1
@ 2023-08-18  9:38 Sajjan Rao
  2023-08-18 15:01 ` Dimitrios Palyvos
  0 siblings, 1 reply; 50+ messages in thread
From: Sajjan Rao @ 2023-08-18  9:38 UTC (permalink / raw)
  To: linux-cxl

Hello,

I have a qemu + cxl configuration coming up with one configured type 3
device. My goal is to generate some cxl.mem traffic in this
configuration.
However the numa_node is always showing as -1. I have tried various
qemu command line parameters including to explicitly set numa_node for
cxl devices.

Here is my qemu command line
--------
qemu-system-x86_64 \
 -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
 -machine type=q35,accel=kvm,cxl=on \
 -m 10G \
 -smp cpus=8 \
 -object memory-backend-ram,size=4G,id=m0 \
 -object memory-backend-ram,size=4G,id=m1 \
 -object memory-backend-ram,size=2G,id=cxl-mem1 \
 -numa node,memdev=m0,cpus=0-1,nodeid=0 \
 -numa node,memdev=m1,cpus=2-3,nodeid=1 \
 -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
 -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
 -device virtio-net-pci,netdev=net0 \
 -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
 -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
 -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
 -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
 -enable-kvm \
 -nographic
-----

I see that the cxl device is listed in lspci output
------
#lspci | grep -i cxl
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)

#lspci -s 0d:00.0 -vvv | grep -i numa
#

-------

sysfs output
----------
#cat /sys/bus/cxl/devices/mem0/numa_node
-1
--------

numactl output

------------------
#numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 3910 MB
node 0 free: 3776 MB
node 1 cpus: 2 3
node 1 size: 4031 MB
node 1 free: 3927 MB
node 2 cpus: 4 5 6 7
node 2 size: 2011 MB
node 2 free: 1785 MB
node distances:
node   0   1   2
  0:  10  20  20
  1:  20  10  20
  2:  20  20  10
-------------------

The numa_node 2 is expected to be mapped to a CXL device, I do see
some activity in numastat output, but it's unclear if this is really
mapped to the CXL device since the device itself says numa_node is -1
(expected to show 2).

Has anybody seen this behavior? Any help will be greatly appreciated.

Thanks,
Sajjan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-18  9:38 qemu cxl memory expander shows numa_node -1 Sajjan Rao
@ 2023-08-18 15:01 ` Dimitrios Palyvos
  2023-08-21 10:00   ` Sajjan Rao
  0 siblings, 1 reply; 50+ messages in thread
From: Dimitrios Palyvos @ 2023-08-18 15:01 UTC (permalink / raw)
  To: Sajjan Rao; +Cc: linux-cxl

Hi,

I am not an expert (and not 100% sure if that's what you want to do),
but here's one way to get your configuration to work:
1. Disable KVM.
2. Remove the CXL NUMA node from the QEMU command.
3. Use the ndctl utilities in the guest to initialize your CXL memory
and associated NUMA node.

More specifically, I changed your QEMU command as follows:

    qemu-system-x86_64 \
     -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
     -machine type=q35,cxl=on \
     -m 8G \
     -smp cpus=8 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=2G,id=cxl-mem1 \
     -numa node,memdev=m0,cpus=0-3,nodeid=0 \
     -numa node,memdev=m1,cpus=4-7,nodeid=1 \
     -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
     -device virtio-net-pci,netdev=net0 \
     -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
     -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
     -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
     -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
     -nographic

In the guest, install ndctl: https://github.com/pmem/ndctl

After that, you should be able to see the CXL memory:
    root@cxl-img:~# cxl list
    [
      {
        "memdev":"mem0",
        "ram_size":2147483648,
        "serial":0,
        "host":"0000:0d:00.0"
      }
    ]

And initialize it as RAM:
    root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
    ...

root@cxl-img:~# lsmem --output-all
    RANGE                                  SIZE  STATE REMOVABLE BLOCK
NODE  ZONES
    0x0000000000000000-0x0000000007ffffff  128M online       yes     0
   0   None
    0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
   0  DMA32
    0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
   0 Normal
    0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
   1 Normal
    0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
   2 Normal

    Memory block size:       128M
    Total online memory:      10G
    Total offline memory:      0B


    root@cxl-img:~# cat /proc/iomem
    ...
    290000000-38fffffff : CXL Window 0
      290000000-30fffffff : region0
        290000000-30fffffff : dax0.0
          290000000-30fffffff : System RAM (kmem)


Then you can generate traffic in the CXL NUMA node, for example:

    root@cxl-img:~# numactl --membind 2 ls

Note: The above is with linux v6.4.11.

Hope that helps!

Kind regards,
Dimitris







On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:
>
> Hello,
>
> I have a qemu + cxl configuration coming up with one configured type 3
> device. My goal is to generate some cxl.mem traffic in this
> configuration.
> However the numa_node is always showing as -1. I have tried various
> qemu command line parameters including to explicitly set numa_node for
> cxl devices.
>
> Here is my qemu command line
> --------
> qemu-system-x86_64 \
>  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
>  -machine type=q35,accel=kvm,cxl=on \
>  -m 10G \
>  -smp cpus=8 \
>  -object memory-backend-ram,size=4G,id=m0 \
>  -object memory-backend-ram,size=4G,id=m1 \
>  -object memory-backend-ram,size=2G,id=cxl-mem1 \
>  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
>  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
>  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
>  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
>  -device virtio-net-pci,netdev=net0 \
>  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
>  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
>  -enable-kvm \
>  -nographic
> -----
>
> I see that the cxl device is listed in lspci output
> ------
> #lspci | grep -i cxl
> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
>
> #lspci -s 0d:00.0 -vvv | grep -i numa
> #
>
> -------
>
> sysfs output
> ----------
> #cat /sys/bus/cxl/devices/mem0/numa_node
> -1
> --------
>
> numactl output
>
> ------------------
> #numactl -H
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: 3910 MB
> node 0 free: 3776 MB
> node 1 cpus: 2 3
> node 1 size: 4031 MB
> node 1 free: 3927 MB
> node 2 cpus: 4 5 6 7
> node 2 size: 2011 MB
> node 2 free: 1785 MB
> node distances:
> node   0   1   2
>   0:  10  20  20
>   1:  20  10  20
>   2:  20  20  10
> -------------------
>
> The numa_node 2 is expected to be mapped to a CXL device, I do see
> some activity in numastat output, but it's unclear if this is really
> mapped to the CXL device since the device itself says numa_node is -1
> (expected to show 2).
>
> Has anybody seen this behavior? Any help will be greatly appreciated.
>
> Thanks,
> Sajjan

-- 
**CONFIDENTIALITY NOTICE:*
*
*The contents of this email message and any 
attachments are intended solely for the addressee(s) and may contain 
confidential and/or privileged information and may be legally protected 
from disclosure. If you are not the intended recipient of this message or 
their agent, or if this message has been addressed to you in error, please 
immediately alert the sender by reply email and then delete this message 
and any attachments. If you are not the intended recipient, you are hereby 
notified that any use, dissemination, copying, or storage of this message 
or its attachments is strictly prohibited. *

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-18 15:01 ` Dimitrios Palyvos
@ 2023-08-21 10:00   ` Sajjan Rao
  2023-08-21 10:53     ` Dimitrios Palyvos
  0 siblings, 1 reply; 50+ messages in thread
From: Sajjan Rao @ 2023-08-21 10:00 UTC (permalink / raw)
  To: Dimitrios Palyvos; +Cc: linux-cxl

Hello Dimitrios,

Thank you for the pointers. I have the 6.4.10 kernel and modified the
qemu options, but now I see an error creating the region.
Is there anything else I missed?

[root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
[ 4144.982608] cxl region0: Failed to synchronize CPU cache state
cxl region: create_region: region0: failed to commit decode: No such
device or address

Thanks,
Sajjan

-- qemu

qemu-system-x86_64 \
 -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
 -machine type=q35,cxl=on \
 -m 4G \
 -smp cpus=2 \
 -accel tcg,thread=single \
 -object memory-backend-ram,size=4G,id=m0 \
 -object memory-backend-ram,size=256M,id=cxl-mem1 \
 -numa node,memdev=m0,cpus=0-1,nodeid=0 \
 -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
 -device virtio-net-pci,netdev=net0 \
 -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
 -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
 -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
 -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
 -nographic

-----

[root@cxl-test /]# uname -r
6.4.10-200.fc38.x86_64
[root@cxl-test /]# cxl list
[
  {
    "memdev":"mem0",
    "ram_size":268435456,
    "serial":0,
    "host":"0000:0d:00.0"
  }
]
[root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
[ 4144.982608] cxl region0: Failed to synchronize CPU cache state
cxl region: create_region: region0: failed to commit decode: No such
device or address

[root@cxl-test /]#

On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
<dimitrios.palyvos@zptcorp.com> wrote:
>
> Hi,
>
> I am not an expert (and not 100% sure if that's what you want to do),
> but here's one way to get your configuration to work:
> 1. Disable KVM.
> 2. Remove the CXL NUMA node from the QEMU command.
> 3. Use the ndctl utilities in the guest to initialize your CXL memory
> and associated NUMA node.
>
> More specifically, I changed your QEMU command as follows:
>
>     qemu-system-x86_64 \
>      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
>      -machine type=q35,cxl=on \
>      -m 8G \
>      -smp cpus=8 \
>      -object memory-backend-ram,size=4G,id=m0 \
>      -object memory-backend-ram,size=4G,id=m1 \
>      -object memory-backend-ram,size=2G,id=cxl-mem1 \
>      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
>      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
>      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
>      -device virtio-net-pci,netdev=net0 \
>      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
>      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
>      -nographic
>
> In the guest, install ndctl: https://github.com/pmem/ndctl
>
> After that, you should be able to see the CXL memory:
>     root@cxl-img:~# cxl list
>     [
>       {
>         "memdev":"mem0",
>         "ram_size":2147483648,
>         "serial":0,
>         "host":"0000:0d:00.0"
>       }
>     ]
>
> And initialize it as RAM:
>     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
>     ...
>
> root@cxl-img:~# lsmem --output-all
>     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> NODE  ZONES
>     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
>    0   None
>     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
>    0  DMA32
>     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
>    0 Normal
>     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
>    1 Normal
>     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
>    2 Normal
>
>     Memory block size:       128M
>     Total online memory:      10G
>     Total offline memory:      0B
>
>
>     root@cxl-img:~# cat /proc/iomem
>     ...
>     290000000-38fffffff : CXL Window 0
>       290000000-30fffffff : region0
>         290000000-30fffffff : dax0.0
>           290000000-30fffffff : System RAM (kmem)
>
>
> Then you can generate traffic in the CXL NUMA node, for example:
>
>     root@cxl-img:~# numactl --membind 2 ls
>
> Note: The above is with linux v6.4.11.
>
> Hope that helps!
>
> Kind regards,
> Dimitris
>
>
>
>
>
>
>
> On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:
> >
> > Hello,
> >
> > I have a qemu + cxl configuration coming up with one configured type 3
> > device. My goal is to generate some cxl.mem traffic in this
> > configuration.
> > However the numa_node is always showing as -1. I have tried various
> > qemu command line parameters including to explicitly set numa_node for
> > cxl devices.
> >
> > Here is my qemu command line
> > --------
> > qemu-system-x86_64 \
> >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> >  -machine type=q35,accel=kvm,cxl=on \
> >  -m 10G \
> >  -smp cpus=8 \
> >  -object memory-backend-ram,size=4G,id=m0 \
> >  -object memory-backend-ram,size=4G,id=m1 \
> >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> >  -device virtio-net-pci,netdev=net0 \
> >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> >  -enable-kvm \
> >  -nographic
> > -----
> >
> > I see that the cxl device is listed in lspci output
> > ------
> > #lspci | grep -i cxl
> > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> >
> > #lspci -s 0d:00.0 -vvv | grep -i numa
> > #
> >
> > -------
> >
> > sysfs output
> > ----------
> > #cat /sys/bus/cxl/devices/mem0/numa_node
> > -1
> > --------
> >
> > numactl output
> >
> > ------------------
> > #numactl -H
> > available: 3 nodes (0-2)
> > node 0 cpus: 0 1
> > node 0 size: 3910 MB
> > node 0 free: 3776 MB
> > node 1 cpus: 2 3
> > node 1 size: 4031 MB
> > node 1 free: 3927 MB
> > node 2 cpus: 4 5 6 7
> > node 2 size: 2011 MB
> > node 2 free: 1785 MB
> > node distances:
> > node   0   1   2
> >   0:  10  20  20
> >   1:  20  10  20
> >   2:  20  20  10
> > -------------------
> >
> > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > some activity in numastat output, but it's unclear if this is really
> > mapped to the CXL device since the device itself says numa_node is -1
> > (expected to show 2).
> >
> > Has anybody seen this behavior? Any help will be greatly appreciated.
> >
> > Thanks,
> > Sajjan
>
> --
> **CONFIDENTIALITY NOTICE:*
> *
> *The contents of this email message and any
> attachments are intended solely for the addressee(s) and may contain
> confidential and/or privileged information and may be legally protected
> from disclosure. If you are not the intended recipient of this message or
> their agent, or if this message has been addressed to you in error, please
> immediately alert the sender by reply email and then delete this message
> and any attachments. If you are not the intended recipient, you are hereby
> notified that any use, dissemination, copying, or storage of this message
> or its attachments is strictly prohibited. *

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-21 10:00   ` Sajjan Rao
@ 2023-08-21 10:53     ` Dimitrios Palyvos
  2023-08-23 11:13       ` Sajjan Rao
  0 siblings, 1 reply; 50+ messages in thread
From: Dimitrios Palyvos @ 2023-08-21 10:53 UTC (permalink / raw)
  To: Sajjan Rao; +Cc: linux-cxl

Hi,

Ah yes, I believe you need to enable the kernel config option
CONFIG_CXL_REGION_INVALIDATION_TEST for the region creation to work in
QEMU. The help entry of that config option gives more info on the why.

Hope that helps!

Kind regards,
Dimitris


On Mon, Aug 21, 2023 at 12:01 PM Sajjan Rao <sajjanr@gmail.com> wrote:
>
> Hello Dimitrios,
>
> Thank you for the pointers. I have the 6.4.10 kernel and modified the
> qemu options, but now I see an error creating the region.
> Is there anything else I missed?
>
> [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> cxl region: create_region: region0: failed to commit decode: No such
> device or address
>
> Thanks,
> Sajjan
>
> -- qemu
>
> qemu-system-x86_64 \
>  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
>  -machine type=q35,cxl=on \
>  -m 4G \
>  -smp cpus=2 \
>  -accel tcg,thread=single \
>  -object memory-backend-ram,size=4G,id=m0 \
>  -object memory-backend-ram,size=256M,id=cxl-mem1 \
>  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
>  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
>  -device virtio-net-pci,netdev=net0 \
>  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
>  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
>  -nographic
>
> -----
>
> [root@cxl-test /]# uname -r
> 6.4.10-200.fc38.x86_64
> [root@cxl-test /]# cxl list
> [
>   {
>     "memdev":"mem0",
>     "ram_size":268435456,
>     "serial":0,
>     "host":"0000:0d:00.0"
>   }
> ]
> [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> cxl region: create_region: region0: failed to commit decode: No such
> device or address
>
> [root@cxl-test /]#
>
> On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
> <dimitrios.palyvos@zptcorp.com> wrote:
> >
> > Hi,
> >
> > I am not an expert (and not 100% sure if that's what you want to do),
> > but here's one way to get your configuration to work:
> > 1. Disable KVM.
> > 2. Remove the CXL NUMA node from the QEMU command.
> > 3. Use the ndctl utilities in the guest to initialize your CXL memory
> > and associated NUMA node.
> >
> > More specifically, I changed your QEMU command as follows:
> >
> >     qemu-system-x86_64 \
> >      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> >      -machine type=q35,cxl=on \
> >      -m 8G \
> >      -smp cpus=8 \
> >      -object memory-backend-ram,size=4G,id=m0 \
> >      -object memory-backend-ram,size=4G,id=m1 \
> >      -object memory-backend-ram,size=2G,id=cxl-mem1 \
> >      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
> >      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
> >      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> >      -device virtio-net-pci,netdev=net0 \
> >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> >      -nographic
> >
> > In the guest, install ndctl: https://github.com/pmem/ndctl
> >
> > After that, you should be able to see the CXL memory:
> >     root@cxl-img:~# cxl list
> >     [
> >       {
> >         "memdev":"mem0",
> >         "ram_size":2147483648,
> >         "serial":0,
> >         "host":"0000:0d:00.0"
> >       }
> >     ]
> >
> > And initialize it as RAM:
> >     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
> >     ...
> >
> > root@cxl-img:~# lsmem --output-all
> >     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> > NODE  ZONES
> >     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
> >    0   None
> >     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
> >    0  DMA32
> >     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
> >    0 Normal
> >     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
> >    1 Normal
> >     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
> >    2 Normal
> >
> >     Memory block size:       128M
> >     Total online memory:      10G
> >     Total offline memory:      0B
> >
> >
> >     root@cxl-img:~# cat /proc/iomem
> >     ...
> >     290000000-38fffffff : CXL Window 0
> >       290000000-30fffffff : region0
> >         290000000-30fffffff : dax0.0
> >           290000000-30fffffff : System RAM (kmem)
> >
> >
> > Then you can generate traffic in the CXL NUMA node, for example:
> >
> >     root@cxl-img:~# numactl --membind 2 ls
> >
> > Note: The above is with linux v6.4.11.
> >
> > Hope that helps!
> >
> > Kind regards,
> > Dimitris
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I have a qemu + cxl configuration coming up with one configured type 3
> > > device. My goal is to generate some cxl.mem traffic in this
> > > configuration.
> > > However the numa_node is always showing as -1. I have tried various
> > > qemu command line parameters including to explicitly set numa_node for
> > > cxl devices.
> > >
> > > Here is my qemu command line
> > > --------
> > > qemu-system-x86_64 \
> > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > >  -machine type=q35,accel=kvm,cxl=on \
> > >  -m 10G \
> > >  -smp cpus=8 \
> > >  -object memory-backend-ram,size=4G,id=m0 \
> > >  -object memory-backend-ram,size=4G,id=m1 \
> > >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> > >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > >  -device virtio-net-pci,netdev=net0 \
> > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > >  -enable-kvm \
> > >  -nographic
> > > -----
> > >
> > > I see that the cxl device is listed in lspci output
> > > ------
> > > #lspci | grep -i cxl
> > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > >
> > > #lspci -s 0d:00.0 -vvv | grep -i numa
> > > #
> > >
> > > -------
> > >
> > > sysfs output
> > > ----------
> > > #cat /sys/bus/cxl/devices/mem0/numa_node
> > > -1
> > > --------
> > >
> > > numactl output
> > >
> > > ------------------
> > > #numactl -H
> > > available: 3 nodes (0-2)
> > > node 0 cpus: 0 1
> > > node 0 size: 3910 MB
> > > node 0 free: 3776 MB
> > > node 1 cpus: 2 3
> > > node 1 size: 4031 MB
> > > node 1 free: 3927 MB
> > > node 2 cpus: 4 5 6 7
> > > node 2 size: 2011 MB
> > > node 2 free: 1785 MB
> > > node distances:
> > > node   0   1   2
> > >   0:  10  20  20
> > >   1:  20  10  20
> > >   2:  20  20  10
> > > -------------------
> > >
> > > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > > some activity in numastat output, but it's unclear if this is really
> > > mapped to the CXL device since the device itself says numa_node is -1
> > > (expected to show 2).
> > >
> > > Has anybody seen this behavior? Any help will be greatly appreciated.
> > >
> > > Thanks,
> > > Sajjan
> >
> > --
> > **CONFIDENTIALITY NOTICE:*
> > *
> > *The contents of this email message and any
> > attachments are intended solely for the addressee(s) and may contain
> > confidential and/or privileged information and may be legally protected
> > from disclosure. If you are not the intended recipient of this message or
> > their agent, or if this message has been addressed to you in error, please
> > immediately alert the sender by reply email and then delete this message
> > and any attachments. If you are not the intended recipient, you are hereby
> > notified that any use, dissemination, copying, or storage of this message
> > or its attachments is strictly prohibited. *



-- 

Dimitrios Payvos-Giannas, PhD

Software Engineer

ZeroPoint Technologies

Remove the waste.

Release the power.

-- 
**CONFIDENTIALITY NOTICE:*
*
*The contents of this email message and any 
attachments are intended solely for the addressee(s) and may contain 
confidential and/or privileged information and may be legally protected 
from disclosure. If you are not the intended recipient of this message or 
their agent, or if this message has been addressed to you in error, please 
immediately alert the sender by reply email and then delete this message 
and any attachments. If you are not the intended recipient, you are hereby 
notified that any use, dissemination, copying, or storage of this message 
or its attachments is strictly prohibited. *

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-21 10:53     ` Dimitrios Palyvos
@ 2023-08-23 11:13       ` Sajjan Rao
  2023-08-23 16:50         ` Jonathan Cameron
  0 siblings, 1 reply; 50+ messages in thread
From: Sajjan Rao @ 2023-08-23 11:13 UTC (permalink / raw)
  To: Dimitrios Palyvos; +Cc: linux-cxl

Thank you Dimitrios. That worked!

On Mon, Aug 21, 2023 at 4:23 PM Dimitrios Palyvos
<dimitrios.palyvos@zptcorp.com> wrote:
>
> Hi,
>
> Ah yes, I believe you need to enable the kernel config option
> CONFIG_CXL_REGION_INVALIDATION_TEST for the region creation to work in
> QEMU. The help entry of that config option gives more info on the why.
>
> Hope that helps!
>
> Kind regards,
> Dimitris
>
>
> On Mon, Aug 21, 2023 at 12:01 PM Sajjan Rao <sajjanr@gmail.com> wrote:
> >
> > Hello Dimitrios,
> >
> > Thank you for the pointers. I have the 6.4.10 kernel and modified the
> > qemu options, but now I see an error creating the region.
> > Is there anything else I missed?
> >
> > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > cxl region: create_region: region0: failed to commit decode: No such
> > device or address
> >
> > Thanks,
> > Sajjan
> >
> > -- qemu
> >
> > qemu-system-x86_64 \
> >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> >  -machine type=q35,cxl=on \
> >  -m 4G \
> >  -smp cpus=2 \
> >  -accel tcg,thread=single \
> >  -object memory-backend-ram,size=4G,id=m0 \
> >  -object memory-backend-ram,size=256M,id=cxl-mem1 \
> >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> >  -device virtio-net-pci,netdev=net0 \
> >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> >  -nographic
> >
> > -----
> >
> > [root@cxl-test /]# uname -r
> > 6.4.10-200.fc38.x86_64
> > [root@cxl-test /]# cxl list
> > [
> >   {
> >     "memdev":"mem0",
> >     "ram_size":268435456,
> >     "serial":0,
> >     "host":"0000:0d:00.0"
> >   }
> > ]
> > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > cxl region: create_region: region0: failed to commit decode: No such
> > device or address
> >
> > [root@cxl-test /]#
> >
> > On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
> > <dimitrios.palyvos@zptcorp.com> wrote:
> > >
> > > Hi,
> > >
> > > I am not an expert (and not 100% sure if that's what you want to do),
> > > but here's one way to get your configuration to work:
> > > 1. Disable KVM.
> > > 2. Remove the CXL NUMA node from the QEMU command.
> > > 3. Use the ndctl utilities in the guest to initialize your CXL memory
> > > and associated NUMA node.
> > >
> > > More specifically, I changed your QEMU command as follows:
> > >
> > >     qemu-system-x86_64 \
> > >      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > >      -machine type=q35,cxl=on \
> > >      -m 8G \
> > >      -smp cpus=8 \
> > >      -object memory-backend-ram,size=4G,id=m0 \
> > >      -object memory-backend-ram,size=4G,id=m1 \
> > >      -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > >      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
> > >      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
> > >      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > >      -device virtio-net-pci,netdev=net0 \
> > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > >      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > >      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > >      -nographic
> > >
> > > In the guest, install ndctl: https://github.com/pmem/ndctl
> > >
> > > After that, you should be able to see the CXL memory:
> > >     root@cxl-img:~# cxl list
> > >     [
> > >       {
> > >         "memdev":"mem0",
> > >         "ram_size":2147483648,
> > >         "serial":0,
> > >         "host":"0000:0d:00.0"
> > >       }
> > >     ]
> > >
> > > And initialize it as RAM:
> > >     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
> > >     ...
> > >
> > > root@cxl-img:~# lsmem --output-all
> > >     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> > > NODE  ZONES
> > >     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
> > >    0   None
> > >     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
> > >    0  DMA32
> > >     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
> > >    0 Normal
> > >     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
> > >    1 Normal
> > >     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
> > >    2 Normal
> > >
> > >     Memory block size:       128M
> > >     Total online memory:      10G
> > >     Total offline memory:      0B
> > >
> > >
> > >     root@cxl-img:~# cat /proc/iomem
> > >     ...
> > >     290000000-38fffffff : CXL Window 0
> > >       290000000-30fffffff : region0
> > >         290000000-30fffffff : dax0.0
> > >           290000000-30fffffff : System RAM (kmem)
> > >
> > >
> > > Then you can generate traffic in the CXL NUMA node, for example:
> > >
> > >     root@cxl-img:~# numactl --membind 2 ls
> > >
> > > Note: The above is with linux v6.4.11.
> > >
> > > Hope that helps!
> > >
> > > Kind regards,
> > > Dimitris
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have a qemu + cxl configuration coming up with one configured type 3
> > > > device. My goal is to generate some cxl.mem traffic in this
> > > > configuration.
> > > > However the numa_node is always showing as -1. I have tried various
> > > > qemu command line parameters including to explicitly set numa_node for
> > > > cxl devices.
> > > >
> > > > Here is my qemu command line
> > > > --------
> > > > qemu-system-x86_64 \
> > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > >  -machine type=q35,accel=kvm,cxl=on \
> > > >  -m 10G \
> > > >  -smp cpus=8 \
> > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > >  -object memory-backend-ram,size=4G,id=m1 \
> > > >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> > > >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > >  -device virtio-net-pci,netdev=net0 \
> > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > >  -enable-kvm \
> > > >  -nographic
> > > > -----
> > > >
> > > > I see that the cxl device is listed in lspci output
> > > > ------
> > > > #lspci | grep -i cxl
> > > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > > >
> > > > #lspci -s 0d:00.0 -vvv | grep -i numa
> > > > #
> > > >
> > > > -------
> > > >
> > > > sysfs output
> > > > ----------
> > > > #cat /sys/bus/cxl/devices/mem0/numa_node
> > > > -1
> > > > --------
> > > >
> > > > numactl output
> > > >
> > > > ------------------
> > > > #numactl -H
> > > > available: 3 nodes (0-2)
> > > > node 0 cpus: 0 1
> > > > node 0 size: 3910 MB
> > > > node 0 free: 3776 MB
> > > > node 1 cpus: 2 3
> > > > node 1 size: 4031 MB
> > > > node 1 free: 3927 MB
> > > > node 2 cpus: 4 5 6 7
> > > > node 2 size: 2011 MB
> > > > node 2 free: 1785 MB
> > > > node distances:
> > > > node   0   1   2
> > > >   0:  10  20  20
> > > >   1:  20  10  20
> > > >   2:  20  20  10
> > > > -------------------
> > > >
> > > > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > > > some activity in numastat output, but it's unclear if this is really
> > > > mapped to the CXL device since the device itself says numa_node is -1
> > > > (expected to show 2).
> > > >
> > > > Has anybody seen this behavior? Any help will be greatly appreciated.
> > > >
> > > > Thanks,
> > > > Sajjan
> > >
> > > --
> > > **CONFIDENTIALITY NOTICE:*
> > > *
> > > *The contents of this email message and any
> > > attachments are intended solely for the addressee(s) and may contain
> > > confidential and/or privileged information and may be legally protected
> > > from disclosure. If you are not the intended recipient of this message or
> > > their agent, or if this message has been addressed to you in error, please
> > > immediately alert the sender by reply email and then delete this message
> > > and any attachments. If you are not the intended recipient, you are hereby
> > > notified that any use, dissemination, copying, or storage of this message
> > > or its attachments is strictly prohibited. *
>
>
>
> --
>
> Dimitrios Payvos-Giannas, PhD
>
> Software Engineer
>
> ZeroPoint Technologies
>
> Remove the waste.
>
> Release the power.
>
> --
> **CONFIDENTIALITY NOTICE:*
> *
> *The contents of this email message and any
> attachments are intended solely for the addressee(s) and may contain
> confidential and/or privileged information and may be legally protected
> from disclosure. If you are not the intended recipient of this message or
> their agent, or if this message has been addressed to you in error, please
> immediately alert the sender by reply email and then delete this message
> and any attachments. If you are not the intended recipient, you are hereby
> notified that any use, dissemination, copying, or storage of this message
> or its attachments is strictly prohibited. *

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-23 11:13       ` Sajjan Rao
@ 2023-08-23 16:50         ` Jonathan Cameron
  2023-08-24  6:26           ` Sajjan Rao
  0 siblings, 1 reply; 50+ messages in thread
From: Jonathan Cameron @ 2023-08-23 16:50 UTC (permalink / raw)
  To: Sajjan Rao; +Cc: Dimitrios Palyvos, linux-cxl

On Wed, 23 Aug 2023 16:43:13 +0530
Sajjan Rao <sajjanr@gmail.com> wrote:

> Thank you Dimitrios. That worked!
> 
> On Mon, Aug 21, 2023 at 4:23 PM Dimitrios Palyvos
> <dimitrios.palyvos@zptcorp.com> wrote:
> >
> > Hi,
> >
> > Ah yes, I believe you need to enable the kernel config option
> > CONFIG_CXL_REGION_INVALIDATION_TEST for the region creation to work in
> > QEMU. The help entry of that config option gives more info on the why.
> >
> > Hope that helps!
> >
> > Kind regards,
> > Dimitris
> >
> >
> > On Mon, Aug 21, 2023 at 12:01 PM Sajjan Rao <sajjanr@gmail.com> wrote:  
> > >
> > > Hello Dimitrios,
> > >
> > > Thank you for the pointers. I have the 6.4.10 kernel and modified the
> > > qemu options, but now I see an error creating the region.
> > > Is there anything else I missed?
> > >
> > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > cxl region: create_region: region0: failed to commit decode: No such
> > > device or address
> > >
> > > Thanks,
> > > Sajjan
> > >
> > > -- qemu
> > >
> > > qemu-system-x86_64 \
> > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > >  -machine type=q35,cxl=on \
> > >  -m 4G \
> > >  -smp cpus=2 \
> > >  -accel tcg,thread=single \
> > >  -object memory-backend-ram,size=4G,id=m0 \
> > >  -object memory-backend-ram,size=256M,id=cxl-mem1 \
> > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > >  -device virtio-net-pci,netdev=net0 \
> > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > >  -nographic
> > >
> > > -----
> > >
> > > [root@cxl-test /]# uname -r
> > > 6.4.10-200.fc38.x86_64
> > > [root@cxl-test /]# cxl list
> > > [
> > >   {
> > >     "memdev":"mem0",
> > >     "ram_size":268435456,
> > >     "serial":0,
> > >     "host":"0000:0d:00.0"
> > >   }
> > > ]
> > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > cxl region: create_region: region0: failed to commit decode: No such
> > > device or address
> > >
> > > [root@cxl-test /]#
> > >
> > > On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
> > > <dimitrios.palyvos@zptcorp.com> wrote:  
> > > >
> > > > Hi,
> > > >
> > > > I am not an expert (and not 100% sure if that's what you want to do),
> > > > but here's one way to get your configuration to work:
> > > > 1. Disable KVM.

Just to second this - don't use KVM and expect it to work with CXL emulation
if you are trying to use kmem to present it as normal memory - it should be fine
as long as you never run instructions resident in that memory.

It will crash in nasty ways due to various issues with instruction emulation
where it is running out of memory behind the emulated interleave decoders.

So far we haven't cared enough to add the complexity that would be needed
to make that work.

TCG is the way to go for now.

Jonathan

> > > > 2. Remove the CXL NUMA node from the QEMU command.
> > > > 3. Use the ndctl utilities in the guest to initialize your CXL memory
> > > > and associated NUMA node.
> > > >
> > > > More specifically, I changed your QEMU command as follows:
> > > >
> > > >     qemu-system-x86_64 \
> > > >      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > >      -machine type=q35,cxl=on \
> > > >      -m 8G \
> > > >      -smp cpus=8 \
> > > >      -object memory-backend-ram,size=4G,id=m0 \
> > > >      -object memory-backend-ram,size=4G,id=m1 \
> > > >      -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > >      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
> > > >      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
> > > >      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > >      -device virtio-net-pci,netdev=net0 \
> > > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > >      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > >      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > >      -nographic
> > > >
> > > > In the guest, install ndctl: https://github.com/pmem/ndctl
> > > >
> > > > After that, you should be able to see the CXL memory:
> > > >     root@cxl-img:~# cxl list
> > > >     [
> > > >       {
> > > >         "memdev":"mem0",
> > > >         "ram_size":2147483648,
> > > >         "serial":0,
> > > >         "host":"0000:0d:00.0"
> > > >       }
> > > >     ]
> > > >
> > > > And initialize it as RAM:
> > > >     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
> > > >     ...
> > > >
> > > > root@cxl-img:~# lsmem --output-all
> > > >     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> > > > NODE  ZONES
> > > >     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
> > > >    0   None
> > > >     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
> > > >    0  DMA32
> > > >     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
> > > >    0 Normal
> > > >     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
> > > >    1 Normal
> > > >     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
> > > >    2 Normal
> > > >
> > > >     Memory block size:       128M
> > > >     Total online memory:      10G
> > > >     Total offline memory:      0B
> > > >
> > > >
> > > >     root@cxl-img:~# cat /proc/iomem
> > > >     ...
> > > >     290000000-38fffffff : CXL Window 0
> > > >       290000000-30fffffff : region0
> > > >         290000000-30fffffff : dax0.0
> > > >           290000000-30fffffff : System RAM (kmem)
> > > >
> > > >
> > > > Then you can generate traffic in the CXL NUMA node, for example:
> > > >
> > > >     root@cxl-img:~# numactl --membind 2 ls
> > > >
> > > > Note: The above is with linux v6.4.11.
> > > >
> > > > Hope that helps!
> > > >
> > > > Kind regards,
> > > > Dimitris
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:  
> > > > >
> > > > > Hello,
> > > > >
> > > > > I have a qemu + cxl configuration coming up with one configured type 3
> > > > > device. My goal is to generate some cxl.mem traffic in this
> > > > > configuration.
> > > > > However the numa_node is always showing as -1. I have tried various
> > > > > qemu command line parameters including to explicitly set numa_node for
> > > > > cxl devices.
> > > > >
> > > > > Here is my qemu command line
> > > > > --------
> > > > > qemu-system-x86_64 \
> > > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > >  -machine type=q35,accel=kvm,cxl=on \
> > > > >  -m 10G \
> > > > >  -smp cpus=8 \
> > > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > > >  -object memory-backend-ram,size=4G,id=m1 \
> > > > >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > > >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> > > > >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> > > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > >  -device virtio-net-pci,netdev=net0 \
> > > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > >  -enable-kvm \
> > > > >  -nographic
> > > > > -----
> > > > >
> > > > > I see that the cxl device is listed in lspci output
> > > > > ------
> > > > > #lspci | grep -i cxl
> > > > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > > > >
> > > > > #lspci -s 0d:00.0 -vvv | grep -i numa
> > > > > #
> > > > >
> > > > > -------
> > > > >
> > > > > sysfs output
> > > > > ----------
> > > > > #cat /sys/bus/cxl/devices/mem0/numa_node
> > > > > -1
> > > > > --------
> > > > >
> > > > > numactl output
> > > > >
> > > > > ------------------
> > > > > #numactl -H
> > > > > available: 3 nodes (0-2)
> > > > > node 0 cpus: 0 1
> > > > > node 0 size: 3910 MB
> > > > > node 0 free: 3776 MB
> > > > > node 1 cpus: 2 3
> > > > > node 1 size: 4031 MB
> > > > > node 1 free: 3927 MB
> > > > > node 2 cpus: 4 5 6 7
> > > > > node 2 size: 2011 MB
> > > > > node 2 free: 1785 MB
> > > > > node distances:
> > > > > node   0   1   2
> > > > >   0:  10  20  20
> > > > >   1:  20  10  20
> > > > >   2:  20  20  10
> > > > > -------------------
> > > > >
> > > > > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > > > > some activity in numastat output, but it's unclear if this is really
> > > > > mapped to the CXL device since the device itself says numa_node is -1
> > > > > (expected to show 2).
> > > > >
> > > > > Has anybody seen this behavior? Any help will be greatly appreciated.
> > > > >
> > > > > Thanks,
> > > > > Sajjan  
> > > >
> > > > --
> > > > **CONFIDENTIALITY NOTICE:*
> > > > *
> > > > *The contents of this email message and any
> > > > attachments are intended solely for the addressee(s) and may contain
> > > > confidential and/or privileged information and may be legally protected
> > > > from disclosure. If you are not the intended recipient of this message or
> > > > their agent, or if this message has been addressed to you in error, please
> > > > immediately alert the sender by reply email and then delete this message
> > > > and any attachments. If you are not the intended recipient, you are hereby
> > > > notified that any use, dissemination, copying, or storage of this message
> > > > or its attachments is strictly prohibited. *  
> >
> >
> >
> > --
> >
> > Dimitrios Payvos-Giannas, PhD
> >
> > Software Engineer
> >
> > ZeroPoint Technologies
> >
> > Remove the waste.
> >
> > Release the power.
> >
> > --
> > **CONFIDENTIALITY NOTICE:*
> > *
> > *The contents of this email message and any
> > attachments are intended solely for the addressee(s) and may contain
> > confidential and/or privileged information and may be legally protected
> > from disclosure. If you are not the intended recipient of this message or
> > their agent, or if this message has been addressed to you in error, please
> > immediately alert the sender by reply email and then delete this message
> > and any attachments. If you are not the intended recipient, you are hereby
> > notified that any use, dissemination, copying, or storage of this message
> > or its attachments is strictly prohibited. *  


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-23 16:50         ` Jonathan Cameron
@ 2023-08-24  6:26           ` Sajjan Rao
  2024-01-25  8:15             ` Sajjan Rao
  0 siblings, 1 reply; 50+ messages in thread
From: Sajjan Rao @ 2023-08-24  6:26 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Dimitrios Palyvos, linux-cxl

Understood. Thank you Jonathan.

On Wed, Aug 23, 2023 at 10:21 PM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 23 Aug 2023 16:43:13 +0530
> Sajjan Rao <sajjanr@gmail.com> wrote:
>
> > Thank you Dimitrios. That worked!
> >
> > On Mon, Aug 21, 2023 at 4:23 PM Dimitrios Palyvos
> > <dimitrios.palyvos@zptcorp.com> wrote:
> > >
> > > Hi,
> > >
> > > Ah yes, I believe you need to enable the kernel config option
> > > CONFIG_CXL_REGION_INVALIDATION_TEST for the region creation to work in
> > > QEMU. The help entry of that config option gives more info on the why.
> > >
> > > Hope that helps!
> > >
> > > Kind regards,
> > > Dimitris
> > >
> > >
> > > On Mon, Aug 21, 2023 at 12:01 PM Sajjan Rao <sajjanr@gmail.com> wrote:
> > > >
> > > > Hello Dimitrios,
> > > >
> > > > Thank you for the pointers. I have the 6.4.10 kernel and modified the
> > > > qemu options, but now I see an error creating the region.
> > > > Is there anything else I missed?
> > > >
> > > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > > cxl region: create_region: region0: failed to commit decode: No such
> > > > device or address
> > > >
> > > > Thanks,
> > > > Sajjan
> > > >
> > > > -- qemu
> > > >
> > > > qemu-system-x86_64 \
> > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > >  -machine type=q35,cxl=on \
> > > >  -m 4G \
> > > >  -smp cpus=2 \
> > > >  -accel tcg,thread=single \
> > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > >  -object memory-backend-ram,size=256M,id=cxl-mem1 \
> > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > >  -device virtio-net-pci,netdev=net0 \
> > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > >  -nographic
> > > >
> > > > -----
> > > >
> > > > [root@cxl-test /]# uname -r
> > > > 6.4.10-200.fc38.x86_64
> > > > [root@cxl-test /]# cxl list
> > > > [
> > > >   {
> > > >     "memdev":"mem0",
> > > >     "ram_size":268435456,
> > > >     "serial":0,
> > > >     "host":"0000:0d:00.0"
> > > >   }
> > > > ]
> > > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > > cxl region: create_region: region0: failed to commit decode: No such
> > > > device or address
> > > >
> > > > [root@cxl-test /]#
> > > >
> > > > On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
> > > > <dimitrios.palyvos@zptcorp.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am not an expert (and not 100% sure if that's what you want to do),
> > > > > but here's one way to get your configuration to work:
> > > > > 1. Disable KVM.
>
> Just to second this - don't use KVM and expect it to work with CXL emulation
> if you are trying to use kmem to present it as normal memory - it should be fine
> as long as you never run instructions resident in that memory.
>
> It will crash in nasty ways due to various issues with instruction emulation
> where it is running out of memory behind the emulated interleave decoders.
>
> So far we haven't cared enough to add the complexity that would be needed
> to make that work.
>
> TCG is the way to go for now.
>
> Jonathan
>
> > > > > 2. Remove the CXL NUMA node from the QEMU command.
> > > > > 3. Use the ndctl utilities in the guest to initialize your CXL memory
> > > > > and associated NUMA node.
> > > > >
> > > > > More specifically, I changed your QEMU command as follows:
> > > > >
> > > > >     qemu-system-x86_64 \
> > > > >      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > >      -machine type=q35,cxl=on \
> > > > >      -m 8G \
> > > > >      -smp cpus=8 \
> > > > >      -object memory-backend-ram,size=4G,id=m0 \
> > > > >      -object memory-backend-ram,size=4G,id=m1 \
> > > > >      -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > >      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
> > > > >      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
> > > > >      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > >      -device virtio-net-pci,netdev=net0 \
> > > > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > >      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > >      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > >      -nographic
> > > > >
> > > > > In the guest, install ndctl: https://github.com/pmem/ndctl
> > > > >
> > > > > After that, you should be able to see the CXL memory:
> > > > >     root@cxl-img:~# cxl list
> > > > >     [
> > > > >       {
> > > > >         "memdev":"mem0",
> > > > >         "ram_size":2147483648,
> > > > >         "serial":0,
> > > > >         "host":"0000:0d:00.0"
> > > > >       }
> > > > >     ]
> > > > >
> > > > > And initialize it as RAM:
> > > > >     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
> > > > >     ...
> > > > >
> > > > > root@cxl-img:~# lsmem --output-all
> > > > >     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> > > > > NODE  ZONES
> > > > >     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
> > > > >    0   None
> > > > >     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
> > > > >    0  DMA32
> > > > >     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
> > > > >    0 Normal
> > > > >     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
> > > > >    1 Normal
> > > > >     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
> > > > >    2 Normal
> > > > >
> > > > >     Memory block size:       128M
> > > > >     Total online memory:      10G
> > > > >     Total offline memory:      0B
> > > > >
> > > > >
> > > > >     root@cxl-img:~# cat /proc/iomem
> > > > >     ...
> > > > >     290000000-38fffffff : CXL Window 0
> > > > >       290000000-30fffffff : region0
> > > > >         290000000-30fffffff : dax0.0
> > > > >           290000000-30fffffff : System RAM (kmem)
> > > > >
> > > > >
> > > > > Then you can generate traffic in the CXL NUMA node, for example:
> > > > >
> > > > >     root@cxl-img:~# numactl --membind 2 ls
> > > > >
> > > > > Note: The above is with linux v6.4.11.
> > > > >
> > > > > Hope that helps!
> > > > >
> > > > > Kind regards,
> > > > > Dimitris
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I have a qemu + cxl configuration coming up with one configured type 3
> > > > > > device. My goal is to generate some cxl.mem traffic in this
> > > > > > configuration.
> > > > > > However the numa_node is always showing as -1. I have tried various
> > > > > > qemu command line parameters including to explicitly set numa_node for
> > > > > > cxl devices.
> > > > > >
> > > > > > Here is my qemu command line
> > > > > > --------
> > > > > > qemu-system-x86_64 \
> > > > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > > >  -machine type=q35,accel=kvm,cxl=on \
> > > > > >  -m 10G \
> > > > > >  -smp cpus=8 \
> > > > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > > > >  -object memory-backend-ram,size=4G,id=m1 \
> > > > > >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > > > >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> > > > > >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> > > > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > > >  -device virtio-net-pci,netdev=net0 \
> > > > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > > >  -enable-kvm \
> > > > > >  -nographic
> > > > > > -----
> > > > > >
> > > > > > I see that the cxl device is listed in lspci output
> > > > > > ------
> > > > > > #lspci | grep -i cxl
> > > > > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > > > > >
> > > > > > #lspci -s 0d:00.0 -vvv | grep -i numa
> > > > > > #
> > > > > >
> > > > > > -------
> > > > > >
> > > > > > sysfs output
> > > > > > ----------
> > > > > > #cat /sys/bus/cxl/devices/mem0/numa_node
> > > > > > -1
> > > > > > --------
> > > > > >
> > > > > > numactl output
> > > > > >
> > > > > > ------------------
> > > > > > #numactl -H
> > > > > > available: 3 nodes (0-2)
> > > > > > node 0 cpus: 0 1
> > > > > > node 0 size: 3910 MB
> > > > > > node 0 free: 3776 MB
> > > > > > node 1 cpus: 2 3
> > > > > > node 1 size: 4031 MB
> > > > > > node 1 free: 3927 MB
> > > > > > node 2 cpus: 4 5 6 7
> > > > > > node 2 size: 2011 MB
> > > > > > node 2 free: 1785 MB
> > > > > > node distances:
> > > > > > node   0   1   2
> > > > > >   0:  10  20  20
> > > > > >   1:  20  10  20
> > > > > >   2:  20  20  10
> > > > > > -------------------
> > > > > >
> > > > > > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > > > > > some activity in numastat output, but it's unclear if this is really
> > > > > > mapped to the CXL device since the device itself says numa_node is -1
> > > > > > (expected to show 2).
> > > > > >
> > > > > > Has anybody seen this behavior? Any help will be greatly appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Sajjan
> > > > >
> > > > > --
> > > > > **CONFIDENTIALITY NOTICE:*
> > > > > *
> > > > > *The contents of this email message and any
> > > > > attachments are intended solely for the addressee(s) and may contain
> > > > > confidential and/or privileged information and may be legally protected
> > > > > from disclosure. If you are not the intended recipient of this message or
> > > > > their agent, or if this message has been addressed to you in error, please
> > > > > immediately alert the sender by reply email and then delete this message
> > > > > and any attachments. If you are not the intended recipient, you are hereby
> > > > > notified that any use, dissemination, copying, or storage of this message
> > > > > or its attachments is strictly prohibited. *
> > >
> > >
> > >
> > > --
> > >
> > > Dimitrios Payvos-Giannas, PhD
> > >
> > > Software Engineer
> > >
> > > ZeroPoint Technologies
> > >
> > > Remove the waste.
> > >
> > > Release the power.
> > >
> > > --
> > > **CONFIDENTIALITY NOTICE:*
> > > *
> > > *The contents of this email message and any
> > > attachments are intended solely for the addressee(s) and may contain
> > > confidential and/or privileged information and may be legally protected
> > > from disclosure. If you are not the intended recipient of this message or
> > > their agent, or if this message has been addressed to you in error, please
> > > immediately alert the sender by reply email and then delete this message
> > > and any attachments. If you are not the intended recipient, you are hereby
> > > notified that any use, dissemination, copying, or storage of this message
> > > or its attachments is strictly prohibited. *
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2023-08-24  6:26           ` Sajjan Rao
@ 2024-01-25  8:15             ` Sajjan Rao
  2024-01-26 12:39               ` Jonathan Cameron
  0 siblings, 1 reply; 50+ messages in thread
From: Sajjan Rao @ 2024-01-25  8:15 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Dimitrios Palyvos, linux-cxl

Looks like something changed in QEMU 8.2 that broke running code out
of CXL memory with KVM disabled.
I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
this worked for me until I updated to the latest QEMU.

Is this a known issue? Or am I missing something?

Thanks,
Sajjan


On Thu, Aug 24, 2023 at 11:56 AM Sajjan Rao <sajjanr@gmail.com> wrote:
>
> Understood. Thank you Jonathan.
>
> On Wed, Aug 23, 2023 at 10:21 PM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Wed, 23 Aug 2023 16:43:13 +0530
> > Sajjan Rao <sajjanr@gmail.com> wrote:
> >
> > > Thank you Dimitrios. That worked!
> > >
> > > On Mon, Aug 21, 2023 at 4:23 PM Dimitrios Palyvos
> > > <dimitrios.palyvos@zptcorp.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Ah yes, I believe you need to enable the kernel config option
> > > > CONFIG_CXL_REGION_INVALIDATION_TEST for the region creation to work in
> > > > QEMU. The help entry of that config option gives more info on the why.
> > > >
> > > > Hope that helps!
> > > >
> > > > Kind regards,
> > > > Dimitris
> > > >
> > > >
> > > > On Mon, Aug 21, 2023 at 12:01 PM Sajjan Rao <sajjanr@gmail.com> wrote:
> > > > >
> > > > > Hello Dimitrios,
> > > > >
> > > > > Thank you for the pointers. I have the 6.4.10 kernel and modified the
> > > > > qemu options, but now I see an error creating the region.
> > > > > Is there anything else I missed?
> > > > >
> > > > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > > > cxl region: create_region: region0: failed to commit decode: No such
> > > > > device or address
> > > > >
> > > > > Thanks,
> > > > > Sajjan
> > > > >
> > > > > -- qemu
> > > > >
> > > > > qemu-system-x86_64 \
> > > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > >  -machine type=q35,cxl=on \
> > > > >  -m 4G \
> > > > >  -smp cpus=2 \
> > > > >  -accel tcg,thread=single \
> > > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > > >  -object memory-backend-ram,size=256M,id=cxl-mem1 \
> > > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > >  -device virtio-net-pci,netdev=net0 \
> > > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > >  -nographic
> > > > >
> > > > > -----
> > > > >
> > > > > [root@cxl-test /]# uname -r
> > > > > 6.4.10-200.fc38.x86_64
> > > > > [root@cxl-test /]# cxl list
> > > > > [
> > > > >   {
> > > > >     "memdev":"mem0",
> > > > >     "ram_size":268435456,
> > > > >     "serial":0,
> > > > >     "host":"0000:0d:00.0"
> > > > >   }
> > > > > ]
> > > > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > > > cxl region: create_region: region0: failed to commit decode: No such
> > > > > device or address
> > > > >
> > > > > [root@cxl-test /]#
> > > > >
> > > > > On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
> > > > > <dimitrios.palyvos@zptcorp.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am not an expert (and not 100% sure if that's what you want to do),
> > > > > > but here's one way to get your configuration to work:
> > > > > > 1. Disable KVM.
> >
> > Just to second this - don't use KVM and expect it to work with CXL emulation
> > if you are trying to use kmem to present it as normal memory - it should be fine
> > as long as you never run instructions resident in that memory.
> >
> > It will crash in nasty ways due to various issues with instruction emulation
> > where it is running out of memory behind the emulated interleave decoders.
> >
> > So far we haven't cared enough to add the complexity that would be needed
> > to make that work.
> >
> > TCG is the way to go for now.
> >
> > Jonathan
> >
> > > > > > 2. Remove the CXL NUMA node from the QEMU command.
> > > > > > 3. Use the ndctl utilities in the guest to initialize your CXL memory
> > > > > > and associated NUMA node.
> > > > > >
> > > > > > More specifically, I changed your QEMU command as follows:
> > > > > >
> > > > > >     qemu-system-x86_64 \
> > > > > >      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > > >      -machine type=q35,cxl=on \
> > > > > >      -m 8G \
> > > > > >      -smp cpus=8 \
> > > > > >      -object memory-backend-ram,size=4G,id=m0 \
> > > > > >      -object memory-backend-ram,size=4G,id=m1 \
> > > > > >      -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > > >      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
> > > > > >      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
> > > > > >      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > > >      -device virtio-net-pci,netdev=net0 \
> > > > > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > >      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > > >      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > > >      -nographic
> > > > > >
> > > > > > In the guest, install ndctl: https://github.com/pmem/ndctl
> > > > > >
> > > > > > After that, you should be able to see the CXL memory:
> > > > > >     root@cxl-img:~# cxl list
> > > > > >     [
> > > > > >       {
> > > > > >         "memdev":"mem0",
> > > > > >         "ram_size":2147483648,
> > > > > >         "serial":0,
> > > > > >         "host":"0000:0d:00.0"
> > > > > >       }
> > > > > >     ]
> > > > > >
> > > > > > And initialize it as RAM:
> > > > > >     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
> > > > > >     ...
> > > > > >
> > > > > > root@cxl-img:~# lsmem --output-all
> > > > > >     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> > > > > > NODE  ZONES
> > > > > >     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
> > > > > >    0   None
> > > > > >     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
> > > > > >    0  DMA32
> > > > > >     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
> > > > > >    0 Normal
> > > > > >     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
> > > > > >    1 Normal
> > > > > >     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
> > > > > >    2 Normal
> > > > > >
> > > > > >     Memory block size:       128M
> > > > > >     Total online memory:      10G
> > > > > >     Total offline memory:      0B
> > > > > >
> > > > > >
> > > > > >     root@cxl-img:~# cat /proc/iomem
> > > > > >     ...
> > > > > >     290000000-38fffffff : CXL Window 0
> > > > > >       290000000-30fffffff : region0
> > > > > >         290000000-30fffffff : dax0.0
> > > > > >           290000000-30fffffff : System RAM (kmem)
> > > > > >
> > > > > >
> > > > > > Then you can generate traffic in the CXL NUMA node, for example:
> > > > > >
> > > > > >     root@cxl-img:~# numactl --membind 2 ls
> > > > > >
> > > > > > Note: The above is with linux v6.4.11.
> > > > > >
> > > > > > Hope that helps!
> > > > > >
> > > > > > Kind regards,
> > > > > > Dimitris
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I have a qemu + cxl configuration coming up with one configured type 3
> > > > > > > device. My goal is to generate some cxl.mem traffic in this
> > > > > > > configuration.
> > > > > > > However the numa_node is always showing as -1. I have tried various
> > > > > > > qemu command line parameters including to explicitly set numa_node for
> > > > > > > cxl devices.
> > > > > > >
> > > > > > > Here is my qemu command line
> > > > > > > --------
> > > > > > > qemu-system-x86_64 \
> > > > > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > > > >  -machine type=q35,accel=kvm,cxl=on \
> > > > > > >  -m 10G \
> > > > > > >  -smp cpus=8 \
> > > > > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > > > > >  -object memory-backend-ram,size=4G,id=m1 \
> > > > > > >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > > > > >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> > > > > > >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> > > > > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > > > >  -device virtio-net-pci,netdev=net0 \
> > > > > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > > > >  -enable-kvm \
> > > > > > >  -nographic
> > > > > > > -----
> > > > > > >
> > > > > > > I see that the cxl device is listed in lspci output
> > > > > > > ------
> > > > > > > #lspci | grep -i cxl
> > > > > > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > > > > > >
> > > > > > > #lspci -s 0d:00.0 -vvv | grep -i numa
> > > > > > > #
> > > > > > >
> > > > > > > -------
> > > > > > >
> > > > > > > sysfs output
> > > > > > > ----------
> > > > > > > #cat /sys/bus/cxl/devices/mem0/numa_node
> > > > > > > -1
> > > > > > > --------
> > > > > > >
> > > > > > > numactl output
> > > > > > >
> > > > > > > ------------------
> > > > > > > #numactl -H
> > > > > > > available: 3 nodes (0-2)
> > > > > > > node 0 cpus: 0 1
> > > > > > > node 0 size: 3910 MB
> > > > > > > node 0 free: 3776 MB
> > > > > > > node 1 cpus: 2 3
> > > > > > > node 1 size: 4031 MB
> > > > > > > node 1 free: 3927 MB
> > > > > > > node 2 cpus: 4 5 6 7
> > > > > > > node 2 size: 2011 MB
> > > > > > > node 2 free: 1785 MB
> > > > > > > node distances:
> > > > > > > node   0   1   2
> > > > > > >   0:  10  20  20
> > > > > > >   1:  20  10  20
> > > > > > >   2:  20  20  10
> > > > > > > -------------------
> > > > > > >
> > > > > > > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > > > > > > some activity in numastat output, but it's unclear if this is really
> > > > > > > mapped to the CXL device since the device itself says numa_node is -1
> > > > > > > (expected to show 2).
> > > > > > >
> > > > > > > Has anybody seen this behavior? Any help will be greatly appreciated.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sajjan
> > > > > >
> > > > > > --
> > > > > > **CONFIDENTIALITY NOTICE:*
> > > > > > *
> > > > > > *The contents of this email message and any
> > > > > > attachments are intended solely for the addressee(s) and may contain
> > > > > > confidential and/or privileged information and may be legally protected
> > > > > > from disclosure. If you are not the intended recipient of this message or
> > > > > > their agent, or if this message has been addressed to you in error, please
> > > > > > immediately alert the sender by reply email and then delete this message
> > > > > > and any attachments. If you are not the intended recipient, you are hereby
> > > > > > notified that any use, dissemination, copying, or storage of this message
> > > > > > or its attachments is strictly prohibited. *
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Dimitrios Payvos-Giannas, PhD
> > > >
> > > > Software Engineer
> > > >
> > > > ZeroPoint Technologies
> > > >
> > > > Remove the waste.
> > > >
> > > > Release the power.
> > > >
> > > > --
> > > > **CONFIDENTIALITY NOTICE:*
> > > > *
> > > > *The contents of this email message and any
> > > > attachments are intended solely for the addressee(s) and may contain
> > > > confidential and/or privileged information and may be legally protected
> > > > from disclosure. If you are not the intended recipient of this message or
> > > > their agent, or if this message has been addressed to you in error, please
> > > > immediately alert the sender by reply email and then delete this message
> > > > and any attachments. If you are not the intended recipient, you are hereby
> > > > notified that any use, dissemination, copying, or storage of this message
> > > > or its attachments is strictly prohibited. *
> >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2024-01-25  8:15             ` Sajjan Rao
@ 2024-01-26 12:39               ` Jonathan Cameron
  2024-01-26 15:43                 ` Gregory Price
  0 siblings, 1 reply; 50+ messages in thread
From: Jonathan Cameron @ 2024-01-26 12:39 UTC (permalink / raw)
  To: Sajjan Rao; +Cc: Dimitrios Palyvos, linux-cxl

On Thu, 25 Jan 2024 13:45:09 +0530
Sajjan Rao <sajjanr@gmail.com> wrote:

> Looks like something changed in QEMU 8.2 that broke running code out
> of CXL memory with KVM disabled.
> I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
> this worked for me until I updated to the latest QEMU.
> 
> Is this a known issue? Or am I missing something?

I'm confused on how the description below ever worked.
Assigning the underlying memdev=cxl-mem1 to a numa node isn't going
to correctly build the connections the CFMWS PA range.

I think you are mapping the same memory backend twice - once via
the normal NUMA node configuration as normal RAM (part of the -m 10G)
and once via the CXL type3 device which then ends up connected up behind
the CFWMS.  This is not a good idea as there are two paths to the same
memory.  CXL memory should not be part of the size provided via the -m
parameter.

The NUMA configuration for CXL memory in QEMU (which is assuming OS first
set up today) does not use the ACPI tables (SRAT/SLIT/HMAT) that will
result from -numa entries in the QEMU command line but instead the kernel
creates a NUMA node per CFWMS entry and any devices connected to that end
up in appropriate NUMA node.

Jonathan

> 
> Thanks,
> Sajjan
> 
> 
> On Thu, Aug 24, 2023 at 11:56 AM Sajjan Rao <sajjanr@gmail.com> wrote:
> >
> > Understood. Thank you Jonathan.
> >
> > On Wed, Aug 23, 2023 at 10:21 PM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:  
> > >
> > > On Wed, 23 Aug 2023 16:43:13 +0530
> > > Sajjan Rao <sajjanr@gmail.com> wrote:
> > >  
> > > > Thank you Dimitrios. That worked!
> > > >
> > > > On Mon, Aug 21, 2023 at 4:23 PM Dimitrios Palyvos
> > > > <dimitrios.palyvos@zptcorp.com> wrote:  
> > > > >
> > > > > Hi,
> > > > >
> > > > > Ah yes, I believe you need to enable the kernel config option
> > > > > CONFIG_CXL_REGION_INVALIDATION_TEST for the region creation to work in
> > > > > QEMU. The help entry of that config option gives more info on the why.
> > > > >
> > > > > Hope that helps!
> > > > >
> > > > > Kind regards,
> > > > > Dimitris
> > > > >
> > > > >
> > > > > On Mon, Aug 21, 2023 at 12:01 PM Sajjan Rao <sajjanr@gmail.com> wrote:  
> > > > > >
> > > > > > Hello Dimitrios,
> > > > > >
> > > > > > Thank you for the pointers. I have the 6.4.10 kernel and modified the
> > > > > > qemu options, but now I see an error creating the region.
> > > > > > Is there anything else I missed?
> > > > > >
> > > > > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > > > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > > > > cxl region: create_region: region0: failed to commit decode: No such
> > > > > > device or address
> > > > > >
> > > > > > Thanks,
> > > > > > Sajjan
> > > > > >
> > > > > > -- qemu
> > > > > >
> > > > > > qemu-system-x86_64 \
> > > > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > > >  -machine type=q35,cxl=on \
> > > > > >  -m 4G \
> > > > > >  -smp cpus=2 \
> > > > > >  -accel tcg,thread=single \
> > > > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > > > >  -object memory-backend-ram,size=256M,id=cxl-mem1 \
> > > > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > > >  -device virtio-net-pci,netdev=net0 \
> > > > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > > >  -nographic
> > > > > >
> > > > > > -----
> > > > > >
> > > > > > [root@cxl-test /]# uname -r
> > > > > > 6.4.10-200.fc38.x86_64
> > > > > > [root@cxl-test /]# cxl list
> > > > > > [
> > > > > >   {
> > > > > >     "memdev":"mem0",
> > > > > >     "ram_size":268435456,
> > > > > >     "serial":0,
> > > > > >     "host":"0000:0d:00.0"
> > > > > >   }
> > > > > > ]
> > > > > > [root@cxl-test /]# cxl create-region -d decoder0.0 -s 268435456 -t ram
> > > > > > [ 4144.982608] cxl region0: Failed to synchronize CPU cache state
> > > > > > cxl region: create_region: region0: failed to commit decode: No such
> > > > > > device or address
> > > > > >
> > > > > > [root@cxl-test /]#
> > > > > >
> > > > > > On Fri, Aug 18, 2023 at 8:31 PM Dimitrios Palyvos
> > > > > > <dimitrios.palyvos@zptcorp.com> wrote:  
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am not an expert (and not 100% sure if that's what you want to do),
> > > > > > > but here's one way to get your configuration to work:
> > > > > > > 1. Disable KVM.  
> > >
> > > Just to second this - don't use KVM and expect it to work with CXL emulation
> > > if you are trying to use kmem to present it as normal memory - it should be fine
> > > as long as you never run instructions resident in that memory.
> > >
> > > It will crash in nasty ways due to various issues with instruction emulation
> > > where it is running out of memory behind the emulated interleave decoders.
> > >
> > > So far we haven't cared enough to add the complexity that would be needed
> > > to make that work.
> > >
> > > TCG is the way to go for now.
> > >
> > > Jonathan
> > >  
> > > > > > > 2. Remove the CXL NUMA node from the QEMU command.
> > > > > > > 3. Use the ndctl utilities in the guest to initialize your CXL memory
> > > > > > > and associated NUMA node.
> > > > > > >
> > > > > > > More specifically, I changed your QEMU command as follows:
> > > > > > >
> > > > > > >     qemu-system-x86_64 \
> > > > > > >      -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > > > >      -machine type=q35,cxl=on \
> > > > > > >      -m 8G \
> > > > > > >      -smp cpus=8 \
> > > > > > >      -object memory-backend-ram,size=4G,id=m0 \
> > > > > > >      -object memory-backend-ram,size=4G,id=m1 \
> > > > > > >      -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > > > >      -numa node,memdev=m0,cpus=0-3,nodeid=0 \
> > > > > > >      -numa node,memdev=m1,cpus=4-7,nodeid=1 \
> > > > > > >      -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > > > >      -device virtio-net-pci,netdev=net0 \
> > > > > > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > > >      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > > > >      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > > > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > > > >      -nographic
> > > > > > >
> > > > > > > In the guest, install ndctl: https://github.com/pmem/ndctl
> > > > > > >
> > > > > > > After that, you should be able to see the CXL memory:
> > > > > > >     root@cxl-img:~# cxl list
> > > > > > >     [
> > > > > > >       {
> > > > > > >         "memdev":"mem0",
> > > > > > >         "ram_size":2147483648,
> > > > > > >         "serial":0,
> > > > > > >         "host":"0000:0d:00.0"
> > > > > > >       }
> > > > > > >     ]
> > > > > > >
> > > > > > > And initialize it as RAM:
> > > > > > >     root@cxl-img:~# cxl create-region -d decoder0.0 -s 2147483648 -t ram
> > > > > > >     ...
> > > > > > >
> > > > > > > root@cxl-img:~# lsmem --output-all
> > > > > > >     RANGE                                  SIZE  STATE REMOVABLE BLOCK
> > > > > > > NODE  ZONES
> > > > > > >     0x0000000000000000-0x0000000007ffffff  128M online       yes     0
> > > > > > >    0   None
> > > > > > >     0x0000000008000000-0x000000007fffffff  1.9G online       yes  1-15
> > > > > > >    0  DMA32
> > > > > > >     0x0000000100000000-0x000000017fffffff    2G online       yes 32-47
> > > > > > >    0 Normal
> > > > > > >     0x0000000180000000-0x000000027fffffff    4G online       yes 48-79
> > > > > > >    1 Normal
> > > > > > >     0x0000000290000000-0x000000030fffffff    2G online       yes 82-97
> > > > > > >    2 Normal
> > > > > > >
> > > > > > >     Memory block size:       128M
> > > > > > >     Total online memory:      10G
> > > > > > >     Total offline memory:      0B
> > > > > > >
> > > > > > >
> > > > > > >     root@cxl-img:~# cat /proc/iomem
> > > > > > >     ...
> > > > > > >     290000000-38fffffff : CXL Window 0
> > > > > > >       290000000-30fffffff : region0
> > > > > > >         290000000-30fffffff : dax0.0
> > > > > > >           290000000-30fffffff : System RAM (kmem)
> > > > > > >
> > > > > > >
> > > > > > > Then you can generate traffic in the CXL NUMA node, for example:
> > > > > > >
> > > > > > >     root@cxl-img:~# numactl --membind 2 ls
> > > > > > >
> > > > > > > Note: The above is with linux v6.4.11.
> > > > > > >
> > > > > > > Hope that helps!
> > > > > > >
> > > > > > > Kind regards,
> > > > > > > Dimitris
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 18, 2023 at 11:39 AM Sajjan Rao <sajjanr@gmail.com> wrote:  
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I have a qemu + cxl configuration coming up with one configured type 3
> > > > > > > > device. My goal is to generate some cxl.mem traffic in this
> > > > > > > > configuration.
> > > > > > > > However the numa_node is always showing as -1. I have tried various
> > > > > > > > qemu command line parameters including to explicitly set numa_node for
> > > > > > > > cxl devices.
> > > > > > > >
> > > > > > > > Here is my qemu command line
> > > > > > > > --------
> > > > > > > > qemu-system-x86_64 \
> > > > > > > >  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
> > > > > > > >  -machine type=q35,accel=kvm,cxl=on \
> > > > > > > >  -m 10G \
> > > > > > > >  -smp cpus=8 \
> > > > > > > >  -object memory-backend-ram,size=4G,id=m0 \
> > > > > > > >  -object memory-backend-ram,size=4G,id=m1 \
> > > > > > > >  -object memory-backend-ram,size=2G,id=cxl-mem1 \
> > > > > > > >  -numa node,memdev=m0,cpus=0-1,nodeid=0 \
> > > > > > > >  -numa node,memdev=m1,cpus=2-3,nodeid=1 \
> > > > > > > >  -numa node,memdev=cxl-mem1,cpus=4-7,nodeid=2 \
> > > > > > > >  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9  \
> > > > > > > >  -device virtio-net-pci,netdev=net0 \
> > > > > > > >  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > > > >  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> > > > > > > >  -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-mem1 \
> > > > > > > >  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > > > > > > >  -enable-kvm \
> > > > > > > >  -nographic
> > > > > > > > -----
> > > > > > > >
> > > > > > > > I see that the cxl device is listed in lspci output
> > > > > > > > ------
> > > > > > > > #lspci | grep -i cxl
> > > > > > > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > > > > > > >
> > > > > > > > #lspci -s 0d:00.0 -vvv | grep -i numa
> > > > > > > > #
> > > > > > > >
> > > > > > > > -------
> > > > > > > >
> > > > > > > > sysfs output
> > > > > > > > ----------
> > > > > > > > #cat /sys/bus/cxl/devices/mem0/numa_node
> > > > > > > > -1
> > > > > > > > --------
> > > > > > > >
> > > > > > > > numactl output
> > > > > > > >
> > > > > > > > ------------------
> > > > > > > > #numactl -H
> > > > > > > > available: 3 nodes (0-2)
> > > > > > > > node 0 cpus: 0 1
> > > > > > > > node 0 size: 3910 MB
> > > > > > > > node 0 free: 3776 MB
> > > > > > > > node 1 cpus: 2 3
> > > > > > > > node 1 size: 4031 MB
> > > > > > > > node 1 free: 3927 MB
> > > > > > > > node 2 cpus: 4 5 6 7
> > > > > > > > node 2 size: 2011 MB
> > > > > > > > node 2 free: 1785 MB
> > > > > > > > node distances:
> > > > > > > > node   0   1   2
> > > > > > > >   0:  10  20  20
> > > > > > > >   1:  20  10  20
> > > > > > > >   2:  20  20  10
> > > > > > > > -------------------
> > > > > > > >
> > > > > > > > The numa_node 2 is expected to be mapped to a CXL device, I do see
> > > > > > > > some activity in numastat output, but it's unclear if this is really
> > > > > > > > mapped to the CXL device since the device itself says numa_node is -1
> > > > > > > > (expected to show 2).
> > > > > > > >
> > > > > > > > Has anybody seen this behavior? Any help will be greatly appreciated.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Sajjan  
> > > > > > >
> > > > > > > --
> > > > > > > **CONFIDENTIALITY NOTICE:*
> > > > > > > *
> > > > > > > *The contents of this email message and any
> > > > > > > attachments are intended solely for the addressee(s) and may contain
> > > > > > > confidential and/or privileged information and may be legally protected
> > > > > > > from disclosure. If you are not the intended recipient of this message or
> > > > > > > their agent, or if this message has been addressed to you in error, please
> > > > > > > immediately alert the sender by reply email and then delete this message
> > > > > > > and any attachments. If you are not the intended recipient, you are hereby
> > > > > > > notified that any use, dissemination, copying, or storage of this message
> > > > > > > or its attachments is strictly prohibited. *  
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Dimitrios Payvos-Giannas, PhD
> > > > >
> > > > > Software Engineer
> > > > >
> > > > > ZeroPoint Technologies
> > > > >
> > > > > Remove the waste.
> > > > >
> > > > > Release the power.
> > > > >
> > > > > --
> > > > > **CONFIDENTIALITY NOTICE:*
> > > > > *
> > > > > *The contents of this email message and any
> > > > > attachments are intended solely for the addressee(s) and may contain
> > > > > confidential and/or privileged information and may be legally protected
> > > > > from disclosure. If you are not the intended recipient of this message or
> > > > > their agent, or if this message has been addressed to you in error, please
> > > > > immediately alert the sender by reply email and then delete this message
> > > > > and any attachments. If you are not the intended recipient, you are hereby
> > > > > notified that any use, dissemination, copying, or storage of this message
> > > > > or its attachments is strictly prohibited. *  
> > >  


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2024-01-26 12:39               ` Jonathan Cameron
@ 2024-01-26 15:43                 ` Gregory Price
  2024-01-26 17:12                   ` Jonathan Cameron
  0 siblings, 1 reply; 50+ messages in thread
From: Gregory Price @ 2024-01-26 15:43 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Sajjan Rao, Dimitrios Palyvos, linux-cxl

On Fri, Jan 26, 2024 at 12:39:26PM +0000, Jonathan Cameron wrote:
> On Thu, 25 Jan 2024 13:45:09 +0530
> Sajjan Rao <sajjanr@gmail.com> wrote:
> 
> > Looks like something changed in QEMU 8.2 that broke running code out
> > of CXL memory with KVM disabled.
> > I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
> > this worked for me until I updated to the latest QEMU.
> > 
> > Is this a known issue? Or am I missing something?
> 
> I'm confused on how the description below ever worked.
> Assigning the underlying memdev=cxl-mem1 to a numa node isn't going
> to correctly build the connections the CFMWS PA range.
> 

I've now seen 3-4 occasions where people have done this and run into
trouble (for obvious reasons).  Is there anything we can do to disallow
the double-registering of a single memdev to both a numa node and a cxl
device?

~Gregory

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2024-01-26 15:43                 ` Gregory Price
@ 2024-01-26 17:12                   ` Jonathan Cameron
  2024-01-30  8:20                     ` Sajjan Rao
  0 siblings, 1 reply; 50+ messages in thread
From: Jonathan Cameron @ 2024-01-26 17:12 UTC (permalink / raw)
  To: Gregory Price; +Cc: Sajjan Rao, Dimitrios Palyvos, linux-cxl

On Fri, 26 Jan 2024 10:43:43 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Fri, Jan 26, 2024 at 12:39:26PM +0000, Jonathan Cameron wrote:
> > On Thu, 25 Jan 2024 13:45:09 +0530
> > Sajjan Rao <sajjanr@gmail.com> wrote:
> >   
> > > Looks like something changed in QEMU 8.2 that broke running code out
> > > of CXL memory with KVM disabled.
> > > I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
> > > this worked for me until I updated to the latest QEMU.
> > > 
> > > Is this a known issue? Or am I missing something?  
> > 
> > I'm confused on how the description below ever worked.
> > Assigning the underlying memdev=cxl-mem1 to a numa node isn't going
> > to correctly build the connections the CFMWS PA range.
> >   
> 
> I've now seen 3-4 occasions where people have done this and run into
> trouble (for obvious reasons).  Is there anything we can do to disallow
> the double-registering of a single memdev to both a numa node and a cxl
> device?
> 
It would be novel for us to prevent people shooting themselves
in the foot ;) but I guess this should be fairly easy as the
numa node logic prevents the same one being used multiple times so can
copy how that is done.

This should do the trick (very lightly tested).
It's end of day Friday here so a formal patch can wait for next week.


diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index f29346fae7..d4194bb757 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -827,6 +827,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
             error_setg(errp, "volatile memdev must have backing device");
             return false;
         }
+        if (host_memory_backend_is_mapped(ct3d->hostvmem)) {
+            error_setg(errp, "memory backend %s can't be used multiple times.",
+               object_get_canonical_path_component(OBJECT(ct3d->hostvmem)));
+            return false;
+        }
         memory_region_set_nonvolatile(vmr, false);
         memory_region_set_enabled(vmr, true);
         host_memory_backend_set_mapped(ct3d->hostvmem, true);
@@ -850,6 +855,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
             error_setg(errp, "persistent memdev must have backing device");
             return false;
         }
+        if (host_memory_backend_is_mapped(ct3d->hostpmem)) {
+            error_setg(errp, "memory backend %s can't be used multiple times.",
+               object_get_canonical_path_component(OBJECT(ct3d->hostpmem)));
+            return false;
+        }
         memory_region_set_nonvolatile(pmr, true);
         memory_region_set_enabled(pmr, true);
         host_memory_backend_set_mapped(ct3d->hostpmem, true);
@@ -880,6 +890,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
             error_setg(errp, "dynamic capacity must have backing device");
             return false;
         }
+        if (host_memory_backend_is_mapped(ct3d->dc.host_dc)) {
+            error_setg(errp, "memory backend %s can't be used multiple times.",
+               object_get_canonical_path_component(OBJECT(ct3d->dc.host_dc)));
+            return false;
+        }
         /* FIXME: set dc as nonvolatile for now */
         memory_region_set_nonvolatile(dc_mr, true);
         memory_region_set_enabled(dc_mr, true);





> ~Gregory


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: qemu cxl memory expander shows numa_node -1
  2024-01-26 17:12                   ` Jonathan Cameron
@ 2024-01-30  8:20                     ` Sajjan Rao
  2024-02-01 13:04                         ` Jonathan Cameron
  0 siblings, 1 reply; 50+ messages in thread
From: Sajjan Rao @ 2024-01-30  8:20 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Gregory Price, Dimitrios Palyvos, linux-cxl

Hi Jonathan,

The QEMU command line in the original email has been corrected back in
August 2023 based on the subsequent responses.

My current QEMU command line reads like below. As you can see I am not
assigning numa to the CXL memory object.

qemu-system-x86_64 \
 -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
 -machine type=q35,nvdimm=on,cxl=on \
 -accel tcg,thread=single \
 -m 4G \
 -smp cpus=4 \
 -object memory-backend-ram,size=4G,id=m0 \
 -object memory-backend-ram,size=256M,id=cxl-mem1 \
 -object memory-backend-ram,size=256M,id=cxl-mem2 \
 -numa node,memdev=m0,cpus=0-3,nodeid=0 \
 -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9,hostfwd=tcp::2222-:22
\
 -device virtio-net-pci,netdev=net0 \
 -device pxb-cxl,bus_nr=2,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true \
 -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
 -device cxl-upstream,bus=cxl_rp_port0,id=us0,addr=0.0,multifunction=on, \
 -device cxl-switch-mailbox-cci,bus=cxl_rp_port0,addr=0.2,target=us0 \
 -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
 -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
 -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
 -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
 -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=512M,cxl-fmw.0.interleave-granularity=2k
\
 -D /tmp/qemu.log \
 -nographic

Until I moved to Qemu version 8.2 recently, I was able to create
regions and run linux native commands on CXL memory using
#numactl --membind <cxl NUMA#> top

You had advised me to turn off KVM and use tcg since the membind
command will run code out of CXL memory which is not supported. By
disabling KVM the membind command worked fine.
However with Qemu version 8.2 the same membind command results in a
kernel hard crash.
I wanted to check if this is a known issue with 8.2 and is there a way
around it.

Thanks,
Sajjan

On Fri, Jan 26, 2024 at 10:42 PM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Fri, 26 Jan 2024 10:43:43 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
>
> > On Fri, Jan 26, 2024 at 12:39:26PM +0000, Jonathan Cameron wrote:
> > > On Thu, 25 Jan 2024 13:45:09 +0530
> > > Sajjan Rao <sajjanr@gmail.com> wrote:
> > >
> > > > Looks like something changed in QEMU 8.2 that broke running code out
> > > > of CXL memory with KVM disabled.
> > > > I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
> > > > this worked for me until I updated to the latest QEMU.
> > > >
> > > > Is this a known issue? Or am I missing something?
> > >
> > > I'm confused on how the description below ever worked.
> > > Assigning the underlying memdev=cxl-mem1 to a numa node isn't going
> > > to correctly build the connections the CFMWS PA range.
> > >
> >
> > I've now seen 3-4 occasions where people have done this and run into
> > trouble (for obvious reasons).  Is there anything we can do to disallow
> > the double-registering of a single memdev to both a numa node and a cxl
> > device?
> >
> It would be novel for us to prevent people shooting themselves
> in the foot ;) but I guess this should be fairly easy as the
> numa node logic prevents the same one being used multiple times so can
> copy how that is done.
>
> This should do the trick (very lightly tested).
> It's end of day Friday here so a formal patch can wait for next week.
>
>
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index f29346fae7..d4194bb757 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -827,6 +827,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
>              error_setg(errp, "volatile memdev must have backing device");
>              return false;
>          }
> +        if (host_memory_backend_is_mapped(ct3d->hostvmem)) {
> +            error_setg(errp, "memory backend %s can't be used multiple times.",
> +               object_get_canonical_path_component(OBJECT(ct3d->hostvmem)));
> +            return false;
> +        }
>          memory_region_set_nonvolatile(vmr, false);
>          memory_region_set_enabled(vmr, true);
>          host_memory_backend_set_mapped(ct3d->hostvmem, true);
> @@ -850,6 +855,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
>              error_setg(errp, "persistent memdev must have backing device");
>              return false;
>          }
> +        if (host_memory_backend_is_mapped(ct3d->hostpmem)) {
> +            error_setg(errp, "memory backend %s can't be used multiple times.",
> +               object_get_canonical_path_component(OBJECT(ct3d->hostpmem)));
> +            return false;
> +        }
>          memory_region_set_nonvolatile(pmr, true);
>          memory_region_set_enabled(pmr, true);
>          host_memory_backend_set_mapped(ct3d->hostpmem, true);
> @@ -880,6 +890,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
>              error_setg(errp, "dynamic capacity must have backing device");
>              return false;
>          }
> +        if (host_memory_backend_is_mapped(ct3d->dc.host_dc)) {
> +            error_setg(errp, "memory backend %s can't be used multiple times.",
> +               object_get_canonical_path_component(OBJECT(ct3d->dc.host_dc)));
> +            return false;
> +        }
>          /* FIXME: set dc as nonvolatile for now */
>          memory_region_set_nonvolatile(dc_mr, true);
>          memory_region_set_enabled(dc_mr, true);
>
>
>
>
>
> > ~Gregory
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-01-30  8:20                     ` Sajjan Rao
@ 2024-02-01 13:04                         ` Jonathan Cameron
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-01 13:04 UTC (permalink / raw)
  To: Sajjan Rao
  Cc: Gregory Price, Dimitrios Palyvos, linux-cxl, qemu-devel,
	richard.henderson

On Tue, 30 Jan 2024 13:50:18 +0530
Sajjan Rao <sajjanr@gmail.com> wrote:

> Hi Jonathan,
> 
> The QEMU command line in the original email has been corrected back in
> August 2023 based on the subsequent responses.
> 
> My current QEMU command line reads like below. As you can see I am not
> assigning numa to the CXL memory object.
> 
> qemu-system-x86_64 \
>  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
>  -machine type=q35,nvdimm=on,cxl=on \
>  -accel tcg,thread=single \
>  -m 4G \
>  -smp cpus=4 \
>  -object memory-backend-ram,size=4G,id=m0 \
>  -object memory-backend-ram,size=256M,id=cxl-mem1 \
>  -object memory-backend-ram,size=256M,id=cxl-mem2 \
>  -numa node,memdev=m0,cpus=0-3,nodeid=0 \
>  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9,hostfwd=tcp::2222-:22
> \
>  -device virtio-net-pci,netdev=net0 \
>  -device pxb-cxl,bus_nr=2,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true \
>  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>  -device cxl-upstream,bus=cxl_rp_port0,id=us0,addr=0.0,multifunction=on, \
>  -device cxl-switch-mailbox-cci,bus=cxl_rp_port0,addr=0.2,target=us0 \
>  -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
>  -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
>  -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
>  -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
>  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=512M,cxl-fmw.0.interleave-granularity=2k
> \
>  -D /tmp/qemu.log \
>  -nographic
> 
> Until I moved to Qemu version 8.2 recently, I was able to create
> regions and run linux native commands on CXL memory using
> #numactl --membind <cxl NUMA#> top
> 
> You had advised me to turn off KVM and use tcg since the membind
> command will run code out of CXL memory which is not supported. By
> disabling KVM the membind command worked fine.
> However with Qemu version 8.2 the same membind command results in a
> kernel hard crash.

Just to check, kernel crashes, or qemu crashes?

I've probably replicated and it seems to be qemu that is going down with a TCG issue.

Bisection underway.

This may take a while.
Our use of TCG is unusual with what QEMU thinks of as io memory is unusual
so we tend to run into corners no one else cares about.

Richard, +CC on off chance you can guess what has happened and save
me a bisection run..

x86 machine pretty much as described above

root@localhost:~/devmem2# numactl --membind=1 touch a
qemu: fatal: cpu_io_recompile: could not find TB for pc=(nil)
RAX=00d6b969c0000000 RBX=ff294696c0044440 RCX=0000000000000028 RDX=0000000000000000
RSI=0000000000000275 RDI=0000000000000000 RBP=0000000490000000 RSP=ff4f8767805d3d20
R8 =0000000000000000 R9 =ff4f8767805d3cdc R10=0000000000000000 R11=0000000000000040
R12=ff294696c0044980 R13=0000000000000000 R14=ff294696c51d0000 R15=0000000000000000
RIP=ffffffff9d270fed RFL=00000007 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 00000000 00000000
CS =0010 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 00000000 00000000
FS =0000 0000000000000000 00000000 00000000
GS =0000 ff2946973bc00000 00000000 00000000
LDT=0000 0000000000000000 00000000 00008200 DPL=0 LDT
TR =0040 fffffe37d29e7000 00004087 00008900 DPL=0 TSS64-avl
GDT=     fffffe37d29e5000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00007f2972bdc450 CR3=0000000490000000 CR4=00751ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00d6b969c0000000 CCD=0000000490000000 CCO=ADDQ
EFER=0000000000000d01
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
YMM00=0000000000000000 0000000000000000 3a3a3a3a3a3a3a3a 3a3a3a3a3a3a3a3a
YMM01=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM02=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM03=0000000000000000 0000000000000000 00ff0000000000ff 0000000000000000
YMM04=0000000000000000 0000000000000000 5f796c7261655f63 62696c5f5f004554
YMM05=0000000000000000 0000000000000000 0000000000000000 0000000000000060
YMM06=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM07=0000000000000000 0000000000000000 0909090909090909 0909090909090909
YMM08=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM09=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM10=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM11=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM12=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM13=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM14=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM15=0000000000000000 0000000000000000 0000000000000000 0000000000000000

Jonathan



> I wanted to check if this is a known issue with 8.2 and is there a way
> around it.
> 
> Thanks,
> Sajjan
> 
> On Fri, Jan 26, 2024 at 10:42 PM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Fri, 26 Jan 2024 10:43:43 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> >  
> > > On Fri, Jan 26, 2024 at 12:39:26PM +0000, Jonathan Cameron wrote:  
> > > > On Thu, 25 Jan 2024 13:45:09 +0530
> > > > Sajjan Rao <sajjanr@gmail.com> wrote:
> > > >  
> > > > > Looks like something changed in QEMU 8.2 that broke running code out
> > > > > of CXL memory with KVM disabled.
> > > > > I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
> > > > > this worked for me until I updated to the latest QEMU.
> > > > >
> > > > > Is this a known issue? Or am I missing something?  
> > > >
> > > > I'm confused on how the description below ever worked.
> > > > Assigning the underlying memdev=cxl-mem1 to a numa node isn't going
> > > > to correctly build the connections the CFMWS PA range.
> > > >  
> > >
> > > I've now seen 3-4 occasions where people have done this and run into
> > > trouble (for obvious reasons).  Is there anything we can do to disallow
> > > the double-registering of a single memdev to both a numa node and a cxl
> > > device?
> > >  
> > It would be novel for us to prevent people shooting themselves
> > in the foot ;) but I guess this should be fairly easy as the
> > numa node logic prevents the same one being used multiple times so can
> > copy how that is done.
> >
> > This should do the trick (very lightly tested).
> > It's end of day Friday here so a formal patch can wait for next week.
> >
> >
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index f29346fae7..d4194bb757 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -827,6 +827,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >              error_setg(errp, "volatile memdev must have backing device");
> >              return false;
> >          }
> > +        if (host_memory_backend_is_mapped(ct3d->hostvmem)) {
> > +            error_setg(errp, "memory backend %s can't be used multiple times.",
> > +               object_get_canonical_path_component(OBJECT(ct3d->hostvmem)));
> > +            return false;
> > +        }
> >          memory_region_set_nonvolatile(vmr, false);
> >          memory_region_set_enabled(vmr, true);
> >          host_memory_backend_set_mapped(ct3d->hostvmem, true);
> > @@ -850,6 +855,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >              error_setg(errp, "persistent memdev must have backing device");
> >              return false;
> >          }
> > +        if (host_memory_backend_is_mapped(ct3d->hostpmem)) {
> > +            error_setg(errp, "memory backend %s can't be used multiple times.",
> > +               object_get_canonical_path_component(OBJECT(ct3d->hostpmem)));
> > +            return false;
> > +        }
> >          memory_region_set_nonvolatile(pmr, true);
> >          memory_region_set_enabled(pmr, true);
> >          host_memory_backend_set_mapped(ct3d->hostpmem, true);
> > @@ -880,6 +890,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >              error_setg(errp, "dynamic capacity must have backing device");
> >              return false;
> >          }
> > +        if (host_memory_backend_is_mapped(ct3d->dc.host_dc)) {
> > +            error_setg(errp, "memory backend %s can't be used multiple times.",
> > +               object_get_canonical_path_component(OBJECT(ct3d->dc.host_dc)));
> > +            return false;
> > +        }
> >          /* FIXME: set dc as nonvolatile for now */
> >          memory_region_set_nonvolatile(dc_mr, true);
> >          memory_region_set_enabled(dc_mr, true);
> >
> >
> >
> >
> >  
> > > ~Gregory  
> >  



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-01 13:04                         ` Jonathan Cameron
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-01 13:04 UTC (permalink / raw)
  To: Sajjan Rao
  Cc: Gregory Price, Dimitrios Palyvos, linux-cxl, qemu-devel,
	richard.henderson

On Tue, 30 Jan 2024 13:50:18 +0530
Sajjan Rao <sajjanr@gmail.com> wrote:

> Hi Jonathan,
> 
> The QEMU command line in the original email has been corrected back in
> August 2023 based on the subsequent responses.
> 
> My current QEMU command line reads like below. As you can see I am not
> assigning numa to the CXL memory object.
> 
> qemu-system-x86_64 \
>  -hda /var/lib/libvirt/images/CXL-Test_1.qcow2 \
>  -machine type=q35,nvdimm=on,cxl=on \
>  -accel tcg,thread=single \
>  -m 4G \
>  -smp cpus=4 \
>  -object memory-backend-ram,size=4G,id=m0 \
>  -object memory-backend-ram,size=256M,id=cxl-mem1 \
>  -object memory-backend-ram,size=256M,id=cxl-mem2 \
>  -numa node,memdev=m0,cpus=0-3,nodeid=0 \
>  -netdev user,id=net0,net=192.168.0.0/24,dhcpstart=192.168.0.9,hostfwd=tcp::2222-:22
> \
>  -device virtio-net-pci,netdev=net0 \
>  -device pxb-cxl,bus_nr=2,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true \
>  -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>  -device cxl-upstream,bus=cxl_rp_port0,id=us0,addr=0.0,multifunction=on, \
>  -device cxl-switch-mailbox-cci,bus=cxl_rp_port0,addr=0.2,target=us0 \
>  -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
>  -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
>  -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
>  -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
>  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=512M,cxl-fmw.0.interleave-granularity=2k
> \
>  -D /tmp/qemu.log \
>  -nographic
> 
> Until I moved to Qemu version 8.2 recently, I was able to create
> regions and run linux native commands on CXL memory using
> #numactl --membind <cxl NUMA#> top
> 
> You had advised me to turn off KVM and use tcg since the membind
> command will run code out of CXL memory which is not supported. By
> disabling KVM the membind command worked fine.
> However with Qemu version 8.2 the same membind command results in a
> kernel hard crash.

Just to check, kernel crashes, or qemu crashes?

I've probably replicated and it seems to be qemu that is going down with a TCG issue.

Bisection underway.

This may take a while.
Our use of TCG is unusual with what QEMU thinks of as io memory is unusual
so we tend to run into corners no one else cares about.

Richard, +CC on off chance you can guess what has happened and save
me a bisection run..

x86 machine pretty much as described above

root@localhost:~/devmem2# numactl --membind=1 touch a
qemu: fatal: cpu_io_recompile: could not find TB for pc=(nil)
RAX=00d6b969c0000000 RBX=ff294696c0044440 RCX=0000000000000028 RDX=0000000000000000
RSI=0000000000000275 RDI=0000000000000000 RBP=0000000490000000 RSP=ff4f8767805d3d20
R8 =0000000000000000 R9 =ff4f8767805d3cdc R10=0000000000000000 R11=0000000000000040
R12=ff294696c0044980 R13=0000000000000000 R14=ff294696c51d0000 R15=0000000000000000
RIP=ffffffff9d270fed RFL=00000007 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 00000000 00000000
CS =0010 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 00000000 00000000
FS =0000 0000000000000000 00000000 00000000
GS =0000 ff2946973bc00000 00000000 00000000
LDT=0000 0000000000000000 00000000 00008200 DPL=0 LDT
TR =0040 fffffe37d29e7000 00004087 00008900 DPL=0 TSS64-avl
GDT=     fffffe37d29e5000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00007f2972bdc450 CR3=0000000490000000 CR4=00751ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00d6b969c0000000 CCD=0000000490000000 CCO=ADDQ
EFER=0000000000000d01
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
YMM00=0000000000000000 0000000000000000 3a3a3a3a3a3a3a3a 3a3a3a3a3a3a3a3a
YMM01=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM02=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM03=0000000000000000 0000000000000000 00ff0000000000ff 0000000000000000
YMM04=0000000000000000 0000000000000000 5f796c7261655f63 62696c5f5f004554
YMM05=0000000000000000 0000000000000000 0000000000000000 0000000000000060
YMM06=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM07=0000000000000000 0000000000000000 0909090909090909 0909090909090909
YMM08=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM09=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM10=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM11=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM12=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM13=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM14=0000000000000000 0000000000000000 0000000000000000 0000000000000000
YMM15=0000000000000000 0000000000000000 0000000000000000 0000000000000000

Jonathan



> I wanted to check if this is a known issue with 8.2 and is there a way
> around it.
> 
> Thanks,
> Sajjan
> 
> On Fri, Jan 26, 2024 at 10:42 PM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Fri, 26 Jan 2024 10:43:43 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> >  
> > > On Fri, Jan 26, 2024 at 12:39:26PM +0000, Jonathan Cameron wrote:  
> > > > On Thu, 25 Jan 2024 13:45:09 +0530
> > > > Sajjan Rao <sajjanr@gmail.com> wrote:
> > > >  
> > > > > Looks like something changed in QEMU 8.2 that broke running code out
> > > > > of CXL memory with KVM disabled.
> > > > > I used "numactl --membind 2 ls" as suggested by Dimitrios earlier,
> > > > > this worked for me until I updated to the latest QEMU.
> > > > >
> > > > > Is this a known issue? Or am I missing something?  
> > > >
> > > > I'm confused on how the description below ever worked.
> > > > Assigning the underlying memdev=cxl-mem1 to a numa node isn't going
> > > > to correctly build the connections the CFMWS PA range.
> > > >  
> > >
> > > I've now seen 3-4 occasions where people have done this and run into
> > > trouble (for obvious reasons).  Is there anything we can do to disallow
> > > the double-registering of a single memdev to both a numa node and a cxl
> > > device?
> > >  
> > It would be novel for us to prevent people shooting themselves
> > in the foot ;) but I guess this should be fairly easy as the
> > numa node logic prevents the same one being used multiple times so can
> > copy how that is done.
> >
> > This should do the trick (very lightly tested).
> > It's end of day Friday here so a formal patch can wait for next week.
> >
> >
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index f29346fae7..d4194bb757 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -827,6 +827,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >              error_setg(errp, "volatile memdev must have backing device");
> >              return false;
> >          }
> > +        if (host_memory_backend_is_mapped(ct3d->hostvmem)) {
> > +            error_setg(errp, "memory backend %s can't be used multiple times.",
> > +               object_get_canonical_path_component(OBJECT(ct3d->hostvmem)));
> > +            return false;
> > +        }
> >          memory_region_set_nonvolatile(vmr, false);
> >          memory_region_set_enabled(vmr, true);
> >          host_memory_backend_set_mapped(ct3d->hostvmem, true);
> > @@ -850,6 +855,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >              error_setg(errp, "persistent memdev must have backing device");
> >              return false;
> >          }
> > +        if (host_memory_backend_is_mapped(ct3d->hostpmem)) {
> > +            error_setg(errp, "memory backend %s can't be used multiple times.",
> > +               object_get_canonical_path_component(OBJECT(ct3d->hostpmem)));
> > +            return false;
> > +        }
> >          memory_region_set_nonvolatile(pmr, true);
> >          memory_region_set_enabled(pmr, true);
> >          host_memory_backend_set_mapped(ct3d->hostpmem, true);
> > @@ -880,6 +890,11 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >              error_setg(errp, "dynamic capacity must have backing device");
> >              return false;
> >          }
> > +        if (host_memory_backend_is_mapped(ct3d->dc.host_dc)) {
> > +            error_setg(errp, "memory backend %s can't be used multiple times.",
> > +               object_get_canonical_path_component(OBJECT(ct3d->dc.host_dc)));
> > +            return false;
> > +        }
> >          /* FIXME: set dc as nonvolatile for now */
> >          memory_region_set_nonvolatile(dc_mr, true);
> >          memory_region_set_enabled(dc_mr, true);
> >
> >
> >
> >
> >  
> > > ~Gregory  
> >  


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 13:04                         ` Jonathan Cameron
  (?)
@ 2024-02-01 13:12                         ` Peter Maydell
  2024-02-01 14:01                             ` Jonathan Cameron via
  -1 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-01 13:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Sajjan Rao, Gregory Price, Dimitrios Palyvos, linux-cxl,
	qemu-devel, richard.henderson

On Thu, 1 Feb 2024 at 13:04, Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
>




> root@localhost:~/devmem2# numactl --membind=1 touch a
> qemu: fatal: cpu_io_recompile: could not find TB for pc=(nil)

Can you run QEMU under gdb and give the backtrace when it stops
on the abort() ? That will probably have a helpful clue. I
suspect something is failing to pass a valid retaddr in
when it calls a load/store function.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 13:12                         ` Peter Maydell
@ 2024-02-01 14:01                             ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-01 14:01 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Sajjan Rao, Gregory Price, Dimitrios Palyvos, linux-cxl,
	qemu-devel, richard.henderson

On Thu, 1 Feb 2024 13:12:23 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 13:04, Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> >  
> 
> 
> 
> 
> > root@localhost:~/devmem2# numactl --membind=1 touch a
> > qemu: fatal: cpu_io_recompile: could not find TB for pc=(nil)  
> 
> Can you run QEMU under gdb and give the backtrace when it stops
> on the abort() ? That will probably have a helpful clue. I
> suspect something is failing to pass a valid retaddr in
> when it calls a load/store function.
> 
> thanks
> -- PMM

[Switching to Thread 0x7ffff56ff6c0 (LWP 21916)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000555555c0d4ce in cpu_abort
    (cpu=cpu@entry=0x555556fd9000, fmt=fmt@entry=0x555555fe3378 "cpu_io_recompile: could not find TB for pc=%p")
    at ../../cpu-target.c:359
#6  0x0000555555c59435 in cpu_io_recompile (cpu=cpu@entry=0x555556fd9000, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
#7  0x0000555555c5c956 in io_prepare
    (retaddr=0, addr=19595792376, attrs=..., xlat=<optimized out>, cpu=0x555556fd9000, out_offset=<synthetic pointer>)
    at ../../accel/tcg/cputlb.c:1339
#8  do_ld_mmio_beN
    (cpu=0x555556fd9000, full=0x7fffee0d96e0, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
#9  0x0000555555c5dfad in do_ld_8
    (cpu=cpu@entry=0x555556fd9000, p=p@entry=0x7ffff56fddc0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
#10 0x0000555555c6026f in do_ld8_mmu
    (cpu=cpu@entry=0x555556fd9000, addr=addr@entry=19595792376, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
#11 0x0000555555c629f9 in cpu_ldq_mmu (ra=0, oi=52, addr=19595792376, env=0x555556fdb7c0) at ../../accel/tcg/ldst_common.c.inc:169
#12 cpu_ldq_le_mmuidx_ra (env=0x555556fdb7c0, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0)
    at ../../accel/tcg/ldst_common.c.inc:301
#13 0x0000555555b18ede in ptw_ldq (in=0x7ffff56fdf00) at ../../target/i386/tcg/sysemu/excp_helper.c:98
#14 ptw_ldq (in=0x7ffff56fdf00) at ../../target/i386/tcg/sysemu/excp_helper.c:93
#15 mmu_translate (env=env@entry=0x555556fdb7c0, in=in@entry=0x7ffff56fdfa0, out=out@entry=0x7ffff56fdf70, err=err@entry=0x7ffff56fdf80)
    at ../../target/i386/tcg/sysemu/excp_helper.c:173
#16 0x0000555555b19c95 in get_physical_address
    (err=0x7ffff56fdf80, out=0x7ffff56fdf70, mmu_idx=0, access_type=MMU_INST_FETCH, addr=18446744072116178925, env=0x555556fdb7c0)
    at ../../target/i386/tcg/sysemu/excp_helper.c:578
#17 x86_cpu_tlb_fill
    (cs=0x555556fd9000, addr=18446744072116178925, size=<optimized out>, access_type=MMU_INST_FETCH, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:604
#18 0x0000555555c5dd2b in probe_access_internal
    (cpu=<optimized out>, addr=18446744072116178925, fault_size=fault_size@entry=1, access_type=access_type@entry=MMU_INST_FETCH, mmu_idx=0, nonfault=nonfault@entry=false, phost=0x7ffff56fe0d0, pfull=0x7ffff56fe0c8, retaddr=0, check_mem_cbs=false)
    at ../../accel/tcg/cputlb.c:1432
#19 0x0000555555c61ff8 in get_page_addr_code_hostp (env=<optimized out>, addr=addr@entry=18446744072116178925, hostp=hostp@entry=0x0)
    at ../../accel/tcg/cputlb.c:1603
#20 0x0000555555c50a2b in get_page_addr_code (addr=18446744072116178925, env=<optimized out>)
    at /home/jic23/src/qemu/include/exec/exec-all.h:594
#21 tb_htable_lookup (cpu=<optimized out>, pc=pc@entry=18446744072116178925, cs_base=0, flags=415285936, cflags=4278353920)
    at ../../accel/tcg/cpu-exec.c:231
#22 0x0000555555c50c08 in tb_lookup
    (cpu=cpu@entry=0x555556fd9000, pc=pc@entry=18446744072116178925, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=<optimized out>) at ../../accel/tcg/cpu-exec.c:267
#23 0x0000555555c51e23 in helper_lookup_tb_ptr (env=0x555556fdb7c0) at ../../accel/tcg/cpu-exec.c:423
#24 0x00007fffa9076ead in code_gen_buffer ()
#25 0x0000555555c50fab in cpu_tb_exec (cpu=cpu@entry=0x555556fd9000, itb=<optimized out>, tb_exit=tb_exit@entry=0x7ffff56fe708)
    at ../../accel/tcg/cpu-exec.c:458
#26 0x0000555555c51492 in cpu_loop_exec_tb
    (tb_exit=0x7ffff56fe708, last_tb=<synthetic pointer>, pc=18446744072116179169, tb=<optimized out>, cpu=0x555556fd9000)
    at ../../accel/tcg/cpu-exec.c:920
#27 cpu_exec_loop (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1041
#28 0x0000555555c51d11 in cpu_exec_setjmp (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1058
#29 0x0000555555c523b4 in cpu_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/cpu-exec.c:1084
#30 0x0000555555c74053 in tcg_cpus_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops.c:76
#31 0x0000555555c741a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
#32 0x0000555555dfb580 in qemu_thread_start (args=0x55555703c3e0) at ../../util/qemu-thread-posix.c:541
#33 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#34 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-01 14:01                             ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-01 14:01 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Sajjan Rao, Gregory Price, Dimitrios Palyvos, linux-cxl,
	qemu-devel, richard.henderson

On Thu, 1 Feb 2024 13:12:23 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 13:04, Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> >  
> 
> 
> 
> 
> > root@localhost:~/devmem2# numactl --membind=1 touch a
> > qemu: fatal: cpu_io_recompile: could not find TB for pc=(nil)  
> 
> Can you run QEMU under gdb and give the backtrace when it stops
> on the abort() ? That will probably have a helpful clue. I
> suspect something is failing to pass a valid retaddr in
> when it calls a load/store function.
> 
> thanks
> -- PMM

[Switching to Thread 0x7ffff56ff6c0 (LWP 21916)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000555555c0d4ce in cpu_abort
    (cpu=cpu@entry=0x555556fd9000, fmt=fmt@entry=0x555555fe3378 "cpu_io_recompile: could not find TB for pc=%p")
    at ../../cpu-target.c:359
#6  0x0000555555c59435 in cpu_io_recompile (cpu=cpu@entry=0x555556fd9000, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
#7  0x0000555555c5c956 in io_prepare
    (retaddr=0, addr=19595792376, attrs=..., xlat=<optimized out>, cpu=0x555556fd9000, out_offset=<synthetic pointer>)
    at ../../accel/tcg/cputlb.c:1339
#8  do_ld_mmio_beN
    (cpu=0x555556fd9000, full=0x7fffee0d96e0, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
#9  0x0000555555c5dfad in do_ld_8
    (cpu=cpu@entry=0x555556fd9000, p=p@entry=0x7ffff56fddc0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
#10 0x0000555555c6026f in do_ld8_mmu
    (cpu=cpu@entry=0x555556fd9000, addr=addr@entry=19595792376, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
#11 0x0000555555c629f9 in cpu_ldq_mmu (ra=0, oi=52, addr=19595792376, env=0x555556fdb7c0) at ../../accel/tcg/ldst_common.c.inc:169
#12 cpu_ldq_le_mmuidx_ra (env=0x555556fdb7c0, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0)
    at ../../accel/tcg/ldst_common.c.inc:301
#13 0x0000555555b18ede in ptw_ldq (in=0x7ffff56fdf00) at ../../target/i386/tcg/sysemu/excp_helper.c:98
#14 ptw_ldq (in=0x7ffff56fdf00) at ../../target/i386/tcg/sysemu/excp_helper.c:93
#15 mmu_translate (env=env@entry=0x555556fdb7c0, in=in@entry=0x7ffff56fdfa0, out=out@entry=0x7ffff56fdf70, err=err@entry=0x7ffff56fdf80)
    at ../../target/i386/tcg/sysemu/excp_helper.c:173
#16 0x0000555555b19c95 in get_physical_address
    (err=0x7ffff56fdf80, out=0x7ffff56fdf70, mmu_idx=0, access_type=MMU_INST_FETCH, addr=18446744072116178925, env=0x555556fdb7c0)
    at ../../target/i386/tcg/sysemu/excp_helper.c:578
#17 x86_cpu_tlb_fill
    (cs=0x555556fd9000, addr=18446744072116178925, size=<optimized out>, access_type=MMU_INST_FETCH, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:604
#18 0x0000555555c5dd2b in probe_access_internal
    (cpu=<optimized out>, addr=18446744072116178925, fault_size=fault_size@entry=1, access_type=access_type@entry=MMU_INST_FETCH, mmu_idx=0, nonfault=nonfault@entry=false, phost=0x7ffff56fe0d0, pfull=0x7ffff56fe0c8, retaddr=0, check_mem_cbs=false)
    at ../../accel/tcg/cputlb.c:1432
#19 0x0000555555c61ff8 in get_page_addr_code_hostp (env=<optimized out>, addr=addr@entry=18446744072116178925, hostp=hostp@entry=0x0)
    at ../../accel/tcg/cputlb.c:1603
#20 0x0000555555c50a2b in get_page_addr_code (addr=18446744072116178925, env=<optimized out>)
    at /home/jic23/src/qemu/include/exec/exec-all.h:594
#21 tb_htable_lookup (cpu=<optimized out>, pc=pc@entry=18446744072116178925, cs_base=0, flags=415285936, cflags=4278353920)
    at ../../accel/tcg/cpu-exec.c:231
#22 0x0000555555c50c08 in tb_lookup
    (cpu=cpu@entry=0x555556fd9000, pc=pc@entry=18446744072116178925, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=<optimized out>) at ../../accel/tcg/cpu-exec.c:267
#23 0x0000555555c51e23 in helper_lookup_tb_ptr (env=0x555556fdb7c0) at ../../accel/tcg/cpu-exec.c:423
#24 0x00007fffa9076ead in code_gen_buffer ()
#25 0x0000555555c50fab in cpu_tb_exec (cpu=cpu@entry=0x555556fd9000, itb=<optimized out>, tb_exit=tb_exit@entry=0x7ffff56fe708)
    at ../../accel/tcg/cpu-exec.c:458
#26 0x0000555555c51492 in cpu_loop_exec_tb
    (tb_exit=0x7ffff56fe708, last_tb=<synthetic pointer>, pc=18446744072116179169, tb=<optimized out>, cpu=0x555556fd9000)
    at ../../accel/tcg/cpu-exec.c:920
#27 cpu_exec_loop (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1041
#28 0x0000555555c51d11 in cpu_exec_setjmp (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1058
#29 0x0000555555c523b4 in cpu_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/cpu-exec.c:1084
#30 0x0000555555c74053 in tcg_cpus_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops.c:76
#31 0x0000555555c741a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
#32 0x0000555555dfb580 in qemu_thread_start (args=0x55555703c3e0) at ../../util/qemu-thread-posix.c:541
#33 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#34 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 14:01                             ` Jonathan Cameron via
  (?)
@ 2024-02-01 14:35                             ` Peter Maydell
  2024-02-01 15:17                               ` Alex Bennée
  -1 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-01 14:35 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Sajjan Rao, Gregory Price, Dimitrios Palyvos, linux-cxl,
	qemu-devel, richard.henderson

On Thu, 1 Feb 2024 at 14:01, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> > Can you run QEMU under gdb and give the backtrace when it stops
> > on the abort() ? That will probably have a helpful clue. I
> > suspect something is failing to pass a valid retaddr in
> > when it calls a load/store function.

> [Switching to Thread 0x7ffff56ff6c0 (LWP 21916)]
> __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> 44      ./nptl/pthread_kill.c: No such file or directory.
> (gdb) bt
> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> #5  0x0000555555c0d4ce in cpu_abort
>     (cpu=cpu@entry=0x555556fd9000, fmt=fmt@entry=0x555555fe3378 "cpu_io_recompile: could not find TB for pc=%p")
>     at ../../cpu-target.c:359
> #6  0x0000555555c59435 in cpu_io_recompile (cpu=cpu@entry=0x555556fd9000, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> #7  0x0000555555c5c956 in io_prepare
>     (retaddr=0, addr=19595792376, attrs=..., xlat=<optimized out>, cpu=0x555556fd9000, out_offset=<synthetic pointer>)
>     at ../../accel/tcg/cputlb.c:1339
> #8  do_ld_mmio_beN
>     (cpu=0x555556fd9000, full=0x7fffee0d96e0, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> #9  0x0000555555c5dfad in do_ld_8
>     (cpu=cpu@entry=0x555556fd9000, p=p@entry=0x7ffff56fddc0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> #10 0x0000555555c6026f in do_ld8_mmu
>     (cpu=cpu@entry=0x555556fd9000, addr=addr@entry=19595792376, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> #11 0x0000555555c629f9 in cpu_ldq_mmu (ra=0, oi=52, addr=19595792376, env=0x555556fdb7c0) at ../../accel/tcg/ldst_common.c.inc:169
> #12 cpu_ldq_le_mmuidx_ra (env=0x555556fdb7c0, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0)
>     at ../../accel/tcg/ldst_common.c.inc:301
> #13 0x0000555555b18ede in ptw_ldq (in=0x7ffff56fdf00) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> #14 ptw_ldq (in=0x7ffff56fdf00) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> #15 mmu_translate (env=env@entry=0x555556fdb7c0, in=in@entry=0x7ffff56fdfa0, out=out@entry=0x7ffff56fdf70, err=err@entry=0x7ffff56fdf80)
>     at ../../target/i386/tcg/sysemu/excp_helper.c:173
> #16 0x0000555555b19c95 in get_physical_address
>     (err=0x7ffff56fdf80, out=0x7ffff56fdf70, mmu_idx=0, access_type=MMU_INST_FETCH, addr=18446744072116178925, env=0x555556fdb7c0)
>     at ../../target/i386/tcg/sysemu/excp_helper.c:578
> #17 x86_cpu_tlb_fill
>     (cs=0x555556fd9000, addr=18446744072116178925, size=<optimized out>, access_type=MMU_INST_FETCH, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> #18 0x0000555555c5dd2b in probe_access_internal
>     (cpu=<optimized out>, addr=18446744072116178925, fault_size=fault_size@entry=1, access_type=access_type@entry=MMU_INST_FETCH, mmu_idx=0, nonfault=nonfault@entry=false, phost=0x7ffff56fe0d0, pfull=0x7ffff56fe0c8, retaddr=0, check_mem_cbs=false)
>     at ../../accel/tcg/cputlb.c:1432
> #19 0x0000555555c61ff8 in get_page_addr_code_hostp (env=<optimized out>, addr=addr@entry=18446744072116178925, hostp=hostp@entry=0x0)
>     at ../../accel/tcg/cputlb.c:1603
> #20 0x0000555555c50a2b in get_page_addr_code (addr=18446744072116178925, env=<optimized out>)
>     at /home/jic23/src/qemu/include/exec/exec-all.h:594
> #21 tb_htable_lookup (cpu=<optimized out>, pc=pc@entry=18446744072116178925, cs_base=0, flags=415285936, cflags=4278353920)
>     at ../../accel/tcg/cpu-exec.c:231
> #22 0x0000555555c50c08 in tb_lookup
>     (cpu=cpu@entry=0x555556fd9000, pc=pc@entry=18446744072116178925, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=<optimized out>) at ../../accel/tcg/cpu-exec.c:267
> #23 0x0000555555c51e23 in helper_lookup_tb_ptr (env=0x555556fdb7c0) at ../../accel/tcg/cpu-exec.c:423
> #24 0x00007fffa9076ead in code_gen_buffer ()
> #25 0x0000555555c50fab in cpu_tb_exec (cpu=cpu@entry=0x555556fd9000, itb=<optimized out>, tb_exit=tb_exit@entry=0x7ffff56fe708)
>     at ../../accel/tcg/cpu-exec.c:458
> #26 0x0000555555c51492 in cpu_loop_exec_tb
>     (tb_exit=0x7ffff56fe708, last_tb=<synthetic pointer>, pc=18446744072116179169, tb=<optimized out>, cpu=0x555556fd9000)
>     at ../../accel/tcg/cpu-exec.c:920
> #27 cpu_exec_loop (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1041
> #28 0x0000555555c51d11 in cpu_exec_setjmp (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1058
> #29 0x0000555555c523b4 in cpu_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/cpu-exec.c:1084
> #30 0x0000555555c74053 in tcg_cpus_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops.c:76
> #31 0x0000555555c741a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> #32 0x0000555555dfb580 in qemu_thread_start (args=0x55555703c3e0) at ../../util/qemu-thread-posix.c:541
> #33 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> #34 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

So, that looks like:
 * we call cpu_tb_exec(), which executes some generated code
 * that generated code calls the lookup_tb_ptr helper to see
   if we have a generated TB already for the address we're going
   to execute next
 * lookup_tb_ptr probes the TLB to see if we know the host RAM
   address for the guest address
 * this results in a TLB walk for an instruction fetch
 * the page table descriptor load is to IO memory
 * io_prepare assumes it needs to do a TLB recompile, because
   can_do_io is clear

I am not surprised that the corner case of "the guest put its
page tables in an MMIO device" has not yet come up :-)

I'm really not sure how the icount handling should interact
with that...

-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 14:35                             ` Peter Maydell
@ 2024-02-01 15:17                               ` Alex Bennée
  2024-02-01 15:29                                   ` Jonathan Cameron via
  2024-02-01 16:00                                 ` Peter Maydell
  0 siblings, 2 replies; 50+ messages in thread
From: Alex Bennée @ 2024-02-01 15:17 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Jonathan Cameron, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

Peter Maydell <peter.maydell@linaro.org> writes:

> On Thu, 1 Feb 2024 at 14:01, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
>> > Can you run QEMU under gdb and give the backtrace when it stops
>> > on the abort() ? That will probably have a helpful clue. I
>> > suspect something is failing to pass a valid retaddr in
>> > when it calls a load/store function.
>
>> [Switching to Thread 0x7ffff56ff6c0 (LWP 21916)]
>> __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>> 44      ./nptl/pthread_kill.c: No such file or directory.
>> (gdb) bt
>> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
>> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
>> #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
>> #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
>> #5  0x0000555555c0d4ce in cpu_abort
>>     (cpu=cpu@entry=0x555556fd9000, fmt=fmt@entry=0x555555fe3378 "cpu_io_recompile: could not find TB for pc=%p")
>>     at ../../cpu-target.c:359
>> #6  0x0000555555c59435 in cpu_io_recompile (cpu=cpu@entry=0x555556fd9000, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
>> #7  0x0000555555c5c956 in io_prepare
>>     (retaddr=0, addr=19595792376, attrs=..., xlat=<optimized out>, cpu=0x555556fd9000, out_offset=<synthetic pointer>)
>>     at ../../accel/tcg/cputlb.c:1339
<snip>
>> #21 tb_htable_lookup (cpu=<optimized out>, pc=pc@entry=18446744072116178925, cs_base=0, flags=415285936, cflags=4278353920)
>>     at ../../accel/tcg/cpu-exec.c:231
>> #22 0x0000555555c50c08 in tb_lookup
>>     (cpu=cpu@entry=0x555556fd9000, pc=pc@entry=18446744072116178925, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=<optimized out>) at ../../accel/tcg/cpu-exec.c:267
>> #23 0x0000555555c51e23 in helper_lookup_tb_ptr (env=0x555556fdb7c0) at ../../accel/tcg/cpu-exec.c:423
>> #24 0x00007fffa9076ead in code_gen_buffer ()
>> #25 0x0000555555c50fab in cpu_tb_exec (cpu=cpu@entry=0x555556fd9000, itb=<optimized out>, tb_exit=tb_exit@entry=0x7ffff56fe708)
>>     at ../../accel/tcg/cpu-exec.c:458
>> #26 0x0000555555c51492 in cpu_loop_exec_tb
>>     (tb_exit=0x7ffff56fe708, last_tb=<synthetic pointer>, pc=18446744072116179169, tb=<optimized out>, cpu=0x555556fd9000)
>>     at ../../accel/tcg/cpu-exec.c:920
>> #27 cpu_exec_loop (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1041
>> #28 0x0000555555c51d11 in cpu_exec_setjmp (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1058
>> #29 0x0000555555c523b4 in cpu_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/cpu-exec.c:1084
>> #30 0x0000555555c74053 in tcg_cpus_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops.c:76
>> #31 0x0000555555c741a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
>> #32 0x0000555555dfb580 in qemu_thread_start (args=0x55555703c3e0) at ../../util/qemu-thread-posix.c:541
>> #33 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
>> #34 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>
> So, that looks like:
>  * we call cpu_tb_exec(), which executes some generated code
>  * that generated code calls the lookup_tb_ptr helper to see
>    if we have a generated TB already for the address we're going
>    to execute next
>  * lookup_tb_ptr probes the TLB to see if we know the host RAM
>    address for the guest address
>  * this results in a TLB walk for an instruction fetch
>  * the page table descriptor load is to IO memory
>  * io_prepare assumes it needs to do a TLB recompile, because
>    can_do_io is clear
>
> I am not surprised that the corner case of "the guest put its
> page tables in an MMIO device" has not yet come up :-)
>
> I'm really not sure how the icount handling should interact
> with that...

Its not just icount - we need to handle it for all modes now. That said
seeing as we are at the end of a block shouldn't can_do_io be set?

Does:

modified   accel/tcg/translator.c
@@ -201,6 +201,8 @@ void translator_loop(CPUState *cpu, TranslationBlock *tb, int *max_insns,
         }
     }
 
+    set_can_do_io(db, true);
+
     /* Emit code to exit the TB, as indicated by db->is_jmp.  */
     ops->tb_stop(db, cpu);
     gen_tb_end(tb, cflags, icount_start_insn, db->num_insns);

do the trick?

>
> -- PMM

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 15:17                               ` Alex Bennée
@ 2024-02-01 15:29                                   ` Jonathan Cameron via
  2024-02-01 16:00                                 ` Peter Maydell
  1 sibling, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-01 15:29 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Maydell, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 01 Feb 2024 15:17:53 +0000
Alex Bennée <alex.bennee@linaro.org> wrote:

> Peter Maydell <peter.maydell@linaro.org> writes:
> 
> > On Thu, 1 Feb 2024 at 14:01, Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:  
> >> > Can you run QEMU under gdb and give the backtrace when it stops
> >> > on the abort() ? That will probably have a helpful clue. I
> >> > suspect something is failing to pass a valid retaddr in
> >> > when it calls a load/store function.  
> >  
> >> [Switching to Thread 0x7ffff56ff6c0 (LWP 21916)]
> >> __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> >> 44      ./nptl/pthread_kill.c: No such file or directory.
> >> (gdb) bt
> >> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> >> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> >> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> >> #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> >> #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> >> #5  0x0000555555c0d4ce in cpu_abort
> >>     (cpu=cpu@entry=0x555556fd9000, fmt=fmt@entry=0x555555fe3378 "cpu_io_recompile: could not find TB for pc=%p")
> >>     at ../../cpu-target.c:359
> >> #6  0x0000555555c59435 in cpu_io_recompile (cpu=cpu@entry=0x555556fd9000, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> >> #7  0x0000555555c5c956 in io_prepare
> >>     (retaddr=0, addr=19595792376, attrs=..., xlat=<optimized out>, cpu=0x555556fd9000, out_offset=<synthetic pointer>)
> >>     at ../../accel/tcg/cputlb.c:1339  
> <snip>
> >> #21 tb_htable_lookup (cpu=<optimized out>, pc=pc@entry=18446744072116178925, cs_base=0, flags=415285936, cflags=4278353920)
> >>     at ../../accel/tcg/cpu-exec.c:231
> >> #22 0x0000555555c50c08 in tb_lookup
> >>     (cpu=cpu@entry=0x555556fd9000, pc=pc@entry=18446744072116178925, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=<optimized out>) at ../../accel/tcg/cpu-exec.c:267
> >> #23 0x0000555555c51e23 in helper_lookup_tb_ptr (env=0x555556fdb7c0) at ../../accel/tcg/cpu-exec.c:423
> >> #24 0x00007fffa9076ead in code_gen_buffer ()
> >> #25 0x0000555555c50fab in cpu_tb_exec (cpu=cpu@entry=0x555556fd9000, itb=<optimized out>, tb_exit=tb_exit@entry=0x7ffff56fe708)
> >>     at ../../accel/tcg/cpu-exec.c:458
> >> #26 0x0000555555c51492 in cpu_loop_exec_tb
> >>     (tb_exit=0x7ffff56fe708, last_tb=<synthetic pointer>, pc=18446744072116179169, tb=<optimized out>, cpu=0x555556fd9000)
> >>     at ../../accel/tcg/cpu-exec.c:920
> >> #27 cpu_exec_loop (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1041
> >> #28 0x0000555555c51d11 in cpu_exec_setjmp (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1058
> >> #29 0x0000555555c523b4 in cpu_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/cpu-exec.c:1084
> >> #30 0x0000555555c74053 in tcg_cpus_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops.c:76
> >> #31 0x0000555555c741a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> >> #32 0x0000555555dfb580 in qemu_thread_start (args=0x55555703c3e0) at ../../util/qemu-thread-posix.c:541
> >> #33 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> >> #34 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81  
> >
> > So, that looks like:
> >  * we call cpu_tb_exec(), which executes some generated code
> >  * that generated code calls the lookup_tb_ptr helper to see
> >    if we have a generated TB already for the address we're going
> >    to execute next
> >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> >    address for the guest address
> >  * this results in a TLB walk for an instruction fetch
> >  * the page table descriptor load is to IO memory
> >  * io_prepare assumes it needs to do a TLB recompile, because
> >    can_do_io is clear
> >
> > I am not surprised that the corner case of "the guest put its
> > page tables in an MMIO device" has not yet come up :-)
> >
> > I'm really not sure how the icount handling should interact
> > with that...  
> 
> Its not just icount - we need to handle it for all modes now. That said
> seeing as we are at the end of a block shouldn't can_do_io be set?
> 
> Does:
> 
> modified   accel/tcg/translator.c
> @@ -201,6 +201,8 @@ void translator_loop(CPUState *cpu, TranslationBlock *tb, int *max_insns,
>          }
>      }
>  
> +    set_can_do_io(db, true);
> +
>      /* Emit code to exit the TB, as indicated by db->is_jmp.  */
>      ops->tb_stop(db, cpu);
>      gen_tb_end(tb, cflags, icount_start_insn, db->num_insns);
> 
> do the trick?

no :(

> 
> >
> > -- PMM  
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-01 15:29                                   ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-01 15:29 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Maydell, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 01 Feb 2024 15:17:53 +0000
Alex Bennée <alex.bennee@linaro.org> wrote:

> Peter Maydell <peter.maydell@linaro.org> writes:
> 
> > On Thu, 1 Feb 2024 at 14:01, Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:  
> >> > Can you run QEMU under gdb and give the backtrace when it stops
> >> > on the abort() ? That will probably have a helpful clue. I
> >> > suspect something is failing to pass a valid retaddr in
> >> > when it calls a load/store function.  
> >  
> >> [Switching to Thread 0x7ffff56ff6c0 (LWP 21916)]
> >> __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> >> 44      ./nptl/pthread_kill.c: No such file or directory.
> >> (gdb) bt
> >> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> >> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> >> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> >> #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> >> #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> >> #5  0x0000555555c0d4ce in cpu_abort
> >>     (cpu=cpu@entry=0x555556fd9000, fmt=fmt@entry=0x555555fe3378 "cpu_io_recompile: could not find TB for pc=%p")
> >>     at ../../cpu-target.c:359
> >> #6  0x0000555555c59435 in cpu_io_recompile (cpu=cpu@entry=0x555556fd9000, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> >> #7  0x0000555555c5c956 in io_prepare
> >>     (retaddr=0, addr=19595792376, attrs=..., xlat=<optimized out>, cpu=0x555556fd9000, out_offset=<synthetic pointer>)
> >>     at ../../accel/tcg/cputlb.c:1339  
> <snip>
> >> #21 tb_htable_lookup (cpu=<optimized out>, pc=pc@entry=18446744072116178925, cs_base=0, flags=415285936, cflags=4278353920)
> >>     at ../../accel/tcg/cpu-exec.c:231
> >> #22 0x0000555555c50c08 in tb_lookup
> >>     (cpu=cpu@entry=0x555556fd9000, pc=pc@entry=18446744072116178925, cs_base=cs_base@entry=0, flags=<optimized out>, cflags=<optimized out>) at ../../accel/tcg/cpu-exec.c:267
> >> #23 0x0000555555c51e23 in helper_lookup_tb_ptr (env=0x555556fdb7c0) at ../../accel/tcg/cpu-exec.c:423
> >> #24 0x00007fffa9076ead in code_gen_buffer ()
> >> #25 0x0000555555c50fab in cpu_tb_exec (cpu=cpu@entry=0x555556fd9000, itb=<optimized out>, tb_exit=tb_exit@entry=0x7ffff56fe708)
> >>     at ../../accel/tcg/cpu-exec.c:458
> >> #26 0x0000555555c51492 in cpu_loop_exec_tb
> >>     (tb_exit=0x7ffff56fe708, last_tb=<synthetic pointer>, pc=18446744072116179169, tb=<optimized out>, cpu=0x555556fd9000)
> >>     at ../../accel/tcg/cpu-exec.c:920
> >> #27 cpu_exec_loop (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1041
> >> #28 0x0000555555c51d11 in cpu_exec_setjmp (cpu=cpu@entry=0x555556fd9000, sc=sc@entry=0x7ffff56fe7a0) at ../../accel/tcg/cpu-exec.c:1058
> >> #29 0x0000555555c523b4 in cpu_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/cpu-exec.c:1084
> >> #30 0x0000555555c74053 in tcg_cpus_exec (cpu=cpu@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops.c:76
> >> #31 0x0000555555c741a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x555556fd9000) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> >> #32 0x0000555555dfb580 in qemu_thread_start (args=0x55555703c3e0) at ../../util/qemu-thread-posix.c:541
> >> #33 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> >> #34 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81  
> >
> > So, that looks like:
> >  * we call cpu_tb_exec(), which executes some generated code
> >  * that generated code calls the lookup_tb_ptr helper to see
> >    if we have a generated TB already for the address we're going
> >    to execute next
> >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> >    address for the guest address
> >  * this results in a TLB walk for an instruction fetch
> >  * the page table descriptor load is to IO memory
> >  * io_prepare assumes it needs to do a TLB recompile, because
> >    can_do_io is clear
> >
> > I am not surprised that the corner case of "the guest put its
> > page tables in an MMIO device" has not yet come up :-)
> >
> > I'm really not sure how the icount handling should interact
> > with that...  
> 
> Its not just icount - we need to handle it for all modes now. That said
> seeing as we are at the end of a block shouldn't can_do_io be set?
> 
> Does:
> 
> modified   accel/tcg/translator.c
> @@ -201,6 +201,8 @@ void translator_loop(CPUState *cpu, TranslationBlock *tb, int *max_insns,
>          }
>      }
>  
> +    set_can_do_io(db, true);
> +
>      /* Emit code to exit the TB, as indicated by db->is_jmp.  */
>      ops->tb_stop(db, cpu);
>      gen_tb_end(tb, cflags, icount_start_insn, db->num_insns);
> 
> do the trick?

no :(

> 
> >
> > -- PMM  
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 15:17                               ` Alex Bennée
  2024-02-01 15:29                                   ` Jonathan Cameron via
@ 2024-02-01 16:00                                 ` Peter Maydell
  2024-02-01 16:21                                     ` Jonathan Cameron via
  2024-02-15 15:04                                     ` Jonathan Cameron
  1 sibling, 2 replies; 50+ messages in thread
From: Peter Maydell @ 2024-02-01 16:00 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Jonathan Cameron, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Peter Maydell <peter.maydell@linaro.org> writes:
> > So, that looks like:
> >  * we call cpu_tb_exec(), which executes some generated code
> >  * that generated code calls the lookup_tb_ptr helper to see
> >    if we have a generated TB already for the address we're going
> >    to execute next
> >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> >    address for the guest address
> >  * this results in a TLB walk for an instruction fetch
> >  * the page table descriptor load is to IO memory
> >  * io_prepare assumes it needs to do a TLB recompile, because
> >    can_do_io is clear
> >
> > I am not surprised that the corner case of "the guest put its
> > page tables in an MMIO device" has not yet come up :-)
> >
> > I'm really not sure how the icount handling should interact
> > with that...
>
> Its not just icount - we need to handle it for all modes now. That said
> seeing as we are at the end of a block shouldn't can_do_io be set?

The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
which happens earlier than the tb_stop callback (it can
happen in the trans function for branch etc insns, for
example).

I think it should be OK to clear can_do_io at the start
of the lookup_tb_ptr helper, something like:
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 977576ca143..7818537f318 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
     uint64_t cs_base;
     uint32_t flags, cflags;

+    /*
+     * By definition we've just finished a TB, so I/O is OK.
+     * Avoid the possibility of calling cpu_io_recompile() if
+     * a page table walk triggered by tb_lookup() calling
+     * probe_access_internal() happens to touch an MMIO device.
+     * The next TB, if we chain to it, will clear the flag again.
+     */
+    cpu->neg.can_do_io = true;
+
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);

     cflags = curr_cflags(cpu);

-- PMM

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 16:00                                 ` Peter Maydell
@ 2024-02-01 16:21                                     ` Jonathan Cameron via
  2024-02-15 15:04                                     ` Jonathan Cameron
  1 sibling, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-01 16:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 16:00:56 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Peter Maydell <peter.maydell@linaro.org> writes:  
> > > So, that looks like:
> > >  * we call cpu_tb_exec(), which executes some generated code
> > >  * that generated code calls the lookup_tb_ptr helper to see
> > >    if we have a generated TB already for the address we're going
> > >    to execute next
> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > >    address for the guest address
> > >  * this results in a TLB walk for an instruction fetch
> > >  * the page table descriptor load is to IO memory
> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > >    can_do_io is clear
> > >
> > > I am not surprised that the corner case of "the guest put its
> > > page tables in an MMIO device" has not yet come up :-)
> > >
> > > I'm really not sure how the icount handling should interact
> > > with that...  
> >
> > Its not just icount - we need to handle it for all modes now. That said
> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> 
> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> which happens earlier than the tb_stop callback (it can
> happen in the trans function for branch etc insns, for
> example).
> 
> I think it should be OK to clear can_do_io at the start
> of the lookup_tb_ptr helper, something like:
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 977576ca143..7818537f318 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
>      uint64_t cs_base;
>      uint32_t flags, cflags;
> 
> +    /*
> +     * By definition we've just finished a TB, so I/O is OK.
> +     * Avoid the possibility of calling cpu_io_recompile() if
> +     * a page table walk triggered by tb_lookup() calling
> +     * probe_access_internal() happens to touch an MMIO device.
> +     * The next TB, if we chain to it, will clear the flag again.
> +     */
> +    cpu->neg.can_do_io = true;
> +
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> 
>      cflags = curr_cflags(cpu);
> 
> -- PMM

No joy.  Seems like a very similar backtrace.

Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
#6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
#7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
#8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
#9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
#10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
#11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
#12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
#13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
#14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
#15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
#16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
#17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
#18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
#19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
#20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
#21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
#22 0x00007fffa9107c63 in code_gen_buffer ()
#23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442
#24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
#25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
#26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
#27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
#28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
#29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
#30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
#31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-01 16:21                                     ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-01 16:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 16:00:56 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Peter Maydell <peter.maydell@linaro.org> writes:  
> > > So, that looks like:
> > >  * we call cpu_tb_exec(), which executes some generated code
> > >  * that generated code calls the lookup_tb_ptr helper to see
> > >    if we have a generated TB already for the address we're going
> > >    to execute next
> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > >    address for the guest address
> > >  * this results in a TLB walk for an instruction fetch
> > >  * the page table descriptor load is to IO memory
> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > >    can_do_io is clear
> > >
> > > I am not surprised that the corner case of "the guest put its
> > > page tables in an MMIO device" has not yet come up :-)
> > >
> > > I'm really not sure how the icount handling should interact
> > > with that...  
> >
> > Its not just icount - we need to handle it for all modes now. That said
> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> 
> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> which happens earlier than the tb_stop callback (it can
> happen in the trans function for branch etc insns, for
> example).
> 
> I think it should be OK to clear can_do_io at the start
> of the lookup_tb_ptr helper, something like:
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 977576ca143..7818537f318 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
>      uint64_t cs_base;
>      uint32_t flags, cflags;
> 
> +    /*
> +     * By definition we've just finished a TB, so I/O is OK.
> +     * Avoid the possibility of calling cpu_io_recompile() if
> +     * a page table walk triggered by tb_lookup() calling
> +     * probe_access_internal() happens to touch an MMIO device.
> +     * The next TB, if we chain to it, will clear the flag again.
> +     */
> +    cpu->neg.can_do_io = true;
> +
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> 
>      cflags = curr_cflags(cpu);
> 
> -- PMM

No joy.  Seems like a very similar backtrace.

Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
#6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
#7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
#8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
#9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
#10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
#11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
#12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
#13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
#14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
#15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
#16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
#17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
#18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
#19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
#20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
#21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
#22 0x00007fffa9107c63 in code_gen_buffer ()
#23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442
#24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
#25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
#26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
#27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
#28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
#29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
#30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
#31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 16:21                                     ` Jonathan Cameron via
  (?)
@ 2024-02-01 16:45                                     ` Alex Bennée
  2024-02-01 17:04                                       ` Gregory Price
  2024-02-01 17:08                                         ` Jonathan Cameron via
  -1 siblings, 2 replies; 50+ messages in thread
From: Alex Bennée @ 2024-02-01 16:45 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Peter Maydell, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:

> On Thu, 1 Feb 2024 16:00:56 +0000
> Peter Maydell <peter.maydell@linaro.org> wrote:
>
>> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
>> >
>> > Peter Maydell <peter.maydell@linaro.org> writes:  
>> > > So, that looks like:
>> > >  * we call cpu_tb_exec(), which executes some generated code
>> > >  * that generated code calls the lookup_tb_ptr helper to see
>> > >    if we have a generated TB already for the address we're going
>> > >    to execute next
>> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
>> > >    address for the guest address
>> > >  * this results in a TLB walk for an instruction fetch
>> > >  * the page table descriptor load is to IO memory
>> > >  * io_prepare assumes it needs to do a TLB recompile, because
>> > >    can_do_io is clear
>> > >
>> > > I am not surprised that the corner case of "the guest put its
>> > > page tables in an MMIO device" has not yet come up :-)
>> > >
>> > > I'm really not sure how the icount handling should interact
>> > > with that...  
>> >
>> > Its not just icount - we need to handle it for all modes now. That said
>> > seeing as we are at the end of a block shouldn't can_do_io be set?  
>> 
>> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
>> which happens earlier than the tb_stop callback (it can
>> happen in the trans function for branch etc insns, for
>> example).
>> 
>> I think it should be OK to clear can_do_io at the start
>> of the lookup_tb_ptr helper, something like:
>> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
>> index 977576ca143..7818537f318 100644
>> --- a/accel/tcg/cpu-exec.c
>> +++ b/accel/tcg/cpu-exec.c
>> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
>>      uint64_t cs_base;
>>      uint32_t flags, cflags;
>> 
>> +    /*
>> +     * By definition we've just finished a TB, so I/O is OK.
>> +     * Avoid the possibility of calling cpu_io_recompile() if
>> +     * a page table walk triggered by tb_lookup() calling
>> +     * probe_access_internal() happens to touch an MMIO device.
>> +     * The next TB, if we chain to it, will clear the flag again.
>> +     */
>> +    cpu->neg.can_do_io = true;
>> +
>>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>> 
>>      cflags = curr_cflags(cpu);
>> 
>> -- PMM
>
> No joy.  Seems like a very similar backtrace.
>
> Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> 44      ./nptl/pthread_kill.c: No such file or directory.
> (gdb) bt
> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> #22 0x00007fffa9107c63 in code_gen_buffer ()

No thats different - we are actually writing to the MMIO region here.
But the fact we hit cpu_abort because we can't find the TB we are
executing is a little problematic.

Does ra properly point to the code buffer here?

> #23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442
> #24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
> #25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
> #26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
> #27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
> #28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
> #29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> #30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
> #31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> #32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 16:45                                     ` Alex Bennée
@ 2024-02-01 17:04                                       ` Gregory Price
  2024-02-01 17:07                                         ` Peter Maydell
  2024-02-01 17:08                                         ` Jonathan Cameron via
  1 sibling, 1 reply; 50+ messages in thread
From: Gregory Price @ 2024-02-01 17:04 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Jonathan Cameron, Peter Maydell, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, Feb 01, 2024 at 04:45:30PM +0000, Alex Bennée wrote:
> Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> 
> > On Thu, 1 Feb 2024 16:00:56 +0000
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
> >> >
> >> > Peter Maydell <peter.maydell@linaro.org> writes:  
> >> > > So, that looks like:
> >> > >  * we call cpu_tb_exec(), which executes some generated code
> >> > >  * that generated code calls the lookup_tb_ptr helper to see
> >> > >    if we have a generated TB already for the address we're going
> >> > >    to execute next
> >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> >> > >    address for the guest address
> >> > >  * this results in a TLB walk for an instruction fetch
> >> > >  * the page table descriptor load is to IO memory
> >> > >  * io_prepare assumes it needs to do a TLB recompile, because
> >> > >    can_do_io is clear
> >> > >
> >> > > I am not surprised that the corner case of "the guest put its
> >> > > page tables in an MMIO device" has not yet come up :-)
> >> > >
> >> > > I'm really not sure how the icount handling should interact
> >> > > with that...  
> >> >
> >> > Its not just icount - we need to handle it for all modes now. That said
> >> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> >> 
> >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> >> which happens earlier than the tb_stop callback (it can
> >> happen in the trans function for branch etc insns, for
> >> example).
> >> 
> >> I think it should be OK to clear can_do_io at the start
> >> of the lookup_tb_ptr helper, something like:
> >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> >> index 977576ca143..7818537f318 100644
> >> --- a/accel/tcg/cpu-exec.c
> >> +++ b/accel/tcg/cpu-exec.c
> >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
> >>      uint64_t cs_base;
> >>      uint32_t flags, cflags;
> >> 
> >> +    /*
> >> +     * By definition we've just finished a TB, so I/O is OK.
> >> +     * Avoid the possibility of calling cpu_io_recompile() if
> >> +     * a page table walk triggered by tb_lookup() calling
> >> +     * probe_access_internal() happens to touch an MMIO device.
> >> +     * The next TB, if we chain to it, will clear the flag again.
> >> +     */
> >> +    cpu->neg.can_do_io = true;
> >> +
> >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> >> 
> >>      cflags = curr_cflags(cpu);
> >> 
> >> -- PMM
> >
> > No joy.  Seems like a very similar backtrace.
> >
> > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > 44      ./nptl/pthread_kill.c: No such file or directory.
> > (gdb) bt
> > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > #22 0x00007fffa9107c63 in code_gen_buffer ()
> 
> No thats different - we are actually writing to the MMIO region here.
> But the fact we hit cpu_abort because we can't find the TB we are
> executing is a little problematic.
> 
> Does ra properly point to the code buffer here?
> 

What if the code block is ALSO in CXL (MMIO)? :D

tb_gen_code():

531     /*
532      * If the TB is not associated with a physical RAM page then it must be
533      * a temporary one-insn TB, and we have nothing left to do. Return early
534      * before attempting to link to other TBs or add to the lookup table.
535      */
536     if (tb_page_addr0(tb) == -1) {
537         assert_no_pages_locked();
538         return tb;
539     }
540
541     /*
542      * Insert TB into the corresponding region tree before publishing it
543      * through QHT. Otherwise rewinding happened in the TB might fail to
544      * lookup itself using host PC.
545      */
546     tcg_tb_insert(tb);


> > #23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442
> > #24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
> > #25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
> > #26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
> > #27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
> > #28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
> > #29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> > #30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
> > #31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> > #32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> 
> -- 
> Alex Bennée
> Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 17:04                                       ` Gregory Price
@ 2024-02-01 17:07                                         ` Peter Maydell
  2024-02-01 17:29                                           ` Gregory Price
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-01 17:07 UTC (permalink / raw)
  To: Gregory Price
  Cc: Alex Bennée, Jonathan Cameron, Sajjan Rao,
	Dimitrios Palyvos, linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 at 17:04, Gregory Price <gregory.price@memverge.com> wrote:
>
> On Thu, Feb 01, 2024 at 04:45:30PM +0000, Alex Bennée wrote:

> > No thats different - we are actually writing to the MMIO region here.
> > But the fact we hit cpu_abort because we can't find the TB we are
> > executing is a little problematic.
> >
> > Does ra properly point to the code buffer here?
> >
>
> What if the code block is ALSO in CXL (MMIO)? :D

In that case the TB is supposed to be a single insn,
so the insn will by definition be the last one in its
TB, and IO should be OK for it -- so can_do_io ought
to be true and we shouldn't get into the io_recompile.

-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 16:45                                     ` Alex Bennée
@ 2024-02-01 17:08                                         ` Jonathan Cameron via
  2024-02-01 17:08                                         ` Jonathan Cameron via
  1 sibling, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-01 17:08 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Maydell, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 01 Feb 2024 16:45:30 +0000
Alex Bennée <alex.bennee@linaro.org> wrote:

> Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> 
> > On Thu, 1 Feb 2024 16:00:56 +0000
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> >  
> >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:  
> >> >
> >> > Peter Maydell <peter.maydell@linaro.org> writes:    
> >> > > So, that looks like:
> >> > >  * we call cpu_tb_exec(), which executes some generated code
> >> > >  * that generated code calls the lookup_tb_ptr helper to see
> >> > >    if we have a generated TB already for the address we're going
> >> > >    to execute next
> >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> >> > >    address for the guest address
> >> > >  * this results in a TLB walk for an instruction fetch
> >> > >  * the page table descriptor load is to IO memory
> >> > >  * io_prepare assumes it needs to do a TLB recompile, because
> >> > >    can_do_io is clear
> >> > >
> >> > > I am not surprised that the corner case of "the guest put its
> >> > > page tables in an MMIO device" has not yet come up :-)
> >> > >
> >> > > I'm really not sure how the icount handling should interact
> >> > > with that...    
> >> >
> >> > Its not just icount - we need to handle it for all modes now. That said
> >> > seeing as we are at the end of a block shouldn't can_do_io be set?    
> >> 
> >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> >> which happens earlier than the tb_stop callback (it can
> >> happen in the trans function for branch etc insns, for
> >> example).
> >> 
> >> I think it should be OK to clear can_do_io at the start
> >> of the lookup_tb_ptr helper, something like:
> >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> >> index 977576ca143..7818537f318 100644
> >> --- a/accel/tcg/cpu-exec.c
> >> +++ b/accel/tcg/cpu-exec.c
> >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
> >>      uint64_t cs_base;
> >>      uint32_t flags, cflags;
> >> 
> >> +    /*
> >> +     * By definition we've just finished a TB, so I/O is OK.
> >> +     * Avoid the possibility of calling cpu_io_recompile() if
> >> +     * a page table walk triggered by tb_lookup() calling
> >> +     * probe_access_internal() happens to touch an MMIO device.
> >> +     * The next TB, if we chain to it, will clear the flag again.
> >> +     */
> >> +    cpu->neg.can_do_io = true;
> >> +
> >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> >> 
> >>      cflags = curr_cflags(cpu);
> >> 
> >> -- PMM  
> >
> > No joy.  Seems like a very similar backtrace.
> >
> > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > 44      ./nptl/pthread_kill.c: No such file or directory.
> > (gdb) bt
> > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > #22 0x00007fffa9107c63 in code_gen_buffer ()  
> 
> No thats different - we are actually writing to the MMIO region here.
> But the fact we hit cpu_abort because we can't find the TB we are
> executing is a little problematic.
> 
> Does ra properly point to the code buffer here?

Err.  How would I tell?

I'll confess I have almost no idea what is going on in TCG :(

Can learn but it won't be quick.

J

> 
> > #23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442
> > #24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
> > #25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
> > #26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
> > #27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
> > #28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
> > #29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> > #30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
> > #31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> > #32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81  
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-01 17:08                                         ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-01 17:08 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Maydell, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 01 Feb 2024 16:45:30 +0000
Alex Bennée <alex.bennee@linaro.org> wrote:

> Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> 
> > On Thu, 1 Feb 2024 16:00:56 +0000
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> >  
> >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:  
> >> >
> >> > Peter Maydell <peter.maydell@linaro.org> writes:    
> >> > > So, that looks like:
> >> > >  * we call cpu_tb_exec(), which executes some generated code
> >> > >  * that generated code calls the lookup_tb_ptr helper to see
> >> > >    if we have a generated TB already for the address we're going
> >> > >    to execute next
> >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> >> > >    address for the guest address
> >> > >  * this results in a TLB walk for an instruction fetch
> >> > >  * the page table descriptor load is to IO memory
> >> > >  * io_prepare assumes it needs to do a TLB recompile, because
> >> > >    can_do_io is clear
> >> > >
> >> > > I am not surprised that the corner case of "the guest put its
> >> > > page tables in an MMIO device" has not yet come up :-)
> >> > >
> >> > > I'm really not sure how the icount handling should interact
> >> > > with that...    
> >> >
> >> > Its not just icount - we need to handle it for all modes now. That said
> >> > seeing as we are at the end of a block shouldn't can_do_io be set?    
> >> 
> >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> >> which happens earlier than the tb_stop callback (it can
> >> happen in the trans function for branch etc insns, for
> >> example).
> >> 
> >> I think it should be OK to clear can_do_io at the start
> >> of the lookup_tb_ptr helper, something like:
> >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> >> index 977576ca143..7818537f318 100644
> >> --- a/accel/tcg/cpu-exec.c
> >> +++ b/accel/tcg/cpu-exec.c
> >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
> >>      uint64_t cs_base;
> >>      uint32_t flags, cflags;
> >> 
> >> +    /*
> >> +     * By definition we've just finished a TB, so I/O is OK.
> >> +     * Avoid the possibility of calling cpu_io_recompile() if
> >> +     * a page table walk triggered by tb_lookup() calling
> >> +     * probe_access_internal() happens to touch an MMIO device.
> >> +     * The next TB, if we chain to it, will clear the flag again.
> >> +     */
> >> +    cpu->neg.can_do_io = true;
> >> +
> >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> >> 
> >>      cflags = curr_cflags(cpu);
> >> 
> >> -- PMM  
> >
> > No joy.  Seems like a very similar backtrace.
> >
> > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > 44      ./nptl/pthread_kill.c: No such file or directory.
> > (gdb) bt
> > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > #22 0x00007fffa9107c63 in code_gen_buffer ()  
> 
> No thats different - we are actually writing to the MMIO region here.
> But the fact we hit cpu_abort because we can't find the TB we are
> executing is a little problematic.
> 
> Does ra properly point to the code buffer here?

Err.  How would I tell?

I'll confess I have almost no idea what is going on in TCG :(

Can learn but it won't be quick.

J

> 
> > #23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442
> > #24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
> > #25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
> > #26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
> > #27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
> > #28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
> > #29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> > #30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
> > #31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> > #32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81  
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 17:08                                         ` Jonathan Cameron via
  (?)
@ 2024-02-01 17:21                                         ` Peter Maydell
  2024-02-01 17:41                                             ` Jonathan Cameron via
  -1 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-01 17:21 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 at 17:08, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 01 Feb 2024 16:45:30 +0000
> Alex Bennée <alex.bennee@linaro.org> wrote:
>
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> >
> > > On Thu, 1 Feb 2024 16:00:56 +0000
> > > Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
> > >> >
> > >> > Peter Maydell <peter.maydell@linaro.org> writes:
> > >> > > So, that looks like:
> > >> > >  * we call cpu_tb_exec(), which executes some generated code
> > >> > >  * that generated code calls the lookup_tb_ptr helper to see
> > >> > >    if we have a generated TB already for the address we're going
> > >> > >    to execute next
> > >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > >> > >    address for the guest address
> > >> > >  * this results in a TLB walk for an instruction fetch
> > >> > >  * the page table descriptor load is to IO memory
> > >> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > >> > >    can_do_io is clear
> > >> > >
> > >> > > I am not surprised that the corner case of "the guest put its
> > >> > > page tables in an MMIO device" has not yet come up :-)
> > >> > >
> > >> > > I'm really not sure how the icount handling should interact
> > >> > > with that...
> > >> >
> > >> > Its not just icount - we need to handle it for all modes now. That said
> > >> > seeing as we are at the end of a block shouldn't can_do_io be set?
> > >>
> > >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> > >> which happens earlier than the tb_stop callback (it can
> > >> happen in the trans function for branch etc insns, for
> > >> example).
> > >>
> > >> I think it should be OK to clear can_do_io at the start
> > >> of the lookup_tb_ptr helper, something like:
> > >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> > >> index 977576ca143..7818537f318 100644
> > >> --- a/accel/tcg/cpu-exec.c
> > >> +++ b/accel/tcg/cpu-exec.c
> > >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
> > >>      uint64_t cs_base;
> > >>      uint32_t flags, cflags;
> > >>
> > >> +    /*
> > >> +     * By definition we've just finished a TB, so I/O is OK.
> > >> +     * Avoid the possibility of calling cpu_io_recompile() if
> > >> +     * a page table walk triggered by tb_lookup() calling
> > >> +     * probe_access_internal() happens to touch an MMIO device.
> > >> +     * The next TB, if we chain to it, will clear the flag again.
> > >> +     */
> > >> +    cpu->neg.can_do_io = true;
> > >> +
> > >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> > >>
> > >>      cflags = curr_cflags(cpu);
> > >>
> > >> -- PMM
> > >
> > > No joy.  Seems like a very similar backtrace.
> > >
> > > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> > > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> > > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > > 44      ./nptl/pthread_kill.c: No such file or directory.
> > > (gdb) bt
> > > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> > > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> > > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> > > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> > > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> > > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> > > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> > > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> > > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> > > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> > > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> > > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> > > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> > > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> > > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > > #22 0x00007fffa9107c63 in code_gen_buffer ()
> >
> > No thats different - we are actually writing to the MMIO region here.
> > But the fact we hit cpu_abort because we can't find the TB we are
> > executing is a little problematic.
> >
> > Does ra properly point to the code buffer here?
>
> Err.  How would I tell?

Well, a NULL pointer for the return address is definitely not in
the codegen buffer :-)

This is again a TLB fill case, but this time it's a data
access from a guest store insn. We had the correct ra when we
did the do_st8_mmu() down in frame 21: ra=140736029817822,
but as we go through the page table walk, we leave the ra
behind in x86_cpu_tlb_fill(), and so the ptw_ldq()
passes a zero ra down to the cpu_ldq_mmuidx_ra() (which
is generally meant to mean "I am not being called from
translated code and so can_do_io should be false").

I think that none of the page-table-walking handling
(either in target code or in general) has been designed
with the idea in mind that it might need to do something
for icount if the ptw touches an MMIO device. This is probably
not as simple as merely "plumb the ra value down through the
ptw code" -- somebody needs to think about whether doing
an io_recompile() is the right response to that situation.
And any "do an address translation for me" system insns
might or might not need to be dealt with differently.

If you can at all arrange for your workload not to put
page tables into MMIO device regions then your life will
be a lot simpler.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 17:08                                         ` Jonathan Cameron via
  (?)
  (?)
@ 2024-02-01 17:25                                         ` Alex Bennée
  2024-02-01 18:04                                           ` Peter Maydell
  -1 siblings, 1 reply; 50+ messages in thread
From: Alex Bennée @ 2024-02-01 17:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Peter Maydell, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:

> On Thu, 01 Feb 2024 16:45:30 +0000
> Alex Bennée <alex.bennee@linaro.org> wrote:
>
>> Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
>> 
>> > On Thu, 1 Feb 2024 16:00:56 +0000
>> > Peter Maydell <peter.maydell@linaro.org> wrote:
>> >  
>> >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:  
>> >> >
>> >> > Peter Maydell <peter.maydell@linaro.org> writes:    
>> >> > > So, that looks like:
>> >> > >  * we call cpu_tb_exec(), which executes some generated code
>> >> > >  * that generated code calls the lookup_tb_ptr helper to see
>> >> > >    if we have a generated TB already for the address we're going
>> >> > >    to execute next
>> >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
>> >> > >    address for the guest address
>> >> > >  * this results in a TLB walk for an instruction fetch
>> >> > >  * the page table descriptor load is to IO memory
>> >> > >  * io_prepare assumes it needs to do a TLB recompile, because
>> >> > >    can_do_io is clear
>> >> > >
>> >> > > I am not surprised that the corner case of "the guest put its
>> >> > > page tables in an MMIO device" has not yet come up :-)
>> >> > >
>> >> > > I'm really not sure how the icount handling should interact
>> >> > > with that...    
>> >> >
>> >> > Its not just icount - we need to handle it for all modes now. That said
>> >> > seeing as we are at the end of a block shouldn't can_do_io be set?    
>> >> 
>> >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
>> >> which happens earlier than the tb_stop callback (it can
>> >> happen in the trans function for branch etc insns, for
>> >> example).
>> >> 
>> >> I think it should be OK to clear can_do_io at the start
>> >> of the lookup_tb_ptr helper, something like:
>> >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
>> >> index 977576ca143..7818537f318 100644
>> >> --- a/accel/tcg/cpu-exec.c
>> >> +++ b/accel/tcg/cpu-exec.c
>> >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
>> >>      uint64_t cs_base;
>> >>      uint32_t flags, cflags;
>> >> 
>> >> +    /*
>> >> +     * By definition we've just finished a TB, so I/O is OK.
>> >> +     * Avoid the possibility of calling cpu_io_recompile() if
>> >> +     * a page table walk triggered by tb_lookup() calling
>> >> +     * probe_access_internal() happens to touch an MMIO device.
>> >> +     * The next TB, if we chain to it, will clear the flag again.
>> >> +     */
>> >> +    cpu->neg.can_do_io = true;
>> >> +
>> >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>> >> 
>> >>      cflags = curr_cflags(cpu);
>> >> 
>> >> -- PMM  
>> >
>> > No joy.  Seems like a very similar backtrace.
>> >
>> > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
>> > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
>> > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>> > 44      ./nptl/pthread_kill.c: No such file or directory.
>> > (gdb) bt
>> > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>> > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
>> > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
>> > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
>> > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
>> > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
>> > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
>> > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
>> > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
>> > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
>> > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
>> > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
>> > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
>> > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
>> > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
>> > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
>> > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
>> > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
>> > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
>> > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
>> > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
>> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
>> > #22 0x00007fffa9107c63 in code_gen_buffer ()  
>> 
>> No thats different - we are actually writing to the MMIO region here.
>> But the fact we hit cpu_abort because we can't find the TB we are
>> executing is a little problematic.
>> 
>> Does ra properly point to the code buffer here?
>
> Err.  How would I tell?

(gdb) p/x 140736029817822
$1 = 0x7fffa9107bde

seems off because code_gen_buffer starts at 0x00007fffa9107c63

>
> I'll confess I have almost no idea what is going on in TCG :(
>
> Can learn but it won't be quick.
>
> J
>
>> 
>> > #23 0x0000555555c9395b in cpu_tb_exec (cpu=cpu@entry=0x5555578e0cb0, itb=itb@entry=0x7fffa9107980 <code_gen_buffer+17856851>, tb_exit=tb_exit@entry=0x7ffff4efd718) at ../../accel/tcg/cpu-exec.c:442

At this point I'd run under rr and set a bp at cpu_tb_exec,
reverse-continue and then see how we ended up off-piste with a retaddr
that can't be resolved.

>> > #24 0x0000555555c93ec0 in cpu_loop_exec_tb (tb_exit=0x7ffff4efd718, last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffa9107980 <code_gen_buffer+17856851>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:897
>> > #25 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1012
>> > #26 0x0000555555c946d1 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
>> > #27 0x0000555555c94ebc in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
>> > #28 0x0000555555cb8f53 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
>> > #29 0x0000555555cb90b0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
>> > #30 0x0000555555e57180 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
>> > #31 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
>> > #32 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81  
>> 

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 17:07                                         ` Peter Maydell
@ 2024-02-01 17:29                                           ` Gregory Price
  0 siblings, 0 replies; 50+ messages in thread
From: Gregory Price @ 2024-02-01 17:29 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Jonathan Cameron, Sajjan Rao,
	Dimitrios Palyvos, linux-cxl, qemu-devel, richard.henderson

On Thu, Feb 01, 2024 at 05:07:31PM +0000, Peter Maydell wrote:
> On Thu, 1 Feb 2024 at 17:04, Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Thu, Feb 01, 2024 at 04:45:30PM +0000, Alex Bennée wrote:
> 
> > > No thats different - we are actually writing to the MMIO region here.
> > > But the fact we hit cpu_abort because we can't find the TB we are
> > > executing is a little problematic.
> > >
> > > Does ra properly point to the code buffer here?
> > >
> >
> > What if the code block is ALSO in CXL (MMIO)? :D
> 
> In that case the TB is supposed to be a single insn,
> so the insn will by definition be the last one in its
> TB, and IO should be OK for it -- so can_do_io ought
> to be true and we shouldn't get into the io_recompile.
> 
> -- PMM

We saw a bug early on in CXL emulation with instructions hosted on CXL
that split a page boundary (e.g. 0xEB|0xFE)..  I'm wondering about a
code block that splits a page boundary and whether there's a similar
corner case.

~Gregory

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 17:21                                         ` Peter Maydell
@ 2024-02-01 17:41                                             ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-01 17:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 17:21:49 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 17:08, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Thu, 01 Feb 2024 16:45:30 +0000
> > Alex Bennée <alex.bennee@linaro.org> wrote:
> >  
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> > >  
> > > > On Thu, 1 Feb 2024 16:00:56 +0000
> > > > Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >  
> > > >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:  
> > > >> >
> > > >> > Peter Maydell <peter.maydell@linaro.org> writes:  
> > > >> > > So, that looks like:
> > > >> > >  * we call cpu_tb_exec(), which executes some generated code
> > > >> > >  * that generated code calls the lookup_tb_ptr helper to see
> > > >> > >    if we have a generated TB already for the address we're going
> > > >> > >    to execute next
> > > >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > > >> > >    address for the guest address
> > > >> > >  * this results in a TLB walk for an instruction fetch
> > > >> > >  * the page table descriptor load is to IO memory
> > > >> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > > >> > >    can_do_io is clear
> > > >> > >
> > > >> > > I am not surprised that the corner case of "the guest put its
> > > >> > > page tables in an MMIO device" has not yet come up :-)
> > > >> > >
> > > >> > > I'm really not sure how the icount handling should interact
> > > >> > > with that...  
> > > >> >
> > > >> > Its not just icount - we need to handle it for all modes now. That said
> > > >> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> > > >>
> > > >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> > > >> which happens earlier than the tb_stop callback (it can
> > > >> happen in the trans function for branch etc insns, for
> > > >> example).
> > > >>
> > > >> I think it should be OK to clear can_do_io at the start
> > > >> of the lookup_tb_ptr helper, something like:
> > > >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> > > >> index 977576ca143..7818537f318 100644
> > > >> --- a/accel/tcg/cpu-exec.c
> > > >> +++ b/accel/tcg/cpu-exec.c
> > > >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
> > > >>      uint64_t cs_base;
> > > >>      uint32_t flags, cflags;
> > > >>
> > > >> +    /*
> > > >> +     * By definition we've just finished a TB, so I/O is OK.
> > > >> +     * Avoid the possibility of calling cpu_io_recompile() if
> > > >> +     * a page table walk triggered by tb_lookup() calling
> > > >> +     * probe_access_internal() happens to touch an MMIO device.
> > > >> +     * The next TB, if we chain to it, will clear the flag again.
> > > >> +     */
> > > >> +    cpu->neg.can_do_io = true;
> > > >> +
> > > >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> > > >>
> > > >>      cflags = curr_cflags(cpu);
> > > >>
> > > >> -- PMM  
> > > >
> > > > No joy.  Seems like a very similar backtrace.
> > > >
> > > > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> > > > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> > > > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > > > 44      ./nptl/pthread_kill.c: No such file or directory.
> > > > (gdb) bt
> > > > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > > > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> > > > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> > > > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> > > > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> > > > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> > > > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> > > > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> > > > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> > > > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > > > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > > > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > > > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > > > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > > > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > > > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> > > > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> > > > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> > > > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> > > > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> > > > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> > > > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > > > #22 0x00007fffa9107c63 in code_gen_buffer ()  
> > >
> > > No thats different - we are actually writing to the MMIO region here.
> > > But the fact we hit cpu_abort because we can't find the TB we are
> > > executing is a little problematic.
> > >
> > > Does ra properly point to the code buffer here?  
> >
> > Err.  How would I tell?  
> 
> Well, a NULL pointer for the return address is definitely not in
> the codegen buffer :-)
> 
> This is again a TLB fill case, but this time it's a data
> access from a guest store insn. We had the correct ra when we
> did the do_st8_mmu() down in frame 21: ra=140736029817822,
> but as we go through the page table walk, we leave the ra
> behind in x86_cpu_tlb_fill(), and so the ptw_ldq()
> passes a zero ra down to the cpu_ldq_mmuidx_ra() (which
> is generally meant to mean "I am not being called from
> translated code and so can_do_io should be false").
> 
> I think that none of the page-table-walking handling
> (either in target code or in general) has been designed
> with the idea in mind that it might need to do something
> for icount if the ptw touches an MMIO device. This is probably
> not as simple as merely "plumb the ra value down through the
> ptw code" -- somebody needs to think about whether doing
> an io_recompile() is the right response to that situation.
> And any "do an address translation for me" system insns
> might or might not need to be dealt with differently.
> 
> If you can at all arrange for your workload not to put
> page tables into MMIO device regions then your life will
> be a lot simpler.

The only way we can do that is abandon supporting emulation
of fine grained interleave (or at least documented that
if you are using this 'mode', don't use it as normal memory)

Whilst I do plan to add a performance option that just
locks out those settings so that we can avoid using MMIO
regions we are going to keep getting people assuming this
case will work  :( 

We need the high perf restricted option for virtualization
usecases where it's reasonable to fake it that there is
no interleave going on (it will be happening in hardware).

I guess doing that goes up in priority :(

Jonathan

> 
> thanks
> -- PMM


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-01 17:41                                             ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-01 17:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 17:21:49 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 17:08, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Thu, 01 Feb 2024 16:45:30 +0000
> > Alex Bennée <alex.bennee@linaro.org> wrote:
> >  
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:
> > >  
> > > > On Thu, 1 Feb 2024 16:00:56 +0000
> > > > Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >  
> > > >> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:  
> > > >> >
> > > >> > Peter Maydell <peter.maydell@linaro.org> writes:  
> > > >> > > So, that looks like:
> > > >> > >  * we call cpu_tb_exec(), which executes some generated code
> > > >> > >  * that generated code calls the lookup_tb_ptr helper to see
> > > >> > >    if we have a generated TB already for the address we're going
> > > >> > >    to execute next
> > > >> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > > >> > >    address for the guest address
> > > >> > >  * this results in a TLB walk for an instruction fetch
> > > >> > >  * the page table descriptor load is to IO memory
> > > >> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > > >> > >    can_do_io is clear
> > > >> > >
> > > >> > > I am not surprised that the corner case of "the guest put its
> > > >> > > page tables in an MMIO device" has not yet come up :-)
> > > >> > >
> > > >> > > I'm really not sure how the icount handling should interact
> > > >> > > with that...  
> > > >> >
> > > >> > Its not just icount - we need to handle it for all modes now. That said
> > > >> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> > > >>
> > > >> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> > > >> which happens earlier than the tb_stop callback (it can
> > > >> happen in the trans function for branch etc insns, for
> > > >> example).
> > > >>
> > > >> I think it should be OK to clear can_do_io at the start
> > > >> of the lookup_tb_ptr helper, something like:
> > > >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> > > >> index 977576ca143..7818537f318 100644
> > > >> --- a/accel/tcg/cpu-exec.c
> > > >> +++ b/accel/tcg/cpu-exec.c
> > > >> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
> > > >>      uint64_t cs_base;
> > > >>      uint32_t flags, cflags;
> > > >>
> > > >> +    /*
> > > >> +     * By definition we've just finished a TB, so I/O is OK.
> > > >> +     * Avoid the possibility of calling cpu_io_recompile() if
> > > >> +     * a page table walk triggered by tb_lookup() calling
> > > >> +     * probe_access_internal() happens to touch an MMIO device.
> > > >> +     * The next TB, if we chain to it, will clear the flag again.
> > > >> +     */
> > > >> +    cpu->neg.can_do_io = true;
> > > >> +
> > > >>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> > > >>
> > > >>      cflags = curr_cflags(cpu);
> > > >>
> > > >> -- PMM  
> > > >
> > > > No joy.  Seems like a very similar backtrace.
> > > >
> > > > Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> > > > [Switching to Thread 0x7ffff4efe6c0 (LWP 23937)]
> > > > __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > > > 44      ./nptl/pthread_kill.c: No such file or directory.
> > > > (gdb) bt
> > > > #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> > > > #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> > > > #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> > > > #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> > > > #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> > > > #5  0x0000555555c4d19e in cpu_abort (cpu=cpu@entry=0x5555578e0cb0, fmt=fmt@entry=0x555556048ee8 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:373
> > > > #6  0x0000555555c9cb25 in cpu_io_recompile (cpu=cpu@entry=0x5555578e0cb0, retaddr=retaddr@entry=0) at ../../accel/tcg/translate-all.c:611
> > > > #7  0x0000555555c9f744 in io_prepare (retaddr=0, addr=19595790664, attrs=..., xlat=<optimized out>, cpu=0x5555578e0cb0, out_offset=<synthetic pointer>) at ../../accel/tcg/cputlb.c:1339
> > > > #8  do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012890, ret_be=ret_be@entry=0, addr=19595790664, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030
> > > > #9  0x0000555555ca0ecd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efcdd0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > > > #10 0x0000555555ca332f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595790664, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > > > #11 0x0000555555ca5e69 in cpu_ldq_mmu (ra=0, oi=52, addr=19595790664, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > > > #12 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595790664, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > > > #13 0x0000555555b4b5de in ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > > > #14 ptw_ldq (in=0x7ffff4efcf10) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > > > #15 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efcfd0, out=0x7ffff4efcfa0, err=err@entry=0x7ffff4efcfb0) at ../../target/i386/tcg/sysemu/excp_helper.c:173
> > > > #16 0x0000555555b4c3f3 in get_physical_address (err=0x7ffff4efcfb0, out=0x7ffff4efcfa0, mmu_idx=0, access_type=MMU_DATA_STORE, addr=18386491786698339392, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:578
> > > > #17 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18386491786698339392, size=<optimized out>, access_type=MMU_DATA_STORE, mmu_idx=0, probe=<optimized out>, retaddr=140736029817822) at ../../target/i386/tcg/sysemu/excp_helper.c:604
> > > > #18 0x0000555555ca0df9 in tlb_fill (retaddr=140736029817822, mmu_idx=0, access_type=MMU_DATA_STORE, size=<optimized out>, addr=18386491786698339392, cpu=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1315
> > > > #19 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd120, mmu_idx=0, access_type=access_type@entry=MMU_DATA_STORE, ra=ra@entry=140736029817822) at ../../accel/tcg/cputlb.c:1713
> > > > #20 0x0000555555ca2b71 in mmu_lookup (cpu=0x5555578e0cb0, addr=18386491786698339392, oi=<optimized out>, ra=140736029817822, type=MMU_DATA_STORE, l=0x7ffff4efd120) at ../../accel/tcg/cputlb.c:1803
> > > > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > > > #22 0x00007fffa9107c63 in code_gen_buffer ()  
> > >
> > > No thats different - we are actually writing to the MMIO region here.
> > > But the fact we hit cpu_abort because we can't find the TB we are
> > > executing is a little problematic.
> > >
> > > Does ra properly point to the code buffer here?  
> >
> > Err.  How would I tell?  
> 
> Well, a NULL pointer for the return address is definitely not in
> the codegen buffer :-)
> 
> This is again a TLB fill case, but this time it's a data
> access from a guest store insn. We had the correct ra when we
> did the do_st8_mmu() down in frame 21: ra=140736029817822,
> but as we go through the page table walk, we leave the ra
> behind in x86_cpu_tlb_fill(), and so the ptw_ldq()
> passes a zero ra down to the cpu_ldq_mmuidx_ra() (which
> is generally meant to mean "I am not being called from
> translated code and so can_do_io should be false").
> 
> I think that none of the page-table-walking handling
> (either in target code or in general) has been designed
> with the idea in mind that it might need to do something
> for icount if the ptw touches an MMIO device. This is probably
> not as simple as merely "plumb the ra value down through the
> ptw code" -- somebody needs to think about whether doing
> an io_recompile() is the right response to that situation.
> And any "do an address translation for me" system insns
> might or might not need to be dealt with differently.
> 
> If you can at all arrange for your workload not to put
> page tables into MMIO device regions then your life will
> be a lot simpler.

The only way we can do that is abandon supporting emulation
of fine grained interleave (or at least documented that
if you are using this 'mode', don't use it as normal memory)

Whilst I do plan to add a performance option that just
locks out those settings so that we can avoid using MMIO
regions we are going to keep getting people assuming this
case will work  :( 

We need the high perf restricted option for virtualization
usecases where it's reasonable to fake it that there is
no interleave going on (it will be happening in hardware).

I guess doing that goes up in priority :(

Jonathan

> 
> thanks
> -- PMM



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 17:25                                         ` Alex Bennée
@ 2024-02-01 18:04                                           ` Peter Maydell
  2024-02-01 18:56                                             ` Gregory Price
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-01 18:04 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Jonathan Cameron, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 at 17:25, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:
> >> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> >> > #22 0x00007fffa9107c63 in code_gen_buffer ()
> >>
> >> No thats different - we are actually writing to the MMIO region here.
> >> But the fact we hit cpu_abort because we can't find the TB we are
> >> executing is a little problematic.
> >>
> >> Does ra properly point to the code buffer here?
> >
> > Err.  How would I tell?
>
> (gdb) p/x 140736029817822
> $1 = 0x7fffa9107bde
>
> seems off because code_gen_buffer starts at 0x00007fffa9107c63

The code_gen_buffer doesn't *start* at 0x00007fffa9107c63 --
that is our return address into it, which is to say it's just
after the call insn to the do_st8_mmu helper. The 'ra' argument
to the helper function is going to be a number slightly lower
than that, because it points within the main lump of generated
code for the TB, whereas the helper call is done as part of
an out-of-line lump of common code at the end of the TB.

The 'ra' here is fine -- the problem is that we don't
pass it all the way down the callstack and instead end
up using 0 as a 'ra' within the ptw code.

-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 18:04                                           ` Peter Maydell
@ 2024-02-01 18:56                                             ` Gregory Price
  2024-02-02 16:26                                                 ` Jonathan Cameron
  0 siblings, 1 reply; 50+ messages in thread
From: Gregory Price @ 2024-02-01 18:56 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Jonathan Cameron, Sajjan Rao,
	Dimitrios Palyvos, linux-cxl, qemu-devel, richard.henderson

On Thu, Feb 01, 2024 at 06:04:26PM +0000, Peter Maydell wrote:
> On Thu, 1 Feb 2024 at 17:25, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:
> > >> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > >> > #22 0x00007fffa9107c63 in code_gen_buffer ()
> > >>
> > >> No thats different - we are actually writing to the MMIO region here.
> > >> But the fact we hit cpu_abort because we can't find the TB we are
> > >> executing is a little problematic.
> > >>
> > >> Does ra properly point to the code buffer here?
> > >
> > > Err.  How would I tell?
> >
> > (gdb) p/x 140736029817822
> > $1 = 0x7fffa9107bde
> >
> > seems off because code_gen_buffer starts at 0x00007fffa9107c63
> 
> The code_gen_buffer doesn't *start* at 0x00007fffa9107c63 --
> that is our return address into it, which is to say it's just
> after the call insn to the do_st8_mmu helper. The 'ra' argument
> to the helper function is going to be a number slightly lower
> than that, because it points within the main lump of generated
> code for the TB, whereas the helper call is done as part of
> an out-of-line lump of common code at the end of the TB.
> 
> The 'ra' here is fine -- the problem is that we don't
> pass it all the way down the callstack and instead end
> up using 0 as a 'ra' within the ptw code.
> 
> -- PMM

Is there any particular reason not to, as below?
~Gregory


diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index 5b86f439ad..2f581b9bfb 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -59,14 +59,14 @@ typedef struct PTETranslate {
     hwaddr gaddr;
 } PTETranslate;

-static bool ptw_translate(PTETranslate *inout, hwaddr addr)
+static bool ptw_translate(PTETranslate *inout, hwaddr addr, uint64_t ra)
 {
     CPUTLBEntryFull *full;
     int flags;

     inout->gaddr = addr;
     flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
-                              inout->ptw_idx, true, &inout->haddr, &full, 0);
+                              inout->ptw_idx, true, &inout->haddr, &full, ra);

     if (unlikely(flags & TLB_INVALID_MASK)) {
         TranslateFault *err = inout->err;
@@ -82,20 +82,20 @@ static bool ptw_translate(PTETranslate *inout, hwaddr addr)
     return true;
 }

-static inline uint32_t ptw_ldl(const PTETranslate *in)
+static inline uint32_t ptw_ldl(const PTETranslate *in, uint64_t ra)
 {
     if (likely(in->haddr)) {
         return ldl_p(in->haddr);
     }
-    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
+    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
 }

-static inline uint64_t ptw_ldq(const PTETranslate *in)
+static inline uint64_t ptw_ldq(const PTETranslate *in, uint64_t ra)
 {
     if (likely(in->haddr)) {
         return ldq_p(in->haddr);
     }
-    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
+    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
 }

 /*
@@ -132,7 +132,8 @@ static inline bool ptw_setl(const PTETranslate *in, uint32_t old, uint32_t set)
 }

 static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
-                          TranslateResult *out, TranslateFault *err)
+                          TranslateResult *out, TranslateFault *err,
+                          uint64_t ra)
 {
     const int32_t a20_mask = x86_get_a20_mask(env);
     const target_ulong addr = in->addr;
@@ -166,11 +167,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
                  */
                 pte_addr = ((in->cr3 & ~0xfff) +
                             (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
-                if (!ptw_translate(&pte_trans, pte_addr)) {
+                if (!ptw_translate(&pte_trans, pte_addr, ra)) {
                     return false;
                 }
             restart_5:
-                pte = ptw_ldq(&pte_trans);
+                pte = ptw_ldq(&pte_trans, ra);
                 if (!(pte & PG_PRESENT_MASK)) {
                     goto do_fault;
                 }
@@ -191,11 +192,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
              */
             pte_addr = ((pte & PG_ADDRESS_MASK) +
                         (((addr >> 39) & 0x1ff) << 3)) & a20_mask;
-            if (!ptw_translate(&pte_trans, pte_addr)) {
+            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
                 return false;
             }
         restart_4:
-            pte = ptw_ldq(&pte_trans);
+            pte = ptw_ldq(&pte_trans, ra);
             if (!(pte & PG_PRESENT_MASK)) {
                 goto do_fault;
             }
@@ -212,11 +213,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
              */
             pte_addr = ((pte & PG_ADDRESS_MASK) +
                         (((addr >> 30) & 0x1ff) << 3)) & a20_mask;
-            if (!ptw_translate(&pte_trans, pte_addr)) {
+            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
                 return false;
             }
         restart_3_lma:
-            pte = ptw_ldq(&pte_trans);
+            pte = ptw_ldq(&pte_trans, ra);
             if (!(pte & PG_PRESENT_MASK)) {
                 goto do_fault;
             }
@@ -239,12 +240,12 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
              * Page table level 3
              */
             pte_addr = ((in->cr3 & ~0x1f) + ((addr >> 27) & 0x18)) & a20_mask;
-            if (!ptw_translate(&pte_trans, pte_addr)) {
+            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
                 return false;
             }
             rsvd_mask |= PG_HI_USER_MASK;
         restart_3_nolma:
-            pte = ptw_ldq(&pte_trans);
+            pte = ptw_ldq(&pte_trans, ra);
             if (!(pte & PG_PRESENT_MASK)) {
                 goto do_fault;
             }
@@ -262,11 +263,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
          */
         pte_addr = ((pte & PG_ADDRESS_MASK) +
                     (((addr >> 21) & 0x1ff) << 3)) & a20_mask;
-        if (!ptw_translate(&pte_trans, pte_addr)) {
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
             return false;
         }
     restart_2_pae:
-        pte = ptw_ldq(&pte_trans);
+        pte = ptw_ldq(&pte_trans, ra);
         if (!(pte & PG_PRESENT_MASK)) {
             goto do_fault;
         }
@@ -289,10 +290,10 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
          */
         pte_addr = ((pte & PG_ADDRESS_MASK) +
                     (((addr >> 12) & 0x1ff) << 3)) & a20_mask;
-        if (!ptw_translate(&pte_trans, pte_addr)) {
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
             return false;
         }
-        pte = ptw_ldq(&pte_trans);
+        pte = ptw_ldq(&pte_trans, ra);
         if (!(pte & PG_PRESENT_MASK)) {
             goto do_fault;
         }
@@ -307,11 +308,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
          * Page table level 2
          */
         pte_addr = ((in->cr3 & ~0xfff) + ((addr >> 20) & 0xffc)) & a20_mask;
-        if (!ptw_translate(&pte_trans, pte_addr)) {
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
             return false;
         }
     restart_2_nopae:
-        pte = ptw_ldl(&pte_trans);
+        pte = ptw_ldl(&pte_trans, ra);
         if (!(pte & PG_PRESENT_MASK)) {
             goto do_fault;
         }
@@ -336,10 +337,10 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
          * Page table level 1
          */
         pte_addr = ((pte & ~0xfffu) + ((addr >> 10) & 0xffc)) & a20_mask;
-        if (!ptw_translate(&pte_trans, pte_addr)) {
+        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
             return false;
         }
-        pte = ptw_ldl(&pte_trans);
+        pte = ptw_ldl(&pte_trans, ra);
         if (!(pte & PG_PRESENT_MASK)) {
             goto do_fault;
         }
@@ -529,7 +530,8 @@ static G_NORETURN void raise_stage2(CPUX86State *env, TranslateFault *err,

 static bool get_physical_address(CPUX86State *env, vaddr addr,
                                  MMUAccessType access_type, int mmu_idx,
-                                 TranslateResult *out, TranslateFault *err)
+                                 TranslateResult *out, TranslateFault *err,
+                                 uint64_t ra)
 {
     TranslateParams in;
     bool use_stage2 = env->hflags2 & HF2_NPT_MASK;
@@ -548,7 +550,7 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
             in.mmu_idx = MMU_USER_IDX;
             in.ptw_idx = MMU_PHYS_IDX;

-            if (!mmu_translate(env, &in, out, err)) {
+            if (!mmu_translate(env, &in, out, err, ra)) {
                 err->stage2 = S2_GPA;
                 return false;
             }
@@ -575,7 +577,7 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
                     return false;
                 }
             }
-            return mmu_translate(env, &in, out, err);
+            return mmu_translate(env, &in, out, err, ra);
         }
         break;
     }
@@ -601,7 +603,7 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
     TranslateResult out;
     TranslateFault err;

-    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err)) {
+    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err, retaddr)) {
         /*
          * Even if 4MB pages, we map only one 4KB page in the cache to
          * avoid filling it too fast.

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 18:56                                             ` Gregory Price
@ 2024-02-02 16:26                                                 ` Jonathan Cameron
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-02 16:26 UTC (permalink / raw)
  To: Gregory Price
  Cc: Peter Maydell, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 13:56:09 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Feb 01, 2024 at 06:04:26PM +0000, Peter Maydell wrote:
> > On Thu, 1 Feb 2024 at 17:25, Alex Bennée <alex.bennee@linaro.org> wrote:  
> > >
> > > Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:  
> > > >> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > > >> > #22 0x00007fffa9107c63 in code_gen_buffer ()  
> > > >>
> > > >> No thats different - we are actually writing to the MMIO region here.
> > > >> But the fact we hit cpu_abort because we can't find the TB we are
> > > >> executing is a little problematic.
> > > >>
> > > >> Does ra properly point to the code buffer here?  
> > > >
> > > > Err.  How would I tell?  
> > >
> > > (gdb) p/x 140736029817822
> > > $1 = 0x7fffa9107bde
> > >
> > > seems off because code_gen_buffer starts at 0x00007fffa9107c63  
> > 
> > The code_gen_buffer doesn't *start* at 0x00007fffa9107c63 --
> > that is our return address into it, which is to say it's just
> > after the call insn to the do_st8_mmu helper. The 'ra' argument
> > to the helper function is going to be a number slightly lower
> > than that, because it points within the main lump of generated
> > code for the TB, whereas the helper call is done as part of
> > an out-of-line lump of common code at the end of the TB.
> > 
> > The 'ra' here is fine -- the problem is that we don't
> > pass it all the way down the callstack and instead end
> > up using 0 as a 'ra' within the ptw code.
> > 
> > -- PMM  
> 
> Is there any particular reason not to, as below?
> ~Gregory
> 
One patch blindly applied. 
New exciting trace...
Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4efe6c0 (LWP 16503)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7b2ed1e in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x00007ffff7b9622e in g_assertion_message_expr () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#7  0x0000555555ab1929 in bql_lock_impl (file=0x555556049122 "../../accel/tcg/cputlb.c", line=2033) at ../../system/cpus.c:524
#8  bql_lock_impl (file=file@entry=0x555556049122 "../../accel/tcg/cputlb.c", line=line@entry=2033) at ../../system/cpus.c:520
#9  0x0000555555c9f7d6 in do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012950, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2033
#10 0x0000555555ca0fbd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efd1d0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
#11 0x0000555555ca341f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595792376, oi=oi@entry=52, ra=0, ra@entry=52, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
#12 0x0000555555ca5f59 in cpu_ldq_mmu (ra=52, oi=52, addr=19595792376, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
#13 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
#14 0x0000555555b4b5fc in ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:98
#15 ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:93
#16 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efd3e0, out=0x7ffff4efd3b0, err=err@entry=0x7ffff4efd3c0, ra=ra@entry=0) at ../../target/i386/tcg/sysemu/excp_helper.c:174
#17 0x0000555555b4c4b3 in get_physical_address (ra=0, err=0x7ffff4efd3c0, out=0x7ffff4efd3b0, mmu_idx=0, access_type=MMU_DATA_LOAD, addr=18446741874686299840, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:580
#18 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18446741874686299840, size=<optimized out>, access_type=MMU_DATA_LOAD, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:606
#19 0x0000555555ca0ee9 in tlb_fill (retaddr=0, mmu_idx=0, access_type=MMU_DATA_LOAD, size=<optimized out>, addr=18446741874686299840, cpu=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1315
#20 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd540, mmu_idx=0, access_type=access_type@entry=MMU_DATA_LOAD, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:1713
#21 0x0000555555ca2c61 in mmu_lookup (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=18446741874686299840, oi=oi@entry=32, ra=ra@entry=0, type=type@entry=MMU_DATA_LOAD, l=l@entry=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1803
#22 0x0000555555ca3165 in do_ld4_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=18446741874686299840, oi=oi@entry=32, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2416
#23 0x0000555555ca5ef9 in cpu_ldl_mmu (ra=0, oi=32, addr=18446741874686299840, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:158
#24 cpu_ldl_le_mmuidx_ra (env=env@entry=0x5555578e3470, addr=addr@entry=18446741874686299840, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:294
#25 0x0000555555bb6cdd in do_interrupt64 (is_hw=1, next_eip=18446744072399775809, error_code=0, is_int=0, intno=236, env=0x5555578e3470) at ../../target/i386/tcg/seg_helper.c:889
#26 do_interrupt_all (cpu=cpu@entry=0x5555578e0cb0, intno=236, is_int=is_int@entry=0, error_code=error_code@entry=0, next_eip=next_eip@entry=0, is_hw=is_hw@entry=1) at ../../target/i386/tcg/seg_helper.c:1130
#27 0x0000555555bb87da in do_interrupt_x86_hardirq (env=env@entry=0x5555578e3470, intno=<optimized out>, is_hw=is_hw@entry=1) at ../../target/i386/tcg/seg_helper.c:1162
#28 0x0000555555b5039c in x86_cpu_exec_interrupt (cs=0x5555578e0cb0, interrupt_request=<optimized out>) at ../../target/i386/tcg/sysemu/seg_helper.c:197
#29 0x0000555555c94480 in cpu_handle_interrupt (last_tb=<synthetic pointer>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:844
#30 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:951
#31 0x0000555555c94791 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
#32 0x0000555555c94f7c in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
#33 0x0000555555cb9043 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
#34 0x0000555555cb91a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
#35 0x0000555555e57270 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
#36 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#37 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

> 
> diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
> index 5b86f439ad..2f581b9bfb 100644
> --- a/target/i386/tcg/sysemu/excp_helper.c
> +++ b/target/i386/tcg/sysemu/excp_helper.c
> @@ -59,14 +59,14 @@ typedef struct PTETranslate {
>      hwaddr gaddr;
>  } PTETranslate;
> 
> -static bool ptw_translate(PTETranslate *inout, hwaddr addr)
> +static bool ptw_translate(PTETranslate *inout, hwaddr addr, uint64_t ra)
>  {
>      CPUTLBEntryFull *full;
>      int flags;
> 
>      inout->gaddr = addr;
>      flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
> -                              inout->ptw_idx, true, &inout->haddr, &full, 0);
> +                              inout->ptw_idx, true, &inout->haddr, &full, ra);
> 
>      if (unlikely(flags & TLB_INVALID_MASK)) {
>          TranslateFault *err = inout->err;
> @@ -82,20 +82,20 @@ static bool ptw_translate(PTETranslate *inout, hwaddr addr)
>      return true;
>  }
> 
> -static inline uint32_t ptw_ldl(const PTETranslate *in)
> +static inline uint32_t ptw_ldl(const PTETranslate *in, uint64_t ra)
>  {
>      if (likely(in->haddr)) {
>          return ldl_p(in->haddr);
>      }
> -    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
> +    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
>  }
> 
> -static inline uint64_t ptw_ldq(const PTETranslate *in)
> +static inline uint64_t ptw_ldq(const PTETranslate *in, uint64_t ra)
>  {
>      if (likely(in->haddr)) {
>          return ldq_p(in->haddr);
>      }
> -    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
> +    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
>  }
> 
>  /*
> @@ -132,7 +132,8 @@ static inline bool ptw_setl(const PTETranslate *in, uint32_t old, uint32_t set)
>  }
> 
>  static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
> -                          TranslateResult *out, TranslateFault *err)
> +                          TranslateResult *out, TranslateFault *err,
> +                          uint64_t ra)
>  {
>      const int32_t a20_mask = x86_get_a20_mask(env);
>      const target_ulong addr = in->addr;
> @@ -166,11 +167,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>                   */
>                  pte_addr = ((in->cr3 & ~0xfff) +
>                              (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
> -                if (!ptw_translate(&pte_trans, pte_addr)) {
> +                if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                      return false;
>                  }
>              restart_5:
> -                pte = ptw_ldq(&pte_trans);
> +                pte = ptw_ldq(&pte_trans, ra);
>                  if (!(pte & PG_PRESENT_MASK)) {
>                      goto do_fault;
>                  }
> @@ -191,11 +192,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>               */
>              pte_addr = ((pte & PG_ADDRESS_MASK) +
>                          (((addr >> 39) & 0x1ff) << 3)) & a20_mask;
> -            if (!ptw_translate(&pte_trans, pte_addr)) {
> +            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                  return false;
>              }
>          restart_4:
> -            pte = ptw_ldq(&pte_trans);
> +            pte = ptw_ldq(&pte_trans, ra);
>              if (!(pte & PG_PRESENT_MASK)) {
>                  goto do_fault;
>              }
> @@ -212,11 +213,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>               */
>              pte_addr = ((pte & PG_ADDRESS_MASK) +
>                          (((addr >> 30) & 0x1ff) << 3)) & a20_mask;
> -            if (!ptw_translate(&pte_trans, pte_addr)) {
> +            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                  return false;
>              }
>          restart_3_lma:
> -            pte = ptw_ldq(&pte_trans);
> +            pte = ptw_ldq(&pte_trans, ra);
>              if (!(pte & PG_PRESENT_MASK)) {
>                  goto do_fault;
>              }
> @@ -239,12 +240,12 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>               * Page table level 3
>               */
>              pte_addr = ((in->cr3 & ~0x1f) + ((addr >> 27) & 0x18)) & a20_mask;
> -            if (!ptw_translate(&pte_trans, pte_addr)) {
> +            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                  return false;
>              }
>              rsvd_mask |= PG_HI_USER_MASK;
>          restart_3_nolma:
> -            pte = ptw_ldq(&pte_trans);
> +            pte = ptw_ldq(&pte_trans, ra);
>              if (!(pte & PG_PRESENT_MASK)) {
>                  goto do_fault;
>              }
> @@ -262,11 +263,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           */
>          pte_addr = ((pte & PG_ADDRESS_MASK) +
>                      (((addr >> 21) & 0x1ff) << 3)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
>      restart_2_pae:
> -        pte = ptw_ldq(&pte_trans);
> +        pte = ptw_ldq(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -289,10 +290,10 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           */
>          pte_addr = ((pte & PG_ADDRESS_MASK) +
>                      (((addr >> 12) & 0x1ff) << 3)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
> -        pte = ptw_ldq(&pte_trans);
> +        pte = ptw_ldq(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -307,11 +308,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           * Page table level 2
>           */
>          pte_addr = ((in->cr3 & ~0xfff) + ((addr >> 20) & 0xffc)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
>      restart_2_nopae:
> -        pte = ptw_ldl(&pte_trans);
> +        pte = ptw_ldl(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -336,10 +337,10 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           * Page table level 1
>           */
>          pte_addr = ((pte & ~0xfffu) + ((addr >> 10) & 0xffc)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
> -        pte = ptw_ldl(&pte_trans);
> +        pte = ptw_ldl(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -529,7 +530,8 @@ static G_NORETURN void raise_stage2(CPUX86State *env, TranslateFault *err,
> 
>  static bool get_physical_address(CPUX86State *env, vaddr addr,
>                                   MMUAccessType access_type, int mmu_idx,
> -                                 TranslateResult *out, TranslateFault *err)
> +                                 TranslateResult *out, TranslateFault *err,
> +                                 uint64_t ra)
>  {
>      TranslateParams in;
>      bool use_stage2 = env->hflags2 & HF2_NPT_MASK;
> @@ -548,7 +550,7 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
>              in.mmu_idx = MMU_USER_IDX;
>              in.ptw_idx = MMU_PHYS_IDX;
> 
> -            if (!mmu_translate(env, &in, out, err)) {
> +            if (!mmu_translate(env, &in, out, err, ra)) {
>                  err->stage2 = S2_GPA;
>                  return false;
>              }
> @@ -575,7 +577,7 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
>                      return false;
>                  }
>              }
> -            return mmu_translate(env, &in, out, err);
> +            return mmu_translate(env, &in, out, err, ra);
>          }
>          break;
>      }
> @@ -601,7 +603,7 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
>      TranslateResult out;
>      TranslateFault err;
> 
> -    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err)) {
> +    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err, retaddr)) {
>          /*
>           * Even if 4MB pages, we map only one 4KB page in the cache to
>           * avoid filling it too fast.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-02 16:26                                                 ` Jonathan Cameron
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-02 16:26 UTC (permalink / raw)
  To: Gregory Price
  Cc: Peter Maydell, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 13:56:09 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Feb 01, 2024 at 06:04:26PM +0000, Peter Maydell wrote:
> > On Thu, 1 Feb 2024 at 17:25, Alex Bennée <alex.bennee@linaro.org> wrote:  
> > >
> > > Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:  
> > > >> > #21 0x0000555555ca3e5d in do_st8_mmu (cpu=0x5555578e0cb0, addr=23937, val=18386491784638059520, oi=6, ra=140736029817822) at ../../accel/tcg/cputlb.c:2853
> > > >> > #22 0x00007fffa9107c63 in code_gen_buffer ()  
> > > >>
> > > >> No thats different - we are actually writing to the MMIO region here.
> > > >> But the fact we hit cpu_abort because we can't find the TB we are
> > > >> executing is a little problematic.
> > > >>
> > > >> Does ra properly point to the code buffer here?  
> > > >
> > > > Err.  How would I tell?  
> > >
> > > (gdb) p/x 140736029817822
> > > $1 = 0x7fffa9107bde
> > >
> > > seems off because code_gen_buffer starts at 0x00007fffa9107c63  
> > 
> > The code_gen_buffer doesn't *start* at 0x00007fffa9107c63 --
> > that is our return address into it, which is to say it's just
> > after the call insn to the do_st8_mmu helper. The 'ra' argument
> > to the helper function is going to be a number slightly lower
> > than that, because it points within the main lump of generated
> > code for the TB, whereas the helper call is done as part of
> > an out-of-line lump of common code at the end of the TB.
> > 
> > The 'ra' here is fine -- the problem is that we don't
> > pass it all the way down the callstack and instead end
> > up using 0 as a 'ra' within the ptw code.
> > 
> > -- PMM  
> 
> Is there any particular reason not to, as below?
> ~Gregory
> 
One patch blindly applied. 
New exciting trace...
Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4efe6c0 (LWP 16503)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7b2ed1e in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x00007ffff7b9622e in g_assertion_message_expr () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#7  0x0000555555ab1929 in bql_lock_impl (file=0x555556049122 "../../accel/tcg/cputlb.c", line=2033) at ../../system/cpus.c:524
#8  bql_lock_impl (file=file@entry=0x555556049122 "../../accel/tcg/cputlb.c", line=line@entry=2033) at ../../system/cpus.c:520
#9  0x0000555555c9f7d6 in do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012950, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2033
#10 0x0000555555ca0fbd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efd1d0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
#11 0x0000555555ca341f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595792376, oi=oi@entry=52, ra=0, ra@entry=52, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
#12 0x0000555555ca5f59 in cpu_ldq_mmu (ra=52, oi=52, addr=19595792376, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
#13 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
#14 0x0000555555b4b5fc in ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:98
#15 ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:93
#16 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efd3e0, out=0x7ffff4efd3b0, err=err@entry=0x7ffff4efd3c0, ra=ra@entry=0) at ../../target/i386/tcg/sysemu/excp_helper.c:174
#17 0x0000555555b4c4b3 in get_physical_address (ra=0, err=0x7ffff4efd3c0, out=0x7ffff4efd3b0, mmu_idx=0, access_type=MMU_DATA_LOAD, addr=18446741874686299840, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:580
#18 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18446741874686299840, size=<optimized out>, access_type=MMU_DATA_LOAD, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:606
#19 0x0000555555ca0ee9 in tlb_fill (retaddr=0, mmu_idx=0, access_type=MMU_DATA_LOAD, size=<optimized out>, addr=18446741874686299840, cpu=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1315
#20 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd540, mmu_idx=0, access_type=access_type@entry=MMU_DATA_LOAD, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:1713
#21 0x0000555555ca2c61 in mmu_lookup (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=18446741874686299840, oi=oi@entry=32, ra=ra@entry=0, type=type@entry=MMU_DATA_LOAD, l=l@entry=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1803
#22 0x0000555555ca3165 in do_ld4_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=18446741874686299840, oi=oi@entry=32, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2416
#23 0x0000555555ca5ef9 in cpu_ldl_mmu (ra=0, oi=32, addr=18446741874686299840, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:158
#24 cpu_ldl_le_mmuidx_ra (env=env@entry=0x5555578e3470, addr=addr@entry=18446741874686299840, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:294
#25 0x0000555555bb6cdd in do_interrupt64 (is_hw=1, next_eip=18446744072399775809, error_code=0, is_int=0, intno=236, env=0x5555578e3470) at ../../target/i386/tcg/seg_helper.c:889
#26 do_interrupt_all (cpu=cpu@entry=0x5555578e0cb0, intno=236, is_int=is_int@entry=0, error_code=error_code@entry=0, next_eip=next_eip@entry=0, is_hw=is_hw@entry=1) at ../../target/i386/tcg/seg_helper.c:1130
#27 0x0000555555bb87da in do_interrupt_x86_hardirq (env=env@entry=0x5555578e3470, intno=<optimized out>, is_hw=is_hw@entry=1) at ../../target/i386/tcg/seg_helper.c:1162
#28 0x0000555555b5039c in x86_cpu_exec_interrupt (cs=0x5555578e0cb0, interrupt_request=<optimized out>) at ../../target/i386/tcg/sysemu/seg_helper.c:197
#29 0x0000555555c94480 in cpu_handle_interrupt (last_tb=<synthetic pointer>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:844
#30 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:951
#31 0x0000555555c94791 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
#32 0x0000555555c94f7c in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
#33 0x0000555555cb9043 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
#34 0x0000555555cb91a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
#35 0x0000555555e57270 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
#36 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#37 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

> 
> diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
> index 5b86f439ad..2f581b9bfb 100644
> --- a/target/i386/tcg/sysemu/excp_helper.c
> +++ b/target/i386/tcg/sysemu/excp_helper.c
> @@ -59,14 +59,14 @@ typedef struct PTETranslate {
>      hwaddr gaddr;
>  } PTETranslate;
> 
> -static bool ptw_translate(PTETranslate *inout, hwaddr addr)
> +static bool ptw_translate(PTETranslate *inout, hwaddr addr, uint64_t ra)
>  {
>      CPUTLBEntryFull *full;
>      int flags;
> 
>      inout->gaddr = addr;
>      flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
> -                              inout->ptw_idx, true, &inout->haddr, &full, 0);
> +                              inout->ptw_idx, true, &inout->haddr, &full, ra);
> 
>      if (unlikely(flags & TLB_INVALID_MASK)) {
>          TranslateFault *err = inout->err;
> @@ -82,20 +82,20 @@ static bool ptw_translate(PTETranslate *inout, hwaddr addr)
>      return true;
>  }
> 
> -static inline uint32_t ptw_ldl(const PTETranslate *in)
> +static inline uint32_t ptw_ldl(const PTETranslate *in, uint64_t ra)
>  {
>      if (likely(in->haddr)) {
>          return ldl_p(in->haddr);
>      }
> -    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
> +    return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
>  }
> 
> -static inline uint64_t ptw_ldq(const PTETranslate *in)
> +static inline uint64_t ptw_ldq(const PTETranslate *in, uint64_t ra)
>  {
>      if (likely(in->haddr)) {
>          return ldq_p(in->haddr);
>      }
> -    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
> +    return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
>  }
> 
>  /*
> @@ -132,7 +132,8 @@ static inline bool ptw_setl(const PTETranslate *in, uint32_t old, uint32_t set)
>  }
> 
>  static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
> -                          TranslateResult *out, TranslateFault *err)
> +                          TranslateResult *out, TranslateFault *err,
> +                          uint64_t ra)
>  {
>      const int32_t a20_mask = x86_get_a20_mask(env);
>      const target_ulong addr = in->addr;
> @@ -166,11 +167,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>                   */
>                  pte_addr = ((in->cr3 & ~0xfff) +
>                              (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
> -                if (!ptw_translate(&pte_trans, pte_addr)) {
> +                if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                      return false;
>                  }
>              restart_5:
> -                pte = ptw_ldq(&pte_trans);
> +                pte = ptw_ldq(&pte_trans, ra);
>                  if (!(pte & PG_PRESENT_MASK)) {
>                      goto do_fault;
>                  }
> @@ -191,11 +192,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>               */
>              pte_addr = ((pte & PG_ADDRESS_MASK) +
>                          (((addr >> 39) & 0x1ff) << 3)) & a20_mask;
> -            if (!ptw_translate(&pte_trans, pte_addr)) {
> +            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                  return false;
>              }
>          restart_4:
> -            pte = ptw_ldq(&pte_trans);
> +            pte = ptw_ldq(&pte_trans, ra);
>              if (!(pte & PG_PRESENT_MASK)) {
>                  goto do_fault;
>              }
> @@ -212,11 +213,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>               */
>              pte_addr = ((pte & PG_ADDRESS_MASK) +
>                          (((addr >> 30) & 0x1ff) << 3)) & a20_mask;
> -            if (!ptw_translate(&pte_trans, pte_addr)) {
> +            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                  return false;
>              }
>          restart_3_lma:
> -            pte = ptw_ldq(&pte_trans);
> +            pte = ptw_ldq(&pte_trans, ra);
>              if (!(pte & PG_PRESENT_MASK)) {
>                  goto do_fault;
>              }
> @@ -239,12 +240,12 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>               * Page table level 3
>               */
>              pte_addr = ((in->cr3 & ~0x1f) + ((addr >> 27) & 0x18)) & a20_mask;
> -            if (!ptw_translate(&pte_trans, pte_addr)) {
> +            if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>                  return false;
>              }
>              rsvd_mask |= PG_HI_USER_MASK;
>          restart_3_nolma:
> -            pte = ptw_ldq(&pte_trans);
> +            pte = ptw_ldq(&pte_trans, ra);
>              if (!(pte & PG_PRESENT_MASK)) {
>                  goto do_fault;
>              }
> @@ -262,11 +263,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           */
>          pte_addr = ((pte & PG_ADDRESS_MASK) +
>                      (((addr >> 21) & 0x1ff) << 3)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
>      restart_2_pae:
> -        pte = ptw_ldq(&pte_trans);
> +        pte = ptw_ldq(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -289,10 +290,10 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           */
>          pte_addr = ((pte & PG_ADDRESS_MASK) +
>                      (((addr >> 12) & 0x1ff) << 3)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
> -        pte = ptw_ldq(&pte_trans);
> +        pte = ptw_ldq(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -307,11 +308,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           * Page table level 2
>           */
>          pte_addr = ((in->cr3 & ~0xfff) + ((addr >> 20) & 0xffc)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
>      restart_2_nopae:
> -        pte = ptw_ldl(&pte_trans);
> +        pte = ptw_ldl(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -336,10 +337,10 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
>           * Page table level 1
>           */
>          pte_addr = ((pte & ~0xfffu) + ((addr >> 10) & 0xffc)) & a20_mask;
> -        if (!ptw_translate(&pte_trans, pte_addr)) {
> +        if (!ptw_translate(&pte_trans, pte_addr, ra)) {
>              return false;
>          }
> -        pte = ptw_ldl(&pte_trans);
> +        pte = ptw_ldl(&pte_trans, ra);
>          if (!(pte & PG_PRESENT_MASK)) {
>              goto do_fault;
>          }
> @@ -529,7 +530,8 @@ static G_NORETURN void raise_stage2(CPUX86State *env, TranslateFault *err,
> 
>  static bool get_physical_address(CPUX86State *env, vaddr addr,
>                                   MMUAccessType access_type, int mmu_idx,
> -                                 TranslateResult *out, TranslateFault *err)
> +                                 TranslateResult *out, TranslateFault *err,
> +                                 uint64_t ra)
>  {
>      TranslateParams in;
>      bool use_stage2 = env->hflags2 & HF2_NPT_MASK;
> @@ -548,7 +550,7 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
>              in.mmu_idx = MMU_USER_IDX;
>              in.ptw_idx = MMU_PHYS_IDX;
> 
> -            if (!mmu_translate(env, &in, out, err)) {
> +            if (!mmu_translate(env, &in, out, err, ra)) {
>                  err->stage2 = S2_GPA;
>                  return false;
>              }
> @@ -575,7 +577,7 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
>                      return false;
>                  }
>              }
> -            return mmu_translate(env, &in, out, err);
> +            return mmu_translate(env, &in, out, err, ra);
>          }
>          break;
>      }
> @@ -601,7 +603,7 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
>      TranslateResult out;
>      TranslateFault err;
> 
> -    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err)) {
> +    if (get_physical_address(env, addr, access_type, mmu_idx, &out, &err, retaddr)) {
>          /*
>           * Even if 4MB pages, we map only one 4KB page in the cache to
>           * avoid filling it too fast.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-02 16:26                                                 ` Jonathan Cameron
  (?)
@ 2024-02-02 16:33                                                 ` Peter Maydell
  2024-02-02 16:50                                                   ` Gregory Price
  -1 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-02 16:33 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Fri, 2 Feb 2024 at 16:26, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> New exciting trace...
> Thread 5 "qemu-system-x86" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7ffff4efe6c0 (LWP 16503)]
> __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
> 44      ./nptl/pthread_kill.c: No such file or directory.
> (gdb) bt
> #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
> #1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> #3  0x00007ffff77c43b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> #4  0x00007ffff77aa87c in __GI_abort () at ./stdlib/abort.c:79
> #5  0x00007ffff7b2ed1e in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #6  0x00007ffff7b9622e in g_assertion_message_expr () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #7  0x0000555555ab1929 in bql_lock_impl (file=0x555556049122 "../../accel/tcg/cputlb.c", line=2033) at ../../system/cpus.c:524
> #8  bql_lock_impl (file=file@entry=0x555556049122 "../../accel/tcg/cputlb.c", line=line@entry=2033) at ../../system/cpus.c:520
> #9  0x0000555555c9f7d6 in do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012950, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2033
> #10 0x0000555555ca0fbd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efd1d0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> #11 0x0000555555ca341f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595792376, oi=oi@entry=52, ra=0, ra@entry=52, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> #12 0x0000555555ca5f59 in cpu_ldq_mmu (ra=52, oi=52, addr=19595792376, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> #13 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> #14 0x0000555555b4b5fc in ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> #15 ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> #16 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efd3e0, out=0x7ffff4efd3b0, err=err@entry=0x7ffff4efd3c0, ra=ra@entry=0) at ../../target/i386/tcg/sysemu/excp_helper.c:174
> #17 0x0000555555b4c4b3 in get_physical_address (ra=0, err=0x7ffff4efd3c0, out=0x7ffff4efd3b0, mmu_idx=0, access_type=MMU_DATA_LOAD, addr=18446741874686299840, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:580
> #18 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18446741874686299840, size=<optimized out>, access_type=MMU_DATA_LOAD, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:606
> #19 0x0000555555ca0ee9 in tlb_fill (retaddr=0, mmu_idx=0, access_type=MMU_DATA_LOAD, size=<optimized out>, addr=18446741874686299840, cpu=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1315
> #20 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd540, mmu_idx=0, access_type=access_type@entry=MMU_DATA_LOAD, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:1713
> #21 0x0000555555ca2c61 in mmu_lookup (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=18446741874686299840, oi=oi@entry=32, ra=ra@entry=0, type=type@entry=MMU_DATA_LOAD, l=l@entry=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1803
> #22 0x0000555555ca3165 in do_ld4_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=18446741874686299840, oi=oi@entry=32, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2416
> #23 0x0000555555ca5ef9 in cpu_ldl_mmu (ra=0, oi=32, addr=18446741874686299840, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:158
> #24 cpu_ldl_le_mmuidx_ra (env=env@entry=0x5555578e3470, addr=addr@entry=18446741874686299840, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:294
> #25 0x0000555555bb6cdd in do_interrupt64 (is_hw=1, next_eip=18446744072399775809, error_code=0, is_int=0, intno=236, env=0x5555578e3470) at ../../target/i386/tcg/seg_helper.c:889
> #26 do_interrupt_all (cpu=cpu@entry=0x5555578e0cb0, intno=236, is_int=is_int@entry=0, error_code=error_code@entry=0, next_eip=next_eip@entry=0, is_hw=is_hw@entry=1) at ../../target/i386/tcg/seg_helper.c:1130
> #27 0x0000555555bb87da in do_interrupt_x86_hardirq (env=env@entry=0x5555578e3470, intno=<optimized out>, is_hw=is_hw@entry=1) at ../../target/i386/tcg/seg_helper.c:1162
> #28 0x0000555555b5039c in x86_cpu_exec_interrupt (cs=0x5555578e0cb0, interrupt_request=<optimized out>) at ../../target/i386/tcg/sysemu/seg_helper.c:197
> #29 0x0000555555c94480 in cpu_handle_interrupt (last_tb=<synthetic pointer>, cpu=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:844
> #30 cpu_exec_loop (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:951
> #31 0x0000555555c94791 in cpu_exec_setjmp (cpu=cpu@entry=0x5555578e0cb0, sc=sc@entry=0x7ffff4efd7b0) at ../../accel/tcg/cpu-exec.c:1029
> #32 0x0000555555c94f7c in cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/cpu-exec.c:1055
> #33 0x0000555555cb9043 in tcg_cpu_exec (cpu=cpu@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops.c:76
> #34 0x0000555555cb91a0 in mttcg_cpu_thread_fn (arg=arg@entry=0x5555578e0cb0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:95
> #35 0x0000555555e57270 in qemu_thread_start (args=0x555557956000) at ../../util/qemu-thread-posix.c:541
> #36 0x00007ffff78176ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
> #37 0x00007ffff78a60d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>

Here we are trying to take an interrupt. This isn't related to the
other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
it's called with the BQL not held, but in fact there are some
situations where we call into the memory subsystem and we do
already have the BQL.

-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-02 16:33                                                 ` Peter Maydell
@ 2024-02-02 16:50                                                   ` Gregory Price
  2024-02-02 16:56                                                     ` Peter Maydell
  0 siblings, 1 reply; 50+ messages in thread
From: Gregory Price @ 2024-02-02 16:50 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Jonathan Cameron, Alex Bennée, Sajjan Rao,
	Dimitrios Palyvos, linux-cxl, qemu-devel, richard.henderson

On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:
> On Fri, 2 Feb 2024 at 16:26, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> > #7  0x0000555555ab1929 in bql_lock_impl (file=0x555556049122 "../../accel/tcg/cputlb.c", line=2033) at ../../system/cpus.c:524
> > #8  bql_lock_impl (file=file@entry=0x555556049122 "../../accel/tcg/cputlb.c", line=line@entry=2033) at ../../system/cpus.c:520
> > #9  0x0000555555c9f7d6 in do_ld_mmio_beN (cpu=0x5555578e0cb0, full=0x7ffe88012950, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2033
> > #10 0x0000555555ca0fbd in do_ld_8 (cpu=cpu@entry=0x5555578e0cb0, p=p@entry=0x7ffff4efd1d0, mmu_idx=<optimized out>, type=type@entry=MMU_DATA_LOAD, memop=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356
> > #11 0x0000555555ca341f in do_ld8_mmu (cpu=cpu@entry=0x5555578e0cb0, addr=addr@entry=19595792376, oi=oi@entry=52, ra=0, ra@entry=52, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439
> > #12 0x0000555555ca5f59 in cpu_ldq_mmu (ra=52, oi=52, addr=19595792376, env=0x5555578e3470) at ../../accel/tcg/ldst_common.c.inc:169
> > #13 cpu_ldq_le_mmuidx_ra (env=0x5555578e3470, addr=19595792376, mmu_idx=<optimized out>, ra=ra@entry=0) at ../../accel/tcg/ldst_common.c.inc:301
> > #14 0x0000555555b4b5fc in ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:98
> > #15 ptw_ldq (ra=0, in=0x7ffff4efd320) at ../../target/i386/tcg/sysemu/excp_helper.c:93
> > #16 mmu_translate (env=env@entry=0x5555578e3470, in=0x7ffff4efd3e0, out=0x7ffff4efd3b0, err=err@entry=0x7ffff4efd3c0, ra=ra@entry=0) at ../../target/i386/tcg/sysemu/excp_helper.c:174
> > #17 0x0000555555b4c4b3 in get_physical_address (ra=0, err=0x7ffff4efd3c0, out=0x7ffff4efd3b0, mmu_idx=0, access_type=MMU_DATA_LOAD, addr=18446741874686299840, env=0x5555578e3470) at ../../target/i386/tcg/sysemu/excp_helper.c:580
> > #18 x86_cpu_tlb_fill (cs=0x5555578e0cb0, addr=18446741874686299840, size=<optimized out>, access_type=MMU_DATA_LOAD, mmu_idx=0, probe=<optimized out>, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:606
> > #19 0x0000555555ca0ee9 in tlb_fill (retaddr=0, mmu_idx=0, access_type=MMU_DATA_LOAD, size=<optimized out>, addr=18446741874686299840, cpu=0x7ffff4efd540) at ../../accel/tcg/cputlb.c:1315
> > #20 mmu_lookup1 (cpu=cpu@entry=0x5555578e0cb0, data=data@entry=0x7ffff4efd540, mmu_idx=0, access_type=access_type@entry=MMU_DATA_LOAD, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:1713
> 
> Here we are trying to take an interrupt. This isn't related to the
> other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> it's called with the BQL not held, but in fact there are some
> situations where we call into the memory subsystem and we do
> already have the BQL.
> 
> -- PMM

It's bugs all the way down as usual!
https://xkcd.com/1416/

I'll dig in a little next week to see if there's an easy fix. We can see
the return address is already 0 going into mmu_translate, so it does
look unrelated to the patch I threw out - but probably still has to do
with things being on IO.

~Gregory

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-02 16:50                                                   ` Gregory Price
@ 2024-02-02 16:56                                                     ` Peter Maydell
  2024-02-07 17:34                                                         ` Jonathan Cameron via
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Maydell @ 2024-02-02 16:56 UTC (permalink / raw)
  To: Gregory Price
  Cc: Jonathan Cameron, Alex Bennée, Sajjan Rao,
	Dimitrios Palyvos, linux-cxl, qemu-devel, richard.henderson

On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:
>
> On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:
> > Here we are trying to take an interrupt. This isn't related to the
> > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > it's called with the BQL not held, but in fact there are some
> > situations where we call into the memory subsystem and we do
> > already have the BQL.

> It's bugs all the way down as usual!
> https://xkcd.com/1416/
>
> I'll dig in a little next week to see if there's an easy fix. We can see
> the return address is already 0 going into mmu_translate, so it does
> look unrelated to the patch I threw out - but probably still has to do
> with things being on IO.

Yes, the low level memory accessors only need to take the BQL if the thing
being accessed is an MMIO device. Probably what is wanted is for those
functions to do "take the lock if we don't already have it", something
like hw/core/cpu-common.c:cpu_reset_interrupt() does.

-- PMM

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-02 16:56                                                     ` Peter Maydell
@ 2024-02-07 17:34                                                         ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-07 17:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson, mst, linuxarm, david

On Fri, 2 Feb 2024 16:56:18 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:  
> > > Here we are trying to take an interrupt. This isn't related to the
> > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > > it's called with the BQL not held, but in fact there are some
> > > situations where we call into the memory subsystem and we do
> > > already have the BQL.  
> 
> > It's bugs all the way down as usual!
> > https://xkcd.com/1416/
> >
> > I'll dig in a little next week to see if there's an easy fix. We can see
> > the return address is already 0 going into mmu_translate, so it does
> > look unrelated to the patch I threw out - but probably still has to do
> > with things being on IO.  
> 
> Yes, the low level memory accessors only need to take the BQL if the thing
> being accessed is an MMIO device. Probably what is wanted is for those
> functions to do "take the lock if we don't already have it", something
> like hw/core/cpu-common.c:cpu_reset_interrupt() does.
> 
> -- PMM

Still a work in progress but I thought I'd give an update on some of the fun...

I have a set of somewhat dubious workarounds that sort of do the job (where
the aim is to be able to safely run any workload on top of any valid
emulated CXL device setup).

To recap, the issue is that for CXL memory interleaving we need to have
find grained routing to each device (16k Max Gran).  That was fine whilst
pretty much all the testing was DAX based so software wasn't running out
of it.  Now the kernel is rather more aggressive in defaulting any volatile
CXL memory it finds to being normal memory (in some configs anyway) people
started hitting problems. Given one of the most important functions of the
emulation is to check data ends up in the right backing stores, I'm not
keen to drop that feature unless we absolutely have to.

1) For the simple case of no interleave I have working code that just
   shoves the MemoryRegion in directly and all works fine.  That was always
   on the todo list for virtualization cases anyway were we pretend the
   underlying devices aren't interleaved and frig the reported perf numbers
   to present aggregate performance etc.  I'll tidy this up and post it.
   We may want a config parameter to 'reject' address decoder programming
   that would result in interleave - it's not remotely spec compliant, but
   meh, it will make it easier to understand.  For virt case we'll probably
   present locked down decoders (as if a FW has set them up) but for emulation
   that limits usefulness too much.
   
2) Unfortunately, for the interleaved case can't just add a lot of memory
   regions because even at highest granularity (16k) and minimum size
   512MiB it takes for ever to eventually run into an assert in
   phys_section_add with the comment:
   "The physical section number is ORed with a page-aligned
    pointer to produce the iotlb entries.  Thus it should
    never overflow into the page-aligned value."
    That sounds hard to 'fix' though I've not looked into it.

So back to plan (A) papering over the cracks with TCG.

I've focused on arm64 which seems a bit easier than x86 (and is arguably
part of my day job)

Challenges
1) The atomic updates of accessed and dirty bits in
   arm_casq_ptw() fail because we don't have a proper address to do them
   on.  However, there is precedence for non atomic updates in there
   already (used when the host system doesn't support big enough cas)
   I think we can do something similar under the bql for this case.
   Not 100% sure I'm writing to the correct address but a simple frig
   superficially appears to work.
2) Emulated devices try to do DMA to buffers in the CXL emulated interleave
   memory (virtio_blk for example).  Can't do that because there is no
   actual translation available - just read and write functions.

   So should be easy to avoid as we know how to handle DMA limitations.
   Just set the max dma address width to 40 bits (so below the CXL Fixed Memory
   Windows and rely on Linux to bounce buffer with swiotlb). For a while
   I couldn't work out why changing IORT to provide this didn't work and
   I saw errors for virtio-pci-blk. So digging ensued.
   Virtio devices by default (sort of) bypass the dma-api in linux.
   vring_use_dma_api() in Linux. That is reasonable from the translation
   point of view, but not the DMA limits (and resulting need to use bounce
   buffers).  Maybe could put a sanity check in linux on no iommu +
   a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
   break other platforms.
   Alternatively just use emulated real device and all seems fine
   - I've tested with nvme.

3) I need to fix the kernel handling for CXL CDAT table originated
   NUMA nodes on ARM64. For now I have a hack in place so I can make
   sure I hit the memory I intend to when testing. I suspect we need
   some significant work to sort 

Suggestions for other approaches would definitely be welcome!

Jonathan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-07 17:34                                                         ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-07 17:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson, mst, linuxarm, david

On Fri, 2 Feb 2024 16:56:18 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:  
> > > Here we are trying to take an interrupt. This isn't related to the
> > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > > it's called with the BQL not held, but in fact there are some
> > > situations where we call into the memory subsystem and we do
> > > already have the BQL.  
> 
> > It's bugs all the way down as usual!
> > https://xkcd.com/1416/
> >
> > I'll dig in a little next week to see if there's an easy fix. We can see
> > the return address is already 0 going into mmu_translate, so it does
> > look unrelated to the patch I threw out - but probably still has to do
> > with things being on IO.  
> 
> Yes, the low level memory accessors only need to take the BQL if the thing
> being accessed is an MMIO device. Probably what is wanted is for those
> functions to do "take the lock if we don't already have it", something
> like hw/core/cpu-common.c:cpu_reset_interrupt() does.
> 
> -- PMM

Still a work in progress but I thought I'd give an update on some of the fun...

I have a set of somewhat dubious workarounds that sort of do the job (where
the aim is to be able to safely run any workload on top of any valid
emulated CXL device setup).

To recap, the issue is that for CXL memory interleaving we need to have
find grained routing to each device (16k Max Gran).  That was fine whilst
pretty much all the testing was DAX based so software wasn't running out
of it.  Now the kernel is rather more aggressive in defaulting any volatile
CXL memory it finds to being normal memory (in some configs anyway) people
started hitting problems. Given one of the most important functions of the
emulation is to check data ends up in the right backing stores, I'm not
keen to drop that feature unless we absolutely have to.

1) For the simple case of no interleave I have working code that just
   shoves the MemoryRegion in directly and all works fine.  That was always
   on the todo list for virtualization cases anyway were we pretend the
   underlying devices aren't interleaved and frig the reported perf numbers
   to present aggregate performance etc.  I'll tidy this up and post it.
   We may want a config parameter to 'reject' address decoder programming
   that would result in interleave - it's not remotely spec compliant, but
   meh, it will make it easier to understand.  For virt case we'll probably
   present locked down decoders (as if a FW has set them up) but for emulation
   that limits usefulness too much.
   
2) Unfortunately, for the interleaved case can't just add a lot of memory
   regions because even at highest granularity (16k) and minimum size
   512MiB it takes for ever to eventually run into an assert in
   phys_section_add with the comment:
   "The physical section number is ORed with a page-aligned
    pointer to produce the iotlb entries.  Thus it should
    never overflow into the page-aligned value."
    That sounds hard to 'fix' though I've not looked into it.

So back to plan (A) papering over the cracks with TCG.

I've focused on arm64 which seems a bit easier than x86 (and is arguably
part of my day job)

Challenges
1) The atomic updates of accessed and dirty bits in
   arm_casq_ptw() fail because we don't have a proper address to do them
   on.  However, there is precedence for non atomic updates in there
   already (used when the host system doesn't support big enough cas)
   I think we can do something similar under the bql for this case.
   Not 100% sure I'm writing to the correct address but a simple frig
   superficially appears to work.
2) Emulated devices try to do DMA to buffers in the CXL emulated interleave
   memory (virtio_blk for example).  Can't do that because there is no
   actual translation available - just read and write functions.

   So should be easy to avoid as we know how to handle DMA limitations.
   Just set the max dma address width to 40 bits (so below the CXL Fixed Memory
   Windows and rely on Linux to bounce buffer with swiotlb). For a while
   I couldn't work out why changing IORT to provide this didn't work and
   I saw errors for virtio-pci-blk. So digging ensued.
   Virtio devices by default (sort of) bypass the dma-api in linux.
   vring_use_dma_api() in Linux. That is reasonable from the translation
   point of view, but not the DMA limits (and resulting need to use bounce
   buffers).  Maybe could put a sanity check in linux on no iommu +
   a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
   break other platforms.
   Alternatively just use emulated real device and all seems fine
   - I've tested with nvme.

3) I need to fix the kernel handling for CXL CDAT table originated
   NUMA nodes on ARM64. For now I have a hack in place so I can make
   sure I hit the memory I intend to when testing. I suspect we need
   some significant work to sort 

Suggestions for other approaches would definitely be welcome!

Jonathan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-07 17:34                                                         ` Jonathan Cameron via
@ 2024-02-08 14:50                                                           ` Jonathan Cameron via
  -1 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-08 14:50 UTC (permalink / raw)
  To: Peter Maydell, linuxarm
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson, mst, david

On Wed, 7 Feb 2024 17:34:15 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Fri, 2 Feb 2024 16:56:18 +0000
> Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> > On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:  
> > >
> > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:    
> > > > Here we are trying to take an interrupt. This isn't related to the
> > > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > > > it's called with the BQL not held, but in fact there are some
> > > > situations where we call into the memory subsystem and we do
> > > > already have the BQL.    
> >   
> > > It's bugs all the way down as usual!
> > > https://xkcd.com/1416/
> > >
> > > I'll dig in a little next week to see if there's an easy fix. We can see
> > > the return address is already 0 going into mmu_translate, so it does
> > > look unrelated to the patch I threw out - but probably still has to do
> > > with things being on IO.    
> > 
> > Yes, the low level memory accessors only need to take the BQL if the thing
> > being accessed is an MMIO device. Probably what is wanted is for those
> > functions to do "take the lock if we don't already have it", something
> > like hw/core/cpu-common.c:cpu_reset_interrupt() does.

Got back to x86 testing and indeed not taking the lock in that one path
does get things running (with all Gregory's earlier hacks + DMA limits as
described below).  Guess it's time to roll some cleaned up patches and
see how much everyone screams :)

Jonathan


> > 
> > -- PMM  
> 
> Still a work in progress but I thought I'd give an update on some of the fun...
> 
> I have a set of somewhat dubious workarounds that sort of do the job (where
> the aim is to be able to safely run any workload on top of any valid
> emulated CXL device setup).
> 
> To recap, the issue is that for CXL memory interleaving we need to have
> find grained routing to each device (16k Max Gran).  That was fine whilst
> pretty much all the testing was DAX based so software wasn't running out
> of it.  Now the kernel is rather more aggressive in defaulting any volatile
> CXL memory it finds to being normal memory (in some configs anyway) people
> started hitting problems. Given one of the most important functions of the
> emulation is to check data ends up in the right backing stores, I'm not
> keen to drop that feature unless we absolutely have to.
> 
> 1) For the simple case of no interleave I have working code that just
>    shoves the MemoryRegion in directly and all works fine.  That was always
>    on the todo list for virtualization cases anyway were we pretend the
>    underlying devices aren't interleaved and frig the reported perf numbers
>    to present aggregate performance etc.  I'll tidy this up and post it.
>    We may want a config parameter to 'reject' address decoder programming
>    that would result in interleave - it's not remotely spec compliant, but
>    meh, it will make it easier to understand.  For virt case we'll probably
>    present locked down decoders (as if a FW has set them up) but for emulation
>    that limits usefulness too much.
>    
> 2) Unfortunately, for the interleaved case can't just add a lot of memory
>    regions because even at highest granularity (16k) and minimum size
>    512MiB it takes for ever to eventually run into an assert in
>    phys_section_add with the comment:
>    "The physical section number is ORed with a page-aligned
>     pointer to produce the iotlb entries.  Thus it should
>     never overflow into the page-aligned value."
>     That sounds hard to 'fix' though I've not looked into it.
> 
> So back to plan (A) papering over the cracks with TCG.
> 
> I've focused on arm64 which seems a bit easier than x86 (and is arguably
> part of my day job)
> 
> Challenges
> 1) The atomic updates of accessed and dirty bits in
>    arm_casq_ptw() fail because we don't have a proper address to do them
>    on.  However, there is precedence for non atomic updates in there
>    already (used when the host system doesn't support big enough cas)
>    I think we can do something similar under the bql for this case.
>    Not 100% sure I'm writing to the correct address but a simple frig
>    superficially appears to work.
> 2) Emulated devices try to do DMA to buffers in the CXL emulated interleave
>    memory (virtio_blk for example).  Can't do that because there is no
>    actual translation available - just read and write functions.
> 
>    So should be easy to avoid as we know how to handle DMA limitations.
>    Just set the max dma address width to 40 bits (so below the CXL Fixed Memory
>    Windows and rely on Linux to bounce buffer with swiotlb). For a while
>    I couldn't work out why changing IORT to provide this didn't work and
>    I saw errors for virtio-pci-blk. So digging ensued.
>    Virtio devices by default (sort of) bypass the dma-api in linux.
>    vring_use_dma_api() in Linux. That is reasonable from the translation
>    point of view, but not the DMA limits (and resulting need to use bounce
>    buffers).  Maybe could put a sanity check in linux on no iommu +
>    a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
>    break other platforms.
>    Alternatively just use emulated real device and all seems fine
>    - I've tested with nvme.
> 
> 3) I need to fix the kernel handling for CXL CDAT table originated
>    NUMA nodes on ARM64. For now I have a hack in place so I can make
>    sure I hit the memory I intend to when testing. I suspect we need
>    some significant work to sort 
> 
> Suggestions for other approaches would definitely be welcome!
> 
> Jonathan
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-08 14:50                                                           ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-08 14:50 UTC (permalink / raw)
  To: Peter Maydell, linuxarm
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson, mst, david

On Wed, 7 Feb 2024 17:34:15 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Fri, 2 Feb 2024 16:56:18 +0000
> Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> > On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:  
> > >
> > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:    
> > > > Here we are trying to take an interrupt. This isn't related to the
> > > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > > > it's called with the BQL not held, but in fact there are some
> > > > situations where we call into the memory subsystem and we do
> > > > already have the BQL.    
> >   
> > > It's bugs all the way down as usual!
> > > https://xkcd.com/1416/
> > >
> > > I'll dig in a little next week to see if there's an easy fix. We can see
> > > the return address is already 0 going into mmu_translate, so it does
> > > look unrelated to the patch I threw out - but probably still has to do
> > > with things being on IO.    
> > 
> > Yes, the low level memory accessors only need to take the BQL if the thing
> > being accessed is an MMIO device. Probably what is wanted is for those
> > functions to do "take the lock if we don't already have it", something
> > like hw/core/cpu-common.c:cpu_reset_interrupt() does.

Got back to x86 testing and indeed not taking the lock in that one path
does get things running (with all Gregory's earlier hacks + DMA limits as
described below).  Guess it's time to roll some cleaned up patches and
see how much everyone screams :)

Jonathan


> > 
> > -- PMM  
> 
> Still a work in progress but I thought I'd give an update on some of the fun...
> 
> I have a set of somewhat dubious workarounds that sort of do the job (where
> the aim is to be able to safely run any workload on top of any valid
> emulated CXL device setup).
> 
> To recap, the issue is that for CXL memory interleaving we need to have
> find grained routing to each device (16k Max Gran).  That was fine whilst
> pretty much all the testing was DAX based so software wasn't running out
> of it.  Now the kernel is rather more aggressive in defaulting any volatile
> CXL memory it finds to being normal memory (in some configs anyway) people
> started hitting problems. Given one of the most important functions of the
> emulation is to check data ends up in the right backing stores, I'm not
> keen to drop that feature unless we absolutely have to.
> 
> 1) For the simple case of no interleave I have working code that just
>    shoves the MemoryRegion in directly and all works fine.  That was always
>    on the todo list for virtualization cases anyway were we pretend the
>    underlying devices aren't interleaved and frig the reported perf numbers
>    to present aggregate performance etc.  I'll tidy this up and post it.
>    We may want a config parameter to 'reject' address decoder programming
>    that would result in interleave - it's not remotely spec compliant, but
>    meh, it will make it easier to understand.  For virt case we'll probably
>    present locked down decoders (as if a FW has set them up) but for emulation
>    that limits usefulness too much.
>    
> 2) Unfortunately, for the interleaved case can't just add a lot of memory
>    regions because even at highest granularity (16k) and minimum size
>    512MiB it takes for ever to eventually run into an assert in
>    phys_section_add with the comment:
>    "The physical section number is ORed with a page-aligned
>     pointer to produce the iotlb entries.  Thus it should
>     never overflow into the page-aligned value."
>     That sounds hard to 'fix' though I've not looked into it.
> 
> So back to plan (A) papering over the cracks with TCG.
> 
> I've focused on arm64 which seems a bit easier than x86 (and is arguably
> part of my day job)
> 
> Challenges
> 1) The atomic updates of accessed and dirty bits in
>    arm_casq_ptw() fail because we don't have a proper address to do them
>    on.  However, there is precedence for non atomic updates in there
>    already (used when the host system doesn't support big enough cas)
>    I think we can do something similar under the bql for this case.
>    Not 100% sure I'm writing to the correct address but a simple frig
>    superficially appears to work.
> 2) Emulated devices try to do DMA to buffers in the CXL emulated interleave
>    memory (virtio_blk for example).  Can't do that because there is no
>    actual translation available - just read and write functions.
> 
>    So should be easy to avoid as we know how to handle DMA limitations.
>    Just set the max dma address width to 40 bits (so below the CXL Fixed Memory
>    Windows and rely on Linux to bounce buffer with swiotlb). For a while
>    I couldn't work out why changing IORT to provide this didn't work and
>    I saw errors for virtio-pci-blk. So digging ensued.
>    Virtio devices by default (sort of) bypass the dma-api in linux.
>    vring_use_dma_api() in Linux. That is reasonable from the translation
>    point of view, but not the DMA limits (and resulting need to use bounce
>    buffers).  Maybe could put a sanity check in linux on no iommu +
>    a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
>    break other platforms.
>    Alternatively just use emulated real device and all seems fine
>    - I've tested with nvme.
> 
> 3) I need to fix the kernel handling for CXL CDAT table originated
>    NUMA nodes on ARM64. For now I have a hack in place so I can make
>    sure I hit the memory I intend to when testing. I suspect we need
>    some significant work to sort 
> 
> Suggestions for other approaches would definitely be welcome!
> 
> Jonathan
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-01 16:00                                 ` Peter Maydell
@ 2024-02-15 15:04                                     ` Jonathan Cameron
  2024-02-15 15:04                                     ` Jonathan Cameron
  1 sibling, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-15 15:04 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 16:00:56 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Peter Maydell <peter.maydell@linaro.org> writes:  
> > > So, that looks like:
> > >  * we call cpu_tb_exec(), which executes some generated code
> > >  * that generated code calls the lookup_tb_ptr helper to see
> > >    if we have a generated TB already for the address we're going
> > >    to execute next
> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > >    address for the guest address
> > >  * this results in a TLB walk for an instruction fetch
> > >  * the page table descriptor load is to IO memory
> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > >    can_do_io is clear
> > >
> > > I am not surprised that the corner case of "the guest put its
> > > page tables in an MMIO device" has not yet come up :-)
> > >
> > > I'm really not sure how the icount handling should interact
> > > with that...  
> >
> > Its not just icount - we need to handle it for all modes now. That said
> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> 
> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> which happens earlier than the tb_stop callback (it can
> happen in the trans function for branch etc insns, for
> example).
> 
> I think it should be OK to clear can_do_io at the start
> of the lookup_tb_ptr helper, something like:
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 977576ca143..7818537f318 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
>      uint64_t cs_base;
>      uint32_t flags, cflags;
> 
> +    /*
> +     * By definition we've just finished a TB, so I/O is OK.
> +     * Avoid the possibility of calling cpu_io_recompile() if
> +     * a page table walk triggered by tb_lookup() calling
> +     * probe_access_internal() happens to touch an MMIO device.
> +     * The next TB, if we chain to it, will clear the flag again.
> +     */
> +    cpu->neg.can_do_io = true;
> +
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> 
>      cflags = curr_cflags(cpu);
> 
> -- PMM

Hi Peter,

I've included this in the series I just sent out:
https://lore.kernel.org/qemu-devel/20240215150133.2088-1-Jonathan.Cameron@huawei.com/T/#t

Could you add your Signed-off-by if you are happy doing so?

Jonathan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-15 15:04                                     ` Jonathan Cameron
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-15 15:04 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alex Bennée, Sajjan Rao, Gregory Price, Dimitrios Palyvos,
	linux-cxl, qemu-devel, richard.henderson

On Thu, 1 Feb 2024 16:00:56 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 1 Feb 2024 at 15:17, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Peter Maydell <peter.maydell@linaro.org> writes:  
> > > So, that looks like:
> > >  * we call cpu_tb_exec(), which executes some generated code
> > >  * that generated code calls the lookup_tb_ptr helper to see
> > >    if we have a generated TB already for the address we're going
> > >    to execute next
> > >  * lookup_tb_ptr probes the TLB to see if we know the host RAM
> > >    address for the guest address
> > >  * this results in a TLB walk for an instruction fetch
> > >  * the page table descriptor load is to IO memory
> > >  * io_prepare assumes it needs to do a TLB recompile, because
> > >    can_do_io is clear
> > >
> > > I am not surprised that the corner case of "the guest put its
> > > page tables in an MMIO device" has not yet come up :-)
> > >
> > > I'm really not sure how the icount handling should interact
> > > with that...  
> >
> > Its not just icount - we need to handle it for all modes now. That said
> > seeing as we are at the end of a block shouldn't can_do_io be set?  
> 
> The lookup_tb_ptr helper gets called from tcg_gen_goto_tb(),
> which happens earlier than the tb_stop callback (it can
> happen in the trans function for branch etc insns, for
> example).
> 
> I think it should be OK to clear can_do_io at the start
> of the lookup_tb_ptr helper, something like:
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 977576ca143..7818537f318 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -396,6 +396,15 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
>      uint64_t cs_base;
>      uint32_t flags, cflags;
> 
> +    /*
> +     * By definition we've just finished a TB, so I/O is OK.
> +     * Avoid the possibility of calling cpu_io_recompile() if
> +     * a page table walk triggered by tb_lookup() calling
> +     * probe_access_internal() happens to touch an MMIO device.
> +     * The next TB, if we chain to it, will clear the flag again.
> +     */
> +    cpu->neg.can_do_io = true;
> +
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> 
>      cflags = curr_cflags(cpu);
> 
> -- PMM

Hi Peter,

I've included this in the series I just sent out:
https://lore.kernel.org/qemu-devel/20240215150133.2088-1-Jonathan.Cameron@huawei.com/T/#t

Could you add your Signed-off-by if you are happy doing so?

Jonathan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-08 14:50                                                           ` Jonathan Cameron via
@ 2024-02-15 15:29                                                             ` Jonathan Cameron via
  -1 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron @ 2024-02-15 15:29 UTC (permalink / raw)
  To: Peter Maydell, linuxarm, Dimitrios Palyvos, linux-cxl, qemu-devel
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, richard.henderson,
	mst, david, Mattias Nissler

On Thu, 8 Feb 2024 14:50:59 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Wed, 7 Feb 2024 17:34:15 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Fri, 2 Feb 2024 16:56:18 +0000
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> >   
> > > On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:    
> > > >
> > > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:      
> > > > > Here we are trying to take an interrupt. This isn't related to the
> > > > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > > > > it's called with the BQL not held, but in fact there are some
> > > > > situations where we call into the memory subsystem and we do
> > > > > already have the BQL.      
> > >     
> > > > It's bugs all the way down as usual!
> > > > https://xkcd.com/1416/
> > > >
> > > > I'll dig in a little next week to see if there's an easy fix. We can see
> > > > the return address is already 0 going into mmu_translate, so it does
> > > > look unrelated to the patch I threw out - but probably still has to do
> > > > with things being on IO.      
> > > 
> > > Yes, the low level memory accessors only need to take the BQL if the thing
> > > being accessed is an MMIO device. Probably what is wanted is for those
> > > functions to do "take the lock if we don't already have it", something
> > > like hw/core/cpu-common.c:cpu_reset_interrupt() does.  
> 
> Got back to x86 testing and indeed not taking the lock in that one path
> does get things running (with all Gregory's earlier hacks + DMA limits as
> described below).  Guess it's time to roll some cleaned up patches and
> see how much everyone screams :)
> 

3 series sent out:
(all also on gitlab.com/jic23/qemu cxl-2024-02-15 though I updated patch descriptions
a little after pushing that out)

Main set of fixes (x86 'works' under my light testing after this one)
https://lore.kernel.org/qemu-devel/20240215150133.2088-1-Jonathan.Cameron@huawei.com/

ARM FEAT_HADFS (access and dirty it updating in PTW) workaround for missing atomic CAS
https://lore.kernel.org/qemu-devel/20240215151804.2426-1-Jonathan.Cameron@huawei.com/T/#t

DMA / virtio fix:
https://lore.kernel.org/qemu-devel/20240215142817.1904-1-Jonathan.Cameron@huawei.com/

Last thing I need to do is propose a suitable flag to make 
Mattias' bounce buffering size parameter apply to "memory" address space.  Currently
I'm carrying this: (I've no idea how much is need but it's somewhere between 4k and 1G)

diff --git a/system/physmem.c b/system/physmem.c
index 43b37942cf..49b961c7a5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2557,6 +2557,7 @@ static void memory_map_init(void)
     memory_region_init(system_memory, NULL, "system", UINT64_MAX);
     address_space_init(&address_space_memory, system_memory, "memory");

+    address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
     system_io = g_malloc(sizeof(*system_io));
     memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
                           65536);

Please take a look. These are all in areas of QEMU I'm not particularly confident
about so relying on nice people giving feedback even more than normal!

Thanks to all those who helped with debugging and suggestions.

Thanks,

Jonathan

> Jonathan
> 
> 
> > > 
> > > -- PMM    
> > 
> > Still a work in progress but I thought I'd give an update on some of the fun...
> > 
> > I have a set of somewhat dubious workarounds that sort of do the job (where
> > the aim is to be able to safely run any workload on top of any valid
> > emulated CXL device setup).
> > 
> > To recap, the issue is that for CXL memory interleaving we need to have
> > find grained routing to each device (16k Max Gran).  That was fine whilst
> > pretty much all the testing was DAX based so software wasn't running out
> > of it.  Now the kernel is rather more aggressive in defaulting any volatile
> > CXL memory it finds to being normal memory (in some configs anyway) people
> > started hitting problems. Given one of the most important functions of the
> > emulation is to check data ends up in the right backing stores, I'm not
> > keen to drop that feature unless we absolutely have to.
> > 
> > 1) For the simple case of no interleave I have working code that just
> >    shoves the MemoryRegion in directly and all works fine.  That was always
> >    on the todo list for virtualization cases anyway were we pretend the
> >    underlying devices aren't interleaved and frig the reported perf numbers
> >    to present aggregate performance etc.  I'll tidy this up and post it.
> >    We may want a config parameter to 'reject' address decoder programming
> >    that would result in interleave - it's not remotely spec compliant, but
> >    meh, it will make it easier to understand.  For virt case we'll probably
> >    present locked down decoders (as if a FW has set them up) but for emulation
> >    that limits usefulness too much.
> >    
> > 2) Unfortunately, for the interleaved case can't just add a lot of memory
> >    regions because even at highest granularity (16k) and minimum size
> >    512MiB it takes for ever to eventually run into an assert in
> >    phys_section_add with the comment:
> >    "The physical section number is ORed with a page-aligned
> >     pointer to produce the iotlb entries.  Thus it should
> >     never overflow into the page-aligned value."
> >     That sounds hard to 'fix' though I've not looked into it.
> > 
> > So back to plan (A) papering over the cracks with TCG.
> > 
> > I've focused on arm64 which seems a bit easier than x86 (and is arguably
> > part of my day job)
> > 
> > Challenges
> > 1) The atomic updates of accessed and dirty bits in
> >    arm_casq_ptw() fail because we don't have a proper address to do them
> >    on.  However, there is precedence for non atomic updates in there
> >    already (used when the host system doesn't support big enough cas)
> >    I think we can do something similar under the bql for this case.
> >    Not 100% sure I'm writing to the correct address but a simple frig
> >    superficially appears to work.
> > 2) Emulated devices try to do DMA to buffers in the CXL emulated interleave
> >    memory (virtio_blk for example).  Can't do that because there is no
> >    actual translation available - just read and write functions.
> > 
> >    So should be easy to avoid as we know how to handle DMA limitations.
> >    Just set the max dma address width to 40 bits (so below the CXL Fixed Memory
> >    Windows and rely on Linux to bounce buffer with swiotlb). For a while
> >    I couldn't work out why changing IORT to provide this didn't work and
> >    I saw errors for virtio-pci-blk. So digging ensued.
> >    Virtio devices by default (sort of) bypass the dma-api in linux.
> >    vring_use_dma_api() in Linux. That is reasonable from the translation
> >    point of view, but not the DMA limits (and resulting need to use bounce
> >    buffers).  Maybe could put a sanity check in linux on no iommu +
> >    a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
> >    break other platforms.
> >    Alternatively just use emulated real device and all seems fine
> >    - I've tested with nvme.
> > 
> > 3) I need to fix the kernel handling for CXL CDAT table originated
> >    NUMA nodes on ARM64. For now I have a hack in place so I can make
> >    sure I hit the memory I intend to when testing. I suspect we need
> >    some significant work to sort 
> > 
> > Suggestions for other approaches would definitely be welcome!
> > 
> > Jonathan
> >   
> 
> 


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
@ 2024-02-15 15:29                                                             ` Jonathan Cameron via
  0 siblings, 0 replies; 50+ messages in thread
From: Jonathan Cameron via @ 2024-02-15 15:29 UTC (permalink / raw)
  To: Peter Maydell, linuxarm, Dimitrios Palyvos, linux-cxl, qemu-devel
  Cc: Gregory Price, Alex Bennée, Sajjan Rao, richard.henderson,
	mst, david, Mattias Nissler

On Thu, 8 Feb 2024 14:50:59 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Wed, 7 Feb 2024 17:34:15 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Fri, 2 Feb 2024 16:56:18 +0000
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> >   
> > > On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.price@memverge.com> wrote:    
> > > >
> > > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:      
> > > > > Here we are trying to take an interrupt. This isn't related to the
> > > > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes
> > > > > it's called with the BQL not held, but in fact there are some
> > > > > situations where we call into the memory subsystem and we do
> > > > > already have the BQL.      
> > >     
> > > > It's bugs all the way down as usual!
> > > > https://xkcd.com/1416/
> > > >
> > > > I'll dig in a little next week to see if there's an easy fix. We can see
> > > > the return address is already 0 going into mmu_translate, so it does
> > > > look unrelated to the patch I threw out - but probably still has to do
> > > > with things being on IO.      
> > > 
> > > Yes, the low level memory accessors only need to take the BQL if the thing
> > > being accessed is an MMIO device. Probably what is wanted is for those
> > > functions to do "take the lock if we don't already have it", something
> > > like hw/core/cpu-common.c:cpu_reset_interrupt() does.  
> 
> Got back to x86 testing and indeed not taking the lock in that one path
> does get things running (with all Gregory's earlier hacks + DMA limits as
> described below).  Guess it's time to roll some cleaned up patches and
> see how much everyone screams :)
> 

3 series sent out:
(all also on gitlab.com/jic23/qemu cxl-2024-02-15 though I updated patch descriptions
a little after pushing that out)

Main set of fixes (x86 'works' under my light testing after this one)
https://lore.kernel.org/qemu-devel/20240215150133.2088-1-Jonathan.Cameron@huawei.com/

ARM FEAT_HADFS (access and dirty it updating in PTW) workaround for missing atomic CAS
https://lore.kernel.org/qemu-devel/20240215151804.2426-1-Jonathan.Cameron@huawei.com/T/#t

DMA / virtio fix:
https://lore.kernel.org/qemu-devel/20240215142817.1904-1-Jonathan.Cameron@huawei.com/

Last thing I need to do is propose a suitable flag to make 
Mattias' bounce buffering size parameter apply to "memory" address space.  Currently
I'm carrying this: (I've no idea how much is need but it's somewhere between 4k and 1G)

diff --git a/system/physmem.c b/system/physmem.c
index 43b37942cf..49b961c7a5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2557,6 +2557,7 @@ static void memory_map_init(void)
     memory_region_init(system_memory, NULL, "system", UINT64_MAX);
     address_space_init(&address_space_memory, system_memory, "memory");

+    address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
     system_io = g_malloc(sizeof(*system_io));
     memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
                           65536);

Please take a look. These are all in areas of QEMU I'm not particularly confident
about so relying on nice people giving feedback even more than normal!

Thanks to all those who helped with debugging and suggestions.

Thanks,

Jonathan

> Jonathan
> 
> 
> > > 
> > > -- PMM    
> > 
> > Still a work in progress but I thought I'd give an update on some of the fun...
> > 
> > I have a set of somewhat dubious workarounds that sort of do the job (where
> > the aim is to be able to safely run any workload on top of any valid
> > emulated CXL device setup).
> > 
> > To recap, the issue is that for CXL memory interleaving we need to have
> > find grained routing to each device (16k Max Gran).  That was fine whilst
> > pretty much all the testing was DAX based so software wasn't running out
> > of it.  Now the kernel is rather more aggressive in defaulting any volatile
> > CXL memory it finds to being normal memory (in some configs anyway) people
> > started hitting problems. Given one of the most important functions of the
> > emulation is to check data ends up in the right backing stores, I'm not
> > keen to drop that feature unless we absolutely have to.
> > 
> > 1) For the simple case of no interleave I have working code that just
> >    shoves the MemoryRegion in directly and all works fine.  That was always
> >    on the todo list for virtualization cases anyway were we pretend the
> >    underlying devices aren't interleaved and frig the reported perf numbers
> >    to present aggregate performance etc.  I'll tidy this up and post it.
> >    We may want a config parameter to 'reject' address decoder programming
> >    that would result in interleave - it's not remotely spec compliant, but
> >    meh, it will make it easier to understand.  For virt case we'll probably
> >    present locked down decoders (as if a FW has set them up) but for emulation
> >    that limits usefulness too much.
> >    
> > 2) Unfortunately, for the interleaved case can't just add a lot of memory
> >    regions because even at highest granularity (16k) and minimum size
> >    512MiB it takes for ever to eventually run into an assert in
> >    phys_section_add with the comment:
> >    "The physical section number is ORed with a page-aligned
> >     pointer to produce the iotlb entries.  Thus it should
> >     never overflow into the page-aligned value."
> >     That sounds hard to 'fix' though I've not looked into it.
> > 
> > So back to plan (A) papering over the cracks with TCG.
> > 
> > I've focused on arm64 which seems a bit easier than x86 (and is arguably
> > part of my day job)
> > 
> > Challenges
> > 1) The atomic updates of accessed and dirty bits in
> >    arm_casq_ptw() fail because we don't have a proper address to do them
> >    on.  However, there is precedence for non atomic updates in there
> >    already (used when the host system doesn't support big enough cas)
> >    I think we can do something similar under the bql for this case.
> >    Not 100% sure I'm writing to the correct address but a simple frig
> >    superficially appears to work.
> > 2) Emulated devices try to do DMA to buffers in the CXL emulated interleave
> >    memory (virtio_blk for example).  Can't do that because there is no
> >    actual translation available - just read and write functions.
> > 
> >    So should be easy to avoid as we know how to handle DMA limitations.
> >    Just set the max dma address width to 40 bits (so below the CXL Fixed Memory
> >    Windows and rely on Linux to bounce buffer with swiotlb). For a while
> >    I couldn't work out why changing IORT to provide this didn't work and
> >    I saw errors for virtio-pci-blk. So digging ensued.
> >    Virtio devices by default (sort of) bypass the dma-api in linux.
> >    vring_use_dma_api() in Linux. That is reasonable from the translation
> >    point of view, but not the DMA limits (and resulting need to use bounce
> >    buffers).  Maybe could put a sanity check in linux on no iommu +
> >    a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
> >    break other platforms.
> >    Alternatively just use emulated real device and all seems fine
> >    - I've tested with nvme.
> > 
> > 3) I need to fix the kernel handling for CXL CDAT table originated
> >    NUMA nodes on ARM64. For now I have a hack in place so I can make
> >    sure I hit the memory I intend to when testing. I suspect we need
> >    some significant work to sort 
> > 
> > Suggestions for other approaches would definitely be welcome!
> > 
> > Jonathan
> >   
> 
> 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1
  2024-02-15 15:29                                                             ` Jonathan Cameron via
  (?)
@ 2024-02-19  7:55                                                             ` Mattias Nissler
  -1 siblings, 0 replies; 50+ messages in thread
From: Mattias Nissler @ 2024-02-19  7:55 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Peter Maydell, linuxarm, Dimitrios Palyvos, linux-cxl,
	qemu-devel, Gregory Price, Alex Bennée, Sajjan Rao,
	richard.henderson, mst, david

[-- Attachment #1: Type: text/plain, Size: 8697 bytes --]

On Thu, Feb 15, 2024 at 4:29 PM Jonathan Cameron <
Jonathan.Cameron@huawei.com> wrote:

> On Thu, 8 Feb 2024 14:50:59 +0000
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>
> > On Wed, 7 Feb 2024 17:34:15 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >
> > > On Fri, 2 Feb 2024 16:56:18 +0000
> > > Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > > On Fri, 2 Feb 2024 at 16:50, Gregory Price <
> gregory.price@memverge.com> wrote:
> > > > >
> > > > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:
>
> > > > > > Here we are trying to take an interrupt. This isn't related to
> the
> > > > > > other can_do_io stuff, it's happening because do_ld_mmio_beN
> assumes
> > > > > > it's called with the BQL not held, but in fact there are some
> > > > > > situations where we call into the memory subsystem and we do
> > > > > > already have the BQL.
> > > >
> > > > > It's bugs all the way down as usual!
> > > > > https://xkcd.com/1416/
> > > > >
> > > > > I'll dig in a little next week to see if there's an easy fix. We
> can see
> > > > > the return address is already 0 going into mmu_translate, so it
> does
> > > > > look unrelated to the patch I threw out - but probably still has
> to do
> > > > > with things being on IO.
> > > >
> > > > Yes, the low level memory accessors only need to take the BQL if the
> thing
> > > > being accessed is an MMIO device. Probably what is wanted is for
> those
> > > > functions to do "take the lock if we don't already have it",
> something
> > > > like hw/core/cpu-common.c:cpu_reset_interrupt() does.
> >
> > Got back to x86 testing and indeed not taking the lock in that one path
> > does get things running (with all Gregory's earlier hacks + DMA limits as
> > described below).  Guess it's time to roll some cleaned up patches and
> > see how much everyone screams :)
> >
>
> 3 series sent out:
> (all also on gitlab.com/jic23/qemu cxl-2024-02-15 though I updated patch
> descriptions
> a little after pushing that out)
>
> Main set of fixes (x86 'works' under my light testing after this one)
>
> https://lore.kernel.org/qemu-devel/20240215150133.2088-1-Jonathan.Cameron@huawei.com/
>
> ARM FEAT_HADFS (access and dirty it updating in PTW) workaround for
> missing atomic CAS
>
> https://lore.kernel.org/qemu-devel/20240215151804.2426-1-Jonathan.Cameron@huawei.com/T/#t
>
> DMA / virtio fix:
>
> https://lore.kernel.org/qemu-devel/20240215142817.1904-1-Jonathan.Cameron@huawei.com/
>
> Last thing I need to do is propose a suitable flag to make
> Mattias' bounce buffering size parameter apply to "memory" address space.


For background, I actually had a global bounce buffer size parameter apply
to all address spaces in an earlier version of my series. After discussion
on the list, we settled on an address-space specific parameter so it can be
configured per device. I haven't looked into where the memory accesses in
your context originate from - can they be attributed to a specific entity
to house the parameter?


> Currently
> I'm carrying this: (I've no idea how much is need but it's somewhere
> between 4k and 1G)
>
> diff --git a/system/physmem.c b/system/physmem.c
> index 43b37942cf..49b961c7a5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2557,6 +2557,7 @@ static void memory_map_init(void)
>      memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>      address_space_init(&address_space_memory, system_memory, "memory");
>
> +    address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
>      system_io = g_malloc(sizeof(*system_io));
>      memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
>                            65536);
>
> Please take a look. These are all in areas of QEMU I'm not particularly
> confident
> about so relying on nice people giving feedback even more than normal!
>
> Thanks to all those who helped with debugging and suggestions.
>
> Thanks,
>
> Jonathan
>
> > Jonathan
> >
> >
> > > >
> > > > -- PMM
> > >
> > > Still a work in progress but I thought I'd give an update on some of
> the fun...
> > >
> > > I have a set of somewhat dubious workarounds that sort of do the job
> (where
> > > the aim is to be able to safely run any workload on top of any valid
> > > emulated CXL device setup).
> > >
> > > To recap, the issue is that for CXL memory interleaving we need to have
> > > find grained routing to each device (16k Max Gran).  That was fine
> whilst
> > > pretty much all the testing was DAX based so software wasn't running
> out
> > > of it.  Now the kernel is rather more aggressive in defaulting any
> volatile
> > > CXL memory it finds to being normal memory (in some configs anyway)
> people
> > > started hitting problems. Given one of the most important functions of
> the
> > > emulation is to check data ends up in the right backing stores, I'm not
> > > keen to drop that feature unless we absolutely have to.
> > >
> > > 1) For the simple case of no interleave I have working code that just
> > >    shoves the MemoryRegion in directly and all works fine.  That was
> always
> > >    on the todo list for virtualization cases anyway were we pretend the
> > >    underlying devices aren't interleaved and frig the reported perf
> numbers
> > >    to present aggregate performance etc.  I'll tidy this up and post
> it.
> > >    We may want a config parameter to 'reject' address decoder
> programming
> > >    that would result in interleave - it's not remotely spec compliant,
> but
> > >    meh, it will make it easier to understand.  For virt case we'll
> probably
> > >    present locked down decoders (as if a FW has set them up) but for
> emulation
> > >    that limits usefulness too much.
> > >
> > > 2) Unfortunately, for the interleaved case can't just add a lot of
> memory
> > >    regions because even at highest granularity (16k) and minimum size
> > >    512MiB it takes for ever to eventually run into an assert in
> > >    phys_section_add with the comment:
> > >    "The physical section number is ORed with a page-aligned
> > >     pointer to produce the iotlb entries.  Thus it should
> > >     never overflow into the page-aligned value."
> > >     That sounds hard to 'fix' though I've not looked into it.
> > >
> > > So back to plan (A) papering over the cracks with TCG.
> > >
> > > I've focused on arm64 which seems a bit easier than x86 (and is
> arguably
> > > part of my day job)
> > >
> > > Challenges
> > > 1) The atomic updates of accessed and dirty bits in
> > >    arm_casq_ptw() fail because we don't have a proper address to do
> them
> > >    on.  However, there is precedence for non atomic updates in there
> > >    already (used when the host system doesn't support big enough cas)
> > >    I think we can do something similar under the bql for this case.
> > >    Not 100% sure I'm writing to the correct address but a simple frig
> > >    superficially appears to work.
> > > 2) Emulated devices try to do DMA to buffers in the CXL emulated
> interleave
> > >    memory (virtio_blk for example).  Can't do that because there is no
> > >    actual translation available - just read and write functions.
> > >
> > >    So should be easy to avoid as we know how to handle DMA limitations.
> > >    Just set the max dma address width to 40 bits (so below the CXL
> Fixed Memory
> > >    Windows and rely on Linux to bounce buffer with swiotlb). For a
> while
> > >    I couldn't work out why changing IORT to provide this didn't work
> and
> > >    I saw errors for virtio-pci-blk. So digging ensued.
> > >    Virtio devices by default (sort of) bypass the dma-api in linux.
> > >    vring_use_dma_api() in Linux. That is reasonable from the
> translation
> > >    point of view, but not the DMA limits (and resulting need to use
> bounce
> > >    buffers).  Maybe could put a sanity check in linux on no iommu +
> > >    a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
> > >    break other platforms.
> > >    Alternatively just use emulated real device and all seems fine
> > >    - I've tested with nvme.
> > >
> > > 3) I need to fix the kernel handling for CXL CDAT table originated
> > >    NUMA nodes on ARM64. For now I have a hack in place so I can make
> > >    sure I hit the memory I intend to when testing. I suspect we need
> > >    some significant work to sort
> > >
> > > Suggestions for other approaches would definitely be welcome!
> > >
> > > Jonathan
> > >
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 11359 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-02-19  7:56 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-18  9:38 qemu cxl memory expander shows numa_node -1 Sajjan Rao
2023-08-18 15:01 ` Dimitrios Palyvos
2023-08-21 10:00   ` Sajjan Rao
2023-08-21 10:53     ` Dimitrios Palyvos
2023-08-23 11:13       ` Sajjan Rao
2023-08-23 16:50         ` Jonathan Cameron
2023-08-24  6:26           ` Sajjan Rao
2024-01-25  8:15             ` Sajjan Rao
2024-01-26 12:39               ` Jonathan Cameron
2024-01-26 15:43                 ` Gregory Price
2024-01-26 17:12                   ` Jonathan Cameron
2024-01-30  8:20                     ` Sajjan Rao
2024-02-01 13:04                       ` Crash with CXL + TCG on 8.2: Was " Jonathan Cameron via
2024-02-01 13:04                         ` Jonathan Cameron
2024-02-01 13:12                         ` Peter Maydell
2024-02-01 14:01                           ` Jonathan Cameron
2024-02-01 14:01                             ` Jonathan Cameron via
2024-02-01 14:35                             ` Peter Maydell
2024-02-01 15:17                               ` Alex Bennée
2024-02-01 15:29                                 ` Jonathan Cameron
2024-02-01 15:29                                   ` Jonathan Cameron via
2024-02-01 16:00                                 ` Peter Maydell
2024-02-01 16:21                                   ` Jonathan Cameron
2024-02-01 16:21                                     ` Jonathan Cameron via
2024-02-01 16:45                                     ` Alex Bennée
2024-02-01 17:04                                       ` Gregory Price
2024-02-01 17:07                                         ` Peter Maydell
2024-02-01 17:29                                           ` Gregory Price
2024-02-01 17:08                                       ` Jonathan Cameron
2024-02-01 17:08                                         ` Jonathan Cameron via
2024-02-01 17:21                                         ` Peter Maydell
2024-02-01 17:41                                           ` Jonathan Cameron
2024-02-01 17:41                                             ` Jonathan Cameron via
2024-02-01 17:25                                         ` Alex Bennée
2024-02-01 18:04                                           ` Peter Maydell
2024-02-01 18:56                                             ` Gregory Price
2024-02-02 16:26                                               ` Jonathan Cameron via
2024-02-02 16:26                                                 ` Jonathan Cameron
2024-02-02 16:33                                                 ` Peter Maydell
2024-02-02 16:50                                                   ` Gregory Price
2024-02-02 16:56                                                     ` Peter Maydell
2024-02-07 17:34                                                       ` Jonathan Cameron
2024-02-07 17:34                                                         ` Jonathan Cameron via
2024-02-08 14:50                                                         ` Jonathan Cameron
2024-02-08 14:50                                                           ` Jonathan Cameron via
2024-02-15 15:29                                                           ` Jonathan Cameron
2024-02-15 15:29                                                             ` Jonathan Cameron via
2024-02-19  7:55                                                             ` Mattias Nissler
2024-02-15 15:04                                   ` Jonathan Cameron via
2024-02-15 15:04                                     ` Jonathan Cameron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.