All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-cxl@vger.kernel.org, Ira Weiny <ira.weiny@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Kees Cook <keescook@chromium.org>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>, Fan Ni <fan.ni@samsung.com>,
	linux-mm@kvack.org, linux-acpi@vger.kernel.org
Subject: Re: [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default
Date: Mon, 13 Feb 2023 13:22:17 -0500	[thread overview]
Message-ID: <Y+p/2S4rnrOOyZ8w@memverge.com> (raw)
In-Reply-To: <167601992097.1924368.18291887895351917895.stgit@dwillia2-xfh.jf.intel.com>

On Fri, Feb 10, 2023 at 01:05:21AM -0800, Dan Williams wrote:
> Changes since v1: [1]
> [... snip ...]

For a single attached device - I have been finding general success.

For multiple attached devices, I'm seeing some strange behaviors.

With multiple root ports, I got some stack traces before deciding
I needed multiple CMFW to do this "correctly", and just attached
multiple pxb-cxl to the root bus.

Obviously this configuration is "not great", and some form of
"impossible in the real world", but it's worth examining i think.

/opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/data/qemu/images/pool/pool1.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 4G,slots=4,maxmem=16G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:56 \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 \
-device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1 \
-device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2 \
-object memory-backend-ram,id=mem0,size=4G \
-object memory-backend-ram,id=mem1,size=4G \
-object memory-backend-ram,id=mem2,size=4G \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1 \
-device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G

The goal here should be to have 3 different memory expanders have their
regions created and mapped to 3 different numa nodes.

One piece i'm not confident about is my CFMW's
(listed more readably:)

-M cxl-fmw.0.targets.0=cxl.0,
   cxl-fmw.0.size=4G,
   cxl-fmw.1.targets.0=cxl.1,
   cxl-fmw.1.size=4G,
   cxl-fmw.2.targets.0=cxl.2,
   cxl-fmw.2.size=4G

should targets in this case be targets.0/1/2, or all of them targets.0?

Either way, i would expect 3 root decoders, and 3 memory devices

[root@fedora ~]# ls /sys/bus/cxl/devices/
decoder0.0  decoder1.0  decoder4.0  endpoint4  mem0  nvdimm-bridge0  port3
decoder0.1  decoder2.0  decoder5.0  endpoint5  mem1  port1           root0
decoder0.2  decoder3.0  decoder6.0  endpoint6  mem2  port2

I see the devices I expect, but i would expect the following:
(cxl list output at the bottom)

decoder0.0 -> mem0
decoder0.1 -> mem1
decoder0.2 -> mem2

root0 -> [decoder0.0, 0.1, 0.2]
root0 -> [port1, 2, 3]
port1 -> mem0
port2 -> mem1
port3 -> mem2

Really i see these decoders and device mappings setup:
port1 -> mem2
port2 -> mem1
port3 -> mem0

Therefore I should expect
decoder0.0 -> mem2
decoder0.1 -> mem1
decoder0.2 -> mem0

This bears out: attempting to use any other combination produces ndctl errors.

So the numbers are backwards, maybe that's relevant, maybe it's not.
The devices are otherwise completely the same, so for the most part
everything might "just work".  Lets keep testing.


[root@fedora ~]# cat create_region.sh
./ndctl/build/cxl/cxl \
  create-region \
  -m \
  -t ram \
  -d decoder0.$1 \
  -w 1 \
  -g 4096 \
  mem$2

[root@fedora ~]# ./create_region.sh 2 0
[   34.424931] cxl_region region2: Bypassing cpu_cache_invalidate_memregion() for testing!
{
  "region":"region2",
  "resource":"0x790000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder4.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region

[   34.486668] Fallback order for Node 3: 3 0
[   34.487568] Built 1 zonelists, mobility grouping on.  Total pages: 979669
[   34.488206] Policy zone: Normal
[   34.501938] Fallback order for Node 0: 0 3
[   34.502405] Fallback order for Node 1: 1 3 0
[   34.502832] Fallback order for Node 2: 2 3 0
[   34.503251] Fallback order for Node 3: 3 0
[   34.503649] Built 2 zonelists, mobility grouping on.  Total pages: 1012437
[   34.504296] Policy zone: Normal



Cool, looks good.  Lets try mem1



[root@fedora ~]# ./create_region.sh 1 1

[   98.787029] Fallback order for Node 2: 2 3 0
[   98.787630] Built 2 zonelists, mobility grouping on.  Total pages: 2019798
[   98.788483] Policy zone: Normal
[  128.301580] Fallback order for Node 0: 0 2 3
[  128.302084] Fallback order for Node 1: 1 3 2 0
[  128.302547] Fallback order for Node 2: 2 3 0
[  128.303009] Fallback order for Node 3: 3 2 0
[  128.303436] Built 3 zonelists, mobility grouping on.  Total pages: 2052566
[  128.304071] Policy zone: Normal
[ .... wait 20-30 more seconds .... ]
{
  "region":"region1",
  "resource":"0x690000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem1",
      "decoder":"decoder5.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region


This takes a LONG time to complete. Maybe that's expected, I don't know.


Lets online mem2.


[root@fedora ~]# ./create_region.sh 0 2
extra data[7]: 0x0000000000000000
emulation failure
RAX=0000000000000000 RBX=ffff8a6f90006800 RCX=0000000000100001 RDX=0000000080100010
RSI=ffffca291a400000 RDI=0000000040000000 RBP=ffff9684c0017a60 RSP=ffff9684c0017a30
R8 =ffff8a6f90006800 R9 =0000000000100001 R10=0000000000000000 R11=0000000000000001
R12=ffffca291a400000 R13=0000000000100001 R14=0000000000000000 R15=0000000080100010
RIP=ffffffffb71c5831 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 00007fd03025db40 ffffffff 00c00000
GS =0000 ffff8a6a7bd00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe46e6e25000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe46e6e23000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00005604371ab0c8 CR3=0000000102ece000 CR4=000006e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=83 ec 08 81 e7 00 00 00 40 74 2c 48 89 d0 48 89 ca 4c 89 c9 <f0> 48 0f c7 4e 20 0f 84 85 00 00 00 f3 90 48 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d


Well that seems bad lol.  I'm not sure what to make of this since my
scrollback cuts off and the machine completely locks up.  I have never
seen "emulation failure" before.


Reboot and attempt to online that region by itself:

[root@fedora ~]# ./create_region.sh 0 2
[   21.292598] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
[   21.341753] Fallback order for Node 1: 1 0
[   21.342462] Built 1 zonelists, mobility grouping on.  Total pages: 979670
[   21.343085] Policy zone: Normal
[   21.355166] Fallback order for Node 0: 0 1
[   21.355613] Fallback order for Node 1: 1 0
[   21.356009] Fallback order for Node 2: 2 1 0
[   21.356441] Fallback order for Node 3: 3 1 0
[   21.356874] Built 2 zonelists, mobility grouping on.  Total pages: 1012438
[   21.357501] Policy zone: Normal
{
  "region":"region0",
  "resource":"0x590000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem2",
      "decoder":"decoder6.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region


That works fine, and works just like onlining the first region (2,0).

This suggests the issue is actually creating multiple regions in this
topology.



Bonus round: booting with memhp_default_state=offline

All regions successfully get created without error.



I have a few guesses, but haven't dived in yet:

1) There's a QEMU error in the way this configuration routes to various
   components of the CXL structure, and/or multiple pxb-cxl's do bad
   things and I should feel bad for doing this configuration.

2) There's something going on when creating the topology, leading to the
   inverted [decoder0.2, mem0], [decoder0.1, mem1], [decoder0.0, mem2]
   mappings, leading to inconsistent device control.  Or I'm making a
   bad assumption and this is expected behavior.

3) The memory block creation / online code is getting hung up somewhere.
   Why does the second region take forever to online?

4) Something else completely.


My gut at the moment tells me my configuration is bad, but i have no
idea why.  Anyone with an idea on what I should look for, let me know.


cxl list output for completeness:

[root@fedora ~]# ./ndctl/build/cxl/cxl list -vvvv
[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":3,
    "dports":[
      {
        "dport":"pci0000:e6",
        "alias":"ACPI0016:00",
        "id":230
      },
      {
        "dport":"pci0000:bf",
        "alias":"ACPI0016:01",
        "id":191
      },
      {
        "dport":"pci0000:34",
        "alias":"ACPI0016:02",
        "id":52
      }
    ],
    "ports:root0":[
      {
        "port":"port1",
        "host":"pci0000:e6",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:e6:00.0",
            "id":2
          }
        ],
        "endpoints:port1":[
          {
            "endpoint":"endpoint5",
            "host":"mem1",
            "depth":2,
            "memdev":{
              "memdev":"mem1",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:e7:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint5":[
              {
                "decoder":"decoder5.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port1":[
          {
            "decoder":"decoder1.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:e6:00.0",
                "position":0,
                "id":2
              }
            ]
          }
        ]
      },
      {
        "port":"port3",
        "host":"pci0000:34",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:34:00.0",
            "id":0
          }
        ],
        "endpoints:port3":[
          {
            "endpoint":"endpoint4",
            "host":"mem0",
            "depth":2,
            "memdev":{
              "memdev":"mem0",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:35:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint4":[
              {
                "decoder":"decoder4.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port3":[
          {
            "decoder":"decoder3.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:34:00.0",
                "position":0,
                "id":0
              }
            ]
          }
        ]
      },
      {
        "port":"port2",
        "host":"pci0000:bf",
        "depth":1,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:bf:00.0",
            "id":1
          }
        ],
        "endpoints:port2":[
          {
            "endpoint":"endpoint6",
            "host":"mem2",
            "depth":2,
            "memdev":{
              "memdev":"mem2",
              "ram_size":4294967296,
              "serial":0,
              "host":"0000:c0:00.0",
              "partition_info":{
                "total_size":4294967296,
                "volatile_only_size":4294967296,
                "persistent_only_size":0,
                "partition_alignment_size":0
              }
            },
            "decoders:endpoint6":[
              {
                "decoder":"decoder6.0",
                "interleave_ways":1,
                "state":"disabled"
              }
            ]
          }
        ],
        "decoders:port2":[
          {
            "decoder":"decoder2.0",
            "interleave_ways":1,
            "state":"disabled",
            "nr_targets":1,
            "targets":[
              {
                "target":"0000:bf:00.0",
                "position":0,
                "id":1
              }
            ]
          }
        ]
      }
    ],
    "decoders:root0":[
      {
        "decoder":"decoder0.0",
        "resource":23890755584,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:34",
            "alias":"ACPI0016:02",
            "position":0,
            "id":52
          }
        ]
      },
      {
        "decoder":"decoder0.1",
        "resource":28185722880,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:bf",
            "alias":"ACPI0016:01",
            "position":0,
            "id":191
          }
        ]
      },
      {
        "decoder":"decoder0.2",
        "resource":32480690176,
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:e6",
            "alias":"ACPI0016:00",
            "position":0,
            "id":230
          }
        ]
      }
    ]
  }
]

  parent reply	other threads:[~2023-02-13 18:23 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-10  9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
2023-02-10  9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
2023-02-10 17:28   ` Jonathan Cameron
2023-02-10 21:14     ` Dan Williams
2023-02-10 23:17   ` Verma, Vishal L
2023-02-10  9:05 ` [PATCH v2 02/20] cxl/Documentation: Update references to attributes added in v6.0 Dan Williams
2023-02-10  9:05 ` [PATCH v2 03/20] cxl/region: Add a mode attribute for regions Dan Williams
2023-02-10  9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
2023-02-10 17:30   ` Jonathan Cameron
2023-02-10 23:34   ` Ira Weiny
2023-02-10  9:05 ` [PATCH v2 05/20] cxl/region: Validate region mode vs decoder mode Dan Williams
2023-02-10  9:05 ` [PATCH v2 06/20] cxl/region: Add volatile region creation support Dan Williams
2023-02-10  9:06 ` [PATCH v2 07/20] cxl/region: Refactor attach_target() for autodiscovery Dan Williams
2023-02-10  9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
2023-02-10 17:31   ` Jonathan Cameron
2023-02-10 23:17   ` Verma, Vishal L
2023-02-10 23:46   ` Ira Weiny
2023-02-10  9:06 ` [PATCH v2 09/20] cxl/region: Move region-position validation to a helper Dan Williams
2023-02-10 17:34   ` Jonathan Cameron
2023-02-10  9:06 ` [PATCH v2 10/20] kernel/range: Uplevel the cxl subsystem's range_contains() helper Dan Williams
2023-02-10  9:06 ` [PATCH v2 11/20] cxl/region: Enable CONFIG_CXL_REGION to be toggled Dan Williams
2023-02-10  9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
2023-02-10 17:41   ` Jonathan Cameron
2023-02-10 23:21   ` Verma, Vishal L
2023-02-10  9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
2023-02-10 18:09   ` Jonathan Cameron
2023-02-10 21:35     ` Dan Williams
2023-02-14 13:23       ` Jonathan Cameron
2023-02-14 16:43         ` Dan Williams
2023-02-10 21:49     ` Dan Williams
2023-02-11  0:29   ` Verma, Vishal L
2023-02-11  1:03     ` Dan Williams
     [not found]   ` <CGME20230213192752uscas1p1c49508da4b100c9ba6a1a3aa92ca03e5@uscas1p1.samsung.com>
2023-02-13 19:27     ` Fan Ni
     [not found]   ` <CGME20230228185348uscas1p1a5314a077383ee81ac228c1b9f1da2f8@uscas1p1.samsung.com>
2023-02-28 18:53     ` Fan Ni
2023-02-10  9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
2023-02-10 18:12   ` Jonathan Cameron
2023-02-10 18:36   ` Dave Jiang
2023-02-11  0:39   ` Verma, Vishal L
2023-02-10  9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
2023-02-10 21:53   ` Dave Jiang
2023-02-10 21:57     ` Dave Jiang
2023-02-11  0:40   ` Verma, Vishal L
2023-02-10  9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
2023-02-10 21:59   ` Dave Jiang
2023-02-11  0:41   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
2023-02-10 22:03   ` Dave Jiang
2023-02-11  4:25   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
2023-02-10 18:25   ` Jonathan Cameron
2023-02-10 22:09   ` Dave Jiang
2023-02-11  4:41   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
2023-02-10 22:19   ` Dave Jiang
2023-02-11  5:57   ` Verma, Vishal L
2023-02-10  9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
2023-02-10 18:38   ` Jonathan Cameron
2023-02-10 22:42   ` Dave Jiang
2023-02-10 17:53 ` [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
2023-02-11 14:04   ` Gregory Price
2023-02-13 18:22 ` Gregory Price [this message]
2023-02-13 18:31   ` Gregory Price
     [not found]     ` <CGME20230222214151uscas1p26d53b2e198f63a1f382fe575c6c25070@uscas1p2.samsung.com>
2023-02-22 21:41       ` Fan Ni
2023-02-22 22:18         ` Dan Williams
2023-02-14 13:35   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+p/2S4rnrOOyZ8w@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=fan.ni@samsung.com \
    --cc=ira.weiny@intel.com \
    --cc=keescook@chromium.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.