All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT preview] for-6.3/cxl-ram-region
@ 2023-01-26  6:25 Dan Williams
  2023-01-26  6:29 ` Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Dan Williams @ 2023-01-26  6:25 UTC (permalink / raw)
  To: linux-cxl

There are still some sharp edges on this patchset, like the missing
device-dax hookup, but it is likely enough to show the direction and
unblock other testing. Specifically I want to see how this fares with
Greg's recent volatile region provisioning in QEMU.

I am hoping to have those last bits ironed out before the end of the
week. Note that this topic branch will rebase so do not base any
work beyond proof-of-concept on top of it.

https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26  6:25 [GIT preview] for-6.3/cxl-ram-region Dan Williams
@ 2023-01-26  6:29 ` Dan Williams
  2023-01-26 18:50   ` Jonathan Cameron
  2023-01-26 22:05 ` Gregory Price
  2023-02-04  2:36 ` Dan Williams
  2 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2023-01-26  6:29 UTC (permalink / raw)
  To: Dan Williams, linux-cxl

Dan Williams wrote:
> There are still some sharp edges on this patchset, like the missing
> device-dax hookup, but it is likely enough to show the direction and
> unblock other testing. Specifically I want to see how this fares with
> Greg's recent volatile region provisioning in QEMU.
> 
> I am hoping to have those last bits ironed out before the end of the
> week. Note that this topic branch will rebase so do not base any
> work beyond proof-of-concept on top of it.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

I also spotted at least one bug-fix that needs to be broken out and
submitted separately, so reviewer-beware at this stage.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26  6:29 ` Dan Williams
@ 2023-01-26 18:50   ` Jonathan Cameron
  2023-01-26 19:34     ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2023-01-26 18:50 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl

On Wed, 25 Jan 2023 22:29:15 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Dan Williams wrote:
> > There are still some sharp edges on this patchset, like the missing
> > device-dax hookup, but it is likely enough to show the direction and
> > unblock other testing. Specifically I want to see how this fares with
> > Greg's recent volatile region provisioning in QEMU.
> > 
> > I am hoping to have those last bits ironed out before the end of the
> > week. Note that this topic branch will rebase so do not base any
> > work beyond proof-of-concept on top of it.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region  
> 
> I also spotted at least one bug-fix that needs to be broken out and
> submitted separately, so reviewer-beware at this stage.

So far I'm failing to set target0 via echo which is wonderfully getting
an error of SUCCESS.  So I think you are returning rc == 0 somewhere.


Ah. There is a shadowed int rc variable in store_targetN() else

Now to figure out why attach_target() is failing.  I probably have
the config sequence wrong as I've just bodged an existing one I had
for pmem.

Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26 18:50   ` Jonathan Cameron
@ 2023-01-26 19:34     ` Jonathan Cameron
  2023-01-30 14:16       ` Gregory Price
  2023-01-30 14:23       ` Gregory Price
  0 siblings, 2 replies; 34+ messages in thread
From: Jonathan Cameron @ 2023-01-26 19:34 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl

On Thu, 26 Jan 2023 18:50:25 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Wed, 25 Jan 2023 22:29:15 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Dan Williams wrote:  
> > > There are still some sharp edges on this patchset, like the missing
> > > device-dax hookup, but it is likely enough to show the direction and
> > > unblock other testing. Specifically I want to see how this fares with
> > > Greg's recent volatile region provisioning in QEMU.
> > > 
> > > I am hoping to have those last bits ironed out before the end of the
> > > week. Note that this topic branch will rebase so do not base any
> > > work beyond proof-of-concept on top of it.
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region    
> > 
> > I also spotted at least one bug-fix that needs to be broken out and
> > submitted separately, so reviewer-beware at this stage.  
> 
> So far I'm failing to set target0 via echo which is wonderfully getting
> an error of SUCCESS.  So I think you are returning rc == 0 somewhere.
> 
> 
> Ah. There is a shadowed int rc variable in store_targetN() else
> 
> Now to figure out why attach_target() is failing.  I probably have
> the config sequence wrong as I've just bodged an existing one I had
> for pmem.
> 
I was trying to add it to a pmem region. Oops.

Can successfully bind the region driver. But not sure how to test
beyond that.

Basically I got to the helpful 'TODO hook up devdax' as you mention above.
Looks like decoders are programmed correctly as I can read and write from
the HPA using devmem2.

So far I've just tested single direct connected device.

This is against http://gitlab.com/jic23/qemu cxl-2023-01-26 which has been
there for a good 30 seconds. Mostly unrelated changes wrt to this work, but
it includes a few trivial tweaks to Gregory's patches as discussed on list.

Thanks for sharing this early version. It unsticks Gregory's series as far as
I'm concerned (if there is anything broken in more complex tests that won't
be related to Gregory's stuff that only effects type 3 devices).

Jonathan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26  6:25 [GIT preview] for-6.3/cxl-ram-region Dan Williams
  2023-01-26  6:29 ` Dan Williams
@ 2023-01-26 22:05 ` Gregory Price
  2023-01-26 22:20   ` Dan Williams
  2023-02-04  2:36 ` Dan Williams
  2 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-26 22:05 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-cxl

On Wed, Jan 25, 2023 at 10:25:42PM -0800, Dan Williams wrote:
> There are still some sharp edges on this patchset, like the missing
> device-dax hookup, but it is likely enough to show the direction and
> unblock other testing. Specifically I want to see how this fares with
> Greg's recent volatile region provisioning in QEMU.
> 
> I am hoping to have those last bits ironed out before the end of the
> week. Note that this topic branch will rebase so do not base any
> work beyond proof-of-concept on top of it.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region


Testing this out now.  Pulled it on top of the recent DOE updates so
this build error may be related (or not).  But it does look like it's
just a naming collision.




lib/stackinit_kunit.c:34:13: error: conflicting types for ‘range_contains’; have ‘bool(char *, size_t,  char *, size_t)’ {aka ‘_Bool(char *, long unsigned int,  char *, long unsigned int)’}
   34 | static bool range_contains(char *haystack_start, size_t haystack_size,
      |             ^~~~~~~~~~~~~~
In file included from ./arch/x86/include/asm/page.h:21,
                 from ./arch/x86/include/asm/thread_info.h:12,
                 from ./include/linux/thread_info.h:60,
                 from ./arch/x86/include/asm/preempt.h:9,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:56,
                 from ./include/linux/kref.h:16,
                 from ./include/kunit/test.h:21,
                 from lib/stackinit_kunit.c:14:
./include/linux/range.h:16:20: note: previous definition of ‘range_contains’ with type ‘bool(struct range *, struct range *)’ {aka ‘_Bool(struct range *, struct range *)’}
   16 | static inline bool range_contains(struct range *r1, struct range *r2)
      |


Was just going recommend renaming contains to envelops for a quick hack
to fix the build issue.

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 62d5bafd02d1..f781e46a825e 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -230,7 +230,7 @@ static int dvsec_range_allowed(struct device *dev, void *arg)
        if (!(cxld->flags & CXL_DECODER_F_RAM))
                return 0;

-       return range_contains(&cxld->hpa_range, dev_range);
+       return range_envelops(&cxld->hpa_range, dev_range);
 }

 static void disable_hdm(void *_cxlhdm)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 73d4d451386b..c0cb828b86e3 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1381,7 +1381,7 @@ static int decoder_match_range(struct device *dev, void *data)
                return 0;

        cxlsd = to_cxl_switch_decoder(dev);
-       return range_contains(&cxlsd->cxld.hpa_range, &cxled->cxld.hpa_range);
+       return range_envelops(&cxlsd->cxld.hpa_range, &cxled->cxld.hpa_range);
 }

 static void find_positions(const struct cxl_switch_decoder *cxlsd,
@@ -2350,7 +2350,7 @@ static int match_decoder_by_range(struct device *dev, void *data)

        cxlrd = to_cxl_root_decoder(dev);
        r1 = &cxlrd->cxlsd.cxld.hpa_range;
-       return range_contains(r1, r2);
+       return range_envelops(r1, r2);
 }

 static int match_region_by_range(struct device *dev, void *data)
diff --git a/include/linux/range.h b/include/linux/range.h
index 7efb6a9b069b..8ad8c036d027 100644
--- a/include/linux/range.h
+++ b/include/linux/range.h
@@ -13,7 +13,7 @@ static inline u64 range_len(const struct range *range)
        return range->end - range->start + 1;
 }

-static inline bool range_contains(struct range *r1, struct range *r2)
+static inline bool range_envelops(struct range *r1, struct range *r2)
 {
        return r1->start <= r2->start && r1->end >= r2->end;
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26 22:05 ` Gregory Price
@ 2023-01-26 22:20   ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2023-01-26 22:20 UTC (permalink / raw)
  To: Gregory Price, Dan Williams; +Cc: linux-cxl

Gregory Price wrote:
> On Wed, Jan 25, 2023 at 10:25:42PM -0800, Dan Williams wrote:
> > There are still some sharp edges on this patchset, like the missing
> > device-dax hookup, but it is likely enough to show the direction and
> > unblock other testing. Specifically I want to see how this fares with
> > Greg's recent volatile region provisioning in QEMU.
> > 
> > I am hoping to have those last bits ironed out before the end of the
> > week. Note that this topic branch will rebase so do not base any
> > work beyond proof-of-concept on top of it.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> 
> Testing this out now.  Pulled it on top of the recent DOE updates so
> this build error may be related (or not).  But it does look like it's
> just a naming collision.
> 
> 
> 
> 
> lib/stackinit_kunit.c:34:13: error: conflicting types for ‘range_contains’; have ‘bool(char *, size_t,  char *, size_t)’ {aka ‘_Bool(char *, long unsigned int,  char *, long unsigned int)’}
>    34 | static bool range_contains(char *haystack_start, size_t haystack_size,
>       |             ^~~~~~~~~~~~~~
> In file included from ./arch/x86/include/asm/page.h:21,
>                  from ./arch/x86/include/asm/thread_info.h:12,
>                  from ./include/linux/thread_info.h:60,
>                  from ./arch/x86/include/asm/preempt.h:9,
>                  from ./include/linux/preempt.h:78,
>                  from ./include/linux/spinlock.h:56,
>                  from ./include/linux/kref.h:16,
>                  from ./include/kunit/test.h:21,
>                  from lib/stackinit_kunit.c:14:
> ./include/linux/range.h:16:20: note: previous definition of ‘range_contains’ with type ‘bool(struct range *, struct range *)’ {aka ‘_Bool(struct range *, struct range *)’}
>    16 | static inline bool range_contains(struct range *r1, struct range *r2)
>       |
> 
> 
> Was just going recommend renaming contains to envelops for a quick hack
> to fix the build issue.

I think it would be more appropriate to adjust kunit to a less generic
name since a 'struct range' is a kernel-wide data-structure.

I'll propose this in the patch when I send it out.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26 19:34     ` Jonathan Cameron
@ 2023-01-30 14:16       ` Gregory Price
  2023-01-30 20:10         ` Dan Williams
  2023-01-30 14:23       ` Gregory Price
  1 sibling, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-30 14:16 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Dan Williams, linux-cxl

On Thu, Jan 26, 2023 at 07:34:24PM +0000, Jonathan Cameron wrote:
> On Thu, 26 Jan 2023 18:50:25 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Wed, 25 Jan 2023 22:29:15 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > Dan Williams wrote:  
> > > > There are still some sharp edges on this patchset, like the missing
> > > > device-dax hookup, but it is likely enough to show the direction and
> > > > unblock other testing. Specifically I want to see how this fares with
> > > > Greg's recent volatile region provisioning in QEMU.
> > > > 
> > > > I am hoping to have those last bits ironed out before the end of the
> > > > week. Note that this topic branch will rebase so do not base any
> > > > work beyond proof-of-concept on top of it.
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region    
> > > 
> > > I also spotted at least one bug-fix that needs to be broken out and
> > > submitted separately, so reviewer-beware at this stage.  
> > 
> > So far I'm failing to set target0 via echo which is wonderfully getting
> > an error of SUCCESS.  So I think you are returning rc == 0 somewhere.
> > 
> > 
> > Ah. There is a shadowed int rc variable in store_targetN() else
> > 
> > Now to figure out why attach_target() is failing.  I probably have
> > the config sequence wrong as I've just bodged an existing one I had
> > for pmem.
> > 
> I was trying to add it to a pmem region. Oops.
> 
> Can successfully bind the region driver. But not sure how to test
> beyond that.
> 
> Basically I got to the helpful 'TODO hook up devdax' as you mention above.
> Looks like decoders are programmed correctly as I can read and write from
> the HPA using devmem2.
> 
> So far I've just tested single direct connected device.
> 
> This is against http://gitlab.com/jic23/qemu cxl-2023-01-26 which has been
> there for a good 30 seconds. Mostly unrelated changes wrt to this work, but
> it includes a few trivial tweaks to Gregory's patches as discussed on list.
> 
> Thanks for sharing this early version. It unsticks Gregory's series as far as
> I'm concerned (if there is anything broken in more complex tests that won't
> be related to Gregory's stuff that only effects type 3 devices).
> 
> Jonathan
> 

I found the same results.

Reference command and config for list readers:

sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-ram,id=mem0,size=1G,share=on \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G


echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
echo 0x40000000 > /sys/bus/cxl/devices/region0/size
echo mem0 > /sys/bus/cxl/devices/region0/target0

Not sure if bug/missing feature, but after attaching a device to the
target, you get no output when reading target0

```
[root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0

[root@fedora ~]#
```

Would be nice for the sake of easier topology reporting if either this
reported the configured target or we added a link to the targets into
the directory.

But this looks good to be so far, excited to see the devdax patch, I
think i can whip up a sample DCD device (for command testing and proof
of concept) pretty quickly after this.


One question re: auto-online of the devdax hookup - is the intent for
auto-online to follow /sys/devices/system/memory/auto_online_blocks
settings or should we consider controlling auto-online more granularly?

It's a bit of a catch-22 if we follow auto_online_blocks:
  1) for local memory expanders, if off, this is annoying
	2) for statically configured remote-pools (remote expanders)
     this is annoying for the same reason
	3) for early DCD's (multi-headed expander, no-switch), the pattern
	   / expectation i'm seeing is that the device expects hosts to see all
		 memory blocks when the device is hooked up, and then expects hosts
		 to "play nice" by only onlining blocks that have been allocated.
		 (there is some device-side exclusion features to enforce security).

Basically early DCD's will look like remote expanders with some
exclusivity controls (configured via the DCD commands).

So with the pattern above, lets say you have a 1TB pool attached to 4
hosts.  Each host would produce the following commands:

echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
echo 0x10000000000 > /sys/bus/cxl/devices/region0/size
echo mem0 > /sys/bus/cxl/devices/region0/target0

and mem0 would get 4096 memory# blocks (presumably under region/devdax?)

A provisioning command would be sent via the device interface

ioctl(DCD(N blocks)) -> /sys/bus/cxl/devices/mem0/dev 
return: DCD return structure with extents[blocks[a,b,c],...]

Then the final action would be
echo online > /sys/bus/cxl/devices/region0/devdax/memory[a,b,c...]

or online_moveable, or probably some other special zone to make sure
the memory is not used by the kernel (so it can be later released)


So to me, it feels like we might want more granular auto-online control,
but I don't know how possible that is.


Note: This is me relaying what I've seen/heard from some device vendors
in terms of what they think the control scheme will be, so if something
is wildly off-base, it would be good to address the expectations.


Either way: This is awesome, thank you for sharing the preview Dan.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26 19:34     ` Jonathan Cameron
  2023-01-30 14:16       ` Gregory Price
@ 2023-01-30 14:23       ` Gregory Price
  2023-01-31 14:56         ` Jonathan Cameron
  1 sibling, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-30 14:23 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Dan Williams, linux-cxl

On Thu, Jan 26, 2023 at 07:34:24PM +0000, Jonathan Cameron wrote:
> Looks like decoders are programmed correctly as I can read and write from
> the HPA using devmem2.
> 
> This is against http://gitlab.com/jic23/qemu cxl-2023-01-26 which has been
> 

Johnathan, can you explain how you're accessing the memory? I don't
quite follow.

Extra Notes
1) Tested against Jonathan's QEMU 01-26 branch above
2) Tested w/ the Linux preview branch rebased on top of the DOE fixes
   that were causing other bugs.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-30 14:16       ` Gregory Price
@ 2023-01-30 20:10         ` Dan Williams
  2023-01-30 20:58           ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2023-01-30 20:10 UTC (permalink / raw)
  To: Gregory Price, Jonathan Cameron; +Cc: Dan Williams, linux-cxl

Gregory Price wrote:
[..]
> I found the same results.
> 
> Reference command and config for list readers:
> 
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
> -object memory-backend-ram,id=mem0,size=1G,share=on \
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> 
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
> echo 0x40000000 > /sys/bus/cxl/devices/region0/size
> echo mem0 > /sys/bus/cxl/devices/region0/target0
> 
> Not sure if bug/missing feature, but after attaching a device to the
> target, you get no output when reading target0
> 
> ```
> [root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0
> 
> [root@fedora ~]#

Hmm, did you not get:

"-bash: echo: write error: Invalid argument"

...at that step? Because targetX expects an endpoint decoder, not a
memdev.


> ```
> 
> Would be nice for the sake of easier topology reporting if either this
> reported the configured target or we added a link to the targets into
> the directory.
> 
> But this looks good to be so far, excited to see the devdax patch, I
> think i can whip up a sample DCD device (for command testing and proof
> of concept) pretty quickly after this.
> 
> 
> One question re: auto-online of the devdax hookup - is the intent for
> auto-online to follow /sys/devices/system/memory/auto_online_blocks
> settings or should we consider controlling auto-online more granularly?
> 
> It's a bit of a catch-22 if we follow auto_online_blocks:
>   1) for local memory expanders, if off, this is annoying
> 	2) for statically configured remote-pools (remote expanders)
>      this is annoying for the same reason
> 	3) for early DCD's (multi-headed expander, no-switch), the pattern
> 	   / expectation i'm seeing is that the device expects hosts to see all
> 		 memory blocks when the device is hooked up, and then expects hosts
> 		 to "play nice" by only onlining blocks that have been allocated.
> 		 (there is some device-side exclusion features to enforce security).
> 
> Basically early DCD's will look like remote expanders with some
> exclusivity controls (configured via the DCD commands).
> 
> So with the pattern above, lets say you have a 1TB pool attached to 4
> hosts.  Each host would produce the following commands:
> 
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
> echo 0x10000000000 > /sys/bus/cxl/devices/region0/size
> echo mem0 > /sys/bus/cxl/devices/region0/target0

> and mem0 would get 4096 memory# blocks (presumably under region/devdax?)

At 1T of size, mem0 would be hosting 4294967296 256-byte blocks.

> A provisioning command would be sent via the device interface
> 
> ioctl(DCD(N blocks)) -> /sys/bus/cxl/devices/mem0/dev 
> return: DCD return structure with extents[blocks[a,b,c],...]

In the DCD case the CXL-region would be instantiated ahead of time and
associated with a DAX-region. Upon each capacity addition event a new
devdax instance would appear in that region.

> Then the final action would be
> echo online > /sys/bus/cxl/devices/region0/devdax/memory[a,b,c...]

Something like that, yes.

> or online_moveable, or probably some other special zone to make sure
> the memory is not used by the kernel (so it can be later released)
> 
> 
> So to me, it feels like we might want more granular auto-online control,
> but I don't know how possible that is.

Yes, I think it also needs to coordinate with the existing udev rules
and policy around auto-onlining new memory blocks.

> Note: This is me relaying what I've seen/heard from some device vendors
> in terms of what they think the control scheme will be, so if something
> is wildly off-base, it would be good to address the expectations.
> 
> 
> Either way: This is awesome, thank you for sharing the preview Dan.

Thanks for testing!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-30 20:10         ` Dan Williams
@ 2023-01-30 20:58           ` Gregory Price
  2023-01-30 23:18             ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-30 20:58 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jonathan Cameron, linux-cxl

On Mon, Jan 30, 2023 at 12:10:08PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> [..]
> > I found the same results.
> > echo mem0 > /sys/bus/cxl/devices/region0/target0
> > 
> > Not sure if bug/missing feature, but after attaching a device to the
> > target, you get no output when reading target0
> > 
> > ```
> > [root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0
> > 
> > [root@fedora ~]#
> 
> Hmm, did you not get:
> 
> "-bash: echo: write error: Invalid argument"
> 
> ...at that step? Because targetX expects an endpoint decoder, not a
> memdev.
> 

¯\_(ツ)_/¯ bug?

I went through the kernel code and thought it was looking for a memdev
but i guess i was wrong

// ... snip ...
    dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
    if (!dev)
      return -ENODEV;

    if (!is_endpoint_decoder(dev)) {
      rc = -EINVAL;
      goto out;
    }
// ... snip ...

bool is_endpoint_decoder(struct device *dev)
{
  return dev->type == &cxl_decoder_endpoint_type;
}
EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);


notably, memdev's do not expose their devtype, maybe something overwrote
it to be a different type?  Seems unlikely, but i'd need to attach gdb.

[root@fedora mem0]# ls
dev     firmware_version    numa_node    pmem  serial     uevent
driver  label_storage_size  payload_max  ram   subsystem



> > ```
> > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> > echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
> > echo 0x10000000000 > /sys/bus/cxl/devices/region0/size
> > echo mem0 > /sys/bus/cxl/devices/region0/target0
> 
> > and mem0 would get 4096 memory# blocks (presumably under region/devdax?)
> 
> At 1T of size, mem0 would be hosting 4294967296 256-byte blocks.
>

Ah i thought granularity was in MB, derp derp.  Note - that's what the
root decoder interleave granularity is set to, so i just mirrored that.

[root@fedora decoder0.0]# cat interleave_granularity
256

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-30 23:18             ` Dan Williams
@ 2023-01-30 22:00               ` Gregory Price
  2023-01-31  2:00               ` Gregory Price
  1 sibling, 0 replies; 34+ messages in thread
From: Gregory Price @ 2023-01-30 22:00 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jonathan Cameron, linux-cxl

On Mon, Jan 30, 2023 at 03:18:38PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > On Mon, Jan 30, 2023 at 12:10:08PM -0800, Dan Williams wrote:
> > > Gregory Price wrote:
> > > [..]
> > > > I found the same results.
> > > > echo mem0 > /sys/bus/cxl/devices/region0/target0
> > > > 
> > > > Not sure if bug/missing feature, but after attaching a device to the
> > > > target, you get no output when reading target0
> > > > 
> > > > ```
> > > > [root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0
> > > > 
> > > > [root@fedora ~]#
> > > 
> > > Hmm, did you not get:
> > > 
> > > "-bash: echo: write error: Invalid argument"
> > > 
> > > ...at that step? Because targetX expects an endpoint decoder, not a
> > > memdev.
> > > 
> > 
> > ¯\_(ツ)_/¯ bug?
> > 
> > I went through the kernel code and thought it was looking for a memdev
> > but i guess i was wrong
> > 
> > // ... snip ...
> >     dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >     if (!dev)
> >       return -ENODEV;
> > 
> >     if (!is_endpoint_decoder(dev)) {
> >       rc = -EINVAL;
> >       goto out;
> >     }
> > // ... snip ...
> 
> Oh, yes, it is a bug. I fixed it up when I address this 0day report:
> 
> https://lore.kernel.org/linux-cxl/202301281313.kVrIreUj-lkp@intel.com/
> 
> ...I posted another spin of that branch with that fixed up.

gotta love human race conditions, i'll try to be slower next time :D

pulling and testing, and yeah I think what i was seeing was that i just
could not set the target at all but wasn't getting a pass/fail
regardless.

Anyway, will report back when built and installed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-30 20:58           ` Gregory Price
@ 2023-01-30 23:18             ` Dan Williams
  2023-01-30 22:00               ` Gregory Price
  2023-01-31  2:00               ` Gregory Price
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Williams @ 2023-01-30 23:18 UTC (permalink / raw)
  To: Gregory Price, Dan Williams; +Cc: Jonathan Cameron, linux-cxl

Gregory Price wrote:
> On Mon, Jan 30, 2023 at 12:10:08PM -0800, Dan Williams wrote:
> > Gregory Price wrote:
> > [..]
> > > I found the same results.
> > > echo mem0 > /sys/bus/cxl/devices/region0/target0
> > > 
> > > Not sure if bug/missing feature, but after attaching a device to the
> > > target, you get no output when reading target0
> > > 
> > > ```
> > > [root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0
> > > 
> > > [root@fedora ~]#
> > 
> > Hmm, did you not get:
> > 
> > "-bash: echo: write error: Invalid argument"
> > 
> > ...at that step? Because targetX expects an endpoint decoder, not a
> > memdev.
> > 
> 
> ¯\_(ツ)_/¯ bug?
> 
> I went through the kernel code and thought it was looking for a memdev
> but i guess i was wrong
> 
> // ... snip ...
>     dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
>     if (!dev)
>       return -ENODEV;
> 
>     if (!is_endpoint_decoder(dev)) {
>       rc = -EINVAL;
>       goto out;
>     }
> // ... snip ...

Oh, yes, it is a bug. I fixed it up when I address this 0day report:

https://lore.kernel.org/linux-cxl/202301281313.kVrIreUj-lkp@intel.com/

...I posted another spin of that branch with that fixed up.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-30 23:18             ` Dan Williams
  2023-01-30 22:00               ` Gregory Price
@ 2023-01-31  2:00               ` Gregory Price
  2023-01-31 16:56                 ` Dan Williams
  2023-01-31 17:59                 ` Verma, Vishal L
  1 sibling, 2 replies; 34+ messages in thread
From: Gregory Price @ 2023-01-31  2:00 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jonathan Cameron, linux-cxl

On Mon, Jan 30, 2023 at 03:18:38PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > On Mon, Jan 30, 2023 at 12:10:08PM -0800, Dan Williams wrote:
> > > Gregory Price wrote:
> > > [..]
> > > > I found the same results.
> > > > echo mem0 > /sys/bus/cxl/devices/region0/target0
> > > > 
> > > > Not sure if bug/missing feature, but after attaching a device to the
> > > > target, you get no output when reading target0
> > > > 
> > > > ```
> > > > [root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0
> > > > 
> > > > [root@fedora ~]#
> > > 
> > > Hmm, did you not get:
> > > 
> > > "-bash: echo: write error: Invalid argument"
> > > 
> > > ...at that step? Because targetX expects an endpoint decoder, not a
> > > memdev.
> > > 
> > 
> > ¯\_(ツ)_/¯ bug?
> > 
> > I went through the kernel code and thought it was looking for a memdev
> > but i guess i was wrong
> > 
> > // ... snip ...
> >     dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> >     if (!dev)
> >       return -ENODEV;
> > 
> >     if (!is_endpoint_decoder(dev)) {
> >       rc = -EINVAL;
> >       goto out;
> >     }
> > // ... snip ...
> 
> Oh, yes, it is a bug. I fixed it up when I address this 0day report:
> 
> https://lore.kernel.org/linux-cxl/202301281313.kVrIreUj-lkp@intel.com/
> 
> ...I posted another spin of that branch with that fixed up.


This worked for me, though it took me a bit to figure out how to wire
everythign up - still not sure this is entirely correct but this is the
string of commands that were required to successfully attach an endpoint
decoder to the root decoder.

Note: The root decoder has interleave_(granularity=256, ways=1), and the
region code appears to require the same granularity?  Does that mean
we're stuck to 256b granularity? (unless i misread the code, which i'm
about 75% sure i am).


Command string:

# Program the endpoint decoder for ram of size 1GB
echo ram > /sys/bus/cxl/devices/decoder2.0/mode
echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size

# Create a region in the root decoder
echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region

# Configure that region with the same interleave granularity
# and ways as the root and endpoint decoders
echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
echo 0x40000000 > /sys/bus/cxl/devices/region0/size

# Link the endpoint decoder as a target in the region
echo decoder2.0 > /sys/bus/cxl/devices/region0/target0

# Commit the changes
echo 1 > /sys/bus/cxl/devices/region0/commit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-30 14:23       ` Gregory Price
@ 2023-01-31 14:56         ` Jonathan Cameron
  2023-01-31 17:34           ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2023-01-31 14:56 UTC (permalink / raw)
  To: Gregory Price; +Cc: Dan Williams, linux-cxl

On Mon, 30 Jan 2023 09:23:54 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Jan 26, 2023 at 07:34:24PM +0000, Jonathan Cameron wrote:
> > Looks like decoders are programmed correctly as I can read and write from
> > the HPA using devmem2.
> > 
> > This is against http://gitlab.com/jic23/qemu cxl-2023-01-26 which has been
> >   
> 
> Johnathan, can you explain how you're accessing the memory? I don't
> quite follow.

Command sequence is pretty similar to yours (not checked it's identical)
and once commit is done, using a version of devmem2 (not sure it was this one)
https://github.com/hackndev/tools/blob/master/devmem2.c 
(if you want 64 bit read / write add the obvious additional parameter ;)

with appropriately loose kernel configuration that /dev/mem works

Then run that against addresses starting at the bottom of the CFMWS
HPA range to read and write.  It's a hack but proves the routing etc
is all setup correctly.

I had this lying around from testing the region setup code when it
was in a similar partial state to this support.

Jonathan


> 
> Extra Notes
> 1) Tested against Jonathan's QEMU 01-26 branch above
> 2) Tested w/ the Linux preview branch rebased on top of the DOE fixes
>    that were causing other bugs.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31  2:00               ` Gregory Price
@ 2023-01-31 16:56                 ` Dan Williams
  2023-01-31 17:59                 ` Verma, Vishal L
  1 sibling, 0 replies; 34+ messages in thread
From: Dan Williams @ 2023-01-31 16:56 UTC (permalink / raw)
  To: Gregory Price, Dan Williams; +Cc: Jonathan Cameron, linux-cxl

Gregory Price wrote:
> On Mon, Jan 30, 2023 at 03:18:38PM -0800, Dan Williams wrote:
> > Gregory Price wrote:
> > > On Mon, Jan 30, 2023 at 12:10:08PM -0800, Dan Williams wrote:
> > > > Gregory Price wrote:
> > > > [..]
> > > > > I found the same results.
> > > > > echo mem0 > /sys/bus/cxl/devices/region0/target0
> > > > > 
> > > > > Not sure if bug/missing feature, but after attaching a device to the
> > > > > target, you get no output when reading target0
> > > > > 
> > > > > ```
> > > > > [root@fedora ~]# cat /sys/bus/cxl/devices/region0/target0
> > > > > 
> > > > > [root@fedora ~]#
> > > > 
> > > > Hmm, did you not get:
> > > > 
> > > > "-bash: echo: write error: Invalid argument"
> > > > 
> > > > ...at that step? Because targetX expects an endpoint decoder, not a
> > > > memdev.
> > > > 
> > > 
> > > ¯\_(ツ)_/¯ bug?
> > > 
> > > I went through the kernel code and thought it was looking for a memdev
> > > but i guess i was wrong
> > > 
> > > // ... snip ...
> > >     dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> > >     if (!dev)
> > >       return -ENODEV;
> > > 
> > >     if (!is_endpoint_decoder(dev)) {
> > >       rc = -EINVAL;
> > >       goto out;
> > >     }
> > > // ... snip ...
> > 
> > Oh, yes, it is a bug. I fixed it up when I address this 0day report:
> > 
> > https://lore.kernel.org/linux-cxl/202301281313.kVrIreUj-lkp@intel.com/
> > 
> > ...I posted another spin of that branch with that fixed up.
> 
> 
> This worked for me, though it took me a bit to figure out how to wire
> everythign up - still not sure this is entirely correct but this is the
> string of commands that were required to successfully attach an endpoint
> decoder to the root decoder.
> 
> Note: The root decoder has interleave_(granularity=256, ways=1), and the
> region code appears to require the same granularity?  Does that mean
> we're stuck to 256b granularity? (unless i misread the code, which i'm
> about 75% sure i am).

You should be able to pick any CXL valid interleave-granularity when the
root-decoder is not interleaved like your configuration. If it is
interleaved then that interleave-setting dominates and is enforced for
downstream regions.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 14:56         ` Jonathan Cameron
@ 2023-01-31 17:34           ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2023-01-31 17:34 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Dan Williams, linux-cxl

On Tue, Jan 31, 2023 at 02:56:16PM +0000, Jonathan Cameron wrote:
> On Mon, 30 Jan 2023 09:23:54 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > On Thu, Jan 26, 2023 at 07:34:24PM +0000, Jonathan Cameron wrote:
> > > Looks like decoders are programmed correctly as I can read and write from
> > > the HPA using devmem2.
> > > 
> > > This is against http://gitlab.com/jic23/qemu cxl-2023-01-26 which has been
> > >   
> > 
> > Johnathan, can you explain how you're accessing the memory? I don't
> > quite follow.
> 
> Command sequence is pretty similar to yours (not checked it's identical)
> and once commit is done, using a version of devmem2 (not sure it was this one)
> https://github.com/hackndev/tools/blob/master/devmem2.c 
> (if you want 64 bit read / write add the obvious additional parameter ;)
> 
> with appropriately loose kernel configuration that /dev/mem works
> 

Very cool, i have confirmed this all works *and* that we can actually
back the memory with a file and see changes

For readers, and sake of completeness:


Kernel:
https://github.com/l1k/linux/commits/doe
+
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
(merge cxl-ram-region into doe)


QEMU:
https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
(might need to revert last commit if you're not on arch)


QEMU Config: file_memexp.sh

sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=1G,share=true \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G


Mapping the memory in the host (still not confident about the interleave
settings, but it works)

map_memory.sh
# Program the endpoint decoder for ram of size 1GB
echo ram > /sys/bus/cxl/devices/decoder2.0/mode
echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size

# Create a region in the root decoder
echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region

# Configure that region's interleave and size
echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity
echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
echo 0x40000000 > /sys/bus/cxl/devices/region0/size

# Link the endpoint decoder as a target in the region
echo decoder2.0 > /sys/bus/cxl/devices/region0/target0

# Commit the changes
echo 1 > /sys/bus/cxl/devices/region0/commit



Hack up devmem2 (it uses strtoul, i needs to use strtoull)
https://github.com/hackndev/tools/blob/master/devmem2.c

Then run it

[root@fedora ~]# ./devmem2 0x290000000 w 0x87654321
/dev/mem opened.
Memory mapped at address 0x7fb8697fb000.
Value at address 0x290000000 (0x7fb8697fb000): 0x00000000
Written 0x87654321; readback 0x87654321


Validate the memory actually got written to the backing file:
[gourry@fedora build]$ xxd /tmp/mem0 | head
00000000: 2143 6587 0000 0000 0000 0000 0000 0000  !Ce.............



:]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31  2:00               ` Gregory Price
  2023-01-31 16:56                 ` Dan Williams
@ 2023-01-31 17:59                 ` Verma, Vishal L
  2023-01-31 19:03                   ` Gregory Price
  1 sibling, 1 reply; 34+ messages in thread
From: Verma, Vishal L @ 2023-01-31 17:59 UTC (permalink / raw)
  To: Williams, Dan J, gregory.price; +Cc: Jonathan.Cameron, linux-cxl

On Mon, 2023-01-30 at 21:00 -0500, Gregory Price wrote:
> On Mon, Jan 30, 2023 at 03:18:38PM -0800, Dan Williams wrote:
> 
> 
> This worked for me, though it took me a bit to figure out how to wire
> everythign up - still not sure this is entirely correct but this is the
> string of commands that were required to successfully attach an endpoint
> decoder to the root decoder.
> 
> Note: The root decoder has interleave_(granularity=256, ways=1), and the
> region code appears to require the same granularity?  Does that mean
> we're stuck to 256b granularity? (unless i misread the code, which i'm
> about 75% sure i am).
> 
> 
> Command string:
> 
> # Program the endpoint decoder for ram of size 1GB
> echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
> 
> # Create a region in the root decoder
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> 
> # Configure that region with the same interleave granularity
> # and ways as the root and endpoint decoders
> echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
> echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> echo 0x40000000 > /sys/bus/cxl/devices/region0/size
> 
> # Link the endpoint decoder as a target in the region
> echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
> 
> # Commit the changes
> echo 1 > /sys/bus/cxl/devices/region0/commit

I've pushed a cxl-cli branch[1] that incorporates this flow, and allows
for: cxl create-region -t ram <other options>

Feel free to give it a spin!

[1]: https://github.com/pmem/ndctl/commits/vv/volatile-regions

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 17:59                 ` Verma, Vishal L
@ 2023-01-31 19:03                   ` Gregory Price
  2023-01-31 19:46                     ` Verma, Vishal L
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-31 19:03 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: Williams, Dan J, Jonathan.Cameron, linux-cxl

On Tue, Jan 31, 2023 at 05:59:26PM +0000, Verma, Vishal L wrote:
> On Mon, 2023-01-30 at 21:00 -0500, Gregory Price wrote:
> > On Mon, Jan 30, 2023 at 03:18:38PM -0800, Dan Williams wrote:
> > 
> > 
> > This worked for me, though it took me a bit to figure out how to wire
> > everythign up - still not sure this is entirely correct but this is the
> > string of commands that were required to successfully attach an endpoint
> > decoder to the root decoder.
> > 
> > Note: The root decoder has interleave_(granularity=256, ways=1), and the
> > region code appears to require the same granularity?  Does that mean
> > we're stuck to 256b granularity? (unless i misread the code, which i'm
> > about 75% sure i am).
> > 
> > 
> > Command string:
> > 
> > # Program the endpoint decoder for ram of size 1GB
> > echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> > echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
> > 
> > # Create a region in the root decoder
> > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > 
> > # Configure that region with the same interleave granularity
> > # and ways as the root and endpoint decoders
> > echo 256 > /sys/bus/cxl/devices/region0/interleave_granularity
> > echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> > echo 0x40000000 > /sys/bus/cxl/devices/region0/size
> > 
> > # Link the endpoint decoder as a target in the region
> > echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
> > 
> > # Commit the changes
> > echo 1 > /sys/bus/cxl/devices/region0/commit
> 
> I've pushed a cxl-cli branch[1] that incorporates this flow, and allows
> for: cxl create-region -t ram <other options>
> 
> Feel free to give it a spin!
> 
> [1]: https://github.com/pmem/ndctl/commits/vv/volatile-regions

Right now I believe this is failing due to the interleave and size not
having default values

./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
cxl region: create_region: create_region: unable to determine region size
cxl region: cmd_create_region: created 0 regions


appears to be due to this code
static int create_region(struct cxl_ctx *ctx, int *count,
             struct parsed_params *p)
{
// ... snip ...
    rc = create_region_validate_config(ctx, p);
    if (rc)
        return rc;

    if (p->size) {
        size = p->size;
        default_size = false;
    } else if (p->ep_min_size) {
        size = p->ep_min_size * p->ways;
**    } else {
**        log_err(&rl, "%s: unable to determine region size\n", __func__);
**        return -ENXIO;
**    }

So both size and ep_min_size are 0 here

echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
cat /sys/bus/cxl/devices/region0/interleave_ways
0
cat /sys/bus/cxl/devices/region0/interleave_granularity
0
cat /sys/bus/cxl/devices/region0/size
0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 19:03                   ` Gregory Price
@ 2023-01-31 19:46                     ` Verma, Vishal L
  2023-01-31 20:24                       ` Verma, Vishal L
  0 siblings, 1 reply; 34+ messages in thread
From: Verma, Vishal L @ 2023-01-31 19:46 UTC (permalink / raw)
  To: gregory.price; +Cc: Williams, Dan J, Jonathan.Cameron, linux-cxl

On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > 
> > I've pushed a cxl-cli branch[1] that incorporates this flow, and allows
> > for: cxl create-region -t ram <other options>
> > 
> > Feel free to give it a spin!
> > 
> > [1]: https://github.com/pmem/ndctl/commits/vv/volatile-regions
> 
> Right now I believe this is failing due to the interleave and size not
> having default values
> 
> ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> cxl region: create_region: create_region: unable to determine region size
> cxl region: cmd_create_region: created 0 regions
> 
> 
> appears to be due to this code
> static int create_region(struct cxl_ctx *ctx, int *count,
>              struct parsed_params *p)
> {
> // ... snip ...
>     rc = create_region_validate_config(ctx, p);
>     if (rc)
>         return rc;
> 
>     if (p->size) {
>         size = p->size;
>         default_size = false;
>     } else if (p->ep_min_size) {
>         size = p->ep_min_size * p->ways;
> **    } else {
> **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> **        return -ENXIO;
> **    }
> 
> So both size and ep_min_size are 0 here
> 
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> cat /sys/bus/cxl/devices/region0/interleave_ways
> 0
> cat /sys/bus/cxl/devices/region0/interleave_granularity
> 0
> cat /sys/bus/cxl/devices/region0/size
> 0

Ah - this revealed an actual bug in these commits - the size and
ep_min_size don't refer to the region's size, it is the capacity of the
component memdevs. Right after create_ram_region, the region size is
expected to be zero.

However the bug here was a pmem assumption I had missed. When
determining sizes, we only look at pmem capacity, which is wrong. It
happened to work in my testing because the memdevs I used had both pmem
and ram capacity. I'll update with a fix shortly. Thanks for trying it
out and reporting this!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 19:46                     ` Verma, Vishal L
@ 2023-01-31 20:24                       ` Verma, Vishal L
  2023-01-31 23:03                         ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Verma, Vishal L @ 2023-01-31 20:24 UTC (permalink / raw)
  To: gregory.price; +Cc: Williams, Dan J, Jonathan.Cameron, linux-cxl

On Tue, 2023-01-31 at 19:46 +0000, Verma, Vishal L wrote:
> On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > 
> > 
> > Right now I believe this is failing due to the interleave and size not
> > having default values
> > 
> > ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > cxl region: create_region: create_region: unable to determine region size
> > cxl region: cmd_create_region: created 0 regions
> > 
> > 
> > appears to be due to this code
> > static int create_region(struct cxl_ctx *ctx, int *count,
> >              struct parsed_params *p)
> > {
> > // ... snip ...
> >     rc = create_region_validate_config(ctx, p);
> >     if (rc)
> >         return rc;
> > 
> >     if (p->size) {
> >         size = p->size;
> >         default_size = false;
> >     } else if (p->ep_min_size) {
> >         size = p->ep_min_size * p->ways;
> > **    } else {
> > **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> > **        return -ENXIO;
> > **    }
> > 
> > So both size and ep_min_size are 0 here
> > 
> > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > cat /sys/bus/cxl/devices/region0/interleave_ways
> > 0
> > cat /sys/bus/cxl/devices/region0/interleave_granularity
> > 0
> > cat /sys/bus/cxl/devices/region0/size
> > 0
> 
> Ah - this revealed an actual bug in these commits - the size and
> ep_min_size don't refer to the region's size, it is the capacity of the
> component memdevs. Right after create_ram_region, the region size is
> expected to be zero.
> 
> However the bug here was a pmem assumption I had missed. When
> determining sizes, we only look at pmem capacity, which is wrong. It
> happened to work in my testing because the memdevs I used had both pmem
> and ram capacity. I'll update with a fix shortly. Thanks for trying it
> out and reporting this!

I've updated the branch now with a fix for this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 20:24                       ` Verma, Vishal L
@ 2023-01-31 23:03                         ` Gregory Price
  2023-01-31 23:17                           ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-31 23:03 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: Williams, Dan J, Jonathan.Cameron, linux-cxl

On Tue, Jan 31, 2023 at 08:24:19PM +0000, Verma, Vishal L wrote:
> On Tue, 2023-01-31 at 19:46 +0000, Verma, Vishal L wrote:
> > On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > > 
> > > 
> > > Right now I believe this is failing due to the interleave and size not
> > > having default values
> > > 
> > > ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > cxl region: create_region: create_region: unable to determine region size
> > > cxl region: cmd_create_region: created 0 regions
> > > 
> > > 
> > > appears to be due to this code
> > > static int create_region(struct cxl_ctx *ctx, int *count,
> > >              struct parsed_params *p)
> > > {
> > > // ... snip ...
> > >     rc = create_region_validate_config(ctx, p);
> > >     if (rc)
> > >         return rc;
> > > 
> > >     if (p->size) {
> > >         size = p->size;
> > >         default_size = false;
> > >     } else if (p->ep_min_size) {
> > >         size = p->ep_min_size * p->ways;
> > > **    } else {
> > > **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> > > **        return -ENXIO;
> > > **    }
> > > 
> > > So both size and ep_min_size are 0 here
> > > 
> > > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > > cat /sys/bus/cxl/devices/region0/interleave_ways
> > > 0
> > > cat /sys/bus/cxl/devices/region0/interleave_granularity
> > > 0
> > > cat /sys/bus/cxl/devices/region0/size
> > > 0
> > 
> > Ah - this revealed an actual bug in these commits - the size and
> > ep_min_size don't refer to the region's size, it is the capacity of the
> > component memdevs. Right after create_ram_region, the region size is
> > expected to be zero.
> > 
> > However the bug here was a pmem assumption I had missed. When
> > determining sizes, we only look at pmem capacity, which is wrong. It
> > happened to work in my testing because the memdevs I used had both pmem
> > and ram capacity. I'll update with a fix shortly. Thanks for trying it
> > out and reporting this!
> 
> I've updated the branch now with a fix for this.

Progress! But now i've found a kernel segfault :D
(sorry about the jumble here, looks like multiple issues))

[root@fedora cxl]# ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
[  170.675334] cxl_region region0: Failed to synchronize CPU cache state
libcxl: [c x l1_7r0e.68249g6i] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  170.691163] #PF: supervisor instruction fetch in kernel mode
[o n 1_70.70e3n9a1b6l]e :# rPeF: error_code(0gixo0010) - not-present page
n0[:  fai led1 7to 0e.7n19709] PGD 800000004d25d067 P4D 800000004d25d067 PUD 4cdf3067 PMD 0
[  170.725436] Oops: 0010 [#1] PREEMPT SMP PTI
1b[l e
7c0x.l734 510r]e giConPU: 0 PID: 717 Comm: cxl Not tainted 6.2.0-rc2+ #19
[  170.739750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
:[  170.747119] R IP: 0c0r1e0:at0ex_0r
egi[o n: 170.751110] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[  170.757699] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
[   17r0e.g7ion0:6 f6a0i9l1e]d RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
[  170.772499] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
 [  t170o.77 8e3nab0l0e] RBP: fff:f 9Nco3 8seed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
[  170.783787] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
uch d[ev i 1ce7 0o.7r8 800a9]d R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
dr[e  s1s7
0.795938] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
[  170.802891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  170.806705] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
[  170.817025] Call Trace:
[  170.818831]  <TASK>
[  170.820589]  cxl_region_decode_reset+0xb8/0x110
[  170.823893]  cxl_region_detach+0xda/0x1e0
[  170.829457]  detach_target.part.0+0x29/0x80
[  170.833503]  unregister_region+0x42/0x90
[  170.836813]  devm_release_action+0x3d/0x70
[  170.840128]  ? __pfx_unregister_region+0x10/0x10
[  170.843899]  delete_region_store+0x69/0x80
[  170.847680]  kernfs_fop_write_iter+0x11e/0x200
[  170.851217]  vfs_write+0x222/0x3e0
[  170.854141]  ksys_write+0x5b/0xd0
[  170.856695]  do_syscall_64+0x5b/0x80
[  170.859678]  ? kmem_cache_free+0x15/0x3b0
[  170.862234]  ? do_sys_openat2+0x77/0x150
[  170.865560]  ? syscall_exit_to_user_mode+0x17/0x40
[  170.870920]  ? do_syscall_64+0x67/0x80
[  170.874726]  ? syscall_exit_to_user_mode+0x17/0x40
[  170.879464]  ? do_syscall_64+0x67/0x80
[  170.881634]  ? __irq_exit_rcu+0x3d/0x140
[  170.884720]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  170.888810] RIP: 0033:0x7ff89c901c37
[  170.891435] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4[  170.905803] RSP: 002b:00007fff0e843a68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  170.913373] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff89c901c37
[  170.920868] RDX: 0000000000000008 RSI: 0000000001290ee6 RDI: 0000000000000003
[  170.931402] RBP: 00007fff0e843aa0 R08: 000000000000fee0 R09: 0000000000000073
[  170.936639] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  170.942484] R13: 00007fff0e844000 R14: 000000000041fdc8 R15: 00007ff89cbdf000
[  170.954794]  </TASK>
[  170.957649] Modules linked in: rfkill vfat fat snd_pcm iTCO_wdt snd_timer intel_pmc_bxt ppdev iTCO_vendor_support snd cxl_pmem soundcore bochg[  170.980623] CR2: 0000000000000000
[  170.984137] ---[ end trace 0000000000000000 ]---
[  170.989062] RIP: 0010:0x0
[  170.991505] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[  170.996401] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
[  170.999716] RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
[  171.006146] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
[  171.018226] RBP: ffff9c38eed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
[  171.024812] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
[  171.036512] R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
[  171.042400] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
[  171.050182] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  171.055740] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
Killed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 23:03                         ` Gregory Price
@ 2023-01-31 23:17                           ` Gregory Price
       [not found]                             ` <CGME20230131235012uscas1p11573de234af67d70a882d4ca0f3ebaab@uscas1p1.samsung.com>
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-01-31 23:17 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: Williams, Dan J, Jonathan.Cameron, linux-cxl

On Tue, Jan 31, 2023 at 06:03:53PM -0500, Gregory Price wrote:
> On Tue, Jan 31, 2023 at 08:24:19PM +0000, Verma, Vishal L wrote:
> > On Tue, 2023-01-31 at 19:46 +0000, Verma, Vishal L wrote:
> > > On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > > > 
> > > > 
> > > > Right now I believe this is failing due to the interleave and size not
> > > > having default values
> > > > 
> > > > ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > > cxl region: create_region: create_region: unable to determine region size
> > > > cxl region: cmd_create_region: created 0 regions
> > > > 
> > > > 
> > > > appears to be due to this code
> > > > static int create_region(struct cxl_ctx *ctx, int *count,
> > > >              struct parsed_params *p)
> > > > {
> > > > // ... snip ...
> > > >     rc = create_region_validate_config(ctx, p);
> > > >     if (rc)
> > > >         return rc;
> > > > 
> > > >     if (p->size) {
> > > >         size = p->size;
> > > >         default_size = false;
> > > >     } else if (p->ep_min_size) {
> > > >         size = p->ep_min_size * p->ways;
> > > > **    } else {
> > > > **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> > > > **        return -ENXIO;
> > > > **    }
> > > > 
> > > > So both size and ep_min_size are 0 here
> > > > 
> > > > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > > > cat /sys/bus/cxl/devices/region0/interleave_ways
> > > > 0
> > > > cat /sys/bus/cxl/devices/region0/interleave_granularity
> > > > 0
> > > > cat /sys/bus/cxl/devices/region0/size
> > > > 0
> > > 
> > > Ah - this revealed an actual bug in these commits - the size and
> > > ep_min_size don't refer to the region's size, it is the capacity of the
> > > component memdevs. Right after create_ram_region, the region size is
> > > expected to be zero.
> > > 
> > > However the bug here was a pmem assumption I had missed. When
> > > determining sizes, we only look at pmem capacity, which is wrong. It
> > > happened to work in my testing because the memdevs I used had both pmem
> > > and ram capacity. I'll update with a fix shortly. Thanks for trying it
> > > out and reporting this!
> > 
> > I've updated the branch now with a fix for this.
> 
> Progress! But now i've found a kernel segfault :D
> (sorry about the jumble here, looks like multiple issues))
> 
> [root@fedora cxl]# ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> [  170.675334] cxl_region region0: Failed to synchronize CPU cache state
> libcxl: [c x l1_7r0e.68249g6i] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  170.691163] #PF: supervisor instruction fetch in kernel mode
> [o n 1_70.70e3n9a1b6l]e :# rPeF: error_code(0gixo0010) - not-present page
> n0[:  fai led1 7to 0e.7n19709] PGD 800000004d25d067 P4D 800000004d25d067 PUD 4cdf3067 PMD 0
> [  170.725436] Oops: 0010 [#1] PREEMPT SMP PTI
> 1b[l e
> 7c0x.l734 510r]e giConPU: 0 PID: 717 Comm: cxl Not tainted 6.2.0-rc2+ #19
> [  170.739750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> :[  170.747119] R IP: 0c0r1e0:at0ex_0r
> egi[o n: 170.751110] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [  170.757699] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> [   17r0e.g7ion0:6 f6a0i9l1e]d RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> [  170.772499] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
>  [  t170o.77 8e3nab0l0e] RBP: fff:f 9Nco3 8seed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> [  170.783787] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> uch d[ev i 1ce7 0o.7r8 800a9]d R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> dr[e  s1s7
> 0.795938] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> [  170.802891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  170.806705] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> [  170.817025] Call Trace:
> [  170.818831]  <TASK>
> [  170.820589]  cxl_region_decode_reset+0xb8/0x110
> [  170.823893]  cxl_region_detach+0xda/0x1e0
> [  170.829457]  detach_target.part.0+0x29/0x80
> [  170.833503]  unregister_region+0x42/0x90
> [  170.836813]  devm_release_action+0x3d/0x70
> [  170.840128]  ? __pfx_unregister_region+0x10/0x10
> [  170.843899]  delete_region_store+0x69/0x80
> [  170.847680]  kernfs_fop_write_iter+0x11e/0x200
> [  170.851217]  vfs_write+0x222/0x3e0
> [  170.854141]  ksys_write+0x5b/0xd0
> [  170.856695]  do_syscall_64+0x5b/0x80
> [  170.859678]  ? kmem_cache_free+0x15/0x3b0
> [  170.862234]  ? do_sys_openat2+0x77/0x150
> [  170.865560]  ? syscall_exit_to_user_mode+0x17/0x40
> [  170.870920]  ? do_syscall_64+0x67/0x80
> [  170.874726]  ? syscall_exit_to_user_mode+0x17/0x40
> [  170.879464]  ? do_syscall_64+0x67/0x80
> [  170.881634]  ? __irq_exit_rcu+0x3d/0x140
> [  170.884720]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  170.888810] RIP: 0033:0x7ff89c901c37
> [  170.891435] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4[  170.905803] RSP: 002b:00007fff0e843a68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  170.913373] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff89c901c37
> [  170.920868] RDX: 0000000000000008 RSI: 0000000001290ee6 RDI: 0000000000000003
> [  170.931402] RBP: 00007fff0e843aa0 R08: 000000000000fee0 R09: 0000000000000073
> [  170.936639] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  170.942484] R13: 00007fff0e844000 R14: 000000000041fdc8 R15: 00007ff89cbdf000
> [  170.954794]  </TASK>
> [  170.957649] Modules linked in: rfkill vfat fat snd_pcm iTCO_wdt snd_timer intel_pmc_bxt ppdev iTCO_vendor_support snd cxl_pmem soundcore bochg[  170.980623] CR2: 0000000000000000
> [  170.984137] ---[ end trace 0000000000000000 ]---
> [  170.989062] RIP: 0010:0x0
> [  170.991505] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [  170.996401] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> [  170.999716] RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> [  171.006146] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> [  171.018226] RBP: ffff9c38eed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> [  171.024812] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> [  171.036512] R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> [  171.042400] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> [  171.050182] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  171.055740] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> Killed


Looks like some error is still occuring, this happens when attempting to
delete a region after it has been configured

[root@fedora ~]# echo region0 > /sys/bus/cxl/devices/decoder0.0/delete_region
[   97.186328] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   97.190754] #PF: supervisor instruction fetch in kernel mode
[   97.196754] #PF: error_code(0x0010) - not-present page
[   97.201108] PGD 0 P4D 0
[   97.202585] Oops: 0010 [#1] PREEMPT SMP PTI
[   97.206085] CPU: 1 PID: 688 Comm: bash Not tainted 6.2.0-rc2+ #19
[   97.215215] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[   97.224247] RIP: 0010:0x0
[   97.228516] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   97.233852] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
[   97.236704] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
[   97.242904] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
[   97.250537] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
[   97.260478] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
[   97.276617] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
[   97.285277] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
[   97.295175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.300566] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
[   97.308856] Call Trace:
[   97.312137]  <TASK>
[   97.314095]  cxl_region_decode_reset+0xb8/0x110
[   97.317937]  cxl_region_detach+0xda/0x1e0
[   97.320694]  detach_target.part.0+0x29/0x80
[   97.326437]  unregister_region+0x42/0x90
[   97.331169]  devm_release_action+0x3d/0x70
[   97.334957]  ? __pfx_unregister_region+0x10/0x10
[   97.338434]  delete_region_store+0x69/0x80
[   97.343526]  kernfs_fop_write_iter+0x11e/0x200
[   97.348950]  vfs_write+0x222/0x3e0
[   97.352273]  ksys_write+0x5b/0xd0
[   97.354592]  do_syscall_64+0x5b/0x80
[   97.358291]  ? exc_page_fault+0x70/0x170
[   97.362739]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   97.367268] RIP: 0033:0x7f195bd01c37
[   97.369719] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4
[   97.386808] RSP: 002b:00007fff8a2320d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   97.394208] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f195bd01c37
[   97.401547] RDX: 0000000000000008 RSI: 0000555fd741b550 RDI: 0000000000000001
[   97.409231] RBP: 0000555fd741b550 R08: 0000000000001000 R09: 0000000000000000
[   97.416149] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
[   97.420784] R13: 00007f195bdf9780 R14: 0000000000000008 R15: 00007f195bdf49e0
[   97.429142]  </TASK>
[   97.431741] Modules linked in: rfkill vfat fat snd_pcm snd_timer iTCO_wdt snd intel_pmc_bxt iTCO_vendor_support ppdev cxl_pmem soundcore libng
[   97.456189] CR2: 0000000000000000
[   97.461271] ---[ end trace 0000000000000000 ]---
[   97.466464] RIP: 0010:0x0
[   97.468599] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
[   97.482910] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
[   97.488445] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
[   97.496227] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
[   97.502543] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
[   97.512213] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
[   97.518303] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
[   97.526884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.533142] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
       [not found]                             ` <CGME20230131235012uscas1p11573de234af67d70a882d4ca0f3ebaab@uscas1p1.samsung.com>
@ 2023-01-31 23:50                               ` Fan Ni
  2023-02-01  5:29                                 ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Fan Ni @ 2023-01-31 23:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: Verma, Vishal L, Williams, Dan J, Jonathan.Cameron, linux-cxl,
	Adam Manzanares, dave

On Tue, Jan 31, 2023 at 06:17:15PM -0500, Gregory Price wrote:
> On Tue, Jan 31, 2023 at 06:03:53PM -0500, Gregory Price wrote:
> > On Tue, Jan 31, 2023 at 08:24:19PM +0000, Verma, Vishal L wrote:
> > > On Tue, 2023-01-31 at 19:46 +0000, Verma, Vishal L wrote:
> > > > On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > > > > 
> > > > > 
> > > > > Right now I believe this is failing due to the interleave and size not
> > > > > having default values
> > > > > 
> > > > > ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > > > cxl region: create_region: create_region: unable to determine region size
> > > > > cxl region: cmd_create_region: created 0 regions
> > > > > 
> > > > > 
> > > > > appears to be due to this code
> > > > > static int create_region(struct cxl_ctx *ctx, int *count,
> > > > >              struct parsed_params *p)
> > > > > {
> > > > > // ... snip ...
> > > > >     rc = create_region_validate_config(ctx, p);
> > > > >     if (rc)
> > > > >         return rc;
> > > > > 
> > > > >     if (p->size) {
> > > > >         size = p->size;
> > > > >         default_size = false;
> > > > >     } else if (p->ep_min_size) {
> > > > >         size = p->ep_min_size * p->ways;
> > > > > **    } else {
> > > > > **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> > > > > **        return -ENXIO;
> > > > > **    }
> > > > > 
> > > > > So both size and ep_min_size are 0 here
> > > > > 
> > > > > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > > > > cat /sys/bus/cxl/devices/region0/interleave_ways
> > > > > 0
> > > > > cat /sys/bus/cxl/devices/region0/interleave_granularity
> > > > > 0
> > > > > cat /sys/bus/cxl/devices/region0/size
> > > > > 0
> > > > 
> > > > Ah - this revealed an actual bug in these commits - the size and
> > > > ep_min_size don't refer to the region's size, it is the capacity of the
> > > > component memdevs. Right after create_ram_region, the region size is
> > > > expected to be zero.
> > > > 
> > > > However the bug here was a pmem assumption I had missed. When
> > > > determining sizes, we only look at pmem capacity, which is wrong. It
> > > > happened to work in my testing because the memdevs I used had both pmem
> > > > and ram capacity. I'll update with a fix shortly. Thanks for trying it
> > > > out and reporting this!
> > > 
> > > I've updated the branch now with a fix for this.
> > 
> > Progress! But now i've found a kernel segfault :D
> > (sorry about the jumble here, looks like multiple issues))
> > 
> > [root@fedora cxl]# ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > [  170.675334] cxl_region region0: Failed to synchronize CPU cache state
> > libcxl: [c x l1_7r0e.68249g6i] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  170.691163] #PF: supervisor instruction fetch in kernel mode
> > [o n 1_70.70e3n9a1b6l]e :# rPeF: error_code(0gixo0010) - not-present page
> > n0[:  fai led1 7to 0e.7n19709] PGD 800000004d25d067 P4D 800000004d25d067 PUD 4cdf3067 PMD 0
> > [  170.725436] Oops: 0010 [#1] PREEMPT SMP PTI
> > 1b[l e
> > 7c0x.l734 510r]e giConPU: 0 PID: 717 Comm: cxl Not tainted 6.2.0-rc2+ #19
> > [  170.739750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > :[  170.747119] R IP: 0c0r1e0:at0ex_0r
> > egi[o n: 170.751110] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > [  170.757699] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> > [   17r0e.g7ion0:6 f6a0i9l1e]d RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> > [  170.772499] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> >  [  t170o.77 8e3nab0l0e] RBP: fff:f 9Nco3 8seed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> > [  170.783787] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> > uch d[ev i 1ce7 0o.7r8 800a9]d R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> > dr[e  s1s7
> > 0.795938] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> > [  170.802891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  170.806705] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> > [  170.817025] Call Trace:
> > [  170.818831]  <TASK>
> > [  170.820589]  cxl_region_decode_reset+0xb8/0x110
> > [  170.823893]  cxl_region_detach+0xda/0x1e0
> > [  170.829457]  detach_target.part.0+0x29/0x80
> > [  170.833503]  unregister_region+0x42/0x90
> > [  170.836813]  devm_release_action+0x3d/0x70
> > [  170.840128]  ? __pfx_unregister_region+0x10/0x10
> > [  170.843899]  delete_region_store+0x69/0x80
> > [  170.847680]  kernfs_fop_write_iter+0x11e/0x200
> > [  170.851217]  vfs_write+0x222/0x3e0
> > [  170.854141]  ksys_write+0x5b/0xd0
> > [  170.856695]  do_syscall_64+0x5b/0x80
> > [  170.859678]  ? kmem_cache_free+0x15/0x3b0
> > [  170.862234]  ? do_sys_openat2+0x77/0x150
> > [  170.865560]  ? syscall_exit_to_user_mode+0x17/0x40
> > [  170.870920]  ? do_syscall_64+0x67/0x80
> > [  170.874726]  ? syscall_exit_to_user_mode+0x17/0x40
> > [  170.879464]  ? do_syscall_64+0x67/0x80
> > [  170.881634]  ? __irq_exit_rcu+0x3d/0x140
> > [  170.884720]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [  170.888810] RIP: 0033:0x7ff89c901c37
> > [  170.891435] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4[  170.905803] RSP: 002b:00007fff0e843a68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [  170.913373] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff89c901c37
> > [  170.920868] RDX: 0000000000000008 RSI: 0000000001290ee6 RDI: 0000000000000003
> > [  170.931402] RBP: 00007fff0e843aa0 R08: 000000000000fee0 R09: 0000000000000073
> > [  170.936639] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > [  170.942484] R13: 00007fff0e844000 R14: 000000000041fdc8 R15: 00007ff89cbdf000
> > [  170.954794]  </TASK>
> > [  170.957649] Modules linked in: rfkill vfat fat snd_pcm iTCO_wdt snd_timer intel_pmc_bxt ppdev iTCO_vendor_support snd cxl_pmem soundcore bochg[  170.980623] CR2: 0000000000000000
> > [  170.984137] ---[ end trace 0000000000000000 ]---
> > [  170.989062] RIP: 0010:0x0
> > [  170.991505] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > [  170.996401] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> > [  170.999716] RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> > [  171.006146] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> > [  171.018226] RBP: ffff9c38eed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> > [  171.024812] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> > [  171.036512] R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> > [  171.042400] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> > [  171.050182] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  171.055740] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> > Killed
> 
> 
> Looks like some error is still occuring, this happens when attempting to
> delete a region after it has been configured
> 
> [root@fedora ~]# echo region0 > /sys/bus/cxl/devices/decoder0.0/delete_region
> [   97.186328] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   97.190754] #PF: supervisor instruction fetch in kernel mode
> [   97.196754] #PF: error_code(0x0010) - not-present page
> [   97.201108] PGD 0 P4D 0
> [   97.202585] Oops: 0010 [#1] PREEMPT SMP PTI
> [   97.206085] CPU: 1 PID: 688 Comm: bash Not tainted 6.2.0-rc2+ #19
> [   97.215215] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> [   97.224247] RIP: 0010:0x0
> [   97.228516] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [   97.233852] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
> [   97.236704] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
> [   97.242904] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
> [   97.250537] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
> [   97.260478] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
> [   97.276617] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
> [   97.285277] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
> [   97.295175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   97.300566] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
> [   97.308856] Call Trace:
> [   97.312137]  <TASK>
> [   97.314095]  cxl_region_decode_reset+0xb8/0x110
> [   97.317937]  cxl_region_detach+0xda/0x1e0
> [   97.320694]  detach_target.part.0+0x29/0x80
> [   97.326437]  unregister_region+0x42/0x90
> [   97.331169]  devm_release_action+0x3d/0x70
> [   97.334957]  ? __pfx_unregister_region+0x10/0x10
> [   97.338434]  delete_region_store+0x69/0x80
> [   97.343526]  kernfs_fop_write_iter+0x11e/0x200
> [   97.348950]  vfs_write+0x222/0x3e0
> [   97.352273]  ksys_write+0x5b/0xd0
> [   97.354592]  do_syscall_64+0x5b/0x80
> [   97.358291]  ? exc_page_fault+0x70/0x170
> [   97.362739]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [   97.367268] RIP: 0033:0x7f195bd01c37
> [   97.369719] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4
> [   97.386808] RSP: 002b:00007fff8a2320d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [   97.394208] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f195bd01c37
> [   97.401547] RDX: 0000000000000008 RSI: 0000555fd741b550 RDI: 0000000000000001
> [   97.409231] RBP: 0000555fd741b550 R08: 0000000000001000 R09: 0000000000000000
> [   97.416149] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
> [   97.420784] R13: 00007f195bdf9780 R14: 0000000000000008 R15: 00007f195bdf49e0
> [   97.429142]  </TASK>
> [   97.431741] Modules linked in: rfkill vfat fat snd_pcm snd_timer iTCO_wdt snd intel_pmc_bxt iTCO_vendor_support ppdev cxl_pmem soundcore libng
> [   97.456189] CR2: 0000000000000000
> [   97.461271] ---[ end trace 0000000000000000 ]---
> [   97.466464] RIP: 0010:0x0
> [   97.468599] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
> [   97.482910] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
> [   97.488445] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
> [   97.496227] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
> [   97.502543] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
> [   97.512213] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
> [   97.518303] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
> [   97.526884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   97.533142] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
> 
Are you using single root port configuration? If yes, the following
patch should have fixed the issue,
https://lore.kernel.org/linux-cxl/20221215170909.2650271-1-fan.ni@samsung.com/

> [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292                    

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-01-31 23:50                               ` Fan Ni
@ 2023-02-01  5:29                                 ` Gregory Price
  2023-02-01 21:16                                   ` Gregory Price
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-02-01  5:29 UTC (permalink / raw)
  To: Fan Ni
  Cc: Verma, Vishal L, Williams, Dan J, Jonathan.Cameron, linux-cxl,
	Adam Manzanares, dave

On Tue, Jan 31, 2023 at 11:50:11PM +0000, Fan Ni wrote:
> On Tue, Jan 31, 2023 at 06:17:15PM -0500, Gregory Price wrote:
> > On Tue, Jan 31, 2023 at 06:03:53PM -0500, Gregory Price wrote:
> > > On Tue, Jan 31, 2023 at 08:24:19PM +0000, Verma, Vishal L wrote:
> > > > On Tue, 2023-01-31 at 19:46 +0000, Verma, Vishal L wrote:
> > > > > On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > > > > > 
> > > > > > 
> > > > > > Right now I believe this is failing due to the interleave and size not
> > > > > > having default values
> > > > > > 
> > > > > > ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > > > > cxl region: create_region: create_region: unable to determine region size
> > > > > > cxl region: cmd_create_region: created 0 regions
> > > > > > 
> > > > > > 
> > > > > > appears to be due to this code
> > > > > > static int create_region(struct cxl_ctx *ctx, int *count,
> > > > > >              struct parsed_params *p)
> > > > > > {
> > > > > > // ... snip ...
> > > > > >     rc = create_region_validate_config(ctx, p);
> > > > > >     if (rc)
> > > > > >         return rc;
> > > > > > 
> > > > > >     if (p->size) {
> > > > > >         size = p->size;
> > > > > >         default_size = false;
> > > > > >     } else if (p->ep_min_size) {
> > > > > >         size = p->ep_min_size * p->ways;
> > > > > > **    } else {
> > > > > > **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> > > > > > **        return -ENXIO;
> > > > > > **    }
> > > > > > 
> > > > > > So both size and ep_min_size are 0 here
> > > > > > 
> > > > > > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > > > > > cat /sys/bus/cxl/devices/region0/interleave_ways
> > > > > > 0
> > > > > > cat /sys/bus/cxl/devices/region0/interleave_granularity
> > > > > > 0
> > > > > > cat /sys/bus/cxl/devices/region0/size
> > > > > > 0
> > > > > 
> > > > > Ah - this revealed an actual bug in these commits - the size and
> > > > > ep_min_size don't refer to the region's size, it is the capacity of the
> > > > > component memdevs. Right after create_ram_region, the region size is
> > > > > expected to be zero.
> > > > > 
> > > > > However the bug here was a pmem assumption I had missed. When
> > > > > determining sizes, we only look at pmem capacity, which is wrong. It
> > > > > happened to work in my testing because the memdevs I used had both pmem
> > > > > and ram capacity. I'll update with a fix shortly. Thanks for trying it
> > > > > out and reporting this!
> > > > 
> > > > I've updated the branch now with a fix for this.
> > > 
> > > Progress! But now i've found a kernel segfault :D
> > > (sorry about the jumble here, looks like multiple issues))
> > > 
> > > [root@fedora cxl]# ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > [  170.675334] cxl_region region0: Failed to synchronize CPU cache state
> > > libcxl: [c x l1_7r0e.68249g6i] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > [  170.691163] #PF: supervisor instruction fetch in kernel mode
> > > [o n 1_70.70e3n9a1b6l]e :# rPeF: error_code(0gixo0010) - not-present page
> > > n0[:  fai led1 7to 0e.7n19709] PGD 800000004d25d067 P4D 800000004d25d067 PUD 4cdf3067 PMD 0
> > > [  170.725436] Oops: 0010 [#1] PREEMPT SMP PTI
> > > 1b[l e
> > > 7c0x.l734 510r]e giConPU: 0 PID: 717 Comm: cxl Not tainted 6.2.0-rc2+ #19
> > > [  170.739750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > :[  170.747119] R IP: 0c0r1e0:at0ex_0r
> > > egi[o n: 170.751110] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > > [  170.757699] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> > > [   17r0e.g7ion0:6 f6a0i9l1e]d RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> > > [  170.772499] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> > >  [  t170o.77 8e3nab0l0e] RBP: fff:f 9Nco3 8seed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> > > [  170.783787] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> > > uch d[ev i 1ce7 0o.7r8 800a9]d R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> > > dr[e  s1s7
> > > 0.795938] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> > > [  170.802891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  170.806705] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> > > [  170.817025] Call Trace:
> > > [  170.818831]  <TASK>
> > > [  170.820589]  cxl_region_decode_reset+0xb8/0x110
> > > [  170.823893]  cxl_region_detach+0xda/0x1e0
> > > [  170.829457]  detach_target.part.0+0x29/0x80
> > > [  170.833503]  unregister_region+0x42/0x90
> > > [  170.836813]  devm_release_action+0x3d/0x70
> > > [  170.840128]  ? __pfx_unregister_region+0x10/0x10
> > > [  170.843899]  delete_region_store+0x69/0x80
> > > [  170.847680]  kernfs_fop_write_iter+0x11e/0x200
> > > [  170.851217]  vfs_write+0x222/0x3e0
> > > [  170.854141]  ksys_write+0x5b/0xd0
> > > [  170.856695]  do_syscall_64+0x5b/0x80
> > > [  170.859678]  ? kmem_cache_free+0x15/0x3b0
> > > [  170.862234]  ? do_sys_openat2+0x77/0x150
> > > [  170.865560]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [  170.870920]  ? do_syscall_64+0x67/0x80
> > > [  170.874726]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [  170.879464]  ? do_syscall_64+0x67/0x80
> > > [  170.881634]  ? __irq_exit_rcu+0x3d/0x140
> > > [  170.884720]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > [  170.888810] RIP: 0033:0x7ff89c901c37
> > > [  170.891435] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4[  170.905803] RSP: 002b:00007fff0e843a68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > > [  170.913373] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff89c901c37
> > > [  170.920868] RDX: 0000000000000008 RSI: 0000000001290ee6 RDI: 0000000000000003
> > > [  170.931402] RBP: 00007fff0e843aa0 R08: 000000000000fee0 R09: 0000000000000073
> > > [  170.936639] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > > [  170.942484] R13: 00007fff0e844000 R14: 000000000041fdc8 R15: 00007ff89cbdf000
> > > [  170.954794]  </TASK>
> > > [  170.957649] Modules linked in: rfkill vfat fat snd_pcm iTCO_wdt snd_timer intel_pmc_bxt ppdev iTCO_vendor_support snd cxl_pmem soundcore bochg[  170.980623] CR2: 0000000000000000
> > > [  170.984137] ---[ end trace 0000000000000000 ]---
> > > [  170.989062] RIP: 0010:0x0
> > > [  170.991505] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > > [  170.996401] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> > > [  170.999716] RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> > > [  171.006146] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> > > [  171.018226] RBP: ffff9c38eed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> > > [  171.024812] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> > > [  171.036512] R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> > > [  171.042400] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> > > [  171.050182] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  171.055740] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> > > Killed
> > 
> > 
> > Looks like some error is still occuring, this happens when attempting to
> > delete a region after it has been configured
> > 
> > [root@fedora ~]# echo region0 > /sys/bus/cxl/devices/decoder0.0/delete_region
> > [   97.186328] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [   97.190754] #PF: supervisor instruction fetch in kernel mode
> > [   97.196754] #PF: error_code(0x0010) - not-present page
> > [   97.201108] PGD 0 P4D 0
> > [   97.202585] Oops: 0010 [#1] PREEMPT SMP PTI
> > [   97.206085] CPU: 1 PID: 688 Comm: bash Not tainted 6.2.0-rc2+ #19
> > [   97.215215] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > [   97.224247] RIP: 0010:0x0
> > [   97.228516] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > [   97.233852] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
> > [   97.236704] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
> > [   97.242904] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
> > [   97.250537] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
> > [   97.260478] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
> > [   97.276617] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
> > [   97.285277] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
> > [   97.295175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   97.300566] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
> > [   97.308856] Call Trace:
> > [   97.312137]  <TASK>
> > [   97.314095]  cxl_region_decode_reset+0xb8/0x110
> > [   97.317937]  cxl_region_detach+0xda/0x1e0
> > [   97.320694]  detach_target.part.0+0x29/0x80
> > [   97.326437]  unregister_region+0x42/0x90
> > [   97.331169]  devm_release_action+0x3d/0x70
> > [   97.334957]  ? __pfx_unregister_region+0x10/0x10
> > [   97.338434]  delete_region_store+0x69/0x80
> > [   97.343526]  kernfs_fop_write_iter+0x11e/0x200
> > [   97.348950]  vfs_write+0x222/0x3e0
> > [   97.352273]  ksys_write+0x5b/0xd0
> > [   97.354592]  do_syscall_64+0x5b/0x80
> > [   97.358291]  ? exc_page_fault+0x70/0x170
> > [   97.362739]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [   97.367268] RIP: 0033:0x7f195bd01c37
> > [   97.369719] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4
> > [   97.386808] RSP: 002b:00007fff8a2320d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [   97.394208] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f195bd01c37
> > [   97.401547] RDX: 0000000000000008 RSI: 0000555fd741b550 RDI: 0000000000000001
> > [   97.409231] RBP: 0000555fd741b550 R08: 0000000000001000 R09: 0000000000000000
> > [   97.416149] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
> > [   97.420784] R13: 00007f195bdf9780 R14: 0000000000000008 R15: 00007f195bdf49e0
> > [   97.429142]  </TASK>
> > [   97.431741] Modules linked in: rfkill vfat fat snd_pcm snd_timer iTCO_wdt snd intel_pmc_bxt iTCO_vendor_support ppdev cxl_pmem soundcore libng
> > [   97.456189] CR2: 0000000000000000
> > [   97.461271] ---[ end trace 0000000000000000 ]---
> > [   97.466464] RIP: 0010:0x0
> > [   97.468599] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
> > [   97.482910] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
> > [   97.488445] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
> > [   97.496227] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
> > [   97.502543] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
> > [   97.512213] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
> > [   97.518303] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
> > [   97.526884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   97.533142] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
> > 
> Are you using single root port configuration? If yes, the following
> patch should have fixed the issue,
> https://lore.kernel.org/linux-cxl/20221215170909.2650271-1-fan.ni@samsung.com/
> 
> > [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292                    

I did not have this patch.  This should definitely make its way up.

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-01  5:29                                 ` Gregory Price
@ 2023-02-01 21:16                                   ` Gregory Price
  2023-02-02  1:06                                     ` Gregory Price
  2023-02-02 16:03                                     ` Jonathan Cameron
  0 siblings, 2 replies; 34+ messages in thread
From: Gregory Price @ 2023-02-01 21:16 UTC (permalink / raw)
  To: Fan Ni
  Cc: Verma, Vishal L, Williams, Dan J, Jonathan.Cameron, linux-cxl,
	Adam Manzanares, dave

On Wed, Feb 01, 2023 at 12:29:50AM -0500, Gregory Price wrote:
> > Are you using single root port configuration? If yes, the following
> > patch should have fixed the issue,
> > https://lore.kernel.org/linux-cxl/20221215170909.2650271-1-fan.ni@samsung.com/
> > 
> > > [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292                    
> 
> I did not have this patch.  This should definitely make its way up.
> 
> ~Gregory


This fixed up the stack trace for me, but memregion is still failing to
successfully complete.

It looks like configuration and decoder commit completes, but then
cxl-cli bails out because writing

echo region0 > /sys/bus/cxl/drivers/cxl_region/bind

results in "Failed to synchronize CPU cache state"

I presume this is because of either a logic error or because the memory
just isn't actually hooked up yet, but this is the relevant code:


static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
{
  if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
    return 0;

  if (!cpu_cache_has_invalidate_memregion()) {
    if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
      dev_warn_once(
        &cxlr->dev,
        "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
      clear_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags);
      return 0;
    } else {
      dev_err(&cxlr->dev,
        "Failed to synchronize CPU cache state\n");
      return -ENXIO;
    }
  }

  cpu_cache_invalidate_memregion(IORES_DESC_CXL);
  clear_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags);
  return 0;
}



Looks like i can bypass this with CONFIG_CXL_REGION_INVALIDATION_TEST
but just wanted to report back incase this is not intended.

On x86, this invalidate_memregion() call maps to not having the
hypervisor bit set:

bool cpu_cache_has_invalidate_memregion(void)
{
  return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
}
EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);



I presume if i enable the invalidate_test bit in my config this will
work, but if anyone can validate that this is expected behavior without
it, that would be great.

Thanks!
~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-02 16:03                                     ` Jonathan Cameron
@ 2023-02-01 22:05                                       ` Gregory Price
  2023-02-02 18:13                                         ` Jonathan Cameron
  2023-02-02 18:18                                         ` Dan Williams
  0 siblings, 2 replies; 34+ messages in thread
From: Gregory Price @ 2023-02-01 22:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Fan Ni, Verma, Vishal L, Williams, Dan J, linux-cxl,
	Adam Manzanares, dave

On Thu, Feb 02, 2023 at 04:03:14PM +0000, Jonathan Cameron wrote:
> 
> Note that there is another QEMU issue that needs resolving if you intend
> to use this as normal memory and it's worse under KVM. Effects the
> corner case where an instruction crosses the boundary from normal memory
> into CXL memory.
> 
> Thanks to the various QEMU folk who are helping us figure out what to do
> about this for the explanations that follow!
> 
> We currently handle the region as MMIO - in QEMU terms, no actual relationship
> to what the OS sees (as need to mess with the address
> mappings on each access for interleaving). That's a problem for KVM
> (which may not cope with sub page granularity remapping under the hood).
> 
> https://lore.kernel.org/qemu-devel/ff3f25ee-1c98-242b-905e-0b01d9f0948d@linaro.org/#r
> 
> Also a problem in TCG because the handling of executing out of MMIO takes
> a shortcut.  It is fine (though very low performance as using a fall back path)
> for fully in MMIO regions, but not the corner case where the start of the instruction
> is in normal RAM (with all the related fast paths and instruction caching) and
> the end of the instruction is in the CXL MMIO region (a CFMWS window).
> 
> Currently looks like the fix will be to use the slow path for this case.
> Patches welcome!
> 
> Anyhow, in meantime beware.
> 
> Jonathan
> 

This all tracks, and is similar to what i've seen on other hypervisor
platforms when attempting to execute out of MMIO.

The reality is that CXL is not MMIO and not RAM or ROM or any of that
and is intended (eventually) to even be shared between QEMU instances.

That means it's likely going to require its own MemoryRegion model and
some deep dark corners of TCG and friends are going to require some
updates to make that work.

Whether it's worth the effort when the intent is to just let the
hardware handle that in the future, i don't really know.



Some speculation here:

The crux of the issue, as i understand it, is the invalidation path.
MMIO doesn't traditionally have a mechanism to tell the caches "hey
i got updated, boot this cache line", so whenever your compiler accesses
MMIO it - at best - does a fetch-and-discard, meaning that instruction
translations can never be cached.  That's the source of the slow down on
the QEMU side, you're constantly re-compiling the translations.

On the KVM side, it likely requires a VMExit to handle the MMIO, and
when it sees that it's an instruction fetch it probably just falls back
to emulator mode to execute the instruction before re-entering.  Maybe
there's a mild optimization where it continues executing until it leaves
that MMIO region, but you're still getting QEMU performance over KVM.

So that all makes sense to me.

To me, the solution here isn't to change QEMU, it's to change the kernel
to try to get it to aggressively keep executable regions out of CXL by
marking CXL regions into a new zone type that essentially says "Use this
as a last resort only for X pages".  But that would likely require
adding migration code to the likes of mprotect and friends.

In the meantime, sure would be nice to have a userland program that
grooms software to detect this problem and migrate X pages to DRAM.


~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-02 18:13                                         ` Jonathan Cameron
@ 2023-02-02  0:43                                           ` Gregory Price
  0 siblings, 0 replies; 34+ messages in thread
From: Gregory Price @ 2023-02-02  0:43 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Fan Ni, Verma, Vishal L, Williams, Dan J, linux-cxl,
	Adam Manzanares, dave

On Thu, Feb 02, 2023 at 06:13:57PM +0000, Jonathan Cameron wrote:
> On Wed, 1 Feb 2023 17:05:51 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > The reality is that CXL is not MMIO and not RAM or ROM or any of that
> > and is intended (eventually) to even be shared between QEMU instances.
> 
> Indeed it's an oddity - but this is all smoke and mirrors anyway
> as far as the OS is concerned. The OS / firmware doesn't need to know
> anything about how we are modeling things in QEMU - we need to make
> QEMU functionally correct.
> 
> > 
> > That means it's likely going to require its own MemoryRegion model and
> > some deep dark corners of TCG and friends are going to require some
> > updates to make that work.
> > 
> > Some speculation here:
> > 
> > The crux of the issue, as i understand it, is the invalidation path.
> > MMIO doesn't traditionally have a mechanism to tell the caches "hey
> > i got updated, boot this cache line", so whenever your compiler accesses
> > MMIO it - at best - does a fetch-and-discard, meaning that instruction
> > translations can never be cached.  That's the source of the slow down on
> > the QEMU side, you're constantly re-compiling the translations.
> 
> If we actually care, I think we could do some tricks with creating
> a cache of pages in between the interleaved underlying element and TCG
> so that it could behave as if it were dealing with a normal page when
> executing, but when writing it would drop such a cache so the writes
> reach the interleave elements. I don't think we do care for now though.
> 
> Slow down on TCG is fine (if possibly rather painful). We just need to
> work around the currently broken corner.
> 

at that point you're just wrapping a memory region inside what amounts
to another memory region.  Might as well just make the full blown fork
of region into an explicit CXL memory region with caches that can be
invalidated.  That model's the hardware anyway (though you don't have to
go as crazy as simulating the full snoop protocol).

~Gregory

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-02 18:18                                         ` Dan Williams
@ 2023-02-02  0:44                                           ` Gregory Price
  2023-02-07 16:31                                             ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Gregory Price @ 2023-02-02  0:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Fan Ni, Verma, Vishal L, linux-cxl,
	Adam Manzanares, dave

On Thu, Feb 02, 2023 at 10:18:33AM -0800, Dan Williams wrote:
> Gregory Price wrote:
> [..]
> > To me, the solution here isn't to change QEMU, it's to change the kernel
> > to try to get it to aggressively keep executable regions out of CXL by
> > marking CXL regions into a new zone type that essentially says "Use this
> > as a last resort only for X pages".  But that would likely require
> > adding migration code to the likes of mprotect and friends.
> 
> No, this can't be the path forward as far as I can see. QEMU is a test
> vehicle for CXL enabling, there's no expectation that QEMU is running
> CXL emulation in production. The quirks of how the QEMU-CXL memory
> behaves are not something the kernel should worry about mitigating. CXL
> is "System RAM" especially in the case when it is mapped by
> platform-firmware. If it's not suitable to be treated as "System RAM"
> then the onus is on platform-firmware to keep it out of the general
> purpose pool.
> 

Eh, you're right, just spitballing.  On real hardware this isn't an
issue so there's no reason to change the kernel.  QEMU should just model
the hardware.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-01 21:16                                   ` Gregory Price
@ 2023-02-02  1:06                                     ` Gregory Price
  2023-02-02 16:03                                     ` Jonathan Cameron
  1 sibling, 0 replies; 34+ messages in thread
From: Gregory Price @ 2023-02-02  1:06 UTC (permalink / raw)
  To: Fan Ni
  Cc: Verma, Vishal L, Williams, Dan J, Jonathan.Cameron, linux-cxl,
	Adam Manzanares, dave

On Wed, Feb 01, 2023 at 04:16:34PM -0500, Gregory Price wrote:
> Looks like i can bypass this with CONFIG_CXL_REGION_INVALIDATION_TEST
> but just wanted to report back incase this is not intended.
> 
> On x86, this invalidate_memregion() call maps to not having the
> hypervisor bit set:
> 
> bool cpu_cache_has_invalidate_memregion(void)
> {
>   return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> }
> EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
> 
> 
> 
> I presume if i enable the invalidate_test bit in my config this will
> work, but if anyone can validate that this is expected behavior without
> it, that would be great.
> 
> Thanks!
> ~Gregory

For the sake of completeness and lurking readers, here is my complete
setup.  I was able to successfully test cxl-cli onlining a region, and
that the memory is accessible to the guest via /dev/mem.

Was also able to verify the routing through the decoders via gdb.

Looks like the last step is to setup a dax device and them wire up up
the memory blocks and attach them to a numa node :].



Linux kernel configurations required:
CONFIG_CXL_REGION_INVALIDATION_TEST=y

linux branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
merged into
https://github.com/l1k/linux/commits/doe
plus this additional commit
https://lore.kernel.org/linux-cxl/20221215170909.2650271-1-fan.ni@samsung.com/


ndctl branch
https://github.com/pmem/ndctl/commits/vv/volatile-regions

devmem2
https://github.com/hackndev/tools/blob/master/devmem2.c
you'll need to modify the strtoul to strtoull to write to the correct
addresses, right now this is 32-bit bound and the CXL physical addresses
are always agove 4GB.


qemu config:
sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=1G,share=true \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G


[root@fedora cxl]./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
[  128.790228] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
{
  "region":"region0",
  "resource":"0x290000000",
  "size":"1024.00 MiB (1073.74 MB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region
[root@fedora cxl]# ./cxl list
[
  {
    "memdevs":[
      {
        "memdev":"mem0",
        "ram_size":1073741824,
        "serial":0,
        "host":"0000:35:00.0"
      }
    ]
  },
  {
    "regions":[
      {
        "region":"region0",
        "resource":11005853696,
        "size":1073741824,
        "type":"ram",
        "interleave_ways":1,
        "interleave_granularity":4096,
        "decode_state":"commit"
      }
    ]
  }
]



[root@fedora ~]# ./devmem2 0x290000000 w 0x11111111
/dev/mem opened.
Memory mapped at address 0x7fa9997db000.
Value at address 0x290000000 (0x7fa9997db000): 0xAAAAAAAA
Written 0x11111111; readback 0x11111111


on host:
[gourry@fedora linux]$ xxd /tmp/mem0 | head
00000000: 1111 1111 0000 0000 0000 0000 0000 0000  ................

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-01 21:16                                   ` Gregory Price
  2023-02-02  1:06                                     ` Gregory Price
@ 2023-02-02 16:03                                     ` Jonathan Cameron
  2023-02-01 22:05                                       ` Gregory Price
  1 sibling, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2023-02-02 16:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: Fan Ni, Verma, Vishal L, Williams, Dan J, linux-cxl,
	Adam Manzanares, dave

On Wed, 1 Feb 2023 16:16:34 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Wed, Feb 01, 2023 at 12:29:50AM -0500, Gregory Price wrote:
> > > Are you using single root port configuration? If yes, the following
> > > patch should have fixed the issue,
> > > https://lore.kernel.org/linux-cxl/20221215170909.2650271-1-fan.ni@samsung.com/
> > >   
> > > > [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292                      
> > 
> > I did not have this patch.  This should definitely make its way up.
> > 
> > ~Gregory  
> 
> 
> This fixed up the stack trace for me, but memregion is still failing to
> successfully complete.
> 
> It looks like configuration and decoder commit completes, but then
> cxl-cli bails out because writing
> 
> echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> 
> results in "Failed to synchronize CPU cache state"
> 
> I presume this is because of either a logic error or because the memory
> just isn't actually hooked up yet, but this is the relevant code:
> 
> 
> static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
> {
>   if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
>     return 0;
> 
>   if (!cpu_cache_has_invalidate_memregion()) {
>     if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
>       dev_warn_once(
>         &cxlr->dev,
>         "Bypassing cpu_cache_invalidate_memregion() for testing!\n");
>       clear_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags);
>       return 0;
>     } else {
>       dev_err(&cxlr->dev,
>         "Failed to synchronize CPU cache state\n");
>       return -ENXIO;
>     }
>   }
> 
>   cpu_cache_invalidate_memregion(IORES_DESC_CXL);
>   clear_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags);
>   return 0;
> }
> 
> 
> 
> Looks like i can bypass this with CONFIG_CXL_REGION_INVALIDATION_TEST
> but just wanted to report back incase this is not intended.
> 
> On x86, this invalidate_memregion() call maps to not having the
> hypervisor bit set:
> 
> bool cpu_cache_has_invalidate_memregion(void)
> {
>   return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> }
> EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);

Yup. I don't think we currently support this properly even when
doing TCG mode on QEMU rather than KVM.

Note that there is another QEMU issue that needs resolving if you intend
to use this as normal memory and it's worse under KVM. Effects the
corner case where an instruction crosses the boundary from normal memory
into CXL memory.

Thanks to the various QEMU folk who are helping us figure out what to do
about this for the explanations that follow!

We currently handle the region as MMIO - in QEMU terms, no actual relationship
to what the OS sees (as need to mess with the address
mappings on each access for interleaving). That's a problem for KVM
(which may not cope with sub page granularity remapping under the hood).

https://lore.kernel.org/qemu-devel/ff3f25ee-1c98-242b-905e-0b01d9f0948d@linaro.org/#r

Also a problem in TCG because the handling of executing out of MMIO takes
a shortcut.  It is fine (though very low performance as using a fall back path)
for fully in MMIO regions, but not the corner case where the start of the instruction
is in normal RAM (with all the related fast paths and instruction caching) and
the end of the instruction is in the CXL MMIO region (a CFMWS window).

Currently looks like the fix will be to use the slow path for this case.
Patches welcome!

Anyhow, in meantime beware.

Jonathan


> 
> 
> 
> I presume if i enable the invalidate_test bit in my config this will
> work, but if anyone can validate that this is expected behavior without
> it, that would be great.
> 
> Thanks!
> ~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-01 22:05                                       ` Gregory Price
@ 2023-02-02 18:13                                         ` Jonathan Cameron
  2023-02-02  0:43                                           ` Gregory Price
  2023-02-02 18:18                                         ` Dan Williams
  1 sibling, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2023-02-02 18:13 UTC (permalink / raw)
  To: Gregory Price
  Cc: Fan Ni, Verma, Vishal L, Williams, Dan J, linux-cxl,
	Adam Manzanares, dave

On Wed, 1 Feb 2023 17:05:51 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Feb 02, 2023 at 04:03:14PM +0000, Jonathan Cameron wrote:
> > 
> > Note that there is another QEMU issue that needs resolving if you intend
> > to use this as normal memory and it's worse under KVM. Effects the
> > corner case where an instruction crosses the boundary from normal memory
> > into CXL memory.
> > 
> > Thanks to the various QEMU folk who are helping us figure out what to do
> > about this for the explanations that follow!
> > 
> > We currently handle the region as MMIO - in QEMU terms, no actual relationship
> > to what the OS sees (as need to mess with the address
> > mappings on each access for interleaving). That's a problem for KVM
> > (which may not cope with sub page granularity remapping under the hood).
> > 
> > https://lore.kernel.org/qemu-devel/ff3f25ee-1c98-242b-905e-0b01d9f0948d@linaro.org/#r
> > 
> > Also a problem in TCG because the handling of executing out of MMIO takes
> > a shortcut.  It is fine (though very low performance as using a fall back path)
> > for fully in MMIO regions, but not the corner case where the start of the instruction
> > is in normal RAM (with all the related fast paths and instruction caching) and
> > the end of the instruction is in the CXL MMIO region (a CFMWS window).
> > 
> > Currently looks like the fix will be to use the slow path for this case.
> > Patches welcome!
> > 
> > Anyhow, in meantime beware.
> > 
> > Jonathan
> >   
> 
> This all tracks, and is similar to what i've seen on other hypervisor
> platforms when attempting to execute out of MMIO.
> 
> The reality is that CXL is not MMIO and not RAM or ROM or any of that
> and is intended (eventually) to even be shared between QEMU instances.

Indeed it's an oddity - but this is all smoke and mirrors anyway
as far as the OS is concerned. The OS / firmware doesn't need to know
anything about how we are modeling things in QEMU - we need to make
QEMU functionally correct.

> 
> That means it's likely going to require its own MemoryRegion model and
> some deep dark corners of TCG and friends are going to require some
> updates to make that work.
> 
> Whether it's worth the effort when the intent is to just let the
> hardware handle that in the future, i don't really know.

I think we need a fix, though perf can be terrible ;)
> 
> 
> 
> Some speculation here:
> 
> The crux of the issue, as i understand it, is the invalidation path.
> MMIO doesn't traditionally have a mechanism to tell the caches "hey
> i got updated, boot this cache line", so whenever your compiler accesses
> MMIO it - at best - does a fetch-and-discard, meaning that instruction
> translations can never be cached.  That's the source of the slow down on
> the QEMU side, you're constantly re-compiling the translations.

If we actually care, I think we could do some tricks with creating
a cache of pages in between the interleaved underlying element and TCG
so that it could behave as if it were dealing with a normal page when
executing, but when writing it would drop such a cache so the writes
reach the interleave elements. I don't think we do care for now though.

Slow down on TCG is fine (if possibly rather painful). We just need to
work around the currently broken corner.

> 
> On the KVM side, it likely requires a VMExit to handle the MMIO, and
> when it sees that it's an instruction fetch it probably just falls back
> to emulator mode to execute the instruction before re-entering.  Maybe
> there's a mild optimization where it continues executing until it leaves
> that MMIO region, but you're still getting QEMU performance over KVM.

Might be ways to improve that perf, but TCG is good enough for testing
so I'm not sure we care.  I tend to be doing cross architecture anyway
so don't care much about KVM :)

> 
> So that all makes sense to me.
> 
> To me, the solution here isn't to change QEMU, it's to change the kernel
> to try to get it to aggressively keep executable regions out of CXL by
> marking CXL regions into a new zone type that essentially says "Use this
> as a last resort only for X pages".  But that would likely require
> adding migration code to the likes of mprotect and friends.

No. This is a QEMU emulation issue, not a real hardware one - so we shouldn't
touch the kernel. On real hardware (with exception of shared case) it should
be fine to execute from CXL memory.

> 
> In the meantime, sure would be nice to have a userland program that
> grooms software to detect this problem and migrate X pages to DRAM.

Lots of program memory is exceedingly cold (typically things like init and
exit code that is only touched once per program run), so we very much do not want
to do this.  In some cases we want to push executable memory to the CXL memory.

Jonathan

> 
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-01 22:05                                       ` Gregory Price
  2023-02-02 18:13                                         ` Jonathan Cameron
@ 2023-02-02 18:18                                         ` Dan Williams
  2023-02-02  0:44                                           ` Gregory Price
  1 sibling, 1 reply; 34+ messages in thread
From: Dan Williams @ 2023-02-02 18:18 UTC (permalink / raw)
  To: Gregory Price, Jonathan Cameron
  Cc: Fan Ni, Verma, Vishal L, Williams, Dan J, linux-cxl,
	Adam Manzanares, dave

Gregory Price wrote:
[..]
> Some speculation here:
> 
> The crux of the issue, as i understand it, is the invalidation path.
> MMIO doesn't traditionally have a mechanism to tell the caches "hey
> i got updated, boot this cache line", so whenever your compiler accesses
> MMIO it - at best - does a fetch-and-discard, meaning that instruction
> translations can never be cached.  That's the source of the slow down on
> the QEMU side, you're constantly re-compiling the translations.
> 
> On the KVM side, it likely requires a VMExit to handle the MMIO, and
> when it sees that it's an instruction fetch it probably just falls back
> to emulator mode to execute the instruction before re-entering.  Maybe
> there's a mild optimization where it continues executing until it leaves
> that MMIO region, but you're still getting QEMU performance over KVM.
> 
> So that all makes sense to me.
> 
> To me, the solution here isn't to change QEMU, it's to change the kernel
> to try to get it to aggressively keep executable regions out of CXL by
> marking CXL regions into a new zone type that essentially says "Use this
> as a last resort only for X pages".  But that would likely require
> adding migration code to the likes of mprotect and friends.

No, this can't be the path forward as far as I can see. QEMU is a test
vehicle for CXL enabling, there's no expectation that QEMU is running
CXL emulation in production. The quirks of how the QEMU-CXL memory
behaves are not something the kernel should worry about mitigating. CXL
is "System RAM" especially in the case when it is mapped by
platform-firmware. If it's not suitable to be treated as "System RAM"
then the onus is on platform-firmware to keep it out of the general
purpose pool.

> In the meantime, sure would be nice to have a userland program that
> grooms software to detect this problem and migrate X pages to DRAM.

As far as software is concerned it may not even be able to discern that
DRAM is DDR vs CXL attached. Something is wrong if that distinction
matters in practice outside of NUMA policy or dedicated device access.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [GIT preview] for-6.3/cxl-ram-region
  2023-01-26  6:25 [GIT preview] for-6.3/cxl-ram-region Dan Williams
  2023-01-26  6:29 ` Dan Williams
  2023-01-26 22:05 ` Gregory Price
@ 2023-02-04  2:36 ` Dan Williams
  2 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2023-02-04  2:36 UTC (permalink / raw)
  To: Dan Williams, linux-cxl

Dan Williams wrote:
> There are still some sharp edges on this patchset, like the missing
> device-dax hookup, but it is likely enough to show the direction and
> unblock other testing. Specifically I want to see how this fares with
> Greg's recent volatile region provisioning in QEMU.
> 
> I am hoping to have those last bits ironed out before the end of the
> week. Note that this topic branch will rebase so do not base any
> work beyond proof-of-concept on top of it.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

I have updated this branch (14acf2db2613) with support for routing CXL
regions to System RAM. Will send the patches out once the build robot
finishes chewing on it.

With cxl_test and booting with "memhp_default_state=offline" (since
cxl_test is not real memory), it produces a device-dax instance that is
ready to be onlined via 'daxctl online-memory dax4.0':

# cxl list -RXu
{
  "region":"region4",
  "resource":"0xf010000000",
  "size":"512.00 MiB (536.87 MB)",
  "interleave_ways":2,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "daxregion":{
    "id":4,
    "size":"512.00 MiB (536.87 MB)",
    "align":2097152,
    "devices":[
      {
        "chardev":"dax4.0",
        "size":"512.00 MiB (536.87 MB)",
        "target_node":0,
        "align":2097152,
        "mode":"system-ram",
        "online_memblocks":0,
        "total_memblocks":4
      }
    ]
  }
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [GIT preview] for-6.3/cxl-ram-region
  2023-02-02  0:44                                           ` Gregory Price
@ 2023-02-07 16:31                                             ` Jonathan Cameron
  0 siblings, 0 replies; 34+ messages in thread
From: Jonathan Cameron @ 2023-02-07 16:31 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, Fan Ni, Verma, Vishal L, linux-cxl,
	Adam Manzanares, dave, Richard Henderson

On Wed, 1 Feb 2023 19:44:47 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Feb 02, 2023 at 10:18:33AM -0800, Dan Williams wrote:
> > Gregory Price wrote:
> > [..]  
> > > To me, the solution here isn't to change QEMU, it's to change the kernel
> > > to try to get it to aggressively keep executable regions out of CXL by
> > > marking CXL regions into a new zone type that essentially says "Use this
> > > as a last resort only for X pages".  But that would likely require
> > > adding migration code to the likes of mprotect and friends.  
> > 
> > No, this can't be the path forward as far as I can see. QEMU is a test
> > vehicle for CXL enabling, there's no expectation that QEMU is running
> > CXL emulation in production. The quirks of how the QEMU-CXL memory
> > behaves are not something the kernel should worry about mitigating. CXL
> > is "System RAM" especially in the case when it is mapped by
> > platform-firmware. If it's not suitable to be treated as "System RAM"
> > then the onus is on platform-firmware to keep it out of the general
> > purpose pool.
> >   
> 
> Eh, you're right, just spitballing.  On real hardware this isn't an
> issue so there's no reason to change the kernel.  QEMU should just model
> the hardware.

FYI.  Richard Henderson has posted a fix. 
We should be fine going forwards on QEMU TCG. Waiting on confirmation
that it fixes the reported the instruction straddling the QEMU normal
to MMIO boundary, but looks right to me (and it's confirmed to fix
a different issue that hit same restriction).  Thanks Richard!

https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/

I hadn't even gotten a reproducer up yet so very much appreciate that
I might not have to ;)

Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-02-07 16:31 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26  6:25 [GIT preview] for-6.3/cxl-ram-region Dan Williams
2023-01-26  6:29 ` Dan Williams
2023-01-26 18:50   ` Jonathan Cameron
2023-01-26 19:34     ` Jonathan Cameron
2023-01-30 14:16       ` Gregory Price
2023-01-30 20:10         ` Dan Williams
2023-01-30 20:58           ` Gregory Price
2023-01-30 23:18             ` Dan Williams
2023-01-30 22:00               ` Gregory Price
2023-01-31  2:00               ` Gregory Price
2023-01-31 16:56                 ` Dan Williams
2023-01-31 17:59                 ` Verma, Vishal L
2023-01-31 19:03                   ` Gregory Price
2023-01-31 19:46                     ` Verma, Vishal L
2023-01-31 20:24                       ` Verma, Vishal L
2023-01-31 23:03                         ` Gregory Price
2023-01-31 23:17                           ` Gregory Price
     [not found]                             ` <CGME20230131235012uscas1p11573de234af67d70a882d4ca0f3ebaab@uscas1p1.samsung.com>
2023-01-31 23:50                               ` Fan Ni
2023-02-01  5:29                                 ` Gregory Price
2023-02-01 21:16                                   ` Gregory Price
2023-02-02  1:06                                     ` Gregory Price
2023-02-02 16:03                                     ` Jonathan Cameron
2023-02-01 22:05                                       ` Gregory Price
2023-02-02 18:13                                         ` Jonathan Cameron
2023-02-02  0:43                                           ` Gregory Price
2023-02-02 18:18                                         ` Dan Williams
2023-02-02  0:44                                           ` Gregory Price
2023-02-07 16:31                                             ` Jonathan Cameron
2023-01-30 14:23       ` Gregory Price
2023-01-31 14:56         ` Jonathan Cameron
2023-01-31 17:34           ` Gregory Price
2023-01-26 22:05 ` Gregory Price
2023-01-26 22:20   ` Dan Williams
2023-02-04  2:36 ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.