All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: Fan Ni <fan.ni@samsung.com>
Cc: "Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"Jonathan.Cameron@huawei.com" <Jonathan.Cameron@huawei.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	Adam Manzanares <a.manzanares@samsung.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>
Subject: Re: [GIT preview] for-6.3/cxl-ram-region
Date: Wed, 1 Feb 2023 00:29:50 -0500	[thread overview]
Message-ID: <Y9n4zjtwf+w6xnmW@memverge.com> (raw)
In-Reply-To: <20230131235003.GA336751@bgt-140510-bm03>

On Tue, Jan 31, 2023 at 11:50:11PM +0000, Fan Ni wrote:
> On Tue, Jan 31, 2023 at 06:17:15PM -0500, Gregory Price wrote:
> > On Tue, Jan 31, 2023 at 06:03:53PM -0500, Gregory Price wrote:
> > > On Tue, Jan 31, 2023 at 08:24:19PM +0000, Verma, Vishal L wrote:
> > > > On Tue, 2023-01-31 at 19:46 +0000, Verma, Vishal L wrote:
> > > > > On Tue, 2023-01-31 at 14:03 -0500, Gregory Price wrote:
> > > > > > 
> > > > > > 
> > > > > > Right now I believe this is failing due to the interleave and size not
> > > > > > having default values
> > > > > > 
> > > > > > ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > > > > cxl region: create_region: create_region: unable to determine region size
> > > > > > cxl region: cmd_create_region: created 0 regions
> > > > > > 
> > > > > > 
> > > > > > appears to be due to this code
> > > > > > static int create_region(struct cxl_ctx *ctx, int *count,
> > > > > >              struct parsed_params *p)
> > > > > > {
> > > > > > // ... snip ...
> > > > > >     rc = create_region_validate_config(ctx, p);
> > > > > >     if (rc)
> > > > > >         return rc;
> > > > > > 
> > > > > >     if (p->size) {
> > > > > >         size = p->size;
> > > > > >         default_size = false;
> > > > > >     } else if (p->ep_min_size) {
> > > > > >         size = p->ep_min_size * p->ways;
> > > > > > **    } else {
> > > > > > **        log_err(&rl, "%s: unable to determine region size\n", __func__);
> > > > > > **        return -ENXIO;
> > > > > > **    }
> > > > > > 
> > > > > > So both size and ep_min_size are 0 here
> > > > > > 
> > > > > > echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> > > > > > cat /sys/bus/cxl/devices/region0/interleave_ways
> > > > > > 0
> > > > > > cat /sys/bus/cxl/devices/region0/interleave_granularity
> > > > > > 0
> > > > > > cat /sys/bus/cxl/devices/region0/size
> > > > > > 0
> > > > > 
> > > > > Ah - this revealed an actual bug in these commits - the size and
> > > > > ep_min_size don't refer to the region's size, it is the capacity of the
> > > > > component memdevs. Right after create_ram_region, the region size is
> > > > > expected to be zero.
> > > > > 
> > > > > However the bug here was a pmem assumption I had missed. When
> > > > > determining sizes, we only look at pmem capacity, which is wrong. It
> > > > > happened to work in my testing because the memdevs I used had both pmem
> > > > > and ram capacity. I'll update with a fix shortly. Thanks for trying it
> > > > > out and reporting this!
> > > > 
> > > > I've updated the branch now with a fix for this.
> > > 
> > > Progress! But now i've found a kernel segfault :D
> > > (sorry about the jumble here, looks like multiple issues))
> > > 
> > > [root@fedora cxl]# ./cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
> > > [  170.675334] cxl_region region0: Failed to synchronize CPU cache state
> > > libcxl: [c x l1_7r0e.68249g6i] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > [  170.691163] #PF: supervisor instruction fetch in kernel mode
> > > [o n 1_70.70e3n9a1b6l]e :# rPeF: error_code(0gixo0010) - not-present page
> > > n0[:  fai led1 7to 0e.7n19709] PGD 800000004d25d067 P4D 800000004d25d067 PUD 4cdf3067 PMD 0
> > > [  170.725436] Oops: 0010 [#1] PREEMPT SMP PTI
> > > 1b[l e
> > > 7c0x.l734 510r]e giConPU: 0 PID: 717 Comm: cxl Not tainted 6.2.0-rc2+ #19
> > > [  170.739750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > :[  170.747119] R IP: 0c0r1e0:at0ex_0r
> > > egi[o n: 170.751110] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > > [  170.757699] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> > > [   17r0e.g7ion0:6 f6a0i9l1e]d RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> > > [  170.772499] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> > >  [  t170o.77 8e3nab0l0e] RBP: fff:f 9Nco3 8seed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> > > [  170.783787] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> > > uch d[ev i 1ce7 0o.7r8 800a9]d R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> > > dr[e  s1s7
> > > 0.795938] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> > > [  170.802891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  170.806705] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> > > [  170.817025] Call Trace:
> > > [  170.818831]  <TASK>
> > > [  170.820589]  cxl_region_decode_reset+0xb8/0x110
> > > [  170.823893]  cxl_region_detach+0xda/0x1e0
> > > [  170.829457]  detach_target.part.0+0x29/0x80
> > > [  170.833503]  unregister_region+0x42/0x90
> > > [  170.836813]  devm_release_action+0x3d/0x70
> > > [  170.840128]  ? __pfx_unregister_region+0x10/0x10
> > > [  170.843899]  delete_region_store+0x69/0x80
> > > [  170.847680]  kernfs_fop_write_iter+0x11e/0x200
> > > [  170.851217]  vfs_write+0x222/0x3e0
> > > [  170.854141]  ksys_write+0x5b/0xd0
> > > [  170.856695]  do_syscall_64+0x5b/0x80
> > > [  170.859678]  ? kmem_cache_free+0x15/0x3b0
> > > [  170.862234]  ? do_sys_openat2+0x77/0x150
> > > [  170.865560]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [  170.870920]  ? do_syscall_64+0x67/0x80
> > > [  170.874726]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [  170.879464]  ? do_syscall_64+0x67/0x80
> > > [  170.881634]  ? __irq_exit_rcu+0x3d/0x140
> > > [  170.884720]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > [  170.888810] RIP: 0033:0x7ff89c901c37
> > > [  170.891435] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4[  170.905803] RSP: 002b:00007fff0e843a68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > > [  170.913373] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff89c901c37
> > > [  170.920868] RDX: 0000000000000008 RSI: 0000000001290ee6 RDI: 0000000000000003
> > > [  170.931402] RBP: 00007fff0e843aa0 R08: 000000000000fee0 R09: 0000000000000073
> > > [  170.936639] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > > [  170.942484] R13: 00007fff0e844000 R14: 000000000041fdc8 R15: 00007ff89cbdf000
> > > [  170.954794]  </TASK>
> > > [  170.957649] Modules linked in: rfkill vfat fat snd_pcm iTCO_wdt snd_timer intel_pmc_bxt ppdev iTCO_vendor_support snd cxl_pmem soundcore bochg[  170.980623] CR2: 0000000000000000
> > > [  170.984137] ---[ end trace 0000000000000000 ]---
> > > [  170.989062] RIP: 0010:0x0
> > > [  170.991505] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > > [  170.996401] RSP: 0018:ffffb9a3c0e97c60 EFLAGS: 00010296
> > > [  170.999716] RAX: 0000000000000000 RBX: ffff9c38e459de60 RCX: 0000000000000000
> > > [  171.006146] RDX: 0000000000000000 RSI: ffff9c38e42ecdb0 RDI: ffff9c390f11d400
> > > [  171.018226] RBP: ffff9c38eed38000 R08: 0000000000000001 R09: ffffb9a3c0e97b38
> > > [  171.024812] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c393d8c8c00
> > > [  171.036512] R13: ffff9c390f141c00 R14: ffff9c38eed38340 R15: ffff9c38c1a01400
> > > [  171.042400] FS:  00007ff89ca037c0(0000) GS:ffff9c393dc00000(0000) knlGS:0000000000000000
> > > [  171.050182] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  171.055740] CR2: ffffffffffffffd6 CR3: 0000000024c8e000 CR4: 00000000000006f0
> > > Killed
> > 
> > 
> > Looks like some error is still occuring, this happens when attempting to
> > delete a region after it has been configured
> > 
> > [root@fedora ~]# echo region0 > /sys/bus/cxl/devices/decoder0.0/delete_region
> > [   97.186328] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [   97.190754] #PF: supervisor instruction fetch in kernel mode
> > [   97.196754] #PF: error_code(0x0010) - not-present page
> > [   97.201108] PGD 0 P4D 0
> > [   97.202585] Oops: 0010 [#1] PREEMPT SMP PTI
> > [   97.206085] CPU: 1 PID: 688 Comm: bash Not tainted 6.2.0-rc2+ #19
> > [   97.215215] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > [   97.224247] RIP: 0010:0x0
> > [   97.228516] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > [   97.233852] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
> > [   97.236704] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
> > [   97.242904] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
> > [   97.250537] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
> > [   97.260478] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
> > [   97.276617] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
> > [   97.285277] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
> > [   97.295175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   97.300566] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
> > [   97.308856] Call Trace:
> > [   97.312137]  <TASK>
> > [   97.314095]  cxl_region_decode_reset+0xb8/0x110
> > [   97.317937]  cxl_region_detach+0xda/0x1e0
> > [   97.320694]  detach_target.part.0+0x29/0x80
> > [   97.326437]  unregister_region+0x42/0x90
> > [   97.331169]  devm_release_action+0x3d/0x70
> > [   97.334957]  ? __pfx_unregister_region+0x10/0x10
> > [   97.338434]  delete_region_store+0x69/0x80
> > [   97.343526]  kernfs_fop_write_iter+0x11e/0x200
> > [   97.348950]  vfs_write+0x222/0x3e0
> > [   97.352273]  ksys_write+0x5b/0xd0
> > [   97.354592]  do_syscall_64+0x5b/0x80
> > [   97.358291]  ? exc_page_fault+0x70/0x170
> > [   97.362739]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [   97.367268] RIP: 0033:0x7f195bd01c37
> > [   97.369719] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 4
> > [   97.386808] RSP: 002b:00007fff8a2320d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [   97.394208] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f195bd01c37
> > [   97.401547] RDX: 0000000000000008 RSI: 0000555fd741b550 RDI: 0000000000000001
> > [   97.409231] RBP: 0000555fd741b550 R08: 0000000000001000 R09: 0000000000000000
> > [   97.416149] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
> > [   97.420784] R13: 00007f195bdf9780 R14: 0000000000000008 R15: 00007f195bdf49e0
> > [   97.429142]  </TASK>
> > [   97.431741] Modules linked in: rfkill vfat fat snd_pcm snd_timer iTCO_wdt snd intel_pmc_bxt iTCO_vendor_support ppdev cxl_pmem soundcore libng
> > [   97.456189] CR2: 0000000000000000
> > [   97.461271] ---[ end trace 0000000000000000 ]---
> > [   97.466464] RIP: 0010:0x0
> > [   97.468599] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292
> > [   97.482910] RAX: 0000000000000000 RBX: ffff8a2fe44fb120 RCX: 0000000000000000
> > [   97.488445] RDX: 0000000000000000 RSI: ffff8a2fc2c5f6c0 RDI: ffff8a2fc1f29000
> > [   97.496227] RBP: ffff8a2fc3395c00 R08: 0000000000000001 R09: ffffa30000d23bf8
> > [   97.502543] R10: ffff8a2fc35adc00 R11: 0000000000000000 R12: ffff8a2fc1272000
> > [   97.512213] R13: ffff8a2fc3329c00 R14: ffff8a2fc3395f40 R15: ffff8a2fc1f29800
> > [   97.518303] FS:  00007f195be24740(0000) GS:ffff8a303dc80000(0000) knlGS:0000000000000000
> > [   97.526884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   97.533142] CR2: ffffffffffffffd6 CR3: 00000000249e6000 CR4: 00000000000006e0
> > 
> Are you using single root port configuration? If yes, the following
> patch should have fixed the issue,
> https://lore.kernel.org/linux-cxl/20221215170909.2650271-1-fan.ni@samsung.com/
> 
> > [   97.476231] RSP: 0018:ffffa30000d23d20 EFLAGS: 00010292                    

I did not have this patch.  This should definitely make its way up.

~Gregory

  reply	other threads:[~2023-02-01 15:37 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-26  6:25 [GIT preview] for-6.3/cxl-ram-region Dan Williams
2023-01-26  6:29 ` Dan Williams
2023-01-26 18:50   ` Jonathan Cameron
2023-01-26 19:34     ` Jonathan Cameron
2023-01-30 14:16       ` Gregory Price
2023-01-30 20:10         ` Dan Williams
2023-01-30 20:58           ` Gregory Price
2023-01-30 23:18             ` Dan Williams
2023-01-30 22:00               ` Gregory Price
2023-01-31  2:00               ` Gregory Price
2023-01-31 16:56                 ` Dan Williams
2023-01-31 17:59                 ` Verma, Vishal L
2023-01-31 19:03                   ` Gregory Price
2023-01-31 19:46                     ` Verma, Vishal L
2023-01-31 20:24                       ` Verma, Vishal L
2023-01-31 23:03                         ` Gregory Price
2023-01-31 23:17                           ` Gregory Price
     [not found]                             ` <CGME20230131235012uscas1p11573de234af67d70a882d4ca0f3ebaab@uscas1p1.samsung.com>
2023-01-31 23:50                               ` Fan Ni
2023-02-01  5:29                                 ` Gregory Price [this message]
2023-02-01 21:16                                   ` Gregory Price
2023-02-02  1:06                                     ` Gregory Price
2023-02-02 16:03                                     ` Jonathan Cameron
2023-02-01 22:05                                       ` Gregory Price
2023-02-02 18:13                                         ` Jonathan Cameron
2023-02-02  0:43                                           ` Gregory Price
2023-02-02 18:18                                         ` Dan Williams
2023-02-02  0:44                                           ` Gregory Price
2023-02-07 16:31                                             ` Jonathan Cameron
2023-01-30 14:23       ` Gregory Price
2023-01-31 14:56         ` Jonathan Cameron
2023-01-31 17:34           ` Gregory Price
2023-01-26 22:05 ` Gregory Price
2023-01-26 22:20   ` Dan Williams
2023-02-04  2:36 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9n4zjtwf+w6xnmW@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=a.manzanares@samsung.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave@stgolabs.net \
    --cc=fan.ni@samsung.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.