All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fan Ni <fan.ni@samsung.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "alison.schofield@intel.com" <alison.schofield@intel.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"bwidawsk@kernel.org" <bwidawsk@kernel.org>,
	"Jonathan.Cameron@huawei.com" <Jonathan.Cameron@huawei.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	Adam Manzanares <a.manzanares@samsung.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] cxl/hdm: Fix hdm decoder init by adding COMMIT field check
Date: Fri, 3 Mar 2023 21:54:54 +0000	[thread overview]
Message-ID: <20230303215446.GA1479551@bgt-140510-bm03> (raw)
In-Reply-To: <64025f6219d2d_71138294e5@dwillia2-xfh.jf.intel.com.notmuch>

On Fri, Mar 03, 2023 at 12:58:10PM -0800, Dan Williams wrote:

> Fan Ni wrote:
> > Add COMMIT field check aside with existing COMMITTED field check during
> > hdm decoder initialization to avoid a system crash during module removal
> > after destroying a region which leaves the COMMIT field being reset while
> > the COMMITTED field still being set.
> > 
> > In current kernel implementation, when destroying a region (cxl
> > destroy-region),the decoders associated to the region will be reset
> > as that in cxl_decoder_reset, where the COMMIT field will be reset.
> > However, resetting COMMIT field will not automatically reset the
> > COMMITTED field, causing a situation where COMMIT is reset (0) while
> > COMMITTED is set (1) after the region is destroyed. Later, when
> > init_hdm_decoder is called (during modprobe), current code only check
> > the COMMITTED to decide whether the decoder is enabled or not. Since
> > the COMMITTED will be 1 and the code treats the decoder as enabled,
> > which will cause unexpected behaviour.
> > 
> > Before the fix, a system crash was observed when performing following
> > steps:
> > 1. modprobe -a cxl_acpi cxl_core cxl_pci cxl_port cxl_mem
> > 2. cxl create-region -m -d decoder0.0 -w 1 mem0 -s 256M
> > 3. cxl destroy-region region0 -f
> > 4. rmmod cxl_acpi cxl_pci cxl_port cxl_mem cxl_pmem cxl_core
> > 5. modprobe -a cxl_acpi cxl_core cxl_pci cxl_port cxl_mem (showing
> > "no CXL window for range 0x0:0xffffffffffffffff" error message)
> > 6. rmmod cxl_acpi cxl_pci cxl_port cxl_mem cxl_pmem cxl_core (kernel
> > crash at cxl_dpa_release due to dpa_res has been freed when destroying
> > the region).
> 
> I think a separate fix for that crash is needed, can you send the
> backtrace? I.e. I worry that crash can be triggered by other means.
Hi Dan,
See backtrace below.

[  130.299394] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  130.299907] #PF: supervisor read access in kernel mode
[  130.299907] #PF: error_code(0x0000) - not-present page
[  130.299907] PGD 0 P4D 0 
[  130.299907] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  130.299907] CPU: 13 PID: 467 Comm: rmmod Not tainted 6.2.0-rc6-00024-g3ea761ec9dd5 #58
[  130.299907] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[  130.299907] RIP: 0010:__cxl_dpa_release+0x3c/0xb0 [cxl_core]
[  130.299907] Code: ff ff 48 8b 7d 40 4c 8b a8 d8 02 00 00 e8 5c a6 ff ff 4c 8b a5 28 03 00 00 48 89 c3 48 8b 85 20 03 00 00 4d 8b ad 40 03 00 00 <48> 8b 50 08 4c 8b 30 49 81 c5 90 00 00 00 4c 89 ef 48 83 c2 01 4c
[  130.299907] RSP: 0018:ffffc9000075fae0 EFLAGS: 00000246
[  130.299907] RAX: 0000000000000000 RBX: ffff88810250cc00 RCX: 0000000000000000
[  130.299907] RDX: 0000000000000001 RSI: ffff8881008d25e8 RDI: ffff88810250cc00
[  130.299907] RBP: ffff88810250d000 R08: 0000000000000001 R09: ffffffff8182b400
[  130.299907] R10: ffff888101fd7238 R11: ffff888201c1f406 R12: 0000000000000000
[  130.299907] R13: 0000000000000000 R14: ffff88810250ce90 R15: ffff88810250ce8c
[  130.299907] FS:  00007f53b3884c40(0000) GS:ffff888277d40000(0000) knlGS:0000000000000000
[  130.299907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  130.299907] CR2: 0000000000000008 CR3: 000000010285c000 CR4: 00000000000006e0
[  130.299907] Call Trace:
[  130.299907]  <TASK>
[  130.299907]  cxl_dpa_release+0x18/0x30 [cxl_core]
[  130.299907]  release_nodes+0x40/0x70
[  130.299907]  devres_release_all+0x86/0xc0
[  130.299907]  device_unbind_cleanup+0x9/0x70
[  130.299907]  device_release_driver_internal+0xe9/0x160
[  130.299907]  bus_remove_device+0xd3/0x140
[  130.299907]  device_del+0x186/0x3d0
[  130.299907]  ? _raw_spin_unlock_irqrestore+0x16/0x30
[  130.299907]  ? devres_remove+0xcb/0xf0
[  130.299907]  device_unregister+0xe/0x60
[  130.299907]  ? __pfx_devm_action_release+0x10/0x10
[  130.299907]  devres_release+0x22/0x50
[  130.299907]  devm_release_action+0x33/0x60
[  130.299907]  ? __pfx_unregister_port+0x10/0x10 [cxl_core]
[  130.299907]  delete_endpoint+0x7a/0x80 [cxl_core]
[  130.299907]  release_nodes+0x40/0x70
[  130.299907]  devres_release_all+0x86/0xc0
[  130.299907]  device_unbind_cleanup+0x9/0x70
[  130.299907]  device_release_driver_internal+0xe9/0x160
[  130.299907]  bus_remove_device+0xd3/0x140
[  130.299907]  device_del+0x186/0x3d0
[  130.299907]  cdev_device_del+0x10/0x30
[  130.299907]  cxl_memdev_unregister+0x36/0x40 [cxl_core]
[  130.299907]  release_nodes+0x40/0x70
[  130.299907]  devres_release_all+0x86/0xc0
[  130.299907]  device_unbind_cleanup+0x9/0x70
[  130.299907]  device_release_driver_internal+0xe9/0x160
[  130.299907]  driver_detach+0x3f/0x80
[  130.299907]  bus_remove_driver+0x50/0xd0
[  130.299907]  pci_unregister_driver+0x36/0x80
[  130.299907]  __x64_sys_delete_module+0x191/0x270
[  130.299907]  ? fpregs_assert_state_consistent+0x1d/0x50
[  130.299907]  ? exit_to_user_mode_prepare+0x36/0x120
[  130.299907]  do_syscall_64+0x3b/0x90
[  130.299907]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  130.299907] RIP: 0033:0x7f53b3126c9b
[  130.299907] Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
[  130.299907] RSP: 002b:00007fff5a72c558 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  130.299907] RAX: ffffffffffffffda RBX: 000056037a16e790 RCX: 00007f53b3126c9b
[  130.299907] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000056037a16e7f8
[  130.299907] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  130.299907] R10: 00007f53b31beac0 R11: 0000000000000206 R12: 00007fff5a72c7b0
[  130.299907] R13: 000056037a16d2a0 R14: 00007fff5a72cdb7 R15: 000056037a16e790
[  130.299907]  </TASK>
[  130.299907] Modules linked in: cxl_mem cxl_pmem cxl_port cxl_pci(-) cxl_acpi cxl_core dax_pmem nd_pmem nd_btt [last unloaded: cxl_core]
[  130.299907] CR2: 0000000000000008
[  130.357813] ---[ end trace 0000000000000000 ]---
[  130.358811] RIP: 0010:__cxl_dpa_release+0x3c/0xb0 [cxl_core]
[  130.360039] Code: ff ff 48 8b 7d 40 4c 8b a8 d8 02 00 00 e8 5c a6 ff ff 4c 8b a5 28 03 00 00 48 89 c3 48 8b 85 20 03 00 00 4d 8b ad 40 03 00 00 <48> 8b 50 08 4c 8b 30 49 81 c5 90 00 00 00 4c 89 ef 48 83 c2 01 4c
[  130.363227] RSP: 0018:ffffc9000075fae0 EFLAGS: 00000246
[  130.364292] RAX: 0000000000000000 RBX: ffff88810250cc00 RCX: 0000000000000000
[  130.365400] RDX: 0000000000000001 RSI: ffff8881008d25e8 RDI: ffff88810250cc00
[  130.366645] RBP: ffff88810250d000 R08: 0000000000000001 R09: ffffffff8182b400
[  130.368025] R10: ffff888101fd7238 R11: ffff888201c1f406 R12: 0000000000000000
[  130.369337] R13: 0000000000000000 R14: ffff88810250ce90 R15: ffff88810250ce8c
[  130.370531] FS:  00007f53b3884c40(0000) GS:ffff888277d40000(0000) knlGS:0000000000000000
[  130.372515] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  130.373567] CR2: 0000000000000008 CR3: 000000010285c000 CR4: 00000000000006e0

> 
> > 
> > The patch fixed the above issue, and is tested based on follow patch series:
> > 
> > [PATCH 00/18] CXL RAM and the 'Soft Reserved' => 'System RAM' default
> > Message-ID: 167601992097.1924368.18291887895351917895.stgit@dwillia2-xfh.jf.intel.com
> > 
> > Signed-off-by: Fan Ni <fan.ni@samsung.com>
> > ---
> >  drivers/cxl/core/hdm.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 80eccae6ba9e..6cf854c949f0 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -695,6 +695,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
> >  	struct cxl_endpoint_decoder *cxled = NULL;
> >  	u64 size, base, skip, dpa_size;
> >  	bool committed;
> > +	bool should_commit;
> >  	u32 remainder;
> >  	int i, rc;
> >  	u32 ctrl;
> > @@ -710,10 +711,11 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
> >  	base = ioread64_hi_lo(hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(which));
> >  	size = ioread64_hi_lo(hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(which));
> >  	committed = !!(ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED);
> > +	should_commit = !!(ctrl & CXL_HDM_DECODER0_CTRL_COMMIT);
> 
> This change looks like a good idea in general given the ambiguity of
> 'committed'. However just combine the two checks into the @committed
> variable with something like this:
> 
> commit_mask = CXL_HDM_DECODER0_CTRL_COMMITTED|CXL_HDM_DECODER0_CTRL_COMMIT;
> committed = (ctrl & commit_mask) == commit_mask;

  reply	other threads:[~2023-03-03 22:54 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230228224029uscas1p1e2fb92a8a595f80fa2985b452899d785@uscas1p1.samsung.com>
2023-02-28 22:40 ` [PATCH] cxl/hdm: Fix hdm decoder init by adding COMMIT field check Fan Ni
2023-03-01 18:54   ` Dave Jiang
2023-03-02  6:23     ` Fan Ni
2023-03-02 15:36       ` Dave Jiang
2023-03-02 16:28         ` Davidlohr Bueso
2023-03-02 17:02           ` Dave Jiang
2023-03-03 14:36         ` Jonathan Cameron
2023-03-03 15:57           ` Ira Weiny
2023-03-06 15:49             ` Jonathan Cameron
2023-03-03 17:21           ` Fan Ni
2023-03-06 16:04             ` Jonathan Cameron
2023-03-07 11:12               ` Jonathan Cameron
2023-03-07 17:27                 ` Ira Weiny
2023-03-13 10:10                   ` Jonathan Cameron
2023-03-13 16:50                     ` Jonathan Cameron
2023-03-03 20:58   ` Dan Williams
2023-03-03 21:54     ` Fan Ni [this message]
2023-03-03 22:36       ` Dan Williams
2023-03-22 16:45         ` Fan Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230303215446.GA1479551@bgt-140510-bm03 \
    --to=fan.ni@samsung.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=a.manzanares@samsung.com \
    --cc=alison.schofield@intel.com \
    --cc=bwidawsk@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.