All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
@ 2020-01-08 11:34 Luo Jiaxing
  2020-01-08 11:53 ` John Garry
  2020-01-08 12:26 ` Greg KH
  0 siblings, 2 replies; 14+ messages in thread
From: Luo Jiaxing @ 2020-01-08 11:34 UTC (permalink / raw)
  To: gregkh, saravanak, jejb, James.Bottomley, James.Bottomley, john.garry
  Cc: linux-kernel, luojiaxing, linuxarm

We found that enabling kernel compilation options CONFIG_SCSI_ENCLOSURE and
CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion of the same
SCSI device will cause system panic, as follows:
[72.425705] Unable to handle kernel paging request at virtual address
dead000000000108
...
[72.595093] Call trace:
[72.597532] device_del + 0x194 / 0x3a0
[72.601012] enclosure_remove_device + 0xbc / 0xf8
[72.605445] ses_intf_remove + 0x9c / 0xd8
[72.609185] device_del + 0xf8 / 0x3a0
[72.612576] device_unregister + 0x14 / 0x30
[72.616489] __scsi_remove_device + 0xf4 / 0x140
[72.620747] scsi_remove_device + 0x28 / 0x40
[72.624745] scsi_remove_target + 0x1c8 / 0x220

After analysis, we see that in the error scenario, the ses module has the
following calling sequence:
device_register() -> device_del() -> device_add() -> device_del().
The first call to device_del() is fine, but the second call to device_del()
will cause a system panic.

Through disassembly, we locate that panic happen when device_links_purge()
call list_del() to remove device_links.needs_suppliers from list, and
list_del() will set this list entry's prev and next pointers to poison.
So if INIT_LIST_HEAD() is not re-executed before the next list_del(), It
will cause the system to access a memory address which is posioned.

Therefore, replace list_del() with list_del_init() can avoid such issue.

Fixes: e2ae9bcc4aaa ("driver core: Add support for linking devices during device addition")
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Reviewed-by: John Garry <john.garry@huawei.com>
---
 drivers/base/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 42a6724..7b9b0d6 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1103,7 +1103,7 @@ static void device_links_purge(struct device *dev)
 	struct device_link *link, *ln;
 
 	mutex_lock(&wfs_lock);
-	list_del(&dev->links.needs_suppliers);
+	list_del_init(&dev->links.needs_suppliers);
 	mutex_unlock(&wfs_lock);
 
 	/*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 11:34 [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge() Luo Jiaxing
@ 2020-01-08 11:53 ` John Garry
  2020-01-08 12:26 ` Greg KH
  1 sibling, 0 replies; 14+ messages in thread
From: John Garry @ 2020-01-08 11:53 UTC (permalink / raw)
  To: Luo Jiaxing, gregkh, saravanak, jejb, James.Bottomley, James.Bottomley
  Cc: linux-kernel, linuxarm, linux-scsi, Martin K . Petersen

On 08/01/2020 11:34, Luo Jiaxing wrote:

+ linux-scsi, Martin

> We found that enabling kernel compilation options CONFIG_SCSI_ENCLOSURE and
> CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion of the same
> SCSI device will cause system panic, as follows:
> [72.425705] Unable to handle kernel paging request at virtual address
> dead000000000108
> ...
> [72.595093] Call trace:
> [72.597532] device_del + 0x194 / 0x3a0
> [72.601012] enclosure_remove_device + 0xbc / 0xf8
> [72.605445] ses_intf_remove + 0x9c / 0xd8
> [72.609185] device_del + 0xf8 / 0x3a0
> [72.612576] device_unregister + 0x14 / 0x30
> [72.616489] __scsi_remove_device + 0xf4 / 0x140
> [72.620747] scsi_remove_device + 0x28 / 0x40
> [72.624745] scsi_remove_target + 0x1c8 / 0x220

please share the full crash stack frame and the commands used to trigger 
it. Some people prefer the timestamp removed also.

> 
> After analysis, we see that in the error scenario, the ses module has the
> following calling sequence:
> device_register() -> device_del() -> device_add() -> device_del().
> The first call to device_del() is fine, but the second call to device_del()
> will cause a system panic.
> 
> Through disassembly, we locate that panic happen when device_links_purge()
> call list_del() to remove device_links.needs_suppliers from list, and
> list_del() will set this list entry's prev and next pointers to poison.
> So if INIT_LIST_HEAD() is not re-executed before the next list_del(), It
> will cause the system to access a memory address which is posioned.
> 
> Therefore, replace list_del() with list_del_init() can avoid such issue.
> 
> Fixes: e2ae9bcc4aaa ("driver core: Add support for linking devices during device addition")
> Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
> Reviewed-by: John Garry <john.garry@huawei.com>

This tag was only implicitly granted, but I thought that the fix looked 
ok, so:

Reviewed-by: John Garry <john.garry@huawei.com>

> ---
>   drivers/base/core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 42a6724..7b9b0d6 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1103,7 +1103,7 @@ static void device_links_purge(struct device *dev)
>   	struct device_link *link, *ln;
>   
>   	mutex_lock(&wfs_lock);
> -	list_del(&dev->links.needs_suppliers);
> +	list_del_init(&dev->links.needs_suppliers);
>   	mutex_unlock(&wfs_lock);
>   
>   	/*
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 11:34 [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge() Luo Jiaxing
  2020-01-08 11:53 ` John Garry
@ 2020-01-08 12:26 ` Greg KH
  2020-01-08 14:50   ` John Garry
  1 sibling, 1 reply; 14+ messages in thread
From: Greg KH @ 2020-01-08 12:26 UTC (permalink / raw)
  To: Luo Jiaxing
  Cc: saravanak, jejb, James.Bottomley, James.Bottomley, john.garry,
	linux-kernel, linuxarm

On Wed, Jan 08, 2020 at 07:34:04PM +0800, Luo Jiaxing wrote:
> We found that enabling kernel compilation options CONFIG_SCSI_ENCLOSURE and
> CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion of the same
> SCSI device will cause system panic, as follows:
> [72.425705] Unable to handle kernel paging request at virtual address
> dead000000000108
> ...
> [72.595093] Call trace:
> [72.597532] device_del + 0x194 / 0x3a0
> [72.601012] enclosure_remove_device + 0xbc / 0xf8
> [72.605445] ses_intf_remove + 0x9c / 0xd8
> [72.609185] device_del + 0xf8 / 0x3a0
> [72.612576] device_unregister + 0x14 / 0x30
> [72.616489] __scsi_remove_device + 0xf4 / 0x140
> [72.620747] scsi_remove_device + 0x28 / 0x40
> [72.624745] scsi_remove_target + 0x1c8 / 0x220
> 
> After analysis, we see that in the error scenario, the ses module has the
> following calling sequence:
> device_register() -> device_del() -> device_add() -> device_del().
> The first call to device_del() is fine, but the second call to device_del()
> will cause a system panic.

Is this all on the same device structure?  If so, that's not ok, you
can't do that, once device_del() is called on the memory location, you
can not call device_add() on it again.

How are you triggering this from userspace?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 12:26 ` Greg KH
@ 2020-01-08 14:50   ` John Garry
  2020-01-08 15:44     ` Greg KH
  2020-01-08 15:51     ` James Bottomley
  0 siblings, 2 replies; 14+ messages in thread
From: John Garry @ 2020-01-08 14:50 UTC (permalink / raw)
  To: Greg KH, Luo Jiaxing
  Cc: saravanak, jejb, James.Bottomley, linux-kernel, linuxarm,
	linux-scsi, Martin K . Petersen

On 08/01/2020 12:26, Greg KH wrote:
> On Wed, Jan 08, 2020 at 07:34:04PM +0800, Luo Jiaxing wrote:
>> We found that enabling kernel compilation options CONFIG_SCSI_ENCLOSURE and
>> CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion of the same
>> SCSI device will cause system panic, as follows:
>> [72.425705] Unable to handle kernel paging request at virtual address
>> dead000000000108
>> ...
>> [72.595093] Call trace:
>> [72.597532] device_del + 0x194 / 0x3a0
>> [72.601012] enclosure_remove_device + 0xbc / 0xf8
>> [72.605445] ses_intf_remove + 0x9c / 0xd8
>> [72.609185] device_del + 0xf8 / 0x3a0
>> [72.612576] device_unregister + 0x14 / 0x30
>> [72.616489] __scsi_remove_device + 0xf4 / 0x140
>> [72.620747] scsi_remove_device + 0x28 / 0x40
>> [72.624745] scsi_remove_target + 0x1c8 / 0x220
>>
>> After analysis, we see that in the error scenario, the ses module has the
>> following calling sequence:
>> device_register() -> device_del() -> device_add() -> device_del().
>> The first call to device_del() is fine, but the second call to device_del()
>> will cause a system panic.
> 
> Is this all on the same device structure?  If so, that's not ok, you
> can't do that, once device_del() is called on the memory location, you
> can not call device_add() on it again.
> 
> How are you triggering this from userspace?

This can be triggered by causing the SCSI device to be lost, found, and 
lost again:

root@(none)$ pwd
/sys/class/sas_phy/phy-0:0:2
root@(none)$ echo 0 > enable
[   48.828139] sas: smp_execute_task_sg: task to dev 500e004aaaaaaa1f 
response: 0x0 status 0x2
root@(none)$
[   48.837040] sas: ex 500e004aaaaaaa1f phy02 change count has changed
[   48.846961] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   48.852120] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: 
hostbyte=0x04 driverbyte=0x00
[   48.898111] hisi_sas_v3_hw 0000:74:02.0: dev[2:1] is gone

root@(none)$ echo 1 > enable
root@(none)$
[   51.967416] sas: ex 500e004aaaaaaa1f phy02 change count has changed
[   51.974022] hisi_sas_v3_hw 0000:74:02.0: dev[7:1] found
[   51.991305] scsi 0:0:5:0: Direct-Access     SEAGATE  ST2000NM0045 
N004 PQ: 0 ANSI: 6
[   52.003609] sd 0:0:5:0: [sda] 3907029168 512-byte logical blocks: 
(2.00 TB/1.82 TiB)
[   52.012010] sd 0:0:5:0: [sda] Write Protect is off
[   52.022643] sd 0:0:5:0: [sda] Write cache: enabled, read cache: 
enabled, supports DPO and FUA
[   52.052429]  sda: sda1
[   52.064439] sd 0:0:5:0: [sda] Attached SCSI disk

root@(none)$ echo 0 > enable
[   54.112100] sas: smp_execute_task_sg: task to dev 500e004aaaaaaa1f 
response: 0x0 status 0x2
root@(none)$ [   54.120909] sas: ex 500e004aaaaaaa1f phy02 change count 
has changed
[   54.130202] Unable to handle kernel paging request at virtual address 
dead000000000108
[   54.138110] Mem abort info:
[   54.140892]   ESR = 0x96000044
[   54.143936]   EC = 0x25: DABT (current EL), IL = 32 bits
[   54.149236]   SET = 0, FnV = 0
[   54.152278]   EA = 0, S1PTW = 0
[   54.155408] Data abort info:
[   54.158275]   ISV = 0, ISS = 0x00000044
[   54.162098]   CM = 0, WnR = 1
[   54.165055] [dead000000000108] address between user and kernel 
address ranges
[   54.172179] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[   54.177737] Modules linked in:
[   54.180780] CPU: 5 PID: 741 Comm: kworker/u192:2 Not tainted 
5.5.0-rc5-dirty #1535
[   54.188334] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[   54.196847] Workqueue: 0000:74:02.0_disco_q sas_revalidate_domain
[   54.202927] pstate: 60c00009 (nZCv daif +PAN +UAO)
[   54.207705] pc : device_del+0x194/0x398
[   54.211527] lr : device_del+0x190/0x398
[   54.215349] sp : ffff80001cc7bb20
[   54.218650] x29: ffff80001cc7bb20 x28: ffff0023be042188
[   54.223948] x27: ffff0023c04c0000 x26: ffff0023be042000
[   54.229246] x25: ffff8000119f0f30 x24: ffff0023be268a30
[   54.234544] x23: ffff0023be268018 x22: ffff800011879000
[   54.239842] x21: ffff8000119f0000 x20: ffff8000119f06e0
[   54.245140] x19: ffff0023be268990 x18: 0000000000000004
[   54.250438] x17: 0000000000000007 x16: 0000000000000001
[   54.255736] x15: ffff0023eac13610 x14: ffff0023eb74a7f8
[   54.261034] x13: 0000000000000000 x12: ffff0023eac13610
[   54.266332] x11: ffff0023eb74a6c8 x10: 0000000000000000
[   54.271630] x9 : ffff0023eac13618 x8 : 0000000040040000
[   54.276928] x7 : 0000000000000000 x6 : ffff0023be268a90
[   54.282226] x5 : ffff0023be74aa00 x4 : 0000000000000000
[   54.287524] x3 : ffff8000119f0f30 x2 : dead000000000100
[   54.292821] x1 : dead000000000122 x0 : 0000000000000000
[   54.298119] Call trace:
[   54.300553]  device_del+0x194/0x398
[   54.304030]  enclosure_remove_device+0xb4/0x100
[   54.308548]  ses_intf_remove+0x98/0xd8
[   54.312283]  device_del+0xfc/0x398
[   54.315671]  device_unregister+0x14/0x30
[   54.319580]  __scsi_remove_device+0xf0/0x130
[   54.323836]  scsi_remove_device+0x28/0x40
[   54.327832]  scsi_remove_target+0x1bc/0x250
[   54.332002]  sas_rphy_remove+0x5c/0x60
[   54.335738]  sas_rphy_delete+0x14/0x28
[   54.339473]  sas_destruct_devices+0x5c/0x98
[   54.343642]  sas_revalidate_domain+0xa0/0x178
[   54.347986]  process_one_work+0x1e0/0x358
[   54.351982]  worker_thread+0x40/0x488
[   54.355631]  kthread+0x118/0x120
[   54.358846]  ret_from_fork+0x10/0x18
[   54.362410] Code: 91028278 aa1903e0 9415f01f a94c0662 (f9000441)
[   54.368489] ---[ end trace 38c672fcf89c95f7 ]---

I tested on v5.4 and no such issue, but maybe the driver core changes 
have exposed a ses/enclosure issue.

Checking:

int enclosure_remove_device(struct enclosure_device *edev, struct device 
*dev)
{
	struct enclosure_component *cdev;
	int i;

	if (!edev || !dev)
		return -EINVAL;

	for (i = 0; i < edev->components; i++) {
		cdev = &edev->component[i];
		if (cdev->dev == dev) {
			enclosure_remove_links(cdev);
			device_del(&cdev->cdev);
			put_device(dev);
			cdev->dev = NULL;
			return device_add(&cdev->cdev);
		}
	}
	return -ENODEV;
}

This has device_del(&cdev->cdev) followed by device_add(&cdev->cdev).

This cdev.dev memory looks to be dynamically allocated for the lifetime 
of the enclosure_device.

John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 14:50   ` John Garry
@ 2020-01-08 15:44     ` Greg KH
  2020-01-08 15:51     ` James Bottomley
  1 sibling, 0 replies; 14+ messages in thread
From: Greg KH @ 2020-01-08 15:44 UTC (permalink / raw)
  To: John Garry
  Cc: Luo Jiaxing, saravanak, jejb, James.Bottomley, linux-kernel,
	linuxarm, linux-scsi, Martin K . Petersen

On Wed, Jan 08, 2020 at 02:50:54PM +0000, John Garry wrote:
> On 08/01/2020 12:26, Greg KH wrote:
> > On Wed, Jan 08, 2020 at 07:34:04PM +0800, Luo Jiaxing wrote:
> > > We found that enabling kernel compilation options CONFIG_SCSI_ENCLOSURE and
> > > CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion of the same
> > > SCSI device will cause system panic, as follows:
> > > [72.425705] Unable to handle kernel paging request at virtual address
> > > dead000000000108
> > > ...
> > > [72.595093] Call trace:
> > > [72.597532] device_del + 0x194 / 0x3a0
> > > [72.601012] enclosure_remove_device + 0xbc / 0xf8
> > > [72.605445] ses_intf_remove + 0x9c / 0xd8
> > > [72.609185] device_del + 0xf8 / 0x3a0
> > > [72.612576] device_unregister + 0x14 / 0x30
> > > [72.616489] __scsi_remove_device + 0xf4 / 0x140
> > > [72.620747] scsi_remove_device + 0x28 / 0x40
> > > [72.624745] scsi_remove_target + 0x1c8 / 0x220
> > > 
> > > After analysis, we see that in the error scenario, the ses module has the
> > > following calling sequence:
> > > device_register() -> device_del() -> device_add() -> device_del().
> > > The first call to device_del() is fine, but the second call to device_del()
> > > will cause a system panic.
> > 
> > Is this all on the same device structure?  If so, that's not ok, you
> > can't do that, once device_del() is called on the memory location, you
> > can not call device_add() on it again.
> > 
> > How are you triggering this from userspace?
> 
> This can be triggered by causing the SCSI device to be lost, found, and lost
> again:
> 
> root@(none)$ pwd
> /sys/class/sas_phy/phy-0:0:2
> root@(none)$ echo 0 > enable
> [   48.828139] sas: smp_execute_task_sg: task to dev 500e004aaaaaaa1f
> response: 0x0 status 0x2
> root@(none)$
> [   48.837040] sas: ex 500e004aaaaaaa1f phy02 change count has changed
> [   48.846961] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   48.852120] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result:
> hostbyte=0x04 driverbyte=0x00
> [   48.898111] hisi_sas_v3_hw 0000:74:02.0: dev[2:1] is gone
> 
> root@(none)$ echo 1 > enable
> root@(none)$
> [   51.967416] sas: ex 500e004aaaaaaa1f phy02 change count has changed
> [   51.974022] hisi_sas_v3_hw 0000:74:02.0: dev[7:1] found
> [   51.991305] scsi 0:0:5:0: Direct-Access     SEAGATE  ST2000NM0045 N004
> PQ: 0 ANSI: 6
> [   52.003609] sd 0:0:5:0: [sda] 3907029168 512-byte logical blocks: (2.00
> TB/1.82 TiB)
> [   52.012010] sd 0:0:5:0: [sda] Write Protect is off
> [   52.022643] sd 0:0:5:0: [sda] Write cache: enabled, read cache: enabled,
> supports DPO and FUA
> [   52.052429]  sda: sda1
> [   52.064439] sd 0:0:5:0: [sda] Attached SCSI disk
> 
> root@(none)$ echo 0 > enable
> [   54.112100] sas: smp_execute_task_sg: task to dev 500e004aaaaaaa1f
> response: 0x0 status 0x2
> root@(none)$ [   54.120909] sas: ex 500e004aaaaaaa1f phy02 change count has
> changed
> [   54.130202] Unable to handle kernel paging request at virtual address
> dead000000000108
> [   54.138110] Mem abort info:
> [   54.140892]   ESR = 0x96000044
> [   54.143936]   EC = 0x25: DABT (current EL), IL = 32 bits
> [   54.149236]   SET = 0, FnV = 0
> [   54.152278]   EA = 0, S1PTW = 0
> [   54.155408] Data abort info:
> [   54.158275]   ISV = 0, ISS = 0x00000044
> [   54.162098]   CM = 0, WnR = 1
> [   54.165055] [dead000000000108] address between user and kernel address
> ranges
> [   54.172179] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> [   54.177737] Modules linked in:
> [   54.180780] CPU: 5 PID: 741 Comm: kworker/u192:2 Not tainted
> 5.5.0-rc5-dirty #1535
> [   54.188334] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> V1.16.01 03/15/2019
> [   54.196847] Workqueue: 0000:74:02.0_disco_q sas_revalidate_domain
> [   54.202927] pstate: 60c00009 (nZCv daif +PAN +UAO)
> [   54.207705] pc : device_del+0x194/0x398
> [   54.211527] lr : device_del+0x190/0x398
> [   54.215349] sp : ffff80001cc7bb20
> [   54.218650] x29: ffff80001cc7bb20 x28: ffff0023be042188
> [   54.223948] x27: ffff0023c04c0000 x26: ffff0023be042000
> [   54.229246] x25: ffff8000119f0f30 x24: ffff0023be268a30
> [   54.234544] x23: ffff0023be268018 x22: ffff800011879000
> [   54.239842] x21: ffff8000119f0000 x20: ffff8000119f06e0
> [   54.245140] x19: ffff0023be268990 x18: 0000000000000004
> [   54.250438] x17: 0000000000000007 x16: 0000000000000001
> [   54.255736] x15: ffff0023eac13610 x14: ffff0023eb74a7f8
> [   54.261034] x13: 0000000000000000 x12: ffff0023eac13610
> [   54.266332] x11: ffff0023eb74a6c8 x10: 0000000000000000
> [   54.271630] x9 : ffff0023eac13618 x8 : 0000000040040000
> [   54.276928] x7 : 0000000000000000 x6 : ffff0023be268a90
> [   54.282226] x5 : ffff0023be74aa00 x4 : 0000000000000000
> [   54.287524] x3 : ffff8000119f0f30 x2 : dead000000000100
> [   54.292821] x1 : dead000000000122 x0 : 0000000000000000
> [   54.298119] Call trace:
> [   54.300553]  device_del+0x194/0x398
> [   54.304030]  enclosure_remove_device+0xb4/0x100
> [   54.308548]  ses_intf_remove+0x98/0xd8
> [   54.312283]  device_del+0xfc/0x398
> [   54.315671]  device_unregister+0x14/0x30
> [   54.319580]  __scsi_remove_device+0xf0/0x130
> [   54.323836]  scsi_remove_device+0x28/0x40
> [   54.327832]  scsi_remove_target+0x1bc/0x250
> [   54.332002]  sas_rphy_remove+0x5c/0x60
> [   54.335738]  sas_rphy_delete+0x14/0x28
> [   54.339473]  sas_destruct_devices+0x5c/0x98
> [   54.343642]  sas_revalidate_domain+0xa0/0x178
> [   54.347986]  process_one_work+0x1e0/0x358
> [   54.351982]  worker_thread+0x40/0x488
> [   54.355631]  kthread+0x118/0x120
> [   54.358846]  ret_from_fork+0x10/0x18
> [   54.362410] Code: 91028278 aa1903e0 9415f01f a94c0662 (f9000441)
> [   54.368489] ---[ end trace 38c672fcf89c95f7 ]---
> 
> I tested on v5.4 and no such issue, but maybe the driver core changes have
> exposed a ses/enclosure issue.
> 
> Checking:
> 
> int enclosure_remove_device(struct enclosure_device *edev, struct device
> *dev)
> {
> 	struct enclosure_component *cdev;
> 	int i;
> 
> 	if (!edev || !dev)
> 		return -EINVAL;
> 
> 	for (i = 0; i < edev->components; i++) {
> 		cdev = &edev->component[i];
> 		if (cdev->dev == dev) {
> 			enclosure_remove_links(cdev);
> 			device_del(&cdev->cdev);
> 			put_device(dev);
> 			cdev->dev = NULL;
> 			return device_add(&cdev->cdev);
> 		}
> 	}
> 	return -ENODEV;
> }
> 
> This has device_del(&cdev->cdev) followed by device_add(&cdev->cdev).

Ugh, that's ripe for problems, as you found.

Yes, your patch will fix this pattern, but the larger problem is that
this sequence might not really work as something else could have had a
reference to the structure (rare, but could happen.)

> This cdev.dev memory looks to be dynamically allocated for the lifetime of
> the enclosure_device.

ick.

SCSI people, what do you think?  This "enclosure" code was yours...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 14:50   ` John Garry
  2020-01-08 15:44     ` Greg KH
@ 2020-01-08 15:51     ` James Bottomley
  2020-01-08 15:57       ` Greg KH
  1 sibling, 1 reply; 14+ messages in thread
From: James Bottomley @ 2020-01-08 15:51 UTC (permalink / raw)
  To: John Garry, Greg KH, Luo Jiaxing
  Cc: saravanak, linux-kernel, linuxarm, linux-scsi, Martin K . Petersen

On Wed, 2020-01-08 at 14:50 +0000, John Garry wrote:
> On 08/01/2020 12:26, Greg KH wrote:
> > On Wed, Jan 08, 2020 at 07:34:04PM +0800, Luo Jiaxing wrote:
> > > We found that enabling kernel compilation options
> > > CONFIG_SCSI_ENCLOSURE and
> > > CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion
> > > of the same
> > > SCSI device will cause system panic, as follows:
> > > [72.425705] Unable to handle kernel paging request at virtual
> > > address
> > > dead000000000108
> > > ...
> > > [72.595093] Call trace:
> > > [72.597532] device_del + 0x194 / 0x3a0
> > > [72.601012] enclosure_remove_device + 0xbc / 0xf8
> > > [72.605445] ses_intf_remove + 0x9c / 0xd8
> > > [72.609185] device_del + 0xf8 / 0x3a0
> > > [72.612576] device_unregister + 0x14 / 0x30
> > > [72.616489] __scsi_remove_device + 0xf4 / 0x140
> > > [72.620747] scsi_remove_device + 0x28 / 0x40
> > > [72.624745] scsi_remove_target + 0x1c8 / 0x220
> > > 
> > > After analysis, we see that in the error scenario, the ses module
> > > has the
> > > following calling sequence:
> > > device_register() -> device_del() -> device_add() ->
> > > device_del().
> > > The first call to device_del() is fine, but the second call to
> > > device_del()
> > > will cause a system panic.
> > 
> > Is this all on the same device structure?  If so, that's not ok,
> > you
> > can't do that, once device_del() is called on the memory location,
> > you
> > can not call device_add() on it again.
> > 
> > How are you triggering this from userspace?
> 
> This can be triggered by causing the SCSI device to be lost, found,
> and 
> lost again:
> 
> root@(none)$ pwd
> /sys/class/sas_phy/phy-0:0:2
> root@(none)$ echo 0 > enable
> [   48.828139] sas: smp_execute_task_sg: task to dev
> 500e004aaaaaaa1f 
> response: 0x0 status 0x2
> root@(none)$
> [   48.837040] sas: ex 500e004aaaaaaa1f phy02 change count has
> changed
> [   48.846961] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   48.852120] sd 0:0:0:0: [sda] Synchronize Cache(10) failed:
> Result: 
> hostbyte=0x04 driverbyte=0x00
> [   48.898111] hisi_sas_v3_hw 0000:74:02.0: dev[2:1] is gone
> 
> root@(none)$ echo 1 > enable
> root@(none)$
> [   51.967416] sas: ex 500e004aaaaaaa1f phy02 change count has
> changed
> [   51.974022] hisi_sas_v3_hw 0000:74:02.0: dev[7:1] found
> [   51.991305] scsi 0:0:5:0: Direct-Access     SEAGATE  ST2000NM0045 
> N004 PQ: 0 ANSI: 6
> [   52.003609] sd 0:0:5:0: [sda] 3907029168 512-byte logical blocks: 
> (2.00 TB/1.82 TiB)
> [   52.012010] sd 0:0:5:0: [sda] Write Protect is off
> [   52.022643] sd 0:0:5:0: [sda] Write cache: enabled, read cache: 
> enabled, supports DPO and FUA
> [   52.052429]  sda: sda1
> [   52.064439] sd 0:0:5:0: [sda] Attached SCSI disk
> 
> root@(none)$ echo 0 > enable
> [   54.112100] sas: smp_execute_task_sg: task to dev
> 500e004aaaaaaa1f 
> response: 0x0 status 0x2
> root@(none)$ [   54.120909] sas: ex 500e004aaaaaaa1f phy02 change
> count 
> has changed
> [   54.130202] Unable to handle kernel paging request at virtual
> address 
> dead000000000108
> [   54.138110] Mem abort info:
> [   54.140892]   ESR = 0x96000044
> [   54.143936]   EC = 0x25: DABT (current EL), IL = 32 bits
> [   54.149236]   SET = 0, FnV = 0
> [   54.152278]   EA = 0, S1PTW = 0
> [   54.155408] Data abort info:
> [   54.158275]   ISV = 0, ISS = 0x00000044
> [   54.162098]   CM = 0, WnR = 1
> [   54.165055] [dead000000000108] address between user and kernel 
> address ranges
> [   54.172179] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> [   54.177737] Modules linked in:
> [   54.180780] CPU: 5 PID: 741 Comm: kworker/u192:2 Not tainted 
> 5.5.0-rc5-dirty #1535
> [   54.188334] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06
> UEFI 
> RC0 - V1.16.01 03/15/2019
> [   54.196847] Workqueue: 0000:74:02.0_disco_q sas_revalidate_domain
> [   54.202927] pstate: 60c00009 (nZCv daif +PAN +UAO)
> [   54.207705] pc : device_del+0x194/0x398
> [   54.211527] lr : device_del+0x190/0x398
> [   54.215349] sp : ffff80001cc7bb20
> [   54.218650] x29: ffff80001cc7bb20 x28: ffff0023be042188
> [   54.223948] x27: ffff0023c04c0000 x26: ffff0023be042000
> [   54.229246] x25: ffff8000119f0f30 x24: ffff0023be268a30
> [   54.234544] x23: ffff0023be268018 x22: ffff800011879000
> [   54.239842] x21: ffff8000119f0000 x20: ffff8000119f06e0
> [   54.245140] x19: ffff0023be268990 x18: 0000000000000004
> [   54.250438] x17: 0000000000000007 x16: 0000000000000001
> [   54.255736] x15: ffff0023eac13610 x14: ffff0023eb74a7f8
> [   54.261034] x13: 0000000000000000 x12: ffff0023eac13610
> [   54.266332] x11: ffff0023eb74a6c8 x10: 0000000000000000
> [   54.271630] x9 : ffff0023eac13618 x8 : 0000000040040000
> [   54.276928] x7 : 0000000000000000 x6 : ffff0023be268a90
> [   54.282226] x5 : ffff0023be74aa00 x4 : 0000000000000000
> [   54.287524] x3 : ffff8000119f0f30 x2 : dead000000000100
> [   54.292821] x1 : dead000000000122 x0 : 0000000000000000
> [   54.298119] Call trace:
> [   54.300553]  device_del+0x194/0x398
> [   54.304030]  enclosure_remove_device+0xb4/0x100
> [   54.308548]  ses_intf_remove+0x98/0xd8
> [   54.312283]  device_del+0xfc/0x398
> [   54.315671]  device_unregister+0x14/0x30
> [   54.319580]  __scsi_remove_device+0xf0/0x130
> [   54.323836]  scsi_remove_device+0x28/0x40
> [   54.327832]  scsi_remove_target+0x1bc/0x250
> [   54.332002]  sas_rphy_remove+0x5c/0x60
> [   54.335738]  sas_rphy_delete+0x14/0x28
> [   54.339473]  sas_destruct_devices+0x5c/0x98
> [   54.343642]  sas_revalidate_domain+0xa0/0x178
> [   54.347986]  process_one_work+0x1e0/0x358
> [   54.351982]  worker_thread+0x40/0x488
> [   54.355631]  kthread+0x118/0x120
> [   54.358846]  ret_from_fork+0x10/0x18
> [   54.362410] Code: 91028278 aa1903e0 9415f01f a94c0662 (f9000441)
> [   54.368489] ---[ end trace 38c672fcf89c95f7 ]---
> 
> I tested on v5.4 and no such issue, but maybe the driver core
> changes 
> have exposed a ses/enclosure issue.
> 
> Checking:
> 
> int enclosure_remove_device(struct enclosure_device *edev, struct
> device 
> *dev)
> {
> 	struct enclosure_component *cdev;
> 	int i;
> 
> 	if (!edev || !dev)
> 		return -EINVAL;
> 
> 	for (i = 0; i < edev->components; i++) {
> 		cdev = &edev->component[i];
> 		if (cdev->dev == dev) {
> 			enclosure_remove_links(cdev);
> 			device_del(&cdev->cdev);
> 			put_device(dev);
> 			cdev->dev = NULL;
> 			return device_add(&cdev->cdev);
> 		}
> 	}
> 	return -ENODEV;
> }

The design of the code is simply to remove the link to the inserted
device which has been removed.

I *think* this means the calls to device_del and device_add are
unnecessary and should go.  enclosure_remove_links and the put of the
enclosed device should be sufficient.

James


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 15:51     ` James Bottomley
@ 2020-01-08 15:57       ` Greg KH
  2020-01-08 16:01         ` James Bottomley
  0 siblings, 1 reply; 14+ messages in thread
From: Greg KH @ 2020-01-08 15:57 UTC (permalink / raw)
  To: James Bottomley
  Cc: John Garry, Luo Jiaxing, saravanak, linux-kernel, linuxarm,
	linux-scsi, Martin K . Petersen

On Wed, Jan 08, 2020 at 07:51:35AM -0800, James Bottomley wrote:
> On Wed, 2020-01-08 at 14:50 +0000, John Garry wrote:
> > On 08/01/2020 12:26, Greg KH wrote:
> > > On Wed, Jan 08, 2020 at 07:34:04PM +0800, Luo Jiaxing wrote:
> > > > We found that enabling kernel compilation options
> > > > CONFIG_SCSI_ENCLOSURE and
> > > > CONFIG_ENCLOSURE_SERVICES, repeated initialization and deletion
> > > > of the same
> > > > SCSI device will cause system panic, as follows:
> > > > [72.425705] Unable to handle kernel paging request at virtual
> > > > address
> > > > dead000000000108
> > > > ...
> > > > [72.595093] Call trace:
> > > > [72.597532] device_del + 0x194 / 0x3a0
> > > > [72.601012] enclosure_remove_device + 0xbc / 0xf8
> > > > [72.605445] ses_intf_remove + 0x9c / 0xd8
> > > > [72.609185] device_del + 0xf8 / 0x3a0
> > > > [72.612576] device_unregister + 0x14 / 0x30
> > > > [72.616489] __scsi_remove_device + 0xf4 / 0x140
> > > > [72.620747] scsi_remove_device + 0x28 / 0x40
> > > > [72.624745] scsi_remove_target + 0x1c8 / 0x220
> > > > 
> > > > After analysis, we see that in the error scenario, the ses module
> > > > has the
> > > > following calling sequence:
> > > > device_register() -> device_del() -> device_add() ->
> > > > device_del().
> > > > The first call to device_del() is fine, but the second call to
> > > > device_del()
> > > > will cause a system panic.
> > > 
> > > Is this all on the same device structure?  If so, that's not ok,
> > > you
> > > can't do that, once device_del() is called on the memory location,
> > > you
> > > can not call device_add() on it again.
> > > 
> > > How are you triggering this from userspace?
> > 
> > This can be triggered by causing the SCSI device to be lost, found,
> > and 
> > lost again:
> > 
> > root@(none)$ pwd
> > /sys/class/sas_phy/phy-0:0:2
> > root@(none)$ echo 0 > enable
> > [   48.828139] sas: smp_execute_task_sg: task to dev
> > 500e004aaaaaaa1f 
> > response: 0x0 status 0x2
> > root@(none)$
> > [   48.837040] sas: ex 500e004aaaaaaa1f phy02 change count has
> > changed
> > [   48.846961] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > [   48.852120] sd 0:0:0:0: [sda] Synchronize Cache(10) failed:
> > Result: 
> > hostbyte=0x04 driverbyte=0x00
> > [   48.898111] hisi_sas_v3_hw 0000:74:02.0: dev[2:1] is gone
> > 
> > root@(none)$ echo 1 > enable
> > root@(none)$
> > [   51.967416] sas: ex 500e004aaaaaaa1f phy02 change count has
> > changed
> > [   51.974022] hisi_sas_v3_hw 0000:74:02.0: dev[7:1] found
> > [   51.991305] scsi 0:0:5:0: Direct-Access     SEAGATE  ST2000NM0045 
> > N004 PQ: 0 ANSI: 6
> > [   52.003609] sd 0:0:5:0: [sda] 3907029168 512-byte logical blocks: 
> > (2.00 TB/1.82 TiB)
> > [   52.012010] sd 0:0:5:0: [sda] Write Protect is off
> > [   52.022643] sd 0:0:5:0: [sda] Write cache: enabled, read cache: 
> > enabled, supports DPO and FUA
> > [   52.052429]  sda: sda1
> > [   52.064439] sd 0:0:5:0: [sda] Attached SCSI disk
> > 
> > root@(none)$ echo 0 > enable
> > [   54.112100] sas: smp_execute_task_sg: task to dev
> > 500e004aaaaaaa1f 
> > response: 0x0 status 0x2
> > root@(none)$ [   54.120909] sas: ex 500e004aaaaaaa1f phy02 change
> > count 
> > has changed
> > [   54.130202] Unable to handle kernel paging request at virtual
> > address 
> > dead000000000108
> > [   54.138110] Mem abort info:
> > [   54.140892]   ESR = 0x96000044
> > [   54.143936]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [   54.149236]   SET = 0, FnV = 0
> > [   54.152278]   EA = 0, S1PTW = 0
> > [   54.155408] Data abort info:
> > [   54.158275]   ISV = 0, ISS = 0x00000044
> > [   54.162098]   CM = 0, WnR = 1
> > [   54.165055] [dead000000000108] address between user and kernel 
> > address ranges
> > [   54.172179] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> > [   54.177737] Modules linked in:
> > [   54.180780] CPU: 5 PID: 741 Comm: kworker/u192:2 Not tainted 
> > 5.5.0-rc5-dirty #1535
> > [   54.188334] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06
> > UEFI 
> > RC0 - V1.16.01 03/15/2019
> > [   54.196847] Workqueue: 0000:74:02.0_disco_q sas_revalidate_domain
> > [   54.202927] pstate: 60c00009 (nZCv daif +PAN +UAO)
> > [   54.207705] pc : device_del+0x194/0x398
> > [   54.211527] lr : device_del+0x190/0x398
> > [   54.215349] sp : ffff80001cc7bb20
> > [   54.218650] x29: ffff80001cc7bb20 x28: ffff0023be042188
> > [   54.223948] x27: ffff0023c04c0000 x26: ffff0023be042000
> > [   54.229246] x25: ffff8000119f0f30 x24: ffff0023be268a30
> > [   54.234544] x23: ffff0023be268018 x22: ffff800011879000
> > [   54.239842] x21: ffff8000119f0000 x20: ffff8000119f06e0
> > [   54.245140] x19: ffff0023be268990 x18: 0000000000000004
> > [   54.250438] x17: 0000000000000007 x16: 0000000000000001
> > [   54.255736] x15: ffff0023eac13610 x14: ffff0023eb74a7f8
> > [   54.261034] x13: 0000000000000000 x12: ffff0023eac13610
> > [   54.266332] x11: ffff0023eb74a6c8 x10: 0000000000000000
> > [   54.271630] x9 : ffff0023eac13618 x8 : 0000000040040000
> > [   54.276928] x7 : 0000000000000000 x6 : ffff0023be268a90
> > [   54.282226] x5 : ffff0023be74aa00 x4 : 0000000000000000
> > [   54.287524] x3 : ffff8000119f0f30 x2 : dead000000000100
> > [   54.292821] x1 : dead000000000122 x0 : 0000000000000000
> > [   54.298119] Call trace:
> > [   54.300553]  device_del+0x194/0x398
> > [   54.304030]  enclosure_remove_device+0xb4/0x100
> > [   54.308548]  ses_intf_remove+0x98/0xd8
> > [   54.312283]  device_del+0xfc/0x398
> > [   54.315671]  device_unregister+0x14/0x30
> > [   54.319580]  __scsi_remove_device+0xf0/0x130
> > [   54.323836]  scsi_remove_device+0x28/0x40
> > [   54.327832]  scsi_remove_target+0x1bc/0x250
> > [   54.332002]  sas_rphy_remove+0x5c/0x60
> > [   54.335738]  sas_rphy_delete+0x14/0x28
> > [   54.339473]  sas_destruct_devices+0x5c/0x98
> > [   54.343642]  sas_revalidate_domain+0xa0/0x178
> > [   54.347986]  process_one_work+0x1e0/0x358
> > [   54.351982]  worker_thread+0x40/0x488
> > [   54.355631]  kthread+0x118/0x120
> > [   54.358846]  ret_from_fork+0x10/0x18
> > [   54.362410] Code: 91028278 aa1903e0 9415f01f a94c0662 (f9000441)
> > [   54.368489] ---[ end trace 38c672fcf89c95f7 ]---
> > 
> > I tested on v5.4 and no such issue, but maybe the driver core
> > changes 
> > have exposed a ses/enclosure issue.
> > 
> > Checking:
> > 
> > int enclosure_remove_device(struct enclosure_device *edev, struct
> > device 
> > *dev)
> > {
> > 	struct enclosure_component *cdev;
> > 	int i;
> > 
> > 	if (!edev || !dev)
> > 		return -EINVAL;
> > 
> > 	for (i = 0; i < edev->components; i++) {
> > 		cdev = &edev->component[i];
> > 		if (cdev->dev == dev) {
> > 			enclosure_remove_links(cdev);
> > 			device_del(&cdev->cdev);
> > 			put_device(dev);
> > 			cdev->dev = NULL;
> > 			return device_add(&cdev->cdev);
> > 		}
> > 	}
> > 	return -ENODEV;
> > }
> 
> The design of the code is simply to remove the link to the inserted
> device which has been removed.
> 
> I *think* this means the calls to device_del and device_add are
> unnecessary and should go.  enclosure_remove_links and the put of the
> enclosed device should be sufficient.

That would make more sense than trying to "reuse" the device structure
here by tearing it down and adding it back.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 15:57       ` Greg KH
@ 2020-01-08 16:01         ` James Bottomley
  2020-01-08 16:08           ` John Garry
  0 siblings, 1 reply; 14+ messages in thread
From: James Bottomley @ 2020-01-08 16:01 UTC (permalink / raw)
  To: Greg KH
  Cc: John Garry, Luo Jiaxing, saravanak, linux-kernel, linuxarm,
	linux-scsi, Martin K . Petersen

On Wed, 2020-01-08 at 16:57 +0100, Greg KH wrote:
> On Wed, Jan 08, 2020 at 07:51:35AM -0800, James Bottomley wrote:
> > On Wed, 2020-01-08 at 14:50 +0000, John Garry wrote:
> > > On 08/01/2020 12:26, Greg KH wrote:
> > > > On Wed, Jan 08, 2020 at 07:34:04PM +0800, Luo Jiaxing wrote:
> > > > > We found that enabling kernel compilation options
> > > > > CONFIG_SCSI_ENCLOSURE and
> > > > > CONFIG_ENCLOSURE_SERVICES, repeated initialization and
> > > > > deletion
> > > > > of the same
> > > > > SCSI device will cause system panic, as follows:
> > > > > [72.425705] Unable to handle kernel paging request at virtual
> > > > > address
> > > > > dead000000000108
> > > > > ...
> > > > > [72.595093] Call trace:
> > > > > [72.597532] device_del + 0x194 / 0x3a0
> > > > > [72.601012] enclosure_remove_device + 0xbc / 0xf8
> > > > > [72.605445] ses_intf_remove + 0x9c / 0xd8
> > > > > [72.609185] device_del + 0xf8 / 0x3a0
> > > > > [72.612576] device_unregister + 0x14 / 0x30
> > > > > [72.616489] __scsi_remove_device + 0xf4 / 0x140
> > > > > [72.620747] scsi_remove_device + 0x28 / 0x40
> > > > > [72.624745] scsi_remove_target + 0x1c8 / 0x220
> > > > > 
> > > > > After analysis, we see that in the error scenario, the ses
> > > > > module
> > > > > has the
> > > > > following calling sequence:
> > > > > device_register() -> device_del() -> device_add() ->
> > > > > device_del().
> > > > > The first call to device_del() is fine, but the second call
> > > > > to
> > > > > device_del()
> > > > > will cause a system panic.
> > > > 
> > > > Is this all on the same device structure?  If so, that's not
> > > > ok,
> > > > you
> > > > can't do that, once device_del() is called on the memory
> > > > location,
> > > > you
> > > > can not call device_add() on it again.
> > > > 
> > > > How are you triggering this from userspace?
> > > 
> > > This can be triggered by causing the SCSI device to be lost,
> > > found,
> > > and 
> > > lost again:
> > > 
> > > root@(none)$ pwd
> > > /sys/class/sas_phy/phy-0:0:2
> > > root@(none)$ echo 0 > enable
> > > [   48.828139] sas: smp_execute_task_sg: task to dev
> > > 500e004aaaaaaa1f 
> > > response: 0x0 status 0x2
> > > root@(none)$
> > > [   48.837040] sas: ex 500e004aaaaaaa1f phy02 change count has
> > > changed
> > > [   48.846961] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > > [   48.852120] sd 0:0:0:0: [sda] Synchronize Cache(10) failed:
> > > Result: 
> > > hostbyte=0x04 driverbyte=0x00
> > > [   48.898111] hisi_sas_v3_hw 0000:74:02.0: dev[2:1] is gone
> > > 
> > > root@(none)$ echo 1 > enable
> > > root@(none)$
> > > [   51.967416] sas: ex 500e004aaaaaaa1f phy02 change count has
> > > changed
> > > [   51.974022] hisi_sas_v3_hw 0000:74:02.0: dev[7:1] found
> > > [   51.991305] scsi 0:0:5:0: Direct-
> > > Access     SEAGATE  ST2000NM0045 
> > > N004 PQ: 0 ANSI: 6
> > > [   52.003609] sd 0:0:5:0: [sda] 3907029168 512-byte logical
> > > blocks: 
> > > (2.00 TB/1.82 TiB)
> > > [   52.012010] sd 0:0:5:0: [sda] Write Protect is off
> > > [   52.022643] sd 0:0:5:0: [sda] Write cache: enabled, read
> > > cache: 
> > > enabled, supports DPO and FUA
> > > [   52.052429]  sda: sda1
> > > [   52.064439] sd 0:0:5:0: [sda] Attached SCSI disk
> > > 
> > > root@(none)$ echo 0 > enable
> > > [   54.112100] sas: smp_execute_task_sg: task to dev
> > > 500e004aaaaaaa1f 
> > > response: 0x0 status 0x2
> > > root@(none)$ [   54.120909] sas: ex 500e004aaaaaaa1f phy02 change
> > > count 
> > > has changed
> > > [   54.130202] Unable to handle kernel paging request at virtual
> > > address 
> > > dead000000000108
> > > [   54.138110] Mem abort info:
> > > [   54.140892]   ESR = 0x96000044
> > > [   54.143936]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > [   54.149236]   SET = 0, FnV = 0
> > > [   54.152278]   EA = 0, S1PTW = 0
> > > [   54.155408] Data abort info:
> > > [   54.158275]   ISV = 0, ISS = 0x00000044
> > > [   54.162098]   CM = 0, WnR = 1
> > > [   54.165055] [dead000000000108] address between user and
> > > kernel 
> > > address ranges
> > > [   54.172179] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> > > [   54.177737] Modules linked in:
> > > [   54.180780] CPU: 5 PID: 741 Comm: kworker/u192:2 Not tainted 
> > > 5.5.0-rc5-dirty #1535
> > > [   54.188334] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06
> > > UEFI 
> > > RC0 - V1.16.01 03/15/2019
> > > [   54.196847] Workqueue: 0000:74:02.0_disco_q
> > > sas_revalidate_domain
> > > [   54.202927] pstate: 60c00009 (nZCv daif +PAN +UAO)
> > > [   54.207705] pc : device_del+0x194/0x398
> > > [   54.211527] lr : device_del+0x190/0x398
> > > [   54.215349] sp : ffff80001cc7bb20
> > > [   54.218650] x29: ffff80001cc7bb20 x28: ffff0023be042188
> > > [   54.223948] x27: ffff0023c04c0000 x26: ffff0023be042000
> > > [   54.229246] x25: ffff8000119f0f30 x24: ffff0023be268a30
> > > [   54.234544] x23: ffff0023be268018 x22: ffff800011879000
> > > [   54.239842] x21: ffff8000119f0000 x20: ffff8000119f06e0
> > > [   54.245140] x19: ffff0023be268990 x18: 0000000000000004
> > > [   54.250438] x17: 0000000000000007 x16: 0000000000000001
> > > [   54.255736] x15: ffff0023eac13610 x14: ffff0023eb74a7f8
> > > [   54.261034] x13: 0000000000000000 x12: ffff0023eac13610
> > > [   54.266332] x11: ffff0023eb74a6c8 x10: 0000000000000000
> > > [   54.271630] x9 : ffff0023eac13618 x8 : 0000000040040000
> > > [   54.276928] x7 : 0000000000000000 x6 : ffff0023be268a90
> > > [   54.282226] x5 : ffff0023be74aa00 x4 : 0000000000000000
> > > [   54.287524] x3 : ffff8000119f0f30 x2 : dead000000000100
> > > [   54.292821] x1 : dead000000000122 x0 : 0000000000000000
> > > [   54.298119] Call trace:
> > > [   54.300553]  device_del+0x194/0x398
> > > [   54.304030]  enclosure_remove_device+0xb4/0x100
> > > [   54.308548]  ses_intf_remove+0x98/0xd8
> > > [   54.312283]  device_del+0xfc/0x398
> > > [   54.315671]  device_unregister+0x14/0x30
> > > [   54.319580]  __scsi_remove_device+0xf0/0x130
> > > [   54.323836]  scsi_remove_device+0x28/0x40
> > > [   54.327832]  scsi_remove_target+0x1bc/0x250
> > > [   54.332002]  sas_rphy_remove+0x5c/0x60
> > > [   54.335738]  sas_rphy_delete+0x14/0x28
> > > [   54.339473]  sas_destruct_devices+0x5c/0x98
> > > [   54.343642]  sas_revalidate_domain+0xa0/0x178
> > > [   54.347986]  process_one_work+0x1e0/0x358
> > > [   54.351982]  worker_thread+0x40/0x488
> > > [   54.355631]  kthread+0x118/0x120
> > > [   54.358846]  ret_from_fork+0x10/0x18
> > > [   54.362410] Code: 91028278 aa1903e0 9415f01f a94c0662
> > > (f9000441)
> > > [   54.368489] ---[ end trace 38c672fcf89c95f7 ]---
> > > 
> > > I tested on v5.4 and no such issue, but maybe the driver core
> > > changes 
> > > have exposed a ses/enclosure issue.
> > > 
> > > Checking:
> > > 
> > > int enclosure_remove_device(struct enclosure_device *edev, struct
> > > device 
> > > *dev)
> > > {
> > > 	struct enclosure_component *cdev;
> > > 	int i;
> > > 
> > > 	if (!edev || !dev)
> > > 		return -EINVAL;
> > > 
> > > 	for (i = 0; i < edev->components; i++) {
> > > 		cdev = &edev->component[i];
> > > 		if (cdev->dev == dev) {
> > > 			enclosure_remove_links(cdev);
> > > 			device_del(&cdev->cdev);
> > > 			put_device(dev);
> > > 			cdev->dev = NULL;
> > > 			return device_add(&cdev->cdev);
> > > 		}
> > > 	}
> > > 	return -ENODEV;
> > > }
> > 
> > The design of the code is simply to remove the link to the inserted
> > device which has been removed.
> > 
> > I *think* this means the calls to device_del and device_add are
> > unnecessary and should go.  enclosure_remove_links and the put of
> > the
> > enclosed device should be sufficient.
> 
> That would make more sense than trying to "reuse" the device
> structure
> here by tearing it down and adding it back.

OK, let's try that.  This should be the patch if someone can try it
(I've compile tested it, but the enclosure system is under a heap of
stuff in the garage).

James

---

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 6d27ccfe0680..3c2d405bc79b 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -406,10 +406,9 @@ int enclosure_remove_device(struct enclosure_device *edev, struct device *dev)
 		cdev = &edev->component[i];
 		if (cdev->dev == dev) {
 			enclosure_remove_links(cdev);
-			device_del(&cdev->cdev);
 			put_device(dev);
 			cdev->dev = NULL;
-			return device_add(&cdev->cdev);
+			return 0;
 		}
 	}
 	return -ENODEV;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 16:01         ` James Bottomley
@ 2020-01-08 16:08           ` John Garry
  2020-01-08 17:10             ` John Garry
  0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2020-01-08 16:08 UTC (permalink / raw)
  To: James Bottomley, Greg KH
  Cc: luojiaxing, saravanak, linux-kernel, Linuxarm, linux-scsi,
	Martin K . Petersen, Arnd Bergmann

On 08/01/2020 16:01, James Bottomley wrote:
>>>> 	cdev->dev = NULL;
>>>> 			return device_add(&cdev->cdev);
>>>> 		}
>>>> 	}
>>>> 	return -ENODEV;
>>>> }
>>> The design of the code is simply to remove the link to the inserted
>>> device which has been removed.
>>>
>>> I*think*  this means the calls to device_del and device_add are
>>> unnecessary and should go.  enclosure_remove_links and the put of
>>> the
>>> enclosed device should be sufficient.
>> That would make more sense than trying to "reuse" the device
>> structure
>> here by tearing it down and adding it back.
> OK, let's try that.  This should be the patch if someone can try it
> (I've compile tested it, but the enclosure system is under a heap of
> stuff in the garage).

I can test it now.

But it is a bit suspicious that we had the device_del() and device_add() 
at all, especially since the code change makes it look a bit more like 
pre-43d8eb9cfd0 ("ses: add support for enclosure component hot removal")

John

> 
> James
> 
> ---
> 
> diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
> index 6d27ccfe0680..3c2d405bc79b 100644
> --- a/drivers/misc/enclosure.c
> +++ b/drivers/misc/enclosure.c
> @@ -406,10 +406,9 @@ int enclosure_remove_device(struct enclosure_device *edev, struct device *dev)
>   		cdev = &edev->component[i];
>   		if (cdev->dev == dev) {
>   			enclosure_remove_links(cdev);
> -			device_del(&cdev->cdev);
>   			put_device(dev);
>   			cdev->dev = NULL;
> -			return device_add(&cdev->cdev);
> +			return 0;
>   		}
>   	}
>   	return -ENODEV;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 16:08           ` John Garry
@ 2020-01-08 17:10             ` John Garry
  2020-01-09  1:04               ` James Bottomley
  0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2020-01-08 17:10 UTC (permalink / raw)
  To: James Bottomley, Greg KH
  Cc: Martin K . Petersen, linux-scsi, Linuxarm, linux-kernel,
	saravanak, Arnd Bergmann

On 08/01/2020 16:08, John Garry wrote:
> On 08/01/2020 16:01, James Bottomley wrote:
>>>>>     cdev->dev = NULL;
>>>>>             return device_add(&cdev->cdev);
>>>>>         }
>>>>>     }
>>>>>     return -ENODEV;
>>>>> }
>>>> The design of the code is simply to remove the link to the inserted
>>>> device which has been removed.
>>>>
>>>> I*think*  this means the calls to device_del and device_add are
>>>> unnecessary and should go.  enclosure_remove_links and the put of
>>>> the
>>>> enclosed device should be sufficient.
>>> That would make more sense than trying to "reuse" the device
>>> structure
>>> here by tearing it down and adding it back.
>> OK, let's try that.  This should be the patch if someone can try it
>> (I've compile tested it, but the enclosure system is under a heap of
>> stuff in the garage).
> 
> I can test it now.
> 

Yeah, that looks to have worked ok. SES disk locate was also fine after 
losing and rediscovering the disk.

Thanks,
John

> But it is a bit suspicious that we had the device_del() and device_add() 
> at all, especially since the code change makes it look a bit more like 
> pre-43d8eb9cfd0 ("ses: add support for enclosure component hot removal")
> 
> John
> 
>>
>> James
>>
>> ---
>>
>> diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
>> index 6d27ccfe0680..3c2d405bc79b 100644
>> --- a/drivers/misc/enclosure.c
>> +++ b/drivers/misc/enclosure.c
>> @@ -406,10 +406,9 @@ int enclosure_remove_device(struct 
>> enclosure_device *edev, struct device *dev)
>>           cdev = &edev->component[i];
>>           if (cdev->dev == dev) {
>>               enclosure_remove_links(cdev);
>> -            device_del(&cdev->cdev);
>>               put_device(dev);
>>               cdev->dev = NULL;
>> -            return device_add(&cdev->cdev);
>> +            return 0;
>>           }
>>       }
>>       return -ENODEV;
> 
> _______________________________________________
> Linuxarm mailing list
> Linuxarm@huawei.com
> http://hulk.huawei.com/mailman/listinfo/linuxarm
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-08 17:10             ` John Garry
@ 2020-01-09  1:04               ` James Bottomley
  2020-01-14 15:07                 ` Greg KH
  0 siblings, 1 reply; 14+ messages in thread
From: James Bottomley @ 2020-01-09  1:04 UTC (permalink / raw)
  To: John Garry, Greg KH
  Cc: Martin K . Petersen, linux-scsi, Linuxarm, linux-kernel,
	saravanak, Arnd Bergmann

On Wed, 2020-01-08 at 17:10 +0000, John Garry wrote:
> On 08/01/2020 16:08, John Garry wrote:
> > On 08/01/2020 16:01, James Bottomley wrote:
> > > > > >     cdev->dev = NULL;
> > > > > >             return device_add(&cdev->cdev);
> > > > > >         }
> > > > > >     }
> > > > > >     return -ENODEV;
> > > > > > }
> > > > > 
> > > > > The design of the code is simply to remove the link to the
> > > > > inserted device which has been removed.
> > > > > 
> > > > > I*think*  this means the calls to device_del and device_add
> > > > > are unnecessary and should go.  enclosure_remove_links and
> > > > > the put of the enclosed device should be sufficient.
> > > > 
> > > > That would make more sense than trying to "reuse" the device
> > > > structure here by tearing it down and adding it back.
> > > 
> > > OK, let's try that.  This should be the patch if someone can try
> > > it (I've compile tested it, but the enclosure system is under a
> > > heap of stuff in the garage).
> > 
> > I can test it now.
> > 
> 
> Yeah, that looks to have worked ok. SES disk locate was also fine
> after losing and rediscovering the disk.

OK, I'll spin up a patch with fixes/reported and tested tags.

> Thanks,
> John
> 
> > But it is a bit suspicious that we had the device_del() and
> > device_add() at all, especially since the code change makes it look
> > a bit more like pre-43d8eb9cfd0 ("ses: add support for enclosure
> > component hot removal")

I think the original reason was to clean out the links.  I vaguely
remember there was once a time when you couldn't clear all the links
simply with sysfs_remove_link.  However, nowadays you can.

James


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-09  1:04               ` James Bottomley
@ 2020-01-14 15:07                 ` Greg KH
  2020-01-14 15:20                   ` John Garry
  0 siblings, 1 reply; 14+ messages in thread
From: Greg KH @ 2020-01-14 15:07 UTC (permalink / raw)
  To: James Bottomley
  Cc: John Garry, Martin K . Petersen, linux-scsi, Linuxarm,
	linux-kernel, saravanak, Arnd Bergmann

On Wed, Jan 08, 2020 at 05:04:20PM -0800, James Bottomley wrote:
> On Wed, 2020-01-08 at 17:10 +0000, John Garry wrote:
> > On 08/01/2020 16:08, John Garry wrote:
> > > On 08/01/2020 16:01, James Bottomley wrote:
> > > > > > >     cdev->dev = NULL;
> > > > > > >             return device_add(&cdev->cdev);
> > > > > > >         }
> > > > > > >     }
> > > > > > >     return -ENODEV;
> > > > > > > }
> > > > > > 
> > > > > > The design of the code is simply to remove the link to the
> > > > > > inserted device which has been removed.
> > > > > > 
> > > > > > I*think*  this means the calls to device_del and device_add
> > > > > > are unnecessary and should go.  enclosure_remove_links and
> > > > > > the put of the enclosed device should be sufficient.
> > > > > 
> > > > > That would make more sense than trying to "reuse" the device
> > > > > structure here by tearing it down and adding it back.
> > > > 
> > > > OK, let's try that.  This should be the patch if someone can try
> > > > it (I've compile tested it, but the enclosure system is under a
> > > > heap of stuff in the garage).
> > > 
> > > I can test it now.
> > > 
> > 
> > Yeah, that looks to have worked ok. SES disk locate was also fine
> > after losing and rediscovering the disk.
> 
> OK, I'll spin up a patch with fixes/reported and tested tags.

Did this get sent?  I can't seem to find it :(


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-14 15:07                 ` Greg KH
@ 2020-01-14 15:20                   ` John Garry
  2020-01-14 15:28                     ` Greg KH
  0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2020-01-14 15:20 UTC (permalink / raw)
  To: Greg KH, James Bottomley
  Cc: Martin K . Petersen, linux-scsi, Linuxarm, linux-kernel,
	saravanak, Arnd Bergmann

On 14/01/2020 15:07, Greg KH wrote:
> On Wed, Jan 08, 2020 at 05:04:20PM -0800, James Bottomley wrote:
>> On Wed, 2020-01-08 at 17:10 +0000, John Garry wrote:
>>> On 08/01/2020 16:08, John Garry wrote:
>>>> On 08/01/2020 16:01, James Bottomley wrote:
>>>>>>>>      cdev->dev = NULL;
>>>>>>>>              return device_add(&cdev->cdev);
>>>>>>>>          }
>>>>>>>>      }
>>>>>>>>      return -ENODEV;
>>>>>>>> }
>>>>>>>
>>>>>>> The design of the code is simply to remove the link to the
>>>>>>> inserted device which has been removed.
>>>>>>>
>>>>>>> I*think*  this means the calls to device_del and device_add
>>>>>>> are unnecessary and should go.  enclosure_remove_links and
>>>>>>> the put of the enclosed device should be sufficient.
>>>>>>
>>>>>> That would make more sense than trying to "reuse" the device
>>>>>> structure here by tearing it down and adding it back.
>>>>>
>>>>> OK, let's try that.  This should be the patch if someone can try
>>>>> it (I've compile tested it, but the enclosure system is under a
>>>>> heap of stuff in the garage).
>>>>
>>>> I can test it now.
>>>>
>>>
>>> Yeah, that looks to have worked ok. SES disk locate was also fine
>>> after losing and rediscovering the disk.
>>
>> OK, I'll spin up a patch with fixes/reported and tested tags.
> 
> Did this get sent?  I can't seem to find it :(
> 

Yeah, but you were not cc'ed :(

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200114&id=529244bd1afc102ab164429d338d310d5d65e60d

cheers.
John

> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge()
  2020-01-14 15:20                   ` John Garry
@ 2020-01-14 15:28                     ` Greg KH
  0 siblings, 0 replies; 14+ messages in thread
From: Greg KH @ 2020-01-14 15:28 UTC (permalink / raw)
  To: John Garry
  Cc: James Bottomley, Martin K . Petersen, linux-scsi, Linuxarm,
	linux-kernel, saravanak, Arnd Bergmann

On Tue, Jan 14, 2020 at 03:20:12PM +0000, John Garry wrote:
> On 14/01/2020 15:07, Greg KH wrote:
> > On Wed, Jan 08, 2020 at 05:04:20PM -0800, James Bottomley wrote:
> > > On Wed, 2020-01-08 at 17:10 +0000, John Garry wrote:
> > > > On 08/01/2020 16:08, John Garry wrote:
> > > > > On 08/01/2020 16:01, James Bottomley wrote:
> > > > > > > > >      cdev->dev = NULL;
> > > > > > > > >              return device_add(&cdev->cdev);
> > > > > > > > >          }
> > > > > > > > >      }
> > > > > > > > >      return -ENODEV;
> > > > > > > > > }
> > > > > > > > 
> > > > > > > > The design of the code is simply to remove the link to the
> > > > > > > > inserted device which has been removed.
> > > > > > > > 
> > > > > > > > I*think*  this means the calls to device_del and device_add
> > > > > > > > are unnecessary and should go.  enclosure_remove_links and
> > > > > > > > the put of the enclosed device should be sufficient.
> > > > > > > 
> > > > > > > That would make more sense than trying to "reuse" the device
> > > > > > > structure here by tearing it down and adding it back.
> > > > > > 
> > > > > > OK, let's try that.  This should be the patch if someone can try
> > > > > > it (I've compile tested it, but the enclosure system is under a
> > > > > > heap of stuff in the garage).
> > > > > 
> > > > > I can test it now.
> > > > > 
> > > > 
> > > > Yeah, that looks to have worked ok. SES disk locate was also fine
> > > > after losing and rediscovering the disk.
> > > 
> > > OK, I'll spin up a patch with fixes/reported and tested tags.
> > 
> > Did this get sent?  I can't seem to find it :(
> > 
> 
> Yeah, but you were not cc'ed :(
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200114&id=529244bd1afc102ab164429d338d310d5d65e60d

Hey, less work for me, that's fine!  :)

thanks for the poitner.

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-01-14 15:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-08 11:34 [PATCH v1] driver core: Use list_del_init to replace list_del at device_links_purge() Luo Jiaxing
2020-01-08 11:53 ` John Garry
2020-01-08 12:26 ` Greg KH
2020-01-08 14:50   ` John Garry
2020-01-08 15:44     ` Greg KH
2020-01-08 15:51     ` James Bottomley
2020-01-08 15:57       ` Greg KH
2020-01-08 16:01         ` James Bottomley
2020-01-08 16:08           ` John Garry
2020-01-08 17:10             ` John Garry
2020-01-09  1:04               ` James Bottomley
2020-01-14 15:07                 ` Greg KH
2020-01-14 15:20                   ` John Garry
2020-01-14 15:28                     ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.