* [PATCH 0/4] scsi: hisi_sas: Some misc changes
@ 2023-03-20 3:34 chenxiang
2023-03-20 3:34 ` [PATCH 1/4] scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list chenxiang
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: chenxiang @ 2023-03-20 3:34 UTC (permalink / raw)
To: jejb, martin.petersen; +Cc: linux-scsi, linuxarm, Xiang Chen
From: Xiang Chen <chenxiang66@hisilicon.com>
This series contain some fixes including:
- Grab sas_dev lock when traversing sas_dev list to avoid NULL pointer
- Handle NCQ error when IPTT is valid
- Ensure all enabled PHYs up during controller reset
- Exit suspend state when usage count of runtime PM is greater than 0
Xingui Yang (2):
scsi: hisi_sas: Grab sas_dev lock when traversing the members of
sas_dev.list
scsi: hisi_sas: Handle NCQ error when IPTT is valid
Yihang Li (2):
scsi: hisi_sas: Ensure all enabled PHYs up during controller reset
scsi: hisi_sas: Exit suspending state when usage count is greater than
0
drivers/scsi/hisi_sas/hisi_sas.h | 3 +-
drivers/scsi/hisi_sas/hisi_sas_main.c | 57 ++++++++++++++++++-----
drivers/scsi/hisi_sas/hisi_sas_v1_hw.c | 8 +++-
drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 8 +++-
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 83 ++++++++++++++++++++++++++--------
5 files changed, 124 insertions(+), 35 deletions(-)
--
2.8.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/4] scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
@ 2023-03-20 3:34 ` chenxiang
2023-03-20 3:34 ` [PATCH 2/4] scsi: hisi_sas: Handle NCQ error when IPTT is valid chenxiang
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: chenxiang @ 2023-03-20 3:34 UTC (permalink / raw)
To: jejb, martin.petersen; +Cc: linux-scsi, linuxarm, Xingui Yang, Xiang Chen
From: Xingui Yang <yangxingui@huawei.com>
When free'ing slots in function slot_complete_v3_hw(), it is possible that
sas_dev.list is being traversed elsewhere, and it may trigger a null
pointer exception, such as follows:
==>cq thread ==>scsi_eh_6
==>scsi_error_handler()
==>sas_eh_handle_sas_errors()
==>sas_scsi_find_task()
==>lldd_abort_task()
==>slot_complete_v3_hw() ==>hisi_sas_abort_task()
==>hisi_sas_slot_task_free() ==>dereg_device_v3_hw()
==>list_del_init() ==>list_for_each_entry_safe()
[ 7165.434918] sas: Enter sas_scsi_recover_host busy: 32 failed: 32
[ 7165.434926] sas: trying to find task 0x00000000769b5ba5
[ 7165.434927] sas: sas_scsi_find_task: aborting task 0x00000000769b5ba5
[ 7165.434940] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(00000000769b5ba5) aborted
[ 7165.434964] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(00000000c9f7aa07) ignored
[ 7165.434965] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(00000000e2a1cf01) ignored
[ 7165.434968] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 7165.434972] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(0000000022d52d93) ignored
[ 7165.434975] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(0000000066a7516c) ignored
[ 7165.434976] Mem abort info:
[ 7165.434982] ESR = 0x96000004
[ 7165.434991] Exception class = DABT (current EL), IL = 32 bits
[ 7165.434992] SET = 0, FnV = 0
[ 7165.434993] EA = 0, S1PTW = 0
[ 7165.434994] Data abort info:
[ 7165.434994] ISV = 0, ISS = 0x00000004
[ 7165.434995] CM = 0, WnR = 0
[ 7165.434997] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f29543f2
[ 7165.434998] [0000000000000000] pgd=0000000000000000
[ 7165.435003] Internal error: Oops: 96000004 [#1] SMP
[ 7165.439863] Process scsi_eh_6 (pid: 4109, stack limit = 0x00000000c43818d5)
[ 7165.468862] pstate: 00c00009 (nzcv daif +PAN +UAO)
[ 7165.473637] pc : dereg_device_v3_hw+0x68/0xa8 [hisi_sas_v3_hw]
[ 7165.479443] lr : dereg_device_v3_hw+0x2c/0xa8 [hisi_sas_v3_hw]
[ 7165.485247] sp : ffff00001d623bc0
[ 7165.488546] x29: ffff00001d623bc0 x28: ffffa027d03b9508
[ 7165.493835] x27: ffff80278ed50af0 x26: ffffa027dd31e0a8
[ 7165.499123] x25: ffffa027d9b27f88 x24: ffffa027d9b209f8
[ 7165.504411] x23: ffffa027c45b0d60 x22: ffff80278ec07c00
[ 7165.509700] x21: 0000000000000008 x20: ffffa027d9b209f8
[ 7165.514988] x19: ffffa027d9b27f88 x18: ffffffffffffffff
[ 7165.520276] x17: 0000000000000000 x16: 0000000000000000
[ 7165.525564] x15: ffff0000091d9708 x14: ffff0000093b7dc8
[ 7165.530852] x13: ffff0000093b7a23 x12: 6e7265746e692067
[ 7165.536140] x11: 0000000000000000 x10: 0000000000000bb0
[ 7165.541429] x9 : ffff00001d6238f0 x8 : ffffa027d877af00
[ 7165.546718] x7 : ffffa027d6329600 x6 : ffff7e809f58ca00
[ 7165.552006] x5 : 0000000000001f8a x4 : 000000000000088e
[ 7165.557295] x3 : ffffa027d9b27fa8 x2 : 0000000000000000
[ 7165.562583] x1 : 0000000000000000 x0 : 000000003000188e
[ 7165.567872] Call trace:
[ 7165.570309] dereg_device_v3_hw+0x68/0xa8 [hisi_sas_v3_hw]
[ 7165.575775] hisi_sas_abort_task+0x248/0x358 [hisi_sas_main]
[ 7165.581415] sas_eh_handle_sas_errors+0x258/0x8e0 [libsas]
[ 7165.586876] sas_scsi_recover_host+0x134/0x458 [libsas]
[ 7165.592082] scsi_error_handler+0xb4/0x488
[ 7165.596163] kthread+0x134/0x138
[ 7165.599380] ret_from_fork+0x10/0x18
[ 7165.602940] Code: d5033e9f b9000040 aa0103e2 eb03003f (f9400021)
[ 7165.609004] kernel fault(0x1) notification starting on CPU 75
[ 7165.700728] ---[ end trace fc042cbbea224efc ]---
[ 7165.705326] Kernel panic - not syncing: Fatal exception
To fix the issue, grab sas_dev lock when traversing the members of
sas_dev.list in dereg_device_v3_hw() and hisi_sas_release_tasks() to
avoid concurrency of adding and deleting member. When function
hisi_sas_release_tasks() calls hisi_sas_do_release_task() to free slot, the
lock cannot be grabbed again in hisi_sas_slot_task_free(), then a bool
parameter need_lock is added.
Signed-off-by: Xingui Yang <yangxingui@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
---
drivers/scsi/hisi_sas/hisi_sas.h | 3 ++-
drivers/scsi/hisi_sas/hisi_sas_main.c | 25 ++++++++++++++++---------
drivers/scsi/hisi_sas/hisi_sas_v1_hw.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 4 +++-
5 files changed, 23 insertions(+), 13 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 3a5fc36..c37ca6f 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -656,7 +656,8 @@ extern void hisi_sas_phy_down(struct hisi_hba *hisi_hba, int phy_no, int rdy,
extern void hisi_sas_phy_bcast(struct hisi_sas_phy *phy);
extern void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba,
struct sas_task *task,
- struct hisi_sas_slot *slot);
+ struct hisi_sas_slot *slot,
+ bool need_lock);
extern void hisi_sas_init_mem(struct hisi_hba *hisi_hba);
extern void hisi_sas_rst_work_handler(struct work_struct *work);
extern void hisi_sas_sync_rst_work_handler(struct work_struct *work);
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 325d6d6..d2e94979 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -205,7 +205,7 @@ static int hisi_sas_slot_index_alloc(struct hisi_hba *hisi_hba,
}
void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, struct sas_task *task,
- struct hisi_sas_slot *slot)
+ struct hisi_sas_slot *slot, bool need_lock)
{
int device_id = slot->device_id;
struct hisi_sas_device *sas_dev = &hisi_hba->devices[device_id];
@@ -239,9 +239,13 @@ void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, struct sas_task *task,
}
}
- spin_lock(&sas_dev->lock);
- list_del_init(&slot->entry);
- spin_unlock(&sas_dev->lock);
+ if (need_lock) {
+ spin_lock(&sas_dev->lock);
+ list_del_init(&slot->entry);
+ spin_unlock(&sas_dev->lock);
+ } else {
+ list_del_init(&slot->entry);
+ }
memset(slot, 0, offsetof(struct hisi_sas_slot, buf));
@@ -1081,7 +1085,7 @@ static void hisi_sas_port_notify_formed(struct asd_sas_phy *sas_phy)
}
static void hisi_sas_do_release_task(struct hisi_hba *hisi_hba, struct sas_task *task,
- struct hisi_sas_slot *slot)
+ struct hisi_sas_slot *slot, bool need_lock)
{
if (task) {
unsigned long flags;
@@ -1098,7 +1102,7 @@ static void hisi_sas_do_release_task(struct hisi_hba *hisi_hba, struct sas_task
spin_unlock_irqrestore(&task->task_state_lock, flags);
}
- hisi_sas_slot_task_free(hisi_hba, task, slot);
+ hisi_sas_slot_task_free(hisi_hba, task, slot, need_lock);
}
static void hisi_sas_release_task(struct hisi_hba *hisi_hba,
@@ -1107,8 +1111,11 @@ static void hisi_sas_release_task(struct hisi_hba *hisi_hba,
struct hisi_sas_slot *slot, *slot2;
struct hisi_sas_device *sas_dev = device->lldd_dev;
+ spin_lock(&sas_dev->lock);
list_for_each_entry_safe(slot, slot2, &sas_dev->list, entry)
- hisi_sas_do_release_task(hisi_hba, slot->task, slot);
+ hisi_sas_do_release_task(hisi_hba, slot->task, slot, false);
+
+ spin_unlock(&sas_dev->lock);
}
void hisi_sas_release_tasks(struct hisi_hba *hisi_hba)
@@ -1634,7 +1641,7 @@ static int hisi_sas_abort_task(struct sas_task *task)
*/
if (rc == TMF_RESP_FUNC_COMPLETE && rc2 != TMF_RESP_FUNC_SUCC) {
if (task->lldd_task)
- hisi_sas_do_release_task(hisi_hba, task, slot);
+ hisi_sas_do_release_task(hisi_hba, task, slot, true);
}
} else if (task->task_proto & SAS_PROTOCOL_SATA ||
task->task_proto & SAS_PROTOCOL_STP) {
@@ -1654,7 +1661,7 @@ static int hisi_sas_abort_task(struct sas_task *task)
*/
if ((sas_dev->dev_status == HISI_SAS_DEV_NCQ_ERR) &&
qc && qc->scsicmd) {
- hisi_sas_do_release_task(hisi_hba, task, slot);
+ hisi_sas_do_release_task(hisi_hba, task, slot, true);
rc = TMF_RESP_FUNC_COMPLETE;
} else {
rc = hisi_sas_softreset_ata_disk(device);
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c
index d643c5a..7ea36659 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c
@@ -1306,7 +1306,7 @@ static void slot_complete_v1_hw(struct hisi_hba *hisi_hba,
}
out:
- hisi_sas_slot_task_free(hisi_hba, task, slot);
+ hisi_sas_slot_task_free(hisi_hba, task, slot, true);
if (task->task_done)
task->task_done(task);
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index cded42f..ef896ef 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -2462,7 +2462,7 @@ static void slot_complete_v2_hw(struct hisi_hba *hisi_hba,
}
task->task_state_flags |= SAS_TASK_STATE_DONE;
spin_unlock_irqrestore(&task->task_state_lock, flags);
- hisi_sas_slot_task_free(hisi_hba, task, slot);
+ hisi_sas_slot_task_free(hisi_hba, task, slot, true);
if (!is_internal && (task->task_proto != SAS_PROTOCOL_SMP)) {
spin_lock_irqsave(&device->done_lock, flags);
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 66fcb34..8c78a6b 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -888,6 +888,7 @@ static void dereg_device_v3_hw(struct hisi_hba *hisi_hba,
cfg_abt_set_query_iptt = hisi_sas_read32(hisi_hba,
CFG_ABT_SET_QUERY_IPTT);
+ spin_lock(&sas_dev->lock);
list_for_each_entry_safe(slot, slot2, &sas_dev->list, entry) {
cfg_abt_set_query_iptt &= ~CFG_SET_ABORTED_IPTT_MSK;
cfg_abt_set_query_iptt |= (1 << CFG_SET_ABORTED_EN_OFF) |
@@ -895,6 +896,7 @@ static void dereg_device_v3_hw(struct hisi_hba *hisi_hba,
hisi_sas_write32(hisi_hba, CFG_ABT_SET_QUERY_IPTT,
cfg_abt_set_query_iptt);
}
+ spin_unlock(&sas_dev->lock);
cfg_abt_set_query_iptt &= ~(1 << CFG_SET_ABORTED_EN_OFF);
hisi_sas_write32(hisi_hba, CFG_ABT_SET_QUERY_IPTT,
cfg_abt_set_query_iptt);
@@ -2379,7 +2381,7 @@ static void slot_complete_v3_hw(struct hisi_hba *hisi_hba,
}
task->task_state_flags |= SAS_TASK_STATE_DONE;
spin_unlock_irqrestore(&task->task_state_lock, flags);
- hisi_sas_slot_task_free(hisi_hba, task, slot);
+ hisi_sas_slot_task_free(hisi_hba, task, slot, true);
if (!is_internal && (task->task_proto != SAS_PROTOCOL_SMP)) {
spin_lock_irqsave(&device->done_lock, flags);
--
2.8.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/4] scsi: hisi_sas: Handle NCQ error when IPTT is valid
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
2023-03-20 3:34 ` [PATCH 1/4] scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list chenxiang
@ 2023-03-20 3:34 ` chenxiang
2023-03-20 3:34 ` [PATCH 3/4] scsi: hisi_sas: Ensure all enabled PHYs up during controller reset chenxiang
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: chenxiang @ 2023-03-20 3:34 UTC (permalink / raw)
To: jejb, martin.petersen; +Cc: linux-scsi, linuxarm, Xingui Yang, Xiang Chen
From: Xingui Yang <yangxingui@huawei.com>
If an NCQ error occurs when the IPTT is valid and slot->abort flag is set
in completion path, sas_task_abort() will be called to abort only one NCQ
command now, and the host would be set to SHOST_RECOVERY state. But this
may not kick-off EH Immediately until other outstanding QCs timeouts. As a
result, the host may remain in the SHOST_RECOVERY state for up to 30
seconds, such as follows:
[7972317.645234] hisi_sas_v3_hw 0000:74:04.0: erroneous completion iptt=3264 task=00000000466116b8 dev id=2 sas_addr=0x5000000000000502 CQ hdr: 0x1883 0x20cc0 0x40000 0x20420000 Error info: 0x0 0x0 0x200000 0x0
[7972341.508264] sas: Enter sas_scsi_recover_host busy: 32 failed: 32
[7972341.984731] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 32 tries: 1
So all NCQ commands that are in the queue should be aborted when an NCQ
error occurs in this scenario.
Fixes: 05d91b557af9 ("scsi: hisi_sas: Directly trigger SCSI error handling for completion errors")
Signed-off-by: Xingui Yang <yangxingui@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
---
drivers/scsi/hisi_sas/hisi_sas_v1_hw.c | 6 +++++-
drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 6 +++++-
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 6 +++++-
3 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c
index 7ea36659..76176b1 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c
@@ -1258,7 +1258,11 @@ static void slot_complete_v1_hw(struct hisi_hba *hisi_hba,
slot_err_v1_hw(hisi_hba, task, slot);
if (unlikely(slot->abort)) {
- sas_task_abort(task);
+ if (dev_is_sata(device) && task->ata_task.use_ncq)
+ sas_ata_device_link_abort(device, true);
+ else
+ sas_task_abort(task);
+
return;
}
goto out;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index ef896ef..746e4d7 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -2404,7 +2404,11 @@ static void slot_complete_v2_hw(struct hisi_hba *hisi_hba,
error_info[2], error_info[3]);
if (unlikely(slot->abort)) {
- sas_task_abort(task);
+ if (dev_is_sata(device) && task->ata_task.use_ncq)
+ sas_ata_device_link_abort(device, true);
+ else
+ sas_task_abort(task);
+
return;
}
goto out;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 8c78a6b..791c0eab 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2327,7 +2327,11 @@ static void slot_complete_v3_hw(struct hisi_hba *hisi_hba,
error_info[0], error_info[1],
error_info[2], error_info[3]);
if (unlikely(slot->abort)) {
- sas_task_abort(task);
+ if (dev_is_sata(device) && task->ata_task.use_ncq)
+ sas_ata_device_link_abort(device, true);
+ else
+ sas_task_abort(task);
+
return;
}
goto out;
--
2.8.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/4] scsi: hisi_sas: Ensure all enabled PHYs up during controller reset
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
2023-03-20 3:34 ` [PATCH 1/4] scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list chenxiang
2023-03-20 3:34 ` [PATCH 2/4] scsi: hisi_sas: Handle NCQ error when IPTT is valid chenxiang
@ 2023-03-20 3:34 ` chenxiang
2023-03-20 3:34 ` [PATCH 4/4] scsi: hisi_sas: Exit suspending state when usage count is greater than 0 chenxiang
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: chenxiang @ 2023-03-20 3:34 UTC (permalink / raw)
To: jejb, martin.petersen; +Cc: linux-scsi, linuxarm, Yihang Li, Xiang Chen
From: Yihang Li <liyihang9@huawei.com>
For the controller reset operation, hisi_sas_phy_enable() is executed for
each enabled local PHY, and refresh the port id of each device based on
the latest hisi_sas_phy->port_id after 1 second sleep,
hisi_sas_phy->port_id is configured in the interrupt processing function
phy_up_v3_hw(). However, in directly attached scenario, for some SATA
disks the amount of time for phyup more than 1s sometimes. In this case,
incorrect port id may be configured in hisi_sas_refresh_port_id().
As a result, all the internal IOs fail and disk lost, such as follows:
[10717.666565] hisi_sas_v3_hw 0000:74:02.0: phyup: phy1 link_rate=10(sata)
[10718.826813] hisi_sas_v3_hw 0000:74:02.0: erroneous completion iptt=63
task=00000000c1ab1c2b dev id=200 addr=5000000000000501 CQ hdr: 0x8000007 0xc8003f 0x0
0x0 Error info: 0x0 0x0 0x0 0x0
[10718.843428] sas: TMF task open reject failed 5000000000000501
[10718.849242] hisi_sas_v3_hw 0000:74:02.0: erroneous completion iptt=64
task=00000000c1ab1c2b dev id=200 addr=5000000000000501 CQ hdr: 0x8000007 0xc80040 0x0
0x0 Error info: 0x0 0x0 0x0 0x0
[10718.865856] sas: TMF task open reject failed 5000000000000501
[10718.871670] hisi_sas_v3_hw 0000:74:02.0: erroneous completion iptt=65
task=00000000c1ab1c2b dev id=200 addr=5000000000000501 CQ hdr: 0x8000007 0xc80041 0x0
0x0 Error info: 0x0 0x0 0x0 0x0
[10718.888284] sas: TMF task open reject failed 5000000000000501
[10718.894093] sas: executing TMF for 5000000000000501 failed after 3 attempts!
[10718.901114] hisi_sas_v3_hw 0000:74:02.0: ata disk 5000000000000501 reset failed
[10718.908410] hisi_sas_v3_hw 0000:74:02.0: controller reset complete
.....
[10773.298633] ata216.00: revalidation failed (errno=-19)
[10773.303753] ata216.00: disable device
So the time of waitting for PHYs up is 1s which may be not enough. To
solve the issue, running hisi_sas_phy_enable() in parallel through
async operations and use wait_for_completion_timeout() to wait for PHYs
come up instead of directly sleep for 1 second.
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
---
drivers/scsi/hisi_sas/hisi_sas_main.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index d2e94979..412431c 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -1520,13 +1520,41 @@ void hisi_sas_controller_reset_prepare(struct hisi_hba *hisi_hba)
}
EXPORT_SYMBOL_GPL(hisi_sas_controller_reset_prepare);
+static void hisi_sas_async_init_wait_phyup(void *data, async_cookie_t cookie)
+{
+ struct hisi_sas_phy *phy = data;
+ struct hisi_hba *hisi_hba = phy->hisi_hba;
+ struct device *dev = hisi_hba->dev;
+ DECLARE_COMPLETION_ONSTACK(completion);
+ int phy_no = phy->sas_phy.id;
+
+ phy->reset_completion = &completion;
+ hisi_sas_phy_enable(hisi_hba, phy_no, 1);
+ if (!wait_for_completion_timeout(&completion,
+ HISI_SAS_WAIT_PHYUP_TIMEOUT))
+ dev_warn(dev, "phy%d wait phyup timed out\n", phy_no);
+
+ phy->reset_completion = NULL;
+}
+
void hisi_sas_controller_reset_done(struct hisi_hba *hisi_hba)
{
struct Scsi_Host *shost = hisi_hba->shost;
+ ASYNC_DOMAIN_EXCLUSIVE(async);
+ int phy_no;
/* Init and wait for PHYs to come up and all libsas event finished. */
- hisi_hba->hw->phys_init(hisi_hba);
- msleep(1000);
+ for (phy_no = 0; phy_no < hisi_hba->n_phy; phy_no++) {
+ struct hisi_sas_phy *phy = &hisi_hba->phy[phy_no];
+
+ if (!(hisi_hba->phy_state & BIT(phy_no)))
+ continue;
+
+ async_schedule_domain(hisi_sas_async_init_wait_phyup,
+ phy, &async);
+ }
+
+ async_synchronize_full_domain(&async);
hisi_sas_refresh_port_id(hisi_hba);
clear_bit(HISI_SAS_REJECT_CMD_BIT, &hisi_hba->flags);
--
2.8.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 4/4] scsi: hisi_sas: Exit suspending state when usage count is greater than 0
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
` (2 preceding siblings ...)
2023-03-20 3:34 ` [PATCH 3/4] scsi: hisi_sas: Ensure all enabled PHYs up during controller reset chenxiang
@ 2023-03-20 3:34 ` chenxiang
2023-04-03 1:59 ` [PATCH 0/4] scsi: hisi_sas: Some misc changes Martin K. Petersen
2023-04-12 2:04 ` Martin K. Petersen
5 siblings, 0 replies; 7+ messages in thread
From: chenxiang @ 2023-03-20 3:34 UTC (permalink / raw)
To: jejb, martin.petersen; +Cc: linux-scsi, linuxarm, Yihang Li, Xiang Chen
From: Yihang Li <liyihang9@huawei.com>
When the current status of the host controller is suspended, enabling a
local PHY just after disabling all local PHYs in expander envirnment,
a hung as follows occurs.
[ 486.854655] INFO: task kworker/u256:1:899 blocked for more than 120 seconds.
[ 486.862207] Not tainted 6.1.0-rc4+ #1
[ 486.870545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 486.878893] task:kworker/u256:1 state:D stack:0 pid:899 ppid:2 flags:0x00000008
[ 486.887745] Workqueue: 0000:74:02.0_disco_q sas_discover_domain [libsas]
[ 486.894704] Call trace:
[ 486.897400] __switch_to+0xf0/0x170
[ 486.901146] __schedule+0x3e4/0x1160
[ 486.904970] schedule+0x64/0x104
[ 486.908442] rpm_resume+0x158/0x6a0
[ 486.912163] __pm_runtime_resume+0x5c/0x84
[ 486.916489] smp_execute_task_sg+0x1f8/0x264 [libsas]
[ 486.921773] sas_discover_expander.part.0+0xbc/0x720 [libsas]
[ 486.927750] sas_discover_root_expander+0x90/0x154 [libsas]
[ 486.933552] sas_discover_domain+0x444/0x6d0 [libsas]
[ 486.938826] process_one_work+0x1e0/0x450
[ 486.943057] worker_thread+0x150/0x44c
[ 486.947015] kthread+0x114/0x120
[ 486.950447] ret_from_fork+0x10/0x20
[ 486.954292] INFO: task kworker/u256:2:1780 blocked for more than 120 seconds.
[ 486.961637] Not tainted 6.1.0-rc4+ #1
[ 486.966087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 486.974356] task:kworker/u256:2 state:D stack:0 pid:1780 ppid:2 flags:0x00000208
[ 486.983141] Workqueue: 0000:74:02.0_event_q sas_port_event_worker [libsas]
[ 486.990252] Call trace:
[ 486.992930] __switch_to+0xf0/0x170
[ 486.996645] __schedule+0x3e4/0x1160
[ 487.000439] schedule+0x64/0x104
[ 487.003886] schedule_timeout+0x17c/0x1c0
[ 487.008102] wait_for_completion+0x7c/0x160
[ 487.012488] __flush_workqueue+0x104/0x3e0
[ 487.016782] sas_porte_bytes_dmaed+0x414/0x454 [libsas]
[ 487.022203] sas_port_event_worker+0x38/0x60 [libsas]
[ 487.027449] process_one_work+0x1e0/0x450
[ 487.031645] worker_thread+0x150/0x44c
[ 487.035594] kthread+0x114/0x120
[ 487.039017] ret_from_fork+0x10/0x20
[ 487.042828] INFO: task bash:11488 blocked for more than 121 seconds.
[ 487.049366] Not tainted 6.1.0-rc4+ #1
[ 487.053746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 487.061953] task:bash state:D stack:0 pid:11488 ppid:10977 flags:0x00000204
[ 487.070698] Call trace:
[ 487.073355] __switch_to+0xf0/0x170
[ 487.077050] __schedule+0x3e4/0x1160
[ 487.080833] schedule+0x64/0x104
[ 487.084270] schedule_timeout+0x17c/0x1c0
[ 487.088474] wait_for_completion+0x7c/0x160
[ 487.092851] __flush_workqueue+0x104/0x3e0
[ 487.097137] drain_workqueue+0xb8/0x160
[ 487.101159] __sas_drain_work+0x50/0x90 [libsas]
[ 487.105963] sas_suspend_ha+0x64/0xd4 [libsas]
[ 487.110590] suspend_v3_hw+0x198/0x1e8 [hisi_sas_v3_hw]
[ 487.115989] pci_pm_runtime_suspend+0x5c/0x1d0
[ 487.120606] __rpm_callback+0x50/0x150
[ 487.124535] rpm_callback+0x74/0x80
[ 487.128204] rpm_suspend+0x110/0x640
[ 487.131955] rpm_idle+0x1f4/0x2d0
[ 487.135447] __pm_runtime_idle+0x58/0x94
[ 487.139538] queue_phy_enable+0xcc/0xf0 [libsas]
[ 487.144330] store_sas_phy_enable+0x74/0x100
[ 487.148770] dev_attr_store+0x20/0x34
[ 487.152606] sysfs_kf_write+0x4c/0x5c
[ 487.156437] kernfs_fop_write_iter+0x120/0x1b0
[ 487.161049] vfs_write+0x2d0/0x36c
[ 487.164625] ksys_write+0x70/0x100
[ 487.168194] __arm64_sys_write+0x24/0x30
[ 487.172280] invoke_syscall+0x50/0x120
[ 487.176186] el0_svc_common.constprop.0+0x168/0x190
[ 487.181214] do_el0_svc+0x34/0xc0
[ 487.184680] el0_svc+0x2c/0xb4
[ 487.187879] el0t_64_sync_handler+0xb8/0xbc
[ 487.192205] el0t_64_sync+0x19c/0x1a0
We find that when all local PHYs are disabled, all the devices will be
removed, the ->runtime_suspend() callback suspend_v3_hw() directly execute
since the controller usage count drop to 0. On the other side, the first
local PHY is enabled through the sysfs interface, and ensures that
function phy_up_v3_hw() is completed due to suspend_v3_hw()->
interrupt_disable_v3_hw(). In the expander scenario,
sas_discover_root_expander() is executed in event work
DISCE_DISCOVER_DOMAIN, which will increases the controller usage count and
carry out a resume and sends SMPIO, it cannot be completed because the
runtime PM status of the controller is RPM_SUSPENDING. At the same time,
the ->runtime_suspend() callback suspend_v3_hw() also cannot complete the
process because of drain libsas event queue in sas_suspend_ha(), so hung
occurs.
(thread 1) | (thread 2)
... |
rpm_idle() |
... |
__update_runtime_status(RPM_SUSPENDING)|
... | ...
suspend_v3_hw() | smp_execute_task_sg()
... | ...
interrupt_disable_v3_hw() | pm_runtime_get_sync()
| ...
... | rpm_resume() //RPM_SUSPENDING
|
__sas_drain_work() |
To fix it, check if the current runtime PM status of the controller allows
to be suspended continue after interrupt_disable_v3_hw(), return
immediately if not.
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hislicon.com>
---
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 73 ++++++++++++++++++++++++++--------
1 file changed, 56 insertions(+), 17 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 791c0eab..592170a 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -604,6 +604,27 @@ static u32 hisi_sas_phy_read32(struct hisi_hba *hisi_hba,
readl_poll_timeout_atomic(regs, val, cond, delay_us, timeout_us);\
})
+static void interrupt_enable_v3_hw(struct hisi_hba *hisi_hba)
+{
+ int i;
+
+ for (i = 0; i < hisi_hba->queue_count; i++)
+ hisi_sas_write32(hisi_hba, OQ0_INT_SRC_MSK + 0x4 * i, 0);
+
+ hisi_sas_write32(hisi_hba, ENT_INT_SRC_MSK1, 0xfefefefe);
+ hisi_sas_write32(hisi_hba, ENT_INT_SRC_MSK2, 0xfefefefe);
+ hisi_sas_write32(hisi_hba, ENT_INT_SRC_MSK3, 0xffc220ff);
+ hisi_sas_write32(hisi_hba, SAS_ECC_INTR_MSK, 0x155555);
+
+ for (i = 0; i < hisi_hba->n_phy; i++) {
+ hisi_sas_phy_write32(hisi_hba, i, CHL_INT1_MSK, 0xf2057fff);
+ hisi_sas_phy_write32(hisi_hba, i, CHL_INT2_MSK, 0xffffbfe);
+ hisi_sas_phy_write32(hisi_hba, i, PHYCTRL_NOT_RDY_MSK, 0x0);
+ hisi_sas_phy_write32(hisi_hba, i, PHYCTRL_PHY_ENA_MSK, 0x0);
+ hisi_sas_phy_write32(hisi_hba, i, SL_RX_BCAST_CHK_MSK, 0x0);
+ }
+}
+
static void init_reg_v3_hw(struct hisi_hba *hisi_hba)
{
int i, j;
@@ -624,20 +645,14 @@ static void init_reg_v3_hw(struct hisi_hba *hisi_hba)
hisi_sas_write32(hisi_hba, ENT_INT_SRC1, 0xffffffff);
hisi_sas_write32(hisi_hba, ENT_INT_SRC2, 0xffffffff);
hisi_sas_write32(hisi_hba, ENT_INT_SRC3, 0xffffffff);
- hisi_sas_write32(hisi_hba, ENT_INT_SRC_MSK1, 0xfefefefe);
- hisi_sas_write32(hisi_hba, ENT_INT_SRC_MSK2, 0xfefefefe);
- hisi_sas_write32(hisi_hba, ENT_INT_SRC_MSK3, 0xffc220ff);
hisi_sas_write32(hisi_hba, CHNL_PHYUPDOWN_INT_MSK, 0x0);
hisi_sas_write32(hisi_hba, CHNL_ENT_INT_MSK, 0x0);
hisi_sas_write32(hisi_hba, HGC_COM_INT_MSK, 0x0);
- hisi_sas_write32(hisi_hba, SAS_ECC_INTR_MSK, 0x155555);
hisi_sas_write32(hisi_hba, AWQOS_AWCACHE_CFG, 0xf0f0);
hisi_sas_write32(hisi_hba, ARQOS_ARCACHE_CFG, 0xf0f0);
- for (i = 0; i < hisi_hba->queue_count; i++)
- hisi_sas_write32(hisi_hba, OQ0_INT_SRC_MSK + 0x4 * i, 0);
-
hisi_sas_write32(hisi_hba, HYPER_STREAM_ID_EN_CFG, 1);
+ interrupt_enable_v3_hw(hisi_hba);
for (i = 0; i < hisi_hba->n_phy; i++) {
enum sas_linkrate max;
struct hisi_sas_phy *phy = &hisi_hba->phy[i];
@@ -660,13 +675,8 @@ static void init_reg_v3_hw(struct hisi_hba *hisi_hba)
hisi_sas_phy_write32(hisi_hba, i, CHL_INT1, 0xffffffff);
hisi_sas_phy_write32(hisi_hba, i, CHL_INT2, 0xffffffff);
hisi_sas_phy_write32(hisi_hba, i, RXOP_CHECK_CFG_H, 0x1000);
- hisi_sas_phy_write32(hisi_hba, i, CHL_INT1_MSK, 0xf2057fff);
- hisi_sas_phy_write32(hisi_hba, i, CHL_INT2_MSK, 0xffffbfe);
hisi_sas_phy_write32(hisi_hba, i, PHY_CTRL_RDY_MSK, 0x0);
- hisi_sas_phy_write32(hisi_hba, i, PHYCTRL_NOT_RDY_MSK, 0x0);
hisi_sas_phy_write32(hisi_hba, i, PHYCTRL_DWS_RESET_MSK, 0x0);
- hisi_sas_phy_write32(hisi_hba, i, PHYCTRL_PHY_ENA_MSK, 0x0);
- hisi_sas_phy_write32(hisi_hba, i, SL_RX_BCAST_CHK_MSK, 0x0);
hisi_sas_phy_write32(hisi_hba, i, PHYCTRL_OOB_RESTART_MSK, 0x1);
hisi_sas_phy_write32(hisi_hba, i, STP_LINK_TIMER, 0x7f7a120);
hisi_sas_phy_write32(hisi_hba, i, CON_CFG_DRIVER, 0x2a0a01);
@@ -2661,7 +2671,6 @@ static int disable_host_v3_hw(struct hisi_hba *hisi_hba)
u32 status, reg_val;
int rc;
- interrupt_disable_v3_hw(hisi_hba);
hisi_sas_sync_poll_cqs(hisi_hba);
hisi_sas_write32(hisi_hba, DLVRY_QUEUE_ENABLE, 0x0);
@@ -2692,6 +2701,7 @@ static int soft_reset_v3_hw(struct hisi_hba *hisi_hba)
struct device *dev = hisi_hba->dev;
int rc;
+ interrupt_disable_v3_hw(hisi_hba);
rc = disable_host_v3_hw(hisi_hba);
if (rc) {
dev_err(dev, "soft reset: disable host failed rc=%d\n", rc);
@@ -5060,6 +5070,7 @@ static void hisi_sas_reset_prepare_v3_hw(struct pci_dev *pdev)
set_bit(HISI_SAS_RESETTING_BIT, &hisi_hba->flags);
hisi_sas_controller_reset_prepare(hisi_hba);
+ interrupt_disable_v3_hw(hisi_hba);
rc = disable_host_v3_hw(hisi_hba);
if (rc)
dev_err(dev, "FLR: disable host failed rc=%d\n", rc);
@@ -5089,6 +5100,21 @@ enum {
hip08,
};
+static void enable_host_v3_hw(struct hisi_hba *hisi_hba)
+{
+ u32 reg_val;
+
+ hisi_sas_write32(hisi_hba, DLVRY_QUEUE_ENABLE,
+ (u32)((1ULL << hisi_hba->queue_count) - 1));
+
+ phys_init_v3_hw(hisi_hba);
+ reg_val = hisi_sas_read32(hisi_hba, AXI_MASTER_CFG_BASE +
+ AM_CTRL_GLOBAL);
+ reg_val &= ~AM_CTRL_SHUTDOWN_REQ_MSK;
+ hisi_sas_write32(hisi_hba, AXI_MASTER_CFG_BASE +
+ AM_CTRL_GLOBAL, reg_val);
+}
+
static int _suspend_v3_hw(struct device *device)
{
struct pci_dev *pdev = to_pci_dev(device);
@@ -5111,14 +5137,18 @@ static int _suspend_v3_hw(struct device *device)
scsi_block_requests(shost);
set_bit(HISI_SAS_REJECT_CMD_BIT, &hisi_hba->flags);
flush_workqueue(hisi_hba->wq);
+ interrupt_disable_v3_hw(hisi_hba);
+
+ if (atomic_read(&device->power.usage_count)) {
+ dev_err(dev, "PM suspend: host status cannot be suspended\n");
+ rc = -EBUSY;
+ goto err_out;
+ }
rc = disable_host_v3_hw(hisi_hba);
if (rc) {
dev_err(dev, "PM suspend: disable host failed rc=%d\n", rc);
- clear_bit(HISI_SAS_REJECT_CMD_BIT, &hisi_hba->flags);
- clear_bit(HISI_SAS_RESETTING_BIT, &hisi_hba->flags);
- scsi_unblock_requests(shost);
- return rc;
+ goto err_out_recover_host;
}
hisi_sas_init_mem(hisi_hba);
@@ -5129,6 +5159,15 @@ static int _suspend_v3_hw(struct device *device)
dev_warn(dev, "end of suspending controller\n");
return 0;
+
+err_out_recover_host:
+ enable_host_v3_hw(hisi_hba);
+err_out:
+ interrupt_enable_v3_hw(hisi_hba);
+ clear_bit(HISI_SAS_REJECT_CMD_BIT, &hisi_hba->flags);
+ clear_bit(HISI_SAS_RESETTING_BIT, &hisi_hba->flags);
+ scsi_unblock_requests(shost);
+ return rc;
}
static int _resume_v3_hw(struct device *device)
--
2.8.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 0/4] scsi: hisi_sas: Some misc changes
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
` (3 preceding siblings ...)
2023-03-20 3:34 ` [PATCH 4/4] scsi: hisi_sas: Exit suspending state when usage count is greater than 0 chenxiang
@ 2023-04-03 1:59 ` Martin K. Petersen
2023-04-12 2:04 ` Martin K. Petersen
5 siblings, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2023-04-03 1:59 UTC (permalink / raw)
To: chenxiang; +Cc: jejb, martin.petersen, linux-scsi, linuxarm
chenxiang,
> This series contain some fixes including:
> - Grab sas_dev lock when traversing sas_dev list to avoid NULL pointer
> - Handle NCQ error when IPTT is valid
> - Ensure all enabled PHYs up during controller reset
> - Exit suspend state when usage count of runtime PM is greater than 0
Applied to 6.4/scsi-staging, thanks!
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/4] scsi: hisi_sas: Some misc changes
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
` (4 preceding siblings ...)
2023-04-03 1:59 ` [PATCH 0/4] scsi: hisi_sas: Some misc changes Martin K. Petersen
@ 2023-04-12 2:04 ` Martin K. Petersen
5 siblings, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2023-04-12 2:04 UTC (permalink / raw)
To: jejb, chenxiang; +Cc: Martin K . Petersen, linux-scsi, linuxarm
On Mon, 20 Mar 2023 11:34:21 +0800, chenxiang wrote:
> This series contain some fixes including:
> - Grab sas_dev lock when traversing sas_dev list to avoid NULL pointer
> - Handle NCQ error when IPTT is valid
> - Ensure all enabled PHYs up during controller reset
> - Exit suspend state when usage count of runtime PM is greater than 0
>
> Xingui Yang (2):
> scsi: hisi_sas: Grab sas_dev lock when traversing the members of
> sas_dev.list
> scsi: hisi_sas: Handle NCQ error when IPTT is valid
>
> [...]
Applied to 6.4/scsi-queue, thanks!
[1/4] scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list
https://git.kernel.org/mkp/scsi/c/71fb36b5ff11
[2/4] scsi: hisi_sas: Handle NCQ error when IPTT is valid
https://git.kernel.org/mkp/scsi/c/bb544224da77
[3/4] scsi: hisi_sas: Ensure all enabled PHYs up during controller reset
https://git.kernel.org/mkp/scsi/c/89954f024c3a
[4/4] scsi: hisi_sas: Exit suspending state when usage count is greater than 0
https://git.kernel.org/mkp/scsi/c/e368d38cb952
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-04-12 2:08 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-20 3:34 [PATCH 0/4] scsi: hisi_sas: Some misc changes chenxiang
2023-03-20 3:34 ` [PATCH 1/4] scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list chenxiang
2023-03-20 3:34 ` [PATCH 2/4] scsi: hisi_sas: Handle NCQ error when IPTT is valid chenxiang
2023-03-20 3:34 ` [PATCH 3/4] scsi: hisi_sas: Ensure all enabled PHYs up during controller reset chenxiang
2023-03-20 3:34 ` [PATCH 4/4] scsi: hisi_sas: Exit suspending state when usage count is greater than 0 chenxiang
2023-04-03 1:59 ` [PATCH 0/4] scsi: hisi_sas: Some misc changes Martin K. Petersen
2023-04-12 2:04 ` Martin K. Petersen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).