linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [issue report] pm8001 driver crashes with IOMMU enabled
@ 2021-11-24 12:28 John Garry
  2021-11-24 12:43 ` Jinpu Wang
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2021-11-24 12:28 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-scsi

Hi,

When I enable the IOMMU on my arm64 system, the pm8001 driver crashes as 
follows:

[    8.649365] pm80xx 0000:04:00.0: Adding to iommu group 0
[    8.655901] pm80xx 0000:04:00.0: pm80xx: driver version 0.1.40
[    8.661755] pm80xx 0000:04:00.0: enabling device (0140 -> 0142)
[    8.667864] :: pm8001_pci_alloc  530:Setting link rate to default value
[    9.716548] scsi host0: pm80xx
[   10.423522] Freeing initrd memory: 413456K
[   11.693443] Unable to handle kernel paging request at virtual address 
ffff0000fcebfb00
[   11.701348] Mem abort info:
[   11.704129]   ESR = 0x96000005
[   11.707170]   EC = 0x25: DABT (current EL), IL = 32 bits
[   11.712468]   SET = 0, FnV = 0
[   11.715510]   EA = 0, S1PTW = 0
[   11.718637]   FSC = 0x05: level 1 translation fault
[   11.723501] Data abort info:
[   11.726368]   ISV = 0, ISS = 0x00000005
[   11.730190]   CM = 0, WnR = 0
[   11.733145] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000013d43000
[   11.739832] [ffff0000fcebfb00] pgd=18000a4fffff8003, 
p4d=18000a4fffff8003, pud=0000000000000000
[   11.748521] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   11.754080] Modules linked in:
[   11.757122] CPU: 1 PID: 7 Comm: kworker/u192:0 Not tainted 
5.16.0-rc2-dirty #102
[   11.764505] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[   11.773015] Workqueue: 0000:04:00.0_disco_q sas_discover_domain
[   11.778926] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[   11.785874] pc : pm80xx_chip_smp_req+0x2d0/0x3d0
[   11.790479] lr : pm80xx_chip_smp_req+0xac/0x3d0
[   11.794996] sp : ffff80001258ba60
[   11.798297] x29: ffff80001258ba60 x28: ffff0020a2892b50 x27: 
ffff0020a2898000
[   11.805421] x26: ffff0020a3ee0000 x25: 0000000000000008 x24: 
ffff0000fcebfb00
[   11.812546] x23: ffff8000113ab6b8 x22: 0000000000000000 x21: 
ffff0020a3ed0038
[   11.819670] x20: ffff0020a2890000 x19: ffff80001258badc x18: 
00000000fffffffb
[   11.826794] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000000
[   11.833917] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000002
[   11.841041] x11: 00000a20098b1000 x10: ffff0020b36515f0 x9 : 
0000000000001000
[   11.848165] x8 : 00000a20098b0000 x7 : ffff8000117eb7f0 x6 : 
0000000000000001
[   11.855288] x5 : 0000000000000f44 x4 : 0000000000001000 x3 : 
0000000000000000
[   11.862412] x2 : ffff8000113ab698 x1 : 0000000000000004 x0 : 
ffff8000117eb000
[   11.869535] Call trace:
[   11.871969]  pm80xx_chip_smp_req+0x2d0/0x3d0
[   11.876226]  pm8001_task_exec.constprop.0+0x368/0x520
[   11.881266]  pm8001_queue_command+0x1c/0x30
[   11.885437]  smp_execute_task_sg+0xdc/0x204
[   11.889607]  sas_discover_expander.part.0+0xac/0x6cc
[   11.894559]  sas_discover_root_expander+0x8c/0x150
[   11.899337]  sas_discover_domain+0x3ac/0x6a0
[   11.903594]  process_one_work+0x1d0/0x354
[   11.907592]  worker_thread+0x13c/0x470
[   11.911328]  kthread+0x17c/0x190
[   11.914545]  ret_from_fork+0x10/0x20
[   11.918110] Code: 371806e1 910006d6 6b16033f 54000249 (38766b05)
[   11.924192] ---[ end trace b91d59aaee98ea2d ]---
[   11.928796] note: kworker/u192:0[7] exited with preempt_count 1


I notice that the driver is calling virt_to_phys() on a dma_addr_t, 
which is broken:

static int pm80xx_chip_smp_req(struct pm8001_hba_info *pm8001_ha,
struct pm8001_ccb_info *ccb)
{
char *preq_dma_addr = NULL;
__le64 tmp_addr;

tmp_addr = cpu_to_le64((u64)sg_dma_address(&task->smp_task.smp_req));
preq_dma_addr = (char *)phys_to_virt(tmp_addr);

How is this supposed to work? I assume that someone has enabled the 
IOMMU on a system with one of these cards before.

I have encountered some other RAID cards which bypasses the IOMMU to 
access host memory - is that the case here potentially?

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 driver crashes with IOMMU enabled
  2021-11-24 12:28 [issue report] pm8001 driver crashes with IOMMU enabled John Garry
@ 2021-11-24 12:43 ` Jinpu Wang
  2021-11-24 16:22   ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Jinpu Wang @ 2021-11-24 12:43 UTC (permalink / raw)
  To: John Garry, Viswas G, Ajish Koshy; +Cc: Jinpu Wang, linux-scsi

+cc folks from microchips

On Wed, Nov 24, 2021 at 1:28 PM John Garry <john.garry@huawei.com> wrote:
>
> Hi,
>
> When I enable the IOMMU on my arm64 system, the pm8001 driver crashes as
> follows:
>
> [    8.649365] pm80xx 0000:04:00.0: Adding to iommu group 0
> [    8.655901] pm80xx 0000:04:00.0: pm80xx: driver version 0.1.40
> [    8.661755] pm80xx 0000:04:00.0: enabling device (0140 -> 0142)
> [    8.667864] :: pm8001_pci_alloc  530:Setting link rate to default value
> [    9.716548] scsi host0: pm80xx
> [   10.423522] Freeing initrd memory: 413456K
> [   11.693443] Unable to handle kernel paging request at virtual address
> ffff0000fcebfb00
> [   11.701348] Mem abort info:
> [   11.704129]   ESR = 0x96000005
> [   11.707170]   EC = 0x25: DABT (current EL), IL = 32 bits
> [   11.712468]   SET = 0, FnV = 0
> [   11.715510]   EA = 0, S1PTW = 0
> [   11.718637]   FSC = 0x05: level 1 translation fault
> [   11.723501] Data abort info:
> [   11.726368]   ISV = 0, ISS = 0x00000005
> [   11.730190]   CM = 0, WnR = 0
> [   11.733145] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000013d43000
> [   11.739832] [ffff0000fcebfb00] pgd=18000a4fffff8003,
> p4d=18000a4fffff8003, pud=0000000000000000
> [   11.748521] Internal error: Oops: 96000005 [#1] PREEMPT SMP
> [   11.754080] Modules linked in:
> [   11.757122] CPU: 1 PID: 7 Comm: kworker/u192:0 Not tainted
> 5.16.0-rc2-dirty #102
> [   11.764505] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI
> RC0 - V1.16.01 03/15/2019
> [   11.773015] Workqueue: 0000:04:00.0_disco_q sas_discover_domain
> [   11.778926] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [   11.785874] pc : pm80xx_chip_smp_req+0x2d0/0x3d0
> [   11.790479] lr : pm80xx_chip_smp_req+0xac/0x3d0
> [   11.794996] sp : ffff80001258ba60
> [   11.798297] x29: ffff80001258ba60 x28: ffff0020a2892b50 x27:
> ffff0020a2898000
> [   11.805421] x26: ffff0020a3ee0000 x25: 0000000000000008 x24:
> ffff0000fcebfb00
> [   11.812546] x23: ffff8000113ab6b8 x22: 0000000000000000 x21:
> ffff0020a3ed0038
> [   11.819670] x20: ffff0020a2890000 x19: ffff80001258badc x18:
> 00000000fffffffb
> [   11.826794] x17: 0000000000000000 x16: 0000000000000000 x15:
> 0000000000000000
> [   11.833917] x14: 0000000000000000 x13: 0000000000000000 x12:
> 0000000000000002
> [   11.841041] x11: 00000a20098b1000 x10: ffff0020b36515f0 x9 :
> 0000000000001000
> [   11.848165] x8 : 00000a20098b0000 x7 : ffff8000117eb7f0 x6 :
> 0000000000000001
> [   11.855288] x5 : 0000000000000f44 x4 : 0000000000001000 x3 :
> 0000000000000000
> [   11.862412] x2 : ffff8000113ab698 x1 : 0000000000000004 x0 :
> ffff8000117eb000
> [   11.869535] Call trace:
> [   11.871969]  pm80xx_chip_smp_req+0x2d0/0x3d0
> [   11.876226]  pm8001_task_exec.constprop.0+0x368/0x520
> [   11.881266]  pm8001_queue_command+0x1c/0x30
> [   11.885437]  smp_execute_task_sg+0xdc/0x204
> [   11.889607]  sas_discover_expander.part.0+0xac/0x6cc
> [   11.894559]  sas_discover_root_expander+0x8c/0x150
> [   11.899337]  sas_discover_domain+0x3ac/0x6a0
> [   11.903594]  process_one_work+0x1d0/0x354
> [   11.907592]  worker_thread+0x13c/0x470
> [   11.911328]  kthread+0x17c/0x190
> [   11.914545]  ret_from_fork+0x10/0x20
> [   11.918110] Code: 371806e1 910006d6 6b16033f 54000249 (38766b05)
> [   11.924192] ---[ end trace b91d59aaee98ea2d ]---
> [   11.928796] note: kworker/u192:0[7] exited with preempt_count 1
>
>
> I notice that the driver is calling virt_to_phys() on a dma_addr_t,
> which is broken:
phys_to_virt you meant.
>
> static int pm80xx_chip_smp_req(struct pm8001_hba_info *pm8001_ha,
> struct pm8001_ccb_info *ccb)
> {
> char *preq_dma_addr = NULL;
> __le64 tmp_addr;
>
> tmp_addr = cpu_to_le64((u64)sg_dma_address(&task->smp_task.smp_req));
> preq_dma_addr = (char *)phys_to_virt(tmp_addr);
>
The code was there since the initial support in 2013.
f5860992db55 ("[SCSI] pm80xx: Added SPCv/ve specific hardware
functionalities and relevant changes in common files")

> How is this supposed to work? I assume that someone has enabled the
> IOMMU on a system with one of these cards before.
I guess it's due to the unaligned access to memory on ARM? AFAIK most
of the user are on x86_64.
>
> I have encountered some other RAID cards which bypasses the IOMMU to
> access host memory - is that the case here potentially?
I don't know, maybe guys from microchip can answer.
>
> Thanks,
> John
Thanks for reporting!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 driver crashes with IOMMU enabled
  2021-11-24 12:43 ` Jinpu Wang
@ 2021-11-24 16:22   ` John Garry
  2021-12-24  9:02     ` [issue report] pm8001 issues (was driver crashes with IOMMU enabled) John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2021-11-24 16:22 UTC (permalink / raw)
  To: Jinpu Wang, Viswas G, Ajish Koshy; +Cc: linux-scsi

On 24/11/2021 12:43, Jinpu Wang wrote:
>> I notice that the driver is calling virt_to_phys() on a dma_addr_t,
>> which is broken:
> phys_to_virt you meant.

Right

>> static int pm80xx_chip_smp_req(struct pm8001_hba_info *pm8001_ha,
>> struct pm8001_ccb_info *ccb)
>> {
>> char *preq_dma_addr = NULL;
>> __le64 tmp_addr;
>>
>> tmp_addr = cpu_to_le64((u64)sg_dma_address(&task->smp_task.smp_req));
>> preq_dma_addr = (char *)phys_to_virt(tmp_addr);
>>
> The code was there since the initial support in 2013.
> f5860992db55 ("[SCSI] pm80xx: Added SPCv/ve specific hardware
> functionalities and relevant changes in common files")
> 
>> How is this supposed to work? I assume that someone has enabled the
>> IOMMU on a system with one of these cards before.

One thing to note is that a long time ago I had to fix libsas for broken 
DMA API usage which was exposed when the IOMMU enabled, which also seems 
strange not to be noticed then.

See commit 9702c67c6066 ("scsi: libsas: fix ata xfer length")

> I guess it's due to the unaligned access to memory on ARM? AFAIK most
> of the user are on x86_64.

I doubt it, especially since !IOMMU seems ok.

>> I have encountered some other RAID cards which bypasses the IOMMU to
>> access host memory - is that the case here potentially?
> I don't know, maybe guys from microchip can answer.

Hopefully.

Thanks,
John


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2021-11-24 16:22   ` John Garry
@ 2021-12-24  9:02     ` John Garry
  2021-12-24 11:58       ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2021-12-24  9:02 UTC (permalink / raw)
  To: Jinpu Wang, Viswas G, Ajish Koshy
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi

+ some recent contributors

Hi microchip guys,

Do you have any idea on the 2x outstanding issues I reported for the 
pm8001 driver:
a. my arm system goes into a continuous cycle of SCSI error handling for 
this scsi host
b. maxcpus=1 on commandline crashes during bootup on my arm system - I 
assume that x86 is same also

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2021-12-24  9:02     ` [issue report] pm8001 issues (was driver crashes with IOMMU enabled) John Garry
@ 2021-12-24 11:58       ` John Garry
  2021-12-27 13:26         ` Ajish.Koshy
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2021-12-24 11:58 UTC (permalink / raw)
  To: Jinpu Wang, Viswas G, Ajish Koshy
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, Damien Le Moal

On 24/12/2021 09:02, John Garry wrote:
> + some recent contributors
> 
> Hi microchip guys,
> 
> Do you have any idea on the 2x outstanding issues I reported for the 
> pm8001 driver:
> a. my arm system goes into a continuous cycle of SCSI error handling for 
> this scsi host
> b. maxcpus=1 on commandline crashes during bootup on my arm system - I 
> assume that x86 is same also

commit 05c6c029a44d ("scsi: pm80xx: Increase number of supported queues
") looks to cause this issue.

Problem a. still exists prior to this.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2021-12-24 11:58       ` John Garry
@ 2021-12-27 13:26         ` Ajish.Koshy
  2022-01-06 15:49           ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ajish.Koshy @ 2021-12-27 13:26 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi John,

Regarding maxcpus=1 issue, will check and try to reproduce the
same on x86 server.

And for ARM issues, need to check internally as it was never
tested for the same.

Thanks,
Ajish

-----Original Message-----
From: John Garry <john.garry@huawei.com> 
Sent: Friday, December 24, 2021 05:29 PM
To: Jinpu Wang <jinpu.wang@ionos.com>; Viswas G - I30667 <Viswas.G@microchip.com>; Ajish Koshy - I30923 <Ajish.Koshy@microchip.com>
Cc: linux-scsi@vger.kernel.org; vishakhavc@google.com; ipylypiv@google.com; Ruksar.devadi@microchip.com; Damien Le Moal <damien.lemoal@opensource.wdc.com>
Subject: Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On 24/12/2021 09:02, John Garry wrote:
> + some recent contributors
>
> Hi microchip guys,
>
> Do you have any idea on the 2x outstanding issues I reported for the
> pm8001 driver:
> a. my arm system goes into a continuous cycle of SCSI error handling 
> for this scsi host b. maxcpus=1 on commandline crashes during bootup 
> on my arm system - I assume that x86 is same also

commit 05c6c029a44d ("scsi: pm80xx: Increase number of supported queues
") looks to cause this issue.

Problem a. still exists prior to this.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2021-12-27 13:26         ` Ajish.Koshy
@ 2022-01-06 15:49           ` John Garry
  2022-01-07 11:12             ` Ajish.Koshy
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2022-01-06 15:49 UTC (permalink / raw)
  To: Ajish.Koshy, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

On 27/12/2021 13:26, Ajish.Koshy@microchip.com wrote:
> Regarding maxcpus=1 issue, will check and try to reproduce the
> same on x86 server.
> 
> And for ARM issues, need to check internally as it was never
> tested for the same.

I have found another issue. There is a potential use-after-free in 
pm8001_task_exec():

static int pm8001_task_exec()
{
	...
	case SAS_PROTOCOL_SSP:
	atomic_inc(&pm8001_dev->running_req);
	if (is_tmf)
		rc = pm8001_task_prep_ssp_tm(...);
	else
		rc = pm8001_task_prep_ssp(pm8001_ha, ccb);
	break;
	...

	if (rc) {
		pm8001_dbg(pm8001_ha, IO, "rc is %x\n", rc);
		atomic_dec(&pm8001_dev->running_req);
		goto err_out_tag;
	}
	/* TODO: select normal or high priority */
	spin_lock(&t->task_state_lock); ****
	t->task_state_flags |= SAS_TASK_AT_INITIATOR;
	spin_unlock(&t->task_state_lock);
	...
}


Once the task is dispatched to HW at ****, it is completed async, i.e. 
it may be completed and freed at any point, even before the dispatch 
function returns. So it is illegal to touch the task at this point and 
the task state must be updated before final dispatch to the HW. If you 
enable KASAN you will prob see it yell like I saw.

Thanks,
john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-06 15:49           ` John Garry
@ 2022-01-07 11:12             ` Ajish.Koshy
  2022-01-10 20:21               ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ajish.Koshy @ 2022-01-07 11:12 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi John,
> 
> On 27/12/2021 13:26, Ajish.Koshy@microchip.com wrote:
> > Regarding maxcpus=1 issue, will check and try to reproduce the same on
> > x86 server.
> >
> > And for ARM issues, need to check internally as it was never tested
> > for the same.
> 
> I have found another issue. There is a potential use-after-free in
> pm8001_task_exec():
> 
> static int pm8001_task_exec()
> {
>         ...
>         case SAS_PROTOCOL_SSP:
>         atomic_inc(&pm8001_dev->running_req);
>         if (is_tmf)
>                 rc = pm8001_task_prep_ssp_tm(...);
>         else
>                 rc = pm8001_task_prep_ssp(pm8001_ha, ccb);
>         break;
>         ...
> 
>         if (rc) {
>                 pm8001_dbg(pm8001_ha, IO, "rc is %x\n", rc);
>                 atomic_dec(&pm8001_dev->running_req);
>                 goto err_out_tag;
>         }
>         /* TODO: select normal or high priority */
>         spin_lock(&t->task_state_lock); ****
>         t->task_state_flags |= SAS_TASK_AT_INITIATOR;
>         spin_unlock(&t->task_state_lock);
>         ...
> }
> 
> 
> Once the task is dispatched to HW at ****, it is completed async, i.e.
> it may be completed and freed at any point, even before the dispatch
> function returns. So it is illegal to touch the task at this point and the task
> state must be updated before final dispatch to the HW. If you enable KASAN
> you will prob see it yell like I saw.
> 

I too have similar thought here. After dispatch to HW, no point to touch the
task state. But since the code is in IO path, may need further testing. 

> Thanks,
> john

Thanks,
Ajish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-07 11:12             ` Ajish.Koshy
@ 2022-01-10 20:21               ` John Garry
  2022-01-11 12:40                 ` Ajish.Koshy
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2022-01-10 20:21 UTC (permalink / raw)
  To: Ajish.Koshy, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

On 07/01/2022 11:12, Ajish.Koshy@microchip.com wrote:
>> Once the task is dispatched to HW at ****, it is completed async, i.e.
>> it may be completed and freed at any point, even before the dispatch
>> function returns. So it is illegal to touch the task at this point and the task
>> state must be updated before final dispatch to the HW. If you enable KASAN
>> you will prob see it yell like I saw.
>>
> I too have similar thought here. After dispatch to HW, no point to touch the
> task state. But since the code is in IO path, may need further testing.
> 

Hi,

Have you made any progress on the hang which I see on my arm64 system?

I think that you said that you can also see it on an arm64 system - 
would that be with a similar card to mine? I think mine is 8008/9

I have tested some older kernels and v4.11 seems much better.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-10 20:21               ` John Garry
@ 2022-01-11 12:40                 ` Ajish.Koshy
  2022-01-11 13:23                   ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ajish.Koshy @ 2022-01-11 12:40 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi John,

> >> Once the task is dispatched to HW at ****, it is completed async, i.e.
> >> it may be completed and freed at any point, even before the dispatch
> >> function returns. So it is illegal to touch the task at this point
> >> and the task state must be updated before final dispatch to the HW.
> >> If you enable KASAN you will prob see it yell like I saw.
> >>
> > I too have similar thought here. After dispatch to HW, no point to
> > touch the task state. But since the code is in IO path, may need further
> testing.
> >
> 
> Hi,
> 
> Have you made any progress on the hang which I see on my arm64 system?

Not planned for ARM server.

> 
> I think that you said that you can also see it on an arm64 system - would that
> be with a similar card to mine? I think mine is 8008/9

That was similar card i.e. 8076.

> 
> I have tested some older kernels and v4.11 seems much better.
> 
> Thanks,
> John

Just to get more clarification, in the same thread 
following issues were mentioned. Right now
I am on x86 server. Don't have 8008/8009 controller
with me here. 
Issues:
1. Driver crashes when IOMMU is enabled. Patch already
submitted.
   - Issue was seen on x86 server too.
2. Observed triggering of scsi error handler on
   ARM server.
   - Issue not observed on x86 server
3. maxcpus=1 on commandline crashes during bootup. 
   Issue with 8008/8009 controller. Patch created.
   - Issue impacts x86 too based on the code.
4. "I have found another issue. There is a potential
   use-after-free in pm8001_task_exec():", where we
   modify task state post task dispatch to hardware
   - Generic code. Impact on all platform x86 and ARM.
   
Let us know if any other issue missed out to
mention here or issues that impacts x86 too.

Thanks,
Ajish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-11 12:40                 ` Ajish.Koshy
@ 2022-01-11 13:23                   ` John Garry
  2022-01-13 12:52                     ` Ajish.Koshy
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2022-01-11 13:23 UTC (permalink / raw)
  To: Ajish.Koshy, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi Ajish,

>>
>> Have you made any progress on the hang which I see on my arm64 system?
> Not planned for ARM server.
> 
>> I think that you said that you can also see it on an arm64 system - would that
>> be with a similar card to mine? I think mine is 8008/9
> That was similar card i.e. 8076.
> 
>> I have tested some older kernels and v4.11 seems much better.
>>
>> Thanks,
>> John
> Just to get more clarification, in the same thread
> following issues were mentioned. Right now
> I am on x86 server. Don't have 8008/8009 controller
> with me here.
> Issues:
> 1. Driver crashes when IOMMU is enabled. Patch already
> submitted.
>     - Issue was seen on x86 server too.
> 2. Observed triggering of scsi error handler on
>     ARM server.
>     - Issue not observed on x86 server

Your position on this is not clear on this one.

 From an earlier mail [0] I got the impression that you tested on an arm 
platform – did you?

I just don't know for certain that this is a card issue or an issue with 
the driver issue or both. I have a strong feeling that it is a driver 
issue. As I mentioned, v4.11 seems to work much better than v5.16 - on 
v4.11 I can mount the filesystem and copy files, which is not possible 
on a new kernel.

IIRC I did use this same card on an x86 platform some time and it worked 
ok, but I can't be certain. And it's really painful for me to swap the 
card to an x86 machine to test.

> 3. maxcpus=1 on commandline crashes during bootup.
>     Issue with 8008/8009 controller. Patch created.
>     - Issue impacts x86 too based on the code.
> 4. "I have found another issue. There is a potential
>     use-after-free in pm8001_task_exec():", where we
>     modify task state post task dispatch to hardware
>     - Generic code. Impact on all platform x86 and ARM.
>     
> Let us know if any other issue missed out to
> mention here or issues that impacts x86 too.

Your list looks ok. However I did also mention these logs which I saw on 
my arm machine:

[   12.160631] sas: target proto 0x0 at 500e004aaaaaaa1f:0x10 not handled
[   12.167183] sas: ex 500e004aaaaaaa1f phy16 failed to discover

They are red flags, and may be related to 2, above.

Thanks,
John

[0] 
https://lore.kernel.org/linux-scsi/PH0PR11MB51122D76F40E164C31AFEE54EC719@PH0PR11MB5112.namprd11.prod.outlook.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-11 13:23                   ` John Garry
@ 2022-01-13 12:52                     ` Ajish.Koshy
  2022-01-13 14:17                       ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ajish.Koshy @ 2022-01-13 12:52 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi John,

> Hi Ajish,
> 
> >>
> >> Have you made any progress on the hang which I see on my arm64
> system?
> > Not planned for ARM server.
> >
> >> I think that you said that you can also see it on an arm64 system -
> >> would that be with a similar card to mine? I think mine is 8008/9
> > That was similar card i.e. 8076.
> >
> >> I have tested some older kernels and v4.11 seems much better.
> >>
> >> Thanks,
> >> John
> > Just to get more clarification, in the same thread following issues
> > were mentioned. Right now I am on x86 server. Don't have 8008/8009
> > controller with me here.
> > Issues:
> > 1. Driver crashes when IOMMU is enabled. Patch already submitted.
> >     - Issue was seen on x86 server too.
> > 2. Observed triggering of scsi error handler on
> >     ARM server.
> >     - Issue not observed on x86 server
> 
> Your position on this is not clear on this one.
> 
>  From an earlier mail [0] I got the impression that you tested on an arm
> platform – did you?

Yes, with respect to my previous mail update, at that time got the chance to
load the driver on ARM server/enclosure connected in one of our tester's 
arm server after attaching the controller card.
There this error handling issue was observed.

The card/driver was never tested or validated on ARM server before,
was curious to see the behavior for the first time. Whereas driver
loads smoothly on x86 server.

Currently busy with some other issues, debugging on ARM server is not
planned for now.

> 
> I just don't know for certain that this is a card issue or an issue with the
> driver issue or both. I have a strong feeling that it is a driver issue. As I
> mentioned, v4.11 seems to work much better than v5.16 - on
> v4.11 I can mount the filesystem and copy files, which is not possible on a
> new kernel.
> 
> IIRC I did use this same card on an x86 platform some time and it worked ok,
> but I can't be certain. And it's really painful for me to swap the card to an x86
> machine to test.
> 
> > 3. maxcpus=1 on commandline crashes during bootup.
> >     Issue with 8008/8009 controller. Patch created.
> >     - Issue impacts x86 too based on the code.
> > 4. "I have found another issue. There is a potential
> >     use-after-free in pm8001_task_exec():", where we
> >     modify task state post task dispatch to hardware
> >     - Generic code. Impact on all platform x86 and ARM.
> >
> > Let us know if any other issue missed out to mention here or issues
> > that impacts x86 too.
> 
> Your list looks ok. However I did also mention these logs which I saw on my
> arm machine:
> 
> [   12.160631] sas: target proto 0x0 at 500e004aaaaaaa1f:0x10 not handled
> [   12.167183] sas: ex 500e004aaaaaaa1f phy16 failed to discover
> 
> They are red flags, and may be related to 2, above.
> 
> Thanks,
> John
> 
> [0]
> https://lore.kernel.org/linux-
> scsi/PH0PR11MB51122D76F40E164C31AFEE54EC719@PH0PR11MB5112.nam
> prd11.prod.outlook.com/

Thanks,
Ajish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-13 12:52                     ` Ajish.Koshy
@ 2022-01-13 14:17                       ` John Garry
  2022-01-14 18:21                         ` John Garry
  2022-01-17 14:02                         ` Ajish.Koshy
  0 siblings, 2 replies; 18+ messages in thread
From: John Garry @ 2022-01-13 14:17 UTC (permalink / raw)
  To: Ajish.Koshy, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

On 13/01/2022 12:52, Ajish.Koshy@microchip.com wrote:

Hi Ajish,

>>   From an earlier mail [0] I got the impression that you tested on an arm
>> platform – did you?
> Yes, with respect to my previous mail update, at that time got the chance to
> load the driver on ARM server/enclosure connected in one of our tester's
> arm server after attaching the controller card.
> There this error handling issue was observed.
> 
> The card/driver was never tested or validated on ARM server before,
> was curious to see the behavior for the first time. Whereas driver
> loads smoothly on x86 server.
> 
> Currently busy with some other issues, debugging on ARM server is not
> planned for now.
> 

OK, since you do see this same/similar issue with another card on arm 
then I think that it is safe to assume that it is a driver issue.

If you can share the dmesg on the arm machine then at least that would 
be helpful.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-13 14:17                       ` John Garry
@ 2022-01-14 18:21                         ` John Garry
  2022-01-17 13:56                           ` Ajish.Koshy
  2022-01-17 14:02                         ` Ajish.Koshy
  1 sibling, 1 reply; 18+ messages in thread
From: John Garry @ 2022-01-14 18:21 UTC (permalink / raw)
  To: Ajish.Koshy, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

On 13/01/2022 14:17, John Garry wrote:
>>>   From an earlier mail [0] I got the impression that you tested on an 
>>> arm
>>> platform – did you?
>> Yes, with respect to my previous mail update, at that time got the 
>> chance to
>> load the driver on ARM server/enclosure connected in one of our tester's
>> arm server after attaching the controller card.
>> There this error handling issue was observed.
>>
>> The card/driver was never tested or validated on ARM server before,
>> was curious to see the behavior for the first time. Whereas driver
>> loads smoothly on x86 server.
>>
>> Currently busy with some other issues, debugging on ARM server is not
>> planned for now.
>>
> 
> OK, since you do see this same/similar issue with another card on arm 
> then I think that it is safe to assume that it is a driver issue.
> 
> If you can share the dmesg on the arm machine then at least that would 
> be helpful.

I notice that UBSAN complains:

    19.231481] 
================================================================================ 

[   19.239926] UBSAN: shift-out-of-bounds in 
drivers/scsi/pm8001/pm80xx_hwi.c:1743:17
[   19.247490] shift exponent 32 is too large for 32-bit type 'int'
[   19.253490] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 
5.16.0-rc3-00389-g1758b8fcdbf7 #1018
[   19.261915] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[   19.270426] Workqueue: events work_for_cpu_fn
[   19.274777] Call trace:
[   19.277211]  dump_backtrace+0x0/0x1b0
[   19.280863]  show_stack+0x1c/0x30
[   19.284167]  dump_stack_lvl+0x7c/0xa8
[   19.287818]  dump_stack+0x1c/0x38
[   19.291121]  ubsan_epilogue+0x10/0x54
[   19.294771]  __ubsan_handle_shift_out_of_bounds+0x148/0x180
[   19.300332]  pm80xx_chip_interrupt_enable+0x74/0x19c
[   19.305287]  pm8001_pci_probe+0xf8c/0x1610
[   19.309372]  local_pci_probe+0x44/0xb0
[   19.313112]  work_for_cpu_fn+0x20/0x34
[   19.316851]  process_one_work+0x224/0x42c
[   19.320849]  worker_thread+0x204/0x44c
[   19.324585]  kthread+0x174/0x190
[   19.327802]  ret_from_fork+0x10/0x20
[   19.331377] ==========================

Here's the code:
static void
pm80xx_chip_interrupt_enable(struct pm8001_hba_info *pm8001_ha, u8 vec)
{
#ifdef PM8001_USE_MSIX
	u32 mask;
	mask = (u32)(1 << vec);

	pm8001_cw32(pm8001_ha, 0, MSGU_ODMR_CLR, (u32)(mask & 0xFFFFFFFF));
	return;
#endif
	pm80xx_chip_intx_interrupt_enable(pm8001_ha);

}

So vec can be >= 32 now and those interrupts are now used - are we 
missing some operations for the upper bits?

Something else I notice is that pm80xx_set_sas_protocol_timer_config() 
is called before the tags are setup in pm8001_init_ccb_tag(), and this 
always fails silently as no tags are available for the command.

I also think that for the tags management, since you use spinlock in 
alloc, spinlock in the free path should also be used, like:

diff --git a/drivers/scsi/pm8001/pm8001_sas.c 
b/drivers/scsi/pm8001/pm8001_sas.c
index 83e73009db5c..0a5e5b5f6975 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -65,7 +65,11 @@ static int pm8001_find_tag(struct sas_task *task, u32 
*tag)
  void pm8001_tag_free(struct pm8001_hba_info *pm8001_ha, u32 tag)
  {
  	void *bitmap = pm8001_ha->tags;
-	clear_bit(tag, bitmap);
+	unsigned long flags;
+
+	spin_lock_irqsave(&pm8001_ha->bitmap_lock, flags);
+	__clear_bit(tag, bitmap);
+	spin_unlock_irqrestore(&pm8001_ha->bitmap_lock, flags);
  }

  /**
@@ -85,7 +89,7 @@ inline int pm8001_tag_alloc(struct pm8001_hba_info 
*pm8001_ha, u32 *tag_out)
  		spin_unlock_irqrestore(&pm8001_ha->bitmap_lock, flags);
  		return -SAS_QUEUE_FULL;
  	}
-	set_bit(tag, bitmap);
+	__set_bit(tag, bitmap);
  	spin_unlock_irqrestore(&pm8001_ha->bitmap_lock, flags);
  	*tag_out = tag;
  	return 0;


Thanks,
John

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-14 18:21                         ` John Garry
@ 2022-01-17 13:56                           ` Ajish.Koshy
  0 siblings, 0 replies; 18+ messages in thread
From: Ajish.Koshy @ 2022-01-17 13:56 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

> >>>   From an earlier mail [0] I got the impression that you tested on
> >>> an arm platform – did you?
> >> Yes, with respect to my previous mail update, at that time got the
> >> chance to load the driver on ARM server/enclosure connected in one of
> >> our tester's arm server after attaching the controller card.
> >> There this error handling issue was observed.
> >>
> >> The card/driver was never tested or validated on ARM server before,
> >> was curious to see the behavior for the first time. Whereas driver
> >> loads smoothly on x86 server.
> >>
> >> Currently busy with some other issues, debugging on ARM server is not
> >> planned for now.
> >>
> >
> > OK, since you do see this same/similar issue with another card on arm
> > then I think that it is safe to assume that it is a driver issue.
> >
> > If you can share the dmesg on the arm machine then at least that would
> > be helpful.
> 
> I notice that UBSAN complains:
> 
>     19.231481]
> ================================================================
> ================
> 
> [   19.239926] UBSAN: shift-out-of-bounds in
> drivers/scsi/pm8001/pm80xx_hwi.c:1743:17
> [   19.247490] shift exponent 32 is too large for 32-bit type 'int'
> [   19.253490] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted
> 5.16.0-rc3-00389-g1758b8fcdbf7 #1018
> [   19.261915] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI
> RC0 - V1.16.01 03/15/2019
> [   19.270426] Workqueue: events work_for_cpu_fn
> [   19.274777] Call trace:
> [   19.277211]  dump_backtrace+0x0/0x1b0
> [   19.280863]  show_stack+0x1c/0x30
> [   19.284167]  dump_stack_lvl+0x7c/0xa8
> [   19.287818]  dump_stack+0x1c/0x38
> [   19.291121]  ubsan_epilogue+0x10/0x54
> [   19.294771]  __ubsan_handle_shift_out_of_bounds+0x148/0x180
> [   19.300332]  pm80xx_chip_interrupt_enable+0x74/0x19c
> [   19.305287]  pm8001_pci_probe+0xf8c/0x1610
> [   19.309372]  local_pci_probe+0x44/0xb0
> [   19.313112]  work_for_cpu_fn+0x20/0x34
> [   19.316851]  process_one_work+0x224/0x42c
> [   19.320849]  worker_thread+0x204/0x44c
> [   19.324585]  kthread+0x174/0x190
> [   19.327802]  ret_from_fork+0x10/0x20
> [   19.331377] ==========================
> 
> Here's the code:
> static void
> pm80xx_chip_interrupt_enable(struct pm8001_hba_info *pm8001_ha, u8
> vec) { #ifdef PM8001_USE_MSIX
>         u32 mask;
>         mask = (u32)(1 << vec);
> 
>         pm8001_cw32(pm8001_ha, 0, MSGU_ODMR_CLR, (u32)(mask &
> 0xFFFFFFFF));
>         return;
> #endif
>         pm80xx_chip_intx_interrupt_enable(pm8001_ha);
> 
> }
> 
> So vec can be >= 32 now and those interrupts are now used - are we missing
> some operations for the upper bits?

Yes. At first look like we are missing it 

#define MSGU_ODMR_CLR 0x38
#define MSGU_ODMR_CLR_U 0x3C

0x38
Address offset 0x38 - bits 31:0
Address offset 0x3C - bits 63:32

The same analogy applies to these
registers too 
#define MSGU_ODMR 0x30
#define MSGU_ODMR_U 0x34

0x30
Address offset 0x30 - bits 31:0
Address offset 0x34 - bits 63:32

Let me go through the internals first.

> 
> Something else I notice is that pm80xx_set_sas_protocol_timer_config()
> is called before the tags are setup in pm8001_init_ccb_tag(), and this always
> fails silently as no tags are available for the command.

You are right here. 
Currently the code sequence and error handling both are
not proper.

Probe()
rc = PM8001_CHIP_DISP->chip_init(pm8001_ha); {pm80xx_chip_init()}
        ret = pm80xx_set_sas_protocol_timer_config() // error handling
                 rc = pm8001_tag_alloc(pm8001_ha, &tag);
                            void *bitmap = pm8001_ha->tags;
		

rc = pm8001_init_ccb_tag(pm8001_ha, shost, pdev);
	pm8001_ha->tags = kzalloc(ccb_count, GFP_KERNEL);
	if (!pm8001_ha->tags)
		goto err_out;

Will submit the patch for the same.

> 
> I also think that for the tags management, since you use spinlock in alloc,
> spinlock in the free path should also be used, like:
> 
> diff --git a/drivers/scsi/pm8001/pm8001_sas.c
> b/drivers/scsi/pm8001/pm8001_sas.c
> index 83e73009db5c..0a5e5b5f6975 100644
> --- a/drivers/scsi/pm8001/pm8001_sas.c
> +++ b/drivers/scsi/pm8001/pm8001_sas.c
> @@ -65,7 +65,11 @@ static int pm8001_find_tag(struct sas_task *task, u32
> *tag)
>   void pm8001_tag_free(struct pm8001_hba_info *pm8001_ha, u32 tag)
>   {
>         void *bitmap = pm8001_ha->tags;
> -       clear_bit(tag, bitmap);
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&pm8001_ha->bitmap_lock, flags);
> +       __clear_bit(tag, bitmap);
> +       spin_unlock_irqrestore(&pm8001_ha->bitmap_lock, flags);
>   }
> 
>   /**
> @@ -85,7 +89,7 @@ inline int pm8001_tag_alloc(struct pm8001_hba_info
> *pm8001_ha, u32 *tag_out)
>                 spin_unlock_irqrestore(&pm8001_ha->bitmap_lock, flags);
>                 return -SAS_QUEUE_FULL;
>         }
> -       set_bit(tag, bitmap);
> +       __set_bit(tag, bitmap);
>         spin_unlock_irqrestore(&pm8001_ha->bitmap_lock, flags);
>         *tag_out = tag;
>         return 0;

Diff changes look fine for me here.

> 
> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-13 14:17                       ` John Garry
  2022-01-14 18:21                         ` John Garry
@ 2022-01-17 14:02                         ` Ajish.Koshy
  2022-01-18 15:49                           ` John Garry
  1 sibling, 1 reply; 18+ messages in thread
From: Ajish.Koshy @ 2022-01-17 14:02 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi John,

> Hi Ajish,
> 
> >>   From an earlier mail [0] I got the impression that you tested on an
> >> arm platform – did you?
> > Yes, with respect to my previous mail update, at that time got the
> > chance to load the driver on ARM server/enclosure connected in one of
> > our tester's arm server after attaching the controller card.
> > There this error handling issue was observed.
> >
> > The card/driver was never tested or validated on ARM server before,
> > was curious to see the behavior for the first time. Whereas driver
> > loads smoothly on x86 server.
> >
> > Currently busy with some other issues, debugging on ARM server is not
> > planned for now.
> >
> 
> OK, since you do see this same/similar issue with another card on arm then I
> think that it is safe to assume that it is a driver issue.
> 
> If you can share the dmesg on the arm machine then at least that would be
> helpful.

Right now the arm configuration is not available. Will be difficult
to get dmesg. 
> 
> Thanks,
> John

Thanks,
Ajish

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-17 14:02                         ` Ajish.Koshy
@ 2022-01-18 15:49                           ` John Garry
  2022-01-19 13:49                             ` Ajish.Koshy
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2022-01-18 15:49 UTC (permalink / raw)
  To: Ajish.Koshy, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi Ajish,

>>
>>>>    From an earlier mail [0] I got the impression that you tested on an
>>>> arm platform – did you?
>>> Yes, with respect to my previous mail update, at that time got the
>>> chance to load the driver on ARM server/enclosure connected in one of
>>> our tester's arm server after attaching the controller card.
>>> There this error handling issue was observed.
>>>
>>> The card/driver was never tested or validated on ARM server before,
>>> was curious to see the behavior for the first time. Whereas driver
>>> loads smoothly on x86 server.
>>>
>>> Currently busy with some other issues, debugging on ARM server is not
>>> planned for now.
>>>
>> OK, since you do see this same/similar issue with another card on arm then I
>> think that it is safe to assume that it is a driver issue.
>>
>> If you can share the dmesg on the arm machine then at least that would be
>> helpful.
> Right now the arm configuration is not available. Will be difficult
> to get dmesg.

By adding (enabling) a tonne of debug logs in the the driver and 
enabling heavy kernel debug config options mount+umount works reliably. 
So it looks like a timing issue / memory barrier issue - yuck. Since the 
issue is so reliably produced it seems unlikely to be a barrier issue.

There are lots of files in the shost sysfs folder - can any of these be 
used to help debug?

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [issue report] pm8001 issues (was driver crashes with IOMMU enabled)
  2022-01-18 15:49                           ` John Garry
@ 2022-01-19 13:49                             ` Ajish.Koshy
  0 siblings, 0 replies; 18+ messages in thread
From: Ajish.Koshy @ 2022-01-19 13:49 UTC (permalink / raw)
  To: john.garry, jinpu.wang, Viswas.G
  Cc: linux-scsi, vishakhavc, ipylypiv, Ruksar.devadi, damien.lemoal,
	Vasanthalakshmi.Tharmarajan

Hi John,
 
> Hi Ajish,
> 
> >>
> >>>>    From an earlier mail [0] I got the impression that you tested on
> >>>> an arm platform – did you?
> >>> Yes, with respect to my previous mail update, at that time got the
> >>> chance to load the driver on ARM server/enclosure connected in one
> >>> of our tester's arm server after attaching the controller card.
> >>> There this error handling issue was observed.
> >>>
> >>> The card/driver was never tested or validated on ARM server before,
> >>> was curious to see the behavior for the first time. Whereas driver
> >>> loads smoothly on x86 server.
> >>>
> >>> Currently busy with some other issues, debugging on ARM server is
> >>> not planned for now.
> >>>
> >> OK, since you do see this same/similar issue with another card on arm
> >> then I think that it is safe to assume that it is a driver issue.
> >>
> >> If you can share the dmesg on the arm machine then at least that
> >> would be helpful.
> > Right now the arm configuration is not available. Will be difficult to
> > get dmesg.
> 
> By adding (enabling) a tonne of debug logs in the the driver and enabling
> heavy kernel debug config options mount+umount works reliably.
> So it looks like a timing issue / memory barrier issue - yuck. Since the issue is
> so reliably produced it seems unlikely to be a barrier issue.
> 
> There are lots of files in the shost sysfs folder - can any of these be used to
> help debug?

For my driver level debugging I normally use "logging_level" sysfs to enable and disable
logs of different level during run time.

For example to enable IO logging
cat logging_level
00000201h
echo 0x209 > logging_level
cat logging_level
00000209h

> 
> Thanks,
> John

Thanks,
Ajish

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-01-19 13:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-24 12:28 [issue report] pm8001 driver crashes with IOMMU enabled John Garry
2021-11-24 12:43 ` Jinpu Wang
2021-11-24 16:22   ` John Garry
2021-12-24  9:02     ` [issue report] pm8001 issues (was driver crashes with IOMMU enabled) John Garry
2021-12-24 11:58       ` John Garry
2021-12-27 13:26         ` Ajish.Koshy
2022-01-06 15:49           ` John Garry
2022-01-07 11:12             ` Ajish.Koshy
2022-01-10 20:21               ` John Garry
2022-01-11 12:40                 ` Ajish.Koshy
2022-01-11 13:23                   ` John Garry
2022-01-13 12:52                     ` Ajish.Koshy
2022-01-13 14:17                       ` John Garry
2022-01-14 18:21                         ` John Garry
2022-01-17 13:56                           ` Ajish.Koshy
2022-01-17 14:02                         ` Ajish.Koshy
2022-01-18 15:49                           ` John Garry
2022-01-19 13:49                             ` Ajish.Koshy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).