linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] scsi: lpfc: Move work items to a stack list
@ 2019-11-05  8:08 Daniel Wagner
  2019-11-13  3:15 ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Wagner @ 2019-11-05  8:08 UTC (permalink / raw)
  To: linux-scsi; +Cc: linux-kernel, Daniel Wagner, James Smart, Dick Kennedy

Move all work items to a list which lives on the stack while holding
the corresponding lock.

With this we avoid a race between testing if the list is empty and
extracting an element of the list. Although, the list_remove_head()
macro tests will return an NULL pointer if the list is empty the two
functions lpfc_sli_handle_slow_ring_event_s4() and
lpfc_sli4_els_xri_abort_event_proc() do not test the return element if
it's NULL.

Instead adding another test if the pointer is NULL just avoid this
access pattern by using the stack list. This also avoids toggling the
interrupts on/off for every item.

Fixes: 4f774513f7b3 ("[SCSI] lpfc 8.3.2 : Addition of SLI4 Interface - Queues")
Cc: James Smart <james.smart@broadcom.com>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---

Hi,

While trying to understand what's going on in the Oops below I figured
that it could be the result of the invalid pointer access. The patch
still needs testing by our customer but indepent of this I think the
patch fixes a real bug.

[  139.392029] general protection fault: 0000 [#1] SMP PTI
[  139.397862] CPU: 5 PID: 998 Comm: kworker/5:13 Tainted: G                   4.12.14-226.g94364da-default #1 SLE15-SP1 (unreleased)
[  139.410962] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.9.1 12/04/2018
[  139.419339] Workqueue: lpfc_wq lpfc_sli4_hba_process_cq [lpfc]
[  139.425847] task: ffff95c996051440 task.stack: ffffaa038601c000
[  139.432459] RIP: 0010:lpfc_set_rrq_active+0xa6/0x2a0 [lpfc]
[  139.438676] RSP: 0018:ffffaa038601fcf8 EFLAGS: 00010046
[  139.444504] RAX: 0000000000000292 RBX: ffff95c5a9a0a000 RCX: 000000000000ffff
[  139.452466] RDX: ffff95c5accbb7b8 RSI: 0064695f74726f70 RDI: ffff95c5a9a0b160
[  139.460427] RBP: ffff95c5a9a0b160 R08: 0000000000000001 R09: 0000000000000002
[  139.468389] R10: ffffaa038601fdd8 R11: 61c8864680b583eb R12: 0000000000000001
[  139.476350] R13: 000000000000ffff R14: 00000000000002bb R15: 0064695f74726f70
[  139.484311] FS:  0000000000000000(0000) GS:ffff95c9bfc80000(0000) knlGS:0000000000000000
[  139.493340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  139.499749] CR2: 0000560354607098 CR3: 00000007a000a003 CR4: 00000000001606e0
[  139.507711] Call Trace:
[  139.510451]  lpfc_sli4_io_xri_aborted+0x1a7/0x250 [lpfc]
[  139.516386]  lpfc_sli4_sp_handle_abort_xri_wcqe.isra.56+0xa0/0x180 [lpfc]
[  139.523964]  ? __switch_to_asm+0x40/0x70
[  139.528338]  ? __switch_to_asm+0x34/0x70
[  139.532718]  ? lpfc_sli4_fp_handle_cqe+0xc3/0x450 [lpfc]
[  139.538649]  lpfc_sli4_fp_handle_cqe+0xc3/0x450 [lpfc]
[  139.544383]  ? __switch_to_asm+0x34/0x70
[  139.548762]  __lpfc_sli4_process_cq+0xea/0x220 [lpfc]
[  139.554393]  ? lpfc_sli4_sp_handle_abort_xri_wcqe.isra.56+0x180/0x180 [lpfc]
[  139.562557]  __lpfc_sli4_hba_process_cq+0x29/0xc0 [lpfc]
[  139.568486]  process_one_work+0x1da/0x400
[  139.572959]  worker_thread+0x2b/0x3f0
[  139.577044]  ? process_one_work+0x400/0x400
[  139.581710]  kthread+0x113/0x130
[  139.585310]  ? kthread_create_worker_on_cpu+0x50/0x50
[  139.590945]  ret_from_fork+0x35/0x40

Thanks,
Daniel

 drivers/scsi/lpfc/lpfc_sli.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index 294f041961a8..cbeb1f408ccc 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -3903,16 +3903,16 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba,
 	struct lpfc_cq_event *cq_event;
 	unsigned long iflag;
 	int count = 0;
+	LIST_HEAD(work_list);
 
 	spin_lock_irqsave(&phba->hbalock, iflag);
 	phba->hba_flag &= ~HBA_SP_QUEUE_EVT;
+	list_splice_init(&phba->sli4_hba.sp_queue_event, &work_list);
 	spin_unlock_irqrestore(&phba->hbalock, iflag);
-	while (!list_empty(&phba->sli4_hba.sp_queue_event)) {
+	while (!list_empty(&work_list)) {
 		/* Get the response iocb from the head of work queue */
-		spin_lock_irqsave(&phba->hbalock, iflag);
-		list_remove_head(&phba->sli4_hba.sp_queue_event,
-				 cq_event, struct lpfc_cq_event, list);
-		spin_unlock_irqrestore(&phba->hbalock, iflag);
+		list_remove_head(&work_list, cq_event,
+				 struct lpfc_cq_event, list);
 
 		switch (bf_get(lpfc_wcqe_c_code, &cq_event->cqe.wcqe_cmpl)) {
 		case CQE_CODE_COMPL_WQE:
@@ -3941,6 +3941,17 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba,
 		if (count == 64)
 			break;
 	}
+
+	/*
+	 * If the limit stops the processing of events, move the
+	 * remaining events back to the main event queue.
+	 */
+	spin_lock_irqsave(&phba->hbalock, iflag);
+	if (!list_empty(&work_list)) {
+		phba->hba_flag |= HBA_SP_QUEUE_EVT;
+		list_splice(&work_list, &phba->sli4_hba.sp_queue_event);
+	}
+	spin_unlock_irqrestore(&phba->hbalock, iflag);
 }
 
 /**
@@ -12877,18 +12888,19 @@ lpfc_sli_intr_handler(int irq, void *dev_id)
 void lpfc_sli4_els_xri_abort_event_proc(struct lpfc_hba *phba)
 {
 	struct lpfc_cq_event *cq_event;
+	LIST_HEAD(work_list);
 
 	/* First, declare the els xri abort event has been handled */
 	spin_lock_irq(&phba->hbalock);
 	phba->hba_flag &= ~ELS_XRI_ABORT_EVENT;
+	list_splice_init(&phba->sli4_hba.sp_els_xri_aborted_work_queue,
+			 &work_list);
 	spin_unlock_irq(&phba->hbalock);
 	/* Now, handle all the els xri abort events */
-	while (!list_empty(&phba->sli4_hba.sp_els_xri_aborted_work_queue)) {
+	while (!list_empty(&work_list)) {
 		/* Get the first event from the head of the event queue */
-		spin_lock_irq(&phba->hbalock);
-		list_remove_head(&phba->sli4_hba.sp_els_xri_aborted_work_queue,
-				 cq_event, struct lpfc_cq_event, list);
-		spin_unlock_irq(&phba->hbalock);
+		list_remove_head(&work_list, cq_event,
+				 struct lpfc_cq_event, list);
 		/* Notify aborted XRI for ELS work queue */
 		lpfc_sli4_els_xri_aborted(phba, &cq_event->cqe.wcqe_axri);
 		/* Free the event processed back to the free pool */
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi: lpfc: Move work items to a stack list
  2019-11-05  8:08 [PATCH] scsi: lpfc: Move work items to a stack list Daniel Wagner
@ 2019-11-13  3:15 ` Martin K. Petersen
  2019-11-19 13:28   ` Daniel Wagner
  0 siblings, 1 reply; 6+ messages in thread
From: Martin K. Petersen @ 2019-11-13  3:15 UTC (permalink / raw)
  To: James Smart, Dick Kennedy; +Cc: Daniel Wagner, linux-scsi, linux-kernel


James/Dick: Please review!

> Move all work items to a list which lives on the stack while holding
> the corresponding lock.
>
> With this we avoid a race between testing if the list is empty and
> extracting an element of the list. Although, the list_remove_head()
> macro tests will return an NULL pointer if the list is empty the two
> functions lpfc_sli_handle_slow_ring_event_s4() and
> lpfc_sli4_els_xri_abort_event_proc() do not test the return element if
> it's NULL.
>
> Instead adding another test if the pointer is NULL just avoid this
> access pattern by using the stack list. This also avoids toggling the
> interrupts on/off for every item.
>
> Fixes: 4f774513f7b3 ("[SCSI] lpfc 8.3.2 : Addition of SLI4 Interface - Queues")
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Dick Kennedy <dick.kennedy@broadcom.com>
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>
> Hi,
>
> While trying to understand what's going on in the Oops below I figured
> that it could be the result of the invalid pointer access. The patch
> still needs testing by our customer but indepent of this I think the
> patch fixes a real bug.
>
> [  139.392029] general protection fault: 0000 [#1] SMP PTI
> [  139.397862] CPU: 5 PID: 998 Comm: kworker/5:13 Tainted: G                   4.12.14-226.g94364da-default #1 SLE15-SP1 (unreleased)
> [  139.410962] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.9.1 12/04/2018
> [  139.419339] Workqueue: lpfc_wq lpfc_sli4_hba_process_cq [lpfc]
> [  139.425847] task: ffff95c996051440 task.stack: ffffaa038601c000
> [  139.432459] RIP: 0010:lpfc_set_rrq_active+0xa6/0x2a0 [lpfc]
> [  139.438676] RSP: 0018:ffffaa038601fcf8 EFLAGS: 00010046
> [  139.444504] RAX: 0000000000000292 RBX: ffff95c5a9a0a000 RCX: 000000000000ffff
> [  139.452466] RDX: ffff95c5accbb7b8 RSI: 0064695f74726f70 RDI: ffff95c5a9a0b160
> [  139.460427] RBP: ffff95c5a9a0b160 R08: 0000000000000001 R09: 0000000000000002
> [  139.468389] R10: ffffaa038601fdd8 R11: 61c8864680b583eb R12: 0000000000000001
> [  139.476350] R13: 000000000000ffff R14: 00000000000002bb R15: 0064695f74726f70
> [  139.484311] FS:  0000000000000000(0000) GS:ffff95c9bfc80000(0000) knlGS:0000000000000000
> [  139.493340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  139.499749] CR2: 0000560354607098 CR3: 00000007a000a003 CR4: 00000000001606e0
> [  139.507711] Call Trace:
> [  139.510451]  lpfc_sli4_io_xri_aborted+0x1a7/0x250 [lpfc]
> [  139.516386]  lpfc_sli4_sp_handle_abort_xri_wcqe.isra.56+0xa0/0x180 [lpfc]
> [  139.523964]  ? __switch_to_asm+0x40/0x70
> [  139.528338]  ? __switch_to_asm+0x34/0x70
> [  139.532718]  ? lpfc_sli4_fp_handle_cqe+0xc3/0x450 [lpfc]
> [  139.538649]  lpfc_sli4_fp_handle_cqe+0xc3/0x450 [lpfc]
> [  139.544383]  ? __switch_to_asm+0x34/0x70
> [  139.548762]  __lpfc_sli4_process_cq+0xea/0x220 [lpfc]
> [  139.554393]  ? lpfc_sli4_sp_handle_abort_xri_wcqe.isra.56+0x180/0x180 [lpfc]
> [  139.562557]  __lpfc_sli4_hba_process_cq+0x29/0xc0 [lpfc]
> [  139.568486]  process_one_work+0x1da/0x400
> [  139.572959]  worker_thread+0x2b/0x3f0
> [  139.577044]  ? process_one_work+0x400/0x400
> [  139.581710]  kthread+0x113/0x130
> [  139.585310]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  139.590945]  ret_from_fork+0x35/0x40
>
> Thanks,
> Daniel
>
>  drivers/scsi/lpfc/lpfc_sli.c | 32 ++++++++++++++++++++++----------
>  1 file changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
> index 294f041961a8..cbeb1f408ccc 100644
> --- a/drivers/scsi/lpfc/lpfc_sli.c
> +++ b/drivers/scsi/lpfc/lpfc_sli.c
> @@ -3903,16 +3903,16 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba,
>  	struct lpfc_cq_event *cq_event;
>  	unsigned long iflag;
>  	int count = 0;
> +	LIST_HEAD(work_list);
>  
>  	spin_lock_irqsave(&phba->hbalock, iflag);
>  	phba->hba_flag &= ~HBA_SP_QUEUE_EVT;
> +	list_splice_init(&phba->sli4_hba.sp_queue_event, &work_list);
>  	spin_unlock_irqrestore(&phba->hbalock, iflag);
> -	while (!list_empty(&phba->sli4_hba.sp_queue_event)) {
> +	while (!list_empty(&work_list)) {
>  		/* Get the response iocb from the head of work queue */
> -		spin_lock_irqsave(&phba->hbalock, iflag);
> -		list_remove_head(&phba->sli4_hba.sp_queue_event,
> -				 cq_event, struct lpfc_cq_event, list);
> -		spin_unlock_irqrestore(&phba->hbalock, iflag);
> +		list_remove_head(&work_list, cq_event,
> +				 struct lpfc_cq_event, list);
>  
>  		switch (bf_get(lpfc_wcqe_c_code, &cq_event->cqe.wcqe_cmpl)) {
>  		case CQE_CODE_COMPL_WQE:
> @@ -3941,6 +3941,17 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba,
>  		if (count == 64)
>  			break;
>  	}
> +
> +	/*
> +	 * If the limit stops the processing of events, move the
> +	 * remaining events back to the main event queue.
> +	 */
> +	spin_lock_irqsave(&phba->hbalock, iflag);
> +	if (!list_empty(&work_list)) {
> +		phba->hba_flag |= HBA_SP_QUEUE_EVT;
> +		list_splice(&work_list, &phba->sli4_hba.sp_queue_event);
> +	}
> +	spin_unlock_irqrestore(&phba->hbalock, iflag);
>  }
>  
>  /**
> @@ -12877,18 +12888,19 @@ lpfc_sli_intr_handler(int irq, void *dev_id)
>  void lpfc_sli4_els_xri_abort_event_proc(struct lpfc_hba *phba)
>  {
>  	struct lpfc_cq_event *cq_event;
> +	LIST_HEAD(work_list);
>  
>  	/* First, declare the els xri abort event has been handled */
>  	spin_lock_irq(&phba->hbalock);
>  	phba->hba_flag &= ~ELS_XRI_ABORT_EVENT;
> +	list_splice_init(&phba->sli4_hba.sp_els_xri_aborted_work_queue,
> +			 &work_list);
>  	spin_unlock_irq(&phba->hbalock);
>  	/* Now, handle all the els xri abort events */
> -	while (!list_empty(&phba->sli4_hba.sp_els_xri_aborted_work_queue)) {
> +	while (!list_empty(&work_list)) {
>  		/* Get the first event from the head of the event queue */
> -		spin_lock_irq(&phba->hbalock);
> -		list_remove_head(&phba->sli4_hba.sp_els_xri_aborted_work_queue,
> -				 cq_event, struct lpfc_cq_event, list);
> -		spin_unlock_irq(&phba->hbalock);
> +		list_remove_head(&work_list, cq_event,
> +				 struct lpfc_cq_event, list);
>  		/* Notify aborted XRI for ELS work queue */
>  		lpfc_sli4_els_xri_aborted(phba, &cq_event->cqe.wcqe_axri);
>  		/* Free the event processed back to the free pool */

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi: lpfc: Move work items to a stack list
  2019-11-13  3:15 ` Martin K. Petersen
@ 2019-11-19 13:28   ` Daniel Wagner
  2019-11-19 13:32     ` Daniel Wagner
  2019-11-19 17:25     ` James Smart
  0 siblings, 2 replies; 6+ messages in thread
From: Daniel Wagner @ 2019-11-19 13:28 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: James Smart, Dick Kennedy, linux-scsi, linux-kernel

On Tue, Nov 12, 2019 at 10:15:00PM -0500, Martin K. Petersen wrote:
> > While trying to understand what's going on in the Oops below I figured
> > that it could be the result of the invalid pointer access. The patch
> > still needs testing by our customer but indepent of this I think the
> > patch fixes a real bug.

I was able to reproduce the same stack trace with this patch
applied... That is obviously bad. The good news, I have access to this
machine, so maybe I able to figure out what's the root cause of this
crash.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi: lpfc: Move work items to a stack list
  2019-11-19 13:28   ` Daniel Wagner
@ 2019-11-19 13:32     ` Daniel Wagner
  2019-11-19 17:25     ` James Smart
  1 sibling, 0 replies; 6+ messages in thread
From: Daniel Wagner @ 2019-11-19 13:32 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: James Smart, Dick Kennedy, linux-scsi, linux-kernel

On Tue, Nov 19, 2019 at 02:28:54PM +0100, Daniel Wagner wrote:
> On Tue, Nov 12, 2019 at 10:15:00PM -0500, Martin K. Petersen wrote:
> > > While trying to understand what's going on in the Oops below I figured
> > > that it could be the result of the invalid pointer access. The patch
> > > still needs testing by our customer but indepent of this I think the
> > > patch fixes a real bug.
> 
> I was able to reproduce the same stack trace with this patch
> applied... That is obviously bad. The good news, I have access to this
> machine, so maybe I able to figure out what's the root cause of this
> crash.

Forgot to append the KASAN trace which points at the same place. Don't
know if this is the same thing or not.


[  329.217804] ==================================================================
[  329.280494] BUG: KASAN: slab-out-of-bounds in lpfc_sli4_io_xri_aborted+0x29c/0x3c0 [lpfc]
[  329.351654] Read of size 8 at addr ffff88984f160000 by task kworker/77:1/488
[  329.396559] nvme nvme3: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[  329.412326] 
[  329.412335] CPU: 77 PID: 488 Comm: kworker/77:1 Kdump: loaded Tainted: G            E     5.4.0-rc1-default+ #3
[  329.412338] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9, BIOS U17 07/21/2019
[  329.412414] Workqueue: lpfc_wq lpfc_sli4_hba_process_cq [lpfc]
[  329.428650] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[  329.765863] Call Trace:
[  329.765888]  dump_stack+0x71/0xab
[  329.765967]  ? lpfc_sli4_io_xri_aborted+0x29c/0x3c0 [lpfc]
[  329.765981]  print_address_description.constprop.6+0x1b/0x2f0
[  329.912961]  ? lpfc_sli4_io_xri_aborted+0x29c/0x3c0 [lpfc]
[  329.913001]  ? lpfc_sli4_io_xri_aborted+0x29c/0x3c0 [lpfc]
[  330.009190]  __kasan_report+0x14e/0x192
[  330.009255]  ? lpfc_sli4_io_xri_aborted+0x29c/0x3c0 [lpfc]
[  330.009261]  kasan_report+0xe/0x20
[  330.120620]  lpfc_sli4_io_xri_aborted+0x29c/0x3c0 [lpfc]
[  330.120660]  lpfc_sli4_sp_handle_abort_xri_wcqe.isra.55+0x59/0x280 [lpfc]
[  330.226013]  ? __update_load_avg_cfs_rq+0x244/0x470
[  330.226052]  ? lpfc_sli4_fp_handle_cqe+0x127/0x8e0 [lpfc]
[  330.226089]  lpfc_sli4_fp_handle_cqe+0x127/0x8e0 [lpfc]
[  330.358896]  ? lpfc_sli4_sp_handle_abort_xri_wcqe.isra.55+0x280/0x280 [lpfc]
[  330.358907]  ? __switch_to_asm+0x40/0x70
[  330.452995]  ? __switch_to_asm+0x34/0x70
[  330.452998]  ? __switch_to_asm+0x40/0x70
[  330.453000]  ? __switch_to_asm+0x34/0x70
[  330.453002]  ? __switch_to_asm+0x40/0x70
[  330.453005]  ? __switch_to_asm+0x34/0x70
[  330.453041]  __lpfc_sli4_process_cq+0x1e1/0x470 [lpfc]
[  330.453078]  ? lpfc_sli4_sp_handle_abort_xri_wcqe.isra.55+0x280/0x280 [lpfc]
[  330.728428]  ? __switch_to_asm+0x40/0x70
[  330.728466]  __lpfc_sli4_hba_process_cq+0x88/0x1d0 [lpfc]
[  330.728503]  ? lpfc_sli4_fp_handle_cqe+0x8e0/0x8e0 [lpfc]
[  330.855605]  process_one_work+0x46e/0x7f0
[  330.855610]  worker_thread+0x69/0x6b0
[  330.855615]  ? process_one_work+0x7f0/0x7f0
[  330.855620]  kthread+0x1b3/0x1d0
[  330.855624]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  330.855627]  ret_from_fork+0x35/0x40
[  330.855631] 
[  330.855634] Allocated by task 5171:
[  330.855644]  save_stack+0x19/0x80
[  330.855650]  __kasan_kmalloc.constprop.9+0xa0/0xd0
[  331.175452]  __kmalloc+0xfb/0x5d0
[  331.175461]  alloc_pipe_info+0xff/0x210
[  331.175464]  create_pipe_files+0x66/0x2e0
[  331.175467]  __do_pipe_flags+0x2c/0x100
[  331.175470]  do_pipe2+0x80/0x130
[  331.175472]  __x64_sys_pipe2+0x2b/0x30
[  331.175486]  do_syscall_64+0x73/0x230
[  331.395309]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  331.395310] 
[  331.395312] Freed by task 5171:
[  331.395317]  save_stack+0x19/0x80
[  331.395319]  __kasan_slab_free+0x105/0x150
[  331.395321]  kfree+0xa6/0x150
[  331.395324]  free_pipe_info+0x106/0x120
[  331.395327]  pipe_release+0xcb/0xf0
[  331.395335]  __fput+0x11d/0x330
[  331.395338]  task_work_run+0xc6/0xf0
[  331.395344]  exit_to_usermode_loop+0x11d/0x120
[  331.730019]  do_syscall_64+0x203/0x230
[  331.730023]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  331.730023] 
[  331.730027] The buggy address belongs to the object at ffff88984f160040
[  331.730027]  which belongs to the cache kmalloc-1k of size 1024
[  331.730030] The buggy address is located 64 bytes to the left of
[  331.730030]  1024-byte region [ffff88984f160040, ffff88984f160440)
[  331.730031] The buggy address belongs to the page:
[  331.730036] page:ffffea00613c5800 refcount:1 mapcount:0 mapping:ffff888107c00700 index:0x0 compound_mapcount: 0
[  331.730042] flags: 0x97ffffc0010200(slab|head)
[  331.730050] raw: 0097ffffc0010200 ffffea00613c4608 ffffea00613c7f88 ffff888107c00700
[  332.266508] raw: 0000000000000000 ffff88984f160040 0000000100000007 0000000000000000
[  332.266509] page dumped because: kasan: bad access detected
[  332.266510] 
[  332.266511] Memory state around the buggy address:
[  332.266516]  ffff88984f15ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  332.266518]  ffff88984f15ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  332.266521] >ffff88984f160000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[  332.266522]                    ^
[  332.266525]  ffff88984f160080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  332.266527]  ffff88984f160100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  332.266528] ==================================================================

The kernel I used to create the above KASAN trace is mkp/queue (clean
without my patch), c0bf9a264e10 ("scsi: iscsi: Don't send data to
unbound connection")

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi: lpfc: Move work items to a stack list
  2019-11-19 13:28   ` Daniel Wagner
  2019-11-19 13:32     ` Daniel Wagner
@ 2019-11-19 17:25     ` James Smart
  2019-11-19 18:14       ` Daniel Wagner
  1 sibling, 1 reply; 6+ messages in thread
From: James Smart @ 2019-11-19 17:25 UTC (permalink / raw)
  To: Daniel Wagner, Martin K. Petersen
  Cc: Dick Kennedy, linux-scsi, linux-kernel, James Smart



On 11/19/2019 5:28 AM, Daniel Wagner wrote:
> On Tue, Nov 12, 2019 at 10:15:00PM -0500, Martin K. Petersen wrote:
>>> While trying to understand what's going on in the Oops below I figured
>>> that it could be the result of the invalid pointer access. The patch
>>> still needs testing by our customer but indepent of this I think the
>>> patch fixes a real bug.
> I was able to reproduce the same stack trace with this patch
> applied... That is obviously bad. The good news, I have access to this
> machine, so maybe I able to figure out what's the root cause of this
> crash.

fyi - Dick and I were taking our time reviewing your patch as the major 
concern we had with it was that the splice and then the re-add of the 
work-list could have an abort complete before the IO, which could lead 
to DC.

We'll look anew at your repro.

-- james


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi: lpfc: Move work items to a stack list
  2019-11-19 17:25     ` James Smart
@ 2019-11-19 18:14       ` Daniel Wagner
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Wagner @ 2019-11-19 18:14 UTC (permalink / raw)
  To: James Smart; +Cc: Martin K. Petersen, Dick Kennedy, linux-scsi, linux-kernel

Hi James,

On Tue, Nov 19, 2019 at 09:25:54AM -0800, James Smart wrote:
> On 11/19/2019 5:28 AM, Daniel Wagner wrote:
> > On Tue, Nov 12, 2019 at 10:15:00PM -0500, Martin K. Petersen wrote:
> > > > While trying to understand what's going on in the Oops below I figured
> > > > that it could be the result of the invalid pointer access. The patch
> > > > still needs testing by our customer but indepent of this I think the
> > > > patch fixes a real bug.
> > I was able to reproduce the same stack trace with this patch
> > applied... That is obviously bad. The good news, I have access to this
> > machine, so maybe I able to figure out what's the root cause of this
> > crash.
> 
> fyi - Dick and I were taking our time reviewing your patch as the major
> concern we had with it was that the splice and then the re-add of the
> work-list could have an abort complete before the IO, which could lead to
> DC.

I think there is potentially a bug. The crash I see might not be
related to my attempt here to fix it. After looking at it again, it
looks a bit ugly and over complex compared to what should do the trick:

@@ -3915,6 +3915,9 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba,
 				 cq_event, struct lpfc_cq_event, list);
 		spin_unlock_irqrestore(&phba->hbalock, iflag);
 
+		if (!cq_event)
+			break;
+
 		switch (bf_get(lpfc_wcqe_c_code, &cq_event->cqe.wcqe_cmpl)) {
 		case CQE_CODE_COMPL_WQE:
 			irspiocbq = container_of(cq_event, struct lpfc_iocbq,

> We'll look anew at your repro.

Thanks a lot.

Anyway, I tried to gather some more data on it.

(gdb)  l *lpfc_sli4_io_xri_aborted+0x29c
0xa1f2c is in lpfc_sli4_io_xri_aborted (drivers/scsi/lpfc/lpfc_scsi.c:543).
538                             }
539     #endif
540                             qp->abts_scsi_io_bufs--;
541                             spin_unlock(&qp->abts_io_buf_list_lock);
542
543                             if (psb->rdata && psb->rdata->pnode)
544                                     ndlp = psb->rdata->pnode;
545                             else
546                                     ndlp = NULL;
547

I looked at the assembler code and if I am not completely mistaken it
is 'psb->rdata' which triggers the KASAN dump.

So I though let's reset psb->rdata when the buf is freed in order to
avoid the report. So this was just to find out if my theory holds.

ff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index ba26df90a36a..8e3a05e44d76 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -791,10 +791,12 @@ lpfc_release_scsi_buf_s4(struct lpfc_hba *phba, struct lpfc_io_buf *psb)
        if (psb->flags & LPFC_SBUF_XBUSY) {
                spin_lock_irqsave(&qp->abts_io_buf_list_lock, iflag);
                psb->pCmd = NULL;
+               psb->rdata = NULL;
                list_add_tail(&psb->list, &qp->lpfc_abts_io_buf_list);
                qp->abts_scsi_io_bufs++;
                spin_unlock_irqrestore(&qp->abts_io_buf_list_lock, iflag);
        } else {
+               psb->rdata = NULL;
                lpfc_release_io_buf(phba, (struct lpfc_io_buf *)psb, qp);
        }
 }

Unfortunatly, this didn't help and instead I got a new KASAN report instead:

[  167.006341] ==================================================================
[  167.006428] BUG: KASAN: slab-out-of-bounds in lpfc_unreg_login+0x7c/0xc0 [lpfc]
[  167.006434] Read of size 2 at addr ffff88a035f792e2 by task lpfc_worker_3/1130
[  167.006435] 
[  167.006444] CPU: 4 PID: 1130 Comm: lpfc_worker_3 Tainted: G            E     5.4.0-rc1-default+ #3
[  167.006447] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9, BIOS U17 07/21/2019
[  167.006449] Call Trace:
[  167.006463]  dump_stack+0x71/0xab
[  167.006535]  ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[  167.006545]  print_address_description.constprop.6+0x1b/0x2f0
[  167.006615]  ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[  167.006684]  ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[  167.006688] sd 7:0:0:0: alua: port group 3e9 state N non-preferred supports TolUsNA
[  167.006693]  __kasan_report+0x14e/0x192
[  167.006764]  ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[  167.006770]  kasan_report+0xe/0x20
[  167.006840]  lpfc_unreg_login+0x7c/0xc0 [lpfc]
[  167.006912]  lpfc_sli_def_mbox_cmpl+0x329/0x560 [lpfc]
[  167.006921]  ? _raw_spin_lock_irqsave+0x8d/0xf0
[  167.006991]  ? __lpfc_sli_rpi_release.isra.59+0xc0/0xc0 [lpfc]
[  167.007057]  lpfc_sli_handle_mb_event+0x3f3/0x8d0 [lpfc]
[  167.007064]  ? _raw_spin_lock_irqsave+0x8d/0xf0
[  167.007069]  ? _raw_write_lock_bh+0xe0/0xe0
[  167.007138]  ? lpfc_mbox_get+0x1d/0xa0 [lpfc]
[  167.007204]  ? lpfc_sli_free_hbq+0x80/0x80 [lpfc]
[  167.007277]  lpfc_do_work+0x100b/0x26e0 [lpfc]
[  167.007285]  ? __schedule+0x6ca/0xc10
[  167.007288]  ? _raw_spin_lock_irqsave+0x8d/0xf0
[  167.007329]  ? lpfc_unregister_unused_fcf+0xc0/0xc0 [lpfc]
[  167.007334]  ? wait_woken+0x130/0x130
[  167.007375]  ? lpfc_unregister_unused_fcf+0xc0/0xc0 [lpfc]
[  167.007380]  kthread+0x1b3/0x1d0
[  167.007383]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  167.007386]  ret_from_fork+0x35/0x40
[  167.007389] 
[  167.007391] Allocated by task 667:
[  167.007395]  save_stack+0x19/0x80
[  167.007398]  __kasan_kmalloc.constprop.9+0xa0/0xd0
[  167.007400]  __kmalloc+0xfb/0x5d0
[  167.007439]  lpfc_sli4_alloc_resource_identifiers+0x260/0x8a0 [lpfc]
[  167.007478]  lpfc_sli4_hba_setup+0xbbd/0x3480 [lpfc]
[  167.007517]  lpfc_pci_probe_one_s4.isra.46+0x100b/0x1d70 [lpfc]
[  167.007556]  lpfc_pci_probe_one+0x191/0x11b0 [lpfc]
[  167.007563]  local_pci_probe+0x74/0xd0
[  167.007567]  work_for_cpu_fn+0x29/0x40
[  167.007570]  process_one_work+0x46e/0x7f0
[  167.007573]  worker_thread+0x4cd/0x6b0
[  167.007575]  kthread+0x1b3/0x1d0
[  167.007578]  ret_from_fork+0x35/0x40
[  167.007578] 
[  167.007580] Freed by task 0:
[  167.007580] (stack is not available)
[  167.007581] 
[  167.007584] The buggy address belongs to the object at ffff88a035f790c0
[  167.007584]  which belongs to the cache kmalloc-512 of size 512
[  167.007587] The buggy address is located 34 bytes to the right of
[  167.007587]  512-byte region [ffff88a035f790c0, ffff88a035f792c0)
[  167.007588] The buggy address belongs to the page:
[  167.007592] page:ffffea0080d7de40 refcount:1 mapcount:0 mapping:ffff888107c00c40 index:0x0
[  167.007595] flags: 0xd7ffffc0000200(slab)
[  167.007601] raw: 00d7ffffc0000200 ffffea0080ec5588 ffffea0080d3e988 ffff888107c00c40
[  167.007604] raw: 0000000000000000 ffff88a035f790c0 0000000100000006 0000000000000000
[  167.007606] page dumped because: kasan: bad access detected
[  167.007606] 
[  167.007607] Memory state around the buggy address:
[  167.007610]  ffff88a035f79180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  167.007613]  ffff88a035f79200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  167.007615] >ffff88a035f79280: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  167.007617]                                                        ^
[  167.007619]  ffff88a035f79300: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[  167.007622]  ffff88a035f79380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  167.007623] ==================================================================


(gdb) l *lpfc_unreg_login+0x7c
0x8bb1c is in lpfc_unreg_login (drivers/scsi/lpfc/lpfc_mbox.c:826).
821             memset(pmb, 0, sizeof (LPFC_MBOXQ_t));
822
823             mb->un.varUnregLogin.rpi = rpi;
824             mb->un.varUnregLogin.rsvd1 = 0;
825             if (phba->sli_rev >= LPFC_SLI_REV3)
826                     mb->un.varUnregLogin.vpi = phba->vpi_ids[vpi];
827
828             mb->mbxCommand = MBX_UNREG_LOGIN;
829             mb->mbxOwner = OWN_HOST;
830


Not sure if this is not again an red herring.

Thanks,
Daniel

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-11-19 18:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05  8:08 [PATCH] scsi: lpfc: Move work items to a stack list Daniel Wagner
2019-11-13  3:15 ` Martin K. Petersen
2019-11-19 13:28   ` Daniel Wagner
2019-11-19 13:32     ` Daniel Wagner
2019-11-19 17:25     ` James Smart
2019-11-19 18:14       ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).