All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI/DOE: Fix destroy_work_on_stack() race
@ 2023-07-26 18:29 Ira Weiny
  2023-07-27  7:54 ` Lukas Wunner
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ira Weiny @ 2023-07-26 18:29 UTC (permalink / raw)
  To: Bjorn Helgaas, Dan Williams, Jonathan Cameron, Lukas Wunner,
	Davidlohr Bueso
  Cc: linux-pci, linux-kernel, Ira Weiny

The following debug object splat was observed in testing.

  [   14.061937] ------------[ cut here ]------------
  [   14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
  [   14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
  ...
  [   14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
  [   14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
  ...
  [   14.116231] Call Trace:
  [   14.117652]  <TASK>
  [   14.118958]  ? debug_print_object+0x7d/0xb0
  [   14.120782]  ? __warn+0x7d/0x130
  [   14.122399]  ? debug_print_object+0x7d/0xb0
  [   14.123746]  ? report_bug+0x18d/0x1c0
  [   14.125025]  ? handle_bug+0x3c/0x80
  [   14.126506]  ? exc_invalid_op+0x13/0x60
  [   14.127796]  ? asm_exc_invalid_op+0x16/0x20
  [   14.129380]  ? debug_print_object+0x7d/0xb0
  [   14.130688]  ? debug_print_object+0x7d/0xb0
  [   14.131997]  ? __pfx_doe_statemachine_work+0x10/0x10
  [   14.133597]  debug_object_free.part.0+0x11b/0x150
  [   14.134940]  doe_statemachine_work+0x45e/0x510
  [   14.136348]  process_one_work+0x1d4/0x3c0
  ...
  [   14.161484]  </TASK>
  [   14.162434] ---[ end trace 0000000000000000 ]---

This occurs because destroy_work_on_stack() was called after signaling
the completion in the calling thread.  This creates a race between
destroy_work_on_stack() and the task->work struct going of scope in the
pci_doe().

Signal the work complete after destroying the work struct.  This is safe
because signal_task_complete() is the final thing the work item does and
the workqueue code is careful not to access the work struct after.

Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
Cc: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/pci/doe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 1b97a5ab71a9..e3aab5edaf70 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas
 static void signal_task_complete(struct pci_doe_task *task, int rv)
 {
 	task->rv = rv;
-	task->complete(task);
 	destroy_work_on_stack(&task->work);
+	task->complete(task);
 }
 
 static void signal_task_abort(struct pci_doe_task *task, int rv)

---
base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
change-id: 20230726-doe-fix-f57943f9ea82

Best regards,
-- 
Ira Weiny <ira.weiny@intel.com>


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race
  2023-07-26 18:29 [PATCH] PCI/DOE: Fix destroy_work_on_stack() race Ira Weiny
@ 2023-07-27  7:54 ` Lukas Wunner
  2023-07-27 16:59 ` Bjorn Helgaas
  2023-07-27 20:21 ` Bjorn Helgaas
  2 siblings, 0 replies; 6+ messages in thread
From: Lukas Wunner @ 2023-07-27  7:54 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Jonathan Cameron, Davidlohr Bueso,
	linux-pci, linux-kernel

On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> The following debug object splat was observed in testing.
[...]
> This occurs because destroy_work_on_stack() was called after signaling
> the completion in the calling thread.  This creates a race between
> destroy_work_on_stack() and the task->work struct going of scope in the
> pci_doe().
> 
> Signal the work complete after destroying the work struct.  This is safe
> because signal_task_complete() is the final thing the work item does and
> the workqueue code is careful not to access the work struct after.
> 
> Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> Cc: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Lukas Wunner <lukas@wunner.de>

Thanks for catching this.  The offending commit abf04be0e707 was applied
by Dan.  Not sure if that means he's going to apply this fix as well?
Would require an ack from Bjorn in that case.  Or Bjorn applies it.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race
  2023-07-26 18:29 [PATCH] PCI/DOE: Fix destroy_work_on_stack() race Ira Weiny
  2023-07-27  7:54 ` Lukas Wunner
@ 2023-07-27 16:59 ` Bjorn Helgaas
  2023-07-27 19:53   ` Dan Williams
  2023-07-27 20:21 ` Bjorn Helgaas
  2 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2023-07-27 16:59 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Jonathan Cameron, Lukas Wunner,
	Davidlohr Bueso, linux-pci, linux-kernel

On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> The following debug object splat was observed in testing.
> 
>   [   14.061937] ------------[ cut here ]------------
>   [   14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
>   [   14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
>   ...
>   [   14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
>   [   14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
>   ...
>   [   14.116231] Call Trace:
>   [   14.117652]  <TASK>
>   [   14.118958]  ? debug_print_object+0x7d/0xb0
>   [   14.120782]  ? __warn+0x7d/0x130
>   [   14.122399]  ? debug_print_object+0x7d/0xb0
>   [   14.123746]  ? report_bug+0x18d/0x1c0
>   [   14.125025]  ? handle_bug+0x3c/0x80
>   [   14.126506]  ? exc_invalid_op+0x13/0x60
>   [   14.127796]  ? asm_exc_invalid_op+0x16/0x20
>   [   14.129380]  ? debug_print_object+0x7d/0xb0
>   [   14.130688]  ? debug_print_object+0x7d/0xb0
>   [   14.131997]  ? __pfx_doe_statemachine_work+0x10/0x10
>   [   14.133597]  debug_object_free.part.0+0x11b/0x150
>   [   14.134940]  doe_statemachine_work+0x45e/0x510
>   [   14.136348]  process_one_work+0x1d4/0x3c0
>   ...
>   [   14.161484]  </TASK>
>   [   14.162434] ---[ end trace 0000000000000000 ]---
> 
> This occurs because destroy_work_on_stack() was called after signaling
> the completion in the calling thread.  This creates a race between
> destroy_work_on_stack() and the task->work struct going of scope in the
> pci_doe().
> 
> Signal the work complete after destroying the work struct.  This is safe
> because signal_task_complete() is the final thing the work item does and
> the workqueue code is careful not to access the work struct after.
> 
> Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> Cc: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Dan, let me know if you'd rather have me take this.

> ---
>  drivers/pci/doe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 1b97a5ab71a9..e3aab5edaf70 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas
>  static void signal_task_complete(struct pci_doe_task *task, int rv)
>  {
>  	task->rv = rv;
> -	task->complete(task);
>  	destroy_work_on_stack(&task->work);
> +	task->complete(task);
>  }
>  
>  static void signal_task_abort(struct pci_doe_task *task, int rv)
> 
> ---
> base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> change-id: 20230726-doe-fix-f57943f9ea82
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race
  2023-07-27 16:59 ` Bjorn Helgaas
@ 2023-07-27 19:53   ` Dan Williams
  0 siblings, 0 replies; 6+ messages in thread
From: Dan Williams @ 2023-07-27 19:53 UTC (permalink / raw)
  To: Bjorn Helgaas, Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Jonathan Cameron, Lukas Wunner,
	Davidlohr Bueso, linux-pci, linux-kernel

Bjorn Helgaas wrote:
> On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> > The following debug object splat was observed in testing.
> > 
> >   [   14.061937] ------------[ cut here ]------------
> >   [   14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
> >   [   14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
> >   ...
> >   [   14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
> >   [   14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
> >   ...
> >   [   14.116231] Call Trace:
> >   [   14.117652]  <TASK>
> >   [   14.118958]  ? debug_print_object+0x7d/0xb0
> >   [   14.120782]  ? __warn+0x7d/0x130
> >   [   14.122399]  ? debug_print_object+0x7d/0xb0
> >   [   14.123746]  ? report_bug+0x18d/0x1c0
> >   [   14.125025]  ? handle_bug+0x3c/0x80
> >   [   14.126506]  ? exc_invalid_op+0x13/0x60
> >   [   14.127796]  ? asm_exc_invalid_op+0x16/0x20
> >   [   14.129380]  ? debug_print_object+0x7d/0xb0
> >   [   14.130688]  ? debug_print_object+0x7d/0xb0
> >   [   14.131997]  ? __pfx_doe_statemachine_work+0x10/0x10
> >   [   14.133597]  debug_object_free.part.0+0x11b/0x150
> >   [   14.134940]  doe_statemachine_work+0x45e/0x510
> >   [   14.136348]  process_one_work+0x1d4/0x3c0
> >   ...
> >   [   14.161484]  </TASK>
> >   [   14.162434] ---[ end trace 0000000000000000 ]---
> > 
> > This occurs because destroy_work_on_stack() was called after signaling
> > the completion in the calling thread.  This creates a race between
> > destroy_work_on_stack() and the task->work struct going of scope in the
> > pci_doe().
> > 
> > Signal the work complete after destroying the work struct.  This is safe
> > because signal_task_complete() is the final thing the work item does and
> > the workqueue code is careful not to access the work struct after.
> > 
> > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Dan, let me know if you'd rather have me take this.

I have nothing else in the DOE area at this time, so I would be happy if
you took it. It is self contained to drivers/pci/.

Much appreciated.

Acked-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race
  2023-07-26 18:29 [PATCH] PCI/DOE: Fix destroy_work_on_stack() race Ira Weiny
  2023-07-27  7:54 ` Lukas Wunner
  2023-07-27 16:59 ` Bjorn Helgaas
@ 2023-07-27 20:21 ` Bjorn Helgaas
  2023-07-27 21:30   ` Ira Weiny
  2 siblings, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2023-07-27 20:21 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Jonathan Cameron, Lukas Wunner,
	Davidlohr Bueso, linux-pci, linux-kernel

On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> The following debug object splat was observed in testing.
> 
>   [   14.061937] ------------[ cut here ]------------
>   [   14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
>   [   14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
>   ...
>   [   14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
>   [   14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
>   ...
>   [   14.116231] Call Trace:
>   [   14.117652]  <TASK>
>   [   14.118958]  ? debug_print_object+0x7d/0xb0
>   [   14.120782]  ? __warn+0x7d/0x130
>   [   14.122399]  ? debug_print_object+0x7d/0xb0
>   [   14.123746]  ? report_bug+0x18d/0x1c0
>   [   14.125025]  ? handle_bug+0x3c/0x80
>   [   14.126506]  ? exc_invalid_op+0x13/0x60
>   [   14.127796]  ? asm_exc_invalid_op+0x16/0x20
>   [   14.129380]  ? debug_print_object+0x7d/0xb0
>   [   14.130688]  ? debug_print_object+0x7d/0xb0
>   [   14.131997]  ? __pfx_doe_statemachine_work+0x10/0x10
>   [   14.133597]  debug_object_free.part.0+0x11b/0x150
>   [   14.134940]  doe_statemachine_work+0x45e/0x510
>   [   14.136348]  process_one_work+0x1d4/0x3c0
>   ...
>   [   14.161484]  </TASK>
>   [   14.162434] ---[ end trace 0000000000000000 ]---
> 
> This occurs because destroy_work_on_stack() was called after signaling
> the completion in the calling thread.  This creates a race between
> destroy_work_on_stack() and the task->work struct going of scope in the
> pci_doe().
> 
> Signal the work complete after destroying the work struct.  This is safe
> because signal_task_complete() is the final thing the work item does and
> the workqueue code is careful not to access the work struct after.
> 
> Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> Cc: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Thanks, applied to pci/misc with Lukas' reviewed-by and Dan's ack for
v6.6.  I edited out the timestamps and some of the call trace from the
splat because they didn't seem relevant.

> ---
>  drivers/pci/doe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 1b97a5ab71a9..e3aab5edaf70 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas
>  static void signal_task_complete(struct pci_doe_task *task, int rv)
>  {
>  	task->rv = rv;
> -	task->complete(task);
>  	destroy_work_on_stack(&task->work);
> +	task->complete(task);
>  }
>  
>  static void signal_task_abort(struct pci_doe_task *task, int rv)
> 
> ---
> base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> change-id: 20230726-doe-fix-f57943f9ea82
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] PCI/DOE: Fix destroy_work_on_stack() race
  2023-07-27 20:21 ` Bjorn Helgaas
@ 2023-07-27 21:30   ` Ira Weiny
  0 siblings, 0 replies; 6+ messages in thread
From: Ira Weiny @ 2023-07-27 21:30 UTC (permalink / raw)
  To: Bjorn Helgaas, Ira Weiny
  Cc: Bjorn Helgaas, Dan Williams, Jonathan Cameron, Lukas Wunner,
	Davidlohr Bueso, linux-pci, linux-kernel

Bjorn Helgaas wrote:
> On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote:
> > The following debug object splat was observed in testing.
> > 
> >   [   14.061937] ------------[ cut here ]------------
> >   [   14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
> >   [   14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
> >   ...
> >   [   14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
> >   [   14.083485] RIP: 0010:debug_print_object+0x7d/0xb0
> >   ...
> >   [   14.116231] Call Trace:
> >   [   14.117652]  <TASK>
> >   [   14.118958]  ? debug_print_object+0x7d/0xb0
> >   [   14.120782]  ? __warn+0x7d/0x130
> >   [   14.122399]  ? debug_print_object+0x7d/0xb0
> >   [   14.123746]  ? report_bug+0x18d/0x1c0
> >   [   14.125025]  ? handle_bug+0x3c/0x80
> >   [   14.126506]  ? exc_invalid_op+0x13/0x60
> >   [   14.127796]  ? asm_exc_invalid_op+0x16/0x20
> >   [   14.129380]  ? debug_print_object+0x7d/0xb0
> >   [   14.130688]  ? debug_print_object+0x7d/0xb0
> >   [   14.131997]  ? __pfx_doe_statemachine_work+0x10/0x10
> >   [   14.133597]  debug_object_free.part.0+0x11b/0x150
> >   [   14.134940]  doe_statemachine_work+0x45e/0x510
> >   [   14.136348]  process_one_work+0x1d4/0x3c0
> >   ...
> >   [   14.161484]  </TASK>
> >   [   14.162434] ---[ end trace 0000000000000000 ]---
> > 
> > This occurs because destroy_work_on_stack() was called after signaling
> > the completion in the calling thread.  This creates a race between
> > destroy_work_on_stack() and the task->work struct going of scope in the
> > pci_doe().
> > 
> > Signal the work complete after destroying the work struct.  This is safe
> > because signal_task_complete() is the final thing the work item does and
> > the workqueue code is careful not to access the work struct after.
> > 
> > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Thanks, applied to pci/misc with Lukas' reviewed-by and Dan's ack for
> v6.6.  I edited out the timestamps and some of the call trace from the
> splat because they didn't seem relevant.
> 

Thanks.  I'll make sure to remove the timestamps in future.
Ira

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-27 21:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-26 18:29 [PATCH] PCI/DOE: Fix destroy_work_on_stack() race Ira Weiny
2023-07-27  7:54 ` Lukas Wunner
2023-07-27 16:59 ` Bjorn Helgaas
2023-07-27 19:53   ` Dan Williams
2023-07-27 20:21 ` Bjorn Helgaas
2023-07-27 21:30   ` Ira Weiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.