Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5

All of lore.kernel.org
 help / color / mirror / Atom feed

* Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
@ 2018-11-07 14:25 xiang xiao
  2018-11-08  6:22 ` Bjorn Andersson
  0 siblings, 1 reply; 10+ messages in thread
From: xiang xiao @ 2018-11-07 14:25 UTC (permalink / raw)
  To: linux-remoteproc, spjoshi, bjorn.andersson

[-- Attachment #1: Type: text/plain, Size: 2119 bytes --]

This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(),
which skip destroy the virtio device at stop but reinitialize it again at
start:
[  603.446805] remoteproc remoteproc0: crash detected in
f9210000.toppwr:tl421-rproc: type mmufault
[  603.456883] remoteproc remoteproc0: handling crash #1 in
f9210000.toppwr:tl421-rproc
[  603.469593] remoteproc remoteproc0: recovering
f9210000.toppwr:tl421-rproc
[  603.483172] remoteproc remoteproc0: stopped remote processor
f9210000.toppwr:tl421-rproc
[  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
object, something is seriously wrong.

  ^^^^^^^^^^^^^^^^^^^^^
[  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
 4.9.27-04454-gd4c1829-dirty #255
[  603.517468] Hardware name: Banks (DT)
[  603.521581] Workqueue: events rproc_crash_handler_work
[  603.527342] Call trace:
[  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
[  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
[  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
[  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
[  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
[  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
[  603.565750] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
[  603.572669] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
[  603.579390] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
[  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
[  603.591754] [<ffffff8008877a68>] rproc_trigger_recovery+0x2f8/0x324
[  603.598763] [<ffffff8008877b24>] rproc_crash_handler_work+0x90/0xb0
[  603.605778] [<ffffff80080cd570>] process_one_work+0x204/0x704
[  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
[  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
[  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40

When the crash happen, is it better to destroy and recreate all virtio
device and it’s children(rpmsg device) again to match the remote side state
like the original behavior?

Thanks

Xiang

[-- Attachment #2: Type: text/html, Size: 3474 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-07 14:25 Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5 xiang xiao
@ 2018-11-08  6:22 ` Bjorn Andersson
  2018-11-08  6:36   ` xiang xiao
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Andersson @ 2018-11-08  6:22 UTC (permalink / raw)
  To: xiang xiao; +Cc: linux-remoteproc, spjoshi

On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:

> This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(),
> which skip destroy the virtio device at stop but reinitialize it again at
> start:
> [  603.446805] remoteproc remoteproc0: crash detected in
> f9210000.toppwr:tl421-rproc: type mmufault
> [  603.456883] remoteproc remoteproc0: handling crash #1 in
> f9210000.toppwr:tl421-rproc
> [  603.469593] remoteproc remoteproc0: recovering
> f9210000.toppwr:tl421-rproc
> [  603.483172] remoteproc remoteproc0: stopped remote processor
> f9210000.toppwr:tl421-rproc
> [  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
> object, something is seriously wrong.
> 

I thought this issue was fixed.

>   ^^^^^^^^^^^^^^^^^^^^^
> [  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
>  4.9.27-04454-gd4c1829-dirty #255
> [  603.517468] Hardware name: Banks (DT)
> [  603.521581] Workqueue: events rproc_crash_handler_work
> [  603.527342] Call trace:
> [  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
> [  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
> [  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
> [  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
> [  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
> [  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
> [  603.565750] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
> [  603.572669] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
> [  603.579390] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
> [  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
> [  603.591754] [<ffffff8008877a68>] rproc_trigger_recovery+0x2f8/0x324
> [  603.598763] [<ffffff8008877b24>] rproc_crash_handler_work+0x90/0xb0
> [  603.605778] [<ffffff80080cd570>] process_one_work+0x204/0x704
> [  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
> [  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
> [  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40
> 
> When the crash happen, is it better to destroy and recreate all virtio
> device and itï¿½s children(rpmsg device) again to match the remote side state
> like the original behavior?
> 

Yes, it's likely that the protocols on top does share some state, so we
do not have any choice but to report this up to the virtio device.

Removing and re-probing the devices - rather than having some other form
of notification of this event - makes the code simpler.


But it seems we're trying to re-register the same device the second
time, rather than initialize a new one.

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08  6:22 ` Bjorn Andersson
@ 2018-11-08  6:36   ` xiang xiao
  2018-11-08  7:26     ` Bjorn Andersson
  0 siblings, 1 reply; 10+ messages in thread
From: xiang xiao @ 2018-11-08  6:36 UTC (permalink / raw)
  To: bjorn.andersson; +Cc: linux-remoteproc, spjoshi

On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
<bjorn.andersson@linaro.org> wrote:
>
> On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:
>
> > This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(),
> > which skip destroy the virtio device at stop but reinitialize it again at
> > start:
> > [  603.446805] remoteproc remoteproc0: crash detected in
> > f9210000.toppwr:tl421-rproc: type mmufault
> > [  603.456883] remoteproc remoteproc0: handling crash #1 in
> > f9210000.toppwr:tl421-rproc
> > [  603.469593] remoteproc remoteproc0: recovering
> > f9210000.toppwr:tl421-rproc
> > [  603.483172] remoteproc remoteproc0: stopped remote processor
> > f9210000.toppwr:tl421-rproc
> > [  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
> > object, something is seriously wrong.
> >
>
> I thought this issue was fixed.
>
Which patch fix this issue? I am using 4.19 kernel.

> >   ^^^^^^^^^^^^^^^^^^^^^
> > [  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
> >  4.9.27-04454-gd4c1829-dirty #255
> > [  603.517468] Hardware name: Banks (DT)
> > [  603.521581] Workqueue: events rproc_crash_handler_work
> > [  603.527342] Call trace:
> > [  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
> > [  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
> > [  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
> > [  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
> > [  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
> > [  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
> > [  603.565750] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
> > [  603.572669] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
> > [  603.579390] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
> > [  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
> > [  603.591754] [<ffffff8008877a68>] rproc_trigger_recovery+0x2f8/0x324
> > [  603.598763] [<ffffff8008877b24>] rproc_crash_handler_work+0x90/0xb0
> > [  603.605778] [<ffffff80080cd570>] process_one_work+0x204/0x704
> > [  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
> > [  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
> > [  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40
> >
> > When the crash happen, is it better to destroy and recreate all virtio
> > device and it’s children(rpmsg device) again to match the remote side state
> > like the original behavior?
> >
>
> Yes, it's likely that the protocols on top does share some state, so we
> do not have any choice but to report this up to the virtio device.
>
> Removing and re-probing the devices - rather than having some other form
> of notification of this event - makes the code simpler.
>
>
> But it seems we're trying to re-register the same device the second
> time, rather than initialize a new one.
>
If we use one new here, the old need to be destroyed to avoid the leak.
Basically, it's become the old approach again.

We are building many rpmsg devices on top of rproc/virtio, each one
has the internal state sync with the remote side.
If remote side crash and reboot again, all state is stale and need to
reset to the default state.
Removing and re-probing is the clean and simple solution, I think too.

> Regards,
> Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08  6:36   ` xiang xiao
@ 2018-11-08  7:26     ` Bjorn Andersson
  2018-11-08  8:16       ` Loic PALLARDY
  2018-11-08  8:23       ` xiang xiao
  0 siblings, 2 replies; 10+ messages in thread
From: Bjorn Andersson @ 2018-11-08  7:26 UTC (permalink / raw)
  To: xiang xiao; +Cc: linux-remoteproc, spjoshi

On Wed 07 Nov 22:36 PST 2018, xiang xiao wrote:

> On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
> <bjorn.andersson@linaro.org> wrote:
> >
> > On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:
> >
> > > This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(),
> > > which skip destroy the virtio device at stop but reinitialize it again at
> > > start:
> > > [  603.446805] remoteproc remoteproc0: crash detected in
> > > f9210000.toppwr:tl421-rproc: type mmufault
> > > [  603.456883] remoteproc remoteproc0: handling crash #1 in
> > > f9210000.toppwr:tl421-rproc
> > > [  603.469593] remoteproc remoteproc0: recovering
> > > f9210000.toppwr:tl421-rproc
> > > [  603.483172] remoteproc remoteproc0: stopped remote processor
> > > f9210000.toppwr:tl421-rproc
> > > [  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
> > > object, something is seriously wrong.
> > >
> >
> > I thought this issue was fixed.
> >
> Which patch fix this issue? I am using 4.19 kernel.
> 

I'm unable to find such commit right now. And given your report I
believe this still is a bug then.

> > >   ^^^^^^^^^^^^^^^^^^^^^
> > > [  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
> > >  4.9.27-04454-gd4c1829-dirty #255
> > > [  603.517468] Hardware name: Banks (DT)
> > > [  603.521581] Workqueue: events rproc_crash_handler_work
> > > [  603.527342] Call trace:
> > > [  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
> > > [  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
> > > [  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
> > > [  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
> > > [  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
> > > [  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
> > > [  603.565750] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
> > > [  603.572669] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
> > > [  603.579390] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
> > > [  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
> > > [  603.591754] [<ffffff8008877a68>] rproc_trigger_recovery+0x2f8/0x324
> > > [  603.598763] [<ffffff8008877b24>] rproc_crash_handler_work+0x90/0xb0
> > > [  603.605778] [<ffffff80080cd570>] process_one_work+0x204/0x704
> > > [  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
> > > [  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
> > > [  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40
> > >
> > > When the crash happen, is it better to destroy and recreate all virtio
> > > device and itï¿½s children(rpmsg device) again to match the remote side state
> > > like the original behavior?
> > >
> >
> > Yes, it's likely that the protocols on top does share some state, so we
> > do not have any choice but to report this up to the virtio device.
> >
> > Removing and re-probing the devices - rather than having some other form
> > of notification of this event - makes the code simpler.
> >
> >
> > But it seems we're trying to re-register the same device the second
> > time, rather than initialize a new one.
> >
> If we use one new here, the old need to be destroyed to avoid the leak.
> Basically, it's become the old approach again.
> 

You're right, for vdevs we must tear them down and bring them up again.

The reason for introducing the offending commit was for carveouts to
be allocated past the stop() call, so that we can stop the core and
provide a coredump for post mortem debugging.

> We are building many rpmsg devices on top of rproc/virtio, each one
> has the internal state sync with the remote side.
> If remote side crash and reboot again, all state is stale and need to
> reset to the default state.
> Removing and re-probing is the clean and simple solution, I think too.
> 

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08  7:26     ` Bjorn Andersson
@ 2018-11-08  8:16       ` Loic PALLARDY
  2018-11-08 18:11         ` xiang xiao
  2018-11-08  8:23       ` xiang xiao
  1 sibling, 1 reply; 10+ messages in thread
From: Loic PALLARDY @ 2018-11-08  8:16 UTC (permalink / raw)
  To: Bjorn Andersson, xiang xiao; +Cc: linux-remoteproc, spjoshi



> -----Original Message-----
> From: linux-remoteproc-owner@vger.kernel.org <linux-remoteproc-
> owner@vger.kernel.org> On Behalf Of Bjorn Andersson
> Sent: jeudi 8 novembre 2018 08:26
> To: xiang xiao <xiaoxiang781216@gmail.com>
> Cc: linux-remoteproc@vger.kernel.org; spjoshi@codeaurora.org
> Subject: Re: Regression by commit
> 7e83cab824a86704cdbd7735c19d34e0ce423dc5
> 
> On Wed 07 Nov 22:36 PST 2018, xiang xiao wrote:
> 
> > On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
> > <bjorn.andersson@linaro.org> wrote:
> > >
> > > On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:
> > >
> > > > This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(),
> > > > which skip destroy the virtio device at stop but reinitialize it again at
> > > > start:
> > > > [  603.446805] remoteproc remoteproc0: crash detected in
> > > > f9210000.toppwr:tl421-rproc: type mmufault
> > > > [  603.456883] remoteproc remoteproc0: handling crash #1 in
> > > > f9210000.toppwr:tl421-rproc
> > > > [  603.469593] remoteproc remoteproc0: recovering
> > > > f9210000.toppwr:tl421-rproc
> > > > [  603.483172] remoteproc remoteproc0: stopped remote processor
> > > > f9210000.toppwr:tl421-rproc
> > > > [  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
> > > > object, something is seriously wrong.
> > > >
> > >
> > > I thought this issue was fixed.
> > >
> > Which patch fix this issue? I am using 4.19 kernel.
> >
> 
> I'm unable to find such commit right now. And given your report I
> believe this still is a bug then.

I send fix [1] for a similar issue in July. After investigation, I found an issue at rpmsg level in remove procedure.
In some case virtio device can't be destroyed as some child are still registered.
Could you please test my patch to see if it fix your issue?

Regards,
Loic

1: https://patchwork.kernel.org/patch/10544757/

> 
> > > >   ^^^^^^^^^^^^^^^^^^^^^
> > > > [  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
> > > >  4.9.27-04454-gd4c1829-dirty #255
> > > > [  603.517468] Hardware name: Banks (DT)
> > > > [  603.521581] Workqueue: events rproc_crash_handler_work
> > > > [  603.527342] Call trace:
> > > > [  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
> > > > [  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
> > > > [  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
> > > > [  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
> > > > [  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
> > > > [  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
> > > > [  603.565750] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
> > > > [  603.572669] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
> > > > [  603.579390] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
> > > > [  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
> > > > [  603.591754] [<ffffff8008877a68>]
> rproc_trigger_recovery+0x2f8/0x324
> > > > [  603.598763] [<ffffff8008877b24>]
> rproc_crash_handler_work+0x90/0xb0
> > > > [  603.605778] [<ffffff80080cd570>] process_one_work+0x204/0x704
> > > > [  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
> > > > [  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
> > > > [  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40
> > > >
> > > > When the crash happen, is it better to destroy and recreate all virtio
> > > > device and it's children(rpmsg device) again to match the remote side
> state
> > > > like the original behavior?
> > > >
> > >
> > > Yes, it's likely that the protocols on top does share some state, so we
> > > do not have any choice but to report this up to the virtio device.
> > >
> > > Removing and re-probing the devices - rather than having some other
> form
> > > of notification of this event - makes the code simpler.
> > >
> > >
> > > But it seems we're trying to re-register the same device the second
> > > time, rather than initialize a new one.
> > >
> > If we use one new here, the old need to be destroyed to avoid the leak.
> > Basically, it's become the old approach again.
> >
> 
> You're right, for vdevs we must tear them down and bring them up again.
> 
> The reason for introducing the offending commit was for carveouts to
> be allocated past the stop() call, so that we can stop the core and
> provide a coredump for post mortem debugging.
> 
> > We are building many rpmsg devices on top of rproc/virtio, each one
> > has the internal state sync with the remote side.
> > If remote side crash and reboot again, all state is stale and need to
> > reset to the default state.
> > Removing and re-probing is the clean and simple solution, I think too.
> >
> 
> Regards,
> Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08  7:26     ` Bjorn Andersson
  2018-11-08  8:16       ` Loic PALLARDY
@ 2018-11-08  8:23       ` xiang xiao
  2018-11-09  4:23         ` Bjorn Andersson
  1 sibling, 1 reply; 10+ messages in thread
From: xiang xiao @ 2018-11-08  8:23 UTC (permalink / raw)
  To: bjorn.andersson; +Cc: linux-remoteproc, spjoshi

On Thu, Nov 8, 2018 at 3:22 PM Bjorn Andersson
<bjorn.andersson@linaro.org> wrote:
>
> On Wed 07 Nov 22:36 PST 2018, xiang xiao wrote:
>
> > On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
> > <bjorn.andersson@linaro.org> wrote:
> > >
> > > On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:

> > > > When the crash happen, is it better to destroy and recreate all virtio
> > > > device and it’s children(rpmsg device) again to match the remote side state
> > > > like the original behavior?
> > > >
> > >
> > > Yes, it's likely that the protocols on top does share some state, so we
> > > do not have any choice but to report this up to the virtio device.
> > >
> > > Removing and re-probing the devices - rather than having some other form
> > > of notification of this event - makes the code simpler.
> > >
> > >
> > > But it seems we're trying to re-register the same device the second
> > > time, rather than initialize a new one.
> > >
> > If we use one new here, the old need to be destroyed to avoid the leak.
> > Basically, it's become the old approach again.
> >
>
> You're right, for vdevs we must tear them down and bring them up again.
>
> The reason for introducing the offending commit was for carveouts to
> be allocated past the stop() call, so that we can stop the core and
> provide a coredump for post mortem debugging.
>
Why we need stop the remote core before making a coredump?
Basically, the coredump is just a series memory copy, it isn't harmful
to do it while the remote code is still live.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08  8:16       ` Loic PALLARDY
@ 2018-11-08 18:11         ` xiang xiao
  2018-11-08 20:10           ` Loic PALLARDY
  0 siblings, 1 reply; 10+ messages in thread
From: xiang xiao @ 2018-11-08 18:11 UTC (permalink / raw)
  To: loic.pallardy; +Cc: bjorn.andersson, linux-remoteproc, spjoshi

Loic, The patch can't fix my issue:(, how do you trigger virtio device
destroy process?
1.Go through shutdown/start circle
2.Or trigger some crash manually
My problem happen in the second case, maybe you can try it to see what
happen on your platform.

On Thu, Nov 8, 2018 at 4:16 PM Loic PALLARDY <loic.pallardy@st.com> wrote:
>
>
>
> > -----Original Message-----
> > From: linux-remoteproc-owner@vger.kernel.org <linux-remoteproc-
> > owner@vger.kernel.org> On Behalf Of Bjorn Andersson
> > Sent: jeudi 8 novembre 2018 08:26
> > To: xiang xiao <xiaoxiang781216@gmail.com>
> > Cc: linux-remoteproc@vger.kernel.org; spjoshi@codeaurora.org
> > Subject: Re: Regression by commit
> > 7e83cab824a86704cdbd7735c19d34e0ce423dc5
> >
> > On Wed 07 Nov 22:36 PST 2018, xiang xiao wrote:
> >
> > > On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
> > > <bjorn.andersson@linaro.org> wrote:
> > > >
> > > > On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:
> > > >
> > > > > This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(),
> > > > > which skip destroy the virtio device at stop but reinitialize it again at
> > > > > start:
> > > > > [  603.446805] remoteproc remoteproc0: crash detected in
> > > > > f9210000.toppwr:tl421-rproc: type mmufault
> > > > > [  603.456883] remoteproc remoteproc0: handling crash #1 in
> > > > > f9210000.toppwr:tl421-rproc
> > > > > [  603.469593] remoteproc remoteproc0: recovering
> > > > > f9210000.toppwr:tl421-rproc
> > > > > [  603.483172] remoteproc remoteproc0: stopped remote processor
> > > > > f9210000.toppwr:tl421-rproc
> > > > > [  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
> > > > > object, something is seriously wrong.
> > > > >
> > > >
> > > > I thought this issue was fixed.
> > > >
> > > Which patch fix this issue? I am using 4.19 kernel.
> > >
> >
> > I'm unable to find such commit right now. And given your report I
> > believe this still is a bug then.
>
> I send fix [1] for a similar issue in July. After investigation, I found an issue at rpmsg level in remove procedure.
> In some case virtio device can't be destroyed as some child are still registered.
> Could you please test my patch to see if it fix your issue?
>
> Regards,
> Loic
>
> 1: https://patchwork.kernel.org/patch/10544757/
>
> >
> > > > >   ^^^^^^^^^^^^^^^^^^^^^
> > > > > [  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
> > > > >  4.9.27-04454-gd4c1829-dirty #255
> > > > > [  603.517468] Hardware name: Banks (DT)
> > > > > [  603.521581] Workqueue: events rproc_crash_handler_work
> > > > > [  603.527342] Call trace:
> > > > > [  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
> > > > > [  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
> > > > > [  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
> > > > > [  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
> > > > > [  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
> > > > > [  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
> > > > > [  603.565750] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
> > > > > [  603.572669] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
> > > > > [  603.579390] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
> > > > > [  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
> > > > > [  603.591754] [<ffffff8008877a68>]
> > rproc_trigger_recovery+0x2f8/0x324
> > > > > [  603.598763] [<ffffff8008877b24>]
> > rproc_crash_handler_work+0x90/0xb0
> > > > > [  603.605778] [<ffffff80080cd570>] process_one_work+0x204/0x704
> > > > > [  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
> > > > > [  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
> > > > > [  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40
> > > > >
> > > > > When the crash happen, is it better to destroy and recreate all virtio
> > > > > device and it's children(rpmsg device) again to match the remote side
> > state
> > > > > like the original behavior?
> > > > >
> > > >
> > > > Yes, it's likely that the protocols on top does share some state, so we
> > > > do not have any choice but to report this up to the virtio device.
> > > >
> > > > Removing and re-probing the devices - rather than having some other
> > form
> > > > of notification of this event - makes the code simpler.
> > > >
> > > >
> > > > But it seems we're trying to re-register the same device the second
> > > > time, rather than initialize a new one.
> > > >
> > > If we use one new here, the old need to be destroyed to avoid the leak.
> > > Basically, it's become the old approach again.
> > >
> >
> > You're right, for vdevs we must tear them down and bring them up again.
> >
> > The reason for introducing the offending commit was for carveouts to
> > be allocated past the stop() call, so that we can stop the core and
> > provide a coredump for post mortem debugging.
> >
> > > We are building many rpmsg devices on top of rproc/virtio, each one
> > > has the internal state sync with the remote side.
> > > If remote side crash and reboot again, all state is stale and need to
> > > reset to the default state.
> > > Removing and re-probing is the clean and simple solution, I think too.
> > >
> >
> > Regards,
> > Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08 18:11         ` xiang xiao
@ 2018-11-08 20:10           ` Loic PALLARDY
  0 siblings, 0 replies; 10+ messages in thread
From: Loic PALLARDY @ 2018-11-08 20:10 UTC (permalink / raw)
  To: xiang xiao; +Cc: bjorn.andersson, linux-remoteproc, spjoshi



> -----Original Message-----
> From: xiang xiao <xiaoxiang781216@gmail.com>
> Sent: jeudi 8 novembre 2018 19:11
> To: Loic PALLARDY <loic.pallardy@st.com>
> Cc: bjorn.andersson@linaro.org; linux-remoteproc@vger.kernel.org;
> spjoshi@codeaurora.org
> Subject: Re: Regression by commit
> 7e83cab824a86704cdbd7735c19d34e0ce423dc5
> 
> Loic, The patch can't fix my issue:(, how do you trigger virtio device
> destroy process?
> 1.Go through shutdown/start circle
> 2.Or trigger some crash manually
> My problem happen in the second case, maybe you can try it to see what
> happen on your platform.
> 

I run stress test doing start/shutdown circle on my platform and found issue I reported.
I'll do the same with crash and recovery using proposed debugfs interface for that and let you know

Regards,
Loic

> On Thu, Nov 8, 2018 at 4:16 PM Loic PALLARDY <loic.pallardy@st.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: linux-remoteproc-owner@vger.kernel.org <linux-remoteproc-
> > > owner@vger.kernel.org> On Behalf Of Bjorn Andersson
> > > Sent: jeudi 8 novembre 2018 08:26
> > > To: xiang xiao <xiaoxiang781216@gmail.com>
> > > Cc: linux-remoteproc@vger.kernel.org; spjoshi@codeaurora.org
> > > Subject: Re: Regression by commit
> > > 7e83cab824a86704cdbd7735c19d34e0ce423dc5
> > >
> > > On Wed 07 Nov 22:36 PST 2018, xiang xiao wrote:
> > >
> > > > On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
> > > > <bjorn.andersson@linaro.org> wrote:
> > > > >
> > > > > On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:
> > > > >
> > > > > > This commit replace rproc_{shutdown,boot}() with
> rproc_{stop,start}(),
> > > > > > which skip destroy the virtio device at stop but reinitialize it again at
> > > > > > start:
> > > > > > [  603.446805] remoteproc remoteproc0: crash detected in
> > > > > > f9210000.toppwr:tl421-rproc: type mmufault
> > > > > > [  603.456883] remoteproc remoteproc0: handling crash #1 in
> > > > > > f9210000.toppwr:tl421-rproc
> > > > > > [  603.469593] remoteproc remoteproc0: recovering
> > > > > > f9210000.toppwr:tl421-rproc
> > > > > > [  603.483172] remoteproc remoteproc0: stopped remote processor
> > > > > > f9210000.toppwr:tl421-rproc
> > > > > > [  603.495999] kobject (ffffffc0b8c51098): tried to init an initialized
> > > > > > object, something is seriously wrong.
> > > > > >
> > > > >
> > > > > I thought this issue was fixed.
> > > > >
> > > > Which patch fix this issue? I am using 4.19 kernel.
> > > >
> > >
> > > I'm unable to find such commit right now. And given your report I
> > > believe this still is a bug then.
> >
> > I send fix [1] for a similar issue in July. After investigation, I found an issue
> at rpmsg level in remove procedure.
> > In some case virtio device can't be destroyed as some child are still
> registered.
> > Could you please test my patch to see if it fix your issue?
> >
> > Regards,
> > Loic
> >
> > 1: https://patchwork.kernel.org/patch/10544757/
> >
> > >
> > > > > >   ^^^^^^^^^^^^^^^^^^^^^
> > > > > > [  603.506868] CPU: 5 PID: 198 Comm: kworker/5:1 Tainted: G        W
> > > > > >  4.9.27-04454-gd4c1829-dirty #255
> > > > > > [  603.517468] Hardware name: Banks (DT)
> > > > > > [  603.521581] Workqueue: events rproc_crash_handler_work
> > > > > > [  603.527342] Call trace:
> > > > > > [  603.530086] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
> > > > > > [  603.536115] [<ffffff800808bf7c>] show_stack+0x14/0x1c
> > > > > > [  603.541771] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
> > > > > > [  603.547423] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
> > > > > > [  603.553280] [<ffffff800853758c>] device_initialize+0x3c/0xe8
> > > > > > [  603.559609] [<ffffff80085397d4>] device_register+0x14/0x28
> > > > > > [  603.565750] [<ffffff80084b777c>]
> register_virtio_device+0xc4/0x114
> > > > > > [  603.572669] [<ffffff8008878b20>]
> rproc_add_virtio_dev+0x7c/0x108
> > > > > > [  603.579390] [<ffffff8008875cbc>]
> rproc_vdev_do_probe+0x14/0x1c
> > > > > > [  603.585911] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
> > > > > > [  603.591754] [<ffffff8008877a68>]
> > > rproc_trigger_recovery+0x2f8/0x324
> > > > > > [  603.598763] [<ffffff8008877b24>]
> > > rproc_crash_handler_work+0x90/0xb0
> > > > > > [  603.605778] [<ffffff80080cd570>]
> process_one_work+0x204/0x704
> > > > > > [  603.612202] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
> > > > > > [  603.618248] [<ffffff80080d4aec>] kthread+0xec/0x100
> > > > > > [  603.623703] [<ffffff8008083890>] ret_from_fork+0x10/0x40
> > > > > >
> > > > > > When the crash happen, is it better to destroy and recreate all virtio
> > > > > > device and it's children(rpmsg device) again to match the remote
> side
> > > state
> > > > > > like the original behavior?
> > > > > >
> > > > >
> > > > > Yes, it's likely that the protocols on top does share some state, so we
> > > > > do not have any choice but to report this up to the virtio device.
> > > > >
> > > > > Removing and re-probing the devices - rather than having some other
> > > form
> > > > > of notification of this event - makes the code simpler.
> > > > >
> > > > >
> > > > > But it seems we're trying to re-register the same device the second
> > > > > time, rather than initialize a new one.
> > > > >
> > > > If we use one new here, the old need to be destroyed to avoid the
> leak.
> > > > Basically, it's become the old approach again.
> > > >
> > >
> > > You're right, for vdevs we must tear them down and bring them up again.
> > >
> > > The reason for introducing the offending commit was for carveouts to
> > > be allocated past the stop() call, so that we can stop the core and
> > > provide a coredump for post mortem debugging.
> > >
> > > > We are building many rpmsg devices on top of rproc/virtio, each one
> > > > has the internal state sync with the remote side.
> > > > If remote side crash and reboot again, all state is stale and need to
> > > > reset to the default state.
> > > > Removing and re-probing is the clean and simple solution, I think too.
> > > >
> > >
> > > Regards,
> > > Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
  2018-11-08  8:23       ` xiang xiao
@ 2018-11-09  4:23         ` Bjorn Andersson
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Andersson @ 2018-11-09  4:23 UTC (permalink / raw)
  To: xiang xiao; +Cc: linux-remoteproc, spjoshi

On Thu 08 Nov 00:23 PST 2018, xiang xiao wrote:

> On Thu, Nov 8, 2018 at 3:22 PM Bjorn Andersson
> <bjorn.andersson@linaro.org> wrote:
> >
> > On Wed 07 Nov 22:36 PST 2018, xiang xiao wrote:
> >
> > > On Thu, Nov 8, 2018 at 2:18 PM Bjorn Andersson
> > > <bjorn.andersson@linaro.org> wrote:
> > > >
> > > > On Wed 07 Nov 06:25 PST 2018, xiang xiao wrote:
> 
> > > > > When the crash happen, is it better to destroy and recreate all virtio
> > > > > device and itï¿½s children(rpmsg device) again to match the remote side state
> > > > > like the original behavior?
> > > > >
> > > >
> > > > Yes, it's likely that the protocols on top does share some state, so we
> > > > do not have any choice but to report this up to the virtio device.
> > > >
> > > > Removing and re-probing the devices - rather than having some other form
> > > > of notification of this event - makes the code simpler.
> > > >
> > > >
> > > > But it seems we're trying to re-register the same device the second
> > > > time, rather than initialize a new one.
> > > >
> > > If we use one new here, the old need to be destroyed to avoid the leak.
> > > Basically, it's become the old approach again.
> > >
> >
> > You're right, for vdevs we must tear them down and bring them up again.
> >
> > The reason for introducing the offending commit was for carveouts to
> > be allocated past the stop() call, so that we can stop the core and
> > provide a coredump for post mortem debugging.
> >
> Why we need stop the remote core before making a coredump?
> Basically, the coredump is just a series memory copy, it isn't harmful
> to do it while the remote code is still live.

In order to support implementing e.g. parts of the DRM stack the memory
carveouts assigned to the remoteprocs on Qualcomm platforms is protected
from read and write accesses from the application CPU.

Therefor the remoteproc must be stopped before the content is accessed.

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5
@ 2018-11-07  7:36 Xiang Xiao
  0 siblings, 0 replies; 10+ messages in thread
From: Xiang Xiao @ 2018-11-07  7:36 UTC (permalink / raw)
  To: linux-remoteproc

This commit replace rproc_{shutdown,boot}() with rproc_{stop,start}(), which
skip destroy the virtio device at stop but reinitialize it again at start:
[ 1502.733760] remoteproc remoteproc0: crash detected in
f9210000.toppwr:tl421-rproc: type watchdog
[ 1502.744019] remoteproc remoteproc0: handling crash #1 in
f9210000.toppwr:tl421-rproc
[ 1502.820741] remoteproc remoteproc0: recovering
f9210000.toppwr:tl421-rproc
[ 1502.831588] remoteproc remoteproc0: stopped remote processor
f9210000.toppwr:tl421-rproc
[ 1502.844458] kobject (ffffffc0b8d1b898): tried to init an initialized
object, something is seriously wrong.
[ 1502.855415] CPU: 6 PID: 169 Comm: kworker/6:1 Tainted: G        W      
4.9.27-04453-g1a03c96 #253
[ 1502.865442] Hardware name: Banks (DT)
[ 1502.869557] Workqueue: events rproc_crash_handler_work
[ 1502.875319] Call trace:
[ 1502.878066] [<ffffff800808bd9c>] dump_backtrace+0x0/0x1cc
[ 1502.884106] [<ffffff800808bf7c>] show_stack+0x14/0x1c
[ 1502.889761] [<ffffff80083fef08>] dump_stack+0xa8/0xe0
[ 1502.895415] [<ffffff8008402b24>] kobject_init+0x8c/0x9c
[ 1502.901262] [<ffffff800853758c>] device_initialize+0x3c/0xe8
[ 1502.907590] [<ffffff80085397d4>] device_register+0x14/0x28
[ 1502.913727] [<ffffff80084b777c>] register_virtio_device+0xc4/0x114
[ 1502.920641] [<ffffff8008878b20>] rproc_add_virtio_dev+0x7c/0x108
[ 1502.927360] [<ffffff8008875cbc>] rproc_vdev_do_probe+0x14/0x1c
[ 1502.933883] [<ffffff8008875a60>] rproc_start+0xac/0x1ac
[ 1502.939728] [<ffffff8008877a68>] rproc_trigger_recovery+0x2f8/0x324
[ 1502.946740] [<ffffff8008877b24>] rproc_crash_handler_work+0x90/0xb0
[ 1502.953752] [<ffffff80080cd570>] process_one_work+0x204/0x704
[ 1502.960179] [<ffffff80080cdac4>] worker_thread+0x54/0x4a8
[ 1502.966225] [<ffffff80080d4aec>] kthread+0xec/0x100
[ 1502.971683] [<ffffff8008083890>] ret_from_fork+0x10/0x40
When the crash happen, It’s better to destroy and recreate all virtio device
and it’s children(rpmsg device) again to match the remote side state like
the original behavior.

Thanks
Xiang

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-11-09  4:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-07 14:25 Regression by commit 7e83cab824a86704cdbd7735c19d34e0ce423dc5 xiang xiao
2018-11-08  6:22 ` Bjorn Andersson
2018-11-08  6:36   ` xiang xiao
2018-11-08  7:26     ` Bjorn Andersson
2018-11-08  8:16       ` Loic PALLARDY
2018-11-08 18:11         ` xiang xiao
2018-11-08 20:10           ` Loic PALLARDY
2018-11-08  8:23       ` xiang xiao
2018-11-09  4:23         ` Bjorn Andersson
  -- strict thread matches above, loose matches on Subject: below --
2018-11-07  7:36 Xiang Xiao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.