linux-remoteproc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* race condition issue at remote proc startup
@ 2021-05-04  9:45 Yann Sionneau
  2021-05-04 16:43 ` Arnaud POULIQUEN
  0 siblings, 1 reply; 4+ messages in thread
From: Yann Sionneau @ 2021-05-04  9:45 UTC (permalink / raw)
  To: linux-remoteproc; +Cc: Pierre-Yves Kerbrat, Vincent Chardon, Julien Hascoet

Hello,

We (at Kalray) have some difficulties during initialization of a 
remoteproc device, and there seem to have no clean way (at least not one 
we know of) out of this problem.

We need vring defined in the resource table to be completely initialized 
before the remoteproc device is started. By completely initialized I 
mean that the vring device address defined in resource table shall be 
changed from 0xff..ff to a proper address. Currently the remote device 
is started before the initialization has completed, which creates a race 
condition between Linux and the remoteproc device. (We have a particular 
architecture in which the processor running Linux is the same as the 
embedded processor, this is why this problem happens in our case but 
probably not when the processor running Linux is much faster than the 
embedded processor).

Our best attempt up to now is to configure the virtio ring sooner i.e 
during subdevice preparation instead of subdevice start.
i.e. in rproc_handle_vdev change code from
     rvdev->subdev.start = rproc_vdev_do_start;
to
     /* da field in vring must be initialized before powering up
      * the remoterproc, or else race condition may occur.
      * Indeed the remoteproc may read it before it has been initialized.
      */
     rvdev->subdev.prepare = rproc_vdev_do_start;

This works but it has undesired side effects. In particular some 
notifications are sent (the remote proc kick function is being called), 
but since the remote CPU has not been started yet we are not able to 
handle them, thus we simply ignore them if the state of the remote proc 
is not RUNNING.
At least this seems to solve our problem, but this is a particularly 
unpleasant way of solving the problem, in particular it might impact the 
existing remoteproc devices. Do you have any suggestion on some cleaner 
to way to solve this problem?

FYI, here is our arch specific remote proc implementation: 
https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c

PS: there seem to be a similar problem when the remote device is being 
stopped. The vring buffer are destroyed and only after is the remote 
proc device stopped. There is once again a race condition as the remote 
proc device might try to access the vring after their destruction by the 
host. Proposed change is as follow:
In rproc_handle_vdev change code from
     rvdev->subdev.stop = rproc_vdev_do_stop;
to
     rvdev->subdev.unprepare = rproc_vdev_do_stop;

Note this change has much less impact on existing remote proc and is 
symmetric to the previous change thus it might make it sound more logical

PS2: I guess that this issue never showed up before because most other 
use cases are using fixed addresses in the resource tables and not 
dynamically allocated ones at runtime.

Regards,

-- 
Yann


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: race condition issue at remote proc startup
  2021-05-04  9:45 race condition issue at remote proc startup Yann Sionneau
@ 2021-05-04 16:43 ` Arnaud POULIQUEN
  2021-05-04 17:26   ` Julien Hascoet
  0 siblings, 1 reply; 4+ messages in thread
From: Arnaud POULIQUEN @ 2021-05-04 16:43 UTC (permalink / raw)
  To: Yann Sionneau, linux-remoteproc
  Cc: Pierre-Yves Kerbrat, Vincent Chardon, Julien Hascoet

Hello Yann

On 5/4/21 11:45 AM, Yann Sionneau wrote:
> Hello,
> 
> We (at Kalray) have some difficulties during initialization of a remoteproc
> device, and there seem to have no clean way (at least not one we know of) out of
> this problem.
> 
> We need vring defined in the resource table to be completely initialized before
> the remoteproc device is started. By completely initialized I mean that the
> vring device address defined in resource table shall be changed from 0xff..ff to
> a proper address. Currently the remote device is started before the
> initialization has completed, which creates a race condition between Linux and
> the remoteproc device. (We have a particular architecture in which the processor
> running Linux is the same as the embedded processor, this is why this problem
> happens in our case but probably not when the processor running Linux is much
> faster than the embedded processor).

Is the remote side waiting for the vdev status[1] update before accessing the
vrings?

[1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307


> 
> Our best attempt up to now is to configure the virtio ring sooner i.e during
> subdevice preparation instead of subdevice start.
> i.e. in rproc_handle_vdev change code from
>     rvdev->subdev.start = rproc_vdev_do_start;
> to
>     /* da field in vring must be initialized before powering up
>      * the remoterproc, or else race condition may occur.
>      * Indeed the remoteproc may read it before it has been initialized.
>      */
>     rvdev->subdev.prepare = rproc_vdev_do_start;
> 
> This works but it has undesired side effects. In particular some notifications
> are sent (the remote proc kick function is being called), but since the remote
> CPU has not been started yet we are not able to handle them, thus we simply
> ignore them if the state of the remote proc is not RUNNING.
> At least this seems to solve our problem, but this is a particularly unpleasant
> way of solving the problem, in particular it might impact the existing
> remoteproc devices. Do you have any suggestion on some cleaner to way to solve
> this problem?
> 
> FYI, here is our arch specific remote proc implementation:
> https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c
> 
> 
> PS: there seem to be a similar problem when the remote device is being stopped.
> The vring buffer are destroyed and only after is the remote proc device stopped.
> There is once again a race condition as the remote proc device might try to
> access the vring after their destruction by the host. Proposed change is as follow:
> In rproc_handle_vdev change code from
>     rvdev->subdev.stop = rproc_vdev_do_stop;
> to
>     rvdev->subdev.unprepare = rproc_vdev_do_stop;

Should also be handled with the vdev status.

> 
> Note this change has much less impact on existing remote proc and is symmetric
> to the previous change thus it might make it sound more logical
> 
> PS2: I guess that this issue never showed up before because most other use cases
> are using fixed addresses in the resource tables and not dynamically allocated
> ones at runtime.

We use dynamic vring address allocation without any issue on STM323MP1 platform,
with the coprocessor started before the main processor running Linux.

Regards,
Arnaud

> 
> Regards,
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: race condition issue at remote proc startup
  2021-05-04 16:43 ` Arnaud POULIQUEN
@ 2021-05-04 17:26   ` Julien Hascoet
  2021-05-05 13:45     ` Yann Sionneau
  0 siblings, 1 reply; 4+ messages in thread
From: Julien Hascoet @ 2021-05-04 17:26 UTC (permalink / raw)
  To: Arnaud POULIQUEN
  Cc: Yann Sionneau, linux-remoteproc, Pierre-Yves Kerbrat, Vincent Chardon

Hello, thanks for the quick answer,

> Is the remote side waiting for the vdev status[1] update before accessing the vrings?
> [1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307

We are using the openamp project that does it for us: https://github.com/OpenAMP/open-amp/blob/master/lib/remoteproc/remoteproc.c#L925

I also checked the defines and they seem aligned on both sides:
linux: include/uapi/linux/virtio_config.h:#define VIRTIO_CONFIG_S_DRIVER_OK    4
openamp: lib/include/openamp/virtio.h:#define VIRTIO_CONFIG_STATUS_DRIVER_OK 0x04

Thanks !

----- Original Message -----
From: "Arnaud POULIQUEN" <arnaud.pouliquen@foss.st.com>
To: "Yann Sionneau" <ysionneau@kalray.eu>, linux-remoteproc@vger.kernel.org
Cc: "Pierre-Yves Kerbrat" <pkerbrat@kalrayinc.com>, "Vincent Chardon" <vchardon@kalrayinc.com>, "jhascoet" <jhascoet@kalrayinc.com>
Sent: Tuesday, May 4, 2021 6:43:47 PM
Subject: Re: race condition issue at remote proc startup

Hello Yann

On 5/4/21 11:45 AM, Yann Sionneau wrote:
> Hello,
> 
> We (at Kalray) have some difficulties during initialization of a remoteproc
> device, and there seem to have no clean way (at least not one we know of) out of
> this problem.
> 
> We need vring defined in the resource table to be completely initialized before
> the remoteproc device is started. By completely initialized I mean that the
> vring device address defined in resource table shall be changed from 0xff..ff to
> a proper address. Currently the remote device is started before the
> initialization has completed, which creates a race condition between Linux and
> the remoteproc device. (We have a particular architecture in which the processor
> running Linux is the same as the embedded processor, this is why this problem
> happens in our case but probably not when the processor running Linux is much
> faster than the embedded processor).

Is the remote side waiting for the vdev status[1] update before accessing the
vrings?

[1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307


> 
> Our best attempt up to now is to configure the virtio ring sooner i.e during
> subdevice preparation instead of subdevice start.
> i.e. in rproc_handle_vdev change code from
>     rvdev->subdev.start = rproc_vdev_do_start;
> to
>     /* da field in vring must be initialized before powering up
>      * the remoterproc, or else race condition may occur.
>      * Indeed the remoteproc may read it before it has been initialized.
>      */
>     rvdev->subdev.prepare = rproc_vdev_do_start;
> 
> This works but it has undesired side effects. In particular some notifications
> are sent (the remote proc kick function is being called), but since the remote
> CPU has not been started yet we are not able to handle them, thus we simply
> ignore them if the state of the remote proc is not RUNNING.
> At least this seems to solve our problem, but this is a particularly unpleasant
> way of solving the problem, in particular it might impact the existing
> remoteproc devices. Do you have any suggestion on some cleaner to way to solve
> this problem?
> 
> FYI, here is our arch specific remote proc implementation:
> https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c
> 
> 
> PS: there seem to be a similar problem when the remote device is being stopped.
> The vring buffer are destroyed and only after is the remote proc device stopped.
> There is once again a race condition as the remote proc device might try to
> access the vring after their destruction by the host. Proposed change is as follow:
> In rproc_handle_vdev change code from
>     rvdev->subdev.stop = rproc_vdev_do_stop;
> to
>     rvdev->subdev.unprepare = rproc_vdev_do_stop;

Should also be handled with the vdev status.

> 
> Note this change has much less impact on existing remote proc and is symmetric
> to the previous change thus it might make it sound more logical
> 
> PS2: I guess that this issue never showed up before because most other use cases
> are using fixed addresses in the resource tables and not dynamically allocated
> ones at runtime.

We use dynamic vring address allocation without any issue on STM323MP1 platform,
with the coprocessor started before the main processor running Linux.

Regards,
Arnaud

> 
> Regards,
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: race condition issue at remote proc startup
  2021-05-04 17:26   ` Julien Hascoet
@ 2021-05-05 13:45     ` Yann Sionneau
  0 siblings, 0 replies; 4+ messages in thread
From: Yann Sionneau @ 2021-05-05 13:45 UTC (permalink / raw)
  To: Julien Hascoet, Arnaud POULIQUEN
  Cc: linux-remoteproc, Pierre-Yves Kerbrat, Vincent Chardon

Hello,

Thank you Arnaud for your very quick and interesting answer.

Thanks to your pointers Julien identified the *real* issue.

It was an issue in our OpenAMP code which was allocating DAs even in the 
"slave" code path, instead of just doing it in the "master" code path. 
Thus OpenAMP was actually overwritting the DAs that Linux wrote.

Issue fixed, and without any Linux patch ;)

Thanks a lot!

Regards,

Yann

On 04/05/2021 19:26, Julien Hascoet wrote:
> Hello, thanks for the quick answer,
>
>> Is the remote side waiting for the vdev status[1] update before accessing the vrings?
>> [1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307
> We are using the openamp project that does it for us: https://github.com/OpenAMP/open-amp/blob/master/lib/remoteproc/remoteproc.c#L925
>
> I also checked the defines and they seem aligned on both sides:
> linux: include/uapi/linux/virtio_config.h:#define VIRTIO_CONFIG_S_DRIVER_OK    4
> openamp: lib/include/openamp/virtio.h:#define VIRTIO_CONFIG_STATUS_DRIVER_OK 0x04
>
> Thanks !
>
> ----- Original Message -----
> From: "Arnaud POULIQUEN" <arnaud.pouliquen@foss.st.com>
> To: "Yann Sionneau" <ysionneau@kalray.eu>, linux-remoteproc@vger.kernel.org
> Cc: "Pierre-Yves Kerbrat" <pkerbrat@kalrayinc.com>, "Vincent Chardon" <vchardon@kalrayinc.com>, "jhascoet" <jhascoet@kalrayinc.com>
> Sent: Tuesday, May 4, 2021 6:43:47 PM
> Subject: Re: race condition issue at remote proc startup
>
> Hello Yann
>
> On 5/4/21 11:45 AM, Yann Sionneau wrote:
>> Hello,
>>
>> We (at Kalray) have some difficulties during initialization of a remoteproc
>> device, and there seem to have no clean way (at least not one we know of) out of
>> this problem.
>>
>> We need vring defined in the resource table to be completely initialized before
>> the remoteproc device is started. By completely initialized I mean that the
>> vring device address defined in resource table shall be changed from 0xff..ff to
>> a proper address. Currently the remote device is started before the
>> initialization has completed, which creates a race condition between Linux and
>> the remoteproc device. (We have a particular architecture in which the processor
>> running Linux is the same as the embedded processor, this is why this problem
>> happens in our case but probably not when the processor running Linux is much
>> faster than the embedded processor).
> Is the remote side waiting for the vdev status[1] update before accessing the
> vrings?
>
> [1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307
>
>
>> Our best attempt up to now is to configure the virtio ring sooner i.e during
>> subdevice preparation instead of subdevice start.
>> i.e. in rproc_handle_vdev change code from
>>      rvdev->subdev.start = rproc_vdev_do_start;
>> to
>>      /* da field in vring must be initialized before powering up
>>       * the remoterproc, or else race condition may occur.
>>       * Indeed the remoteproc may read it before it has been initialized.
>>       */
>>      rvdev->subdev.prepare = rproc_vdev_do_start;
>>
>> This works but it has undesired side effects. In particular some notifications
>> are sent (the remote proc kick function is being called), but since the remote
>> CPU has not been started yet we are not able to handle them, thus we simply
>> ignore them if the state of the remote proc is not RUNNING.
>> At least this seems to solve our problem, but this is a particularly unpleasant
>> way of solving the problem, in particular it might impact the existing
>> remoteproc devices. Do you have any suggestion on some cleaner to way to solve
>> this problem?
>>
>> FYI, here is our arch specific remote proc implementation:
>> https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c
>>
>>
>> PS: there seem to be a similar problem when the remote device is being stopped.
>> The vring buffer are destroyed and only after is the remote proc device stopped.
>> There is once again a race condition as the remote proc device might try to
>> access the vring after their destruction by the host. Proposed change is as follow:
>> In rproc_handle_vdev change code from
>>      rvdev->subdev.stop = rproc_vdev_do_stop;
>> to
>>      rvdev->subdev.unprepare = rproc_vdev_do_stop;
> Should also be handled with the vdev status.
>
>> Note this change has much less impact on existing remote proc and is symmetric
>> to the previous change thus it might make it sound more logical
>>
>> PS2: I guess that this issue never showed up before because most other use cases
>> are using fixed addresses in the resource tables and not dynamically allocated
>> ones at runtime.
> We use dynamic vring address allocation without any issue on STM323MP1 platform,
> with the coprocessor started before the main processor running Linux.
>
> Regards,
> Arnaud
>
>> Regards,
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-05 13:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-04  9:45 race condition issue at remote proc startup Yann Sionneau
2021-05-04 16:43 ` Arnaud POULIQUEN
2021-05-04 17:26   ` Julien Hascoet
2021-05-05 13:45     ` Yann Sionneau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).