All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-08 20:50 Brian Norris
  2017-09-11 19:48 ` Benson Leung
       [not found] ` <20170908205011.77986-1-briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
  0 siblings, 2 replies; 36+ messages in thread
From: Brian Norris @ 2017-09-08 20:50 UTC (permalink / raw)
  To: Olof Johansson, Benson Leung
  Cc: Lee Jones, linux-kernel, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Shawn Nematbakhsh, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, Brian Norris

From: Shawn Nematbakhsh <shawnn@chromium.org>

pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
one instance of these functions correct, but not the second, fall-back
case. We use the fall-back only when the first command returns an
IN_PROGRESS status, which is only used on some EC firmwares where we
don't want to constantly poll the bus, but instead back off and
sleep/retry for a little while.

Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>
---
v3:
 * Added Javier's reviewed tag
 * It's been > 8 months since [1], so why not? And hey, Benson's officially in
   MAINTAINERS now! Too bad no one told me.

 [1] https://patchwork.kernel.org/patch/9450633/


v2:
 * Add Benson in 'To:'
 * make subject prefix more obvious

 drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
index 8dfa7fcb1248..e7bbdf947bbc 100644
--- a/drivers/platform/chrome/cros_ec_proto.c
+++ b/drivers/platform/chrome/cros_ec_proto.c
@@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
 			struct cros_ec_command *msg)
 {
 	int ret;
+	int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
 
 	if (ec_dev->proto_version > 2)
-		ret = ec_dev->pkt_xfer(ec_dev, msg);
+		xfer_fxn = ec_dev->pkt_xfer;
 	else
-		ret = ec_dev->cmd_xfer(ec_dev, msg);
+		xfer_fxn = ec_dev->cmd_xfer;
 
+	ret = (*xfer_fxn)(ec_dev, msg);
 	if (msg->result == EC_RES_IN_PROGRESS) {
 		int i;
 		struct cros_ec_command *status_msg;
@@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
 		for (i = 0; i < EC_COMMAND_RETRIES; i++) {
 			usleep_range(10000, 11000);
 
-			ret = ec_dev->cmd_xfer(ec_dev, status_msg);
+			ret = (*xfer_fxn)(ec_dev, status_msg);
 			if (ret < 0)
 				break;
 
-- 
2.14.1.581.gf28d330327-goog

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-08 20:50 [PATCH v3] platform/chrome: Use proper protocol transfer function Brian Norris
@ 2017-09-11 19:48 ` Benson Leung
       [not found] ` <20170908205011.77986-1-briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
  1 sibling, 0 replies; 36+ messages in thread
From: Benson Leung @ 2017-09-11 19:48 UTC (permalink / raw)
  To: Brian Norris
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Javier Martinez Canillas,
	Shawn Nematbakhsh, Gwendal Grignou, Enric Balletbo, Tomeu Vizoso

[-- Attachment #1: Type: text/plain, Size: 961 bytes --]

Hi Brian,

On Fri, Sep 08, 2017 at 01:50:11PM -0700, Brian Norris wrote:
> From: Shawn Nematbakhsh <shawnn@chromium.org>
> 
> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
> one instance of these functions correct, but not the second, fall-back
> case. We use the fall-back only when the first command returns an
> IN_PROGRESS status, which is only used on some EC firmwares where we
> don't want to constantly poll the bus, but instead back off and
> sleep/retry for a little while.
> 
> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>

Looks good. Applied for v4.14.

Thanks!
Benson

-- 
Benson Leung
Staff Software Engineer
Chrome OS Kernel
Google Inc.
bleung@google.com
Chromium OS Project
bleung@chromium.org

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-08 20:50 [PATCH v3] platform/chrome: Use proper protocol transfer function Brian Norris
@ 2017-09-19 13:44     ` Jon Hunter
       [not found] ` <20170908205011.77986-1-briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-09-19 13:44 UTC (permalink / raw)
  To: Brian Norris, Olof Johansson, Benson Leung
  Cc: Lee Jones, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson,
	Brian Norris, Javier Martinez Canillas, Shawn Nematbakhsh,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi Brian,

On 08/09/17 21:50, Brian Norris wrote:
> From: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> 
> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
> one instance of these functions correct, but not the second, fall-back
> case. We use the fall-back only when the first command returns an
> IN_PROGRESS status, which is only used on some EC firmwares where we
> don't want to constantly poll the bus, but instead back off and
> sleep/retry for a little while.
> 
> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
> Signed-off-by: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> Signed-off-by: Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> Reviewed-by: Javier Martinez Canillas <javier-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> ---
> v3:
>  * Added Javier's reviewed tag
>  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
>    MAINTAINERS now! Too bad no one told me.
> 
>  [1] https://patchwork.kernel.org/patch/9450633/
> 
> 
> v2:
>  * Add Benson in 'To:'
>  * make subject prefix more obvious
> 
>  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> index 8dfa7fcb1248..e7bbdf947bbc 100644
> --- a/drivers/platform/chrome/cros_ec_proto.c
> +++ b/drivers/platform/chrome/cros_ec_proto.c
> @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
>  			struct cros_ec_command *msg)
>  {
>  	int ret;
> +	int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
>  
>  	if (ec_dev->proto_version > 2)
> -		ret = ec_dev->pkt_xfer(ec_dev, msg);
> +		xfer_fxn = ec_dev->pkt_xfer;
>  	else
> -		ret = ec_dev->cmd_xfer(ec_dev, msg);
> +		xfer_fxn = ec_dev->cmd_xfer;
>  
> +	ret = (*xfer_fxn)(ec_dev, msg);
>  	if (msg->result == EC_RES_IN_PROGRESS) {
>  		int i;
>  		struct cros_ec_command *status_msg;
> @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
>  		for (i = 0; i < EC_COMMAND_RETRIES; i++) {
>  			usleep_range(10000, 11000);
>  
> -			ret = ec_dev->cmd_xfer(ec_dev, status_msg);
> +			ret = (*xfer_fxn)(ec_dev, status_msg);
>  			if (ret < 0)
>  				break;
>  

Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
bisect is pointing to this commit. Reverting the above on top of -next
does allow the board to boot successfully. Looks like this board is
proto_version 3 but I have not looked into this any further. Let me know
if you have any thoughts.

Cheers
Jon

[0]
https://nvtb.github.io//linux-next/test_next-20170919/20170918213034/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt


-- 
nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-19 13:44     ` Jon Hunter
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-09-19 13:44 UTC (permalink / raw)
  To: Brian Norris, Olof Johansson, Benson Leung
  Cc: Lee Jones, linux-kernel, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Shawn Nematbakhsh, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

Hi Brian,

On 08/09/17 21:50, Brian Norris wrote:
> From: Shawn Nematbakhsh <shawnn@chromium.org>
> 
> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
> one instance of these functions correct, but not the second, fall-back
> case. We use the fall-back only when the first command returns an
> IN_PROGRESS status, which is only used on some EC firmwares where we
> don't want to constantly poll the bus, but instead back off and
> sleep/retry for a little while.
> 
> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>
> ---
> v3:
>  * Added Javier's reviewed tag
>  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
>    MAINTAINERS now! Too bad no one told me.
> 
>  [1] https://patchwork.kernel.org/patch/9450633/
> 
> 
> v2:
>  * Add Benson in 'To:'
>  * make subject prefix more obvious
> 
>  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> index 8dfa7fcb1248..e7bbdf947bbc 100644
> --- a/drivers/platform/chrome/cros_ec_proto.c
> +++ b/drivers/platform/chrome/cros_ec_proto.c
> @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
>  			struct cros_ec_command *msg)
>  {
>  	int ret;
> +	int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
>  
>  	if (ec_dev->proto_version > 2)
> -		ret = ec_dev->pkt_xfer(ec_dev, msg);
> +		xfer_fxn = ec_dev->pkt_xfer;
>  	else
> -		ret = ec_dev->cmd_xfer(ec_dev, msg);
> +		xfer_fxn = ec_dev->cmd_xfer;
>  
> +	ret = (*xfer_fxn)(ec_dev, msg);
>  	if (msg->result == EC_RES_IN_PROGRESS) {
>  		int i;
>  		struct cros_ec_command *status_msg;
> @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
>  		for (i = 0; i < EC_COMMAND_RETRIES; i++) {
>  			usleep_range(10000, 11000);
>  
> -			ret = ec_dev->cmd_xfer(ec_dev, status_msg);
> +			ret = (*xfer_fxn)(ec_dev, status_msg);
>  			if (ret < 0)
>  				break;
>  

Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
bisect is pointing to this commit. Reverting the above on top of -next
does allow the board to boot successfully. Looks like this board is
proto_version 3 but I have not looked into this any further. Let me know
if you have any thoughts.

Cheers
Jon

[0]
https://nvtb.github.io//linux-next/test_next-20170919/20170918213034/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt


-- 
nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-19 13:44     ` Jon Hunter
@ 2017-09-19 14:09         ` Shawn N
  -1 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-19 14:09 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Brian Norris, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>
> Hi Brian,
>
> On 08/09/17 21:50, Brian Norris wrote:
> > From: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> >
> > pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
> > one instance of these functions correct, but not the second, fall-back
> > case. We use the fall-back only when the first command returns an
> > IN_PROGRESS status, which is only used on some EC firmwares where we
> > don't want to constantly poll the bus, but instead back off and
> > sleep/retry for a little while.
> >
> > Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
> > Signed-off-by: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> > Signed-off-by: Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> > Reviewed-by: Javier Martinez Canillas <javier-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> > ---
> > v3:
> >  * Added Javier's reviewed tag
> >  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
> >    MAINTAINERS now! Too bad no one told me.
> >
> >  [1] https://patchwork.kernel.org/patch/9450633/
> >
> >
> > v2:
> >  * Add Benson in 'To:'
> >  * make subject prefix more obvious
> >
> >  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> > index 8dfa7fcb1248..e7bbdf947bbc 100644
> > --- a/drivers/platform/chrome/cros_ec_proto.c
> > +++ b/drivers/platform/chrome/cros_ec_proto.c
> > @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
> >                       struct cros_ec_command *msg)
> >  {
> >       int ret;
> > +     int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
> >
> >       if (ec_dev->proto_version > 2)
> > -             ret = ec_dev->pkt_xfer(ec_dev, msg);
> > +             xfer_fxn = ec_dev->pkt_xfer;
> >       else
> > -             ret = ec_dev->cmd_xfer(ec_dev, msg);
> > +             xfer_fxn = ec_dev->cmd_xfer;
> >
> > +     ret = (*xfer_fxn)(ec_dev, msg);
> >       if (msg->result == EC_RES_IN_PROGRESS) {
> >               int i;
> >               struct cros_ec_command *status_msg;
> > @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
> >               for (i = 0; i < EC_COMMAND_RETRIES; i++) {
> >                       usleep_range(10000, 11000);
> >
> > -                     ret = ec_dev->cmd_xfer(ec_dev, status_msg);
> > +                     ret = (*xfer_fxn)(ec_dev, status_msg);
> >                       if (ret < 0)
> >                               break;
> >
>
> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
> bisect is pointing to this commit. Reverting the above on top of -next
> does allow the board to boot successfully. Looks like this board is
> proto_version 3 but I have not looked into this any further. Let me know
> if you have any thoughts.


Thanks for the bug report, I'll look into this today.

> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);

So, ec_dev->proto_version > 3? That doesn't seem right.

>
>
> Cheers
> Jon
>
> [0]
> https://nvtb.github.io//linux-next/test_next-20170919/20170918213034/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt
>
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-19 14:09         ` Shawn N
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-19 14:09 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Brian Norris, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra

On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Brian,
>
> On 08/09/17 21:50, Brian Norris wrote:
> > From: Shawn Nematbakhsh <shawnn@chromium.org>
> >
> > pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
> > one instance of these functions correct, but not the second, fall-back
> > case. We use the fall-back only when the first command returns an
> > IN_PROGRESS status, which is only used on some EC firmwares where we
> > don't want to constantly poll the bus, but instead back off and
> > sleep/retry for a little while.
> >
> > Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
> > Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
> > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>
> > ---
> > v3:
> >  * Added Javier's reviewed tag
> >  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
> >    MAINTAINERS now! Too bad no one told me.
> >
> >  [1] https://patchwork.kernel.org/patch/9450633/
> >
> >
> > v2:
> >  * Add Benson in 'To:'
> >  * make subject prefix more obvious
> >
> >  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> > index 8dfa7fcb1248..e7bbdf947bbc 100644
> > --- a/drivers/platform/chrome/cros_ec_proto.c
> > +++ b/drivers/platform/chrome/cros_ec_proto.c
> > @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
> >                       struct cros_ec_command *msg)
> >  {
> >       int ret;
> > +     int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
> >
> >       if (ec_dev->proto_version > 2)
> > -             ret = ec_dev->pkt_xfer(ec_dev, msg);
> > +             xfer_fxn = ec_dev->pkt_xfer;
> >       else
> > -             ret = ec_dev->cmd_xfer(ec_dev, msg);
> > +             xfer_fxn = ec_dev->cmd_xfer;
> >
> > +     ret = (*xfer_fxn)(ec_dev, msg);
> >       if (msg->result == EC_RES_IN_PROGRESS) {
> >               int i;
> >               struct cros_ec_command *status_msg;
> > @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
> >               for (i = 0; i < EC_COMMAND_RETRIES; i++) {
> >                       usleep_range(10000, 11000);
> >
> > -                     ret = ec_dev->cmd_xfer(ec_dev, status_msg);
> > +                     ret = (*xfer_fxn)(ec_dev, status_msg);
> >                       if (ret < 0)
> >                               break;
> >
>
> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
> bisect is pointing to this commit. Reverting the above on top of -next
> does allow the board to boot successfully. Looks like this board is
> proto_version 3 but I have not looked into this any further. Let me know
> if you have any thoughts.


Thanks for the bug report, I'll look into this today.

> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);

So, ec_dev->proto_version > 3? That doesn't seem right.

>
>
> Cheers
> Jon
>
> [0]
> https://nvtb.github.io//linux-next/test_next-20170919/20170918213034/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt
>
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-19 14:09         ` Shawn N
@ 2017-09-19 16:39             ` Jon Hunter
  -1 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-09-19 16:39 UTC (permalink / raw)
  To: Shawn N
  Cc: Brian Norris, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra-u79uwXL29TY76Z2rM5mHXA



On 19/09/17 15:09, Shawn N wrote:
> On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> Hi Brian,
>>
>> On 08/09/17 21:50, Brian Norris wrote:
>>> From: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>>>
>>> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
>>> one instance of these functions correct, but not the second, fall-back
>>> case. We use the fall-back only when the first command returns an
>>> IN_PROGRESS status, which is only used on some EC firmwares where we
>>> don't want to constantly poll the bus, but instead back off and
>>> sleep/retry for a little while.
>>>
>>> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
>>> Signed-off-by: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>>> Signed-off-by: Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>>> Reviewed-by: Javier Martinez Canillas <javier-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
>>> ---
>>> v3:
>>>  * Added Javier's reviewed tag
>>>  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
>>>    MAINTAINERS now! Too bad no one told me.
>>>
>>>  [1] https://patchwork.kernel.org/patch/9450633/
>>>
>>>
>>> v2:
>>>  * Add Benson in 'To:'
>>>  * make subject prefix more obvious
>>>
>>>  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
>>> index 8dfa7fcb1248..e7bbdf947bbc 100644
>>> --- a/drivers/platform/chrome/cros_ec_proto.c
>>> +++ b/drivers/platform/chrome/cros_ec_proto.c
>>> @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>                       struct cros_ec_command *msg)
>>>  {
>>>       int ret;
>>> +     int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
>>>
>>>       if (ec_dev->proto_version > 2)
>>> -             ret = ec_dev->pkt_xfer(ec_dev, msg);
>>> +             xfer_fxn = ec_dev->pkt_xfer;
>>>       else
>>> -             ret = ec_dev->cmd_xfer(ec_dev, msg);
>>> +             xfer_fxn = ec_dev->cmd_xfer;u
>>>
>>> +     ret = (*xfer_fxn)(ec_dev, msg);
>>>       if (msg->result == EC_RES_IN_PROGRESS) {
>>>               int i;
>>>               struct cros_ec_command *status_msg;
>>> @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>               for (i = 0; i < EC_COMMAND_RETRIES; i++) {
>>>                       usleep_range(10000, 11000);
>>>
>>> -                     ret = ec_dev->cmd_xfer(ec_dev, status_msg);
>>> +                     ret = (*xfer_fxn)(ec_dev, status_msg);
>>>                       if (ret < 0)
>>>                               break;
>>>
>>
>> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
>> bisect is pointing to this commit. Reverting the above on top of -next
>> does allow the board to boot successfully. Looks like this board is
>> proto_version 3 but I have not looked into this any further. Let me know
>> if you have any thoughts.
> 
> 
> Thanks for the bug report, I'll look into this today.
> 
>> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
>> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
> 
> So, ec_dev->proto_version > 3? That doesn't seem right.

You mean != 3, but yes. Looks like an initialisation problem, because if I 
add the following WARNING ...

diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
index e7bbdf947bbc..ad3b3a1e8d54 100644
--- a/drivers/platform/chrome/cros_ec_proto.c
+++ b/drivers/platform/chrome/cros_ec_proto.c
@@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
        int i;
        u8 csum = 0;
 
+       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
        BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
        BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
 
... then I see ...

[    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
[    1.512566] 65535 != 3

Any chance this is being called before the version is initialised?

Cheers
Jon

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-19 16:39             ` Jon Hunter
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-09-19 16:39 UTC (permalink / raw)
  To: Shawn N
  Cc: Brian Norris, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra



On 19/09/17 15:09, Shawn N wrote:
> On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>> Hi Brian,
>>
>> On 08/09/17 21:50, Brian Norris wrote:
>>> From: Shawn Nematbakhsh <shawnn@chromium.org>
>>>
>>> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
>>> one instance of these functions correct, but not the second, fall-back
>>> case. We use the fall-back only when the first command returns an
>>> IN_PROGRESS status, which is only used on some EC firmwares where we
>>> don't want to constantly poll the bus, but instead back off and
>>> sleep/retry for a little while.
>>>
>>> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
>>> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
>>> Signed-off-by: Brian Norris <briannorris@chromium.org>
>>> Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>
>>> ---
>>> v3:
>>>  * Added Javier's reviewed tag
>>>  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
>>>    MAINTAINERS now! Too bad no one told me.
>>>
>>>  [1] https://patchwork.kernel.org/patch/9450633/
>>>
>>>
>>> v2:
>>>  * Add Benson in 'To:'
>>>  * make subject prefix more obvious
>>>
>>>  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
>>> index 8dfa7fcb1248..e7bbdf947bbc 100644
>>> --- a/drivers/platform/chrome/cros_ec_proto.c
>>> +++ b/drivers/platform/chrome/cros_ec_proto.c
>>> @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>                       struct cros_ec_command *msg)
>>>  {
>>>       int ret;
>>> +     int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
>>>
>>>       if (ec_dev->proto_version > 2)
>>> -             ret = ec_dev->pkt_xfer(ec_dev, msg);
>>> +             xfer_fxn = ec_dev->pkt_xfer;
>>>       else
>>> -             ret = ec_dev->cmd_xfer(ec_dev, msg);
>>> +             xfer_fxn = ec_dev->cmd_xfer;u
>>>
>>> +     ret = (*xfer_fxn)(ec_dev, msg);
>>>       if (msg->result == EC_RES_IN_PROGRESS) {
>>>               int i;
>>>               struct cros_ec_command *status_msg;
>>> @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>               for (i = 0; i < EC_COMMAND_RETRIES; i++) {
>>>                       usleep_range(10000, 11000);
>>>
>>> -                     ret = ec_dev->cmd_xfer(ec_dev, status_msg);
>>> +                     ret = (*xfer_fxn)(ec_dev, status_msg);
>>>                       if (ret < 0)
>>>                               break;
>>>
>>
>> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
>> bisect is pointing to this commit. Reverting the above on top of -next
>> does allow the board to boot successfully. Looks like this board is
>> proto_version 3 but I have not looked into this any further. Let me know
>> if you have any thoughts.
> 
> 
> Thanks for the bug report, I'll look into this today.
> 
>> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
>> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
> 
> So, ec_dev->proto_version > 3? That doesn't seem right.

You mean != 3, but yes. Looks like an initialisation problem, because if I 
add the following WARNING ...

diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
index e7bbdf947bbc..ad3b3a1e8d54 100644
--- a/drivers/platform/chrome/cros_ec_proto.c
+++ b/drivers/platform/chrome/cros_ec_proto.c
@@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
        int i;
        u8 csum = 0;
 
+       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
        BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
        BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
 
... then I see ...

[    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
[    1.512566] 65535 != 3

Any chance this is being called before the version is initialised?

Cheers
Jon

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-19 16:39             ` Jon Hunter
@ 2017-09-19 17:03                 ` Shawn N
  -1 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-19 17:03 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Brian Norris, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 19, 2017 at 9:39 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On 19/09/17 15:09, Shawn N wrote:
>> On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>>>
>>> Hi Brian,
>>>
>>> On 08/09/17 21:50, Brian Norris wrote:
>>>> From: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>>>>
>>>> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
>>>> one instance of these functions correct, but not the second, fall-back
>>>> case. We use the fall-back only when the first command returns an
>>>> IN_PROGRESS status, which is only used on some EC firmwares where we
>>>> don't want to constantly poll the bus, but instead back off and
>>>> sleep/retry for a little while.
>>>>
>>>> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
>>>> Signed-off-by: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>>>> Signed-off-by: Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>>>> Reviewed-by: Javier Martinez Canillas <javier-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
>>>> ---
>>>> v3:
>>>>  * Added Javier's reviewed tag
>>>>  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
>>>>    MAINTAINERS now! Too bad no one told me.
>>>>
>>>>  [1] https://patchwork.kernel.org/patch/9450633/
>>>>
>>>>
>>>> v2:
>>>>  * Add Benson in 'To:'
>>>>  * make subject prefix more obvious
>>>>
>>>>  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
>>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
>>>> index 8dfa7fcb1248..e7bbdf947bbc 100644
>>>> --- a/drivers/platform/chrome/cros_ec_proto.c
>>>> +++ b/drivers/platform/chrome/cros_ec_proto.c
>>>> @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>>                       struct cros_ec_command *msg)
>>>>  {
>>>>       int ret;
>>>> +     int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
>>>>
>>>>       if (ec_dev->proto_version > 2)
>>>> -             ret = ec_dev->pkt_xfer(ec_dev, msg);
>>>> +             xfer_fxn = ec_dev->pkt_xfer;
>>>>       else
>>>> -             ret = ec_dev->cmd_xfer(ec_dev, msg);
>>>> +             xfer_fxn = ec_dev->cmd_xfer;u
>>>>
>>>> +     ret = (*xfer_fxn)(ec_dev, msg);
>>>>       if (msg->result == EC_RES_IN_PROGRESS) {
>>>>               int i;
>>>>               struct cros_ec_command *status_msg;
>>>> @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>>               for (i = 0; i < EC_COMMAND_RETRIES; i++) {
>>>>                       usleep_range(10000, 11000);
>>>>
>>>> -                     ret = ec_dev->cmd_xfer(ec_dev, status_msg);
>>>> +                     ret = (*xfer_fxn)(ec_dev, status_msg);
>>>>                       if (ret < 0)
>>>>                               break;
>>>>
>>>
>>> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
>>> bisect is pointing to this commit. Reverting the above on top of -next
>>> does allow the board to boot successfully. Looks like this board is
>>> proto_version 3 but I have not looked into this any further. Let me know
>>> if you have any thoughts.
>>
>>
>> Thanks for the bug report, I'll look into this today.
>>
>>> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
>>> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>>
>> So, ec_dev->proto_version > 3? That doesn't seem right.
>
> You mean != 3, but yes. Looks like an initialisation problem, because if I
> add the following WARNING ...

I meant > 3 because we check for > 2 in send_command().

>
> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> index e7bbdf947bbc..ad3b3a1e8d54 100644
> --- a/drivers/platform/chrome/cros_ec_proto.c
> +++ b/drivers/platform/chrome/cros_ec_proto.c
> @@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
>         int i;
>         u8 csum = 0;
>
> +       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
>         BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>         BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
>
> ... then I see ...
>
> [    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
> [    1.512566] 65535 != 3
>
> Any chance this is being called before the version is initialised?

It's initialized in cros_ec_query_all().

Considering your trace shows (send_command+0x20/0xd8) as a caller, I'm
guessing that we die on the first call to (*xfer_fxn):

+     ret = (*xfer_fxn)(ec_dev, msg);

That part of the change should be a NOP, I only added a function
pointer so we wouldn't have to re-check protocol_version later. The
syntax looks fine to me even after re-checking, but maybe I missed
something. Let me test on my side.

>
> Cheers
> Jon
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-19 17:03                 ` Shawn N
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-19 17:03 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Brian Norris, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra

On Tue, Sep 19, 2017 at 9:39 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 19/09/17 15:09, Shawn N wrote:
>> On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>>>
>>> Hi Brian,
>>>
>>> On 08/09/17 21:50, Brian Norris wrote:
>>>> From: Shawn Nematbakhsh <shawnn@chromium.org>
>>>>
>>>> pkt_xfer should be used for protocol v3, and cmd_xfer otherwise. We had
>>>> one instance of these functions correct, but not the second, fall-back
>>>> case. We use the fall-back only when the first command returns an
>>>> IN_PROGRESS status, which is only used on some EC firmwares where we
>>>> don't want to constantly poll the bus, but instead back off and
>>>> sleep/retry for a little while.
>>>>
>>>> Fixes: 2c7589af3c4d ("mfd: cros_ec: add proto v3 skeleton")
>>>> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
>>>> Signed-off-by: Brian Norris <briannorris@chromium.org>
>>>> Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>
>>>> ---
>>>> v3:
>>>>  * Added Javier's reviewed tag
>>>>  * It's been > 8 months since [1], so why not? And hey, Benson's officially in
>>>>    MAINTAINERS now! Too bad no one told me.
>>>>
>>>>  [1] https://patchwork.kernel.org/patch/9450633/
>>>>
>>>>
>>>> v2:
>>>>  * Add Benson in 'To:'
>>>>  * make subject prefix more obvious
>>>>
>>>>  drivers/platform/chrome/cros_ec_proto.c | 8 +++++---
>>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
>>>> index 8dfa7fcb1248..e7bbdf947bbc 100644
>>>> --- a/drivers/platform/chrome/cros_ec_proto.c
>>>> +++ b/drivers/platform/chrome/cros_ec_proto.c
>>>> @@ -60,12 +60,14 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>>                       struct cros_ec_command *msg)
>>>>  {
>>>>       int ret;
>>>> +     int (*xfer_fxn)(struct cros_ec_device *ec, struct cros_ec_command *msg);
>>>>
>>>>       if (ec_dev->proto_version > 2)
>>>> -             ret = ec_dev->pkt_xfer(ec_dev, msg);
>>>> +             xfer_fxn = ec_dev->pkt_xfer;
>>>>       else
>>>> -             ret = ec_dev->cmd_xfer(ec_dev, msg);
>>>> +             xfer_fxn = ec_dev->cmd_xfer;u
>>>>
>>>> +     ret = (*xfer_fxn)(ec_dev, msg);
>>>>       if (msg->result == EC_RES_IN_PROGRESS) {
>>>>               int i;
>>>>               struct cros_ec_command *status_msg;
>>>> @@ -88,7 +90,7 @@ static int send_command(struct cros_ec_device *ec_dev,
>>>>               for (i = 0; i < EC_COMMAND_RETRIES; i++) {
>>>>                       usleep_range(10000, 11000);
>>>>
>>>> -                     ret = ec_dev->cmd_xfer(ec_dev, status_msg);
>>>> +                     ret = (*xfer_fxn)(ec_dev, status_msg);
>>>>                       if (ret < 0)
>>>>                               break;
>>>>
>>>
>>> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
>>> bisect is pointing to this commit. Reverting the above on top of -next
>>> does allow the board to boot successfully. Looks like this board is
>>> proto_version 3 but I have not looked into this any further. Let me know
>>> if you have any thoughts.
>>
>>
>> Thanks for the bug report, I'll look into this today.
>>
>>> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
>>> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>>
>> So, ec_dev->proto_version > 3? That doesn't seem right.
>
> You mean != 3, but yes. Looks like an initialisation problem, because if I
> add the following WARNING ...

I meant > 3 because we check for > 2 in send_command().

>
> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> index e7bbdf947bbc..ad3b3a1e8d54 100644
> --- a/drivers/platform/chrome/cros_ec_proto.c
> +++ b/drivers/platform/chrome/cros_ec_proto.c
> @@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
>         int i;
>         u8 csum = 0;
>
> +       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
>         BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>         BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
>
> ... then I see ...
>
> [    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
> [    1.512566] 65535 != 3
>
> Any chance this is being called before the version is initialised?

It's initialized in cros_ec_query_all().

Considering your trace shows (send_command+0x20/0xd8) as a caller, I'm
guessing that we die on the first call to (*xfer_fxn):

+     ret = (*xfer_fxn)(ec_dev, msg);

That part of the change should be a NOP, I only added a function
pointer so we wouldn't have to re-check protocol_version later. The
syntax looks fine to me even after re-checking, but maybe I missed
something. Let me test on my side.

>
> Cheers
> Jon
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-19 16:39             ` Jon Hunter
@ 2017-09-19 17:14                 ` Brian Norris
  -1 siblings, 0 replies; 36+ messages in thread
From: Brian Norris @ 2017-09-19 17:14 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Shawn N, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Javier Martinez Canillas, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi Jon,

On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
> On 19/09/17 15:09, Shawn N wrote:
> > On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
> >> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
> >> bisect is pointing to this commit. Reverting the above on top of -next
> >> does allow the board to boot successfully. Looks like this board is
> >> proto_version 3 but I have not looked into this any further. Let me know
> >> if you have any thoughts.
> > 
> > 
> > Thanks for the bug report, I'll look into this today.

Yes, thanks!

> >> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
> >> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
> > 
> > So, ec_dev->proto_version > 3? That doesn't seem right.
> 
> You mean != 3, but yes. Looks like an initialisation problem, because if I 
> add the following WARNING ...
> 
> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> index e7bbdf947bbc..ad3b3a1e8d54 100644
> --- a/drivers/platform/chrome/cros_ec_proto.c
> +++ b/drivers/platform/chrome/cros_ec_proto.c
> @@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
>         int i;
>         u8 csum = 0;
>  
> +       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
>         BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>         BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
>  
> ... then I see ...
> 
> [    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
> [    1.512566] 65535 != 3
> 
> Any chance this is being called before the version is initialised?

If it's uninitialized, it should be 0 (the structure is kzalloc'd, and
the call stack you point to clearly shows it's at least been allocated
already). Also, if it's uninitialized, then you should be BUG'ing even
without this patch; the patch you've bisected to is only modifying the
*second* (or later) attempt to send the command, and it's using the same
'ec_dev' structure.

Furthermore, the only assignments to this 'proto_version' field look
like they're only writing one of 0, 2, 3, or

   min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)

. I don't see where 0xffff comes from.

So...is there any chance we've got a heap corruption somewhere?
Somebody's overwriting 'ec_dev->proto_version' accidentally?

Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-19 17:14                 ` Brian Norris
  0 siblings, 0 replies; 36+ messages in thread
From: Brian Norris @ 2017-09-19 17:14 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Shawn N, Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Javier Martinez Canillas,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso, linux-tegra

Hi Jon,

On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
> On 19/09/17 15:09, Shawn N wrote:
> > On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
> >> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
> >> bisect is pointing to this commit. Reverting the above on top of -next
> >> does allow the board to boot successfully. Looks like this board is
> >> proto_version 3 but I have not looked into this any further. Let me know
> >> if you have any thoughts.
> > 
> > 
> > Thanks for the bug report, I'll look into this today.

Yes, thanks!

> >> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
> >> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
> > 
> > So, ec_dev->proto_version > 3? That doesn't seem right.
> 
> You mean != 3, but yes. Looks like an initialisation problem, because if I 
> add the following WARNING ...
> 
> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> index e7bbdf947bbc..ad3b3a1e8d54 100644
> --- a/drivers/platform/chrome/cros_ec_proto.c
> +++ b/drivers/platform/chrome/cros_ec_proto.c
> @@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
>         int i;
>         u8 csum = 0;
>  
> +       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
>         BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>         BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
>  
> ... then I see ...
> 
> [    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
> [    1.512566] 65535 != 3
> 
> Any chance this is being called before the version is initialised?

If it's uninitialized, it should be 0 (the structure is kzalloc'd, and
the call stack you point to clearly shows it's at least been allocated
already). Also, if it's uninitialized, then you should be BUG'ing even
without this patch; the patch you've bisected to is only modifying the
*second* (or later) attempt to send the command, and it's using the same
'ec_dev' structure.

Furthermore, the only assignments to this 'proto_version' field look
like they're only writing one of 0, 2, 3, or

   min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)

. I don't see where 0xffff comes from.

So...is there any chance we've got a heap corruption somewhere?
Somebody's overwriting 'ec_dev->proto_version' accidentally?

Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-19 17:14                 ` Brian Norris
@ 2017-09-20  6:05                     ` Shawn N
  -1 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-20  6:05 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Brian Norris, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
getting messed up, or the reply buffer is getting corrupted somehow.

               ec_dev->proto_version =
                        min(EC_HOST_REQUEST_VERSION,
                                        fls(proto_info->protocol_versions) - 1);

If proto_info->protocol_versions == 0 then ec_dev->proto_version will
be assigned 0xffff. The logic here seems strange to me, if the EC is
successfully replying to our v3 command then obviously it supports v3
(maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
Anyway, we need to figure out what is happening with our
EC_HOST_REQUEST_VERSION host command.

On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> Hi Jon,
>
> On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
>> On 19/09/17 15:09, Shawn N wrote:
>> > On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>> >> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
>> >> bisect is pointing to this commit. Reverting the above on top of -next
>> >> does allow the board to boot successfully. Looks like this board is
>> >> proto_version 3 but I have not looked into this any further. Let me know
>> >> if you have any thoughts.
>> >
>> >
>> > Thanks for the bug report, I'll look into this today.
>
> Yes, thanks!
>
>> >> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
>> >> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>> >
>> > So, ec_dev->proto_version > 3? That doesn't seem right.
>>
>> You mean != 3, but yes. Looks like an initialisation problem, because if I
>> add the following WARNING ...
>>
>> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
>> index e7bbdf947bbc..ad3b3a1e8d54 100644
>> --- a/drivers/platform/chrome/cros_ec_proto.c
>> +++ b/drivers/platform/chrome/cros_ec_proto.c
>> @@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
>>         int i;
>>         u8 csum = 0;
>>
>> +       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
>>         BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>>         BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
>>
>> ... then I see ...
>>
>> [    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
>> [    1.512566] 65535 != 3
>>
>> Any chance this is being called before the version is initialised?
>
> If it's uninitialized, it should be 0 (the structure is kzalloc'd, and
> the call stack you point to clearly shows it's at least been allocated
> already). Also, if it's uninitialized, then you should be BUG'ing even
> without this patch; the patch you've bisected to is only modifying the
> *second* (or later) attempt to send the command, and it's using the same
> 'ec_dev' structure.
>
> Furthermore, the only assignments to this 'proto_version' field look
> like they're only writing one of 0, 2, 3, or
>
>    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
>
> . I don't see where 0xffff comes from.
>
> So...is there any chance we've got a heap corruption somewhere?
> Somebody's overwriting 'ec_dev->proto_version' accidentally?
>
> Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-20  6:05                     ` Shawn N
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-20  6:05 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Brian Norris, Benson Leung, Lee Jones,
	linux-kernel, Doug Anderson, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
getting messed up, or the reply buffer is getting corrupted somehow.

               ec_dev->proto_version =
                        min(EC_HOST_REQUEST_VERSION,
                                        fls(proto_info->protocol_versions) - 1);

If proto_info->protocol_versions == 0 then ec_dev->proto_version will
be assigned 0xffff. The logic here seems strange to me, if the EC is
successfully replying to our v3 command then obviously it supports v3
(maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
Anyway, we need to figure out what is happening with our
EC_HOST_REQUEST_VERSION host command.

On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris@chromium.org> wrote:
> Hi Jon,
>
> On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
>> On 19/09/17 15:09, Shawn N wrote:
>> > On Tue, Sep 19, 2017 at 6:44 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>> >> Tegra124 Nyan-Big is currently crashing during boot with -next [0] and
>> >> bisect is pointing to this commit. Reverting the above on top of -next
>> >> does allow the board to boot successfully. Looks like this board is
>> >> proto_version 3 but I have not looked into this any further. Let me know
>> >> if you have any thoughts.
>> >
>> >
>> > Thanks for the bug report, I'll look into this today.
>
> Yes, thanks!
>
>> >> [    1.502497] kernel BUG at drivers/platform/chrome/cros_ec_proto.c:34!
>> >> 34 BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>> >
>> > So, ec_dev->proto_version > 3? That doesn't seem right.
>>
>> You mean != 3, but yes. Looks like an initialisation problem, because if I
>> add the following WARNING ...
>>
>> diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
>> index e7bbdf947bbc..ad3b3a1e8d54 100644
>> --- a/drivers/platform/chrome/cros_ec_proto.c
>> +++ b/drivers/platform/chrome/cros_ec_proto.c
>> @@ -31,6 +31,7 @@ static int prepare_packet(struct cros_ec_device *ec_dev,
>>         int i;
>>         u8 csum = 0;
>>
>> +       WARN(ec_dev->proto_version != EC_HOST_REQUEST_VERSION, "%d != %d", ec_dev->proto_version, EC_HOST_REQUEST_VERSION);
>>         BUG_ON(ec_dev->proto_version != EC_HOST_REQUEST_VERSION);
>>         BUG_ON(msg->outsize + sizeof(*request) > ec_dev->dout_size);
>>
>> ... then I see ...
>>
>> [    1.502495] WARNING: CPU: 0 PID: 1 at drivers/platform/chrome/cros_ec_proto.c:35 cros_ec_prepare_tx+0x190/0x1a8
>> [    1.512566] 65535 != 3
>>
>> Any chance this is being called before the version is initialised?
>
> If it's uninitialized, it should be 0 (the structure is kzalloc'd, and
> the call stack you point to clearly shows it's at least been allocated
> already). Also, if it's uninitialized, then you should be BUG'ing even
> without this patch; the patch you've bisected to is only modifying the
> *second* (or later) attempt to send the command, and it's using the same
> 'ec_dev' structure.
>
> Furthermore, the only assignments to this 'proto_version' field look
> like they're only writing one of 0, 2, 3, or
>
>    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
>
> . I don't see where 0xffff comes from.
>
> So...is there any chance we've got a heap corruption somewhere?
> Somebody's overwriting 'ec_dev->proto_version' accidentally?
>
> Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-20  6:05                     ` Shawn N
@ 2017-09-20  6:13                         ` Brian Norris
  -1 siblings, 0 replies; 36+ messages in thread
From: Brian Norris @ 2017-09-20  6:13 UTC (permalink / raw)
  To: Shawn N
  Cc: Jon Hunter, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi,

On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
> getting messed up, or the reply buffer is getting corrupted somehow.
> 
>                ec_dev->proto_version =
>                         min(EC_HOST_REQUEST_VERSION,
>                                         fls(proto_info->protocol_versions) - 1);
> 
> If proto_info->protocol_versions == 0 then ec_dev->proto_version will
> be assigned 0xffff. The logic here seems strange to me, if the EC is

Whoops...

> successfully replying to our v3 command then obviously it supports v3
> (maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
> Anyway, we need to figure out what is happening with our
> EC_HOST_REQUEST_VERSION host command.
> 
> On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> > Hi Jon,
> >
> > On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
> >> On 19/09/17 15:09, Shawn N wrote:
...
> > Furthermore, the only assignments to this 'proto_version' field look
> > like they're only writing one of 0, 2, 3, or
> >
> >    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
> >
> > . I don't see where 0xffff comes from.

...I'm an idiot. While the rvalue (the expression above) is an int (e.g,
-1), it's getting cast into a uint16_t (ec_dev->proto_version). So
that's where the 0xffff can come from.

Sorry if I misled you Shawn :(

Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-20  6:13                         ` Brian Norris
  0 siblings, 0 replies; 36+ messages in thread
From: Brian Norris @ 2017-09-20  6:13 UTC (permalink / raw)
  To: Shawn N
  Cc: Jon Hunter, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel, Doug Anderson, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

Hi,

On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
> getting messed up, or the reply buffer is getting corrupted somehow.
> 
>                ec_dev->proto_version =
>                         min(EC_HOST_REQUEST_VERSION,
>                                         fls(proto_info->protocol_versions) - 1);
> 
> If proto_info->protocol_versions == 0 then ec_dev->proto_version will
> be assigned 0xffff. The logic here seems strange to me, if the EC is

Whoops...

> successfully replying to our v3 command then obviously it supports v3
> (maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
> Anyway, we need to figure out what is happening with our
> EC_HOST_REQUEST_VERSION host command.
> 
> On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris@chromium.org> wrote:
> > Hi Jon,
> >
> > On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
> >> On 19/09/17 15:09, Shawn N wrote:
...
> > Furthermore, the only assignments to this 'proto_version' field look
> > like they're only writing one of 0, 2, 3, or
> >
> >    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
> >
> > . I don't see where 0xffff comes from.

...I'm an idiot. While the rvalue (the expression above) is an int (e.g,
-1), it's getting cast into a uint16_t (ec_dev->proto_version). So
that's where the 0xffff can come from.

Sorry if I misled you Shawn :(

Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-20  6:13                         ` Brian Norris
@ 2017-09-20 20:22                             ` Shawn N
  -1 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-20 20:22 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Brian Norris, Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> Hi,
>
> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>> getting messed up, or the reply buffer is getting corrupted somehow.
>>
>>                ec_dev->proto_version =
>>                         min(EC_HOST_REQUEST_VERSION,
>>                                         fls(proto_info->protocol_versions) - 1);
>>

Checking this closer, the first host command we send after we boot the
kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
(see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
this doesn't seem to happen on the Chromium OS nyan_big release
kernel, I suggest to hook up a logic analyzer and see if the SPI
master is doing something bad.

The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
"the host command was received by the EC and is currently being
handled, poll status until completion". So the caller polls status
with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
(which is interpreted to mean "the host command I sent previously has
now successfully completed"), and returns success. The problem here is
that the initial host command was never received at all, and no reply
was ever received, so our reply data is all zero.

Two things need to be fixed here:

1) Find out why the first host command after boot is failing. Probe
SPI pins and see what's going on.
2) Fix error handling so we properly return an error (or properly
retry the entire command) when a protocol error occurs (I made some
attempt in https://chromium-review.googlesource.com/385080/, probably
I should revisit that).

>> If proto_info->protocol_versions == 0 then ec_dev->proto_version will
>> be assigned 0xffff. The logic here seems strange to me, if the EC is
>
> Whoops...
>
>> successfully replying to our v3 command then obviously it supports v3
>> (maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
>> Anyway, we need to figure out what is happening with our
>> EC_HOST_REQUEST_VERSION host command.
>>
>> On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>> > Hi Jon,
>> >
>> > On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
>> >> On 19/09/17 15:09, Shawn N wrote:
> ...
>> > Furthermore, the only assignments to this 'proto_version' field look
>> > like they're only writing one of 0, 2, 3, or
>> >
>> >    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
>> >
>> > . I don't see where 0xffff comes from.
>
> ...I'm an idiot. While the rvalue (the expression above) is an int (e.g,
> -1), it's getting cast into a uint16_t (ec_dev->proto_version). So
> that's where the 0xffff can come from.

I saw that before and overlooked it too, so we're both idiots.

>
> Sorry if I misled you Shawn :(
>
> Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-20 20:22                             ` Shawn N
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-09-20 20:22 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
> Hi,
>
> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>> getting messed up, or the reply buffer is getting corrupted somehow.
>>
>>                ec_dev->proto_version =
>>                         min(EC_HOST_REQUEST_VERSION,
>>                                         fls(proto_info->protocol_versions) - 1);
>>

Checking this closer, the first host command we send after we boot the
kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
(see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
this doesn't seem to happen on the Chromium OS nyan_big release
kernel, I suggest to hook up a logic analyzer and see if the SPI
master is doing something bad.

The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
"the host command was received by the EC and is currently being
handled, poll status until completion". So the caller polls status
with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
(which is interpreted to mean "the host command I sent previously has
now successfully completed"), and returns success. The problem here is
that the initial host command was never received at all, and no reply
was ever received, so our reply data is all zero.

Two things need to be fixed here:

1) Find out why the first host command after boot is failing. Probe
SPI pins and see what's going on.
2) Fix error handling so we properly return an error (or properly
retry the entire command) when a protocol error occurs (I made some
attempt in https://chromium-review.googlesource.com/385080/, probably
I should revisit that).

>> If proto_info->protocol_versions == 0 then ec_dev->proto_version will
>> be assigned 0xffff. The logic here seems strange to me, if the EC is
>
> Whoops...
>
>> successfully replying to our v3 command then obviously it supports v3
>> (maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
>> Anyway, we need to figure out what is happening with our
>> EC_HOST_REQUEST_VERSION host command.
>>
>> On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris@chromium.org> wrote:
>> > Hi Jon,
>> >
>> > On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
>> >> On 19/09/17 15:09, Shawn N wrote:
> ...
>> > Furthermore, the only assignments to this 'proto_version' field look
>> > like they're only writing one of 0, 2, 3, or
>> >
>> >    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
>> >
>> > . I don't see where 0xffff comes from.
>
> ...I'm an idiot. While the rvalue (the expression above) is an int (e.g,
> -1), it's getting cast into a uint16_t (ec_dev->proto_version). So
> that's where the 0xffff can come from.

I saw that before and overlooked it too, so we're both idiots.

>
> Sorry if I misled you Shawn :(
>
> Brian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-20 20:22                             ` Shawn N
  (?)
@ 2017-09-25 23:15                             ` Shawn N
  2017-09-26 15:40                                 ` Jon Hunter
  2017-10-10 13:35                                 ` Jon Hunter
  -1 siblings, 2 replies; 36+ messages in thread
From: Shawn N @ 2017-09-25 23:15 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

[-- Attachment #1: Type: text/plain, Size: 6440 bytes --]

On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@google.com> wrote:
> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
>> Hi,
>>
>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>
>>>                ec_dev->proto_version =
>>>                         min(EC_HOST_REQUEST_VERSION,
>>>                                         fls(proto_info->protocol_versions) - 1);
>>>
>
> Checking this closer, the first host command we send after we boot the
> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
> this doesn't seem to happen on the Chromium OS nyan_big release
> kernel, I suggest to hook up a logic analyzer and see if the SPI
> master is doing something bad.
>
> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
> "the host command was received by the EC and is currently being
> handled, poll status until completion". So the caller polls status
> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
> (which is interpreted to mean "the host command I sent previously has
> now successfully completed"), and returns success. The problem here is
> that the initial host command was never received at all, and no reply
> was ever received, so our reply data is all zero.
>
> Two things need to be fixed here:
>
> 1) Find out why the first host command after boot is failing. Probe
> SPI pins and see what's going on.
> 2) Fix error handling so we properly return an error (or properly
> retry the entire command) when a protocol error occurs (I made some
> attempt in https://chromium-review.googlesource.com/385080/, probably
> I should revisit that).

The below patch will fix error handling and will make things mostly
work on nyan_big, because we'll fall back to V2 protocol after the
initial failure. But we should still investigate why we're getting
errors on the first host command. We aren't seeing these errors when
we send commands from firmware, so I suspect something is wrong in
kernel SPI HW initialization that causes the first command to fail.

From: Shawn Nematbakhsh <shawnn@chromium.org>
Date: Mon, 25 Sep 2017 14:32:38 -0700
Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling

For host commands that take a long time to process, cros ec can return
early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
status with EC_CMD_GET_COMMS_STATUS until completion of the command.

None of the above applies when data link errors are encountered. When
errors such as EC_SPI_PAST_END are encountered during command
transmission, it usually means the command was not received by the EC.
Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
almost always the wrong decision, and can result in host commands
silently being lost.

Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
---
 drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
index c9714072e224..d33e3847e11e 100644
--- a/drivers/mfd/cros_ec_spi.c
+++ b/drivers/mfd/cros_ec_spi.c
@@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
cros_ec_device *ec_dev,
        u8 *ptr;
        u8 *rx_buf;
        u8 sum;
+       u8 rx_byte;
        int ret = 0, final_ret;

        len = cros_ec_prepare_tx(ec_dev, ec_msg);
@@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
cros_ec_device *ec_dev,
        if (!ret) {
                /* Verify that EC can process command */
                for (i = 0; i < len; i++) {
-                       switch (rx_buf[i]) {
-                       case EC_SPI_PAST_END:
-                       case EC_SPI_RX_BAD_DATA:
-                       case EC_SPI_NOT_READY:
-                               ret = -EAGAIN;
-                               ec_msg->result = EC_RES_IN_PROGRESS;
-                       default:
+                       rx_byte = rx_buf[i];
+                       if (rx_byte == EC_SPI_PAST_END  ||
+                           rx_byte == EC_SPI_RX_BAD_DATA ||
+                           rx_byte == EC_SPI_NOT_READY) {
+                               ret = -EREMOTEIO;
                                break;
                        }
-                       if (ret)
-                               break;
                }
-               if (!ret)
-                       ret = cros_ec_spi_receive_packet(ec_dev,
-                                       ec_msg->insize + sizeof(*response));
-       } else {
-               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
        }

+       if (!ret)
+               ret = cros_ec_spi_receive_packet(ec_dev,
+                               ec_msg->insize + sizeof(*response));
+       else
+               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
+
        final_ret = terminate_request(ec_dev);

        spi_bus_unlock(ec_spi->spi->master);
-- 
2.12.2

>
>>> If proto_info->protocol_versions == 0 then ec_dev->proto_version will
>>> be assigned 0xffff. The logic here seems strange to me, if the EC is
>>
>> Whoops...
>>
>>> successfully replying to our v3 command then obviously it supports v3
>>> (maybe it will be useful someday if EC_HOST_REQUEST_VERSION is rev'd).
>>> Anyway, we need to figure out what is happening with our
>>> EC_HOST_REQUEST_VERSION host command.
>>>
>>> On Tue, Sep 19, 2017 at 10:14 AM, Brian Norris <briannorris@chromium.org> wrote:
>>> > Hi Jon,
>>> >
>>> > On Tue, Sep 19, 2017 at 05:39:56PM +0100, Jon Hunter wrote:
>>> >> On 19/09/17 15:09, Shawn N wrote:
>> ...
>>> > Furthermore, the only assignments to this 'proto_version' field look
>>> > like they're only writing one of 0, 2, 3, or
>>> >
>>> >    min(EC_HOST_REQUEST_VERSION, fls(proto_info->protocol_versions) - 1)
>>> >
>>> > . I don't see where 0xffff comes from.
>>
>> ...I'm an idiot. While the rvalue (the expression above) is an int (e.g,
>> -1), it's getting cast into a uint16_t (ec_dev->proto_version). So
>> that's where the 0xffff can come from.
>
> I saw that before and overlooked it too, so we're both idiots.
>
>>
>> Sorry if I misled you Shawn :(
>>
>> Brian

[-- Attachment #2: 0001-mfd-cros-ec-spi-Fix-in-progress-error-signaling.patch --]
[-- Type: text/x-patch, Size: 2352 bytes --]

From 9a9073b069001f0d265b735e263f0f9823965e95 Mon Sep 17 00:00:00 2001
From: Shawn Nematbakhsh <shawnn@chromium.org>
Date: Mon, 25 Sep 2017 14:32:38 -0700
Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling

For host commands that take a long time to process, cros ec can return
early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
status with EC_CMD_GET_COMMS_STATUS until completion of the command.

None of the above applies when data link errors are encountered. When
errors such as EC_SPI_PAST_END are encountered during command
transmission, it usually means the command was not received by the EC.
Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
almost always the wrong decision, and can result in host commands
silently being lost.

Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
---
 drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
index c9714072e224..d33e3847e11e 100644
--- a/drivers/mfd/cros_ec_spi.c
+++ b/drivers/mfd/cros_ec_spi.c
@@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
 	u8 *ptr;
 	u8 *rx_buf;
 	u8 sum;
+	u8 rx_byte;
 	int ret = 0, final_ret;
 
 	len = cros_ec_prepare_tx(ec_dev, ec_msg);
@@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
 	if (!ret) {
 		/* Verify that EC can process command */
 		for (i = 0; i < len; i++) {
-			switch (rx_buf[i]) {
-			case EC_SPI_PAST_END:
-			case EC_SPI_RX_BAD_DATA:
-			case EC_SPI_NOT_READY:
-				ret = -EAGAIN;
-				ec_msg->result = EC_RES_IN_PROGRESS;
-			default:
+			rx_byte = rx_buf[i];
+			if (rx_byte == EC_SPI_PAST_END  ||
+			    rx_byte == EC_SPI_RX_BAD_DATA ||
+			    rx_byte == EC_SPI_NOT_READY) {
+				ret = -EREMOTEIO;
 				break;
 			}
-			if (ret)
-				break;
 		}
-		if (!ret)
-			ret = cros_ec_spi_receive_packet(ec_dev,
-					ec_msg->insize + sizeof(*response));
-	} else {
-		dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
 	}
 
+	if (!ret)
+		ret = cros_ec_spi_receive_packet(ec_dev,
+				ec_msg->insize + sizeof(*response));
+	else
+		dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
+
 	final_ret = terminate_request(ec_dev);
 
 	spi_bus_unlock(ec_spi->spi->master);
-- 
2.12.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-25 23:15                             ` Shawn N
@ 2017-09-26 15:40                                 ` Jon Hunter
  2017-10-10 13:35                                 ` Jon Hunter
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-09-26 15:40 UTC (permalink / raw)
  To: Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra


On 26/09/17 00:15, Shawn N wrote:
> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@google.com> wrote:
>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
>>> Hi,
>>>
>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>
>>>>                ec_dev->proto_version =
>>>>                         min(EC_HOST_REQUEST_VERSION,
>>>>                                         fls(proto_info->protocol_versions) - 1);
>>>>
>>
>> Checking this closer, the first host command we send after we boot the
>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>> this doesn't seem to happen on the Chromium OS nyan_big release
>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>> master is doing something bad.
>>
>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>> "the host command was received by the EC and is currently being
>> handled, poll status until completion". So the caller polls status
>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>> (which is interpreted to mean "the host command I sent previously has
>> now successfully completed"), and returns success. The problem here is
>> that the initial host command was never received at all, and no reply
>> was ever received, so our reply data is all zero.
>>
>> Two things need to be fixed here:
>>
>> 1) Find out why the first host command after boot is failing. Probe
>> SPI pins and see what's going on.

Yes, I will see if I can look into this.

>> 2) Fix error handling so we properly return an error (or properly
>> retry the entire command) when a protocol error occurs (I made some
>> attempt in https://chromium-review.googlesource.com/385080/, probably
>> I should revisit that).
> 
> The below patch will fix error handling and will make things mostly
> work on nyan_big, because we'll fall back to V2 protocol after the
> initial failure. But we should still investigate why we're getting
> errors on the first host command. We aren't seeing these errors when
> we send commands from firmware, so I suspect something is wrong in
> kernel SPI HW initialization that causes the first command to fail.
> 
> From: Shawn Nematbakhsh <shawnn@chromium.org>
> Date: Mon, 25 Sep 2017 14:32:38 -0700
> Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling
> 
> For host commands that take a long time to process, cros ec can return
> early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
> status with EC_CMD_GET_COMMS_STATUS until completion of the command.
> 
> None of the above applies when data link errors are encountered. When
> errors such as EC_SPI_PAST_END are encountered during command
> transmission, it usually means the command was not received by the EC.
> Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
> almost always the wrong decision, and can result in host commands
> silently being lost.
> 
> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
> ---
>  drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> index c9714072e224..d33e3847e11e 100644
> --- a/drivers/mfd/cros_ec_spi.c
> +++ b/drivers/mfd/cros_ec_spi.c
> @@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
> cros_ec_device *ec_dev,
>         u8 *ptr;
>         u8 *rx_buf;
>         u8 sum;
> +       u8 rx_byte;
>         int ret = 0, final_ret;
> 
>         len = cros_ec_prepare_tx(ec_dev, ec_msg);
> @@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
> cros_ec_device *ec_dev,
>         if (!ret) {
>                 /* Verify that EC can process command */
>                 for (i = 0; i < len; i++) {
> -                       switch (rx_buf[i]) {
> -                       case EC_SPI_PAST_END:
> -                       case EC_SPI_RX_BAD_DATA:
> -                       case EC_SPI_NOT_READY:
> -                               ret = -EAGAIN;
> -                               ec_msg->result = EC_RES_IN_PROGRESS;
> -                       default:
> +                       rx_byte = rx_buf[i];
> +                       if (rx_byte == EC_SPI_PAST_END  ||
> +                           rx_byte == EC_SPI_RX_BAD_DATA ||
> +                           rx_byte == EC_SPI_NOT_READY) {
> +                               ret = -EREMOTEIO;
>                                 break;
>                         }
> -                       if (ret)
> -                               break;
>                 }
> -               if (!ret)
> -                       ret = cros_ec_spi_receive_packet(ec_dev,
> -                                       ec_msg->insize + sizeof(*response));
> -       } else {
> -               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
>         }
> 
> +       if (!ret)
> +               ret = cros_ec_spi_receive_packet(ec_dev,
> +                               ec_msg->insize + sizeof(*response));
> +       else
> +               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
> +
>         final_ret = terminate_request(ec_dev);
> 
>         spi_bus_unlock(ec_spi->spi->master);
> 

Thanks! Works for me ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-09-26 15:40                                 ` Jon Hunter
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-09-26 15:40 UTC (permalink / raw)
  To: Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra


On 26/09/17 00:15, Shawn N wrote:
> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@google.com> wrote:
>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
>>> Hi,
>>>
>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>
>>>>                ec_dev->proto_version =
>>>>                         min(EC_HOST_REQUEST_VERSION,
>>>>                                         fls(proto_info->protocol_versions) - 1);
>>>>
>>
>> Checking this closer, the first host command we send after we boot the
>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>> this doesn't seem to happen on the Chromium OS nyan_big release
>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>> master is doing something bad.
>>
>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>> "the host command was received by the EC and is currently being
>> handled, poll status until completion". So the caller polls status
>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>> (which is interpreted to mean "the host command I sent previously has
>> now successfully completed"), and returns success. The problem here is
>> that the initial host command was never received at all, and no reply
>> was ever received, so our reply data is all zero.
>>
>> Two things need to be fixed here:
>>
>> 1) Find out why the first host command after boot is failing. Probe
>> SPI pins and see what's going on.

Yes, I will see if I can look into this.

>> 2) Fix error handling so we properly return an error (or properly
>> retry the entire command) when a protocol error occurs (I made some
>> attempt in https://chromium-review.googlesource.com/385080/, probably
>> I should revisit that).
> 
> The below patch will fix error handling and will make things mostly
> work on nyan_big, because we'll fall back to V2 protocol after the
> initial failure. But we should still investigate why we're getting
> errors on the first host command. We aren't seeing these errors when
> we send commands from firmware, so I suspect something is wrong in
> kernel SPI HW initialization that causes the first command to fail.
> 
> From: Shawn Nematbakhsh <shawnn@chromium.org>
> Date: Mon, 25 Sep 2017 14:32:38 -0700
> Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling
> 
> For host commands that take a long time to process, cros ec can return
> early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
> status with EC_CMD_GET_COMMS_STATUS until completion of the command.
> 
> None of the above applies when data link errors are encountered. When
> errors such as EC_SPI_PAST_END are encountered during command
> transmission, it usually means the command was not received by the EC.
> Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
> almost always the wrong decision, and can result in host commands
> silently being lost.
> 
> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
> ---
>  drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> index c9714072e224..d33e3847e11e 100644
> --- a/drivers/mfd/cros_ec_spi.c
> +++ b/drivers/mfd/cros_ec_spi.c
> @@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
> cros_ec_device *ec_dev,
>         u8 *ptr;
>         u8 *rx_buf;
>         u8 sum;
> +       u8 rx_byte;
>         int ret = 0, final_ret;
> 
>         len = cros_ec_prepare_tx(ec_dev, ec_msg);
> @@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
> cros_ec_device *ec_dev,
>         if (!ret) {
>                 /* Verify that EC can process command */
>                 for (i = 0; i < len; i++) {
> -                       switch (rx_buf[i]) {
> -                       case EC_SPI_PAST_END:
> -                       case EC_SPI_RX_BAD_DATA:
> -                       case EC_SPI_NOT_READY:
> -                               ret = -EAGAIN;
> -                               ec_msg->result = EC_RES_IN_PROGRESS;
> -                       default:
> +                       rx_byte = rx_buf[i];
> +                       if (rx_byte == EC_SPI_PAST_END  ||
> +                           rx_byte == EC_SPI_RX_BAD_DATA ||
> +                           rx_byte == EC_SPI_NOT_READY) {
> +                               ret = -EREMOTEIO;
>                                 break;
>                         }
> -                       if (ret)
> -                               break;
>                 }
> -               if (!ret)
> -                       ret = cros_ec_spi_receive_packet(ec_dev,
> -                                       ec_msg->insize + sizeof(*response));
> -       } else {
> -               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
>         }
> 
> +       if (!ret)
> +               ret = cros_ec_spi_receive_packet(ec_dev,
> +                               ec_msg->insize + sizeof(*response));
> +       else
> +               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
> +
>         final_ret = terminate_request(ec_dev);
> 
>         spi_bus_unlock(ec_spi->spi->master);
> 

Thanks! Works for me ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-25 23:15                             ` Shawn N
@ 2017-10-10 13:35                                 ` Jon Hunter
  2017-10-10 13:35                                 ` Jon Hunter
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-10-10 13:35 UTC (permalink / raw)
  To: Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra


On 26/09/17 00:15, Shawn N wrote:
> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@google.com> wrote:
>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
>>> Hi,
>>>
>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>
>>>>                ec_dev->proto_version =
>>>>                         min(EC_HOST_REQUEST_VERSION,
>>>>                                         fls(proto_info->protocol_versions) - 1);
>>>>
>>
>> Checking this closer, the first host command we send after we boot the
>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>> this doesn't seem to happen on the Chromium OS nyan_big release
>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>> master is doing something bad.
>>
>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>> "the host command was received by the EC and is currently being
>> handled, poll status until completion". So the caller polls status
>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>> (which is interpreted to mean "the host command I sent previously has
>> now successfully completed"), and returns success. The problem here is
>> that the initial host command was never received at all, and no reply
>> was ever received, so our reply data is all zero.
>>
>> Two things need to be fixed here:
>>
>> 1) Find out why the first host command after boot is failing. Probe
>> SPI pins and see what's going on.
>> 2) Fix error handling so we properly return an error (or properly
>> retry the entire command) when a protocol error occurs (I made some
>> attempt in https://chromium-review.googlesource.com/385080/, probably
>> I should revisit that).
> 
> The below patch will fix error handling and will make things mostly
> work on nyan_big, because we'll fall back to V2 protocol after the
> initial failure. But we should still investigate why we're getting
> errors on the first host command. We aren't seeing these errors when
> we send commands from firmware, so I suspect something is wrong in
> kernel SPI HW initialization that causes the first command to fail.
I have been looking into this a bit more and it appears to be timing 
related. I found that enabling some debug in the Tegra SPI driver the
problem would go away and seems that adding a delay before sending the
SPI message would also workaround the problem. 

Looking back at the Tegra Linux test history [0], it appears that after
v4.10-rc5 [1] I start to see the following message which is the first 
indication of some SPI issues ...

 cros-ec-spi spi32766.0: packet too long (249 bytes, expected 4)

I attempted to bisect this, but I was not successful because each time
I would ended up somewhere different. I found that even with v4.9 I may
see the issue 1 in 20 times and so I realised that I am not even sure 
when the problem really started or if has always been there. It seems
that for older kernels it is harder to reproduce. I am wondering now if
some timing has changed somewhere causing us to see the problem more
frequently with newer kernels?

I see the the cros-ec binding defines the following ...

"google,cros-ec-spi-pre-delay: Some implementations of the EC need a  
 little time to wake up from sleep before they can receive SPI transfers 
 at a high clock rate. This property specifies the delay, in usecs, 
 between the assertion of the CS to the start of the first clock pulse."

I found that adding the following also worked around the problem ...

diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
index 5cf987b5401e..0baa6bfc0f36 100644
--- a/arch/arm/boot/dts/tegra124-nyan.dtsi
+++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
@@ -317,6 +317,7 @@
                        interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
                        reg = <0>;
 
+                       google,cros-ec-spi-pre-delay = <10>;
                        google,cros-ec-spi-msg-delay = <2000>;
 
                        i2c-tunnel {

I have tried 50 boots with the above and I have seen no SPI failures
on boot.

I did look to see if it is possible to probe the SPI signals with a
scope but from the schematics I am not sure if they are accessible or
buried in the PCB.

Is it possible that Tegra is sending the SPI message too soon for the
EC?

Cheers
Jon

[0] https://nvtb.github.io/linux/
[1] https://nvtb.github.io/linux/test_v4.10-rc5/20170123023102/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-10-10 13:35                                 ` Jon Hunter
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-10-10 13:35 UTC (permalink / raw)
  To: Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra


On 26/09/17 00:15, Shawn N wrote:
> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@google.com> wrote:
>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
>>> Hi,
>>>
>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>
>>>>                ec_dev->proto_version =
>>>>                         min(EC_HOST_REQUEST_VERSION,
>>>>                                         fls(proto_info->protocol_versions) - 1);
>>>>
>>
>> Checking this closer, the first host command we send after we boot the
>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>> this doesn't seem to happen on the Chromium OS nyan_big release
>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>> master is doing something bad.
>>
>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>> "the host command was received by the EC and is currently being
>> handled, poll status until completion". So the caller polls status
>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>> (which is interpreted to mean "the host command I sent previously has
>> now successfully completed"), and returns success. The problem here is
>> that the initial host command was never received at all, and no reply
>> was ever received, so our reply data is all zero.
>>
>> Two things need to be fixed here:
>>
>> 1) Find out why the first host command after boot is failing. Probe
>> SPI pins and see what's going on.
>> 2) Fix error handling so we properly return an error (or properly
>> retry the entire command) when a protocol error occurs (I made some
>> attempt in https://chromium-review.googlesource.com/385080/, probably
>> I should revisit that).
> 
> The below patch will fix error handling and will make things mostly
> work on nyan_big, because we'll fall back to V2 protocol after the
> initial failure. But we should still investigate why we're getting
> errors on the first host command. We aren't seeing these errors when
> we send commands from firmware, so I suspect something is wrong in
> kernel SPI HW initialization that causes the first command to fail.
I have been looking into this a bit more and it appears to be timing 
related. I found that enabling some debug in the Tegra SPI driver the
problem would go away and seems that adding a delay before sending the
SPI message would also workaround the problem. 

Looking back at the Tegra Linux test history [0], it appears that after
v4.10-rc5 [1] I start to see the following message which is the first 
indication of some SPI issues ...

 cros-ec-spi spi32766.0: packet too long (249 bytes, expected 4)

I attempted to bisect this, but I was not successful because each time
I would ended up somewhere different. I found that even with v4.9 I may
see the issue 1 in 20 times and so I realised that I am not even sure 
when the problem really started or if has always been there. It seems
that for older kernels it is harder to reproduce. I am wondering now if
some timing has changed somewhere causing us to see the problem more
frequently with newer kernels?

I see the the cros-ec binding defines the following ...

"google,cros-ec-spi-pre-delay: Some implementations of the EC need a  
 little time to wake up from sleep before they can receive SPI transfers 
 at a high clock rate. This property specifies the delay, in usecs, 
 between the assertion of the CS to the start of the first clock pulse."

I found that adding the following also worked around the problem ...

diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
index 5cf987b5401e..0baa6bfc0f36 100644
--- a/arch/arm/boot/dts/tegra124-nyan.dtsi
+++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
@@ -317,6 +317,7 @@
                        interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
                        reg = <0>;
 
+                       google,cros-ec-spi-pre-delay = <10>;
                        google,cros-ec-spi-msg-delay = <2000>;
 
                        i2c-tunnel {

I have tried 50 boots with the above and I have seen no SPI failures
on boot.

I did look to see if it is possible to probe the SPI signals with a
scope but from the schematics I am not sure if they are accessible or
buried in the PCB.

Is it possible that Tegra is sending the SPI message too soon for the
EC?

Cheers
Jon

[0] https://nvtb.github.io/linux/
[1] https://nvtb.github.io/linux/test_v4.10-rc5/20170123023102/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-10-10 13:35                                 ` Jon Hunter
  (?)
@ 2017-10-10 15:33                                 ` Shawn N
       [not found]                                   ` <CALaWCOP=0O7AMobu4YX0z=fJLopcQwv1Vm1_Bbp3XTaty1y6fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 36+ messages in thread
From: Shawn N @ 2017-10-10 15:33 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra, amstan

On Tue, Oct 10, 2017 at 6:35 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> On 26/09/17 00:15, Shawn N wrote:
>> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@google.com> wrote:
>>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@chromium.org> wrote:
>>>> Hi,
>>>>
>>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>>
>>>>>                ec_dev->proto_version =
>>>>>                         min(EC_HOST_REQUEST_VERSION,
>>>>>                                         fls(proto_info->protocol_versions) - 1);
>>>>>
>>>
>>> Checking this closer, the first host command we send after we boot the
>>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>>> this doesn't seem to happen on the Chromium OS nyan_big release
>>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>>> master is doing something bad.
>>>
>>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>>> "the host command was received by the EC and is currently being
>>> handled, poll status until completion". So the caller polls status
>>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>>> (which is interpreted to mean "the host command I sent previously has
>>> now successfully completed"), and returns success. The problem here is
>>> that the initial host command was never received at all, and no reply
>>> was ever received, so our reply data is all zero.
>>>
>>> Two things need to be fixed here:
>>>
>>> 1) Find out why the first host command after boot is failing. Probe
>>> SPI pins and see what's going on.
>>> 2) Fix error handling so we properly return an error (or properly
>>> retry the entire command) when a protocol error occurs (I made some
>>> attempt in https://chromium-review.googlesource.com/385080/, probably
>>> I should revisit that).
>>
>> The below patch will fix error handling and will make things mostly
>> work on nyan_big, because we'll fall back to V2 protocol after the
>> initial failure. But we should still investigate why we're getting
>> errors on the first host command. We aren't seeing these errors when
>> we send commands from firmware, so I suspect something is wrong in
>> kernel SPI HW initialization that causes the first command to fail.
> I have been looking into this a bit more and it appears to be timing
> related. I found that enabling some debug in the Tegra SPI driver the
> problem would go away and seems that adding a delay before sending the
> SPI message would also workaround the problem.
>
> Looking back at the Tegra Linux test history [0], it appears that after
> v4.10-rc5 [1] I start to see the following message which is the first
> indication of some SPI issues ...
>
>  cros-ec-spi spi32766.0: packet too long (249 bytes, expected 4)
>
> I attempted to bisect this, but I was not successful because each time
> I would ended up somewhere different. I found that even with v4.9 I may
> see the issue 1 in 20 times and so I realised that I am not even sure
> when the problem really started or if has always been there. It seems
> that for older kernels it is harder to reproduce. I am wondering now if
> some timing has changed somewhere causing us to see the problem more
> frequently with newer kernels?
>
> I see the the cros-ec binding defines the following ...
>
> "google,cros-ec-spi-pre-delay: Some implementations of the EC need a
>  little time to wake up from sleep before they can receive SPI transfers
>  at a high clock rate. This property specifies the delay, in usecs,
>  between the assertion of the CS to the start of the first clock pulse."
>
> I found that adding the following also worked around the problem ...
>
> diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
> index 5cf987b5401e..0baa6bfc0f36 100644
> --- a/arch/arm/boot/dts/tegra124-nyan.dtsi
> +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
> @@ -317,6 +317,7 @@
>                         interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
>                         reg = <0>;
>
> +                       google,cros-ec-spi-pre-delay = <10>;
>                         google,cros-ec-spi-msg-delay = <2000>;
>
>                         i2c-tunnel {
>
> I have tried 50 boots with the above and I have seen no SPI failures
> on boot.
>
> I did look to see if it is possible to probe the SPI signals with a
> scope but from the schematics I am not sure if they are accessible or
> buried in the PCB.
>
> Is it possible that Tegra is sending the SPI message too soon for the
> EC?

I have not worked much with the cros_ec stm32 SPI host interface, but
I think it's possible. Also note that we define
"google,cros-ec-spi-pre-delay = <30>;" for the veyron family of
devices, which also use an stm32-family embedded controller. Some of
the folks on cc worked more on veyron and may have more insight.

I'm still not clear on why we see an error only on the first
transaction after boot. In this case, the embedded controller
previously handled host commands from firmware just fine, and the
handoff between firmware and the kernel shouldn't be significant, from
the EC point of view. If host command timing is consistent from the
master, I would expect to see some constant error rate, eg. some
chance any host command will fail, rather than the first host command
always failing.

>
> Cheers
> Jon
>
> [0] https://nvtb.github.io/linux/
> [1] https://nvtb.github.io/linux/test_v4.10-rc5/20170123023102/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-10-10 15:33                                 ` Shawn N
@ 2017-10-10 16:52                                       ` Doug Anderson
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Anderson @ 2017-10-10 16:52 UTC (permalink / raw)
  To: Shawn N
  Cc: Jon Hunter, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Norris, Brian Norris,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Alexandru M Stan

Hi,

On Tue, Oct 10, 2017 at 8:33 AM, Shawn N <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>> diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
>> index 5cf987b5401e..0baa6bfc0f36 100644
>> --- a/arch/arm/boot/dts/tegra124-nyan.dtsi
>> +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
>> @@ -317,6 +317,7 @@
>>                         interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
>>                         reg = <0>;
>>
>> +                       google,cros-ec-spi-pre-delay = <10>;
>>                         google,cros-ec-spi-msg-delay = <2000>;
>>
>>                         i2c-tunnel {
>>
>> I have tried 50 boots with the above and I have seen no SPI failures
>> on boot.
>>
>> I did look to see if it is possible to probe the SPI signals with a
>> scope but from the schematics I am not sure if they are accessible or
>> buried in the PCB.
>>
>> Is it possible that Tegra is sending the SPI message too soon for the
>> EC?
>
> I have not worked much with the cros_ec stm32 SPI host interface, but
> I think it's possible. Also note that we define
> "google,cros-ec-spi-pre-delay = <30>;" for the veyron family of
> devices, which also use an stm32-family embedded controller. Some of
> the folks on cc worked more on veyron and may have more insight.

Alex is the expert here, but basically we enabled hibernation on the
EC for veyron so it needed a bunch of extra time to wakeup.  Before
hibernation there was no known case where the delay was needed as far
as I understand.  I don't think we had hibernation on tegra.

It's _possible_ that we sometimes need a delay even on tegra, but I'm
not sure.  It's also possible that the extra delay just shifts things
around a little bit and changes the timing.  Possibly at boot the
system is always extra busy and adding the 10 us delay is enough to
"reliably" get your transfer to happen at a non-busy time.  It would
be interesting to see of a 10us delay _before_ asserting chip select
fixed things just as well.  If a delay before asserting chip select
fixes things as well as a delay after asserting chip select then
you're probably just pushing things around and not doing a real fix.

One issue I _know_ we have is that we sometimes drop EC packets on the
floor because we get interrupted midway through the transfer and the
EC gives up on us before we get context switched back in.  I think
this is more common at suspend/resume and at boot when the system is
typically very busy.  That case can actually be made _worse_ by the
"cros-ec-spi-pre-delay", as described in
<https://chromium-review.googlesource.com/439713>.  Even when we
removed delay on rk3399-gru-* we could still sometimes see transfer
errors and the agreed upon "fix" for this is to do the transfer in a
higher priority context.  Some details are in the (unfortunately
private) <https://issuetracker.google.com/35579351>, but to copy the
important bits from the bug::

From Mark Brown (upstream SPI maintainer) as long as there is no
contention on this SPI bus (there isn't for us) the transfer will
happen in the context of the calling process.  That means "bumping"
the priority would work.  ...but as per Brian bumping the priority is
ugly and we need to make sure we don't conflict with userspace.   One
solution (inspired by Guenter) is to just _always_ use a high priority
worker to do the transfer and just block waiting for the worker to
finish.  This should be easy, reliable, and clean even if it is
slightly awkward.  This also shouldn't be too hard to do, we think.


NOTE: a really truly long term fix for this is to change the protocol
to have reliably retries.  That's planned but (as far as I can tell)
stalled.  Details in the bug
<https://bugs.chromium.org/p/chromium/issues/detail?id=678675>


> I'm still not clear on why we see an error only on the first
> transaction after boot. In this case, the embedded controller
> previously handled host commands from firmware just fine, and the
> handoff between firmware and the kernel shouldn't be significant, from
> the EC point of view. If host command timing is consistent from the
> master, I would expect to see some constant error rate, eg. some
> chance any host command will fail, rather than the first host command
> always failing.

The AP itself is often quite busy at boot and so the timings for
everything change.  That could easily explain the problems.

-Doug

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-10-10 16:52                                       ` Doug Anderson
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Anderson @ 2017-10-10 16:52 UTC (permalink / raw)
  To: Shawn N
  Cc: Jon Hunter, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra, Alexandru M Stan

Hi,

On Tue, Oct 10, 2017 at 8:33 AM, Shawn N <shawnn@chromium.org> wrote:
>> diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
>> index 5cf987b5401e..0baa6bfc0f36 100644
>> --- a/arch/arm/boot/dts/tegra124-nyan.dtsi
>> +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
>> @@ -317,6 +317,7 @@
>>                         interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
>>                         reg = <0>;
>>
>> +                       google,cros-ec-spi-pre-delay = <10>;
>>                         google,cros-ec-spi-msg-delay = <2000>;
>>
>>                         i2c-tunnel {
>>
>> I have tried 50 boots with the above and I have seen no SPI failures
>> on boot.
>>
>> I did look to see if it is possible to probe the SPI signals with a
>> scope but from the schematics I am not sure if they are accessible or
>> buried in the PCB.
>>
>> Is it possible that Tegra is sending the SPI message too soon for the
>> EC?
>
> I have not worked much with the cros_ec stm32 SPI host interface, but
> I think it's possible. Also note that we define
> "google,cros-ec-spi-pre-delay = <30>;" for the veyron family of
> devices, which also use an stm32-family embedded controller. Some of
> the folks on cc worked more on veyron and may have more insight.

Alex is the expert here, but basically we enabled hibernation on the
EC for veyron so it needed a bunch of extra time to wakeup.  Before
hibernation there was no known case where the delay was needed as far
as I understand.  I don't think we had hibernation on tegra.

It's _possible_ that we sometimes need a delay even on tegra, but I'm
not sure.  It's also possible that the extra delay just shifts things
around a little bit and changes the timing.  Possibly at boot the
system is always extra busy and adding the 10 us delay is enough to
"reliably" get your transfer to happen at a non-busy time.  It would
be interesting to see of a 10us delay _before_ asserting chip select
fixed things just as well.  If a delay before asserting chip select
fixes things as well as a delay after asserting chip select then
you're probably just pushing things around and not doing a real fix.

One issue I _know_ we have is that we sometimes drop EC packets on the
floor because we get interrupted midway through the transfer and the
EC gives up on us before we get context switched back in.  I think
this is more common at suspend/resume and at boot when the system is
typically very busy.  That case can actually be made _worse_ by the
"cros-ec-spi-pre-delay", as described in
<https://chromium-review.googlesource.com/439713>.  Even when we
removed delay on rk3399-gru-* we could still sometimes see transfer
errors and the agreed upon "fix" for this is to do the transfer in a
higher priority context.  Some details are in the (unfortunately
private) <https://issuetracker.google.com/35579351>, but to copy the
important bits from the bug::

>From Mark Brown (upstream SPI maintainer) as long as there is no
contention on this SPI bus (there isn't for us) the transfer will
happen in the context of the calling process.  That means "bumping"
the priority would work.  ...but as per Brian bumping the priority is
ugly and we need to make sure we don't conflict with userspace.   One
solution (inspired by Guenter) is to just _always_ use a high priority
worker to do the transfer and just block waiting for the worker to
finish.  This should be easy, reliable, and clean even if it is
slightly awkward.  This also shouldn't be too hard to do, we think.


NOTE: a really truly long term fix for this is to change the protocol
to have reliably retries.  That's planned but (as far as I can tell)
stalled.  Details in the bug
<https://bugs.chromium.org/p/chromium/issues/detail?id=678675>


> I'm still not clear on why we see an error only on the first
> transaction after boot. In this case, the embedded controller
> previously handled host commands from firmware just fine, and the
> handoff between firmware and the kernel shouldn't be significant, from
> the EC point of view. If host command timing is consistent from the
> master, I would expect to see some constant error rate, eg. some
> chance any host command will fail, rather than the first host command
> always failing.

The AP itself is often quite busy at boot and so the timings for
everything change.  That could easily explain the problems.

-Doug

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-10-10 16:52                                       ` Doug Anderson
@ 2017-11-07 11:28                                           ` Jon Hunter
  -1 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-11-07 11:28 UTC (permalink / raw)
  To: Doug Anderson, Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Norris, Brian Norris,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Alexandru M Stan


On 10/10/17 17:52, Doug Anderson wrote:

...

>> I'm still not clear on why we see an error only on the first
>> transaction after boot. In this case, the embedded controller
>> previously handled host commands from firmware just fine, and the
>> handoff between firmware and the kernel shouldn't be significant, from
>> the EC point of view. If host command timing is consistent from the
>> master, I would expect to see some constant error rate, eg. some
>> chance any host command will fail, rather than the first host command
>> always failing.
> 
> The AP itself is often quite busy at boot and so the timings for
> everything change.  That could easily explain the problems.

Sorry for the delay, but I have finally had some time to look at this a
bit closer. I have been able to track down where the additional delay is
really needed and seems to explain what is going on.

For starters, the SPI chip-select is under h/w control and so the
software delay has no impact on the timing between the chip-select
going active and the transaction starting as I had first thought.

I found that a delay is needed between completing the probe the Tegra
SPI device and the first SPI transaction issued to the EC. In the Tegra
SPI probe the SPI controller is programmed for master mode, but at the
same time it clears the chip-select inactive polarity bit meaning that
initially the SPI chip-select default to active-high (rather that low
which seems odd). I believe that this then drives the chip-select low
(active for the EC) and until it is then configured when spi_setup() is
called which configures it as active-low for the EC.
To get the first transaction to work for the EC there needs to be a
delay after we program the chip-select polarity when spi_setup() is
called. For example ...

diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
index 584367f3a0ed..c1075c3c60c8 100644
--- a/drivers/mfd/cros_ec_spi.c
+++ b/drivers/mfd/cros_ec_spi.c
@@ -648,6 +648,8 @@ static int cros_ec_spi_probe(struct spi_device *spi)
        if (err < 0)
                return err;

+       udelay(100);
+
        ec_spi = devm_kzalloc(dev, sizeof(*ec_spi), GFP_KERNEL);
        if (ec_spi == NULL)
                return -ENOMEM;


You may say why not put a delay in the tegra_spi_setup() itself, but we
have some other SPI devices such as flash devices which are also use an
active-low chip-select which don't have any issues with this to date.
Furthermore, this delay is also probably device specific.

From an EC perspective, if the chip-select is asserted is there a
turnaround time before it can be asserted again? Or in this case maybe
the issue is that the chip-select is asserted but no transaction occurs
before it is de-asserted and so is causing a problem. Please note that a
delay of around ~50us above still fails but 100us seems to be ok.

Finally, this also works-around the problem by avoiding the chip-select
from being pulsed low by defaulting all chip-selects to active-low but
maybe this just masks the problem.

diff --git a/drivers/spi/spi-tegra114.c b/drivers/spi/spi-tegra114.c
index a76acedd7e2f..7c18204e61d9 100644
--- a/drivers/spi/spi-tegra114.c
+++ b/drivers/spi/spi-tegra114.c
@@ -1117,7 +1117,7 @@ static int tegra_spi_probe(struct platform_device
                goto exit_pm_disable;
        }
-       tspi->def_command1_reg  = SPI_M_S;
+       tspi->def_command1_reg = SPI_M_S | SPI_CS_POL_INACTIVE_MASK;
        tegra_spi_writel(tspi, tspi->def_command1_reg, SPI_COMMAND1);
        pm_runtime_put(&pdev->dev);

Cheers
Jon

--
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-11-07 11:28                                           ` Jon Hunter
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-11-07 11:28 UTC (permalink / raw)
  To: Doug Anderson, Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Brian Norris, Brian Norris, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra, Alexandru M Stan


On 10/10/17 17:52, Doug Anderson wrote:

...

>> I'm still not clear on why we see an error only on the first
>> transaction after boot. In this case, the embedded controller
>> previously handled host commands from firmware just fine, and the
>> handoff between firmware and the kernel shouldn't be significant, from
>> the EC point of view. If host command timing is consistent from the
>> master, I would expect to see some constant error rate, eg. some
>> chance any host command will fail, rather than the first host command
>> always failing.
> 
> The AP itself is often quite busy at boot and so the timings for
> everything change.  That could easily explain the problems.

Sorry for the delay, but I have finally had some time to look at this a
bit closer. I have been able to track down where the additional delay is
really needed and seems to explain what is going on.

For starters, the SPI chip-select is under h/w control and so the
software delay has no impact on the timing between the chip-select
going active and the transaction starting as I had first thought.

I found that a delay is needed between completing the probe the Tegra
SPI device and the first SPI transaction issued to the EC. In the Tegra
SPI probe the SPI controller is programmed for master mode, but at the
same time it clears the chip-select inactive polarity bit meaning that
initially the SPI chip-select default to active-high (rather that low
which seems odd). I believe that this then drives the chip-select low
(active for the EC) and until it is then configured when spi_setup() is
called which configures it as active-low for the EC.
To get the first transaction to work for the EC there needs to be a
delay after we program the chip-select polarity when spi_setup() is
called. For example ...

diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
index 584367f3a0ed..c1075c3c60c8 100644
--- a/drivers/mfd/cros_ec_spi.c
+++ b/drivers/mfd/cros_ec_spi.c
@@ -648,6 +648,8 @@ static int cros_ec_spi_probe(struct spi_device *spi)
        if (err < 0)
                return err;

+       udelay(100);
+
        ec_spi = devm_kzalloc(dev, sizeof(*ec_spi), GFP_KERNEL);
        if (ec_spi == NULL)
                return -ENOMEM;


You may say why not put a delay in the tegra_spi_setup() itself, but we
have some other SPI devices such as flash devices which are also use an
active-low chip-select which don't have any issues with this to date.
Furthermore, this delay is also probably device specific.

>From an EC perspective, if the chip-select is asserted is there a
turnaround time before it can be asserted again? Or in this case maybe
the issue is that the chip-select is asserted but no transaction occurs
before it is de-asserted and so is causing a problem. Please note that a
delay of around ~50us above still fails but 100us seems to be ok.

Finally, this also works-around the problem by avoiding the chip-select
from being pulsed low by defaulting all chip-selects to active-low but
maybe this just masks the problem.

diff --git a/drivers/spi/spi-tegra114.c b/drivers/spi/spi-tegra114.c
index a76acedd7e2f..7c18204e61d9 100644
--- a/drivers/spi/spi-tegra114.c
+++ b/drivers/spi/spi-tegra114.c
@@ -1117,7 +1117,7 @@ static int tegra_spi_probe(struct platform_device
                goto exit_pm_disable;
        }
-       tspi->def_command1_reg  = SPI_M_S;
+       tspi->def_command1_reg = SPI_M_S | SPI_CS_POL_INACTIVE_MASK;
        tegra_spi_writel(tspi, tspi->def_command1_reg, SPI_COMMAND1);
        pm_runtime_put(&pdev->dev);

Cheers
Jon

--
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-11-07 11:28                                           ` Jon Hunter
@ 2017-11-07 17:22                                               ` Doug Anderson
  -1 siblings, 0 replies; 36+ messages in thread
From: Doug Anderson @ 2017-11-07 17:22 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Shawn N, Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Norris, Brian Norris,
	Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Alexandru M Stan

Hi,

On Tue, Nov 7, 2017 at 3:28 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>
> On 10/10/17 17:52, Doug Anderson wrote:
>
> ...
>
>>> I'm still not clear on why we see an error only on the first
>>> transaction after boot. In this case, the embedded controller
>>> previously handled host commands from firmware just fine, and the
>>> handoff between firmware and the kernel shouldn't be significant, from
>>> the EC point of view. If host command timing is consistent from the
>>> master, I would expect to see some constant error rate, eg. some
>>> chance any host command will fail, rather than the first host command
>>> always failing.
>>
>> The AP itself is often quite busy at boot and so the timings for
>> everything change.  That could easily explain the problems.
>
> Sorry for the delay, but I have finally had some time to look at this a
> bit closer. I have been able to track down where the additional delay is
> really needed and seems to explain what is going on.
>
> For starters, the SPI chip-select is under h/w control and so the
> software delay has no impact on the timing between the chip-select
> going active and the transaction starting as I had first thought.
>
> I found that a delay is needed between completing the probe the Tegra
> SPI device and the first SPI transaction issued to the EC. In the Tegra
> SPI probe the SPI controller is programmed for master mode, but at the
> same time it clears the chip-select inactive polarity bit meaning that
> initially the SPI chip-select default to active-high (rather that low
> which seems odd). I believe that this then drives the chip-select low
> (active for the EC) and until it is then configured when spi_setup() is
> called which configures it as active-low for the EC.
> To get the first transaction to work for the EC there needs to be a
> delay after we program the chip-select polarity when spi_setup() is
> called. For example ...
>
> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> index 584367f3a0ed..c1075c3c60c8 100644
> --- a/drivers/mfd/cros_ec_spi.c
> +++ b/drivers/mfd/cros_ec_spi.c
> @@ -648,6 +648,8 @@ static int cros_ec_spi_probe(struct spi_device *spi)
>         if (err < 0)
>                 return err;
>
> +       udelay(100);
> +

This isn't totally crazy, but actually you could probably do this:

ec_spi->last_transfer_ns = ktime_get_ns();

...that will leverage already existing code and constants and also
will avoid doing a delay if it wasn't needed.  You could also then get
rid of some "if (ec_spi->last_transfer_ns)" tests in the code.  I'd
support landing that.


>         ec_spi = devm_kzalloc(dev, sizeof(*ec_spi), GFP_KERNEL);
>         if (ec_spi == NULL)
>                 return -ENOMEM;
>
>
> You may say why not put a delay in the tegra_spi_setup() itself, but we
> have some other SPI devices such as flash devices which are also use an
> active-low chip-select which don't have any issues with this to date.
> Furthermore, this delay is also probably device specific.
>
> From an EC perspective, if the chip-select is asserted is there a
> turnaround time before it can be asserted again? Or in this case maybe
> the issue is that the chip-select is asserted but no transaction occurs
> before it is de-asserted and so is causing a problem. Please note that a
> delay of around ~50us above still fails but 100us seems to be ok.

Really nice detective work!


> Finally, this also works-around the problem by avoiding the chip-select
> from being pulsed low by defaulting all chip-selects to active-low but
> maybe this just masks the problem.
>
> diff --git a/drivers/spi/spi-tegra114.c b/drivers/spi/spi-tegra114.c
> index a76acedd7e2f..7c18204e61d9 100644
> --- a/drivers/spi/spi-tegra114.c
> +++ b/drivers/spi/spi-tegra114.c
> @@ -1117,7 +1117,7 @@ static int tegra_spi_probe(struct platform_device
>                 goto exit_pm_disable;
>         }
> -       tspi->def_command1_reg  = SPI_M_S;
> +       tspi->def_command1_reg = SPI_M_S | SPI_CS_POL_INACTIVE_MASK;

Note that even though I'd support adding some sort of delay to the
cros_ec driver, I'd also say that it _might_ make sense to mess with
the SPI driver too, just to avoid glitching the lines at bootup.  As
you said, you shouldn't just willy nilly change the default but it
seems like it could make sense to define an initial (board level)
pinctrl state that's in effect until the first call to setup_bus().

That's a bit like the "init" pinctrl state that exists precisely to
address these types of glitches, but unfortunately you can't use the
automatic-ness of the "init" pinctrl state.  That code assumes that by
the end of probe you know how your pins should be configured.  In your
case you don't know how your pins should be configured until the first
setup_bus() is called...

Presumably you could add your own pinctrl state that mimics "init", or
you could figure out how to improve the way the "init" pinctrl state
works to allow a driver to defer the transition to "default".

Another option might be to try to use the "idle" state, maybe in
conjunction with the "init" pinctrl state.  AKA use "init" to avoid
glitches during probe and then require that by the end of probe that
the device is Runtime Suspended, which would leave you in "idle"
state.  Then in setup_bus you make sure you keep track of the polarity
in such a way that when you Runtime Resume you can come out of "idle"
state without glitching.  The board would need to define "init" and
"idle" in a way that won't glitch things.


As with my advice always, note that I'm not an expert so feel free to
call into question my advice if it looks wrong.  :-P


>         tegra_spi_writel(tspi, tspi->def_command1_reg, SPI_COMMAND1);
>         pm_runtime_put(&pdev->dev);
>
> Cheers
> Jon
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-11-07 17:22                                               ` Doug Anderson
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Anderson @ 2017-11-07 17:22 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Shawn N, Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Brian Norris, Brian Norris, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra, Alexandru M Stan

Hi,

On Tue, Nov 7, 2017 at 3:28 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> On 10/10/17 17:52, Doug Anderson wrote:
>
> ...
>
>>> I'm still not clear on why we see an error only on the first
>>> transaction after boot. In this case, the embedded controller
>>> previously handled host commands from firmware just fine, and the
>>> handoff between firmware and the kernel shouldn't be significant, from
>>> the EC point of view. If host command timing is consistent from the
>>> master, I would expect to see some constant error rate, eg. some
>>> chance any host command will fail, rather than the first host command
>>> always failing.
>>
>> The AP itself is often quite busy at boot and so the timings for
>> everything change.  That could easily explain the problems.
>
> Sorry for the delay, but I have finally had some time to look at this a
> bit closer. I have been able to track down where the additional delay is
> really needed and seems to explain what is going on.
>
> For starters, the SPI chip-select is under h/w control and so the
> software delay has no impact on the timing between the chip-select
> going active and the transaction starting as I had first thought.
>
> I found that a delay is needed between completing the probe the Tegra
> SPI device and the first SPI transaction issued to the EC. In the Tegra
> SPI probe the SPI controller is programmed for master mode, but at the
> same time it clears the chip-select inactive polarity bit meaning that
> initially the SPI chip-select default to active-high (rather that low
> which seems odd). I believe that this then drives the chip-select low
> (active for the EC) and until it is then configured when spi_setup() is
> called which configures it as active-low for the EC.
> To get the first transaction to work for the EC there needs to be a
> delay after we program the chip-select polarity when spi_setup() is
> called. For example ...
>
> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> index 584367f3a0ed..c1075c3c60c8 100644
> --- a/drivers/mfd/cros_ec_spi.c
> +++ b/drivers/mfd/cros_ec_spi.c
> @@ -648,6 +648,8 @@ static int cros_ec_spi_probe(struct spi_device *spi)
>         if (err < 0)
>                 return err;
>
> +       udelay(100);
> +

This isn't totally crazy, but actually you could probably do this:

ec_spi->last_transfer_ns = ktime_get_ns();

...that will leverage already existing code and constants and also
will avoid doing a delay if it wasn't needed.  You could also then get
rid of some "if (ec_spi->last_transfer_ns)" tests in the code.  I'd
support landing that.


>         ec_spi = devm_kzalloc(dev, sizeof(*ec_spi), GFP_KERNEL);
>         if (ec_spi == NULL)
>                 return -ENOMEM;
>
>
> You may say why not put a delay in the tegra_spi_setup() itself, but we
> have some other SPI devices such as flash devices which are also use an
> active-low chip-select which don't have any issues with this to date.
> Furthermore, this delay is also probably device specific.
>
> From an EC perspective, if the chip-select is asserted is there a
> turnaround time before it can be asserted again? Or in this case maybe
> the issue is that the chip-select is asserted but no transaction occurs
> before it is de-asserted and so is causing a problem. Please note that a
> delay of around ~50us above still fails but 100us seems to be ok.

Really nice detective work!


> Finally, this also works-around the problem by avoiding the chip-select
> from being pulsed low by defaulting all chip-selects to active-low but
> maybe this just masks the problem.
>
> diff --git a/drivers/spi/spi-tegra114.c b/drivers/spi/spi-tegra114.c
> index a76acedd7e2f..7c18204e61d9 100644
> --- a/drivers/spi/spi-tegra114.c
> +++ b/drivers/spi/spi-tegra114.c
> @@ -1117,7 +1117,7 @@ static int tegra_spi_probe(struct platform_device
>                 goto exit_pm_disable;
>         }
> -       tspi->def_command1_reg  = SPI_M_S;
> +       tspi->def_command1_reg = SPI_M_S | SPI_CS_POL_INACTIVE_MASK;

Note that even though I'd support adding some sort of delay to the
cros_ec driver, I'd also say that it _might_ make sense to mess with
the SPI driver too, just to avoid glitching the lines at bootup.  As
you said, you shouldn't just willy nilly change the default but it
seems like it could make sense to define an initial (board level)
pinctrl state that's in effect until the first call to setup_bus().

That's a bit like the "init" pinctrl state that exists precisely to
address these types of glitches, but unfortunately you can't use the
automatic-ness of the "init" pinctrl state.  That code assumes that by
the end of probe you know how your pins should be configured.  In your
case you don't know how your pins should be configured until the first
setup_bus() is called...

Presumably you could add your own pinctrl state that mimics "init", or
you could figure out how to improve the way the "init" pinctrl state
works to allow a driver to defer the transition to "default".

Another option might be to try to use the "idle" state, maybe in
conjunction with the "init" pinctrl state.  AKA use "init" to avoid
glitches during probe and then require that by the end of probe that
the device is Runtime Suspended, which would leave you in "idle"
state.  Then in setup_bus you make sure you keep track of the polarity
in such a way that when you Runtime Resume you can come out of "idle"
state without glitching.  The board would need to define "init" and
"idle" in a way that won't glitch things.


As with my advice always, note that I'm not an expert so feel free to
call into question my advice if it looks wrong.  :-P


>         tegra_spi_writel(tspi, tspi->def_command1_reg, SPI_COMMAND1);
>         pm_runtime_put(&pdev->dev);
>
> Cheers
> Jon
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-11-07 17:22                                               ` Doug Anderson
  (?)
@ 2017-11-08 10:20                                               ` Jon Hunter
  2017-11-08 16:45                                                 ` Doug Anderson
  -1 siblings, 1 reply; 36+ messages in thread
From: Jon Hunter @ 2017-11-08 10:20 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Shawn N, Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Brian Norris, Brian Norris, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra, Alexandru M Stan

Hi Doug,

On 07/11/17 17:22, Doug Anderson wrote:
> On Tue, Nov 7, 2017 at 3:28 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>> On 10/10/17 17:52, Doug Anderson wrote:
>>
>> ...
>>
>>>> I'm still not clear on why we see an error only on the first
>>>> transaction after boot. In this case, the embedded controller
>>>> previously handled host commands from firmware just fine, and the
>>>> handoff between firmware and the kernel shouldn't be significant, from
>>>> the EC point of view. If host command timing is consistent from the
>>>> master, I would expect to see some constant error rate, eg. some
>>>> chance any host command will fail, rather than the first host command
>>>> always failing.
>>>
>>> The AP itself is often quite busy at boot and so the timings for
>>> everything change.  That could easily explain the problems.
>>
>> Sorry for the delay, but I have finally had some time to look at this a
>> bit closer. I have been able to track down where the additional delay is
>> really needed and seems to explain what is going on.
>>
>> For starters, the SPI chip-select is under h/w control and so the
>> software delay has no impact on the timing between the chip-select
>> going active and the transaction starting as I had first thought.
>>
>> I found that a delay is needed between completing the probe the Tegra
>> SPI device and the first SPI transaction issued to the EC. In the Tegra
>> SPI probe the SPI controller is programmed for master mode, but at the
>> same time it clears the chip-select inactive polarity bit meaning that
>> initially the SPI chip-select default to active-high (rather that low
>> which seems odd). I believe that this then drives the chip-select low
>> (active for the EC) and until it is then configured when spi_setup() is
>> called which configures it as active-low for the EC.
>> To get the first transaction to work for the EC there needs to be a
>> delay after we program the chip-select polarity when spi_setup() is
>> called. For example ...
>>
>> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
>> index 584367f3a0ed..c1075c3c60c8 100644
>> --- a/drivers/mfd/cros_ec_spi.c
>> +++ b/drivers/mfd/cros_ec_spi.c
>> @@ -648,6 +648,8 @@ static int cros_ec_spi_probe(struct spi_device *spi)
>>         if (err < 0)
>>                 return err;
>>
>> +       udelay(100);
>> +
> 
> This isn't totally crazy, but actually you could probably do this:
> 
> ec_spi->last_transfer_ns = ktime_get_ns();
> 
> ...that will leverage already existing code and constants and also
> will avoid doing a delay if it wasn't needed.  You could also then get
> rid of some "if (ec_spi->last_transfer_ns)" tests in the code.  I'd
> support landing that.
> 
> 
>>         ec_spi = devm_kzalloc(dev, sizeof(*ec_spi), GFP_KERNEL);
>>         if (ec_spi == NULL)
>>                 return -ENOMEM;
>>


OK, yes and that does work well too. I will send a patch with the
following for review.

diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
index 584367f3a0ed..477f8e81dc34 100644
--- a/drivers/mfd/cros_ec_spi.c
+++ b/drivers/mfd/cros_ec_spi.c
@@ -671,6 +671,7 @@ static int cros_ec_spi_probe(struct spi_device *spi)
                           sizeof(struct ec_response_get_protocol_info);
        ec_dev->dout_size = sizeof(struct ec_host_request);

+       ec_spi->last_transfer_ns = ktime_get_ns();

        err = cros_ec_register(ec_dev);
        if (err) {

>> You may say why not put a delay in the tegra_spi_setup() itself, but we
>> have some other SPI devices such as flash devices which are also use an
>> active-low chip-select which don't have any issues with this to date.
>> Furthermore, this delay is also probably device specific.
>>
>> From an EC perspective, if the chip-select is asserted is there a
>> turnaround time before it can be asserted again? Or in this case maybe
>> the issue is that the chip-select is asserted but no transaction occurs
>> before it is de-asserted and so is causing a problem. Please note that a
>> delay of around ~50us above still fails but 100us seems to be ok.
> 
> Really nice detective work!
> 
> 
>> Finally, this also works-around the problem by avoiding the chip-select
>> from being pulsed low by defaulting all chip-selects to active-low but
>> maybe this just masks the problem.
>>
>> diff --git a/drivers/spi/spi-tegra114.c b/drivers/spi/spi-tegra114.c
>> index a76acedd7e2f..7c18204e61d9 100644
>> --- a/drivers/spi/spi-tegra114.c
>> +++ b/drivers/spi/spi-tegra114.c
>> @@ -1117,7 +1117,7 @@ static int tegra_spi_probe(struct platform_device
>>                 goto exit_pm_disable;
>>         }
>> -       tspi->def_command1_reg  = SPI_M_S;
>> +       tspi->def_command1_reg = SPI_M_S | SPI_CS_POL_INACTIVE_MASK;
> 
> Note that even though I'd support adding some sort of delay to the
> cros_ec driver, I'd also say that it _might_ make sense to mess with
> the SPI driver too, just to avoid glitching the lines at bootup.  As
> you said, you shouldn't just willy nilly change the default but it
> seems like it could make sense to define an initial (board level)
> pinctrl state that's in effect until the first call to setup_bus().

I am thinking about making that above change as well, because the reset
value of the chip-select polarity bits default to active-low. For
active-high devices I don't think that you can ever avoid the
chip-select asserting for a period after reset as that is the default
setting, but I don't see why we are clearing these bits in probe.

I will do a bit more testing with this to avoid any regressions, but
both changes seem worthwhile IMO.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-11-08 10:20                                               ` Jon Hunter
@ 2017-11-08 16:45                                                 ` Doug Anderson
  0 siblings, 0 replies; 36+ messages in thread
From: Doug Anderson @ 2017-11-08 16:45 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Shawn N, Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Brian Norris, Brian Norris, Gwendal Grignou, Enric Balletbo,
	Tomeu Vizoso, linux-tegra, Alexandru M Stan

Hi,

On Wed, Nov 8, 2017 at 2:20 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
> Hi Doug,
>
> On 07/11/17 17:22, Doug Anderson wrote:
>> On Tue, Nov 7, 2017 at 3:28 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>>> On 10/10/17 17:52, Doug Anderson wrote:
>>>
>>> ...
>>>
>>>>> I'm still not clear on why we see an error only on the first
>>>>> transaction after boot. In this case, the embedded controller
>>>>> previously handled host commands from firmware just fine, and the
>>>>> handoff between firmware and the kernel shouldn't be significant, from
>>>>> the EC point of view. If host command timing is consistent from the
>>>>> master, I would expect to see some constant error rate, eg. some
>>>>> chance any host command will fail, rather than the first host command
>>>>> always failing.
>>>>
>>>> The AP itself is often quite busy at boot and so the timings for
>>>> everything change.  That could easily explain the problems.
>>>
>>> Sorry for the delay, but I have finally had some time to look at this a
>>> bit closer. I have been able to track down where the additional delay is
>>> really needed and seems to explain what is going on.
>>>
>>> For starters, the SPI chip-select is under h/w control and so the
>>> software delay has no impact on the timing between the chip-select
>>> going active and the transaction starting as I had first thought.
>>>
>>> I found that a delay is needed between completing the probe the Tegra
>>> SPI device and the first SPI transaction issued to the EC. In the Tegra
>>> SPI probe the SPI controller is programmed for master mode, but at the
>>> same time it clears the chip-select inactive polarity bit meaning that
>>> initially the SPI chip-select default to active-high (rather that low
>>> which seems odd). I believe that this then drives the chip-select low
>>> (active for the EC) and until it is then configured when spi_setup() is
>>> called which configures it as active-low for the EC.
>>> To get the first transaction to work for the EC there needs to be a
>>> delay after we program the chip-select polarity when spi_setup() is
>>> called. For example ...
>>>
>>> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
>>> index 584367f3a0ed..c1075c3c60c8 100644
>>> --- a/drivers/mfd/cros_ec_spi.c
>>> +++ b/drivers/mfd/cros_ec_spi.c
>>> @@ -648,6 +648,8 @@ static int cros_ec_spi_probe(struct spi_device *spi)
>>>         if (err < 0)
>>>                 return err;
>>>
>>> +       udelay(100);
>>> +
>>
>> This isn't totally crazy, but actually you could probably do this:
>>
>> ec_spi->last_transfer_ns = ktime_get_ns();
>>
>> ...that will leverage already existing code and constants and also
>> will avoid doing a delay if it wasn't needed.  You could also then get
>> rid of some "if (ec_spi->last_transfer_ns)" tests in the code.  I'd
>> support landing that.
>>
>>
>>>         ec_spi = devm_kzalloc(dev, sizeof(*ec_spi), GFP_KERNEL);
>>>         if (ec_spi == NULL)
>>>                 return -ENOMEM;
>>>
>
>
> OK, yes and that does work well too. I will send a patch with the
> following for review.
>
> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> index 584367f3a0ed..477f8e81dc34 100644
> --- a/drivers/mfd/cros_ec_spi.c
> +++ b/drivers/mfd/cros_ec_spi.c
> @@ -671,6 +671,7 @@ static int cros_ec_spi_probe(struct spi_device *spi)
>                            sizeof(struct ec_response_get_protocol_info);
>         ec_dev->dout_size = sizeof(struct ec_host_request);
>
> +       ec_spi->last_transfer_ns = ktime_get_ns();
>
>         err = cros_ec_register(ec_dev);
>         if (err) {

That's sufficient to make it work, but now that "last_transfer_ns" can
never be 0 you should also fix that.  It's hard to paste a patch in
gmail and the change is trivial so I didn't try to find another way to
post it, but roughly:

- * @last_transfer_ns: time that we last finished a transfer, or 0 if there
- *     if no record
+ * @last_transfer_ns: time that we last finished a transfer.

===

+       unsigned long delay;    /* The delay completed so far */

        len = cros_ec_prepare_tx(ec_dev, ec_msg);
        dev_dbg(ec_dev->dev, "prepared, len=%d\n", len);

        /* If it's too soon to do another transaction, wait */
-       if (ec_spi->last_transfer_ns) {
-               unsigned long delay;    /* The delay completed so far */
-
-               delay = ktime_get_ns() - ec_spi->last_transfer_ns;
-               if (delay < EC_SPI_RECOVERY_TIME_NS)
-                       ndelay(EC_SPI_RECOVERY_TIME_NS - delay);
-       }
+       delay = ktime_get_ns() - ec_spi->last_transfer_ns;
+       if (delay < EC_SPI_RECOVERY_TIME_NS)
+               ndelay(EC_SPI_RECOVERY_TIME_NS - delay);

===

+       unsigned long delay;    /* The delay completed so far */

        len = cros_ec_prepare_tx(ec_dev, ec_msg);
        dev_dbg(ec_dev->dev, "prepared, len=%d\n", len);

        /* If it's too soon to do another transaction, wait */
-       if (ec_spi->last_transfer_ns) {
-               unsigned long delay;    /* The delay completed so far */
-
-               delay = ktime_get_ns() - ec_spi->last_transfer_ns;
-               if (delay < EC_SPI_RECOVERY_TIME_NS)
-                       ndelay(EC_SPI_RECOVERY_TIME_NS - delay);
-       }
+       delay = ktime_get_ns() - ec_spi->last_transfer_ns;
+       if (delay < EC_SPI_RECOVERY_TIME_NS)
+               ndelay(EC_SPI_RECOVERY_TIME_NS - delay);


>>> You may say why not put a delay in the tegra_spi_setup() itself, but we
>>> have some other SPI devices such as flash devices which are also use an
>>> active-low chip-select which don't have any issues with this to date.
>>> Furthermore, this delay is also probably device specific.
>>>
>>> From an EC perspective, if the chip-select is asserted is there a
>>> turnaround time before it can be asserted again? Or in this case maybe
>>> the issue is that the chip-select is asserted but no transaction occurs
>>> before it is de-asserted and so is causing a problem. Please note that a
>>> delay of around ~50us above still fails but 100us seems to be ok.
>>
>> Really nice detective work!
>>
>>
>>> Finally, this also works-around the problem by avoiding the chip-select
>>> from being pulsed low by defaulting all chip-selects to active-low but
>>> maybe this just masks the problem.
>>>
>>> diff --git a/drivers/spi/spi-tegra114.c b/drivers/spi/spi-tegra114.c
>>> index a76acedd7e2f..7c18204e61d9 100644
>>> --- a/drivers/spi/spi-tegra114.c
>>> +++ b/drivers/spi/spi-tegra114.c
>>> @@ -1117,7 +1117,7 @@ static int tegra_spi_probe(struct platform_device
>>>                 goto exit_pm_disable;
>>>         }
>>> -       tspi->def_command1_reg  = SPI_M_S;
>>> +       tspi->def_command1_reg = SPI_M_S | SPI_CS_POL_INACTIVE_MASK;
>>
>> Note that even though I'd support adding some sort of delay to the
>> cros_ec driver, I'd also say that it _might_ make sense to mess with
>> the SPI driver too, just to avoid glitching the lines at bootup.  As
>> you said, you shouldn't just willy nilly change the default but it
>> seems like it could make sense to define an initial (board level)
>> pinctrl state that's in effect until the first call to setup_bus().
>
> I am thinking about making that above change as well, because the reset
> value of the chip-select polarity bits default to active-low. For
> active-high devices I don't think that you can ever avoid the
> chip-select asserting for a period after reset as that is the default
> setting, but I don't see why we are clearing these bits in probe.

Is the default muxing option for these pins to be SPI?  If not then
you'd have to look at the default muxing option and possibly the
default pull option.


> I will do a bit more testing with this to avoid any regressions, but
> both changes seem worthwhile IMO.

It does seem slightly risky to change the default for all boards, but
I'll leave it up to you.  I think you can get this right with all the
pinctrl stuff I mentioned, but that definitely would be a lot more
work.  Also: if you land the cros ec change, it's not exactly
urgent...

-Doug

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-09-26 15:40                                 ` Jon Hunter
@ 2017-11-14 15:56                                     ` Jon Hunter
  -1 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-11-14 15:56 UTC (permalink / raw)
  To: Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Brian Norris, Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi Shawn,

On 26/09/17 16:40, Jon Hunter wrote:
> On 26/09/17 00:15, Shawn N wrote:

...

>> From: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>> Date: Mon, 25 Sep 2017 14:32:38 -0700
>> Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling
>>
>> For host commands that take a long time to process, cros ec can return
>> early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
>> status with EC_CMD_GET_COMMS_STATUS until completion of the command.
>>
>> None of the above applies when data link errors are encountered. When
>> errors such as EC_SPI_PAST_END are encountered during command
>> transmission, it usually means the command was not received by the EC.
>> Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
>> almost always the wrong decision, and can result in host commands
>> silently being lost.
>>
>> Signed-off-by: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>> ---
>>  drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
>>  1 file changed, 12 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
>> index c9714072e224..d33e3847e11e 100644
>> --- a/drivers/mfd/cros_ec_spi.c
>> +++ b/drivers/mfd/cros_ec_spi.c
>> @@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
>> cros_ec_device *ec_dev,
>>         u8 *ptr;
>>         u8 *rx_buf;
>>         u8 sum;
>> +       u8 rx_byte;
>>         int ret = 0, final_ret;
>>
>>         len = cros_ec_prepare_tx(ec_dev, ec_msg);
>> @@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
>> cros_ec_device *ec_dev,
>>         if (!ret) {
>>                 /* Verify that EC can process command */
>>                 for (i = 0; i < len; i++) {
>> -                       switch (rx_buf[i]) {
>> -                       case EC_SPI_PAST_END:
>> -                       case EC_SPI_RX_BAD_DATA:
>> -                       case EC_SPI_NOT_READY:
>> -                               ret = -EAGAIN;
>> -                               ec_msg->result = EC_RES_IN_PROGRESS;
>> -                       default:
>> +                       rx_byte = rx_buf[i];
>> +                       if (rx_byte == EC_SPI_PAST_END  ||
>> +                           rx_byte == EC_SPI_RX_BAD_DATA ||
>> +                           rx_byte == EC_SPI_NOT_READY) {
>> +                               ret = -EREMOTEIO;
>>                                 break;
>>                         }
>> -                       if (ret)
>> -                               break;
>>                 }
>> -               if (!ret)
>> -                       ret = cros_ec_spi_receive_packet(ec_dev,
>> -                                       ec_msg->insize + sizeof(*response));
>> -       } else {
>> -               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
>>         }
>>
>> +       if (!ret)
>> +               ret = cros_ec_spi_receive_packet(ec_dev,
>> +                               ec_msg->insize + sizeof(*response));
>> +       else
>> +               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
>> +
>>         final_ret = terminate_request(ec_dev);
>>
>>         spi_bus_unlock(ec_spi->spi->master);
>>
> 
> Thanks! Works for me ...
> 
> Tested-by: Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>

I can't find the formal patch you sent out for the above, but I have not
seen it picked up yet. I am guess that Lee did not pick it up because
there was still an open question. Anyhow we may want to circle back with
Lee on this so that this does get picked up.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-11-14 15:56                                     ` Jon Hunter
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Hunter @ 2017-11-14 15:56 UTC (permalink / raw)
  To: Shawn N
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

Hi Shawn,

On 26/09/17 16:40, Jon Hunter wrote:
> On 26/09/17 00:15, Shawn N wrote:

...

>> From: Shawn Nematbakhsh <shawnn@chromium.org>
>> Date: Mon, 25 Sep 2017 14:32:38 -0700
>> Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling
>>
>> For host commands that take a long time to process, cros ec can return
>> early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
>> status with EC_CMD_GET_COMMS_STATUS until completion of the command.
>>
>> None of the above applies when data link errors are encountered. When
>> errors such as EC_SPI_PAST_END are encountered during command
>> transmission, it usually means the command was not received by the EC.
>> Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
>> almost always the wrong decision, and can result in host commands
>> silently being lost.
>>
>> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
>> ---
>>  drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
>>  1 file changed, 12 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
>> index c9714072e224..d33e3847e11e 100644
>> --- a/drivers/mfd/cros_ec_spi.c
>> +++ b/drivers/mfd/cros_ec_spi.c
>> @@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
>> cros_ec_device *ec_dev,
>>         u8 *ptr;
>>         u8 *rx_buf;
>>         u8 sum;
>> +       u8 rx_byte;
>>         int ret = 0, final_ret;
>>
>>         len = cros_ec_prepare_tx(ec_dev, ec_msg);
>> @@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
>> cros_ec_device *ec_dev,
>>         if (!ret) {
>>                 /* Verify that EC can process command */
>>                 for (i = 0; i < len; i++) {
>> -                       switch (rx_buf[i]) {
>> -                       case EC_SPI_PAST_END:
>> -                       case EC_SPI_RX_BAD_DATA:
>> -                       case EC_SPI_NOT_READY:
>> -                               ret = -EAGAIN;
>> -                               ec_msg->result = EC_RES_IN_PROGRESS;
>> -                       default:
>> +                       rx_byte = rx_buf[i];
>> +                       if (rx_byte == EC_SPI_PAST_END  ||
>> +                           rx_byte == EC_SPI_RX_BAD_DATA ||
>> +                           rx_byte == EC_SPI_NOT_READY) {
>> +                               ret = -EREMOTEIO;
>>                                 break;
>>                         }
>> -                       if (ret)
>> -                               break;
>>                 }
>> -               if (!ret)
>> -                       ret = cros_ec_spi_receive_packet(ec_dev,
>> -                                       ec_msg->insize + sizeof(*response));
>> -       } else {
>> -               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
>>         }
>>
>> +       if (!ret)
>> +               ret = cros_ec_spi_receive_packet(ec_dev,
>> +                               ec_msg->insize + sizeof(*response));
>> +       else
>> +               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
>> +
>>         final_ret = terminate_request(ec_dev);
>>
>>         spi_bus_unlock(ec_spi->spi->master);
>>
> 
> Thanks! Works for me ...
> 
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

I can't find the formal patch you sent out for the above, but I have not
seen it picked up yet. I am guess that Lee did not pick it up because
there was still an open question. Anyhow we may want to circle back with
Lee on this so that this does get picked up.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
  2017-11-14 15:56                                     ` Jon Hunter
@ 2017-11-14 15:59                                         ` Shawn N
  -1 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-11-14 15:59 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Benson Leung, Lee Jones,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Doug Anderson, Brian Norris,
	Brian Norris, Gwendal Grignou, Enric Balletbo, Tomeu Vizoso,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi Jon,

On Tue, Nov 14, 2017 at 7:56 AM, Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>
> Hi Shawn,
>
> On 26/09/17 16:40, Jon Hunter wrote:
> > On 26/09/17 00:15, Shawn N wrote:
>
> ...
>
> >> From: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> >> Date: Mon, 25 Sep 2017 14:32:38 -0700
> >> Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling
> >>
> >> For host commands that take a long time to process, cros ec can return
> >> early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
> >> status with EC_CMD_GET_COMMS_STATUS until completion of the command.
> >>
> >> None of the above applies when data link errors are encountered. When
> >> errors such as EC_SPI_PAST_END are encountered during command
> >> transmission, it usually means the command was not received by the EC.
> >> Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
> >> almost always the wrong decision, and can result in host commands
> >> silently being lost.
> >>
> >> Signed-off-by: Shawn Nematbakhsh <shawnn-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> >> ---
> >>  drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
> >>  1 file changed, 12 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> >> index c9714072e224..d33e3847e11e 100644
> >> --- a/drivers/mfd/cros_ec_spi.c
> >> +++ b/drivers/mfd/cros_ec_spi.c
> >> @@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
> >> cros_ec_device *ec_dev,
> >>         u8 *ptr;
> >>         u8 *rx_buf;
> >>         u8 sum;
> >> +       u8 rx_byte;
> >>         int ret = 0, final_ret;
> >>
> >>         len = cros_ec_prepare_tx(ec_dev, ec_msg);
> >> @@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
> >> cros_ec_device *ec_dev,
> >>         if (!ret) {
> >>                 /* Verify that EC can process command */
> >>                 for (i = 0; i < len; i++) {
> >> -                       switch (rx_buf[i]) {
> >> -                       case EC_SPI_PAST_END:
> >> -                       case EC_SPI_RX_BAD_DATA:
> >> -                       case EC_SPI_NOT_READY:
> >> -                               ret = -EAGAIN;
> >> -                               ec_msg->result = EC_RES_IN_PROGRESS;
> >> -                       default:
> >> +                       rx_byte = rx_buf[i];
> >> +                       if (rx_byte == EC_SPI_PAST_END  ||
> >> +                           rx_byte == EC_SPI_RX_BAD_DATA ||
> >> +                           rx_byte == EC_SPI_NOT_READY) {
> >> +                               ret = -EREMOTEIO;
> >>                                 break;
> >>                         }
> >> -                       if (ret)
> >> -                               break;
> >>                 }
> >> -               if (!ret)
> >> -                       ret = cros_ec_spi_receive_packet(ec_dev,
> >> -                                       ec_msg->insize + sizeof(*response));
> >> -       } else {
> >> -               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
> >>         }
> >>
> >> +       if (!ret)
> >> +               ret = cros_ec_spi_receive_packet(ec_dev,
> >> +                               ec_msg->insize + sizeof(*response));
> >> +       else
> >> +               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
> >> +
> >>         final_ret = terminate_request(ec_dev);
> >>
> >>         spi_bus_unlock(ec_spi->spi->master);
> >>
> >
> > Thanks! Works for me ...
> >
> > Tested-by: Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>
> I can't find the formal patch you sent out for the above, but I have not
> seen it picked up yet. I am guess that Lee did not pick it up because
> there was still an open question. Anyhow we may want to circle back with
> Lee on this so that this does get picked up.

The formal patch is here:

https://lkml.org/lkml/2017/9/27/707

I will circle back to ensure that it gets picked up.

>
> Cheers
> Jon
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
@ 2017-11-14 15:59                                         ` Shawn N
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn N @ 2017-11-14 15:59 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Olof Johansson, Benson Leung, Lee Jones, linux-kernel,
	Doug Anderson, Brian Norris, Brian Norris, Gwendal Grignou,
	Enric Balletbo, Tomeu Vizoso, linux-tegra

Hi Jon,

On Tue, Nov 14, 2017 at 7:56 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Shawn,
>
> On 26/09/17 16:40, Jon Hunter wrote:
> > On 26/09/17 00:15, Shawn N wrote:
>
> ...
>
> >> From: Shawn Nematbakhsh <shawnn@chromium.org>
> >> Date: Mon, 25 Sep 2017 14:32:38 -0700
> >> Subject: [PATCH] mfd: cros ec: spi: Fix "in progress" error signaling
> >>
> >> For host commands that take a long time to process, cros ec can return
> >> early by signaling a EC_RES_IN_PROGRESS result. The host must then poll
> >> status with EC_CMD_GET_COMMS_STATUS until completion of the command.
> >>
> >> None of the above applies when data link errors are encountered. When
> >> errors such as EC_SPI_PAST_END are encountered during command
> >> transmission, it usually means the command was not received by the EC.
> >> Treating such errors as if they were 'EC_RES_IN_PROGRESS' results is
> >> almost always the wrong decision, and can result in host commands
> >> silently being lost.
> >>
> >> Signed-off-by: Shawn Nematbakhsh <shawnn@chromium.org>
> >> ---
> >>  drivers/mfd/cros_ec_spi.c | 26 ++++++++++++--------------
> >>  1 file changed, 12 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/mfd/cros_ec_spi.c b/drivers/mfd/cros_ec_spi.c
> >> index c9714072e224..d33e3847e11e 100644
> >> --- a/drivers/mfd/cros_ec_spi.c
> >> +++ b/drivers/mfd/cros_ec_spi.c
> >> @@ -377,6 +377,7 @@ static int cros_ec_pkt_xfer_spi(struct
> >> cros_ec_device *ec_dev,
> >>         u8 *ptr;
> >>         u8 *rx_buf;
> >>         u8 sum;
> >> +       u8 rx_byte;
> >>         int ret = 0, final_ret;
> >>
> >>         len = cros_ec_prepare_tx(ec_dev, ec_msg);
> >> @@ -421,25 +422,22 @@ static int cros_ec_pkt_xfer_spi(struct
> >> cros_ec_device *ec_dev,
> >>         if (!ret) {
> >>                 /* Verify that EC can process command */
> >>                 for (i = 0; i < len; i++) {
> >> -                       switch (rx_buf[i]) {
> >> -                       case EC_SPI_PAST_END:
> >> -                       case EC_SPI_RX_BAD_DATA:
> >> -                       case EC_SPI_NOT_READY:
> >> -                               ret = -EAGAIN;
> >> -                               ec_msg->result = EC_RES_IN_PROGRESS;
> >> -                       default:
> >> +                       rx_byte = rx_buf[i];
> >> +                       if (rx_byte == EC_SPI_PAST_END  ||
> >> +                           rx_byte == EC_SPI_RX_BAD_DATA ||
> >> +                           rx_byte == EC_SPI_NOT_READY) {
> >> +                               ret = -EREMOTEIO;
> >>                                 break;
> >>                         }
> >> -                       if (ret)
> >> -                               break;
> >>                 }
> >> -               if (!ret)
> >> -                       ret = cros_ec_spi_receive_packet(ec_dev,
> >> -                                       ec_msg->insize + sizeof(*response));
> >> -       } else {
> >> -               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
> >>         }
> >>
> >> +       if (!ret)
> >> +               ret = cros_ec_spi_receive_packet(ec_dev,
> >> +                               ec_msg->insize + sizeof(*response));
> >> +       else
> >> +               dev_err(ec_dev->dev, "spi transfer failed: %d\n", ret);
> >> +
> >>         final_ret = terminate_request(ec_dev);
> >>
> >>         spi_bus_unlock(ec_spi->spi->master);
> >>
> >
> > Thanks! Works for me ...
> >
> > Tested-by: Jon Hunter <jonathanh@nvidia.com>
>
> I can't find the formal patch you sent out for the above, but I have not
> seen it picked up yet. I am guess that Lee did not pick it up because
> there was still an open question. Anyhow we may want to circle back with
> Lee on this so that this does get picked up.

The formal patch is here:

https://lkml.org/lkml/2017/9/27/707

I will circle back to ensure that it gets picked up.

>
> Cheers
> Jon
>
> --
> nvpublic

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-11-14 15:59 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-08 20:50 [PATCH v3] platform/chrome: Use proper protocol transfer function Brian Norris
2017-09-11 19:48 ` Benson Leung
     [not found] ` <20170908205011.77986-1-briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
2017-09-19 13:44   ` Jon Hunter
2017-09-19 13:44     ` Jon Hunter
     [not found]     ` <02aa65e7-e967-055b-2af3-2e9b6ef77935-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2017-09-19 14:09       ` Shawn N
2017-09-19 14:09         ` Shawn N
     [not found]         ` <CALaWCOMj0wQk5OfYOYqU_sZUt2SQBhy=HaP-qOiB5aMf9G8inw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-09-19 16:39           ` Jon Hunter
2017-09-19 16:39             ` Jon Hunter
     [not found]             ` <c3c5d08b-2df2-2e5b-cb09-bd4b3011e3df-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2017-09-19 17:03               ` Shawn N
2017-09-19 17:03                 ` Shawn N
2017-09-19 17:14               ` Brian Norris
2017-09-19 17:14                 ` Brian Norris
     [not found]                 ` <20170919171401.GA10968-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2017-09-20  6:05                   ` Shawn N
2017-09-20  6:05                     ` Shawn N
     [not found]                     ` <CALaWCOPzT-BWu-YcMY+xEAWGRmvvVEoA64ceEK3zG3K-wajskQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-09-20  6:13                       ` Brian Norris
2017-09-20  6:13                         ` Brian Norris
     [not found]                         ` <20170920061317.GB13616-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2017-09-20 20:22                           ` Shawn N
2017-09-20 20:22                             ` Shawn N
2017-09-25 23:15                             ` Shawn N
2017-09-26 15:40                               ` Jon Hunter
2017-09-26 15:40                                 ` Jon Hunter
     [not found]                                 ` <d8aff55f-796b-4e8d-edf3-b8d55a65eda0-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2017-11-14 15:56                                   ` Jon Hunter
2017-11-14 15:56                                     ` Jon Hunter
     [not found]                                     ` <4cf2fa5b-f909-1ab1-f743-e1cb9a66c604-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2017-11-14 15:59                                       ` Shawn N
2017-11-14 15:59                                         ` Shawn N
2017-10-10 13:35                               ` Jon Hunter
2017-10-10 13:35                                 ` Jon Hunter
2017-10-10 15:33                                 ` Shawn N
     [not found]                                   ` <CALaWCOP=0O7AMobu4YX0z=fJLopcQwv1Vm1_Bbp3XTaty1y6fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-10 16:52                                     ` Doug Anderson
2017-10-10 16:52                                       ` Doug Anderson
     [not found]                                       ` <CAD=FV=VobwsHyo96JxAgEPqG2ji5gMoNz6JJxQaQmZ3MD+dxRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-11-07 11:28                                         ` Jon Hunter
2017-11-07 11:28                                           ` Jon Hunter
     [not found]                                           ` <5ae1292d-14dc-b917-84cc-2758e82c4795-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2017-11-07 17:22                                             ` Doug Anderson
2017-11-07 17:22                                               ` Doug Anderson
2017-11-08 10:20                                               ` Jon Hunter
2017-11-08 16:45                                                 ` Doug Anderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.