linux-integrity.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fix Atmel TPM crash caused by too frequent queries
@ 2020-09-26 22:31 Hao Wu
  2020-09-26 22:57 ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-26 22:31 UTC (permalink / raw)
  To: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, hamza,
	james.l.morris, nayna, why2jjj.linux, zohar
  Cc: linux-integrity, pmenzel, kgold, seungyeop.han, shrihari.kalkar,
	anish.jhaveri, hao.wu

Since kernel 4.14, we fixed the TPM sleep logic
from msleep to usleep_range, so that the TPM
sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
Before that fix, msleep(5) actually sleeps for
around 15ms.
The fix is https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3

That fix uncovered that the TPM_TIMEOUT was not properly
set previously. We recently found the TPM driver in kernel 4.14+
(including 5.9-rc4) crashes Atmel TPM chips with
too frequent TPM queries.

The TPM crash signature is
```
$ tpm_sealdata -z
Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error

$ sudo dmesg | grep tpm0
[59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
[59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
```

From the error code "-62", it looks similar to another bug
https://patchwork.kernel.org/patch/10520247/
where the "TPM_TIMEOUT_USECS_MAX" and "TPM_TIMEOUT_USEC_MIN"
is too small, which causes TPM get queried too frequently,
and thus crashes.

We finally found the root cause is related to
the sleep timeout used in `wait_for_tpm_stat`

In the 4.16, commit
https://github.com/torvalds/linux/commit/cf151a9a44d52a63332e8e926234574fe5a5d784
uses `TPM_POLL_SLEEP` instead of TPM_TIMEOUT for `wait_for_tpm_stat`
and set `TPM_POLL_SLEEP` (1ms).

Since 4.18 commits
https://github.com/torvalds/linux/commit/59f5a6b07f6434efac0057dc2f303a96b871811b
https://github.com/torvalds/linux/commit/424eaf910c329ab06ad03a527ef45dcf6a328f00
further reduced the timeout in wait_for_tpm_stat to less than 1ms.

Our patch here defines a `TPM_TIMEOUT_WAIT_STAT` (15ms) just for
`wait_for_tpm_stat the` to fix the crash in Atmel chips,
but not introduce unnecessary performance regression
in other workflows.
15ms is the actual timeout TPM chips use before the 4.14 fix
from msleep to usleep_range. Thus fixed the crash.

Exploring smaller `TPM_TIMEOUT_WAIT_STAT` should be a separate
thing to revisit later. This patch meant to fix the regression
introduced since 4.14

Test Plan:
- Run fixed kernel on system with Atmel TPM chips
  and ensure crash does not happen
- Run fixed kernel on system with other TPM chips
  (IFX / WEC / STM) ensure not breakages from tpm-tool

Signed-off-by: Hao Wu <hao.wu@rubrik.com>
---
 drivers/char/tpm/tpm.h          | 1 +
 drivers/char/tpm/tpm_tis_core.c | 3 +--
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 947d1db0a5cc..899097ae9756 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -41,6 +41,7 @@ enum tpm_timeout {
 	TPM_TIMEOUT_RETRY = 100, /* msecs */
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
+	TPM_TIMEOUT_WAIT_STAT = 15,      /* msecs */
 	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
 	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
 };
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 65ab1b027949..8aa5eef10c28 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -80,8 +80,7 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
 		}
 	} else {
 		do {
-			usleep_range(TPM_TIMEOUT_USECS_MIN,
-				     TPM_TIMEOUT_USECS_MAX);
+			tpm_msleep(TPM_TIMEOUT_WAIT_STAT);
 			status = chip->ops->status(chip);
 			if ((status & mask) == mask)
 				return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-26 22:31 [PATCH] Fix Atmel TPM crash caused by too frequent queries Hao Wu
@ 2020-09-26 22:57 ` James Bottomley
  2020-09-26 23:10   ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-09-26 22:57 UTC (permalink / raw)
  To: Hao Wu, peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, hamza,
	james.l.morris, nayna, why2jjj.linux, zohar
  Cc: linux-integrity, pmenzel, kgold, seungyeop.han, shrihari.kalkar,
	anish.jhaveri

On Sat, 2020-09-26 at 15:31 -0700, Hao Wu wrote:
> Since kernel 4.14, we fixed the TPM sleep logic
> from msleep to usleep_range, so that the TPM
> sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
> Before that fix, msleep(5) actually sleeps for
> around 15ms.
> The fix is 
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> 
> That fix uncovered that the TPM_TIMEOUT was not properly
> set previously. We recently found the TPM driver in kernel 4.14+
> (including 5.9-rc4) crashes Atmel TPM chips with
> too frequent TPM queries.

I've seen this with my nuvoton too ... although it seems to be because
my chip is somewhat experimental (SW upgrade from 1.2 to 2.0).  The
problem with your patch is it reintroduces the massive delays that
msleep has ... that's why usleep was used.  The patch I use locally to
fix this keeps usleep, can you try it (attached).

James

---

From d40a8c7691a72de28ea66a78bd177db36a79710a Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Wed, 11 Jul 2018 10:11:14 -0700
Subject: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression

tpm_tis regressed recently to the point where the TPM being driven by
it falls off the bus and cannot be contacted after some hours of use.
This is the failure trace:

jejb@jarvis:~> dmesg|grep tpm
[    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
[14566.626614] tpm tpm0: Operation Timed out
[14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
[14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
[14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
...

The problem is caused by a change that caused us to poke the TPM far
more often to see if it's ready.  Apparently something about the bus
its on and the TPM means that it crashes or falls off the bus if you
poke it too often and once this happens, only a reboot will recover
it.

The fix I've come up with is to adjust the timings so the TPM no
longer falls of the bus.  Obviously, this fix works for my Nuvoton
NPCT6xxx but that's the only TPM I've tested it with.

Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/char/tpm/tpm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 947d1db0a5cc..e4f4b98418ab 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -41,8 +41,8 @@ enum tpm_timeout {
 	TPM_TIMEOUT_RETRY = 100, /* msecs */
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
-	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
-	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
+	TPM_TIMEOUT_USECS_MIN = 750,      /* usecs */
+	TPM_TIMEOUT_USECS_MAX = 1000,      /* usecs */
 };
 
 /* TPM addresses */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-26 22:57 ` James Bottomley
@ 2020-09-26 23:10   ` Hao Wu
  2020-09-27 18:25     ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-26 23:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

Resending following email in plaintext.

----

Hi James,

Thanks for following up.

We have actually tried change 
TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX 
according to https://patchwork.kernel.org/patch/10520247/
It does not solve the problem for ATMEL chip. The chips facing crash is 
not experimental, but happens commonly in 
the production systems we and our customers are using.
It is widely found in Cisco 220 / 240 systems which are using
Ateml chips.

Thanks
Hao  

> On Sep 26, 2020, at 3:57 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote:
> 
> On Sat, 2020-09-26 at 15:31 -0700, Hao Wu wrote:
>> Since kernel 4.14, we fixed the TPM sleep logic
>> from msleep to usleep_range, so that the TPM
>> sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
>> Before that fix, msleep(5) actually sleeps for
>> around 15ms.
>> The fix is 
>> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
>> 
>> That fix uncovered that the TPM_TIMEOUT was not properly
>> set previously. We recently found the TPM driver in kernel 4.14+
>> (including 5.9-rc4) crashes Atmel TPM chips with
>> too frequent TPM queries.
> 
> I've seen this with my nuvoton too ... although it seems to be because
> my chip is somewhat experimental (SW upgrade from 1.2 to 2.0).  The
> problem with your patch is it reintroduces the massive delays that
> msleep has ... that's why usleep was used.  The patch I use locally to
> fix this keeps usleep, can you try it (attached).
> 
> James
> 
> ---
> 
> From d40a8c7691a72de28ea66a78bd177db36a79710a Mon Sep 17 00:00:00 2001
> From: James Bottomley <James.Bottomley@HansenPartnership.com>
> Date: Wed, 11 Jul 2018 10:11:14 -0700
> Subject: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
> 
> tpm_tis regressed recently to the point where the TPM being driven by
> it falls off the bus and cannot be contacted after some hours of use.
> This is the failure trace:
> 
> jejb@jarvis:~> dmesg|grep tpm
> [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
> [14566.626614] tpm tpm0: Operation Timed out
> [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> ...
> 
> The problem is caused by a change that caused us to poke the TPM far
> more often to see if it's ready.  Apparently something about the bus
> its on and the TPM means that it crashes or falls off the bus if you
> poke it too often and once this happens, only a reboot will recover
> it.
> 
> The fix I've come up with is to adjust the timings so the TPM no
> longer falls of the bus.  Obviously, this fix works for my Nuvoton
> NPCT6xxx but that's the only TPM I've tested it with.
> 
> Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
> drivers/char/tpm/tpm.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 947d1db0a5cc..e4f4b98418ab 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -41,8 +41,8 @@ enum tpm_timeout {
> 	TPM_TIMEOUT_RETRY = 100, /* msecs */
> 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
> 	TPM_TIMEOUT_POLL = 1,	/* msecs */
> -	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
> -	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
> +	TPM_TIMEOUT_USECS_MIN = 750,      /* usecs */
> +	TPM_TIMEOUT_USECS_MAX = 1000,      /* usecs */
> };
> 
> /* TPM addresses */
> -- 
> 2.26.2
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-26 23:10   ` Hao Wu
@ 2020-09-27 18:25     ` James Bottomley
  2020-09-28  0:11       ` Hao Wu
  2020-09-28  1:08       ` Jarkko Sakkinen
  0 siblings, 2 replies; 51+ messages in thread
From: James Bottomley @ 2020-09-27 18:25 UTC (permalink / raw)
  To: Hao Wu
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Sat, 2020-09-26 at 16:10 -0700, Hao Wu wrote:
> Resending following email in plaintext.
> 
> ----
> 
> Hi James,
> 
> Thanks for following up.
> 
> We have actually tried change 
> TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX 
> according to https://patchwork.kernel.org/patch/10520247/
> It does not solve the problem for ATMEL chip. The chips facing crash
> is 
> not experimental, but happens commonly in 
> the production systems we and our customers are using.
> It is widely found in Cisco 220 / 240 systems which are using
> Ateml chips.

Well, I came up with the values in that patch by trial and error ....
all I know is they work for my nuvoton. If they're not right for you,
see if you can find what values actually do work for your TPM.  The
difference between msleep and usleep_range is that the former can have
an indefinitely long timeout and the latter has a range bounded one. 
If you think msleep works for you, the chances are it doesn't and
you're relying on the large upper bound to make the bug infrequent
enough for you not to see it.  Playing with the values in usleep range
will help you find what the actual timeout is and eliminate the problem
for good.

James



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-27 18:25     ` James Bottomley
@ 2020-09-28  0:11       ` Hao Wu
  2020-09-28  0:15         ` Hao Wu
  2020-09-28  1:22         ` James Bottomley
  2020-09-28  1:08       ` Jarkko Sakkinen
  1 sibling, 2 replies; 51+ messages in thread
From: Hao Wu @ 2020-09-28  0:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

Hi James,

Maybe there is a misunderstanding. Here I am using tpm_msleep, not msleep.
tpm_msleep is using usleep underlaying. See
https://github.com/torvalds/linux/blob/master/drivers/char/tpm/tpm.h#L188

The reasons I choose 15ms, is because before 
https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
(Where msleep is changed to tpm_msleep (which is essentially usleep)),
The actual sleep time is 15ms, thus here we change this back to 15ms to fix
regression.

Thanks
Hao 

> On Sep 27, 2020, at 11:25 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> On Sat, 2020-09-26 at 16:10 -0700, Hao Wu wrote:
>> Resending following email in plaintext.
>> 
>> ----
>> 
>> Hi James,
>> 
>> Thanks for following up.
>> 
>> We have actually tried change 
>> TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX 
>> according to https://patchwork.kernel.org/patch/10520247/
>> It does not solve the problem for ATMEL chip. The chips facing crash
>> is 
>> not experimental, but happens commonly in 
>> the production systems we and our customers are using.
>> It is widely found in Cisco 220 / 240 systems which are using
>> Ateml chips.
> 
> Well, I came up with the values in that patch by trial and error ....
> all I know is they work for my nuvoton. If they're not right for you,
> see if you can find what values actually do work for your TPM.  The
> difference between msleep and usleep_range is that the former can have
> an indefinitely long timeout and the latter has a range bounded one. 
> If you think msleep works for you, the chances are it doesn't and
> you're relying on the large upper bound to make the bug infrequent
> enough for you not to see it.  Playing with the values in usleep range
> will help you find what the actual timeout is and eliminate the problem
> for good.
> 
> James


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28  0:11       ` Hao Wu
@ 2020-09-28  0:15         ` Hao Wu
  2020-09-28  1:22         ` James Bottomley
  1 sibling, 0 replies; 51+ messages in thread
From: Hao Wu @ 2020-09-28  0:15 UTC (permalink / raw)
  To: Hao Wu
  Cc: James Bottomley, peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh,
	Hamza Attak, nayna, why2jjj.linux, zohar, linux-integrity,
	Paul Menzel, Ken Goldman, Seungyeop Han, Shrihari Kalkar,
	Anish Jhaveri

I am attaching the original bug report to this thread for new reviewers to get better context

---

Hi TPM Driver Maintainers,

We are from Rubrik engineering team. We found a TPM driver update since kernel 4.14 causing atmel TPM chips crash. We have root caused it and have a patch on the kernel used by Rubrik products. Given the “bug” has impact on not just Rubrik products, but all atmel TPM chips, we are asking to fix the issue in the kernel upstream.

The commit that introduced the bug since 4.14 
https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3

Effected platform / system:
- Cisco UCS C220 M5 with atmel TPM chip
	- Ubuntu 16.04
  	- Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4
- Cisco UCS C240 M5 with atmel TPM chip
	- Ubuntu 16.04
  	- Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4

```
# TPM chip used in the problematic platform
$ tpm_version
TPM 1.2 Version Info:
  Chip Version:        1.2.66.1
  Spec Level:          2
  Errata Revision:     3
  TPM Vendor ID:       ATML
  TPM Version:         01010000
  Manufacturer Info:   41544d4c
```

Not all Cisco server nodes are facing the crash, but there are a good amount of portion (around 50% from nodes in Rubrik) are persistently having the TPM crash issue.

Our other platforms using TPM chips from other vendors do not have issue. These platform are running the same software as the problematic platforms run. Those TPM non-effected vendors are
- IFX
- STM
- WEC

TPM crash signature:
```
# error when running tpm tool
$ tpm_sealdata -z
Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error

# tpm driver sends error
$ sudo dmesg | grep tpm0
[59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
[59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
```

Our Root Cause Analysis
From the error code “-62”, it looks similar to another bug https://patchwork.kernel.org/patch/10520247/
where the “TPM_TIMEOUT_USECS_MAX” and “TPM_TIMEOUT_USEC_MIN” is too small, which causes TPM get queried too frequently, and thus crashes.

The issue for atmel TPM chip crash is similar the commit https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 changed TPM sleep logic from using `msleep` to `tpm_msleep`, in which `usleep_range` is used.

As what https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 intended to do, using usleep_range can make the sleep period shorter, because msleep actually sleeps longer when the sleep perioid is within 20ms, and usleep_range can make it more precise.

According to our experiments,
- usleep_range makes the TPM sleep precisely for TPM_TIMEOUT (i.e. 5ms https://github.com/torvalds/linux/blob/v4.14/drivers/char/tpm/tpm.h#L52)
- msleep(TPM_TIMEOUT) actually sleeps around 15ms    

Thus the TPM get queried more frequently than before, which is likely the root cause of the atmel chip crash. We fix it by bumping up the TPM_TIMEOUT to 15ms.


Rubrik Patch (patch only for 4.14)
```
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 72d3ce4..9b8f3f8 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -49,7 +49,15 @@ enum tpm_const {
 };

 enum tpm_timeout {
-       TPM_TIMEOUT = 5,        /* msecs */
+       TPM_TIMEOUT = 15,       /* msecs */
        TPM_TIMEOUT_RETRY = 100, /* msecs */
        TPM_TIMEOUT_RANGE_US = 300      /* usecs */
 };
```
With the patch, the atmel TPM chip crash is fixed.  

Proposal
We want the kernel upstream to adopt the fix or have a better fix for the atmel chip while not bring performance regression for other TPM chips. We understand that https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 was intended to shorten the TPM respond time, but it does not work well for atmel TPM chips. Probably we should override TPM_TIMEOUT value for atmel chips at least.

Thanks
Hao

> On Sep 27, 2020, at 5:11 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
> Hi James,
> 
> Maybe there is a misunderstanding. Here I am using tpm_msleep, not msleep.
> tpm_msleep is using usleep underlaying. See
> https://github.com/torvalds/linux/blob/master/drivers/char/tpm/tpm.h#L188
> 
> The reasons I choose 15ms, is because before 
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> (Where msleep is changed to tpm_msleep (which is essentially usleep)),
> The actual sleep time is 15ms, thus here we change this back to 15ms to fix
> regression.
> 
> Thanks
> Hao 
> 
>> On Sep 27, 2020, at 11:25 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>> 
>> On Sat, 2020-09-26 at 16:10 -0700, Hao Wu wrote:
>>> Resending following email in plaintext.
>>> 
>>> ----
>>> 
>>> Hi James,
>>> 
>>> Thanks for following up.
>>> 
>>> We have actually tried change 
>>> TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX 
>>> according to https://patchwork.kernel.org/patch/10520247/
>>> It does not solve the problem for ATMEL chip. The chips facing crash
>>> is 
>>> not experimental, but happens commonly in 
>>> the production systems we and our customers are using.
>>> It is widely found in Cisco 220 / 240 systems which are using
>>> Ateml chips.
>> 
>> Well, I came up with the values in that patch by trial and error ....
>> all I know is they work for my nuvoton. If they're not right for you,
>> see if you can find what values actually do work for your TPM.  The
>> difference between msleep and usleep_range is that the former can have
>> an indefinitely long timeout and the latter has a range bounded one. 
>> If you think msleep works for you, the chances are it doesn't and
>> you're relying on the large upper bound to make the bug infrequent
>> enough for you not to see it.  Playing with the values in usleep range
>> will help you find what the actual timeout is and eliminate the problem
>> for good.
>> 
>> James
> 


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-27 18:25     ` James Bottomley
  2020-09-28  0:11       ` Hao Wu
@ 2020-09-28  1:08       ` Jarkko Sakkinen
  2020-09-28  6:03         ` Hao Wu
  1 sibling, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-28  1:08 UTC (permalink / raw)
  To: James Bottomley, Nayna Jain
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Sun, Sep 27, 2020 at 11:25:39AM -0700, James Bottomley wrote:
> On Sat, 2020-09-26 at 16:10 -0700, Hao Wu wrote:
> > Resending following email in plaintext.
> > 
> > ----
> > 
> > Hi James,
> > 
> > Thanks for following up.
> > 
> > We have actually tried change 
> > TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX 
> > according to https://patchwork.kernel.org/patch/10520247/
> > It does not solve the problem for ATMEL chip. The chips facing crash
> > is 
> > not experimental, but happens commonly in 
> > the production systems we and our customers are using.
> > It is widely found in Cisco 220 / 240 systems which are using
> > Ateml chips.
> 
> Well, I came up with the values in that patch by trial and error ....
> all I know is they work for my nuvoton. If they're not right for you,
> see if you can find what values actually do work for your TPM.  The
> difference between msleep and usleep_range is that the former can have
> an indefinitely long timeout and the latter has a range bounded one. 
> If you think msleep works for you, the chances are it doesn't and
> you're relying on the large upper bound to make the bug infrequent
> enough for you not to see it.  Playing with the values in usleep range
> will help you find what the actual timeout is and eliminate the problem
> for good.
> 
> James

I think I should revert 424eaf910c329, up until more legit values are found.

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28  0:11       ` Hao Wu
  2020-09-28  0:15         ` Hao Wu
@ 2020-09-28  1:22         ` James Bottomley
  2020-09-28  5:59           ` Hao Wu
  1 sibling, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-09-28  1:22 UTC (permalink / raw)
  To: Hao Wu
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Sun, 2020-09-27 at 17:11 -0700, Hao Wu wrote:
> Hi James,
> 
> Maybe there is a misunderstanding. Here I am using tpm_msleep, not
> msleep. tpm_msleep is using usleep underlaying. See
> https://github.com/torvalds/linux/blob/master/drivers/char/tpm/tpm.h#L188

Right, I had missed that.

> The reasons I choose 15ms, is because before 
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> (Where msleep is changed to tpm_msleep (which is essentially
> usleep)), The actual sleep time is 15ms, thus here we change this
> back to 15ms to fix regression.

Right now most TIS TPMs operate successfully with a sleep in there of
the range 0.1-0.5ms.  Upping that to 15ms introduces a 100x delay in
our status wait for the TPM to become ready, potentially slowing down
all TIS TPM operations by 100x, which will hit us most with the PCR
writes we do for IMA logging ... that seems like a bad bargain for a
single TPM family manufacturer.

However, there is another possibility: it's something to do with the
byte read; I notice you don't require the same slowdown for the burst
count read, which actually reads the status register and burst count as
a read32.  If that really is the case, for the atmel would substituting
a read32 and just throwing the upper bytes away in tpm_tis_status()
allow us to keep the current timings?  I can actually try doing this
and see if it fixes my nuvoton.

James



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28  1:22         ` James Bottomley
@ 2020-09-28  5:59           ` Hao Wu
  2020-09-28 22:11             ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-28  5:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

Hi James,

>  Upping that to 15ms introduces a 100x delay in
> our status wait for the TPM to become ready, potentially slowing down
> all TIS TPM operations by 100x, which will hit us most with the PCR
> writes we do for IMA logging ... that seems like a bad bargain for a
> single TPM family manufacturer.
1. It is unlike to be 100x delay 
According to 
https://github.com/torvalds/linux/commit/59f5a6b07f6434efac0057dc2f303a96b871811b
https://github.com/torvalds/linux/commit/424eaf910c329ab06ad03a527ef45dcf6a328f00
It only optimize from 14sec to 7sec. Which is only a 2x speed up by using sleep time from 5ms to >1ms.
Here we change it back to 15 ms is very unlikely to have 100x delay. 
The optimization does not worth to have a breakage on chip from one manufacturer.

2. In my opinion, the kernel should support all manufacturers which were supported before. 
Not supporting any of them would lead to kernel divergence, because those chip users have to
Use it anyway. Maybe we can see the maintainers’ opinion on this.

3. I am kind of opposing to coming up smaller values without doing comprehensive
qualification on all supported manufacturers. Stable is probably more important for such software.
Looking back to these commits that introduced the breakages, only one or two
chips are tested. If that is a common case, probably we should refactor
the TPM driver to better support per-manufacturer configuration?    

> However, there is another possibility: it's something to do with the
> byte read; I notice you don't require the same slowdown for the burst
> count read, which actually reads the status register and burst count as
> a read32.  If that really is the case, for the atmel would substituting
> a read32 and just throwing the upper bytes away in tpm_tis_status()
> allow us to keep the current timings?  I can actually try doing this
> and see if it fixes my nuvoton.

If would be helpful if you can find the solution without reducing performance.
I think it is a separate problem to address though. Maybe not worth to mix
them in the same fix.

Thanks
Hao

> On Sep 27, 2020, at 6:22 PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> On Sun, 2020-09-27 at 17:11 -0700, Hao Wu wrote:
>> Hi James,
>> 
>> Maybe there is a misunderstanding. Here I am using tpm_msleep, not
>> msleep. tpm_msleep is using usleep underlaying. See
>> https://github.com/torvalds/linux/blob/master/drivers/char/tpm/tpm.h#L188
> 
> Right, I had missed that.
> 
>> The reasons I choose 15ms, is because before 
>> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
>> (Where msleep is changed to tpm_msleep (which is essentially
>> usleep)), The actual sleep time is 15ms, thus here we change this
>> back to 15ms to fix regression.
> 
> Right now most TIS TPMs operate successfully with a sleep in there of
> the range 0.1-0.5ms.  Upping that to 15ms introduces a 100x delay in
> our status wait for the TPM to become ready, potentially slowing down
> all TIS TPM operations by 100x, which will hit us most with the PCR
> writes we do for IMA logging ... that seems like a bad bargain for a
> single TPM family manufacturer.
> 
> However, there is another possibility: it's something to do with the
> byte read; I notice you don't require the same slowdown for the burst
> count read, which actually reads the status register and burst count as
> a read32.  If that really is the case, for the atmel would substituting
> a read32 and just throwing the upper bytes away in tpm_tis_status()
> allow us to keep the current timings?  I can actually try doing this
> and see if it fixes my nuvoton.
> 
> James
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28  1:08       ` Jarkko Sakkinen
@ 2020-09-28  6:03         ` Hao Wu
  2020-09-28 14:16           ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-28  6:03 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

Hi Jarkko,

Just to be clear it is not caused by that single commit, but a few accumulated commits
https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
https://github.com/torvalds/linux/commit/cf151a9a44d52a63332e8e926234574fe5a5d784
https://github.com/torvalds/linux/commit/59f5a6b07f6434efac0057dc2f303a96b871811b
https://github.com/torvalds/linux/commit/424eaf910c329ab06ad03a527ef45dcf6a328f00

The easy way is probably just apply the patch I provided, and then revisit the value
for TPM_TIMEOUT_WAIT_STAT  

Thanks
Hao

> 
> I think I should revert 424eaf910c329, up until more legit values are found.
> 
> /Jarkko


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28  6:03         ` Hao Wu
@ 2020-09-28 14:16           ` Jarkko Sakkinen
  2020-09-28 17:49             ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-28 14:16 UTC (permalink / raw)
  To: Hao Wu
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Sun, Sep 27, 2020 at 11:03:47PM -0700, Hao Wu wrote:
> Hi Jarkko,
> 
> Just to be clear it is not caused by that single commit, but a few accumulated commits
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> https://github.com/torvalds/linux/commit/cf151a9a44d52a63332e8e926234574fe5a5d784
> https://github.com/torvalds/linux/commit/59f5a6b07f6434efac0057dc2f303a96b871811b
> https://github.com/torvalds/linux/commit/424eaf910c329ab06ad03a527ef45dcf6a328f00
> 
> The easy way is probably just apply the patch I provided, and then revisit the value
> for TPM_TIMEOUT_WAIT_STAT  

When you response, please quote properly, and do not top post.  The
discussion is impossible to follow this way.

I'm not sure if I buy that. Which one is the first failing commit?

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28 14:16           ` Jarkko Sakkinen
@ 2020-09-28 17:49             ` Hao Wu
  2020-09-28 19:47               ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-28 17:49 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

Hi Jarkko,

https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 
Is the one introducing the issue since 4.14. Then the other three commits
changed the relevant code a bit. Probably you can check the timestamp / release version
on each commit to understand the relationship.

I think the original patch commit message can help you understand the root cause.
Attaching the commit here for your convenience.

Thanks
Hao

-----

Since kernel 4.14, we fixed the TPM sleep logic
from msleep to usleep_range, so that the TPM
sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
Before that fix, msleep(5) actually sleeps for
around 15ms.
The fix is https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3

That fix uncovered that the TPM_TIMEOUT was not properly
set previously. We recently found the TPM driver in kernel 4.14+
(including 5.9-rc4) crashes Atmel TPM chips with
too frequent TPM queries.

The TPM crash signature is
```
$ tpm_sealdata -z
Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error

$ sudo dmesg | grep tpm0
[59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
[59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
```

From the error code "-62", it looks similar to another bug
https://patchwork.kernel.org/patch/10520247/
where the "TPM_TIMEOUT_USECS_MAX" and "TPM_TIMEOUT_USEC_MIN"
is too small, which causes TPM get queried too frequently,
and thus crashes.

We finally found the root cause is related to
the sleep timeout used in `wait_for_tpm_stat`

In the 4.16, commit
https://github.com/torvalds/linux/commit/cf151a9a44d52a63332e8e926234574fe5a5d784
uses `TPM_POLL_SLEEP` instead of TPM_TIMEOUT for `wait_for_tpm_stat`
and set `TPM_POLL_SLEEP` (1ms).

Since 4.18 commits
https://github.com/torvalds/linux/commit/59f5a6b07f6434efac0057dc2f303a96b871811b
https://github.com/torvalds/linux/commit/424eaf910c329ab06ad03a527ef45dcf6a328f00
further reduced the timeout in wait_for_tpm_stat to less than 1ms.

Our patch here defines a `TPM_TIMEOUT_WAIT_STAT` (15ms) just for
`wait_for_tpm_stat the` to fix the crash in Atmel chips,
but not introduce unnecessary performance regression
in other workflows.
15ms is the actual timeout TPM chips use before the 4.14 fix
from msleep to usleep_range. Thus fixed the crash.

Exploring smaller `TPM_TIMEOUT_WAIT_STAT` should be a separate
thing to revisit later. This patch meant to fix the regression
introduced since 4.14

Test Plan:
- Run fixed kernel on system with Atmel TPM chips
 and ensure crash does not happen
- Run fixed kernel on system with other TPM chips
 (IFX / WEC / STM) ensure not breakages from tpm-tool

Signed-off-by: Hao Wu <hao.wu@rubrik.com>
---
drivers/char/tpm/tpm.h          | 1 +
drivers/char/tpm/tpm_tis_core.c | 3 +--
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 947d1db0a5cc..899097ae9756 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -41,6 +41,7 @@ enum tpm_timeout {
	TPM_TIMEOUT_RETRY = 100, /* msecs */
	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
	TPM_TIMEOUT_POLL = 1,	/* msecs */
+	TPM_TIMEOUT_WAIT_STAT = 15,      /* msecs */
	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
};
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 65ab1b027949..8aa5eef10c28 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -80,8 +80,7 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
		}
	} else {
		do {
-			usleep_range(TPM_TIMEOUT_USECS_MIN,
-				     TPM_TIMEOUT_USECS_MAX);
+			tpm_msleep(TPM_TIMEOUT_WAIT_STAT);
			status = chip->ops->status(chip);
			if ((status & mask) == mask)
				return 0;
-- 
2.17.1



> On Sep 28, 2020, at 7:16 AM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Sun, Sep 27, 2020 at 11:03:47PM -0700, Hao Wu wrote:
>> Hi Jarkko,
>> 
>> Just to be clear it is not caused by that single commit, but a few accumulated commits
>> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
>> https://github.com/torvalds/linux/commit/cf151a9a44d52a63332e8e926234574fe5a5d784
>> https://github.com/torvalds/linux/commit/59f5a6b07f6434efac0057dc2f303a96b871811b
>> https://github.com/torvalds/linux/commit/424eaf910c329ab06ad03a527ef45dcf6a328f00
>> 
>> The easy way is probably just apply the patch I provided, and then revisit the value
>> for TPM_TIMEOUT_WAIT_STAT  
> 
> When you response, please quote properly, and do not top post.  The
> discussion is impossible to follow this way.
> 
> I'm not sure if I buy that. Which one is the first failing commit?
> 
> /Jarkko


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28 17:49             ` Hao Wu
@ 2020-09-28 19:47               ` Jarkko Sakkinen
  2020-09-28 20:27                 ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-28 19:47 UTC (permalink / raw)
  To: Hao Wu
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Mon, Sep 28, 2020 at 10:49:56AM -0700, Hao Wu wrote:
> Hi Jarkko,
> 
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 
> Is the one introducing the issue since 4.14. Then the other three commits
> changed the relevant code a bit. Probably you can check the timestamp / release version
> on each commit to understand the relationship.
> 
> I think the original patch commit message can help you understand the root cause.
> Attaching the commit here for your convenience.
> 
> Thanks
> Hao

Please, again, when you respond quote properly instead of putting your
response on top. Thank you.

Yes, I know the issue and it is already documented also in the James'
earlier patch that did a similar change. I.e. for some reason some TPM's
(or the bus itself) do not like poking it too often.

So: what if you revert on using msleep(TPM_TIMEOUT) in
wait_for_tpm_stat(), i.e. revert to the behaviour before the
aformentioned commit?

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28 19:47               ` Jarkko Sakkinen
@ 2020-09-28 20:27                 ` Hao Wu
  2020-09-30  2:11                   ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-28 20:27 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri



> On Sep 28, 2020, at 12:47 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Mon, Sep 28, 2020 at 10:49:56AM -0700, Hao Wu wrote:
>> Hi Jarkko,
>> 
>> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 
>> Is the one introducing the issue since 4.14. Then the other three commits
>> changed the relevant code a bit. Probably you can check the timestamp / release version
>> on each commit to understand the relationship.
>> 
>> I think the original patch commit message can help you understand the root cause.
>> Attaching the commit here for your convenience.
>> 
>> Thanks
>> Hao
> 
> Please, again, when you respond quote properly instead of putting your
> response on top. Thank you.
> 
> Yes, I know the issue and it is already documented also in the James'
> earlier patch that did a similar change. I.e. for some reason some TPM's
> (or the bus itself) do not like poking it too often.
Yes, probably. Although the issue James’s patch fixes has the same error code,
it is about a different issue which is similar.

> So: what if you revert on using msleep(TPM_TIMEOUT) in
> wait_for_tpm_stat(), i.e. revert to the behaviour before the
> aformentioned commit?
I believe that should resolve the issue as well

Thanks
Hao






^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28  5:59           ` Hao Wu
@ 2020-09-28 22:11             ` James Bottomley
  2020-09-29  4:46               ` Hao Wu
  2020-09-30  2:16               ` Jarkko Sakkinen
  0 siblings, 2 replies; 51+ messages in thread
From: James Bottomley @ 2020-09-28 22:11 UTC (permalink / raw)
  To: Hao Wu
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
[...]
> > However, there is another possibility: it's something to do with
> > the byte read; I notice you don't require the same slowdown for the
> > burst count read, which actually reads the status register and
> > burst count as a read32.  If that really is the case, for the atmel
> > would substituting a read32 and just throwing the upper bytes away
> > in tpm_tis_status() allow us to keep the current timings?  I can
> > actually try doing this and see if it fixes my nuvoton.
> 
> If would be helpful if you can find the solution without reducing
> performance. I think it is a separate problem to address though.
> Maybe not worth to mix them in the same fix.

Well, if it works, no other fix is needed.

This is what I'm currently trying out on my nuvoton with the timings
reverted to being those in the vanilla kernel.  So far it hasn't
crashed, but I haven't run it for long enough to be sure yet.

James

---

diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 6b884badabe7..c4dbac8edc9b 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -233,9 +233,9 @@ static u8 tpm_tis_status(struct tpm_chip *chip)
 {
 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
 	int rc;
-	u8 status;
+	u32 status;
 
-	rc = tpm_tis_read8(priv, TPM_STS(priv->locality), &status);
+	rc = tpm_tis_read32(priv, TPM_STS(priv->locality), &status);
 	if (rc < 0)
 		return 0;
 


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28 22:11             ` James Bottomley
@ 2020-09-29  4:46               ` Hao Wu
  2020-09-30  2:16               ` Jarkko Sakkinen
  1 sibling, 0 replies; 51+ messages in thread
From: Hao Wu @ 2020-09-29  4:46 UTC (permalink / raw)
  To: James Bottomley
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, Hamza Attak,
	nayna, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri



> On Sep 28, 2020, at 3:11 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote:
> 
> On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> [...]
>>> However, there is another possibility: it's something to do with
>>> the byte read; I notice you don't require the same slowdown for the
>>> burst count read, which actually reads the status register and
>>> burst count as a read32.  If that really is the case, for the atmel
>>> would substituting a read32 and just throwing the upper bytes away
>>> in tpm_tis_status() allow us to keep the current timings?  I can
>>> actually try doing this and see if it fixes my nuvoton.
>> 
>> If would be helpful if you can find the solution without reducing
>> performance. I think it is a separate problem to address though.
>> Maybe not worth to mix them in the same fix.
> 
> Well, if it works, no other fix is needed.
> 
> This is what I'm currently trying out on my nuvoton with the timings
> reverted to being those in the vanilla kernel.  So far it hasn't
> crashed, but I haven't run it for long enough to be sure yet.
> 
> James
> 
> ---
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index 6b884badabe7..c4dbac8edc9b 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -233,9 +233,9 @@ static u8 tpm_tis_status(struct tpm_chip *chip)
> {
> 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> 	int rc;
> -	u8 status;
> +	u32 status;
> 
> -	rc = tpm_tis_read8(priv, TPM_STS(priv->locality), &status);
> +	rc = tpm_tis_read32(priv, TPM_STS(priv->locality), &status);
> 	if (rc < 0)
> 		return 0;
> 
> 

Thanks James for the quick patch. 
I will apply it this week any see whether it helps or not.

Hao

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28 20:27                 ` Hao Wu
@ 2020-09-30  2:11                   ` Jarkko Sakkinen
  2020-09-30  3:41                     ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-30  2:11 UTC (permalink / raw)
  To: Hao Wu
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Mon, Sep 28, 2020 at 01:27:14PM -0700, Hao Wu wrote:
> 
> 
> > On Sep 28, 2020, at 12:47 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> > 
> > On Mon, Sep 28, 2020 at 10:49:56AM -0700, Hao Wu wrote:
> >> Hi Jarkko,
> >> 
> >> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 
> >> Is the one introducing the issue since 4.14. Then the other three commits
> >> changed the relevant code a bit. Probably you can check the timestamp / release version
> >> on each commit to understand the relationship.
> >> 
> >> I think the original patch commit message can help you understand the root cause.
> >> Attaching the commit here for your convenience.
> >> 
> >> Thanks
> >> Hao
> > 
> > Please, again, when you respond quote properly instead of putting your
> > response on top. Thank you.
> > 
> > Yes, I know the issue and it is already documented also in the James'
> > earlier patch that did a similar change. I.e. for some reason some TPM's
> > (or the bus itself) do not like poking it too often.
> Yes, probably. Although the issue James’s patch fixes has the same error code,
> it is about a different issue which is similar.

OK, great.

> > So: what if you revert on using msleep(TPM_TIMEOUT) in
> > wait_for_tpm_stat(), i.e. revert to the behaviour before the
> > aformentioned commit?
> I believe that should resolve the issue as well

I'd return to the old code that works instead of doing something new
along the lines. James?

Anyway, thanks a lot for coming with this. I think we are making at
least some progress sorting this out.

Also want to underline that my comments about quoting emails did not
have anything to do that I would not appreciate this feedback. It is
just a "protocol thing".

> Thanks
> Hao

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-28 22:11             ` James Bottomley
  2020-09-29  4:46               ` Hao Wu
@ 2020-09-30  2:16               ` Jarkko Sakkinen
  2020-09-30 14:54                 ` James Bottomley
  1 sibling, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-30  2:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley wrote:
> On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> [...]
> > > However, there is another possibility: it's something to do with
> > > the byte read; I notice you don't require the same slowdown for the
> > > burst count read, which actually reads the status register and
> > > burst count as a read32.  If that really is the case, for the atmel
> > > would substituting a read32 and just throwing the upper bytes away
> > > in tpm_tis_status() allow us to keep the current timings?  I can
> > > actually try doing this and see if it fixes my nuvoton.
> > 
> > If would be helpful if you can find the solution without reducing
> > performance. I think it is a separate problem to address though.
> > Maybe not worth to mix them in the same fix.
> 
> Well, if it works, no other fix is needed.
> 
> This is what I'm currently trying out on my nuvoton with the timings
> reverted to being those in the vanilla kernel.  So far it hasn't
> crashed, but I haven't run it for long enough to be sure yet.
> 
> James

OK, so the bus does not like one byte reads but prefers full (32-bit)
word reads? I.e. what's the context?

> ---
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index 6b884badabe7..c4dbac8edc9b 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -233,9 +233,9 @@ static u8 tpm_tis_status(struct tpm_chip *chip)
>  {
>  	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
>  	int rc;
> -	u8 status;
> +	u32 status;
>  
> -	rc = tpm_tis_read8(priv, TPM_STS(priv->locality), &status);
> +	rc = tpm_tis_read32(priv, TPM_STS(priv->locality), &status);
>  	if (rc < 0)
>  		return 0;
>  
> 

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30  2:11                   ` Jarkko Sakkinen
@ 2020-09-30  3:41                     ` Hao Wu
       [not found]                       ` <EA1EE8F8-F054-4E1B-B830-231398D33CB8@rubrik.com>
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-09-30  3:41 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna Jain, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri


> On Sep 29, 2020, at 7:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Mon, Sep 28, 2020 at 01:27:14PM -0700, Hao Wu wrote:
>> 
>> 
>>> On Sep 28, 2020, at 12:47 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>> 
>>> On Mon, Sep 28, 2020 at 10:49:56AM -0700, Hao Wu wrote:
>>>> Hi Jarkko,
>>>> 
>>>> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 
>>>> Is the one introducing the issue since 4.14. Then the other three commits
>>>> changed the relevant code a bit. Probably you can check the timestamp / release version
>>>> on each commit to understand the relationship.
>>>> 
>>>> I think the original patch commit message can help you understand the root cause.
>>>> Attaching the commit here for your convenience.
>>>> 
>>>> Thanks
>>>> Hao
>>> 
>>> Please, again, when you respond quote properly instead of putting your
>>> response on top. Thank you.
>>> 
>>> Yes, I know the issue and it is already documented also in the James'
>>> earlier patch that did a similar change. I.e. for some reason some TPM's
>>> (or the bus itself) do not like poking it too often.
>> Yes, probably. Although the issue James’s patch fixes has the same error code,
>> it is about a different issue which is similar.
> 
> OK, great.
> 
>>> So: what if you revert on using msleep(TPM_TIMEOUT) in
>>> wait_for_tpm_stat(), i.e. revert to the behaviour before the
>>> aformentioned commit?
>> I believe that should resolve the issue as well
> 
> I'd return to the old code that works instead of doing something new
> along the lines. James?
> 
I would not use msleep back which is actually wrong way to do. 
We don’t know the actual time it sleeps on different system in the future.
Currently, my measurement over msleep(TPM_TIMEOUT) , i.e. msleep(5) 
sleeps 15ms. Maybe we should use tpm_msleep to precisely do the sleep.

I will test out James’ patch and your proposal this week and get you back anyway. 

> Anyway, thanks a lot for coming with this. I think we are making at
> least some progress sorting this out.
> 
> Also want to underline that my comments about quoting emails did not
> have anything to do that I would not appreciate this feedback. It is
> just a "protocol thing".
No worries. I am not familiar with the rules here. Thank you for corrections.

> 
>> Thanks
>> Hao
> 
> /Jarkko

Hao


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30  2:16               ` Jarkko Sakkinen
@ 2020-09-30 14:54                 ` James Bottomley
  2020-09-30 15:37                   ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-09-30 14:54 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley wrote:
> > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > [...]
> > > > However, there is another possibility: it's something to do
> > > > with the byte read; I notice you don't require the same
> > > > slowdown for the burst count read, which actually reads the
> > > > status register and burst count as a read32.  If that really is
> > > > the case, for the atmel would substituting a read32 and just
> > > > throwing the upper bytes away in tpm_tis_status() allow us to
> > > > keep the current timings?  I can actually try doing this and
> > > > see if it fixes my nuvoton.
> > > 
> > > If would be helpful if you can find the solution without reducing
> > > performance. I think it is a separate problem to address though.
> > > Maybe not worth to mix them in the same fix.
> > 
> > Well, if it works, no other fix is needed.
> > 
> > This is what I'm currently trying out on my nuvoton with the
> > timings reverted to being those in the vanilla kernel.  So far it
> > hasn't crashed, but I haven't run it for long enough to be sure
> > yet.
> > 
> > James
> 
> OK, so the bus does not like one byte reads but prefers full (32-bit)
> word reads? I.e. what's the context?

It's not supported by anything in the spec just empirical observation. 
However, the spec says the status register is 24 bits: the upper 16
being the burst count.  When we read the whole status register,
including the burst count, we do a read32. I observed that the
elongated timing was only added for the read8 code not the read32 which
supports the theory that the former causes the Atmel to crash but the
latter doesn't.  Of course it's always possible that probabilistically
the Atmel is going to crash on the burst count read, but that's
exercised far less than the status only read.

James
 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30 14:54                 ` James Bottomley
@ 2020-09-30 15:37                   ` Jarkko Sakkinen
  2020-09-30 20:48                     ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-30 15:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley wrote:
> On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> > On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley wrote:
> > > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > > [...]
> > > > > However, there is another possibility: it's something to do
> > > > > with the byte read; I notice you don't require the same
> > > > > slowdown for the burst count read, which actually reads the
> > > > > status register and burst count as a read32.  If that really is
> > > > > the case, for the atmel would substituting a read32 and just
> > > > > throwing the upper bytes away in tpm_tis_status() allow us to
> > > > > keep the current timings?  I can actually try doing this and
> > > > > see if it fixes my nuvoton.
> > > > 
> > > > If would be helpful if you can find the solution without reducing
> > > > performance. I think it is a separate problem to address though.
> > > > Maybe not worth to mix them in the same fix.
> > > 
> > > Well, if it works, no other fix is needed.
> > > 
> > > This is what I'm currently trying out on my nuvoton with the
> > > timings reverted to being those in the vanilla kernel.  So far it
> > > hasn't crashed, but I haven't run it for long enough to be sure
> > > yet.
> > > 
> > > James
> > 
> > OK, so the bus does not like one byte reads but prefers full (32-bit)
> > word reads? I.e. what's the context?
> 
> It's not supported by anything in the spec just empirical observation. 
> However, the spec says the status register is 24 bits: the upper 16
> being the burst count.  When we read the whole status register,
> including the burst count, we do a read32. I observed that the
> elongated timing was only added for the read8 code not the read32 which
> supports the theory that the former causes the Atmel to crash but the
> latter doesn't.  Of course it's always possible that probabilistically
> the Atmel is going to crash on the burst count read, but that's
> exercised far less than the status only read.

This paragraph is good enough explanation for me. Can you include it
to the final commit as soon as we hear how your fix works for Hao?

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30 15:37                   ` Jarkko Sakkinen
@ 2020-09-30 20:48                     ` James Bottomley
  2020-09-30 21:09                       ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-09-30 20:48 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
> On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley wrote:
> > On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> > > On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley wrote:
> > > > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > > > [...]
> > > > > > However, there is another possibility: it's something to do
> > > > > > with the byte read; I notice you don't require the same
> > > > > > slowdown for the burst count read, which actually reads the
> > > > > > status register and burst count as a read32.  If that
> > > > > > really is the case, for the atmel would substituting a
> > > > > > read32 and just throwing the upper bytes away in
> > > > > > tpm_tis_status() allow us to keep the current timings?  I
> > > > > > can actually try doing this and see if it fixes my nuvoton.
> > > > > 
> > > > > If would be helpful if you can find the solution without
> > > > > reducing performance. I think it is a separate problem to
> > > > > address though. Maybe not worth to mix them in the same fix.
> > > > 
> > > > Well, if it works, no other fix is needed.
> > > > 
> > > > This is what I'm currently trying out on my nuvoton with the
> > > > timings reverted to being those in the vanilla kernel.  So far
> > > > it hasn't crashed, but I haven't run it for long enough to be
> > > > sure yet.
> > > > 
> > > > James
> > > 
> > > OK, so the bus does not like one byte reads but prefers full (32-
> > > bit) word reads? I.e. what's the context?
> > 
> > It's not supported by anything in the spec just empirical
> > observation.  However, the spec says the status register is 24
> > bits: the upper 16 being the burst count.  When we read the whole
> > status register, including the burst count, we do a read32. I
> > observed that the elongated timing was only added for the read8
> > code not the read32 which supports the theory that the former
> > causes the Atmel to crash but the latter doesn't.  Of course it's
> > always possible that probabilistically the Atmel is going to crash
> > on the burst count read, but that's exercised far less than the
> > status only read.
> 
> This paragraph is good enough explanation for me. Can you include it
> to the final commit as soon as we hear how your fix works for Hao?

Sure.  I'm afraid I have to report that it didn't work for me.  My
Nuvoton is definitely annoyed by the frequency of the prodding rather
than the register width.

James



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30 20:48                     ` James Bottomley
@ 2020-09-30 21:09                       ` Jarkko Sakkinen
  2020-09-30 22:31                         ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-09-30 21:09 UTC (permalink / raw)
  To: James Bottomley
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Wed, Sep 30, 2020 at 01:48:15PM -0700, James Bottomley wrote:
> On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
> > On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley wrote:
> > > On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> > > > On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley wrote:
> > > > > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > > > > [...]
> > > > > > > However, there is another possibility: it's something to do
> > > > > > > with the byte read; I notice you don't require the same
> > > > > > > slowdown for the burst count read, which actually reads the
> > > > > > > status register and burst count as a read32.  If that
> > > > > > > really is the case, for the atmel would substituting a
> > > > > > > read32 and just throwing the upper bytes away in
> > > > > > > tpm_tis_status() allow us to keep the current timings?  I
> > > > > > > can actually try doing this and see if it fixes my nuvoton.
> > > > > > 
> > > > > > If would be helpful if you can find the solution without
> > > > > > reducing performance. I think it is a separate problem to
> > > > > > address though. Maybe not worth to mix them in the same fix.
> > > > > 
> > > > > Well, if it works, no other fix is needed.
> > > > > 
> > > > > This is what I'm currently trying out on my nuvoton with the
> > > > > timings reverted to being those in the vanilla kernel.  So far
> > > > > it hasn't crashed, but I haven't run it for long enough to be
> > > > > sure yet.
> > > > > 
> > > > > James
> > > > 
> > > > OK, so the bus does not like one byte reads but prefers full (32-
> > > > bit) word reads? I.e. what's the context?
> > > 
> > > It's not supported by anything in the spec just empirical
> > > observation.  However, the spec says the status register is 24
> > > bits: the upper 16 being the burst count.  When we read the whole
> > > status register, including the burst count, we do a read32. I
> > > observed that the elongated timing was only added for the read8
> > > code not the read32 which supports the theory that the former
> > > causes the Atmel to crash but the latter doesn't.  Of course it's
> > > always possible that probabilistically the Atmel is going to crash
> > > on the burst count read, but that's exercised far less than the
> > > status only read.
> > 
> > This paragraph is good enough explanation for me. Can you include it
> > to the final commit as soon as we hear how your fix works for Hao?
> 
> Sure.  I'm afraid I have to report that it didn't work for me.  My
> Nuvoton is definitely annoyed by the frequency of the prodding rather
> than the register width.

Sorry, this might have been stated at some point but what type of bus
is it connected with?

Does it help in any way to tune the frequency?

I also wonder if we could adjust the frequency dynamically. I.e. start
with optimistic value and lower it until finding the sweet spot.

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30 21:09                       ` Jarkko Sakkinen
@ 2020-09-30 22:31                         ` James Bottomley
  2020-10-01  1:50                           ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-09-30 22:31 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> On Wed, Sep 30, 2020 at 01:48:15PM -0700, James Bottomley wrote:
> > On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
> > > On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley wrote:
> > > > On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> > > > > On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley
> > > > > wrote:
> > > > > > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > > > > > [...]
> > > > > > > > However, there is another possibility: it's something
> > > > > > > > to do
> > > > > > > > with the byte read; I notice you don't require the same
> > > > > > > > slowdown for the burst count read, which actually reads
> > > > > > > > the
> > > > > > > > status register and burst count as a read32.  If that
> > > > > > > > really is the case, for the atmel would substituting a
> > > > > > > > read32 and just throwing the upper bytes away in
> > > > > > > > tpm_tis_status() allow us to keep the current
> > > > > > > > timings?  I
> > > > > > > > can actually try doing this and see if it fixes my
> > > > > > > > nuvoton.
> > > > > > > 
> > > > > > > If would be helpful if you can find the solution without
> > > > > > > reducing performance. I think it is a separate problem to
> > > > > > > address though. Maybe not worth to mix them in the same
> > > > > > > fix.
> > > > > > 
> > > > > > Well, if it works, no other fix is needed.
> > > > > > 
> > > > > > This is what I'm currently trying out on my nuvoton with
> > > > > > the
> > > > > > timings reverted to being those in the vanilla kernel.  So
> > > > > > far
> > > > > > it hasn't crashed, but I haven't run it for long enough to
> > > > > > be
> > > > > > sure yet.
> > > > > > 
> > > > > > James
> > > > > 
> > > > > OK, so the bus does not like one byte reads but prefers full
> > > > > (32-
> > > > > bit) word reads? I.e. what's the context?
> > > > 
> > > > It's not supported by anything in the spec just empirical
> > > > observation.  However, the spec says the status register is 24
> > > > bits: the upper 16 being the burst count.  When we read the
> > > > whole
> > > > status register, including the burst count, we do a read32. I
> > > > observed that the elongated timing was only added for the read8
> > > > code not the read32 which supports the theory that the former
> > > > causes the Atmel to crash but the latter doesn't.  Of course
> > > > it's
> > > > always possible that probabilistically the Atmel is going to
> > > > crash
> > > > on the burst count read, but that's exercised far less than the
> > > > status only read.
> > > 
> > > This paragraph is good enough explanation for me. Can you include
> > > it
> > > to the final commit as soon as we hear how your fix works for
> > > Hao?
> > 
> > Sure.  I'm afraid I have to report that it didn't work for me.  My
> > Nuvoton is definitely annoyed by the frequency of the prodding
> > rather
> > than the register width.
> 
> Sorry, this might have been stated at some point but what type of bus
> is it connected with?

It's hard to tell: this is my Dell Laptop, but I'd have to bet LPC.

> Does it help in any way to tune the frequency?

Of the bus?  We simply don't have access: a TIS TPM is projected at a
specific memory mapped address and all the conversion to the LPC back
end is done by memory read/write operations.  The TPM itself has a
clock but doesn't give the TIS interface software control.

> I also wonder if we could adjust the frequency dynamically. I.e.
> start with optimistic value and lower it until finding the sweet
> spot.

The problem is the way this crashes: the TPM seems to be unrecoverable.
If it were recoverable without a hard reset of the entire machine, we
could certainly play around with it.  I can try alternative mechanisms
to see if anything's viable, but to all intents and purposes, it looks
like my TPM simply stops responding to the TIS interface.

James



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-30 22:31                         ` James Bottomley
@ 2020-10-01  1:50                           ` Jarkko Sakkinen
  2020-10-01  4:53                             ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-10-01  1:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> > On Wed, Sep 30, 2020 at 01:48:15PM -0700, James Bottomley wrote:
> > > On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
> > > > On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley wrote:
> > > > > On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> > > > > > On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley
> > > > > > wrote:
> > > > > > > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > > > > > > [...]
> > > > > > > > > However, there is another possibility: it's something
> > > > > > > > > to do
> > > > > > > > > with the byte read; I notice you don't require the same
> > > > > > > > > slowdown for the burst count read, which actually reads
> > > > > > > > > the
> > > > > > > > > status register and burst count as a read32.  If that
> > > > > > > > > really is the case, for the atmel would substituting a
> > > > > > > > > read32 and just throwing the upper bytes away in
> > > > > > > > > tpm_tis_status() allow us to keep the current
> > > > > > > > > timings?  I
> > > > > > > > > can actually try doing this and see if it fixes my
> > > > > > > > > nuvoton.
> > > > > > > > 
> > > > > > > > If would be helpful if you can find the solution without
> > > > > > > > reducing performance. I think it is a separate problem to
> > > > > > > > address though. Maybe not worth to mix them in the same
> > > > > > > > fix.
> > > > > > > 
> > > > > > > Well, if it works, no other fix is needed.
> > > > > > > 
> > > > > > > This is what I'm currently trying out on my nuvoton with
> > > > > > > the
> > > > > > > timings reverted to being those in the vanilla kernel.  So
> > > > > > > far
> > > > > > > it hasn't crashed, but I haven't run it for long enough to
> > > > > > > be
> > > > > > > sure yet.
> > > > > > > 
> > > > > > > James
> > > > > > 
> > > > > > OK, so the bus does not like one byte reads but prefers full
> > > > > > (32-
> > > > > > bit) word reads? I.e. what's the context?
> > > > > 
> > > > > It's not supported by anything in the spec just empirical
> > > > > observation.  However, the spec says the status register is 24
> > > > > bits: the upper 16 being the burst count.  When we read the
> > > > > whole
> > > > > status register, including the burst count, we do a read32. I
> > > > > observed that the elongated timing was only added for the read8
> > > > > code not the read32 which supports the theory that the former
> > > > > causes the Atmel to crash but the latter doesn't.  Of course
> > > > > it's
> > > > > always possible that probabilistically the Atmel is going to
> > > > > crash
> > > > > on the burst count read, but that's exercised far less than the
> > > > > status only read.
> > > > 
> > > > This paragraph is good enough explanation for me. Can you include
> > > > it
> > > > to the final commit as soon as we hear how your fix works for
> > > > Hao?
> > > 
> > > Sure.  I'm afraid I have to report that it didn't work for me.  My
> > > Nuvoton is definitely annoyed by the frequency of the prodding
> > > rather
> > > than the register width.
> > 
> > Sorry, this might have been stated at some point but what type of bus
> > is it connected with?
> 
> It's hard to tell: this is my Dell Laptop, but I'd have to bet LPC.
> 
> > Does it help in any way to tune the frequency?
> 
> Of the bus?  We simply don't have access: a TIS TPM is projected at a
> specific memory mapped address and all the conversion to the LPC back
> end is done by memory read/write operations.  The TPM itself has a
> clock but doesn't give the TIS interface software control.

Some TPM's use tpm_tis_spi instead of MMIO.

> > I also wonder if we could adjust the frequency dynamically. I.e.
> > start with optimistic value and lower it until finding the sweet
> > spot.
> 
> The problem is the way this crashes: the TPM seems to be unrecoverable.
> If it were recoverable without a hard reset of the entire machine, we
> could certainly play around with it.  I can try alternative mechanisms
> to see if anything's viable, but to all intents and purposes, it looks
> like my TPM simply stops responding to the TIS interface.

A quickly scraped idea probably with some holes in it but I was
thinking something like

1. Initially set slow value for latency, this could be the original 15
   ms.
2. Use this to read TPM_PT_VENDOR_STRING_*.
3. Lookup based vendor string from a fixup table a latency that works
   (the fallback latency could be the existing latency).
4. Set the legit latency.

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-01  1:50                           ` Jarkko Sakkinen
@ 2020-10-01  4:53                             ` James Bottomley
  2020-10-01 18:15                               ` Nayna
  0 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-10-01  4:53 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak, nayna,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> > On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> > > On Wed, Sep 30, 2020 at 01:48:15PM -0700, James Bottomley wrote:
> > > > On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
> > > > > On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley
> > > > > wrote:
> > > > > > On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
> > > > > > > On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley
> > > > > > > wrote:
> > > > > > > > On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
> > > > > > > > [...]
> > > > > > > > > > However, there is another possibility: it's
> > > > > > > > > > something to do with the byte read; I notice you
> > > > > > > > > > don't require the same slowdown for the burst count
> > > > > > > > > > read, which actually reads the status register and
> > > > > > > > > > burst count as a read32.  If that really is the
> > > > > > > > > > case, for the atmel would substituting a read32 and
> > > > > > > > > > just throwing the upper bytes away in
> > > > > > > > > > tpm_tis_status() allow us to keep the current
> > > > > > > > > > timings?  I can actually try doing this and see if
> > > > > > > > > > it fixes my nuvoton.
> > > > > > > > > 
> > > > > > > > > If would be helpful if you can find the solution
> > > > > > > > > without reducing performance. I think it is a
> > > > > > > > > separate problem to address though. Maybe not worth
> > > > > > > > > to mix them in the same fix.
> > > > > > > > 
> > > > > > > > Well, if it works, no other fix is needed.
> > > > > > > > 
> > > > > > > > This is what I'm currently trying out on my nuvoton
> > > > > > > > with the timings reverted to being those in the vanilla
> > > > > > > > kernel.  So far it hasn't crashed, but I haven't run it
> > > > > > > > for long enough to be sure yet.
> > > > > > > > 
> > > > > > > > James
> > > > > > > 
> > > > > > > OK, so the bus does not like one byte reads but prefers
> > > > > > > full (32-bit) word reads? I.e. what's the context?
> > > > > > 
> > > > > > It's not supported by anything in the spec just empirical
> > > > > > observation.  However, the spec says the status register is
> > > > > > 24 bits: the upper 16 being the burst count.  When we read
> > > > > > the whole status register, including the burst count, we do
> > > > > > a read32. I observed that the elongated timing was only
> > > > > > added for the read8 code not the read32 which supports the
> > > > > > theory that the former causes the Atmel to crash but the
> > > > > > latter doesn't.  Of course it's always possible that
> > > > > > probabilistically the Atmel is going to crash on the burst
> > > > > > count read, but that's exercised far less than the status
> > > > > > only read.
> > > > > 
> > > > > This paragraph is good enough explanation for me. Can you
> > > > > include it to the final commit as soon as we hear how your
> > > > > fix works for Hao?
> > > > 
> > > > Sure.  I'm afraid I have to report that it didn't work for
> > > > me.  My Nuvoton is definitely annoyed by the frequency of the
> > > > prodding rather than the register width.
> > > 
> > > Sorry, this might have been stated at some point but what type of
> > > bus is it connected with?
> > 
> > It's hard to tell: this is my Dell Laptop, but I'd have to bet LPC.
> > 
> > > Does it help in any way to tune the frequency?
> > 
> > specific memory mapped address and all the conversion to the LPC
> > back end is done by memory read/write operations.  The TPM itself
> > has a clock but doesn't give the TIS interface software control.
> 
> Some TPM's use tpm_tis_spi instead of MMIO.

Yes, but I'm fairly certain mine's not SPI.

> > > I also wonder if we could adjust the frequency dynamically. I.e.
> > > start with optimistic value and lower it until finding the sweet
> > > spot.
> > 
> > The problem is the way this crashes: the TPM seems to be
> > unrecoverable. If it were recoverable without a hard reset of the
> > entire machine, we could certainly play around with it.  I can try
> > alternative mechanisms to see if anything's viable, but to all
> > intents and purposes, it looks like my TPM simply stops responding
> > to the TIS interface.
> 
> A quickly scraped idea probably with some holes in it but I was
> thinking something like
> 
> 1. Initially set slow value for latency, this could be the original
> 15 ms.
> 2. Use this to read TPM_PT_VENDOR_STRING_*.
> 3. Lookup based vendor string from a fixup table a latency that works
>    (the fallback latency could be the existing latency).

Well, yes, that was sort of what I was thinking of doing for the Atmel
... except I was thinking of using the TIS VID (16 byte assigned vendor
ID) which means we can get the information to set the timeout before we
have to do any TPM operations.

> 4. Set the legit latency.

James



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
       [not found]                       ` <EA1EE8F8-F054-4E1B-B830-231398D33CB8@rubrik.com>
@ 2020-10-01 14:16                         ` Mimi Zohar
  0 siblings, 0 replies; 51+ messages in thread
From: Mimi Zohar @ 2020-10-01 14:16 UTC (permalink / raw)
  To: Hao Wu, Jarkko Sakkinen, James Bottomley
  Cc: Nayna Jain, peterhuewe, jgg, arnd, gregkh, Hamza Attak,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

Hi Hao,

When posting to the mailing list, please respond using plain text and
inline/bottom post.

On Wed, 2020-09-30 at 22:26 -0700, Hao Wu wrote:
> 7.1
> 
> - Jarkko’s proposal: Using msleep(TPM_TIMEOUT) for wait_for_tpm_stat(), resolve
>   the Atmel crash as we expect.  
> 
> ---
>  drivers/char/tpm/tpm_tis_core.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index 65ab1b027949..7dd9bcff542d 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -80,8 +80,9 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>  		}
>  	} else {
>  		do {
> -			usleep_range(TPM_TIMEOUT_USECS_MIN,
> -				     TPM_TIMEOUT_USECS_MAX);
> +            msleep(TPM_TIMEOUT);
> +			// usleep_range(TPM_TIMEOUT_USECS_MIN,
> +			//	     TPM_TIMEOUT_USECS_MAX);
>  			status = chip->ops->status(chip);
>  			if ((status & mask) == mask)
>  				return 0;
> --
> 2.17.1
> 
> But I think tpm_msleep(15) is still the right way to go.

Using msleep is definitely not the right way of going.   Please refer
to commit a233a0289cf9 ("tpm: msleep() delays - replace with
usleep_range() in i2c nuvoton driver") for a detailed explanation.

thanks,

Mimi


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-01  4:53                             ` James Bottomley
@ 2020-10-01 18:15                               ` Nayna
  2020-10-01 18:32                                 ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Nayna @ 2020-10-01 18:15 UTC (permalink / raw)
  To: James Bottomley, Jarkko Sakkinen
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri


On 10/1/20 12:53 AM, James Bottomley wrote:
> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>> On Wed, Sep 30, 2020 at 01:48:15PM -0700, James Bottomley wrote:
>>>>> On Wed, 2020-09-30 at 18:37 +0300, Jarkko Sakkinen wrote:
>>>>>> On Wed, Sep 30, 2020 at 07:54:58AM -0700, James Bottomley
>>>>>> wrote:
>>>>>>> On Wed, 2020-09-30 at 05:16 +0300, Jarkko Sakkinen wrote:
>>>>>>>> On Mon, Sep 28, 2020 at 03:11:39PM -0700, James Bottomley
>>>>>>>> wrote:
>>>>>>>>> On Sun, 2020-09-27 at 22:59 -0700, Hao Wu wrote:
>>>>>>>>> [...]
>>>>>>>>>>> However, there is another possibility: it's
>>>>>>>>>>> something to do with the byte read; I notice you
>>>>>>>>>>> don't require the same slowdown for the burst count
>>>>>>>>>>> read, which actually reads the status register and
>>>>>>>>>>> burst count as a read32.  If that really is the
>>>>>>>>>>> case, for the atmel would substituting a read32 and
>>>>>>>>>>> just throwing the upper bytes away in
>>>>>>>>>>> tpm_tis_status() allow us to keep the current
>>>>>>>>>>> timings?  I can actually try doing this and see if
>>>>>>>>>>> it fixes my nuvoton.
>>>>>>>>>> If would be helpful if you can find the solution
>>>>>>>>>> without reducing performance. I think it is a
>>>>>>>>>> separate problem to address though. Maybe not worth
>>>>>>>>>> to mix them in the same fix.
>>>>>>>>> Well, if it works, no other fix is needed.
>>>>>>>>>
>>>>>>>>> This is what I'm currently trying out on my nuvoton
>>>>>>>>> with the timings reverted to being those in the vanilla
>>>>>>>>> kernel.  So far it hasn't crashed, but I haven't run it
>>>>>>>>> for long enough to be sure yet.
>>>>>>>>>
>>>>>>>>> James
>>>>>>>> OK, so the bus does not like one byte reads but prefers
>>>>>>>> full (32-bit) word reads? I.e. what's the context?
>>>>>>> It's not supported by anything in the spec just empirical
>>>>>>> observation.  However, the spec says the status register is
>>>>>>> 24 bits: the upper 16 being the burst count.  When we read
>>>>>>> the whole status register, including the burst count, we do
>>>>>>> a read32. I observed that the elongated timing was only
>>>>>>> added for the read8 code not the read32 which supports the
>>>>>>> theory that the former causes the Atmel to crash but the
>>>>>>> latter doesn't.  Of course it's always possible that
>>>>>>> probabilistically the Atmel is going to crash on the burst
>>>>>>> count read, but that's exercised far less than the status
>>>>>>> only read.
>>>>>> This paragraph is good enough explanation for me. Can you
>>>>>> include it to the final commit as soon as we hear how your
>>>>>> fix works for Hao?
>>>>> Sure.  I'm afraid I have to report that it didn't work for
>>>>> me.  My Nuvoton is definitely annoyed by the frequency of the
>>>>> prodding rather than the register width.
>>>> Sorry, this might have been stated at some point but what type of
>>>> bus is it connected with?
>>> It's hard to tell: this is my Dell Laptop, but I'd have to bet LPC.
>>>
>>>> Does it help in any way to tune the frequency?
>>> specific memory mapped address and all the conversion to the LPC
>>> back end is done by memory read/write operations.  The TPM itself
>>> has a clock but doesn't give the TIS interface software control.
>> Some TPM's use tpm_tis_spi instead of MMIO.
> Yes, but I'm fairly certain mine's not SPI.
>
>>>> I also wonder if we could adjust the frequency dynamically. I.e.
>>>> start with optimistic value and lower it until finding the sweet
>>>> spot.
>>> The problem is the way this crashes: the TPM seems to be
>>> unrecoverable. If it were recoverable without a hard reset of the
>>> entire machine, we could certainly play around with it.  I can try
>>> alternative mechanisms to see if anything's viable, but to all
>>> intents and purposes, it looks like my TPM simply stops responding
>>> to the TIS interface.
>> A quickly scraped idea probably with some holes in it but I was
>> thinking something like
>>
>> 1. Initially set slow value for latency, this could be the original
>> 15 ms.
>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>> 3. Lookup based vendor string from a fixup table a latency that works
>>     (the fallback latency could be the existing latency).
> Well, yes, that was sort of what I was thinking of doing for the Atmel
> ... except I was thinking of using the TIS VID (16 byte assigned vendor
> ID) which means we can get the information to set the timeout before we
> have to do any TPM operations.


I wonder if the timeout issue exists for all TPM commands for the same 
manufacturer.  For example, does the ATMEL TPM also crash when 
extending  PCRs ?

In addition to defining a per TPM vendor based lookup table for timeout, 
would it be a good idea to also define a Kconfig/boot param option to 
allow timeout setting.  This will enable to set the timeout based on the 
specific use.

I was also thinking how will we decide the lookup table values for each 
vendor ?

Thanks & Regards,

       - Nayna


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-01 18:15                               ` Nayna
@ 2020-10-01 18:32                                 ` James Bottomley
  2020-10-01 23:04                                   ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-10-01 18:32 UTC (permalink / raw)
  To: Nayna, Jarkko Sakkinen
  Cc: Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> On 10/1/20 12:53 AM, James Bottomley wrote:
> > On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> > > On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> > > > On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
[...]
> > > > > I also wonder if we could adjust the frequency dynamically.
> > > > > I.e. start with optimistic value and lower it until finding
> > > > > the sweet spot.
> > > >  
> > > > The problem is the way this crashes: the TPM seems to be
> > > > unrecoverable. If it were recoverable without a hard reset of
> > > > the entire machine, we could certainly play around with it.  I
> > > > can try alternative mechanisms to see if anything's viable, but
> > > > to all intents and purposes, it looks like my TPM simply stops
> > > > responding to the TIS interface.
> > >  
> > > A quickly scraped idea probably with some holes in it but I was
> > > thinking something like
> > > 
> > > 1. Initially set slow value for latency, this could be the
> > > original 15 ms.
> > > 2. Use this to read TPM_PT_VENDOR_STRING_*.
> > > 3. Lookup based vendor string from a fixup table a latency that
> > > works
> > >     (the fallback latency could be the existing latency).
> >  
> > Well, yes, that was sort of what I was thinking of doing for the
> > Atmel ... except I was thinking of using the TIS VID (16 byte
> > assigned vendor ID) which means we can get the information to set
> > the timeout before we have to do any TPM operations.
> 
> I wonder if the timeout issue exists for all TPM commands for the
> same manufacturer.  For example, does the ATMEL TPM also crash when 
> extending  PCRs ?
> 
> In addition to defining a per TPM vendor based lookup table for
> timeout, would it be a good idea to also define a Kconfig/boot param
> option to allow timeout setting.  This will enable to set the timeout
> based on the specific use.

I don't think we need go that far (yet).  The timing change has been in
upstream since:

commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
Author: Nayna Jain <nayna@linux.vnet.ibm.com>
Date:   Wed May 16 01:51:25 2018 -0400

    tpm: reduce polling time to usecs for even finer granularity
 
Which was in the released kernel 4.18: over two years ago.  In all that
time we've discovered two problems: mine which looks to be an artifact
of an experimental upgrade process in a new nuvoton and the Atmel. 
That means pretty much every other TPM simply works with the existing
timings

> I was also thinking how will we decide the lookup table values for
> each vendor ?

I wasn't thinking we would.  I was thinking I'd do a simple exception
for the Atmel and nothing else.  I don't think my Nuvoton is in any way
characteristic.  Indeed my pluggable TPM rainbow bridge system works
just fine with a Nuvoton and the current timings.

We can add additional exceptions if they actually turn up.

James



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-01 18:32                                 ` James Bottomley
@ 2020-10-01 23:04                                   ` Jarkko Sakkinen
  2020-10-17  6:11                                     ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-10-01 23:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: Nayna, Hao Wu, peterhuewe, jgg, arnd, gregkh, Hamza Attak,
	why2jjj.linux, zohar, linux-integrity, Paul Menzel, Ken Goldman,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> > On 10/1/20 12:53 AM, James Bottomley wrote:
> > > On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> > > > On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> > > > > On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> [...]
> > > > > > I also wonder if we could adjust the frequency dynamically.
> > > > > > I.e. start with optimistic value and lower it until finding
> > > > > > the sweet spot.
> > > > >  
> > > > > The problem is the way this crashes: the TPM seems to be
> > > > > unrecoverable. If it were recoverable without a hard reset of
> > > > > the entire machine, we could certainly play around with it.  I
> > > > > can try alternative mechanisms to see if anything's viable, but
> > > > > to all intents and purposes, it looks like my TPM simply stops
> > > > > responding to the TIS interface.
> > > >  
> > > > A quickly scraped idea probably with some holes in it but I was
> > > > thinking something like
> > > > 
> > > > 1. Initially set slow value for latency, this could be the
> > > > original 15 ms.
> > > > 2. Use this to read TPM_PT_VENDOR_STRING_*.
> > > > 3. Lookup based vendor string from a fixup table a latency that
> > > > works
> > > >     (the fallback latency could be the existing latency).
> > >  
> > > Well, yes, that was sort of what I was thinking of doing for the
> > > Atmel ... except I was thinking of using the TIS VID (16 byte
> > > assigned vendor ID) which means we can get the information to set
> > > the timeout before we have to do any TPM operations.
> > 
> > I wonder if the timeout issue exists for all TPM commands for the
> > same manufacturer.  For example, does the ATMEL TPM also crash when 
> > extending  PCRs ?
> > 
> > In addition to defining a per TPM vendor based lookup table for
> > timeout, would it be a good idea to also define a Kconfig/boot param
> > option to allow timeout setting.  This will enable to set the timeout
> > based on the specific use.
> 
> I don't think we need go that far (yet).  The timing change has been in
> upstream since:
> 
> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
> Date:   Wed May 16 01:51:25 2018 -0400
> 
>     tpm: reduce polling time to usecs for even finer granularity
>  
> Which was in the released kernel 4.18: over two years ago.  In all that
> time we've discovered two problems: mine which looks to be an artifact
> of an experimental upgrade process in a new nuvoton and the Atmel. 
> That means pretty much every other TPM simply works with the existing
> timings
> 
> > I was also thinking how will we decide the lookup table values for
> > each vendor ?
> 
> I wasn't thinking we would.  I was thinking I'd do a simple exception
> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
> characteristic.  Indeed my pluggable TPM rainbow bridge system works
> just fine with a Nuvoton and the current timings.
> 
> We can add additional exceptions if they actually turn up.

I'd add a table and fallback.

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-01 23:04                                   ` Jarkko Sakkinen
@ 2020-10-17  6:11                                     ` Hao Wu
  2020-10-18  5:09                                       ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-10-17  6:11 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>> [...]
>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>> the sweet spot.
>>>>>> 
>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>> responding to the TIS interface.
>>>>> 
>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>> thinking something like
>>>>> 
>>>>> 1. Initially set slow value for latency, this could be the
>>>>> original 15 ms.
>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>> works
>>>>>    (the fallback latency could be the existing latency).
>>>> 
>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>> assigned vendor ID) which means we can get the information to set
>>>> the timeout before we have to do any TPM operations.
>>> 
>>> I wonder if the timeout issue exists for all TPM commands for the
>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>> extending  PCRs ?
>>> 
>>> In addition to defining a per TPM vendor based lookup table for
>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>> option to allow timeout setting.  This will enable to set the timeout
>>> based on the specific use.
>> 
>> I don't think we need go that far (yet).  The timing change has been in
>> upstream since:
>> 
>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>> Date:   Wed May 16 01:51:25 2018 -0400
>> 
>>    tpm: reduce polling time to usecs for even finer granularity
>> 
>> Which was in the released kernel 4.18: over two years ago.  In all that
>> time we've discovered two problems: mine which looks to be an artifact
>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>> That means pretty much every other TPM simply works with the existing
>> timings
>> 
>>> I was also thinking how will we decide the lookup table values for
>>> each vendor ?
>> 
>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>> just fine with a Nuvoton and the current timings.
>> 
>> We can add additional exceptions if they actually turn up.
> 
> I'd add a table and fallback.
> 

Hi folks,

I want to follow up this a bit and check whether we reached a consensus 
on how to fix the timeout issue for Atmel chip.

Should we revert the changes or introduce the lookup table for chips.

Is there anything I can help from Rubrik side.

Thanks
Hao
 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-17  6:11                                     ` Hao Wu
@ 2020-10-18  5:09                                       ` Jarkko Sakkinen
  2020-10-18  5:20                                         ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-10-18  5:09 UTC (permalink / raw)
  To: Hao Wu
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
> > On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> > 
> > On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
> >> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> >>> On 10/1/20 12:53 AM, James Bottomley wrote:
> >>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> >>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> >>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> >> [...]
> >>>>>>> I also wonder if we could adjust the frequency dynamically.
> >>>>>>> I.e. start with optimistic value and lower it until finding
> >>>>>>> the sweet spot.
> >>>>>> 
> >>>>>> The problem is the way this crashes: the TPM seems to be
> >>>>>> unrecoverable. If it were recoverable without a hard reset of
> >>>>>> the entire machine, we could certainly play around with it.  I
> >>>>>> can try alternative mechanisms to see if anything's viable, but
> >>>>>> to all intents and purposes, it looks like my TPM simply stops
> >>>>>> responding to the TIS interface.
> >>>>> 
> >>>>> A quickly scraped idea probably with some holes in it but I was
> >>>>> thinking something like
> >>>>> 
> >>>>> 1. Initially set slow value for latency, this could be the
> >>>>> original 15 ms.
> >>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
> >>>>> 3. Lookup based vendor string from a fixup table a latency that
> >>>>> works
> >>>>>    (the fallback latency could be the existing latency).
> >>>> 
> >>>> Well, yes, that was sort of what I was thinking of doing for the
> >>>> Atmel ... except I was thinking of using the TIS VID (16 byte
> >>>> assigned vendor ID) which means we can get the information to set
> >>>> the timeout before we have to do any TPM operations.
> >>> 
> >>> I wonder if the timeout issue exists for all TPM commands for the
> >>> same manufacturer.  For example, does the ATMEL TPM also crash when 
> >>> extending  PCRs ?
> >>> 
> >>> In addition to defining a per TPM vendor based lookup table for
> >>> timeout, would it be a good idea to also define a Kconfig/boot param
> >>> option to allow timeout setting.  This will enable to set the timeout
> >>> based on the specific use.
> >> 
> >> I don't think we need go that far (yet).  The timing change has been in
> >> upstream since:
> >> 
> >> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
> >> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
> >> Date:   Wed May 16 01:51:25 2018 -0400
> >> 
> >>    tpm: reduce polling time to usecs for even finer granularity
> >> 
> >> Which was in the released kernel 4.18: over two years ago.  In all that
> >> time we've discovered two problems: mine which looks to be an artifact
> >> of an experimental upgrade process in a new nuvoton and the Atmel. 
> >> That means pretty much every other TPM simply works with the existing
> >> timings
> >> 
> >>> I was also thinking how will we decide the lookup table values for
> >>> each vendor ?
> >> 
> >> I wasn't thinking we would.  I was thinking I'd do a simple exception
> >> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
> >> characteristic.  Indeed my pluggable TPM rainbow bridge system works
> >> just fine with a Nuvoton and the current timings.
> >> 
> >> We can add additional exceptions if they actually turn up.
> > 
> > I'd add a table and fallback.
> > 
> 
> Hi folks,
> 
> I want to follow up this a bit and check whether we reached a consensus 
> on how to fix the timeout issue for Atmel chip.
> 
> Should we revert the changes or introduce the lookup table for chips.
> 
> Is there anything I can help from Rubrik side.
> 
> Thanks
> Hao

There is nothing to revert as the previous was not applied but I'm
of course ready to review any new attempts.

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-18  5:09                                       ` Jarkko Sakkinen
@ 2020-10-18  5:20                                         ` Hao Wu
  2020-11-14  4:39                                           ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-10-18  5:20 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>> 
>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>> [...]
>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>> the sweet spot.
>>>>>>>> 
>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>> responding to the TIS interface.
>>>>>>> 
>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>> thinking something like
>>>>>>> 
>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>> original 15 ms.
>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>> works
>>>>>>>   (the fallback latency could be the existing latency).
>>>>>> 
>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>> the timeout before we have to do any TPM operations.
>>>>> 
>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>> extending  PCRs ?
>>>>> 
>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>> based on the specific use.
>>>> 
>>>> I don't think we need go that far (yet).  The timing change has been in
>>>> upstream since:
>>>> 
>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>> 
>>>>   tpm: reduce polling time to usecs for even finer granularity
>>>> 
>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>> time we've discovered two problems: mine which looks to be an artifact
>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>> That means pretty much every other TPM simply works with the existing
>>>> timings
>>>> 
>>>>> I was also thinking how will we decide the lookup table values for
>>>>> each vendor ?
>>>> 
>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>> just fine with a Nuvoton and the current timings.
>>>> 
>>>> We can add additional exceptions if they actually turn up.
>>> 
>>> I'd add a table and fallback.
>>> 
>> 
>> Hi folks,
>> 
>> I want to follow up this a bit and check whether we reached a consensus 
>> on how to fix the timeout issue for Atmel chip.
>> 
>> Should we revert the changes or introduce the lookup table for chips.
>> 
>> Is there anything I can help from Rubrik side.
>> 
>> Thanks
>> Hao
> 
> There is nothing to revert as the previous was not applied but I'm
> of course ready to review any new attempts.
> 

Hi Jarkko,

By “revert” I meant we revert the timeout value changes by applying
the patch I proposed, as the timeout value discussed does cause issues.

Why don’t we apply the patch and improve the perf in the way of not
breaking TPMs ? 

Hao
 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-10-18  5:20                                         ` Hao Wu
@ 2020-11-14  4:39                                           ` Hao Wu
  2020-11-18 21:11                                             ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2020-11-14  4:39 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>> 
>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>> 
>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>> [...]
>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>> the sweet spot.
>>>>>>>>> 
>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>> responding to the TIS interface.
>>>>>>>> 
>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>> thinking something like
>>>>>>>> 
>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>> original 15 ms.
>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>> works
>>>>>>>>  (the fallback latency could be the existing latency).
>>>>>>> 
>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>> the timeout before we have to do any TPM operations.
>>>>>> 
>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>> extending  PCRs ?
>>>>>> 
>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>> based on the specific use.
>>>>> 
>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>> upstream since:
>>>>> 
>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>> 
>>>>>  tpm: reduce polling time to usecs for even finer granularity
>>>>> 
>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>> That means pretty much every other TPM simply works with the existing
>>>>> timings
>>>>> 
>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>> each vendor ?
>>>>> 
>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>> just fine with a Nuvoton and the current timings.
>>>>> 
>>>>> We can add additional exceptions if they actually turn up.
>>>> 
>>>> I'd add a table and fallback.
>>>> 
>>> 
>>> Hi folks,
>>> 
>>> I want to follow up this a bit and check whether we reached a consensus 
>>> on how to fix the timeout issue for Atmel chip.
>>> 
>>> Should we revert the changes or introduce the lookup table for chips.
>>> 
>>> Is there anything I can help from Rubrik side.
>>> 
>>> Thanks
>>> Hao
>> 
>> There is nothing to revert as the previous was not applied but I'm
>> of course ready to review any new attempts.
>> 
> 
> Hi Jarkko,
> 
> By “revert” I meant we revert the timeout value changes by applying
> the patch I proposed, as the timeout value discussed does cause issues.
> 
> Why don’t we apply the patch and improve the perf in the way of not
> breaking TPMs ? 
> 
> Hao

Hi Jarkko and folks,

It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
It looks like we currently have following choices:
1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
  (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
  It is the current proposed patch
3. Fix regression by making exception for ateml chip.  

Should we reach consensus on which one we want to pursue before dig into implementation
of the patch? In my opinion, I prefer to fix the regression with 2, and then pursue 1 as long-term
solution. 3 is hacky.

Let me know what do you guys think

Hao
 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-11-14  4:39                                           ` Hao Wu
@ 2020-11-18 21:11                                             ` Jarkko Sakkinen
  2020-11-18 23:23                                               ` Hao Wu
  2021-05-09  6:18                                               ` Hao Wu
  0 siblings, 2 replies; 51+ messages in thread
From: Jarkko Sakkinen @ 2020-11-18 21:11 UTC (permalink / raw)
  To: Hao Wu
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
> > On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> > 
> >> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> >> 
> >> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
> >>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> >>>> 
> >>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
> >>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> >>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
> >>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> >>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> >>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> >>>>> [...]
> >>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
> >>>>>>>>>> I.e. start with optimistic value and lower it until finding
> >>>>>>>>>> the sweet spot.
> >>>>>>>>> 
> >>>>>>>>> The problem is the way this crashes: the TPM seems to be
> >>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
> >>>>>>>>> the entire machine, we could certainly play around with it.  I
> >>>>>>>>> can try alternative mechanisms to see if anything's viable, but
> >>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
> >>>>>>>>> responding to the TIS interface.
> >>>>>>>> 
> >>>>>>>> A quickly scraped idea probably with some holes in it but I was
> >>>>>>>> thinking something like
> >>>>>>>> 
> >>>>>>>> 1. Initially set slow value for latency, this could be the
> >>>>>>>> original 15 ms.
> >>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
> >>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
> >>>>>>>> works
> >>>>>>>>  (the fallback latency could be the existing latency).
> >>>>>>> 
> >>>>>>> Well, yes, that was sort of what I was thinking of doing for the
> >>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
> >>>>>>> assigned vendor ID) which means we can get the information to set
> >>>>>>> the timeout before we have to do any TPM operations.
> >>>>>> 
> >>>>>> I wonder if the timeout issue exists for all TPM commands for the
> >>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
> >>>>>> extending  PCRs ?
> >>>>>> 
> >>>>>> In addition to defining a per TPM vendor based lookup table for
> >>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
> >>>>>> option to allow timeout setting.  This will enable to set the timeout
> >>>>>> based on the specific use.
> >>>>> 
> >>>>> I don't think we need go that far (yet).  The timing change has been in
> >>>>> upstream since:
> >>>>> 
> >>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
> >>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
> >>>>> Date:   Wed May 16 01:51:25 2018 -0400
> >>>>> 
> >>>>>  tpm: reduce polling time to usecs for even finer granularity
> >>>>> 
> >>>>> Which was in the released kernel 4.18: over two years ago.  In all that
> >>>>> time we've discovered two problems: mine which looks to be an artifact
> >>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
> >>>>> That means pretty much every other TPM simply works with the existing
> >>>>> timings
> >>>>> 
> >>>>>> I was also thinking how will we decide the lookup table values for
> >>>>>> each vendor ?
> >>>>> 
> >>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
> >>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
> >>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
> >>>>> just fine with a Nuvoton and the current timings.
> >>>>> 
> >>>>> We can add additional exceptions if they actually turn up.
> >>>> 
> >>>> I'd add a table and fallback.
> >>>> 
> >>> 
> >>> Hi folks,
> >>> 
> >>> I want to follow up this a bit and check whether we reached a consensus 
> >>> on how to fix the timeout issue for Atmel chip.
> >>> 
> >>> Should we revert the changes or introduce the lookup table for chips.
> >>> 
> >>> Is there anything I can help from Rubrik side.
> >>> 
> >>> Thanks
> >>> Hao
> >> 
> >> There is nothing to revert as the previous was not applied but I'm
> >> of course ready to review any new attempts.
> >> 
> > 
> > Hi Jarkko,
> > 
> > By “revert” I meant we revert the timeout value changes by applying
> > the patch I proposed, as the timeout value discussed does cause issues.
> > 
> > Why don’t we apply the patch and improve the perf in the way of not
> > breaking TPMs ? 
> > 
> > Hao
> 
> Hi Jarkko and folks,
> 
> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
> It looks like we currently have following choices:
> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>   (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>   It is the current proposed patch
> 3. Fix regression by making exception for ateml chip.  
> 
> Should we reach consensus on which one we want to pursue before dig
> into implementation of the patch? In my opinion, I prefer to fix the
> regression with 2, and then pursue 1 as long-term solution. 3 is
> hacky.

What does option 1 fix for *all* vendors?

> Let me know what do you guys think
> 
> Hao

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-11-18 21:11                                             ` Jarkko Sakkinen
@ 2020-11-18 23:23                                               ` Hao Wu
  2021-05-09  6:18                                               ` Hao Wu
  1 sibling, 0 replies; 51+ messages in thread
From: Hao Wu @ 2020-11-18 23:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
>>> 
>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>> 
>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>>>> 
>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>> [...]
>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>> 
>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>> 
>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>> thinking something like
>>>>>>>>>> 
>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>> original 15 ms.
>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>> works
>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>> 
>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>> 
>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>> extending  PCRs ?
>>>>>>>> 
>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>> based on the specific use.
>>>>>>> 
>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>> upstream since:
>>>>>>> 
>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>> 
>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>> 
>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>> timings
>>>>>>> 
>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>> each vendor ?
>>>>>>> 
>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>> 
>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>> 
>>>>>> I'd add a table and fallback.
>>>>>> 
>>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>> on how to fix the timeout issue for Atmel chip.
>>>>> 
>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>> 
>>>>> Is there anything I can help from Rubrik side.
>>>>> 
>>>>> Thanks
>>>>> Hao
>>>> 
>>>> There is nothing to revert as the previous was not applied but I'm
>>>> of course ready to review any new attempts.
>>>> 
>>> 
>>> Hi Jarkko,
>>> 
>>> By “revert” I meant we revert the timeout value changes by applying
>>> the patch I proposed, as the timeout value discussed does cause issues.
>>> 
>>> Why don’t we apply the patch and improve the perf in the way of not
>>> breaking TPMs ? 
>>> 
>>> Hao
>> 
>> Hi Jarkko and folks,
>> 
>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>> It looks like we currently have following choices:
>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>  (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>  It is the current proposed patch
>> 3. Fix regression by making exception for ateml chip.  
>> 
>> Should we reach consensus on which one we want to pursue before dig
>> into implementation of the patch? In my opinion, I prefer to fix the
>> regression with 2, and then pursue 1 as long-term solution. 3 is
>> hacky.
> 
> What does option 1 fix for *all* vendors?
> 
>> Let me know what do you guys think
>> 
>> Hao
> 
> /Jarkko

I meant timing is potentially sensitive per vendor,
the option 1 is a solution for all vendor if this issue happening again.
The option 1 is just what you mentioned in the earlier discussion:
>>>>>> I'd add a table and fallback.


Hao


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-11-18 21:11                                             ` Jarkko Sakkinen
  2020-11-18 23:23                                               ` Hao Wu
@ 2021-05-09  6:18                                               ` Hao Wu
  2021-05-09  6:31                                                 ` Hao Wu
  1 sibling, 1 reply; 51+ messages in thread
From: Hao Wu @ 2021-05-09  6:18 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: James Bottomley, Nayna, peterhuewe, jgg, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
>>> 
>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>> 
>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>>>> 
>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>> [...]
>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>> 
>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>> 
>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>> thinking something like
>>>>>>>>>> 
>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>> original 15 ms.
>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>> works
>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>> 
>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>> 
>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>> extending  PCRs ?
>>>>>>>> 
>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>> based on the specific use.
>>>>>>> 
>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>> upstream since:
>>>>>>> 
>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>> 
>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>> 
>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>> timings
>>>>>>> 
>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>> each vendor ?
>>>>>>> 
>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>> 
>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>> 
>>>>>> I'd add a table and fallback.
>>>>>> 
>>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>> on how to fix the timeout issue for Atmel chip.
>>>>> 
>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>> 
>>>>> Is there anything I can help from Rubrik side.
>>>>> 
>>>>> Thanks
>>>>> Hao
>>>> 
>>>> There is nothing to revert as the previous was not applied but I'm
>>>> of course ready to review any new attempts.
>>>> 
>>> 
>>> Hi Jarkko,
>>> 
>>> By “revert” I meant we revert the timeout value changes by applying
>>> the patch I proposed, as the timeout value discussed does cause issues.
>>> 
>>> Why don’t we apply the patch and improve the perf in the way of not
>>> breaking TPMs ? 
>>> 
>>> Hao
>> 
>> Hi Jarkko and folks,
>> 
>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>> It looks like we currently have following choices:
>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>  (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>  It is the current proposed patch
>> 3. Fix regression by making exception for ateml chip.  
>> 
>> Should we reach consensus on which one we want to pursue before dig
>> into implementation of the patch? In my opinion, I prefer to fix the
>> regression with 2, and then pursue 1 as long-term solution. 3 is
>> hacky.
> 
> What does option 1 fix for *all* vendors?
> 
>> Let me know what do you guys think
>> 
>> Hao
> 
> /Jarkko

Hi Jarkko and folks,

It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
Somehow I couldn’t found the last message I sent but it is in 
https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/

In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
timing may cause performance degradation for other chips.

I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
refactoring without testing chips from all manufacturing.

Let me know how should I push this forward.

Thanks
Hao


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-05-09  6:18                                               ` Hao Wu
@ 2021-05-09  6:31                                                 ` Hao Wu
  2021-05-10  2:17                                                   ` Mimi Zohar
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2021-05-09  6:31 UTC (permalink / raw)
  To: Jarkko Sakkinen, James Bottomley, peterhuewe, jgg
  Cc: Nayna, arnd, gregkh, Hamza Attak, why2jjj.linux, zohar,
	linux-integrity, Paul Menzel, Ken Goldman, Seungyeop Han,
	Shrihari Kalkar, Anish Jhaveri

> On May 8, 2021, at 11:18 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
>> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>> 
>> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
>>>> 
>>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>>> 
>>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>>>>> 
>>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>>> [...]
>>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>>> 
>>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>>> thinking something like
>>>>>>>>>>> 
>>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>>> original 15 ms.
>>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>>> works
>>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>>> 
>>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>>> 
>>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>>> extending  PCRs ?
>>>>>>>>> 
>>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>>> based on the specific use.
>>>>>>>> 
>>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>>> upstream since:
>>>>>>>> 
>>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>>> 
>>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>>> 
>>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>>> timings
>>>>>>>> 
>>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>>> each vendor ?
>>>>>>>> 
>>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>>> 
>>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>>> 
>>>>>>> I'd add a table and fallback.
>>>>>>> 
>>>>>> 
>>>>>> Hi folks,
>>>>>> 
>>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>>> on how to fix the timeout issue for Atmel chip.
>>>>>> 
>>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>>> 
>>>>>> Is there anything I can help from Rubrik side.
>>>>>> 
>>>>>> Thanks
>>>>>> Hao
>>>>> 
>>>>> There is nothing to revert as the previous was not applied but I'm
>>>>> of course ready to review any new attempts.
>>>>> 
>>>> 
>>>> Hi Jarkko,
>>>> 
>>>> By “revert” I meant we revert the timeout value changes by applying
>>>> the patch I proposed, as the timeout value discussed does cause issues.
>>>> 
>>>> Why don’t we apply the patch and improve the perf in the way of not
>>>> breaking TPMs ? 
>>>> 
>>>> Hao
>>> 
>>> Hi Jarkko and folks,
>>> 
>>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>>> It looks like we currently have following choices:
>>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>> (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>> It is the current proposed patch
>>> 3. Fix regression by making exception for ateml chip.  
>>> 
>>> Should we reach consensus on which one we want to pursue before dig
>>> into implementation of the patch? In my opinion, I prefer to fix the
>>> regression with 2, and then pursue 1 as long-term solution. 3 is
>>> hacky.
>> 
>> What does option 1 fix for *all* vendors?
>> 
>>> Let me know what do you guys think
>>> 
>>> Hao
>> 
>> /Jarkko
> 
> Hi Jarkko and folks,
> 
> It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
> Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
> Somehow I couldn’t found the last message I sent but it is in 
> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
> 
> In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
> configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
> timing may cause performance degradation for other chips.
> 
> I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
> refactoring without testing chips from all manufacturing.
> 
> Let me know how should I push this forward.
> 
> Thanks
> Hao
> 
It looks like Jarkko’s email address (jarkko.sakkinen@linux.intel.com) is unreachable now,
can other TPM maintainer / reviewer help make a call and unblock this ? 

Thanks
Hao

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-05-09  6:31                                                 ` Hao Wu
@ 2021-05-10  2:17                                                   ` Mimi Zohar
  2021-05-10  3:15                                                     ` Hao Wu
  2021-05-10 17:28                                                     ` Jarkko Sakkinen
  0 siblings, 2 replies; 51+ messages in thread
From: Mimi Zohar @ 2021-05-10  2:17 UTC (permalink / raw)
  To: Hao Wu, Jarkko Sakkinen, James Bottomley, peterhuewe, jgg
  Cc: Nayna, arnd, gregkh, Hamza Attak, why2jjj.linux, zohar,
	linux-integrity, Paul Menzel, Ken Goldman, Seungyeop Han,
	Shrihari Kalkar, Anish Jhaveri, Jarkko Sakkinen

[Cc'ing Jarkko Sakkinen <jarkko@kernel.org>]

On Sat, 2021-05-08 at 23:31 -0700, Hao Wu wrote:
> > On May 8, 2021, at 11:18 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> > 
> >> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> >> 
> >> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
> >>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> >>>> 
> >>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> >>>>> 
> >>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
> >>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> >>>>>>> 
> >>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
> >>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> >>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
> >>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> >>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> >>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> >>>>>>>> [...]
> >>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
> >>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
> >>>>>>>>>>>>> the sweet spot.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
> >>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
> >>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
> >>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
> >>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
> >>>>>>>>>>>> responding to the TIS interface.
> >>>>>>>>>>> 
> >>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
> >>>>>>>>>>> thinking something like
> >>>>>>>>>>> 
> >>>>>>>>>>> 1. Initially set slow value for latency, this could be the
> >>>>>>>>>>> original 15 ms.
> >>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
> >>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
> >>>>>>>>>>> works
> >>>>>>>>>>> (the fallback latency could be the existing latency).
> >>>>>>>>>> 
> >>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
> >>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
> >>>>>>>>>> assigned vendor ID) which means we can get the information to set
> >>>>>>>>>> the timeout before we have to do any TPM operations.
> >>>>>>>>> 
> >>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
> >>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
> >>>>>>>>> extending  PCRs ?
> >>>>>>>>> 
> >>>>>>>>> In addition to defining a per TPM vendor based lookup table for
> >>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
> >>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
> >>>>>>>>> based on the specific use.
> >>>>>>>> 
> >>>>>>>> I don't think we need go that far (yet).  The timing change has been in
> >>>>>>>> upstream since:
> >>>>>>>> 
> >>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
> >>>>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
> >>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
> >>>>>>>> 
> >>>>>>>> tpm: reduce polling time to usecs for even finer granularity
> >>>>>>>> 
> >>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
> >>>>>>>> time we've discovered two problems: mine which looks to be an artifact
> >>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
> >>>>>>>> That means pretty much every other TPM simply works with the existing
> >>>>>>>> timings
> >>>>>>>> 
> >>>>>>>>> I was also thinking how will we decide the lookup table values for
> >>>>>>>>> each vendor ?
> >>>>>>>> 
> >>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
> >>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
> >>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
> >>>>>>>> just fine with a Nuvoton and the current timings.
> >>>>>>>> 
> >>>>>>>> We can add additional exceptions if they actually turn up.
> >>>>>>> 
> >>>>>>> I'd add a table and fallback.
> >>>>>>> 
> >>>>>> 
> >>>>>> Hi folks,
> >>>>>> 
> >>>>>> I want to follow up this a bit and check whether we reached a consensus 
> >>>>>> on how to fix the timeout issue for Atmel chip.
> >>>>>> 
> >>>>>> Should we revert the changes or introduce the lookup table for chips.
> >>>>>> 
> >>>>>> Is there anything I can help from Rubrik side.
> >>>>>> 
> >>>>>> Thanks
> >>>>>> Hao
> >>>>> 
> >>>>> There is nothing to revert as the previous was not applied but I'm
> >>>>> of course ready to review any new attempts.
> >>>>> 
> >>>> 
> >>>> Hi Jarkko,
> >>>> 
> >>>> By “revert” I meant we revert the timeout value changes by applying
> >>>> the patch I proposed, as the timeout value discussed does cause issues.
> >>>> 
> >>>> Why don’t we apply the patch and improve the perf in the way of not
> >>>> breaking TPMs ? 
> >>>> 
> >>>> Hao
> >>> 
> >>> Hi Jarkko and folks,
> >>> 
> >>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
> >>> It looks like we currently have following choices:
> >>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
> >>> (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
> >>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
> >>> It is the current proposed patch
> >>> 3. Fix regression by making exception for ateml chip.  
> >>> 
> >>> Should we reach consensus on which one we want to pursue before dig
> >>> into implementation of the patch? In my opinion, I prefer to fix the
> >>> regression with 2, and then pursue 1 as long-term solution. 3 is
> >>> hacky.
> >> 
> >> What does option 1 fix for *all* vendors?
> >> 
> >>> Let me know what do you guys think
> >>> 
> >>> Hao
> >> 
> >> /Jarkko
> > 
> > Hi Jarkko and folks,
> > 
> > It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
> > Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
> > Somehow I couldn’t found the last message I sent but it is in 
> > https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
> > 
> > In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
> > configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
> > timing may cause performance degradation for other chips.
> > 
> > I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
> > refactoring without testing chips from all manufacturing.
> > 
> > Let me know how should I push this forward.
> > 
> > Thanks
> > Hao
> > 
> It looks like Jarkko’s email address (jarkko.sakkinen@linux.intel.com) is unreachable now,
> can other TPM maintainer / reviewer help make a call and unblock this ? 

A while ago Jarkko asked everyone to use his kernel.org address.

Mimi


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-05-10  2:17                                                   ` Mimi Zohar
@ 2021-05-10  3:15                                                     ` Hao Wu
  2021-05-10 17:28                                                     ` Jarkko Sakkinen
  1 sibling, 0 replies; 51+ messages in thread
From: Hao Wu @ 2021-05-10  3:15 UTC (permalink / raw)
  To: Mimi Zohar, jarkko
  Cc: James Bottomley, peterhuewe, jgg, Nayna, arnd, gregkh,
	Hamza Attak, why2jjj.linux, zohar, linux-integrity, Paul Menzel,
	Ken Goldman, Seungyeop Han, Shrihari Kalkar, Anish Jhaveri,
	Jarkko Sakkinen

> On May 9, 2021, at 7:17 PM, Mimi Zohar <zohar@linux.ibm.com> wrote:
> 
> [Cc'ing Jarkko Sakkinen <jarkko@kernel.org>]
> 
> On Sat, 2021-05-08 at 23:31 -0700, Hao Wu wrote:
>>> On May 8, 2021, at 11:18 PM, Hao Wu <hao.wu@rubrik.com> wrote:
>>> 
>>>> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>> 
>>>> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
>>>>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
>>>>>> 
>>>>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>>>>> 
>>>>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>>>>>>> [...]
>>>>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>>>>>>> the sweet spot.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>>>>>>> responding to the TIS interface.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>>>>>>> thinking something like
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>>>>>>> original 15 ms.
>>>>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>>>>>>> works
>>>>>>>>>>>>> (the fallback latency could be the existing latency).
>>>>>>>>>>>> 
>>>>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>>>>>>> the timeout before we have to do any TPM operations.
>>>>>>>>>>> 
>>>>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>>>>>>> extending  PCRs ?
>>>>>>>>>>> 
>>>>>>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>>>>>>> based on the specific use.
>>>>>>>>>> 
>>>>>>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>>>>>>> upstream since:
>>>>>>>>>> 
>>>>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>>>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>>>>>>> 
>>>>>>>>>> tpm: reduce polling time to usecs for even finer granularity
>>>>>>>>>> 
>>>>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>>>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>>>>>>> That means pretty much every other TPM simply works with the existing
>>>>>>>>>> timings
>>>>>>>>>> 
>>>>>>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>>>>>>> each vendor ?
>>>>>>>>>> 
>>>>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>>>>>>> just fine with a Nuvoton and the current timings.
>>>>>>>>>> 
>>>>>>>>>> We can add additional exceptions if they actually turn up.
>>>>>>>>> 
>>>>>>>>> I'd add a table and fallback.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi folks,
>>>>>>>> 
>>>>>>>> I want to follow up this a bit and check whether we reached a consensus 
>>>>>>>> on how to fix the timeout issue for Atmel chip.
>>>>>>>> 
>>>>>>>> Should we revert the changes or introduce the lookup table for chips.
>>>>>>>> 
>>>>>>>> Is there anything I can help from Rubrik side.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Hao
>>>>>>> 
>>>>>>> There is nothing to revert as the previous was not applied but I'm
>>>>>>> of course ready to review any new attempts.
>>>>>>> 
>>>>>> 
>>>>>> Hi Jarkko,
>>>>>> 
>>>>>> By “revert” I meant we revert the timeout value changes by applying
>>>>>> the patch I proposed, as the timeout value discussed does cause issues.
>>>>>> 
>>>>>> Why don’t we apply the patch and improve the perf in the way of not
>>>>>> breaking TPMs ? 
>>>>>> 
>>>>>> Hao
>>>>> 
>>>>> Hi Jarkko and folks,
>>>>> 
>>>>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
>>>>> It looks like we currently have following choices:
>>>>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
>>>>> (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
>>>>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
>>>>> It is the current proposed patch
>>>>> 3. Fix regression by making exception for ateml chip.  
>>>>> 
>>>>> Should we reach consensus on which one we want to pursue before dig
>>>>> into implementation of the patch? In my opinion, I prefer to fix the
>>>>> regression with 2, and then pursue 1 as long-term solution. 3 is
>>>>> hacky.
>>>> 
>>>> What does option 1 fix for *all* vendors?
>>>> 
>>>>> Let me know what do you guys think
>>>>> 
>>>>> Hao
>>>> 
>>>> /Jarkko
>>> 
>>> Hi Jarkko and folks,
>>> 
>>> It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
>>> Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
>>> Somehow I couldn’t found the last message I sent but it is in 
>>> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
>>> 
>>> In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
>>> configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
>>> timing may cause performance degradation for other chips.
>>> 
>>> I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
>>> refactoring without testing chips from all manufacturing.
>>> 
>>> Let me know how should I push this forward.
>>> 
>>> Thanks
>>> Hao
>>> 
>> It looks like Jarkko’s email address (jarkko.sakkinen@linux.intel.com) is unreachable now,
>> can other TPM maintainer / reviewer help make a call and unblock this ? 
> 
> A while ago Jarkko asked everyone to use his kernel.org address.
> 
> Mimi

Ah thanks Mimi, just found Jarkko’s address.

Jarkko please check the message above when you have a chance.

Hao


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-05-10  2:17                                                   ` Mimi Zohar
  2021-05-10  3:15                                                     ` Hao Wu
@ 2021-05-10 17:28                                                     ` Jarkko Sakkinen
  1 sibling, 0 replies; 51+ messages in thread
From: Jarkko Sakkinen @ 2021-05-10 17:28 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Hao Wu, Jarkko Sakkinen, James Bottomley, peterhuewe, jgg, Nayna,
	arnd, gregkh, Hamza Attak, why2jjj.linux, zohar, linux-integrity,
	Paul Menzel, Ken Goldman, Seungyeop Han, Shrihari Kalkar,
	Anish Jhaveri

On Sun, May 09, 2021 at 10:17:01PM -0400, Mimi Zohar wrote:
> [Cc'ing Jarkko Sakkinen <jarkko@kernel.org>]
> 
> On Sat, 2021-05-08 at 23:31 -0700, Hao Wu wrote:
> > > On May 8, 2021, at 11:18 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> > > 
> > >> On Nov 18, 2020, at 1:11 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> > >> 
> > >> On Fri, Nov 13, 2020 at 08:39:28PM -0800, Hao Wu wrote:
> > >>>> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> > >>>> 
> > >>>>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> > >>>>> 
> > >>>>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
> > >>>>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> > >>>>>>> 
> > >>>>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
> > >>>>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
> > >>>>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
> > >>>>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
> > >>>>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
> > >>>>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
> > >>>>>>>> [...]
> > >>>>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
> > >>>>>>>>>>>>> I.e. start with optimistic value and lower it until finding
> > >>>>>>>>>>>>> the sweet spot.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> The problem is the way this crashes: the TPM seems to be
> > >>>>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
> > >>>>>>>>>>>> the entire machine, we could certainly play around with it.  I
> > >>>>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
> > >>>>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
> > >>>>>>>>>>>> responding to the TIS interface.
> > >>>>>>>>>>> 
> > >>>>>>>>>>> A quickly scraped idea probably with some holes in it but I was
> > >>>>>>>>>>> thinking something like
> > >>>>>>>>>>> 
> > >>>>>>>>>>> 1. Initially set slow value for latency, this could be the
> > >>>>>>>>>>> original 15 ms.
> > >>>>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
> > >>>>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
> > >>>>>>>>>>> works
> > >>>>>>>>>>> (the fallback latency could be the existing latency).
> > >>>>>>>>>> 
> > >>>>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
> > >>>>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
> > >>>>>>>>>> assigned vendor ID) which means we can get the information to set
> > >>>>>>>>>> the timeout before we have to do any TPM operations.
> > >>>>>>>>> 
> > >>>>>>>>> I wonder if the timeout issue exists for all TPM commands for the
> > >>>>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
> > >>>>>>>>> extending  PCRs ?
> > >>>>>>>>> 
> > >>>>>>>>> In addition to defining a per TPM vendor based lookup table for
> > >>>>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
> > >>>>>>>>> option to allow timeout setting.  This will enable to set the timeout
> > >>>>>>>>> based on the specific use.
> > >>>>>>>> 
> > >>>>>>>> I don't think we need go that far (yet).  The timing change has been in
> > >>>>>>>> upstream since:
> > >>>>>>>> 
> > >>>>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
> > >>>>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
> > >>>>>>>> Date:   Wed May 16 01:51:25 2018 -0400
> > >>>>>>>> 
> > >>>>>>>> tpm: reduce polling time to usecs for even finer granularity
> > >>>>>>>> 
> > >>>>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
> > >>>>>>>> time we've discovered two problems: mine which looks to be an artifact
> > >>>>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
> > >>>>>>>> That means pretty much every other TPM simply works with the existing
> > >>>>>>>> timings
> > >>>>>>>> 
> > >>>>>>>>> I was also thinking how will we decide the lookup table values for
> > >>>>>>>>> each vendor ?
> > >>>>>>>> 
> > >>>>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
> > >>>>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
> > >>>>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
> > >>>>>>>> just fine with a Nuvoton and the current timings.
> > >>>>>>>> 
> > >>>>>>>> We can add additional exceptions if they actually turn up.
> > >>>>>>> 
> > >>>>>>> I'd add a table and fallback.
> > >>>>>>> 
> > >>>>>> 
> > >>>>>> Hi folks,
> > >>>>>> 
> > >>>>>> I want to follow up this a bit and check whether we reached a consensus 
> > >>>>>> on how to fix the timeout issue for Atmel chip.
> > >>>>>> 
> > >>>>>> Should we revert the changes or introduce the lookup table for chips.
> > >>>>>> 
> > >>>>>> Is there anything I can help from Rubrik side.
> > >>>>>> 
> > >>>>>> Thanks
> > >>>>>> Hao
> > >>>>> 
> > >>>>> There is nothing to revert as the previous was not applied but I'm
> > >>>>> of course ready to review any new attempts.
> > >>>>> 
> > >>>> 
> > >>>> Hi Jarkko,
> > >>>> 
> > >>>> By “revert” I meant we revert the timeout value changes by applying
> > >>>> the patch I proposed, as the timeout value discussed does cause issues.
> > >>>> 
> > >>>> Why don’t we apply the patch and improve the perf in the way of not
> > >>>> breaking TPMs ? 
> > >>>> 
> > >>>> Hao
> > >>> 
> > >>> Hi Jarkko and folks,
> > >>> 
> > >>> It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
> > >>> It looks like we currently have following choices:
> > >>> 1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
> > >>> (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
> > >>> 2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
> > >>> It is the current proposed patch
> > >>> 3. Fix regression by making exception for ateml chip.  
> > >>> 
> > >>> Should we reach consensus on which one we want to pursue before dig
> > >>> into implementation of the patch? In my opinion, I prefer to fix the
> > >>> regression with 2, and then pursue 1 as long-term solution. 3 is
> > >>> hacky.
> > >> 
> > >> What does option 1 fix for *all* vendors?
> > >> 
> > >>> Let me know what do you guys think
> > >>> 
> > >>> Hao
> > >> 
> > >> /Jarkko
> > > 
> > > Hi Jarkko and folks,
> > > 
> > > It has been a while again. In my previous message I answered Jarkko’s question about the option 1.
> > > Jarkko, let me know if it is clear to you or you have further questions and suggestions on next to do.
> > > Somehow I couldn’t found the last message I sent but it is in 
> > > https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
> > > 
> > > In high-level, the option 1 is to add a timing lookup table for each manufacture, hence we can
> > > configure timing for each chip respectively. Then we don’t need to worry about fixing ATMEL
> > > timing may cause performance degradation for other chips.
> > > 
> > > I do want to push the fix in TPM driver, which is likely to be hit going forward again when people are doing
> > > refactoring without testing chips from all manufacturing.
> > > 
> > > Let me know how should I push this forward.
> > > 
> > > Thanks
> > > Hao
> > > 
> > It looks like Jarkko’s email address (jarkko.sakkinen@linux.intel.com) is unreachable now,
> > can other TPM maintainer / reviewer help make a call and unblock this ? 
> 
> A while ago Jarkko asked everyone to use his kernel.org address.

Hao, I cannot really say that much about patch that does not exist.

The whole consesus thing based on a plan is just semantically wrong
way to look at things. If you have an idea for a patch, make your
own choice and just send the patch.

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-06-29 20:06     ` Jarkko Sakkinen
@ 2021-06-30  4:27       ` Hao Wu
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Wu @ 2021-06-30  4:27 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Shrihari Kalkar, Seungyeop Han, Anish Jhaveri, peterhuewe, jgg,
	linux-integrity, Paul Menzel, Ken Goldman, zohar, why2jjj.linux,
	Hamza Attak, gregkh, arnd, Nayna, James.Bottomley

> On Jun 29, 2021, at 1:06 PM, Jarkko Sakkinen <jarkko@kernel.org> wrote:
> 
> On Wed, Jun 23, 2021 at 10:49:27PM -0700, Hao Wu wrote:
>>> On Jun 23, 2021, at 6:35 AM, Jarkko Sakkinen <jarkko@kernel.org> wrote:
>>> 
>>> On Sun, Jun 20, 2021 at 04:18:09PM -0700, Hao Wu wrote:
>>>> This is a fix for the ATMEL TPM crash bug reported in
>>>> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
>>>> 
>>>> According to the discussions in the original thread,
>>>> we don't want to revert the timeout of wait_for_tpm_stat
>>>> for non-ATMEL chips, which brings back the performance cost.
>>>> For investigation and analysis of why wait_for_tpm_stat
>>>> caused the issue, and how the regression was introduced,
>>>> please read the original thread above.
>>>> 
>>>> Thus the proposed fix here is to only revert the timeout
>>>> for ATMEL chips by checking the vendor ID.
>>>> 
>>>> Test Plan:
>>>> - Run fixed kernel with ATMEL TPM chips and see crash
>>>> has been fixed.
>>>> - Run fixed kernel with non-ATMEL TPM chips, and confirm
>>>> the timeout has not been changed.
>>> 
>>> Please move test plan right before diffstat if you wan to include such,
>>> so that it does not go into the commit log.
>> Hi Jarkko, not sure I understood your suggestion or not. I removed
>> the test plan from the commit message in a updated commit
>> https://patchwork.kernel.org/project/linux-integrity/patch/20210624053321.861-1-hao.wu@rubrik.com/
>> 
>> Let me know if I misunderstood this. I am fine to not include test plan,
>> If this is not something expected by linux community.
>> I personally think it is helpful to understand the confidence of the commit.
>> 
>>> 
>>>> ---
> 
> You can add it right here. Then it won't be included to the actual
> commit log but is still available in the patch.
> 
I see, thanks Jarkko. Updated the patch
https://patchwork.kernel.org/project/linux-integrity/patch/20210630042205.30051-1-hao.wu@rubrik.com/
Hopefull it makes more sense now.

> /Jarkko 

Hao

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-06-24  5:33 ` Hao Wu
@ 2021-06-29 20:07   ` Jarkko Sakkinen
  0 siblings, 0 replies; 51+ messages in thread
From: Jarkko Sakkinen @ 2021-06-29 20:07 UTC (permalink / raw)
  To: Hao Wu
  Cc: shrihari.kalkar, seungyeop.han, anish.jhaveri, peterhuewe, jgg,
	linux-integrity, pmenzel, kgold, zohar, why2jjj.linux, hamza,
	gregkh, arnd, nayna, James.Bottomley

On Wed, Jun 23, 2021 at 10:33:21PM -0700, Hao Wu wrote:
> This is a fix for the ATMEL TPM crash bug reported in
> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
> 
> According to the discussions in the original thread,
> we don't want to revert the timeout of wait_for_tpm_stat
> for non-ATMEL chips, which brings back the performance cost.
> For investigation and analysis of why wait_for_tpm_stat
> caused the issue, and how the regression was introduced,
> please read the original thread above.
> 
> Thus the proposed fix here is to only revert the timeout
> for ATMEL chips by checking the vendor ID.
> ---

Lacking "Fixes" and "Signed-off-by".

/Jarkko

>  drivers/char/tpm/tpm.h          |  9 ++++++++-
>  drivers/char/tpm/tpm_tis_core.c | 19 +++++++++++++++++--
>  include/linux/tpm.h             |  2 ++
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 283f78211c3a..bc6aa7f9e119 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -42,7 +42,9 @@ enum tpm_timeout {
>  	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
>  	TPM_TIMEOUT_POLL = 1,	/* msecs */
>  	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
> -	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
> +	TPM_TIMEOUT_USECS_MAX = 500,	/* usecs */
> +	TPM_TIMEOUT_WAIT_STAT = 500,	/* usecs */
> +	TPM_ATML_TIMEOUT_WAIT_STAT = 15000	/* usecs */
>  };
>  
>  /* TPM addresses */
> @@ -189,6 +191,11 @@ static inline void tpm_msleep(unsigned int delay_msec)
>  		     delay_msec * 1000);
>  };
>  
> +static inline void tpm_usleep(unsigned int delay_usec)
> +{
> +	usleep_range(delay_usec - TPM_TIMEOUT_RANGE_US, delay_usec);
> +};
> +
>  int tpm_chip_start(struct tpm_chip *chip);
>  void tpm_chip_stop(struct tpm_chip *chip);
>  struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip);
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index 55b9d3965ae1..9ddd4edfe1c2 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -80,8 +80,12 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>  		}
>  	} else {
>  		do {
> -			usleep_range(TPM_TIMEOUT_USECS_MIN,
> -				     TPM_TIMEOUT_USECS_MAX);
> +			if (chip->timeout_wait_stat && 
> +				chip->timeout_wait_stat >= TPM_TIMEOUT_WAIT_STAT) {
> +				tpm_usleep((unsigned int)(chip->timeout_wait_stat));
> +			} else {
> +				tpm_usleep((unsigned int)(TPM_TIMEOUT_WAIT_STAT));
> +			}
>  			status = chip->ops->status(chip);
>  			if ((status & mask) == mask)
>  				return 0;
> @@ -934,6 +938,8 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>  	chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX);
>  	chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX);
>  	chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX);
> +	/* init timeout for wait_for_tpm_stat */
> +	chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
>  	priv->phy_ops = phy_ops;
>  	dev_set_drvdata(&chip->dev, priv);
>  
> @@ -983,6 +989,15 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>  
>  	priv->manufacturer_id = vendor;
>  
> +	switch (priv->manufacturer_id) {
> +	case TPM_VID_ATML:
> +        /* ATMEL chip needs longer timeout to avoid crash */
> +		chip->timeout_wait_stat = TPM_ATML_TIMEOUT_WAIT_STAT;
> +		break;
> +	default:
> +		chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
> +	}
> +
>  	rc = tpm_tis_read8(priv, TPM_RID(0), &rid);
>  	if (rc < 0)
>  		goto out_err;
> diff --git a/include/linux/tpm.h b/include/linux/tpm.h
> index aa11fe323c56..35f2a0260d76 100644
> --- a/include/linux/tpm.h
> +++ b/include/linux/tpm.h
> @@ -150,6 +150,7 @@ struct tpm_chip {
>  	bool timeout_adjusted;
>  	unsigned long duration[TPM_NUM_DURATIONS]; /* jiffies */
>  	bool duration_adjusted;
> +	unsigned long timeout_wait_stat; /* usecs */
>  
>  	struct dentry *bios_dir[TPM_NUM_EVENT_LOG_FILES];
>  
> @@ -269,6 +270,7 @@ enum tpm2_cc_attrs {
>  #define TPM_VID_INTEL    0x8086
>  #define TPM_VID_WINBOND  0x1050
>  #define TPM_VID_STM      0x104A
> +#define TPM_VID_ATML     0x1114
>  
>  enum tpm_chip_flags {
>  	TPM_CHIP_FLAG_TPM2		= BIT(1),
> -- 
> 2.29.0.vfs.0.0
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-06-24  5:49   ` Hao Wu
@ 2021-06-29 20:06     ` Jarkko Sakkinen
  2021-06-30  4:27       ` Hao Wu
  0 siblings, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2021-06-29 20:06 UTC (permalink / raw)
  To: Hao Wu
  Cc: Shrihari Kalkar, Seungyeop Han, Anish Jhaveri, peterhuewe, jgg,
	linux-integrity, Paul Menzel, Ken Goldman, zohar, why2jjj.linux,
	Hamza Attak, gregkh, arnd, Nayna, James.Bottomley

On Wed, Jun 23, 2021 at 10:49:27PM -0700, Hao Wu wrote:
> > On Jun 23, 2021, at 6:35 AM, Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > 
> > On Sun, Jun 20, 2021 at 04:18:09PM -0700, Hao Wu wrote:
> >> This is a fix for the ATMEL TPM crash bug reported in
> >> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
> >> 
> >> According to the discussions in the original thread,
> >> we don't want to revert the timeout of wait_for_tpm_stat
> >> for non-ATMEL chips, which brings back the performance cost.
> >> For investigation and analysis of why wait_for_tpm_stat
> >> caused the issue, and how the regression was introduced,
> >> please read the original thread above.
> >> 
> >> Thus the proposed fix here is to only revert the timeout
> >> for ATMEL chips by checking the vendor ID.
> >> 
> >> Test Plan:
> >> - Run fixed kernel with ATMEL TPM chips and see crash
> >>  has been fixed.
> >> - Run fixed kernel with non-ATMEL TPM chips, and confirm
> >>  the timeout has not been changed.
> > 
> > Please move test plan right before diffstat if you wan to include such,
> > so that it does not go into the commit log.
> Hi Jarkko, not sure I understood your suggestion or not. I removed
> the test plan from the commit message in a updated commit
> https://patchwork.kernel.org/project/linux-integrity/patch/20210624053321.861-1-hao.wu@rubrik.com/
> 
> Let me know if I misunderstood this. I am fine to not include test plan,
> If this is not something expected by linux community.
> I personally think it is helpful to understand the confidence of the commit.
> 
> > 
> >> ---

You can add it right here. Then it won't be included to the actual
commit log but is still available in the patch.

/Jarkko 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-06-23 13:35 ` Jarkko Sakkinen
@ 2021-06-24  5:49   ` Hao Wu
  2021-06-29 20:06     ` Jarkko Sakkinen
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Wu @ 2021-06-24  5:49 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Shrihari Kalkar, Seungyeop Han, Anish Jhaveri, peterhuewe, jgg,
	linux-integrity, Paul Menzel, Ken Goldman, zohar, why2jjj.linux,
	Hamza Attak, gregkh, arnd, Nayna, James.Bottomley

> On Jun 23, 2021, at 6:35 AM, Jarkko Sakkinen <jarkko@kernel.org> wrote:
> 
> On Sun, Jun 20, 2021 at 04:18:09PM -0700, Hao Wu wrote:
>> This is a fix for the ATMEL TPM crash bug reported in
>> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
>> 
>> According to the discussions in the original thread,
>> we don't want to revert the timeout of wait_for_tpm_stat
>> for non-ATMEL chips, which brings back the performance cost.
>> For investigation and analysis of why wait_for_tpm_stat
>> caused the issue, and how the regression was introduced,
>> please read the original thread above.
>> 
>> Thus the proposed fix here is to only revert the timeout
>> for ATMEL chips by checking the vendor ID.
>> 
>> Test Plan:
>> - Run fixed kernel with ATMEL TPM chips and see crash
>>  has been fixed.
>> - Run fixed kernel with non-ATMEL TPM chips, and confirm
>>  the timeout has not been changed.
> 
> Please move test plan right before diffstat if you wan to include such,
> so that it does not go into the commit log.
Hi Jarkko, not sure I understood your suggestion or not. I removed
the test plan from the commit message in a updated commit
https://patchwork.kernel.org/project/linux-integrity/patch/20210624053321.861-1-hao.wu@rubrik.com/

Let me know if I misunderstood this. I am fine to not include test plan,
If this is not something expected by linux community.
I personally think it is helpful to understand the confidence of the commit.

> 
>> ---
>> drivers/char/tpm/tpm.h          |  9 ++++++++-
>> drivers/char/tpm/tpm_tis_core.c | 19 +++++++++++++++++--
>> include/linux/tpm.h             |  2 ++
>> 3 files changed, 27 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
>> index 283f78211c3a..bc6aa7f9e119 100644
>> --- a/drivers/char/tpm/tpm.h
>> +++ b/drivers/char/tpm/tpm.h
>> @@ -42,7 +42,9 @@ enum tpm_timeout {
>> 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
>> 	TPM_TIMEOUT_POLL = 1,	/* msecs */
>> 	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
>> -	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
>> +	TPM_TIMEOUT_USECS_MAX = 500,	/* usecs */
>> +	TPM_TIMEOUT_WAIT_STAT = 500,	/* usecs */
>> +	TPM_ATML_TIMEOUT_WAIT_STAT = 15000	/* usecs */
>> };
>> 
>> /* TPM addresses */
>> @@ -189,6 +191,11 @@ static inline void tpm_msleep(unsigned int delay_msec)
>> 		     delay_msec * 1000);
>> };
>> 
>> +static inline void tpm_usleep(unsigned int delay_usec)
>> +{
>> +	usleep_range(delay_usec - TPM_TIMEOUT_RANGE_US, delay_usec);
>> +};
>> +
>> int tpm_chip_start(struct tpm_chip *chip);
>> void tpm_chip_stop(struct tpm_chip *chip);
>> struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip);
>> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
>> index 55b9d3965ae1..9ddd4edfe1c2 100644
>> --- a/drivers/char/tpm/tpm_tis_core.c
>> +++ b/drivers/char/tpm/tpm_tis_core.c
>> @@ -80,8 +80,12 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>> 		}
>> 	} else {
>> 		do {
>> -			usleep_range(TPM_TIMEOUT_USECS_MIN,
>> -				     TPM_TIMEOUT_USECS_MAX);
>> +			if (chip->timeout_wait_stat && 
>> +				chip->timeout_wait_stat >= TPM_TIMEOUT_WAIT_STAT) {
>> +				tpm_usleep((unsigned int)(chip->timeout_wait_stat));
>> +			} else {
>> +				tpm_usleep((unsigned int)(TPM_TIMEOUT_WAIT_STAT));
>> +			}
>> 			status = chip->ops->status(chip);
>> 			if ((status & mask) == mask)
>> 				return 0;
>> @@ -934,6 +938,8 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>> 	chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX);
>> 	chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX);
>> 	chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX);
>> +	/* init timeout for wait_for_tpm_stat */
>> +	chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
>> 	priv->phy_ops = phy_ops;
>> 	dev_set_drvdata(&chip->dev, priv);
>> 
>> @@ -983,6 +989,15 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>> 
>> 	priv->manufacturer_id = vendor;
>> 
>> +	switch (priv->manufacturer_id) {
>> +	case TPM_VID_ATML:
>> +        /* ATMEL chip needs longer timeout to avoid crash */
>> +		chip->timeout_wait_stat = TPM_ATML_TIMEOUT_WAIT_STAT;
>> +		break;
>> +	default:
>> +		chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
>> +	}
>> +
>> 	rc = tpm_tis_read8(priv, TPM_RID(0), &rid);
>> 	if (rc < 0)
>> 		goto out_err;
>> diff --git a/include/linux/tpm.h b/include/linux/tpm.h
>> index aa11fe323c56..35f2a0260d76 100644
>> --- a/include/linux/tpm.h
>> +++ b/include/linux/tpm.h
>> @@ -150,6 +150,7 @@ struct tpm_chip {
>> 	bool timeout_adjusted;
>> 	unsigned long duration[TPM_NUM_DURATIONS]; /* jiffies */
>> 	bool duration_adjusted;
>> +	unsigned long timeout_wait_stat; /* usecs */
>> 
>> 	struct dentry *bios_dir[TPM_NUM_EVENT_LOG_FILES];
>> 
>> @@ -269,6 +270,7 @@ enum tpm2_cc_attrs {
>> #define TPM_VID_INTEL    0x8086
>> #define TPM_VID_WINBOND  0x1050
>> #define TPM_VID_STM      0x104A
>> +#define TPM_VID_ATML     0x1114
>> 
>> enum tpm_chip_flags {
>> 	TPM_CHIP_FLAG_TPM2		= BIT(1),
>> -- 
>> 2.29.0.vfs.0.0
>> 
>> 
> 
> /Jarkko

Hao


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-06-20 23:18 Hao Wu
  2021-06-23 13:35 ` Jarkko Sakkinen
@ 2021-06-24  5:33 ` Hao Wu
  2021-06-29 20:07   ` Jarkko Sakkinen
  1 sibling, 1 reply; 51+ messages in thread
From: Hao Wu @ 2021-06-24  5:33 UTC (permalink / raw)
  To: hao.wu, shrihari.kalkar, seungyeop.han, anish.jhaveri,
	peterhuewe, jarkko, jgg, linux-integrity, pmenzel, kgold, zohar,
	why2jjj.linux, hamza, gregkh, arnd, nayna, James.Bottomley

This is a fix for the ATMEL TPM crash bug reported in
https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/

According to the discussions in the original thread,
we don't want to revert the timeout of wait_for_tpm_stat
for non-ATMEL chips, which brings back the performance cost.
For investigation and analysis of why wait_for_tpm_stat
caused the issue, and how the regression was introduced,
please read the original thread above.

Thus the proposed fix here is to only revert the timeout
for ATMEL chips by checking the vendor ID.
---
 drivers/char/tpm/tpm.h          |  9 ++++++++-
 drivers/char/tpm/tpm_tis_core.c | 19 +++++++++++++++++--
 include/linux/tpm.h             |  2 ++
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 283f78211c3a..bc6aa7f9e119 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -42,7 +42,9 @@ enum tpm_timeout {
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
 	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
-	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
+	TPM_TIMEOUT_USECS_MAX = 500,	/* usecs */
+	TPM_TIMEOUT_WAIT_STAT = 500,	/* usecs */
+	TPM_ATML_TIMEOUT_WAIT_STAT = 15000	/* usecs */
 };
 
 /* TPM addresses */
@@ -189,6 +191,11 @@ static inline void tpm_msleep(unsigned int delay_msec)
 		     delay_msec * 1000);
 };
 
+static inline void tpm_usleep(unsigned int delay_usec)
+{
+	usleep_range(delay_usec - TPM_TIMEOUT_RANGE_US, delay_usec);
+};
+
 int tpm_chip_start(struct tpm_chip *chip);
 void tpm_chip_stop(struct tpm_chip *chip);
 struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip);
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 55b9d3965ae1..9ddd4edfe1c2 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -80,8 +80,12 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
 		}
 	} else {
 		do {
-			usleep_range(TPM_TIMEOUT_USECS_MIN,
-				     TPM_TIMEOUT_USECS_MAX);
+			if (chip->timeout_wait_stat && 
+				chip->timeout_wait_stat >= TPM_TIMEOUT_WAIT_STAT) {
+				tpm_usleep((unsigned int)(chip->timeout_wait_stat));
+			} else {
+				tpm_usleep((unsigned int)(TPM_TIMEOUT_WAIT_STAT));
+			}
 			status = chip->ops->status(chip);
 			if ((status & mask) == mask)
 				return 0;
@@ -934,6 +938,8 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
 	chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX);
 	chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX);
 	chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX);
+	/* init timeout for wait_for_tpm_stat */
+	chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
 	priv->phy_ops = phy_ops;
 	dev_set_drvdata(&chip->dev, priv);
 
@@ -983,6 +989,15 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
 
 	priv->manufacturer_id = vendor;
 
+	switch (priv->manufacturer_id) {
+	case TPM_VID_ATML:
+        /* ATMEL chip needs longer timeout to avoid crash */
+		chip->timeout_wait_stat = TPM_ATML_TIMEOUT_WAIT_STAT;
+		break;
+	default:
+		chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
+	}
+
 	rc = tpm_tis_read8(priv, TPM_RID(0), &rid);
 	if (rc < 0)
 		goto out_err;
diff --git a/include/linux/tpm.h b/include/linux/tpm.h
index aa11fe323c56..35f2a0260d76 100644
--- a/include/linux/tpm.h
+++ b/include/linux/tpm.h
@@ -150,6 +150,7 @@ struct tpm_chip {
 	bool timeout_adjusted;
 	unsigned long duration[TPM_NUM_DURATIONS]; /* jiffies */
 	bool duration_adjusted;
+	unsigned long timeout_wait_stat; /* usecs */
 
 	struct dentry *bios_dir[TPM_NUM_EVENT_LOG_FILES];
 
@@ -269,6 +270,7 @@ enum tpm2_cc_attrs {
 #define TPM_VID_INTEL    0x8086
 #define TPM_VID_WINBOND  0x1050
 #define TPM_VID_STM      0x104A
+#define TPM_VID_ATML     0x1114
 
 enum tpm_chip_flags {
 	TPM_CHIP_FLAG_TPM2		= BIT(1),
-- 
2.29.0.vfs.0.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2021-06-20 23:18 Hao Wu
@ 2021-06-23 13:35 ` Jarkko Sakkinen
  2021-06-24  5:49   ` Hao Wu
  2021-06-24  5:33 ` Hao Wu
  1 sibling, 1 reply; 51+ messages in thread
From: Jarkko Sakkinen @ 2021-06-23 13:35 UTC (permalink / raw)
  To: Hao Wu
  Cc: shrihari.kalkar, seungyeop.han, anish.jhaveri, peterhuewe, jgg,
	linux-integrity, pmenzel, kgold, zohar, why2jjj.linux, hamza,
	gregkh, arnd, nayna, James.Bottomley

On Sun, Jun 20, 2021 at 04:18:09PM -0700, Hao Wu wrote:
> This is a fix for the ATMEL TPM crash bug reported in
> https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
> 
> According to the discussions in the original thread,
> we don't want to revert the timeout of wait_for_tpm_stat
> for non-ATMEL chips, which brings back the performance cost.
> For investigation and analysis of why wait_for_tpm_stat
> caused the issue, and how the regression was introduced,
> please read the original thread above.
> 
> Thus the proposed fix here is to only revert the timeout
> for ATMEL chips by checking the vendor ID.
> 
> Test Plan:
> - Run fixed kernel with ATMEL TPM chips and see crash
>   has been fixed.
> - Run fixed kernel with non-ATMEL TPM chips, and confirm
>   the timeout has not been changed.

Please move test plan right before diffstat if you wan to include such,
so that it does not go into the commit log.


> ---
>  drivers/char/tpm/tpm.h          |  9 ++++++++-
>  drivers/char/tpm/tpm_tis_core.c | 19 +++++++++++++++++--
>  include/linux/tpm.h             |  2 ++
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 283f78211c3a..bc6aa7f9e119 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -42,7 +42,9 @@ enum tpm_timeout {
>  	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
>  	TPM_TIMEOUT_POLL = 1,	/* msecs */
>  	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
> -	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
> +	TPM_TIMEOUT_USECS_MAX = 500,	/* usecs */
> +	TPM_TIMEOUT_WAIT_STAT = 500,	/* usecs */
> +	TPM_ATML_TIMEOUT_WAIT_STAT = 15000	/* usecs */
>  };
>  
>  /* TPM addresses */
> @@ -189,6 +191,11 @@ static inline void tpm_msleep(unsigned int delay_msec)
>  		     delay_msec * 1000);
>  };
>  
> +static inline void tpm_usleep(unsigned int delay_usec)
> +{
> +	usleep_range(delay_usec - TPM_TIMEOUT_RANGE_US, delay_usec);
> +};
> +
>  int tpm_chip_start(struct tpm_chip *chip);
>  void tpm_chip_stop(struct tpm_chip *chip);
>  struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip);
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index 55b9d3965ae1..9ddd4edfe1c2 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -80,8 +80,12 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>  		}
>  	} else {
>  		do {
> -			usleep_range(TPM_TIMEOUT_USECS_MIN,
> -				     TPM_TIMEOUT_USECS_MAX);
> +			if (chip->timeout_wait_stat && 
> +				chip->timeout_wait_stat >= TPM_TIMEOUT_WAIT_STAT) {
> +				tpm_usleep((unsigned int)(chip->timeout_wait_stat));
> +			} else {
> +				tpm_usleep((unsigned int)(TPM_TIMEOUT_WAIT_STAT));
> +			}
>  			status = chip->ops->status(chip);
>  			if ((status & mask) == mask)
>  				return 0;
> @@ -934,6 +938,8 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>  	chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX);
>  	chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX);
>  	chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX);
> +	/* init timeout for wait_for_tpm_stat */
> +	chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
>  	priv->phy_ops = phy_ops;
>  	dev_set_drvdata(&chip->dev, priv);
>  
> @@ -983,6 +989,15 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>  
>  	priv->manufacturer_id = vendor;
>  
> +	switch (priv->manufacturer_id) {
> +	case TPM_VID_ATML:
> +        /* ATMEL chip needs longer timeout to avoid crash */
> +		chip->timeout_wait_stat = TPM_ATML_TIMEOUT_WAIT_STAT;
> +		break;
> +	default:
> +		chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
> +	}
> +
>  	rc = tpm_tis_read8(priv, TPM_RID(0), &rid);
>  	if (rc < 0)
>  		goto out_err;
> diff --git a/include/linux/tpm.h b/include/linux/tpm.h
> index aa11fe323c56..35f2a0260d76 100644
> --- a/include/linux/tpm.h
> +++ b/include/linux/tpm.h
> @@ -150,6 +150,7 @@ struct tpm_chip {
>  	bool timeout_adjusted;
>  	unsigned long duration[TPM_NUM_DURATIONS]; /* jiffies */
>  	bool duration_adjusted;
> +	unsigned long timeout_wait_stat; /* usecs */
>  
>  	struct dentry *bios_dir[TPM_NUM_EVENT_LOG_FILES];
>  
> @@ -269,6 +270,7 @@ enum tpm2_cc_attrs {
>  #define TPM_VID_INTEL    0x8086
>  #define TPM_VID_WINBOND  0x1050
>  #define TPM_VID_STM      0x104A
> +#define TPM_VID_ATML     0x1114
>  
>  enum tpm_chip_flags {
>  	TPM_CHIP_FLAG_TPM2		= BIT(1),
> -- 
> 2.29.0.vfs.0.0
> 
> 

/Jarkko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] Fix Atmel TPM crash caused by too frequent queries
@ 2021-06-20 23:18 Hao Wu
  2021-06-23 13:35 ` Jarkko Sakkinen
  2021-06-24  5:33 ` Hao Wu
  0 siblings, 2 replies; 51+ messages in thread
From: Hao Wu @ 2021-06-20 23:18 UTC (permalink / raw)
  To: hao.wu, shrihari.kalkar, seungyeop.han, anish.jhaveri,
	peterhuewe, jarkko, jgg, linux-integrity, pmenzel, kgold, zohar,
	why2jjj.linux, hamza, gregkh, arnd, nayna, James.Bottomley

This is a fix for the ATMEL TPM crash bug reported in
https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/

According to the discussions in the original thread,
we don't want to revert the timeout of wait_for_tpm_stat
for non-ATMEL chips, which brings back the performance cost.
For investigation and analysis of why wait_for_tpm_stat
caused the issue, and how the regression was introduced,
please read the original thread above.

Thus the proposed fix here is to only revert the timeout
for ATMEL chips by checking the vendor ID.

Test Plan:
- Run fixed kernel with ATMEL TPM chips and see crash
  has been fixed.
- Run fixed kernel with non-ATMEL TPM chips, and confirm
  the timeout has not been changed.
---
 drivers/char/tpm/tpm.h          |  9 ++++++++-
 drivers/char/tpm/tpm_tis_core.c | 19 +++++++++++++++++--
 include/linux/tpm.h             |  2 ++
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 283f78211c3a..bc6aa7f9e119 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -42,7 +42,9 @@ enum tpm_timeout {
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
 	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
-	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
+	TPM_TIMEOUT_USECS_MAX = 500,	/* usecs */
+	TPM_TIMEOUT_WAIT_STAT = 500,	/* usecs */
+	TPM_ATML_TIMEOUT_WAIT_STAT = 15000	/* usecs */
 };
 
 /* TPM addresses */
@@ -189,6 +191,11 @@ static inline void tpm_msleep(unsigned int delay_msec)
 		     delay_msec * 1000);
 };
 
+static inline void tpm_usleep(unsigned int delay_usec)
+{
+	usleep_range(delay_usec - TPM_TIMEOUT_RANGE_US, delay_usec);
+};
+
 int tpm_chip_start(struct tpm_chip *chip);
 void tpm_chip_stop(struct tpm_chip *chip);
 struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip);
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 55b9d3965ae1..9ddd4edfe1c2 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -80,8 +80,12 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
 		}
 	} else {
 		do {
-			usleep_range(TPM_TIMEOUT_USECS_MIN,
-				     TPM_TIMEOUT_USECS_MAX);
+			if (chip->timeout_wait_stat && 
+				chip->timeout_wait_stat >= TPM_TIMEOUT_WAIT_STAT) {
+				tpm_usleep((unsigned int)(chip->timeout_wait_stat));
+			} else {
+				tpm_usleep((unsigned int)(TPM_TIMEOUT_WAIT_STAT));
+			}
 			status = chip->ops->status(chip);
 			if ((status & mask) == mask)
 				return 0;
@@ -934,6 +938,8 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
 	chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX);
 	chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX);
 	chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX);
+	/* init timeout for wait_for_tpm_stat */
+	chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
 	priv->phy_ops = phy_ops;
 	dev_set_drvdata(&chip->dev, priv);
 
@@ -983,6 +989,15 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
 
 	priv->manufacturer_id = vendor;
 
+	switch (priv->manufacturer_id) {
+	case TPM_VID_ATML:
+        /* ATMEL chip needs longer timeout to avoid crash */
+		chip->timeout_wait_stat = TPM_ATML_TIMEOUT_WAIT_STAT;
+		break;
+	default:
+		chip->timeout_wait_stat = TPM_TIMEOUT_WAIT_STAT;
+	}
+
 	rc = tpm_tis_read8(priv, TPM_RID(0), &rid);
 	if (rc < 0)
 		goto out_err;
diff --git a/include/linux/tpm.h b/include/linux/tpm.h
index aa11fe323c56..35f2a0260d76 100644
--- a/include/linux/tpm.h
+++ b/include/linux/tpm.h
@@ -150,6 +150,7 @@ struct tpm_chip {
 	bool timeout_adjusted;
 	unsigned long duration[TPM_NUM_DURATIONS]; /* jiffies */
 	bool duration_adjusted;
+	unsigned long timeout_wait_stat; /* usecs */
 
 	struct dentry *bios_dir[TPM_NUM_EVENT_LOG_FILES];
 
@@ -269,6 +270,7 @@ enum tpm2_cc_attrs {
 #define TPM_VID_INTEL    0x8086
 #define TPM_VID_WINBOND  0x1050
 #define TPM_VID_STM      0x104A
+#define TPM_VID_ATML     0x1114
 
 enum tpm_chip_flags {
 	TPM_CHIP_FLAG_TPM2		= BIT(1),
-- 
2.29.0.vfs.0.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-14  6:13 Hao Wu
  2020-09-14  6:17 ` Greg KH
@ 2020-09-15  2:52 ` Hao Wu
  1 sibling, 0 replies; 51+ messages in thread
From: Hao Wu @ 2020-09-15  2:52 UTC (permalink / raw)
  To: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, hamza, james.l.morris
  Cc: linux-integrity, pmenzel, kgold, seungyeop.han, shrihari.kalkar,
	anish.jhaveri

Applied the fix on 5.9-rc4, but it looks like it is not enough to fix the issue. There should be other TPM timing changes since 4.15 (where we have tested and ensure it works). Need to look into it a bit to see what should be the fix in the master branch.

Thanks
Hao

> On Sep 13, 2020, at 11:13 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
> Since kernel 4.14, we fixed the TPM sleep logic
> from msleep to usleep_range, so that the TPM
> sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
> Before that fix, msleep(5) actually sleeps for
> around 15ms.
> The fix is https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> 
> That fix uncovered that the TPM_TIMEOUT was not properly
> set previously. We recently found the TPM driver in kernel 4.14+
> (including 5.9-rc4) crashes Atmel TPM chips with
> too frequent TPM queries.
> 
> The TPM crash signature is
> ```
> $ tpm_sealdata -z
> Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error
> 
> $ sudo dmesg | grep tpm0
> [59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
> [59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
> ```
> 
> From the error code "-62", it looks similar to another bug
> https://patchwork.kernel.org/patch/10520247/
> where the "TPM_TIMEOUT_USECS_MAX" and "TPM_TIMEOUT_USEC_MIN"
> is too small, which causes TPM get queried too frequently,
> and thus crashes.
> 
> This patch fix the TPM_TIMEOUT to 15ms which was
> the actual timeout TPM chips use before the fix
> from msleep to usleep_range. Thus fixed the crash.
> 
> Test Plan:
> - Run fixed kernel on system with Atmel TPM chips
>  and ensure crash does not happen
> - Run fixed kernel on system with other TPM chips
>  (IFX / WEC / STM) ensure not breakages from tpm-tool
> ---
> drivers/char/tpm/tpm.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 947d1db0a5cc..73259ac0a997 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -37,7 +37,7 @@
> #define TPM_RETRY		50
> 
> enum tpm_timeout {
> -	TPM_TIMEOUT = 5,	/* msecs */
> +	TPM_TIMEOUT = 15,	/* msecs */
> 	TPM_TIMEOUT_RETRY = 100, /* msecs */
> 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
> 	TPM_TIMEOUT_POLL = 1,	/* msecs */
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
  2020-09-14  6:13 Hao Wu
@ 2020-09-14  6:17 ` Greg KH
  2020-09-15  2:52 ` Hao Wu
  1 sibling, 0 replies; 51+ messages in thread
From: Greg KH @ 2020-09-14  6:17 UTC (permalink / raw)
  To: Hao Wu
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, hamza, james.l.morris,
	linux-integrity, pmenzel, kgold, seungyeop.han, shrihari.kalkar,
	anish.jhaveri

On Sun, Sep 13, 2020 at 11:13:43PM -0700, Hao Wu wrote:
> Since kernel 4.14, we fixed the TPM sleep logic
> from msleep to usleep_range, so that the TPM
> sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
> Before that fix, msleep(5) actually sleeps for
> around 15ms.
> The fix is https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> 
> That fix uncovered that the TPM_TIMEOUT was not properly
> set previously. We recently found the TPM driver in kernel 4.14+
> (including 5.9-rc4) crashes Atmel TPM chips with
> too frequent TPM queries.
> 
> The TPM crash signature is
> ```
> $ tpm_sealdata -z
> Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error
> 
> $ sudo dmesg | grep tpm0
> [59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
> [59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
> ```
> 
> >From the error code "-62", it looks similar to another bug
> https://patchwork.kernel.org/patch/10520247/
> where the "TPM_TIMEOUT_USECS_MAX" and "TPM_TIMEOUT_USEC_MIN"
> is too small, which causes TPM get queried too frequently,
> and thus crashes.
> 
> This patch fix the TPM_TIMEOUT to 15ms which was
> the actual timeout TPM chips use before the fix
> from msleep to usleep_range. Thus fixed the crash.
> 
> Test Plan:
> - Run fixed kernel on system with Atmel TPM chips
>   and ensure crash does not happen
> - Run fixed kernel on system with other TPM chips
>   (IFX / WEC / STM) ensure not breakages from tpm-tool
> ---
>  drivers/char/tpm/tpm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 947d1db0a5cc..73259ac0a997 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -37,7 +37,7 @@
>  #define TPM_RETRY		50
>  
>  enum tpm_timeout {
> -	TPM_TIMEOUT = 5,	/* msecs */
> +	TPM_TIMEOUT = 15,	/* msecs */
>  	TPM_TIMEOUT_RETRY = 100, /* msecs */
>  	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
>  	TPM_TIMEOUT_POLL = 1,	/* msecs */
> -- 
> 2.17.1
> 

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- Your patch does not have a Signed-off-by: line.  Please read the
  kernel file, Documentation/SubmittingPatches and resend it after
  adding that line.  Note, the line needs to be in the body of the
  email, before the patch, not at the bottom of the patch or in the
  email signature.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] Fix Atmel TPM crash caused by too frequent queries
@ 2020-09-14  6:13 Hao Wu
  2020-09-14  6:17 ` Greg KH
  2020-09-15  2:52 ` Hao Wu
  0 siblings, 2 replies; 51+ messages in thread
From: Hao Wu @ 2020-09-14  6:13 UTC (permalink / raw)
  To: peterhuewe, jarkko.sakkinen, jgg, arnd, gregkh, hamza, james.l.morris
  Cc: linux-integrity, pmenzel, kgold, seungyeop.han, shrihari.kalkar,
	hao.wu, anish.jhaveri

Since kernel 4.14, we fixed the TPM sleep logic
from msleep to usleep_range, so that the TPM
sleeps exactly with TPM_TIMEOUT (=5ms) afterward.
Before that fix, msleep(5) actually sleeps for
around 15ms.
The fix is https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3

That fix uncovered that the TPM_TIMEOUT was not properly
set previously. We recently found the TPM driver in kernel 4.14+
(including 5.9-rc4) crashes Atmel TPM chips with
too frequent TPM queries.

The TPM crash signature is
```
$ tpm_sealdata -z
Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error

$ sudo dmesg | grep tpm0
[59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
[59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
```

From the error code "-62", it looks similar to another bug
https://patchwork.kernel.org/patch/10520247/
where the "TPM_TIMEOUT_USECS_MAX" and "TPM_TIMEOUT_USEC_MIN"
is too small, which causes TPM get queried too frequently,
and thus crashes.

This patch fix the TPM_TIMEOUT to 15ms which was
the actual timeout TPM chips use before the fix
from msleep to usleep_range. Thus fixed the crash.

Test Plan:
- Run fixed kernel on system with Atmel TPM chips
  and ensure crash does not happen
- Run fixed kernel on system with other TPM chips
  (IFX / WEC / STM) ensure not breakages from tpm-tool
---
 drivers/char/tpm/tpm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 947d1db0a5cc..73259ac0a997 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -37,7 +37,7 @@
 #define TPM_RETRY		50
 
 enum tpm_timeout {
-	TPM_TIMEOUT = 5,	/* msecs */
+	TPM_TIMEOUT = 15,	/* msecs */
 	TPM_TIMEOUT_RETRY = 100, /* msecs */
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2021-06-30  4:27 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-26 22:31 [PATCH] Fix Atmel TPM crash caused by too frequent queries Hao Wu
2020-09-26 22:57 ` James Bottomley
2020-09-26 23:10   ` Hao Wu
2020-09-27 18:25     ` James Bottomley
2020-09-28  0:11       ` Hao Wu
2020-09-28  0:15         ` Hao Wu
2020-09-28  1:22         ` James Bottomley
2020-09-28  5:59           ` Hao Wu
2020-09-28 22:11             ` James Bottomley
2020-09-29  4:46               ` Hao Wu
2020-09-30  2:16               ` Jarkko Sakkinen
2020-09-30 14:54                 ` James Bottomley
2020-09-30 15:37                   ` Jarkko Sakkinen
2020-09-30 20:48                     ` James Bottomley
2020-09-30 21:09                       ` Jarkko Sakkinen
2020-09-30 22:31                         ` James Bottomley
2020-10-01  1:50                           ` Jarkko Sakkinen
2020-10-01  4:53                             ` James Bottomley
2020-10-01 18:15                               ` Nayna
2020-10-01 18:32                                 ` James Bottomley
2020-10-01 23:04                                   ` Jarkko Sakkinen
2020-10-17  6:11                                     ` Hao Wu
2020-10-18  5:09                                       ` Jarkko Sakkinen
2020-10-18  5:20                                         ` Hao Wu
2020-11-14  4:39                                           ` Hao Wu
2020-11-18 21:11                                             ` Jarkko Sakkinen
2020-11-18 23:23                                               ` Hao Wu
2021-05-09  6:18                                               ` Hao Wu
2021-05-09  6:31                                                 ` Hao Wu
2021-05-10  2:17                                                   ` Mimi Zohar
2021-05-10  3:15                                                     ` Hao Wu
2021-05-10 17:28                                                     ` Jarkko Sakkinen
2020-09-28  1:08       ` Jarkko Sakkinen
2020-09-28  6:03         ` Hao Wu
2020-09-28 14:16           ` Jarkko Sakkinen
2020-09-28 17:49             ` Hao Wu
2020-09-28 19:47               ` Jarkko Sakkinen
2020-09-28 20:27                 ` Hao Wu
2020-09-30  2:11                   ` Jarkko Sakkinen
2020-09-30  3:41                     ` Hao Wu
     [not found]                       ` <EA1EE8F8-F054-4E1B-B830-231398D33CB8@rubrik.com>
2020-10-01 14:16                         ` Mimi Zohar
  -- strict thread matches above, loose matches on Subject: below --
2021-06-20 23:18 Hao Wu
2021-06-23 13:35 ` Jarkko Sakkinen
2021-06-24  5:49   ` Hao Wu
2021-06-29 20:06     ` Jarkko Sakkinen
2021-06-30  4:27       ` Hao Wu
2021-06-24  5:33 ` Hao Wu
2021-06-29 20:07   ` Jarkko Sakkinen
2020-09-14  6:13 Hao Wu
2020-09-14  6:17 ` Greg KH
2020-09-15  2:52 ` Hao Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).