All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
       [not found] <9173F912-F682-44CC-8408-565A6C675749@rubrik.com>
@ 2020-09-11  4:18 ` Greg KH
  2020-09-11 19:35   ` Ken Goldman
  0 siblings, 1 reply; 8+ messages in thread
From: Greg KH @ 2020-09-11  4:18 UTC (permalink / raw)
  To: Hao Wu
  Cc: peterhuewe, jarkko.sakkinen, jgg, arnd, linux-integrity,
	Seungyeop Han, Shrihari Kalkar, Anish Jhaveri

On Thu, Sep 10, 2020 at 03:13:22PM -0700, Hao Wu wrote:
> Hi TPM Driver Maintainers,
> 
> We are from Rubrik engineering team. We found a TPM driver update since kernel 4.14 causing atmel TPM chips crash. We have root caused it and have a patch on the kernel used by Rubrik products. Given the “bug” has impact on not just Rubrik products, but all atmel TPM chips, we are asking to fix the issue in the kernel upstream.
> 
> The commit that introduced the bug since 4.14 
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 <https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3>
> 
> Effected platform / system:
> - Cisco UCS C220 M5 with atmel TPM chip
> 	- Ubuntu 16.04
>   	- Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4
> - Cisco UCS C240 M5 with atmel TPM chip
> 	- Ubuntu 16.04
>   	- Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4
> 
> ```
> # TPM chip used in the problematic platform
> $ tpm_version
> TPM 1.2 Version Info:
>   Chip Version:        1.2.66.1
>   Spec Level:          2
>   Errata Revision:     3
>   TPM Vendor ID:       ATML
>   TPM Version:         01010000
>   Manufacturer Info:   41544d4c
> ```
> 
> Not all Cisco server nodes are facing the crash, but there are a good amount of portion (around 50% from nodes in Rubrik) are persistently having the TPM crash issue.
> 
> Our other platforms using TPM chips from other vendors do not have issue. These platform are running the same software as the problematic platforms run. Those TPM non-effected vendors are
> - IFX
> - STM
> - WEC
> 
> TPM crash signature:
> ```
> # error when running tpm tool
> $ tpm_sealdata -z
> Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error
> 
> # tpm driver sends error
> $ sudo dmesg | grep tpm0
> [59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
> [59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
> ```
> 
> Our Root Cause Analysis
> From the error code “-62”, it looks similar to another bug https://patchwork.kernel.org/patch/10520247/ <https://patchwork.kernel.org/patch/10520247/>
> where the “TPM_TIMEOUT_USECS_MAX” and “TPM_TIMEOUT_USEC_MIN” is too small, which causes TPM get queried too frequently, and thus crashes.
> 
> The issue for atmel TPM chip crash is similar the commit https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 <https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3> changed TPM sleep logic from using `msleep` to `tpm_msleep`, in which `usleep_range` is used.
> 
> As what https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 <https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3> intended to do, using usleep_range can make the sleep period shorter, because msleep actually sleeps longer when the sleep perioid is within 20ms, and usleep_range can make it more precise.
> 
> According to our experiments,
> - usleep_range makes the TPM sleep precisely for TPM_TIMEOUT (i.e. 5ms https://github.com/torvalds/linux/blob/v4.14/drivers/char/tpm/tpm.h#L52 <https://github.com/torvalds/linux/blob/v4.14/drivers/char/tpm/tpm.h#L52>)
> - msleep(TPM_TIMEOUT) actually sleeps around 15ms    
> 
> Thus the TPM get queried more frequently than before, which is likely the root cause of the atmel chip crash. We fix it by bumping up the TPM_TIMEOUT to 15ms.
> 
> 
> Rubrik Patch
> ```
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 72d3ce4..9b8f3f8 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -49,7 +49,15 @@ enum tpm_const {
>  };
> 
>  enum tpm_timeout {
> -       TPM_TIMEOUT = 5,        /* msecs */
> +       TPM_TIMEOUT = 15,       /* msecs */
>         TPM_TIMEOUT_RETRY = 100, /* msecs */
>         TPM_TIMEOUT_RANGE_US = 300      /* usecs */
>  };
> ```
> With the patch, the atmel TPM chip crash is fixed.  
> 
> Proposal
> We want the kernel upstream to adopt the fix or have a better fix for the atmel chip while not bring performance regression for other TPM chips. We understand that https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 <https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3> was intended to shorten the TPM respond time, but it does not work well for atmel TPM chips. Probably we should override TPM_TIMEOUT value for atmel chips at least.
> 
> Thanks
> Hao

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- Your patch is malformed (tabs converted to spaces, linewrapped, etc.)
  and can not be applied.  Please read the file,
  Documentation/email-clients.txt in order to fix this.

- Your patch does not have a Signed-off-by: line.  Please read the
  kernel file, Documentation/SubmittingPatches and resend it after
  adding that line.  Note, the line needs to be in the body of the
  email, before the patch, not at the bottom of the patch or in the
  email signature.

- Your patch was sent privately to Greg.  Kernel development is done in
  public, please always cc: a public mailing list with a patch
  submission.  Using the tool, scripts/get_maintainer.pl on the patch
  will tell you what mailing list to cc.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
  2020-09-11  4:18 ` [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip Greg KH
@ 2020-09-11 19:35   ` Ken Goldman
  2020-09-12  7:37     ` Paul Menzel
  0 siblings, 1 reply; 8+ messages in thread
From: Ken Goldman @ 2020-09-11 19:35 UTC (permalink / raw)
  Cc: linux-integrity

On 9/11/2020 12:18 AM, Greg KH wrote:
> Thus the TPM get queried more frequently than before, which is likely the root cause of the atmel chip crash. We fix it by bumping up the TPM_TIMEOUT to 15ms.
> 
> 
> Rubrik Patch
> ```
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 72d3ce4..9b8f3f8 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -49,7 +49,15 @@ enum tpm_const {
>   };
> 
>   enum tpm_timeout {
> -       TPM_TIMEOUT = 5,        /* msecs */
> +       TPM_TIMEOUT = 15,       /* msecs */
>          TPM_TIMEOUT_RETRY = 100, /* msecs */
>          TPM_TIMEOUT_RANGE_US = 300      /* usecs */
>   };
> ```
> With the patch, the atmel TPM chip crash is fixed.
> 
> Proposal
> We want the kernel upstream to adopt the fix or have a better fix for the atmel chip while not bring performance regression for other TPM chips. We understand thathttps://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3  <https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3>  was intended to shorten the TPM

Is this the poll time, which was reduced at one point?  If so ...

Be careful about making this a global change.  It could reduce the TPM 
performance by 3x. We don't want to affect all TPMs to fix a bug in an 
old TPM 1.2 chip from one vendor.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
  2020-09-11 19:35   ` Ken Goldman
@ 2020-09-12  7:37     ` Paul Menzel
  2020-09-12  8:13       ` Hao Wu
       [not found]       ` <B003F2A9-2DF5-4633-91C4-FD6B8A3353ED@rubrik.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Paul Menzel @ 2020-09-12  7:37 UTC (permalink / raw)
  To: Ken Goldman; +Cc: Hao Wu, linux-integrity, Hamza Attak

Dear Ken,


Any reason you stripped the CC list? Adding back Hao Wu and adding the 
patch author.

Am 11.09.20 um 21:35 schrieb Ken Goldman:
> On 9/11/2020 12:18 AM, Greg KH wrote:
>> Thus the TPM get queried more frequently than before, which is likely 
>> the root cause of the atmel chip crash. We fix it by bumping up the 
>> TPM_TIMEOUT to 15ms.
>>
>>
>> Rubrik Patch
>> ```
>> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
>> index 72d3ce4..9b8f3f8 100644
>> --- a/drivers/char/tpm/tpm.h
>> +++ b/drivers/char/tpm/tpm.h
>> @@ -49,7 +49,15 @@ enum tpm_const {
>>   };
>>
>>   enum tpm_timeout {
>> -       TPM_TIMEOUT = 5,        /* msecs */
>> +       TPM_TIMEOUT = 15,       /* msecs */
>>          TPM_TIMEOUT_RETRY = 100, /* msecs */
>>          TPM_TIMEOUT_RANGE_US = 300      /* usecs */
>>   };
>> ```
>> With the patch, the atmel TPM chip crash is fixed.
>>
>> Proposal
>> We want the kernel upstream to adopt the fix or have a better fix for 
>> the atmel chip while not bring performance regression for other TPM 
>> chips. We understand 
>> that https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
>> was intended to shorten the TPM
> 
> Is this the poll time, which was reduced at one point?  If so ...

The commit summary is: tpm: replace msleep() with  usleep_range() in TPM 
1.2/2.0 generic drivers.

> Be careful about making this a global change.  It could reduce the TPM 
> performance by 3x. We don't want to affect all TPMs to fix a bug in an 
> old TPM 1.2 chip from one vendor.

Linux has a no regression policy, so the performance penalty wouldn’t 
matter. Unfortunately, the regression was only noticed several years 
after being introduced in Linux v4.14-rc2.

Hao, I wouldn’t expect a longer timeout causing the TPM to be queried 
less frequently, but I do not know the code well.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
  2020-09-12  7:37     ` Paul Menzel
@ 2020-09-12  8:13       ` Hao Wu
       [not found]       ` <B003F2A9-2DF5-4633-91C4-FD6B8A3353ED@rubrik.com>
  1 sibling, 0 replies; 8+ messages in thread
From: Hao Wu @ 2020-09-12  8:13 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ken Goldman, linux-integrity, Hamza Attak, Seungyeop Han,
	Shrihari Kalkar, Anish Jhaveri

Resend following email with plaintext
-----

Hi Paul and Ken,

Thanks for quick responses over this report. 

> Hao, I wouldn’t expect a longer timeout causing the TPM to be queried less frequently, but I do not know the code well.
From our understanding, the TPM queries might be retried due to some reason, increase timeout 3x would lower the query frequency to 1/3. The explanation might be wrong, but the fact looks like the timeout matters.  Unfortunately, engineers from Rubrik are not experts over the TPM driver code neither :(

>> Be careful about making this a global change.  It could reduce the TPM performance by 3x. We don't want to affect all TPMs to fix a bug in an old TPM 1.2 chip from one vendor.
> 
> Linux has a no regression policy, so the performance penalty wouldn’t matter. Unfortunately, the regression was only noticed several years after being introduced in Linux v4.14-rc2.
So does that mean we are good to apply the global change ? Or we need to discuss about the actual fix further ?

Thanks
Hao

> On Sep 12, 2020, at 12:37 AM, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> 
> Dear Ken,
> 
> 
> Any reason you stripped the CC list? Adding back Hao Wu and adding the patch author.
> 
> Am 11.09.20 um 21:35 schrieb Ken Goldman:
>> On 9/11/2020 12:18 AM, Greg KH wrote:
>>> Thus the TPM get queried more frequently than before, which is likely the root cause of the atmel chip crash. We fix it by bumping up the TPM_TIMEOUT to 15ms.
>>> 
>>> 
>>> Rubrik Patch
>>> ```
>>> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
>>> index 72d3ce4..9b8f3f8 100644
>>> --- a/drivers/char/tpm/tpm.h
>>> +++ b/drivers/char/tpm/tpm.h
>>> @@ -49,7 +49,15 @@ enum tpm_const {
>>>   };
>>> 
>>>   enum tpm_timeout {
>>> -       TPM_TIMEOUT = 5,        /* msecs */
>>> +       TPM_TIMEOUT = 15,       /* msecs */
>>>          TPM_TIMEOUT_RETRY = 100, /* msecs */
>>>          TPM_TIMEOUT_RANGE_US = 300      /* usecs */
>>>   };
>>> ```
>>> With the patch, the atmel TPM chip crash is fixed.
>>> 
>>> Proposal
>>> We want the kernel upstream to adopt the fix or have a better fix for the atmel chip while not bring performance regression for other TPM chips. We understand that https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
>>> was intended to shorten the TPM
>> Is this the poll time, which was reduced at one point?  If so ...
> 
> The commit summary is: tpm: replace msleep() with  usleep_range() in TPM 1.2/2.0 generic drivers.
> 
>> Be careful about making this a global change.  It could reduce the TPM performance by 3x. We don't want to affect all TPMs to fix a bug in an old TPM 1.2 chip from one vendor.
> 
> Linux has a no regression policy, so the performance penalty wouldn’t matter. Unfortunately, the regression was only noticed several years after being introduced in Linux v4.14-rc2.
> 
> Hao, I wouldn’t expect a longer timeout causing the TPM to be queried less frequently, but I do not know the code well.
> 
> 
> Kind regards,
> 
> Paul


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
       [not found]       ` <B003F2A9-2DF5-4633-91C4-FD6B8A3353ED@rubrik.com>
@ 2020-09-12  8:14         ` Paul Menzel
  2020-09-12 14:17           ` hamza
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Menzel @ 2020-09-12  8:14 UTC (permalink / raw)
  To: Hao Wu
  Cc: Ken Goldman, linux-integrity, Hamza Attak, Seungyeop Han,
	Shrihari Kalkar

Dear Hao,


Thank you for the reply.

Am 12.09.20 um 10:10 schrieb Hao Wu:

> Thanks for quick responses over this report.

Thank you for the quick follow-up.

>> Hao, I wouldn’t expect a longer timeout causing the TPM to be
>> queried less frequently, but I do not know the code well.
> From our understanding, the TPM queries might be retried due to some
> reason, increase timeout 3x would lower the query frequency to 1/3.
> The explanation might be wrong, but the fact looks like the timeout
> matters.  Unfortunately, engineers from Rubrik are not experts over
> the TPM driver code neither :(
> 
>>> Be careful about making this a global change.  It could reduce
>>> the TPM performance by 3x. We don't want to affect all TPMs to
>>> fix a bug in an old TPM 1.2 chip from one vendor.
>> 
>> Linux has a no regression policy, so the performance penalty
>> wouldn’t matter. Unfortunately, the regression was only noticed
>> several years after being introduced in Linux v4.14-rc2.
> 
> So does that mean we are good to apply the global change ? Or we need
> to discuss about the actual fix further?
To get a fix into the stable series, it first needs to be applied to the 
master branch. I guess you tested also with Linux master, right?

Please add the explanation from your email to Greg into the git commit 
message, format the patch with `git format-patch -1 -o outgoing` and 
send it with `git send-email outgoing/*` to the addresses listed for the 
subsystem in `MAINTAINERS` and the people listed in the commit 
introducing the regression.

Then it can be properly reviewed and discussed.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
  2020-09-12  8:14         ` Paul Menzel
@ 2020-09-12 14:17           ` hamza
  2020-09-13  4:51             ` Hao Wu
  0 siblings, 1 reply; 8+ messages in thread
From: hamza @ 2020-09-12 14:17 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Hao Wu, Ken Goldman, linux-integrity, Seungyeop Han, Shrihari Kalkar

Hello everyone,

Dear Hao,

The first aim of the patch you mentioned is not to shorten TPM timings, 
as no
timing is changed at all on this patch, but to count time more 
accurately.

If your TPM needs more time to perform a specific operation, it has 
nothing to
do with this patch, but rather with its specified timings in the 
kernel. The
timings for your Atmel TPM were certainly wrong before 4.14 to begin 
with, they
were not counted accurately and your TPM working was potentially a 
"positive"
side effect of the imprecise time counting (ie non-specified delays 
were added
every time msleep was used instead of usleep_range).
TPM polling should not be affected because we use a different way to 
count
timings in the kernel.

If you see issues with your Atmel TPM, I'd suggest to make the changes 
in the
tpm_atmel code only (tpm_infineon would be a suitable example to look 
at) and
discuss the timing values with Atmel TPMs experts, but reverting this 
patch and
affecting all the other TPMs seems like a misunderstanding.

Thanks,
Hamza ATTAK.

On Sat, Sep 12, 2020 at 10:14, Paul Menzel <pmenzel@molgen.mpg.de> 
wrote:
> Dear Hao,
> 
> 
> Thank you for the reply.
> 
> Am 12.09.20 um 10:10 schrieb Hao Wu:
> 
>> Thanks for quick responses over this report.
> 
> Thank you for the quick follow-up.
> 
>>> Hao, I wouldn’t expect a longer timeout causing the TPM to be
>>> queried less frequently, but I do not know the code well.
>> From our understanding, the TPM queries might be retried due to some
>> reason, increase timeout 3x would lower the query frequency to 1/3.
>> The explanation might be wrong, but the fact looks like the timeout
>> matters.  Unfortunately, engineers from Rubrik are not experts over
>> the TPM driver code neither :(
>> 
>>>> Be careful about making this a global change.  It could reduce
>>>> the TPM performance by 3x. We don't want to affect all TPMs to
>>>> fix a bug in an old TPM 1.2 chip from one vendor.
>>> 
>>> Linux has a no regression policy, so the performance penalty
>>> wouldn’t matter. Unfortunately, the regression was only noticed
>>> several years after being introduced in Linux v4.14-rc2.
>> 
>> So does that mean we are good to apply the global change ? Or we need
>> to discuss about the actual fix further?
> To get a fix into the stable series, it first needs to be applied to 
> the master branch. I guess you tested also with Linux master, right?
> 
> Please add the explanation from your email to Greg into the git 
> commit message, format the patch with `git format-patch -1 -o 
> outgoing` and send it with `git send-email outgoing/*` to the 
> addresses listed for the subsystem in `MAINTAINERS` and the people 
> listed in the commit introducing the regression.
> 
> Then it can be properly reviewed and discussed.
> 
> 
> Kind regards,
> 
> Paul



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
  2020-09-12 14:17           ` hamza
@ 2020-09-13  4:51             ` Hao Wu
  2020-09-13  6:38               ` Hao Wu
  0 siblings, 1 reply; 8+ messages in thread
From: Hao Wu @ 2020-09-13  4:51 UTC (permalink / raw)
  To: hamza
  Cc: Paul Menzel, Ken Goldman, linux-integrity, Seungyeop Han,
	Shrihari Kalkar

Sorry again and resend following email in plaintext
-------

Hi Hamza and everyone,

Thanks Hamza for the response on this issue. 

> The first aim of the patch you mentioned is not to shorten TPM timings, as no
> timing is changed at all on this patch, but to count time more accurately.
It is true that the code was not intended to do so explicitly, however the fact is that counting time more accurately in this context is shortening the TPM timings.

> If your TPM needs more time to perform a specific operation, it has nothing to
> do with this patch, but rather with its specified timings in the kernel. The
> timings for your Atmel TPM were certainly wrong before 4.14 to begin with, they
> were not counted accurately and your TPM working was potentially a "positive"
> side effect of the imprecise time counting (ie non-specified delays were added
> every time msleep was used instead of usleep_range).
> TPM polling should not be affected because we use a different way to count
> timings in the kernel.
As our experiments show, the msleep is not inaccurate specifically to atmel TPM, but inaccurate in general in the kernel (or linux). Before the patch, it is likely all TPM timings are around 15ms instead of 5ms. (Which is why  https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 says “we see results going from 1m57s unpatched to 40s") 

> If you see issues with your Atmel TPM, I'd suggest to make the changes in the
> tpm_atmel code only (tpm_infineon would be a suitable example to look at) and
> discuss the timing values with Atmel TPMs experts, but reverting this patch and
> affecting all the other TPMs seems like a misunderstanding.
I understand that the patch https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 makes the timing more accurate, that surfaced the issue. I am not sure how the original 5ms timeout came up at very beginning. If there is already a TPM standard with the 5ms, I agree that TPM hardware manufacturers should follow that. If the 5ms in the driver code was defined based on the tests on which hardwares, then it is likely that the inaccurate msleep mislead the original author. But in either way, we probably need a fix in driver code instead of hardware fix or firmware fix to resolve the breakage.

It is not only observed in the Atmel TPM in Rubrik’s inventory, but also has been observed in multiple our customers’ machines. Thus I think it has wide impact on Atmel TPM chips, and code fix is required.    

From this issue, it looks like the timeout value could be different across TPM chip manufacturers, probably the QA on such change is better to be required for all TPM chip manufacturers in the future ? It looks like https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3  has only tested IFX and STM, which potentially is the root cause of the issue. I am not familiar with the QA process of linux kernel, so correct me if I am wrong.

Regarding of the final fix in the master, I agree that Atmel-specific fix is probably better than global fix. I can take a further look in this path, if folks all agree on that. Let me know your thoughts.

Thanks
Hao

> On Sep 12, 2020, at 7:17 AM, hamza@hpe.com wrote:
> 
> Hello everyone,
> 
> Dear Hao,
> 
> The first aim of the patch you mentioned is not to shorten TPM timings, as no
> timing is changed at all on this patch, but to count time more accurately.
> 
> If your TPM needs more time to perform a specific operation, it has nothing to
> do with this patch, but rather with its specified timings in the kernel. The
> timings for your Atmel TPM were certainly wrong before 4.14 to begin with, they
> were not counted accurately and your TPM working was potentially a "positive"
> side effect of the imprecise time counting (ie non-specified delays were added
> every time msleep was used instead of usleep_range).
> TPM polling should not be affected because we use a different way to count
> timings in the kernel.
> 
> If you see issues with your Atmel TPM, I'd suggest to make the changes in the
> tpm_atmel code only (tpm_infineon would be a suitable example to look at) and
> discuss the timing values with Atmel TPMs experts, but reverting this patch and
> affecting all the other TPMs seems like a misunderstanding.
> 
> Thanks,
> Hamza ATTAK.
> 
> On Sat, Sep 12, 2020 at 10:14, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>> Dear Hao,
>> Thank you for the reply.
>> Am 12.09.20 um 10:10 schrieb Hao Wu:
>>> Thanks for quick responses over this report.
>> Thank you for the quick follow-up.
>>>> Hao, I wouldn´t expect a longer timeout causing the TPM to be
>>>> queried less frequently, but I do not know the code well.
>>> From our understanding, the TPM queries might be retried due to some
>>> reason, increase timeout 3x would lower the query frequency to 1/3.
>>> The explanation might be wrong, but the fact looks like the timeout
>>> matters.  Unfortunately, engineers from Rubrik are not experts over
>>> the TPM driver code neither :(
>>>>> Be careful about making this a global change.  It could reduce
>>>>> the TPM performance by 3x. We don't want to affect all TPMs to
>>>>> fix a bug in an old TPM 1.2 chip from one vendor.
>>>> Linux has a no regression policy, so the performance penalty
>>>> wouldn´t matter. Unfortunately, the regression was only noticed
>>>> several years after being introduced in Linux v4.14-rc2.
>>> So does that mean we are good to apply the global change ? Or we need
>>> to discuss about the actual fix further?
>> To get a fix into the stable series, it first needs to be applied to the master branch. I guess you tested also with Linux master, right?
>> Please add the explanation from your email to Greg into the git commit message, format the patch with `git format-patch -1 -o outgoing` and send it with `git send-email outgoing/*` to the addresses listed for the subsystem in `MAINTAINERS` and the people listed in the commit introducing the regression.
>> Then it can be properly reviewed and discussed.
>> Kind regards,
>> Paul
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip
  2020-09-13  4:51             ` Hao Wu
@ 2020-09-13  6:38               ` Hao Wu
  0 siblings, 0 replies; 8+ messages in thread
From: Hao Wu @ 2020-09-13  6:38 UTC (permalink / raw)
  To: Hamza Attak, Paul Menzel, Ken Goldman
  Cc: linux-integrity, Seungyeop Han, Shrihari Kalkar

Hi Hamza, Paul, and Ken,

I just took a look at the TPM driver code and specifically tpm_infineon.c, but it looks like it is not that straightforward to override TPM_TIMEOUT by TPM chip today. I may not have good expertise to make that done shortly. 

Could we checkin the global fix first to address the regression? Or someone with more TPM driver expertise can help work on the TPM-chip specific fix?

Thanks
Hao

> On Sep 12, 2020, at 9:51 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
> Sorry again and resend following email in plaintext
> -------
> 
> Hi Hamza and everyone,
> 
> Thanks Hamza for the response on this issue. 
> 
>> The first aim of the patch you mentioned is not to shorten TPM timings, as no
>> timing is changed at all on this patch, but to count time more accurately.
> It is true that the code was not intended to do so explicitly, however the fact is that counting time more accurately in this context is shortening the TPM timings.
> 
>> If your TPM needs more time to perform a specific operation, it has nothing to
>> do with this patch, but rather with its specified timings in the kernel. The
>> timings for your Atmel TPM were certainly wrong before 4.14 to begin with, they
>> were not counted accurately and your TPM working was potentially a "positive"
>> side effect of the imprecise time counting (ie non-specified delays were added
>> every time msleep was used instead of usleep_range).
>> TPM polling should not be affected because we use a different way to count
>> timings in the kernel.
> As our experiments show, the msleep is not inaccurate specifically to atmel TPM, but inaccurate in general in the kernel (or linux). Before the patch, it is likely all TPM timings are around 15ms instead of 5ms. (Which is why  https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 says “we see results going from 1m57s unpatched to 40s") 
> 
>> If you see issues with your Atmel TPM, I'd suggest to make the changes in the
>> tpm_atmel code only (tpm_infineon would be a suitable example to look at) and
>> discuss the timing values with Atmel TPMs experts, but reverting this patch and
>> affecting all the other TPMs seems like a misunderstanding.
> I understand that the patch https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 makes the timing more accurate, that surfaced the issue. I am not sure how the original 5ms timeout came up at very beginning. If there is already a TPM standard with the 5ms, I agree that TPM hardware manufacturers should follow that. If the 5ms in the driver code was defined based on the tests on which hardwares, then it is likely that the inaccurate msleep mislead the original author. But in either way, we probably need a fix in driver code instead of hardware fix or firmware fix to resolve the breakage.
> 
> It is not only observed in the Atmel TPM in Rubrik’s inventory, but also has been observed in multiple our customers’ machines. Thus I think it has wide impact on Atmel TPM chips, and code fix is required.    
> 
> From this issue, it looks like the timeout value could be different across TPM chip manufacturers, probably the QA on such change is better to be required for all TPM chip manufacturers in the future ? It looks like https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3  has only tested IFX and STM, which potentially is the root cause of the issue. I am not familiar with the QA process of linux kernel, so correct me if I am wrong.
> 
> Regarding of the final fix in the master, I agree that Atmel-specific fix is probably better than global fix. I can take a further look in this path, if folks all agree on that. Let me know your thoughts.
> 
> Thanks
> Hao
> 
>> On Sep 12, 2020, at 7:17 AM, hamza@hpe.com wrote:
>> 
>> Hello everyone,
>> 
>> Dear Hao,
>> 
>> The first aim of the patch you mentioned is not to shorten TPM timings, as no
>> timing is changed at all on this patch, but to count time more accurately.
>> 
>> If your TPM needs more time to perform a specific operation, it has nothing to
>> do with this patch, but rather with its specified timings in the kernel. The
>> timings for your Atmel TPM were certainly wrong before 4.14 to begin with, they
>> were not counted accurately and your TPM working was potentially a "positive"
>> side effect of the imprecise time counting (ie non-specified delays were added
>> every time msleep was used instead of usleep_range).
>> TPM polling should not be affected because we use a different way to count
>> timings in the kernel.
>> 
>> If you see issues with your Atmel TPM, I'd suggest to make the changes in the
>> tpm_atmel code only (tpm_infineon would be a suitable example to look at) and
>> discuss the timing values with Atmel TPMs experts, but reverting this patch and
>> affecting all the other TPMs seems like a misunderstanding.
>> 
>> Thanks,
>> Hamza ATTAK.
>> 
>> On Sat, Sep 12, 2020 at 10:14, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>>> Dear Hao,
>>> Thank you for the reply.
>>> Am 12.09.20 um 10:10 schrieb Hao Wu:
>>>> Thanks for quick responses over this report.
>>> Thank you for the quick follow-up.
>>>>> Hao, I wouldn´t expect a longer timeout causing the TPM to be
>>>>> queried less frequently, but I do not know the code well.
>>>> From our understanding, the TPM queries might be retried due to some
>>>> reason, increase timeout 3x would lower the query frequency to 1/3.
>>>> The explanation might be wrong, but the fact looks like the timeout
>>>> matters.  Unfortunately, engineers from Rubrik are not experts over
>>>> the TPM driver code neither :(
>>>>>> Be careful about making this a global change.  It could reduce
>>>>>> the TPM performance by 3x. We don't want to affect all TPMs to
>>>>>> fix a bug in an old TPM 1.2 chip from one vendor.
>>>>> Linux has a no regression policy, so the performance penalty
>>>>> wouldn´t matter. Unfortunately, the regression was only noticed
>>>>> several years after being introduced in Linux v4.14-rc2.
>>>> So does that mean we are good to apply the global change ? Or we need
>>>> to discuss about the actual fix further?
>>> To get a fix into the stable series, it first needs to be applied to the master branch. I guess you tested also with Linux master, right?
>>> Please add the explanation from your email to Greg into the git commit message, format the patch with `git format-patch -1 -o outgoing` and send it with `git send-email outgoing/*` to the addresses listed for the subsystem in `MAINTAINERS` and the people listed in the commit introducing the regression.
>>> Then it can be properly reviewed and discussed.
>>> Kind regards,
>>> Paul
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-09-13  6:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <9173F912-F682-44CC-8408-565A6C675749@rubrik.com>
2020-09-11  4:18 ` [Bug Report] Kernel 4.14+ TPM Driver Bug for Atmel TPM Chip Greg KH
2020-09-11 19:35   ` Ken Goldman
2020-09-12  7:37     ` Paul Menzel
2020-09-12  8:13       ` Hao Wu
     [not found]       ` <B003F2A9-2DF5-4633-91C4-FD6B8A3353ED@rubrik.com>
2020-09-12  8:14         ` Paul Menzel
2020-09-12 14:17           ` hamza
2020-09-13  4:51             ` Hao Wu
2020-09-13  6:38               ` Hao Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.