linux-integrity.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hao Wu <hao.wu@rubrik.com>
To: Hao Wu <hao.wu@rubrik.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
	peterhuewe@gmx.de, jarkko.sakkinen@linux.intel.com, jgg@ziepe.ca,
	arnd@arndb.de, gregkh@linuxfoundation.org,
	Hamza Attak <hamza@hpe.com>,
	nayna@linux.vnet.ibm.com, why2jjj.linux@gmail.com,
	zohar@linux.vnet.ibm.com, linux-integrity@vger.kernel.org,
	Paul Menzel <pmenzel@molgen.mpg.de>,
	Ken Goldman <kgold@linux.ibm.com>,
	Seungyeop Han <seungyeop.han@rubrik.com>,
	Shrihari Kalkar <shrihari.kalkar@rubrik.com>,
	Anish Jhaveri <anish.jhaveri@rubrik.com>
Subject: Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
Date: Sun, 27 Sep 2020 17:15:51 -0700	[thread overview]
Message-ID: <3C374AD2-4591-4B5F-89AA-730F2CCBC05F@rubrik.com> (raw)
In-Reply-To: <E6E3C07D-57F4-48F5-B4A9-50868B82E779@rubrik.com>

I am attaching the original bug report to this thread for new reviewers to get better context

---

Hi TPM Driver Maintainers,

We are from Rubrik engineering team. We found a TPM driver update since kernel 4.14 causing atmel TPM chips crash. We have root caused it and have a patch on the kernel used by Rubrik products. Given the “bug” has impact on not just Rubrik products, but all atmel TPM chips, we are asking to fix the issue in the kernel upstream.

The commit that introduced the bug since 4.14 
https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3

Effected platform / system:
- Cisco UCS C220 M5 with atmel TPM chip
	- Ubuntu 16.04
  	- Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4
- Cisco UCS C240 M5 with atmel TPM chip
	- Ubuntu 16.04
  	- Kernel 4.14 / 4.15 / 4.18 / 4.20 / 5.8 / 5.9-rc4

```
# TPM chip used in the problematic platform
$ tpm_version
TPM 1.2 Version Info:
  Chip Version:        1.2.66.1
  Spec Level:          2
  Errata Revision:     3
  TPM Vendor ID:       ATML
  TPM Version:         01010000
  Manufacturer Info:   41544d4c
```

Not all Cisco server nodes are facing the crash, but there are a good amount of portion (around 50% from nodes in Rubrik) are persistently having the TPM crash issue.

Our other platforms using TPM chips from other vendors do not have issue. These platform are running the same software as the problematic platforms run. Those TPM non-effected vendors are
- IFX
- STM
- WEC

TPM crash signature:
```
# error when running tpm tool
$ tpm_sealdata -z
Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl, code=0087 (135), I/O error

# tpm driver sends error
$ sudo dmesg | grep tpm0
[59154.665549] tpm tpm0: tpm_try_transmit: send(): error -62
[59154.809532] tpm tpm0: tpm_try_transmit: send(): error -62
```

Our Root Cause Analysis
From the error code “-62”, it looks similar to another bug https://patchwork.kernel.org/patch/10520247/
where the “TPM_TIMEOUT_USECS_MAX” and “TPM_TIMEOUT_USEC_MIN” is too small, which causes TPM get queried too frequently, and thus crashes.

The issue for atmel TPM chip crash is similar the commit https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 changed TPM sleep logic from using `msleep` to `tpm_msleep`, in which `usleep_range` is used.

As what https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 intended to do, using usleep_range can make the sleep period shorter, because msleep actually sleeps longer when the sleep perioid is within 20ms, and usleep_range can make it more precise.

According to our experiments,
- usleep_range makes the TPM sleep precisely for TPM_TIMEOUT (i.e. 5ms https://github.com/torvalds/linux/blob/v4.14/drivers/char/tpm/tpm.h#L52)
- msleep(TPM_TIMEOUT) actually sleeps around 15ms    

Thus the TPM get queried more frequently than before, which is likely the root cause of the atmel chip crash. We fix it by bumping up the TPM_TIMEOUT to 15ms.


Rubrik Patch (patch only for 4.14)
```
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 72d3ce4..9b8f3f8 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -49,7 +49,15 @@ enum tpm_const {
 };

 enum tpm_timeout {
-       TPM_TIMEOUT = 5,        /* msecs */
+       TPM_TIMEOUT = 15,       /* msecs */
        TPM_TIMEOUT_RETRY = 100, /* msecs */
        TPM_TIMEOUT_RANGE_US = 300      /* usecs */
 };
```
With the patch, the atmel TPM chip crash is fixed.  

Proposal
We want the kernel upstream to adopt the fix or have a better fix for the atmel chip while not bring performance regression for other TPM chips. We understand that https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3 was intended to shorten the TPM respond time, but it does not work well for atmel TPM chips. Probably we should override TPM_TIMEOUT value for atmel chips at least.

Thanks
Hao

> On Sep 27, 2020, at 5:11 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
> Hi James,
> 
> Maybe there is a misunderstanding. Here I am using tpm_msleep, not msleep.
> tpm_msleep is using usleep underlaying. See
> https://github.com/torvalds/linux/blob/master/drivers/char/tpm/tpm.h#L188
> 
> The reasons I choose 15ms, is because before 
> https://github.com/torvalds/linux/commit/9f3fc7bcddcb51234e23494531f93ab60475e1c3
> (Where msleep is changed to tpm_msleep (which is essentially usleep)),
> The actual sleep time is 15ms, thus here we change this back to 15ms to fix
> regression.
> 
> Thanks
> Hao 
> 
>> On Sep 27, 2020, at 11:25 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>> 
>> On Sat, 2020-09-26 at 16:10 -0700, Hao Wu wrote:
>>> Resending following email in plaintext.
>>> 
>>> ----
>>> 
>>> Hi James,
>>> 
>>> Thanks for following up.
>>> 
>>> We have actually tried change 
>>> TPM_TIMEOUT_USECS_MIN / TPM_TIMEOUT_USECS_MAX 
>>> according to https://patchwork.kernel.org/patch/10520247/
>>> It does not solve the problem for ATMEL chip. The chips facing crash
>>> is 
>>> not experimental, but happens commonly in 
>>> the production systems we and our customers are using.
>>> It is widely found in Cisco 220 / 240 systems which are using
>>> Ateml chips.
>> 
>> Well, I came up with the values in that patch by trial and error ....
>> all I know is they work for my nuvoton. If they're not right for you,
>> see if you can find what values actually do work for your TPM.  The
>> difference between msleep and usleep_range is that the former can have
>> an indefinitely long timeout and the latter has a range bounded one. 
>> If you think msleep works for you, the chances are it doesn't and
>> you're relying on the large upper bound to make the bug infrequent
>> enough for you not to see it.  Playing with the values in usleep range
>> will help you find what the actual timeout is and eliminate the problem
>> for good.
>> 
>> James
> 


  reply	other threads:[~2020-09-28  0:15 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-26 22:31 [PATCH] Fix Atmel TPM crash caused by too frequent queries Hao Wu
2020-09-26 22:57 ` James Bottomley
2020-09-26 23:10   ` Hao Wu
2020-09-27 18:25     ` James Bottomley
2020-09-28  0:11       ` Hao Wu
2020-09-28  0:15         ` Hao Wu [this message]
2020-09-28  1:22         ` James Bottomley
2020-09-28  5:59           ` Hao Wu
2020-09-28 22:11             ` James Bottomley
2020-09-29  4:46               ` Hao Wu
2020-09-30  2:16               ` Jarkko Sakkinen
2020-09-30 14:54                 ` James Bottomley
2020-09-30 15:37                   ` Jarkko Sakkinen
2020-09-30 20:48                     ` James Bottomley
2020-09-30 21:09                       ` Jarkko Sakkinen
2020-09-30 22:31                         ` James Bottomley
2020-10-01  1:50                           ` Jarkko Sakkinen
2020-10-01  4:53                             ` James Bottomley
2020-10-01 18:15                               ` Nayna
2020-10-01 18:32                                 ` James Bottomley
2020-10-01 23:04                                   ` Jarkko Sakkinen
2020-10-17  6:11                                     ` Hao Wu
2020-10-18  5:09                                       ` Jarkko Sakkinen
2020-10-18  5:20                                         ` Hao Wu
2020-11-14  4:39                                           ` Hao Wu
2020-11-18 21:11                                             ` Jarkko Sakkinen
2020-11-18 23:23                                               ` Hao Wu
2021-05-09  6:18                                               ` Hao Wu
2021-05-09  6:31                                                 ` Hao Wu
2021-05-10  2:17                                                   ` Mimi Zohar
2021-05-10  3:15                                                     ` Hao Wu
2021-05-10 17:28                                                     ` Jarkko Sakkinen
2020-09-28  1:08       ` Jarkko Sakkinen
2020-09-28  6:03         ` Hao Wu
2020-09-28 14:16           ` Jarkko Sakkinen
2020-09-28 17:49             ` Hao Wu
2020-09-28 19:47               ` Jarkko Sakkinen
2020-09-28 20:27                 ` Hao Wu
2020-09-30  2:11                   ` Jarkko Sakkinen
2020-09-30  3:41                     ` Hao Wu
     [not found]                       ` <EA1EE8F8-F054-4E1B-B830-231398D33CB8@rubrik.com>
2020-10-01 14:16                         ` Mimi Zohar
  -- strict thread matches above, loose matches on Subject: below --
2021-06-20 23:18 Hao Wu
2021-06-23 13:35 ` Jarkko Sakkinen
2021-06-24  5:49   ` Hao Wu
2021-06-29 20:06     ` Jarkko Sakkinen
2021-06-30  4:27       ` Hao Wu
2021-06-24  5:33 ` Hao Wu
2021-06-29 20:07   ` Jarkko Sakkinen
2020-09-14  6:13 Hao Wu
2020-09-14  6:17 ` Greg KH
2020-09-15  2:52 ` Hao Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3C374AD2-4591-4B5F-89AA-730F2CCBC05F@rubrik.com \
    --to=hao.wu@rubrik.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=anish.jhaveri@rubrik.com \
    --cc=arnd@arndb.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=hamza@hpe.com \
    --cc=jarkko.sakkinen@linux.intel.com \
    --cc=jgg@ziepe.ca \
    --cc=kgold@linux.ibm.com \
    --cc=linux-integrity@vger.kernel.org \
    --cc=nayna@linux.vnet.ibm.com \
    --cc=peterhuewe@gmx.de \
    --cc=pmenzel@molgen.mpg.de \
    --cc=seungyeop.han@rubrik.com \
    --cc=shrihari.kalkar@rubrik.com \
    --cc=why2jjj.linux@gmail.com \
    --cc=zohar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).