Re: [PATCH v2] tpm: fix Atmel TPM crash caused by too frequent queries

From: Hao Wu <hao.wu@rubrik.com>
To: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Shrihari Kalkar <shrihari.kalkar@rubrik.com>,
	Seungyeop Han <seungyeop.han@rubrik.com>,
	Anish Jhaveri <anish.jhaveri@rubrik.com>,
	peterhuewe@gmx.de, jgg@ziepe.ca, linux-integrity@vger.kernel.org,
	Paul Menzel <pmenzel@molgen.mpg.de>,
	Ken Goldman <kgold@linux.ibm.com>,
	zohar@linux.vnet.ibm.com, why2jjj.linux@gmail.com,
	Hamza Attak <hamza@hpe.com>,
	gregkh@linuxfoundation.org, arnd@arndb.de,
	Nayna <nayna@linux.vnet.ibm.com>,
	James.Bottomley@hansenpartnership.com
Subject: Re: [PATCH v2] tpm: fix Atmel TPM crash caused by too frequent queries
Date: Sun, 11 Jul 2021 00:37:07 -0700	[thread overview]
Message-ID: <A470A175-40B2-4357-826A-FA4A9737B49A@rubrik.com> (raw)
In-Reply-To: <C8D0A56A-F62D-4D07-8AC7-B03608246B0F@rubrik.com>

> On Jul 9, 2021, at 12:23 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
>> On Jul 9, 2021, at 10:47 AM, Jarkko Sakkinen <jarkko@kernel.org> wrote:
>> 
>> On Thu, Jul 08, 2021 at 09:40:28PM -0700, Hao Wu wrote:
>>> The Atmel TPM 1.2 chips crash with error
>>> `tpm_try_transmit: send(): error -62` since kernel 4.14.
>>> It is observed from the kernel log after running `tpm_sealdata -z`.
>>> The error thrown from the command is as follows
>>> ```
>>> $ tpm_sealdata -z
>>> Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl,
>>> code=0087 (135), I/O error
>>> ```
>>> 
>>> The issue was reproduced with the following Atmel TPM chip:
>>> ```
>>> $ tpm_version
>>> T0  TPM 1.2 Version Info:
>>> Chip Version:        1.2.66.1
>>> Spec Level:          2
>>> Errata Revision:     3
>>> TPM Vendor ID:       ATML
>>> TPM Version:         01010000
>>> Manufacturer Info:   41544d4c
>>> ```
>>> 
>>> The root cause of the issue is due to the TPM calls to msleep()
>>> were replaced with usleep_range() [1], which reduces
>>> the actual timeout. Via experiments, it is observed that
>>> the original msleep(5) actually sleeps for 15ms.
>>> Because of a known timeout issue in Atmel TPM 1.2 chip,
>>> the shorter timeout than 15ms can cause the error described above.
>>> 
>>> A few further changes in kernel 4.16 [2] and 4.18 [3, 4] further
>>> reduced the timeout to less than 1ms. With experiments,
>>> the problematic timeout in the latest kernel is the one
>>> for `wait_for_tpm_stat`.
>>> 
>>> To fix it, the patch reverts the timeout of `wait_for_tpm_stat`
>>> to 15ms for all Atmel TPM 1.2 chips, but leave it untouched
>>> for Ateml TPM 2.0 chip, and chips from other vendors.
>>> As explained above, the chosen 15ms timeout is
>>> the actual timeout before this issue introduced,
>>> thus the old value is used here.
>>> Particularly, TPM_ATML_TIMEOUT_WAIT_STAT_MIN is set to 14700us,
>>> TPM_ATML_TIMEOUT_WAIT_STAT_MIN is set to 15000us according to
>>> the existing TPM_TIMEOUT_RANGE_US (300us).
>>> The fixed has been tested in the system with the affected Atmel chip
>>> with no issues observed after boot up.
>>> 
>>> References:
>>> [1] 9f3fc7bcddcb tpm: replace msleep() with usleep_range() in TPM
>>> 1.2/2.0 generic drivers
>>> [2] cf151a9a44d5 tpm: reduce tpm polling delay in tpm_tis_core
>>> [3] 59f5a6b07f64 tpm: reduce poll sleep time in tpm_transmit()
>>> [4] 424eaf910c32 tpm: reduce polling time to usecs for even finer
>>> granularity
>>> 
>>> Fixes: 9f3fc7bcddcb ("tpm: replace msleep() with usleep_range() in TPM 1.2/2.0 generic drivers")
>>> Link: https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
>>> Signed-off-by: Hao Wu <hao.wu@rubrik.com>
>>> ---
>>> This version (v2) has following changes on top of the last (v1):
>>> - follow the existing way to define two timeouts (min and max)
>>> for ATMEL chip, thus keep the exact timeout logic for 
>>> non-ATEML chips.
>>> - limit the timeout increase to only ATMEL TPM 1.2 chips,
>>> because it is not an issue for TPM 2.0 chips yet.
>>> 
>>> Test Plan:
>>> - Run fixed kernel with ATMEL TPM chips and see crash has been fixed.
>>> - Run fixed kernel with non-ATMEL TPM chips, and confirm
>>> the timeout has not been changed.
>>> 
>>> drivers/char/tpm/tpm.h          |  6 ++++--
>>> drivers/char/tpm/tpm_tis_core.c | 23 +++++++++++++++++++++--
>>> include/linux/tpm.h             |  3 +++
>>> 3 files changed, 28 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
>>> index 283f78211c3a..6de1b44c4aab 100644
>>> --- a/drivers/char/tpm/tpm.h
>>> +++ b/drivers/char/tpm/tpm.h
>>> @@ -41,8 +41,10 @@ enum tpm_timeout {
>>> 	TPM_TIMEOUT_RETRY = 100, /* msecs */
>>> 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
>>> 	TPM_TIMEOUT_POLL = 1,	/* msecs */
>>> -	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
>>> -	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
>>> +	TPM_TIMEOUT_USECS_MIN = 100,	/* usecs */
>>> +	TPM_TIMEOUT_USECS_MAX = 500,	/* usecs */
>>> +	TPM_ATML_TIMEOUT_WAIT_STAT_MIN = 14700,	/* usecs */
>>> +	TPM_ATML_TIMEOUT_WAIT_STAT_MAX = 15000	/* usecs */
>>> };
>>> 
>>> /* TPM addresses */
>>> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
>>> index 55b9d3965ae1..ae27d66fdd94 100644
>>> --- a/drivers/char/tpm/tpm_tis_core.c
>>> +++ b/drivers/char/tpm/tpm_tis_core.c
>>> @@ -80,8 +80,17 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>>> 		}
>>> 	} else {
>>> 		do {
>>> -			usleep_range(TPM_TIMEOUT_USECS_MIN,
>>> -				     TPM_TIMEOUT_USECS_MAX);
>>> +			/* this code path could be executed before
>>> +			 * timeouts initialized in chip instance.
>>> +			 */
>>> +			if (chip->timeout_wait_stat_min &&
>>> +			    chip->timeout_wait_stat_max)
>>> +				usleep_range(chip->timeout_wait_stat_min,
>>> +					     chip->timeout_wait_stat_max);
>>> +			else
>>> +				usleep_range(TPM_TIMEOUT_USECS_MIN,
>>> +					     TPM_TIMEOUT_USECS_MAX);
>> 
>> This starts to look otherwise fine but you don't need this condition.
>> Just initialize variables to TPM_TIMEOUT_USECS_{MIN, MAX} for non-Atmel.
> Not sure I got your point or not. We have discussed this question a few rounds before,
> I answered you about this. This check is required because before the time of 
> Initialization in the code I added in `tpm_tis_core_init`
> ```
> +	chip->timeout_wait_stat_min = TPM_TIMEOUT_USECS_MIN;
> +	chip->timeout_wait_stat_max = TPM_TIMEOUT_USECS_MAX;
> ```
> The func `wait_for_tpm_stat` runs, we need the condition to fall back to avoid system startup crash.
> 
> Let me know if this makes sense. If needed, I can do another confirm.
I double checked this, and found the current init lines in `tpm_tis_core_init` 
is actually before this code path now. Maybe it was an issue in one
of my old revision and I had the wrong impression. 
The condition seems ok to remove in the current revision. 

But I am not fully sure is if the behavior is consistent across other 1.2 chips, and TPM 2.0 chips.
Should we still keep the condition for robustness or ship without it ?  

>> /Jarkko
> 
> Hao

Hao