Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries

From: Hao Wu <hao.wu@rubrik.com>
To: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
	Nayna <nayna@linux.vnet.ibm.com>,
	peterhuewe@gmx.de, jgg@ziepe.ca, arnd@arndb.de,
	gregkh@linuxfoundation.org, Hamza Attak <hamza@hpe.com>,
	why2jjj.linux@gmail.com, zohar@linux.vnet.ibm.com,
	linux-integrity@vger.kernel.org,
	Paul Menzel <pmenzel@molgen.mpg.de>,
	Ken Goldman <kgold@linux.ibm.com>,
	Seungyeop Han <seungyeop.han@rubrik.com>,
	Shrihari Kalkar <shrihari.kalkar@rubrik.com>,
	Anish Jhaveri <anish.jhaveri@rubrik.com>
Subject: Re: [PATCH] Fix Atmel TPM crash caused by too frequent queries
Date: Fri, 13 Nov 2020 20:39:28 -0800	[thread overview]
Message-ID: <9E249567-4901-4FA4-BA89-EF6DE51F7E7A@rubrik.com> (raw)
In-Reply-To: <53B75B06-FD89-4B00-BC3F-46C5B28DC201@rubrik.com>

> On Oct 17, 2020, at 10:20 PM, Hao Wu <hao.wu@rubrik.com> wrote:
> 
>> On Oct 17, 2020, at 10:09 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>> 
>> On Fri, Oct 16, 2020 at 11:11:37PM -0700, Hao Wu wrote:
>>>> On Oct 1, 2020, at 4:04 PM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>>>> 
>>>> On Thu, Oct 01, 2020 at 11:32:59AM -0700, James Bottomley wrote:
>>>>> On Thu, 2020-10-01 at 14:15 -0400, Nayna wrote:
>>>>>> On 10/1/20 12:53 AM, James Bottomley wrote:
>>>>>>> On Thu, 2020-10-01 at 04:50 +0300, Jarkko Sakkinen wrote:
>>>>>>>> On Wed, Sep 30, 2020 at 03:31:20PM -0700, James Bottomley wrote:
>>>>>>>>> On Thu, 2020-10-01 at 00:09 +0300, Jarkko Sakkinen wrote:
>>>>> [...]
>>>>>>>>>> I also wonder if we could adjust the frequency dynamically.
>>>>>>>>>> I.e. start with optimistic value and lower it until finding
>>>>>>>>>> the sweet spot.
>>>>>>>>> 
>>>>>>>>> The problem is the way this crashes: the TPM seems to be
>>>>>>>>> unrecoverable. If it were recoverable without a hard reset of
>>>>>>>>> the entire machine, we could certainly play around with it.  I
>>>>>>>>> can try alternative mechanisms to see if anything's viable, but
>>>>>>>>> to all intents and purposes, it looks like my TPM simply stops
>>>>>>>>> responding to the TIS interface.
>>>>>>>> 
>>>>>>>> A quickly scraped idea probably with some holes in it but I was
>>>>>>>> thinking something like
>>>>>>>> 
>>>>>>>> 1. Initially set slow value for latency, this could be the
>>>>>>>> original 15 ms.
>>>>>>>> 2. Use this to read TPM_PT_VENDOR_STRING_*.
>>>>>>>> 3. Lookup based vendor string from a fixup table a latency that
>>>>>>>> works
>>>>>>>>  (the fallback latency could be the existing latency).
>>>>>>> 
>>>>>>> Well, yes, that was sort of what I was thinking of doing for the
>>>>>>> Atmel ... except I was thinking of using the TIS VID (16 byte
>>>>>>> assigned vendor ID) which means we can get the information to set
>>>>>>> the timeout before we have to do any TPM operations.
>>>>>> 
>>>>>> I wonder if the timeout issue exists for all TPM commands for the
>>>>>> same manufacturer.  For example, does the ATMEL TPM also crash when 
>>>>>> extending  PCRs ?
>>>>>> 
>>>>>> In addition to defining a per TPM vendor based lookup table for
>>>>>> timeout, would it be a good idea to also define a Kconfig/boot param
>>>>>> option to allow timeout setting.  This will enable to set the timeout
>>>>>> based on the specific use.
>>>>> 
>>>>> I don't think we need go that far (yet).  The timing change has been in
>>>>> upstream since:
>>>>> 
>>>>> commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
>>>>> Author: Nayna Jain <nayna@linux.vnet.ibm.com>
>>>>> Date:   Wed May 16 01:51:25 2018 -0400
>>>>> 
>>>>>  tpm: reduce polling time to usecs for even finer granularity
>>>>> 
>>>>> Which was in the released kernel 4.18: over two years ago.  In all that
>>>>> time we've discovered two problems: mine which looks to be an artifact
>>>>> of an experimental upgrade process in a new nuvoton and the Atmel. 
>>>>> That means pretty much every other TPM simply works with the existing
>>>>> timings
>>>>> 
>>>>>> I was also thinking how will we decide the lookup table values for
>>>>>> each vendor ?
>>>>> 
>>>>> I wasn't thinking we would.  I was thinking I'd do a simple exception
>>>>> for the Atmel and nothing else.  I don't think my Nuvoton is in any way
>>>>> characteristic.  Indeed my pluggable TPM rainbow bridge system works
>>>>> just fine with a Nuvoton and the current timings.
>>>>> 
>>>>> We can add additional exceptions if they actually turn up.
>>>> 
>>>> I'd add a table and fallback.
>>>> 
>>> 
>>> Hi folks,
>>> 
>>> I want to follow up this a bit and check whether we reached a consensus 
>>> on how to fix the timeout issue for Atmel chip.
>>> 
>>> Should we revert the changes or introduce the lookup table for chips.
>>> 
>>> Is there anything I can help from Rubrik side.
>>> 
>>> Thanks
>>> Hao
>> 
>> There is nothing to revert as the previous was not applied but I'm
>> of course ready to review any new attempts.
>> 
> 
> Hi Jarkko,
> 
> By “revert” I meant we revert the timeout value changes by applying
> the patch I proposed, as the timeout value discussed does cause issues.
> 
> Why don’t we apply the patch and improve the perf in the way of not
> breaking TPMs ? 
> 
> Hao

Hi Jarkko and folks,

It’s being a while since our last discussion. I want to push a fix in the upstream for ateml chip. 
It looks like we currently have following choices:
1.  generic fix for all vendors: have a lookup table for sleep time of wait_for_tpm_stat 
  (i.e. TPM_TIMEOUT_WAIT_STAT in my proposed patch) 
2.  quick fix for the regression: change the sleep time of wait_for_tpm_stat back to 15ms.
  It is the current proposed patch
3. Fix regression by making exception for ateml chip.  

Should we reach consensus on which one we want to pursue before dig into implementation
of the patch? In my opinion, I prefer to fix the regression with 2, and then pursue 1 as long-term
solution. 3 is hacky.

Let me know what do you guys think

Hao