linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Calvin Owens <calvinowens@fb.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Peter Huewe <peterhuewe@gmx.de>,
	Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, "Arnd Bergmann" <arnd@arndb.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"linux-integrity@vger.kernel.org"
	<linux-integrity@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>,
	Calvin Owens <calvinowens@fb.com>
Subject: Re: [PATCH] tpm: Make timeout logic simpler and more robust
Date: Tue, 12 Mar 2019 19:37:30 +0000	[thread overview]
Message-ID: <20190312193702.GA4148@Haydn> (raw)
In-Reply-To: <1552350463.23859.8.camel@HansenPartnership.com>

On Monday 03/11 at 17:27 -0700, James Bottomley wrote:
> On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > e're having lots of problems with TPM commands timing out, and we're
> > seeing these problems across lots of different hardware (both v1/v2).
> > 
> > I instrumented the driver to collect latency data, but I wasn't able
> > to find any specific timeout to fix: it seems like many of them are
> > too aggressive. So I tried replacing all the timeout logic with a
> > single universal long timeout, and found that makes our TPMs 100%
> > reliable.
> > 
> > Given that this timeout logic is very complex, problematic, and
> > appears to serve no real purpose, I propose simply deleting all of
> > it.
> 
> "no real purpose" is a bit strong given that all these timeouts are
> standards mandated.  

Sure, in fairness I said "appears to" ;)

We tested this on roughly a hundred machines with a variety of hardware,
they were flaky before and essentially perfectly reliable after this
patch. So that's where I'm coming from here.

> The purpose stated by the standards is that there needs to be a way of
> differentiating the TPM crashed from the TPM is taking a very long
> time to respond.  For a normally functioning TPM it looks complex and
> unnecessary, but for a malfunctioning one it's a lifesaver.

Does getting -EWHATEVER some 2-3 seconds more quickly really make much
of a difference? That's all we're talking about changing here, right?

> Could you first check it's not a problem we introduced with our polling
> changes?  My nuvoton still doesn't work properly with the default poll
> timings but it works flawlessly if I use the patch below.  I think my
> nuvoton is a bit out of spec (it's a very early model that was software
> upgraded from 1.2 to 2.0) because no-one else on the list seems to see
> the problems I see, but perhaps you are.

I did consider the polling changes. My thinking was that, since the poll
loops I was seeing time out are all gated on time_before(), it would
only potentially change how much the final poll overruns the target
jiffies, and wasn't as likely to help as changing the timeouts
themselves.

The theory about poking it too aggressively making it fall off the bus
definitely makes sense, but the success of this "universal timeout"
approach suggests to me that the timeouts themselves are the root
problem with the flakiness we're seeing in production.

Thanks,
Calvin
 
> James
> 
> ---
> 
> From 249d60a9fafa8638433e545b50dab6987346cb26 Mon Sep 17 00:00:00 2001
> From: James Bottomley <James.Bottomley@HansenPartnership.com>
> Date: Wed, 11 Jul 2018 10:11:14 -0700
> Subject: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
> 
> tpm_tis regressed recently to the point where the TPM being driven by
> it falls off the bus and cannot be contacted after some hours of use.
> This is the failure trace:
> 
> jejb@jarvis:~> dmesg|grep tpm
> [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
> [14566.626614] tpm tpm0: Operation Timed out
> [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> ...
> 
> The problem is caused by a change that caused us to poke the TPM far
> more often to see if it's ready.  Apparently something about the bus
> its on and the TPM means that it crashes or falls off the bus if you
> poke it too often and once this happens, only a reboot will recover
> it.
> 
> The fix I've come up with is to adjust the timings so the TPM no
> longer falls of the bus.  Obviously, this fix works for my Nuvoton
> NPCT6xxx but that's the only TPM I've tested it with.
> 
> Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> 
> diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> index 4b104245afed..a6c806d98950 100644
> --- a/drivers/char/tpm/tpm.h
> +++ b/drivers/char/tpm/tpm.h
> @@ -64,8 +64,8 @@ enum tpm_timeout {
>  	TPM_TIMEOUT_RETRY = 100, /* msecs */
>  	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
>  	TPM_TIMEOUT_POLL = 1,	/* msecs */
> -	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
> -	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
> +	TPM_TIMEOUT_USECS_MIN = 750,      /* usecs */
> +	TPM_TIMEOUT_USECS_MAX = 1000,      /* usecs */
>  };
>  
>  /* TPM addresses */

  parent reply	other threads:[~2019-03-12 19:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-11 23:54 [PATCH] tpm: Make timeout logic simpler and more robust Calvin Owens
2019-03-12  0:27 ` James Bottomley
2019-03-12 12:50   ` Jarkko Sakkinen
2019-03-12 14:42     ` James Bottomley
2019-03-12 15:39       ` Jarkko Sakkinen
2019-03-12 19:41         ` Calvin Owens
2019-03-12 16:59       ` Mimi Zohar
2019-03-12 17:14         ` James Bottomley
2019-03-12 18:32           ` Mimi Zohar
2019-03-12 19:37   ` Calvin Owens [this message]
2019-03-12 12:36 ` Jarkko Sakkinen
2019-03-12 16:56   ` Mimi Zohar
2019-03-12 14:55 ` Jarkko Sakkinen
2019-03-12 17:04 ` Mimi Zohar
2019-03-12 20:08   ` Calvin Owens
2019-03-12 20:56     ` Mimi Zohar
2019-03-13 13:22   ` Jarkko Sakkinen
2019-03-13 13:23     ` Jarkko Sakkinen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190312193702.GA4148@Haydn \
    --to=calvinowens@fb.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=Kernel-team@fb.com \
    --cc=arnd@arndb.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=jarkko.sakkinen@linux.intel.com \
    --cc=jgg@ziepe.ca \
    --cc=linux-integrity@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterhuewe@gmx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).