All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in tpm_tis driver: the TPM now fatally offlines itself after a few hours of use
@ 2018-07-11 17:04 James Bottomley
  2018-07-11 17:11 ` [PATCH] tpm.h: increase poll timings to fix tpm_tis regression James Bottomley
  0 siblings, 1 reply; 10+ messages in thread
From: James Bottomley @ 2018-07-11 17:04 UTC (permalink / raw)
  To: linux-integrity; +Cc: Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

First a caveat: all my laptop security goes through the TPM, so I'm a
much more industrial consumer of the technology than most users, who
don't use a TPM at all.

However, since 4.18-rc1 I've been seeing these errors with the TPM:

jejb@jarvis:~> dmesg|grep tpm
[    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
[14566.626614] tpm tpm0: Operation Timed out
[14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
[14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
[14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14579.710418] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14581.710404] tpm tpm0: tpm_try_transmit: tpm_send: error -62
...

What happens is that I get one command that errors out with ETIME and
from that point on every TPM operation always returns -ETIME and it's
impossible to recover the TPM by any means except a reboot.

There are only three patches to tpm_tis in the merge window and it
looks like reverting this one fixes the problem:

commit 424eaf910c329ab06ad03a527ef45dcf6a328f00
Author: Nayna Jain <nayna@linux.vnet.ibm.com>
Date:   Wed May 16 01:51:25 2018 -0400

    tpm: reduce polling time to usecs for even finer granularity

As far as I can tell, all that patch does is cause the TPM to be poked
far more often to see if it's finished, but something about this rate
of poking is causing it to drop off its bus.  Based on this theory,
I've got a proposed fix which increases the timing parameters so we can
maintain the performance benefits of the above patch while remedying
the regression (I'll send it as a reply to this report).

Regards,

James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 17:04 Regression in tpm_tis driver: the TPM now fatally offlines itself after a few hours of use James Bottomley
@ 2018-07-11 17:11 ` James Bottomley
  2018-07-11 18:21   ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: James Bottomley @ 2018-07-11 17:11 UTC (permalink / raw)
  To: linux-integrity; +Cc: Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

tpm_tis regressed recently to the point where the TPM being driven by
it falls off the bus and cannot be contacted after some hours of use.
This is the failure trace:

jejb@jarvis:~> dmesg|grep tpm
[    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
[14566.626614] tpm tpm0: Operation Timed out
[14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
[14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
[14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
...

The problem is caused by a change that caused us to poke the TPM far
more often to see if it's ready.  Apparently something about the bus
its on and the TPM means that it crashes or falls off the bus if you
poke it too often and once this happens, only a reboot will recover
it.

The fix I've come up with is to adjust the timings so the TPM no
longer falls of the bus.  Obviously, this fix works for my Nuvoton
NPCT6xxx but that's the only TPM I've tested it with.

Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/char/tpm/tpm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 8114ea88449f..5d147a8dcca8 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -58,8 +58,8 @@ enum tpm_timeout {
 	TPM_TIMEOUT_RETRY = 100, /* msecs */
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
-	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
-	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
+	TPM_TIMEOUT_USECS_MIN = 750,      /* usecs */
+	TPM_TIMEOUT_USECS_MAX = 1000,      /* usecs */
 };
 
 /* TPM addresses */
-- 
2.13.7

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 17:11 ` [PATCH] tpm.h: increase poll timings to fix tpm_tis regression James Bottomley
@ 2018-07-11 18:21   ` Jason Gunthorpe
  2018-07-11 19:08     ` James Bottomley
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2018-07-11 18:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-integrity, Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> tpm_tis regressed recently to the point where the TPM being driven by
> it falls off the bus and cannot be contacted after some hours of use.
> This is the failure trace:
> 
> jejb@jarvis:~> dmesg|grep tpm
> [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
> [14566.626614] tpm tpm0: Operation Timed out
> [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> ...
> 
> The problem is caused by a change that caused us to poke the TPM far
> more often to see if it's ready.  Apparently something about the bus
> its on and the TPM means that it crashes or falls off the bus if you
> poke it too often and once this happens, only a reboot will recover
> it.

I wonder if something about triggering ETIME even once breaks the
driver so it can't talk to the chip at all thereafter..

Ie it doesn't abort the command properly and becomes desynced with the
TIS execution protocol.

I would be very surprised if polling the TIS status register effects
the firmware running inside the chip..

BTW, no interrupt in your latop setup? I'm surprised by that..

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 18:21   ` Jason Gunthorpe
@ 2018-07-11 19:08     ` James Bottomley
  2018-07-11 20:01       ` Jason Gunthorpe
  2018-07-16 20:16       ` Jarkko Sakkinen
  0 siblings, 2 replies; 10+ messages in thread
From: James Bottomley @ 2018-07-11 19:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-integrity, Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> > tpm_tis regressed recently to the point where the TPM being driven
> > by
> > it falls off the bus and cannot be contacted after some hours of
> > use.
> > This is the failure trace:
> > 
> > jejb@jarvis:~> dmesg|grep tpm
> > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 
> > 2)
> > [14566.626614] tpm tpm0: Operation Timed out
> > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system
> > error -62
> > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system
> > error -62
> > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > ...
> > 
> > The problem is caused by a change that caused us to poke the TPM
> > far
> > more often to see if it's ready.  Apparently something about the
> > bus
> > its on and the TPM means that it crashes or falls off the bus if
> > you
> > poke it too often and once this happens, only a reboot will recover
> > it.
> 
> I wonder if something about triggering ETIME even once breaks the
> driver so it can't talk to the chip at all thereafter..
> 
> Ie it doesn't abort the command properly and becomes desynced with
> the TIS execution protocol.

Yes, I wondered about this, but I don't understand the bus protocol
well enough.  The tpm-interface:tpm_try_transmit() which throws the
first ETIME says after we get that we send chip->ops->cancel() which
tpm_tis simply translates to tpm_tis_ready() which also times out.  Is
there a bigger hammer I can hit it with?

> I would be very surprised if polling the TIS status register effects.

Well, I was surprised too, but given the patch that's causing the
problem, it was the only explanation I could come up with.

> the firmware running inside the chip..
> 
> 
> BTW, no interrupt in your latop setup? I'm surprised by that..

Heh, I hear rumours that most laptops are delivered like this and a
large software company might be annoyed enough about the problem to
mandate interrupts the next time it produces a hardware requirements
spec.

James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 19:08     ` James Bottomley
@ 2018-07-11 20:01       ` Jason Gunthorpe
  2018-07-11 20:39         ` James Bottomley
  2018-07-16 20:16       ` Jarkko Sakkinen
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2018-07-11 20:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-integrity, Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> > > tpm_tis regressed recently to the point where the TPM being driven
> > > by
> > > it falls off the bus and cannot be contacted after some hours of
> > > use.
> > > This is the failure trace:
> > > 
> > > jejb@jarvis:~> dmesg|grep tpm
> > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 
> > > 2)
> > > [14566.626614] tpm tpm0: Operation Timed out
> > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system
> > > error -62
> > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system
> > > error -62
> > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > ...
> > > 
> > > The problem is caused by a change that caused us to poke the TPM
> > > far
> > > more often to see if it's ready.  Apparently something about the
> > > bus
> > > its on and the TPM means that it crashes or falls off the bus if
> > > you
> > > poke it too often and once this happens, only a reboot will recover
> > > it.
> > 
> > I wonder if something about triggering ETIME even once breaks the
> > driver so it can't talk to the chip at all thereafter..
> > 
> > Ie it doesn't abort the command properly and becomes desynced with
> > the TIS execution protocol.
> 
> Yes, I wondered about this, but I don't understand the bus protocol
> well enough.  The tpm-interface:tpm_try_transmit() which throws the
> first ETIME says after we get that we send chip->ops->cancel() which
> tpm_tis simply translates to tpm_tis_ready() which also times out.  Is
> there a bigger hammer I can hit it with?

I don't remember off hand.. But this is, IMHO, a better guess than the
firmware crashes from reading the status register..

Perhaps reducing the timeout to force a ETIME would prove the theory?

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 20:01       ` Jason Gunthorpe
@ 2018-07-11 20:39         ` James Bottomley
  2018-07-11 20:51           ` Peter Huewe
  0 siblings, 1 reply; 10+ messages in thread
From: James Bottomley @ 2018-07-11 20:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-integrity, Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> > > > tpm_tis regressed recently to the point where the TPM being
> > > > driven
> > > > by
> > > > it falls off the bus and cannot be contacted after some hours
> > > > of
> > > > use.
> > > > This is the failure trace:
> > > > 
> > > > jejb@jarvis:~> dmesg|grep tpm
> > > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE,
> > > > rev-id 
> > > > 2)
> > > > [14566.626614] tpm tpm0: Operation Timed out
> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a
> > > > system
> > > > error -62
> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a
> > > > system
> > > > error -62
> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > ...
> > > > 
> > > > The problem is caused by a change that caused us to poke the
> > > > TPM
> > > > far
> > > > more often to see if it's ready.  Apparently something about
> > > > the
> > > > bus
> > > > its on and the TPM means that it crashes or falls off the bus
> > > > if
> > > > you
> > > > poke it too often and once this happens, only a reboot will
> > > > recover
> > > > it.
> > > 
> > > I wonder if something about triggering ETIME even once breaks the
> > > driver so it can't talk to the chip at all thereafter..
> > > 
> > > Ie it doesn't abort the command properly and becomes desynced
> > > with the TIS execution protocol.
> > 
> > Yes, I wondered about this, but I don't understand the bus protocol
> > well enough.  The tpm-interface:tpm_try_transmit() which throws the
> > first ETIME says after we get that we send chip->ops->cancel()
> > which tpm_tis simply translates to tpm_tis_ready() which also times
> > out.  Is there a bigger hammer I can hit it with?
> 
> I don't remember off hand.. But this is, IMHO, a better guess than
> the firmware crashes from reading the status register..

Oh, actually, I think the bus crashes or wedges, not the TPM.  I just
don't have any tools to probe the LPC.

James


> Perhaps reducing the timeout to force a ETIME would prove the theory?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 20:39         ` James Bottomley
@ 2018-07-11 20:51           ` Peter Huewe
  2018-07-11 20:54             ` James Bottomley
  2018-07-11 21:06             ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Huewe @ 2018-07-11 20:51 UTC (permalink / raw)
  To: James Bottomley, Jason Gunthorpe
  Cc: linux-integrity, Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain



Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottomley@HansenPartnership.com>:
>On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
>> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
>> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
>> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
>> > > > tpm_tis regressed recently to the point where the TPM being
>> > > > driven
>> > > > by
>> > > > it falls off the bus and cannot be contacted after some hours
>> > > > of
>> > > > use.
>> > > > This is the failure trace:
>> > > > 
>> > > > jejb@jarvis:~> dmesg|grep tpm
>> > > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE,
>> > > > rev-id 
>> > > > 2)
>> > > > [14566.626614] tpm tpm0: Operation Timed out
>> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a
>> > > > system
>> > > > error -62
>> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a
>> > > > system
>> > > > error -62
>> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > ...
>> > > > 
>> > > > The problem is caused by a change that caused us to poke the
>> > > > TPM
>> > > > far
>> > > > more often to see if it's ready.  Apparently something about
>> > > > the
>> > > > bus
>> > > > its on and the TPM means that it crashes or falls off the bus
>> > > > if
>> > > > you
>> > > > poke it too often and once this happens, only a reboot will
>> > > > recover
>> > > > it.
>> > > 
>> > > I wonder if something about triggering ETIME even once breaks the
>> > > driver so it can't talk to the chip at all thereafter..
>> > > 
>> > > Ie it doesn't abort the command properly and becomes desynced
>> > > with the TIS execution protocol.
>> > 
>> > Yes, I wondered about this, but I don't understand the bus protocol
>> > well enough.  The tpm-interface:tpm_try_transmit() which throws the
>> > first ETIME says after we get that we send chip->ops->cancel()
>> > which tpm_tis simply translates to tpm_tis_ready() which also times
>> > out.  Is there a bigger hammer I can hit it with?
>> 
>> I don't remember off hand.. But this is, IMHO, a better guess than
>> the firmware crashes from reading the status register..
>
>Oh, actually, I think the bus crashes or wedges, not the TPM.  I just
>don't have any tools to probe the LPC.
>

I doubt that your fTPM is actually attached to LPC.
And usually if lpc  wedges it takes down your pc with it (from my experience)

I wonder what the contents of the acc and sts registers actually are.


If you want i can send you a small tis 'debug' tool tomorrow.

>James
>
>
>> Perhaps reducing the timeout to force a ETIME would prove the theory?

-- 
Sent from my mobile

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 20:51           ` Peter Huewe
@ 2018-07-11 20:54             ` James Bottomley
  2018-07-11 21:06             ` Jason Gunthorpe
  1 sibling, 0 replies; 10+ messages in thread
From: James Bottomley @ 2018-07-11 20:54 UTC (permalink / raw)
  To: Peter Huewe, Jason Gunthorpe
  Cc: linux-integrity, Jarkko Sakkinen, Thorsten Leemhuis, Nayna Jain

On Wed, 2018-07-11 at 22:51 +0200, Peter Huewe wrote:
> 
> Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottoml
> ey@HansenPartnership.com>:
> > On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
> > > On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> > > > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
[...]
> > > > > I wonder if something about triggering ETIME even once breaks
> > > > > the driver so it can't talk to the chip at all thereafter..
> > > > > 
> > > > > Ie it doesn't abort the command properly and becomes desynced
> > > > > with the TIS execution protocol.
> > > > 
> > > > Yes, I wondered about this, but I don't understand the bus
> > > > protocol well enough.  The tpm-interface:tpm_try_transmit()
> > > > which throws the first ETIME says after we get that we send
> > > > chip->ops->cancel() which tpm_tis simply translates to
> > > > tpm_tis_ready() which also times out.  Is there a bigger hammer
> > > > I can hit it with?
> > > 
> > > I don't remember off hand.. But this is, IMHO, a better guess
> > > than the firmware crashes from reading the status register..
> > 
> > Oh, actually, I think the bus crashes or wedges, not the TPM.  I
> > just don't have any tools to probe the LPC.
> > 
> 
> I doubt that your fTPM is actually attached to LPC.
> And usually if lpc  wedges it takes down your pc with it (from my
> experience)

How do I tell what it is attached to?  Sysfs doesn't seem to know but I
assume the ACPI information should contain something relevant.

> I wonder what the contents of the acc and sts registers actually are.
> 
> 
> If you want i can send you a small tis 'debug' tool tomorrow.

Sure, that would help ... I'm just leafing through the TIS manual now
to see if there's anything I can build into the driver.

James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 20:51           ` Peter Huewe
  2018-07-11 20:54             ` James Bottomley
@ 2018-07-11 21:06             ` Jason Gunthorpe
  1 sibling, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2018-07-11 21:06 UTC (permalink / raw)
  To: Peter Huewe
  Cc: James Bottomley, linux-integrity, Jarkko Sakkinen,
	Thorsten Leemhuis, Nayna Jain

On Wed, Jul 11, 2018 at 10:51:40PM +0200, Peter Huewe wrote:
> 
> 
> Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottomley@HansenPartnership.com>:
> >On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
> >> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> >> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> >> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> >> > > > tpm_tis regressed recently to the point where the TPM being
> >> > > > driven
> >> > > > by
> >> > > > it falls off the bus and cannot be contacted after some hours
> >> > > > of
> >> > > > use.
> >> > > > This is the failure trace:
> >> > > > 
> >> > > > jejb@jarvis:~> dmesg|grep tpm
> >> > > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE,
> >> > > > rev-id 
> >> > > > 2)
> >> > > > [14566.626614] tpm tpm0: Operation Timed out
> >> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a
> >> > > > system
> >> > > > error -62
> >> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a
> >> > > > system
> >> > > > error -62
> >> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > ...
> >> > > > 
> >> > > > The problem is caused by a change that caused us to poke the
> >> > > > TPM
> >> > > > far
> >> > > > more often to see if it's ready.  Apparently something about
> >> > > > the
> >> > > > bus
> >> > > > its on and the TPM means that it crashes or falls off the bus
> >> > > > if
> >> > > > you
> >> > > > poke it too often and once this happens, only a reboot will
> >> > > > recover
> >> > > > it.
> >> > > 
> >> > > I wonder if something about triggering ETIME even once breaks the
> >> > > driver so it can't talk to the chip at all thereafter..
> >> > > 
> >> > > Ie it doesn't abort the command properly and becomes desynced
> >> > > with the TIS execution protocol.
> >> > 
> >> > Yes, I wondered about this, but I don't understand the bus protocol
> >> > well enough.  The tpm-interface:tpm_try_transmit() which throws the
> >> > first ETIME says after we get that we send chip->ops->cancel()
> >> > which tpm_tis simply translates to tpm_tis_ready() which also times
> >> > out.  Is there a bigger hammer I can hit it with?
> >> 
> >> I don't remember off hand.. But this is, IMHO, a better guess than
> >> the firmware crashes from reading the status register..
> >
> >Oh, actually, I think the bus crashes or wedges, not the TPM.  I just
> >don't have any tools to probe the LPC.
> >
> 
> I doubt that your fTPM is actually attached to LPC.
> And usually if lpc  wedges it takes down your pc with it (from my experience)

Yes, agree. Very, very unlikely LPC crashed or wedged.

You could try reading the ID numbers register (eg the source of this
data 'device-id 0xFE rev-id 2').

That should be on the HW side of the FIFO bridge and if it returs the
same values as boot the bus is not broken, no matter what the bus is.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tpm.h: increase poll timings to fix tpm_tis regression
  2018-07-11 19:08     ` James Bottomley
  2018-07-11 20:01       ` Jason Gunthorpe
@ 2018-07-16 20:16       ` Jarkko Sakkinen
  1 sibling, 0 replies; 10+ messages in thread
From: Jarkko Sakkinen @ 2018-07-16 20:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jason Gunthorpe, linux-integrity, Thorsten Leemhuis, Nayna Jain

On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> Yes, I wondered about this, but I don't understand the bus protocol
> well enough.  The tpm-interface:tpm_try_transmit() which throws the
> first ETIME says after we get that we send chip->ops->cancel() which
> tpm_tis simply translates to tpm_tis_ready() which also times out.  Is
> there a bigger hammer I can hit it with?

tpm_tis_ready() gets called twice and it is unsynced (no wait). Maybe
this confuses the chip somehow?

/Jarkko

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-07-16 20:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11 17:04 Regression in tpm_tis driver: the TPM now fatally offlines itself after a few hours of use James Bottomley
2018-07-11 17:11 ` [PATCH] tpm.h: increase poll timings to fix tpm_tis regression James Bottomley
2018-07-11 18:21   ` Jason Gunthorpe
2018-07-11 19:08     ` James Bottomley
2018-07-11 20:01       ` Jason Gunthorpe
2018-07-11 20:39         ` James Bottomley
2018-07-11 20:51           ` Peter Huewe
2018-07-11 20:54             ` James Bottomley
2018-07-11 21:06             ` Jason Gunthorpe
2018-07-16 20:16       ` Jarkko Sakkinen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.