linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] IPMI state machine regression
@ 2018-08-21 22:14 Andrew Banman
  2018-08-22 16:14 ` Corey Minyard
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Banman @ 2018-08-21 22:14 UTC (permalink / raw)
  To: Corey Minyard, Arnd Bergmann, Greg Kroah-Hartman
  Cc: abanman, justin.ernst, rja, frank.ramsay, openipmi-developer,
	linux-kernel

Dear IPMI supporters,

We observe a window in IPMI BT's opportunistic get capabilities request,
wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
This leaves IPMI hosed until you reload modules. Looks to happen after we call
schedule().

The problem is intermittent, but we observe ~50% fail rate on 4.18
kernels on our UV4 systems.

Here's the relevant snippet (the action starts at 120.367379):

[  118.316764] IPMI System Interface driver.
[  118.354246] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[  118.354249] ipmi_si: SMBIOS: mem 0xce4 regsize 1 spacing 1 irq 6
[  118.354251] ipmi_si: Adding SMBIOS-specified bt state machine
[  118.354311] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[  118.381164] ipmi_si IPI0001:00: [io  0x0ce4-0x0ce6] regsize 1 spacing 1 irq 6
[  118.381167] ipmi_si: Adding ACPI-specified bt state machine
[  118.382811] ipmi_si: Trying SMBIOS-specified bt state machine at mem address 0xce4, slave address 0x20, irq 6
[  118.382815] ipmi_si dmi-ipmi-si.0: Could not set up I/O space
[  118.382816] ipmi_si: Trying ACPI-specified bt state machine at i/o address 0xce4, slave address 0x0, irq 6
[  118.382911] IPMI BT: flag reset [ ]
[  118.383008] BT: XACTION [ ] TO=5000000 - 0
[  118.383059] BT: WR_BYTES [ ] TO=5000000 - 0
[  118.383272] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  118.388034] BT: WR_CONSUME [ ] TO=5000000 - 4000
[  118.388059] BT: RD_WAIT [ ] TO=4996000 - 0
[  118.396017] BT: RD_WAIT [ ] TO=4996000 - 4000
[  118.492015] BT: RD_WAIT [ ] TO=4992000 - 4000
[  118.520073] BT: RD_WAIT [ ] TO=4988000 - 4000
[  118.528079] BT: RD_WAIT [ ] TO=4984000 - 4000
[  118.540049] BT: RD_WAIT [ ] TO=4980000 - 4000
[  118.548011] BT: RD_WAIT [ ] TO=4976000 - 4000
[  118.564033] BT: RD_WAIT [ ] TO=4972000 - 4000
[  118.572109] BT: RD_WAIT [ ] TO=4968000 - 4000
[  118.580013] BT: RD_WAIT [ ] TO=4964000 - 4000
[  118.588017] BT: RD_WAIT [ ] TO=4960000 - 4000
[  118.608014] BT: RD_WAIT [ ] TO=4956000 - 4000
[  118.628058] BT: RD_WAIT [ ] TO=4952000 - 4000
[  118.636053] BT: RD_WAIT [ ] TO=4948000 - 4000
[  118.648016] BT: RD_WAIT [ ] TO=4944000 - 4000
[  118.672028] BT: RD_WAIT [ ] TO=4940000 - 4000
[  118.684056] BT: RD_WAIT [ ] TO=4936000 - 4000
[  118.696021] BT: RD_WAIT [ ] TO=4932000 - 4000
[  118.708015] BT: RD_WAIT [ ] TO=4928000 - 4000
[  118.720018] BT: RD_WAIT [ ] TO=4924000 - 4000
[  118.732063] BT: RD_WAIT [ ] TO=4920000 - 4000
[  118.744021] BT: RD_WAIT [ ] TO=4916000 - 4000
[  118.752089] BT: RD_WAIT [ ] TO=4912000 - 4000
[  118.764011] BT: RD_WAIT [ ] TO=4908000 - 4000
[  118.784009] BT: RD_WAIT [ ] TO=4904000 - 4000
[  118.808011] BT: RD_WAIT [ ] TO=4900000 - 4000
[  118.832010] BT: RD_WAIT [ ] TO=4896000 - 4000
[  118.844011] BT: RD_WAIT [ ] TO=4892000 - 4000
[  118.860036] BT: RD_WAIT [ ] TO=4888000 - 4000
[  118.868102] BT: RD_WAIT [ ] TO=4884000 - 4000
[  118.876012] BT: RD_WAIT [ ] TO=4880000 - 4000
[  118.904011] BT: RD_WAIT [ ] TO=4876000 - 4000
[  118.924013] BT: RD_WAIT [ ] TO=4872000 - 4000
[  118.948016] BT: RD_WAIT [ ] TO=4868000 - 4000
[  118.984043] BT: RD_WAIT [ ] TO=4864000 - 4000
[  118.992059] BT: RD_WAIT [ ] TO=4860000 - 4000
[  118.995422]  for 2000000
[  118.995650]
[  119.000026] BT: RD_WAIT [ ] TO=4856000 - 4000
[  119.012018] BT: RD_WAIT [ ] TO=4852000 - 4000
[  119.032010] BT: RD_WAIT [ ] TO=4848000 - 4000
[  119.048013] BT: RD_WAIT [ ] TO=4844000 - 4000
[  119.060013] BT: RD_WAIT [ ] TO=4840000 - 4000
[  119.080014] BT: RD_WAIT [ ] TO=4836000 - 4000
[  119.096021] BT: RD_WAIT [ ] TO=4832000 - 4000
[  119.120009] BT: RD_WAIT [ ] TO=4828000 - 4000
[  119.128017] BT: RD_WAIT [ ] TO=4824000 - 4000
[  119.136011] BT: RD_WAIT [ ] TO=4820000 - 4000
[  119.144013] BT: RD_WAIT [ ] TO=4816000 - 4000
[  119.164008] BT: RD_WAIT [ ] TO=4812000 - 4000
[  119.184010] BT: RD_WAIT [ ] TO=4808000 - 4000
[  119.204015] BT: RD_WAIT [ ] TO=4804000 - 4000
[  119.264034] BT: RD_WAIT [ ] TO=4800000 - 4000
[  119.272013] BT: RD_WAIT [ ] TO=4796000 - 4000
[  119.288013] BT: RD_WAIT [ ] TO=4792000 - 4000
[  119.308011] BT: RD_WAIT [ ] TO=4788000 - 4000
[  119.324011] BT: RD_WAIT [ ] TO=4784000 - 4000
[  119.344010] BT: RD_WAIT [ ] TO=4780000 - 4000
[  119.364027] BT: RD_WAIT [ ] TO=4776000 - 4000
[  119.376011] BT: RD_WAIT [ ] TO=4772000 - 4000
[  119.396010] BT: RD_WAIT [ ] TO=4768000 - 4000
[  119.408013] BT: RD_WAIT [ ] TO=4764000 - 4000
[  119.424012] BT: RD_WAIT [ B2H ] TO=4760000 - 4000
[  119.434845] BT: CLEAR_B2H [ H_BUSY ] TO=4756000 - 0
[  119.592204] BT: RD_BYTES [ H_BUSY ] TO=4756000 - 0
[  119.650481] BT: XACTION [ ] TO=5000000 - 0
[  119.676062] BT: WR_BYTES [ ] TO=5000000 - 0
[  119.734596] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  119.740019] BT: WR_CONSUME [ ] TO=5000000 - 4000
[  119.740036] BT: RD_WAIT [ ] TO=4996000 - 0
[  119.748032] BT: RD_WAIT [ B2H ] TO=4996000 - 4000
[  119.748101] BT: CLEAR_B2H [ H_BUSY ] TO=4992000 - 0
[  119.748124] BT: RD_BYTES [ H_BUSY ] TO=4992000 - 0
[  119.748348] BT: XACTION [ ] TO=5000000 - 0
[  119.748382] BT: WR_BYTES [ ] TO=5000000 - 0
[  119.748537] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  119.760013] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000
[  119.806271] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[  119.806282] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[  119.816909] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[  119.816944] BT: XACTION [ ] TO=5000000 - 0
[  119.827061] BT: WR_BYTES [ ] TO=5000000 - 0
[  119.827088] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  119.852024] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000
[  119.875752] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[  119.895099] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[  119.991768] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[  119.991805] BT: XACTION [ ] TO=5000000 - 0
[  120.004993] BT: WR_BYTES [ ] TO=5000000 - 0
[  120.027171] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  120.048024] BT: WR_CONSUME [ ] TO=5000000 - 4000
[  120.072949] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[  120.085892] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[  120.097374] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[  120.097406] BT: XACTION [ ] TO=5000000 - 0
[  120.125568] BT: WR_BYTES [ ] TO=5000000 - 0
[  120.141827] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  120.264026] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000
[  120.287267] BT: RD_WAIT [ B2H ] TO=4996000 - 0
[  120.287276] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0
[  120.287280] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0
[  120.287429] ipmi_si IPI0001:00: Using irq 6
[  120.287447] **Interrupt: 1534434704.287445059
[  120.287457] BT: XACTION [ ] TO=5000000 - 0
[  120.287466] BT: WR_BYTES [ ] TO=5000000 - 0
[  120.287500] BT: WR_CONSUME [ H2B ] TO=5000000 - 0
[  120.297445] **Interrupt: 1534434704.297445691
[  120.332534] BT: WR_CONSUME [ B2H ] TO=5000000 - 0
[  120.332538] BT: RD_WAIT [ B2H ] TO=5000000 - 0
[  120.342956] BT: CLEAR_B2H [ H_BUSY ] TO=5000000 - 0
[  120.342969] BT: RD_BYTES [ H_BUSY ] TO=5000000 - 0
[  120.343244] **Done: 1534434704.343244675
[  120.367379] BT: IDLE [ ] TO=5000000 - 0
[  120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
[  120.376634] **Timer: 1534434704.376633453
[  120.376646] BT: XACTION [ ] TO=5000000 - 92000
[  120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
[  120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
[  120.377025] Send:  18 08
[  120.377027] **Enqueue: 1534434704.377028726
[  120.377028] **Start2: 1534434704.377029859
[  120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
[  120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
[  120.377140] Recv::  1c 08 d5
[  120.377156] Send:  18 01
[  120.377161] **Enqueue: 1534434704.377161047
[  120.377164] **Start2: 1534434704.377164751
[  120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
[  120.377237] Recv::  1c 01 d5
[  120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
[  120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
[  120.377252] ipmi_si IPI0001:00: Unable to register device: error -5
[  120.569865] IPMI SSIF Interface driver

Also FYI,

Bad cleanup code causes a null pointer and adds a race condition; this
is addressed by "[PATCH] Remove redundant cleanup in ipmi_register_smi"
from Justin Ernst <justin.ernst@hpe.com>, which I paste here:

diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 51832b8a2c62..3b0b50c4f064 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -3395,12 +3395,12 @@ int ipmi_register_smi(const struct ipmi_smi_handlers *handlers,
 
  out:
 	if (rv) {
-		ipmi_bmc_unregister(intf);
-		list_del_rcu(&intf->link);
+		/*
+		 * ipmi_unregister_smi must be called to clean up after
+		 * failure. We unlock the mutex to allow ipmi_unregister_smi
+		 * to lock it and perform cleanup.
+		 */
 		mutex_unlock(&ipmi_interfaces_mutex);
-		synchronize_srcu(&ipmi_interfaces_srcu);
-		cleanup_srcu_struct(&intf->users_srcu);
-		kref_put(&intf->refcount, intf_free);
 	} else {
 		/*
 		 * Keep memory order straight for RCU readers.  Make
--

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] IPMI state machine regression
  2018-08-21 22:14 [RFC] IPMI state machine regression Andrew Banman
@ 2018-08-22 16:14 ` Corey Minyard
  2018-08-22 16:23   ` Andrew Banman
  0 siblings, 1 reply; 6+ messages in thread
From: Corey Minyard @ 2018-08-22 16:14 UTC (permalink / raw)
  To: Andrew Banman, Arnd Bergmann, Greg Kroah-Hartman
  Cc: justin.ernst, rja, frank.ramsay, openipmi-developer, linux-kernel

On 08/21/2018 05:14 PM, Andrew Banman wrote:
> Dear IPMI supporters,
>
> We observe a window in IPMI BT's opportunistic get capabilities request,
> wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> This leaves IPMI hosed until you reload modules. Looks to happen after we call
> schedule().

When was the latest kernel where this worked properly?  Also, what 
hardware is this?

BTW, you can use the "hotmod" capability of the IPMI driver to add the 
device
dynamically.

-corey

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] IPMI state machine regression
  2018-08-22 16:14 ` Corey Minyard
@ 2018-08-22 16:23   ` Andrew Banman
  2018-08-23 16:22     ` Corey Minyard
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Banman @ 2018-08-22 16:23 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Andrew Banman, Arnd Bergmann, Greg Kroah-Hartman, justin.ernst,
	rja, frank.ramsay, openipmi-developer, linux-kernel

On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
> On 08/21/2018 05:14 PM, Andrew Banman wrote:
> > Dear IPMI supporters,
> > 
> > We observe a window in IPMI BT's opportunistic get capabilities request,
> > wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> > machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> > bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> > There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> > This leaves IPMI hosed until you reload modules. Looks to happen after we call
> > schedule().
> 
> When was the latest kernel where this worked properly?  Also, what hardware
> is this?

This is UV4.

First known bad commit, but I am not sure if the timing issue predates
it:

commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
Author: Jeremy Kerr <jk@ozlabs.org>
Date:   Fri Aug 25 15:47:24 2017 +0800

    ipmi: allow dynamic BMC version information

Hits less frequently with older kernels so I didn't see it until
recently when it became more frequent.

> 
> BTW, you can use the "hotmod" capability of the IPMI driver to add the
> device
> dynamically.
> 
> -corey

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] IPMI state machine regression
  2018-08-22 16:23   ` Andrew Banman
@ 2018-08-23 16:22     ` Corey Minyard
  2018-08-23 16:52       ` Andrew Banman
  0 siblings, 1 reply; 6+ messages in thread
From: Corey Minyard @ 2018-08-23 16:22 UTC (permalink / raw)
  To: Andrew Banman, Corey Minyard
  Cc: Arnd Bergmann, Greg Kroah-Hartman, justin.ernst, rja,
	frank.ramsay, openipmi-developer, linux-kernel

On 08/22/2018 11:23 AM, Andrew Banman wrote:
> On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
>> On 08/21/2018 05:14 PM, Andrew Banman wrote:
>>> Dear IPMI supporters,
>>>
>>> We observe a window in IPMI BT's opportunistic get capabilities request,
>>> wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
>>> machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
>>> bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
>>> There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
>>> This leaves IPMI hosed until you reload modules. Looks to happen after we call
>>> schedule().
>> When was the latest kernel where this worked properly?  Also, what hardware
>> is this?
> This is UV4.
>
> First known bad commit, but I am not sure if the timing issue predates
> it:
>
> commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
> Author: Jeremy Kerr <jk@ozlabs.org>
> Date:   Fri Aug 25 15:47:24 2017 +0800
>
>      ipmi: allow dynamic BMC version information
>
> Hits less frequently with older kernels so I didn't see it until
> recently when it became more frequent.

Ok, that's for the crash, which makes sense.  But that's an easy problem 
to fix.
I would like a "Tested-by" on that, if you get to test it, though I was 
able to
simulate various failures there to test it out.

So reading between the lines ("more frequent") I'm guessing that this still
happened with older kernels, but is becoming annoying with newer kernels.
I would guess recent changes causes it to happen more often due to changes
in the way the upper layer interacts with the lower layers, you will 
have more
messages at startup, and the timing is somewhat different.

The BT code itself hasn't changed much in over 10 years.  Nothing that
looks like it would cause an issue like this.  So I would guess this is an
issue that has been around for a while.

I don't have any real hardware with a BT interface, just the one in qemu,
but I've never seen it there.

It actually looks like the state machine is working ok.  But the BMC is
responding to a "Get Device ID" command with:

    Recv::  1c 08 d5


That's an error response with D5, which is "Cannot execute command.
Command, or request parameter(s), not supported in present state."
That's an error response from your BMC.  That particular command
shouldn't ever respond with that error, so I think the bug here is
with your BMC.

-corey



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] IPMI state machine regression
  2018-08-23 16:22     ` Corey Minyard
@ 2018-08-23 16:52       ` Andrew Banman
  2018-08-23 19:06         ` Corey Minyard
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Banman @ 2018-08-23 16:52 UTC (permalink / raw)
  To: minyard
  Cc: Andrew Banman, Arnd Bergmann, Greg Kroah-Hartman, justin.ernst,
	rja, frank.ramsay, openipmi-developer, linux-kernel

On Thu, Aug 23, 2018 at 11:22:58AM -0500, Corey Minyard wrote:
> On 08/22/2018 11:23 AM, Andrew Banman wrote:
> > On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
> > > On 08/21/2018 05:14 PM, Andrew Banman wrote:
> > > > Dear IPMI supporters,
> > > > 
> > > > We observe a window in IPMI BT's opportunistic get capabilities request,
> > > > wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> > > > machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> > > > bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> > > > There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> > > > This leaves IPMI hosed until you reload modules. Looks to happen after we call
> > > > schedule().
> > > When was the latest kernel where this worked properly?  Also, what hardware
> > > is this?
> > This is UV4.
> > 
> > First known bad commit, but I am not sure if the timing issue predates
> > it:
> > 
> > commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
> > Author: Jeremy Kerr <jk@ozlabs.org>
> > Date:   Fri Aug 25 15:47:24 2017 +0800
> > 
> >      ipmi: allow dynamic BMC version information
> > 
> > Hits less frequently with older kernels so I didn't see it until
> > recently when it became more frequent.
> 
> Ok, that's for the crash, which makes sense.  But that's an easy problem to
> fix.
> I would like a "Tested-by" on that, if you get to test it, though I was able
> to
> simulate various failures there to test it out.

Testing it today. We'll respond to the patch you sent out. Thanks for
fixing that!

> 
> So reading between the lines ("more frequent") I'm guessing that this still
> happened with older kernels, but is becoming annoying with newer kernels.
> I would guess recent changes causes it to happen more often due to changes
> in the way the upper layer interacts with the lower layers, you will have
> more
> messages at startup, and the timing is somewhat different.
> 
> The BT code itself hasn't changed much in over 10 years.  Nothing that
> looks like it would cause an issue like this.  So I would guess this is an
> issue that has been around for a while.

That's what I suspect as well.

> 
> I don't have any real hardware with a BT interface, just the one in qemu,
> but I've never seen it there.
> 
> It actually looks like the state machine is working ok.  But the BMC is
> responding to a "Get Device ID" command with:
> 
>    Recv::  1c 08 d5

That completion code is spoofed in after bt_start_transaction when bt->state
is in WR_CONSUME. This is pretty visible from the message trace, but here
is another with an explicit debug printk (below).

Maybe I'm totally reading this wrong, but I think that err gets passed 
on to the upper layer:

	start_transaction: return 0xD5 -> return_hosed_msg -> deliver_recv_msg
		-> ipmi_smi_msg_received


[  120.367379] BT: IDLE [ ] TO=5000000 - 0
[  120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
[  120.376634] **Timer: 1534434704.376633453
[  120.376646] BT: XACTION [ ] TO=5000000 - 92000
[  120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
[  120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
[  120.377025] Send:  18 08
[  120.377027] **Enqueue: 1534434704.377028726
[  120.377028] **Start2: 1534434704.377029859
# Added DBG # [  120.377028] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
# Added DBG # [  120.377030]  18
# Added DBG # [  120.377030]  08
# Added DBG # [  120.377030]
[  120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
[  120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
[  120.377140] Recv::  1c 08 d5
[  120.377156] Send:  18 01
[  120.377161] **Enqueue: 1534434704.377161047
[  120.377164] **Start2: 1534434704.377164751
# Added DBG # [  120.377165] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
# Added DBG # [  120.377166]  18
# Added DBG # [  120.377167]  01
# Added DBG # [  120.377168]
[  120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
[  120.377237] Recv::  1c 01 d5
[  120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
[  120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
[  120.377252] ipmi_si IPI0001:00: Unable to register device: error -5


diff --git a/drivers/char/ipmi/ipmi_bt_sm.c b/drivers/char/ipmi/ipmi_bt_sm.c
index d99d744..1be4de9 100644
--- a/drivers/char/ipmi/ipmi_bt_sm.c
+++ b/drivers/char/ipmi/ipmi_bt_sm.c
@@ -24,7 +24,7 @@
  * value
  */

-static int bt_debug = 4; /* 0 == BT_DEBUG_OFF */
+static int bt_debug = 2; /* 0 == BT_DEBUG_OFF */

 module_param(bt_debug, int, 0644);
 MODULE_PARM_DESC(bt_debug, "debug bitmask, 1=enable, 2=messages, 4=states");
@@ -217,8 +217,14 @@ static int bt_start_transaction(struct si_sm_data *bt,
        if (bt->state == BT_STATE_LONG_BUSY)
                return IPMI_NODE_BUSY_ERR;

-   if (bt->state != BT_STATE_IDLE)
+ if (bt->state != BT_STATE_IDLE) {
+         pr_crit("bt->state = %d : bt_start_transaction returned "
+                 "IPMI_NOT_IN_MY_STATE_ERR 0xd5\n", bt->state);
+         for (i = 0; i < size; i ++)
+                 printk(" %02x", data[i]);
+         printk("\n");
                return IPMI_NOT_IN_MY_STATE_ERR;
+ }

        if (bt_debug & BT_DEBUG_MSG) {
                printk(KERN_WARNING "BT: +++++++++++++++++ New command\n");

> 
> 
> That's an error response with D5, which is "Cannot execute command.
> Command, or request parameter(s), not supported in present state."
> That's an error response from your BMC.  That particular command
> shouldn't ever respond with that error, so I think the bug here is
> with your BMC.

My BMC engineers and I cannot reproduce the problem running ipmi commands
direct from the BMC.

> 
> -corey
> 
> 

Thank you,

Andrew

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] IPMI state machine regression
  2018-08-23 16:52       ` Andrew Banman
@ 2018-08-23 19:06         ` Corey Minyard
  0 siblings, 0 replies; 6+ messages in thread
From: Corey Minyard @ 2018-08-23 19:06 UTC (permalink / raw)
  To: Andrew Banman
  Cc: Arnd Bergmann, Greg Kroah-Hartman, justin.ernst, rja,
	frank.ramsay, openipmi-developer, linux-kernel

On 08/23/2018 11:52 AM, Andrew Banman wrote:
>
>> I don't have any real hardware with a BT interface, just the one in qemu,
>> but I've never seen it there.
>>
>> It actually looks like the state machine is working ok.  But the BMC is
>> responding to a "Get Device ID" command with:
>>
>>     Recv::  1c 08 d5
> That completion code is spoofed in after bt_start_transaction when bt->state
> is in WR_CONSUME. This is pretty visible from the message trace, but here
> is another with an explicit debug printk (below).
>
> Maybe I'm totally reading this wrong, but I think that err gets passed
> on to the upper layer:
>
> 	start_transaction: return 0xD5 -> return_hosed_msg -> deliver_recv_msg
> 		-> ipmi_smi_msg_received

Ah, yes, you are right.  I didn't write the BT state machine, so I'm not 
terribly
familiar with how it works.  I'm pretty sure the author is no longer 
with HP.

>
> [  120.367379] BT: IDLE [ ] TO=5000000 - 0
> [  120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
> [  120.376634] **Timer: 1534434704.376633453
> [  120.376646] BT: XACTION [ ] TO=5000000 - 92000
> [  120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
> [  120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
> [  120.377025] Send:  18 08
> [  120.377027] **Enqueue: 1534434704.377028726
> [  120.377028] **Start2: 1534434704.377029859

So what appears to be happening is that the BT state machine sends the
BT capabilities commands to the BMC.  While this is going on the upper
layer sends a messages, but the state machine can't send because it's
in the middle of another operation.

The "device" in qemu is almost instantaneous, so you would never
see it there.

So this is almost certainly due to the timing and messaging changes.
in the upper layer.  Another step may be necessary in initialization
to allow this to happen.

Let me poke at this a bit and I'll send another patch.

Thanks,

-corey

> # Added DBG # [  120.377028] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
> # Added DBG # [  120.377030]  18
> # Added DBG # [  120.377030]  08
> # Added DBG # [  120.377030]
> [  120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
> [  120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
> [  120.377140] Recv::  1c 08 d5
> [  120.377156] Send:  18 01
> [  120.377161] **Enqueue: 1534434704.377161047
> [  120.377164] **Start2: 1534434704.377164751
> # Added DBG # [  120.377165] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
> # Added DBG # [  120.377166]  18
> # Added DBG # [  120.377167]  01
> # Added DBG # [  120.377168]
> [  120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
> [  120.377237] Recv::  1c 01 d5
> [  120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
> [  120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
> [  120.377252] ipmi_si IPI0001:00: Unable to register device: error -5
>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-08-23 19:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-21 22:14 [RFC] IPMI state machine regression Andrew Banman
2018-08-22 16:14 ` Corey Minyard
2018-08-22 16:23   ` Andrew Banman
2018-08-23 16:22     ` Corey Minyard
2018-08-23 16:52       ` Andrew Banman
2018-08-23 19:06         ` Corey Minyard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).