linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ipmi: retry to get device id when error
@ 2020-09-13 12:02 Xianting Tian
  2020-09-13 12:39 ` Corey Minyard
  0 siblings, 1 reply; 5+ messages in thread
From: Xianting Tian @ 2020-09-13 12:02 UTC (permalink / raw)
  To: minyard, arnd, gregkh; +Cc: openipmi-developer, linux-kernel, Xianting Tian

We can't get bmc's device id with low probability when loading ipmi driver,
it caused bmc device register failed. This issue may caused by bad lpc
signal quality. When this issue happened, we got below kernel printks:
	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
	[Wed Sep  9 19:52:03 2020] IPMI BT: using default values
	[Wed Sep  9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2
	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device id: -5
	[Wed Sep  9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register device: error -5

When this issue happened, we want to manually unload the driver and try to
load it again, but it can't be unloaded by 'rmmod' as it is already 'in use'.

We add below 'printk' in handle_one_recv_msg(), when this issue happened,
the msg we received is "Recv: 1c 01 d5", which means the data_len is 1,
data[0] is 0xd5.
	Debug code:
	static int handle_one_recv_msg(struct ipmi_smi *intf,
                               struct ipmi_smi_msg *msg) {
        	printk("Recv: %*ph\n", msg->rsp_size, msg->rsp);
		... ...
	}
Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7'
and 'data[0] != 0'.

We used this patch to retry to get device id when error happen, we
reproduced this issue again and the retry succeed on the first retry, we
finally got the correct msg and then all is ok:
Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00

So use retry machanism in this patch to give bmc more opportunity to
correctly response kernel.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
---
 drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 737c0b6b2..bfb2de77a 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -34,6 +34,7 @@
 #include <linux/uuid.h>
 #include <linux/nospec.h>
 #include <linux/vmalloc.h>
+#include <linux/delay.h>
 
 #define IPMI_DRIVER_VERSION "39.2"
 
@@ -60,6 +61,9 @@ enum ipmi_panic_event_op {
 #else
 #define IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE
 #endif
+
+#define GET_DEVICE_ID_MAX_RETRY	5
+
 static enum ipmi_panic_event_op ipmi_send_panic_event = IPMI_PANIC_DEFAULT;
 
 static int panic_op_write_handler(const char *val,
@@ -2426,19 +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf)
 static int __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc)
 {
 	int rv;
-
-	bmc->dyn_id_set = 2;
+	unsigned int retry_count = 0;
 
 	intf->null_user_handler = bmc_device_id_handler;
 
+retry:
+	bmc->dyn_id_set = 2;
+
 	rv = send_get_device_id_cmd(intf);
 	if (rv)
 		return rv;
 
 	wait_event(intf->waitq, bmc->dyn_id_set != 2);
 
-	if (!bmc->dyn_id_set)
+	if (!bmc->dyn_id_set) {
+		msleep(1000);
+		if (++retry_count <= GET_DEVICE_ID_MAX_RETRY)
+			goto retry;
+
 		rv = -EIO; /* Something went wrong in the fetch. */
+	}
 
 	/* dyn_id_set makes the id data available. */
 	smp_rmb();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ipmi: retry to get device id when error
  2020-09-13 12:02 [PATCH] ipmi: retry to get device id when error Xianting Tian
@ 2020-09-13 12:39 ` Corey Minyard
  2020-09-13 14:10   ` Tianxianting
  0 siblings, 1 reply; 5+ messages in thread
From: Corey Minyard @ 2020-09-13 12:39 UTC (permalink / raw)
  To: Xianting Tian; +Cc: arnd, gregkh, openipmi-developer, linux-kernel

On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote:
> We can't get bmc's device id with low probability when loading ipmi driver,
> it caused bmc device register failed. This issue may caused by bad lpc
> signal quality. When this issue happened, we got below kernel printks:
> 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
> 	[Wed Sep  9 19:52:03 2020] IPMI BT: using default values
> 	[Wed Sep  9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2
> 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device id: -5
> 	[Wed Sep  9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register device: error -5
> 
> When this issue happened, we want to manually unload the driver and try to
> load it again, but it can't be unloaded by 'rmmod' as it is already 'in use'.

I'm not sure this patch is a good idea; it would cause a long boot delay
in situations where there really isn't a BMC out there.  Yes, it
happens.

You don't have to reload the driver to add a device, though.  You can
hot-add devices using /sys/modules/ipmi_si/parameters/hotmod.  Look in
Documentation/driver-api/ipmi.rst for details.

Does that work for you?

-corey

> 
> We add below 'printk' in handle_one_recv_msg(), when this issue happened,
> the msg we received is "Recv: 1c 01 d5", which means the data_len is 1,
> data[0] is 0xd5.
> 	Debug code:
> 	static int handle_one_recv_msg(struct ipmi_smi *intf,
>                                struct ipmi_smi_msg *msg) {
>         	printk("Recv: %*ph\n", msg->rsp_size, msg->rsp);
> 		... ...
> 	}
> Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7'
> and 'data[0] != 0'.
> 
> We used this patch to retry to get device id when error happen, we
> reproduced this issue again and the retry succeed on the first retry, we
> finally got the correct msg and then all is ok:
> Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00
> 
> So use retry machanism in this patch to give bmc more opportunity to
> correctly response kernel.
> 
> Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
> ---
>  drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> index 737c0b6b2..bfb2de77a 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -34,6 +34,7 @@
>  #include <linux/uuid.h>
>  #include <linux/nospec.h>
>  #include <linux/vmalloc.h>
> +#include <linux/delay.h>
>  
>  #define IPMI_DRIVER_VERSION "39.2"
>  
> @@ -60,6 +61,9 @@ enum ipmi_panic_event_op {
>  #else
>  #define IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE
>  #endif
> +
> +#define GET_DEVICE_ID_MAX_RETRY	5
> +
>  static enum ipmi_panic_event_op ipmi_send_panic_event = IPMI_PANIC_DEFAULT;
>  
>  static int panic_op_write_handler(const char *val,
> @@ -2426,19 +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf)
>  static int __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc)
>  {
>  	int rv;
> -
> -	bmc->dyn_id_set = 2;
> +	unsigned int retry_count = 0;
>  
>  	intf->null_user_handler = bmc_device_id_handler;
>  
> +retry:
> +	bmc->dyn_id_set = 2;
> +
>  	rv = send_get_device_id_cmd(intf);
>  	if (rv)
>  		return rv;
>  
>  	wait_event(intf->waitq, bmc->dyn_id_set != 2);
>  
> -	if (!bmc->dyn_id_set)
> +	if (!bmc->dyn_id_set) {
> +		msleep(1000);
> +		if (++retry_count <= GET_DEVICE_ID_MAX_RETRY)
> +			goto retry;
> +
>  		rv = -EIO; /* Something went wrong in the fetch. */
> +	}
>  
>  	/* dyn_id_set makes the id data available. */
>  	smp_rmb();
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] ipmi: retry to get device id when error
  2020-09-13 12:39 ` Corey Minyard
@ 2020-09-13 14:10   ` Tianxianting
  2020-09-13 17:26     ` Corey Minyard
  0 siblings, 1 reply; 5+ messages in thread
From: Tianxianting @ 2020-09-13 14:10 UTC (permalink / raw)
  To: minyard; +Cc: arnd, gregkh, openipmi-developer, linux-kernel

Hi Corey
Thanks for your quickly reply,
We didn't try the method you mentioned, actually, I didn't know it before you told me:(
The issue ever occurred on our 2 ceph storage server both with low probability.
We finally use this patch to solve the issue, it can automatically solve the issue when it happened. So no need to judge and reload ipmi driver manually or develop additional scripts to make it.
The 1 second delay is acceptable to us.

If there really isn't a BMC out there, ipmi driver will not be loaded, is it right?

May be we can adjust to retry 3 times with 500ms interval? 

Thanks in advance if you can feedback again.

-----Original Message-----
From: Corey Minyard [mailto:tcminyard@gmail.com] On Behalf Of Corey Minyard
Sent: Sunday, September 13, 2020 8:40 PM
To: tianxianting (RD) <tian.xianting@h3c.com>
Cc: arnd@arndb.de; gregkh@linuxfoundation.org; openipmi-developer@lists.sourceforge.net; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ipmi: retry to get device id when error

On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote:
> We can't get bmc's device id with low probability when loading ipmi 
> driver, it caused bmc device register failed. This issue may caused by 
> bad lpc signal quality. When this issue happened, we got below kernel printks:
> 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
> 	[Wed Sep  9 19:52:03 2020] IPMI BT: using default values
> 	[Wed Sep  9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2
> 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device id: -5
> 	[Wed Sep  9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register 
> device: error -5
> 
> When this issue happened, we want to manually unload the driver and 
> try to load it again, but it can't be unloaded by 'rmmod' as it is already 'in use'.

I'm not sure this patch is a good idea; it would cause a long boot delay in situations where there really isn't a BMC out there.  Yes, it happens.

You don't have to reload the driver to add a device, though.  You can hot-add devices using /sys/modules/ipmi_si/parameters/hotmod.  Look in Documentation/driver-api/ipmi.rst for details.

Does that work for you?

-corey

> 
> We add below 'printk' in handle_one_recv_msg(), when this issue 
> happened, the msg we received is "Recv: 1c 01 d5", which means the 
> data_len is 1, data[0] is 0xd5.
> 	Debug code:
> 	static int handle_one_recv_msg(struct ipmi_smi *intf,
>                                struct ipmi_smi_msg *msg) {
>         	printk("Recv: %*ph\n", msg->rsp_size, msg->rsp);
> 		... ...
> 	}
> Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7'
> and 'data[0] != 0'.
> 
> We used this patch to retry to get device id when error happen, we 
> reproduced this issue again and the retry succeed on the first retry, 
> we finally got the correct msg and then all is ok:
> Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00
> 
> So use retry machanism in this patch to give bmc more opportunity to 
> correctly response kernel.
> 
> Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
> ---
>  drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
> b/drivers/char/ipmi/ipmi_msghandler.c
> index 737c0b6b2..bfb2de77a 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -34,6 +34,7 @@
>  #include <linux/uuid.h>
>  #include <linux/nospec.h>
>  #include <linux/vmalloc.h>
> +#include <linux/delay.h>
>  
>  #define IPMI_DRIVER_VERSION "39.2"
>  
> @@ -60,6 +61,9 @@ enum ipmi_panic_event_op {  #else  #define 
> IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE  #endif
> +
> +#define GET_DEVICE_ID_MAX_RETRY	5
> +
>  static enum ipmi_panic_event_op ipmi_send_panic_event = 
> IPMI_PANIC_DEFAULT;
>  
>  static int panic_op_write_handler(const char *val, @@ -2426,19 
> +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf)  static int 
> __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc)  {
>  	int rv;
> -
> -	bmc->dyn_id_set = 2;
> +	unsigned int retry_count = 0;
>  
>  	intf->null_user_handler = bmc_device_id_handler;
>  
> +retry:
> +	bmc->dyn_id_set = 2;
> +
>  	rv = send_get_device_id_cmd(intf);
>  	if (rv)
>  		return rv;
>  
>  	wait_event(intf->waitq, bmc->dyn_id_set != 2);
>  
> -	if (!bmc->dyn_id_set)
> +	if (!bmc->dyn_id_set) {
> +		msleep(1000);
> +		if (++retry_count <= GET_DEVICE_ID_MAX_RETRY)
> +			goto retry;
> +
>  		rv = -EIO; /* Something went wrong in the fetch. */
> +	}
>  
>  	/* dyn_id_set makes the id data available. */
>  	smp_rmb();
> --
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ipmi: retry to get device id when error
  2020-09-13 14:10   ` Tianxianting
@ 2020-09-13 17:26     ` Corey Minyard
  2020-09-14  8:25       ` Tianxianting
  0 siblings, 1 reply; 5+ messages in thread
From: Corey Minyard @ 2020-09-13 17:26 UTC (permalink / raw)
  To: Tianxianting; +Cc: arnd, gregkh, openipmi-developer, linux-kernel

On Sun, Sep 13, 2020 at 02:10:01PM +0000, Tianxianting wrote:
> Hi Corey
> Thanks for your quickly reply,
> We didn't try the method you mentioned, actually, I didn't know it before you told me:(
> The issue ever occurred on our 2 ceph storage server both with low probability.
> We finally use this patch to solve the issue, it can automatically solve the issue when it happened. So no need to judge and reload ipmi driver manually or develop additional scripts to make it.
> The 1 second delay is acceptable to us.
> 
> If there really isn't a BMC out there, ipmi driver will not be loaded, is it right?

No, there are systems that have IPMI controllers that are specified in
the ACPI/SMBIOS tables but have no IPMI controller.

> 
> May be we can adjust to retry 3 times with 500ms interval? 

Maybe there is another way.  What I'm guessing is happening is not that
the interface is lossy, but that the BMC is in the middle of a reset or
an upgrade.  The D5 completion code means: Cannot execute command. Command,
or request parameter(s), not supported in present state.

That error is also returned from bt_start_transaction() in the driver if
the driver is not in the right state when starting a transaction,
which may point to a bug in the BT state machine.  Search for
IPMI_NOT_IN_MY_STATE_ERR in drivers/char/ipmi/ipmi_bt_sm.c.

If it's coming fron the state machine, it would be handy to know what
state it was in when the error happened to help trace the bug down.
That's what I would suggest first.  Fix the fundamental bug, if you can.
a pr_warn() added to the code there would be all that's needed, and
thats probably a good permanent addition.

I would be ok with a patch that retried some number of times if it got a
D5 completion code.  That wouldn't have any effect on other systems.
Plus you could add a D1 and D2 completion code, which are similar
things.  Add the new completion codes to include/uapi/linux/ipmi_msgdefs.h
and implement the retry.  That makes sense from a general point of view.

Thanks,

-corey

> 
> Thanks in advance if you can feedback again.
> 
> -----Original Message-----
> From: Corey Minyard [mailto:tcminyard@gmail.com] On Behalf Of Corey Minyard
> Sent: Sunday, September 13, 2020 8:40 PM
> To: tianxianting (RD) <tian.xianting@h3c.com>
> Cc: arnd@arndb.de; gregkh@linuxfoundation.org; openipmi-developer@lists.sourceforge.net; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] ipmi: retry to get device id when error
> 
> On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote:
> > We can't get bmc's device id with low probability when loading ipmi 
> > driver, it caused bmc device register failed. This issue may caused by 
> > bad lpc signal quality. When this issue happened, we got below kernel printks:
> > 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
> > 	[Wed Sep  9 19:52:03 2020] IPMI BT: using default values
> > 	[Wed Sep  9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2
> > 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device id: -5
> > 	[Wed Sep  9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register 
> > device: error -5
> > 
> > When this issue happened, we want to manually unload the driver and 
> > try to load it again, but it can't be unloaded by 'rmmod' as it is already 'in use'.
> 
> I'm not sure this patch is a good idea; it would cause a long boot delay in situations where there really isn't a BMC out there.  Yes, it happens.
> 
> You don't have to reload the driver to add a device, though.  You can hot-add devices using /sys/modules/ipmi_si/parameters/hotmod.  Look in Documentation/driver-api/ipmi.rst for details.
> 
> Does that work for you?
> 
> -corey
> 
> > 
> > We add below 'printk' in handle_one_recv_msg(), when this issue 
> > happened, the msg we received is "Recv: 1c 01 d5", which means the 
> > data_len is 1, data[0] is 0xd5.
> > 	Debug code:
> > 	static int handle_one_recv_msg(struct ipmi_smi *intf,
> >                                struct ipmi_smi_msg *msg) {
> >         	printk("Recv: %*ph\n", msg->rsp_size, msg->rsp);
> > 		... ...
> > 	}
> > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7'
> > and 'data[0] != 0'.
> > 
> > We used this patch to retry to get device id when error happen, we 
> > reproduced this issue again and the retry succeed on the first retry, 
> > we finally got the correct msg and then all is ok:
> > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00
> > 
> > So use retry machanism in this patch to give bmc more opportunity to 
> > correctly response kernel.
> > 
> > Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
> > ---
> >  drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
> > b/drivers/char/ipmi/ipmi_msghandler.c
> > index 737c0b6b2..bfb2de77a 100644
> > --- a/drivers/char/ipmi/ipmi_msghandler.c
> > +++ b/drivers/char/ipmi/ipmi_msghandler.c
> > @@ -34,6 +34,7 @@
> >  #include <linux/uuid.h>
> >  #include <linux/nospec.h>
> >  #include <linux/vmalloc.h>
> > +#include <linux/delay.h>
> >  
> >  #define IPMI_DRIVER_VERSION "39.2"
> >  
> > @@ -60,6 +61,9 @@ enum ipmi_panic_event_op {  #else  #define 
> > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE  #endif
> > +
> > +#define GET_DEVICE_ID_MAX_RETRY	5
> > +
> >  static enum ipmi_panic_event_op ipmi_send_panic_event = 
> > IPMI_PANIC_DEFAULT;
> >  
> >  static int panic_op_write_handler(const char *val, @@ -2426,19 
> > +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf)  static int 
> > __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc)  {
> >  	int rv;
> > -
> > -	bmc->dyn_id_set = 2;
> > +	unsigned int retry_count = 0;
> >  
> >  	intf->null_user_handler = bmc_device_id_handler;
> >  
> > +retry:
> > +	bmc->dyn_id_set = 2;
> > +
> >  	rv = send_get_device_id_cmd(intf);
> >  	if (rv)
> >  		return rv;
> >  
> >  	wait_event(intf->waitq, bmc->dyn_id_set != 2);
> >  
> > -	if (!bmc->dyn_id_set)
> > +	if (!bmc->dyn_id_set) {
> > +		msleep(1000);
> > +		if (++retry_count <= GET_DEVICE_ID_MAX_RETRY)
> > +			goto retry;
> > +
> >  		rv = -EIO; /* Something went wrong in the fetch. */
> > +	}
> >  
> >  	/* dyn_id_set makes the id data available. */
> >  	smp_rmb();
> > --
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] ipmi: retry to get device id when error
  2020-09-13 17:26     ` Corey Minyard
@ 2020-09-14  8:25       ` Tianxianting
  0 siblings, 0 replies; 5+ messages in thread
From: Tianxianting @ 2020-09-14  8:25 UTC (permalink / raw)
  To: minyard; +Cc: arnd, gregkh, openipmi-developer, linux-kernel

Hi Corey,
Thanks a lot for you kindly guide.

I send the version 2 patch to you: [PATCH] [v2] ipmi: retry to get device id when error
If there is still something wrong or I missed something, please correct me:)

-----Original Message-----
From: Corey Minyard [mailto:tcminyard@gmail.com] On Behalf Of Corey Minyard
Sent: Monday, September 14, 2020 1:26 AM
To: tianxianting (RD) <tian.xianting@h3c.com>
Cc: arnd@arndb.de; gregkh@linuxfoundation.org; openipmi-developer@lists.sourceforge.net; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ipmi: retry to get device id when error

On Sun, Sep 13, 2020 at 02:10:01PM +0000, Tianxianting wrote:
> Hi Corey
> Thanks for your quickly reply,
> We didn't try the method you mentioned, actually, I didn't know it 
> before you told me:( The issue ever occurred on our 2 ceph storage server both with low probability.
> We finally use this patch to solve the issue, it can automatically solve the issue when it happened. So no need to judge and reload ipmi driver manually or develop additional scripts to make it.
> The 1 second delay is acceptable to us.
> 
> If there really isn't a BMC out there, ipmi driver will not be loaded, is it right?

No, there are systems that have IPMI controllers that are specified in the ACPI/SMBIOS tables but have no IPMI controller.

> 
> May be we can adjust to retry 3 times with 500ms interval? 

Maybe there is another way.  What I'm guessing is happening is not that the interface is lossy, but that the BMC is in the middle of a reset or an upgrade.  The D5 completion code means: Cannot execute command. Command, or request parameter(s), not supported in present state.

That error is also returned from bt_start_transaction() in the driver if the driver is not in the right state when starting a transaction, which may point to a bug in the BT state machine.  Search for IPMI_NOT_IN_MY_STATE_ERR in drivers/char/ipmi/ipmi_bt_sm.c.

If it's coming fron the state machine, it would be handy to know what state it was in when the error happened to help trace the bug down.
That's what I would suggest first.  Fix the fundamental bug, if you can.
a pr_warn() added to the code there would be all that's needed, and thats probably a good permanent addition.

I would be ok with a patch that retried some number of times if it got a
D5 completion code.  That wouldn't have any effect on other systems.
Plus you could add a D1 and D2 completion code, which are similar things.  Add the new completion codes to include/uapi/linux/ipmi_msgdefs.h and implement the retry.  That makes sense from a general point of view.

Thanks,

-corey

> 
> Thanks in advance if you can feedback again.
> 
> -----Original Message-----
> From: Corey Minyard [mailto:tcminyard@gmail.com] On Behalf Of Corey 
> Minyard
> Sent: Sunday, September 13, 2020 8:40 PM
> To: tianxianting (RD) <tian.xianting@h3c.com>
> Cc: arnd@arndb.de; gregkh@linuxfoundation.org; 
> openipmi-developer@lists.sourceforge.net; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] ipmi: retry to get device id when error
> 
> On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote:
> > We can't get bmc's device id with low probability when loading ipmi 
> > driver, it caused bmc device register failed. This issue may caused 
> > by bad lpc signal quality. When this issue happened, we got below kernel printks:
> > 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
> > 	[Wed Sep  9 19:52:03 2020] IPMI BT: using default values
> > 	[Wed Sep  9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2
> > 	[Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device id: -5
> > 	[Wed Sep  9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register
> > device: error -5
> > 
> > When this issue happened, we want to manually unload the driver and 
> > try to load it again, but it can't be unloaded by 'rmmod' as it is already 'in use'.
> 
> I'm not sure this patch is a good idea; it would cause a long boot delay in situations where there really isn't a BMC out there.  Yes, it happens.
> 
> You don't have to reload the driver to add a device, though.  You can hot-add devices using /sys/modules/ipmi_si/parameters/hotmod.  Look in Documentation/driver-api/ipmi.rst for details.
> 
> Does that work for you?
> 
> -corey
> 
> > 
> > We add below 'printk' in handle_one_recv_msg(), when this issue 
> > happened, the msg we received is "Recv: 1c 01 d5", which means the 
> > data_len is 1, data[0] is 0xd5.
> > 	Debug code:
> > 	static int handle_one_recv_msg(struct ipmi_smi *intf,
> >                                struct ipmi_smi_msg *msg) {
> >         	printk("Recv: %*ph\n", msg->rsp_size, msg->rsp);
> > 		... ...
> > 	}
> > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7'
> > and 'data[0] != 0'.
> > 
> > We used this patch to retry to get device id when error happen, we 
> > reproduced this issue again and the retry succeed on the first 
> > retry, we finally got the correct msg and then all is ok:
> > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00
> > 
> > So use retry machanism in this patch to give bmc more opportunity to 
> > correctly response kernel.
> > 
> > Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
> > ---
> >  drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/char/ipmi/ipmi_msghandler.c
> > b/drivers/char/ipmi/ipmi_msghandler.c
> > index 737c0b6b2..bfb2de77a 100644
> > --- a/drivers/char/ipmi/ipmi_msghandler.c
> > +++ b/drivers/char/ipmi/ipmi_msghandler.c
> > @@ -34,6 +34,7 @@
> >  #include <linux/uuid.h>
> >  #include <linux/nospec.h>
> >  #include <linux/vmalloc.h>
> > +#include <linux/delay.h>
> >  
> >  #define IPMI_DRIVER_VERSION "39.2"
> >  
> > @@ -60,6 +61,9 @@ enum ipmi_panic_event_op {  #else  #define 
> > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE  #endif
> > +
> > +#define GET_DEVICE_ID_MAX_RETRY	5
> > +
> >  static enum ipmi_panic_event_op ipmi_send_panic_event = 
> > IPMI_PANIC_DEFAULT;
> >  
> >  static int panic_op_write_handler(const char *val, @@ -2426,19
> > +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf)  static 
> > +int
> > __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc)  {
> >  	int rv;
> > -
> > -	bmc->dyn_id_set = 2;
> > +	unsigned int retry_count = 0;
> >  
> >  	intf->null_user_handler = bmc_device_id_handler;
> >  
> > +retry:
> > +	bmc->dyn_id_set = 2;
> > +
> >  	rv = send_get_device_id_cmd(intf);
> >  	if (rv)
> >  		return rv;
> >  
> >  	wait_event(intf->waitq, bmc->dyn_id_set != 2);
> >  
> > -	if (!bmc->dyn_id_set)
> > +	if (!bmc->dyn_id_set) {
> > +		msleep(1000);
> > +		if (++retry_count <= GET_DEVICE_ID_MAX_RETRY)
> > +			goto retry;
> > +
> >  		rv = -EIO; /* Something went wrong in the fetch. */
> > +	}
> >  
> >  	/* dyn_id_set makes the id data available. */
> >  	smp_rmb();
> > --
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-14  8:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-13 12:02 [PATCH] ipmi: retry to get device id when error Xianting Tian
2020-09-13 12:39 ` Corey Minyard
2020-09-13 14:10   ` Tianxianting
2020-09-13 17:26     ` Corey Minyard
2020-09-14  8:25       ` Tianxianting

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).