All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
@ 2021-10-15  3:38 Shuai Xue
  2021-10-15 15:37 ` Luck, Tony
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Shuai Xue @ 2021-10-15  3:38 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, bp, tony.luck, james.morse, lenb, rjw
  Cc: xueshuai, zhangliguang, zhuo.song

When injecting an error into the platform, the OSPM executes an
EXECUTE_OPERATION action to instruct the platform to begin the injection
operation. And then, the OSPM busy waits for a while by continually
executing CHECK_BUSY_STATUS action until the platform indicates that the
operation is complete. More specifically, the platform is limited to
respond within 1 millisecond right now. This is too strict for some
platforms.

For example, in Arm platfrom, when injecting a Processor Correctable error,
the OSPM will warn:
    Firmware does not respond in time.

And a message is printed on the console:
    echo: write error: Input/output error

We observe that the waiting time for DDR error injection is about 10 ms
and that for PCIe error injection is about 500 ms in Arm platfrom.

In this patch, we relax the response timeout to 1 second and allow user to
pass the time out value as a argument.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/acpi/apei/einj.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 133156759551..fa2386ee37db 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -14,6 +14,7 @@
 
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/moduleparam.h>
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/debugfs.h>
@@ -28,9 +29,9 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "EINJ: " fmt
 
-#define SPIN_UNIT		100			/* 100ns */
-/* Firmware should respond within 1 milliseconds */
-#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
+#define SPIN_UNIT		100			/* 100us */
+/* Firmware should respond within 1 seconds */
+#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
 #define ACPI5_VENDOR_BIT	BIT(31)
 #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
 				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
@@ -40,6 +41,8 @@
  * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
  */
 static int acpi5;
+static int timeout_default = FIRMWARE_TIMEOUT;
+module_param(timeout_default, int, 0644);
 
 struct set_error_type_with_address {
 	u32	type;
@@ -176,7 +179,7 @@ static int einj_timedout(u64 *t)
 		return 1;
 	}
 	*t -= SPIN_UNIT;
-	ndelay(SPIN_UNIT);
+	udelay(SPIN_UNIT);
 	touch_nmi_watchdog();
 	return 0;
 }
@@ -403,7 +406,7 @@ static int __einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
 			       u64 param3, u64 param4)
 {
 	struct apei_exec_context ctx;
-	u64 val, trigger_paddr, timeout = FIRMWARE_TIMEOUT;
+	u64 val, trigger_paddr, timeout = timeout_default;
 	int rc;
 
 	einj_exec_ctx_init(&ctx);
-- 
2.20.1.12.g72788fdb


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-15  3:38 [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second Shuai Xue
@ 2021-10-15 15:37 ` Luck, Tony
  2021-10-17  4:06   ` Shuai Xue
  2021-10-22 13:44 ` [PATCH v2] " Shuai Xue
  2021-10-26  7:28 ` [PATCH v3] " Shuai Xue
  2 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2021-10-15 15:37 UTC (permalink / raw)
  To: Shuai Xue, linux-kernel, linux-acpi, bp, james.morse, lenb, rjw
  Cc: zhangliguang, zhuo.song

> We observe that the waiting time for DDR error injection is about 10 ms
> and that for PCIe error injection is about 500 ms in Arm platfrom.
>
> In this patch, we relax the response timeout to 1 second and allow user to
> pass the time out value as a argument.

Spinning for 1ms was maybe ok. Spinning for up to 1s seems like a bad idea.

This code is executed inside a mutex ... so maybe it is safe to sleep instead of spin?

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-15 15:37 ` Luck, Tony
@ 2021-10-17  4:06   ` Shuai Xue
  2021-10-18 15:40     ` Luck, Tony
  0 siblings, 1 reply; 14+ messages in thread
From: Shuai Xue @ 2021-10-17  4:06 UTC (permalink / raw)
  To: Luck, Tony, linux-kernel, linux-acpi, bp, james.morse, lenb, rjw
  Cc: zhangliguang, zhuo.song

Hi, Tony,

Thank you for your reply.

> Spinning for 1ms was maybe ok. Spinning for up to 1s seems like a bad idea.
>
> This code is executed inside a mutex ... so maybe it is safe to sleep instead of spin?

May the email Subject misled you. This code do NOT spin for 1 sec. The period of the
spinning depends on the SPIN_UNIT.

> -#define SPIN_UNIT		100			/* 100ns */
> -/* Firmware should respond within 1 milliseconds */
> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
> +#define SPIN_UNIT		100			/* 100us */
> +/* Firmware should respond within 1 seconds */
> +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)

The period was 100 ns and changed to 100 us now. In my opinion, spinning for 100 ns or 100 us is OK :)

The timeout_default is set with FIRMWARE_TIMEOUT (1 sec) by default. If the platform do not
respond within timeout_default after multiple spins, the OSPM will print a warning message to
dmesg.

Regards,
Shuai


On 2021/10/15 PM11:37, Luck, Tony wrote:
>> We observe that the waiting time for DDR error injection is about 10 ms
>> and that for PCIe error injection is about 500 ms in Arm platfrom.
>>
>> In this patch, we relax the response timeout to 1 second and allow user to
>> pass the time out value as a argument.
> 
> Spinning for 1ms was maybe ok. Spinning for up to 1s seems like a bad idea.
> 
> This code is executed inside a mutex ... so maybe it is safe to sleep instead of spin?
> 
> -Tony
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-17  4:06   ` Shuai Xue
@ 2021-10-18 15:40     ` Luck, Tony
  2021-10-19 13:33       ` Shuai Xue
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2021-10-18 15:40 UTC (permalink / raw)
  To: Shuai Xue
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

On Sun, Oct 17, 2021 at 12:06:52PM +0800, Shuai Xue wrote:
> Hi, Tony,
> 
> Thank you for your reply.
> 
> > Spinning for 1ms was maybe ok. Spinning for up to 1s seems like a bad idea.
> >
> > This code is executed inside a mutex ... so maybe it is safe to sleep instead of spin?
> 
> May the email Subject misled you. This code do NOT spin for 1 sec. The period of the
> spinning depends on the SPIN_UNIT.

Not just the subject line. See the comment you changed here:

> > -#define SPIN_UNIT		100			/* 100ns */
> > -/* Firmware should respond within 1 milliseconds */
> > -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
> > +#define SPIN_UNIT		100			/* 100us */
> > +/* Firmware should respond within 1 seconds */
> > +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)

That definitely reads to me that the timeout was increased from
1 millisecond to 1 second. With the old code polling for completion
every 100ns, and the new code polling every 100us
> 
> The period was 100 ns and changed to 100 us now. In my opinion, spinning for 100 ns or 100 us is OK :)

But what does the code do in between polls? The calling code is:

        for (;;) {
                rc = apei_exec_run(&ctx, ACPI_EINJ_CHECK_BUSY_STATUS);
                if (rc)
                        return rc;
                val = apei_exec_ctx_get_output(&ctx);
                if (!(val & EINJ_OP_BUSY))
                        break;
                if (einj_timedout(&timeout))
                        return -EIO;
        }

Now apei_exec_run() and apei_exec_ctx_get_output() are a maze of
functions & macros. But I don't think they can block, sleep, or
context switch.

So this code is "spinning" until either BIOS says the operation is
complete, or the FIRMWARE_TIMEOUT is reached.

It avoids triggering a watchdog by the call to touch_nmi_watchdog()
after each spin between polls. But the whole thing may be spinning
for a second.

I'm not at all sure that I'm right that the spin could be replaced
with an msleep(). It will certainly slow things down for systems
and EINJ operations that actually complete quickly (because instead
of returnining within 100ns (or 100us with your patch) it will sleep
for 1 ms (rounded up to next jiffie ... so 4 ms of HZ=250 systems.

But I don't care if my error injections take 4ms.

I do care that one logical CPU spins for 1 second.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-18 15:40     ` Luck, Tony
@ 2021-10-19 13:33       ` Shuai Xue
  0 siblings, 0 replies; 14+ messages in thread
From: Shuai Xue @ 2021-10-19 13:33 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

Hi Tony,

> I'm not at all sure that I'm right that the spin could be replaced
> with an msleep(). It will certainly slow things down for systems
> and EINJ operations that actually complete quickly (because instead
> of returnining within 100ns (or 100us with your patch) it will sleep
> for 1 ms (rounded up to next jiffie ... so 4 ms of HZ=250 systems.
>
> But I don't care if my error injections take 4ms.
>
> I do care that one logical CPU spins for 1 second.
Agree. The side effect of sleep is to slow down the injection that
actually complete quickly and error injection is not concerned with
real-time.

I will send a v2 patch implemented in msleep soon.

Regards.
Shuai


On 2021/10/18 PM11:40, Luck, Tony wrote:
> On Sun, Oct 17, 2021 at 12:06:52PM +0800, Shuai Xue wrote:
>> Hi, Tony,
>>
>> Thank you for your reply.
>>
>>> Spinning for 1ms was maybe ok. Spinning for up to 1s seems like a bad idea.
>>>
>>> This code is executed inside a mutex ... so maybe it is safe to sleep instead of spin?
>>
>> May the email Subject misled you. This code do NOT spin for 1 sec. The period of the
>> spinning depends on the SPIN_UNIT.
> 
> Not just the subject line. See the comment you changed here:
> 
>>> -#define SPIN_UNIT		100			/* 100ns */
>>> -/* Firmware should respond within 1 milliseconds */
>>> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
>>> +#define SPIN_UNIT		100			/* 100us */
>>> +/* Firmware should respond within 1 seconds */
>>> +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
> 
> That definitely reads to me that the timeout was increased from
> 1 millisecond to 1 second. With the old code polling for completion
> every 100ns, and the new code polling every 100us
>>
>> The period was 100 ns and changed to 100 us now. In my opinion, spinning for 100 ns or 100 us is OK :)
> 
> But what does the code do in between polls? The calling code is:
> 
>         for (;;) {
>                 rc = apei_exec_run(&ctx, ACPI_EINJ_CHECK_BUSY_STATUS);
>                 if (rc)
>                         return rc;
>                 val = apei_exec_ctx_get_output(&ctx);
>                 if (!(val & EINJ_OP_BUSY))
>                         break;
>                 if (einj_timedout(&timeout))
>                         return -EIO;
>         }
> 
> Now apei_exec_run() and apei_exec_ctx_get_output() are a maze of
> functions & macros. But I don't think they can block, sleep, or
> context switch.
> 
> So this code is "spinning" until either BIOS says the operation is
> complete, or the FIRMWARE_TIMEOUT is reached.
> 
> It avoids triggering a watchdog by the call to touch_nmi_watchdog()
> after each spin between polls. But the whole thing may be spinning
> for a second.
> 
> I'm not at all sure that I'm right that the spin could be replaced
> with an msleep(). It will certainly slow things down for systems
> and EINJ operations that actually complete quickly (because instead
> of returnining within 100ns (or 100us with your patch) it will sleep
> for 1 ms (rounded up to next jiffie ... so 4 ms of HZ=250 systems.
> 
> But I don't care if my error injections take 4ms.
> 
> I do care that one logical CPU spins for 1 second.
> 
> -Tony
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-15  3:38 [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second Shuai Xue
  2021-10-15 15:37 ` Luck, Tony
@ 2021-10-22 13:44 ` Shuai Xue
  2021-10-22 23:54   ` Luck, Tony
  2021-10-26  7:28 ` [PATCH v3] " Shuai Xue
  2 siblings, 1 reply; 14+ messages in thread
From: Shuai Xue @ 2021-10-22 13:44 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, bp, tony.luck, james.morse, lenb, rjw
  Cc: xueshuai, zhangliguang, zhuo.song

When injecting an error into the platform, the OSPM executes an
EXECUTE_OPERATION action to instruct the platform to begin the injection
operation. And then, the OSPM busy waits for a while by continually
executing CHECK_BUSY_STATUS action until the platform indicates that the
operation is complete. More specifically, the platform is limited to
respond within 1 millisecond right now. This is too strict for some
platforms.

For example, in Arm platform, when injecting a Processor Correctable error,
the OSPM will warn:
    Firmware does not respond in time.

And a message is printed on the console:
    echo: write error: Input/output error

We observe that the waiting time for DDR error injection is about 10 ms
and that for PCIe error injection is about 500 ms in Arm platform.

In this patch, we relax the response timeout to 1 second and allow user to
pass the time out value as a argument.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
Changelog v1 -> v2:
- Implemented the timeout in msleep instead of udelay.
- Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
---
 drivers/acpi/apei/einj.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 133156759551..e411eb30e0ee 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -28,9 +28,9 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "EINJ: " fmt
 
-#define SPIN_UNIT		100			/* 100ns */
-/* Firmware should respond within 1 milliseconds */
-#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
+#define SLEEP_UNIT		1			/* 1ms */
+/* Firmware should respond within 1 seconds */
+#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
 #define ACPI5_VENDOR_BIT	BIT(31)
 #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
 				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
@@ -40,6 +40,8 @@
  * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
  */
 static int acpi5;
+static int timeout_default = FIRMWARE_TIMEOUT;
+module_param(timeout_default, int, 0644);
 
 struct set_error_type_with_address {
 	u32	type;
@@ -171,12 +173,12 @@ static int einj_get_available_error_type(u32 *type)
 
 static int einj_timedout(u64 *t)
 {
-	if ((s64)*t < SPIN_UNIT) {
+	if ((s64)*t < SLEEP_UNIT) {
 		pr_warn(FW_WARN "Firmware does not respond in time\n");
 		return 1;
 	}
-	*t -= SPIN_UNIT;
-	ndelay(SPIN_UNIT);
+	*t -= SLEEP_UNIT;
+	msleep(SLEEP_UNIT);
 	touch_nmi_watchdog();
 	return 0;
 }
@@ -403,7 +405,7 @@ static int __einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
 			       u64 param3, u64 param4)
 {
 	struct apei_exec_context ctx;
-	u64 val, trigger_paddr, timeout = FIRMWARE_TIMEOUT;
+	u64 val, trigger_paddr, timeout = timeout_default;
 	int rc;
 
 	einj_exec_ctx_init(&ctx);
-- 
2.20.1.12.g72788fdb


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-22 13:44 ` [PATCH v2] " Shuai Xue
@ 2021-10-22 23:54   ` Luck, Tony
  2021-10-24  9:10     ` Shuai Xue
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2021-10-22 23:54 UTC (permalink / raw)
  To: Shuai Xue
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

On Fri, Oct 22, 2021 at 09:44:24PM +0800, Shuai Xue wrote:
> When injecting an error into the platform, the OSPM executes an
> EXECUTE_OPERATION action to instruct the platform to begin the injection
> operation. And then, the OSPM busy waits for a while by continually
> executing CHECK_BUSY_STATUS action until the platform indicates that the
> operation is complete. More specifically, the platform is limited to
> respond within 1 millisecond right now. This is too strict for some
> platforms.
> 
> For example, in Arm platform, when injecting a Processor Correctable error,
> the OSPM will warn:
>     Firmware does not respond in time.
> 
> And a message is printed on the console:
>     echo: write error: Input/output error
> 
> We observe that the waiting time for DDR error injection is about 10 ms
> and that for PCIe error injection is about 500 ms in Arm platform.
> 
> In this patch, we relax the response timeout to 1 second and allow user to
> pass the time out value as a argument.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
> Changelog v1 -> v2:
> - Implemented the timeout in msleep instead of udelay.
> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
> ---
>  drivers/acpi/apei/einj.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
> index 133156759551..e411eb30e0ee 100644
> --- a/drivers/acpi/apei/einj.c
> +++ b/drivers/acpi/apei/einj.c
> @@ -28,9 +28,9 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) "EINJ: " fmt
>  
> -#define SPIN_UNIT		100			/* 100ns */
> -/* Firmware should respond within 1 milliseconds */
> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
> +#define SLEEP_UNIT		1			/* 1ms */

I know I pointed you to msleep() ... sorry, I was wrong. For a
1 ms sleep the recommendation is to use usleep_range()

See this write-up in Documentation/timers/timers-howto.rst:

                - Why not msleep for (1ms - 20ms)?
                        Explained originally here:
                                https://lore.kernel.org/r/15327.1186166232@lwn.net

                        msleep(1~20) may not do what the caller intends, and
                        will often sleep longer (~20 ms actual sleep for any
                        value given in the 1~20ms range). In many cases this
                        is not the desired behavior.

To answer the question posed in that document on "What is a good range?"

I don't think injection cares too much about precision here. Maybe go
with

	usleep_range(1000, 5000);
[with #defines for SLEEP_UNIT_MIN, SLEEP_UNIT_MAX instead of those
numbers]

> +/* Firmware should respond within 1 seconds */
> +#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
>  #define ACPI5_VENDOR_BIT	BIT(31)
>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
> @@ -40,6 +40,8 @@
>   * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
>   */
>  static int acpi5;
> +static int timeout_default = FIRMWARE_TIMEOUT;
> +module_param(timeout_default, int, 0644);

You've set the default to 1 second. Who would use this parameter?
Do you anticipate systems that take even longer to inject?
A user might set a shorter limit ... but I don't see why they
would want to.

>  
>  struct set_error_type_with_address {
>  	u32	type;
> @@ -171,12 +173,12 @@ static int einj_get_available_error_type(u32 *type)
>  
>  static int einj_timedout(u64 *t)
>  {
> -	if ((s64)*t < SPIN_UNIT) {
> +	if ((s64)*t < SLEEP_UNIT) {
>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>  		return 1;
>  	}
> -	*t -= SPIN_UNIT;
> -	ndelay(SPIN_UNIT);
> +	*t -= SLEEP_UNIT;
> +	msleep(SLEEP_UNIT);
>  	touch_nmi_watchdog();

Since we are sleeping instead of spinning, maybe we don't need to
touch the nmi watchdog?

>  	return 0;
>  }
> @@ -403,7 +405,7 @@ static int __einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
>  			       u64 param3, u64 param4)
>  {
>  	struct apei_exec_context ctx;
> -	u64 val, trigger_paddr, timeout = FIRMWARE_TIMEOUT;
> +	u64 val, trigger_paddr, timeout = timeout_default;
>  	int rc;
>  
>  	einj_exec_ctx_init(&ctx);
> -- 
> 2.20.1.12.g72788fdb
> 

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-22 23:54   ` Luck, Tony
@ 2021-10-24  9:10     ` Shuai Xue
  2021-10-25 12:49       ` Shuai Xue
  0 siblings, 1 reply; 14+ messages in thread
From: Shuai Xue @ 2021-10-24  9:10 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

Hi, Tony,

Thank you for your comments.

> I know I pointed you to msleep() ... sorry, I was wrong. For a
> 1 ms sleep the recommendation is to use usleep_range()
>
> See this write-up in Documentation/timers/timers-howto.rst:
>
>                - Why not msleep for (1ms - 20ms)?
>                        Explained originally here:
>                                https://lore.kernel.org/r/15327.1186166232@lwn.net
>
>                        msleep(1~20) may not do what the caller intends, and
>                        will often sleep longer (~20 ms actual sleep for any
>                        value given in the 1~20ms range). In many cases this
>                        is not the desired behavior.
>
> To answer the question posed in that document on "What is a good range?"
>
> I don't think injection cares too much about precision here. Maybe go
> with
>
>	usleep_range(1000, 5000);
> [with #defines for SLEEP_UNIT_MIN, SLEEP_UNIT_MAX instead of those
> numbers]
Got it. Thank you. I will change it latter.


>> +/* Firmware should respond within 1 seconds */
>> +#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>> @@ -40,6 +40,8 @@
>>   * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
>>   */
>>  static int acpi5;
>> +static int timeout_default = FIRMWARE_TIMEOUT;
>> +module_param(timeout_default, int, 0644);
>
> You've set the default to 1 second. Who would use this parameter?
> Do you anticipate systems that take even longer to inject?
> A user might set a shorter limit ... but I don't see why they
> would want to.
No, I don't. EINJ provides a hardware error injection mechanism to develop
and debug firmware code and hardware RAS feature. While we test on Arm
platform, it cannot meet the original timeout limit. Therefore, we send
this patch to relax the upper bound of timeout. In order to facilitate
other platforms to encounter the same problems, we expose timeout as a
configurable parameter in user space.


>>  struct set_error_type_with_address {
>>  	u32	type;
>> @@ -171,12 +173,12 @@ static int einj_get_available_error_type(u32 *type)
>>
>>  static int einj_timedout(u64 *t)
>>  {
>> -	if ((s64)*t < SPIN_UNIT) {
>> +	if ((s64)*t < SLEEP_UNIT) {
>>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>>  		return 1;
>>  	}
>> -	*t -= SPIN_UNIT;
>> -	ndelay(SPIN_UNIT);
>> +	*t -= SLEEP_UNIT;
>> +	msleep(SLEEP_UNIT);
>>  	touch_nmi_watchdog();
>
> Since we are sleeping instead of spinning, maybe we don't need to
> touch the nmi watchdog?
Agree. I will delete it in next version.

Regards,
Shuai

On 2021/10/23 AM7:54, Luck, Tony wrote:
> On Fri, Oct 22, 2021 at 09:44:24PM +0800, Shuai Xue wrote:
>> When injecting an error into the platform, the OSPM executes an
>> EXECUTE_OPERATION action to instruct the platform to begin the injection
>> operation. And then, the OSPM busy waits for a while by continually
>> executing CHECK_BUSY_STATUS action until the platform indicates that the
>> operation is complete. More specifically, the platform is limited to
>> respond within 1 millisecond right now. This is too strict for some
>> platforms.
>>
>> For example, in Arm platform, when injecting a Processor Correctable error,
>> the OSPM will warn:
>>     Firmware does not respond in time.
>>
>> And a message is printed on the console:
>>     echo: write error: Input/output error
>>
>> We observe that the waiting time for DDR error injection is about 10 ms
>> and that for PCIe error injection is about 500 ms in Arm platform.
>>
>> In this patch, we relax the response timeout to 1 second and allow user to
>> pass the time out value as a argument.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>> Changelog v1 -> v2:
>> - Implemented the timeout in msleep instead of udelay.
>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
>> ---
>>  drivers/acpi/apei/einj.c | 16 +++++++++-------
>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
>> index 133156759551..e411eb30e0ee 100644
>> --- a/drivers/acpi/apei/einj.c
>> +++ b/drivers/acpi/apei/einj.c
>> @@ -28,9 +28,9 @@
>>  #undef pr_fmt
>>  #define pr_fmt(fmt) "EINJ: " fmt
>>  
>> -#define SPIN_UNIT		100			/* 100ns */
>> -/* Firmware should respond within 1 milliseconds */
>> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
>> +#define SLEEP_UNIT		1			/* 1ms */
> 
> I know I pointed you to msleep() ... sorry, I was wrong. For a
> 1 ms sleep the recommendation is to use usleep_range()
> 
> See this write-up in Documentation/timers/timers-howto.rst:
> 
>                 - Why not msleep for (1ms - 20ms)?
>                         Explained originally here:
>                                 https://lore.kernel.org/r/15327.1186166232@lwn.net
> 
>                         msleep(1~20) may not do what the caller intends, and
>                         will often sleep longer (~20 ms actual sleep for any
>                         value given in the 1~20ms range). In many cases this
>                         is not the desired behavior.
> 
> To answer the question posed in that document on "What is a good range?"
> 
> I don't think injection cares too much about precision here. Maybe go
> with
> 
> 	usleep_range(1000, 5000);
> [with #defines for SLEEP_UNIT_MIN, SLEEP_UNIT_MAX instead of those
> numbers]
> 
>> +/* Firmware should respond within 1 seconds */
>> +#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>> @@ -40,6 +40,8 @@
>>   * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
>>   */
>>  static int acpi5;
>> +static int timeout_default = FIRMWARE_TIMEOUT;
>> +module_param(timeout_default, int, 0644);
> 
> You've set the default to 1 second. Who would use this parameter?
> Do you anticipate systems that take even longer to inject?
> A user might set a shorter limit ... but I don't see why they
> would want to.
> 
>>  
>>  struct set_error_type_with_address {
>>  	u32	type;
>> @@ -171,12 +173,12 @@ static int einj_get_available_error_type(u32 *type)
>>  
>>  static int einj_timedout(u64 *t)
>>  {
>> -	if ((s64)*t < SPIN_UNIT) {
>> +	if ((s64)*t < SLEEP_UNIT) {
>>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>>  		return 1;
>>  	}
>> -	*t -= SPIN_UNIT;
>> -	ndelay(SPIN_UNIT);
>> +	*t -= SLEEP_UNIT;
>> +	msleep(SLEEP_UNIT);
>>  	touch_nmi_watchdog();
> 
> Since we are sleeping instead of spinning, maybe we don't need to
> touch the nmi watchdog?
> 
>>  	return 0;
>>  }
>> @@ -403,7 +405,7 @@ static int __einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
>>  			       u64 param3, u64 param4)
>>  {
>>  	struct apei_exec_context ctx;
>> -	u64 val, trigger_paddr, timeout = FIRMWARE_TIMEOUT;
>> +	u64 val, trigger_paddr, timeout = timeout_default;
>>  	int rc;
>>  
>>  	einj_exec_ctx_init(&ctx);
>> -- 
>> 2.20.1.12.g72788fdb
>>
> 
> -Tony
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-24  9:10     ` Shuai Xue
@ 2021-10-25 12:49       ` Shuai Xue
  2021-10-25 15:59         ` Luck, Tony
  0 siblings, 1 reply; 14+ messages in thread
From: Shuai Xue @ 2021-10-25 12:49 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

Hi, Tony,

>>> +/* Firmware should respond within 1 seconds */
>>> +#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
>>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>>> @@ -40,6 +40,8 @@
>>>   * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
>>>   */
>>>  static int acpi5;
>>> +static int timeout_default = FIRMWARE_TIMEOUT;
>>> +module_param(timeout_default, int, 0644);
>>
>> You've set the default to 1 second. Who would use this parameter?
>> Do you anticipate systems that take even longer to inject?
>> A user might set a shorter limit ... but I don't see why they
>> would want to.
> No, I don't. EINJ provides a hardware error injection mechanism to develop
> and debug firmware code and hardware RAS feature. While we test on Arm
> platform, it cannot meet the original timeout limit. Therefore, we send
> this patch to relax the upper bound of timeout. In order to facilitate
> other platforms to encounter the same problems, we expose timeout as a
> configurable parameter in user space.

What's your opinion about this interface?

Regards,

Shuai.


On 2021/10/24 PM5:10, Shuai Xue wrote:
> Hi, Tony,
> 
> Thank you for your comments.
> 
>> I know I pointed you to msleep() ... sorry, I was wrong. For a
>> 1 ms sleep the recommendation is to use usleep_range()
>>
>> See this write-up in Documentation/timers/timers-howto.rst:
>>
>>                - Why not msleep for (1ms - 20ms)?
>>                        Explained originally here:
>>                                https://lore.kernel.org/r/15327.1186166232@lwn.net
>>
>>                        msleep(1~20) may not do what the caller intends, and
>>                        will often sleep longer (~20 ms actual sleep for any
>>                        value given in the 1~20ms range). In many cases this
>>                        is not the desired behavior.
>>
>> To answer the question posed in that document on "What is a good range?"
>>
>> I don't think injection cares too much about precision here. Maybe go
>> with
>>
>> 	usleep_range(1000, 5000);
>> [with #defines for SLEEP_UNIT_MIN, SLEEP_UNIT_MAX instead of those
>> numbers]
> Got it. Thank you. I will change it latter.
> 
> 
>>> +/* Firmware should respond within 1 seconds */
>>> +#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
>>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>>> @@ -40,6 +40,8 @@
>>>   * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
>>>   */
>>>  static int acpi5;
>>> +static int timeout_default = FIRMWARE_TIMEOUT;
>>> +module_param(timeout_default, int, 0644);
>>
>> You've set the default to 1 second. Who would use this parameter?
>> Do you anticipate systems that take even longer to inject?
>> A user might set a shorter limit ... but I don't see why they
>> would want to.
> No, I don't. EINJ provides a hardware error injection mechanism to develop
> and debug firmware code and hardware RAS feature. While we test on Arm
> platform, it cannot meet the original timeout limit. Therefore, we send
> this patch to relax the upper bound of timeout. In order to facilitate
> other platforms to encounter the same problems, we expose timeout as a
> configurable parameter in user space.
> 
> 
>>>  struct set_error_type_with_address {
>>>  	u32	type;
>>> @@ -171,12 +173,12 @@ static int einj_get_available_error_type(u32 *type)
>>>
>>>  static int einj_timedout(u64 *t)
>>>  {
>>> -	if ((s64)*t < SPIN_UNIT) {
>>> +	if ((s64)*t < SLEEP_UNIT) {
>>>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>>>  		return 1;
>>>  	}
>>> -	*t -= SPIN_UNIT;
>>> -	ndelay(SPIN_UNIT);
>>> +	*t -= SLEEP_UNIT;
>>> +	msleep(SLEEP_UNIT);
>>>  	touch_nmi_watchdog();
>>
>> Since we are sleeping instead of spinning, maybe we don't need to
>> touch the nmi watchdog?
> Agree. I will delete it in next version.
> 
> Regards,
> Shuai
> 
> On 2021/10/23 AM7:54, Luck, Tony wrote:
>> On Fri, Oct 22, 2021 at 09:44:24PM +0800, Shuai Xue wrote:
>>> When injecting an error into the platform, the OSPM executes an
>>> EXECUTE_OPERATION action to instruct the platform to begin the injection
>>> operation. And then, the OSPM busy waits for a while by continually
>>> executing CHECK_BUSY_STATUS action until the platform indicates that the
>>> operation is complete. More specifically, the platform is limited to
>>> respond within 1 millisecond right now. This is too strict for some
>>> platforms.
>>>
>>> For example, in Arm platform, when injecting a Processor Correctable error,
>>> the OSPM will warn:
>>>     Firmware does not respond in time.
>>>
>>> And a message is printed on the console:
>>>     echo: write error: Input/output error
>>>
>>> We observe that the waiting time for DDR error injection is about 10 ms
>>> and that for PCIe error injection is about 500 ms in Arm platform.
>>>
>>> In this patch, we relax the response timeout to 1 second and allow user to
>>> pass the time out value as a argument.
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>> ---
>>> Changelog v1 -> v2:
>>> - Implemented the timeout in msleep instead of udelay.
>>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
>>> ---
>>>  drivers/acpi/apei/einj.c | 16 +++++++++-------
>>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
>>> index 133156759551..e411eb30e0ee 100644
>>> --- a/drivers/acpi/apei/einj.c
>>> +++ b/drivers/acpi/apei/einj.c
>>> @@ -28,9 +28,9 @@
>>>  #undef pr_fmt
>>>  #define pr_fmt(fmt) "EINJ: " fmt
>>>  
>>> -#define SPIN_UNIT		100			/* 100ns */
>>> -/* Firmware should respond within 1 milliseconds */
>>> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
>>> +#define SLEEP_UNIT		1			/* 1ms */
>>
>> I know I pointed you to msleep() ... sorry, I was wrong. For a
>> 1 ms sleep the recommendation is to use usleep_range()
>>
>> See this write-up in Documentation/timers/timers-howto.rst:
>>
>>                 - Why not msleep for (1ms - 20ms)?
>>                         Explained originally here:
>>                                 https://lore.kernel.org/r/15327.1186166232@lwn.net
>>
>>                         msleep(1~20) may not do what the caller intends, and
>>                         will often sleep longer (~20 ms actual sleep for any
>>                         value given in the 1~20ms range). In many cases this
>>                         is not the desired behavior.
>>
>> To answer the question posed in that document on "What is a good range?"
>>
>> I don't think injection cares too much about precision here. Maybe go
>> with
>>
>> 	usleep_range(1000, 5000);
>> [with #defines for SLEEP_UNIT_MIN, SLEEP_UNIT_MAX instead of those
>> numbers]
>>
>>> +/* Firmware should respond within 1 seconds */
>>> +#define FIRMWARE_TIMEOUT	(1 * MSEC_PER_SEC)
>>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>>> @@ -40,6 +40,8 @@
>>>   * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action.
>>>   */
>>>  static int acpi5;
>>> +static int timeout_default = FIRMWARE_TIMEOUT;
>>> +module_param(timeout_default, int, 0644);
>>
>> You've set the default to 1 second. Who would use this parameter?
>> Do you anticipate systems that take even longer to inject?
>> A user might set a shorter limit ... but I don't see why they
>> would want to.
>>
>>>  
>>>  struct set_error_type_with_address {
>>>  	u32	type;
>>> @@ -171,12 +173,12 @@ static int einj_get_available_error_type(u32 *type)
>>>  
>>>  static int einj_timedout(u64 *t)
>>>  {
>>> -	if ((s64)*t < SPIN_UNIT) {
>>> +	if ((s64)*t < SLEEP_UNIT) {
>>>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>>>  		return 1;
>>>  	}
>>> -	*t -= SPIN_UNIT;
>>> -	ndelay(SPIN_UNIT);
>>> +	*t -= SLEEP_UNIT;
>>> +	msleep(SLEEP_UNIT);
>>>  	touch_nmi_watchdog();
>>
>> Since we are sleeping instead of spinning, maybe we don't need to
>> touch the nmi watchdog?
>>
>>>  	return 0;
>>>  }
>>> @@ -403,7 +405,7 @@ static int __einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
>>>  			       u64 param3, u64 param4)
>>>  {
>>>  	struct apei_exec_context ctx;
>>> -	u64 val, trigger_paddr, timeout = FIRMWARE_TIMEOUT;
>>> +	u64 val, trigger_paddr, timeout = timeout_default;
>>>  	int rc;
>>>  
>>>  	einj_exec_ctx_init(&ctx);
>>> -- 
>>> 2.20.1.12.g72788fdb
>>>
>>
>> -Tony
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v2] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-25 12:49       ` Shuai Xue
@ 2021-10-25 15:59         ` Luck, Tony
  0 siblings, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2021-10-25 15:59 UTC (permalink / raw)
  To: Shuai Xue
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

>> No, I don't. EINJ provides a hardware error injection mechanism to develop
>> and debug firmware code and hardware RAS feature. While we test on Arm
>> platform, it cannot meet the original timeout limit. Therefore, we send
>> this patch to relax the upper bound of timeout. In order to facilitate
>> other platforms to encounter the same problems, we expose timeout as a
>> configurable parameter in user space.
>
> What's your opinion about this interface?

I can't see a case where anyone would use it. So it is just useless fluff.

I say drop it from the next rev of the patch.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-15  3:38 [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second Shuai Xue
  2021-10-15 15:37 ` Luck, Tony
  2021-10-22 13:44 ` [PATCH v2] " Shuai Xue
@ 2021-10-26  7:28 ` Shuai Xue
  2021-10-26 17:05   ` Luck, Tony
  2 siblings, 1 reply; 14+ messages in thread
From: Shuai Xue @ 2021-10-26  7:28 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, bp, tony.luck, james.morse, lenb, rjw
  Cc: xueshuai, zhangliguang, zhuo.song

When injecting an error into the platform, the OSPM executes an
EXECUTE_OPERATION action to instruct the platform to begin the injection
operation. And then, the OSPM busy waits for a while by continually
executing CHECK_BUSY_STATUS action until the platform indicates that the
operation is complete. More specifically, the platform is limited to
respond within 1 millisecond right now. This is too strict for some
platforms.

For example, in Arm platform, when injecting a Processor Correctable error,
the OSPM will warn:
    Firmware does not respond in time.

And a message is printed on the console:
    echo: write error: Input/output error

We observe that the waiting time for DDR error injection is about 10 ms and
that for PCIe error injection is about 500 ms in Arm platform.

In this patch, we relax the response timeout to 1 second.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
Changelog v2 -> v3:
- Implemented the timeout in usleep_range instead of msleep.
- Dropped command line interface of timeout.
- Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
---
 drivers/acpi/apei/einj.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 133156759551..6e1ff4b62a8f 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -28,9 +28,10 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "EINJ: " fmt
 
-#define SPIN_UNIT		100			/* 100ns */
-/* Firmware should respond within 1 milliseconds */
-#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
+#define SLEEP_UNIT_MIN		1000			/* 1ms */
+#define SLEEP_UNIT_MAX		5000			/* 5ms */
+/* Firmware should respond within 1 seconds */
+#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
 #define ACPI5_VENDOR_BIT	BIT(31)
 #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
 				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
@@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
 
 static int einj_timedout(u64 *t)
 {
-	if ((s64)*t < SPIN_UNIT) {
+	if ((s64)*t < SLEEP_UNIT_MIN) {
 		pr_warn(FW_WARN "Firmware does not respond in time\n");
 		return 1;
 	}
-	*t -= SPIN_UNIT;
-	ndelay(SPIN_UNIT);
-	touch_nmi_watchdog();
+	*t -= SLEEP_UNIT_MIN;
+	usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
+
 	return 0;
 }
 
-- 
2.20.1.12.g72788fdb


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-26  7:28 ` [PATCH v3] " Shuai Xue
@ 2021-10-26 17:05   ` Luck, Tony
  2021-10-27  2:18     ` Shuai Xue
  2021-10-27 18:24     ` Rafael J. Wysocki
  0 siblings, 2 replies; 14+ messages in thread
From: Luck, Tony @ 2021-10-26 17:05 UTC (permalink / raw)
  To: Rafael J. Wysocki, Shuai Xue
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

On Tue, Oct 26, 2021 at 03:28:29PM +0800, Shuai Xue wrote:
> When injecting an error into the platform, the OSPM executes an
> EXECUTE_OPERATION action to instruct the platform to begin the injection
> operation. And then, the OSPM busy waits for a while by continually
> executing CHECK_BUSY_STATUS action until the platform indicates that the
> operation is complete. More specifically, the platform is limited to
> respond within 1 millisecond right now. This is too strict for some
> platforms.
> 
> For example, in Arm platform, when injecting a Processor Correctable error,
> the OSPM will warn:
>     Firmware does not respond in time.
> 
> And a message is printed on the console:
>     echo: write error: Input/output error
> 
> We observe that the waiting time for DDR error injection is about 10 ms and
> that for PCIe error injection is about 500 ms in Arm platform.
> 
> In this patch, we relax the response timeout to 1 second.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

Reviewed-by: Tony Luck <tony.luck@intel.com>

Rafael: Do you want to take this in the acpi tree? If not, I can
apply it to the RAS tree (already at -rc7, so in next merge cycle
after 5.16-rc1 comes out).

> ---
> Changelog v2 -> v3:
> - Implemented the timeout in usleep_range instead of msleep.
> - Dropped command line interface of timeout.
> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
> ---
>  drivers/acpi/apei/einj.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
> index 133156759551..6e1ff4b62a8f 100644
> --- a/drivers/acpi/apei/einj.c
> +++ b/drivers/acpi/apei/einj.c
> @@ -28,9 +28,10 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) "EINJ: " fmt
>  
> -#define SPIN_UNIT		100			/* 100ns */
> -/* Firmware should respond within 1 milliseconds */
> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
> +#define SLEEP_UNIT_MIN		1000			/* 1ms */
> +#define SLEEP_UNIT_MAX		5000			/* 5ms */
> +/* Firmware should respond within 1 seconds */
> +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
>  #define ACPI5_VENDOR_BIT	BIT(31)
>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
> @@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
>  
>  static int einj_timedout(u64 *t)
>  {
> -	if ((s64)*t < SPIN_UNIT) {
> +	if ((s64)*t < SLEEP_UNIT_MIN) {
>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>  		return 1;
>  	}
> -	*t -= SPIN_UNIT;
> -	ndelay(SPIN_UNIT);
> -	touch_nmi_watchdog();
> +	*t -= SLEEP_UNIT_MIN;
> +	usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
> +
>  	return 0;
>  }
>  
> -- 
> 2.20.1.12.g72788fdb
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-26 17:05   ` Luck, Tony
@ 2021-10-27  2:18     ` Shuai Xue
  2021-10-27 18:24     ` Rafael J. Wysocki
  1 sibling, 0 replies; 14+ messages in thread
From: Shuai Xue @ 2021-10-27  2:18 UTC (permalink / raw)
  To: Luck, Tony, Rafael J. Wysocki
  Cc: linux-kernel, linux-acpi, bp, james.morse, lenb, rjw,
	zhangliguang, zhuo.song

Hi Tony,

Thank you for your patient revision. :)

Cheers,
Shuai

On 2021/10/27 AM1:05, Luck, Tony wrote:
> On Tue, Oct 26, 2021 at 03:28:29PM +0800, Shuai Xue wrote:
>> When injecting an error into the platform, the OSPM executes an
>> EXECUTE_OPERATION action to instruct the platform to begin the injection
>> operation. And then, the OSPM busy waits for a while by continually
>> executing CHECK_BUSY_STATUS action until the platform indicates that the
>> operation is complete. More specifically, the platform is limited to
>> respond within 1 millisecond right now. This is too strict for some
>> platforms.
>>
>> For example, in Arm platform, when injecting a Processor Correctable error,
>> the OSPM will warn:
>>     Firmware does not respond in time.
>>
>> And a message is printed on the console:
>>     echo: write error: Input/output error
>>
>> We observe that the waiting time for DDR error injection is about 10 ms and
>> that for PCIe error injection is about 500 ms in Arm platform.
>>
>> In this patch, we relax the response timeout to 1 second.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> 
> Rafael: Do you want to take this in the acpi tree? If not, I can
> apply it to the RAS tree (already at -rc7, so in next merge cycle
> after 5.16-rc1 comes out).
> 
>> ---
>> Changelog v2 -> v3:
>> - Implemented the timeout in usleep_range instead of msleep.
>> - Dropped command line interface of timeout.
>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
>> ---
>>  drivers/acpi/apei/einj.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
>> index 133156759551..6e1ff4b62a8f 100644
>> --- a/drivers/acpi/apei/einj.c
>> +++ b/drivers/acpi/apei/einj.c
>> @@ -28,9 +28,10 @@
>>  #undef pr_fmt
>>  #define pr_fmt(fmt) "EINJ: " fmt
>>  
>> -#define SPIN_UNIT		100			/* 100ns */
>> -/* Firmware should respond within 1 milliseconds */
>> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
>> +#define SLEEP_UNIT_MIN		1000			/* 1ms */
>> +#define SLEEP_UNIT_MAX		5000			/* 5ms */
>> +/* Firmware should respond within 1 seconds */
>> +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>> @@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
>>  
>>  static int einj_timedout(u64 *t)
>>  {
>> -	if ((s64)*t < SPIN_UNIT) {
>> +	if ((s64)*t < SLEEP_UNIT_MIN) {
>>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>>  		return 1;
>>  	}
>> -	*t -= SPIN_UNIT;
>> -	ndelay(SPIN_UNIT);
>> -	touch_nmi_watchdog();
>> +	*t -= SLEEP_UNIT_MIN;
>> +	usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
>> +
>>  	return 0;
>>  }
>>  
>> -- 
>> 2.20.1.12.g72788fdb
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.
  2021-10-26 17:05   ` Luck, Tony
  2021-10-27  2:18     ` Shuai Xue
@ 2021-10-27 18:24     ` Rafael J. Wysocki
  1 sibling, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2021-10-27 18:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Rafael J. Wysocki, Shuai Xue, Linux Kernel Mailing List,
	ACPI Devel Maling List, Borislav Petkov, James Morse, Len Brown,
	Rafael J. Wysocki, luanshi, zhuo.song

On Tue, Oct 26, 2021 at 7:05 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Tue, Oct 26, 2021 at 03:28:29PM +0800, Shuai Xue wrote:
> > When injecting an error into the platform, the OSPM executes an
> > EXECUTE_OPERATION action to instruct the platform to begin the injection
> > operation. And then, the OSPM busy waits for a while by continually
> > executing CHECK_BUSY_STATUS action until the platform indicates that the
> > operation is complete. More specifically, the platform is limited to
> > respond within 1 millisecond right now. This is too strict for some
> > platforms.
> >
> > For example, in Arm platform, when injecting a Processor Correctable error,
> > the OSPM will warn:
> >     Firmware does not respond in time.
> >
> > And a message is printed on the console:
> >     echo: write error: Input/output error
> >
> > We observe that the waiting time for DDR error injection is about 10 ms and
> > that for PCIe error injection is about 500 ms in Arm platform.
> >
> > In this patch, we relax the response timeout to 1 second.
> >
> > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
>
> Rafael: Do you want to take this in the acpi tree? If not, I can
> apply it to the RAS tree (already at -rc7, so in next merge cycle
> after 5.16-rc1 comes out).

I'll queue it up for 5.16.

Thanks!

> > ---
> > Changelog v2 -> v3:
> > - Implemented the timeout in usleep_range instead of msleep.
> > - Dropped command line interface of timeout.
> > - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
> > ---
> >  drivers/acpi/apei/einj.c | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
> > index 133156759551..6e1ff4b62a8f 100644
> > --- a/drivers/acpi/apei/einj.c
> > +++ b/drivers/acpi/apei/einj.c
> > @@ -28,9 +28,10 @@
> >  #undef pr_fmt
> >  #define pr_fmt(fmt) "EINJ: " fmt
> >
> > -#define SPIN_UNIT            100                     /* 100ns */
> > -/* Firmware should respond within 1 milliseconds */
> > -#define FIRMWARE_TIMEOUT     (1 * NSEC_PER_MSEC)
> > +#define SLEEP_UNIT_MIN               1000                    /* 1ms */
> > +#define SLEEP_UNIT_MAX               5000                    /* 5ms */
> > +/* Firmware should respond within 1 seconds */
> > +#define FIRMWARE_TIMEOUT     (1 * USEC_PER_SEC)
> >  #define ACPI5_VENDOR_BIT     BIT(31)
> >  #define MEM_ERROR_MASK               (ACPI_EINJ_MEMORY_CORRECTABLE | \
> >                               ACPI_EINJ_MEMORY_UNCORRECTABLE | \
> > @@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
> >
> >  static int einj_timedout(u64 *t)
> >  {
> > -     if ((s64)*t < SPIN_UNIT) {
> > +     if ((s64)*t < SLEEP_UNIT_MIN) {
> >               pr_warn(FW_WARN "Firmware does not respond in time\n");
> >               return 1;
> >       }
> > -     *t -= SPIN_UNIT;
> > -     ndelay(SPIN_UNIT);
> > -     touch_nmi_watchdog();
> > +     *t -= SLEEP_UNIT_MIN;
> > +     usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
> > +
> >       return 0;
> >  }
> >
> > --
> > 2.20.1.12.g72788fdb
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-27 18:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-15  3:38 [PATCH] ACPI, APEI, EINJ: Relax platform response timeout to 1 second Shuai Xue
2021-10-15 15:37 ` Luck, Tony
2021-10-17  4:06   ` Shuai Xue
2021-10-18 15:40     ` Luck, Tony
2021-10-19 13:33       ` Shuai Xue
2021-10-22 13:44 ` [PATCH v2] " Shuai Xue
2021-10-22 23:54   ` Luck, Tony
2021-10-24  9:10     ` Shuai Xue
2021-10-25 12:49       ` Shuai Xue
2021-10-25 15:59         ` Luck, Tony
2021-10-26  7:28 ` [PATCH v3] " Shuai Xue
2021-10-26 17:05   ` Luck, Tony
2021-10-27  2:18     ` Shuai Xue
2021-10-27 18:24     ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.