All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
@ 2021-04-28  9:27 Alexey Bogoslavsky
  2021-04-28 12:42 ` hch
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Alexey Bogoslavsky @ 2021-04-28  9:27 UTC (permalink / raw)
  To: linux-nvme; +Cc: hch, kbusch, axboe, sagi

From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>

The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.

Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
---
 drivers/nvme/host/core.c | 89 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 78 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2f45e8fcdd7c..9768d2e84562 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -57,6 +57,26 @@ static bool force_apst;
 module_param(force_apst, bool, 0644);
 MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
 
+static unsigned long apst_primary_timeout_ms = 100;
+module_param(apst_primary_timeout_ms, ulong, 0644);
+MODULE_PARM_DESC(apst_primary_timeout_ms,
+	"primary APST timeout in ms");
+
+static unsigned long apst_secondary_timeout_ms = 2000;
+module_param(apst_secondary_timeout_ms, ulong, 0644);
+MODULE_PARM_DESC(apst_secondary_timeout_ms,
+	"secondary APST timeout in ms");
+
+static unsigned long apst_primary_latency_tol_us = 15000;
+module_param(apst_primary_latency_tol_us, ulong, 0644);
+MODULE_PARM_DESC(apst_primary_latency_tol_us,
+	"primary APST latency tolerance in us");
+
+static unsigned long apst_secondary_latency_tol_us = 100000;
+module_param(apst_secondary_latency_tol_us, ulong, 0644);
+MODULE_PARM_DESC(apst_secondary_latency_tol_us,
+	"secondary APST latency tolerance in us");
+
 static bool streams;
 module_param(streams, bool, 0644);
 MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
@@ -2185,14 +2205,54 @@ static int nvme_configure_acre(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
+/*
+ * The function checks whether the given total (exlat + enlat) latency of
+ * a power state allows the latter to be used as an APST transition target.
+ * It does so by comparing the latency to the primary and secondary latency
+ * tolerances defined by module params. If there's a match, the corresponding
+ * timeout value is returned and the matching tolerance index (1 or 2) is
+ * reported.
+ */
+static bool nvme_apst_get_transition_time(u64 total_latency,
+		u64 *transition_time, unsigned *last_index)
+{
+	if (total_latency <= apst_primary_latency_tol_us) {
+		if (*last_index == 1)
+			return false;
+		*last_index = 1;
+		*transition_time = apst_primary_timeout_ms;
+		return true;
+	}
+	if (apst_secondary_timeout_ms &&
+		total_latency <= apst_secondary_latency_tol_us) {
+		if (*last_index <= 2)
+			return false;
+		*last_index = 2;
+		*transition_time = apst_secondary_timeout_ms;
+		return true;
+	}
+	return false;
+}
+
 /*
  * APST (Autonomous Power State Transition) lets us program a table of power
  * state transitions that the controller will perform automatically.
- * We configure it with a simple heuristic: we are willing to spend at most 2%
- * of the time transitioning between power states.  Therefore, when running in
- * any given state, we will enter the next lower-power non-operational state
- * after waiting 50 * (enlat + exlat) microseconds, as long as that state's exit
- * latency is under the requested maximum latency.
+ *
+ * Depending on module params, one of the two supported techniques will be used:
+ *
+ * - If the parameters provide explicit timeouts and tolerances, they will be
+ *   used to build a table with up to 2 non-operational states to transition to.
+ *   The default parameter values were selected based on the values used by
+ *   Microsoft's and Intel's NVMe drivers. Yet, since we don't implement dynamic
+ *   regeneration of the APST table in the event of switching between external
+ *   and battery power, the timeouts and tolerances reflect a compromise
+ *   between values used by Microsoft for AC and battery scenarios.
+ * - If not, we'll configure the table with a simple heuristic: we are willing
+ *   to spend at most 2% of the time transitioning between power states.
+ *   Therefore, when running in any given state, we will enter the next
+ *   lower-power non-operational state after waiting 50 * (enlat + exlat)
+ *   microseconds, as long as that state's exit latency is under the requested
+ *   maximum latency.
  *
  * We will not autonomously enter any non-operational state for which the total
  * latency exceeds ps_max_latency_us.
@@ -2208,6 +2268,7 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
 	int max_ps = -1;
 	int state;
 	int ret;
+	unsigned last_lt_index = UINT_MAX;
 
 	/*
 	 * If APST isn't supported or if we haven't been initialized yet,
@@ -2266,13 +2327,19 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
 			le32_to_cpu(ctrl->psd[state].entry_lat);
 
 		/*
-		 * This state is good.  Use it as the APST idle target for
-		 * higher power states.
+		 * This state is good. It can be used as the APST idle target
+		 * for higher power states.
 		 */
-		transition_ms = total_latency_us + 19;
-		do_div(transition_ms, 20);
-		if (transition_ms > (1 << 24) - 1)
-			transition_ms = (1 << 24) - 1;
+		if (apst_primary_timeout_ms && apst_primary_latency_tol_us) {
+			if (!nvme_apst_get_transition_time(total_latency_us,
+					&transition_ms, &last_lt_index))
+				continue;
+		} else {
+			transition_ms = total_latency_us + 19;
+			do_div(transition_ms, 20);
+			if (transition_ms > (1 << 24) - 1)
+				transition_ms = (1 << 24) - 1;
+		}
 
 		target = cpu_to_le64((state << 3) | (transition_ms << 8));
 		if (max_ps == -1)
-- 
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
@ 2021-04-28 12:42 ` hch
  2021-04-28 14:44   ` Andy Lutomirski
  2021-05-19  7:00 ` hch
  2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
  2 siblings, 1 reply; 15+ messages in thread
From: hch @ 2021-04-28 12:42 UTC (permalink / raw)
  To: Alexey Bogoslavsky; +Cc: linux-nvme, hch, kbusch, axboe, sagi, Andy Lutomirski

Adding Andy who wrote the original APST code.

On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> 
> The algorithm that was used until now for building the APST configuration
> table has been found to produce entries with excessively long ITPT
> (idle time prior to transition) for devices declaring relatively long
> entry and exit latencies for non-operational power states. This leads
> to unnecessary waste of power and, as a result, failure to pass
> mandatory power consumption tests on Chromebook platforms.
> 
> The new algorithm is based on two predefined ITPT values and two
> predefined latency tolerances. Based on these values, as well as on
> exit and entry latencies reported by the device, the algorithm looks
> for up to 2 suitable non-operational power states to use as primary
> and secondary APST transition targets. The predefined values are
> supplied to the nvme driver as module parameters:
> 
>  - apst_primary_timeout_ms (default: 100)
>  - apst_secondary_timeout_ms (default: 2000)
>  - apst_primary_latency_tol_us (default: 15000)
>  - apst_secondary_latency_tol_us (default: 100000)
> 
> The algorithm echoes the approach used by Intel's and Microsoft's drivers
> on Windows. The specific default parameter values are also based on those
> drivers. Yet, this patch doesn't introduce the ability to dynamically
> regenerate the APST table in the event of switching the power source from
> AC to battery and back. Adding this functionality may be considered in the
> future. In the meantime, the timeouts and tolerances reflect a compromise
> between values used by Microsoft for AC and battery scenarios.
> 
> In most NVMe devices the new algorithm causes them to implement a more
> aggressive power saving policy. While beneficial in most cases, this
> sometimes comes at the price of a higher IO processing latency in certain
> scenarios as well as at the price of a potential impact on the drive's
> endurance (due to more frequent context saving when entering deep non-
> operational states). So in order to provide a fallback for systems where
> these regressions cannot be tolerated, the patch allows to revert to
> the legacy behavior by setting either apst_primary_timeout_ms or
> apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> fine tuning the default values of the module parameters) the legacy behavior
> can be removed.
> 
> TESTING.
> 
> The new algorithm has been extensively tested. Initially, simulations were
> used to compare APST tables generated by old and new algorithms for a wide
> range of devices. After that, power consumption, performance and latencies
> were measured under different workloads on devices from multiple vendors
> (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
> and the findings.
> 
> General observations.
> The effect the patch has on the APST table varies depending on the entry and
> exit latencies advertised by the devices. For some devices, the effect is
> negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
> transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
> the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
> SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
> the initial transition happens after a longer idle time, but takes the device
> to a lower power state.
> 
> Workflows.
> In order to evaluate the patch's effect on the power consumption and latency,
> 7 workflows were used for each device. The workflows were designed to test
> the scenarios where significant differences between the old and new behaviors
> are most likely. Each workflow was tested twice: with the new and with the
> old APST table generation implementation. Power consumption, performance and
> latency were measured in the process. The following workflows were used:
> 1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
> 2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
>    idle time
> 3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
>    idle time
> 4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
>    idle time
> 5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
>    idle time
> 6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
>    idle time
> 7) Repeated pattern of a single random read of a 4K packet followed by 150ms
>    idle time
> 
> Power consumption
> Actual power consumption measurements produced predictable results in
> accordance with the APST mechanism's theory of operation.
> Devices with long entry and exit latencies such as WD SN530 showed huge
> improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
> KBG40ZNS where the resulting APST table looks virtually identical with
> both legacy and new algorithms, showed little or no change in the average power
> consumption on all workflows. Devices with extra short latencies such as
> Samsung PM991 showed moderate increase in power consumption of up to 18% in
> worst case scenarios.
> In addition, on Intel and Samsung devices a more complex impact was observed
> on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
> non-operational states between the writes the devices start performing background
> operations leading to an increase of power consumption. With the old APST tables
> part of these operations are delayed until the scenario is over and a longer idle
> period begins, but eventually this extra power is consumed anyway.
> 
> Performance.
> In terms of performance measured on sustained write or read scenarios, the
> effect of the patch is minimal as in this case the device doesn't enter low power
> states.
> 
> Latency
> As expected, in devices where the patch causes a more aggressive power saving
> policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
> certain scenarios. Workflow number 7, specifically designed to simulate the
> worst case scenario as far as latency is concerned, indeed shows a sharp
> increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
> WD SN530). The latency increase on other workloads and other devices is much
> milder or non-existent.
> 
> Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
> ---
>  drivers/nvme/host/core.c | 89 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 78 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2f45e8fcdd7c..9768d2e84562 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -57,6 +57,26 @@ static bool force_apst;
>  module_param(force_apst, bool, 0644);
>  MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>  
> +static unsigned long apst_primary_timeout_ms = 100;
> +module_param(apst_primary_timeout_ms, ulong, 0644);
> +MODULE_PARM_DESC(apst_primary_timeout_ms,
> +	"primary APST timeout in ms");
> +
> +static unsigned long apst_secondary_timeout_ms = 2000;
> +module_param(apst_secondary_timeout_ms, ulong, 0644);
> +MODULE_PARM_DESC(apst_secondary_timeout_ms,
> +	"secondary APST timeout in ms");
> +
> +static unsigned long apst_primary_latency_tol_us = 15000;
> +module_param(apst_primary_latency_tol_us, ulong, 0644);
> +MODULE_PARM_DESC(apst_primary_latency_tol_us,
> +	"primary APST latency tolerance in us");
> +
> +static unsigned long apst_secondary_latency_tol_us = 100000;
> +module_param(apst_secondary_latency_tol_us, ulong, 0644);
> +MODULE_PARM_DESC(apst_secondary_latency_tol_us,
> +	"secondary APST latency tolerance in us");
> +
>  static bool streams;
>  module_param(streams, bool, 0644);
>  MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
> @@ -2185,14 +2205,54 @@ static int nvme_configure_acre(struct nvme_ctrl *ctrl)
>  	return ret;
>  }
>  
> +/*
> + * The function checks whether the given total (exlat + enlat) latency of
> + * a power state allows the latter to be used as an APST transition target.
> + * It does so by comparing the latency to the primary and secondary latency
> + * tolerances defined by module params. If there's a match, the corresponding
> + * timeout value is returned and the matching tolerance index (1 or 2) is
> + * reported.
> + */
> +static bool nvme_apst_get_transition_time(u64 total_latency,
> +		u64 *transition_time, unsigned *last_index)
> +{
> +	if (total_latency <= apst_primary_latency_tol_us) {
> +		if (*last_index == 1)
> +			return false;
> +		*last_index = 1;
> +		*transition_time = apst_primary_timeout_ms;
> +		return true;
> +	}
> +	if (apst_secondary_timeout_ms &&
> +		total_latency <= apst_secondary_latency_tol_us) {
> +		if (*last_index <= 2)
> +			return false;
> +		*last_index = 2;
> +		*transition_time = apst_secondary_timeout_ms;
> +		return true;
> +	}
> +	return false;
> +}
> +
>  /*
>   * APST (Autonomous Power State Transition) lets us program a table of power
>   * state transitions that the controller will perform automatically.
> - * We configure it with a simple heuristic: we are willing to spend at most 2%
> - * of the time transitioning between power states.  Therefore, when running in
> - * any given state, we will enter the next lower-power non-operational state
> - * after waiting 50 * (enlat + exlat) microseconds, as long as that state's exit
> - * latency is under the requested maximum latency.
> + *
> + * Depending on module params, one of the two supported techniques will be used:
> + *
> + * - If the parameters provide explicit timeouts and tolerances, they will be
> + *   used to build a table with up to 2 non-operational states to transition to.
> + *   The default parameter values were selected based on the values used by
> + *   Microsoft's and Intel's NVMe drivers. Yet, since we don't implement dynamic
> + *   regeneration of the APST table in the event of switching between external
> + *   and battery power, the timeouts and tolerances reflect a compromise
> + *   between values used by Microsoft for AC and battery scenarios.
> + * - If not, we'll configure the table with a simple heuristic: we are willing
> + *   to spend at most 2% of the time transitioning between power states.
> + *   Therefore, when running in any given state, we will enter the next
> + *   lower-power non-operational state after waiting 50 * (enlat + exlat)
> + *   microseconds, as long as that state's exit latency is under the requested
> + *   maximum latency.
>   *
>   * We will not autonomously enter any non-operational state for which the total
>   * latency exceeds ps_max_latency_us.
> @@ -2208,6 +2268,7 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
>  	int max_ps = -1;
>  	int state;
>  	int ret;
> +	unsigned last_lt_index = UINT_MAX;
>  
>  	/*
>  	 * If APST isn't supported or if we haven't been initialized yet,
> @@ -2266,13 +2327,19 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl)
>  			le32_to_cpu(ctrl->psd[state].entry_lat);
>  
>  		/*
> -		 * This state is good.  Use it as the APST idle target for
> -		 * higher power states.
> +		 * This state is good. It can be used as the APST idle target
> +		 * for higher power states.
>  		 */
> -		transition_ms = total_latency_us + 19;
> -		do_div(transition_ms, 20);
> -		if (transition_ms > (1 << 24) - 1)
> -			transition_ms = (1 << 24) - 1;
> +		if (apst_primary_timeout_ms && apst_primary_latency_tol_us) {
> +			if (!nvme_apst_get_transition_time(total_latency_us,
> +					&transition_ms, &last_lt_index))
> +				continue;
> +		} else {
> +			transition_ms = total_latency_us + 19;
> +			do_div(transition_ms, 20);
> +			if (transition_ms > (1 << 24) - 1)
> +				transition_ms = (1 << 24) - 1;
> +		}
>  
>  		target = cpu_to_le64((state << 3) | (transition_ms << 8));
>  		if (max_ps == -1)
> -- 
> 2.17.1
---end quoted text---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28 12:42 ` hch
@ 2021-04-28 14:44   ` Andy Lutomirski
  2021-04-28 15:45     ` Alexey Bogoslavsky
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2021-04-28 14:44 UTC (permalink / raw)
  To: hch; +Cc: Alexey Bogoslavsky, linux-nvme, kbusch, axboe, sagi, Andy Lutomirski

On Wed, Apr 28, 2021 at 5:43 AM hch@lst.de <hch@lst.de> wrote:
>
> Adding Andy who wrote the original APST code.
>
> On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> > From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >
> > The algorithm that was used until now for building the APST configuration
> > table has been found to produce entries with excessively long ITPT
> > (idle time prior to transition) for devices declaring relatively long
> > entry and exit latencies for non-operational power states. This leads
> > to unnecessary waste of power and, as a result, failure to pass
> > mandatory power consumption tests on Chromebook platforms.
> >
> > The new algorithm is based on two predefined ITPT values and two
> > predefined latency tolerances. Based on these values, as well as on
> > exit and entry latencies reported by the device, the algorithm looks
> > for up to 2 suitable non-operational power states to use as primary
> > and secondary APST transition targets. The predefined values are
> > supplied to the nvme driver as module parameters:
> >
> >  - apst_primary_timeout_ms (default: 100)
> >  - apst_secondary_timeout_ms (default: 2000)
> >  - apst_primary_latency_tol_us (default: 15000)
> >  - apst_secondary_latency_tol_us (default: 100000)
> >
> > The algorithm echoes the approach used by Intel's and Microsoft's drivers
> > on Windows. The specific default parameter values are also based on those
> > drivers. Yet, this patch doesn't introduce the ability to dynamically
> > regenerate the APST table in the event of switching the power source from
> > AC to battery and back. Adding this functionality may be considered in the
> > future. In the meantime, the timeouts and tolerances reflect a compromise
> > between values used by Microsoft for AC and battery scenarios.
> >
> > In most NVMe devices the new algorithm causes them to implement a more
> > aggressive power saving policy. While beneficial in most cases, this
> > sometimes comes at the price of a higher IO processing latency in certain
> > scenarios as well as at the price of a potential impact on the drive's
> > endurance (due to more frequent context saving when entering deep non-
> > operational states). So in order to provide a fallback for systems where
> > these regressions cannot be tolerated, the patch allows to revert to
> > the legacy behavior by setting either apst_primary_timeout_ms or
> > apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> > fine tuning the default values of the module parameters) the legacy behavior
> > can be removed.

Can you give an example of the APST states and latencies on a device
for which this is useful?

I'm not opposed to adjusting the algorithm, but I'd like to understand
what we're up against.  If Linux were the only game in town, I would
say that the approach in this patch is unfortunate because of the
arbitrary thresholds it introduces, but if it tracks Windows, then
we're probably okay.

--Andy

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28 14:44   ` Andy Lutomirski
@ 2021-04-28 15:45     ` Alexey Bogoslavsky
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Bogoslavsky @ 2021-04-28 15:45 UTC (permalink / raw)
  To: Andy Lutomirski, hch; +Cc: linux-nvme, kbusch, axboe, sagi

On Wed, Apr 28, 2021 at 5:43 AM hch@lst.de <hch@lst.de> wrote:
>
> Adding Andy who wrote the original APST code.
>
> On Wed, Apr 28, 2021 at 09:27:36AM +0000, Alexey Bogoslavsky wrote:
> > From: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >
> > The algorithm that was used until now for building the APST configuration
> > table has been found to produce entries with excessively long ITPT
> > (idle time prior to transition) for devices declaring relatively long
> > entry and exit latencies for non-operational power states. This leads
> > to unnecessary waste of power and, as a result, failure to pass
> > mandatory power consumption tests on Chromebook platforms.
> >
> > The new algorithm is based on two predefined ITPT values and two
> > predefined latency tolerances. Based on these values, as well as on
> > exit and entry latencies reported by the device, the algorithm looks
> > for up to 2 suitable non-operational power states to use as primary
> > and secondary APST transition targets. The predefined values are
> > supplied to the nvme driver as module parameters:
> >
> >  - apst_primary_timeout_ms (default: 100)
> >  - apst_secondary_timeout_ms (default: 2000)
> >  - apst_primary_latency_tol_us (default: 15000)
> >  - apst_secondary_latency_tol_us (default: 100000)
> >
> > The algorithm echoes the approach used by Intel's and Microsoft's drivers
> > on Windows. The specific default parameter values are also based on those
> > drivers. Yet, this patch doesn't introduce the ability to dynamically
> > regenerate the APST table in the event of switching the power source from
> > AC to battery and back. Adding this functionality may be considered in the
> > future. In the meantime, the timeouts and tolerances reflect a compromise
> > between values used by Microsoft for AC and battery scenarios.
> >
> > In most NVMe devices the new algorithm causes them to implement a more
> > aggressive power saving policy. While beneficial in most cases, this
> > sometimes comes at the price of a higher IO processing latency in certain
> > scenarios as well as at the price of a potential impact on the drive's
> > endurance (due to more frequent context saving when entering deep non-
> > operational states). So in order to provide a fallback for systems where
> > these regressions cannot be tolerated, the patch allows to revert to
> > the legacy behavior by setting either apst_primary_timeout_ms or
> > apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
> > fine tuning the default values of the module parameters) the legacy behavior
> > can be removed.

>  Can you give an example of the APST states and latencies on a device
>  for which this is useful?

Sure. Originally, we faced this problem with WD SN530 device that failed to pass
Google's power consumption tests. The device reports the following latencies:
PS3: entry: 3900, exit: 11000 (translates to 745ms ITPT)
PS4: entry: 5000, exit: 39000 (translates to 2200ms ITPT)

Then we started looking at other devices and found more with a similar problem,
e.g. Crucial P5:
PS3: entry: 10000, exit: 2500 (translates to 625ms ITPT)
PS4: entry: 12000, exit: 35000 (translates to 2350ms ITPT)

>  I'm not opposed to adjusting the algorithm, but I'd like to understand
>  what we're up against.  If Linux were the only game in town, I would
>  say that the approach in this patch is unfortunate because of the
>  arbitrary thresholds it introduces, but if it tracks Windows, then
>  we're probably okay.
>--Andy

I agree. Using arbitrary numbers would be a very bad idea. But the numbers I'm
suggesting are indeed based on the schemes used on Windows, so they have proved
viable on a huge number of devices.

Regards,
Alexey

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] nvme: extend and modify the APST configuration algorithm
  2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
  2021-04-28 12:42 ` hch
@ 2021-05-19  7:00 ` hch
  2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
  2 siblings, 0 replies; 15+ messages in thread
From: hch @ 2021-05-19  7:00 UTC (permalink / raw)
  To: Alexey Bogoslavsky; +Cc: linux-nvme, hch, kbusch, axboe, sagi, Andy Lutomirski

Thanks,

applied to nvme-5.14.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
  2021-04-28 12:42 ` hch
  2021-05-19  7:00 ` hch
@ 2023-01-16 18:32 ` Alexey Bogoslavsky
  2023-01-17  7:14   ` 'hch@lst.de'
  2023-01-17 15:54   ` Keith Busch
  2 siblings, 2 replies; 15+ messages in thread
From: Alexey Bogoslavsky @ 2023-01-16 18:32 UTC (permalink / raw)
  To: linux-pci; +Cc: bhelgaas, 'hch@lst.de'

From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>

A bug was found in SN730 WD SSD that causes occasional false AER reporting
of correctable errors. While functionally harmless, this causes error
messages to appear in the system log (dmesg) which, in turn, causes
problems in automated platform validation tests. Since the issue can not
be fixed by FW, customers asked for correctable error reporting to be
quirked out in the kernel for this particular device.

The patch was manually verified. It was checked that correctable errors
are still detected but ignored for the target device (SN730), and are both
detected and reported for devices not affected by this quirk.

Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com
---
 drivers/pci/pcie/aer.c  | 3 +++
 drivers/pci/quirks.c    | 6 ++++++
 include/linux/pci.h     | 1 +
 include/linux/pci_ids.h | 4 ++++
 4 files changed, 14 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d7ee79d7b192..5cc24d28b76d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 		goto out;
 	}
 
+	if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
+		return;
+
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 285acc4aaccc..21e4972fb242 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5911,6 +5911,12 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1536, rom_bar_overlap_defect);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1537, rom_bar_overlap_defect);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1538, rom_bar_overlap_defect);
 
+static void wd_ignore_correctable_errors(struct pci_dev *pdev)
+{
+	pdev->ignore_correctable_errors = true;
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
+
 #ifdef CONFIG_PCIEASPM
 /*
  * Several Intel DG2 graphics devices advertise that they can only tolerate
diff --git a/include/linux/pci.h b/include/linux/pci.h
index adffd65e84b4..1e002be07255 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -463,6 +463,7 @@ struct pci_dev {
 	unsigned int	link_active_reporting:1;/* Device capable of reporting link active */
 	unsigned int	no_vf_scan:1;		/* Don't scan for VFs after IOV enablement */
 	unsigned int	no_command_memory:1;	/* No PCI_COMMAND_MEMORY */
+	unsigned int	ignore_correctable_errors:1;	/* Ignore correctable errors reported */
 	unsigned int	rom_bar_overlap:1;	/* ROM BAR disable broken */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index b362d90eb9b0..b5a08a4cf047 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2299,9 +2299,13 @@
 #define PCI_VENDOR_ID_QUICKNET		0x15e2
 #define PCI_DEVICE_ID_QUICKNET_XJ	0x0500
 
+#define PCI_VENDOR_ID_WESTERN_DIGITAL		0x15B7
+#define PCI_DEVICE_ID_WESTERN_DIGITAL_SN730	0x5006
+
 /*
  * ADDI-DATA GmbH communication cards mailto:info@addi-data.com
  */
+
 #define PCI_VENDOR_ID_ADDIDATA                 0x15B8
 #define PCI_DEVICE_ID_ADDIDATA_APCI7500        0x7000
 #define PCI_DEVICE_ID_ADDIDATA_APCI7420        0x7001
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
@ 2023-01-17  7:14   ` 'hch@lst.de'
  2023-01-17 13:20     ` Alexey Bogoslavsky
  2023-01-17 15:54   ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: 'hch@lst.de' @ 2023-01-17  7:14 UTC (permalink / raw)
  To: Alexey Bogoslavsky; +Cc: linux-pci, bhelgaas, 'hch@lst.de'

On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> 
> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> of correctable errors. While functionally harmless, this causes error
> messages to appear in the system log (dmesg) which, in turn, causes
> problems in automated platform validation tests. Since the issue can not
> be fixed by FW, customers asked for correctable error reporting to be
> quirked out in the kernel for this particular device.
> 
> The patch was manually verified. It was checked that correctable errors
> are still detected but ignored for the target device (SN730), and are both
> detected and reported for devices not affected by this quirk.
> 
> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com
> ---
>  drivers/pci/pcie/aer.c  | 3 +++
>  drivers/pci/quirks.c    | 6 ++++++
>  include/linux/pci.h     | 1 +
>  include/linux/pci_ids.h | 4 ++++
>  4 files changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d7ee79d7b192..5cc24d28b76d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  		goto out;
>  	}
>  
> +	if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)

No need for the inner braces.

> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);

Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
instead.

But overall I'm not really sure it's worth adding code just to surpress
a harmless warning.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-17  7:14   ` 'hch@lst.de'
@ 2023-01-17 13:20     ` Alexey Bogoslavsky
  2023-01-17 14:22       ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Bogoslavsky @ 2023-01-17 13:20 UTC (permalink / raw)
  To: 'hch@lst.de'; +Cc: linux-pci, bhelgaas

>From: 'hch@lst.de' <hch@lst.de> 
>Sent: Tuesday, January 17, 2023 9:14 AM
>To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
>Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

>On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
>> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
>>
>> A bug was found in SN730 WD SSD that causes occasional false AER reporting
>> of correctable errors. While functionally harmless, this causes error
>> messages to appear in the system log (dmesg) which, in turn, causes
>> problems in automated platform validation tests. Since the issue can not
>> be fixed by FW, customers asked for correctable error reporting to be
>> quirked out in the kernel for this particular device.
>>
>> The patch was manually verified. It was checked that correctable errors
>> are still detected but ignored for the target device (SN730), and are both
>> detected and reported for devices not affected by this quirk.
>>
>> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com
>> ---
>>  drivers/pci/pcie/aer.c  | 3 +++
>>  drivers/pci/quirks.c    | 6 ++++++
>>  include/linux/pci.h     | 1 +
>>  include/linux/pci_ids.h | 4 ++++
>>  4 files changed, 14 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index d7ee79d7b192..5cc24d28b76d 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>               goto out;
>>       }
>>
>> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)

>No need for the inner braces.
Will fix

>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);

>Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
instead.
Will fix both, thanks
>But overall I'm not really sure it's worth adding code just to surpress
>a harmless warning.
This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-17 13:20     ` Alexey Bogoslavsky
@ 2023-01-17 14:22       ` Bjorn Helgaas
  2023-01-17 18:06         ` Alexey Bogoslavsky
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2023-01-17 14:22 UTC (permalink / raw)
  To: Alexey Bogoslavsky
  Cc: 'hch@lst.de', linux-pci, bhelgaas, Rajat Khandelwal

[+cc Rajat]

On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote:
> >From: 'hch@lst.de' <hch@lst.de> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests.

I don't think automated test problems warrant an OS change to suppress
warnings.  Those are internal tests where you can automatically ignore
warnings you don't care about.

> >> Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.

Customer complaints are a little more of an issue.  But let's clarify
what's happening here.  You mention "false AER reporting," and I want
to understand that better.  I guess you mean that SN730 logs a
correctable error and generates an ERR_COR message when no error
actually occurred?

I hesitate to turn off correctable error reporting completely because
that information is useful for monitoring overall system health.  But
there are two things I think we could do:

  - Reformat the correctable error messages so they are obviously
    different from uncorrectable errors and indicate that these have
    been automatically corrected, and

  - Possibly rate-limit them on a per-device basis, e.g., something
    like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@linux.intel.com

If we do end up having to turn off reporting completely, I would log a
note that we're doing that so we know we're giving up something and
there may be legitimate errors that we will never see.

> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> >>
> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com

Check the history and use the same format here, i.e.,

  Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>

(Also for the From: line inserted at the top.)

> >> ---
> >>  drivers/pci/pcie/aer.c  | 3 +++
> >>  drivers/pci/quirks.c    | 6 ++++++
> >>  include/linux/pci.h     | 1 +
> >>  include/linux/pci_ids.h | 4 ++++
> >>  4 files changed, 14 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d7ee79d7b192..5cc24d28b76d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >>               goto out;
> >>       }
> >>
> >> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
> 
> >No need for the inner braces.
> Will fix
> 
> >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
> 
> >Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
> instead.
> Will fix both, thanks
> >But overall I'm not really sure it's worth adding code just to surpress
> >a harmless warning.
> This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
  2023-01-17  7:14   ` 'hch@lst.de'
@ 2023-01-17 15:54   ` Keith Busch
  2023-01-17 18:15     ` Alexey Bogoslavsky
  1 sibling, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-01-17 15:54 UTC (permalink / raw)
  To: Alexey Bogoslavsky; +Cc: linux-pci, bhelgaas, 'hch@lst.de'

On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> 
> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> of correctable errors. While functionally harmless, this causes error
> messages to appear in the system log (dmesg) which, in turn, causes
> problems in automated platform validation tests. Since the issue can not
> be fixed by FW, customers asked for correctable error reporting to be
> quirked out in the kernel for this particular device.
> 
> The patch was manually verified. It was checked that correctable errors
> are still detected but ignored for the target device (SN730), and are both
> detected and reported for devices not affected by this quirk.

If you're just going to have the kernel ignore these, are you not able
to suppress the ERR_COR message at the source? Have the following
options been tried?

 a. Disabling Correctable Error Reporting Enable in Device Control
    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
 b. Setting AER Correctable Error Mask Register to all 1's

I think it's usually possible for firmware to hardwire these. If the
firmware can't do that, quirking the kernel to always disable reporting
sounds like a better option. If either of the above fail to suppress the
error messages, then I guess having the kernel ignore it is the only
option.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-17 14:22       ` Bjorn Helgaas
@ 2023-01-17 18:06         ` Alexey Bogoslavsky
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Bogoslavsky @ 2023-01-17 18:06 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: 'hch@lst.de', linux-pci, bhelgaas, Rajat Khandelwal

>From: Bjorn Helgaas <helgaas@kernel.org> 
>Sent: Tuesday, January 17, 2023 4:23 PM
>To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>Cc: 'hch@lst.de' <hch@lst.de>; linux-pci@vger.kernel.org; bhelgaas@google.com; Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
>Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

>[+cc Rajat]

>On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote:
>> >From: 'hch@lst.de' <hch@lst.de>
>> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
>> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
>> >>
>> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
>> >> of correctable errors. While functionally harmless, this causes error
>> >> messages to appear in the system log (dmesg) which, in turn, causes
>> >> problems in automated platform validation tests.

>I don't think automated test problems warrant an OS change to suppress
>warnings.  Those are internal tests where you can automatically ignore
>warnings you don't care about.

>> >> Since the issue can not
>> >> be fixed by FW, customers asked for correctable error reporting to be
>> >> quirked out in the kernel for this particular device.

>Customer complaints are a little more of an issue.  But let's clarify
>what's happening here.  You mention "false AER reporting," and I want
>to understand that better.  I guess you mean that SN730 logs a
>correctable error and generates an ERR_COR message when no error
>actually occurred?
Exactly

>I hesitate to turn off correctable error reporting completely because
>that information is useful for monitoring overall system health.  But
>there are two things I think we could do:

>  - Reformat the correctable error messages so they are obviously
>    different from uncorrectable errors and indicate that these have
>    been automatically corrected, and
We've had long and tiresome discussions with the customers using this
specific device in their design trying to convince them to just ignore
the messages, but they wouldn't budge
>  - Possibly rate-limit them on a per-device basis, e.g., something
>    like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@linux.intel.com
The thing is on this device correctable PCI errors can never be trusted. So reporting
them in any format or at any rate makes little sense as one can never assume they're real.
Also, the loss of potential real error notifications is something the customers seem
to be OK with, given the pros and cons.

>If we do end up having to turn off reporting completely, I would log a
>note that we're doing that so we know we're giving up something and
>there may be legitimate errors that we will never see.
You mean, print a warning message the first time such error is reported by the device
and ignore all additional ERR_COR messages? This seems like a fine idea, I would
definitely be willing to implement it

>> >> The patch was manually verified. It was checked that correctable errors
>> >> are still detected but ignored for the target device (SN730), and are both
>> >> detected and reported for devices not affected by this quirk.
>> >>
>> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com

>Check the history and use the same format here, i.e.,

>  Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>

>(Also for the From: line inserted at the top.)
Sorry about that, will fix
> >> ---
> >>  drivers/pci/pcie/aer.c  | 3 +++
> >>  drivers/pci/quirks.c    | 6 ++++++
> >>  include/linux/pci.h     | 1 +
> >>  include/linux/pci_ids.h | 4 ++++
> >>  4 files changed, 14 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d7ee79d7b192..5cc24d28b76d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >>               goto out;
> >>       }
> >>
> >> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
>
> >No need for the inner braces.
> Will fix
>
> >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
>
> >Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
> instead.
> Will fix both, thanks
> >But overall I'm not really sure it's worth adding code just to surpress
> >a harmless warning.
> This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-17 15:54   ` Keith Busch
@ 2023-01-17 18:15     ` Alexey Bogoslavsky
  2023-04-11 22:15       ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Bogoslavsky @ 2023-01-17 18:15 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, bhelgaas, 'hch@lst.de'


>From: Keith Busch <kbusch@kernel.org> 
>Sent: Tuesday, January 17, 2023 5:55 PM
>To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
>Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
>Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

>On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
>> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
>>
>> A bug was found in SN730 WD SSD that causes occasional false AER reporting
>> of correctable errors. While functionally harmless, this causes error
>> messages to appear in the system log (dmesg) which, in turn, causes
>> problems in automated platform validation tests. Since the issue can not
>> be fixed by FW, customers asked for correctable error reporting to be
>> quirked out in the kernel for this particular device.
>
>> The patch was manually verified. It was checked that correctable errors
>> are still detected but ignored for the target device (SN730), and are both
>> detected and reported for devices not affected by this quirk.

>If you're just going to have the kernel ignore these, are you not able
>to suppress the ERR_COR message at the source? Have the following
>options been tried?

> a. Disabling Correctable Error Reporting Enable in Device Control
>    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> b. Setting AER Correctable Error Mask Register to all 1's

>I think it's usually possible for firmware to hardwire these. If the
I believe these options were discussed but deemed non-viable. I'll double check anyway
>If firmware can't do that, quirking the kernel to always disable reporting
>sounds like a better option. If either of the above fail to suppress the
>error messages, then I guess having the kernel ignore it is the only
>option.
This could probably work. I'll discuss this with our FW team to make sure the issue can
be resolved this way. Thank you

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-01-17 18:15     ` Alexey Bogoslavsky
@ 2023-04-11 22:15       ` Bjorn Helgaas
  2023-04-24 11:27         ` Alexey Bogoslavsky
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2023-04-11 22:15 UTC (permalink / raw)
  To: Alexey Bogoslavsky
  Cc: Keith Busch, linux-pci, Bjorn Helgas, Christoph Hellwig,
	Grant Grundler, Rajat Khandelwal

[+cc Grant, Rajat]

On Tue, Jan 17, 2023 at 06:15:28PM +0000, Alexey Bogoslavsky wrote:
> >From: Keith Busch <kbusch@kernel.org> 
> >Sent: Tuesday, January 17, 2023 5:55 PM
> >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
> >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests. Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.
> >
> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> 
> >If you're just going to have the kernel ignore these, are you not able
> >to suppress the ERR_COR message at the source? Have the following
> >options been tried?
> 
> > a. Disabling Correctable Error Reporting Enable in Device Control
> >    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> > b. Setting AER Correctable Error Mask Register to all 1's
> 
> >I think it's usually possible for firmware to hardwire these. If the
>
> I believe these options were discussed but deemed non-viable. I'll
> double check anyway
>
> >If firmware can't do that, quirking the kernel to always disable
> >reporting sounds like a better option. If either of the above fail
> >to suppress the error messages, then I guess having the kernel
> >ignore it is the only option.
>
> This could probably work. I'll discuss this with our FW team to make
> sure the issue can be resolved this way. Thank you

Any resolution on this FW possibility?

We have patches in progress to rate-limit correctable error messages
and make them KERN_INFO instead of KERN_WARN [1], but I don't think
that's going to be a good enough solution for you because nobody wants
to see even an informational message every 5 seconds if the message is
useless.

If firmware on the device can turn off these errors, that would be the
best solution.  If not, I think your quirk is a reasonable approach
and just needs a litle polishing per the previous comments.

Bjorn

[1] https://lore.kernel.org/r/20230317175109.3859943-1-grundler@chromium.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-04-11 22:15       ` Bjorn Helgaas
@ 2023-04-24 11:27         ` Alexey Bogoslavsky
  2023-04-28 22:00           ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Bogoslavsky @ 2023-04-24 11:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, linux-pci, Bjorn Helgas, Christoph Hellwig,
	Grant Grundler, Rajat Khandelwal

Hello Bjorn,

Sorry for not addressing your questions earlier. As you may have heard, WD experienced
a hacking attack which left us with no access to the company e-mail for weeks.
As for the patch, no FW change was an option as the product causing the issue was
basically at the end of life. So, I prepared a workaround that took into account
all the comments from the community.

Yet, at this point it seems like the company has lost interest in promoting this patch altogether.
So we could just drop it. Please let me know if there's anything I need to do to request that
officially.

Thank you,
Alexey

-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org> 
Sent: Wednesday, April 12, 2023 1:15 AM
To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
Cc: Keith Busch <kbusch@kernel.org>; linux-pci@vger.kernel.org; Bjorn Helgas <bhelgaas@google.com>; Christoph Hellwig <hch@lst.de>; Grant Grundler <grundler@chromium.org>; Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD

CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.


[+cc Grant, Rajat]

On Tue, Jan 17, 2023 at 06:15:28PM +0000, Alexey Bogoslavsky wrote:
> >From: Keith Busch <kbusch@kernel.org>
> >Sent: Tuesday, January 17, 2023 5:55 PM
> >To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
> >Cc: linux-pci@vger.kernel.org; bhelgaas@google.com; 'hch@lst.de' <hch@lst.de>
> >Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
>
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests. Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.
> >
> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
>
> >If you're just going to have the kernel ignore these, are you not able
> >to suppress the ERR_COR message at the source? Have the following
> >options been tried?
>
> > a. Disabling Correctable Error Reporting Enable in Device Control
> >    Register; i.e. mask out PCI_EXP_DEVCTL_CERE.
> > b. Setting AER Correctable Error Mask Register to all 1's
>
> >I think it's usually possible for firmware to hardwire these. If the
>
> I believe these options were discussed but deemed non-viable. I'll
> double check anyway
>
> >If firmware can't do that, quirking the kernel to always disable
> >reporting sounds like a better option. If either of the above fail
> >to suppress the error messages, then I guess having the kernel
> >ignore it is the only option.
>
> This could probably work. I'll discuss this with our FW team to make
> sure the issue can be resolved this way. Thank you

Any resolution on this FW possibility?

We have patches in progress to rate-limit correctable error messages
and make them KERN_INFO instead of KERN_WARN [1], but I don't think
that's going to be a good enough solution for you because nobody wants
to see even an informational message every 5 seconds if the message is
useless.

If firmware on the device can turn off these errors, that would be the
best solution.  If not, I think your quirk is a reasonable approach
and just needs a litle polishing per the previous comments.

Bjorn

[1] https://lore.kernel.org/r/20230317175109.3859943-1-grundler@chromium.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
  2023-04-24 11:27         ` Alexey Bogoslavsky
@ 2023-04-28 22:00           ` Bjorn Helgaas
  0 siblings, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2023-04-28 22:00 UTC (permalink / raw)
  To: Alexey Bogoslavsky
  Cc: Keith Busch, linux-pci, Bjorn Helgas, Christoph Hellwig,
	Grant Grundler, Rajat Khandelwal

On Mon, Apr 24, 2023 at 11:27:43AM +0000, Alexey Bogoslavsky wrote:
> Hello Bjorn,
> 
> Sorry for not addressing your questions earlier. As you may have
> heard, WD experienced a hacking attack which left us with no access
> to the company e-mail for weeks.

So sorry to hear that, that is incredibly disruptive.

> As for the patch, no FW change was an option as the product causing
> the issue was basically at the end of life. So, I prepared a
> workaround that took into account all the comments from the
> community.

Makes sense and is a very common situation.  In general we try to
avoid "fixing" issues by requiring a firmware update because even if
such an update is available, most users will not have it.

I think the hope was that a kernel quirk could tweak something on the
device to make it stop reporting these errors.

> Yet, at this point it seems like the company has lost interest in
> promoting this patch altogether.  So we could just drop it. Please
> let me know if there's anything I need to do to request that
> officially.

No worries, you don't need to do anything.  If you can point me to any
users or bug reports, maybe I can tidy this up and merge it.  Those
users are still being annoyed by the error spam, and that seems
pointless.

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-04-28 22:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-28  9:27 [PATCH 1/1] nvme: extend and modify the APST configuration algorithm Alexey Bogoslavsky
2021-04-28 12:42 ` hch
2021-04-28 14:44   ` Andy Lutomirski
2021-04-28 15:45     ` Alexey Bogoslavsky
2021-05-19  7:00 ` hch
2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
2023-01-17  7:14   ` 'hch@lst.de'
2023-01-17 13:20     ` Alexey Bogoslavsky
2023-01-17 14:22       ` Bjorn Helgaas
2023-01-17 18:06         ` Alexey Bogoslavsky
2023-01-17 15:54   ` Keith Busch
2023-01-17 18:15     ` Alexey Bogoslavsky
2023-04-11 22:15       ` Bjorn Helgaas
2023-04-24 11:27         ` Alexey Bogoslavsky
2023-04-28 22:00           ` Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.