openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds
@ 2020-08-27 13:29 Andrew Jeffery
  2020-08-27 13:30 ` [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG Andrew Jeffery
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Andrew Jeffery @ 2020-08-27 13:29 UTC (permalink / raw)
  To: joel; +Cc: openbmc

Hello,

This series works around reliability problems with the MAX31785 fan controller
as observed in the field on some POWER systems.

I'm the first to admit the patches are not elegant, so feedback there is
appreciated.

Separately, our previous workarounds have run aground upstream as the hwmon
maintainer was unable to reproduce our observations on the MAX31785 evaluation
kit. I've recently received an evaluation kit, so I plan on putting some of
these issues to the test myself. Ultimately this will help determine whether we
have issues with our fan card designs or whether the controller itself is at
fault (I have to admit, given some of the failures, it's hard to see how the
controller might not be at fault). Basically, this paragraph is my excuse for
not pushing these patches further upstream for the moment; I will re-evaluate
once I have the results from testing against the evkit.

In the mean time, these patches resolve issues we've seen in some system
deployments. Taken together, I've put the driver through an unbind/bind loop
of over 20,000 iterations with no "loss" of fans, where prior to the series we
typically achieved only a few hundred. This feels like a significant
improvement, so please consider merging despite their ugliness.

Cheers,

Andrew

Andrew Jeffery (3):
  pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG
  pmbus: (max31785) Add a local pmbus_set_page() implementation
  pmbus: (core) Add a one-shot retry in pmbus_set_page()

 drivers/hwmon/pmbus/max31785.c   | 55 ++++++++++++++++++++++++++------
 drivers/hwmon/pmbus/pmbus_core.c | 33 ++++++++++++-------
 2 files changed, 66 insertions(+), 22 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG
  2020-08-27 13:29 [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Andrew Jeffery
@ 2020-08-27 13:30 ` Andrew Jeffery
  2020-09-01  6:17   ` Joel Stanley
  2020-08-27 13:30 ` [PATCH linux dev-5.8 2/3] pmbus: (max31785) Add a local pmbus_set_page() implementation Andrew Jeffery
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Andrew Jeffery @ 2020-08-27 13:30 UTC (permalink / raw)
  To: joel; +Cc: openbmc

It has been observed across large fleets of systems that a small subset
of those systems occasionally loose control of some number of fans
across a BMC reboot (their hwmon fan attributes are missing from sysfs).

From extensive testing and tracing it was discovered that writes
enabling a fan in FAN_CONFIG_1_2 failed to stick on the system under
test with a frequency of about 1 in 1000 re-binds of the driver.

The MAX31785 datasheet recommends in the documentation for
MFR_FAN_CONFIG that the asssociated fan(s) be disabled before updating
the register. The sequence in question implements this suggestion, and
the observed loss-of-fans symptom occurs when the write to re-enable the
fan in FAN_CONFIG_1_2 fails to stick.

The trace data suggests a one-shot retry is enough to successfully
update FAN_CONFIG_1_2. With the workaround, no loss of fans was observed
in over 20,000 consecutive rebinds of the driver.

Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
---
 drivers/hwmon/pmbus/max31785.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/hwmon/pmbus/max31785.c b/drivers/hwmon/pmbus/max31785.c
index cbcd0b2301f4..88b7156d777e 100644
--- a/drivers/hwmon/pmbus/max31785.c
+++ b/drivers/hwmon/pmbus/max31785.c
@@ -376,6 +376,7 @@ static int max31785_of_fan_config(struct i2c_client *client,
 	u32 page;
 	u32 uval;
 	int ret;
+	int i;
 
 	if (!of_device_is_compatible(child, "pmbus-fan"))
 		return 0;
@@ -552,10 +553,24 @@ static int max31785_of_fan_config(struct i2c_client *client,
 	if (ret < 0)
 		return ret;
 
-	ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_FAN_CONFIG_12,
-						 pb_cfg);
-	if (ret < 0)
-		return ret;
+	for (i = 0; i < 2; i++) {
+		ret = max31785_i2c_smbus_write_byte_data(client,
+							 PMBUS_FAN_CONFIG_12,
+							 pb_cfg);
+		if (ret < 0)
+			continue;
+
+		ret = max31785_i2c_smbus_read_byte_data(client,
+							PMBUS_FAN_CONFIG_12);
+		if (ret < 0)
+			continue;
+
+		if (ret == pb_cfg)
+			break;
+	}
+
+	if (i == 2)
+		return -EIO;
 
 	/*
 	 * Fans are on pages 0 - 5. If the page property of a fan node is
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH linux dev-5.8 2/3] pmbus: (max31785) Add a local pmbus_set_page() implementation
  2020-08-27 13:29 [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Andrew Jeffery
  2020-08-27 13:30 ` [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG Andrew Jeffery
@ 2020-08-27 13:30 ` Andrew Jeffery
  2020-09-01  6:18   ` Joel Stanley
  2020-08-27 13:30 ` [PATCH linux dev-5.8 3/3] pmbus: (core) Add a one-shot retry in pmbus_set_page() Andrew Jeffery
  2020-09-01  6:21 ` [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Joel Stanley
  3 siblings, 1 reply; 8+ messages in thread
From: Andrew Jeffery @ 2020-08-27 13:30 UTC (permalink / raw)
  To: joel; +Cc: openbmc

Extensive testing and tracing has shown that the MAX31785 is unreliable
in the face of PAGE write commands, ACK'ing the PAGE request but
reporting a value of 0 on some subsequent PAGE reads. The trace data
suggests that a one-shot retry of the PAGE write is enough to get the
requested value to stick.

As we configure the device before registering with the PMBus core,
centralise PAGE handling inside the driver and implement the one-shot
retry semantics there.

Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
---
 drivers/hwmon/pmbus/max31785.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/hwmon/pmbus/max31785.c b/drivers/hwmon/pmbus/max31785.c
index 88b7156d777e..a392c0efe0a6 100644
--- a/drivers/hwmon/pmbus/max31785.c
+++ b/drivers/hwmon/pmbus/max31785.c
@@ -361,6 +361,27 @@ static int max31785_write_word_data(struct i2c_client *client, int page,
 	return -ENXIO;
 }
 
+static int max31785_pmbus_set_page(struct i2c_client *client, int page)
+{
+	int ret;
+	int i;
+
+	for (i = 0; i < 2; i++) {
+		ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
+		if (ret < 0)
+			return ret;
+
+		ret = max31785_i2c_smbus_read_byte_data(client, PMBUS_PAGE);
+		if (ret < 0)
+			return ret;
+
+		if (ret == page)
+			return 0;
+	}
+
+	return -EIO;
+}
+
 /*
  * Returns negative error codes if an unrecoverable problem is detected, 0 if a
  * recoverable problem is detected, or a positive value on success.
@@ -392,7 +413,7 @@ static int max31785_of_fan_config(struct i2c_client *client,
 		return -ENXIO;
 	}
 
-	ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
+	ret = max31785_pmbus_set_page(client, page);
 	if (ret < 0)
 		return ret;
 
@@ -613,7 +634,7 @@ static int max31785_of_tmp_config(struct i2c_client *client,
 		return -ENXIO;
 	}
 
-	ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
+	ret = max31785_pmbus_set_page(client, page);
 	if (ret < 0)
 		return ret;
 
@@ -714,7 +735,7 @@ static int max31785_configure_dual_tach(struct i2c_client *client,
 	int i;
 
 	for (i = 0; i < MAX31785_NR_FAN_PAGES; i++) {
-		ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, i);
+		ret = max31785_pmbus_set_page(client, i);
 		if (ret < 0)
 			return ret;
 
@@ -756,7 +777,7 @@ static int max31785_probe(struct i2c_client *client,
 
 	*info = max31785_info;
 
-	ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, 255);
+	ret = max31785_pmbus_set_page(client, 255);
 	if (ret < 0)
 		return ret;
 
@@ -798,8 +819,7 @@ static int max31785_probe(struct i2c_client *client,
 		if (!have_fan || fan_configured)
 			continue;
 
-		ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE,
-							 i);
+		ret = max31785_pmbus_set_page(client, i);
 		if (ret < 0)
 			return ret;
 
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH linux dev-5.8 3/3] pmbus: (core) Add a one-shot retry in pmbus_set_page()
  2020-08-27 13:29 [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Andrew Jeffery
  2020-08-27 13:30 ` [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG Andrew Jeffery
  2020-08-27 13:30 ` [PATCH linux dev-5.8 2/3] pmbus: (max31785) Add a local pmbus_set_page() implementation Andrew Jeffery
@ 2020-08-27 13:30 ` Andrew Jeffery
  2020-09-01  6:21 ` [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Joel Stanley
  3 siblings, 0 replies; 8+ messages in thread
From: Andrew Jeffery @ 2020-08-27 13:30 UTC (permalink / raw)
  To: joel; +Cc: openbmc

From extensive testing and tracing it was discovered that the MAX31785
occasionally fails to switch pages despite ACK'ing the PAGE PMBus data
write. I suspect this behaviour had been seen on other devices as well,
as pmbus_set_page() already read-back the freshly set value and errored
out if it wasn't what we requested.

In the case of the MAX31785 it was shown that a one-shot retry was
enough to get the PAGE write to stick if the inital command failed. To
improve robustness, only error out if the one-shot retry also fails to
stick.

Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
---
 drivers/hwmon/pmbus/pmbus_core.c | 33 ++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/hwmon/pmbus/pmbus_core.c b/drivers/hwmon/pmbus/pmbus_core.c
index 702c25010369..7d0aac59af31 100644
--- a/drivers/hwmon/pmbus/pmbus_core.c
+++ b/drivers/hwmon/pmbus/pmbus_core.c
@@ -158,25 +158,34 @@ int pmbus_set_page(struct i2c_client *client, int page, int phase)
 
 	if (!(data->info->func[page] & PMBUS_PAGE_VIRTUAL) &&
 	    data->info->pages > 1 && page != data->currpage) {
+		int i;
+
 		dev_dbg(&client->dev, "Want page %u, %u cached\n", page,
 			data->currpage);
 
-		rv = i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
-		if (rv < 0) {
+		for (i = 0; i < 2; i++) {
 			rv = i2c_smbus_write_byte_data(client, PMBUS_PAGE,
 						       page);
-			dev_dbg(&client->dev,
-				"Failed to set page %u, performed one-shot retry %s: %d\n",
-				page, rv ? "and failed" : "with success", rv);
+
+			if (rv)
+				continue;
+
+			rv = i2c_smbus_read_byte_data(client, PMBUS_PAGE);
 			if (rv < 0)
-				return rv;
+				continue;
+
+			/* Success, exit loop */
+			if (rv == page)
+				break;
+
+			rv = i2c_smbus_read_byte_data(client, PMBUS_STATUS_CML);
+			if (rv < 0)
+				continue;
+
+			if (rv & PB_CML_FAULT_INVALID_DATA)
+				return -EIO;
 		}
-
-		rv = i2c_smbus_read_byte_data(client, PMBUS_PAGE);
-		if (rv < 0)
-			return rv;
-
-		if (rv != page)
+		if (i == 2)
 			return -EIO;
 	}
 	data->currpage = page;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG
  2020-08-27 13:30 ` [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG Andrew Jeffery
@ 2020-09-01  6:17   ` Joel Stanley
  0 siblings, 0 replies; 8+ messages in thread
From: Joel Stanley @ 2020-09-01  6:17 UTC (permalink / raw)
  To: Andrew Jeffery; +Cc: OpenBMC Maillist

On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote:
>
> It has been observed across large fleets of systems that a small subset
> of those systems occasionally loose control of some number of fans

lose

> across a BMC reboot (their hwmon fan attributes are missing from sysfs).
>
> From extensive testing and tracing it was discovered that writes
> enabling a fan in FAN_CONFIG_1_2 failed to stick on the system under
> test with a frequency of about 1 in 1000 re-binds of the driver.
>
> The MAX31785 datasheet recommends in the documentation for
> MFR_FAN_CONFIG that the asssociated fan(s) be disabled before updating

associated

> the register. The sequence in question implements this suggestion, and
> the observed loss-of-fans symptom occurs when the write to re-enable the
> fan in FAN_CONFIG_1_2 fails to stick.
>
> The trace data suggests a one-shot retry is enough to successfully
> update FAN_CONFIG_1_2. With the workaround, no loss of fans was observed
> in over 20,000 consecutive rebinds of the driver.

Once you add your i2c throttling you should back out this change and re-test.

>
> Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
> ---
>  drivers/hwmon/pmbus/max31785.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/hwmon/pmbus/max31785.c b/drivers/hwmon/pmbus/max31785.c
> index cbcd0b2301f4..88b7156d777e 100644
> --- a/drivers/hwmon/pmbus/max31785.c
> +++ b/drivers/hwmon/pmbus/max31785.c
> @@ -376,6 +376,7 @@ static int max31785_of_fan_config(struct i2c_client *client,
>         u32 page;
>         u32 uval;
>         int ret;
> +       int i;
>
>         if (!of_device_is_compatible(child, "pmbus-fan"))
>                 return 0;
> @@ -552,10 +553,24 @@ static int max31785_of_fan_config(struct i2c_client *client,
>         if (ret < 0)
>                 return ret;
>
> -       ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_FAN_CONFIG_12,
> -                                                pb_cfg);
> -       if (ret < 0)
> -               return ret;
> +       for (i = 0; i < 2; i++) {
> +               ret = max31785_i2c_smbus_write_byte_data(client,
> +                                                        PMBUS_FAN_CONFIG_12,
> +                                                        pb_cfg);
> +               if (ret < 0)
> +                       continue;
> +
> +               ret = max31785_i2c_smbus_read_byte_data(client,
> +                                                       PMBUS_FAN_CONFIG_12);
> +               if (ret < 0)
> +                       continue;
> +
> +               if (ret == pb_cfg)
> +                       break;
> +       }
> +
> +       if (i == 2)
> +               return -EIO;
>
>         /*
>          * Fans are on pages 0 - 5. If the page property of a fan node is
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH linux dev-5.8 2/3] pmbus: (max31785) Add a local pmbus_set_page() implementation
  2020-08-27 13:30 ` [PATCH linux dev-5.8 2/3] pmbus: (max31785) Add a local pmbus_set_page() implementation Andrew Jeffery
@ 2020-09-01  6:18   ` Joel Stanley
  0 siblings, 0 replies; 8+ messages in thread
From: Joel Stanley @ 2020-09-01  6:18 UTC (permalink / raw)
  To: Andrew Jeffery; +Cc: OpenBMC Maillist

On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote:
>
> Extensive testing and tracing has shown that the MAX31785 is unreliable
> in the face of PAGE write commands, ACK'ing the PAGE request but
> reporting a value of 0 on some subsequent PAGE reads. The trace data
> suggests that a one-shot retry of the PAGE write is enough to get the
> requested value to stick.
>
> As we configure the device before registering with the PMBus core,
> centralise PAGE handling inside the driver and implement the one-shot
> retry semantics there.

Are you still doing a retry with every operation due to the macro magic?

>
> Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
> ---
>  drivers/hwmon/pmbus/max31785.c | 32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/hwmon/pmbus/max31785.c b/drivers/hwmon/pmbus/max31785.c
> index 88b7156d777e..a392c0efe0a6 100644
> --- a/drivers/hwmon/pmbus/max31785.c
> +++ b/drivers/hwmon/pmbus/max31785.c
> @@ -361,6 +361,27 @@ static int max31785_write_word_data(struct i2c_client *client, int page,
>         return -ENXIO;
>  }
>
> +static int max31785_pmbus_set_page(struct i2c_client *client, int page)
> +{
> +       int ret;
> +       int i;
> +
> +       for (i = 0; i < 2; i++) {
> +               ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
> +               if (ret < 0)
> +                       return ret;
> +
> +               ret = max31785_i2c_smbus_read_byte_data(client, PMBUS_PAGE);
> +               if (ret < 0)
> +                       return ret;
> +
> +               if (ret == page)
> +                       return 0;
> +       }
> +
> +       return -EIO;
> +}
> +
>  /*
>   * Returns negative error codes if an unrecoverable problem is detected, 0 if a
>   * recoverable problem is detected, or a positive value on success.
> @@ -392,7 +413,7 @@ static int max31785_of_fan_config(struct i2c_client *client,
>                 return -ENXIO;
>         }
>
> -       ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
> +       ret = max31785_pmbus_set_page(client, page);
>         if (ret < 0)
>                 return ret;
>
> @@ -613,7 +634,7 @@ static int max31785_of_tmp_config(struct i2c_client *client,
>                 return -ENXIO;
>         }
>
> -       ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, page);
> +       ret = max31785_pmbus_set_page(client, page);
>         if (ret < 0)
>                 return ret;
>
> @@ -714,7 +735,7 @@ static int max31785_configure_dual_tach(struct i2c_client *client,
>         int i;
>
>         for (i = 0; i < MAX31785_NR_FAN_PAGES; i++) {
> -               ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, i);
> +               ret = max31785_pmbus_set_page(client, i);
>                 if (ret < 0)
>                         return ret;
>
> @@ -756,7 +777,7 @@ static int max31785_probe(struct i2c_client *client,
>
>         *info = max31785_info;
>
> -       ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE, 255);
> +       ret = max31785_pmbus_set_page(client, 255);
>         if (ret < 0)
>                 return ret;
>
> @@ -798,8 +819,7 @@ static int max31785_probe(struct i2c_client *client,
>                 if (!have_fan || fan_configured)
>                         continue;
>
> -               ret = max31785_i2c_smbus_write_byte_data(client, PMBUS_PAGE,
> -                                                        i);
> +               ret = max31785_pmbus_set_page(client, i);
>                 if (ret < 0)
>                         return ret;
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds
  2020-08-27 13:29 [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Andrew Jeffery
                   ` (2 preceding siblings ...)
  2020-08-27 13:30 ` [PATCH linux dev-5.8 3/3] pmbus: (core) Add a one-shot retry in pmbus_set_page() Andrew Jeffery
@ 2020-09-01  6:21 ` Joel Stanley
  2020-10-26  0:45   ` Andrew Jeffery
  3 siblings, 1 reply; 8+ messages in thread
From: Joel Stanley @ 2020-09-01  6:21 UTC (permalink / raw)
  To: Andrew Jeffery; +Cc: OpenBMC Maillist

On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote:
>
> Hello,
>
> This series works around reliability problems with the MAX31785 fan controller
> as observed in the field on some POWER systems.
>
> I'm the first to admit the patches are not elegant, so feedback there is
> appreciated.

The way you implemented the loop took me several goes to grok. I
finally got there.


>
> Separately, our previous workarounds have run aground upstream as the hwmon
> maintainer was unable to reproduce our observations on the MAX31785 evaluation
> kit. I've recently received an evaluation kit, so I plan on putting some of
> these issues to the test myself. Ultimately this will help determine whether we
> have issues with our fan card designs or whether the controller itself is at
> fault (I have to admit, given some of the failures, it's hard to see how the
> controller might not be at fault). Basically, this paragraph is my excuse for
> not pushing these patches further upstream for the moment; I will re-evaluate
> once I have the results from testing against the evkit.

I would post this series upstream so Guneter has some context for
future patches that come out of your investigation.

>
> In the mean time, these patches resolve issues we've seen in some system
> deployments. Taken together, I've put the driver through an unbind/bind loop
> of over 20,000 iterations with no "loss" of fans, where prior to the series we
> typically achieved only a few hundred. This feels like a significant
> improvement, so please consider merging despite their ugliness.

Do you want these in dev-5.4, or both 5.4 and 5.8?

>
> Cheers,
>
> Andrew
>
> Andrew Jeffery (3):
>   pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG
>   pmbus: (max31785) Add a local pmbus_set_page() implementation
>   pmbus: (core) Add a one-shot retry in pmbus_set_page()
>
>  drivers/hwmon/pmbus/max31785.c   | 55 ++++++++++++++++++++++++++------
>  drivers/hwmon/pmbus/pmbus_core.c | 33 ++++++++++++-------
>  2 files changed, 66 insertions(+), 22 deletions(-)
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds
  2020-09-01  6:21 ` [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Joel Stanley
@ 2020-10-26  0:45   ` Andrew Jeffery
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Jeffery @ 2020-10-26  0:45 UTC (permalink / raw)
  To: Joel Stanley; +Cc: OpenBMC Maillist



On Tue, 1 Sep 2020, at 15:51, Joel Stanley wrote:
> On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote:
> >
> > Hello,
> >
> > This series works around reliability problems with the MAX31785 fan controller
> > as observed in the field on some POWER systems.
> >
> > I'm the first to admit the patches are not elegant, so feedback there is
> > appreciated.
> 
> The way you implemented the loop took me several goes to grok. I
> finally got there.
> 
> 
> >
> > Separately, our previous workarounds have run aground upstream as the hwmon
> > maintainer was unable to reproduce our observations on the MAX31785 evaluation
> > kit. I've recently received an evaluation kit, so I plan on putting some of
> > these issues to the test myself. Ultimately this will help determine whether we
> > have issues with our fan card designs or whether the controller itself is at
> > fault (I have to admit, given some of the failures, it's hard to see how the
> > controller might not be at fault). Basically, this paragraph is my excuse for
> > not pushing these patches further upstream for the moment; I will re-evaluate
> > once I have the results from testing against the evkit.
> 
> I would post this series upstream so Guneter has some context for
> future patches that come out of your investigation.
> 
> >
> > In the mean time, these patches resolve issues we've seen in some system
> > deployments. Taken together, I've put the driver through an unbind/bind loop
> > of over 20,000 iterations with no "loss" of fans, where prior to the series we
> > typically achieved only a few hundred. This feels like a significant
> > improvement, so please consider merging despite their ugliness.
> 
> Do you want these in dev-5.4, or both 5.4 and 5.8?

A bit late to respond, but can we apply these to 5.8 as well? I've experimented
with applying the same workaround as we did for the UCD90320, but it looks
like a bad idea: We're not actually monitoring the UCD90320 sensors at runtime
and so this hides the scheduling latency/jank that the I2C throttling patches
introduce. This jank is exposed by the fact that unlike the UCD90320 we do
monitor the MAX31785 sensors. Decreasing the jank by increasing the upper
bound of usleep_range() to e.g. a full tick leads to unacceptable latencies when
reading the MAX31785 sensor data (one read blows out to ~2.8 seconds due
to the caching strategy implemented in pmbus_core.c).

Cheers,

Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-26  0:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-27 13:29 [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Andrew Jeffery
2020-08-27 13:30 ` [PATCH linux dev-5.8 1/3] pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG Andrew Jeffery
2020-09-01  6:17   ` Joel Stanley
2020-08-27 13:30 ` [PATCH linux dev-5.8 2/3] pmbus: (max31785) Add a local pmbus_set_page() implementation Andrew Jeffery
2020-09-01  6:18   ` Joel Stanley
2020-08-27 13:30 ` [PATCH linux dev-5.8 3/3] pmbus: (core) Add a one-shot retry in pmbus_set_page() Andrew Jeffery
2020-09-01  6:21 ` [PATCH linux dev-5.8 0/3] MAX31785 Fan Controller Work-arounds Joel Stanley
2020-10-26  0:45   ` Andrew Jeffery

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).