All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
@ 2016-12-20 14:16 Vipin K Parashar
  2016-12-21  5:24 ` Mukesh Ojha
  2017-01-27  0:17 ` Michael Ellerman
  0 siblings, 2 replies; 11+ messages in thread
From: Vipin K Parashar @ 2016-12-20 14:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Vipin K Parashar

Added check for OPAL_WRONG_STATE error code returned from OPAL.
Currently Linux flashes "unexpected error" over console for this
error. This will avoid throwing such message and return I/O error
for such OPAL failures.

Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/opal.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 2822935..ab91d53 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -866,9 +866,10 @@ int opal_error_code(int rc)
 	case OPAL_NO_MEM:		return -ENOMEM;
 	case OPAL_PERMISSION:		return -EPERM;
 
-	case OPAL_UNSUPPORTED:		return -EIO;
-	case OPAL_HARDWARE:		return -EIO;
-	case OPAL_INTERNAL_ERROR:	return -EIO;
+	case OPAL_UNSUPPORTED:
+	case OPAL_HARDWARE:
+	case OPAL_INTERNAL_ERROR:
+	case OPAL_WRONG_STATE:		return -EIO;
 	default:
 		pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
 		return -EIO;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2016-12-20 14:16 [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails Vipin K Parashar
@ 2016-12-21  5:24 ` Mukesh Ojha
  2017-01-27  0:17 ` Michael Ellerman
  1 sibling, 0 replies; 11+ messages in thread
From: Mukesh Ojha @ 2016-12-21  5:24 UTC (permalink / raw)
  To: linuxppc-dev



On Tuesday 20 December 2016 07:46 PM, Vipin K Parashar wrote:
> Added check for OPAL_WRONG_STATE error code returned from OPAL.
> Currently Linux flashes "unexpected error" over console for this
> error. This will avoid throwing such message and return I/O error
> for such OPAL failures.
>
> Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/opal.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 2822935..ab91d53 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -866,9 +866,10 @@ int opal_error_code(int rc)
>   	case OPAL_NO_MEM:		return -ENOMEM;
>   	case OPAL_PERMISSION:		return -EPERM;
>
> -	case OPAL_UNSUPPORTED:		return -EIO;
> -	case OPAL_HARDWARE:		return -EIO;
> -	case OPAL_INTERNAL_ERROR:	return -EIO;
> +	case OPAL_UNSUPPORTED:
> +	case OPAL_HARDWARE:
> +	case OPAL_INTERNAL_ERROR:
> +	case OPAL_WRONG_STATE:		return -EIO;

Looks good.

Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>

>   	default:
>   		pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
>   		return -EIO;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2016-12-20 14:16 [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails Vipin K Parashar
  2016-12-21  5:24 ` Mukesh Ojha
@ 2017-01-27  0:17 ` Michael Ellerman
  2017-01-27  6:48   ` Vipin K Parashar
  1 sibling, 1 reply; 11+ messages in thread
From: Michael Ellerman @ 2017-01-27  0:17 UTC (permalink / raw)
  To: Vipin K Parashar, linuxppc-dev; +Cc: Vipin K Parashar

Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:

> Added check for OPAL_WRONG_STATE error code returned from OPAL.
> Currently Linux flashes "unexpected error" over console for this
> error. This will avoid throwing such message and return I/O error
> for such OPAL failures.

Why do we expect to get OPAL_WRONG_STATE ?

cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-01-27  0:17 ` Michael Ellerman
@ 2017-01-27  6:48   ` Vipin K Parashar
  2017-02-13  0:43     ` Michael Ellerman
  0 siblings, 1 reply; 11+ messages in thread
From: Vipin K Parashar @ 2017-01-27  6:48 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev

OPAL returns OPAL_WRONG_STATE for XSCOM operations

done to read any core FIR which is sleeping, offline.


On Friday 27 January 2017 05:47 AM, Michael Ellerman wrote:
> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>
>> Added check for OPAL_WRONG_STATE error code returned from OPAL.
>> Currently Linux flashes "unexpected error" over console for this
>> error. This will avoid throwing such message and return I/O error
>> for such OPAL failures.
> Why do we expect to get OPAL_WRONG_STATE ?
>
> cheers
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-01-27  6:48   ` Vipin K Parashar
@ 2017-02-13  0:43     ` Michael Ellerman
  2017-02-15  5:01       ` Stewart Smith
  2017-02-15 20:12       ` Vipin K Parashar
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Ellerman @ 2017-02-13  0:43 UTC (permalink / raw)
  To: Vipin K Parashar, linuxppc-dev; +Cc: Stewart Smith

Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:

> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>
> done to read any core FIR which is sleeping, offline.

OK.

Do we know why Linux is causing that to happen?

It's also returned from many of the XIVE routines if we're in the wrong
xive mode, all of which would indicate a fairly bad Linux bug.

Also the skiboot patch which added WRONG_STATE for XSCOM ops did so
explicitly so we could differentiate from other errors:

    commit 9c2d82394fd2303847cac4a665dee62556ca528a
    Author:     Russell Currey <ruscur@russell.cc>
    AuthorDate: Mon Mar 21 12:00:00 2016 +1100

    xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleep
    
    xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
    OPAL_HARDWARE if they didn't.  This doesn't provide information about why
    the operation failed, such as if the CPU happens to be asleep.
    
    This is specifically useful in error scanning, so if every CPU is being
    scanned for errors, sleeping CPUs likely aren't the cause of failures.
    
    So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
    sleeping.
    
    Signed-off-by: Russell Currey <ruscur@russell.cc>
    Reviewed-by: Alistair Popple <alistair@popple.id.au>
    Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>



So I'm still not convinced that quietly swallowing this error and
mapping it to -EIO along with several of the other error codes is the
right thing to do.

cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-02-13  0:43     ` Michael Ellerman
@ 2017-02-15  5:01       ` Stewart Smith
  2017-02-15 20:12       ` Vipin K Parashar
  1 sibling, 0 replies; 11+ messages in thread
From: Stewart Smith @ 2017-02-15  5:01 UTC (permalink / raw)
  To: Michael Ellerman, Vipin K Parashar, linuxppc-dev

Michael Ellerman <mpe@ellerman.id.au> writes:
> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>
>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>
>> done to read any core FIR which is sleeping, offline.
>
> OK.
>
> Do we know why Linux is causing that to happen?
>
> It's also returned from many of the XIVE routines if we're in the wrong
> xive mode, all of which would indicate a fairly bad Linux bug.
>
> Also the skiboot patch which added WRONG_STATE for XSCOM ops did so
> explicitly so we could differentiate from other errors:
>
>     commit 9c2d82394fd2303847cac4a665dee62556ca528a
>     Author:     Russell Currey <ruscur@russell.cc>
>     AuthorDate: Mon Mar 21 12:00:00 2016 +1100
>
>     xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleep
>     
>     xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
>     OPAL_HARDWARE if they didn't.  This doesn't provide information about why
>     the operation failed, such as if the CPU happens to be asleep.
>     
>     This is specifically useful in error scanning, so if every CPU is being
>     scanned for errors, sleeping CPUs likely aren't the cause of failures.
>     
>     So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
>     sleeping.
>     
>     Signed-off-by: Russell Currey <ruscur@russell.cc>
>     Reviewed-by: Alistair Popple <alistair@popple.id.au>
>     Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>
>
>
> So I'm still not convinced that quietly swallowing this error and
> mapping it to -EIO along with several of the other error codes is the
> right thing to do.

FWIW I agree - pretty limited cases where it should just be converted
into -EIO and passed on - probably *just* the debugfs interface to be honest.

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-02-13  0:43     ` Michael Ellerman
  2017-02-15  5:01       ` Stewart Smith
@ 2017-02-15 20:12       ` Vipin K Parashar
  2017-02-16  0:52         ` Stewart Smith
  1 sibling, 1 reply; 11+ messages in thread
From: Vipin K Parashar @ 2017-02-15 20:12 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: Stewart Smith

Hi Michael,

Thanks!! for review.

Answers to your questions as below:


On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>
>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>
>> done to read any core FIR which is sleeping, offline.
> OK.
>
> Do we know why Linux is causing that to happen?

This issue is originally seen upon running STAF (Software Test
Automation Framework) stress tests and off-lining some cores
with stress tests running.

It can also be re-created after off-lining few cores and following
one of below methods.
1. Executing Linux "sensors" command
2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
    where X is offline CPU.

Its "opal_get_sensor_data" Linux API that that triggers
OPAL call "opal_sensor_read", performing XSCOM ops here.
If core is found sleeping/offline Linux throws up
"opal_error_code: Unexpected OPAL error" error onto console.

Currently Linux isn't aware about OPAL_WRONG_STATE return code
from OPAL. Thus it prints "Unexpected OPAL error" message, same
as it would log for any unknown OPAL return codes.

Seeing this error over console has been a concern for Test and
would puzzle real user as well. This patch makes Linux aware about
OPAL_WRONG_STATE return code from OPAL and stops printing
"Unexpected OPAL error" message onto console for OPAL fails
with OPAL_WRONG_STATE

>
> It's also returned from many of the XIVE routines if we're in the wrong
> xive mode, all of which would indicate a fairly bad Linux bug.
>
> Also the skiboot patch which added WRONG_STATE for XSCOM ops did so
> explicitly so we could differentiate from other errors:
>
>      commit 9c2d82394fd2303847cac4a665dee62556ca528a
>      Author:     Russell Currey <ruscur@russell.cc>
>      AuthorDate: Mon Mar 21 12:00:00 2016 +1100
>
>      xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleep
>      
>      xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
>      OPAL_HARDWARE if they didn't.  This doesn't provide information about why
>      the operation failed, such as if the CPU happens to be asleep.
>      
>      This is specifically useful in error scanning, so if every CPU is being
>      scanned for errors, sleeping CPUs likely aren't the cause of failures.
>      
>      So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
>      sleeping.
>      
>      Signed-off-by: Russell Currey <ruscur@russell.cc>
>      Reviewed-by: Alistair Popple <alistair@popple.id.au>
>      Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>
>
>
> So I'm still not convinced that quietly swallowing this error and
> mapping it to -EIO along with several of the other error codes is the
> right thing to do.

How about returning -ENXIO upon receiving OPAL_WRONG_STATE ?

while -EIO remains to be returned for OPAL_HARDWARE.

I can send out new patch doing pr_notice for fails with supported OPAL
return codes and pr_err for any unexpected OPAL return code. So this way
we will have logging of any OPAL call failure onto Linux log and only
unexpected OPAL error codes would get flashed onto console.

>
> cheers
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-02-15 20:12       ` Vipin K Parashar
@ 2017-02-16  0:52         ` Stewart Smith
  2017-02-20  5:03           ` Michael Ellerman
  0 siblings, 1 reply; 11+ messages in thread
From: Stewart Smith @ 2017-02-16  0:52 UTC (permalink / raw)
  To: Vipin K Parashar, Michael Ellerman, linuxppc-dev

Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>>
>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>
>>> done to read any core FIR which is sleeping, offline.
>> OK.
>>
>> Do we know why Linux is causing that to happen?
>
> This issue is originally seen upon running STAF (Software Test
> Automation Framework) stress tests and off-lining some cores
> with stress tests running.
>
> It can also be re-created after off-lining few cores and following
> one of below methods.
> 1. Executing Linux "sensors" command
> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>     where X is offline CPU.
>
> Its "opal_get_sensor_data" Linux API that that triggers
> OPAL call "opal_sensor_read", performing XSCOM ops here.
> If core is found sleeping/offline Linux throws up
> "opal_error_code: Unexpected OPAL error" error onto console.
>
> Currently Linux isn't aware about OPAL_WRONG_STATE return code
> from OPAL. Thus it prints "Unexpected OPAL error" message, same
> as it would log for any unknown OPAL return codes.
>
> Seeing this error over console has been a concern for Test and
> would puzzle real user as well. This patch makes Linux aware about
> OPAL_WRONG_STATE return code from OPAL and stops printing
> "Unexpected OPAL error" message onto console for OPAL fails
> with OPAL_WRONG_STATE

Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
return the xscom_read return code in event of error.

I would argue that converting to EIO in that instance is probably
correct... or EAGAIN? EAGAIN may be more correct in the situation where
the core is just sleeping.

What kind of offlining are you doing?

Arguably, the correct behaviour would be to remove said sensors when the
core is offline.

>> It's also returned from many of the XIVE routines if we're in the wrong
>> xive mode, all of which would indicate a fairly bad Linux bug.
>>
>> Also the skiboot patch which added WRONG_STATE for XSCOM ops did so
>> explicitly so we could differentiate from other errors:
>>
>>      commit 9c2d82394fd2303847cac4a665dee62556ca528a
>>      Author:     Russell Currey <ruscur@russell.cc>
>>      AuthorDate: Mon Mar 21 12:00:00 2016 +1100
>>
>>      xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleep
>>      
>>      xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
>>      OPAL_HARDWARE if they didn't.  This doesn't provide information about why
>>      the operation failed, such as if the CPU happens to be asleep.
>>      
>>      This is specifically useful in error scanning, so if every CPU is being
>>      scanned for errors, sleeping CPUs likely aren't the cause of failures.
>>      
>>      So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
>>      sleeping.
>>      
>>      Signed-off-by: Russell Currey <ruscur@russell.cc>
>>      Reviewed-by: Alistair Popple <alistair@popple.id.au>
>>      Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>>
>>
>>
>> So I'm still not convinced that quietly swallowing this error and
>> mapping it to -EIO along with several of the other error codes is the
>> right thing to do.
>
> How about returning -ENXIO upon receiving OPAL_WRONG_STATE ?

I think ENXIO would be incorrect, as it is "No such device or address"
when that almost certainly isn't the case. It's a solvable error -
there's just some XSCOMS you can't do at certain times.


-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-02-16  0:52         ` Stewart Smith
@ 2017-02-20  5:03           ` Michael Ellerman
  2017-02-23  3:52             ` Stewart Smith
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Ellerman @ 2017-02-20  5:03 UTC (permalink / raw)
  To: Stewart Smith, Vipin K Parashar, linuxppc-dev

Stewart Smith <stewart@linux.vnet.ibm.com> writes:

> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>>> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>>>
>>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>>
>>>> done to read any core FIR which is sleeping, offline.
>>> OK.
>>>
>>> Do we know why Linux is causing that to happen?
>>
>> This issue is originally seen upon running STAF (Software Test
>> Automation Framework) stress tests and off-lining some cores
>> with stress tests running.
>>
>> It can also be re-created after off-lining few cores and following
>> one of below methods.
>> 1. Executing Linux "sensors" command
>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>>     where X is offline CPU.
>>
>> Its "opal_get_sensor_data" Linux API that that triggers
>> OPAL call "opal_sensor_read", performing XSCOM ops here.
>> If core is found sleeping/offline Linux throws up
>> "opal_error_code: Unexpected OPAL error" error onto console.
>>
>> Currently Linux isn't aware about OPAL_WRONG_STATE return code
>> from OPAL. Thus it prints "Unexpected OPAL error" message, same
>> as it would log for any unknown OPAL return codes.
>>
>> Seeing this error over console has been a concern for Test and
>> would puzzle real user as well. This patch makes Linux aware about
>> OPAL_WRONG_STATE return code from OPAL and stops printing
>> "Unexpected OPAL error" message onto console for OPAL fails
>> with OPAL_WRONG_STATE
>
> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
> return the xscom_read return code in event of error.
>
> I would argue that converting to EIO in that instance is probably
> correct... or EAGAIN? EAGAIN may be more correct in the situation where
> the core is just sleeping.
>
> What kind of offlining are you doing?
>
> Arguably, the correct behaviour would be to remove said sensors when the
> core is offline.

Right, that would be ideal. There appear to be at least two other hwmon
drivers that are CPU hotplug aware (coretemp and via-cputemp).

But perhaps it's not possible to work out which sensors are attached to
which CPU etc., I haven't looked in detail.

In that case changing just opal_get_sensor_data() to handle
OPAL_WRONG_STATE would be OK, with a comment explaining that we might be
asked to read a sensor on an offline CPU and we aren't able to detect
that.

cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-02-20  5:03           ` Michael Ellerman
@ 2017-02-23  3:52             ` Stewart Smith
  2017-02-28  9:20               ` Vipin K Parashar
  0 siblings, 1 reply; 11+ messages in thread
From: Stewart Smith @ 2017-02-23  3:52 UTC (permalink / raw)
  To: Michael Ellerman, Vipin K Parashar, linuxppc-dev

Michael Ellerman <mpe@ellerman.id.au> writes:

> Stewart Smith <stewart@linux.vnet.ibm.com> writes:
>
>> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>>>> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>>>>
>>>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>>>
>>>>> done to read any core FIR which is sleeping, offline.
>>>> OK.
>>>>
>>>> Do we know why Linux is causing that to happen?
>>>
>>> This issue is originally seen upon running STAF (Software Test
>>> Automation Framework) stress tests and off-lining some cores
>>> with stress tests running.
>>>
>>> It can also be re-created after off-lining few cores and following
>>> one of below methods.
>>> 1. Executing Linux "sensors" command
>>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>>>     where X is offline CPU.
>>>
>>> Its "opal_get_sensor_data" Linux API that that triggers
>>> OPAL call "opal_sensor_read", performing XSCOM ops here.
>>> If core is found sleeping/offline Linux throws up
>>> "opal_error_code: Unexpected OPAL error" error onto console.
>>>
>>> Currently Linux isn't aware about OPAL_WRONG_STATE return code
>>> from OPAL. Thus it prints "Unexpected OPAL error" message, same
>>> as it would log for any unknown OPAL return codes.
>>>
>>> Seeing this error over console has been a concern for Test and
>>> would puzzle real user as well. This patch makes Linux aware about
>>> OPAL_WRONG_STATE return code from OPAL and stops printing
>>> "Unexpected OPAL error" message onto console for OPAL fails
>>> with OPAL_WRONG_STATE
>>
>> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
>> return the xscom_read return code in event of error.
>>
>> I would argue that converting to EIO in that instance is probably
>> correct... or EAGAIN? EAGAIN may be more correct in the situation where
>> the core is just sleeping.
>>
>> What kind of offlining are you doing?
>>
>> Arguably, the correct behaviour would be to remove said sensors when the
>> core is offline.
>
> Right, that would be ideal. There appear to be at least two other hwmon
> drivers that are CPU hotplug aware (coretemp and via-cputemp).
>
> But perhaps it's not possible to work out which sensors are attached to
> which CPU etc., I haven't looked in detail.

Each core-temp@ sensor has a ibm,pir property, so linking back to what
core shouldn't be too hard. For mem-temp@ sensors, we have the chip-id.

> In that case changing just opal_get_sensor_data() to handle
> OPAL_WRONG_STATE would be OK, with a comment explaining that we might be
> asked to read a sensor on an offline CPU and we aren't able to detect
> that.

Agree.

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
  2017-02-23  3:52             ` Stewart Smith
@ 2017-02-28  9:20               ` Vipin K Parashar
  0 siblings, 0 replies; 11+ messages in thread
From: Vipin K Parashar @ 2017-02-28  9:20 UTC (permalink / raw)
  To: Stewart Smith, Michael Ellerman, linuxppc-dev

Thanks!! for review.

Sending out v2 with  suggested changes.


On Thursday 23 February 2017 09:22 AM, Stewart Smith wrote:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
>> Stewart Smith <stewart@linux.vnet.ibm.com> writes:
>>
>>> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>>>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>>>>> Vipin K Parashar <vipin@linux.vnet.ibm.com> writes:
>>>>>
>>>>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>>>>
>>>>>> done to read any core FIR which is sleeping, offline.
>>>>> OK.
>>>>>
>>>>> Do we know why Linux is causing that to happen?
>>>> This issue is originally seen upon running STAF (Software Test
>>>> Automation Framework) stress tests and off-lining some cores
>>>> with stress tests running.
>>>>
>>>> It can also be re-created after off-lining few cores and following
>>>> one of below methods.
>>>> 1. Executing Linux "sensors" command
>>>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>>>>      where X is offline CPU.
>>>>
>>>> Its "opal_get_sensor_data" Linux API that that triggers
>>>> OPAL call "opal_sensor_read", performing XSCOM ops here.
>>>> If core is found sleeping/offline Linux throws up
>>>> "opal_error_code: Unexpected OPAL error" error onto console.
>>>>
>>>> Currently Linux isn't aware about OPAL_WRONG_STATE return code
>>>> from OPAL. Thus it prints "Unexpected OPAL error" message, same
>>>> as it would log for any unknown OPAL return codes.
>>>>
>>>> Seeing this error over console has been a concern for Test and
>>>> would puzzle real user as well. This patch makes Linux aware about
>>>> OPAL_WRONG_STATE return code from OPAL and stops printing
>>>> "Unexpected OPAL error" message onto console for OPAL fails
>>>> with OPAL_WRONG_STATE
>>> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
>>> return the xscom_read return code in event of error.
>>>
>>> I would argue that converting to EIO in that instance is probably
>>> correct... or EAGAIN? EAGAIN may be more correct in the situation where
>>> the core is just sleeping.
>>>
>>> What kind of offlining are you doing?
>>>
>>> Arguably, the correct behaviour would be to remove said sensors when the
>>> core is offline.
>> Right, that would be ideal. There appear to be at least two other hwmon
>> drivers that are CPU hotplug aware (coretemp and via-cputemp).
>>
>> But perhaps it's not possible to work out which sensors are attached to
>> which CPU etc., I haven't looked in detail.
> Each core-temp@ sensor has a ibm,pir property, so linking back to what
> core shouldn't be too hard. For mem-temp@ sensors, we have the chip-id.
>
>> In that case changing just opal_get_sensor_data() to handle
>> OPAL_WRONG_STATE would be OK, with a comment explaining that we might be
>> asked to read a sensor on an offline CPU and we aren't able to detect
>> that.
> Agree.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-02-28  9:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20 14:16 [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails Vipin K Parashar
2016-12-21  5:24 ` Mukesh Ojha
2017-01-27  0:17 ` Michael Ellerman
2017-01-27  6:48   ` Vipin K Parashar
2017-02-13  0:43     ` Michael Ellerman
2017-02-15  5:01       ` Stewart Smith
2017-02-15 20:12       ` Vipin K Parashar
2017-02-16  0:52         ` Stewart Smith
2017-02-20  5:03           ` Michael Ellerman
2017-02-23  3:52             ` Stewart Smith
2017-02-28  9:20               ` Vipin K Parashar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.