linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
@ 2015-09-21  9:29 Wei Yang
  2015-09-21 11:49 ` Gavin Shan
  0 siblings, 1 reply; 5+ messages in thread
From: Wei Yang @ 2015-09-21  9:29 UTC (permalink / raw)
  To: gwshan; +Cc: linuxppc-dev, Wei Yang

Current EEH infrastructure would avoid to handle EEH when a PE is passed to
guest, while if this PE is a Child PE of the one hit EEH, host would handle
this. By doing so, this would leads to guest hang. The correct way is
avoid to handle it on host and let guest to recover.

This patch avoids to handle EEH on a passed Child PE.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_pe.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 5cde950..c6d0e9f 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe,
  * callback returns something other than NULL, or no more PEs
  * to be traversed.
  */
+static void *__eeh_pe_get(void *data, void *flag);
 void *eeh_pe_traverse(struct eeh_pe *root,
 		      eeh_traverse_func fn, void *flag)
 {
@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root,
 	void *ret;
 
 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
+		if (eeh_pe_passed(pe) && (fn != __eeh_pe_get))
+			continue;
 		ret = fn(pe, flag);
 		if (ret) return ret;
 	}
@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root,
 
 	/* Traverse root PE */
 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
+		if (eeh_pe_passed(pe))
+			continue;
 		eeh_pe_for_each_dev(pe, edev, tmp) {
 			ret = fn(edev, flag);
 			if (ret)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
  2015-09-21  9:29 [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE Wei Yang
@ 2015-09-21 11:49 ` Gavin Shan
  2015-09-22  4:43   ` Wei Yang
  0 siblings, 1 reply; 5+ messages in thread
From: Gavin Shan @ 2015-09-21 11:49 UTC (permalink / raw)
  To: Wei Yang; +Cc: gwshan, linuxppc-dev

On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote:
>Current EEH infrastructure would avoid to handle EEH when a PE is passed to
>guest, while if this PE is a Child PE of the one hit EEH, host would handle
>this. By doing so, this would leads to guest hang. The correct way is
>avoid to handle it on host and let guest to recover.
>
>This patch avoids to handle EEH on a passed Child PE.
>

Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting
EEH error, right? If so, I'm not sure if you really tested this code. Does
it work for you?

When the parent PE (PF) is stopped for EEH recovery, it sounds impossible
that the child PE can't be affected and just escape from the error. The
question is how the guest can continue to work after the EEH recovery on
parent PE?

>Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>---
> arch/powerpc/kernel/eeh_pe.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
>index 5cde950..c6d0e9f 100644
>--- a/arch/powerpc/kernel/eeh_pe.c
>+++ b/arch/powerpc/kernel/eeh_pe.c
>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe,
>  * callback returns something other than NULL, or no more PEs
>  * to be traversed.
>  */
>+static void *__eeh_pe_get(void *data, void *flag);
> void *eeh_pe_traverse(struct eeh_pe *root,
> 		      eeh_traverse_func fn, void *flag)
> {
>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root,
> 	void *ret;
>
> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>+		if (eeh_pe_passed(pe) && (fn != __eeh_pe_get))
>+			continue;

The code change here seems ugly.

The "flag" can be extended to carry the information to skip pass-through
PEs or not. So the function calling eeh_pe_traverse() decides to skip
pass-through PEs or not.

> 		ret = fn(pe, flag);
> 		if (ret) return ret;
> 	}
>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root,

>
> 	/* Traverse root PE */
> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>+		if (eeh_pe_passed(pe))
>+			continue;
> 		eeh_pe_for_each_dev(pe, edev, tmp) {
> 			ret = fn(edev, flag);
> 			if (ret)
>-- 
>2.5.0
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
  2015-09-21 11:49 ` Gavin Shan
@ 2015-09-22  4:43   ` Wei Yang
  2015-09-22 23:07     ` Gavin Shan
  0 siblings, 1 reply; 5+ messages in thread
From: Wei Yang @ 2015-09-22  4:43 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Wei Yang, linuxppc-dev

On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote:
>On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote:
>>Current EEH infrastructure would avoid to handle EEH when a PE is passed to
>>guest, while if this PE is a Child PE of the one hit EEH, host would handle
>>this. By doing so, this would leads to guest hang. The correct way is
>>avoid to handle it on host and let guest to recover.
>>
>>This patch avoids to handle EEH on a passed Child PE.
>>
>
>Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting
>EEH error, right? If so, I'm not sure if you really tested this code. Does
>it work for you?

Yes, I inject error on Parent Bus PE.

>
>When the parent PE (PF) is stopped for EEH recovery, it sounds impossible
>that the child PE can't be affected and just escape from the error. The
>question is how the guest can continue to work after the EEH recovery on
>parent PE?

What I see is the PF is covering and VF in guest is recovering.

>
>>Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>>---
>> arch/powerpc/kernel/eeh_pe.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
>>index 5cde950..c6d0e9f 100644
>>--- a/arch/powerpc/kernel/eeh_pe.c
>>+++ b/arch/powerpc/kernel/eeh_pe.c
>>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe,
>>  * callback returns something other than NULL, or no more PEs
>>  * to be traversed.
>>  */
>>+static void *__eeh_pe_get(void *data, void *flag);
>> void *eeh_pe_traverse(struct eeh_pe *root,
>> 		      eeh_traverse_func fn, void *flag)
>> {
>>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root,
>> 	void *ret;
>>
>> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>+		if (eeh_pe_passed(pe) && (fn != __eeh_pe_get))
>>+			continue;
>
>The code change here seems ugly.
>
>The "flag" can be extended to carry the information to skip pass-through
>PEs or not. So the function calling eeh_pe_traverse() decides to skip
>pass-through PEs or not.

I don't get the point, which "flag" you mean? Add a flag in eeh_pe?

>
>> 		ret = fn(pe, flag);
>> 		if (ret) return ret;
>> 	}
>>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root,
>
>>
>> 	/* Traverse root PE */
>> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>+		if (eeh_pe_passed(pe))
>>+			continue;
>> 		eeh_pe_for_each_dev(pe, edev, tmp) {
>> 			ret = fn(edev, flag);
>> 			if (ret)
>>-- 
>>2.5.0
>>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
  2015-09-22  4:43   ` Wei Yang
@ 2015-09-22 23:07     ` Gavin Shan
  2015-09-25  8:19       ` Wei Yang
  0 siblings, 1 reply; 5+ messages in thread
From: Gavin Shan @ 2015-09-22 23:07 UTC (permalink / raw)
  To: Wei Yang; +Cc: Gavin Shan, linuxppc-dev

On Tue, Sep 22, 2015 at 12:43:03PM +0800, Wei Yang wrote:
>On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote:
>>On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote:
>>>Current EEH infrastructure would avoid to handle EEH when a PE is passed to
>>>guest, while if this PE is a Child PE of the one hit EEH, host would handle
>>>this. By doing so, this would leads to guest hang. The correct way is
>>>avoid to handle it on host and let guest to recover.
>>>
>>>This patch avoids to handle EEH on a passed Child PE.
>>>
>>
>>Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting
>>EEH error, right? If so, I'm not sure if you really tested this code. Does
>>it work for you?
>
>Yes, I inject error on Parent Bus PE.
>
>>
>>When the parent PE (PF) is stopped for EEH recovery, it sounds impossible
>>that the child PE can't be affected and just escape from the error. The
>>question is how the guest can continue to work after the EEH recovery on
>>parent PE?
>
>What I see is the PF is covering and VF in guest is recovering.
>

What do you mean by "covering"? Which PE's error is detected first in your
testing? There is potentially race here: when the VF PE's error is detected
first and guest tries to recover it. After the recovery happened on guest
side, the host detects the PF PE's error and tries to recover it. During
the recovery, the VF PE is total unusable but guest doesn't know that and
operate like usual. I'm not sure what kinds of problems it can incur, but
it would incur issues.

On the other hand, if PF PE's error is detected on host first, and then the
guest detects the error on VF PE. After that, the host and guest try to do
recovery at same time. Host issues PE reset to PF PE, which isn't finished
yet. Then guest issues PE reset to VF PE, which will cause another EEH error.

So I'm not sure if had this patch fully tested. If so, it seems the result is
just achieved by luck, I guess.

>>
>>>Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>>>---
>>> arch/powerpc/kernel/eeh_pe.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
>>>index 5cde950..c6d0e9f 100644
>>>--- a/arch/powerpc/kernel/eeh_pe.c
>>>+++ b/arch/powerpc/kernel/eeh_pe.c
>>>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe,
>>>  * callback returns something other than NULL, or no more PEs
>>>  * to be traversed.
>>>  */
>>>+static void *__eeh_pe_get(void *data, void *flag);
>>> void *eeh_pe_traverse(struct eeh_pe *root,
>>> 		      eeh_traverse_func fn, void *flag)
>>> {
>>>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root,
>>> 	void *ret;
>>>
>>> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>>+		if (eeh_pe_passed(pe) && (fn != __eeh_pe_get))
>>>+			continue;
>>
>>The code change here seems ugly.
>>
>>The "flag" can be extended to carry the information to skip pass-through
>>PEs or not. So the function calling eeh_pe_traverse() decides to skip
>>pass-through PEs or not.
>
>I don't get the point, which "flag" you mean? Add a flag in eeh_pe?
>

>>> void *eeh_pe_traverse(struct eeh_pe *root,
>>>                   eeh_traverse_func fn, void *flag)
                                            ^^^^^^^^^^
                                            This one

The code needn't to be changed in a hurry though. I don't think it's right
way to fix the issue as discussed as above.

>>
>>> 		ret = fn(pe, flag);
>>> 		if (ret) return ret;
>>> 	}
>>>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root,
>>
>>>
>>> 	/* Traverse root PE */
>>> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>>+		if (eeh_pe_passed(pe))
>>>+			continue;
>>> 		eeh_pe_for_each_dev(pe, edev, tmp) {
>>> 			ret = fn(edev, flag);
>>> 			if (ret)
>>>-- 
>>>2.5.0
>>>
>
>-- 
>Richard Yang
>Help you, Help me

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
  2015-09-22 23:07     ` Gavin Shan
@ 2015-09-25  8:19       ` Wei Yang
  0 siblings, 0 replies; 5+ messages in thread
From: Wei Yang @ 2015-09-25  8:19 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Wei Yang, linuxppc-dev

On Wed, Sep 23, 2015 at 09:07:41AM +1000, Gavin Shan wrote:
>On Tue, Sep 22, 2015 at 12:43:03PM +0800, Wei Yang wrote:
>>On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote:
>>>On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote:
>>>>Current EEH infrastructure would avoid to handle EEH when a PE is passed to
>>>>guest, while if this PE is a Child PE of the one hit EEH, host would handle
>>>>this. By doing so, this would leads to guest hang. The correct way is
>>>>avoid to handle it on host and let guest to recover.
>>>>
>>>>This patch avoids to handle EEH on a passed Child PE.
>>>>
>>>
>>>Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting
>>>EEH error, right? If so, I'm not sure if you really tested this code. Does
>>>it work for you?
>>
>>Yes, I inject error on Parent Bus PE.
>>
>>>
>>>When the parent PE (PF) is stopped for EEH recovery, it sounds impossible
>>>that the child PE can't be affected and just escape from the error. The
>>>question is how the guest can continue to work after the EEH recovery on
>>>parent PE?
>>
>>What I see is the PF is covering and VF in guest is recovering.
>>
>
>What do you mean by "covering"? Which PE's error is detected first in your
>testing? There is potentially race here: when the VF PE's error is detected
>first and guest tries to recover it. After the recovery happened on guest
>side, the host detects the PF PE's error and tries to recover it. During
>the recovery, the VF PE is total unusable but guest doesn't know that and
>operate like usual. I'm not sure what kinds of problems it can incur, but
>it would incur issues.
>
>On the other hand, if PF PE's error is detected on host first, and then the
>guest detects the error on VF PE. After that, the host and guest try to do
>recovery at same time. Host issues PE reset to PF PE, which isn't finished
>yet. Then guest issues PE reset to VF PE, which will cause another EEH error.
>
>So I'm not sure if had this patch fully tested. If so, it seems the result is
>just achieved by luck, I guess.
>

It looks really has some race condition.

So the expected order should be PF's PE recovered then VF's PE recover in
guest, right?

>>>
>>>>Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>>>>---
>>>> arch/powerpc/kernel/eeh_pe.c | 5 +++++
>>>> 1 file changed, 5 insertions(+)
>>>>
>>>>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
>>>>index 5cde950..c6d0e9f 100644
>>>>--- a/arch/powerpc/kernel/eeh_pe.c
>>>>+++ b/arch/powerpc/kernel/eeh_pe.c
>>>>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe,
>>>>  * callback returns something other than NULL, or no more PEs
>>>>  * to be traversed.
>>>>  */
>>>>+static void *__eeh_pe_get(void *data, void *flag);
>>>> void *eeh_pe_traverse(struct eeh_pe *root,
>>>> 		      eeh_traverse_func fn, void *flag)
>>>> {
>>>>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root,
>>>> 	void *ret;
>>>>
>>>> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>>>+		if (eeh_pe_passed(pe) && (fn != __eeh_pe_get))
>>>>+			continue;
>>>
>>>The code change here seems ugly.
>>>
>>>The "flag" can be extended to carry the information to skip pass-through
>>>PEs or not. So the function calling eeh_pe_traverse() decides to skip
>>>pass-through PEs or not.
>>
>>I don't get the point, which "flag" you mean? Add a flag in eeh_pe?
>>
>
>>>> void *eeh_pe_traverse(struct eeh_pe *root,
>>>>                   eeh_traverse_func fn, void *flag)
>                                            ^^^^^^^^^^
>                                            This one
>
>The code needn't to be changed in a hurry though. I don't think it's right
>way to fix the issue as discussed as above.
>
>>>
>>>> 		ret = fn(pe, flag);
>>>> 		if (ret) return ret;
>>>> 	}
>>>>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root,
>>>
>>>>
>>>> 	/* Traverse root PE */
>>>> 	for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>>>+		if (eeh_pe_passed(pe))
>>>>+			continue;
>>>> 		eeh_pe_for_each_dev(pe, edev, tmp) {
>>>> 			ret = fn(edev, flag);
>>>> 			if (ret)
>>>>-- 
>>>>2.5.0
>>>>
>>
>>-- 
>>Richard Yang
>>Help you, Help me

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-25  8:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-21  9:29 [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE Wei Yang
2015-09-21 11:49 ` Gavin Shan
2015-09-22  4:43   ` Wei Yang
2015-09-22 23:07     ` Gavin Shan
2015-09-25  8:19       ` Wei Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).