All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
@ 2011-12-14  5:09 Ira Weiny
       [not found] ` <20111213210918.9cf79850.weiny2-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ira Weiny @ 2011-12-14  5:09 UTC (permalink / raw)
  To: Alex Netes; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA


In addition print transaction ID of all DR PATH dumps to make sure we know
which MAD's they refer to.

Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
---
 opensm/osm_helper.c      |    5 +++--
 opensm/osm_sm_mad_ctrl.c |    7 +++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
index f9f3d9d..b6591c4 100644
--- a/opensm/osm_helper.c
+++ b/opensm/osm_helper.c
@@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
 		char buf[BUF_SIZE];
 		unsigned n;
 
-		n = sprintf(buf, "Received SMP on a %u hop path: "
-			    "Initial path = ", p_smp->hop_count);
+		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
+			    "Initial path = ",
+			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
 		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
 				      p_smp->initial_path,
 				      p_smp->hop_count + 1);
diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
index ee92c66..6abf8b8 100644
--- a/opensm/osm_sm_mad_ctrl.c
+++ b/opensm/osm_sm_mad_ctrl.c
@@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
 		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
 		cl_ntoh64(p_smp->trans_id));
 
+	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
+		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
+	} else {
+		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
+			cl_ntoh16(p_madw->mad_addr.dest_lid));
+	}
+
 	/*
 	   If this was a SubnSet MAD, then this error might indicate a problem
 	   in configuring the subnet. In this case - need to mark that there was
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found] ` <20111213210918.9cf79850.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2011-12-15  2:21   ` Hal Rosenstock
       [not found]     ` <4EE959A8.3000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Hal Rosenstock @ 2011-12-15  2:21 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/14/2011 12:09 AM, Ira Weiny wrote:
> 
> In addition print transaction ID of all DR PATH dumps to make sure we know
> which MAD's they refer to.
> 
> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
> ---
>  opensm/osm_helper.c      |    5 +++--
>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
> index f9f3d9d..b6591c4 100644
> --- a/opensm/osm_helper.c
> +++ b/opensm/osm_helper.c
> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
>  		char buf[BUF_SIZE];
>  		unsigned n;
>  
> -		n = sprintf(buf, "Received SMP on a %u hop path: "
> -			    "Initial path = ", p_smp->hop_count);
> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
> +			    "Initial path = ",
> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>  				      p_smp->initial_path,
>  				      p_smp->hop_count + 1);
> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
> index ee92c66..6abf8b8 100644
> --- a/opensm/osm_sm_mad_ctrl.c
> +++ b/opensm/osm_sm_mad_ctrl.c
> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>  		cl_ntoh64(p_smp->trans_id));
>  
> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);

Rather than here, should this be in osm_vendor_ibumad.c ? There's
already one similar log there but looks like evicted entry logging was
not done. If not, then do any logs there need to be removed as redundant ?

Also, does this log every timeout (at error level) ? If so, that might
not be a good thing in all subnets as timeouts are common.

> +	} else {
> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
> +			cl_ntoh16(p_madw->mad_addr.dest_lid));

Log the TID here too ?

-- Hal

> +	}
> +
>  	/*
>  	   If this was a SubnSet MAD, then this error might indicate a problem
>  	   in configuring the subnet. In this case - need to mark that there was

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found]     ` <4EE959A8.3000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-12-15  3:18       ` Ira Weiny
       [not found]         ` <20111214191840.a3a80cf3.weiny2-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ira Weiny @ 2011-12-15  3:18 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 14 Dec 2011 18:21:28 -0800
Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 12/14/2011 12:09 AM, Ira Weiny wrote:
> > 
> > In addition print transaction ID of all DR PATH dumps to make sure we know
> > which MAD's they refer to.
> > 
> > Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
> > ---
> >  opensm/osm_helper.c      |    5 +++--
> >  opensm/osm_sm_mad_ctrl.c |    7 +++++++
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
> > index f9f3d9d..b6591c4 100644
> > --- a/opensm/osm_helper.c
> > +++ b/opensm/osm_helper.c
> > @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
> >  		char buf[BUF_SIZE];
> >  		unsigned n;
> >  
> > -		n = sprintf(buf, "Received SMP on a %u hop path: "
> > -			    "Initial path = ", p_smp->hop_count);
> > +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
> > +			    "Initial path = ",
> > +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
> >  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
> >  				      p_smp->initial_path,
> >  				      p_smp->hop_count + 1);
> > diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
> > index ee92c66..6abf8b8 100644
> > --- a/opensm/osm_sm_mad_ctrl.c
> > +++ b/opensm/osm_sm_mad_ctrl.c
> > @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
> >  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
> >  		cl_ntoh64(p_smp->trans_id));
> >  
> > +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
> > +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
> 
> Rather than here, should this be in osm_vendor_ibumad.c ? There's
> already one similar log there but looks like evicted entry logging was
> not done. If not, then do any logs there need to be removed as redundant ?

Yes looking a bit closer I see that is redundant with the current umad_status
implementation.  IE the message you get is:

	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0

That is useless.  I can alter the patch to remove that as well.

> 
> Also, does this log every timeout (at error level) ? If so, that might
> not be a good thing in all subnets as timeouts are common.

Why would you say that?  I think it is very valid to know what nodes are
timeing out.  When would you not want to know the destination of what is
timing out?

> 
> > +	} else {
> > +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
> > +			cl_ntoh16(p_madw->mad_addr.dest_lid));
> 
> Log the TID here too ?

Actually I think moving that into the error print above is better.

Sending V2 now,
Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found]         ` <20111214191840.a3a80cf3.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2011-12-15 14:15           ` Hal Rosenstock
       [not found]             ` <4EEA00ED.2080609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Hal Rosenstock @ 2011-12-15 14:15 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/14/2011 10:18 PM, Ira Weiny wrote:
> On Wed, 14 Dec 2011 18:21:28 -0800
> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
>> On 12/14/2011 12:09 AM, Ira Weiny wrote:
>>>
>>> In addition print transaction ID of all DR PATH dumps to make sure we know
>>> which MAD's they refer to.
>>>
>>> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
>>> ---
>>>  opensm/osm_helper.c      |    5 +++--
>>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
>>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
>>> index f9f3d9d..b6591c4 100644
>>> --- a/opensm/osm_helper.c
>>> +++ b/opensm/osm_helper.c
>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
>>>  		char buf[BUF_SIZE];
>>>  		unsigned n;
>>>  
>>> -		n = sprintf(buf, "Received SMP on a %u hop path: "
>>> -			    "Initial path = ", p_smp->hop_count);
>>> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
>>> +			    "Initial path = ",
>>> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>>>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>>>  				      p_smp->initial_path,
>>>  				      p_smp->hop_count + 1);
>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
>>> index ee92c66..6abf8b8 100644
>>> --- a/opensm/osm_sm_mad_ctrl.c
>>> +++ b/opensm/osm_sm_mad_ctrl.c
>>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
>>>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>>>  		cl_ntoh64(p_smp->trans_id));
>>>  
>>> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
>>> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
>>
>> Rather than here, should this be in osm_vendor_ibumad.c ? There's
>> already one similar log there but looks like evicted entry logging was
>> not done. If not, then do any logs there need to be removed as redundant ?
> 
> Yes looking a bit closer I see that is redundant with the current umad_status
> implementation.  

Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
only currently called for NodeInfo. That's one reason why it's down at
that layer right now. I can see your v2 patch addresses this.

> IE the message you get is:
> 
> 	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
> 
> That is useless.  I can alter the patch to remove that as well.

Another alternative is that if it's a bug, why not just fix that ?

>>
>> Also, does this log every timeout (at error level) ? If so, that might
>> not be a good thing in all subnets as timeouts are common.
> 
> Why would you say that?  I think it is very valid to know what nodes are
> timeing out.  When would you not want to know the destination of what is
> timing out?

Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
many subnets; maybe not yours.

I was just saying level might be reduced but I can see 5411 is error
level too.

>>
>>> +	} else {
>>> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
>>> +			cl_ntoh16(p_madw->mad_addr.dest_lid));
>>
>> Log the TID here too ?
> 
> Actually I think moving that into the error print above is better.

Sure; that's another way of accomplishing the same thing.

-- Hal

> Sending V2 now,
> Ira
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found]             ` <4EEA00ED.2080609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-12-15 17:40               ` Ira Weiny
       [not found]                 ` <20111215094006.8aa8f198.weiny2-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ira Weiny @ 2011-12-15 17:40 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 15 Dec 2011 06:15:09 -0800
Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 12/14/2011 10:18 PM, Ira Weiny wrote:
> > On Wed, 14 Dec 2011 18:21:28 -0800
> > Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> > 
> >> On 12/14/2011 12:09 AM, Ira Weiny wrote:
> >>>
> >>> In addition print transaction ID of all DR PATH dumps to make sure we know
> >>> which MAD's they refer to.
> >>>
> >>> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
> >>> ---
> >>>  opensm/osm_helper.c      |    5 +++--
> >>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
> >>>  2 files changed, 10 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
> >>> index f9f3d9d..b6591c4 100644
> >>> --- a/opensm/osm_helper.c
> >>> +++ b/opensm/osm_helper.c
> >>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
> >>>  		char buf[BUF_SIZE];
> >>>  		unsigned n;
> >>>  
> >>> -		n = sprintf(buf, "Received SMP on a %u hop path: "
> >>> -			    "Initial path = ", p_smp->hop_count);
> >>> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
> >>> +			    "Initial path = ",
> >>> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
> >>>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
> >>>  				      p_smp->initial_path,
> >>>  				      p_smp->hop_count + 1);
> >>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
> >>> index ee92c66..6abf8b8 100644
> >>> --- a/opensm/osm_sm_mad_ctrl.c
> >>> +++ b/opensm/osm_sm_mad_ctrl.c
> >>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
> >>>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
> >>>  		cl_ntoh64(p_smp->trans_id));
> >>>  
> >>> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
> >>> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
> >>
> >> Rather than here, should this be in osm_vendor_ibumad.c ? There's
> >> already one similar log there but looks like evicted entry logging was
> >> not done. If not, then do any logs there need to be removed as redundant ?
> > 
> > Yes looking a bit closer I see that is redundant with the current umad_status
> > implementation.  
> 
> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
> only currently called for NodeInfo. That's one reason why it's down at
> that layer right now. I can see your v2 patch addresses this.
> 
> > IE the message you get is:
> > 
> > 	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
> > 
> > That is useless.  I can alter the patch to remove that as well.
> 
> Another alternative is that if it's a bug, why not just fix that ?

I was not sure if it was a bug (but I did not dig through the umad and kernel
layers to verify that).  It looks to me like umad_status only returns TIMEOUT
as an error.  Furthermore, I think the "response" MAD returned (and printed in
the above logs) does not include a DR path since it is not really a response
but just a place holder.

Does the kernel/umad layer copy the request MAD into the "response" MAD on
timeout?[*]  (It looks to me like only some of the data is copied into the
"response" MAD; method, attr_id, and transaction ID).  If the entire request
is to be copied then the above might be bug.  If not then the print statement
I removed is useless but the other one at least confirms the "fake" MAD is
equivalent to the one printed in the error call back.

If you want to keep the error printing at the vendor layer then I guess the
appropriate place to print the error would be on the request mad like this...

			} else {
				p_req_madw->status = IB_TIMEOUT;
				/* cb frees req_madw */
				pthread_mutex_lock(&p_vend->cb_mutex);
				pthread_cleanup_push(unlock_mutex,
						     &p_vend->cb_mutex);
// dump p_req_madw here before callback
				(*p_bind->send_err_callback) (p_bind->
							      client_context,
							      p_req_madw);
				pthread_cleanup_pop(1);
			}

Or you are correct we could just remove all the logging in the vendor layer
here.  Alex, any opinions?

Ira

[*] I don't think the request MAD should be copied except for the TID as is
documented so you can find the request MAD.

> 
> >>
> >> Also, does this log every timeout (at error level) ? If so, that might
> >> not be a good thing in all subnets as timeouts are common.
> > 
> > Why would you say that?  I think it is very valid to know what nodes are
> > timeing out.  When would you not want to know the destination of what is
> > timing out?
> 
> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
> many subnets; maybe not yours.
> 
> I was just saying level might be reduced but I can see 5411 is error
> level too.
> 
> >>
> >>> +	} else {
> >>> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
> >>> +			cl_ntoh16(p_madw->mad_addr.dest_lid));
> >>
> >> Log the TID here too ?
> > 
> > Actually I think moving that into the error print above is better.
> 
> Sure; that's another way of accomplishing the same thing.
> 
> -- Hal
> 
> > Sending V2 now,
> > Ira
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found]                 ` <20111215094006.8aa8f198.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2011-12-15 22:20                   ` Hal Rosenstock
       [not found]                     ` <4EEA72A5.8080602-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Hal Rosenstock @ 2011-12-15 22:20 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/15/2011 12:40 PM, Ira Weiny wrote:
> On Thu, 15 Dec 2011 06:15:09 -0800
> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
>> On 12/14/2011 10:18 PM, Ira Weiny wrote:
>>> On Wed, 14 Dec 2011 18:21:28 -0800
>>> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>>>
>>>> On 12/14/2011 12:09 AM, Ira Weiny wrote:
>>>>>
>>>>> In addition print transaction ID of all DR PATH dumps to make sure we know
>>>>> which MAD's they refer to.
>>>>>
>>>>> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
>>>>> ---
>>>>>  opensm/osm_helper.c      |    5 +++--
>>>>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
>>>>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
>>>>> index f9f3d9d..b6591c4 100644
>>>>> --- a/opensm/osm_helper.c
>>>>> +++ b/opensm/osm_helper.c
>>>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
>>>>>  		char buf[BUF_SIZE];
>>>>>  		unsigned n;
>>>>>  
>>>>> -		n = sprintf(buf, "Received SMP on a %u hop path: "
>>>>> -			    "Initial path = ", p_smp->hop_count);
>>>>> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
>>>>> +			    "Initial path = ",
>>>>> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>>>>>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>>>>>  				      p_smp->initial_path,
>>>>>  				      p_smp->hop_count + 1);
>>>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
>>>>> index ee92c66..6abf8b8 100644
>>>>> --- a/opensm/osm_sm_mad_ctrl.c
>>>>> +++ b/opensm/osm_sm_mad_ctrl.c
>>>>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
>>>>>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>>>>>  		cl_ntoh64(p_smp->trans_id));
>>>>>  
>>>>> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
>>>>> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
>>>>
>>>> Rather than here, should this be in osm_vendor_ibumad.c ? There's
>>>> already one similar log there but looks like evicted entry logging was
>>>> not done. If not, then do any logs there need to be removed as redundant ?
>>>
>>> Yes looking a bit closer I see that is redundant with the current umad_status
>>> implementation.  
>>
>> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
>> only currently called for NodeInfo. That's one reason why it's down at
>> that layer right now. I can see your v2 patch addresses this.
>>
>>> IE the message you get is:
>>>
>>> 	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
>>>
>>> That is useless.  I can alter the patch to remove that as well.
>>
>> Another alternative is that if it's a bug, why not just fix that ?
> 
> I was not sure if it was a bug (but I did not dig through the umad and kernel
> layers to verify that).  It looks to me like umad_status only returns TIMEOUT
> as an error.  Furthermore, I think the "response" MAD returned (and printed in
> the above logs) does not include a DR path since it is not really a response
> but just a place holder.

I thought that the sent MAD is returned in the timeout case. That would
include the DR paths in the SMP case. It seems to work only some of the
time though and I'm not sure why that is.

> Does the kernel/umad layer copy the request MAD into the "response" MAD on
> timeout?[*]

See above and below under [*]

>  (It looks to me like only some of the data is copied into the
> "response" MAD; method, attr_id, and transaction ID).  

That's certainly the observation in some but not all cases...

> If the entire request
> is to be copied then the above might be bug.  

I thought the request MAD is supposed to be returned on a send error
(including timeout). Maybe some shortcut/optimization was taken here
in some cases.

> If not then the print statement
> I removed is useless but the other one at least confirms the "fake" MAD is
> equivalent to the one printed in the error call back.

Yes, but why have the extra log message just for that ?

> If you want to keep the error printing at the vendor layer then I guess the
> appropriate place to print the error would be on the request mad like this...
> 
> 			} else {
> 				p_req_madw->status = IB_TIMEOUT;
> 				/* cb frees req_madw */
> 				pthread_mutex_lock(&p_vend->cb_mutex);
> 				pthread_cleanup_push(unlock_mutex,
> 						     &p_vend->cb_mutex);
> // dump p_req_madw here before callback
> 				(*p_bind->send_err_callback) (p_bind->
> 							      client_context,
> 							      p_req_madw);
> 				pthread_cleanup_pop(1);
> 			}

There's also the cancelled case too that calls send_err_callback.

> Or you are correct we could just remove all the logging in the vendor layer
> here.  
>
> Alex, any opinions?
> 
> Ira
> 
> [*] I don't think the request MAD should be copied except for the TID as is
> documented so you can find the request MAD.

What documentation are you referring to ?

-- Hal

>>
>>>>
>>>> Also, does this log every timeout (at error level) ? If so, that might
>>>> not be a good thing in all subnets as timeouts are common.
>>>
>>> Why would you say that?  I think it is very valid to know what nodes are
>>> timeing out.  When would you not want to know the destination of what is
>>> timing out?
>>
>> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
>> many subnets; maybe not yours.
>>
>> I was just saying level might be reduced but I can see 5411 is error
>> level too.
>>
>>>>
>>>>> +	} else {
>>>>> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
>>>>> +			cl_ntoh16(p_madw->mad_addr.dest_lid));
>>>>
>>>> Log the TID here too ?
>>>
>>> Actually I think moving that into the error print above is better.
>>
>> Sure; that's another way of accomplishing the same thing.
>>
>> -- Hal
>>
>>> Sending V2 now,
>>> Ira
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found]                     ` <4EEA72A5.8080602-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-12-15 22:54                       ` Ira Weiny
       [not found]                         ` <20111215145430.20f6adee.weiny2-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ira Weiny @ 2011-12-15 22:54 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 15 Dec 2011 14:20:21 -0800
Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 12/15/2011 12:40 PM, Ira Weiny wrote:
> > On Thu, 15 Dec 2011 06:15:09 -0800
> > Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> > 
> >> On 12/14/2011 10:18 PM, Ira Weiny wrote:
> >>> On Wed, 14 Dec 2011 18:21:28 -0800
> >>> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> >>>
> >>>> On 12/14/2011 12:09 AM, Ira Weiny wrote:
> >>>>>
> >>>>> In addition print transaction ID of all DR PATH dumps to make sure we know
> >>>>> which MAD's they refer to.
> >>>>>
> >>>>> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
> >>>>> ---
> >>>>>  opensm/osm_helper.c      |    5 +++--
> >>>>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
> >>>>>  2 files changed, 10 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
> >>>>> index f9f3d9d..b6591c4 100644
> >>>>> --- a/opensm/osm_helper.c
> >>>>> +++ b/opensm/osm_helper.c
> >>>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
> >>>>>  		char buf[BUF_SIZE];
> >>>>>  		unsigned n;
> >>>>>  
> >>>>> -		n = sprintf(buf, "Received SMP on a %u hop path: "
> >>>>> -			    "Initial path = ", p_smp->hop_count);
> >>>>> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
> >>>>> +			    "Initial path = ",
> >>>>> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
> >>>>>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
> >>>>>  				      p_smp->initial_path,
> >>>>>  				      p_smp->hop_count + 1);
> >>>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
> >>>>> index ee92c66..6abf8b8 100644
> >>>>> --- a/opensm/osm_sm_mad_ctrl.c
> >>>>> +++ b/opensm/osm_sm_mad_ctrl.c
> >>>>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
> >>>>>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
> >>>>>  		cl_ntoh64(p_smp->trans_id));
> >>>>>  
> >>>>> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
> >>>>> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
> >>>>
> >>>> Rather than here, should this be in osm_vendor_ibumad.c ? There's
> >>>> already one similar log there but looks like evicted entry logging was
> >>>> not done. If not, then do any logs there need to be removed as redundant ?
> >>>
> >>> Yes looking a bit closer I see that is redundant with the current umad_status
> >>> implementation.  
> >>
> >> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
> >> only currently called for NodeInfo. That's one reason why it's down at
> >> that layer right now. I can see your v2 patch addresses this.
> >>
> >>> IE the message you get is:
> >>>
> >>> 	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
> >>>
> >>> That is useless.  I can alter the patch to remove that as well.
> >>
> >> Another alternative is that if it's a bug, why not just fix that ?
> > 
> > I was not sure if it was a bug (but I did not dig through the umad and kernel
> > layers to verify that).  It looks to me like umad_status only returns TIMEOUT
> > as an error.  Furthermore, I think the "response" MAD returned (and printed in
> > the above logs) does not include a DR path since it is not really a response
> > but just a place holder.
> 
> I thought that the sent MAD is returned in the timeout case. That would
> include the DR paths in the SMP case. It seems to work only some of the
> time though and I'm not sure why that is.

I did not see this in umad_recv but the man page for umad_send does say.

     "If  reply is not received within the specified value [TIMEOUT], the original
     buffer is returned in the read channel with the status field set (to non
     zero)."

So I guess this is a bug.  :-(

> 
> > Does the kernel/umad layer copy the request MAD into the "response" MAD on
> > timeout?[*]
> 
> See above and below under [*]
> 
> >  (It looks to me like only some of the data is copied into the
> > "response" MAD; method, attr_id, and transaction ID).  
> 
> That's certainly the observation in some but not all cases...

I am not dead set against leaving the DR path print in if you think it might
be valid.  I was unsure of the behaviour in this area.

> 
> > If the entire request
> > is to be copied then the above might be bug.  
> 
> I thought the request MAD is supposed to be returned on a send error
> (including timeout). Maybe some shortcut/optimization was taken here
> in some cases.
> 
> > If not then the print statement
> > I removed is useless but the other one at least confirms the "fake" MAD is
> > equivalent to the one printed in the error call back.
> 
> Yes, but why have the extra log message just for that ?

I guess for the reason above.  To check the response MAD.

> 
> > If you want to keep the error printing at the vendor layer then I guess the
> > appropriate place to print the error would be on the request mad like this...
> > 
> > 			} else {
> > 				p_req_madw->status = IB_TIMEOUT;
> > 				/* cb frees req_madw */
> > 				pthread_mutex_lock(&p_vend->cb_mutex);
> > 				pthread_cleanup_push(unlock_mutex,
> > 						     &p_vend->cb_mutex);
> > // dump p_req_madw here before callback
> > 				(*p_bind->send_err_callback) (p_bind->
> > 							      client_context,
> > 							      p_req_madw);
> > 				pthread_cleanup_pop(1);
> > 			}
> 
> There's also the cancelled case too that calls send_err_callback.

Isn't this more reason to print the address information in the error callback?

> 
> > Or you are correct we could just remove all the logging in the vendor layer
> > here.  
> >
> > Alex, any opinions?
> > 
> > Ira
> > 
> > [*] I don't think the request MAD should be copied except for the TID as is
> > documented so you can find the request MAD.
> 
> What documentation are you referring to ?

man page for umad_status.  It does not mention anything other than the TID.
After reading the umad_send man page I guess we could assume the other fields
are also valid.

	"ETIMEDOUT means that the packet had a send-timeout indication. In this case, the transaction ID will be set to the TID of the original request."

Ira

> 
> -- Hal
> 
> >>
> >>>>
> >>>> Also, does this log every timeout (at error level) ? If so, that might
> >>>> not be a good thing in all subnets as timeouts are common.
> >>>
> >>> Why would you say that?  I think it is very valid to know what nodes are
> >>> timeing out.  When would you not want to know the destination of what is
> >>> timing out?
> >>
> >> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
> >> many subnets; maybe not yours.
> >>
> >> I was just saying level might be reduced but I can see 5411 is error
> >> level too.
> >>
> >>>>
> >>>>> +	} else {
> >>>>> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
> >>>>> +			cl_ntoh16(p_madw->mad_addr.dest_lid));
> >>>>
> >>>> Log the TID here too ?
> >>>
> >>> Actually I think moving that into the error print above is better.
> >>
> >> Sure; that's another way of accomplishing the same thing.
> >>
> >> -- Hal
> >>
> >>> Sending V2 now,
> >>> Ira
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
       [not found]                         ` <20111215145430.20f6adee.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2011-12-16 13:03                           ` Hal Rosenstock
  0 siblings, 0 replies; 8+ messages in thread
From: Hal Rosenstock @ 2011-12-16 13:03 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Alex Netes, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/15/2011 5:54 PM, Ira Weiny wrote:
> On Thu, 15 Dec 2011 14:20:21 -0800
> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
>> On 12/15/2011 12:40 PM, Ira Weiny wrote:
>>> On Thu, 15 Dec 2011 06:15:09 -0800
>>> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>>>
>>>> On 12/14/2011 10:18 PM, Ira Weiny wrote:
>>>>> On Wed, 14 Dec 2011 18:21:28 -0800
>>>>> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>>>>>
>>>>>> On 12/14/2011 12:09 AM, Ira Weiny wrote:
>>>>>>>
>>>>>>> In addition print transaction ID of all DR PATH dumps to make sure we know
>>>>>>> which MAD's they refer to.
>>>>>>>
>>>>>>> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
>>>>>>> ---
>>>>>>>  opensm/osm_helper.c      |    5 +++--
>>>>>>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
>>>>>>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
>>>>>>> index f9f3d9d..b6591c4 100644
>>>>>>> --- a/opensm/osm_helper.c
>>>>>>> +++ b/opensm/osm_helper.c
>>>>>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
>>>>>>>  		char buf[BUF_SIZE];
>>>>>>>  		unsigned n;
>>>>>>>  
>>>>>>> -		n = sprintf(buf, "Received SMP on a %u hop path: "
>>>>>>> -			    "Initial path = ", p_smp->hop_count);
>>>>>>> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
>>>>>>> +			    "Initial path = ",
>>>>>>> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>>>>>>>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>>>>>>>  				      p_smp->initial_path,
>>>>>>>  				      p_smp->hop_count + 1);
>>>>>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
>>>>>>> index ee92c66..6abf8b8 100644
>>>>>>> --- a/opensm/osm_sm_mad_ctrl.c
>>>>>>> +++ b/opensm/osm_sm_mad_ctrl.c
>>>>>>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
>>>>>>>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>>>>>>>  		cl_ntoh64(p_smp->trans_id));
>>>>>>>  
>>>>>>> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
>>>>>>> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
>>>>>>
>>>>>> Rather than here, should this be in osm_vendor_ibumad.c ? There's
>>>>>> already one similar log there but looks like evicted entry logging was
>>>>>> not done. If not, then do any logs there need to be removed as redundant ?
>>>>>
>>>>> Yes looking a bit closer I see that is redundant with the current umad_status
>>>>> implementation.  
>>>>
>>>> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
>>>> only currently called for NodeInfo. That's one reason why it's down at
>>>> that layer right now. I can see your v2 patch addresses this.
>>>>
>>>>> IE the message you get is:
>>>>>
>>>>> 	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
>>>>>
>>>>> That is useless.  I can alter the patch to remove that as well.
>>>>
>>>> Another alternative is that if it's a bug, why not just fix that ?
>>>
>>> I was not sure if it was a bug (but I did not dig through the umad and kernel
>>> layers to verify that).  It looks to me like umad_status only returns TIMEOUT
>>> as an error.  Furthermore, I think the "response" MAD returned (and printed in
>>> the above logs) does not include a DR path since it is not really a response
>>> but just a place holder.
>>
>> I thought that the sent MAD is returned in the timeout case. That would
>> include the DR paths in the SMP case. It seems to work only some of the
>> time though and I'm not sure why that is.
> 
> I did not see this in umad_recv but the man page for umad_send does say.
> 
>      "If  reply is not received within the specified value [TIMEOUT], the original
>      buffer is returned in the read channel with the status field set (to non
>      zero)."
>
> So I guess this is a bug.  :-(

The kernel (mad/user_mad) documentation is more definitive.

>>
>>> Does the kernel/umad layer copy the request MAD into the "response" MAD on
>>> timeout?[*]
>>
>> See above and below under [*]
>>
>>>  (It looks to me like only some of the data is copied into the
>>> "response" MAD; method, attr_id, and transaction ID).  
>>
>> That's certainly the observation in some but not all cases...
> 
> I am not dead set against leaving the DR path print in if you think it might
> be valid.  I was unsure of the behaviour in this area.
> 
>>
>>> If the entire request
>>> is to be copied then the above might be bug.  
>>
>> I thought the request MAD is supposed to be returned on a send error
>> (including timeout). Maybe some shortcut/optimization was taken here
>> in some cases.
>>
>>> If not then the print statement
>>> I removed is useless but the other one at least confirms the "fake" MAD is
>>> equivalent to the one printed in the error call back.
>>
>> Yes, but why have the extra log message just for that ?
> 
> I guess for the reason above.  To check the response MAD.

But your v2 patch is pushing that up to the callback so why log that twice ?

>>
>>> If you want to keep the error printing at the vendor layer then I guess the
>>> appropriate place to print the error would be on the request mad like this...
>>>
>>> 			} else {
>>> 				p_req_madw->status = IB_TIMEOUT;
>>> 				/* cb frees req_madw */
>>> 				pthread_mutex_lock(&p_vend->cb_mutex);
>>> 				pthread_cleanup_push(unlock_mutex,
>>> 						     &p_vend->cb_mutex);
>>> // dump p_req_madw here before callback
>>> 				(*p_bind->send_err_callback) (p_bind->
>>> 							      client_context,
>>> 							      p_req_madw);
>>> 				pthread_cleanup_pop(1);
>>> 			}
>>
>> There's also the cancelled case too that calls send_err_callback.
> 
> Isn't this more reason to print the address information in the error callback?

Not really. It's the tradeoff of doing it twice in the vendor layer
(could be combined into one routine called from the 2 places) to doing
it in every per class send error callback routine. Also, it becomes an
unwritten rule that any new GS managers using the vendor layer would
need (to know to) implement this. That's why this approach was
originally chosen (albeit with the bug where the DR path is not accurate).

>>
>>> Or you are correct we could just remove all the logging in the vendor layer
>>> here.  
>>>
>>> Alex, any opinions?
>>>
>>> Ira
>>>
>>> [*] I don't think the request MAD should be copied except for the TID as is
>>> documented so you can find the request MAD.
>>
>> What documentation are you referring to ?
> 
> man page for umad_status.  It does not mention anything other than the TID.
> After reading the umad_send man page I guess we could assume the other fields
> are also valid.
> 
> 	"ETIMEDOUT means that the packet had a send-timeout indication. In this case, the transaction ID will be set to the TID of the original request."

This documentation may not be complete/perfect.

I think it comes down to choosing the approach: in the vendor layer or
in the send error callbacks (your v3 patch is mainly fine). Neither is a
very large change and I think this is largely stylistic more than
anything else. It's up to Alex.

-- Hal

> Ira
> 
>>
>> -- Hal
>>
>>>>
>>>>>>
>>>>>> Also, does this log every timeout (at error level) ? If so, that might
>>>>>> not be a good thing in all subnets as timeouts are common.
>>>>>
>>>>> Why would you say that?  I think it is very valid to know what nodes are
>>>>> timeing out.  When would you not want to know the destination of what is
>>>>> timing out?
>>>>
>>>> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
>>>> many subnets; maybe not yours.
>>>>
>>>> I was just saying level might be reduced but I can see 5411 is error
>>>> level too.
>>>>
>>>>>>
>>>>>>> +	} else {
>>>>>>> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
>>>>>>> +			cl_ntoh16(p_madw->mad_addr.dest_lid));
>>>>>>
>>>>>> Log the TID here too ?
>>>>>
>>>>> Actually I think moving that into the error print above is better.
>>>>
>>>> Sure; that's another way of accomplishing the same thing.
>>>>
>>>> -- Hal
>>>>
>>>>> Sending V2 now,
>>>>> Ira
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-12-16 13:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-14  5:09 [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD Ira Weiny
     [not found] ` <20111213210918.9cf79850.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-15  2:21   ` Hal Rosenstock
     [not found]     ` <4EE959A8.3000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-15  3:18       ` Ira Weiny
     [not found]         ` <20111214191840.a3a80cf3.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-15 14:15           ` Hal Rosenstock
     [not found]             ` <4EEA00ED.2080609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-15 17:40               ` Ira Weiny
     [not found]                 ` <20111215094006.8aa8f198.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-15 22:20                   ` Hal Rosenstock
     [not found]                     ` <4EEA72A5.8080602-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-15 22:54                       ` Ira Weiny
     [not found]                         ` <20111215145430.20f6adee.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-16 13:03                           ` Hal Rosenstock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.