linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] scsi: core: Log SCSI command age with errors
@ 2019-09-23  6:01 Milan P. Gandhi
  2019-09-23 13:02 ` Laurence Oberman
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Milan P. Gandhi @ 2019-09-23  6:01 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: jejb, martin.petersen

Couple of users had requested to print the SCSI command age along 
with command failure errors. This is a small change, but allows 
users to get more important information about the command that was 
failed, it would help the users in debugging the command failures:

Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
---
diff --git a/drivers/scsi/scsi_logging.c b/drivers/scsi/scsi_logging.c
index ecc5918e372a..ca2182bc53c6 100644
--- a/drivers/scsi/scsi_logging.c
+++ b/drivers/scsi/scsi_logging.c
@@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg,
 	const char *mlret_string = scsi_mlreturn_string(disposition);
 	const char *hb_string = scsi_hostbyte_string(cmd->result);
 	const char *db_string = scsi_driverbyte_string(cmd->result);
+	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
 
 	logbuf = scsi_log_reserve_buffer(&logbuf_len);
 	if (!logbuf)
@@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg,
 
 	if (db_string)
 		off += scnprintf(logbuf + off, logbuf_len - off,
-				 "driverbyte=%s", db_string);
+				 "driverbyte=%s ", db_string);
 	else
 		off += scnprintf(logbuf + off, logbuf_len - off,
-				 "driverbyte=0x%02x", driver_byte(cmd->result));
+				 "driverbyte=0x%02x ",
+				 driver_byte(cmd->result));
+
+	off += scnprintf(logbuf + off, logbuf_len - off,
+			 "cmd-age=%lus", cmd_age);
+
 out_printk:
 	dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
 	scsi_log_release_buffer(logbuf);


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-23  6:01 [PATCH] scsi: core: Log SCSI command age with errors Milan P. Gandhi
@ 2019-09-23 13:02 ` Laurence Oberman
  2019-09-23 13:24   ` Milan P. Gandhi
  2019-09-25 15:45 ` Bart Van Assche
  2019-09-27 15:26 ` Martin Wilck
  2 siblings, 1 reply; 10+ messages in thread
From: Laurence Oberman @ 2019-09-23 13:02 UTC (permalink / raw)
  To: Milan P. Gandhi, linux-kernel, linux-scsi; +Cc: jejb, martin.petersen

On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
> Couple of users had requested to print the SCSI command age along 
> with command failure errors. This is a small change, but allows 
> users to get more important information about the command that was 
> failed, it would help the users in debugging the command failures:
> 
> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
> ---
> diff --git a/drivers/scsi/scsi_logging.c
> b/drivers/scsi/scsi_logging.c
> index ecc5918e372a..ca2182bc53c6 100644
> --- a/drivers/scsi/scsi_logging.c
> +++ b/drivers/scsi/scsi_logging.c
> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  	const char *mlret_string = scsi_mlreturn_string(disposition);
>  	const char *hb_string = scsi_hostbyte_string(cmd->result);
>  	const char *db_string = scsi_driverbyte_string(cmd->result);
> +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>  
>  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
>  	if (!logbuf)
> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  
>  	if (db_string)
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=%s", db_string);
> +				 "driverbyte=%s ", db_string);
>  	else
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=0x%02x", driver_byte(cmd-
> >result));
> +				 "driverbyte=0x%02x ",
> +				 driver_byte(cmd->result));
> +
> +	off += scnprintf(logbuf + off, logbuf_len - off,
> +			 "cmd-age=%lus", cmd_age);
> +
>  out_printk:
>  	dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
>  	scsi_log_release_buffer(logbuf);
> 

This looks to be a useful debug addition to me, and the code looks
correct.
I believe this has also been tested by Milan in our lab.

Reviewed-by: Laurence Oberman <loberman@redhat.com> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-23 13:02 ` Laurence Oberman
@ 2019-09-23 13:24   ` Milan P. Gandhi
  0 siblings, 0 replies; 10+ messages in thread
From: Milan P. Gandhi @ 2019-09-23 13:24 UTC (permalink / raw)
  To: Laurence Oberman, linux-kernel, linux-scsi; +Cc: jejb, martin.petersen

On 9/23/19 6:32 PM, Laurence Oberman wrote:
> On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
>> Couple of users had requested to print the SCSI command age along 
>> with command failure errors. This is a small change, but allows 
>> users to get more important information about the command that was 
>> failed, it would help the users in debugging the command failures:
>>
>> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
>> ---
>> diff --git a/drivers/scsi/scsi_logging.c
>> b/drivers/scsi/scsi_logging.c
>> index ecc5918e372a..ca2182bc53c6 100644
>> --- a/drivers/scsi/scsi_logging.c
>> +++ b/drivers/scsi/scsi_logging.c
>> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
>> *cmd, const char *msg,
>>  	const char *mlret_string = scsi_mlreturn_string(disposition);
>>  	const char *hb_string = scsi_hostbyte_string(cmd->result);
>>  	const char *db_string = scsi_driverbyte_string(cmd->result);
>> +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>>  
>>  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
>>  	if (!logbuf)
>> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
>> *cmd, const char *msg,
>>  
>>  	if (db_string)
>>  		off += scnprintf(logbuf + off, logbuf_len - off,
>> -				 "driverbyte=%s", db_string);
>> +				 "driverbyte=%s ", db_string);
>>  	else
>>  		off += scnprintf(logbuf + off, logbuf_len - off,
>> -				 "driverbyte=0x%02x", driver_byte(cmd-
>>> result));
>> +				 "driverbyte=0x%02x ",
>> +				 driver_byte(cmd->result));
>> +
>> +	off += scnprintf(logbuf + off, logbuf_len - off,
>> +			 "cmd-age=%lus", cmd_age);
>> +
>>  out_printk:
>>  	dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
>>  	scsi_log_release_buffer(logbuf);
>>
> 
> This looks to be a useful debug addition to me, and the code looks
> correct.
> I believe this has also been tested by Milan in our lab.
> 
> Reviewed-by: Laurence Oberman <loberman@redhat.com> 
> 
Yes, the patch was tested locally using scsi_debug as well as in real 
storage issues caused by bad disks in customer environment.

Thanks,
Milan.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-23  6:01 [PATCH] scsi: core: Log SCSI command age with errors Milan P. Gandhi
  2019-09-23 13:02 ` Laurence Oberman
@ 2019-09-25 15:45 ` Bart Van Assche
  2019-09-26  5:01   ` Milan P. Gandhi
  2019-09-27 15:26 ` Martin Wilck
  2 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2019-09-25 15:45 UTC (permalink / raw)
  To: Milan P. Gandhi, linux-kernel, linux-scsi; +Cc: jejb, martin.petersen

On 9/22/19 11:01 PM, Milan P. Gandhi wrote:
> +	off += scnprintf(logbuf + off, logbuf_len - off,
> +			 "cmd-age=%lus", cmd_age);

Have you considered to change cmd-age into cmd_age? I'm afraid otherwise 
someone might interpret the hyphen as a subtraction sign...

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-25 15:45 ` Bart Van Assche
@ 2019-09-26  5:01   ` Milan P. Gandhi
  0 siblings, 0 replies; 10+ messages in thread
From: Milan P. Gandhi @ 2019-09-26  5:01 UTC (permalink / raw)
  To: Bart Van Assche, linux-kernel, linux-scsi
  Cc: jejb, martin.petersen, Laurence Oberman

On 9/25/19 9:15 PM, Bart Van Assche wrote:
> On 9/22/19 11:01 PM, Milan P. Gandhi wrote:
>> +    off += scnprintf(logbuf + off, logbuf_len - off,
>> +             "cmd-age=%lus", cmd_age);
> 
> Have you considered to change cmd-age into cmd_age? I'm afraid otherwise someone might interpret the hyphen as a subtraction sign...

Thanks for your suggestion Bart.
Yes, it would be better to use cmd_age to avoid any confusion.
I will change it and send the updated patch.

Thanks,
Milan.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-23  6:01 [PATCH] scsi: core: Log SCSI command age with errors Milan P. Gandhi
  2019-09-23 13:02 ` Laurence Oberman
  2019-09-25 15:45 ` Bart Van Assche
@ 2019-09-27 15:26 ` Martin Wilck
  2019-09-27 17:45   ` Laurence Oberman
  2 siblings, 1 reply; 10+ messages in thread
From: Martin Wilck @ 2019-09-27 15:26 UTC (permalink / raw)
  To: Milan P. Gandhi, linux-kernel, linux-scsi; +Cc: jejb, martin.petersen

On Mon, 2019-09-23 at 11:31 +0530,  Milan P. Gandhi wrote:
> Couple of users had requested to print the SCSI command age along 
> with command failure errors. This is a small change, but allows 
> users to get more important information about the command that was 
> failed, it would help the users in debugging the command failures:
> 
> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
> ---
> diff --git a/drivers/scsi/scsi_logging.c
> b/drivers/scsi/scsi_logging.c
> index ecc5918e372a..ca2182bc53c6 100644
> --- a/drivers/scsi/scsi_logging.c
> +++ b/drivers/scsi/scsi_logging.c
> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  	const char *mlret_string = scsi_mlreturn_string(disposition);
>  	const char *hb_string = scsi_hostbyte_string(cmd->result);
>  	const char *db_string = scsi_driverbyte_string(cmd->result);
> +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;

This comes down to pretty coarse time resolution, does it not? More
often than not, the time difference shown will be 0. I'd recommend at
least millisecond resolution.

>  
>  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
>  	if (!logbuf)
> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  
>  	if (db_string)
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=%s", db_string);
> +				 "driverbyte=%s ", db_string);
>  	else
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=0x%02x", driver_byte(cmd-
> >result));
> +				 "driverbyte=0x%02x ",
> +				 driver_byte(cmd->result));
> +
> +	off += scnprintf(logbuf + off, logbuf_len - off,
> +			 "cmd-age=%lus", cmd_age);

This is certainly helpful in some situations. Yet I am unsure if it
should *always* be printed. I wouldn't say it's as important as the 
other stuff scsi_print_result() provides. After all, by activating
MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better
accuracy from currently available logs. 

So perhaps make this depend on a module parameter?

Also, we should carefully ponder if we want to put this on the same
line as the driver byte, as users may have created scripts for parsing
this output.

Regards,
Martin



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-27 15:26 ` Martin Wilck
@ 2019-09-27 17:45   ` Laurence Oberman
  2019-09-30  8:42     ` Martin Wilck
  0 siblings, 1 reply; 10+ messages in thread
From: Laurence Oberman @ 2019-09-27 17:45 UTC (permalink / raw)
  To: Martin Wilck, Milan P. Gandhi, linux-kernel, linux-scsi
  Cc: jejb, martin.petersen

On Fri, 2019-09-27 at 17:26 +0200, Martin Wilck wrote:
> On Mon, 2019-09-23 at 11:31 +0530,  Milan P. Gandhi wrote:
> > Couple of users had requested to print the SCSI command age along 
> > with command failure errors. This is a small change, but allows 
> > users to get more important information about the command that was 
> > failed, it would help the users in debugging the command failures:
> > 
> > Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
> > ---
> > diff --git a/drivers/scsi/scsi_logging.c
> > b/drivers/scsi/scsi_logging.c
> > index ecc5918e372a..ca2182bc53c6 100644
> > --- a/drivers/scsi/scsi_logging.c
> > +++ b/drivers/scsi/scsi_logging.c
> > @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> > *cmd, const char *msg,
> >  	const char *mlret_string = scsi_mlreturn_string(disposition);
> >  	const char *hb_string = scsi_hostbyte_string(cmd->result);
> >  	const char *db_string = scsi_driverbyte_string(cmd->result);
> > +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
> 
> This comes down to pretty coarse time resolution, does it not? More
> often than not, the time difference shown will be 0. I'd recommend at
> least millisecond resolution.
> 
> >  
> >  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
> >  	if (!logbuf)
> > @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> > *cmd, const char *msg,
> >  
> >  	if (db_string)
> >  		off += scnprintf(logbuf + off, logbuf_len - off,
> > -				 "driverbyte=%s", db_string);
> > +				 "driverbyte=%s ", db_string);
> >  	else
> >  		off += scnprintf(logbuf + off, logbuf_len - off,
> > -				 "driverbyte=0x%02x", driver_byte(cmd-
> > > result));
> > 
> > +				 "driverbyte=0x%02x ",
> > +				 driver_byte(cmd->result));
> > +
> > +	off += scnprintf(logbuf + off, logbuf_len - off,
> > +			 "cmd-age=%lus", cmd_age);
> 
> This is certainly helpful in some situations. Yet I am unsure if it
> should *always* be printed. I wouldn't say it's as important as the 
> other stuff scsi_print_result() provides. After all, by activating
> MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better
> accuracy from currently available logs. 
> 
> So perhaps make this depend on a module parameter?
> 
> Also, we should carefully ponder if we want to put this on the same
> line as the driver byte, as users may have created scripts for
> parsing
> this output.
> 
> Regards,
> Martin
> 
> 
Hi Martin

Agreed about log extraction, but turning that on with a busy workload
in a production environment is not practical. We cant do it with
systems with 1000's of luns and 1000's of IOPS/sec.
Also second resolution is good enough for the debug we want to see.

Regards
Laurence


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-27 17:45   ` Laurence Oberman
@ 2019-09-30  8:42     ` Martin Wilck
  2019-09-30  9:05       ` Milan P. Gandhi
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Wilck @ 2019-09-30  8:42 UTC (permalink / raw)
  To: Laurence Oberman, Milan P. Gandhi, linux-kernel, linux-scsi
  Cc: jejb, martin.petersen

On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote:
> 
> Hi Martin
> 
> Agreed about log extraction, but turning that on with a busy workload
> in a production environment is not practical. We cant do it with
> systems with 1000's of luns and 1000's of IOPS/sec.
> Also second resolution is good enough for the debug we want to see.

I gather that you look at a specific problem where second resolution is
sufficient. For upstream, the generic usefulness should be considered,
and I don't think we can say today that better-than-second resolution
will never be useful, so I still vote for milliseconds.

Wrt the enablement of the option on highly loaded systems, I'm not sure
I understand. You need to enable SCSI logging anyway, don't you? Is it
an issue to have to set 2 sysfs values rather than just one?

Regards,
Martin

> 
> Regards
> Laurence
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-30  8:42     ` Martin Wilck
@ 2019-09-30  9:05       ` Milan P. Gandhi
  2019-09-30  9:46         ` Martin Wilck
  0 siblings, 1 reply; 10+ messages in thread
From: Milan P. Gandhi @ 2019-09-30  9:05 UTC (permalink / raw)
  To: Martin Wilck, Laurence Oberman, linux-kernel, linux-scsi
  Cc: jejb, martin.petersen

On 9/30/19 2:12 PM, Martin Wilck wrote:
> On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote:
>>
>> Hi Martin
>>
>> Agreed about log extraction, but turning that on with a busy workload
>> in a production environment is not practical. We cant do it with
>> systems with 1000's of luns and 1000's of IOPS/sec.
>> Also second resolution is good enough for the debug we want to see.
> 
> I gather that you look at a specific problem where second resolution is
> sufficient. For upstream, the generic usefulness should be considered,
> and I don't think we can say today that better-than-second resolution
> will never be useful, so I still vote for milliseconds.

Ok, I will change it to ms.

> Wrt the enablement of the option on highly loaded systems, I'm not sure
> I understand. You need to enable SCSI logging anyway, don't you?

By default we keep the SCSI debug logging disabled or am I missing 
something?

>Is it an issue to have to set 2 sysfs values rather than just one?

The idea here is to capture the above debug data even without 
any user interventions to change any sysfs entries or to enable 
debug logging on busy, critical production systems.

Also, we are not changing the existing text in SCSI command error log,
but we are only adding one single word at the end of message. Ideally
the user scripts are written to grep specific pattern from the logs.
Since we are not replacing any existing text from the logs, the 
scripts should still work with this change as well.

Thanks,
Milan.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] scsi: core: Log SCSI command age with errors
  2019-09-30  9:05       ` Milan P. Gandhi
@ 2019-09-30  9:46         ` Martin Wilck
  0 siblings, 0 replies; 10+ messages in thread
From: Martin Wilck @ 2019-09-30  9:46 UTC (permalink / raw)
  To: mgandhi, Laurence Oberman, linux-kernel, linux-scsi; +Cc: jejb, martin.petersen

Hello Milan,

On Mon, 2019-09-30 at 14:35 +0530, Milan P. Gandhi wrote:
> On 9/30/19 2:12 PM, Martin Wilck wrote:

> > Wrt the enablement of the option on highly loaded systems, I'm not
> > sure
> > I understand. You need to enable SCSI logging anyway, don't you?
> 
> By default we keep the SCSI debug logging disabled or am I missing 
> something?
> 
> > Is it an issue to have to set 2 sysfs values rather than just one?
> 
> The idea here is to capture the above debug data even without 
> any user interventions to change any sysfs entries or to enable 
> debug logging on busy, critical production systems.

So, you're looking at the scsi_io_completion() code path. In my
experience that isn't reliable for bug hunting because of the the
message rate limiting. Therefore I prefer using SCSI logging
MLCOMPLETE=1, where no rate limiting applies. But that's just a side
note, it depends on the case what's more useful.

Back to the cmd age output, IMO we're are on a thin line between
capturing useful information and keeping the logs neat. As I already
said, I'm not convinced that this information, as important it may be
for the case(s) you're currently investigating, has the same generic
degree of importance or usefulness as what's currently printed (the CDB
and the sense data). But OTOH, that's just a gut feeling, and I can't
claim to have the experience to make general statement on it. If noone
else has issues with this being printed by default, I'm not going
oppose it. 

> Also, we are not changing the existing text in SCSI command error
> log,
> but we are only adding one single word at the end of message. Ideally
> the user scripts are written to grep specific pattern from the logs.
> Since we are not replacing any existing text from the logs, the 
> scripts should still work with this change as well.

You are certainly aware that such scripts don't necessarily conform to
what kernel developers would consider "ideal" :-) But again, I just
wanted to raise the issue; if noone else thinks it matters, fine with
me.

Thanks
Martin



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-30  9:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-23  6:01 [PATCH] scsi: core: Log SCSI command age with errors Milan P. Gandhi
2019-09-23 13:02 ` Laurence Oberman
2019-09-23 13:24   ` Milan P. Gandhi
2019-09-25 15:45 ` Bart Van Assche
2019-09-26  5:01   ` Milan P. Gandhi
2019-09-27 15:26 ` Martin Wilck
2019-09-27 17:45   ` Laurence Oberman
2019-09-30  8:42     ` Martin Wilck
2019-09-30  9:05       ` Milan P. Gandhi
2019-09-30  9:46         ` Martin Wilck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).