All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] SUNRPC: Move call to rpc_count_iostats before rpc_call_done
@ 2019-05-23 20:13 Dave Wysochanski
  2019-05-23 20:13 ` [PATCH 2/3] SUNRPC: Use proper printk specifiers for unsigned long long Dave Wysochanski
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Dave Wysochanski @ 2019-05-23 20:13 UTC (permalink / raw)
  To: linux-nfs

For diagnostic purposes, it would be useful to have an rpc_iostats
metric of RPCs completing with tk_status < 0.  Unfortunately,
tk_status is reset inside the rpc_call_done functions for each
operation, and the call to tally the per-op metrics comes after
rpc_call_done.  Refactor the call to rpc_count_iostat earlier in
rpc_exit_task so we can count these RPCs completing in error.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 net/sunrpc/sched.c | 5 +++++
 net/sunrpc/xprt.c  | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 28956c70100a..543caef296e4 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -22,6 +22,7 @@
 #include <linux/sched/mm.h>
 
 #include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/metrics.h>
 
 #include "sunrpc.h"
 
@@ -770,6 +771,10 @@ rpc_reset_task_statistics(struct rpc_task *task)
 void rpc_exit_task(struct rpc_task *task)
 {
 	task->tk_action = NULL;
+	if (task->tk_ops->rpc_count_stats)
+		task->tk_ops->rpc_count_stats(task, task->tk_calldata);
+	else if (task->tk_client)
+		rpc_count_iostats(task, task->tk_client->cl_metrics);
 	if (task->tk_ops->rpc_call_done != NULL) {
 		task->tk_ops->rpc_call_done(task, task->tk_calldata);
 		if (task->tk_action != NULL) {
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index d7117d241460..d4477b227a96 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1716,10 +1716,6 @@ void xprt_release(struct rpc_task *task)
 	}
 
 	xprt = req->rq_xprt;
-	if (task->tk_ops->rpc_count_stats != NULL)
-		task->tk_ops->rpc_count_stats(task, task->tk_calldata);
-	else if (task->tk_client)
-		rpc_count_iostats(task, task->tk_client->cl_metrics);
 	xprt_request_dequeue_all(task, req);
 	spin_lock_bh(&xprt->transport_lock);
 	xprt->ops->release_xprt(xprt, task);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] SUNRPC: Use proper printk specifiers for unsigned long long
  2019-05-23 20:13 [PATCH 1/3] SUNRPC: Move call to rpc_count_iostats before rpc_call_done Dave Wysochanski
@ 2019-05-23 20:13 ` Dave Wysochanski
  2019-05-23 20:13 ` [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0 Dave Wysochanski
  2019-05-23 20:13 ` [PATCH] mountstats: add per-op error counts for mountstats command Dave Wysochanski
  2 siblings, 0 replies; 18+ messages in thread
From: Dave Wysochanski @ 2019-05-23 20:13 UTC (permalink / raw)
  To: linux-nfs

Update the printk specifiers inside _print_rpc_iostats to avoid
a checkpatch warning.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 net/sunrpc/stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index 71166b393732..8b2d3c58ffae 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -224,7 +224,7 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
 			       int op, const struct rpc_procinfo *procs)
 {
 	_print_name(seq, op, procs);
-	seq_printf(seq, "%lu %lu %lu %Lu %Lu %Lu %Lu %Lu\n",
+	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
 		   stats->om_ops,
 		   stats->om_ntrans,
 		   stats->om_timeouts,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-23 20:13 [PATCH 1/3] SUNRPC: Move call to rpc_count_iostats before rpc_call_done Dave Wysochanski
  2019-05-23 20:13 ` [PATCH 2/3] SUNRPC: Use proper printk specifiers for unsigned long long Dave Wysochanski
@ 2019-05-23 20:13 ` Dave Wysochanski
  2019-05-30 21:38   ` J. Bruce Fields
  2019-05-23 20:13 ` [PATCH] mountstats: add per-op error counts for mountstats command Dave Wysochanski
  2 siblings, 1 reply; 18+ messages in thread
From: Dave Wysochanski @ 2019-05-23 20:13 UTC (permalink / raw)
  To: linux-nfs

We often see various error conditions with NFS4.x that show up with
a very high operation count all completing with tk_status < 0 in a
short period of time.  Add a count to rpc_iostats to record on a
per-op basis the ops that complete in this manner, which will
enable lower overhead diagnostics.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 include/linux/sunrpc/metrics.h | 7 ++++++-
 net/sunrpc/stats.c             | 8 ++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
index 1b3751327575..0ee3f7052846 100644
--- a/include/linux/sunrpc/metrics.h
+++ b/include/linux/sunrpc/metrics.h
@@ -30,7 +30,7 @@
 #include <linux/ktime.h>
 #include <linux/spinlock.h>
 
-#define RPC_IOSTATS_VERS	"1.0"
+#define RPC_IOSTATS_VERS	"1.1"
 
 struct rpc_iostats {
 	spinlock_t		om_lock;
@@ -66,6 +66,11 @@ struct rpc_iostats {
 	ktime_t			om_queue,	/* queued for xmit */
 				om_rtt,		/* RPC RTT */
 				om_execute;	/* RPC execution */
+	/*
+	 * The count of operations that complete with tk_status < 0.
+	 * These statuses usually indicate error conditions.
+	 */
+	unsigned long           om_error_status;
 } ____cacheline_aligned;
 
 struct rpc_task;
diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index 8b2d3c58ffae..737414247ca7 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task,
 
 	execute = ktime_sub(now, task->tk_start);
 	op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute);
+	if (task->tk_status < 0)
+		op_metrics->om_error_status++;
 
 	spin_unlock(&op_metrics->om_lock);
 
@@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b)
 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
 	a->om_execute = ktime_add(a->om_execute, b->om_execute);
+	a->om_error_status += b->om_error_status;
 }
 
 static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
 			       int op, const struct rpc_procinfo *procs)
 {
 	_print_name(seq, op, procs);
-	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
+	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n",
 		   stats->om_ops,
 		   stats->om_ntrans,
 		   stats->om_timeouts,
@@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
 		   stats->om_bytes_recv,
 		   ktime_to_ms(stats->om_queue),
 		   ktime_to_ms(stats->om_rtt),
-		   ktime_to_ms(stats->om_execute));
+		   ktime_to_ms(stats->om_execute),
+		   stats->om_error_status);
 }
 
 void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] mountstats: add per-op error counts for mountstats command
  2019-05-23 20:13 [PATCH 1/3] SUNRPC: Move call to rpc_count_iostats before rpc_call_done Dave Wysochanski
  2019-05-23 20:13 ` [PATCH 2/3] SUNRPC: Use proper printk specifiers for unsigned long long Dave Wysochanski
  2019-05-23 20:13 ` [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0 Dave Wysochanski
@ 2019-05-23 20:13 ` Dave Wysochanski
  2019-05-23 20:33   ` [PATCH v2] " Dave Wysochanski
  2 siblings, 1 reply; 18+ messages in thread
From: Dave Wysochanski @ 2019-05-23 20:13 UTC (permalink / raw)
  To: linux-nfs

Display the count of ops completing with error status (status < 0)
on kernels that support it.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 tools/mountstats/mountstats.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/mountstats/mountstats.py b/tools/mountstats/mountstats.py
index c7fb8bb1..b8c720d5 100755
--- a/tools/mountstats/mountstats.py
+++ b/tools/mountstats/mountstats.py
@@ -475,7 +475,9 @@ class DeviceData:
                 retrans = stats[2] - count
                 if retrans != 0:
                     print('\t%d retrans (%d%%)' % (retrans, ((retrans * 100) / count)), end=' ')
-                    print('\t%d major timeouts' % stats[3])
+                    print('\t%d major timeouts' % stats[3], end='')
+                if len(stats) >= 10 and stats[9] != 0:
+                    print('%d errors (%d%%)' % (stats[9], ((stats[9] * 100) / count)))
                 else:
                     print('')
                 print('\tavg bytes sent per op: %d\tavg bytes received per op: %d' % \
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2] mountstats: add per-op error counts for mountstats command
  2019-05-23 20:13 ` [PATCH] mountstats: add per-op error counts for mountstats command Dave Wysochanski
@ 2019-05-23 20:33   ` Dave Wysochanski
  2019-06-03 14:31     ` Steve Dickson
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Wysochanski @ 2019-05-23 20:33 UTC (permalink / raw)
  To: linux-nfs

Display the count of ops completing with error status (status < 0)
on kernels that support it.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 tools/mountstats/mountstats.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/mountstats/mountstats.py b/tools/mountstats/mountstats.py
index c7fb8bb1..2f525f4b 100755
--- a/tools/mountstats/mountstats.py
+++ b/tools/mountstats/mountstats.py
@@ -475,7 +475,9 @@ class DeviceData:
                 retrans = stats[2] - count
                 if retrans != 0:
                     print('\t%d retrans (%d%%)' % (retrans, ((retrans * 100) / count)), end=' ')
-                    print('\t%d major timeouts' % stats[3])
+                    print('\t%d major timeouts' % stats[3], end='')
+                if len(stats) >= 10 and stats[9] != 0:
+                    print('\t%d errors (%d%%)' % (stats[9], ((stats[9] * 100) / count)))
                 else:
                     print('')
                 print('\tavg bytes sent per op: %d\tavg bytes received per op: %d' % \
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-23 20:13 ` [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0 Dave Wysochanski
@ 2019-05-30 21:38   ` J. Bruce Fields
  2019-05-30 22:19     ` Chuck Lever
  2019-05-31  0:17     ` David Wysochanski
  0 siblings, 2 replies; 18+ messages in thread
From: J. Bruce Fields @ 2019-05-30 21:38 UTC (permalink / raw)
  To: Dave Wysochanski; +Cc: linux-nfs

On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski wrote:
> We often see various error conditions with NFS4.x that show up with
> a very high operation count all completing with tk_status < 0 in a
> short period of time.  Add a count to rpc_iostats to record on a
> per-op basis the ops that complete in this manner, which will
> enable lower overhead diagnostics.

Looks like a good idea to me.

It's too bad we can't distinguish the errors.  (E.g. ESTALE on a lookup
call is a lot more interesting than ENOENT.)  But understood that
maintaining (number of possible errors) * (number of possible ops)
counters is probably overkill, so just counting the number of errors
seems like a good start.

--b.

> 
> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> ---
>  include/linux/sunrpc/metrics.h | 7 ++++++-
>  net/sunrpc/stats.c             | 8 ++++++--
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
> index 1b3751327575..0ee3f7052846 100644
> --- a/include/linux/sunrpc/metrics.h
> +++ b/include/linux/sunrpc/metrics.h
> @@ -30,7 +30,7 @@
>  #include <linux/ktime.h>
>  #include <linux/spinlock.h>
>  
> -#define RPC_IOSTATS_VERS	"1.0"
> +#define RPC_IOSTATS_VERS	"1.1"
>  
>  struct rpc_iostats {
>  	spinlock_t		om_lock;
> @@ -66,6 +66,11 @@ struct rpc_iostats {
>  	ktime_t			om_queue,	/* queued for xmit */
>  				om_rtt,		/* RPC RTT */
>  				om_execute;	/* RPC execution */
> +	/*
> +	 * The count of operations that complete with tk_status < 0.
> +	 * These statuses usually indicate error conditions.
> +	 */
> +	unsigned long           om_error_status;
>  } ____cacheline_aligned;
>  
>  struct rpc_task;
> diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
> index 8b2d3c58ffae..737414247ca7 100644
> --- a/net/sunrpc/stats.c
> +++ b/net/sunrpc/stats.c
> @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task,
>  
>  	execute = ktime_sub(now, task->tk_start);
>  	op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute);
> +	if (task->tk_status < 0)
> +		op_metrics->om_error_status++;
>  
>  	spin_unlock(&op_metrics->om_lock);
>  
> @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b)
>  	a->om_queue = ktime_add(a->om_queue, b->om_queue);
>  	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
>  	a->om_execute = ktime_add(a->om_execute, b->om_execute);
> +	a->om_error_status += b->om_error_status;
>  }
>  
>  static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
>  			       int op, const struct rpc_procinfo *procs)
>  {
>  	_print_name(seq, op, procs);
> -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
> +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n",
>  		   stats->om_ops,
>  		   stats->om_ntrans,
>  		   stats->om_timeouts,
> @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
>  		   stats->om_bytes_recv,
>  		   ktime_to_ms(stats->om_queue),
>  		   ktime_to_ms(stats->om_rtt),
> -		   ktime_to_ms(stats->om_execute));
> +		   ktime_to_ms(stats->om_execute),
> +		   stats->om_error_status);
>  }
>  
>  void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt)
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-30 21:38   ` J. Bruce Fields
@ 2019-05-30 22:19     ` Chuck Lever
  2019-05-30 22:33       ` Bruce Fields
  2019-05-31  0:17     ` David Wysochanski
  1 sibling, 1 reply; 18+ messages in thread
From: Chuck Lever @ 2019-05-30 22:19 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Dave Wysochanski, Linux NFS Mailing List



> On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
> 
> On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski wrote:
>> We often see various error conditions with NFS4.x that show up with
>> a very high operation count all completing with tk_status < 0 in a
>> short period of time.  Add a count to rpc_iostats to record on a
>> per-op basis the ops that complete in this manner, which will
>> enable lower overhead diagnostics.
> 
> Looks like a good idea to me.
> 
> It's too bad we can't distinguish the errors.  (E.g. ESTALE on a lookup
> call is a lot more interesting than ENOENT.)  But understood that
> maintaining (number of possible errors) * (number of possible ops)
> counters is probably overkill, so just counting the number of errors
> seems like a good start.

We now have trace points that can do that too.


> --b.
> 
>> 
>> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
>> ---
>> include/linux/sunrpc/metrics.h | 7 ++++++-
>> net/sunrpc/stats.c             | 8 ++++++--
>> 2 files changed, 12 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
>> index 1b3751327575..0ee3f7052846 100644
>> --- a/include/linux/sunrpc/metrics.h
>> +++ b/include/linux/sunrpc/metrics.h
>> @@ -30,7 +30,7 @@
>> #include <linux/ktime.h>
>> #include <linux/spinlock.h>
>> 
>> -#define RPC_IOSTATS_VERS	"1.0"
>> +#define RPC_IOSTATS_VERS	"1.1"
>> 
>> struct rpc_iostats {
>> 	spinlock_t		om_lock;
>> @@ -66,6 +66,11 @@ struct rpc_iostats {
>> 	ktime_t			om_queue,	/* queued for xmit */
>> 				om_rtt,		/* RPC RTT */
>> 				om_execute;	/* RPC execution */
>> +	/*
>> +	 * The count of operations that complete with tk_status < 0.
>> +	 * These statuses usually indicate error conditions.
>> +	 */
>> +	unsigned long           om_error_status;
>> } ____cacheline_aligned;
>> 
>> struct rpc_task;
>> diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
>> index 8b2d3c58ffae..737414247ca7 100644
>> --- a/net/sunrpc/stats.c
>> +++ b/net/sunrpc/stats.c
>> @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task,
>> 
>> 	execute = ktime_sub(now, task->tk_start);
>> 	op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute);
>> +	if (task->tk_status < 0)
>> +		op_metrics->om_error_status++;
>> 
>> 	spin_unlock(&op_metrics->om_lock);
>> 
>> @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b)
>> 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
>> 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
>> 	a->om_execute = ktime_add(a->om_execute, b->om_execute);
>> +	a->om_error_status += b->om_error_status;
>> }
>> 
>> static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
>> 			       int op, const struct rpc_procinfo *procs)
>> {
>> 	_print_name(seq, op, procs);
>> -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
>> +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n",
>> 		   stats->om_ops,
>> 		   stats->om_ntrans,
>> 		   stats->om_timeouts,
>> @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
>> 		   stats->om_bytes_recv,
>> 		   ktime_to_ms(stats->om_queue),
>> 		   ktime_to_ms(stats->om_rtt),
>> -		   ktime_to_ms(stats->om_execute));
>> +		   ktime_to_ms(stats->om_execute),
>> +		   stats->om_error_status);
>> }
>> 
>> void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt)
>> -- 
>> 2.20.1

--
Chuck Lever




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-30 22:19     ` Chuck Lever
@ 2019-05-30 22:33       ` Bruce Fields
  2019-05-31 13:25         ` Chuck Lever
  0 siblings, 1 reply; 18+ messages in thread
From: Bruce Fields @ 2019-05-30 22:33 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Dave Wysochanski, Linux NFS Mailing List

On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
> 
> 
> > On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
> > 
> > On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski wrote:
> >> We often see various error conditions with NFS4.x that show up with
> >> a very high operation count all completing with tk_status < 0 in a
> >> short period of time.  Add a count to rpc_iostats to record on a
> >> per-op basis the ops that complete in this manner, which will
> >> enable lower overhead diagnostics.
> > 
> > Looks like a good idea to me.
> > 
> > It's too bad we can't distinguish the errors.  (E.g. ESTALE on a lookup
> > call is a lot more interesting than ENOENT.)  But understood that
> > maintaining (number of possible errors) * (number of possible ops)
> > counters is probably overkill, so just counting the number of errors
> > seems like a good start.
> 
> We now have trace points that can do that too.

You mean, that can report every error (and its value)?

I assume having these statistics in mountstats is still useful, though.

--b.

> 
> 
> > --b.
> > 
> >> 
> >> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> >> ---
> >> include/linux/sunrpc/metrics.h | 7 ++++++-
> >> net/sunrpc/stats.c             | 8 ++++++--
> >> 2 files changed, 12 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
> >> index 1b3751327575..0ee3f7052846 100644
> >> --- a/include/linux/sunrpc/metrics.h
> >> +++ b/include/linux/sunrpc/metrics.h
> >> @@ -30,7 +30,7 @@
> >> #include <linux/ktime.h>
> >> #include <linux/spinlock.h>
> >> 
> >> -#define RPC_IOSTATS_VERS	"1.0"
> >> +#define RPC_IOSTATS_VERS	"1.1"
> >> 
> >> struct rpc_iostats {
> >> 	spinlock_t		om_lock;
> >> @@ -66,6 +66,11 @@ struct rpc_iostats {
> >> 	ktime_t			om_queue,	/* queued for xmit */
> >> 				om_rtt,		/* RPC RTT */
> >> 				om_execute;	/* RPC execution */
> >> +	/*
> >> +	 * The count of operations that complete with tk_status < 0.
> >> +	 * These statuses usually indicate error conditions.
> >> +	 */
> >> +	unsigned long           om_error_status;
> >> } ____cacheline_aligned;
> >> 
> >> struct rpc_task;
> >> diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
> >> index 8b2d3c58ffae..737414247ca7 100644
> >> --- a/net/sunrpc/stats.c
> >> +++ b/net/sunrpc/stats.c
> >> @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task,
> >> 
> >> 	execute = ktime_sub(now, task->tk_start);
> >> 	op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute);
> >> +	if (task->tk_status < 0)
> >> +		op_metrics->om_error_status++;
> >> 
> >> 	spin_unlock(&op_metrics->om_lock);
> >> 
> >> @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b)
> >> 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
> >> 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
> >> 	a->om_execute = ktime_add(a->om_execute, b->om_execute);
> >> +	a->om_error_status += b->om_error_status;
> >> }
> >> 
> >> static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
> >> 			       int op, const struct rpc_procinfo *procs)
> >> {
> >> 	_print_name(seq, op, procs);
> >> -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
> >> +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n",
> >> 		   stats->om_ops,
> >> 		   stats->om_ntrans,
> >> 		   stats->om_timeouts,
> >> @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
> >> 		   stats->om_bytes_recv,
> >> 		   ktime_to_ms(stats->om_queue),
> >> 		   ktime_to_ms(stats->om_rtt),
> >> -		   ktime_to_ms(stats->om_execute));
> >> +		   ktime_to_ms(stats->om_execute),
> >> +		   stats->om_error_status);
> >> }
> >> 
> >> void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt)
> >> -- 
> >> 2.20.1
> 
> --
> Chuck Lever
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-30 21:38   ` J. Bruce Fields
  2019-05-30 22:19     ` Chuck Lever
@ 2019-05-31  0:17     ` David Wysochanski
  1 sibling, 0 replies; 18+ messages in thread
From: David Wysochanski @ 2019-05-31  0:17 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Thu, May 30, 2019 at 5:39 PM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski wrote:
> > We often see various error conditions with NFS4.x that show up with
> > a very high operation count all completing with tk_status < 0 in a
> > short period of time.  Add a count to rpc_iostats to record on a
> > per-op basis the ops that complete in this manner, which will
> > enable lower overhead diagnostics.
>
> Looks like a good idea to me.
>
> It's too bad we can't distinguish the errors.  (E.g. ESTALE on a lookup
> call is a lot more interesting than ENOENT.)  But understood that
> maintaining (number of possible errors) * (number of possible ops)
> counters is probably overkill, so just counting the number of errors
> seems like a good start.
>
> --b.
>

I did consider a more elaborate approach, where each code would be
counted.  Most likely that would need to go into debugfs.  I didn't
pursue that but may look into it in the future along with other work.

I do think the < 0 status counts have value even if we do not know the
specific code, they are very low overhead, and always there if added
to mountstats.  I was envisioning using this along with something like
a periodic capture of mountstats possibly using PCP and then using
some statistics as a "NFS health" measure for a specific mount point.
There are already some "NFS health" metrics that may be calculated
with existing mountstats (for example if you're clever you can spot
certain NFS4 protocol loops), but the error counts would make it much
easier and reliable.


> >
> > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > ---
> >  include/linux/sunrpc/metrics.h | 7 ++++++-
> >  net/sunrpc/stats.c             | 8 ++++++--
> >  2 files changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
> > index 1b3751327575..0ee3f7052846 100644
> > --- a/include/linux/sunrpc/metrics.h
> > +++ b/include/linux/sunrpc/metrics.h
> > @@ -30,7 +30,7 @@
> >  #include <linux/ktime.h>
> >  #include <linux/spinlock.h>
> >
> > -#define RPC_IOSTATS_VERS     "1.0"
> > +#define RPC_IOSTATS_VERS     "1.1"
> >
> >  struct rpc_iostats {
> >       spinlock_t              om_lock;
> > @@ -66,6 +66,11 @@ struct rpc_iostats {
> >       ktime_t                 om_queue,       /* queued for xmit */
> >                               om_rtt,         /* RPC RTT */
> >                               om_execute;     /* RPC execution */
> > +     /*
> > +      * The count of operations that complete with tk_status < 0.
> > +      * These statuses usually indicate error conditions.
> > +      */
> > +     unsigned long           om_error_status;
> >  } ____cacheline_aligned;
> >
> >  struct rpc_task;
> > diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
> > index 8b2d3c58ffae..737414247ca7 100644
> > --- a/net/sunrpc/stats.c
> > +++ b/net/sunrpc/stats.c
> > @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task,
> >
> >       execute = ktime_sub(now, task->tk_start);
> >       op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute);
> > +     if (task->tk_status < 0)
> > +             op_metrics->om_error_status++;
> >
> >       spin_unlock(&op_metrics->om_lock);
> >
> > @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b)
> >       a->om_queue = ktime_add(a->om_queue, b->om_queue);
> >       a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
> >       a->om_execute = ktime_add(a->om_execute, b->om_execute);
> > +     a->om_error_status += b->om_error_status;
> >  }
> >
> >  static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
> >                              int op, const struct rpc_procinfo *procs)
> >  {
> >       _print_name(seq, op, procs);
> > -     seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
> > +     seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n",
> >                  stats->om_ops,
> >                  stats->om_ntrans,
> >                  stats->om_timeouts,
> > @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
> >                  stats->om_bytes_recv,
> >                  ktime_to_ms(stats->om_queue),
> >                  ktime_to_ms(stats->om_rtt),
> > -                ktime_to_ms(stats->om_execute));
> > +                ktime_to_ms(stats->om_execute),
> > +                stats->om_error_status);
> >  }
> >
> >  void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt)
> > --
> > 2.20.1



-- 
Dave Wysochanski
Principal Software Maintenance Engineer
T: 919-754-4024

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-30 22:33       ` Bruce Fields
@ 2019-05-31 13:25         ` Chuck Lever
  2019-06-03 18:53           ` Dave Wysochanski
  0 siblings, 1 reply; 18+ messages in thread
From: Chuck Lever @ 2019-05-31 13:25 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Dave Wysochanski, Linux NFS Mailing List



> On May 30, 2019, at 6:33 PM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
>> 
>> 
>>> On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
>>> 
>>> On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski wrote:
>>>> We often see various error conditions with NFS4.x that show up with
>>>> a very high operation count all completing with tk_status < 0 in a
>>>> short period of time.  Add a count to rpc_iostats to record on a
>>>> per-op basis the ops that complete in this manner, which will
>>>> enable lower overhead diagnostics.
>>> 
>>> Looks like a good idea to me.
>>> 
>>> It's too bad we can't distinguish the errors.  (E.g. ESTALE on a lookup
>>> call is a lot more interesting than ENOENT.)  But understood that
>>> maintaining (number of possible errors) * (number of possible ops)
>>> counters is probably overkill, so just counting the number of errors
>>> seems like a good start.
>> 
>> We now have trace points that can do that too.
> 
> You mean, that can report every error (and its value)?

Yes, the nfs_xdr_status trace point reports the error by value and symbolic name.


> I assume having these statistics in mountstats is still useful, though.
> 
> --b.
> 
>> 
>> 
>>> --b.
>>> 
>>>> 
>>>> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
>>>> ---
>>>> include/linux/sunrpc/metrics.h | 7 ++++++-
>>>> net/sunrpc/stats.c             | 8 ++++++--
>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
>>>> index 1b3751327575..0ee3f7052846 100644
>>>> --- a/include/linux/sunrpc/metrics.h
>>>> +++ b/include/linux/sunrpc/metrics.h
>>>> @@ -30,7 +30,7 @@
>>>> #include <linux/ktime.h>
>>>> #include <linux/spinlock.h>
>>>> 
>>>> -#define RPC_IOSTATS_VERS	"1.0"
>>>> +#define RPC_IOSTATS_VERS	"1.1"
>>>> 
>>>> struct rpc_iostats {
>>>> 	spinlock_t		om_lock;
>>>> @@ -66,6 +66,11 @@ struct rpc_iostats {
>>>> 	ktime_t			om_queue,	/* queued for xmit */
>>>> 				om_rtt,		/* RPC RTT */
>>>> 				om_execute;	/* RPC execution */
>>>> +	/*
>>>> +	 * The count of operations that complete with tk_status < 0.
>>>> +	 * These statuses usually indicate error conditions.
>>>> +	 */
>>>> +	unsigned long           om_error_status;
>>>> } ____cacheline_aligned;
>>>> 
>>>> struct rpc_task;
>>>> diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
>>>> index 8b2d3c58ffae..737414247ca7 100644
>>>> --- a/net/sunrpc/stats.c
>>>> +++ b/net/sunrpc/stats.c
>>>> @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task,
>>>> 
>>>> 	execute = ktime_sub(now, task->tk_start);
>>>> 	op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute);
>>>> +	if (task->tk_status < 0)
>>>> +		op_metrics->om_error_status++;
>>>> 
>>>> 	spin_unlock(&op_metrics->om_lock);
>>>> 
>>>> @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b)
>>>> 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
>>>> 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
>>>> 	a->om_execute = ktime_add(a->om_execute, b->om_execute);
>>>> +	a->om_error_status += b->om_error_status;
>>>> }
>>>> 
>>>> static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
>>>> 			       int op, const struct rpc_procinfo *procs)
>>>> {
>>>> 	_print_name(seq, op, procs);
>>>> -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu\n",
>>>> +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n",
>>>> 		   stats->om_ops,
>>>> 		   stats->om_ntrans,
>>>> 		   stats->om_timeouts,
>>>> @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats,
>>>> 		   stats->om_bytes_recv,
>>>> 		   ktime_to_ms(stats->om_queue),
>>>> 		   ktime_to_ms(stats->om_rtt),
>>>> -		   ktime_to_ms(stats->om_execute));
>>>> +		   ktime_to_ms(stats->om_execute),
>>>> +		   stats->om_error_status);
>>>> }
>>>> 
>>>> void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt)
>>>> -- 
>>>> 2.20.1
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mountstats: add per-op error counts for mountstats command
  2019-05-23 20:33   ` [PATCH v2] " Dave Wysochanski
@ 2019-06-03 14:31     ` Steve Dickson
  0 siblings, 0 replies; 18+ messages in thread
From: Steve Dickson @ 2019-06-03 14:31 UTC (permalink / raw)
  To: Dave Wysochanski, linux-nfs



On 5/23/19 4:33 PM, Dave Wysochanski wrote:
> Display the count of ops completing with error status (status < 0)
> on kernels that support it.
> 
> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> ---
>  tools/mountstats/mountstats.py | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
Committed... 

steved.
> 
> diff --git a/tools/mountstats/mountstats.py b/tools/mountstats/mountstats.py
> index c7fb8bb1..2f525f4b 100755
> --- a/tools/mountstats/mountstats.py
> +++ b/tools/mountstats/mountstats.py
> @@ -475,7 +475,9 @@ class DeviceData:
>                  retrans = stats[2] - count
>                  if retrans != 0:
>                      print('\t%d retrans (%d%%)' % (retrans, ((retrans * 100) / count)), end=' ')
> -                    print('\t%d major timeouts' % stats[3])
> +                    print('\t%d major timeouts' % stats[3], end='')
> +                if len(stats) >= 10 and stats[9] != 0:
> +                    print('\t%d errors (%d%%)' % (stats[9], ((stats[9] * 100) / count)))
>                  else:
>                      print('')
>                  print('\tavg bytes sent per op: %d\tavg bytes received per op: %d' % \
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-05-31 13:25         ` Chuck Lever
@ 2019-06-03 18:53           ` Dave Wysochanski
  2019-06-03 18:56             ` Chuck Lever
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Wysochanski @ 2019-06-03 18:53 UTC (permalink / raw)
  To: Chuck Lever, Bruce Fields; +Cc: Linux NFS Mailing List

On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
> > On May 30, 2019, at 6:33 PM, Bruce Fields <bfields@fieldses.org>
> > wrote:
> > 
> > On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
> > > 
> > > 
> > > > On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
> > > > 
> > > > On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski
> > > > wrote:
> > > > > We often see various error conditions with NFS4.x that show
> > > > > up with
> > > > > a very high operation count all completing with tk_status < 0
> > > > > in a
> > > > > short period of time.  Add a count to rpc_iostats to record
> > > > > on a
> > > > > per-op basis the ops that complete in this manner, which will
> > > > > enable lower overhead diagnostics.
> > > > 
> > > > Looks like a good idea to me.
> > > > 
> > > > It's too bad we can't distinguish the errors.  (E.g. ESTALE on
> > > > a lookup
> > > > call is a lot more interesting than ENOENT.)  But understood
> > > > that
> > > > maintaining (number of possible errors) * (number of possible
> > > > ops)
> > > > counters is probably overkill, so just counting the number of
> > > > errors
> > > > seems like a good start.
> > > 
> > > We now have trace points that can do that too.
> > 
> > You mean, that can report every error (and its value)?
> 
> Yes, the nfs_xdr_status trace point reports the error by value and
> symbolic name.
> 

The tracepoint is very useful I agree.  I don't think it will show:
a) the mount
b) the opcode

Or am I mistaken and there's a way to get those with a filter or
another tracepoint?


> 
> > I assume having these statistics in mountstats is still useful,
> > though.
> > 
> > --b.
> > 
> > > 
> > > 
> > > > --b.
> > > > 
> > > > > 
> > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > ---
> > > > > include/linux/sunrpc/metrics.h | 7 ++++++-
> > > > > net/sunrpc/stats.c             | 8 ++++++--
> > > > > 2 files changed, 12 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/sunrpc/metrics.h
> > > > > b/include/linux/sunrpc/metrics.h
> > > > > index 1b3751327575..0ee3f7052846 100644
> > > > > --- a/include/linux/sunrpc/metrics.h
> > > > > +++ b/include/linux/sunrpc/metrics.h
> > > > > @@ -30,7 +30,7 @@
> > > > > #include <linux/ktime.h>
> > > > > #include <linux/spinlock.h>
> > > > > 
> > > > > -#define RPC_IOSTATS_VERS	"1.0"
> > > > > +#define RPC_IOSTATS_VERS	"1.1"
> > > > > 
> > > > > struct rpc_iostats {
> > > > > 	spinlock_t		om_lock;
> > > > > @@ -66,6 +66,11 @@ struct rpc_iostats {
> > > > > 	ktime_t			om_queue,	/* queued for
> > > > > xmit */
> > > > > 				om_rtt,		/* RPC RTT */
> > > > > 				om_execute;	/* RPC
> > > > > execution */
> > > > > +	/*
> > > > > +	 * The count of operations that complete with tk_status
> > > > > < 0.
> > > > > +	 * These statuses usually indicate error conditions.
> > > > > +	 */
> > > > > +	unsigned long           om_error_status;
> > > > > } ____cacheline_aligned;
> > > > > 
> > > > > struct rpc_task;
> > > > > diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
> > > > > index 8b2d3c58ffae..737414247ca7 100644
> > > > > --- a/net/sunrpc/stats.c
> > > > > +++ b/net/sunrpc/stats.c
> > > > > @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const
> > > > > struct rpc_task *task,
> > > > > 
> > > > > 	execute = ktime_sub(now, task->tk_start);
> > > > > 	op_metrics->om_execute = ktime_add(op_metrics-
> > > > > >om_execute, execute);
> > > > > +	if (task->tk_status < 0)
> > > > > +		op_metrics->om_error_status++;
> > > > > 
> > > > > 	spin_unlock(&op_metrics->om_lock);
> > > > > 
> > > > > @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct
> > > > > rpc_iostats *a, struct rpc_iostats *b)
> > > > > 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
> > > > > 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
> > > > > 	a->om_execute = ktime_add(a->om_execute, b-
> > > > > >om_execute);
> > > > > +	a->om_error_status += b->om_error_status;
> > > > > }
> > > > > 
> > > > > static void _print_rpc_iostats(struct seq_file *seq, struct
> > > > > rpc_iostats *stats,
> > > > > 			       int op, const struct
> > > > > rpc_procinfo *procs)
> > > > > {
> > > > > 	_print_name(seq, op, procs);
> > > > > -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu
> > > > > %llu\n",
> > > > > +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu
> > > > > %lu\n",
> > > > > 		   stats->om_ops,
> > > > > 		   stats->om_ntrans,
> > > > > 		   stats->om_timeouts,
> > > > > @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct
> > > > > seq_file *seq, struct rpc_iostats *stats,
> > > > > 		   stats->om_bytes_recv,
> > > > > 		   ktime_to_ms(stats->om_queue),
> > > > > 		   ktime_to_ms(stats->om_rtt),
> > > > > -		   ktime_to_ms(stats->om_execute));
> > > > > +		   ktime_to_ms(stats->om_execute),
> > > > > +		   stats->om_error_status);
> > > > > }
> > > > > 
> > > > > void rpc_clnt_show_stats(struct seq_file *seq, struct
> > > > > rpc_clnt *clnt)
> > > > > -- 
> > > > > 2.20.1
> > > 
> > > --
> > > Chuck Lever
> 
> --
> Chuck Lever
> 
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-06-03 18:53           ` Dave Wysochanski
@ 2019-06-03 18:56             ` Chuck Lever
  2019-06-03 19:05               ` Dave Wysochanski
  2019-06-04 14:45               ` Bruce Fields
  0 siblings, 2 replies; 18+ messages in thread
From: Chuck Lever @ 2019-06-03 18:56 UTC (permalink / raw)
  To: Dave Wysochanski; +Cc: Bruce Fields, Linux NFS Mailing List



> On Jun 3, 2019, at 2:53 PM, Dave Wysochanski <dwysocha@redhat.com> wrote:
> 
> On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
>>> On May 30, 2019, at 6:33 PM, Bruce Fields <bfields@fieldses.org>
>>> wrote:
>>> 
>>> On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
>>>>> 
>>>>> On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski
>>>>> wrote:
>>>>>> We often see various error conditions with NFS4.x that show
>>>>>> up with
>>>>>> a very high operation count all completing with tk_status < 0
>>>>>> in a
>>>>>> short period of time.  Add a count to rpc_iostats to record
>>>>>> on a
>>>>>> per-op basis the ops that complete in this manner, which will
>>>>>> enable lower overhead diagnostics.
>>>>> 
>>>>> Looks like a good idea to me.
>>>>> 
>>>>> It's too bad we can't distinguish the errors.  (E.g. ESTALE on
>>>>> a lookup
>>>>> call is a lot more interesting than ENOENT.)  But understood
>>>>> that
>>>>> maintaining (number of possible errors) * (number of possible
>>>>> ops)
>>>>> counters is probably overkill, so just counting the number of
>>>>> errors
>>>>> seems like a good start.
>>>> 
>>>> We now have trace points that can do that too.
>>> 
>>> You mean, that can report every error (and its value)?
>> 
>> Yes, the nfs_xdr_status trace point reports the error by value and
>> symbolic name.
>> 
> 
> The tracepoint is very useful I agree.  I don't think it will show:
> a) the mount
> b) the opcode
> 
> Or am I mistaken and there's a way to get those with a filter or
> another tracepoint?

The opcode can be exposed by another trace point, but the link between
the two trace points is tenuous and could be improved.

I don't believe any of the NFS trace points expose the mount. My testing
is largely on a single mount so my imagination stopped there.


>>> I assume having these statistics in mountstats is still useful,
>>> though.
>>> 
>>> --b.
>>> 
>>>> 
>>>> 
>>>>> --b.
>>>>> 
>>>>>> 
>>>>>> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
>>>>>> ---
>>>>>> include/linux/sunrpc/metrics.h | 7 ++++++-
>>>>>> net/sunrpc/stats.c             | 8 ++++++--
>>>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>>>>> 
>>>>>> diff --git a/include/linux/sunrpc/metrics.h
>>>>>> b/include/linux/sunrpc/metrics.h
>>>>>> index 1b3751327575..0ee3f7052846 100644
>>>>>> --- a/include/linux/sunrpc/metrics.h
>>>>>> +++ b/include/linux/sunrpc/metrics.h
>>>>>> @@ -30,7 +30,7 @@
>>>>>> #include <linux/ktime.h>
>>>>>> #include <linux/spinlock.h>
>>>>>> 
>>>>>> -#define RPC_IOSTATS_VERS	"1.0"
>>>>>> +#define RPC_IOSTATS_VERS	"1.1"
>>>>>> 
>>>>>> struct rpc_iostats {
>>>>>> 	spinlock_t		om_lock;
>>>>>> @@ -66,6 +66,11 @@ struct rpc_iostats {
>>>>>> 	ktime_t			om_queue,	/* queued for
>>>>>> xmit */
>>>>>> 				om_rtt,		/* RPC RTT */
>>>>>> 				om_execute;	/* RPC
>>>>>> execution */
>>>>>> +	/*
>>>>>> +	 * The count of operations that complete with tk_status
>>>>>> < 0.
>>>>>> +	 * These statuses usually indicate error conditions.
>>>>>> +	 */
>>>>>> +	unsigned long           om_error_status;
>>>>>> } ____cacheline_aligned;
>>>>>> 
>>>>>> struct rpc_task;
>>>>>> diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
>>>>>> index 8b2d3c58ffae..737414247ca7 100644
>>>>>> --- a/net/sunrpc/stats.c
>>>>>> +++ b/net/sunrpc/stats.c
>>>>>> @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const
>>>>>> struct rpc_task *task,
>>>>>> 
>>>>>> 	execute = ktime_sub(now, task->tk_start);
>>>>>> 	op_metrics->om_execute = ktime_add(op_metrics-
>>>>>>> om_execute, execute);
>>>>>> +	if (task->tk_status < 0)
>>>>>> +		op_metrics->om_error_status++;
>>>>>> 
>>>>>> 	spin_unlock(&op_metrics->om_lock);
>>>>>> 
>>>>>> @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct
>>>>>> rpc_iostats *a, struct rpc_iostats *b)
>>>>>> 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
>>>>>> 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
>>>>>> 	a->om_execute = ktime_add(a->om_execute, b-
>>>>>>> om_execute);
>>>>>> +	a->om_error_status += b->om_error_status;
>>>>>> }
>>>>>> 
>>>>>> static void _print_rpc_iostats(struct seq_file *seq, struct
>>>>>> rpc_iostats *stats,
>>>>>> 			       int op, const struct
>>>>>> rpc_procinfo *procs)
>>>>>> {
>>>>>> 	_print_name(seq, op, procs);
>>>>>> -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu
>>>>>> %llu\n",
>>>>>> +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu
>>>>>> %lu\n",
>>>>>> 		   stats->om_ops,
>>>>>> 		   stats->om_ntrans,
>>>>>> 		   stats->om_timeouts,
>>>>>> @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct
>>>>>> seq_file *seq, struct rpc_iostats *stats,
>>>>>> 		   stats->om_bytes_recv,
>>>>>> 		   ktime_to_ms(stats->om_queue),
>>>>>> 		   ktime_to_ms(stats->om_rtt),
>>>>>> -		   ktime_to_ms(stats->om_execute));
>>>>>> +		   ktime_to_ms(stats->om_execute),
>>>>>> +		   stats->om_error_status);
>>>>>> }
>>>>>> 
>>>>>> void rpc_clnt_show_stats(struct seq_file *seq, struct
>>>>>> rpc_clnt *clnt)
>>>>>> -- 
>>>>>> 2.20.1
>>>> 
>>>> --
>>>> Chuck Lever
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-06-03 18:56             ` Chuck Lever
@ 2019-06-03 19:05               ` Dave Wysochanski
  2019-06-03 19:08                 ` Chuck Lever
  2019-06-04 14:45               ` Bruce Fields
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Wysochanski @ 2019-06-03 19:05 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Bruce Fields, Linux NFS Mailing List

On Mon, 2019-06-03 at 14:56 -0400, Chuck Lever wrote:
> > On Jun 3, 2019, at 2:53 PM, Dave Wysochanski <dwysocha@redhat.com>
> > wrote:
> > 
> > On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
> > > > On May 30, 2019, at 6:33 PM, Bruce Fields <bfields@fieldses.org
> > > > >
> > > > wrote:
> > > > 
> > > > On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
> > > > > 
> > > > > 
> > > > > > On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
> > > > > > 
> > > > > > On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski
> > > > > > wrote:
> > > > > > > We often see various error conditions with NFS4.x that
> > > > > > > show
> > > > > > > up with
> > > > > > > a very high operation count all completing with tk_status
> > > > > > > < 0
> > > > > > > in a
> > > > > > > short period of time.  Add a count to rpc_iostats to
> > > > > > > record
> > > > > > > on a
> > > > > > > per-op basis the ops that complete in this manner, which
> > > > > > > will
> > > > > > > enable lower overhead diagnostics.
> > > > > > 
> > > > > > Looks like a good idea to me.
> > > > > > 
> > > > > > It's too bad we can't distinguish the errors.  (E.g. ESTALE
> > > > > > on
> > > > > > a lookup
> > > > > > call is a lot more interesting than ENOENT.)  But
> > > > > > understood
> > > > > > that
> > > > > > maintaining (number of possible errors) * (number of
> > > > > > possible
> > > > > > ops)
> > > > > > counters is probably overkill, so just counting the number
> > > > > > of
> > > > > > errors
> > > > > > seems like a good start.
> > > > > 
> > > > > We now have trace points that can do that too.
> > > > 
> > > > You mean, that can report every error (and its value)?
> > > 
> > > Yes, the nfs_xdr_status trace point reports the error by value
> > > and
> > > symbolic name.
> > > 
> > 
> > The tracepoint is very useful I agree.  I don't think it will show:
> > a) the mount
> > b) the opcode
> > 
> > Or am I mistaken and there's a way to get those with a filter or
> > another tracepoint?
> 
> The opcode can be exposed by another trace point, but the link
> between
> the two trace points is tenuous and could be improved.
> 

I think the number is there but it's not decoded!  This was for a WRITE
completing with BAD_STATEID - very nice:

   kworker/u16:0-31130 [006] .... 949028.602298: nfs4_xdr_status: operation 38: nfs status -10025 (BAD_STATEID)



> I don't believe any of the NFS trace points expose the mount. My
> testing
> is largely on a single mount so my imagination stopped there.
> 

Ok thanks for the confirmation.


> 
> > > > I assume having these statistics in mountstats is still useful,
> > > > though.
> > > > 
> > > > --b.
> > > > 
> > > > > 
> > > > > 
> > > > > > --b.
> > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > > > ---
> > > > > > > include/linux/sunrpc/metrics.h | 7 ++++++-
> > > > > > > net/sunrpc/stats.c             | 8 ++++++--
> > > > > > > 2 files changed, 12 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/include/linux/sunrpc/metrics.h
> > > > > > > b/include/linux/sunrpc/metrics.h
> > > > > > > index 1b3751327575..0ee3f7052846 100644
> > > > > > > --- a/include/linux/sunrpc/metrics.h
> > > > > > > +++ b/include/linux/sunrpc/metrics.h
> > > > > > > @@ -30,7 +30,7 @@
> > > > > > > #include <linux/ktime.h>
> > > > > > > #include <linux/spinlock.h>
> > > > > > > 
> > > > > > > -#define RPC_IOSTATS_VERS	"1.0"
> > > > > > > +#define RPC_IOSTATS_VERS	"1.1"
> > > > > > > 
> > > > > > > struct rpc_iostats {
> > > > > > > 	spinlock_t		om_lock;
> > > > > > > @@ -66,6 +66,11 @@ struct rpc_iostats {
> > > > > > > 	ktime_t			om_queue,	/* queued
> > > > > > > for
> > > > > > > xmit */
> > > > > > > 				om_rtt,		/* RPC
> > > > > > > RTT */
> > > > > > > 				om_execute;	/* RPC
> > > > > > > execution */
> > > > > > > +	/*
> > > > > > > +	 * The count of operations that complete with tk_status
> > > > > > > < 0.
> > > > > > > +	 * These statuses usually indicate error conditions.
> > > > > > > +	 */
> > > > > > > +	unsigned long           om_error_status;
> > > > > > > } ____cacheline_aligned;
> > > > > > > 
> > > > > > > struct rpc_task;
> > > > > > > diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
> > > > > > > index 8b2d3c58ffae..737414247ca7 100644
> > > > > > > --- a/net/sunrpc/stats.c
> > > > > > > +++ b/net/sunrpc/stats.c
> > > > > > > @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const
> > > > > > > struct rpc_task *task,
> > > > > > > 
> > > > > > > 	execute = ktime_sub(now, task->tk_start);
> > > > > > > 	op_metrics->om_execute = ktime_add(op_metrics-
> > > > > > > > om_execute, execute);
> > > > > > > 
> > > > > > > +	if (task->tk_status < 0)
> > > > > > > +		op_metrics->om_error_status++;
> > > > > > > 
> > > > > > > 	spin_unlock(&op_metrics->om_lock);
> > > > > > > 
> > > > > > > @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct
> > > > > > > rpc_iostats *a, struct rpc_iostats *b)
> > > > > > > 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
> > > > > > > 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
> > > > > > > 	a->om_execute = ktime_add(a->om_execute, b-
> > > > > > > > om_execute);
> > > > > > > 
> > > > > > > +	a->om_error_status += b->om_error_status;
> > > > > > > }
> > > > > > > 
> > > > > > > static void _print_rpc_iostats(struct seq_file *seq,
> > > > > > > struct
> > > > > > > rpc_iostats *stats,
> > > > > > > 			       int op, const struct
> > > > > > > rpc_procinfo *procs)
> > > > > > > {
> > > > > > > 	_print_name(seq, op, procs);
> > > > > > > -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu
> > > > > > > %llu\n",
> > > > > > > +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu
> > > > > > > %lu\n",
> > > > > > > 		   stats->om_ops,
> > > > > > > 		   stats->om_ntrans,
> > > > > > > 		   stats->om_timeouts,
> > > > > > > @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct
> > > > > > > seq_file *seq, struct rpc_iostats *stats,
> > > > > > > 		   stats->om_bytes_recv,
> > > > > > > 		   ktime_to_ms(stats->om_queue),
> > > > > > > 		   ktime_to_ms(stats->om_rtt),
> > > > > > > -		   ktime_to_ms(stats->om_execute));
> > > > > > > +		   ktime_to_ms(stats->om_execute),
> > > > > > > +		   stats->om_error_status);
> > > > > > > }
> > > > > > > 
> > > > > > > void rpc_clnt_show_stats(struct seq_file *seq, struct
> > > > > > > rpc_clnt *clnt)
> > > > > > > -- 
> > > > > > > 2.20.1
> > > > > 
> > > > > --
> > > > > Chuck Lever
> > > 
> > > --
> > > Chuck Lever
> 
> --
> Chuck Lever
> 
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-06-03 19:05               ` Dave Wysochanski
@ 2019-06-03 19:08                 ` Chuck Lever
  0 siblings, 0 replies; 18+ messages in thread
From: Chuck Lever @ 2019-06-03 19:08 UTC (permalink / raw)
  To: Dave Wysochanski; +Cc: Bruce Fields, Linux NFS Mailing List



> On Jun 3, 2019, at 3:05 PM, Dave Wysochanski <dwysocha@redhat.com> wrote:
> 
> On Mon, 2019-06-03 at 14:56 -0400, Chuck Lever wrote:
>>> On Jun 3, 2019, at 2:53 PM, Dave Wysochanski <dwysocha@redhat.com>
>>> wrote:
>>> 
>>> On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
>>>>> On May 30, 2019, at 6:33 PM, Bruce Fields <bfields@fieldses.org
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>> On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
>>>>>> 
>>>>>> 
>>>>>>> On May 30, 2019, at 5:38 PM, bfields@fieldses.org wrote:
>>>>>>> 
>>>>>>> On Thu, May 23, 2019 at 04:13:50PM -0400, Dave Wysochanski
>>>>>>> wrote:
>>>>>>>> We often see various error conditions with NFS4.x that
>>>>>>>> show
>>>>>>>> up with
>>>>>>>> a very high operation count all completing with tk_status
>>>>>>>> < 0
>>>>>>>> in a
>>>>>>>> short period of time.  Add a count to rpc_iostats to
>>>>>>>> record
>>>>>>>> on a
>>>>>>>> per-op basis the ops that complete in this manner, which
>>>>>>>> will
>>>>>>>> enable lower overhead diagnostics.
>>>>>>> 
>>>>>>> Looks like a good idea to me.
>>>>>>> 
>>>>>>> It's too bad we can't distinguish the errors.  (E.g. ESTALE
>>>>>>> on
>>>>>>> a lookup
>>>>>>> call is a lot more interesting than ENOENT.)  But
>>>>>>> understood
>>>>>>> that
>>>>>>> maintaining (number of possible errors) * (number of
>>>>>>> possible
>>>>>>> ops)
>>>>>>> counters is probably overkill, so just counting the number
>>>>>>> of
>>>>>>> errors
>>>>>>> seems like a good start.
>>>>>> 
>>>>>> We now have trace points that can do that too.
>>>>> 
>>>>> You mean, that can report every error (and its value)?
>>>> 
>>>> Yes, the nfs_xdr_status trace point reports the error by value
>>>> and
>>>> symbolic name.
>>>> 
>>> 
>>> The tracepoint is very useful I agree.  I don't think it will show:
>>> a) the mount
>>> b) the opcode
>>> 
>>> Or am I mistaken and there's a way to get those with a filter or
>>> another tracepoint?
>> 
>> The opcode can be exposed by another trace point, but the link
>> between
>> the two trace points is tenuous and could be improved.
>> 
> 
> I think the number is there but it's not decoded!  This was for a WRITE
> completing with BAD_STATEID - very nice:
> 
>   kworker/u16:0-31130 [006] .... 949028.602298: nfs4_xdr_status: operation 38: nfs status -10025 (BAD_STATEID)

OK, I'm smarter than I look. This shows the NFS operation,
not the RPC procedure (which would be NFS4 COMPOUND), which
is the right thing to do.


>> I don't believe any of the NFS trace points expose the mount. My
>> testing
>> is largely on a single mount so my imagination stopped there.
>> 
> 
> Ok thanks for the confirmation.

Well, it's something worth thinking carefully about: if
you see a slick way to make that information available,
have at it.


>>>>> I assume having these statistics in mountstats is still useful,
>>>>> though.
>>>>> 
>>>>> --b.
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> --b.
>>>>>>> 
>>>>>>>> 
>>>>>>>> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
>>>>>>>> ---
>>>>>>>> include/linux/sunrpc/metrics.h | 7 ++++++-
>>>>>>>> net/sunrpc/stats.c             | 8 ++++++--
>>>>>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>>>>>>> 
>>>>>>>> diff --git a/include/linux/sunrpc/metrics.h
>>>>>>>> b/include/linux/sunrpc/metrics.h
>>>>>>>> index 1b3751327575..0ee3f7052846 100644
>>>>>>>> --- a/include/linux/sunrpc/metrics.h
>>>>>>>> +++ b/include/linux/sunrpc/metrics.h
>>>>>>>> @@ -30,7 +30,7 @@
>>>>>>>> #include <linux/ktime.h>
>>>>>>>> #include <linux/spinlock.h>
>>>>>>>> 
>>>>>>>> -#define RPC_IOSTATS_VERS	"1.0"
>>>>>>>> +#define RPC_IOSTATS_VERS	"1.1"
>>>>>>>> 
>>>>>>>> struct rpc_iostats {
>>>>>>>> 	spinlock_t		om_lock;
>>>>>>>> @@ -66,6 +66,11 @@ struct rpc_iostats {
>>>>>>>> 	ktime_t			om_queue,	/* queued
>>>>>>>> for
>>>>>>>> xmit */
>>>>>>>> 				om_rtt,		/* RPC
>>>>>>>> RTT */
>>>>>>>> 				om_execute;	/* RPC
>>>>>>>> execution */
>>>>>>>> +	/*
>>>>>>>> +	 * The count of operations that complete with tk_status
>>>>>>>> < 0.
>>>>>>>> +	 * These statuses usually indicate error conditions.
>>>>>>>> +	 */
>>>>>>>> +	unsigned long           om_error_status;
>>>>>>>> } ____cacheline_aligned;
>>>>>>>> 
>>>>>>>> struct rpc_task;
>>>>>>>> diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
>>>>>>>> index 8b2d3c58ffae..737414247ca7 100644
>>>>>>>> --- a/net/sunrpc/stats.c
>>>>>>>> +++ b/net/sunrpc/stats.c
>>>>>>>> @@ -176,6 +176,8 @@ void rpc_count_iostats_metrics(const
>>>>>>>> struct rpc_task *task,
>>>>>>>> 
>>>>>>>> 	execute = ktime_sub(now, task->tk_start);
>>>>>>>> 	op_metrics->om_execute = ktime_add(op_metrics-
>>>>>>>>> om_execute, execute);
>>>>>>>> 
>>>>>>>> +	if (task->tk_status < 0)
>>>>>>>> +		op_metrics->om_error_status++;
>>>>>>>> 
>>>>>>>> 	spin_unlock(&op_metrics->om_lock);
>>>>>>>> 
>>>>>>>> @@ -218,13 +220,14 @@ static void _add_rpc_iostats(struct
>>>>>>>> rpc_iostats *a, struct rpc_iostats *b)
>>>>>>>> 	a->om_queue = ktime_add(a->om_queue, b->om_queue);
>>>>>>>> 	a->om_rtt = ktime_add(a->om_rtt, b->om_rtt);
>>>>>>>> 	a->om_execute = ktime_add(a->om_execute, b-
>>>>>>>>> om_execute);
>>>>>>>> 
>>>>>>>> +	a->om_error_status += b->om_error_status;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> static void _print_rpc_iostats(struct seq_file *seq,
>>>>>>>> struct
>>>>>>>> rpc_iostats *stats,
>>>>>>>> 			       int op, const struct
>>>>>>>> rpc_procinfo *procs)
>>>>>>>> {
>>>>>>>> 	_print_name(seq, op, procs);
>>>>>>>> -	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu
>>>>>>>> %llu\n",
>>>>>>>> +	seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu
>>>>>>>> %lu\n",
>>>>>>>> 		   stats->om_ops,
>>>>>>>> 		   stats->om_ntrans,
>>>>>>>> 		   stats->om_timeouts,
>>>>>>>> @@ -232,7 +235,8 @@ static void _print_rpc_iostats(struct
>>>>>>>> seq_file *seq, struct rpc_iostats *stats,
>>>>>>>> 		   stats->om_bytes_recv,
>>>>>>>> 		   ktime_to_ms(stats->om_queue),
>>>>>>>> 		   ktime_to_ms(stats->om_rtt),
>>>>>>>> -		   ktime_to_ms(stats->om_execute));
>>>>>>>> +		   ktime_to_ms(stats->om_execute),
>>>>>>>> +		   stats->om_error_status);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> void rpc_clnt_show_stats(struct seq_file *seq, struct
>>>>>>>> rpc_clnt *clnt)
>>>>>>>> -- 
>>>>>>>> 2.20.1
>>>>>> 
>>>>>> --
>>>>>> Chuck Lever
>>>> 
>>>> --
>>>> Chuck Lever
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-06-03 18:56             ` Chuck Lever
  2019-06-03 19:05               ` Dave Wysochanski
@ 2019-06-04 14:45               ` Bruce Fields
  2019-06-04 14:56                 ` Trond Myklebust
  2019-06-07 14:27                 ` Dave Wysochanski
  1 sibling, 2 replies; 18+ messages in thread
From: Bruce Fields @ 2019-06-04 14:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Dave Wysochanski, Linux NFS Mailing List

On Mon, Jun 03, 2019 at 02:56:29PM -0400, Chuck Lever wrote:
> > On Jun 3, 2019, at 2:53 PM, Dave Wysochanski <dwysocha@redhat.com> wrote:
> > On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
> >>> On May 30, 2019, at 6:33 PM, Bruce Fields <bfields@fieldses.org>
> >>> wrote:
> >>> On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
> >>>> We now have trace points that can do that too.
> >>> 
> >>> You mean, that can report every error (and its value)?
> >> 
> >> Yes, the nfs_xdr_status trace point reports the error by value and
> >> symbolic name.
> > 
> > The tracepoint is very useful I agree.  I don't think it will show:
> > a) the mount
> > b) the opcode
> > 
> > Or am I mistaken and there's a way to get those with a filter or
> > another tracepoint?
> 
> The opcode can be exposed by another trace point, but the link between
> the two trace points is tenuous and could be improved.
> 
> I don't believe any of the NFS trace points expose the mount. My testing
> is largely on a single mount so my imagination stopped there.

Dumb question: is it possible to add more fields to tracepoints without
breaking some kind of backwards compatibility?

I wonder if adding, say, an xid and an xprt pointer to tracepoints when
available would help with this kind of thing.

In any case, I think Dave's stats will still be handy if only because
they're on all the time.

--b.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-06-04 14:45               ` Bruce Fields
@ 2019-06-04 14:56                 ` Trond Myklebust
  2019-06-07 14:27                 ` Dave Wysochanski
  1 sibling, 0 replies; 18+ messages in thread
From: Trond Myklebust @ 2019-06-04 14:56 UTC (permalink / raw)
  To: bfields, chuck.lever; +Cc: linux-nfs, dwysocha

On Tue, 2019-06-04 at 10:45 -0400, Bruce Fields wrote:
> On Mon, Jun 03, 2019 at 02:56:29PM -0400, Chuck Lever wrote:
> > > On Jun 3, 2019, at 2:53 PM, Dave Wysochanski <dwysocha@redhat.com
> > > > wrote:
> > > On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
> > > > > On May 30, 2019, at 6:33 PM, Bruce Fields <
> > > > > bfields@fieldses.org>
> > > > > wrote:
> > > > > On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
> > > > > > We now have trace points that can do that too.
> > > > > 
> > > > > You mean, that can report every error (and its value)?
> > > > 
> > > > Yes, the nfs_xdr_status trace point reports the error by value
> > > > and
> > > > symbolic name.
> > > 
> > > The tracepoint is very useful I agree.  I don't think it will
> > > show:
> > > a) the mount
> > > b) the opcode
> > > 
> > > Or am I mistaken and there's a way to get those with a filter or
> > > another tracepoint?
> > 
> > The opcode can be exposed by another trace point, but the link
> > between
> > the two trace points is tenuous and could be improved.
> > 
> > I don't believe any of the NFS trace points expose the mount. My
> > testing
> > is largely on a single mount so my imagination stopped there.
> 
> Dumb question: is it possible to add more fields to tracepoints
> without
> breaking some kind of backwards compatibility?

Tracepoints are not to be considered an API.

> I wonder if adding, say, an xid and an xprt pointer to tracepoints
> when
> available would help with this kind of thing.
> 
> In any case, I think Dave's stats will still be handy if only because
> they're on all the time.

Given that someone who shall remain anonymous added the struct rpc_rqst
to the struct xdr_stream for "debugging purposes", it should be quite
possible to add both the XID and the xprt in this case.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0
  2019-06-04 14:45               ` Bruce Fields
  2019-06-04 14:56                 ` Trond Myklebust
@ 2019-06-07 14:27                 ` Dave Wysochanski
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Wysochanski @ 2019-06-07 14:27 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: Linux NFS Mailing List

On Tue, 2019-06-04 at 10:45 -0400, Bruce Fields wrote:
> On Mon, Jun 03, 2019 at 02:56:29PM -0400, Chuck Lever wrote:
> > > On Jun 3, 2019, at 2:53 PM, Dave Wysochanski <dwysocha@redhat.com
> > > > wrote:
> > > On Fri, 2019-05-31 at 09:25 -0400, Chuck Lever wrote:
> > > > > On May 30, 2019, at 6:33 PM, Bruce Fields <
> > > > > bfields@fieldses.org>
> > > > > wrote:
> > > > > On Thu, May 30, 2019 at 06:19:54PM -0400, Chuck Lever wrote:
> > > > > > We now have trace points that can do that too.
> > > > > 
> > > > > You mean, that can report every error (and its value)?
> > > > 
> > > > Yes, the nfs_xdr_status trace point reports the error by value
> > > > and
> > > > symbolic name.
> > > 
> > > The tracepoint is very useful I agree.  I don't think it will
> > > show:
> > > a) the mount
> > > b) the opcode
> > > 
> > > Or am I mistaken and there's a way to get those with a filter or
> > > another tracepoint?
> > 
> > The opcode can be exposed by another trace point, but the link
> > between
> > the two trace points is tenuous and could be improved.
> > 
> > I don't believe any of the NFS trace points expose the mount. My
> > testing
> > is largely on a single mount so my imagination stopped there.
> 
> Dumb question: is it possible to add more fields to tracepoints
> without
> breaking some kind of backwards compatibility?
> 
> I wonder if adding, say, an xid and an xprt pointer to tracepoints
> when
> available would help with this kind of thing.
> 
> In any case, I think Dave's stats will still be handy if only because
> they're on all the time.
> 
> --b.

Trond or Anna, will you take this series for mountstats or are you
opposed to it?

I think it is useful in conjuction with the tracepoints because it is
always on and easy to know which mount is involved (we often start with
a customer saying mount XYZ has some issue or is hanging).  If you see
problems or want other testing please let me know.

Thanks!


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-06-07 14:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23 20:13 [PATCH 1/3] SUNRPC: Move call to rpc_count_iostats before rpc_call_done Dave Wysochanski
2019-05-23 20:13 ` [PATCH 2/3] SUNRPC: Use proper printk specifiers for unsigned long long Dave Wysochanski
2019-05-23 20:13 ` [PATCH 3/3] SUNRPC: Count ops completing with tk_status < 0 Dave Wysochanski
2019-05-30 21:38   ` J. Bruce Fields
2019-05-30 22:19     ` Chuck Lever
2019-05-30 22:33       ` Bruce Fields
2019-05-31 13:25         ` Chuck Lever
2019-06-03 18:53           ` Dave Wysochanski
2019-06-03 18:56             ` Chuck Lever
2019-06-03 19:05               ` Dave Wysochanski
2019-06-03 19:08                 ` Chuck Lever
2019-06-04 14:45               ` Bruce Fields
2019-06-04 14:56                 ` Trond Myklebust
2019-06-07 14:27                 ` Dave Wysochanski
2019-05-31  0:17     ` David Wysochanski
2019-05-23 20:13 ` [PATCH] mountstats: add per-op error counts for mountstats command Dave Wysochanski
2019-05-23 20:33   ` [PATCH v2] " Dave Wysochanski
2019-06-03 14:31     ` Steve Dickson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.