linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* error=Invalid slot
@ 2019-04-15 15:04 Chuck Lever
  2019-04-15 16:05 ` Trond Myklebust
  0 siblings, 1 reply; 5+ messages in thread
From: Chuck Lever @ 2019-04-15 15:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

Just happened again. Any thoughts about where I should start looking?

Mon Apr 15 11:01:40 EDT 2019
4k100test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1024
...
fio-3.1
Starting 12 processes
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
4k100test: Laying out IO file (1 file / 1024MiB)
fio: native_fallocate call failed: Operation not supported
fio: io_u error on file 4k100test.7.0: Invalid slot: read offset=938229760, buflen=4096
fio: pid=2653, err=57/file:io_u.c:1756, func=io_u error, error=Invalid slot
fio: io_u error on file 4k100test.10.0: Invalid slot: read offset=331800576, buflen=40963s]
fio: pid=2650, err=57/file:io_u.c:1756, func=io_u error, error=Invalid slot
Jobs: 10 (f=10): [r(2),X(1),r(2),X(1),r(6)][100.0%][r=617MiB/s,w=0KiB/s][r=158k,w=0 IOPS][eta 00m:00s]
4k100test: (groupid=0, jobs=12): err=57 (file:io_u.c:1756, func=io_u error, error=Invalid slot): pid=2648: Mon Apr 15 11:02:25 2019
   read: IOPS=157k, BW=613MiB/s (643MB/s)(11.4GiB/19024msec)
    slat (usec): min=2, max=9808, avg= 7.36, stdev=30.80
    clat (usec): min=43, max=1236.1k, avg=74443.93, stdev=85310.03
     lat (usec): min=134, max=1236.1k, avg=74451.53, stdev=85309.77
    clat percentiles (usec):
     |  1.00th=[   807],  5.00th=[  1287], 10.00th=[  1647], 20.00th=[  2376],
     | 30.00th=[  3490], 40.00th=[  8160], 50.00th=[ 69731], 60.00th=[ 76022],
     | 70.00th=[ 80217], 80.00th=[139461], 90.00th=[181404], 95.00th=[233833],
     | 99.00th=[375391], 99.50th=[438305], 99.90th=[549454], 99.95th=[616563],
     | 99.99th=[750781]
   bw (  KiB/s): min=46941, max=70036, per=8.73%, avg=54819.99, stdev=4131.49, samples=430
   iops        : min=11735, max=17509, avg=13704.80, stdev=1032.88, samples=430
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.14%, 750=0.63%
  lat (usec)   : 1000=1.48%
  lat (msec)   : 2=12.84%, 4=17.70%, 10=7.87%, 20=0.77%, 50=0.21%
  lat (msec)   : 100=34.04%, 250=20.30%, 500=3.82%, 750=0.20%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=2.04%, sys=7.06%, ctx=698855, majf=0, minf=6299
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwt: total=2986716,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: bw=613MiB/s (643MB/s), 613MiB/s-613MiB/s (643MB/s-643MB/s), io=11.4GiB (12.2GB), run=19024-19024msec

--
Chuck Lever




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: error=Invalid slot
  2019-04-15 15:04 error=Invalid slot Chuck Lever
@ 2019-04-15 16:05 ` Trond Myklebust
  2019-04-15 16:11   ` Chuck Lever
  2019-04-17 19:50   ` Chuck Lever
  0 siblings, 2 replies; 5+ messages in thread
From: Trond Myklebust @ 2019-04-15 16:05 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

Hi Chuck,


On Mon, 2019-04-15 at 11:04 -0400, Chuck Lever wrote:
> Just happened again. Any thoughts about where I should start looking?
> 
> Mon Apr 15 11:01:40 EDT 2019
> 4k100test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> (T) 4096B-4096B, ioengine=libaio, iodepth=1024
> ...
> fio-3.1
> Starting 12 processes
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> 4k100test: Laying out IO file (1 file / 1024MiB)
> fio: native_fallocate call failed: Operation not supported
> fio: io_u error on file 4k100test.7.0: Invalid slot: read
> offset=938229760, buflen=4096

Does the following patch fix the race?

8<--------------------------------------
From 4c8759eafad9bb7ea2626a53296e30618aeefcc7 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust@hammerspace.com>
Date: Mon, 15 Apr 2019 11:54:13 -0400
Subject: [PATCH] SUNRPC: Ignore queue transmission errors on successful
 transmission

If a request transmission fails due to write space or slot unavailability
errors, but the queued task then gets transmitted before it has time to
process the error in call_transmit_status() or call_bc_transmit_status(),
we need to suppress the transmission error code to prevent it from leaking
out of the RPC layer.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 net/sunrpc/clnt.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index fa900bb44cd5..369a2648dafc 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2101,8 +2101,8 @@ call_transmit_status(struct rpc_task *task)
 	 * test first.
 	 */
 	if (rpc_task_transmitted(task)) {
-		if (task->tk_status == 0)
-			xprt_request_wait_receive(task);
+		task->tk_status = 0;
+		xprt_request_wait_receive(task);
 		return;
 	}
 
@@ -2187,6 +2187,9 @@ call_bc_transmit_status(struct rpc_task *task)
 {
 	struct rpc_rqst *req = task->tk_rqstp;
 
+	if (rpc_task_transmitted(task))
+		task->tk_status = 0;
+
 	dprint_status(task);
 
 	switch (task->tk_status) {
-- 
2.20.1

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: error=Invalid slot
  2019-04-15 16:05 ` Trond Myklebust
@ 2019-04-15 16:11   ` Chuck Lever
  2019-04-15 16:24     ` Trond Myklebust
  2019-04-17 19:50   ` Chuck Lever
  1 sibling, 1 reply; 5+ messages in thread
From: Chuck Lever @ 2019-04-15 16:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



> On Apr 15, 2019, at 12:05 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> Hi Chuck,
> 
> 
> On Mon, 2019-04-15 at 11:04 -0400, Chuck Lever wrote:
>> Just happened again. Any thoughts about where I should start looking?
>> 
>> Mon Apr 15 11:01:40 EDT 2019
>> 4k100test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
>> (T) 4096B-4096B, ioengine=libaio, iodepth=1024
>> ...
>> fio-3.1
>> Starting 12 processes
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> fio: io_u error on file 4k100test.7.0: Invalid slot: read
>> offset=938229760, buflen=4096
> 
> Does the following patch fix the race?
> 
> 8<--------------------------------------
> From 4c8759eafad9bb7ea2626a53296e30618aeefcc7 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> Date: Mon, 15 Apr 2019 11:54:13 -0400
> Subject: [PATCH] SUNRPC: Ignore queue transmission errors on successful
> transmission
> 
> If a request transmission fails due to write space or slot unavailability
> errors, but the queued task then gets transmitted before it has time to
> process the error in call_transmit_status() or call_bc_transmit_status(),
> we need to suppress the transmission error code to prevent it from leaking
> out of the RPC layer.
> 
> Reported-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
> net/sunrpc/clnt.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index fa900bb44cd5..369a2648dafc 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -2101,8 +2101,8 @@ call_transmit_status(struct rpc_task *task)
> 	 * test first.
> 	 */
> 	if (rpc_task_transmitted(task)) {
> -		if (task->tk_status == 0)
> -			xprt_request_wait_receive(task);
> +		task->tk_status = 0;
> +		xprt_request_wait_receive(task);
> 		return;
> 	}
> 
> @@ -2187,6 +2187,9 @@ call_bc_transmit_status(struct rpc_task *task)
> {
> 	struct rpc_rqst *req = task->tk_rqstp;
> 
> +	if (rpc_task_transmitted(task))
> +		task->tk_status = 0;
> +
> 	dprint_status(task);
> 
> 	switch (task->tk_status) {

I was about to try something like this.

I don't have a 100% reproducer. I will apply your patch and
wait for the problem to appear over the next few days.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: error=Invalid slot
  2019-04-15 16:11   ` Chuck Lever
@ 2019-04-15 16:24     ` Trond Myklebust
  0 siblings, 0 replies; 5+ messages in thread
From: Trond Myklebust @ 2019-04-15 16:24 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

On Mon, 2019-04-15 at 12:11 -0400, Chuck Lever wrote:
> > 
> I was about to try something like this.
> 

So, the reason for that approach is that I believe we can currently
leak ENOBUFS and possibly other errors in addition to EBADSLT through
the same race.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: error=Invalid slot
  2019-04-15 16:05 ` Trond Myklebust
  2019-04-15 16:11   ` Chuck Lever
@ 2019-04-17 19:50   ` Chuck Lever
  1 sibling, 0 replies; 5+ messages in thread
From: Chuck Lever @ 2019-04-17 19:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



> On Apr 15, 2019, at 12:05 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> Hi Chuck,
> 
> 
> On Mon, 2019-04-15 at 11:04 -0400, Chuck Lever wrote:
>> Just happened again. Any thoughts about where I should start looking?
>> 
>> Mon Apr 15 11:01:40 EDT 2019
>> 4k100test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
>> (T) 4096B-4096B, ioengine=libaio, iodepth=1024
>> ...
>> fio-3.1
>> Starting 12 processes
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> 4k100test: Laying out IO file (1 file / 1024MiB)
>> fio: native_fallocate call failed: Operation not supported
>> fio: io_u error on file 4k100test.7.0: Invalid slot: read
>> offset=938229760, buflen=4096
> 
> Does the following patch fix the race?
> 
> 8<--------------------------------------
> From 4c8759eafad9bb7ea2626a53296e30618aeefcc7 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> Date: Mon, 15 Apr 2019 11:54:13 -0400
> Subject: [PATCH] SUNRPC: Ignore queue transmission errors on successful
> transmission
> 
> If a request transmission fails due to write space or slot unavailability
> errors, but the queued task then gets transmitted before it has time to
> process the error in call_transmit_status() or call_bc_transmit_status(),
> we need to suppress the transmission error code to prevent it from leaking
> out of the RPC layer.
> 
> Reported-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
> net/sunrpc/clnt.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index fa900bb44cd5..369a2648dafc 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -2101,8 +2101,8 @@ call_transmit_status(struct rpc_task *task)
> 	 * test first.
> 	 */
> 	if (rpc_task_transmitted(task)) {
> -		if (task->tk_status == 0)
> -			xprt_request_wait_receive(task);
> +		task->tk_status = 0;
> +		xprt_request_wait_receive(task);
> 		return;
> 	}
> 
> @@ -2187,6 +2187,9 @@ call_bc_transmit_status(struct rpc_task *task)
> {
> 	struct rpc_rqst *req = task->tk_rqstp;
> 
> +	if (rpc_task_transmitted(task))
> +		task->tk_status = 0;
> +
> 	dprint_status(task);
> 
> 	switch (task->tk_status) {
> -- 
> 2.20.1
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

I haven't been able to reproduce the problem.

Tested-by: Chuck Lever <chuck.lever@oracle.com>


--
Chuck Lever




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-04-17 19:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-15 15:04 error=Invalid slot Chuck Lever
2019-04-15 16:05 ` Trond Myklebust
2019-04-15 16:11   ` Chuck Lever
2019-04-15 16:24     ` Trond Myklebust
2019-04-17 19:50   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).