Re: [PATCH] SUNRPC: Remove rpc_xprt::tsh_size

From: Chuck Lever <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH] SUNRPC: Remove rpc_xprt::tsh_size
Date: Thu, 3 Jan 2019 15:53:56 -0500	[thread overview]
Message-ID: <90B38E07-3241-4CCD-A4C8-AB78BADFB0CD@oracle.com> (raw)
In-Reply-To: <0331de80b8161f8bf16a92de20049cafb0c228da.camel@hammerspace.com>

> On Jan 3, 2019, at 1:47 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Thu, 2019-01-03 at 13:29 -0500, Chuck Lever wrote:
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index d5ce1a8..66b08aa 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -678,6 +678,31 @@ static void xs_stream_data_receive_workfn(struct
>> work_struct *work)
>> 
>> #define XS_SENDMSG_FLAGS	(MSG_DONTWAIT | MSG_NOSIGNAL)
>> 
>> +static int xs_send_record_marker(struct sock_xprt *transport,
>> +				 const struct rpc_rqst *req)
>> +{
>> +	static struct msghdr msg = {
>> +		.msg_name	= NULL,
>> +		.msg_namelen	= 0,
>> +		.msg_flags	= (XS_SENDMSG_FLAGS | MSG_MORE),
>> +	};
>> +	rpc_fraghdr marker;
>> +	struct kvec iov = {
>> +		.iov_base	= &marker,
>> +		.iov_len	= sizeof(marker),
>> +	};
>> +	u32 reclen;
>> +
>> +	if (unlikely(!transport->sock))
>> +		return -ENOTSOCK;
>> +	if (req->rq_bytes_sent)
>> +		return 0;
> 
> The test needs to use transport->xmit.offset, not req->rq_bytes_sent.

OK, that seems to work better.

> You also need to update transport->xmit.offset on success,

That causes the first 4 bytes of the rq_snd_buf to not be sent.
Not updating xmit.offset seems more correct.

> and be
> prepared to handle the case where < sizeof(marker) bytes get
> transmitted due to a write_space condition.

Probably the only recourse is to break the connection.

>> +
>> +	reclen = req->rq_snd_buf.len;
>> +	marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | reclen);
>> +	return kernel_sendmsg(transport->sock, &msg, &iov, 1,
>> iov.iov_len);
> 
> 
> So what does this do for performance? I'd expect that adding another
> dive into the socket layer will come with penalties.

NFSv3 on TCP, sec=sys, 56Gbs IBoIP, v4.20 + my v4.21 patches
fio, 8KB random, 70% read, 30% write, 16 threads, iodepth=16

Without this patch:

   read: IOPS=28.7k, BW=224MiB/s (235MB/s)(11.2GiB/51092msec)
  write: IOPS=12.3k, BW=96.3MiB/s (101MB/s)(4918MiB/51092msec)

With this patch:

   read: IOPS=28.6k, BW=224MiB/s (235MB/s)(11.2GiB/51276msec)
  write: IOPS=12.3k, BW=95.8MiB/s (100MB/s)(4914MiB/51276msec)

Seems like that's in the noise.

--
Chuck Lever