All of lore.kernel.org
 help / color / mirror / Atom feed
From: Badari Pulavarty <pbadari@us.ibm.com>
To: linux-nfs@vger.kernel.org
Subject: [RFC][PATCH] Vector read/write support for NFS (DIO) client
Date: Tue, 12 Apr 2011 08:32:15 -0700	[thread overview]
Message-ID: <1302622335.3877.62.camel@badari-desktop> (raw)

Hi,

We recently ran into serious performance issue with NFS client.
It turned out that its due to lack of readv/write support for
NFS (O_DIRECT) client.

Here is our use-case:

In our cloud environment, our storage is over NFS. Files
on NFS are passed as a blockdevices to the guest (using
O_DIRECT). When guest is doing IO on these block devices,
they will end up as O_DIRECT writes to NFS (on KVM host).

QEMU (on the host) gets a vector from virtio-ring and
submits them. Old versions of QEMU, linearized the vector
it got from KVM (copied them into a buffer) and submits
the buffer. So, NFS client always received a single buffer.

Later versions of QEMU, eliminated this copy and submits
a vector directly using preadv/pwritev().

NFS client loops through the vector and submits each
vector as separate request for each IO < wsize. In our
case (negotiated wsize=1MB), for 256K IO - we get 64 
vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
Server end up doing each 4K synchronously. This causes
serious performance degrade. We are trying to see if the
performance improves if we convert IOs to ASYNC - but
our initial results doesn't look good.

readv/writev support NFS client for all possible cases is
hard. Instead, if all vectors are page-aligned and 
iosizes page-multiple - it fits the current code easily.
Luckily, QEMU use-case fits these requirements.

Here is the patch to add this support. Comments ?

Thanks,
Badari

Special readv/writev support for NFS O_DIRECT. Currently NFS 
client O_DIRECT read/write iterates over individual elements 
in the vector and submits the IO and waits for them to finish. 
If individual IOsize is < r/wsize, each IO will be FILE_SYNC. 
Server has to finish individual smaller IOs synchronous. This 
causes serious performance degrade.

This patch is a special case to submit larger IOs from the client
only if all the IO buffers are page-aligned and each individual
IOs are page-multiple + total IO size is < r/wsize. 

This is the common case for QEMU usage.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
---
 fs/nfs/direct.c |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 259 insertions(+)

Index: linux-2.6.38.2/fs/nfs/direct.c
===================================================================
--- linux-2.6.38.2.orig/fs/nfs/direct.c	2011-04-12 10:56:29.374266292 -0400
+++ linux-2.6.38.2/fs/nfs/direct.c	2011-04-12 11:01:21.883266310 -0400
@@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
 	.rpc_release = nfs_direct_read_release,
 };
 
+static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
+				      const struct iovec *iov,
+                                      unsigned long nr_segs, int rw)
+{
+	unsigned long seg;
+	int pages = 0;
+	struct nfs_open_context *ctx = dreq->ctx;
+	struct inode *inode = ctx->path.dentry->d_inode;
+	size_t size;
+
+
+	/* If its a single vector - use old code */
+	if (nr_segs == 1)
+		return pages;
+	/*
+	 * Check to see if all IO buffers are page-aligned and
+	 * individual IO sizes are page-multiple.
+	 * If so, we can submit them in a single IO.
+	 * Otherwise, use old code
+	 */
+	for (seg = 0; seg < nr_segs; seg++) {
+		unsigned long user_addr = (unsigned long)iov[seg].iov_base;
+
+		if ((user_addr & (PAGE_SIZE - 1)) ||
+			(iov[seg].iov_len & (PAGE_SIZE - 1))) {
+			pages = 0;
+			break;
+		}
+		pages += (iov[seg].iov_len >> PAGE_SHIFT);
+	}
+
+	if (rw)
+		size = NFS_SERVER(inode)->wsize;
+	else
+		size = NFS_SERVER(inode)->rsize;
+
+	/*
+         * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
+         * - fall back to old code.
+         */
+	if (pages > (size >> PAGE_SHIFT))
+		pages = 0;
+
+	return pages;
+}
+
 /*
  * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
  * operation.  If nfs_readdata_alloc() or get_user_pages() fails,
@@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
 	return result < 0 ? (ssize_t) result : -EFAULT;
 }
 
+/*
+ * Special Case: Efficient readv() support - only if all the IO buffers
+ * in the vector are page-aligned and IO sizes are page-multiple
+ * + total IO size is < rsize. We can map all of the vectors together
+ * and submit them in a single IO instead of operating on single entry
+ * in the vector.
+ */
+static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
+						 const struct iovec *iov,
+						 unsigned long nr_segs,
+						 int pages,
+						 loff_t pos)
+{
+	struct nfs_open_context *ctx = dreq->ctx;
+	struct inode *inode = ctx->path.dentry->d_inode;
+	unsigned long user_addr = (unsigned long)iov->iov_base;
+	size_t bytes = pages << PAGE_SHIFT;
+	struct rpc_task *task;
+	struct rpc_message msg = {
+		.rpc_cred = ctx->cred,
+	};
+	struct rpc_task_setup task_setup_data = {
+		.rpc_client = NFS_CLIENT(inode),
+		.rpc_message = &msg,
+		.callback_ops = &nfs_read_direct_ops,
+		.workqueue = nfsiod_workqueue,
+		.flags = RPC_TASK_ASYNC,
+	};
+	unsigned int pgbase;
+	int result;
+	int seg;
+	int mapped = 0;
+	int nr_pages;
+	ssize_t started = 0;
+	struct nfs_read_data *data;
+
+	result = -ENOMEM;
+	data = nfs_readdata_alloc(pages);
+	if (unlikely(!data))
+		return result;
+
+	pgbase = user_addr & ~PAGE_MASK;
+
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
+
+		down_read(&current->mm->mmap_sem);
+		result = get_user_pages(current, current->mm, user_addr,
+					nr_pages, 1, 0, &data->pagevec[mapped],
+					NULL);
+		up_read(&current->mm->mmap_sem);
+		if (result < 0) {
+			/* unmap what is done so far */
+			nfs_direct_release_pages(data->pagevec, mapped);
+			nfs_readdata_free(data);
+			return result;
+		}
+		mapped += result;
+	}
+
+	get_dreq(dreq);
+
+	data->req = (struct nfs_page *) dreq;
+	data->inode = inode;
+	data->cred = msg.rpc_cred;
+	data->args.fh = NFS_FH(inode);
+	data->args.context = ctx;
+	data->args.offset = pos;
+	data->args.pgbase = pgbase;
+	data->args.pages = data->pagevec;
+	data->args.count = bytes;
+	data->res.fattr = &data->fattr;
+	data->res.eof = 0;
+	data->res.count = bytes;
+	nfs_fattr_init(&data->fattr);
+
+	task_setup_data.task = &data->task;
+	task_setup_data.callback_data = data;
+	msg.rpc_argp = &data->args;
+	msg.rpc_resp = &data->res;
+	NFS_PROTO(inode)->read_setup(data, &msg);
+
+	task = rpc_run_task(&task_setup_data);
+	if (IS_ERR(task))
+		goto out;
+	rpc_put_task(task);
+
+	started += bytes;
+out:
+	if (started)
+		return started;
+	return result < 0 ? (ssize_t) result : -EFAULT;
+}
+
 static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 					      const struct iovec *iov,
 					      unsigned long nr_segs,
@@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
 
 	get_dreq(dreq);
 
+	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
+	if (result) {
+		result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
+							result, pos);
+		if (result < 0)
+			goto out;
+		requested_bytes += result;
+		pos += result;
+		goto out;
+	}
 	for (seg = 0; seg < nr_segs; seg++) {
 		const struct iovec *vec = &iov[seg];
 		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
@@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
 		return result < 0 ? result : -EIO;
 	}
 
+out:
 	if (put_dreq(dreq))
 		nfs_direct_complete(dreq);
 	return 0;
@@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
 	return result < 0 ? (ssize_t) result : -EFAULT;
 }
 
+/*
+ * Special Case: Efficient writev() support - only if all the IO buffers
+ * in the vector are page-aligned and IO sizes are page-multiple
+ * + total IO size is < wsize. We can map all of the vectors together
+ * and submit them in a single IO instead of operating on single entry
+ * in the vector.
+ */
+static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
+						 const struct iovec *iov,
+						 unsigned long nr_segs,
+						 int pages,
+						 loff_t pos, int sync)
+{
+	struct nfs_open_context *ctx = dreq->ctx;
+	struct inode *inode = ctx->path.dentry->d_inode;
+	unsigned long user_addr = (unsigned long)iov->iov_base;
+	size_t bytes = pages << PAGE_SHIFT;
+	struct rpc_task *task;
+	struct rpc_message msg = {
+		.rpc_cred = ctx->cred,
+	};
+	struct rpc_task_setup task_setup_data = {
+		.rpc_client = NFS_CLIENT(inode),
+		.rpc_message = &msg,
+		.callback_ops = &nfs_write_direct_ops,
+		.workqueue = nfsiod_workqueue,
+		.flags = RPC_TASK_ASYNC,
+	};
+	unsigned int pgbase;
+	int result;
+	int seg;
+	int mapped = 0;
+	int nr_pages;
+	ssize_t started = 0;
+	struct nfs_write_data *data;
+
+	result = -ENOMEM;
+	data = nfs_writedata_alloc(pages);
+	if (unlikely(!data))
+		return result;
+
+	pgbase = user_addr & ~PAGE_MASK;
+
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
+
+		down_read(&current->mm->mmap_sem);
+		result = get_user_pages(current, current->mm, user_addr,
+					nr_pages, 0, 0, &data->pagevec[mapped], NULL);
+		up_read(&current->mm->mmap_sem);
+		if (result < 0) {
+			/* unmap what is done so far */
+			nfs_direct_release_pages(data->pagevec, mapped);
+			nfs_writedata_free(data);
+			return result;
+		}
+		mapped += result;
+	}
+
+	get_dreq(dreq);
+	list_move_tail(&data->pages, &dreq->rewrite_list);
+
+	data->req = (struct nfs_page *) dreq;
+	data->inode = inode;
+	data->cred = msg.rpc_cred;
+	data->args.fh = NFS_FH(inode);
+	data->args.context = ctx;
+	data->args.offset = pos;
+	data->args.pgbase = pgbase;
+	data->args.pages = data->pagevec;
+	data->args.count = bytes;
+	data->args.stable = sync;
+	data->res.fattr = &data->fattr;
+	data->res.count = bytes;
+	data->res.verf = &data->verf;
+	nfs_fattr_init(&data->fattr);
+
+	task_setup_data.task = &data->task;
+	task_setup_data.callback_data = data;
+	msg.rpc_argp = &data->args;
+	msg.rpc_resp = &data->res;
+	NFS_PROTO(inode)->write_setup(data, &msg);
+
+	task = rpc_run_task(&task_setup_data);
+	if (IS_ERR(task))
+		goto out;
+	rpc_put_task(task);
+
+	started += bytes;
+out:
+	if (started)
+		return started;
+	return result < 0 ? (ssize_t) result : -EFAULT;
+}
+
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 					       const struct iovec *iov,
 					       unsigned long nr_segs,
@@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
 
 	get_dreq(dreq);
 
+	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
+	if (result) {
+		result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
+							result, pos, sync);
+		if (result < 0)
+			goto out;
+		requested_bytes += result;
+		pos += result;
+		goto out;
+	}
 	for (seg = 0; seg < nr_segs; seg++) {
 		const struct iovec *vec = &iov[seg];
 		result = nfs_direct_write_schedule_segment(dreq, vec,
@@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
 		return result < 0 ? result : -EIO;
 	}
 
+out:
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq, dreq->inode);
 	return 0;



             reply	other threads:[~2011-04-12 15:30 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-12 15:32 Badari Pulavarty [this message]
2011-04-12 15:36 ` [RFC][PATCH] Vector read/write support for NFS (DIO) client Chuck Lever
2011-04-12 16:15   ` Badari Pulavarty
2011-04-12 16:42     ` Chuck Lever
2011-04-12 17:46       ` Badari Pulavarty
2011-04-13 12:36         ` Jeff Layton
2011-04-13 13:43           ` Badari Pulavarty
2011-04-13 14:02             ` Jeff Layton
2011-04-13 14:22               ` Trond Myklebust
2011-04-13 14:27                 ` Andy Adamson
2011-04-13 17:20                 ` Jeff Layton
2011-04-13 17:35                   ` Trond Myklebust
2011-04-13 17:56                   ` Andy Adamson
2011-04-13 18:14                     ` Trond Myklebust
2011-04-13 18:47                       ` Chuck Lever
2011-04-13 19:04                         ` Jeff Layton
2011-04-14  0:21                     ` Dean
2011-04-14  0:42                       ` Trond Myklebust
2011-04-14  6:39                         ` Dean
2011-04-12 15:49 ` Trond Myklebust
     [not found]   ` <1302623369.4801.28.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2011-04-12 16:17     ` Badari Pulavarty
2011-04-12 16:26       ` Trond Myklebust
2011-04-15 17:33   ` Christoph Hellwig
2011-04-15 18:00     ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1302622335.3877.62.camel@badari-desktop \
    --to=pbadari@us.ibm.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.