All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC-V2] [PATCH 0/7] Zero Copy
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:55 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> In this patch series I am trying to take another stab at zero copy. 
> Please review and provide your feedback.

Something went wrong in my scripts and the subject got wrong for all
patches in the series. First patch subject is copied to other patches.
I will respond each patch with correct subject. Sorry for the inconvenience.

- JV

> 
> Goal:
> 
> 9P Linux client makes an additional copy of read/write buffer into the kernel 
> buffer.  There are some transports(especially in the virtualization 
> environment) which can avoid this additional copy by directly sending user 
> buffer to the server.
> 
> Design Goals.
> 
> - Have minimal changes to the net layer so that common code is not polluted by 
>   the transport specifics.
> - Create a common transport library which can be used by other transports.
> - Avoid additional optimizations in the initial attempt (more details below) 
>   and focus on achieving basic functionality. 
> 
> Design
> 
> Send the payload buffers separately to the transport layer if it asks for it.
> Transport layer specifies the preference through newly introduced field in the 
> transport module.  (clnt->trans_mod->pref)
> This method has few advantages.
>    - Keeps the net layer clean and lets the transport layer deal with specifics.
>    - mapping user addr into kernel pages pins the memory. Lack of flow control 
>      make the system vulnerable to denial-of-service attacks. This change gives 
>      transport layer more control to implement effective flow control.
>   - If a transport layer doesn't see the need to handle payload separately, 
>     it can set the preference accordingly so that current code works with no 
>     changes. This is very useful for transports which has no plans of 
>     converting/pinning user pages. Especially things become more complex as 
>     copy_to_user()  is not possible as reads(RREAD) are handled by the
>     transport layer in the interrupt context.
> 
> TREAD/RERROR scenario.
> This is a rather sticky issue to deal with for the !dotl protocol. This is not 
> a problem in 9P2000.L as the error is a known size (errno) but in other 
> protocols it is a string of size (ERRMAX).  To take care of TREAD/RERROR 
> scenario in !dotl we make sure that the read buffer is big enough to 
> accommodate  ERRMAX string. If the read size is small, don't send the payload 
> buffer separately to the transport layer  even if it set its preferences other 
> way (P9_TRANS_PREF_PAYLOAD_SEP).
>   
> For bigger reads, RERROR is handled by copying back user buffers into kernel 
> buffer in the case of error. As this is done only in the error path it should 
> not affect the regular performance.
> 
> Created trans_common.[ch] to house common functions so that other transport 
> layers can take advantage of them.
> 
> msize: One of the major advantage of this patch series is to have bigger msize 
> to pull off bigger read/writes from the server. Increasing the msize is not 
> really a solution as majority of other transactions are extremely small which 
> could result in waste of kernel heap.  To address this problem we need to have 
> two sizes of PDUs. 
> Given that this is an additional optimization/usecase of zero copy..and not a 
> NEED to implement zerocopy itself, I am differing it to next round of changes.
> 
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for zero copy.
  2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:56   ` Venkateswararao Jujjuri (JV)
  2011-02-08 15:20     ` [V9fs-developer] " Latchesar Ionkov
  0 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:56 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> ---
>  net/9p/Makefile       |    1 +
>  net/9p/trans_common.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
>  net/9p/trans_common.h |   26 ++++++++++++++
>  3 files changed, 115 insertions(+), 0 deletions(-)
>  create mode 100644 net/9p/trans_common.c
>  create mode 100644 net/9p/trans_common.h
> 
> diff --git a/net/9p/Makefile b/net/9p/Makefile
> index 198a640..a0874cc 100644
> --- a/net/9p/Makefile
> +++ b/net/9p/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
>  	util.o \
>  	protocol.o \
>  	trans_fd.o \
> +	trans_common.o \
> 
>  9pnet_virtio-objs := \
>  	trans_virtio.o \
> diff --git a/net/9p/trans_common.c b/net/9p/trans_common.c
> new file mode 100644
> index 0000000..dad57d2
> --- /dev/null
> +++ b/net/9p/trans_common.c
> @@ -0,0 +1,88 @@
> +/*
> + * Copyright IBM Corporation, 2010
> + * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + *
> + */
> +
> +#include <linux/slab.h>
> +#include <net/9p/9p.h>
> +#include <net/9p/client.h>
> +#include <linux/scatterlist.h>
> +#include "trans_common.h"
> +
> +/**
> + *  p9_release_req_pages - Release pages after the transaction.
> + *  @*private: PDU's private page of type virtio_rpage_info_t
> + */
> +void
> +p9_release_req_pages(void *private)
> +{
> +	virtio_rpage_info_t *vpinfo = private;
> +	int i = 0;
> +
> +	while (vpinfo->vp_data[i] && vpinfo->vp_nr_pages--) {
> +		put_page(vpinfo->vp_data[i]);
> +		i++;
> +	}
> +}
> +
> +/**
> + * payload_gup - Calculates number of pages that needs to be pinned and
> + * pins them ehter for read/write through get_user_pages_fast().
> + */
> +int
> +payload_gup(struct p9_req_t *req, size_t *pdata_off, int *pdata_len, u8 rw)
> +{
> +	int nr_pages;
> +	uint32_t first_page_bytes = 0;
> +	uint32_t pdata_mapped_pages;
> +	virtio_rpage_info_t  *rpinfo;
> +
> +	nr_pages = req->tc->pbuf_size >> PAGE_SHIFT;
> +	*pdata_off = (size_t)req->tc->pbuf & (PAGE_SIZE-1);
> +
> +	if (*pdata_off)
> +		first_page_bytes = min((PAGE_SIZE - *pdata_off),
> +				req->tc->pbuf_size);
> +
> +	if (req->tc->pbuf_size - (first_page_bytes + (nr_pages << PAGE_SHIFT))){
> +		/* trailing partial page */
> +		nr_pages++;
> +	}       
> +	if (first_page_bytes) {
> +		/* leading partial page */
> +		nr_pages++;
> +	}
> +	/* TODO: Use buffer on PDU instead of allocating */
> +	rpinfo = kmalloc(sizeof(virtio_rpage_info_t) +
> +			sizeof(struct page *) * nr_pages, GFP_KERNEL);
> +	req->tc->private = (void *)rpinfo;
> +	pdata_mapped_pages = get_user_pages_fast((unsigned long)req->tc->pbuf,
> +			nr_pages, rw, &rpinfo->vp_data[0]);
> +
> +	if (pdata_mapped_pages < 0) {
> +		printk("get_user_pages_fast failed:%d udata:%p" "nr_pages:%d\n",
> +				pdata_mapped_pages, req->tc->pbuf, nr_pages);
> +		pdata_mapped_pages = 0;
> +		kfree(rpinfo);
> +		return -EIO;
> +	}
> +	rpinfo->vp_nr_pages = pdata_mapped_pages;
> +	if (*pdata_off) {
> +		*pdata_len = first_page_bytes;
> +		*pdata_len += min((req->tc->pbuf_size - *pdata_len),
> +				((size_t)pdata_mapped_pages - 1) << PAGE_SHIFT);
> +	} else {
> +		*pdata_len = min (req->tc->pbuf_size,
> +				(size_t)pdata_mapped_pages << PAGE_SHIFT);
> +	}
> +	return 0;
> +}
> diff --git a/net/9p/trans_common.h b/net/9p/trans_common.h
> new file mode 100644
> index 0000000..8c85392
> --- /dev/null
> +++ b/net/9p/trans_common.h
> @@ -0,0 +1,26 @@
> +/*
> + * Copyright IBM Corporation, 2010
> + * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + *
> + */
> +
> +/**
> + * struct virtio_rpage_info - To store mapped page information in PDU.
> + * @vp_nr_pages: Number of mapped pages
> + * @vp_data: Array of page pointers
> + */
> +typedef struct virtio_rpage_info {
> +       int vp_nr_pages;
> +       struct page *vp_data[0];
> +} virtio_rpage_info_t;
> +
> +void p9_release_req_pages(void *);
> +int payload_gup(struct p9_req_t *, size_t *, int *, u8);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 3/7] [net/9p] Assign type of transaction to tc->pdu->id which is otherwise unsed.
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate " Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:56   ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:56 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> ---
>  net/9p/protocol.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
> index c500a0b..dfc358f 100644
> --- a/net/9p/protocol.c
> +++ b/net/9p/protocol.c
> @@ -579,6 +579,7 @@ EXPORT_SYMBOL(p9stat_read);
> 
>  int p9pdu_prepare(struct p9_fcall *pdu, int16_t tag, int8_t type)
>  {
> +	pdu->id = type;
>  	return p9pdu_writef(pdu, 0, "dbw", 0, type, tag);
>  }
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 4/7] [net/9p] Add gup/zero_copy support to VirtIO transport layer.
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:56   ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:56 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> ---
>  net/9p/trans_virtio.c |   85 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 79 insertions(+), 6 deletions(-)
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index c8f3f72..607f064 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -45,6 +45,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/virtio.h>
>  #include <linux/virtio_9p.h>
> +#include "trans_common.h"
> 
>  #define VIRTQUEUE_NUM	128
> 
> @@ -155,6 +156,12 @@ static void req_done(struct virtqueue *vq)
>  					rc->tag);
>  			req = p9_tag_lookup(chan->client, rc->tag);
>  			req->status = REQ_STATUS_RCVD;
> +			if (req->tc->private) {
> +				/*Release pages */
> +				p9_release_req_pages(req->tc->private);
> +				kfree(req->tc->private);
> +				req->tc->private = NULL;
> +			}
>  			p9_client_cb(chan->client, req);
>  		} else {
>  			spin_unlock_irqrestore(&chan->lock, flags);
> @@ -202,6 +209,30 @@ static int p9_virtio_cancel(struct p9_client *client, struct p9_req_t *req)
>  	return 1;
>  }
> 
> +static int
> +pack_sg_list_p(struct scatterlist *sg, int start, int limit, size_t pdata_off,
> +		struct page **pdata, int count)
> +{
> +	int s;
> +	int i = 0;
> +	int index = start;
> +
> +	if (pdata_off) {
> +		s = min((int)(PAGE_SIZE - pdata_off), count);
> +		sg_set_page(&sg[index++], pdata[i++], s, pdata_off);
> +		count -= s;
> +	}
> +
> +	while (count) {
> +		BUG_ON(index > limit);
> +		s = min((int)PAGE_SIZE, count);
> +		sg_set_page(&sg[index++], pdata[i++], s, 0);
> +		count -= s;
> +	}
> +
> +	return index-start;
> +}
> +
>  /**
>   * p9_virtio_request - issue a request
>   * @client: client instance issuing the request
> @@ -212,22 +243,64 @@ static int p9_virtio_cancel(struct p9_client *client, struct p9_req_t *req)
>  static int
>  p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
>  {
> -	int in, out;
> +	int in, out, inp, outp;
>  	struct virtio_chan *chan = client->trans;
>  	char *rdata = (char *)req->rc+sizeof(struct p9_fcall);
>  	unsigned long flags;
> -	int err;
> +	size_t pdata_off=0;
> +	virtio_rpage_info_t *rpinfo;
> +	int err, pdata_len=0;
> 
>  	P9_DPRINTK(P9_DEBUG_TRANS, "9p debug: virtio request\n");
> 
>  req_retry:
>  	req->status = REQ_STATUS_SENT;
> 
> +	if (req->tc->pbuf_size &&
> +			(req->tc->pbuf && !segment_eq(get_fs(), KERNEL_DS))) {
> +		err = payload_gup(req, &pdata_off, &pdata_len,
> +				req->tc->id == P9_TREAD ? 1 : 0 );
> +		if (err < 0)
> +			return err;
> +	}
> +	rpinfo = (virtio_rpage_info_t *)req->tc->private;
> +
>  	spin_lock_irqsave(&chan->lock, flags);
> -	out = pack_sg_list(chan->sg, 0, VIRTQUEUE_NUM, req->tc->sdata,
> -								req->tc->size);
> -	in = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM-out, rdata,
> -								client->msize);
> +
> +	/* Handle out VirtIO ring buffers */
> +	if (req->tc->pbuf_size && (req->tc->id == P9_TWRITE)) {
> +		/* We have additional write payload buffer to take care */
> +		out = pack_sg_list(chan->sg, 0, VIRTQUEUE_NUM, req->tc->sdata,
> +				req->tc->size);
> +		outp = pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
> +				pdata_off, rpinfo->vp_data, pdata_len);
> +		out += outp;
> +	} else {
> +		out = pack_sg_list(chan->sg, 0, VIRTQUEUE_NUM, req->tc->sdata,
> +				req->tc->size);
> +	}
> +
> +	/* Handle in VirtIO ring buffers */
> +	if (req->tc->pbuf_size && (req->tc->id == P9_TREAD)) {
> +		/* We have additional Read payload buffer to take care */
> +		inp = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM, rdata, 11);
> +		/* 
> +		 * Running executables in the filesystem may result in
> +		 * a read request with kernel buffer as opposed to user buffer.
> +		 */
> +		if (req->tc->pbuf && !segment_eq(get_fs(), KERNEL_DS)) {
> +			in = pack_sg_list_p(chan->sg, out+inp, VIRTQUEUE_NUM,
> +					pdata_off, rpinfo->vp_data, pdata_len);
> +		} else {
> +			in = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM,
> +					(char *)req->tc->pbuf,
> +					req->tc->pbuf_size);
> +		}
> +		in += inp;
> +	} else {
> +		in = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM, rdata,
> +				client->msize);
> +	}
> 
>  	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc);
>  	if (err < 0) {



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 5/7] [net/9p] Add preferences to transport layer.
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:57   ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:57 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> This patch adds preferences field to the p9_trans_module.
> Through this, now transport layer can express its preference about the
> payload. i.e if payload neds to be part of the PDU or it prefers it
> to be sent sepearetly so that the transport layer can handle it in
> a better way.
> 
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> ---
>  include/net/9p/transport.h |    9 +++++++++
>  net/9p/trans_virtio.c      |    1 +
>  2 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/include/net/9p/transport.h b/include/net/9p/transport.h
> index 6d5886e..13c01a8 100644
> --- a/include/net/9p/transport.h
> +++ b/include/net/9p/transport.h
> @@ -26,11 +26,19 @@
>  #ifndef NET_9P_TRANSPORT_H
>  #define NET_9P_TRANSPORT_H
> 
> +#define P9_TRANS_PREF_PAYLOAD_MASK 0x1
> +
> +/* Default. Add Payload to PDU before sending it down to transport layer */
> +#define P9_TRANS_PREF_PAYLOAD_DEF  0x0 
> +/* Send pay load seperately to transport layer along with PDU.*/
> +#define P9_TRANS_PREF_PAYLOAD_SEP  0x1
> +
>  /**
>   * struct p9_trans_module - transport module interface
>   * @list: used to maintain a list of currently available transports
>   * @name: the human-readable name of the transport
>   * @maxsize: transport provided maximum packet size
> + * @pref: Preferences of this transport
>   * @def: set if this transport should be considered the default
>   * @create: member function to create a new connection on this transport
>   * @request: member function to issue a request to the transport
> @@ -47,6 +55,7 @@ struct p9_trans_module {
>  	struct list_head list;
>  	char *name;		/* name of transport */
>  	int maxsize;		/* max message size of transport */
> +	int pref;               /* Preferences of this transport */
>  	int def;		/* this transport should be default */
>  	struct module *owner;
>  	int (*create)(struct p9_client *, const char *, char *);
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 607f064..c76ace6 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -521,6 +521,7 @@ static struct p9_trans_module p9_virtio_trans = {
>  	.request = p9_virtio_request,
>  	.cancel = p9_virtio_cancel,
>  	.maxsize = PAGE_SIZE*16,
> +	.pref = P9_TRANS_PREF_PAYLOAD_DEF,
>  	.def = 0,
>  	.owner = THIS_MODULE,
>  };



^ permalink raw reply	[flat|nested] 26+ messages in thread

* :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:57   ` Venkateswararao Jujjuri (JV)
  2011-02-08 21:09     ` [V9fs-developer] " Eric Van Hensbergen
  0 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:57 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> ---
>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+), 14 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index a848bca..f939edf 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>  	if (count < rsize)
>  		rsize = count;
> 
> -	req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
> +	if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
> +			P9_TRANS_PREF_PAYLOAD_SEP) {
> +		req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
> +				rsize, data ? data : udata);
> +	} else {
> +		req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
> +				rsize);
> +	}
>  	if (IS_ERR(req)) {
>  		err = PTR_ERR(req);
>  		goto error;
> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
> 
>  	P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
> 
> -	if (data) {
> -		memmove(data, dataptr, count);
> -	} else {
> -		err = copy_to_user(udata, dataptr, count);
> -		if (err) {
> -			err = -EFAULT;
> -			goto free_and_error;
> +	if (!req->tc->pbuf_size) {
> +		if (data) {
> +			memmove(data, dataptr, count);
> +		} else {
> +			err = copy_to_user(udata, dataptr, count);
> +			if (err) {
> +				err = -EFAULT;
> +				goto free_and_error;
> +			}
>  		}
>  	}
>  	p9_free_req(clnt, req);
> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
> 
>  	if (count < rsize)
>  		rsize = count;
> -	if (data)
> -		req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
> -								rsize, data);
> -	else
> -		req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
> -								rsize, udata);
> +
> +	if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
> +			P9_TRANS_PREF_PAYLOAD_SEP) {
> +		req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
> +				rsize, data ? data : udata);
> +	} else {
> +		if (data)
> +			req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
> +					offset, rsize, data);
> +		else
> +			req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
> +					offset, rsize, udata);
> +	}
> +
>  	if (IS_ERR(req)) {
>  		err = PTR_ERR(req);
>  		goto error;
> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
> index dfc358f..ea778dd 100644
> --- a/net/9p/protocol.c
> +++ b/net/9p/protocol.c
> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>  	return size - len;
>  }
> 
> +static size_t
> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
> +{
> +	size_t len = min(pdu->capacity - pdu->size, size);
> +	pdu->pbuf = udata;
> +	pdu->pbuf_size = len;
> +	return size - len;
> +}
> +
> +static size_t
> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
> +{
> +	size_t len = min(pdu->capacity - pdu->size, size);
> +	pdu->pbuf = udata;
> +	pdu->pbuf_size = len;
> +	return size - len;
> +}
> +
>  /*
>  	b - int8_t
>  	w - int16_t
> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>  					errcode = -EFAULT;
>  			}
>  			break;
> +		case 'E':{
> +				 int32_t count = va_arg(ap, int32_t);
> +				 const char *udata = va_arg(ap, const void *);
> +				 errcode = p9pdu_writef(pdu, proto_version, "d",
> +						 count);
> +				 if (!errcode && pdu_write_ur(pdu, udata,
> +							 count))
> +					 errcode = -EFAULT;
> +			 }
> +			 break;
> +		case 'F':{
> +				 int32_t count = va_arg(ap, int32_t);
> +				 const char *udata = va_arg(ap, const void *);
> +				 errcode = p9pdu_writef(pdu, proto_version, "d",
> +						 count);
> +				 if (!errcode && pdu_write_uw(pdu, udata,
> +							 count))
> +					 errcode = -EFAULT;
> +			 }
> +			 break;
>  		case 'U':{
>  				int32_t count = va_arg(ap, int32_t);
>  				const char __user *udata =



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 7/7] [net/9p] Handle TREAD/RERROR case in !dotl case.
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  6:58   ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  6:58 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: v9fs-developer, linux-fsdevel

On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
> In addition, this patch also avoids zero copy for short reads in !dotl case.
> 
> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
> 
> Conflicts:
> 
> 	net/9p/protocol.c
> ---
>  include/net/9p/9p.h |    2 +
>  net/9p/client.c     |   74 +++++++++++++++++++++++++++++++++-----------------
>  2 files changed, 51 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
> index 9c939c2..7313801 100644
> --- a/include/net/9p/9p.h
> +++ b/include/net/9p/9p.h
> @@ -320,6 +320,8 @@ enum p9_qid_t {
>  /* Room for readdir header */
>  #define P9_READDIRHDRSZ	24
> 
> +#define P9_ERRMAX 256 /* FIXME: Check what is the correct value */
> +
>  /**
>   * struct p9_str - length prefixed string type
>   * @len: length of the string
> diff --git a/net/9p/client.c b/net/9p/client.c
> index f939edf..7f34c42 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -443,6 +443,7 @@ static int p9_check_errors(struct p9_client *c, struct p9_req_t *req)
>  {
>  	int8_t type;
>  	int err;
> +	int ecode;
> 
>  	err = p9_parse_header(req->rc, NULL, &type, NULL, 0);
>  	if (err) {
> @@ -450,36 +451,53 @@ static int p9_check_errors(struct p9_client *c, struct p9_req_t *req)
>  		return err;
>  	}
> 
> -	if (type == P9_RERROR || type == P9_RLERROR) {
> -		int ecode;
> -
> -		if (!p9_is_proto_dotl(c)) {
> -			char *ename;
> +	if (type != P9_RERROR && type != P9_RLERROR) 
> +		return 0;
> 
> -			err = p9pdu_readf(req->rc, c->proto_version, "s?d",
> -								&ename, &ecode);
> -			if (err)
> -				goto out_err;
> +	if (!p9_is_proto_dotl(c)) {
> +		char *ename;
> +
> +		if (req->tc->pbuf_size) {
> +			/* Handle user buffers */
> +			size_t len = req->rc->size - req->rc->offset;
> +			if (req->tc->pbuf &&
> +					!segment_eq(get_fs(), KERNEL_DS)) {
> +				/* User Buffer */
> +				err = copy_from_user(
> +					&req->rc->sdata[req->rc->offset],
> +					req->tc->pbuf, len);
> +				if (err) {
> +					err = -EFAULT;
> +					return err;
> +				}
> +			} else {
> +				/* Kernel Buffer */
> +				memmove(&req->rc->sdata[req->rc->offset],
> +						req->tc->pbuf, len);
> +			}
> +		}
> +		err = p9pdu_readf(req->rc, c->proto_version, "s?d",
> +				&ename, &ecode);
> +		if (err)
> +			goto out_err;
> 
> -			if (p9_is_proto_dotu(c))
> -				err = -ecode;
> +		if (p9_is_proto_dotu(c))
> +			err = -ecode;
> 
> -			if (!err || !IS_ERR_VALUE(err)) {
> -				err = p9_errstr2errno(ename, strlen(ename));
> +		if (!err || !IS_ERR_VALUE(err)) {
> +			err = p9_errstr2errno(ename, strlen(ename));
> 
> -				P9_DPRINTK(P9_DEBUG_9P, "<<< RERROR (%d) %s\n", -ecode, ename);
> +			P9_DPRINTK(P9_DEBUG_9P, "<<< RERROR (%d) %s\n", -ecode, ename);
> 
> -				kfree(ename);
> -			}
> -		} else {
> -			err = p9pdu_readf(req->rc, c->proto_version, "d", &ecode);
> -			err = -ecode;
> -
> -			P9_DPRINTK(P9_DEBUG_9P, "<<< RLERROR (%d)\n", -ecode);
> +			kfree(ename);
>  		}
> +	} else {
> +		err = p9pdu_readf(req->rc, c->proto_version, "d", &ecode);
> +		err = -ecode;
> +
> +		P9_DPRINTK(P9_DEBUG_9P, "<<< RLERROR (%d)\n", -ecode);
> +	}
> 
> -	} else
> -		err = 0;
> 
>  	return err;
> 
> @@ -1270,8 +1288,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>  	if (count < rsize)
>  		rsize = count;
> 
> -	if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
> -			P9_TRANS_PREF_PAYLOAD_SEP) {
> +	/* for !p9_proto_2000L, we need to have enough space on PDU
> +	 * to handle TREAD/RERROR. Hence don't attempt payload
> +	 * seperaion for small reads even if the transport prefers
> +	 * P9_TRANS_PREF_PAYLOAD_SEP */
> +	if (((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
> +			P9_TRANS_PREF_PAYLOAD_SEP) &&
> +			((clnt->proto_version == p9_proto_2000L) ||
> +			 rsize > 2 * P9_ERRMAX) ) {
>  		req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>  				rsize, data ? data : udata);
>  	} else {



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC-V2] [PATCH 0/7] Zero Copy
@ 2011-02-07  7:21 Venkateswararao Jujjuri (JV)
  2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri

In this patch series I am trying to take another stab at zero copy. 
Please review and provide your feedback.

Goal:

9P Linux client makes an additional copy of read/write buffer into the kernel 
buffer.  There are some transports(especially in the virtualization 
environment) which can avoid this additional copy by directly sending user 
buffer to the server.

Design Goals.

- Have minimal changes to the net layer so that common code is not polluted by 
  the transport specifics.
- Create a common transport library which can be used by other transports.
- Avoid additional optimizations in the initial attempt (more details below) 
  and focus on achieving basic functionality. 

Design

Send the payload buffers separately to the transport layer if it asks for it.
Transport layer specifies the preference through newly introduced field in the 
transport module.  (clnt->trans_mod->pref)
This method has few advantages.
   - Keeps the net layer clean and lets the transport layer deal with specifics.
   - mapping user addr into kernel pages pins the memory. Lack of flow control 
     make the system vulnerable to denial-of-service attacks. This change gives 
     transport layer more control to implement effective flow control.
  - If a transport layer doesn't see the need to handle payload separately, 
    it can set the preference accordingly so that current code works with no 
    changes. This is very useful for transports which has no plans of 
    converting/pinning user pages. Especially things become more complex as 
    copy_to_user()  is not possible as reads(RREAD) are handled by the
    transport layer in the interrupt context.

TREAD/RERROR scenario.
This is a rather sticky issue to deal with for the !dotl protocol. This is not 
a problem in 9P2000.L as the error is a known size (errno) but in other 
protocols it is a string of size (ERRMAX).  To take care of TREAD/RERROR 
scenario in !dotl we make sure that the read buffer is big enough to 
accommodate  ERRMAX string. If the read size is small, don't send the payload 
buffer separately to the transport layer  even if it set its preferences other 
way (P9_TRANS_PREF_PAYLOAD_SEP).
  
For bigger reads, RERROR is handled by copying back user buffers into kernel 
buffer in the case of error. As this is done only in the error path it should 
not affect the regular performance.
 
Created trans_common.[ch] to house common functions so that other transport 
layers can take advantage of them.

msize: One of the major advantage of this patch series is to have bigger msize 
to pull off bigger read/writes from the server. Increasing the msize is not 
really a solution as majority of other transactions are extremely small which 
could result in waste of kernel heap.  To address this problem we need to have 
two sizes of PDUs. 
Given that this is an additional optimization/usecase of zero copy..and not a 
NEED to implement zerocopy itself, I am differing it to next round of changes.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
  2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
---
 include/net/9p/9p.h |    6 ++++++
 net/9p/protocol.c   |    3 +++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
index 071fd7a..9c939c2 100644
--- a/include/net/9p/9p.h
+++ b/include/net/9p/9p.h
@@ -689,6 +689,9 @@ struct p9_rwstat {
  * @tag: transaction id of the request
  * @offset: used by marshalling routines to track currentposition in buffer
  * @capacity: used by marshalling routines to track total capacity
+ * @pbuf: Payload buffer given by the caller
+ * @pbuf_size: pbuf size to be read/written
+ * @private: For transport layer's use.
  * @sdata: payload
  *
  * &p9_fcall represents the structure for all 9P RPC
@@ -705,6 +708,9 @@ struct p9_fcall {
 
 	size_t offset;
 	size_t capacity;
+	const uint8_t *pbuf;
+	size_t pbuf_size;
+	void *private;
 
 	uint8_t *sdata;
 };
diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index 1e308f2..c500a0b 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -606,6 +606,9 @@ void p9pdu_reset(struct p9_fcall *pdu)
 {
 	pdu->offset = 0;
 	pdu->size = 0;
+	pdu->private = NULL;
+	pdu->pbuf = NULL;
+	pdu->pbuf_size = 0;
 }
 
 int p9dirent_read(char *buf, int len, struct p9_dirent *dirent,
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
  2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  6:56   ` [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for " Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate " Venkateswararao Jujjuri (JV)
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
---
 net/9p/Makefile       |    1 +
 net/9p/trans_common.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/9p/trans_common.h |   26 ++++++++++++++
 3 files changed, 115 insertions(+), 0 deletions(-)
 create mode 100644 net/9p/trans_common.c
 create mode 100644 net/9p/trans_common.h

diff --git a/net/9p/Makefile b/net/9p/Makefile
index 198a640..a0874cc 100644
--- a/net/9p/Makefile
+++ b/net/9p/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
 	util.o \
 	protocol.o \
 	trans_fd.o \
+	trans_common.o \
 
 9pnet_virtio-objs := \
 	trans_virtio.o \
diff --git a/net/9p/trans_common.c b/net/9p/trans_common.c
new file mode 100644
index 0000000..dad57d2
--- /dev/null
+++ b/net/9p/trans_common.c
@@ -0,0 +1,88 @@
+/*
+ * Copyright IBM Corporation, 2010
+ * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/slab.h>
+#include <net/9p/9p.h>
+#include <net/9p/client.h>
+#include <linux/scatterlist.h>
+#include "trans_common.h"
+
+/**
+ *  p9_release_req_pages - Release pages after the transaction.
+ *  @*private: PDU's private page of type virtio_rpage_info_t
+ */
+void
+p9_release_req_pages(void *private)
+{
+	virtio_rpage_info_t *vpinfo = private;
+	int i = 0;
+
+	while (vpinfo->vp_data[i] && vpinfo->vp_nr_pages--) {
+		put_page(vpinfo->vp_data[i]);
+		i++;
+	}
+}
+
+/**
+ * payload_gup - Calculates number of pages that needs to be pinned and
+ * pins them ehter for read/write through get_user_pages_fast().
+ */
+int
+payload_gup(struct p9_req_t *req, size_t *pdata_off, int *pdata_len, u8 rw)
+{
+	int nr_pages;
+	uint32_t first_page_bytes = 0;
+	uint32_t pdata_mapped_pages;
+	virtio_rpage_info_t  *rpinfo;
+
+	nr_pages = req->tc->pbuf_size >> PAGE_SHIFT;
+	*pdata_off = (size_t)req->tc->pbuf & (PAGE_SIZE-1);
+
+	if (*pdata_off)
+		first_page_bytes = min((PAGE_SIZE - *pdata_off),
+				req->tc->pbuf_size);
+
+	if (req->tc->pbuf_size - (first_page_bytes + (nr_pages << PAGE_SHIFT))){
+		/* trailing partial page */
+		nr_pages++;
+	}       
+	if (first_page_bytes) {
+		/* leading partial page */
+		nr_pages++;
+	}
+	/* TODO: Use buffer on PDU instead of allocating */
+	rpinfo = kmalloc(sizeof(virtio_rpage_info_t) +
+			sizeof(struct page *) * nr_pages, GFP_KERNEL);
+	req->tc->private = (void *)rpinfo;
+	pdata_mapped_pages = get_user_pages_fast((unsigned long)req->tc->pbuf,
+			nr_pages, rw, &rpinfo->vp_data[0]);
+
+	if (pdata_mapped_pages < 0) {
+		printk("get_user_pages_fast failed:%d udata:%p" "nr_pages:%d\n",
+				pdata_mapped_pages, req->tc->pbuf, nr_pages);
+		pdata_mapped_pages = 0;
+		kfree(rpinfo);
+		return -EIO;
+	}
+	rpinfo->vp_nr_pages = pdata_mapped_pages;
+	if (*pdata_off) {
+		*pdata_len = first_page_bytes;
+		*pdata_len += min((req->tc->pbuf_size - *pdata_len),
+				((size_t)pdata_mapped_pages - 1) << PAGE_SHIFT);
+	} else {
+		*pdata_len = min (req->tc->pbuf_size,
+				(size_t)pdata_mapped_pages << PAGE_SHIFT);
+	}
+	return 0;
+}
diff --git a/net/9p/trans_common.h b/net/9p/trans_common.h
new file mode 100644
index 0000000..8c85392
--- /dev/null
+++ b/net/9p/trans_common.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright IBM Corporation, 2010
+ * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+/**
+ * struct virtio_rpage_info - To store mapped page information in PDU.
+ * @vp_nr_pages: Number of mapped pages
+ * @vp_data: Array of page pointers
+ */
+typedef struct virtio_rpage_info {
+       int vp_nr_pages;
+       struct page *vp_data[0];
+} virtio_rpage_info_t;
+
+void p9_release_req_pages(void *);
+int payload_gup(struct p9_req_t *, size_t *, int *, u8);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
                   ` (2 preceding siblings ...)
  2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  6:56   ` [RFC] [PATCH 3/7] [net/9p] Assign type of transaction to tc->pdu->id which is otherwise unsed Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
---
 net/9p/protocol.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index c500a0b..dfc358f 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -579,6 +579,7 @@ EXPORT_SYMBOL(p9stat_read);
 
 int p9pdu_prepare(struct p9_fcall *pdu, int16_t tag, int8_t type)
 {
+	pdu->id = type;
 	return p9pdu_writef(pdu, 0, "dbw", 0, type, tag);
 }
 
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
                   ` (3 preceding siblings ...)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate " Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  6:56   ` [RFC] [PATCH 4/7] [net/9p] Add gup/zero_copy support to VirtIO transport layer Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
---
 net/9p/trans_virtio.c |   85 +++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index c8f3f72..607f064 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -45,6 +45,7 @@
 #include <linux/scatterlist.h>
 #include <linux/virtio.h>
 #include <linux/virtio_9p.h>
+#include "trans_common.h"
 
 #define VIRTQUEUE_NUM	128
 
@@ -155,6 +156,12 @@ static void req_done(struct virtqueue *vq)
 					rc->tag);
 			req = p9_tag_lookup(chan->client, rc->tag);
 			req->status = REQ_STATUS_RCVD;
+			if (req->tc->private) {
+				/*Release pages */
+				p9_release_req_pages(req->tc->private);
+				kfree(req->tc->private);
+				req->tc->private = NULL;
+			}
 			p9_client_cb(chan->client, req);
 		} else {
 			spin_unlock_irqrestore(&chan->lock, flags);
@@ -202,6 +209,30 @@ static int p9_virtio_cancel(struct p9_client *client, struct p9_req_t *req)
 	return 1;
 }
 
+static int
+pack_sg_list_p(struct scatterlist *sg, int start, int limit, size_t pdata_off,
+		struct page **pdata, int count)
+{
+	int s;
+	int i = 0;
+	int index = start;
+
+	if (pdata_off) {
+		s = min((int)(PAGE_SIZE - pdata_off), count);
+		sg_set_page(&sg[index++], pdata[i++], s, pdata_off);
+		count -= s;
+	}
+
+	while (count) {
+		BUG_ON(index > limit);
+		s = min((int)PAGE_SIZE, count);
+		sg_set_page(&sg[index++], pdata[i++], s, 0);
+		count -= s;
+	}
+
+	return index-start;
+}
+
 /**
  * p9_virtio_request - issue a request
  * @client: client instance issuing the request
@@ -212,22 +243,64 @@ static int p9_virtio_cancel(struct p9_client *client, struct p9_req_t *req)
 static int
 p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 {
-	int in, out;
+	int in, out, inp, outp;
 	struct virtio_chan *chan = client->trans;
 	char *rdata = (char *)req->rc+sizeof(struct p9_fcall);
 	unsigned long flags;
-	int err;
+	size_t pdata_off=0;
+	virtio_rpage_info_t *rpinfo;
+	int err, pdata_len=0;
 
 	P9_DPRINTK(P9_DEBUG_TRANS, "9p debug: virtio request\n");
 
 req_retry:
 	req->status = REQ_STATUS_SENT;
 
+	if (req->tc->pbuf_size &&
+			(req->tc->pbuf && !segment_eq(get_fs(), KERNEL_DS))) {
+		err = payload_gup(req, &pdata_off, &pdata_len,
+				req->tc->id == P9_TREAD ? 1 : 0 );
+		if (err < 0)
+			return err;
+	}
+	rpinfo = (virtio_rpage_info_t *)req->tc->private;
+
 	spin_lock_irqsave(&chan->lock, flags);
-	out = pack_sg_list(chan->sg, 0, VIRTQUEUE_NUM, req->tc->sdata,
-								req->tc->size);
-	in = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM-out, rdata,
-								client->msize);
+
+	/* Handle out VirtIO ring buffers */
+	if (req->tc->pbuf_size && (req->tc->id == P9_TWRITE)) {
+		/* We have additional write payload buffer to take care */
+		out = pack_sg_list(chan->sg, 0, VIRTQUEUE_NUM, req->tc->sdata,
+				req->tc->size);
+		outp = pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
+				pdata_off, rpinfo->vp_data, pdata_len);
+		out += outp;
+	} else {
+		out = pack_sg_list(chan->sg, 0, VIRTQUEUE_NUM, req->tc->sdata,
+				req->tc->size);
+	}
+
+	/* Handle in VirtIO ring buffers */
+	if (req->tc->pbuf_size && (req->tc->id == P9_TREAD)) {
+		/* We have additional Read payload buffer to take care */
+		inp = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM, rdata, 11);
+		/* 
+		 * Running executables in the filesystem may result in
+		 * a read request with kernel buffer as opposed to user buffer.
+		 */
+		if (req->tc->pbuf && !segment_eq(get_fs(), KERNEL_DS)) {
+			in = pack_sg_list_p(chan->sg, out+inp, VIRTQUEUE_NUM,
+					pdata_off, rpinfo->vp_data, pdata_len);
+		} else {
+			in = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM,
+					(char *)req->tc->pbuf,
+					req->tc->pbuf_size);
+		}
+		in += inp;
+	} else {
+		in = pack_sg_list(chan->sg, out, VIRTQUEUE_NUM, rdata,
+				client->msize);
+	}
 
 	err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc);
 	if (err < 0) {
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
                   ` (4 preceding siblings ...)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  6:57   ` [RFC] [PATCH 5/7] [net/9p] Add preferences to transport layer Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
  7 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

This patch adds preferences field to the p9_trans_module.
Through this, now transport layer can express its preference about the
payload. i.e if payload neds to be part of the PDU or it prefers it
to be sent sepearetly so that the transport layer can handle it in
a better way.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
---
 include/net/9p/transport.h |    9 +++++++++
 net/9p/trans_virtio.c      |    1 +
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/net/9p/transport.h b/include/net/9p/transport.h
index 6d5886e..13c01a8 100644
--- a/include/net/9p/transport.h
+++ b/include/net/9p/transport.h
@@ -26,11 +26,19 @@
 #ifndef NET_9P_TRANSPORT_H
 #define NET_9P_TRANSPORT_H
 
+#define P9_TRANS_PREF_PAYLOAD_MASK 0x1
+
+/* Default. Add Payload to PDU before sending it down to transport layer */
+#define P9_TRANS_PREF_PAYLOAD_DEF  0x0 
+/* Send pay load seperately to transport layer along with PDU.*/
+#define P9_TRANS_PREF_PAYLOAD_SEP  0x1
+
 /**
  * struct p9_trans_module - transport module interface
  * @list: used to maintain a list of currently available transports
  * @name: the human-readable name of the transport
  * @maxsize: transport provided maximum packet size
+ * @pref: Preferences of this transport
  * @def: set if this transport should be considered the default
  * @create: member function to create a new connection on this transport
  * @request: member function to issue a request to the transport
@@ -47,6 +55,7 @@ struct p9_trans_module {
 	struct list_head list;
 	char *name;		/* name of transport */
 	int maxsize;		/* max message size of transport */
+	int pref;               /* Preferences of this transport */
 	int def;		/* this transport should be default */
 	struct module *owner;
 	int (*create)(struct p9_client *, const char *, char *);
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 607f064..c76ace6 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -521,6 +521,7 @@ static struct p9_trans_module p9_virtio_trans = {
 	.request = p9_virtio_request,
 	.cancel = p9_virtio_cancel,
 	.maxsize = PAGE_SIZE*16,
+	.pref = P9_TRANS_PREF_PAYLOAD_DEF,
 	.def = 0,
 	.owner = THIS_MODULE,
 };
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
                   ` (5 preceding siblings ...)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  6:57   ` :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol Venkateswararao Jujjuri (JV)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
  7 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
---
 net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
 net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index a848bca..f939edf 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
 	if (count < rsize)
 		rsize = count;
 
-	req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
+	if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
+			P9_TRANS_PREF_PAYLOAD_SEP) {
+		req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
+				rsize, data ? data : udata);
+	} else {
+		req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
+				rsize);
+	}
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
 		goto error;
@@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
 
 	P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
 
-	if (data) {
-		memmove(data, dataptr, count);
-	} else {
-		err = copy_to_user(udata, dataptr, count);
-		if (err) {
-			err = -EFAULT;
-			goto free_and_error;
+	if (!req->tc->pbuf_size) {
+		if (data) {
+			memmove(data, dataptr, count);
+		} else {
+			err = copy_to_user(udata, dataptr, count);
+			if (err) {
+				err = -EFAULT;
+				goto free_and_error;
+			}
 		}
 	}
 	p9_free_req(clnt, req);
@@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
 
 	if (count < rsize)
 		rsize = count;
-	if (data)
-		req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
-								rsize, data);
-	else
-		req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
-								rsize, udata);
+
+	if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
+			P9_TRANS_PREF_PAYLOAD_SEP) {
+		req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
+				rsize, data ? data : udata);
+	} else {
+		if (data)
+			req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
+					offset, rsize, data);
+		else
+			req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
+					offset, rsize, udata);
+	}
+
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
 		goto error;
diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index dfc358f..ea778dd 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
 	return size - len;
 }
 
+static size_t
+pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
+{
+	size_t len = min(pdu->capacity - pdu->size, size);
+	pdu->pbuf = udata;
+	pdu->pbuf_size = len;
+	return size - len;
+}
+
+static size_t
+pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
+{
+	size_t len = min(pdu->capacity - pdu->size, size);
+	pdu->pbuf = udata;
+	pdu->pbuf_size = len;
+	return size - len;
+}
+
 /*
 	b - int8_t
 	w - int16_t
@@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
 					errcode = -EFAULT;
 			}
 			break;
+		case 'E':{
+				 int32_t count = va_arg(ap, int32_t);
+				 const char *udata = va_arg(ap, const void *);
+				 errcode = p9pdu_writef(pdu, proto_version, "d",
+						 count);
+				 if (!errcode && pdu_write_ur(pdu, udata,
+							 count))
+					 errcode = -EFAULT;
+			 }
+			 break;
+		case 'F':{
+				 int32_t count = va_arg(ap, int32_t);
+				 const char *udata = va_arg(ap, const void *);
+				 errcode = p9pdu_writef(pdu, proto_version, "d",
+						 count);
+				 if (!errcode && pdu_write_uw(pdu, udata,
+							 count))
+					 errcode = -EFAULT;
+			 }
+			 break;
 		case 'U':{
 				int32_t count = va_arg(ap, int32_t);
 				const char __user *udata =
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy.
  2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
                   ` (6 preceding siblings ...)
  2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
@ 2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
  2011-02-07  6:58   ` [PATCH 7/7] [net/9p] Handle TREAD/RERROR case in !dotl case Venkateswararao Jujjuri (JV)
  7 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri (JV)

In addition, this patch also avoids zero copy for short reads in !dotl case.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>

Conflicts:

	net/9p/protocol.c
---
 include/net/9p/9p.h |    2 +
 net/9p/client.c     |   74 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 51 insertions(+), 25 deletions(-)

diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
index 9c939c2..7313801 100644
--- a/include/net/9p/9p.h
+++ b/include/net/9p/9p.h
@@ -320,6 +320,8 @@ enum p9_qid_t {
 /* Room for readdir header */
 #define P9_READDIRHDRSZ	24
 
+#define P9_ERRMAX 256 /* FIXME: Check what is the correct value */
+
 /**
  * struct p9_str - length prefixed string type
  * @len: length of the string
diff --git a/net/9p/client.c b/net/9p/client.c
index f939edf..7f34c42 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -443,6 +443,7 @@ static int p9_check_errors(struct p9_client *c, struct p9_req_t *req)
 {
 	int8_t type;
 	int err;
+	int ecode;
 
 	err = p9_parse_header(req->rc, NULL, &type, NULL, 0);
 	if (err) {
@@ -450,36 +451,53 @@ static int p9_check_errors(struct p9_client *c, struct p9_req_t *req)
 		return err;
 	}
 
-	if (type == P9_RERROR || type == P9_RLERROR) {
-		int ecode;
-
-		if (!p9_is_proto_dotl(c)) {
-			char *ename;
+	if (type != P9_RERROR && type != P9_RLERROR) 
+		return 0;
 
-			err = p9pdu_readf(req->rc, c->proto_version, "s?d",
-								&ename, &ecode);
-			if (err)
-				goto out_err;
+	if (!p9_is_proto_dotl(c)) {
+		char *ename;
+
+		if (req->tc->pbuf_size) {
+			/* Handle user buffers */
+			size_t len = req->rc->size - req->rc->offset;
+			if (req->tc->pbuf &&
+					!segment_eq(get_fs(), KERNEL_DS)) {
+				/* User Buffer */
+				err = copy_from_user(
+					&req->rc->sdata[req->rc->offset],
+					req->tc->pbuf, len);
+				if (err) {
+					err = -EFAULT;
+					return err;
+				}
+			} else {
+				/* Kernel Buffer */
+				memmove(&req->rc->sdata[req->rc->offset],
+						req->tc->pbuf, len);
+			}
+		}
+		err = p9pdu_readf(req->rc, c->proto_version, "s?d",
+				&ename, &ecode);
+		if (err)
+			goto out_err;
 
-			if (p9_is_proto_dotu(c))
-				err = -ecode;
+		if (p9_is_proto_dotu(c))
+			err = -ecode;
 
-			if (!err || !IS_ERR_VALUE(err)) {
-				err = p9_errstr2errno(ename, strlen(ename));
+		if (!err || !IS_ERR_VALUE(err)) {
+			err = p9_errstr2errno(ename, strlen(ename));
 
-				P9_DPRINTK(P9_DEBUG_9P, "<<< RERROR (%d) %s\n", -ecode, ename);
+			P9_DPRINTK(P9_DEBUG_9P, "<<< RERROR (%d) %s\n", -ecode, ename);
 
-				kfree(ename);
-			}
-		} else {
-			err = p9pdu_readf(req->rc, c->proto_version, "d", &ecode);
-			err = -ecode;
-
-			P9_DPRINTK(P9_DEBUG_9P, "<<< RLERROR (%d)\n", -ecode);
+			kfree(ename);
 		}
+	} else {
+		err = p9pdu_readf(req->rc, c->proto_version, "d", &ecode);
+		err = -ecode;
+
+		P9_DPRINTK(P9_DEBUG_9P, "<<< RLERROR (%d)\n", -ecode);
+	}
 
-	} else
-		err = 0;
 
 	return err;
 
@@ -1270,8 +1288,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
 	if (count < rsize)
 		rsize = count;
 
-	if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
-			P9_TRANS_PREF_PAYLOAD_SEP) {
+	/* for !p9_proto_2000L, we need to have enough space on PDU
+	 * to handle TREAD/RERROR. Hence don't attempt payload
+	 * seperaion for small reads even if the transport prefers
+	 * P9_TRANS_PREF_PAYLOAD_SEP */
+	if (((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
+			P9_TRANS_PREF_PAYLOAD_SEP) &&
+			((clnt->proto_version == p9_proto_2000L) ||
+			 rsize > 2 * P9_ERRMAX) ) {
 		req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
 				rsize, data ? data : udata);
 	} else {
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for zero copy.
  2011-02-07  6:56   ` [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for " Venkateswararao Jujjuri (JV)
@ 2011-02-08 15:20     ` Latchesar Ionkov
  2011-02-08 17:21       ` Venkateswararao Jujjuri (JV)
  0 siblings, 1 reply; 26+ messages in thread
From: Latchesar Ionkov @ 2011-02-08 15:20 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: linux-fsdevel, v9fs-developer

Can you please rename the common structures so they don't have virtio
related names?

Thanks,
    Lucho

On Sun, Feb 6, 2011 at 11:56 PM, Venkateswararao Jujjuri (JV)
<jvrao@linux.vnet.ibm.com> wrote:
> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>> ---
>>  net/9p/Makefile       |    1 +
>>  net/9p/trans_common.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  net/9p/trans_common.h |   26 ++++++++++++++
>>  3 files changed, 115 insertions(+), 0 deletions(-)
>>  create mode 100644 net/9p/trans_common.c
>>  create mode 100644 net/9p/trans_common.h
>>
>> diff --git a/net/9p/Makefile b/net/9p/Makefile
>> index 198a640..a0874cc 100644
>> --- a/net/9p/Makefile
>> +++ b/net/9p/Makefile
>> @@ -9,6 +9,7 @@ obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
>>       util.o \
>>       protocol.o \
>>       trans_fd.o \
>> +     trans_common.o \
>>
>>  9pnet_virtio-objs := \
>>       trans_virtio.o \
>> diff --git a/net/9p/trans_common.c b/net/9p/trans_common.c
>> new file mode 100644
>> index 0000000..dad57d2
>> --- /dev/null
>> +++ b/net/9p/trans_common.c
>> @@ -0,0 +1,88 @@
>> +/*
>> + * Copyright IBM Corporation, 2010
>> + * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of version 2.1 of the GNU Lesser General Public License
>> + * as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it would be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>> + *
>> + */
>> +
>> +#include <linux/slab.h>
>> +#include <net/9p/9p.h>
>> +#include <net/9p/client.h>
>> +#include <linux/scatterlist.h>
>> +#include "trans_common.h"
>> +
>> +/**
>> + *  p9_release_req_pages - Release pages after the transaction.
>> + *  @*private: PDU's private page of type virtio_rpage_info_t
>> + */
>> +void
>> +p9_release_req_pages(void *private)
>> +{
>> +     virtio_rpage_info_t *vpinfo = private;
>> +     int i = 0;
>> +
>> +     while (vpinfo->vp_data[i] && vpinfo->vp_nr_pages--) {
>> +             put_page(vpinfo->vp_data[i]);
>> +             i++;
>> +     }
>> +}
>> +
>> +/**
>> + * payload_gup - Calculates number of pages that needs to be pinned and
>> + * pins them ehter for read/write through get_user_pages_fast().
>> + */
>> +int
>> +payload_gup(struct p9_req_t *req, size_t *pdata_off, int *pdata_len, u8 rw)
>> +{
>> +     int nr_pages;
>> +     uint32_t first_page_bytes = 0;
>> +     uint32_t pdata_mapped_pages;
>> +     virtio_rpage_info_t  *rpinfo;
>> +
>> +     nr_pages = req->tc->pbuf_size >> PAGE_SHIFT;
>> +     *pdata_off = (size_t)req->tc->pbuf & (PAGE_SIZE-1);
>> +
>> +     if (*pdata_off)
>> +             first_page_bytes = min((PAGE_SIZE - *pdata_off),
>> +                             req->tc->pbuf_size);
>> +
>> +     if (req->tc->pbuf_size - (first_page_bytes + (nr_pages << PAGE_SHIFT))){
>> +             /* trailing partial page */
>> +             nr_pages++;
>> +     }
>> +     if (first_page_bytes) {
>> +             /* leading partial page */
>> +             nr_pages++;
>> +     }
>> +     /* TODO: Use buffer on PDU instead of allocating */
>> +     rpinfo = kmalloc(sizeof(virtio_rpage_info_t) +
>> +                     sizeof(struct page *) * nr_pages, GFP_KERNEL);
>> +     req->tc->private = (void *)rpinfo;
>> +     pdata_mapped_pages = get_user_pages_fast((unsigned long)req->tc->pbuf,
>> +                     nr_pages, rw, &rpinfo->vp_data[0]);
>> +
>> +     if (pdata_mapped_pages < 0) {
>> +             printk("get_user_pages_fast failed:%d udata:%p" "nr_pages:%d\n",
>> +                             pdata_mapped_pages, req->tc->pbuf, nr_pages);
>> +             pdata_mapped_pages = 0;
>> +             kfree(rpinfo);
>> +             return -EIO;
>> +     }
>> +     rpinfo->vp_nr_pages = pdata_mapped_pages;
>> +     if (*pdata_off) {
>> +             *pdata_len = first_page_bytes;
>> +             *pdata_len += min((req->tc->pbuf_size - *pdata_len),
>> +                             ((size_t)pdata_mapped_pages - 1) << PAGE_SHIFT);
>> +     } else {
>> +             *pdata_len = min (req->tc->pbuf_size,
>> +                             (size_t)pdata_mapped_pages << PAGE_SHIFT);
>> +     }
>> +     return 0;
>> +}
>> diff --git a/net/9p/trans_common.h b/net/9p/trans_common.h
>> new file mode 100644
>> index 0000000..8c85392
>> --- /dev/null
>> +++ b/net/9p/trans_common.h
>> @@ -0,0 +1,26 @@
>> +/*
>> + * Copyright IBM Corporation, 2010
>> + * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of version 2.1 of the GNU Lesser General Public License
>> + * as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it would be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>> + *
>> + */
>> +
>> +/**
>> + * struct virtio_rpage_info - To store mapped page information in PDU.
>> + * @vp_nr_pages: Number of mapped pages
>> + * @vp_data: Array of page pointers
>> + */
>> +typedef struct virtio_rpage_info {
>> +       int vp_nr_pages;
>> +       struct page *vp_data[0];
>> +} virtio_rpage_info_t;
>> +
>> +void p9_release_req_pages(void *);
>> +int payload_gup(struct p9_req_t *, size_t *, int *, u8);
>
>
>
> ------------------------------------------------------------------------------
> The modern datacenter depends on network connectivity to access resources
> and provide services. The best practices for maximizing a physical server's
> connectivity to a physical network are well understood - see how these
> rules translate into the virtual world?
> http://p.sf.net/sfu/oracle-sfdevnlfb
> _______________________________________________
> V9fs-developer mailing list
> V9fs-developer@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for zero copy.
  2011-02-08 15:20     ` [V9fs-developer] " Latchesar Ionkov
@ 2011-02-08 17:21       ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-08 17:21 UTC (permalink / raw)
  To: Latchesar Ionkov; +Cc: linux-fsdevel, v9fs-developer

On 2/8/2011 7:20 AM, Latchesar Ionkov wrote:
> Can you please rename the common structures so they don't have virtio
> related names?
> 

Sure.

- JV

> Thanks,
>     Lucho
> 
> On Sun, Feb 6, 2011 at 11:56 PM, Venkateswararao Jujjuri (JV)
> <jvrao@linux.vnet.ibm.com> wrote:
>> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>> ---
>>>  net/9p/Makefile       |    1 +
>>>  net/9p/trans_common.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>  net/9p/trans_common.h |   26 ++++++++++++++
>>>  3 files changed, 115 insertions(+), 0 deletions(-)
>>>  create mode 100644 net/9p/trans_common.c
>>>  create mode 100644 net/9p/trans_common.h
>>>
>>> diff --git a/net/9p/Makefile b/net/9p/Makefile
>>> index 198a640..a0874cc 100644
>>> --- a/net/9p/Makefile
>>> +++ b/net/9p/Makefile
>>> @@ -9,6 +9,7 @@ obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
>>>       util.o \
>>>       protocol.o \
>>>       trans_fd.o \
>>> +     trans_common.o \
>>>
>>>  9pnet_virtio-objs := \
>>>       trans_virtio.o \
>>> diff --git a/net/9p/trans_common.c b/net/9p/trans_common.c
>>> new file mode 100644
>>> index 0000000..dad57d2
>>> --- /dev/null
>>> +++ b/net/9p/trans_common.c
>>> @@ -0,0 +1,88 @@
>>> +/*
>>> + * Copyright IBM Corporation, 2010
>>> + * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify it
>>> + * under the terms of version 2.1 of the GNU Lesser General Public License
>>> + * as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it would be useful, but
>>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>>> + *
>>> + */
>>> +
>>> +#include <linux/slab.h>
>>> +#include <net/9p/9p.h>
>>> +#include <net/9p/client.h>
>>> +#include <linux/scatterlist.h>
>>> +#include "trans_common.h"
>>> +
>>> +/**
>>> + *  p9_release_req_pages - Release pages after the transaction.
>>> + *  @*private: PDU's private page of type virtio_rpage_info_t
>>> + */
>>> +void
>>> +p9_release_req_pages(void *private)
>>> +{
>>> +     virtio_rpage_info_t *vpinfo = private;
>>> +     int i = 0;
>>> +
>>> +     while (vpinfo->vp_data[i] && vpinfo->vp_nr_pages--) {
>>> +             put_page(vpinfo->vp_data[i]);
>>> +             i++;
>>> +     }
>>> +}
>>> +
>>> +/**
>>> + * payload_gup - Calculates number of pages that needs to be pinned and
>>> + * pins them ehter for read/write through get_user_pages_fast().
>>> + */
>>> +int
>>> +payload_gup(struct p9_req_t *req, size_t *pdata_off, int *pdata_len, u8 rw)
>>> +{
>>> +     int nr_pages;
>>> +     uint32_t first_page_bytes = 0;
>>> +     uint32_t pdata_mapped_pages;
>>> +     virtio_rpage_info_t  *rpinfo;
>>> +
>>> +     nr_pages = req->tc->pbuf_size >> PAGE_SHIFT;
>>> +     *pdata_off = (size_t)req->tc->pbuf & (PAGE_SIZE-1);
>>> +
>>> +     if (*pdata_off)
>>> +             first_page_bytes = min((PAGE_SIZE - *pdata_off),
>>> +                             req->tc->pbuf_size);
>>> +
>>> +     if (req->tc->pbuf_size - (first_page_bytes + (nr_pages << PAGE_SHIFT))){
>>> +             /* trailing partial page */
>>> +             nr_pages++;
>>> +     }
>>> +     if (first_page_bytes) {
>>> +             /* leading partial page */
>>> +             nr_pages++;
>>> +     }
>>> +     /* TODO: Use buffer on PDU instead of allocating */
>>> +     rpinfo = kmalloc(sizeof(virtio_rpage_info_t) +
>>> +                     sizeof(struct page *) * nr_pages, GFP_KERNEL);
>>> +     req->tc->private = (void *)rpinfo;
>>> +     pdata_mapped_pages = get_user_pages_fast((unsigned long)req->tc->pbuf,
>>> +                     nr_pages, rw, &rpinfo->vp_data[0]);
>>> +
>>> +     if (pdata_mapped_pages < 0) {
>>> +             printk("get_user_pages_fast failed:%d udata:%p" "nr_pages:%d\n",
>>> +                             pdata_mapped_pages, req->tc->pbuf, nr_pages);
>>> +             pdata_mapped_pages = 0;
>>> +             kfree(rpinfo);
>>> +             return -EIO;
>>> +     }
>>> +     rpinfo->vp_nr_pages = pdata_mapped_pages;
>>> +     if (*pdata_off) {
>>> +             *pdata_len = first_page_bytes;
>>> +             *pdata_len += min((req->tc->pbuf_size - *pdata_len),
>>> +                             ((size_t)pdata_mapped_pages - 1) << PAGE_SHIFT);
>>> +     } else {
>>> +             *pdata_len = min (req->tc->pbuf_size,
>>> +                             (size_t)pdata_mapped_pages << PAGE_SHIFT);
>>> +     }
>>> +     return 0;
>>> +}
>>> diff --git a/net/9p/trans_common.h b/net/9p/trans_common.h
>>> new file mode 100644
>>> index 0000000..8c85392
>>> --- /dev/null
>>> +++ b/net/9p/trans_common.h
>>> @@ -0,0 +1,26 @@
>>> +/*
>>> + * Copyright IBM Corporation, 2010
>>> + * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify it
>>> + * under the terms of version 2.1 of the GNU Lesser General Public License
>>> + * as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it would be useful, but
>>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>>> + *
>>> + */
>>> +
>>> +/**
>>> + * struct virtio_rpage_info - To store mapped page information in PDU.
>>> + * @vp_nr_pages: Number of mapped pages
>>> + * @vp_data: Array of page pointers
>>> + */
>>> +typedef struct virtio_rpage_info {
>>> +       int vp_nr_pages;
>>> +       struct page *vp_data[0];
>>> +} virtio_rpage_info_t;
>>> +
>>> +void p9_release_req_pages(void *);
>>> +int payload_gup(struct p9_req_t *, size_t *, int *, u8);
>>
>>
>>
>> ------------------------------------------------------------------------------
>> The modern datacenter depends on network connectivity to access resources
>> and provide services. The best practices for maximizing a physical server's
>> connectivity to a physical network are well understood - see how these
>> rules translate into the virtual world?
>> http://p.sf.net/sfu/oracle-sfdevnlfb
>> _______________________________________________
>> V9fs-developer mailing list
>> V9fs-developer@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-07  6:57   ` :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol Venkateswararao Jujjuri (JV)
@ 2011-02-08 21:09     ` Eric Van Hensbergen
  2011-02-08 21:16       ` Eric Van Hensbergen
  2011-02-08 23:50       ` Venkateswararao Jujjuri (JV)
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Van Hensbergen @ 2011-02-08 21:09 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: linux-fsdevel, v9fs-developer

One thing I wonder is if we always want to zero copy for payload.  In
the extreme, do we want to take the overhead of pinning an extra page
if we are only reading/writing a byte?  memcpy is expensive for large
packets, but may actually be more efficient for small packets.

Have we done any performance measurements of this code with various
payload sizes versus non-zero-copy?  Of course that may not really
show the impact of pinning the extra pages....

In any case, if such a tradeoff did exist, we might choose to not do
zero copy for requests smaller than some size -- and that might
alleviate some of the problems with the legacy protocols (such as the
Rerror issue) -- killing two birds with one stone.  In any case, given
the implementations it should be really easy to shut me up with
comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
4192 byte payloads (without caches enabled of course).

       -eric


On Mon, Feb 7, 2011 at 12:57 AM, Venkateswararao Jujjuri (JV)
<jvrao@linux.vnet.ibm.com> wrote:
> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>> ---
>>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 69 insertions(+), 14 deletions(-)
>>
>> diff --git a/net/9p/client.c b/net/9p/client.c
>> index a848bca..f939edf 100644
>> --- a/net/9p/client.c
>> +++ b/net/9p/client.c
>> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>       if (count < rsize)
>>               rsize = count;
>>
>> -     req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>> +                             rsize, data ? data : udata);
>> +     } else {
>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
>> +                             rsize);
>> +     }
>>       if (IS_ERR(req)) {
>>               err = PTR_ERR(req);
>>               goto error;
>> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>
>>       P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
>>
>> -     if (data) {
>> -             memmove(data, dataptr, count);
>> -     } else {
>> -             err = copy_to_user(udata, dataptr, count);
>> -             if (err) {
>> -                     err = -EFAULT;
>> -                     goto free_and_error;
>> +     if (!req->tc->pbuf_size) {
>> +             if (data) {
>> +                     memmove(data, dataptr, count);
>> +             } else {
>> +                     err = copy_to_user(udata, dataptr, count);
>> +                     if (err) {
>> +                             err = -EFAULT;
>> +                             goto free_and_error;
>> +                     }
>>               }
>>       }
>>       p9_free_req(clnt, req);
>> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
>>
>>       if (count < rsize)
>>               rsize = count;
>> -     if (data)
>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
>> -                                                             rsize, data);
>> -     else
>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
>> -                                                             rsize, udata);
>> +
>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>> +             req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
>> +                             rsize, data ? data : udata);
>> +     } else {
>> +             if (data)
>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
>> +                                     offset, rsize, data);
>> +             else
>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
>> +                                     offset, rsize, udata);
>> +     }
>> +
>>       if (IS_ERR(req)) {
>>               err = PTR_ERR(req);
>>               goto error;
>> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
>> index dfc358f..ea778dd 100644
>> --- a/net/9p/protocol.c
>> +++ b/net/9p/protocol.c
>> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>>       return size - len;
>>  }
>>
>> +static size_t
>> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
>> +{
>> +     size_t len = min(pdu->capacity - pdu->size, size);
>> +     pdu->pbuf = udata;
>> +     pdu->pbuf_size = len;
>> +     return size - len;
>> +}
>> +
>> +static size_t
>> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
>> +{
>> +     size_t len = min(pdu->capacity - pdu->size, size);
>> +     pdu->pbuf = udata;
>> +     pdu->pbuf_size = len;
>> +     return size - len;
>> +}
>> +
>>  /*
>>       b - int8_t
>>       w - int16_t
>> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>>                                       errcode = -EFAULT;
>>                       }
>>                       break;
>> +             case 'E':{
>> +                              int32_t count = va_arg(ap, int32_t);
>> +                              const char *udata = va_arg(ap, const void *);
>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>> +                                              count);
>> +                              if (!errcode && pdu_write_ur(pdu, udata,
>> +                                                      count))
>> +                                      errcode = -EFAULT;
>> +                      }
>> +                      break;
>> +             case 'F':{
>> +                              int32_t count = va_arg(ap, int32_t);
>> +                              const char *udata = va_arg(ap, const void *);
>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>> +                                              count);
>> +                              if (!errcode && pdu_write_uw(pdu, udata,
>> +                                                      count))
>> +                                      errcode = -EFAULT;
>> +                      }
>> +                      break;
>>               case 'U':{
>>                               int32_t count = va_arg(ap, int32_t);
>>                               const char __user *udata =
>
>
>
> ------------------------------------------------------------------------------
> The modern datacenter depends on network connectivity to access resources
> and provide services. The best practices for maximizing a physical server's
> connectivity to a physical network are well understood - see how these
> rules translate into the virtual world?
> http://p.sf.net/sfu/oracle-sfdevnlfb
> _______________________________________________
> V9fs-developer mailing list
> V9fs-developer@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-08 21:09     ` [V9fs-developer] " Eric Van Hensbergen
@ 2011-02-08 21:16       ` Eric Van Hensbergen
  2011-02-09 21:09         ` Venkateswararao Jujjuri (JV)
  2011-02-08 23:50       ` Venkateswararao Jujjuri (JV)
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Van Hensbergen @ 2011-02-08 21:16 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: linux-fsdevel, v9fs-developer

oh and, for reference, while a different environment, my request is
based on a little bit more than idle fancy.  Check out the graph on
page 6 in: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.8182&rep=rep1&type=pdf

        -eric


On Tue, Feb 8, 2011 at 3:09 PM, Eric Van Hensbergen <ericvh@gmail.com> wrote:
> One thing I wonder is if we always want to zero copy for payload.  In
> the extreme, do we want to take the overhead of pinning an extra page
> if we are only reading/writing a byte?  memcpy is expensive for large
> packets, but may actually be more efficient for small packets.
>
> Have we done any performance measurements of this code with various
> payload sizes versus non-zero-copy?  Of course that may not really
> show the impact of pinning the extra pages....
>
> In any case, if such a tradeoff did exist, we might choose to not do
> zero copy for requests smaller than some size -- and that might
> alleviate some of the problems with the legacy protocols (such as the
> Rerror issue) -- killing two birds with one stone.  In any case, given
> the implementations it should be really easy to shut me up with
> comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
> 4192 byte payloads (without caches enabled of course).
>
>       -eric
>
>
> On Mon, Feb 7, 2011 at 12:57 AM, Venkateswararao Jujjuri (JV)
> <jvrao@linux.vnet.ibm.com> wrote:
>> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>> ---
>>>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>>>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 69 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/net/9p/client.c b/net/9p/client.c
>>> index a848bca..f939edf 100644
>>> --- a/net/9p/client.c
>>> +++ b/net/9p/client.c
>>> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>       if (count < rsize)
>>>               rsize = count;
>>>
>>> -     req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>>> +                             rsize, data ? data : udata);
>>> +     } else {
>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
>>> +                             rsize);
>>> +     }
>>>       if (IS_ERR(req)) {
>>>               err = PTR_ERR(req);
>>>               goto error;
>>> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>
>>>       P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
>>>
>>> -     if (data) {
>>> -             memmove(data, dataptr, count);
>>> -     } else {
>>> -             err = copy_to_user(udata, dataptr, count);
>>> -             if (err) {
>>> -                     err = -EFAULT;
>>> -                     goto free_and_error;
>>> +     if (!req->tc->pbuf_size) {
>>> +             if (data) {
>>> +                     memmove(data, dataptr, count);
>>> +             } else {
>>> +                     err = copy_to_user(udata, dataptr, count);
>>> +                     if (err) {
>>> +                             err = -EFAULT;
>>> +                             goto free_and_error;
>>> +                     }
>>>               }
>>>       }
>>>       p9_free_req(clnt, req);
>>> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
>>>
>>>       if (count < rsize)
>>>               rsize = count;
>>> -     if (data)
>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
>>> -                                                             rsize, data);
>>> -     else
>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
>>> -                                                             rsize, udata);
>>> +
>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>> +             req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
>>> +                             rsize, data ? data : udata);
>>> +     } else {
>>> +             if (data)
>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
>>> +                                     offset, rsize, data);
>>> +             else
>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
>>> +                                     offset, rsize, udata);
>>> +     }
>>> +
>>>       if (IS_ERR(req)) {
>>>               err = PTR_ERR(req);
>>>               goto error;
>>> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
>>> index dfc358f..ea778dd 100644
>>> --- a/net/9p/protocol.c
>>> +++ b/net/9p/protocol.c
>>> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>>>       return size - len;
>>>  }
>>>
>>> +static size_t
>>> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
>>> +{
>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>> +     pdu->pbuf = udata;
>>> +     pdu->pbuf_size = len;
>>> +     return size - len;
>>> +}
>>> +
>>> +static size_t
>>> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
>>> +{
>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>> +     pdu->pbuf = udata;
>>> +     pdu->pbuf_size = len;
>>> +     return size - len;
>>> +}
>>> +
>>>  /*
>>>       b - int8_t
>>>       w - int16_t
>>> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>>>                                       errcode = -EFAULT;
>>>                       }
>>>                       break;
>>> +             case 'E':{
>>> +                              int32_t count = va_arg(ap, int32_t);
>>> +                              const char *udata = va_arg(ap, const void *);
>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>> +                                              count);
>>> +                              if (!errcode && pdu_write_ur(pdu, udata,
>>> +                                                      count))
>>> +                                      errcode = -EFAULT;
>>> +                      }
>>> +                      break;
>>> +             case 'F':{
>>> +                              int32_t count = va_arg(ap, int32_t);
>>> +                              const char *udata = va_arg(ap, const void *);
>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>> +                                              count);
>>> +                              if (!errcode && pdu_write_uw(pdu, udata,
>>> +                                                      count))
>>> +                                      errcode = -EFAULT;
>>> +                      }
>>> +                      break;
>>>               case 'U':{
>>>                               int32_t count = va_arg(ap, int32_t);
>>>                               const char __user *udata =
>>
>>
>>
>> ------------------------------------------------------------------------------
>> The modern datacenter depends on network connectivity to access resources
>> and provide services. The best practices for maximizing a physical server's
>> connectivity to a physical network are well understood - see how these
>> rules translate into the virtual world?
>> http://p.sf.net/sfu/oracle-sfdevnlfb
>> _______________________________________________
>> V9fs-developer mailing list
>> V9fs-developer@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-08 21:09     ` [V9fs-developer] " Eric Van Hensbergen
  2011-02-08 21:16       ` Eric Van Hensbergen
@ 2011-02-08 23:50       ` Venkateswararao Jujjuri (JV)
  2011-02-09  1:59         ` Venkateswararao Jujjuri (JV)
  1 sibling, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-08 23:50 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, v9fs-developer

On 2/8/2011 1:09 PM, Eric Van Hensbergen wrote:
> One thing I wonder is if we always want to zero copy for payload.  In
> the extreme, do we want to take the overhead of pinning an extra page
> if we are only reading/writing a byte?  memcpy is expensive for large
> packets, but may actually be more efficient for small packets.
I am not a memory expert, but I would assume memcpy also need to do
same thing similar to get_user_pages() short of pinning pages. But I see the point.

> 
> Have we done any performance measurements of this code with various
> payload sizes versus non-zero-copy?  Of course that may not really
> show the impact of pinning the extra pages....

All our testing is with large buffers. Did not test with small buffers.

> 
> In any case, if such a tradeoff did exist, we might choose to not do
> zero copy for requests smaller than some size -- and that might
> alleviate some of the problems with the legacy protocols (such as the
> Rerror issue) -- killing two birds with one stone.  In any case, given

I think it is a wise decision to avoid zero copy if iosize+hdr_size <= pagesize.
But It doesn't change any of today's complexity. Except may be saving an if
condition.

> the implementations it should be really easy to shut me up with
> comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
> 4192 byte payloads (without caches enabled of course).

I think this is good experiment will publish data.

> 
>        -eric
> 
> 
> On Mon, Feb 7, 2011 at 12:57 AM, Venkateswararao Jujjuri (JV)
> <jvrao@linux.vnet.ibm.com> wrote:
>> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>> ---
>>>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>>>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 69 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/net/9p/client.c b/net/9p/client.c
>>> index a848bca..f939edf 100644
>>> --- a/net/9p/client.c
>>> +++ b/net/9p/client.c
>>> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>       if (count < rsize)
>>>               rsize = count;
>>>
>>> -     req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>>> +                             rsize, data ? data : udata);
>>> +     } else {
>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
>>> +                             rsize);
>>> +     }
>>>       if (IS_ERR(req)) {
>>>               err = PTR_ERR(req);
>>>               goto error;
>>> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>
>>>       P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
>>>
>>> -     if (data) {
>>> -             memmove(data, dataptr, count);
>>> -     } else {
>>> -             err = copy_to_user(udata, dataptr, count);
>>> -             if (err) {
>>> -                     err = -EFAULT;
>>> -                     goto free_and_error;
>>> +     if (!req->tc->pbuf_size) {
>>> +             if (data) {
>>> +                     memmove(data, dataptr, count);
>>> +             } else {
>>> +                     err = copy_to_user(udata, dataptr, count);
>>> +                     if (err) {
>>> +                             err = -EFAULT;
>>> +                             goto free_and_error;
>>> +                     }
>>>               }
>>>       }
>>>       p9_free_req(clnt, req);
>>> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
>>>
>>>       if (count < rsize)
>>>               rsize = count;
>>> -     if (data)
>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
>>> -                                                             rsize, data);
>>> -     else
>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
>>> -                                                             rsize, udata);
>>> +
>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>> +             req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
>>> +                             rsize, data ? data : udata);
>>> +     } else {
>>> +             if (data)
>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
>>> +                                     offset, rsize, data);
>>> +             else
>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
>>> +                                     offset, rsize, udata);
>>> +     }
>>> +
>>>       if (IS_ERR(req)) {
>>>               err = PTR_ERR(req);
>>>               goto error;
>>> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
>>> index dfc358f..ea778dd 100644
>>> --- a/net/9p/protocol.c
>>> +++ b/net/9p/protocol.c
>>> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>>>       return size - len;
>>>  }
>>>
>>> +static size_t
>>> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
>>> +{
>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>> +     pdu->pbuf = udata;
>>> +     pdu->pbuf_size = len;
>>> +     return size - len;
>>> +}
>>> +
>>> +static size_t
>>> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
>>> +{
>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>> +     pdu->pbuf = udata;
>>> +     pdu->pbuf_size = len;
>>> +     return size - len;
>>> +}
>>> +
>>>  /*
>>>       b - int8_t
>>>       w - int16_t
>>> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>>>                                       errcode = -EFAULT;
>>>                       }
>>>                       break;
>>> +             case 'E':{
>>> +                              int32_t count = va_arg(ap, int32_t);
>>> +                              const char *udata = va_arg(ap, const void *);
>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>> +                                              count);
>>> +                              if (!errcode && pdu_write_ur(pdu, udata,
>>> +                                                      count))
>>> +                                      errcode = -EFAULT;
>>> +                      }
>>> +                      break;
>>> +             case 'F':{
>>> +                              int32_t count = va_arg(ap, int32_t);
>>> +                              const char *udata = va_arg(ap, const void *);
>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>> +                                              count);
>>> +                              if (!errcode && pdu_write_uw(pdu, udata,
>>> +                                                      count))
>>> +                                      errcode = -EFAULT;
>>> +                      }
>>> +                      break;
>>>               case 'U':{
>>>                               int32_t count = va_arg(ap, int32_t);
>>>                               const char __user *udata =
>>
>>
>>
>> ------------------------------------------------------------------------------
>> The modern datacenter depends on network connectivity to access resources
>> and provide services. The best practices for maximizing a physical server's
>> connectivity to a physical network are well understood - see how these
>> rules translate into the virtual world?
>> http://p.sf.net/sfu/oracle-sfdevnlfb
>> _______________________________________________
>> V9fs-developer mailing list
>> V9fs-developer@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-08 23:50       ` Venkateswararao Jujjuri (JV)
@ 2011-02-09  1:59         ` Venkateswararao Jujjuri (JV)
  2011-02-09 14:28           ` Eric Van Hensbergen
  0 siblings, 1 reply; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-09  1:59 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, v9fs-developer

On 2/8/2011 3:50 PM, Venkateswararao Jujjuri (JV) wrote:
> On 2/8/2011 1:09 PM, Eric Van Hensbergen wrote:
>> One thing I wonder is if we always want to zero copy for payload.  In
>> the extreme, do we want to take the overhead of pinning an extra page
>> if we are only reading/writing a byte?  memcpy is expensive for large
>> packets, but may actually be more efficient for small packets.
> I am not a memory expert, but I would assume memcpy also need to do
> same thing similar to get_user_pages() short of pinning pages. But I see the point.
> 
>>
>> Have we done any performance measurements of this code with various
>> payload sizes versus non-zero-copy?  Of course that may not really
>> show the impact of pinning the extra pages....
> 
> All our testing is with large buffers. Did not test with small buffers.
> 
>>
>> In any case, if such a tradeoff did exist, we might choose to not do
>> zero copy for requests smaller than some size -- and that might
>> alleviate some of the problems with the legacy protocols (such as the
>> Rerror issue) -- killing two birds with one stone.  In any case, given
> 
> I think it is a wise decision to avoid zero copy if iosize+hdr_size <= pagesize.
> But It doesn't change any of today's complexity. Except may be saving an if
> condition.
> 
>> the implementations it should be really easy to shut me up with
>> comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
>> 4192 byte payloads (without caches enabled of course).
> 
> I think this is good experiment will publish data.

BTW, unless we have bigger msize with differentiating pdu sizes these experiments
may not make sense.

- JV

> 
>>
>>        -eric
>>
>>
>> On Mon, Feb 7, 2011 at 12:57 AM, Venkateswararao Jujjuri (JV)
>> <jvrao@linux.vnet.ibm.com> wrote:
>>> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>>>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>>> ---
>>>>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>>>>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 69 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/net/9p/client.c b/net/9p/client.c
>>>> index a848bca..f939edf 100644
>>>> --- a/net/9p/client.c
>>>> +++ b/net/9p/client.c
>>>> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>>       if (count < rsize)
>>>>               rsize = count;
>>>>
>>>> -     req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
>>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>>>> +                             rsize, data ? data : udata);
>>>> +     } else {
>>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
>>>> +                             rsize);
>>>> +     }
>>>>       if (IS_ERR(req)) {
>>>>               err = PTR_ERR(req);
>>>>               goto error;
>>>> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>>
>>>>       P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
>>>>
>>>> -     if (data) {
>>>> -             memmove(data, dataptr, count);
>>>> -     } else {
>>>> -             err = copy_to_user(udata, dataptr, count);
>>>> -             if (err) {
>>>> -                     err = -EFAULT;
>>>> -                     goto free_and_error;
>>>> +     if (!req->tc->pbuf_size) {
>>>> +             if (data) {
>>>> +                     memmove(data, dataptr, count);
>>>> +             } else {
>>>> +                     err = copy_to_user(udata, dataptr, count);
>>>> +                     if (err) {
>>>> +                             err = -EFAULT;
>>>> +                             goto free_and_error;
>>>> +                     }
>>>>               }
>>>>       }
>>>>       p9_free_req(clnt, req);
>>>> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
>>>>
>>>>       if (count < rsize)
>>>>               rsize = count;
>>>> -     if (data)
>>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
>>>> -                                                             rsize, data);
>>>> -     else
>>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
>>>> -                                                             rsize, udata);
>>>> +
>>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>>> +             req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
>>>> +                             rsize, data ? data : udata);
>>>> +     } else {
>>>> +             if (data)
>>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
>>>> +                                     offset, rsize, data);
>>>> +             else
>>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
>>>> +                                     offset, rsize, udata);
>>>> +     }
>>>> +
>>>>       if (IS_ERR(req)) {
>>>>               err = PTR_ERR(req);
>>>>               goto error;
>>>> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
>>>> index dfc358f..ea778dd 100644
>>>> --- a/net/9p/protocol.c
>>>> +++ b/net/9p/protocol.c
>>>> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>>>>       return size - len;
>>>>  }
>>>>
>>>> +static size_t
>>>> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
>>>> +{
>>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>>> +     pdu->pbuf = udata;
>>>> +     pdu->pbuf_size = len;
>>>> +     return size - len;
>>>> +}
>>>> +
>>>> +static size_t
>>>> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
>>>> +{
>>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>>> +     pdu->pbuf = udata;
>>>> +     pdu->pbuf_size = len;
>>>> +     return size - len;
>>>> +}
>>>> +
>>>>  /*
>>>>       b - int8_t
>>>>       w - int16_t
>>>> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>>>>                                       errcode = -EFAULT;
>>>>                       }
>>>>                       break;
>>>> +             case 'E':{
>>>> +                              int32_t count = va_arg(ap, int32_t);
>>>> +                              const char *udata = va_arg(ap, const void *);
>>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>>> +                                              count);
>>>> +                              if (!errcode && pdu_write_ur(pdu, udata,
>>>> +                                                      count))
>>>> +                                      errcode = -EFAULT;
>>>> +                      }
>>>> +                      break;
>>>> +             case 'F':{
>>>> +                              int32_t count = va_arg(ap, int32_t);
>>>> +                              const char *udata = va_arg(ap, const void *);
>>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>>> +                                              count);
>>>> +                              if (!errcode && pdu_write_uw(pdu, udata,
>>>> +                                                      count))
>>>> +                                      errcode = -EFAULT;
>>>> +                      }
>>>> +                      break;
>>>>               case 'U':{
>>>>                               int32_t count = va_arg(ap, int32_t);
>>>>                               const char __user *udata =
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> The modern datacenter depends on network connectivity to access resources
>>> and provide services. The best practices for maximizing a physical server's
>>> connectivity to a physical network are well understood - see how these
>>> rules translate into the virtual world?
>>> http://p.sf.net/sfu/oracle-sfdevnlfb
>>> _______________________________________________
>>> V9fs-developer mailing list
>>> V9fs-developer@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>>>
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-09  1:59         ` Venkateswararao Jujjuri (JV)
@ 2011-02-09 14:28           ` Eric Van Hensbergen
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Van Hensbergen @ 2011-02-09 14:28 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: linux-fsdevel, v9fs-developer

On Tue, Feb 8, 2011 at 7:59 PM, Venkateswararao Jujjuri (JV)
<jvrao@linux.vnet.ibm.com> wrote:
>>
>>> the implementations it should be really easy to shut me up with
>>> comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
>>> 4192 byte payloads (without caches enabled of course).
>>
>> I think this is good experiment will publish data.
>
> BTW, unless we have bigger msize with differentiating pdu sizes these experiments
> may not make sense.
>

Not sure I agree (at least in the 1-4k scale), we are measuring the
overhead of memcpy versus the overhead of mapping/pinning the
additional sg -- or am I not thinking clearly.  Its possible... I have
not had coffee yet.  I suppose with your small buffer patch series
there might be some performance differences due to different allocator
behavior, and while I don't think it'll be significant, it may be
worth re-doing the experiment once we have that in place.

Which brings up another question -- I know your team are doing
functional regressions, but are they also dong performance
regressions?  Since we are starting into the optimization patches it
may not be a bad idea to track how the changes are impacting
scalability, latency, and throughput (as well as some metric of
resource consumption, but that may be a harder metric to track).

         -eric


    -eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-08 21:16       ` Eric Van Hensbergen
@ 2011-02-09 21:09         ` Venkateswararao Jujjuri (JV)
  2011-02-09 21:12           ` Venkateswararao Jujjuri (JV)
  2011-02-09 21:18           ` Eric Van Hensbergen
  0 siblings, 2 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-09 21:09 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, v9fs-developer

WRITE

IO SIZE      TOTAL SIZE       No ZC                ZC
1                   1MB                22.4 kb/s         19.8 kb/s
32                 32MB              711 kb/s          633 kb/s
64                 64MB              1.4 mb/s          1.3  mb/s
128               128MB             2.8 mb/s          2.6 mb/s
256               256MB             5.6 mb/s          5.1 mb/s
512               512MB            10.4 mb/s        10.2 mb/s
1024              1GB               19.7 mb/s         20.4 mb/s
2048              2GB               40.1 mb/s          43.7 mb/s
4096              4GB               71.4 mb/s          73.1 mb/s



READ
IO SIZE      TOTAL SIZE       No ZC                ZC
1                   1MB                26.6 kb/s         23.1 kb/s
32                 32MB              783 kb/s           734 kb/s
64                 64MB              1.7 mb/s          1.5 mb/s
128               128MB             3.4 mb/s          3.0 mb/s
256               256MB             4.2 mb/s           5.9 mb/s
512               512MB            6.9 mb/s            11.6 mb/s
1024              1GB               23.3 mb/s          23.4 mb/s
2048              2GB               42.5 mb/s          45.4 mb/s
4096              4GB               67.4 mb/s          73.9 mb/s

As you can see, the difference is marginal..but zc improves as the IO size
increases.
In the past we have seen tremendous improvements with different msizes.
It  is mostly because of shipping bigger chunks of data which is possible with
zero copy.
Also it could be my setup/box even on the host I am getting same/similar numbers.

- JV




On 2/8/2011 1:16 PM, Eric Van Hensbergen wrote:
> oh and, for reference, while a different environment, my request is
> based on a little bit more than idle fancy.  Check out the graph on
> page 6 in: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.8182&rep=rep1&type=pdf
> 
>         -eric
> 
> 
> On Tue, Feb 8, 2011 at 3:09 PM, Eric Van Hensbergen <ericvh@gmail.com> wrote:
>> One thing I wonder is if we always want to zero copy for payload.  In
>> the extreme, do we want to take the overhead of pinning an extra page
>> if we are only reading/writing a byte?  memcpy is expensive for large
>> packets, but may actually be more efficient for small packets.
>>
>> Have we done any performance measurements of this code with various
>> payload sizes versus non-zero-copy?  Of course that may not really
>> show the impact of pinning the extra pages....
>>
>> In any case, if such a tradeoff did exist, we might choose to not do
>> zero copy for requests smaller than some size -- and that might
>> alleviate some of the problems with the legacy protocols (such as the
>> Rerror issue) -- killing two birds with one stone.  In any case, given
>> the implementations it should be really easy to shut me up with
>> comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
>> 4192 byte payloads (without caches enabled of course).
>>
>>       -eric
>>
>>
>> On Mon, Feb 7, 2011 at 12:57 AM, Venkateswararao Jujjuri (JV)
>> <jvrao@linux.vnet.ibm.com> wrote:
>>> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>>>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>>> ---
>>>>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>>>>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 69 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/net/9p/client.c b/net/9p/client.c
>>>> index a848bca..f939edf 100644
>>>> --- a/net/9p/client.c
>>>> +++ b/net/9p/client.c
>>>> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>>       if (count < rsize)
>>>>               rsize = count;
>>>>
>>>> -     req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
>>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>>>> +                             rsize, data ? data : udata);
>>>> +     } else {
>>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
>>>> +                             rsize);
>>>> +     }
>>>>       if (IS_ERR(req)) {
>>>>               err = PTR_ERR(req);
>>>>               goto error;
>>>> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>>
>>>>       P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
>>>>
>>>> -     if (data) {
>>>> -             memmove(data, dataptr, count);
>>>> -     } else {
>>>> -             err = copy_to_user(udata, dataptr, count);
>>>> -             if (err) {
>>>> -                     err = -EFAULT;
>>>> -                     goto free_and_error;
>>>> +     if (!req->tc->pbuf_size) {
>>>> +             if (data) {
>>>> +                     memmove(data, dataptr, count);
>>>> +             } else {
>>>> +                     err = copy_to_user(udata, dataptr, count);
>>>> +                     if (err) {
>>>> +                             err = -EFAULT;
>>>> +                             goto free_and_error;
>>>> +                     }
>>>>               }
>>>>       }
>>>>       p9_free_req(clnt, req);
>>>> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
>>>>
>>>>       if (count < rsize)
>>>>               rsize = count;
>>>> -     if (data)
>>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
>>>> -                                                             rsize, data);
>>>> -     else
>>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
>>>> -                                                             rsize, udata);
>>>> +
>>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>>> +             req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
>>>> +                             rsize, data ? data : udata);
>>>> +     } else {
>>>> +             if (data)
>>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
>>>> +                                     offset, rsize, data);
>>>> +             else
>>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
>>>> +                                     offset, rsize, udata);
>>>> +     }
>>>> +
>>>>       if (IS_ERR(req)) {
>>>>               err = PTR_ERR(req);
>>>>               goto error;
>>>> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
>>>> index dfc358f..ea778dd 100644
>>>> --- a/net/9p/protocol.c
>>>> +++ b/net/9p/protocol.c
>>>> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>>>>       return size - len;
>>>>  }
>>>>
>>>> +static size_t
>>>> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
>>>> +{
>>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>>> +     pdu->pbuf = udata;
>>>> +     pdu->pbuf_size = len;
>>>> +     return size - len;
>>>> +}
>>>> +
>>>> +static size_t
>>>> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
>>>> +{
>>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>>> +     pdu->pbuf = udata;
>>>> +     pdu->pbuf_size = len;
>>>> +     return size - len;
>>>> +}
>>>> +
>>>>  /*
>>>>       b - int8_t
>>>>       w - int16_t
>>>> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>>>>                                       errcode = -EFAULT;
>>>>                       }
>>>>                       break;
>>>> +             case 'E':{
>>>> +                              int32_t count = va_arg(ap, int32_t);
>>>> +                              const char *udata = va_arg(ap, const void *);
>>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>>> +                                              count);
>>>> +                              if (!errcode && pdu_write_ur(pdu, udata,
>>>> +                                                      count))
>>>> +                                      errcode = -EFAULT;
>>>> +                      }
>>>> +                      break;
>>>> +             case 'F':{
>>>> +                              int32_t count = va_arg(ap, int32_t);
>>>> +                              const char *udata = va_arg(ap, const void *);
>>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>>> +                                              count);
>>>> +                              if (!errcode && pdu_write_uw(pdu, udata,
>>>> +                                                      count))
>>>> +                                      errcode = -EFAULT;
>>>> +                      }
>>>> +                      break;
>>>>               case 'U':{
>>>>                               int32_t count = va_arg(ap, int32_t);
>>>>                               const char __user *udata =
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> The modern datacenter depends on network connectivity to access resources
>>> and provide services. The best practices for maximizing a physical server's
>>> connectivity to a physical network are well understood - see how these
>>> rules translate into the virtual world?
>>> http://p.sf.net/sfu/oracle-sfdevnlfb
>>> _______________________________________________
>>> V9fs-developer mailing list
>>> V9fs-developer@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>>>
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-09 21:09         ` Venkateswararao Jujjuri (JV)
@ 2011-02-09 21:12           ` Venkateswararao Jujjuri (JV)
  2011-02-09 21:18           ` Eric Van Hensbergen
  1 sibling, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-09 21:12 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, v9fs-developer

On 2/9/2011 1:09 PM, Venkateswararao Jujjuri (JV) wrote:
> WRITE

dd if=/dev/zero of=/pmnt/file1 bs=4096 count=1MB (variable bs = IO SIZE)

> 
> IO SIZE      TOTAL SIZE       No ZC                ZC
> 1                   1MB                22.4 kb/s         19.8 kb/s
> 32                 32MB              711 kb/s          633 kb/s
> 64                 64MB              1.4 mb/s          1.3  mb/s
> 128               128MB             2.8 mb/s          2.6 mb/s
> 256               256MB             5.6 mb/s          5.1 mb/s
> 512               512MB            10.4 mb/s        10.2 mb/s
> 1024              1GB               19.7 mb/s         20.4 mb/s
> 2048              2GB               40.1 mb/s          43.7 mb/s
> 4096              4GB               71.4 mb/s          73.1 mb/s
> 
> 
> 
> READ
dd of=/dev/null if=/pmnt/file1 bs=4096 count=1MB(variable bs = IO SIZE)

> IO SIZE      TOTAL SIZE       No ZC                ZC
> 1                   1MB                26.6 kb/s         23.1 kb/s
> 32                 32MB              783 kb/s           734 kb/s
> 64                 64MB              1.7 mb/s          1.5 mb/s
> 128               128MB             3.4 mb/s          3.0 mb/s
> 256               256MB             4.2 mb/s           5.9 mb/s
> 512               512MB            6.9 mb/s            11.6 mb/s
> 1024              1GB               23.3 mb/s          23.4 mb/s
> 2048              2GB               42.5 mb/s          45.4 mb/s
> 4096              4GB               67.4 mb/s          73.9 mb/s
> 
> As you can see, the difference is marginal..but zc improves as the IO size
> increases.
> In the past we have seen tremendous improvements with different msizes.
> It  is mostly because of shipping bigger chunks of data which is possible with
> zero copy.
> Also it could be my setup/box even on the host I am getting same/similar numbers.
> 
> - JV
> 
> 
> 
> 
> On 2/8/2011 1:16 PM, Eric Van Hensbergen wrote:
>> oh and, for reference, while a different environment, my request is
>> based on a little bit more than idle fancy.  Check out the graph on
>> page 6 in: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.8182&rep=rep1&type=pdf
>>
>>         -eric
>>
>>
>> On Tue, Feb 8, 2011 at 3:09 PM, Eric Van Hensbergen <ericvh@gmail.com> wrote:
>>> One thing I wonder is if we always want to zero copy for payload.  In
>>> the extreme, do we want to take the overhead of pinning an extra page
>>> if we are only reading/writing a byte?  memcpy is expensive for large
>>> packets, but may actually be more efficient for small packets.
>>>
>>> Have we done any performance measurements of this code with various
>>> payload sizes versus non-zero-copy?  Of course that may not really
>>> show the impact of pinning the extra pages....
>>>
>>> In any case, if such a tradeoff did exist, we might choose to not do
>>> zero copy for requests smaller than some size -- and that might
>>> alleviate some of the problems with the legacy protocols (such as the
>>> Rerror issue) -- killing two birds with one stone.  In any case, given
>>> the implementations it should be really easy to shut me up with
>>> comparison data of zc and non-zc for 1, 64, 128, 256, 512, 1024, 2048,
>>> 4192 byte payloads (without caches enabled of course).
>>>
>>>       -eric
>>>
>>>
>>> On Mon, Feb 7, 2011 at 12:57 AM, Venkateswararao Jujjuri (JV)
>>> <jvrao@linux.vnet.ibm.com> wrote:
>>>> On 2/6/2011 11:21 PM, Venkateswararao Jujjuri (JV) wrote:
>>>>> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
>>>>> ---
>>>>>  net/9p/client.c   |   45 +++++++++++++++++++++++++++++++--------------
>>>>>  net/9p/protocol.c |   38 ++++++++++++++++++++++++++++++++++++++
>>>>>  2 files changed, 69 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/net/9p/client.c b/net/9p/client.c
>>>>> index a848bca..f939edf 100644
>>>>> --- a/net/9p/client.c
>>>>> +++ b/net/9p/client.c
>>>>> @@ -1270,7 +1270,14 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>>>       if (count < rsize)
>>>>>               rsize = count;
>>>>>
>>>>> -     req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
>>>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqE", fid->fid, offset,
>>>>> +                             rsize, data ? data : udata);
>>>>> +     } else {
>>>>> +             req = p9_client_rpc(clnt, P9_TREAD, "dqd", fid->fid, offset,
>>>>> +                             rsize);
>>>>> +     }
>>>>>       if (IS_ERR(req)) {
>>>>>               err = PTR_ERR(req);
>>>>>               goto error;
>>>>> @@ -1284,13 +1291,15 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
>>>>>
>>>>>       P9_DPRINTK(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
>>>>>
>>>>> -     if (data) {
>>>>> -             memmove(data, dataptr, count);
>>>>> -     } else {
>>>>> -             err = copy_to_user(udata, dataptr, count);
>>>>> -             if (err) {
>>>>> -                     err = -EFAULT;
>>>>> -                     goto free_and_error;
>>>>> +     if (!req->tc->pbuf_size) {
>>>>> +             if (data) {
>>>>> +                     memmove(data, dataptr, count);
>>>>> +             } else {
>>>>> +                     err = copy_to_user(udata, dataptr, count);
>>>>> +                     if (err) {
>>>>> +                             err = -EFAULT;
>>>>> +                             goto free_and_error;
>>>>> +                     }
>>>>>               }
>>>>>       }
>>>>>       p9_free_req(clnt, req);
>>>>> @@ -1323,12 +1332,20 @@ p9_client_write(struct p9_fid *fid, char *data, const char __user *udata,
>>>>>
>>>>>       if (count < rsize)
>>>>>               rsize = count;
>>>>> -     if (data)
>>>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid, offset,
>>>>> -                                                             rsize, data);
>>>>> -     else
>>>>> -             req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid, offset,
>>>>> -                                                             rsize, udata);
>>>>> +
>>>>> +     if ((clnt->trans_mod->pref & P9_TRANS_PREF_PAYLOAD_MASK) ==
>>>>> +                     P9_TRANS_PREF_PAYLOAD_SEP) {
>>>>> +             req = p9_client_rpc(clnt, P9_TWRITE, "dqF", fid->fid, offset,
>>>>> +                             rsize, data ? data : udata);
>>>>> +     } else {
>>>>> +             if (data)
>>>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqD", fid->fid,
>>>>> +                                     offset, rsize, data);
>>>>> +             else
>>>>> +                     req = p9_client_rpc(clnt, P9_TWRITE, "dqU", fid->fid,
>>>>> +                                     offset, rsize, udata);
>>>>> +     }
>>>>> +
>>>>>       if (IS_ERR(req)) {
>>>>>               err = PTR_ERR(req);
>>>>>               goto error;
>>>>> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
>>>>> index dfc358f..ea778dd 100644
>>>>> --- a/net/9p/protocol.c
>>>>> +++ b/net/9p/protocol.c
>>>>> @@ -114,6 +114,24 @@ pdu_write_u(struct p9_fcall *pdu, const char __user *udata, size_t size)
>>>>>       return size - len;
>>>>>  }
>>>>>
>>>>> +static size_t
>>>>> +pdu_write_uw(struct p9_fcall *pdu, const char *udata, size_t size)
>>>>> +{
>>>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>>>> +     pdu->pbuf = udata;
>>>>> +     pdu->pbuf_size = len;
>>>>> +     return size - len;
>>>>> +}
>>>>> +
>>>>> +static size_t
>>>>> +pdu_write_ur(struct p9_fcall *pdu, const char *udata, size_t size)
>>>>> +{
>>>>> +     size_t len = min(pdu->capacity - pdu->size, size);
>>>>> +     pdu->pbuf = udata;
>>>>> +     pdu->pbuf_size = len;
>>>>> +     return size - len;
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>       b - int8_t
>>>>>       w - int16_t
>>>>> @@ -445,6 +463,26 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
>>>>>                                       errcode = -EFAULT;
>>>>>                       }
>>>>>                       break;
>>>>> +             case 'E':{
>>>>> +                              int32_t count = va_arg(ap, int32_t);
>>>>> +                              const char *udata = va_arg(ap, const void *);
>>>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>>>> +                                              count);
>>>>> +                              if (!errcode && pdu_write_ur(pdu, udata,
>>>>> +                                                      count))
>>>>> +                                      errcode = -EFAULT;
>>>>> +                      }
>>>>> +                      break;
>>>>> +             case 'F':{
>>>>> +                              int32_t count = va_arg(ap, int32_t);
>>>>> +                              const char *udata = va_arg(ap, const void *);
>>>>> +                              errcode = p9pdu_writef(pdu, proto_version, "d",
>>>>> +                                              count);
>>>>> +                              if (!errcode && pdu_write_uw(pdu, udata,
>>>>> +                                                      count))
>>>>> +                                      errcode = -EFAULT;
>>>>> +                      }
>>>>> +                      break;
>>>>>               case 'U':{
>>>>>                               int32_t count = va_arg(ap, int32_t);
>>>>>                               const char __user *udata =
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> The modern datacenter depends on network connectivity to access resources
>>>> and provide services. The best practices for maximizing a physical server's
>>>> connectivity to a physical network are well understood - see how these
>>>> rules translate into the virtual world?
>>>> http://p.sf.net/sfu/oracle-sfdevnlfb
>>>> _______________________________________________
>>>> V9fs-developer mailing list
>>>> V9fs-developer@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/v9fs-developer
>>>>
>>>
> 
> 
> 
> ------------------------------------------------------------------------------
> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> Pinpoint memory and threading errors before they happen.
> Find and fix more than 250 security defects in the development cycle.
> Locate bottlenecks in serial and parallel code that limit performance.
> http://p.sf.net/sfu/intel-dev2devfeb
> _______________________________________________
> V9fs-developer mailing list
> V9fs-developer@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/v9fs-developer



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-09 21:09         ` Venkateswararao Jujjuri (JV)
  2011-02-09 21:12           ` Venkateswararao Jujjuri (JV)
@ 2011-02-09 21:18           ` Eric Van Hensbergen
  2011-02-09 21:39             ` Venkateswararao Jujjuri (JV)
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Van Hensbergen @ 2011-02-09 21:18 UTC (permalink / raw)
  To: Venkateswararao Jujjuri (JV); +Cc: linux-fsdevel, v9fs-developer

On Wed, Feb 9, 2011 at 3:09 PM, Venkateswararao Jujjuri (JV)
<jvrao@linux.vnet.ibm.com> wrote:
> WRITE
>
> IO SIZE      TOTAL SIZE       No ZC                ZC
> 1                   1MB                22.4 kb/s         19.8 kb/s
> 32                 32MB              711 kb/s          633 kb/s
> 64                 64MB              1.4 mb/s          1.3  mb/s
> 128               128MB             2.8 mb/s          2.6 mb/s
> 256               256MB             5.6 mb/s          5.1 mb/s
> 512               512MB            10.4 mb/s        10.2 mb/s
> 1024              1GB               19.7 mb/s         20.4 mb/s
> 2048              2GB               40.1 mb/s          43.7 mb/s
> 4096              4GB               71.4 mb/s          73.1 mb/s
>
> READ
> IO SIZE      TOTAL SIZE       No ZC                ZC
> 1                   1MB                26.6 kb/s         23.1 kb/s
> 32                 32MB              783 kb/s           734 kb/s
> 64                 64MB              1.7 mb/s          1.5 mb/s
> 128               128MB             3.4 mb/s          3.0 mb/s
> 256               256MB             4.2 mb/s           5.9 mb/s
> 512               512MB            6.9 mb/s            11.6 mb/s
> 1024              1GB               23.3 mb/s          23.4 mb/s
> 2048              2GB               42.5 mb/s          45.4 mb/s
> 4096              4GB               67.4 mb/s          73.9 mb/s
>
> As you can see, the difference is marginal..but zc improves as the IO size
> increases.
> In the past we have seen tremendous improvements with different msizes.
> It  is mostly because of shipping bigger chunks of data which is possible with
> zero copy.
> Also it could be my setup/box even on the host I am getting same/similar numbers.
>

So the break even point for write is around 512 and for read it is
somewhere between 128 and 256 -- but I think there may be some
justification then for not doing zc for payloads of 128 or less.
Interesting number, its the same as ERRMAX :)  These numbers will be
different system to system of course, but I imagine on a server class
machine the tradeoff size moves higher instead of lower (since
processor and caches are likely to be faster).  How characteristic is
the machine you tested it on JV?

      -eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [V9fs-developer] :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol.
  2011-02-09 21:18           ` Eric Van Hensbergen
@ 2011-02-09 21:39             ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 26+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-09 21:39 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, v9fs-developer

On 2/9/2011 1:18 PM, Eric Van Hensbergen wrote:
> On Wed, Feb 9, 2011 at 3:09 PM, Venkateswararao Jujjuri (JV)
> <jvrao@linux.vnet.ibm.com> wrote:
>> WRITE
>>
>> IO SIZE      TOTAL SIZE       No ZC                ZC
>> 1                   1MB                22.4 kb/s         19.8 kb/s
>> 32                 32MB              711 kb/s          633 kb/s
>> 64                 64MB              1.4 mb/s          1.3  mb/s
>> 128               128MB             2.8 mb/s          2.6 mb/s
>> 256               256MB             5.6 mb/s          5.1 mb/s
>> 512               512MB            10.4 mb/s        10.2 mb/s
>> 1024              1GB               19.7 mb/s         20.4 mb/s
>> 2048              2GB               40.1 mb/s          43.7 mb/s
>> 4096              4GB               71.4 mb/s          73.1 mb/s
>>
>> READ
>> IO SIZE      TOTAL SIZE       No ZC                ZC
>> 1                   1MB                26.6 kb/s         23.1 kb/s
>> 32                 32MB              783 kb/s           734 kb/s
>> 64                 64MB              1.7 mb/s          1.5 mb/s
>> 128               128MB             3.4 mb/s          3.0 mb/s
>> 256               256MB             4.2 mb/s           5.9 mb/s
>> 512               512MB            6.9 mb/s            11.6 mb/s
>> 1024              1GB               23.3 mb/s          23.4 mb/s
>> 2048              2GB               42.5 mb/s          45.4 mb/s
>> 4096              4GB               67.4 mb/s          73.9 mb/s
>>
>> As you can see, the difference is marginal..but zc improves as the IO size
>> increases.
>> In the past we have seen tremendous improvements with different msizes.
>> It  is mostly because of shipping bigger chunks of data which is possible with
>> zero copy.
>> Also it could be my setup/box even on the host I am getting same/similar numbers.
>>
> 
> So the break even point for write is around 512 and for read it is
> somewhere between 128 and 256 -- but I think there may be some
> justification then for not doing zc for payloads of 128 or less.
> Interesting number, its the same as ERRMAX :)  These numbers will be
> different system to system of course, but I imagine on a server class
> machine the tradeoff size moves higher instead of lower (since
> processor and caches are likely to be faster).  How characteristic is
> the machine you tested it on JV?

It is a HS21 blade a two socket quad core Xeon with 4 GB memory, IO to the local
disk.
As I said, throughput on the host also in the same range...we could very well be
capped by the
disk performance. But I agree that if the iosize+hdr size < 4k we can just use
non-zero copy.
I don't think it is going to swing the pendulum of performance/complexity in
either way..but
given that we are going to allocate atleast 4k buffers, it makes sense to use it
if we can accommodate everything in there.

- JV
> 
>       -eric



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-02-09 21:40 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
2011-02-07  6:56   ` [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for " Venkateswararao Jujjuri (JV)
2011-02-08 15:20     ` [V9fs-developer] " Latchesar Ionkov
2011-02-08 17:21       ` Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate " Venkateswararao Jujjuri (JV)
2011-02-07  6:56   ` [RFC] [PATCH 3/7] [net/9p] Assign type of transaction to tc->pdu->id which is otherwise unsed Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:56   ` [RFC] [PATCH 4/7] [net/9p] Add gup/zero_copy support to VirtIO transport layer Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:57   ` [RFC] [PATCH 5/7] [net/9p] Add preferences to transport layer Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:57   ` :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol Venkateswararao Jujjuri (JV)
2011-02-08 21:09     ` [V9fs-developer] " Eric Van Hensbergen
2011-02-08 21:16       ` Eric Van Hensbergen
2011-02-09 21:09         ` Venkateswararao Jujjuri (JV)
2011-02-09 21:12           ` Venkateswararao Jujjuri (JV)
2011-02-09 21:18           ` Eric Van Hensbergen
2011-02-09 21:39             ` Venkateswararao Jujjuri (JV)
2011-02-08 23:50       ` Venkateswararao Jujjuri (JV)
2011-02-09  1:59         ` Venkateswararao Jujjuri (JV)
2011-02-09 14:28           ` Eric Van Hensbergen
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:58   ` [PATCH 7/7] [net/9p] Handle TREAD/RERROR case in !dotl case Venkateswararao Jujjuri (JV)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.