[PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC
@ 2013-02-07 14:51 Jeff Layton
  2013-02-07 14:51 ` [PATCH v3 1/2] sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer Jeff Layton
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-07 14:51 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

This patchset is a respin of the ones that Bruce has not yet committed
of my DRC overhaul. The main difference is the first patch, which adds a
routine to strip the trailing checksum off of decrypted or
integrity-verified buffer.

I've tested both the client and server with different krb5 flavors (and
using both the v1 and v2 codepaths) and it seems to work fine with those
checksum blobs stripped off.

I think we should consider this for 3.9 since XID collisions are a lot
more likely now with the new DRC code in place.

Jeff Layton (2):
  sunrpc: trim off trailing checksum before returning decrypted or
    integrity authenticated buffer
  nfsd: keep a checksum of the first 256 bytes of request

 fs/nfsd/cache.h                     |  5 ++++
 fs/nfsd/nfscache.c                  | 53 ++++++++++++++++++++++++++++++++++---
 include/linux/sunrpc/xdr.h          |  1 +
 net/sunrpc/auth_gss/gss_krb5_wrap.c |  2 ++
 net/sunrpc/auth_gss/svcauth_gss.c   | 10 +++++--
 net/sunrpc/xdr.c                    | 42 +++++++++++++++++++++++++++++
 6 files changed, 107 insertions(+), 6 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/2] sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer
  2013-02-07 14:51 [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC Jeff Layton
@ 2013-02-07 14:51 ` Jeff Layton
  2013-02-07 14:51 ` [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request Jeff Layton
  2013-02-07 15:11 ` [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC J. Bruce Fields
  2 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-07 14:51 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

When GSSAPI integrity signatures are in use, or when we're using GSSAPI
privacy with the v2 token format, there is a trailing checksum on the
xdr_buf that is returned.

It's checked during the authentication stage, and afterward nothing
cares about it. Ordinarily, it's not a problem since the XDR code
generally ignores it, but it will be when we try to compute a checksum
over the buffer to help prevent XID collisions in the duplicate reply
cache.

Fix the code to trim off the checksums after verifying them. Note that
in unwrap_integ_data, we must avoid trying to reverify the checksum if
the request was deferred since it will no longer be present when it's
revisited.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/sunrpc/xdr.h          |  1 +
 net/sunrpc/auth_gss/gss_krb5_wrap.c |  2 ++
 net/sunrpc/auth_gss/svcauth_gss.c   | 10 +++++++--
 net/sunrpc/xdr.c                    | 42 +++++++++++++++++++++++++++++++++++++
 4 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index 224d060..15f9204 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -152,6 +152,7 @@ xdr_adjust_iovec(struct kvec *iov, __be32 *p)
 extern void xdr_shift_buf(struct xdr_buf *, size_t);
 extern void xdr_buf_from_iov(struct kvec *, struct xdr_buf *);
 extern int xdr_buf_subsegment(struct xdr_buf *, struct xdr_buf *, unsigned int, unsigned int);
+extern void xdr_buf_trim(struct xdr_buf *, unsigned int);
 extern int xdr_buf_read_netobj(struct xdr_buf *, struct xdr_netobj *, unsigned int);
 extern int read_bytes_from_xdr_buf(struct xdr_buf *, unsigned int, void *, unsigned int);
 extern int write_bytes_to_xdr_buf(struct xdr_buf *, unsigned int, void *, unsigned int);
diff --git a/net/sunrpc/auth_gss/gss_krb5_wrap.c b/net/sunrpc/auth_gss/gss_krb5_wrap.c
index 107c452..88edec9 100644
--- a/net/sunrpc/auth_gss/gss_krb5_wrap.c
+++ b/net/sunrpc/auth_gss/gss_krb5_wrap.c
@@ -574,6 +574,8 @@ gss_unwrap_kerberos_v2(struct krb5_ctx *kctx, int offset, struct xdr_buf *buf)
 	buf->head[0].iov_len -= GSS_KRB5_TOK_HDR_LEN + headskip;
 	buf->len -= GSS_KRB5_TOK_HDR_LEN + headskip;
 
+	/* Trim off the checksum blob */
+	xdr_buf_trim(buf, GSS_KRB5_TOK_HDR_LEN + tailskip);
 	return GSS_S_COMPLETE;
 }
 
diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
index 73e9573..a5b41e2 100644
--- a/net/sunrpc/auth_gss/svcauth_gss.c
+++ b/net/sunrpc/auth_gss/svcauth_gss.c
@@ -817,13 +817,17 @@ read_u32_from_xdr_buf(struct xdr_buf *buf, int base, u32 *obj)
  *	The server uses base of head iovec as read pointer, while the
  *	client uses separate pointer. */
 static int
-unwrap_integ_data(struct xdr_buf *buf, u32 seq, struct gss_ctx *ctx)
+unwrap_integ_data(struct svc_rqst *rqstp, struct xdr_buf *buf, u32 seq, struct gss_ctx *ctx)
 {
 	int stat = -EINVAL;
 	u32 integ_len, maj_stat;
 	struct xdr_netobj mic;
 	struct xdr_buf integ_buf;
 
+	/* Did we already verify the signature on the original pass through? */
+	if (rqstp->rq_deferred)
+		return 0;
+
 	integ_len = svc_getnl(&buf->head[0]);
 	if (integ_len & 3)
 		return stat;
@@ -846,6 +850,8 @@ unwrap_integ_data(struct xdr_buf *buf, u32 seq, struct gss_ctx *ctx)
 		goto out;
 	if (svc_getnl(&buf->head[0]) != seq)
 		goto out;
+	/* trim off the mic at the end before returning */
+	xdr_buf_trim(buf, mic.len + 4);
 	stat = 0;
 out:
 	kfree(mic.data);
@@ -1190,7 +1196,7 @@ svcauth_gss_accept(struct svc_rqst *rqstp, __be32 *authp)
 			/* placeholders for length and seq. number: */
 			svc_putnl(resv, 0);
 			svc_putnl(resv, 0);
-			if (unwrap_integ_data(&rqstp->rq_arg,
+			if (unwrap_integ_data(rqstp, &rqstp->rq_arg,
 					gc->gc_seq, rsci->mechctx))
 				goto garbage_args;
 			break;
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 5605563..02c1577 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -879,6 +879,48 @@ xdr_buf_subsegment(struct xdr_buf *buf, struct xdr_buf *subbuf,
 }
 EXPORT_SYMBOL_GPL(xdr_buf_subsegment);
 
+/**
+ * xdr_buf_trim - lop at most "len" bytes off the end of "buf"
+ * @buf: buf to be trimmed
+ * @len: number of bytes to reduce "buf" by
+ *
+ * Trim an xdr_buf by the given number of bytes by fixing up the lengths. Note
+ * that it's possible that we'll trim less than that amount if the xdr_buf is
+ * too small, or if (for instance) it's all in the head and the parser has
+ * already read too far into it.
+ */
+void xdr_buf_trim(struct xdr_buf *buf, unsigned int len)
+{
+	size_t cur;
+	unsigned int trim = len;
+
+	if (buf->tail[0].iov_len) {
+		cur = min_t(size_t, buf->tail[0].iov_len, trim);
+		buf->tail[0].iov_len -= cur;
+		trim -= cur;
+		/* XXX: ok to leave tail[0].iov_base as non-NULL here? */
+		if (!trim)
+			goto fix_len;
+	}
+
+	if (buf->page_len) {
+		cur = min_t(unsigned int, buf->page_len, trim);
+		buf->page_len -= cur;
+		trim -= cur;
+		if (!trim)
+			goto fix_len;
+	}
+
+	if (buf->head[0].iov_len) {
+		cur = min_t(size_t, buf->head[0].iov_len, trim);
+		buf->head[0].iov_len -= cur;
+		trim -= cur;
+	}
+fix_len:
+	buf->len -= (len - trim);
+}
+EXPORT_SYMBOL_GPL(xdr_buf_trim);
+
 static void __read_bytes_from_xdr_buf(struct xdr_buf *subbuf, void *obj, unsigned int len)
 {
 	unsigned int this_len;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 14:51 [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC Jeff Layton
  2013-02-07 14:51 ` [PATCH v3 1/2] sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer Jeff Layton
@ 2013-02-07 14:51 ` Jeff Layton
  2013-02-07 15:51   ` Chuck Lever
  2013-02-07 15:11 ` [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC J. Bruce Fields
  2 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2013-02-07 14:51 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Now that we're allowing more DRC entries, it becomes a lot easier to hit
problems with XID collisions. In order to mitigate those, calculate the
crc32 of up to the first 256 bytes of each request coming in and store
that in the cache entry, along with the total length of the request.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfsd/cache.h    |  5 +++++
 fs/nfsd/nfscache.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 9c7232b..4822db3 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -29,6 +29,8 @@ struct svc_cacherep {
 	u32			c_prot;
 	u32			c_proc;
 	u32			c_vers;
+	unsigned int		c_len;
+	u32			c_crc;
 	unsigned long		c_timestamp;
 	union {
 		struct kvec	u_vec;
@@ -73,6 +75,9 @@ enum {
 /* Cache entries expire after this time period */
 #define RC_EXPIRE		(120 * HZ)
 
+/* Checksum this amount of the request */
+#define RC_CSUMLEN		(256U)
+
 int	nfsd_reply_cache_init(void);
 void	nfsd_reply_cache_shutdown(void);
 int	nfsd_cache_lookup(struct svc_rqst *);
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index f754469..a8c3f1e 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -11,6 +11,8 @@
 #include <linux/slab.h>
 #include <linux/sunrpc/addr.h>
 #include <linux/highmem.h>
+#include <linux/crc32.h>
+#include <linux/sunrpc/svcauth_gss.h>
 
 #include "nfsd.h"
 #include "cache.h"
@@ -24,6 +26,7 @@ static struct list_head 	lru_head;
 static struct kmem_cache	*drc_slab;
 static unsigned int		num_drc_entries;
 static unsigned int		max_drc_entries;
+static u32			crc_seed;
 
 /*
  * Calculate the hash index from an XID.
@@ -130,6 +133,9 @@ int nfsd_reply_cache_init(void)
 	INIT_LIST_HEAD(&lru_head);
 	max_drc_entries = nfsd_cache_size_limit();
 	num_drc_entries = 0;
+
+	/* Is a random seed any better than some well-defined constant? */
+	get_random_bytes(&crc_seed, sizeof(crc_seed));
 	return 0;
 out_nomem:
 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
@@ -238,12 +244,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
 }
 
 /*
+ * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
+ */
+static u32
+nfsd_cache_crc(struct svc_rqst *rqstp)
+{
+	int idx;
+	unsigned int base;
+	u32 crc;
+	struct xdr_buf *buf = &rqstp->rq_arg;
+	const unsigned char *p = buf->head[0].iov_base;
+	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
+				RC_CSUMLEN);
+	size_t len = min(buf->head[0].iov_len, csum_len);
+
+	/* rq_arg.head first */
+	crc = crc32(crc_seed, p, len);
+	csum_len -= len;
+
+	/* Continue into page array */
+	idx = buf->page_base / PAGE_SIZE;
+	base = buf->page_base & ~PAGE_MASK;
+	while (csum_len) {
+		p = page_address(buf->pages[idx]) + base;
+		len = min(PAGE_SIZE - base, csum_len);
+		crc = crc32(crc, p, len);
+		csum_len -= len;
+		base = 0;
+		++idx;
+	}
+	return crc;
+}
+
+/*
  * Search the request hash for an entry that matches the given rqstp.
  * Must be called with cache_lock held. Returns the found entry or
  * NULL on failure.
  */
 static struct svc_cacherep *
-nfsd_cache_search(struct svc_rqst *rqstp)
+nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
 {
 	struct svc_cacherep	*rp;
 	struct hlist_node	*hn;
@@ -257,6 +296,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
 	hlist_for_each_entry(rp, hn, rh, c_hash) {
 		if (xid == rp->c_xid && proc == rp->c_proc &&
 		    proto == rp->c_prot && vers == rp->c_vers &&
+		    rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
 			return rp;
@@ -276,7 +316,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
 	__be32			xid = rqstp->rq_xid;
 	u32			proto =  rqstp->rq_prot,
 				vers = rqstp->rq_vers,
-				proc = rqstp->rq_proc;
+				proc = rqstp->rq_proc,
+				crc;
 	unsigned long		age;
 	int type = rqstp->rq_cachetype;
 	int rtn;
@@ -287,10 +328,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
 		return RC_DOIT;
 	}
 
+	crc = nfsd_cache_crc(rqstp);
+
 	spin_lock(&cache_lock);
 	rtn = RC_DOIT;
 
-	rp = nfsd_cache_search(rqstp);
+	rp = nfsd_cache_search(rqstp, crc);
 	if (rp)
 		goto found_entry;
 
@@ -318,7 +361,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
 	 * Must search again just in case someone inserted one
 	 * after we dropped the lock above.
 	 */
-	found = nfsd_cache_search(rqstp);
+	found = nfsd_cache_search(rqstp, crc);
 	if (found) {
 		nfsd_reply_cache_free_locked(rp);
 		rp = found;
@@ -344,6 +387,8 @@ setup_entry:
 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
 	rp->c_prot = proto;
 	rp->c_vers = vers;
+	rp->c_len = rqstp->rq_arg.len;
+	rp->c_crc = crc;
 
 	hash_refile(rp);
 	lru_put_end(rp);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC
  2013-02-07 14:51 [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC Jeff Layton
  2013-02-07 14:51 ` [PATCH v3 1/2] sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer Jeff Layton
  2013-02-07 14:51 ` [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request Jeff Layton
@ 2013-02-07 15:11 ` J. Bruce Fields
  2 siblings, 0 replies; 20+ messages in thread
From: J. Bruce Fields @ 2013-02-07 15:11 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

On Thu, Feb 07, 2013 at 09:51:39AM -0500, Jeff Layton wrote:
> This patchset is a respin of the ones that Bruce has not yet committed
> of my DRC overhaul. The main difference is the first patch, which adds a
> routine to strip the trailing checksum off of decrypted or
> integrity-verified buffer.
> 
> I've tested both the client and server with different krb5 flavors (and
> using both the v1 and v2 codepaths) and it seems to work fine with those
> checksum blobs stripped off.
> 
> I think we should consider this for 3.9 since XID collisions are a lot
> more likely now with the new DRC code in place.

Looks good--applying for 3.9 absent any objections.

--b.

> 
> Jeff Layton (2):
>   sunrpc: trim off trailing checksum before returning decrypted or
>     integrity authenticated buffer
>   nfsd: keep a checksum of the first 256 bytes of request
> 
>  fs/nfsd/cache.h                     |  5 ++++
>  fs/nfsd/nfscache.c                  | 53 ++++++++++++++++++++++++++++++++++---
>  include/linux/sunrpc/xdr.h          |  1 +
>  net/sunrpc/auth_gss/gss_krb5_wrap.c |  2 ++
>  net/sunrpc/auth_gss/svcauth_gss.c   | 10 +++++--
>  net/sunrpc/xdr.c                    | 42 +++++++++++++++++++++++++++++
>  6 files changed, 107 insertions(+), 6 deletions(-)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 14:51 ` [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request Jeff Layton
@ 2013-02-07 15:51   ` Chuck Lever
  2013-02-07 16:00     ` J. Bruce Fields
  2013-02-07 18:03     ` Jeff Layton
  0 siblings, 2 replies; 20+ messages in thread
From: Chuck Lever @ 2013-02-07 15:51 UTC (permalink / raw)
  To: Jeff Layton; +Cc: bfields, linux-nfs


On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:

> Now that we're allowing more DRC entries, it becomes a lot easier to hit
> problems with XID collisions. In order to mitigate those, calculate the
> crc32 of up to the first 256 bytes of each request coming in and store
> that in the cache entry, along with the total length of the request.

I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.

Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.


> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> fs/nfsd/cache.h    |  5 +++++
> fs/nfsd/nfscache.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 54 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 9c7232b..4822db3 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -29,6 +29,8 @@ struct svc_cacherep {
> 	u32			c_prot;
> 	u32			c_proc;
> 	u32			c_vers;
> +	unsigned int		c_len;
> +	u32			c_crc;
> 	unsigned long		c_timestamp;
> 	union {
> 		struct kvec	u_vec;
> @@ -73,6 +75,9 @@ enum {
> /* Cache entries expire after this time period */
> #define RC_EXPIRE		(120 * HZ)
> 
> +/* Checksum this amount of the request */
> +#define RC_CSUMLEN		(256U)
> +
> int	nfsd_reply_cache_init(void);
> void	nfsd_reply_cache_shutdown(void);
> int	nfsd_cache_lookup(struct svc_rqst *);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index f754469..a8c3f1e 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -11,6 +11,8 @@
> #include <linux/slab.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/highmem.h>
> +#include <linux/crc32.h>
> +#include <linux/sunrpc/svcauth_gss.h>
> 
> #include "nfsd.h"
> #include "cache.h"
> @@ -24,6 +26,7 @@ static struct list_head 	lru_head;
> static struct kmem_cache	*drc_slab;
> static unsigned int		num_drc_entries;
> static unsigned int		max_drc_entries;
> +static u32			crc_seed;
> 
> /*
>  * Calculate the hash index from an XID.
> @@ -130,6 +133,9 @@ int nfsd_reply_cache_init(void)
> 	INIT_LIST_HEAD(&lru_head);
> 	max_drc_entries = nfsd_cache_size_limit();
> 	num_drc_entries = 0;
> +
> +	/* Is a random seed any better than some well-defined constant? */
> +	get_random_bytes(&crc_seed, sizeof(crc_seed));
> 	return 0;
> out_nomem:
> 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> @@ -238,12 +244,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> }
> 
> /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static u32
> +nfsd_cache_crc(struct svc_rqst *rqstp)
> +{
> +	int idx;
> +	unsigned int base;
> +	u32 crc;
> +	struct xdr_buf *buf = &rqstp->rq_arg;
> +	const unsigned char *p = buf->head[0].iov_base;
> +	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> +				RC_CSUMLEN);
> +	size_t len = min(buf->head[0].iov_len, csum_len);
> +
> +	/* rq_arg.head first */
> +	crc = crc32(crc_seed, p, len);
> +	csum_len -= len;
> +
> +	/* Continue into page array */
> +	idx = buf->page_base / PAGE_SIZE;
> +	base = buf->page_base & ~PAGE_MASK;
> +	while (csum_len) {
> +		p = page_address(buf->pages[idx]) + base;
> +		len = min(PAGE_SIZE - base, csum_len);
> +		crc = crc32(crc, p, len);
> +		csum_len -= len;
> +		base = 0;
> +		++idx;
> +	}
> +	return crc;
> +}
> +
> +/*
>  * Search the request hash for an entry that matches the given rqstp.
>  * Must be called with cache_lock held. Returns the found entry or
>  * NULL on failure.
>  */
> static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> {
> 	struct svc_cacherep	*rp;
> 	struct hlist_node	*hn;
> @@ -257,6 +296,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> 	hlist_for_each_entry(rp, hn, rh, c_hash) {
> 		if (xid == rp->c_xid && proc == rp->c_proc &&
> 		    proto == rp->c_prot && vers == rp->c_vers &&
> +		    rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> 			return rp;
> @@ -276,7 +316,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> 	__be32			xid = rqstp->rq_xid;
> 	u32			proto =  rqstp->rq_prot,
> 				vers = rqstp->rq_vers,
> -				proc = rqstp->rq_proc;
> +				proc = rqstp->rq_proc,
> +				crc;
> 	unsigned long		age;
> 	int type = rqstp->rq_cachetype;
> 	int rtn;
> @@ -287,10 +328,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> 		return RC_DOIT;
> 	}
> 
> +	crc = nfsd_cache_crc(rqstp);
> +
> 	spin_lock(&cache_lock);
> 	rtn = RC_DOIT;
> 
> -	rp = nfsd_cache_search(rqstp);
> +	rp = nfsd_cache_search(rqstp, crc);
> 	if (rp)
> 		goto found_entry;
> 
> @@ -318,7 +361,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> 	 * Must search again just in case someone inserted one
> 	 * after we dropped the lock above.
> 	 */
> -	found = nfsd_cache_search(rqstp);
> +	found = nfsd_cache_search(rqstp, crc);
> 	if (found) {
> 		nfsd_reply_cache_free_locked(rp);
> 		rp = found;
> @@ -344,6 +387,8 @@ setup_entry:
> 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> 	rp->c_prot = proto;
> 	rp->c_vers = vers;
> +	rp->c_len = rqstp->rq_arg.len;
> +	rp->c_crc = crc;
> 
> 	hash_refile(rp);
> 	lru_put_end(rp);
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 15:51   ` Chuck Lever
@ 2013-02-07 16:00     ` J. Bruce Fields
  2013-02-07 16:23       ` Chuck Lever
  2013-02-07 16:32       ` Myklebust, Trond
  2013-02-07 18:03     ` Jeff Layton
  1 sibling, 2 replies; 20+ messages in thread
From: J. Bruce Fields @ 2013-02-07 16:00 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Thu, Feb 07, 2013 at 10:51:02AM -0500, Chuck Lever wrote:
> 
> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> > Now that we're allowing more DRC entries, it becomes a lot easier to
> > hit problems with XID collisions. In order to mitigate those,
> > calculate the crc32 of up to the first 256 bytes of each request
> > coming in and store that in the cache entry, along with the total
> > length of the request.
> 
> I'm happy to see a checksummed DRC finally become reality for the
> Linux NFS server.
> 
> Have you measured the CPU utilization impact and CPU cache footprint
> of performing a CRC computation for every incoming RPC?

Note this is over the first 256 bytes of the request--which we're
probably just about to read for xdr decoding anyway.

> I'm wondering if a simpler checksum might be just as useful but less
> costly to compute.

What would be an example of a simpler checksum?

--b.

> 
> 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/nfsd/cache.h    |  5 +++++
> > fs/nfsd/nfscache.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++----
> > 2 files changed, 54 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> > index 9c7232b..4822db3 100644
> > --- a/fs/nfsd/cache.h
> > +++ b/fs/nfsd/cache.h
> > @@ -29,6 +29,8 @@ struct svc_cacherep {
> > 	u32			c_prot;
> > 	u32			c_proc;
> > 	u32			c_vers;
> > +	unsigned int		c_len;
> > +	u32			c_crc;
> > 	unsigned long		c_timestamp;
> > 	union {
> > 		struct kvec	u_vec;
> > @@ -73,6 +75,9 @@ enum {
> > /* Cache entries expire after this time period */
> > #define RC_EXPIRE		(120 * HZ)
> > 
> > +/* Checksum this amount of the request */
> > +#define RC_CSUMLEN		(256U)
> > +
> > int	nfsd_reply_cache_init(void);
> > void	nfsd_reply_cache_shutdown(void);
> > int	nfsd_cache_lookup(struct svc_rqst *);
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index f754469..a8c3f1e 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -11,6 +11,8 @@
> > #include <linux/slab.h>
> > #include <linux/sunrpc/addr.h>
> > #include <linux/highmem.h>
> > +#include <linux/crc32.h>
> > +#include <linux/sunrpc/svcauth_gss.h>
> > 
> > #include "nfsd.h"
> > #include "cache.h"
> > @@ -24,6 +26,7 @@ static struct list_head 	lru_head;
> > static struct kmem_cache	*drc_slab;
> > static unsigned int		num_drc_entries;
> > static unsigned int		max_drc_entries;
> > +static u32			crc_seed;
> > 
> > /*
> >  * Calculate the hash index from an XID.
> > @@ -130,6 +133,9 @@ int nfsd_reply_cache_init(void)
> > 	INIT_LIST_HEAD(&lru_head);
> > 	max_drc_entries = nfsd_cache_size_limit();
> > 	num_drc_entries = 0;
> > +
> > +	/* Is a random seed any better than some well-defined constant? */
> > +	get_random_bytes(&crc_seed, sizeof(crc_seed));
> > 	return 0;
> > out_nomem:
> > 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> > @@ -238,12 +244,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > }
> > 
> > /*
> > + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> > + */
> > +static u32
> > +nfsd_cache_crc(struct svc_rqst *rqstp)
> > +{
> > +	int idx;
> > +	unsigned int base;
> > +	u32 crc;
> > +	struct xdr_buf *buf = &rqstp->rq_arg;
> > +	const unsigned char *p = buf->head[0].iov_base;
> > +	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> > +				RC_CSUMLEN);
> > +	size_t len = min(buf->head[0].iov_len, csum_len);
> > +
> > +	/* rq_arg.head first */
> > +	crc = crc32(crc_seed, p, len);
> > +	csum_len -= len;
> > +
> > +	/* Continue into page array */
> > +	idx = buf->page_base / PAGE_SIZE;
> > +	base = buf->page_base & ~PAGE_MASK;
> > +	while (csum_len) {
> > +		p = page_address(buf->pages[idx]) + base;
> > +		len = min(PAGE_SIZE - base, csum_len);
> > +		crc = crc32(crc, p, len);
> > +		csum_len -= len;
> > +		base = 0;
> > +		++idx;
> > +	}
> > +	return crc;
> > +}
> > +
> > +/*
> >  * Search the request hash for an entry that matches the given rqstp.
> >  * Must be called with cache_lock held. Returns the found entry or
> >  * NULL on failure.
> >  */
> > static struct svc_cacherep *
> > -nfsd_cache_search(struct svc_rqst *rqstp)
> > +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
> > {
> > 	struct svc_cacherep	*rp;
> > 	struct hlist_node	*hn;
> > @@ -257,6 +296,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> > 	hlist_for_each_entry(rp, hn, rh, c_hash) {
> > 		if (xid == rp->c_xid && proc == rp->c_proc &&
> > 		    proto == rp->c_prot && vers == rp->c_vers &&
> > +		    rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
> > 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> > 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> > 			return rp;
> > @@ -276,7 +316,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > 	__be32			xid = rqstp->rq_xid;
> > 	u32			proto =  rqstp->rq_prot,
> > 				vers = rqstp->rq_vers,
> > -				proc = rqstp->rq_proc;
> > +				proc = rqstp->rq_proc,
> > +				crc;
> > 	unsigned long		age;
> > 	int type = rqstp->rq_cachetype;
> > 	int rtn;
> > @@ -287,10 +328,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > 		return RC_DOIT;
> > 	}
> > 
> > +	crc = nfsd_cache_crc(rqstp);
> > +
> > 	spin_lock(&cache_lock);
> > 	rtn = RC_DOIT;
> > 
> > -	rp = nfsd_cache_search(rqstp);
> > +	rp = nfsd_cache_search(rqstp, crc);
> > 	if (rp)
> > 		goto found_entry;
> > 
> > @@ -318,7 +361,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > 	 * Must search again just in case someone inserted one
> > 	 * after we dropped the lock above.
> > 	 */
> > -	found = nfsd_cache_search(rqstp);
> > +	found = nfsd_cache_search(rqstp, crc);
> > 	if (found) {
> > 		nfsd_reply_cache_free_locked(rp);
> > 		rp = found;
> > @@ -344,6 +387,8 @@ setup_entry:
> > 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> > 	rp->c_prot = proto;
> > 	rp->c_vers = vers;
> > +	rp->c_len = rqstp->rq_arg.len;
> > +	rp->c_crc = crc;
> > 
> > 	hash_refile(rp);
> > 	lru_put_end(rp);
> > -- 
> > 1.7.11.7
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 16:00     ` J. Bruce Fields
@ 2013-02-07 16:23       ` Chuck Lever
  2013-02-07 16:37         ` J. Bruce Fields
  2013-02-07 16:41         ` Jim Rees
  2013-02-07 16:32       ` Myklebust, Trond
  1 sibling, 2 replies; 20+ messages in thread
From: Chuck Lever @ 2013-02-07 16:23 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs


On Feb 7, 2013, at 11:00 AM, "J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Thu, Feb 07, 2013 at 10:51:02AM -0500, Chuck Lever wrote:
>> 
>> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
>> 
>>> Now that we're allowing more DRC entries, it becomes a lot easier to
>>> hit problems with XID collisions. In order to mitigate those,
>>> calculate the crc32 of up to the first 256 bytes of each request
>>> coming in and store that in the cache entry, along with the total
>>> length of the request.
>> 
>> I'm happy to see a checksummed DRC finally become reality for the
>> Linux NFS server.
>> 
>> Have you measured the CPU utilization impact and CPU cache footprint
>> of performing a CRC computation for every incoming RPC?
> 
> Note this is over the first 256 bytes of the request--which we're
> probably just about to read for xdr decoding anyway.

XDR decoding is copying and branching.  Computing a CRC involves real math, which tends to be significantly more expensive than successfully predicted branches, especially on low-power CPUs that might be found in SOHO NAS products.


> 
>> I'm wondering if a simpler checksum might be just as useful but less
>> costly to compute.
> 
> What would be an example of a simpler checksum?

The same one TCP uses, like a simple additive sum, or an XOR.  Is a heavyweight checksum needed because checksums generated with a simple function are more likely to collide?

Not that this should hold up merging Jeff's work!  We can easily tweak or replace the checksum algorithm after it's upstream.  It's not kABI.

But someone should assess the impact of the additional checksum computation.  CRC seems to me heavier than is needed here.


Possible tweaks:

Why 256 bytes?  Is that too much?  Or not enough for some NFSv4 compounds that might often start with the same data?  Could we, for instance, use fewer bytes for NFSv2 and NFSv3?  Or even a variable checksum length depending on the NFS operation?  Is 256 bytes enough for NFSv4.1, whose compounds always start with the same operation?

If integrity or privacy is in play, can we use that information in place of a separate DRC checksum?



> 
> --b.
> 
>> 
>> 
>>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>>> ---
>>> fs/nfsd/cache.h    |  5 +++++
>>> fs/nfsd/nfscache.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++----
>>> 2 files changed, 54 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
>>> index 9c7232b..4822db3 100644
>>> --- a/fs/nfsd/cache.h
>>> +++ b/fs/nfsd/cache.h
>>> @@ -29,6 +29,8 @@ struct svc_cacherep {
>>> 	u32			c_prot;
>>> 	u32			c_proc;
>>> 	u32			c_vers;
>>> +	unsigned int		c_len;
>>> +	u32			c_crc;
>>> 	unsigned long		c_timestamp;
>>> 	union {
>>> 		struct kvec	u_vec;
>>> @@ -73,6 +75,9 @@ enum {
>>> /* Cache entries expire after this time period */
>>> #define RC_EXPIRE		(120 * HZ)
>>> 
>>> +/* Checksum this amount of the request */
>>> +#define RC_CSUMLEN		(256U)
>>> +
>>> int	nfsd_reply_cache_init(void);
>>> void	nfsd_reply_cache_shutdown(void);
>>> int	nfsd_cache_lookup(struct svc_rqst *);
>>> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
>>> index f754469..a8c3f1e 100644
>>> --- a/fs/nfsd/nfscache.c
>>> +++ b/fs/nfsd/nfscache.c
>>> @@ -11,6 +11,8 @@
>>> #include <linux/slab.h>
>>> #include <linux/sunrpc/addr.h>
>>> #include <linux/highmem.h>
>>> +#include <linux/crc32.h>
>>> +#include <linux/sunrpc/svcauth_gss.h>
>>> 
>>> #include "nfsd.h"
>>> #include "cache.h"
>>> @@ -24,6 +26,7 @@ static struct list_head 	lru_head;
>>> static struct kmem_cache	*drc_slab;
>>> static unsigned int		num_drc_entries;
>>> static unsigned int		max_drc_entries;
>>> +static u32			crc_seed;
>>> 
>>> /*
>>> * Calculate the hash index from an XID.
>>> @@ -130,6 +133,9 @@ int nfsd_reply_cache_init(void)
>>> 	INIT_LIST_HEAD(&lru_head);
>>> 	max_drc_entries = nfsd_cache_size_limit();
>>> 	num_drc_entries = 0;
>>> +
>>> +	/* Is a random seed any better than some well-defined constant? */
>>> +	get_random_bytes(&crc_seed, sizeof(crc_seed));
>>> 	return 0;
>>> out_nomem:
>>> 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
>>> @@ -238,12 +244,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
>>> }
>>> 
>>> /*
>>> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
>>> + */
>>> +static u32
>>> +nfsd_cache_crc(struct svc_rqst *rqstp)
>>> +{
>>> +	int idx;
>>> +	unsigned int base;
>>> +	u32 crc;
>>> +	struct xdr_buf *buf = &rqstp->rq_arg;
>>> +	const unsigned char *p = buf->head[0].iov_base;
>>> +	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
>>> +				RC_CSUMLEN);
>>> +	size_t len = min(buf->head[0].iov_len, csum_len);
>>> +
>>> +	/* rq_arg.head first */
>>> +	crc = crc32(crc_seed, p, len);
>>> +	csum_len -= len;
>>> +
>>> +	/* Continue into page array */
>>> +	idx = buf->page_base / PAGE_SIZE;
>>> +	base = buf->page_base & ~PAGE_MASK;
>>> +	while (csum_len) {
>>> +		p = page_address(buf->pages[idx]) + base;
>>> +		len = min(PAGE_SIZE - base, csum_len);
>>> +		crc = crc32(crc, p, len);
>>> +		csum_len -= len;
>>> +		base = 0;
>>> +		++idx;
>>> +	}
>>> +	return crc;
>>> +}
>>> +
>>> +/*
>>> * Search the request hash for an entry that matches the given rqstp.
>>> * Must be called with cache_lock held. Returns the found entry or
>>> * NULL on failure.
>>> */
>>> static struct svc_cacherep *
>>> -nfsd_cache_search(struct svc_rqst *rqstp)
>>> +nfsd_cache_search(struct svc_rqst *rqstp, u32 crc)
>>> {
>>> 	struct svc_cacherep	*rp;
>>> 	struct hlist_node	*hn;
>>> @@ -257,6 +296,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
>>> 	hlist_for_each_entry(rp, hn, rh, c_hash) {
>>> 		if (xid == rp->c_xid && proc == rp->c_proc &&
>>> 		    proto == rp->c_prot && vers == rp->c_vers &&
>>> +		    rqstp->rq_arg.len == rp->c_len && crc == rp->c_crc &&
>>> 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
>>> 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
>>> 			return rp;
>>> @@ -276,7 +316,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>>> 	__be32			xid = rqstp->rq_xid;
>>> 	u32			proto =  rqstp->rq_prot,
>>> 				vers = rqstp->rq_vers,
>>> -				proc = rqstp->rq_proc;
>>> +				proc = rqstp->rq_proc,
>>> +				crc;
>>> 	unsigned long		age;
>>> 	int type = rqstp->rq_cachetype;
>>> 	int rtn;
>>> @@ -287,10 +328,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>>> 		return RC_DOIT;
>>> 	}
>>> 
>>> +	crc = nfsd_cache_crc(rqstp);
>>> +
>>> 	spin_lock(&cache_lock);
>>> 	rtn = RC_DOIT;
>>> 
>>> -	rp = nfsd_cache_search(rqstp);
>>> +	rp = nfsd_cache_search(rqstp, crc);
>>> 	if (rp)
>>> 		goto found_entry;
>>> 
>>> @@ -318,7 +361,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>>> 	 * Must search again just in case someone inserted one
>>> 	 * after we dropped the lock above.
>>> 	 */
>>> -	found = nfsd_cache_search(rqstp);
>>> +	found = nfsd_cache_search(rqstp, crc);
>>> 	if (found) {
>>> 		nfsd_reply_cache_free_locked(rp);
>>> 		rp = found;
>>> @@ -344,6 +387,8 @@ setup_entry:
>>> 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
>>> 	rp->c_prot = proto;
>>> 	rp->c_vers = vers;
>>> +	rp->c_len = rqstp->rq_arg.len;
>>> +	rp->c_crc = crc;
>>> 
>>> 	hash_refile(rp);
>>> 	lru_put_end(rp);
>>> -- 
>>> 1.7.11.7
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> -- 
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>> 
>> 
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 16:00     ` J. Bruce Fields
  2013-02-07 16:23       ` Chuck Lever
@ 2013-02-07 16:32       ` Myklebust, Trond
  2013-02-07 18:35         ` Jeff Layton
  2013-02-08 15:41         ` Jeff Layton
  1 sibling, 2 replies; 20+ messages in thread
From: Myklebust, Trond @ 2013-02-07 16:32 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: Jeff Layton, linux-nfs

> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Thursday, February 07, 2013 11:01 AM
> To: Chuck Lever
> Cc: Jeff Layton; linux-nfs@vger.kernel.org
> Subject: Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of
> request
> 
> On Thu, Feb 07, 2013 at 10:51:02AM -0500, Chuck Lever wrote:
> >
> > On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> >
> > > Now that we're allowing more DRC entries, it becomes a lot easier to
> > > hit problems with XID collisions. In order to mitigate those,
> > > calculate the crc32 of up to the first 256 bytes of each request
> > > coming in and store that in the cache entry, along with the total
> > > length of the request.
> >
> > I'm happy to see a checksummed DRC finally become reality for the
> > Linux NFS server.
> >
> > Have you measured the CPU utilization impact and CPU cache footprint
> > of performing a CRC computation for every incoming RPC?
> 
> Note this is over the first 256 bytes of the request--which we're probably just
> about to read for xdr decoding anyway.

- Would it make sense perhaps to generate the checksum as you are reading the data?
- Also, is 256 bytes sufficient? How far does that get you with your average WRITE compound?
- Could the integrity checksum in RPCSEC_GSS/krbi be reused as a DRC checksum?

Cheers
   Trond

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 16:23       ` Chuck Lever
@ 2013-02-07 16:37         ` J. Bruce Fields
  2013-02-07 16:41         ` Jim Rees
  1 sibling, 0 replies; 20+ messages in thread
From: J. Bruce Fields @ 2013-02-07 16:37 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Thu, Feb 07, 2013 at 11:23:17AM -0500, Chuck Lever wrote:
> 
> On Feb 7, 2013, at 11:00 AM, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Thu, Feb 07, 2013 at 10:51:02AM -0500, Chuck Lever wrote:
> >> 
> >> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> >> 
> >>> Now that we're allowing more DRC entries, it becomes a lot easier to
> >>> hit problems with XID collisions. In order to mitigate those,
> >>> calculate the crc32 of up to the first 256 bytes of each request
> >>> coming in and store that in the cache entry, along with the total
> >>> length of the request.
> >> 
> >> I'm happy to see a checksummed DRC finally become reality for the
> >> Linux NFS server.
> >> 
> >> Have you measured the CPU utilization impact and CPU cache footprint
> >> of performing a CRC computation for every incoming RPC?
> > 
> > Note this is over the first 256 bytes of the request--which we're
> > probably just about to read for xdr decoding anyway.
> 
> XDR decoding is copying and branching.  Computing a CRC involves real math, which tends to be significantly more expensive than successfully predicted branches, especially on low-power CPUs that might be found in SOHO NAS products.

OK, I wouldn't know.

(I was just responding to the "cache footprint" question--I thought you
were concerned about reading in a bunch of the request.)  Looks like the
biggest piece of the crc32 code is a 1k lookup table?

> >> I'm wondering if a simpler checksum might be just as useful but less
> >> costly to compute.
> > 
> > What would be an example of a simpler checksum?
> 
> The same one TCP uses, like a simple additive sum, or an XOR.  Is a heavyweight checksum needed because checksums generated with a simple function are more likely to collide?
> 
> Not that this should hold up merging Jeff's work!  We can easily tweak or replace the checksum algorithm after it's upstream.  It's not kABI.
> 
> But someone should assess the impact of the additional checksum computation.  CRC seems to me heavier than is needed here.

OK, sure, may be worth looking into.

> Possible tweaks:
> 
> Why 256 bytes?  Is that too much?  Or not enough for some NFSv4
> compounds that might often start with the same data?  Could we, for
> instance, use fewer bytes for NFSv2 and NFSv3?  Or even a variable
> checksum length depending on the NFS operation?  Is 256 bytes enough
> for NFSv4.1, whose compounds always start with the same operation?

NFSv4.1 has the drc turned completely off.

> If integrity or privacy is in play, can we use that information in
> place of a separate DRC checksum?

There's a gss sequence number that's incremented even on resends of the
same rpc, so this doesn't work.  (By design: you don't want an attacker
to be able to replay an old rpc.)

--b.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 16:23       ` Chuck Lever
  2013-02-07 16:37         ` J. Bruce Fields
@ 2013-02-07 16:41         ` Jim Rees
  1 sibling, 0 replies; 20+ messages in thread
From: Jim Rees @ 2013-02-07 16:41 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, Jeff Layton, linux-nfs

Chuck Lever wrote:

  > 
  >> I'm wondering if a simpler checksum might be just as useful but less
  >> costly to compute.
  > 
  > What would be an example of a simpler checksum?

  The same one TCP uses, like a simple additive sum, or an XOR.  Is a
  heavyweight checksum needed because checksums generated with a simple
  function are more likely to collide?

At least CRC isn't as bad as a crypto checksum like md5. Those are often
misused when what's really wanted is just a simple guard against accidental
corruption. In this case I'd go with the tcp checksum, which I think has an
api and even an accelerated implementation depending on the machine arch,
although I can't find it right now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 15:51   ` Chuck Lever
  2013-02-07 16:00     ` J. Bruce Fields
@ 2013-02-07 18:03     ` Jeff Layton
  2013-02-08 13:27       ` Jeff Layton
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2013-02-07 18:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: bfields, linux-nfs

On Thu, 7 Feb 2013 10:51:02 -0500
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> > Now that we're allowing more DRC entries, it becomes a lot easier to hit
> > problems with XID collisions. In order to mitigate those, calculate the
> > crc32 of up to the first 256 bytes of each request coming in and store
> > that in the cache entry, along with the total length of the request.
> 
> I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
> 
> Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
> 

No, I haven't, at least not in any sort of rigorous way. It's pretty
negligible on "normal" PC hardware, but I think most intel and amd cpus
have instructions for handling crc32. I'm ok with a different checksum,
we don't need anything cryptographically secure here. I simply chose
crc32 since it has an easily available API, and I figured it would be
fairly lightweight.

I originally had hopes of just using the checksum in the TCP/UDP
header, but that's handled in hw by some cards and so isn't available.
There are also things that change during a retransmit (like GSS sequence
numbers), so we can't just scrape those out anyway.

As far as why 256 bytes, we had had a bug opened a while back (by
Oracle, I think), that asked us to add this capability and suggested
200 bytes. I like powers of 2 so I rounded up. We could easily extend
that though by just changing RC_CSUMLEN if we think it's not enough.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 16:32       ` Myklebust, Trond
@ 2013-02-07 18:35         ` Jeff Layton
  2013-02-08 15:41         ` Jeff Layton
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-07 18:35 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs

On Thu, 7 Feb 2013 16:32:20 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> > -----Original Message-----
> > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> > owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> > Sent: Thursday, February 07, 2013 11:01 AM
> > To: Chuck Lever
> > Cc: Jeff Layton; linux-nfs@vger.kernel.org
> > Subject: Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of
> > request
> > 
> > On Thu, Feb 07, 2013 at 10:51:02AM -0500, Chuck Lever wrote:
> > >
> > > On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > >
> > > > Now that we're allowing more DRC entries, it becomes a lot easier to
> > > > hit problems with XID collisions. In order to mitigate those,
> > > > calculate the crc32 of up to the first 256 bytes of each request
> > > > coming in and store that in the cache entry, along with the total
> > > > length of the request.
> > >
> > > I'm happy to see a checksummed DRC finally become reality for the
> > > Linux NFS server.
> > >
> > > Have you measured the CPU utilization impact and CPU cache footprint
> > > of performing a CRC computation for every incoming RPC?
> > 
> > Note this is over the first 256 bytes of the request--which we're probably just
> > about to read for xdr decoding anyway.
> 
> - Would it make sense perhaps to generate the checksum as you are reading the data?

It would be nice, but that would require some significant
reengineering, AFAICT. I'm not sure this is worth doing all of that,
but maybe there's an easy way to do that that I'm not seeing.

> - Also, is 256 bytes sufficient? How far does that get you with your average WRITE compound?

Mostly that length comes from a bug we had opened a while back which was
entitled "Oracle has insisted of all the NAS vendors that for all the
dNFS IO, the first 200 bytes check-sum of every write to be validated
before the commit takes place." The bug is marked private or I'd post a
link to it here.

In any case, the title is poorly worded, but basically they were saying
we should checksum the first 200 bytes of write data as a guard against
xid collisions in the DRC. I rounded it up to 256 just because I like
powers of 2 and we needed some extra to cover the NFS header anyway.

We could always extend that, or even make it variable based on some
criteria.

> - Could the integrity checksum in RPCSEC_GSS/krbi be reused as a DRC checksum?
> 

Sadly, no. As Bruce pointed out, that has GSSAPI sequence numbers,
which change on a retransmit. Scraping the checksum out of the TCP/UDP
headers somehow is also problematic for similar reasons...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 18:03     ` Jeff Layton
@ 2013-02-08 13:27       ` Jeff Layton
  2013-02-08 15:42         ` Chuck Lever
  2013-02-08 20:55         ` J. Bruce Fields
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-08 13:27 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Chuck Lever, bfields, linux-nfs, rees

On Thu, 7 Feb 2013 13:03:16 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Thu, 7 Feb 2013 10:51:02 -0500
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> > 
> > On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > > Now that we're allowing more DRC entries, it becomes a lot easier to hit
> > > problems with XID collisions. In order to mitigate those, calculate the
> > > crc32 of up to the first 256 bytes of each request coming in and store
> > > that in the cache entry, along with the total length of the request.
> > 
> > I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
> > 
> > Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
> > 
> 
> No, I haven't, at least not in any sort of rigorous way. It's pretty
> negligible on "normal" PC hardware, but I think most intel and amd cpus
> have instructions for handling crc32. I'm ok with a different checksum,
> we don't need anything cryptographically secure here. I simply chose
> crc32 since it has an easily available API, and I figured it would be
> fairly lightweight.
> 

After an abortive attempt to measure this with ftrace, I ended up
hacking together a patch to just measure the latency of the
nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
checksums cuts that roughly in half to ~800ns. I'm not sure how best to
measure the cache footprint however.

Neither seems terribly significant, especially given the other
inefficiencies in this code. OTOH, I guess those latencies can add up,
and I don't see any need to use crc32 over the net/checksum.h routines.
We probably ought to go with my RFC patch from yesterday.

This could look very different on other arches too of course, but I
don't have a test rig for any other arches at the moment.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-07 16:32       ` Myklebust, Trond
  2013-02-07 18:35         ` Jeff Layton
@ 2013-02-08 15:41         ` Jeff Layton
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-08 15:41 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs

On Thu, 7 Feb 2013 16:32:20 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> > -----Original Message-----
> > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> > owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> > Sent: Thursday, February 07, 2013 11:01 AM
> > To: Chuck Lever
> > Cc: Jeff Layton; linux-nfs@vger.kernel.org
> > Subject: Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of
> > request
> > 
> > On Thu, Feb 07, 2013 at 10:51:02AM -0500, Chuck Lever wrote:
> > >
> > > On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > >
> > > > Now that we're allowing more DRC entries, it becomes a lot easier to
> > > > hit problems with XID collisions. In order to mitigate those,
> > > > calculate the crc32 of up to the first 256 bytes of each request
> > > > coming in and store that in the cache entry, along with the total
> > > > length of the request.
> > >
> > > I'm happy to see a checksummed DRC finally become reality for the
> > > Linux NFS server.
> > >
> > > Have you measured the CPU utilization impact and CPU cache footprint
> > > of performing a CRC computation for every incoming RPC?
> > 
> > Note this is over the first 256 bytes of the request--which we're probably just
> > about to read for xdr decoding anyway.
> 
> - Would it make sense perhaps to generate the checksum as you are reading the data?

Maybe, but I'm not sure it's worth the engineering effort. On my highly
unscientific test rig, the time to calculate the checksum is ~800ns
using the IP checksum routine. If we can run the checksum while we read
the data off the socket, then we might shave some of that off, but that
"feels" like small potatoes to me.

OTOH, I don't have a good way to measure what the effect of this is on
the CPU cache. Any ideas of how to measure whether that's significant?

> - Also, is 256 bytes sufficient? How far does that get you with your average WRITE compound?

I also had a look at this, and it looks like the data portion of a
WRITE compound from the Linux client starts at ~80 bytes from the
beginning of the NFS part of the frame. So with a 256 byte checksum
length, we're checksumming over about 176 bytes of write data from a
typical NFSv4.0 WRITE request.

It might not hurt to extend that out to 384 bytes or so? We have to
figure that some clients also set the tag field, so having a bit of
fudge factor for that might be reasonable.

Making it variable doesn't seem like it would buy us too much since
most calls are much smaller than 256 bytes to begin with.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-08 13:27       ` Jeff Layton
@ 2013-02-08 15:42         ` Chuck Lever
  2013-02-08 15:57           ` Jeff Layton
  2013-02-08 20:55         ` J. Bruce Fields
  1 sibling, 1 reply; 20+ messages in thread
From: Chuck Lever @ 2013-02-08 15:42 UTC (permalink / raw)
  To: Jeff Layton; +Cc: bfields, linux-nfs, rees


On Feb 8, 2013, at 8:27 AM, Jeff Layton <jlayton@redhat.com> wrote:

> On Thu, 7 Feb 2013 13:03:16 -0500
> Jeff Layton <jlayton@redhat.com> wrote:
> 
>> On Thu, 7 Feb 2013 10:51:02 -0500
>> Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>>> 
>>> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
>>> 
>>>> Now that we're allowing more DRC entries, it becomes a lot easier to hit
>>>> problems with XID collisions. In order to mitigate those, calculate the
>>>> crc32 of up to the first 256 bytes of each request coming in and store
>>>> that in the cache entry, along with the total length of the request.
>>> 
>>> I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
>>> 
>>> Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
>>> 
>> 
>> No, I haven't, at least not in any sort of rigorous way. It's pretty
>> negligible on "normal" PC hardware, but I think most intel and amd cpus
>> have instructions for handling crc32. I'm ok with a different checksum,
>> we don't need anything cryptographically secure here. I simply chose
>> crc32 since it has an easily available API, and I figured it would be
>> fairly lightweight.
>> 
> 
> After an abortive attempt to measure this with ftrace, I ended up
> hacking together a patch to just measure the latency of the
> nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
> KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
> checksums cuts that roughly in half to ~800ns. I'm not sure how best to
> measure the cache footprint however.

Thanks for sorting that, I feel better having a real data point.

> Neither seems terribly significant, especially given the other
> inefficiencies in this code. OTOH, I guess those latencies can add up,
> and I don't see any need to use crc32 over the net/checksum.h routines.
> We probably ought to go with my RFC patch from yesterday.
> 
> This could look very different on other arches too of course, but I
> don't have a test rig for any other arches at the moment.

Going with the IP checksum seems perfectly adequate to me.

Bruce made a good point that it is difficult to know how to test the efficacy of this change.  We don't really know of any especially bad cases that defeat the current DRC, though I know the Red Hat Q/A team likes to use NFS/UDP and they seem to run into problems with some frequency.  Having a chat with Rick Macklem and/or Tom Talpey on this topic might have some value, and likewise perhaps the bug's OP might have some ideas about what test cases they prefer.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-08 15:42         ` Chuck Lever
@ 2013-02-08 15:57           ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-08 15:57 UTC (permalink / raw)
  To: Chuck Lever; +Cc: bfields, linux-nfs, rees

On Fri, 8 Feb 2013 10:42:20 -0500
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Feb 8, 2013, at 8:27 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Thu, 7 Feb 2013 13:03:16 -0500
> > Jeff Layton <jlayton@redhat.com> wrote:
> > 
> >> On Thu, 7 Feb 2013 10:51:02 -0500
> >> Chuck Lever <chuck.lever@oracle.com> wrote:
> >> 
> >>> 
> >>> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> >>> 
> >>>> Now that we're allowing more DRC entries, it becomes a lot easier to hit
> >>>> problems with XID collisions. In order to mitigate those, calculate the
> >>>> crc32 of up to the first 256 bytes of each request coming in and store
> >>>> that in the cache entry, along with the total length of the request.
> >>> 
> >>> I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
> >>> 
> >>> Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
> >>> 
> >> 
> >> No, I haven't, at least not in any sort of rigorous way. It's pretty
> >> negligible on "normal" PC hardware, but I think most intel and amd cpus
> >> have instructions for handling crc32. I'm ok with a different checksum,
> >> we don't need anything cryptographically secure here. I simply chose
> >> crc32 since it has an easily available API, and I figured it would be
> >> fairly lightweight.
> >> 
> > 
> > After an abortive attempt to measure this with ftrace, I ended up
> > hacking together a patch to just measure the latency of the
> > nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
> > KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
> > checksums cuts that roughly in half to ~800ns. I'm not sure how best to
> > measure the cache footprint however.
> 
> Thanks for sorting that, I feel better having a real data point.
> 

Me too -- thanks for raising the question.

> > Neither seems terribly significant, especially given the other
> > inefficiencies in this code. OTOH, I guess those latencies can add up,
> > and I don't see any need to use crc32 over the net/checksum.h routines.
> > We probably ought to go with my RFC patch from yesterday.
> > 
> > This could look very different on other arches too of course, but I
> > don't have a test rig for any other arches at the moment.
> 
> Going with the IP checksum seems perfectly adequate to me.
> 

Agreed.

> Bruce made a good point that it is difficult to know how to test the efficacy of this change.  We don't really know of any especially bad cases that defeat the current DRC, though I know the Red Hat Q/A team likes to use NFS/UDP and they seem to run into problems with some frequency.  Having a chat with Rick Macklem and/or Tom Talpey on this topic might have some value, and likewise perhaps the bug's OP might have some ideas about what test cases they prefer.
> 

It is difficult to know since XID collisions are so rare (or at least
we think so). When they do happen they're often not noticeable, but we
do have some documented cases of it happening in the field...

RH's QA team does test with NFS/UDP but I'm not aware of any problems
with XID collisions, per-se. The problems they've had are that the DRC
sometimes just plain doesn't work. Based on some testing we've done
internally, I think the problem is likely due to the fact that it's
trivial to blow out the current fixed-size DRC cache. Until I backport
this and get them to deploy it onto our test NFS servers though, it's
going to be hard to know for certain.

I decided to add in the extra checksumming for 3 main reasons:

1) I was in here anyway, and it didn't seem too tough to do

2) the potential for xid collisions will be greater with a larger DRC

3) we've had people request it in the past

Given that it seems fairly cheap to add this, we might as well do it.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-08 13:27       ` Jeff Layton
  2013-02-08 15:42         ` Chuck Lever
@ 2013-02-08 20:55         ` J. Bruce Fields
  2013-02-08 20:59           ` Chuck Lever
  2013-02-09 11:36           ` Jeff Layton
  1 sibling, 2 replies; 20+ messages in thread
From: J. Bruce Fields @ 2013-02-08 20:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Chuck Lever, linux-nfs, rees

On Fri, Feb 08, 2013 at 08:27:06AM -0500, Jeff Layton wrote:
> On Thu, 7 Feb 2013 13:03:16 -0500
> Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Thu, 7 Feb 2013 10:51:02 -0500
> > Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > > 
> > > On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > > Now that we're allowing more DRC entries, it becomes a lot easier to hit
> > > > problems with XID collisions. In order to mitigate those, calculate the
> > > > crc32 of up to the first 256 bytes of each request coming in and store
> > > > that in the cache entry, along with the total length of the request.
> > > 
> > > I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
> > > 
> > > Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
> > > 
> > 
> > No, I haven't, at least not in any sort of rigorous way. It's pretty
> > negligible on "normal" PC hardware, but I think most intel and amd cpus
> > have instructions for handling crc32. I'm ok with a different checksum,
> > we don't need anything cryptographically secure here. I simply chose
> > crc32 since it has an easily available API, and I figured it would be
> > fairly lightweight.
> > 
> 
> After an abortive attempt to measure this with ftrace, I ended up
> hacking together a patch to just measure the latency of the
> nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
> KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
> checksums cuts that roughly in half to ~800ns. I'm not sure how best to
> measure the cache footprint however.
> 
> Neither seems terribly significant, especially given the other
> inefficiencies in this code. OTOH, I guess those latencies can add up,
> and I don't see any need to use crc32 over the net/checksum.h routines.
> We probably ought to go with my RFC patch from yesterday.

OK, I hadn't committed the original yet, so I've just rolled them
together and added a little of the above to the changelog.  Look OK?
Chuck, should I add a Reviewed-by: ?

--b.

commit a937bd422ccc4306cdc81b5aa60b12a7212b70d3
Author: Jeff Layton <jlayton@redhat.com>
Date:   Mon Feb 4 11:57:27 2013 -0500

    nfsd: keep a checksum of the first 256 bytes of request
    
    Now that we're allowing more DRC entries, it becomes a lot easier to hit
    problems with XID collisions. In order to mitigate those, calculate a
    checksum of up to the first 256 bytes of each request coming in and store
    that in the cache entry, along with the total length of the request.
    
    This initially used crc32, but Chuck Lever and Jim Rees pointed out that
    crc32 is probably more heavyweight than we really need for generating
    these checksums, and recommended looking at using the same routines that
    are used to generate checksums for IP packets.
    
    On an x86_64 KVM guest measurements with ftrace showed ~800ns to use
    csum_partial vs ~1750ns for crc32.  The difference probably isn't
    terribly significant, but for now we may as well use csum_partial.
    
    Signed-off-by: Jeff Layton <jlayton@redhat.com>
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
index 9c7232b..87fd141 100644
--- a/fs/nfsd/cache.h
+++ b/fs/nfsd/cache.h
@@ -29,6 +29,8 @@ struct svc_cacherep {
 	u32			c_prot;
 	u32			c_proc;
 	u32			c_vers;
+	unsigned int		c_len;
+	__wsum			c_csum;
 	unsigned long		c_timestamp;
 	union {
 		struct kvec	u_vec;
@@ -73,6 +75,9 @@ enum {
 /* Cache entries expire after this time period */
 #define RC_EXPIRE		(120 * HZ)
 
+/* Checksum this amount of the request */
+#define RC_CSUMLEN		(256U)
+
 int	nfsd_reply_cache_init(void);
 void	nfsd_reply_cache_shutdown(void);
 int	nfsd_cache_lookup(struct svc_rqst *);
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index f754469..40db57e 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -11,6 +11,7 @@
 #include <linux/slab.h>
 #include <linux/sunrpc/addr.h>
 #include <linux/highmem.h>
+#include <net/checksum.h>
 
 #include "nfsd.h"
 #include "cache.h"
@@ -130,6 +131,7 @@ int nfsd_reply_cache_init(void)
 	INIT_LIST_HEAD(&lru_head);
 	max_drc_entries = nfsd_cache_size_limit();
 	num_drc_entries = 0;
+
 	return 0;
 out_nomem:
 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
@@ -238,12 +240,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
 }
 
 /*
+ * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
+ */
+static __wsum
+nfsd_cache_csum(struct svc_rqst *rqstp)
+{
+	int idx;
+	unsigned int base;
+	__wsum csum;
+	struct xdr_buf *buf = &rqstp->rq_arg;
+	const unsigned char *p = buf->head[0].iov_base;
+	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
+				RC_CSUMLEN);
+	size_t len = min(buf->head[0].iov_len, csum_len);
+
+	/* rq_arg.head first */
+	csum = csum_partial(p, len, 0);
+	csum_len -= len;
+
+	/* Continue into page array */
+	idx = buf->page_base / PAGE_SIZE;
+	base = buf->page_base & ~PAGE_MASK;
+	while (csum_len) {
+		p = page_address(buf->pages[idx]) + base;
+		len = min(PAGE_SIZE - base, csum_len);
+		csum = csum_partial(p, len, csum);
+		csum_len -= len;
+		base = 0;
+		++idx;
+	}
+	return csum;
+}
+
+/*
  * Search the request hash for an entry that matches the given rqstp.
  * Must be called with cache_lock held. Returns the found entry or
  * NULL on failure.
  */
 static struct svc_cacherep *
-nfsd_cache_search(struct svc_rqst *rqstp)
+nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
 {
 	struct svc_cacherep	*rp;
 	struct hlist_node	*hn;
@@ -257,6 +292,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
 	hlist_for_each_entry(rp, hn, rh, c_hash) {
 		if (xid == rp->c_xid && proc == rp->c_proc &&
 		    proto == rp->c_prot && vers == rp->c_vers &&
+		    rqstp->rq_arg.len == rp->c_len && csum == rp->c_csum &&
 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
 			return rp;
@@ -277,6 +313,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
 	u32			proto =  rqstp->rq_prot,
 				vers = rqstp->rq_vers,
 				proc = rqstp->rq_proc;
+	__wsum			csum;
 	unsigned long		age;
 	int type = rqstp->rq_cachetype;
 	int rtn;
@@ -287,10 +324,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
 		return RC_DOIT;
 	}
 
+	csum = nfsd_cache_csum(rqstp);
+
 	spin_lock(&cache_lock);
 	rtn = RC_DOIT;
 
-	rp = nfsd_cache_search(rqstp);
+	rp = nfsd_cache_search(rqstp, csum);
 	if (rp)
 		goto found_entry;
 
@@ -318,7 +357,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
 	 * Must search again just in case someone inserted one
 	 * after we dropped the lock above.
 	 */
-	found = nfsd_cache_search(rqstp);
+	found = nfsd_cache_search(rqstp, csum);
 	if (found) {
 		nfsd_reply_cache_free_locked(rp);
 		rp = found;
@@ -344,6 +383,8 @@ setup_entry:
 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
 	rp->c_prot = proto;
 	rp->c_vers = vers;
+	rp->c_len = rqstp->rq_arg.len;
+	rp->c_csum = csum;
 
 	hash_refile(rp);
 	lru_put_end(rp);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-08 20:55         ` J. Bruce Fields
@ 2013-02-08 20:59           ` Chuck Lever
  2013-02-08 21:02             ` J. Bruce Fields
  2013-02-09 11:36           ` Jeff Layton
  1 sibling, 1 reply; 20+ messages in thread
From: Chuck Lever @ 2013-02-08 20:59 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs, rees


On Feb 8, 2013, at 3:55 PM, "J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Fri, Feb 08, 2013 at 08:27:06AM -0500, Jeff Layton wrote:
>> On Thu, 7 Feb 2013 13:03:16 -0500
>> Jeff Layton <jlayton@redhat.com> wrote:
>> 
>>> On Thu, 7 Feb 2013 10:51:02 -0500
>>> Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>>> 
>>>> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
>>>> 
>>>>> Now that we're allowing more DRC entries, it becomes a lot easier to hit
>>>>> problems with XID collisions. In order to mitigate those, calculate the
>>>>> crc32 of up to the first 256 bytes of each request coming in and store
>>>>> that in the cache entry, along with the total length of the request.
>>>> 
>>>> I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
>>>> 
>>>> Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
>>>> 
>>> 
>>> No, I haven't, at least not in any sort of rigorous way. It's pretty
>>> negligible on "normal" PC hardware, but I think most intel and amd cpus
>>> have instructions for handling crc32. I'm ok with a different checksum,
>>> we don't need anything cryptographically secure here. I simply chose
>>> crc32 since it has an easily available API, and I figured it would be
>>> fairly lightweight.
>>> 
>> 
>> After an abortive attempt to measure this with ftrace, I ended up
>> hacking together a patch to just measure the latency of the
>> nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
>> KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
>> checksums cuts that roughly in half to ~800ns. I'm not sure how best to
>> measure the cache footprint however.
>> 
>> Neither seems terribly significant, especially given the other
>> inefficiencies in this code. OTOH, I guess those latencies can add up,
>> and I don't see any need to use crc32 over the net/checksum.h routines.
>> We probably ought to go with my RFC patch from yesterday.
> 
> OK, I hadn't committed the original yet, so I've just rolled them
> together and added a little of the above to the changelog.  Look OK?
> Chuck, should I add a Reviewed-by: ?

Not sure my participation counts as review.  How about:

  Stones-thrown-by: Chuck Lever <chuck.lever@oracle.com>

> --b.
> 
> commit a937bd422ccc4306cdc81b5aa60b12a7212b70d3
> Author: Jeff Layton <jlayton@redhat.com>
> Date:   Mon Feb 4 11:57:27 2013 -0500
> 
>    nfsd: keep a checksum of the first 256 bytes of request
> 
>    Now that we're allowing more DRC entries, it becomes a lot easier to hit
>    problems with XID collisions. In order to mitigate those, calculate a
>    checksum of up to the first 256 bytes of each request coming in and store
>    that in the cache entry, along with the total length of the request.
> 
>    This initially used crc32, but Chuck Lever and Jim Rees pointed out that
>    crc32 is probably more heavyweight than we really need for generating
>    these checksums, and recommended looking at using the same routines that
>    are used to generate checksums for IP packets.
> 
>    On an x86_64 KVM guest measurements with ftrace showed ~800ns to use
>    csum_partial vs ~1750ns for crc32.  The difference probably isn't
>    terribly significant, but for now we may as well use csum_partial.
> 
>    Signed-off-by: Jeff Layton <jlayton@redhat.com>
>    Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 
> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 9c7232b..87fd141 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -29,6 +29,8 @@ struct svc_cacherep {
> 	u32			c_prot;
> 	u32			c_proc;
> 	u32			c_vers;
> +	unsigned int		c_len;
> +	__wsum			c_csum;
> 	unsigned long		c_timestamp;
> 	union {
> 		struct kvec	u_vec;
> @@ -73,6 +75,9 @@ enum {
> /* Cache entries expire after this time period */
> #define RC_EXPIRE		(120 * HZ)
> 
> +/* Checksum this amount of the request */
> +#define RC_CSUMLEN		(256U)
> +
> int	nfsd_reply_cache_init(void);
> void	nfsd_reply_cache_shutdown(void);
> int	nfsd_cache_lookup(struct svc_rqst *);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index f754469..40db57e 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -11,6 +11,7 @@
> #include <linux/slab.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/highmem.h>
> +#include <net/checksum.h>
> 
> #include "nfsd.h"
> #include "cache.h"
> @@ -130,6 +131,7 @@ int nfsd_reply_cache_init(void)
> 	INIT_LIST_HEAD(&lru_head);
> 	max_drc_entries = nfsd_cache_size_limit();
> 	num_drc_entries = 0;
> +
> 	return 0;
> out_nomem:
> 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> @@ -238,12 +240,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> }
> 
> /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static __wsum
> +nfsd_cache_csum(struct svc_rqst *rqstp)
> +{
> +	int idx;
> +	unsigned int base;
> +	__wsum csum;
> +	struct xdr_buf *buf = &rqstp->rq_arg;
> +	const unsigned char *p = buf->head[0].iov_base;
> +	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> +				RC_CSUMLEN);
> +	size_t len = min(buf->head[0].iov_len, csum_len);
> +
> +	/* rq_arg.head first */
> +	csum = csum_partial(p, len, 0);
> +	csum_len -= len;
> +
> +	/* Continue into page array */
> +	idx = buf->page_base / PAGE_SIZE;
> +	base = buf->page_base & ~PAGE_MASK;
> +	while (csum_len) {
> +		p = page_address(buf->pages[idx]) + base;
> +		len = min(PAGE_SIZE - base, csum_len);
> +		csum = csum_partial(p, len, csum);
> +		csum_len -= len;
> +		base = 0;
> +		++idx;
> +	}
> +	return csum;
> +}
> +
> +/*
>  * Search the request hash for an entry that matches the given rqstp.
>  * Must be called with cache_lock held. Returns the found entry or
>  * NULL on failure.
>  */
> static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
> {
> 	struct svc_cacherep	*rp;
> 	struct hlist_node	*hn;
> @@ -257,6 +292,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> 	hlist_for_each_entry(rp, hn, rh, c_hash) {
> 		if (xid == rp->c_xid && proc == rp->c_proc &&
> 		    proto == rp->c_prot && vers == rp->c_vers &&
> +		    rqstp->rq_arg.len == rp->c_len && csum == rp->c_csum &&
> 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> 			return rp;
> @@ -277,6 +313,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> 	u32			proto =  rqstp->rq_prot,
> 				vers = rqstp->rq_vers,
> 				proc = rqstp->rq_proc;
> +	__wsum			csum;
> 	unsigned long		age;
> 	int type = rqstp->rq_cachetype;
> 	int rtn;
> @@ -287,10 +324,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> 		return RC_DOIT;
> 	}
> 
> +	csum = nfsd_cache_csum(rqstp);
> +
> 	spin_lock(&cache_lock);
> 	rtn = RC_DOIT;
> 
> -	rp = nfsd_cache_search(rqstp);
> +	rp = nfsd_cache_search(rqstp, csum);
> 	if (rp)
> 		goto found_entry;
> 
> @@ -318,7 +357,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> 	 * Must search again just in case someone inserted one
> 	 * after we dropped the lock above.
> 	 */
> -	found = nfsd_cache_search(rqstp);
> +	found = nfsd_cache_search(rqstp, csum);
> 	if (found) {
> 		nfsd_reply_cache_free_locked(rp);
> 		rp = found;
> @@ -344,6 +383,8 @@ setup_entry:
> 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> 	rp->c_prot = proto;
> 	rp->c_vers = vers;
> +	rp->c_len = rqstp->rq_arg.len;
> +	rp->c_csum = csum;
> 
> 	hash_refile(rp);
> 	lru_put_end(rp);

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-08 20:59           ` Chuck Lever
@ 2013-02-08 21:02             ` J. Bruce Fields
  0 siblings, 0 replies; 20+ messages in thread
From: J. Bruce Fields @ 2013-02-08 21:02 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs, rees

On Fri, Feb 08, 2013 at 03:59:53PM -0500, Chuck Lever wrote:
> 
> On Feb 8, 2013, at 3:55 PM, "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Fri, Feb 08, 2013 at 08:27:06AM -0500, Jeff Layton wrote:
> >> On Thu, 7 Feb 2013 13:03:16 -0500
> >> Jeff Layton <jlayton@redhat.com> wrote:
> >> 
> >>> On Thu, 7 Feb 2013 10:51:02 -0500
> >>> Chuck Lever <chuck.lever@oracle.com> wrote:
> >>> 
> >>>> 
> >>>> On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> >>>> 
> >>>>> Now that we're allowing more DRC entries, it becomes a lot easier to hit
> >>>>> problems with XID collisions. In order to mitigate those, calculate the
> >>>>> crc32 of up to the first 256 bytes of each request coming in and store
> >>>>> that in the cache entry, along with the total length of the request.
> >>>> 
> >>>> I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
> >>>> 
> >>>> Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
> >>>> 
> >>> 
> >>> No, I haven't, at least not in any sort of rigorous way. It's pretty
> >>> negligible on "normal" PC hardware, but I think most intel and amd cpus
> >>> have instructions for handling crc32. I'm ok with a different checksum,
> >>> we don't need anything cryptographically secure here. I simply chose
> >>> crc32 since it has an easily available API, and I figured it would be
> >>> fairly lightweight.
> >>> 
> >> 
> >> After an abortive attempt to measure this with ftrace, I ended up
> >> hacking together a patch to just measure the latency of the
> >> nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
> >> KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
> >> checksums cuts that roughly in half to ~800ns. I'm not sure how best to
> >> measure the cache footprint however.
> >> 
> >> Neither seems terribly significant, especially given the other
> >> inefficiencies in this code. OTOH, I guess those latencies can add up,
> >> and I don't see any need to use crc32 over the net/checksum.h routines.
> >> We probably ought to go with my RFC patch from yesterday.
> > 
> > OK, I hadn't committed the original yet, so I've just rolled them
> > together and added a little of the above to the changelog.  Look OK?
> > Chuck, should I add a Reviewed-by: ?
> 
> Not sure my participation counts as review.  How about:
> 
>   Stones-thrown-by: Chuck Lever <chuck.lever@oracle.com>

As you wish.--b.

> 
> > --b.
> > 
> > commit a937bd422ccc4306cdc81b5aa60b12a7212b70d3
> > Author: Jeff Layton <jlayton@redhat.com>
> > Date:   Mon Feb 4 11:57:27 2013 -0500
> > 
> >    nfsd: keep a checksum of the first 256 bytes of request
> > 
> >    Now that we're allowing more DRC entries, it becomes a lot easier to hit
> >    problems with XID collisions. In order to mitigate those, calculate a
> >    checksum of up to the first 256 bytes of each request coming in and store
> >    that in the cache entry, along with the total length of the request.
> > 
> >    This initially used crc32, but Chuck Lever and Jim Rees pointed out that
> >    crc32 is probably more heavyweight than we really need for generating
> >    these checksums, and recommended looking at using the same routines that
> >    are used to generate checksums for IP packets.
> > 
> >    On an x86_64 KVM guest measurements with ftrace showed ~800ns to use
> >    csum_partial vs ~1750ns for crc32.  The difference probably isn't
> >    terribly significant, but for now we may as well use csum_partial.
> > 
> >    Signed-off-by: Jeff Layton <jlayton@redhat.com>
> >    Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> > 
> > diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> > index 9c7232b..87fd141 100644
> > --- a/fs/nfsd/cache.h
> > +++ b/fs/nfsd/cache.h
> > @@ -29,6 +29,8 @@ struct svc_cacherep {
> > 	u32			c_prot;
> > 	u32			c_proc;
> > 	u32			c_vers;
> > +	unsigned int		c_len;
> > +	__wsum			c_csum;
> > 	unsigned long		c_timestamp;
> > 	union {
> > 		struct kvec	u_vec;
> > @@ -73,6 +75,9 @@ enum {
> > /* Cache entries expire after this time period */
> > #define RC_EXPIRE		(120 * HZ)
> > 
> > +/* Checksum this amount of the request */
> > +#define RC_CSUMLEN		(256U)
> > +
> > int	nfsd_reply_cache_init(void);
> > void	nfsd_reply_cache_shutdown(void);
> > int	nfsd_cache_lookup(struct svc_rqst *);
> > diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> > index f754469..40db57e 100644
> > --- a/fs/nfsd/nfscache.c
> > +++ b/fs/nfsd/nfscache.c
> > @@ -11,6 +11,7 @@
> > #include <linux/slab.h>
> > #include <linux/sunrpc/addr.h>
> > #include <linux/highmem.h>
> > +#include <net/checksum.h>
> > 
> > #include "nfsd.h"
> > #include "cache.h"
> > @@ -130,6 +131,7 @@ int nfsd_reply_cache_init(void)
> > 	INIT_LIST_HEAD(&lru_head);
> > 	max_drc_entries = nfsd_cache_size_limit();
> > 	num_drc_entries = 0;
> > +
> > 	return 0;
> > out_nomem:
> > 	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> > @@ -238,12 +240,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
> > }
> > 
> > /*
> > + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> > + */
> > +static __wsum
> > +nfsd_cache_csum(struct svc_rqst *rqstp)
> > +{
> > +	int idx;
> > +	unsigned int base;
> > +	__wsum csum;
> > +	struct xdr_buf *buf = &rqstp->rq_arg;
> > +	const unsigned char *p = buf->head[0].iov_base;
> > +	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> > +				RC_CSUMLEN);
> > +	size_t len = min(buf->head[0].iov_len, csum_len);
> > +
> > +	/* rq_arg.head first */
> > +	csum = csum_partial(p, len, 0);
> > +	csum_len -= len;
> > +
> > +	/* Continue into page array */
> > +	idx = buf->page_base / PAGE_SIZE;
> > +	base = buf->page_base & ~PAGE_MASK;
> > +	while (csum_len) {
> > +		p = page_address(buf->pages[idx]) + base;
> > +		len = min(PAGE_SIZE - base, csum_len);
> > +		csum = csum_partial(p, len, csum);
> > +		csum_len -= len;
> > +		base = 0;
> > +		++idx;
> > +	}
> > +	return csum;
> > +}
> > +
> > +/*
> >  * Search the request hash for an entry that matches the given rqstp.
> >  * Must be called with cache_lock held. Returns the found entry or
> >  * NULL on failure.
> >  */
> > static struct svc_cacherep *
> > -nfsd_cache_search(struct svc_rqst *rqstp)
> > +nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
> > {
> > 	struct svc_cacherep	*rp;
> > 	struct hlist_node	*hn;
> > @@ -257,6 +292,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
> > 	hlist_for_each_entry(rp, hn, rh, c_hash) {
> > 		if (xid == rp->c_xid && proc == rp->c_proc &&
> > 		    proto == rp->c_prot && vers == rp->c_vers &&
> > +		    rqstp->rq_arg.len == rp->c_len && csum == rp->c_csum &&
> > 		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
> > 		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
> > 			return rp;
> > @@ -277,6 +313,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > 	u32			proto =  rqstp->rq_prot,
> > 				vers = rqstp->rq_vers,
> > 				proc = rqstp->rq_proc;
> > +	__wsum			csum;
> > 	unsigned long		age;
> > 	int type = rqstp->rq_cachetype;
> > 	int rtn;
> > @@ -287,10 +324,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > 		return RC_DOIT;
> > 	}
> > 
> > +	csum = nfsd_cache_csum(rqstp);
> > +
> > 	spin_lock(&cache_lock);
> > 	rtn = RC_DOIT;
> > 
> > -	rp = nfsd_cache_search(rqstp);
> > +	rp = nfsd_cache_search(rqstp, csum);
> > 	if (rp)
> > 		goto found_entry;
> > 
> > @@ -318,7 +357,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
> > 	 * Must search again just in case someone inserted one
> > 	 * after we dropped the lock above.
> > 	 */
> > -	found = nfsd_cache_search(rqstp);
> > +	found = nfsd_cache_search(rqstp, csum);
> > 	if (found) {
> > 		nfsd_reply_cache_free_locked(rp);
> > 		rp = found;
> > @@ -344,6 +383,8 @@ setup_entry:
> > 	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
> > 	rp->c_prot = proto;
> > 	rp->c_vers = vers;
> > +	rp->c_len = rqstp->rq_arg.len;
> > +	rp->c_csum = csum;
> > 
> > 	hash_refile(rp);
> > 	lru_put_end(rp);
> 
> -- 
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request
  2013-02-08 20:55         ` J. Bruce Fields
  2013-02-08 20:59           ` Chuck Lever
@ 2013-02-09 11:36           ` Jeff Layton
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2013-02-09 11:36 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Chuck Lever, linux-nfs, rees

On Fri, 8 Feb 2013 15:55:55 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Fri, Feb 08, 2013 at 08:27:06AM -0500, Jeff Layton wrote:
> > On Thu, 7 Feb 2013 13:03:16 -0500
> > Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > > On Thu, 7 Feb 2013 10:51:02 -0500
> > > Chuck Lever <chuck.lever@oracle.com> wrote:
> > > 
> > > > 
> > > > On Feb 7, 2013, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > > Now that we're allowing more DRC entries, it becomes a lot easier to hit
> > > > > problems with XID collisions. In order to mitigate those, calculate the
> > > > > crc32 of up to the first 256 bytes of each request coming in and store
> > > > > that in the cache entry, along with the total length of the request.
> > > > 
> > > > I'm happy to see a checksummed DRC finally become reality for the Linux NFS server.
> > > > 
> > > > Have you measured the CPU utilization impact and CPU cache footprint of performing a CRC computation for every incoming RPC?  I'm wondering if a simpler checksum might be just as useful but less costly to compute.
> > > > 
> > > 
> > > No, I haven't, at least not in any sort of rigorous way. It's pretty
> > > negligible on "normal" PC hardware, but I think most intel and amd cpus
> > > have instructions for handling crc32. I'm ok with a different checksum,
> > > we don't need anything cryptographically secure here. I simply chose
> > > crc32 since it has an easily available API, and I figured it would be
> > > fairly lightweight.
> > > 
> > 
> > After an abortive attempt to measure this with ftrace, I ended up
> > hacking together a patch to just measure the latency of the
> > nfsd_cache_csum/_crc functions to get some rough numbers. On my x86_64
> > KVM guest, the avg time to calculate the crc32 is ~1750ns. Using IP
> > checksums cuts that roughly in half to ~800ns. I'm not sure how best to
> > measure the cache footprint however.
> > 
> > Neither seems terribly significant, especially given the other
> > inefficiencies in this code. OTOH, I guess those latencies can add up,
> > and I don't see any need to use crc32 over the net/checksum.h routines.
> > We probably ought to go with my RFC patch from yesterday.
> 
> OK, I hadn't committed the original yet, so I've just rolled them
> together and added a little of the above to the changelog.  Look OK?
> Chuck, should I add a Reviewed-by: ?
> 
> --b.
> 
> commit a937bd422ccc4306cdc81b5aa60b12a7212b70d3
> Author: Jeff Layton <jlayton@redhat.com>
> Date:   Mon Feb 4 11:57:27 2013 -0500
> 
>     nfsd: keep a checksum of the first 256 bytes of request
>     
>     Now that we're allowing more DRC entries, it becomes a lot easier to hit
>     problems with XID collisions. In order to mitigate those, calculate a
>     checksum of up to the first 256 bytes of each request coming in and store
>     that in the cache entry, along with the total length of the request.
>     
>     This initially used crc32, but Chuck Lever and Jim Rees pointed out that
>     crc32 is probably more heavyweight than we really need for generating
>     these checksums, and recommended looking at using the same routines that
>     are used to generate checksums for IP packets.
>     
>     On an x86_64 KVM guest measurements with ftrace showed ~800ns to use
>     csum_partial vs ~1750ns for crc32.  The difference probably isn't
>     terribly significant, but for now we may as well use csum_partial.
>     
>     Signed-off-by: Jeff Layton <jlayton@redhat.com>
>     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 


Thanks Bruce. Looks good to me.

> diff --git a/fs/nfsd/cache.h b/fs/nfsd/cache.h
> index 9c7232b..87fd141 100644
> --- a/fs/nfsd/cache.h
> +++ b/fs/nfsd/cache.h
> @@ -29,6 +29,8 @@ struct svc_cacherep {
>  	u32			c_prot;
>  	u32			c_proc;
>  	u32			c_vers;
> +	unsigned int		c_len;
> +	__wsum			c_csum;
>  	unsigned long		c_timestamp;
>  	union {
>  		struct kvec	u_vec;
> @@ -73,6 +75,9 @@ enum {
>  /* Cache entries expire after this time period */
>  #define RC_EXPIRE		(120 * HZ)
>  
> +/* Checksum this amount of the request */
> +#define RC_CSUMLEN		(256U)
> +
>  int	nfsd_reply_cache_init(void);
>  void	nfsd_reply_cache_shutdown(void);
>  int	nfsd_cache_lookup(struct svc_rqst *);
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index f754469..40db57e 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -11,6 +11,7 @@
>  #include <linux/slab.h>
>  #include <linux/sunrpc/addr.h>
>  #include <linux/highmem.h>
> +#include <net/checksum.h>
>  
>  #include "nfsd.h"
>  #include "cache.h"
> @@ -130,6 +131,7 @@ int nfsd_reply_cache_init(void)
>  	INIT_LIST_HEAD(&lru_head);
>  	max_drc_entries = nfsd_cache_size_limit();
>  	num_drc_entries = 0;
> +
>  	return 0;
>  out_nomem:
>  	printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
> @@ -238,12 +240,45 @@ nfsd_reply_cache_shrink(struct shrinker *shrink, struct shrink_control *sc)
>  }
>  
>  /*
> + * Walk an xdr_buf and get a CRC for at most the first RC_CSUMLEN bytes
> + */
> +static __wsum
> +nfsd_cache_csum(struct svc_rqst *rqstp)
> +{
> +	int idx;
> +	unsigned int base;
> +	__wsum csum;
> +	struct xdr_buf *buf = &rqstp->rq_arg;
> +	const unsigned char *p = buf->head[0].iov_base;
> +	size_t csum_len = min_t(size_t, buf->head[0].iov_len + buf->page_len,
> +				RC_CSUMLEN);
> +	size_t len = min(buf->head[0].iov_len, csum_len);
> +
> +	/* rq_arg.head first */
> +	csum = csum_partial(p, len, 0);
> +	csum_len -= len;
> +
> +	/* Continue into page array */
> +	idx = buf->page_base / PAGE_SIZE;
> +	base = buf->page_base & ~PAGE_MASK;
> +	while (csum_len) {
> +		p = page_address(buf->pages[idx]) + base;
> +		len = min(PAGE_SIZE - base, csum_len);
> +		csum = csum_partial(p, len, csum);
> +		csum_len -= len;
> +		base = 0;
> +		++idx;
> +	}
> +	return csum;
> +}
> +
> +/*
>   * Search the request hash for an entry that matches the given rqstp.
>   * Must be called with cache_lock held. Returns the found entry or
>   * NULL on failure.
>   */
>  static struct svc_cacherep *
> -nfsd_cache_search(struct svc_rqst *rqstp)
> +nfsd_cache_search(struct svc_rqst *rqstp, __wsum csum)
>  {
>  	struct svc_cacherep	*rp;
>  	struct hlist_node	*hn;
> @@ -257,6 +292,7 @@ nfsd_cache_search(struct svc_rqst *rqstp)
>  	hlist_for_each_entry(rp, hn, rh, c_hash) {
>  		if (xid == rp->c_xid && proc == rp->c_proc &&
>  		    proto == rp->c_prot && vers == rp->c_vers &&
> +		    rqstp->rq_arg.len == rp->c_len && csum == rp->c_csum &&
>  		    rpc_cmp_addr(svc_addr(rqstp), (struct sockaddr *)&rp->c_addr) &&
>  		    rpc_get_port(svc_addr(rqstp)) == rpc_get_port((struct sockaddr *)&rp->c_addr))
>  			return rp;
> @@ -277,6 +313,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>  	u32			proto =  rqstp->rq_prot,
>  				vers = rqstp->rq_vers,
>  				proc = rqstp->rq_proc;
> +	__wsum			csum;
>  	unsigned long		age;
>  	int type = rqstp->rq_cachetype;
>  	int rtn;
> @@ -287,10 +324,12 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>  		return RC_DOIT;
>  	}
>  
> +	csum = nfsd_cache_csum(rqstp);
> +
>  	spin_lock(&cache_lock);
>  	rtn = RC_DOIT;
>  
> -	rp = nfsd_cache_search(rqstp);
> +	rp = nfsd_cache_search(rqstp, csum);
>  	if (rp)
>  		goto found_entry;
>  
> @@ -318,7 +357,7 @@ nfsd_cache_lookup(struct svc_rqst *rqstp)
>  	 * Must search again just in case someone inserted one
>  	 * after we dropped the lock above.
>  	 */
> -	found = nfsd_cache_search(rqstp);
> +	found = nfsd_cache_search(rqstp, csum);
>  	if (found) {
>  		nfsd_reply_cache_free_locked(rp);
>  		rp = found;
> @@ -344,6 +383,8 @@ setup_entry:
>  	rpc_set_port((struct sockaddr *)&rp->c_addr, rpc_get_port(svc_addr(rqstp)));
>  	rp->c_prot = proto;
>  	rp->c_vers = vers;
> +	rp->c_len = rqstp->rq_arg.len;
> +	rp->c_csum = csum;
>  
>  	hash_refile(rp);
>  	lru_put_end(rp);

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-02-09 11:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-07 14:51 [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC Jeff Layton
2013-02-07 14:51 ` [PATCH v3 1/2] sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer Jeff Layton
2013-02-07 14:51 ` [PATCH v3 2/2] nfsd: keep a checksum of the first 256 bytes of request Jeff Layton
2013-02-07 15:51   ` Chuck Lever
2013-02-07 16:00     ` J. Bruce Fields
2013-02-07 16:23       ` Chuck Lever
2013-02-07 16:37         ` J. Bruce Fields
2013-02-07 16:41         ` Jim Rees
2013-02-07 16:32       ` Myklebust, Trond
2013-02-07 18:35         ` Jeff Layton
2013-02-08 15:41         ` Jeff Layton
2013-02-07 18:03     ` Jeff Layton
2013-02-08 13:27       ` Jeff Layton
2013-02-08 15:42         ` Chuck Lever
2013-02-08 15:57           ` Jeff Layton
2013-02-08 20:55         ` J. Bruce Fields
2013-02-08 20:59           ` Chuck Lever
2013-02-08 21:02             ` J. Bruce Fields
2013-02-09 11:36           ` Jeff Layton
2013-02-07 15:11 ` [PATCH v3 0/2] nfsd: checksum first 256 bytes of request to guard against XID collisions in the DRC J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.