All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Cache deferal improvements - try++
@ 2010-02-03  6:31 NeilBrown
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

Thanks to Bruce challenging me to justify the complexity of my
previous version of this I have managed to simplify it significantly.

I have changed the rules for sunrpc_caches so that items that have
expired get removed at the earliest opportunity even if they are still
referenced, and in particular so that sunrpc_cache_lookup never
returns an expired item.
This means that cache_check doesn't need to check for "expired" any
more and so only initiates an upcall for items that are not VALID.

This means that when the upcall is responded to, it will always be
exactly that item that is updated - never a different item with the
same key.  So there is no longer any need to repeat the lookup.

The last 3 patches in this series are simply "cleanups" that I
happened across while mucking about in the code.  The should have zero
change in functionality, and if you don't think they are cleanups
(second last is now questionable), feel free to ignore them.

I have tested this to ensure that it doesn't completely break things,
and to ensure that it fixes the problem(*) but I haven't hammered on
it very hard.

(*)
The problem is exhibited by sending a stream of writes to the NFS
server and then occasionally flushing the export cache (exportfs -f).
The problem manifests by a write not getting a reply and the client 
having to retransmit.
It is 'fixed' if there are no retransmit delays.
The follow generates the required writes and shows the delays.

Without the patch I get delays of 60 seconds with TCP and 5 seconds
with UDP.

NeilBrown

/*
 * write to NFS server and report delays exceeding 1 second.
 */

#define _GNU_SOURCE

#include <sys/time.h>
#include <stdio.h>
#include <sys/fcntl.h>
#include <malloc.h>
#include <memory.h>

main(int argc, char *argv[])
{
	int usec;
	//int fd = open(argv[1], O_WRONLY|O_DIRECT|O_CREAT, 0666);
	//int fd = open(argv[1], O_WRONLY|O_SYNC|O_CREAT, 0666);
	int fd = open(argv[1], O_WRONLY|O_CREAT, 0666);
	char *buf;
	struct timeval tv1, tv2;

	posix_memalign(&buf, 4096, 409600);
	memset(buf, 0x5a, 409600);

	while(1) {
		gettimeofday(&tv1, NULL);
		write(fd, buf, 409600);
		gettimeofday(&tv2, NULL);
		usec = (tv2.tv_sec*1000000 + tv2.tv_usec) -
			(tv1.tv_sec*1000000 + tv1.tv_usec);
		if (usec > 1000000)
			printf(" %d\n", usec/1000000);
		else
			printf(".");
		fflush(stdout);
	}
}


---

NeilBrown (9):
      sunrpc: don't keep expired entries in the auth caches.
      sunrpc/cache: factor out cache_is_expired
      sunrpc: never return expired entries in sunrpc_cache_lookup
      sunrpc/cache: allow threads to block while waiting for cache update.
      nfsd/idmap: drop special request deferal in favour of improved default.
      sunrpc: close connection when a request is irretrievably lost.
      nfsd: factor out hash functions for export caches.
      svcauth_gss: replace a trivial 'switch' with an 'if'
      sunrpc/cache: change deferred-request hash table to use hlist.


 fs/nfsd/export.c                  |   40 ++++++++------
 fs/nfsd/nfs4idmap.c               |  105 ++++--------------------------------
 include/linux/sunrpc/cache.h      |    5 +-
 include/linux/sunrpc/svcauth.h    |   10 ++-
 net/sunrpc/auth_gss/svcauth_gss.c |   51 ++++++++---------
 net/sunrpc/cache.c                |  109 ++++++++++++++++++++++++++-----------
 net/sunrpc/svc.c                  |    3 +
 net/sunrpc/svc_xprt.c             |   11 ++++
 net/sunrpc/svcauth_unix.c         |   11 +++-
 9 files changed, 169 insertions(+), 176 deletions(-)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/9] sunrpc: don't keep expired entries in the auth caches.
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2010-02-03  6:31   ` [PATCH 7/9] nfsd: factor out hash functions for export caches NeilBrown
  2010-02-03  6:31   ` [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
       [not found]     ` <20100203063130.12945.29226.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2010-02-03  6:31   ` [PATCH 9/9] sunrpc/cache: change deferred-request hash table to use hlist NeilBrown
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

currently expired entries remain in the auth caches as long
as there is a reference.
This was needed long ago when the auth_domain cache used the same
cache infrastructure.  But since that (being a very different sort
of cache) was separated, this test is no longer needed.

So remove the test on refcnt and tidy up the surrounding code.

This allows the cache_dequeue call (which needed to be there to
drop a potentially awkward reference) can be moved outside of the
spinlock which is a better place for it.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/cache.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 39bddba..83592e0 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -397,31 +397,28 @@ static int cache_clean(void)
 		/* Ok, now to clean this strand */
 
 		cp = & current_detail->hash_table[current_index];
-		ch = *cp;
-		for (; ch; cp= & ch->next, ch= *cp) {
+		for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
 			if (current_detail->nextcheck > ch->expiry_time)
 				current_detail->nextcheck = ch->expiry_time+1;
 			if (ch->expiry_time >= get_seconds() &&
 			    ch->last_refresh >= current_detail->flush_time)
 				continue;
-			if (test_and_clear_bit(CACHE_PENDING, &ch->flags))
-				cache_dequeue(current_detail, ch);
 
-			if (atomic_read(&ch->ref.refcount) == 1)
-				break;
-		}
-		if (ch) {
 			*cp = ch->next;
 			ch->next = NULL;
 			current_detail->entries--;
 			rv = 1;
+			break;
 		}
+
 		write_unlock(&current_detail->hash_lock);
 		d = current_detail;
 		if (!ch)
 			current_index ++;
 		spin_unlock(&cache_list_lock);
 		if (ch) {
+			if (test_and_clear_bit(CACHE_PENDING, &ch->flags))
+				cache_dequeue(current_detail, ch);
 			cache_revisit_request(ch);
 			cache_put(ch, d);
 		}



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/9] sunrpc/cache: factor out cache_is_expired
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (6 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
       [not found]     ` <20100203063131.12945.65321.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2010-02-03  6:31   ` [PATCH 4/9] sunrpc/cache: allow threads to block while waiting for cache update NeilBrown
  2010-02-03 19:43   ` [PATCH 0/9] Cache deferal improvements - try++ J. Bruce Fields
  9 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

This removes a tiny bit of code duplication, but more important
prepares for following patch which will perform the expiry check in
cache_lookup and the rest of the validity check in cache_check.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/cache.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 83592e0..9826c5c 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -49,6 +49,12 @@ static void cache_init(struct cache_head *h)
 	h->last_refresh = now;
 }
 
+static inline int cache_is_expired(struct cache_detail *detail, struct cache_head *h)
+{
+	return  (h->expiry_time < get_seconds()) ||
+		(detail->flush_time > h->last_refresh);
+}
+
 struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 				       struct cache_head *key, int hash)
 {
@@ -184,9 +190,7 @@ static int cache_make_upcall(struct cache_detail *cd, struct cache_head *h)
 static inline int cache_is_valid(struct cache_detail *detail, struct cache_head *h)
 {
 	if (!test_bit(CACHE_VALID, &h->flags) ||
-	    h->expiry_time < get_seconds())
-		return -EAGAIN;
-	else if (detail->flush_time > h->last_refresh)
+	    cache_is_expired(detail, h))
 		return -EAGAIN;
 	else {
 		/* entry is valid */
@@ -400,8 +404,7 @@ static int cache_clean(void)
 		for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
 			if (current_detail->nextcheck > ch->expiry_time)
 				current_detail->nextcheck = ch->expiry_time+1;
-			if (ch->expiry_time >= get_seconds() &&
-			    ch->last_refresh >= current_detail->flush_time)
+			if (!cache_is_expired(current_detail, ch))
 				continue;
 
 			*cp = ch->next;



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (5 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 5/9] nfsd/idmap: drop special request deferal in favour of improved default NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
       [not found]     ` <20100203063131.12945.97779.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2010-02-03  6:31   ` [PATCH 2/9] sunrpc/cache: factor out cache_is_expired NeilBrown
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

If sunrpc_cache_lookup finds an expired entry, remove it from
the cache and return a freshly created non-VALID entry instead.
This ensures that we only ever get a usable entry, or an
entry that will become usable once an update arrives.
i.e. we will never need to repeat the lookup.

This allows us to remove the 'is_expired' test from cache_check
(i.e. from cache_is_valid).  cache_check should never get an expired
entry as 'lookup' will never return one.  If it does happen - due to
inconvenient timing - then just accept it as still valid, it won't be
very much past it's use-by date.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/cache.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 9826c5c..3e1ef8b 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -59,7 +59,7 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 				       struct cache_head *key, int hash)
 {
 	struct cache_head **head,  **hp;
-	struct cache_head *new = NULL;
+	struct cache_head *new = NULL, *freeme = NULL;
 
 	head = &detail->hash_table[hash];
 
@@ -68,6 +68,9 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
 		struct cache_head *tmp = *hp;
 		if (detail->match(tmp, key)) {
+			if (cache_is_expired(detail, tmp))
+				/* This entry is expired, we will discard it. */
+				break;
 			cache_get(tmp);
 			read_unlock(&detail->hash_lock);
 			return tmp;
@@ -92,6 +95,13 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
 		struct cache_head *tmp = *hp;
 		if (detail->match(tmp, key)) {
+			if (cache_is_expired(detail, tmp)) {
+				*hp = tmp->next;
+				tmp->next = NULL;
+				detail->entries --;
+				freeme = tmp;
+				break;
+			}
 			cache_get(tmp);
 			write_unlock(&detail->hash_lock);
 			cache_put(new, detail);
@@ -104,6 +114,8 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 	cache_get(new);
 	write_unlock(&detail->hash_lock);
 
+	if (freeme)
+		cache_put(freeme, detail);
 	return new;
 }
 EXPORT_SYMBOL_GPL(sunrpc_cache_lookup);
@@ -189,8 +201,7 @@ static int cache_make_upcall(struct cache_detail *cd, struct cache_head *h)
 
 static inline int cache_is_valid(struct cache_detail *detail, struct cache_head *h)
 {
-	if (!test_bit(CACHE_VALID, &h->flags) ||
-	    cache_is_expired(detail, h))
+	if (!test_bit(CACHE_VALID, &h->flags))
 		return -EAGAIN;
 	else {
 		/* entry is valid */



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/9] sunrpc/cache: allow threads to block while waiting for cache update.
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (7 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 2/9] sunrpc/cache: factor out cache_is_expired NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
  2010-04-15 15:55     ` J. Bruce Fields
  2010-02-03 19:43   ` [PATCH 0/9] Cache deferal improvements - try++ J. Bruce Fields
  9 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

The current practice of waiting for cache updates by queueing the
whole request to be retried has (at least) two problems.

1/ With NFSv4, requests can be quite complex and re-trying a whole
  request when a latter part fails should only be a last-resort, not a
  normal practice.

2/ Large requests, and in particular any 'write' request, will not be
  queued by the current code and doing so would be undesirable.

In many cases only a very sort wait is needed before the cache gets
valid data.

So, providing the underlying transport permits it by setting
 ->thread_wait,
arrange to wait briefly for an upcall to be completed (as reflected in
the clearing of CACHE_PENDING).
If the short wait was not long enough and CACHE_PENDING is still set,
fall back on the old approach.

The 'thread_wait' value is set to 5 seconds when there are spare
threads, and 1 second when there are no spare threads.

These values are probably much higher than needed, but will ensure
some forward progress.

Note that as we (now) only request an update for a non-valid item,
and as non-valid items are updated in place it is extremely unlikely
that cache_check will return -ETIMEDOUT.  Normally cache_defer_req
will sleep for a short while and then find that the item is_valid.


Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/cache.h |    3 +++
 net/sunrpc/cache.c           |   44 +++++++++++++++++++++++++++++++++++++++++-
 net/sunrpc/svc_xprt.c        |   11 +++++++++++
 3 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 6f52b4d..ef3db11 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -125,6 +125,9 @@ struct cache_detail {
  */
 struct cache_req {
 	struct cache_deferred_req *(*defer)(struct cache_req *req);
+	int thread_wait;  /* How long (jiffies) we can block the
+			   * current thread to wait for updates.
+			   */
 };
 /* this must be embedded in a deferred_request that is being
  * delayed awaiting cache-fill
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 3e1ef8b..b23b2e6 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -508,10 +508,22 @@ static LIST_HEAD(cache_defer_list);
 static struct list_head cache_defer_hash[DFR_HASHSIZE];
 static int cache_defer_cnt;
 
+struct thread_deferred_req {
+	struct cache_deferred_req handle;
+	wait_queue_head_t wait;
+};
+static void cache_restart_thread(struct cache_deferred_req *dreq, int too_many)
+{
+	struct thread_deferred_req *dr =
+		container_of(dreq, struct thread_deferred_req, handle);
+	wake_up(&dr->wait);
+}
+
 static int cache_defer_req(struct cache_req *req, struct cache_head *item)
 {
 	struct cache_deferred_req *dreq, *discard;
 	int hash = DFR_HASH(item);
+	struct thread_deferred_req sleeper;
 
 	if (cache_defer_cnt >= DFR_MAX) {
 		/* too much in the cache, randomly drop this one,
@@ -520,7 +532,14 @@ static int cache_defer_req(struct cache_req *req, struct cache_head *item)
 		if (net_random()&1)
 			return -ENOMEM;
 	}
-	dreq = req->defer(req);
+	if (req->thread_wait) {
+		dreq = &sleeper.handle;
+		init_waitqueue_head(&sleeper.wait);
+		dreq->revisit = cache_restart_thread;
+	} else
+		dreq = req->defer(req);
+
+ retry:
 	if (dreq == NULL)
 		return -ENOMEM;
 
@@ -554,6 +573,29 @@ static int cache_defer_req(struct cache_req *req, struct cache_head *item)
 		cache_revisit_request(item);
 		return -EAGAIN;
 	}
+
+	if (dreq == &sleeper.handle) {
+		wait_event_interruptible_timeout(
+			sleeper.wait,
+			!test_bit(CACHE_PENDING, &item->flags)
+			|| list_empty(&sleeper.handle.hash),
+			req->thread_wait);
+		spin_lock(&cache_defer_lock);
+		if (!list_empty(&sleeper.handle.hash)) {
+			list_del_init(&sleeper.handle.recent);
+			list_del_init(&sleeper.handle.hash);
+			cache_defer_cnt--;
+		}
+		spin_unlock(&cache_defer_lock);
+		if (test_bit(CACHE_PENDING, &item->flags)) {
+			/* item is still pending, try request
+			 * deferral
+			 */
+			dreq = req->defer(req);
+			goto retry;
+		}
+		return -EEXIST;
+	}
 	return 0;
 }
 
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7d1f9e9..ae71d51 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -643,6 +643,11 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 	if (signalled() || kthread_should_stop())
 		return -EINTR;
 
+	/* Normally we will wait up to 5 seconds for any required
+	 * cache information to be provided.
+	 */
+	rqstp->rq_chandle.thread_wait = 5*HZ;
+
 	spin_lock_bh(&pool->sp_lock);
 	xprt = svc_xprt_dequeue(pool);
 	if (xprt) {
@@ -650,6 +655,12 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 		svc_xprt_get(xprt);
 		rqstp->rq_reserved = serv->sv_max_mesg;
 		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
+
+		/* As there is a shortage of threads and this request
+		 * had to be queue, don't allow the thread to wait so
+		 * long for cache updates.
+		 */
+		rqstp->rq_chandle.thread_wait = 1*HZ;
 	} else {
 		/* No data pending. Go to sleep */
 		svc_thread_enqueue(pool, rqstp);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/9] nfsd/idmap: drop special request deferal in favour of improved default.
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (4 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 8/9] svcauth_gss: replace a trivial 'switch' with an 'if' NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
  2010-02-03  6:31   ` [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup NeilBrown
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

The idmap code manages request deferal by waiting for a reply from
userspace rather than putting the NFS request on a queue to be retried
from the start.
Now that the common deferal code does this there is no need for the
special code in idmap.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/nfs4idmap.c |  105 +++++----------------------------------------------
 1 files changed, 11 insertions(+), 94 deletions(-)

diff --git a/fs/nfsd/nfs4idmap.c b/fs/nfsd/nfs4idmap.c
index 6e2983b..f3de309 100644
--- a/fs/nfsd/nfs4idmap.c
+++ b/fs/nfsd/nfs4idmap.c
@@ -481,109 +481,26 @@ nfsd_idmap_shutdown(void)
 	cache_unregister(&nametoid_cache);
 }
 
-/*
- * Deferred request handling
- */
-
-struct idmap_defer_req {
-       struct cache_req		req;
-       struct cache_deferred_req deferred_req;
-       wait_queue_head_t	waitq;
-       atomic_t			count;
-};
-
-static inline void
-put_mdr(struct idmap_defer_req *mdr)
-{
-	if (atomic_dec_and_test(&mdr->count))
-		kfree(mdr);
-}
-
-static inline void
-get_mdr(struct idmap_defer_req *mdr)
-{
-	atomic_inc(&mdr->count);
-}
-
-static void
-idmap_revisit(struct cache_deferred_req *dreq, int toomany)
-{
-	struct idmap_defer_req *mdr =
-		container_of(dreq, struct idmap_defer_req, deferred_req);
-
-	wake_up(&mdr->waitq);
-	put_mdr(mdr);
-}
-
-static struct cache_deferred_req *
-idmap_defer(struct cache_req *req)
-{
-	struct idmap_defer_req *mdr =
-		container_of(req, struct idmap_defer_req, req);
-
-	mdr->deferred_req.revisit = idmap_revisit;
-	get_mdr(mdr);
-	return (&mdr->deferred_req);
-}
-
-static inline int
-do_idmap_lookup(struct ent *(*lookup_fn)(struct ent *), struct ent *key,
-		struct cache_detail *detail, struct ent **item,
-		struct idmap_defer_req *mdr)
-{
-	*item = lookup_fn(key);
-	if (!*item)
-		return -ENOMEM;
-	return cache_check(detail, &(*item)->h, &mdr->req);
-}
-
-static inline int
-do_idmap_lookup_nowait(struct ent *(*lookup_fn)(struct ent *),
-			struct ent *key, struct cache_detail *detail,
-			struct ent **item)
-{
-	int ret = -ENOMEM;
-
-	*item = lookup_fn(key);
-	if (!*item)
-		goto out_err;
-	ret = -ETIMEDOUT;
-	if (!test_bit(CACHE_VALID, &(*item)->h.flags)
-			|| (*item)->h.expiry_time < get_seconds()
-			|| detail->flush_time > (*item)->h.last_refresh)
-		goto out_put;
-	ret = -ENOENT;
-	if (test_bit(CACHE_NEGATIVE, &(*item)->h.flags))
-		goto out_put;
-	return 0;
-out_put:
-	cache_put(&(*item)->h, detail);
-out_err:
-	*item = NULL;
-	return ret;
-}
-
 static int
 idmap_lookup(struct svc_rqst *rqstp,
 		struct ent *(*lookup_fn)(struct ent *), struct ent *key,
 		struct cache_detail *detail, struct ent **item)
 {
-	struct idmap_defer_req *mdr;
 	int ret;
 
-	mdr = kzalloc(sizeof(*mdr), GFP_KERNEL);
-	if (!mdr)
+	*item = lookup_fn(key);
+	if (!*item)
 		return -ENOMEM;
-	atomic_set(&mdr->count, 1);
-	init_waitqueue_head(&mdr->waitq);
-	mdr->req.defer = idmap_defer;
-	ret = do_idmap_lookup(lookup_fn, key, detail, item, mdr);
-	if (ret == -EAGAIN) {
-		wait_event_interruptible_timeout(mdr->waitq,
-			test_bit(CACHE_VALID, &(*item)->h.flags), 1 * HZ);
-		ret = do_idmap_lookup_nowait(lookup_fn, key, detail, item);
+ retry:
+	ret = cache_check(detail, &(*item)->h, &rqstp->rq_chandle);
+
+	if (ret == -ETIMEDOUT) {
+		struct ent *prev_item = *item;
+		*item = lookup_fn(key);
+		if (*item != prev_item)
+			goto retry;
+		cache_put(&(*item)->h, detail);
 	}
-	put_mdr(mdr);
 	return ret;
 }
 



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2010-02-03  6:31   ` [PATCH 7/9] nfsd: factor out hash functions for export caches NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
  2010-02-03 15:43     ` Chuck Lever
  2010-02-03  6:31   ` [PATCH 1/9] sunrpc: don't keep expired entries in the auth caches NeilBrown
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

If we drop a request in the sunrpc layer, either due kmalloc failure,
or due to a cache miss when we could not queue the request for later
replay, then close the connection to encourage the client to retry sooner.

Note that if the drop happens in the NFS layer, NFSERR_JUKEBOX
(aka NFS4ERR_DELAY) is returned to guide the client concerning
replay.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/svcauth.h    |   10 +++++++---
 net/sunrpc/auth_gss/svcauth_gss.c |   12 ++++++------
 net/sunrpc/svc.c                  |    3 +++
 net/sunrpc/svcauth_unix.c         |   11 ++++++++---
 4 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/linux/sunrpc/svcauth.h b/include/linux/sunrpc/svcauth.h
index d39dbdc..1126693 100644
--- a/include/linux/sunrpc/svcauth.h
+++ b/include/linux/sunrpc/svcauth.h
@@ -108,9 +108,13 @@ struct auth_ops {
 #define	SVC_NEGATIVE	4
 #define	SVC_OK		5
 #define	SVC_DROP	6
-#define	SVC_DENIED	7
-#define	SVC_PENDING	8
-#define	SVC_COMPLETE	9
+#define	SVC_CLOSE	7	/* Like SVC_DROP, but request is definitely
+				 * lost so if there is a tcp connection, it
+				 * should be closed
+				 */
+#define	SVC_DENIED	8
+#define	SVC_PENDING	9
+#define	SVC_COMPLETE	10
 
 
 extern int	svc_authenticate(struct svc_rqst *rqstp, __be32 *authp);
diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
index e34bc53..4db9562 100644
--- a/net/sunrpc/auth_gss/svcauth_gss.c
+++ b/net/sunrpc/auth_gss/svcauth_gss.c
@@ -963,7 +963,7 @@ svcauth_gss_set_client(struct svc_rqst *rqstp)
 	if (rqstp->rq_gssclient == NULL)
 		return SVC_DENIED;
 	stat = svcauth_unix_set_client(rqstp);
-	if (stat == SVC_DROP)
+	if (stat == SVC_DROP || stat == SVC_CLOSE)
 		return stat;
 	return SVC_OK;
 }
@@ -1017,7 +1017,7 @@ static int svcauth_gss_handle_init(struct svc_rqst *rqstp,
 		return SVC_DENIED;
 	memset(&rsikey, 0, sizeof(rsikey));
 	if (dup_netobj(&rsikey.in_handle, &gc->gc_ctx))
-		return SVC_DROP;
+		return SVC_CLOSE;
 	*authp = rpc_autherr_badverf;
 	if (svc_safe_getnetobj(argv, &tmpobj)) {
 		kfree(rsikey.in_handle.data);
@@ -1025,22 +1025,22 @@ static int svcauth_gss_handle_init(struct svc_rqst *rqstp,
 	}
 	if (dup_netobj(&rsikey.in_token, &tmpobj)) {
 		kfree(rsikey.in_handle.data);
-		return SVC_DROP;
+		return SVC_CLOSE;
 	}
 
 	/* Perform upcall, or find upcall result: */
 	rsip = rsi_lookup(&rsikey);
 	rsi_free(&rsikey);
 	if (!rsip)
-		return SVC_DROP;
+		return SVC_CLOSE;
 	switch (cache_check(&rsi_cache, &rsip->h, &rqstp->rq_chandle)) {
 	case -EAGAIN:
 	case -ETIMEDOUT:
 	case -ENOENT:
 		/* No upcall result: */
-		return SVC_DROP;
+		return SVC_CLOSE;
 	case 0:
-		ret = SVC_DROP;
+		ret = SVC_CLOSE;
 		/* Got an answer to the upcall; use it: */
 		if (gss_write_init_verf(rqstp, rsip))
 			goto out;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 538ca43..e750988 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1050,6 +1050,9 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
 		goto err_bad;
 	case SVC_DENIED:
 		goto err_bad_auth;
+	case SVC_CLOSE:
+		if (test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
+			svc_close_xprt(rqstp->rq_xprt);
 	case SVC_DROP:
 		goto dropit;
 	case SVC_COMPLETE:
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index d8c0411..a25f8ba 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -668,6 +668,8 @@ static struct group_info *unix_gid_find(uid_t uid, struct svc_rqst *rqstp)
 	switch (ret) {
 	case -ENOENT:
 		return ERR_PTR(-ENOENT);
+	case -ETIMEDOUT:
+		return ERR_PTR(-ESHUTDOWN);
 	case 0:
 		gi = get_group_info(ug->gi);
 		cache_put(&ug->h, &unix_gid_cache);
@@ -714,8 +716,9 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
 	switch (cache_check(&ip_map_cache, &ipm->h, &rqstp->rq_chandle)) {
 		default:
 			BUG();
-		case -EAGAIN:
 		case -ETIMEDOUT:
+			return SVC_CLOSE;
+		case -EAGAIN:
 			return SVC_DROP;
 		case -ENOENT:
 			return SVC_DENIED;
@@ -730,6 +733,8 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
 	switch (PTR_ERR(gi)) {
 	case -EAGAIN:
 		return SVC_DROP;
+	case -ESHUTDOWN:
+		return SVC_CLOSE;
 	case -ENOENT:
 		break;
 	default:
@@ -770,7 +775,7 @@ svcauth_null_accept(struct svc_rqst *rqstp, __be32 *authp)
 	cred->cr_gid = (gid_t) -1;
 	cred->cr_group_info = groups_alloc(0);
 	if (cred->cr_group_info == NULL)
-		return SVC_DROP; /* kmalloc failure - client must retry */
+		return SVC_CLOSE; /* kmalloc failure - client must retry */
 
 	/* Put NULL verifier */
 	svc_putnl(resv, RPC_AUTH_NULL);
@@ -834,7 +839,7 @@ svcauth_unix_accept(struct svc_rqst *rqstp, __be32 *authp)
 		goto badcred;
 	cred->cr_group_info = groups_alloc(slen);
 	if (cred->cr_group_info == NULL)
-		return SVC_DROP;
+		return SVC_CLOSE;
 	for (i = 0; i < slen; i++)
 		GROUP_AT(cred->cr_group_info, i) = svc_getnl(argv);
 	if (svc_getu32(argv) != htonl(RPC_AUTH_NULL) || svc_getu32(argv) != 0) {



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/9] nfsd: factor out hash functions for export caches.
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2010-02-03  6:31   ` NeilBrown
       [not found]     ` <20100203063131.12945.38791.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2010-02-03  6:31   ` [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost NeilBrown
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

Both the _lookup and the _update functions for these two caches
independently calculate the hash of the key.
So factor out that code for improved reuse.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/export.c |   40 +++++++++++++++++++++++-----------------
 1 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index c487810..cb8766c 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -258,10 +258,9 @@ static struct cache_detail svc_expkey_cache = {
 	.alloc		= expkey_alloc,
 };
 
-static struct svc_expkey *
-svc_expkey_lookup(struct svc_expkey *item)
+static int
+svc_expkey_hash(struct svc_expkey *item)
 {
-	struct cache_head *ch;
 	int hash = item->ek_fsidtype;
 	char * cp = (char*)item->ek_fsid;
 	int len = key_len(item->ek_fsidtype);
@@ -269,6 +268,14 @@ svc_expkey_lookup(struct svc_expkey *item)
 	hash ^= hash_mem(cp, len, EXPKEY_HASHBITS);
 	hash ^= hash_ptr(item->ek_client, EXPKEY_HASHBITS);
 	hash &= EXPKEY_HASHMASK;
+	return hash;
+}
+
+static struct svc_expkey *
+svc_expkey_lookup(struct svc_expkey *item)
+{
+	struct cache_head *ch;
+	int hash = svc_expkey_hash(item);
 
 	ch = sunrpc_cache_lookup(&svc_expkey_cache, &item->h,
 				 hash);
@@ -282,13 +289,7 @@ static struct svc_expkey *
 svc_expkey_update(struct svc_expkey *new, struct svc_expkey *old)
 {
 	struct cache_head *ch;
-	int hash = new->ek_fsidtype;
-	char * cp = (char*)new->ek_fsid;
-	int len = key_len(new->ek_fsidtype);
-
-	hash ^= hash_mem(cp, len, EXPKEY_HASHBITS);
-	hash ^= hash_ptr(new->ek_client, EXPKEY_HASHBITS);
-	hash &= EXPKEY_HASHMASK;
+	int hash = svc_expkey_hash(new);
 
 	ch = sunrpc_cache_update(&svc_expkey_cache, &new->h,
 				 &old->h, hash);
@@ -737,14 +738,22 @@ struct cache_detail svc_export_cache = {
 	.alloc		= svc_export_alloc,
 };
 
-static struct svc_export *
-svc_export_lookup(struct svc_export *exp)
+static int
+svc_export_hash(struct svc_export *exp)
 {
-	struct cache_head *ch;
 	int hash;
+
 	hash = hash_ptr(exp->ex_client, EXPORT_HASHBITS);
 	hash ^= hash_ptr(exp->ex_path.dentry, EXPORT_HASHBITS);
 	hash ^= hash_ptr(exp->ex_path.mnt, EXPORT_HASHBITS);
+	return hash;
+}
+
+static struct svc_export *
+svc_export_lookup(struct svc_export *exp)
+{
+	struct cache_head *ch;
+	int hash = svc_export_hash(exp);
 
 	ch = sunrpc_cache_lookup(&svc_export_cache, &exp->h,
 				 hash);
@@ -758,10 +767,7 @@ static struct svc_export *
 svc_export_update(struct svc_export *new, struct svc_export *old)
 {
 	struct cache_head *ch;
-	int hash;
-	hash = hash_ptr(old->ex_client, EXPORT_HASHBITS);
-	hash ^= hash_ptr(old->ex_path.dentry, EXPORT_HASHBITS);
-	hash ^= hash_ptr(old->ex_path.mnt, EXPORT_HASHBITS);
+	int hash = svc_export_hash(old);
 
 	ch = sunrpc_cache_update(&svc_export_cache, &new->h,
 				 &old->h,



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 8/9] svcauth_gss: replace a trivial 'switch' with an 'if'
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (3 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 9/9] sunrpc/cache: change deferred-request hash table to use hlist NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
  2010-02-03  6:31   ` [PATCH 5/9] nfsd/idmap: drop special request deferal in favour of improved default NeilBrown
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

Code like:

  switch(xxx) {
  case -error1:
  case -error2:
     ..
     return;
  case 0:
     stuff;
  }

  can more naturally be written:

  if (xxx < 0)
      return;

  stuff;

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/auth_gss/svcauth_gss.c |   41 +++++++++++++++++--------------------
 1 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
index 4db9562..2916f69 100644
--- a/net/sunrpc/auth_gss/svcauth_gss.c
+++ b/net/sunrpc/auth_gss/svcauth_gss.c
@@ -1033,30 +1033,27 @@ static int svcauth_gss_handle_init(struct svc_rqst *rqstp,
 	rsi_free(&rsikey);
 	if (!rsip)
 		return SVC_CLOSE;
-	switch (cache_check(&rsi_cache, &rsip->h, &rqstp->rq_chandle)) {
-	case -EAGAIN:
-	case -ETIMEDOUT:
-	case -ENOENT:
+	if (cache_check(&rsi_cache, &rsip->h, &rqstp->rq_chandle) < 0)
 		/* No upcall result: */
 		return SVC_CLOSE;
-	case 0:
-		ret = SVC_CLOSE;
-		/* Got an answer to the upcall; use it: */
-		if (gss_write_init_verf(rqstp, rsip))
-			goto out;
-		if (resv->iov_len + 4 > PAGE_SIZE)
-			goto out;
-		svc_putnl(resv, RPC_SUCCESS);
-		if (svc_safe_putnetobj(resv, &rsip->out_handle))
-			goto out;
-		if (resv->iov_len + 3 * 4 > PAGE_SIZE)
-			goto out;
-		svc_putnl(resv, rsip->major_status);
-		svc_putnl(resv, rsip->minor_status);
-		svc_putnl(resv, GSS_SEQ_WIN);
-		if (svc_safe_putnetobj(resv, &rsip->out_token))
-			goto out;
-	}
+
+	ret = SVC_CLOSE;
+	/* Got an answer to the upcall; use it: */
+	if (gss_write_init_verf(rqstp, rsip))
+		goto out;
+	if (resv->iov_len + 4 > PAGE_SIZE)
+		goto out;
+	svc_putnl(resv, RPC_SUCCESS);
+	if (svc_safe_putnetobj(resv, &rsip->out_handle))
+		goto out;
+	if (resv->iov_len + 3 * 4 > PAGE_SIZE)
+		goto out;
+	svc_putnl(resv, rsip->major_status);
+	svc_putnl(resv, rsip->minor_status);
+	svc_putnl(resv, GSS_SEQ_WIN);
+	if (svc_safe_putnetobj(resv, &rsip->out_token))
+		goto out;
+
 	ret = SVC_COMPLETE;
 out:
 	cache_put(&rsip->h, &rsi_cache);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 9/9] sunrpc/cache: change deferred-request hash table to use hlist.
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (2 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 1/9] sunrpc: don't keep expired entries in the auth caches NeilBrown
@ 2010-02-03  6:31   ` NeilBrown
  2010-02-03  6:31   ` [PATCH 8/9] svcauth_gss: replace a trivial 'switch' with an 'if' NeilBrown
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2010-02-03  6:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

Being a hash table, hlist is the best option.

There is currently some ugliness were we treat "->next == NULL" as
a special case to avoid having to initialise the whole array.
This change nicely gets rid of that case.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/cache.h |    2 +-
 net/sunrpc/cache.c           |   30 ++++++++++++------------------
 2 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index ef3db11..ba9d72c 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -133,7 +133,7 @@ struct cache_req {
  * delayed awaiting cache-fill
  */
 struct cache_deferred_req {
-	struct list_head	hash;	/* on hash chain */
+	struct hlist_node	hash;	/* on hash chain */
 	struct list_head	recent; /* on fifo */
 	struct cache_head	*item;  /* cache item we wait on */
 	void			*owner; /* we might need to discard all defered requests
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index b23b2e6..421b493 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -505,7 +505,7 @@ EXPORT_SYMBOL_GPL(cache_purge);
 
 static DEFINE_SPINLOCK(cache_defer_lock);
 static LIST_HEAD(cache_defer_list);
-static struct list_head cache_defer_hash[DFR_HASHSIZE];
+static struct hlist_head cache_defer_hash[DFR_HASHSIZE];
 static int cache_defer_cnt;
 
 struct thread_deferred_req {
@@ -549,9 +549,7 @@ static int cache_defer_req(struct cache_req *req, struct cache_head *item)
 
 	list_add(&dreq->recent, &cache_defer_list);
 
-	if (cache_defer_hash[hash].next == NULL)
-		INIT_LIST_HEAD(&cache_defer_hash[hash]);
-	list_add(&dreq->hash, &cache_defer_hash[hash]);
+	hlist_add_head(&dreq->hash, &cache_defer_hash[hash]);
 
 	/* it is in, now maybe clean up */
 	discard = NULL;
@@ -559,7 +557,7 @@ static int cache_defer_req(struct cache_req *req, struct cache_head *item)
 		discard = list_entry(cache_defer_list.prev,
 				     struct cache_deferred_req, recent);
 		list_del_init(&discard->recent);
-		list_del_init(&discard->hash);
+		hlist_del_init(&discard->hash);
 		cache_defer_cnt--;
 	}
 	spin_unlock(&cache_defer_lock);
@@ -578,12 +576,12 @@ static int cache_defer_req(struct cache_req *req, struct cache_head *item)
 		wait_event_interruptible_timeout(
 			sleeper.wait,
 			!test_bit(CACHE_PENDING, &item->flags)
-			|| list_empty(&sleeper.handle.hash),
+			|| hlist_unhashed(&sleeper.handle.hash),
 			req->thread_wait);
 		spin_lock(&cache_defer_lock);
-		if (!list_empty(&sleeper.handle.hash)) {
+		if (!hlist_unhashed(&sleeper.handle.hash)) {
 			list_del_init(&sleeper.handle.recent);
-			list_del_init(&sleeper.handle.hash);
+			hlist_del_init(&sleeper.handle.hash);
 			cache_defer_cnt--;
 		}
 		spin_unlock(&cache_defer_lock);
@@ -604,24 +602,20 @@ static void cache_revisit_request(struct cache_head *item)
 	struct cache_deferred_req *dreq;
 	struct list_head pending;
 
-	struct list_head *lp;
+	struct hlist_node *lp, *tmp;
 	int hash = DFR_HASH(item);
 
 	INIT_LIST_HEAD(&pending);
 	spin_lock(&cache_defer_lock);
 
-	lp = cache_defer_hash[hash].next;
-	if (lp) {
-		while (lp != &cache_defer_hash[hash]) {
-			dreq = list_entry(lp, struct cache_deferred_req, hash);
-			lp = lp->next;
+	hlist_for_each_entry_safe(dreq, lp, tmp,
+				  &cache_defer_hash[hash], hash)
 			if (dreq->item == item) {
-				list_del_init(&dreq->hash);
+				hlist_del_init(&dreq->hash);
 				list_move(&dreq->recent, &pending);
 				cache_defer_cnt--;
 			}
-		}
-	}
+
 	spin_unlock(&cache_defer_lock);
 
 	while (!list_empty(&pending)) {
@@ -642,7 +636,7 @@ void cache_clean_deferred(void *owner)
 
 	list_for_each_entry_safe(dreq, tmp, &cache_defer_list, recent) {
 		if (dreq->owner == owner) {
-			list_del_init(&dreq->hash);
+			hlist_del_init(&dreq->hash);
 			list_move(&dreq->recent, &pending);
 			cache_defer_cnt--;
 		}



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03  6:31   ` [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost NeilBrown
@ 2010-02-03 15:43     ` Chuck Lever
  2010-02-03 21:23       ` Neil Brown
  0 siblings, 1 reply; 27+ messages in thread
From: Chuck Lever @ 2010-02-03 15:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: J. Bruce Fields, linux-nfs

On 02/03/2010 01:31 AM, NeilBrown wrote:
> If we drop a request in the sunrpc layer, either due kmalloc failure,
> or due to a cache miss when we could not queue the request for later
> replay, then close the connection to encourage the client to retry sooner.

I studied connection dropping behavior a few years back, and decided 
that dropping the connection on a retransmit is nearly always 
counterproductive.  Any other pending requests on a connection that is 
dropped must also be retransmitted, which means one retransmit suddenly 
turns into many.  And then you get into issues of idempotency and all 
the extra traffic and the long delays and the risk of reconnecting on a 
different port so that XID replay is undetectable...

I don't think dropping the connection will cause the client to 
retransmit sooner.  Clients I have encountered will reconnect and 
retransmit only after their retransmit timeout fires, never sooner.

Unfortunately NFSv4 requires a connection drop before a retransmit, but 
NFSv3 does not.  NFSv4 servers are rather supposed to try very hard not 
to drop requests.

How often do you expect this kind of recovery to be necessary?  Would it 
be possible to drop only for NFSv4 connections?

> Note that if the drop happens in the NFS layer, NFSERR_JUKEBOX
> (aka NFS4ERR_DELAY) is returned to guide the client concerning
> replay.
>
> Signed-off-by: NeilBrown<neilb@suse.de>
> ---
>   include/linux/sunrpc/svcauth.h    |   10 +++++++---
>   net/sunrpc/auth_gss/svcauth_gss.c |   12 ++++++------
>   net/sunrpc/svc.c                  |    3 +++
>   net/sunrpc/svcauth_unix.c         |   11 ++++++++---
>   4 files changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sunrpc/svcauth.h b/include/linux/sunrpc/svcauth.h
> index d39dbdc..1126693 100644
> --- a/include/linux/sunrpc/svcauth.h
> +++ b/include/linux/sunrpc/svcauth.h
> @@ -108,9 +108,13 @@ struct auth_ops {
>   #define	SVC_NEGATIVE	4
>   #define	SVC_OK		5
>   #define	SVC_DROP	6
> -#define	SVC_DENIED	7
> -#define	SVC_PENDING	8
> -#define	SVC_COMPLETE	9
> +#define	SVC_CLOSE	7	/* Like SVC_DROP, but request is definitely
> +				 * lost so if there is a tcp connection, it
> +				 * should be closed
> +				 */
> +#define	SVC_DENIED	8
> +#define	SVC_PENDING	9
> +#define	SVC_COMPLETE	10
>
>
>   extern int	svc_authenticate(struct svc_rqst *rqstp, __be32 *authp);
> diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
> index e34bc53..4db9562 100644
> --- a/net/sunrpc/auth_gss/svcauth_gss.c
> +++ b/net/sunrpc/auth_gss/svcauth_gss.c
> @@ -963,7 +963,7 @@ svcauth_gss_set_client(struct svc_rqst *rqstp)
>   	if (rqstp->rq_gssclient == NULL)
>   		return SVC_DENIED;
>   	stat = svcauth_unix_set_client(rqstp);
> -	if (stat == SVC_DROP)
> +	if (stat == SVC_DROP || stat == SVC_CLOSE)
>   		return stat;
>   	return SVC_OK;
>   }
> @@ -1017,7 +1017,7 @@ static int svcauth_gss_handle_init(struct svc_rqst *rqstp,
>   		return SVC_DENIED;
>   	memset(&rsikey, 0, sizeof(rsikey));
>   	if (dup_netobj(&rsikey.in_handle,&gc->gc_ctx))
> -		return SVC_DROP;
> +		return SVC_CLOSE;
>   	*authp = rpc_autherr_badverf;
>   	if (svc_safe_getnetobj(argv,&tmpobj)) {
>   		kfree(rsikey.in_handle.data);
> @@ -1025,22 +1025,22 @@ static int svcauth_gss_handle_init(struct svc_rqst *rqstp,
>   	}
>   	if (dup_netobj(&rsikey.in_token,&tmpobj)) {
>   		kfree(rsikey.in_handle.data);
> -		return SVC_DROP;
> +		return SVC_CLOSE;
>   	}
>
>   	/* Perform upcall, or find upcall result: */
>   	rsip = rsi_lookup(&rsikey);
>   	rsi_free(&rsikey);
>   	if (!rsip)
> -		return SVC_DROP;
> +		return SVC_CLOSE;
>   	switch (cache_check(&rsi_cache,&rsip->h,&rqstp->rq_chandle)) {
>   	case -EAGAIN:
>   	case -ETIMEDOUT:
>   	case -ENOENT:
>   		/* No upcall result: */
> -		return SVC_DROP;
> +		return SVC_CLOSE;
>   	case 0:
> -		ret = SVC_DROP;
> +		ret = SVC_CLOSE;
>   		/* Got an answer to the upcall; use it: */
>   		if (gss_write_init_verf(rqstp, rsip))
>   			goto out;
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 538ca43..e750988 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1050,6 +1050,9 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
>   		goto err_bad;
>   	case SVC_DENIED:
>   		goto err_bad_auth;
> +	case SVC_CLOSE:
> +		if (test_bit(XPT_TEMP,&rqstp->rq_xprt->xpt_flags))
> +			svc_close_xprt(rqstp->rq_xprt);
>   	case SVC_DROP:
>   		goto dropit;
>   	case SVC_COMPLETE:
> diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
> index d8c0411..a25f8ba 100644
> --- a/net/sunrpc/svcauth_unix.c
> +++ b/net/sunrpc/svcauth_unix.c
> @@ -668,6 +668,8 @@ static struct group_info *unix_gid_find(uid_t uid, struct svc_rqst *rqstp)
>   	switch (ret) {
>   	case -ENOENT:
>   		return ERR_PTR(-ENOENT);
> +	case -ETIMEDOUT:
> +		return ERR_PTR(-ESHUTDOWN);
>   	case 0:
>   		gi = get_group_info(ug->gi);
>   		cache_put(&ug->h,&unix_gid_cache);
> @@ -714,8 +716,9 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
>   	switch (cache_check(&ip_map_cache,&ipm->h,&rqstp->rq_chandle)) {
>   		default:
>   			BUG();
> -		case -EAGAIN:
>   		case -ETIMEDOUT:
> +			return SVC_CLOSE;
> +		case -EAGAIN:
>   			return SVC_DROP;
>   		case -ENOENT:
>   			return SVC_DENIED;
> @@ -730,6 +733,8 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
>   	switch (PTR_ERR(gi)) {
>   	case -EAGAIN:
>   		return SVC_DROP;
> +	case -ESHUTDOWN:
> +		return SVC_CLOSE;
>   	case -ENOENT:
>   		break;
>   	default:
> @@ -770,7 +775,7 @@ svcauth_null_accept(struct svc_rqst *rqstp, __be32 *authp)
>   	cred->cr_gid = (gid_t) -1;
>   	cred->cr_group_info = groups_alloc(0);
>   	if (cred->cr_group_info == NULL)
> -		return SVC_DROP; /* kmalloc failure - client must retry */
> +		return SVC_CLOSE; /* kmalloc failure - client must retry */
>
>   	/* Put NULL verifier */
>   	svc_putnl(resv, RPC_AUTH_NULL);
> @@ -834,7 +839,7 @@ svcauth_unix_accept(struct svc_rqst *rqstp, __be32 *authp)
>   		goto badcred;
>   	cred->cr_group_info = groups_alloc(slen);
>   	if (cred->cr_group_info == NULL)
> -		return SVC_DROP;
> +		return SVC_CLOSE;
>   	for (i = 0; i<  slen; i++)
>   		GROUP_AT(cred->cr_group_info, i) = svc_getnl(argv);
>   	if (svc_getu32(argv) != htonl(RPC_AUTH_NULL) || svc_getu32(argv) != 0) {
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Cache deferal improvements - try++
       [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
                     ` (8 preceding siblings ...)
  2010-02-03  6:31   ` [PATCH 4/9] sunrpc/cache: allow threads to block while waiting for cache update NeilBrown
@ 2010-02-03 19:43   ` J. Bruce Fields
  9 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-02-03 19:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, Feb 03, 2010 at 05:31:30PM +1100, NeilBrown wrote:
> Thanks to Bruce challenging me to justify the complexity of my
> previous version of this I have managed to simplify it significantly.

And thanks for persisting!  As you know it's also been a worry of mine
that the v4 compound behavior isn't correct with respect to deferrals,
so it will be reassuring to have this sorted out.  Maybe be a day or two
before I get to take a careful look.

--b.

> 
> I have changed the rules for sunrpc_caches so that items that have
> expired get removed at the earliest opportunity even if they are still
> referenced, and in particular so that sunrpc_cache_lookup never
> returns an expired item.
> This means that cache_check doesn't need to check for "expired" any
> more and so only initiates an upcall for items that are not VALID.
> 
> This means that when the upcall is responded to, it will always be
> exactly that item that is updated - never a different item with the
> same key.  So there is no longer any need to repeat the lookup.
> 
> The last 3 patches in this series are simply "cleanups" that I
> happened across while mucking about in the code.  The should have zero
> change in functionality, and if you don't think they are cleanups
> (second last is now questionable), feel free to ignore them.
> 
> I have tested this to ensure that it doesn't completely break things,
> and to ensure that it fixes the problem(*) but I haven't hammered on
> it very hard.
> 
> (*)
> The problem is exhibited by sending a stream of writes to the NFS
> server and then occasionally flushing the export cache (exportfs -f).
> The problem manifests by a write not getting a reply and the client 
> having to retransmit.
> It is 'fixed' if there are no retransmit delays.
> The follow generates the required writes and shows the delays.
> 
> Without the patch I get delays of 60 seconds with TCP and 5 seconds
> with UDP.
> 
> NeilBrown
> 
> /*
>  * write to NFS server and report delays exceeding 1 second.
>  */
> 
> #define _GNU_SOURCE
> 
> #include <sys/time.h>
> #include <stdio.h>
> #include <sys/fcntl.h>
> #include <malloc.h>
> #include <memory.h>
> 
> main(int argc, char *argv[])
> {
> 	int usec;
> 	//int fd = open(argv[1], O_WRONLY|O_DIRECT|O_CREAT, 0666);
> 	//int fd = open(argv[1], O_WRONLY|O_SYNC|O_CREAT, 0666);
> 	int fd = open(argv[1], O_WRONLY|O_CREAT, 0666);
> 	char *buf;
> 	struct timeval tv1, tv2;
> 
> 	posix_memalign(&buf, 4096, 409600);
> 	memset(buf, 0x5a, 409600);
> 
> 	while(1) {
> 		gettimeofday(&tv1, NULL);
> 		write(fd, buf, 409600);
> 		gettimeofday(&tv2, NULL);
> 		usec = (tv2.tv_sec*1000000 + tv2.tv_usec) -
> 			(tv1.tv_sec*1000000 + tv1.tv_usec);
> 		if (usec > 1000000)
> 			printf(" %d\n", usec/1000000);
> 		else
> 			printf(".");
> 		fflush(stdout);
> 	}
> }
> 
> 
> ---
> 
> NeilBrown (9):
>       sunrpc: don't keep expired entries in the auth caches.
>       sunrpc/cache: factor out cache_is_expired
>       sunrpc: never return expired entries in sunrpc_cache_lookup
>       sunrpc/cache: allow threads to block while waiting for cache update.
>       nfsd/idmap: drop special request deferal in favour of improved default.
>       sunrpc: close connection when a request is irretrievably lost.
>       nfsd: factor out hash functions for export caches.
>       svcauth_gss: replace a trivial 'switch' with an 'if'
>       sunrpc/cache: change deferred-request hash table to use hlist.
> 
> 
>  fs/nfsd/export.c                  |   40 ++++++++------
>  fs/nfsd/nfs4idmap.c               |  105 ++++--------------------------------
>  include/linux/sunrpc/cache.h      |    5 +-
>  include/linux/sunrpc/svcauth.h    |   10 ++-
>  net/sunrpc/auth_gss/svcauth_gss.c |   51 ++++++++---------
>  net/sunrpc/cache.c                |  109 ++++++++++++++++++++++++++-----------
>  net/sunrpc/svc.c                  |    3 +
>  net/sunrpc/svc_xprt.c             |   11 ++++
>  net/sunrpc/svcauth_unix.c         |   11 +++-
>  9 files changed, 169 insertions(+), 176 deletions(-)
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 15:43     ` Chuck Lever
@ 2010-02-03 21:23       ` Neil Brown
  2010-02-03 22:20         ` Trond Myklebust
  2010-02-03 22:34         ` Chuck Lever
  0 siblings, 2 replies; 27+ messages in thread
From: Neil Brown @ 2010-02-03 21:23 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, linux-nfs

On Wed, 03 Feb 2010 10:43:04 -0500
Chuck Lever <chuck.lever@oracle.com> wrote:

> On 02/03/2010 01:31 AM, NeilBrown wrote:
> > If we drop a request in the sunrpc layer, either due kmalloc failure,
> > or due to a cache miss when we could not queue the request for later
> > replay, then close the connection to encourage the client to retry sooner.
> 
> I studied connection dropping behavior a few years back, and decided 
> that dropping the connection on a retransmit is nearly always 
> counterproductive.  Any other pending requests on a connection that is 
> dropped must also be retransmitted, which means one retransmit suddenly 
> turns into many.  And then you get into issues of idempotency and all 
> the extra traffic and the long delays and the risk of reconnecting on a 
> different port so that XID replay is undetectable...

You make some good points there, thanks.

> 
> I don't think dropping the connection will cause the client to 
> retransmit sooner.  Clients I have encountered will reconnect and 
> retransmit only after their retransmit timeout fires, never sooner.
> 

I thought I had noticed the Linux client resending immediately, but it would
have been a while ago, and I could easily be remembering wrongly.

My reasoning was that if the connection is closed then the client can *know*
that they won't get a response to any outstanding requests, rather than
having to use the timeout heuristic.  How the client uses that information I
don't know, but at least they would have it.

> Unfortunately NFSv4 requires a connection drop before a retransmit, but 
> NFSv3 does not.  NFSv4 servers are rather supposed to try very hard not 
> to drop requests.
> 
> How often do you expect this kind of recovery to be necessary?  Would it 
> be possible to drop only for NFSv4 connections?
> 

With the improved handling of large requests I would expect this kind of
recovery would be very rarely needed.
Yes, it would be quite easy to only drop connections on which we have seen an
NFSv4 request... and maybe also connections on which we have not successfully
handled any request yet(?).
What if, instead of closing the connection, we set a flag so that it would be
closed as soon as it had been idle for 1 second,  thus flushing any other
pending requests???   That probably doesn't help - there would easily be real
cases where other threads of activity keep the connection busy, while the
thread waiting for the lost request still needs a full time-out.

I would be happy with the v4-only version.

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 21:23       ` Neil Brown
@ 2010-02-03 22:20         ` Trond Myklebust
  2010-02-03 22:34           ` J. Bruce Fields
  2010-02-03 22:40           ` Chuck Lever
  2010-02-03 22:34         ` Chuck Lever
  1 sibling, 2 replies; 27+ messages in thread
From: Trond Myklebust @ 2010-02-03 22:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: Chuck Lever, J. Bruce Fields, linux-nfs

On Thu, 2010-02-04 at 08:23 +1100, Neil Brown wrote: 
> On Wed, 03 Feb 2010 10:43:04 -0500
> Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > I don't think dropping the connection will cause the client to 
> > retransmit sooner.  Clients I have encountered will reconnect and 
> > retransmit only after their retransmit timeout fires, never sooner.
> > 
> 
> I thought I had noticed the Linux client resending immediately, but it would
> have been a while ago, and I could easily be remembering wrongly.

It depends on who closes the connection.

The client assumes that if the _server_ closes the connection, then it
may be having resource congestion issues. In order to give the server
time to recover, the client will delay reconnecting for 3 seconds (with
an exponential back off).

If, on the other hand, the client was the one that initiated the
connection closure, then it will try to reconnect immediately.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 21:23       ` Neil Brown
  2010-02-03 22:20         ` Trond Myklebust
@ 2010-02-03 22:34         ` Chuck Lever
  1 sibling, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2010-02-03 22:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: J. Bruce Fields, linux-nfs

On 02/03/2010 04:23 PM, Neil Brown wrote:
> On Wed, 03 Feb 2010 10:43:04 -0500
> Chuck Lever<chuck.lever@oracle.com>  wrote:
>
>> On 02/03/2010 01:31 AM, NeilBrown wrote:
>>> If we drop a request in the sunrpc layer, either due kmalloc failure,
>>> or due to a cache miss when we could not queue the request for later
>>> replay, then close the connection to encourage the client to retry sooner.
>>
>> I studied connection dropping behavior a few years back, and decided
>> that dropping the connection on a retransmit is nearly always
>> counterproductive.  Any other pending requests on a connection that is
>> dropped must also be retransmitted, which means one retransmit suddenly
>> turns into many.  And then you get into issues of idempotency and all
>> the extra traffic and the long delays and the risk of reconnecting on a
>> different port so that XID replay is undetectable...
>
> You make some good points there, thanks.
>
>>
>> I don't think dropping the connection will cause the client to
>> retransmit sooner.  Clients I have encountered will reconnect and
>> retransmit only after their retransmit timeout fires, never sooner.
>>
>
> I thought I had noticed the Linux client resending immediately, but it would
> have been a while ago, and I could easily be remembering wrongly.
>
> My reasoning was that if the connection is closed then the client can *know*
> that they won't get a response to any outstanding requests, rather than
> having to use the timeout heuristic.  How the client uses that information I
> don't know, but at least they would have it.

So, I seem to remember expecting that behavior, and then being 
disappointed when it didn't work that way :-)

It's also the case that there is some pathological behavior around 
reconnect in some of the older 2.6 NFS clients.  Trond would probably 
remember the details there.

In any event, I think you would do well to make some direct observations 
with several different vintages of Linux NFS clients, just to be sure 
this works as you expect it to with reasonable clients.

I noticed that connection drops are especially onerous when a server is 
under load, and that's exactly when drops seem to occur most often. It's 
one of those corner cases that's nearly impossible to test well.

>> Unfortunately NFSv4 requires a connection drop before a retransmit, but
>> NFSv3 does not.  NFSv4 servers are rather supposed to try very hard not
>> to drop requests.
>>
>> How often do you expect this kind of recovery to be necessary?  Would it
>> be possible to drop only for NFSv4 connections?
>>
>
> With the improved handling of large requests I would expect this kind of
> recovery would be very rarely needed.
>
> Yes, it would be quite easy to only drop connections on which we have seen an
> NFSv4 request... and maybe also connections on which we have not successfully
> handled any request yet(?).
> What if, instead of closing the connection, we set a flag so that it would be
> closed as soon as it had been idle for 1 second,  thus flushing any other
> pending requests???   That probably doesn't help - there would easily be real
> cases where other threads of activity keep the connection busy, while the
> thread waiting for the lost request still needs a full time-out.
>
> I would be happy with the v4-only version.

v4 would almost be required to work this way, I think.  I wouldn't 
object to a v4-only implementation for now.

I'm sure there are cases where v3 will have to drop the connection... 
i'd just like to ensure that's it's absolutely a last resort.

-- 
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 22:20         ` Trond Myklebust
@ 2010-02-03 22:34           ` J. Bruce Fields
  2010-02-03 22:40           ` Chuck Lever
  1 sibling, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-02-03 22:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, Chuck Lever, linux-nfs

On Wed, Feb 03, 2010 at 05:20:10PM -0500, Trond Myklebust wrote:
> On Thu, 2010-02-04 at 08:23 +1100, Neil Brown wrote: 
> > On Wed, 03 Feb 2010 10:43:04 -0500
> > Chuck Lever <chuck.lever@oracle.com> wrote:
> > > 
> > > I don't think dropping the connection will cause the client to 
> > > retransmit sooner.  Clients I have encountered will reconnect and 
> > > retransmit only after their retransmit timeout fires, never sooner.
> > > 
> > 
> > I thought I had noticed the Linux client resending immediately, but it would
> > have been a while ago, and I could easily be remembering wrongly.
> 
> It depends on who closes the connection.
> 
> The client assumes that if the _server_ closes the connection, then it
> may be having resource congestion issues. In order to give the server
> time to recover, the client will delay reconnecting for 3 seconds (with
> an exponential back off).
> 
> If, on the other hand, the client was the one that initiated the
> connection closure, then it will try to reconnect immediately.

So, if I understand Neil's patches right:

	- First we try waiting at least one second for the upcall.
	- Then we try to return JUKEBOX/DELAY.  (But if we're still
	  processing the rpc header we may not have the option.)
	- Then we give up and drop the request.

Upcalls shouldn't normally take a second; so something's broken or
congested, whether it's us or our kerberos server.  So telling the
client we're congested sounds right, as does the client response Trond
describes.

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 22:20         ` Trond Myklebust
  2010-02-03 22:34           ` J. Bruce Fields
@ 2010-02-03 22:40           ` Chuck Lever
  2010-02-03 23:13             ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: Chuck Lever @ 2010-02-03 22:40 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, J. Bruce Fields, linux-nfs

On 02/03/2010 05:20 PM, Trond Myklebust wrote:
> On Thu, 2010-02-04 at 08:23 +1100, Neil Brown wrote:
>> On Wed, 03 Feb 2010 10:43:04 -0500
>> Chuck Lever<chuck.lever@oracle.com>  wrote:
>>>
>>> I don't think dropping the connection will cause the client to
>>> retransmit sooner.  Clients I have encountered will reconnect and
>>> retransmit only after their retransmit timeout fires, never sooner.
>>>
>>
>> I thought I had noticed the Linux client resending immediately, but it would
>> have been a while ago, and I could easily be remembering wrongly.
>
> It depends on who closes the connection.
>
> The client assumes that if the _server_ closes the connection, then it
> may be having resource congestion issues. In order to give the server
> time to recover, the client will delay reconnecting for 3 seconds (with
> an exponential back off).
 >
> If, on the other hand, the client was the one that initiated the
> connection closure, then it will try to reconnect immediately.

That's only if there are RPC requests immediately ready to send, though, 
right?  A request that is waiting for a reply when the connection is 
dropped wouldn't be resent until its retransmit timer expired, I thought.

And, this behavior is true only for late-model clients... some of the 
eariler 2.6 clients have some trouble with this scenario, I seem to recall.

-- 
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 22:40           ` Chuck Lever
@ 2010-02-03 23:13             ` Trond Myklebust
  2010-02-04  0:06               ` Chuck Lever
  0 siblings, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2010-02-03 23:13 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Neil Brown, J. Bruce Fields, linux-nfs

On Wed, 2010-02-03 at 17:40 -0500, Chuck Lever wrote: 
> On 02/03/2010 05:20 PM, Trond Myklebust wrote:
> > On Thu, 2010-02-04 at 08:23 +1100, Neil Brown wrote:
> >> On Wed, 03 Feb 2010 10:43:04 -0500
> >> Chuck Lever<chuck.lever@oracle.com>  wrote:
> >>>
> >>> I don't think dropping the connection will cause the client to
> >>> retransmit sooner.  Clients I have encountered will reconnect and
> >>> retransmit only after their retransmit timeout fires, never sooner.
> >>>
> >>
> >> I thought I had noticed the Linux client resending immediately, but it would
> >> have been a while ago, and I could easily be remembering wrongly.
> >
> > It depends on who closes the connection.
> >
> > The client assumes that if the _server_ closes the connection, then it
> > may be having resource congestion issues. In order to give the server
> > time to recover, the client will delay reconnecting for 3 seconds (with
> > an exponential back off).
>  >
> > If, on the other hand, the client was the one that initiated the
> > connection closure, then it will try to reconnect immediately.
> 
> That's only if there are RPC requests immediately ready to send, though, 
> right?  A request that is waiting for a reply when the connection is 
> dropped wouldn't be resent until its retransmit timer expired, I thought.

No, that is incorrect.

We call xprt_wake_pending_tasks() both when the connection is closed,
and when it is re-established: see the details in
xs_tcp_state_change(). 

It doesn't make sense for the client to defer resending requests when it
knows that the original connection was lost. Deferring would simply mean
that the chances of the server evicting the reply from its DRC will
increase.

> And, this behavior is true only for late-model clients... some of the 
> eariler 2.6 clients have some trouble with this scenario, I seem to recall.

Yes. Some of the earlier clients were too aggressive when reconnecting,
which sometimes lead to problems when the server was truly congested. I
hope we've fixed that now, and that the distributions that are still
supporting older kernels are working on backporting those fixes.

Cheers
  Trond



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-03 23:13             ` Trond Myklebust
@ 2010-02-04  0:06               ` Chuck Lever
  2010-02-04  0:24                 ` Trond Myklebust
  0 siblings, 1 reply; 27+ messages in thread
From: Chuck Lever @ 2010-02-04  0:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, J. Bruce Fields, linux-nfs

On 02/03/2010 06:13 PM, Trond Myklebust wrote:
> On Wed, 2010-02-03 at 17:40 -0500, Chuck Lever wrote:
>> On 02/03/2010 05:20 PM, Trond Myklebust wrote:
>>> On Thu, 2010-02-04 at 08:23 +1100, Neil Brown wrote:
>>>> On Wed, 03 Feb 2010 10:43:04 -0500
>>>> Chuck Lever<chuck.lever@oracle.com>   wrote:
>>>>>
>>>>> I don't think dropping the connection will cause the client to
>>>>> retransmit sooner.  Clients I have encountered will reconnect and
>>>>> retransmit only after their retransmit timeout fires, never sooner.
>>>>>
>>>>
>>>> I thought I had noticed the Linux client resending immediately, but it would
>>>> have been a while ago, and I could easily be remembering wrongly.
>>>
>>> It depends on who closes the connection.
>>>
>>> The client assumes that if the _server_ closes the connection, then it
>>> may be having resource congestion issues. In order to give the server
>>> time to recover, the client will delay reconnecting for 3 seconds (with
>>> an exponential back off).
>>   >
>>> If, on the other hand, the client was the one that initiated the
>>> connection closure, then it will try to reconnect immediately.
>>
>> That's only if there are RPC requests immediately ready to send, though,
>> right?  A request that is waiting for a reply when the connection is
>> dropped wouldn't be resent until its retransmit timer expired, I thought.
>
> No, that is incorrect.
>
> We call xprt_wake_pending_tasks() both when the connection is closed,
> and when it is re-established: see the details in
> xs_tcp_state_change().
>
> It doesn't make sense for the client to defer resending requests when it
> knows that the original connection was lost. Deferring would simply mean
> that the chances of the server evicting the reply from its DRC will
> increase.

True, thanks for clarifying.

My only concern would be that we perhaps should not assume that every 
2.6 NFS client does this.  There was an awful lot of churn at one point 
as you were getting all of these ducks in a row.  Before you changed the 
underlying RPC transports to return only ENOTCONN, for instance, was 
this behavior the same?  I wouldn't swear to it in the 2.6.7 - .9 
vintage kernels.

-- 
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost.
  2010-02-04  0:06               ` Chuck Lever
@ 2010-02-04  0:24                 ` Trond Myklebust
  0 siblings, 0 replies; 27+ messages in thread
From: Trond Myklebust @ 2010-02-04  0:24 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Neil Brown, J. Bruce Fields, linux-nfs

On Wed, 2010-02-03 at 19:06 -0500, Chuck Lever wrote: 
> My only concern would be that we perhaps should not assume that every 
> 2.6 NFS client does this.  There was an awful lot of churn at one point 
> as you were getting all of these ducks in a row.  Before you changed the 
> underlying RPC transports to return only ENOTCONN, for instance, was 
> this behavior the same?  I wouldn't swear to it in the 2.6.7 - .9 
> vintage kernels.

Every change in this area is performance sensitive, and should certainly
be tested carefully; particularly so with older kernels.

However we shouldn't forget that the changes to the client were
motivated by poor performance against a variety of servers, in a variety
of workloads and so it isn't unreasonable to expect that SUSE, Red Hat,
Debian, Oracle, etc do have an interest in fixing their NFS clients.

My question would rather therefore be: do they already have a fix in
their latest kernel revisions, and if not, are they able to give us a
timeframe?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/9] sunrpc: don't keep expired entries in the auth caches.
       [not found]     ` <20100203063130.12945.29226.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2010-03-15  0:58       ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-03-15  0:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> currently expired entries remain in the auth caches as long
> as there is a reference.
> This was needed long ago when the auth_domain cache used the same
> cache infrastructure.  But since that (being a very different sort
> of cache) was separated, this test is no longer needed.
> 
> So remove the test on refcnt and tidy up the surrounding code.
> 
> This allows the cache_dequeue call (which needed to be there to
> drop a potentially awkward reference) can be moved outside of the
> spinlock which is a better place for it.

Applied, thanks.--b.

> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  net/sunrpc/cache.c |   13 +++++--------
>  1 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index 39bddba..83592e0 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -397,31 +397,28 @@ static int cache_clean(void)
>  		/* Ok, now to clean this strand */
>  
>  		cp = & current_detail->hash_table[current_index];
> -		ch = *cp;
> -		for (; ch; cp= & ch->next, ch= *cp) {
> +		for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
>  			if (current_detail->nextcheck > ch->expiry_time)
>  				current_detail->nextcheck = ch->expiry_time+1;
>  			if (ch->expiry_time >= get_seconds() &&
>  			    ch->last_refresh >= current_detail->flush_time)
>  				continue;
> -			if (test_and_clear_bit(CACHE_PENDING, &ch->flags))
> -				cache_dequeue(current_detail, ch);
>  
> -			if (atomic_read(&ch->ref.refcount) == 1)
> -				break;
> -		}
> -		if (ch) {
>  			*cp = ch->next;
>  			ch->next = NULL;
>  			current_detail->entries--;
>  			rv = 1;
> +			break;
>  		}
> +
>  		write_unlock(&current_detail->hash_lock);
>  		d = current_detail;
>  		if (!ch)
>  			current_index ++;
>  		spin_unlock(&cache_list_lock);
>  		if (ch) {
> +			if (test_and_clear_bit(CACHE_PENDING, &ch->flags))
> +				cache_dequeue(current_detail, ch);
>  			cache_revisit_request(ch);
>  			cache_put(ch, d);
>  		}
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/9] sunrpc/cache: factor out cache_is_expired
       [not found]     ` <20100203063131.12945.65321.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2010-03-15  0:58       ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-03-15  0:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> This removes a tiny bit of code duplication, but more important
> prepares for following patch which will perform the expiry check in
> cache_lookup and the rest of the validity check in cache_check.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

Also applied.--b.

> ---
>  net/sunrpc/cache.c |   13 ++++++++-----
>  1 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index 83592e0..9826c5c 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -49,6 +49,12 @@ static void cache_init(struct cache_head *h)
>  	h->last_refresh = now;
>  }
>  
> +static inline int cache_is_expired(struct cache_detail *detail, struct cache_head *h)
> +{
> +	return  (h->expiry_time < get_seconds()) ||
> +		(detail->flush_time > h->last_refresh);
> +}
> +
>  struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  				       struct cache_head *key, int hash)
>  {
> @@ -184,9 +190,7 @@ static int cache_make_upcall(struct cache_detail *cd, struct cache_head *h)
>  static inline int cache_is_valid(struct cache_detail *detail, struct cache_head *h)
>  {
>  	if (!test_bit(CACHE_VALID, &h->flags) ||
> -	    h->expiry_time < get_seconds())
> -		return -EAGAIN;
> -	else if (detail->flush_time > h->last_refresh)
> +	    cache_is_expired(detail, h))
>  		return -EAGAIN;
>  	else {
>  		/* entry is valid */
> @@ -400,8 +404,7 @@ static int cache_clean(void)
>  		for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
>  			if (current_detail->nextcheck > ch->expiry_time)
>  				current_detail->nextcheck = ch->expiry_time+1;
> -			if (ch->expiry_time >= get_seconds() &&
> -			    ch->last_refresh >= current_detail->flush_time)
> +			if (!cache_is_expired(current_detail, ch))
>  				continue;
>  
>  			*cp = ch->next;
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 7/9] nfsd: factor out hash functions for export caches.
       [not found]     ` <20100203063131.12945.38791.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2010-03-17 19:35       ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-03-17 19:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> Both the _lookup and the _update functions for these two caches
> independently calculate the hash of the key.
> So factor out that code for improved reuse.

Applied.--b.

> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/export.c |   40 +++++++++++++++++++++++-----------------
>  1 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index c487810..cb8766c 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -258,10 +258,9 @@ static struct cache_detail svc_expkey_cache = {
>  	.alloc		= expkey_alloc,
>  };
>  
> -static struct svc_expkey *
> -svc_expkey_lookup(struct svc_expkey *item)
> +static int
> +svc_expkey_hash(struct svc_expkey *item)
>  {
> -	struct cache_head *ch;
>  	int hash = item->ek_fsidtype;
>  	char * cp = (char*)item->ek_fsid;
>  	int len = key_len(item->ek_fsidtype);
> @@ -269,6 +268,14 @@ svc_expkey_lookup(struct svc_expkey *item)
>  	hash ^= hash_mem(cp, len, EXPKEY_HASHBITS);
>  	hash ^= hash_ptr(item->ek_client, EXPKEY_HASHBITS);
>  	hash &= EXPKEY_HASHMASK;
> +	return hash;
> +}
> +
> +static struct svc_expkey *
> +svc_expkey_lookup(struct svc_expkey *item)
> +{
> +	struct cache_head *ch;
> +	int hash = svc_expkey_hash(item);
>  
>  	ch = sunrpc_cache_lookup(&svc_expkey_cache, &item->h,
>  				 hash);
> @@ -282,13 +289,7 @@ static struct svc_expkey *
>  svc_expkey_update(struct svc_expkey *new, struct svc_expkey *old)
>  {
>  	struct cache_head *ch;
> -	int hash = new->ek_fsidtype;
> -	char * cp = (char*)new->ek_fsid;
> -	int len = key_len(new->ek_fsidtype);
> -
> -	hash ^= hash_mem(cp, len, EXPKEY_HASHBITS);
> -	hash ^= hash_ptr(new->ek_client, EXPKEY_HASHBITS);
> -	hash &= EXPKEY_HASHMASK;
> +	int hash = svc_expkey_hash(new);
>  
>  	ch = sunrpc_cache_update(&svc_expkey_cache, &new->h,
>  				 &old->h, hash);
> @@ -737,14 +738,22 @@ struct cache_detail svc_export_cache = {
>  	.alloc		= svc_export_alloc,
>  };
>  
> -static struct svc_export *
> -svc_export_lookup(struct svc_export *exp)
> +static int
> +svc_export_hash(struct svc_export *exp)
>  {
> -	struct cache_head *ch;
>  	int hash;
> +
>  	hash = hash_ptr(exp->ex_client, EXPORT_HASHBITS);
>  	hash ^= hash_ptr(exp->ex_path.dentry, EXPORT_HASHBITS);
>  	hash ^= hash_ptr(exp->ex_path.mnt, EXPORT_HASHBITS);
> +	return hash;
> +}
> +
> +static struct svc_export *
> +svc_export_lookup(struct svc_export *exp)
> +{
> +	struct cache_head *ch;
> +	int hash = svc_export_hash(exp);
>  
>  	ch = sunrpc_cache_lookup(&svc_export_cache, &exp->h,
>  				 hash);
> @@ -758,10 +767,7 @@ static struct svc_export *
>  svc_export_update(struct svc_export *new, struct svc_export *old)
>  {
>  	struct cache_head *ch;
> -	int hash;
> -	hash = hash_ptr(old->ex_client, EXPORT_HASHBITS);
> -	hash ^= hash_ptr(old->ex_path.dentry, EXPORT_HASHBITS);
> -	hash ^= hash_ptr(old->ex_path.mnt, EXPORT_HASHBITS);
> +	int hash = svc_export_hash(old);
>  
>  	ch = sunrpc_cache_update(&svc_export_cache, &new->h,
>  				 &old->h,
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup
       [not found]     ` <20100203063131.12945.97779.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2010-03-17 21:33       ` J. Bruce Fields
  2010-03-24  1:22         ` Neil Brown
  0 siblings, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2010-03-17 21:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> If sunrpc_cache_lookup finds an expired entry, remove it from
> the cache and return a freshly created non-VALID entry instead.
> This ensures that we only ever get a usable entry, or an
> entry that will become usable once an update arrives.
> i.e. we will never need to repeat the lookup.
> 
> This allows us to remove the 'is_expired' test from cache_check
> (i.e. from cache_is_valid).  cache_check should never get an expired
> entry as 'lookup' will never return one.  If it does happen - due to
> inconvenient timing - then just accept it as still valid, it won't be
> very much past it's use-by date.

Looks right to me.  Thanks, applied.

By the way, if we get sunrpc_cache_update(old, new1) and
sunrpc_cache_update(old, new2) simultaneously, what happens?

More generally: should we try to ensure that a cache never contains two
entries which match the same key?

--b.

> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  net/sunrpc/cache.c |   17 ++++++++++++++---
>  1 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index 9826c5c..3e1ef8b 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -59,7 +59,7 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  				       struct cache_head *key, int hash)
>  {
>  	struct cache_head **head,  **hp;
> -	struct cache_head *new = NULL;
> +	struct cache_head *new = NULL, *freeme = NULL;
>  
>  	head = &detail->hash_table[hash];
>  
> @@ -68,6 +68,9 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
>  		struct cache_head *tmp = *hp;
>  		if (detail->match(tmp, key)) {
> +			if (cache_is_expired(detail, tmp))
> +				/* This entry is expired, we will discard it. */
> +				break;
>  			cache_get(tmp);
>  			read_unlock(&detail->hash_lock);
>  			return tmp;
> @@ -92,6 +95,13 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
>  		struct cache_head *tmp = *hp;
>  		if (detail->match(tmp, key)) {
> +			if (cache_is_expired(detail, tmp)) {
> +				*hp = tmp->next;
> +				tmp->next = NULL;
> +				detail->entries --;
> +				freeme = tmp;
> +				break;
> +			}
>  			cache_get(tmp);
>  			write_unlock(&detail->hash_lock);
>  			cache_put(new, detail);
> @@ -104,6 +114,8 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  	cache_get(new);
>  	write_unlock(&detail->hash_lock);
>  
> +	if (freeme)
> +		cache_put(freeme, detail);
>  	return new;
>  }
>  EXPORT_SYMBOL_GPL(sunrpc_cache_lookup);
> @@ -189,8 +201,7 @@ static int cache_make_upcall(struct cache_detail *cd, struct cache_head *h)
>  
>  static inline int cache_is_valid(struct cache_detail *detail, struct cache_head *h)
>  {
> -	if (!test_bit(CACHE_VALID, &h->flags) ||
> -	    cache_is_expired(detail, h))
> +	if (!test_bit(CACHE_VALID, &h->flags))
>  		return -EAGAIN;
>  	else {
>  		/* entry is valid */
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup
  2010-03-17 21:33       ` J. Bruce Fields
@ 2010-03-24  1:22         ` Neil Brown
  2010-03-30 15:11           ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: Neil Brown @ 2010-03-24  1:22 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Wed, 17 Mar 2010 17:33:07 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> > If sunrpc_cache_lookup finds an expired entry, remove it from
> > the cache and return a freshly created non-VALID entry instead.
> > This ensures that we only ever get a usable entry, or an
> > entry that will become usable once an update arrives.
> > i.e. we will never need to repeat the lookup.
> > 
> > This allows us to remove the 'is_expired' test from cache_check
> > (i.e. from cache_is_valid).  cache_check should never get an expired
> > entry as 'lookup' will never return one.  If it does happen - due to
> > inconvenient timing - then just accept it as still valid, it won't be
> > very much past it's use-by date.
> 
> Looks right to me.  Thanks, applied.
> 
> By the way, if we get sunrpc_cache_update(old, new1) and
> sunrpc_cache_update(old, new2) simultaneously, what happens?

Interesting question.
I guess you could get two entries for the same key in the cache.
However the ->parse routines are protected by i_mutex so you would need on
update to come through /proc/net/rpc/..../channel, and the other to come
through the legacy nfsd syscal.
Highly unlikely.

> 
> More generally: should we try to ensure that a cache never contains two
> entries which match the same key?

I don't think we need to.  The newer will over-ride the older which will
eventually expire from the cache or be flushed.
So worst-case someone will look in the /content file, see two entries with
the same key, and get confused.  I don't think it is a problem that needs
fixing.

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup
  2010-03-24  1:22         ` Neil Brown
@ 2010-03-30 15:11           ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-03-30 15:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-nfs

On Wed, Mar 24, 2010 at 12:22:00PM +1100, Neil Brown wrote:
> On Wed, 17 Mar 2010 17:33:07 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> > > If sunrpc_cache_lookup finds an expired entry, remove it from
> > > the cache and return a freshly created non-VALID entry instead.
> > > This ensures that we only ever get a usable entry, or an
> > > entry that will become usable once an update arrives.
> > > i.e. we will never need to repeat the lookup.
> > > 
> > > This allows us to remove the 'is_expired' test from cache_check
> > > (i.e. from cache_is_valid).  cache_check should never get an expired
> > > entry as 'lookup' will never return one.  If it does happen - due to
> > > inconvenient timing - then just accept it as still valid, it won't be
> > > very much past it's use-by date.
> > 
> > Looks right to me.  Thanks, applied.
> > 
> > By the way, if we get sunrpc_cache_update(old, new1) and
> > sunrpc_cache_update(old, new2) simultaneously, what happens?
> 
> Interesting question.
> I guess you could get two entries for the same key in the cache.
> However the ->parse routines are protected by i_mutex

Oh, right, missed that.  Might simplify verification of this sort of
thing to have that be the responsibility of the core cache code rather
than the caller, though.

> so you would need on
> update to come through /proc/net/rpc/..../channel, and the other to come
> through the legacy nfsd syscal.
> Highly unlikely.
> 
> > 
> > More generally: should we try to ensure that a cache never contains two
> > entries which match the same key?
> 
> I don't think we need to.  The newer will over-ride the older which will
> eventually expire from the cache or be flushed.
> So worst-case someone will look in the /content file, see two entries with
> the same key, and get confused.  I don't think it is a problem that needs
> fixing.

Well, my real fear here is that an rpc call could stall indefinitely if
it waited on one item while the other one got updated.

I don't see how that's possible, though.

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/9] sunrpc/cache: allow threads to block while waiting for cache update.
  2010-02-03  6:31   ` [PATCH 4/9] sunrpc/cache: allow threads to block while waiting for cache update NeilBrown
@ 2010-04-15 15:55     ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2010-04-15 15:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, Feb 03, 2010 at 05:31:31PM +1100, NeilBrown wrote:
> The current practice of waiting for cache updates by queueing the
> whole request to be retried has (at least) two problems.
> 
> 1/ With NFSv4, requests can be quite complex and re-trying a whole
>   request when a latter part fails should only be a last-resort, not a
>   normal practice.

I want to avoid it completely, actually.  It's not so much that the
retry is expensive, as that it's actually incorrect in the case of
nonidempotent operations.  (Chief among them, SEQUENCE--see
rq_usedeferral).

How about just letting the fallback be making the client retry?

We could do that just in the v4 case (returning DELAY to v4 clients, but
doing the existing deferral for others).  Or we could handle them all
similarly (returning JUKEBOX in the v3 case, and dropping in the v2
case?)

Making the client retry is not as nice, but this should be an unusual
case.  I'm not sure how to think about what the best behavior in such a
case.

> @@ -554,6 +573,29 @@ static int cache_defer_req(struct cache_req *req, struct cache_head *item)
>  		cache_revisit_request(item);
>  		return -EAGAIN;
>  	}
> +
> +	if (dreq == &sleeper.handle) {
> +		wait_event_interruptible_timeout(
> +			sleeper.wait,
> +			!test_bit(CACHE_PENDING, &item->flags)
> +			|| list_empty(&sleeper.handle.hash),
> +			req->thread_wait);
> +		spin_lock(&cache_defer_lock);
> +		if (!list_empty(&sleeper.handle.hash)) {
> +			list_del_init(&sleeper.handle.recent);
> +			list_del_init(&sleeper.handle.hash);
> +			cache_defer_cnt--;
> +		}
> +		spin_unlock(&cache_defer_lock);
> +		if (test_bit(CACHE_PENDING, &item->flags)) {
> +			/* item is still pending, try request
> +			 * deferral
> +			 */
> +			dreq = req->defer(req);
> +			goto retry;
> +		}
> +		return -EEXIST;

Can this be right?  If we wake up from sleep with CACHE_PENDING cleared,
that's the succesful case.

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-04-15 15:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-03  6:31 [PATCH 0/9] Cache deferal improvements - try++ NeilBrown
     [not found] ` <20100203060657.12945.27293.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2010-02-03  6:31   ` [PATCH 7/9] nfsd: factor out hash functions for export caches NeilBrown
     [not found]     ` <20100203063131.12945.38791.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2010-03-17 19:35       ` J. Bruce Fields
2010-02-03  6:31   ` [PATCH 6/9] sunrpc: close connection when a request is irretrievably lost NeilBrown
2010-02-03 15:43     ` Chuck Lever
2010-02-03 21:23       ` Neil Brown
2010-02-03 22:20         ` Trond Myklebust
2010-02-03 22:34           ` J. Bruce Fields
2010-02-03 22:40           ` Chuck Lever
2010-02-03 23:13             ` Trond Myklebust
2010-02-04  0:06               ` Chuck Lever
2010-02-04  0:24                 ` Trond Myklebust
2010-02-03 22:34         ` Chuck Lever
2010-02-03  6:31   ` [PATCH 1/9] sunrpc: don't keep expired entries in the auth caches NeilBrown
     [not found]     ` <20100203063130.12945.29226.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2010-03-15  0:58       ` J. Bruce Fields
2010-02-03  6:31   ` [PATCH 9/9] sunrpc/cache: change deferred-request hash table to use hlist NeilBrown
2010-02-03  6:31   ` [PATCH 8/9] svcauth_gss: replace a trivial 'switch' with an 'if' NeilBrown
2010-02-03  6:31   ` [PATCH 5/9] nfsd/idmap: drop special request deferal in favour of improved default NeilBrown
2010-02-03  6:31   ` [PATCH 3/9] sunrpc: never return expired entries in sunrpc_cache_lookup NeilBrown
     [not found]     ` <20100203063131.12945.97779.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2010-03-17 21:33       ` J. Bruce Fields
2010-03-24  1:22         ` Neil Brown
2010-03-30 15:11           ` J. Bruce Fields
2010-02-03  6:31   ` [PATCH 2/9] sunrpc/cache: factor out cache_is_expired NeilBrown
     [not found]     ` <20100203063131.12945.65321.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2010-03-15  0:58       ` J. Bruce Fields
2010-02-03  6:31   ` [PATCH 4/9] sunrpc/cache: allow threads to block while waiting for cache update NeilBrown
2010-04-15 15:55     ` J. Bruce Fields
2010-02-03 19:43   ` [PATCH 0/9] Cache deferal improvements - try++ J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.