[PATCH v3 0/7] nfs-utils: support for lifting grace period early

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] nfs-utils: support for lifting grace period early
@ 2014-09-08 16:30 Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 1/7] sm-notify: inform the kernel if there were no hosts to notify Jeff Layton
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

v3:
- account for change to NFSDCLTRACK_RECLAIM_COMPLETE env var

This patchset adds some support to sm-notify and nfsdcltrack for lifting
the grace periods early. Allowing this to actually work depends on the
companion kernel patchset, but the approach I've taken here should deal
properly with userland/kernel mismatches.

There are two main pieces:

sm-notify: in the event that sm-notify isn't sending any NOTIFY
requests, we don't expect to see any reclaims from clients. In that
case, we should be able to safely lift the lockd grace period early.
The first patch in the series implements this (though we'll probably
need a bit of selinux work to get that working in Fedora under enforcing
mode).

nfsdcltrack: if there are no v4.0 clients and all v4.1+ clients have
issued a RECLAIM_COMPLETE, then we can go ahead and end the nfsd grace
period. The remainder of the patchset adds the support for this. This
requires revving the DB schema for it, and making use of the environment
variables that are passed to the upcall by the kernel.

With an updated kernel and nfs-utils, I typically see the grace period
being lifted just a few seconds after it starts. It may take a little
longer with more clients, but this is a vast improvement over having
to wait 90s after each reboot to get meaningful work done.

Assuming that the kernel parts are acceptable to Bruce et. al., then I
think we'll want this merged around the same time.

Jeff Layton (7):
  sm-notify: inform the kernel if there were no hosts to notify
  nfsdcltrack: update comments in sqlite.c
  nfsdcltrack: rename CLD_* constants with CLTRACK_* prefixes
  nfsdcltrack: overhaul database initializtion
  nfsdcltrack: update schema to v2
  nfsdcltrack: grab the NFSDCLTRACK_RECLAIM_COMPLETE env var if it's
    present
  nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment

 utils/nfsdcltrack/nfsdcltrack.c |  93 +++++++++-
 utils/nfsdcltrack/sqlite.c      | 375 ++++++++++++++++++++++++++++++----------
 utils/nfsdcltrack/sqlite.h      |   5 +-
 utils/statd/sm-notify.c         |  25 +++
 4 files changed, 400 insertions(+), 98 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/7] sm-notify: inform the kernel if there were no hosts to notify
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 2/7] nfsdcltrack: update comments in sqlite.c Jeff Layton
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

In the event that there no hosts to be notified after a reboot, there's
no real reason to force lockd to wait the entire grace period before
handing out locks. We're not expecting any reclaim requests to come in
that situation.

Have sm-notify do a write to /proc/fs/lockd/nlm_end_grace if that file
is present. That informs the kernel that it's OK to go ahead and lift
lockd's grace period early.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/statd/sm-notify.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/utils/statd/sm-notify.c b/utils/statd/sm-notify.c
index 9dbe5d908336..828a6991f8e6 100644
--- a/utils/statd/sm-notify.c
+++ b/utils/statd/sm-notify.c
@@ -42,6 +42,8 @@
 #define NSM_TIMEOUT	2
 #define NSM_MAX_TIMEOUT	120	/* don't make this too big */
 
+#define NLM_END_GRACE_FILE	"/proc/fs/lockd/nlm_end_grace"
+
 struct nsm_host {
 	struct nsm_host *	next;
 	char *			name;
@@ -450,6 +452,28 @@ retry:
 	return sock;
 }
 
+/* Inform the kernel that it's OK to lift lockd's grace period */
+static void
+nsm_lift_grace_period(void)
+{
+	int fd;
+
+	fd = open(NLM_END_GRACE_FILE, O_WRONLY);
+	if (fd < 0) {
+		/* Don't warn if file isn't present */
+		if (errno != ENOENT)
+			xlog(L_WARNING, "Unable to open %s: %m",
+				NLM_END_GRACE_FILE);
+		return;
+	}
+
+	if (write(fd, "Y", 1) < 0)
+		xlog(L_WARNING, "Unable to write to %s: %m", NLM_END_GRACE_FILE);
+
+	close(fd);
+	return;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -534,6 +558,7 @@ usage:		fprintf(stderr,
 	(void)nsm_retire_monitored_hosts();
 	if (nsm_load_notify_list(smn_get_host) == 0) {
 		xlog(D_GENERAL, "No hosts to notify; exiting");
+		nsm_lift_grace_period();
 		return 0;
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/7] nfsdcltrack: update comments in sqlite.c
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 1/7] sm-notify: inform the kernel if there were no hosts to notify Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  2014-09-11 15:53   ` J. Bruce Fields
  2014-09-08 16:30 ` [PATCH v3 3/7] nfsdcltrack: rename CLD_* constants with CLTRACK_* prefixes Jeff Layton
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

Clean up and fix some inaccuracies.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/nfsdcltrack/sqlite.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
index bac678980938..352f029d55c0 100644
--- a/utils/nfsdcltrack/sqlite.c
+++ b/utils/nfsdcltrack/sqlite.c
@@ -21,17 +21,15 @@
  * Explanation:
  *
  * This file contains the code to manage the sqlite backend database for the
- * clstated upcall daemon.
+ * nfsdcltrack usermodehelper upcall program.
  *
  * The main database is called main.sqlite and contains the following tables:
  *
  * parameters: simple key/value pairs for storing database info
  *
- * clients: one column containing a BLOB with the as sent by the client
- * 	    and a timestamp (in epoch seconds) of when the record was
- * 	    established
- *
- * FIXME: should we also record the fsid being accessed?
+ * clients: an "id" column containing a BLOB with the long-form clientid as
+ * 	    sent by the client, a "time" column containing a timestamp (in
+ * 	    epoch seconds) of when the record was last updated.
  */
 
 #ifdef HAVE_CONFIG_H
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/7] nfsdcltrack: rename CLD_* constants with CLTRACK_* prefixes
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 1/7] sm-notify: inform the kernel if there were no hosts to notify Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 2/7] nfsdcltrack: update comments in sqlite.c Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 4/7] nfsdcltrack: overhaul database initializtion Jeff Layton
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

From: Jeff Layton <jlayton@poochiereds.net>

Since nfsdcld has been dead for a few years now, clean up the prefixes
on the constants.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/nfsdcltrack/sqlite.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
index 352f029d55c0..52fcaa4d7ca3 100644
--- a/utils/nfsdcltrack/sqlite.c
+++ b/utils/nfsdcltrack/sqlite.c
@@ -50,10 +50,10 @@
 
 #include "xlog.h"
 
-#define CLD_SQLITE_SCHEMA_VERSION 1
+#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 1
 
 /* in milliseconds */
-#define CLD_SQLITE_BUSY_TIMEOUT 10000
+#define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
 
 /* private data structures */
 
@@ -111,7 +111,7 @@ sqlite_prepare_dbh(const char *topdir)
 		return ret;
 	}
 
-	ret = sqlite3_busy_timeout(dbh, CLD_SQLITE_BUSY_TIMEOUT);
+	ret = sqlite3_busy_timeout(dbh, CLTRACK_SQLITE_BUSY_TIMEOUT);
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "Unable to set sqlite busy timeout: %d", ret);
 		sqlite3_close(dbh);
@@ -156,7 +156,7 @@ sqlite_maindb_init(const char *topdir)
 	/* insert version into table -- ignore error if it fails */
 	ret = snprintf(buf, sizeof(buf),
 		       "INSERT OR IGNORE INTO parameters values (\"version\", "
-		       "\"%d\");", CLD_SQLITE_SCHEMA_VERSION);
+		       "\"%d\");", CLTRACK_SQLITE_CURRENT_SCHEMA_VERSION);
 	if (ret < 0) {
 		goto out_err;
 	} else if ((size_t)ret >= sizeof(buf)) {
@@ -189,10 +189,10 @@ sqlite_maindb_init(const char *topdir)
 
 	/* process SELECT result */
 	ret = sqlite3_column_int(stmt, 0);
-	if (ret != CLD_SQLITE_SCHEMA_VERSION) {
+	if (ret != CLTRACK_SQLITE_CURRENT_SCHEMA_VERSION) {
 		xlog(L_ERROR, "Unsupported database schema version! "
 			"Expected %d, got %d.",
-			CLD_SQLITE_SCHEMA_VERSION, ret);
+			CLTRACK_SQLITE_CURRENT_SCHEMA_VERSION, ret);
 		ret = -EINVAL;
 		goto out_err;
 	}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 4/7] nfsdcltrack: overhaul database initializtion
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
                   ` (2 preceding siblings ...)
  2014-09-08 16:30 ` [PATCH v3 3/7] nfsdcltrack: rename CLD_* constants with CLTRACK_* prefixes Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 5/7] nfsdcltrack: update schema to v2 Jeff Layton
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

We have some possibility for races with nfsdcltrack when the DB schema
is upgraded.

Suppose we update the nfs-utils package on a machine after the DB has
been initialized. With the current scheme of initializing the DB only
during the "init" phase, we could end up with a new program that expects
a new schema with an old database.

We could try to do a one-time update when the package is installed, but
that could be racy. We could get an upcall between when the program is
installed and when we run the update. Also, relying on packaging to get
that right is tricky at best. To fix this, change how the database
initialization and checking of the schema revision works.

On every upcall, attempt to open the db as we normally would. If that
fails, then try to create the directory if it doesn't exist and then
retry the open. If it fails again, then give up.

If we get a successful open, then query the DB for the schema version.
If it matches what we expect, then declare success and move on. If the
query fails then assume that the DB isn't set up yet. Start an exclusive
transaction, check the schema version again and then set up the DB if no
one raced in to create it in the meantime.

This should only add a tiny bit of overhead on most upcalls (just an
extra select of the parameters table), and should improve the
performance of the "init" upcall. It'll also make it possible to handle
DB schema changes sanely.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/nfsdcltrack/nfsdcltrack.c |   2 +-
 utils/nfsdcltrack/sqlite.c      | 225 +++++++++++++++++++++++++---------------
 utils/nfsdcltrack/sqlite.h      |   1 -
 3 files changed, 142 insertions(+), 86 deletions(-)

diff --git a/utils/nfsdcltrack/nfsdcltrack.c b/utils/nfsdcltrack/nfsdcltrack.c
index 4334340197b1..f2813610be73 100644
--- a/utils/nfsdcltrack/nfsdcltrack.c
+++ b/utils/nfsdcltrack/nfsdcltrack.c
@@ -241,7 +241,7 @@ cltrack_init(const char __attribute__((unused)) *unused)
 	}
 
 	/* set up storage db */
-	ret = sqlite_maindb_init(storagedir);
+	ret = sqlite_prepare_dbh(storagedir);
 	if (ret) {
 		xlog(L_ERROR, "Failed to init database: %d", ret);
 		/*
diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
index 52fcaa4d7ca3..dde2f7ff8621 100644
--- a/utils/nfsdcltrack/sqlite.c
+++ b/utils/nfsdcltrack/sqlite.c
@@ -88,135 +88,192 @@ mkdir_if_not_exist(const char *dirname)
 	return ret;
 }
 
-/* Open the database and set up the database handle for it */
-int
-sqlite_prepare_dbh(const char *topdir)
+static int
+sqlite_query_schema_version(void)
 {
 	int ret;
+	sqlite3_stmt *stmt = NULL;
 
-	/* Do nothing if the database handle is already set up */
-	if (dbh)
-		return 0;
-
-	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", topdir);
-	if (ret < 0)
-		return ret;
-
-	buf[PATH_MAX - 1] = '\0';
-
-	ret = sqlite3_open(buf, &dbh);
+	/* prepare select query */
+	ret = sqlite3_prepare_v2(dbh,
+		"SELECT value FROM parameters WHERE key == \"version\";",
+		 -1, &stmt, NULL);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to open main database: %d", ret);
-		dbh = NULL;
-		return ret;
+		xlog(L_ERROR, "Unable to prepare select statement: %s",
+			sqlite3_errstr(ret));
+		ret = 0;
+		goto out;
 	}
 
-	ret = sqlite3_busy_timeout(dbh, CLTRACK_SQLITE_BUSY_TIMEOUT);
-	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to set sqlite busy timeout: %d", ret);
-		sqlite3_close(dbh);
-		dbh = NULL;
+	/* query schema version */
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(L_ERROR, "Select statement execution failed: %s",
+				sqlite3_errstr(ret));
+		ret = 0;
+		goto out;
 	}
 
+	ret = sqlite3_column_int(stmt, 0);
+out:
+	sqlite3_finalize(stmt);
 	return ret;
 }
 
 /*
- * Open the "main" database, and attempt to initialize it by creating the
- * parameters table and inserting the schema version into it. Ignore any errors
- * from that, and then attempt to select the version out of it again. If the
- * version appears wrong, then assume that the DB is corrupt or has been
- * upgraded, and return an error. If all of that works, then attempt to create
- * the "clients" table.
+ * Start an exclusive transaction and recheck the DB schema version. If it's
+ * still zero (indicating a new database) then set it up. If that all works,
+ * then insert schema version into the parameters table and commit the
+ * transaction. On any error, rollback the transaction.
  */
 int
-sqlite_maindb_init(const char *topdir)
+sqlite_maindb_init_v1(void)
 {
 	int ret;
 	char *err = NULL;
-	sqlite3_stmt *stmt = NULL;
 
-	ret = mkdir_if_not_exist(topdir);
-	if (ret)
+	/* Start a transaction */
+	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
+				&err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to begin transaction: %s", err);
 		return ret;
+	}
 
-	ret = sqlite_prepare_dbh(topdir);
-	if (ret)
-		return ret;
+	/*
+	 * Check schema version again. This time, under an exclusive
+	 * transaction to guard against racing DB setup attempts
+	 */
+	ret = sqlite_query_schema_version();
+	switch (ret) {
+	case 0:
+		/* Query failed again -- set up DB */
+		break;
+	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+		/* Someone else raced in and set it up */
+		ret = 0;
+		goto rollback;
+	default:
+		/* Something went wrong -- fail! */
+		ret = -EINVAL;
+		goto rollback;
+	}
 
-	/* Try to create table */
-	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
+	ret = sqlite3_exec(dbh, "CREATE TABLE parameters "
 				"(key TEXT PRIMARY KEY, value TEXT);",
 				NULL, NULL, &err);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to create parameter table: %d", ret);
-		goto out_err;
+		xlog(L_ERROR, "Unable to create parameter table: %s", err);
+		goto rollback;
 	}
 
-	/* insert version into table -- ignore error if it fails */
-	ret = snprintf(buf, sizeof(buf),
-		       "INSERT OR IGNORE INTO parameters values (\"version\", "
-		       "\"%d\");", CLTRACK_SQLITE_CURRENT_SCHEMA_VERSION);
+	/* create the "clients" table */
+	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB PRIMARY KEY, "
+				"time INTEGER);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create clients table: %s", err);
+		goto rollback;
+	}
+
+
+	/* insert version into parameters table */
+	ret = snprintf(buf, sizeof(buf), "INSERT OR FAIL INTO parameters "
+			"values (\"version\", \"%d\");",
+			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
 	if (ret < 0) {
-		goto out_err;
+		xlog(L_ERROR, "sprintf failed!");
+		goto rollback;
 	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
 		ret = -EINVAL;
-		goto out_err;
+		goto rollback;
 	}
 
 	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to insert into parameter table: %d",
-				ret);
-		goto out_err;
+		xlog(L_ERROR, "Unable to insert into parameter table: %s", err);
+		goto rollback;
 	}
 
-	ret = sqlite3_prepare_v2(dbh,
-		"SELECT value FROM parameters WHERE key == \"version\";",
-		 -1, &stmt, NULL);
+	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to prepare select statement: %d", ret);
-		goto out_err;
+		xlog(L_ERROR, "Unable to commit transaction: %s", err);
+		goto rollback;
 	}
+out:
+	sqlite3_free(err);
+	return ret;
 
-	/* check schema version */
-	ret = sqlite3_step(stmt);
-	if (ret != SQLITE_ROW) {
-		xlog(L_ERROR, "Select statement execution failed: %s",
-				sqlite3_errmsg(dbh));
-		goto out_err;
-	}
+rollback:
+	/* Attempt to rollback the transaction */
+	sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	goto out;
+}
 
-	/* process SELECT result */
-	ret = sqlite3_column_int(stmt, 0);
-	if (ret != CLTRACK_SQLITE_CURRENT_SCHEMA_VERSION) {
-		xlog(L_ERROR, "Unsupported database schema version! "
-			"Expected %d, got %d.",
-			CLTRACK_SQLITE_CURRENT_SCHEMA_VERSION, ret);
-		ret = -EINVAL;
-		goto out_err;
-	}
+/* Open the database and set up the database handle for it */
+int
+sqlite_prepare_dbh(const char *topdir)
+{
+	int ret;
 
-	/* now create the "clients" table */
-	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS clients "
-				"(id BLOB PRIMARY KEY, time INTEGER);",
-				NULL, NULL, &err);
+	/* Do nothing if the database handle is already set up */
+	if (dbh)
+		return 0;
+
+	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", topdir);
+	if (ret < 0)
+		return ret;
+
+	buf[PATH_MAX - 1] = '\0';
+
+	/* open a new DB handle */
+	ret = sqlite3_open(buf, &dbh);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to create clients table: %s", err);
-		goto out_err;
+		/* try to create the dir */
+		ret = mkdir_if_not_exist(topdir);
+		if (ret)
+			goto out_close;
+
+		/* retry open */
+		ret = sqlite3_open(buf, &dbh);
+		if (ret != SQLITE_OK)
+			goto out_close;
 	}
 
-	sqlite3_free(err);
-	sqlite3_finalize(stmt);
-	return 0;
+	/* set busy timeout */
+	ret = sqlite3_busy_timeout(dbh, CLTRACK_SQLITE_BUSY_TIMEOUT);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to set sqlite busy timeout: %s",
+				sqlite3_errstr(ret));
+		goto out_close;
+	}
 
-out_err:
-	if (err) {
-		xlog(L_ERROR, "sqlite error: %s", err);
-		sqlite3_free(err);
+	ret = sqlite_query_schema_version();
+	switch (ret) {
+	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+		/* DB is already set up. Do nothing */
+		ret = 0;
+		break;
+	case 0:
+		/* Query failed -- try to set up new DB */
+		ret = sqlite_maindb_init_v1();
+		if (ret)
+			goto out_close;
+		break;
+	default:
+		/* Unknown DB version -- downgrade? Fail */
+		xlog(L_ERROR, "Unsupported database schema version! "
+			"Expected %d, got %d.",
+			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION, ret);
+		ret = -EINVAL;
+		goto out_close;
 	}
-	sqlite3_finalize(stmt);
-	sqlite3_close(dbh);
+
+	return ret;
+out_close:
+	sqlite3_close_v2(dbh);
+	dbh = NULL;
 	return ret;
 }
 
diff --git a/utils/nfsdcltrack/sqlite.h b/utils/nfsdcltrack/sqlite.h
index ebf04c3cdbaa..3713bfb66e2f 100644
--- a/utils/nfsdcltrack/sqlite.h
+++ b/utils/nfsdcltrack/sqlite.h
@@ -21,7 +21,6 @@
 #define _SQLITE_H_
 
 int sqlite_prepare_dbh(const char *topdir);
-int sqlite_maindb_init(const char *topdir);
 int sqlite_insert_client(const unsigned char *clname, const size_t namelen);
 int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
 int sqlite_check_client(const unsigned char *clname, const size_t namelen);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
                   ` (3 preceding siblings ...)
  2014-09-08 16:30 ` [PATCH v3 4/7] nfsdcltrack: overhaul database initializtion Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  2014-09-11 19:55   ` J. Bruce Fields
  2014-09-08 16:30 ` [PATCH v3 6/7] nfsdcltrack: grab the NFSDCLTRACK_RECLAIM_COMPLETE env var if it's present Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 7/7] nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment Jeff Layton
  6 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

From: Jeff Layton <jlayton@poochiereds.net>

In order to allow knfsd's lock manager to lift its grace period early,
we need to figure out whether all clients have finished reclaiming
their state not. Unfortunately, the current code doesn't allow us to
ascertain this. All we track for each client is a timestamp that tells
us when the last "check" or "create" operation came in.

We need to track the two timestamps separately. Add a new
"reclaim_complete" column to the database that tells us when the last
"create" operation came in. For now, we just insert "0" in that column
but a later patch will make it so that we insert a real timestamp for
v4.1+ client records.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/nfsdcltrack/sqlite.c | 102 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 8 deletions(-)

diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
index dde2f7ff8621..e260e81b1722 100644
--- a/utils/nfsdcltrack/sqlite.c
+++ b/utils/nfsdcltrack/sqlite.c
@@ -29,7 +29,10 @@
  *
  * clients: an "id" column containing a BLOB with the long-form clientid as
  * 	    sent by the client, a "time" column containing a timestamp (in
- * 	    epoch seconds) of when the record was last updated.
+ * 	    epoch seconds) of when the record was last updated, and a
+ * 	    "reclaim_complete" column containing a timestamp (in epoch seconds)
+ * 	    of when the last "create" operation came in for v4.1+ clients.
+ * 	    v4.0 clients should always have this set to 0.
  */
 
 #ifdef HAVE_CONFIG_H
@@ -50,7 +53,7 @@
 
 #include "xlog.h"
 
-#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 1
+#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 2
 
 /* in milliseconds */
 #define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
@@ -120,6 +123,81 @@ out:
 	return ret;
 }
 
+static int
+sqlite_maindb_update_v1_to_v2(void)
+{
+	int ret, ret2;
+	char *err;
+
+	/* begin transaction */
+	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
+				&err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to begin transaction: %s", err);
+		goto rollback;
+	}
+
+	/*
+	 * Check schema version again. This time, under an exclusive
+	 * transaction to guard against racing DB setup attempts
+	 */
+	ret = sqlite_query_schema_version();
+	switch (ret) {
+	case 1:
+		/* Still at v1 -- do conversion */
+		break;
+	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+		/* Someone else raced in and set it up */
+		ret = 0;
+		goto rollback;
+	default:
+		/* Something went wrong -- fail! */
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	/* create v2 clients table */
+	ret = sqlite3_exec(dbh, "ALTER TABLE clients ADD COLUMN "
+				"reclaim_complete INTEGER;",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to update clients table: %s", err);
+		goto rollback;
+	}
+
+	ret = snprintf(buf, sizeof(buf), "UPDATE parameters SET value = %d "
+			"WHERE key = \"version\";",
+			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		goto rollback;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to update schema version: %s", err);
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to commit transaction: %s", err);
+		goto rollback;
+	}
+out:
+	sqlite3_free(err);
+	return ret;
+rollback:
+	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	if (ret2 != SQLITE_OK)
+		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
+	goto out;
+}
+
 /*
  * Start an exclusive transaction and recheck the DB schema version. If it's
  * still zero (indicating a new database) then set it up. If that all works,
@@ -127,9 +205,9 @@ out:
  * transaction. On any error, rollback the transaction.
  */
 int
-sqlite_maindb_init_v1(void)
+sqlite_maindb_init_v2(void)
 {
-	int ret;
+	int ret, ret2;
 	char *err = NULL;
 
 	/* Start a transaction */
@@ -169,7 +247,7 @@ sqlite_maindb_init_v1(void)
 
 	/* create the "clients" table */
 	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB PRIMARY KEY, "
-				"time INTEGER);",
+				"time INTEGER, reclaim_complete INTEGER);",
 				NULL, NULL, &err);
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "Unable to create clients table: %s", err);
@@ -207,7 +285,9 @@ out:
 
 rollback:
 	/* Attempt to rollback the transaction */
-	sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	if (ret2 != SQLITE_OK)
+		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
 	goto out;
 }
 
@@ -255,9 +335,15 @@ sqlite_prepare_dbh(const char *topdir)
 		/* DB is already set up. Do nothing */
 		ret = 0;
 		break;
+	case 1:
+		/* Old DB -- update to new schema */
+		ret = sqlite_maindb_update_v1_to_v2();
+		if (ret)
+			goto out_close;
+		break;
 	case 0:
 		/* Query failed -- try to set up new DB */
-		ret = sqlite_maindb_init_v1();
+		ret = sqlite_maindb_init_v2();
 		if (ret)
 			goto out_close;
 		break;
@@ -289,7 +375,7 @@ sqlite_insert_client(const unsigned char *clname, const size_t namelen)
 	sqlite3_stmt *stmt = NULL;
 
 	ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients VALUES "
-				      "(?, strftime('%s', 'now'));", -1,
+					"(?, strftime('%s', 'now'), 0);", -1,
 					&stmt, NULL);
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "%s: insert statement prepare failed: %s",
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 6/7] nfsdcltrack: grab the NFSDCLTRACK_RECLAIM_COMPLETE env var if it's present
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
                   ` (4 preceding siblings ...)
  2014-09-08 16:30 ` [PATCH v3 5/7] nfsdcltrack: update schema to v2 Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  2014-09-08 16:30 ` [PATCH v3 7/7] nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment Jeff Layton
  6 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

...and set the reclaim_complete field in the DB based on whether it's
true or not. If it's true, set it to the current time. If it's not, then
set it to zero.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/nfsdcltrack/nfsdcltrack.c | 21 +++++++++++++++++++--
 utils/nfsdcltrack/sqlite.c      | 16 ++++++++++++----
 utils/nfsdcltrack/sqlite.h      |  3 ++-
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/utils/nfsdcltrack/nfsdcltrack.c b/utils/nfsdcltrack/nfsdcltrack.c
index f2813610be73..873b1fae056d 100644
--- a/utils/nfsdcltrack/nfsdcltrack.c
+++ b/utils/nfsdcltrack/nfsdcltrack.c
@@ -37,6 +37,7 @@
 #include <libgen.h>
 #include <sys/inotify.h>
 #include <dirent.h>
+#include <limits.h>
 #ifdef HAVE_SYS_CAPABILITY_H
 #include <sys/prctl.h>
 #include <sys/capability.h>
@@ -254,6 +255,22 @@ cltrack_init(const char __attribute__((unused)) *unused)
 	return ret;
 }
 
+/*
+ * Fetch the contents of the NFSDCLTRACK_RECLAIM_COMPLETE env var. If
+ * it's set and the first character is 'Y' then return true. Otherwise
+ * return false.
+ */
+static bool
+cltrack_get_reclaim_complete(void)
+{
+	char *reclaim_complete = getenv("NFSDCLTRACK_RECLAIM_COMPLETE");
+
+	if (reclaim_complete && *reclaim_complete == 'Y')
+		return true;
+
+	return false;
+}
+
 static int
 cltrack_create(const char *id)
 {
@@ -270,7 +287,7 @@ cltrack_create(const char *id)
 	if (len < 0)
 		return (int)len;
 
-	ret = sqlite_insert_client(blob, len);
+	ret = sqlite_insert_client(blob, len, cltrack_get_reclaim_complete());
 
 	return ret ? -EREMOTEIO : ret;
 }
@@ -323,7 +340,7 @@ cltrack_check_legacy(const unsigned char *blob, const ssize_t len)
 	}
 
 	/* Dir exists, try to insert record into db */
-	ret = sqlite_insert_client(blob, len);
+	ret = sqlite_insert_client(blob, len, 0);
 	if (ret) {
 		xlog(D_GENERAL, "Failed to insert client: %d", ret);
 		return -EREMOTEIO;
diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
index e260e81b1722..e2a860f4b924 100644
--- a/utils/nfsdcltrack/sqlite.c
+++ b/utils/nfsdcltrack/sqlite.c
@@ -369,14 +369,22 @@ out_close:
  * Returns a non-zero sqlite error code, or SQLITE_OK (aka 0)
  */
 int
-sqlite_insert_client(const unsigned char *clname, const size_t namelen)
+sqlite_insert_client(const unsigned char *clname, const size_t namelen,
+			const bool reclaim_complete)
 {
 	int ret;
 	sqlite3_stmt *stmt = NULL;
 
-	ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients VALUES "
-					"(?, strftime('%s', 'now'), 0);", -1,
-					&stmt, NULL);
+	if (reclaim_complete)
+		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
+				"VALUES (?, strftime('%s', 'now'), "
+				"strftime('%s', 'now'));", -1,
+				&stmt, NULL);
+	else
+		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
+				"VALUES (?, strftime('%s', 'now'), 0);", -1,
+				&stmt, NULL);
+
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "%s: insert statement prepare failed: %s",
 			__func__, sqlite3_errmsg(dbh));
diff --git a/utils/nfsdcltrack/sqlite.h b/utils/nfsdcltrack/sqlite.h
index 3713bfb66e2f..9d7357bcf832 100644
--- a/utils/nfsdcltrack/sqlite.h
+++ b/utils/nfsdcltrack/sqlite.h
@@ -21,7 +21,8 @@
 #define _SQLITE_H_
 
 int sqlite_prepare_dbh(const char *topdir);
-int sqlite_insert_client(const unsigned char *clname, const size_t namelen);
+int sqlite_insert_client(const unsigned char *clname, const size_t namelen,
+				const bool reclaim_complete);
 int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
 int sqlite_check_client(const unsigned char *clname, const size_t namelen);
 int sqlite_remove_unreclaimed(const time_t grace_start);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 7/7] nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment
  2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
                   ` (5 preceding siblings ...)
  2014-09-08 16:30 ` [PATCH v3 6/7] nfsdcltrack: grab the NFSDCLTRACK_RECLAIM_COMPLETE env var if it's present Jeff Layton
@ 2014-09-08 16:30 ` Jeff Layton
  6 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-08 16:30 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs, bfields

Allow the fetching of NFSDCLTRACK_GRACE_START out of environment
variables. If it's present in the "create" or "init" upcalls, then we
can use that to query the database to see whether there are any clients
that have not issued a RECLAIM_COMPLETE since that time.

If there aren't any, then we know that all reclaim activity is now done
and we can then cue the kernel to lift the grace period.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 utils/nfsdcltrack/nfsdcltrack.c | 72 ++++++++++++++++++++++++++++++++++++++++-
 utils/nfsdcltrack/sqlite.c      | 40 +++++++++++++++++++++++
 utils/nfsdcltrack/sqlite.h      |  1 +
 3 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/utils/nfsdcltrack/nfsdcltrack.c b/utils/nfsdcltrack/nfsdcltrack.c
index 873b1fae056d..7d20f7d27d1a 100644
--- a/utils/nfsdcltrack/nfsdcltrack.c
+++ b/utils/nfsdcltrack/nfsdcltrack.c
@@ -50,6 +50,8 @@
 #define CLD_DEFAULT_STORAGEDIR NFS_STATEDIR "/nfsdcltrack"
 #endif
 
+#define NFSD_END_GRACE_FILE "/proc/fs/nfsd/v4_end_grace"
+
 /* defined by RFC 3530 */
 #define NFS4_OPAQUE_LIMIT	1024
 
@@ -211,6 +213,64 @@ cltrack_set_caps(void)
 	return ret;
 }
 
+/* Inform the kernel that it's OK to lift nfsd's grace period */
+static void
+cltrack_lift_grace_period(void)
+{
+	int fd;
+
+	fd = open(NFSD_END_GRACE_FILE, O_WRONLY);
+	if (fd < 0) {
+		/* Don't warn if file isn't present */
+		if (errno != ENOENT)
+			xlog(L_WARNING, "Unable to open %s: %m",
+				NFSD_END_GRACE_FILE);
+		return;
+	}
+
+	if (write(fd, "Y", 1) < 0)
+		xlog(L_WARNING, "Unable to write to %s: %m",
+				NFSD_END_GRACE_FILE);
+
+	close(fd);
+	return;
+}
+
+/*
+ * Fetch the contents of the NFSDCLTRACK_GRACE_START env var. If it's not set
+ * or there's an error converting it to time_t, then return LONG_MAX.
+ */
+static time_t
+cltrack_get_grace_start(void)
+{
+	time_t grace_start;
+	char *end;
+	char *grace_start_str = getenv("NFSDCLTRACK_GRACE_START");
+
+	if (!grace_start_str)
+		return LONG_MAX;
+
+	errno = 0;
+	grace_start = strtol(grace_start_str, &end, 0);
+	/* Problem converting or value is too large? */
+	if (errno)
+		return LONG_MAX;
+
+	return grace_start;
+}
+
+static bool
+cltrack_reclaims_complete(void)
+{
+	time_t grace_start = cltrack_get_grace_start();
+
+	/* Don't query DB if we didn't get a valid boot time */
+	if (grace_start == LONG_MAX)
+		return false;
+
+	return !sqlite_query_reclaiming(grace_start);
+}
+
 static int
 cltrack_init(const char __attribute__((unused)) *unused)
 {
@@ -251,7 +311,11 @@ cltrack_init(const char __attribute__((unused)) *unused)
 		 * stop upcalling until the problem is resolved.
 		 */
 		ret = -EACCES;
+	} else {
+		if (cltrack_reclaims_complete())
+			cltrack_lift_grace_period();
 	}
+
 	return ret;
 }
 
@@ -276,6 +340,7 @@ cltrack_create(const char *id)
 {
 	int ret;
 	ssize_t len;
+	bool reclaim_complete;
 
 	xlog(D_GENERAL, "%s: create client record.", __func__);
 
@@ -287,7 +352,12 @@ cltrack_create(const char *id)
 	if (len < 0)
 		return (int)len;
 
-	ret = sqlite_insert_client(blob, len, cltrack_get_reclaim_complete());
+	reclaim_complete = cltrack_get_reclaim_complete();
+
+	ret = sqlite_insert_client(blob, len, reclaim_complete);
+
+	if (!ret && reclaim_complete && cltrack_reclaims_complete())
+		cltrack_lift_grace_period();
 
 	return ret ? -EREMOTEIO : ret;
 }
diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
index e2a860f4b924..616a5e59d528 100644
--- a/utils/nfsdcltrack/sqlite.c
+++ b/utils/nfsdcltrack/sqlite.c
@@ -547,3 +547,43 @@ sqlite_remove_unreclaimed(time_t grace_start)
 	sqlite3_free(err);
 	return ret;
 }
+
+/*
+ * Are there any clients that are possibly still reclaiming? Return a positive
+ * integer (usually number of clients) if so. If not, then return 0. On any
+ * error, return non-zero.
+ */
+int
+sqlite_query_reclaiming(const time_t grace_start)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+
+	ret = sqlite3_prepare_v2(dbh, "SELECT count(*) FROM clients WHERE "
+				      "reclaim_complete < ?", -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: unable to prepare select statement: %s",
+				__func__, sqlite3_errstr(ret));
+		return ret;
+	}
+
+	ret = sqlite3_bind_int64(stmt, 1, (sqlite3_int64)grace_start);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind int64 failed: %s",
+				__func__, sqlite3_errstr(ret));
+		return ret;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(L_ERROR, "%s: unexpected return code from select: %s",
+				__func__, sqlite3_errstr(ret));
+		return ret;
+	}
+
+	ret = sqlite3_column_int(stmt, 0);
+	sqlite3_finalize(stmt);
+	xlog(D_GENERAL, "%s: there are %d clients that have not completed "
+			"reclaim", __func__, ret);
+	return ret;
+}
diff --git a/utils/nfsdcltrack/sqlite.h b/utils/nfsdcltrack/sqlite.h
index 9d7357bcf832..32b92d799b4a 100644
--- a/utils/nfsdcltrack/sqlite.h
+++ b/utils/nfsdcltrack/sqlite.h
@@ -26,5 +26,6 @@ int sqlite_insert_client(const unsigned char *clname, const size_t namelen,
 int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
 int sqlite_check_client(const unsigned char *clname, const size_t namelen);
 int sqlite_remove_unreclaimed(const time_t grace_start);
+int sqlite_query_reclaiming(const time_t grace_start);
 
 #endif /* _SQLITE_H */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/7] nfsdcltrack: update comments in sqlite.c
  2014-09-08 16:30 ` [PATCH v3 2/7] nfsdcltrack: update comments in sqlite.c Jeff Layton
@ 2014-09-11 15:53   ` J. Bruce Fields
  0 siblings, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2014-09-11 15:53 UTC (permalink / raw)
  To: Jeff Layton; +Cc: steved, linux-nfs

On Mon, Sep 08, 2014 at 12:30:16PM -0400, Jeff Layton wrote:
> Clean up and fix some inaccuracies.

By the way, I also notice the man page refers to it as a daemon.

--b.

> 
> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
> ---
>  utils/nfsdcltrack/sqlite.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
> index bac678980938..352f029d55c0 100644
> --- a/utils/nfsdcltrack/sqlite.c
> +++ b/utils/nfsdcltrack/sqlite.c
> @@ -21,17 +21,15 @@
>   * Explanation:
>   *
>   * This file contains the code to manage the sqlite backend database for the
> - * clstated upcall daemon.
> + * nfsdcltrack usermodehelper upcall program.
>   *
>   * The main database is called main.sqlite and contains the following tables:
>   *
>   * parameters: simple key/value pairs for storing database info
>   *
> - * clients: one column containing a BLOB with the as sent by the client
> - * 	    and a timestamp (in epoch seconds) of when the record was
> - * 	    established
> - *
> - * FIXME: should we also record the fsid being accessed?
> + * clients: an "id" column containing a BLOB with the long-form clientid as
> + * 	    sent by the client, a "time" column containing a timestamp (in
> + * 	    epoch seconds) of when the record was last updated.
>   */
>  
>  #ifdef HAVE_CONFIG_H
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-08 16:30 ` [PATCH v3 5/7] nfsdcltrack: update schema to v2 Jeff Layton
@ 2014-09-11 19:55   ` J. Bruce Fields
  2014-09-11 20:28     ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2014-09-11 19:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: steved, linux-nfs

On Mon, Sep 08, 2014 at 12:30:19PM -0400, Jeff Layton wrote:
> From: Jeff Layton <jlayton@poochiereds.net>
> 
> In order to allow knfsd's lock manager to lift its grace period early,
> we need to figure out whether all clients have finished reclaiming
> their state not. Unfortunately, the current code doesn't allow us to
> ascertain this. All we track for each client is a timestamp that tells
> us when the last "check" or "create" operation came in.
> 
> We need to track the two timestamps separately. Add a new
> "reclaim_complete" column to the database that tells us when the last
> "create" operation came in. For now, we just insert "0" in that column
> but a later patch will make it so that we insert a real timestamp for
> v4.1+ client records.

If I understand correctly, then nfsdcltrack has a bug here: we shouldn't
be counting a 4.1 client as allowed to reclaim on the next boot until we
get the RECLAIM_COMPLETE, but nfsdcltrack is allowing a 4.1 client to
reclaim if all we got the previous boot was a reclaim open (a "check"
operation).

--b.

> 
> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
> ---
>  utils/nfsdcltrack/sqlite.c | 102 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 94 insertions(+), 8 deletions(-)
> 
> diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
> index dde2f7ff8621..e260e81b1722 100644
> --- a/utils/nfsdcltrack/sqlite.c
> +++ b/utils/nfsdcltrack/sqlite.c
> @@ -29,7 +29,10 @@
>   *
>   * clients: an "id" column containing a BLOB with the long-form clientid as
>   * 	    sent by the client, a "time" column containing a timestamp (in
> - * 	    epoch seconds) of when the record was last updated.
> + * 	    epoch seconds) of when the record was last updated, and a
> + * 	    "reclaim_complete" column containing a timestamp (in epoch seconds)
> + * 	    of when the last "create" operation came in for v4.1+ clients.
> + * 	    v4.0 clients should always have this set to 0.
>   */
>  
>  #ifdef HAVE_CONFIG_H
> @@ -50,7 +53,7 @@
>  
>  #include "xlog.h"
>  
> -#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 1
> +#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 2
>  
>  /* in milliseconds */
>  #define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
> @@ -120,6 +123,81 @@ out:
>  	return ret;
>  }
>  
> +static int
> +sqlite_maindb_update_v1_to_v2(void)
> +{
> +	int ret, ret2;
> +	char *err;
> +
> +	/* begin transaction */
> +	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
> +				&err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to begin transaction: %s", err);
> +		goto rollback;
> +	}
> +
> +	/*
> +	 * Check schema version again. This time, under an exclusive
> +	 * transaction to guard against racing DB setup attempts
> +	 */
> +	ret = sqlite_query_schema_version();
> +	switch (ret) {
> +	case 1:
> +		/* Still at v1 -- do conversion */
> +		break;
> +	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
> +		/* Someone else raced in and set it up */
> +		ret = 0;
> +		goto rollback;
> +	default:
> +		/* Something went wrong -- fail! */
> +		ret = -EINVAL;
> +		goto rollback;
> +	}
> +
> +	/* create v2 clients table */
> +	ret = sqlite3_exec(dbh, "ALTER TABLE clients ADD COLUMN "
> +				"reclaim_complete INTEGER;",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to update clients table: %s", err);
> +		goto rollback;
> +	}
> +
> +	ret = snprintf(buf, sizeof(buf), "UPDATE parameters SET value = %d "
> +			"WHERE key = \"version\";",
> +			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
> +	if (ret < 0) {
> +		xlog(L_ERROR, "sprintf failed!");
> +		goto rollback;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +		ret = -EINVAL;
> +		goto rollback;
> +	}
> +
> +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to update schema version: %s", err);
> +		goto rollback;
> +	}
> +
> +	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to commit transaction: %s", err);
> +		goto rollback;
> +	}
> +out:
> +	sqlite3_free(err);
> +	return ret;
> +rollback:
> +	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
> +	if (ret2 != SQLITE_OK)
> +		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
> +	goto out;
> +}
> +
>  /*
>   * Start an exclusive transaction and recheck the DB schema version. If it's
>   * still zero (indicating a new database) then set it up. If that all works,
> @@ -127,9 +205,9 @@ out:
>   * transaction. On any error, rollback the transaction.
>   */
>  int
> -sqlite_maindb_init_v1(void)
> +sqlite_maindb_init_v2(void)
>  {
> -	int ret;
> +	int ret, ret2;
>  	char *err = NULL;
>  
>  	/* Start a transaction */
> @@ -169,7 +247,7 @@ sqlite_maindb_init_v1(void)
>  
>  	/* create the "clients" table */
>  	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB PRIMARY KEY, "
> -				"time INTEGER);",
> +				"time INTEGER, reclaim_complete INTEGER);",
>  				NULL, NULL, &err);
>  	if (ret != SQLITE_OK) {
>  		xlog(L_ERROR, "Unable to create clients table: %s", err);
> @@ -207,7 +285,9 @@ out:
>  
>  rollback:
>  	/* Attempt to rollback the transaction */
> -	sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
> +	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
> +	if (ret2 != SQLITE_OK)
> +		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
>  	goto out;
>  }
>  
> @@ -255,9 +335,15 @@ sqlite_prepare_dbh(const char *topdir)
>  		/* DB is already set up. Do nothing */
>  		ret = 0;
>  		break;
> +	case 1:
> +		/* Old DB -- update to new schema */
> +		ret = sqlite_maindb_update_v1_to_v2();
> +		if (ret)
> +			goto out_close;
> +		break;
>  	case 0:
>  		/* Query failed -- try to set up new DB */
> -		ret = sqlite_maindb_init_v1();
> +		ret = sqlite_maindb_init_v2();
>  		if (ret)
>  			goto out_close;
>  		break;
> @@ -289,7 +375,7 @@ sqlite_insert_client(const unsigned char *clname, const size_t namelen)
>  	sqlite3_stmt *stmt = NULL;
>  
>  	ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients VALUES "
> -				      "(?, strftime('%s', 'now'));", -1,
> +					"(?, strftime('%s', 'now'), 0);", -1,
>  					&stmt, NULL);
>  	if (ret != SQLITE_OK) {
>  		xlog(L_ERROR, "%s: insert statement prepare failed: %s",
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-11 19:55   ` J. Bruce Fields
@ 2014-09-11 20:28     ` Jeff Layton
  2014-09-12 13:36       ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2014-09-11 20:28 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: steved, linux-nfs

On Thu, 11 Sep 2014 15:55:47 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Sep 08, 2014 at 12:30:19PM -0400, Jeff Layton wrote:
> > From: Jeff Layton <jlayton@poochiereds.net>
> > 
> > In order to allow knfsd's lock manager to lift its grace period early,
> > we need to figure out whether all clients have finished reclaiming
> > their state not. Unfortunately, the current code doesn't allow us to
> > ascertain this. All we track for each client is a timestamp that tells
> > us when the last "check" or "create" operation came in.
> > 
> > We need to track the two timestamps separately. Add a new
> > "reclaim_complete" column to the database that tells us when the last
> > "create" operation came in. For now, we just insert "0" in that column
> > but a later patch will make it so that we insert a real timestamp for
> > v4.1+ client records.
> 
> If I understand correctly, then nfsdcltrack has a bug here: we shouldn't
> be counting a 4.1 client as allowed to reclaim on the next boot until we
> get the RECLAIM_COMPLETE, but nfsdcltrack is allowing a 4.1 client to
> reclaim if all we got the previous boot was a reclaim open (a "check"
> operation).
> 
> --b.
> 

Yeah, I guess so, with a bit of a clarification I think...

We don't want to allow a v4.1 client to reclaim if it didn't send a
RECLAIM_COMPLETE prior to the last reboot *and* the grace period ended
prior to the last reboot.

IOW, in the case where the reboot occurs before the grace period ends,
we don't want to clean out the and deny reclaims. FWIW, the legacy
client tracker got this very wrong -- if you did a couple of rapid
reboots in succession you couldn't reclaim once everything was back up.

I'll have to ponder how best to fix that. Given that the logic required
is quite different between v4.0 and v4.1 clients, we may have to add yet
another column to the DB to track what sort of client this is.

> > 
> > Signed-off-by: Jeff Layton <jlayton@primarydata.com>
> > ---
> >  utils/nfsdcltrack/sqlite.c | 102
> > +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 94
> > insertions(+), 8 deletions(-)
> > 
> > diff --git a/utils/nfsdcltrack/sqlite.c b/utils/nfsdcltrack/sqlite.c
> > index dde2f7ff8621..e260e81b1722 100644
> > --- a/utils/nfsdcltrack/sqlite.c
> > +++ b/utils/nfsdcltrack/sqlite.c
> > @@ -29,7 +29,10 @@
> >   *
> >   * clients: an "id" column containing a BLOB with the long-form
> > clientid as
> >   * 	    sent by the client, a "time" column containing a
> > timestamp (in
> > - * 	    epoch seconds) of when the record was last updated.
> > + * 	    epoch seconds) of when the record was last updated,
> > and a
> > + * 	    "reclaim_complete" column containing a timestamp
> > (in epoch seconds)
> > + * 	    of when the last "create" operation came in for
> > v4.1+ clients.
> > + * 	    v4.0 clients should always have this set to 0.
> >   */
> >  
> >  #ifdef HAVE_CONFIG_H
> > @@ -50,7 +53,7 @@
> >  
> >  #include "xlog.h"
> >  
> > -#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 1
> > +#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 2
> >  
> >  /* in milliseconds */
> >  #define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
> > @@ -120,6 +123,81 @@ out:
> >  	return ret;
> >  }
> >  
> > +static int
> > +sqlite_maindb_update_v1_to_v2(void)
> > +{
> > +	int ret, ret2;
> > +	char *err;
> > +
> > +	/* begin transaction */
> > +	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;",
> > NULL, NULL,
> > +				&err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(L_ERROR, "Unable to begin transaction: %s",
> > err);
> > +		goto rollback;
> > +	}
> > +
> > +	/*
> > +	 * Check schema version again. This time, under an
> > exclusive
> > +	 * transaction to guard against racing DB setup attempts
> > +	 */
> > +	ret = sqlite_query_schema_version();
> > +	switch (ret) {
> > +	case 1:
> > +		/* Still at v1 -- do conversion */
> > +		break;
> > +	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
> > +		/* Someone else raced in and set it up */
> > +		ret = 0;
> > +		goto rollback;
> > +	default:
> > +		/* Something went wrong -- fail! */
> > +		ret = -EINVAL;
> > +		goto rollback;
> > +	}
> > +
> > +	/* create v2 clients table */
> > +	ret = sqlite3_exec(dbh, "ALTER TABLE clients ADD COLUMN "
> > +				"reclaim_complete INTEGER;",
> > +				NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(L_ERROR, "Unable to update clients table:
> > %s", err);
> > +		goto rollback;
> > +	}
> > +
> > +	ret = snprintf(buf, sizeof(buf), "UPDATE parameters SET
> > value = %d "
> > +			"WHERE key = \"version\";",
> > +			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
> > +	if (ret < 0) {
> > +		xlog(L_ERROR, "sprintf failed!");
> > +		goto rollback;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		xlog(L_ERROR, "sprintf output too long! (%d
> > chars)", ret);
> > +		ret = -EINVAL;
> > +		goto rollback;
> > +	}
> > +
> > +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL,
> > &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(L_ERROR, "Unable to update schema version:
> > %s", err);
> > +		goto rollback;
> > +	}
> > +
> > +	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL,
> > &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(L_ERROR, "Unable to commit transaction: %s",
> > err);
> > +		goto rollback;
> > +	}
> > +out:
> > +	sqlite3_free(err);
> > +	return ret;
> > +rollback:
> > +	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL,
> > NULL, &err);
> > +	if (ret2 != SQLITE_OK)
> > +		xlog(L_ERROR, "Unable to rollback transaction:
> > %s", err);
> > +	goto out;
> > +}
> > +
> >  /*
> >   * Start an exclusive transaction and recheck the DB schema
> > version. If it's
> >   * still zero (indicating a new database) then set it up. If that
> > all works, @@ -127,9 +205,9 @@ out:
> >   * transaction. On any error, rollback the transaction.
> >   */
> >  int
> > -sqlite_maindb_init_v1(void)
> > +sqlite_maindb_init_v2(void)
> >  {
> > -	int ret;
> > +	int ret, ret2;
> >  	char *err = NULL;
> >  
> >  	/* Start a transaction */
> > @@ -169,7 +247,7 @@ sqlite_maindb_init_v1(void)
> >  
> >  	/* create the "clients" table */
> >  	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB
> > PRIMARY KEY, "
> > -				"time INTEGER);",
> > +				"time INTEGER, reclaim_complete
> > INTEGER);", NULL, NULL, &err);
> >  	if (ret != SQLITE_OK) {
> >  		xlog(L_ERROR, "Unable to create clients table:
> > %s", err); @@ -207,7 +285,9 @@ out:
> >  
> >  rollback:
> >  	/* Attempt to rollback the transaction */
> > -	sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL,
> > &err);
> > +	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL,
> > NULL, &err);
> > +	if (ret2 != SQLITE_OK)
> > +		xlog(L_ERROR, "Unable to rollback transaction:
> > %s", err); goto out;
> >  }
> >  
> > @@ -255,9 +335,15 @@ sqlite_prepare_dbh(const char *topdir)
> >  		/* DB is already set up. Do nothing */
> >  		ret = 0;
> >  		break;
> > +	case 1:
> > +		/* Old DB -- update to new schema */
> > +		ret = sqlite_maindb_update_v1_to_v2();
> > +		if (ret)
> > +			goto out_close;
> > +		break;
> >  	case 0:
> >  		/* Query failed -- try to set up new DB */
> > -		ret = sqlite_maindb_init_v1();
> > +		ret = sqlite_maindb_init_v2();
> >  		if (ret)
> >  			goto out_close;
> >  		break;
> > @@ -289,7 +375,7 @@ sqlite_insert_client(const unsigned char
> > *clname, const size_t namelen) sqlite3_stmt *stmt = NULL;
> >  
> >  	ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO
> > clients VALUES "
> > -				      "(?, strftime('%s',
> > 'now'));", -1,
> > +					"(?, strftime('%s',
> > 'now'), 0);", -1, &stmt, NULL);
> >  	if (ret != SQLITE_OK) {
> >  		xlog(L_ERROR, "%s: insert statement prepare
> > failed: %s", -- 
> > 1.9.3
> > 


-- 
Jeff Layton <jlayton@primarydata.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-11 20:28     ` Jeff Layton
@ 2014-09-12 13:36       ` Jeff Layton
  2014-09-12 14:21         ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2014-09-12 13:36 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: steved, linux-nfs

On Thu, 11 Sep 2014 16:28:36 -0400
Jeff Layton <jeff.layton@primarydata.com> wrote:

> On Thu, 11 Sep 2014 15:55:47 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Mon, Sep 08, 2014 at 12:30:19PM -0400, Jeff Layton wrote:
> > > From: Jeff Layton <jlayton@poochiereds.net>
> > > 
> > > In order to allow knfsd's lock manager to lift its grace period early,
> > > we need to figure out whether all clients have finished reclaiming
> > > their state not. Unfortunately, the current code doesn't allow us to
> > > ascertain this. All we track for each client is a timestamp that tells
> > > us when the last "check" or "create" operation came in.
> > > 
> > > We need to track the two timestamps separately. Add a new
> > > "reclaim_complete" column to the database that tells us when the last
> > > "create" operation came in. For now, we just insert "0" in that column
> > > but a later patch will make it so that we insert a real timestamp for
> > > v4.1+ client records.
> > 
> > If I understand correctly, then nfsdcltrack has a bug here: we shouldn't
> > be counting a 4.1 client as allowed to reclaim on the next boot until we
> > get the RECLAIM_COMPLETE, but nfsdcltrack is allowing a 4.1 client to
> > reclaim if all we got the previous boot was a reclaim open (a "check"
> > operation).
> > 
> > --b.
> > 
> 
> Yeah, I guess so, with a bit of a clarification I think...
> 
> We don't want to allow a v4.1 client to reclaim if it didn't send a
> RECLAIM_COMPLETE prior to the last reboot *and* the grace period ended
> prior to the last reboot.
> 
> IOW, in the case where the reboot occurs before the grace period ends,
> we don't want to clean out the and deny reclaims. FWIW, the legacy
> client tracker got this very wrong -- if you did a couple of rapid
> reboots in succession you couldn't reclaim once everything was back up.
> 
> I'll have to ponder how best to fix that. Given that the logic required
> is quite different between v4.0 and v4.1 clients, we may have to add yet
> another column to the DB to track what sort of client this is.
> 

This new requirement complicates things quite a bit. I'll have to
respin both patchsets.

I think we can fix this by ensuring that we clean out any v4.1+ clients
that have not done a "create" since the start of the grace period
during a "grace_done" upcall. For v4.0 clients, we can't do that of
course since a v4.0 client may reclaim opens but never do a new one
(and so may never send a "create" at all).

That means that we'll need also to send something in the "check" upcall
that indicates the client's minorversion. The good news is that we
won't need a new column in the DB since the only timestamp that matters
for v4.1+ clients is the "create" time. We can just avoid setting the
time field for v4.1+ clients on the "check" upcall.

Now that we need to send info about the minorversion in a "check", I
may go back to sending an actual minorversion in the upcall's
environment vars. It doesn't make sense to me to send a boolean about
RECLAIM_COMPLETE when the client hasn't actually sent one.

I'll get started on reworking this but I have no idea on an ETA just
yet. Hopefully I can have something that works by next week sometime.

-- 
Jeff Layton <jlayton@primarydata.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 13:36       ` Jeff Layton
@ 2014-09-12 14:21         ` Jeff Layton
  2014-09-12 14:36           ` J. Bruce Fields
  2014-09-12 15:42           ` Trond Myklebust
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-12 14:21 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: steved, linux-nfs

On Fri, 12 Sep 2014 09:36:00 -0400
Jeff Layton <jlayton@primarydata.com> wrote:

> On Thu, 11 Sep 2014 16:28:36 -0400
> Jeff Layton <jeff.layton@primarydata.com> wrote:
> 
> > On Thu, 11 Sep 2014 15:55:47 -0400
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Mon, Sep 08, 2014 at 12:30:19PM -0400, Jeff Layton wrote:
> > > > From: Jeff Layton <jlayton@poochiereds.net>
> > > > 
> > > > In order to allow knfsd's lock manager to lift its grace period early,
> > > > we need to figure out whether all clients have finished reclaiming
> > > > their state not. Unfortunately, the current code doesn't allow us to
> > > > ascertain this. All we track for each client is a timestamp that tells
> > > > us when the last "check" or "create" operation came in.
> > > > 
> > > > We need to track the two timestamps separately. Add a new
> > > > "reclaim_complete" column to the database that tells us when the last
> > > > "create" operation came in. For now, we just insert "0" in that column
> > > > but a later patch will make it so that we insert a real timestamp for
> > > > v4.1+ client records.
> > > 
> > > If I understand correctly, then nfsdcltrack has a bug here: we shouldn't
> > > be counting a 4.1 client as allowed to reclaim on the next boot until we
> > > get the RECLAIM_COMPLETE, but nfsdcltrack is allowing a 4.1 client to
> > > reclaim if all we got the previous boot was a reclaim open (a "check"
> > > operation).
> > > 
> > > --b.
> > > 
> > 
> > Yeah, I guess so, with a bit of a clarification I think...
> > 
> > We don't want to allow a v4.1 client to reclaim if it didn't send a
> > RECLAIM_COMPLETE prior to the last reboot *and* the grace period ended
> > prior to the last reboot.
> > 
> > IOW, in the case where the reboot occurs before the grace period ends,
> > we don't want to clean out the and deny reclaims. FWIW, the legacy
> > client tracker got this very wrong -- if you did a couple of rapid
> > reboots in succession you couldn't reclaim once everything was back up.
> > 
> > I'll have to ponder how best to fix that. Given that the logic required
> > is quite different between v4.0 and v4.1 clients, we may have to add yet
> > another column to the DB to track what sort of client this is.
> > 
> 
> This new requirement complicates things quite a bit. I'll have to
> respin both patchsets.
> 
> I think we can fix this by ensuring that we clean out any v4.1+ clients
> that have not done a "create" since the start of the grace period
> during a "grace_done" upcall. For v4.0 clients, we can't do that of
> course since a v4.0 client may reclaim opens but never do a new one
> (and so may never send a "create" at all).
> 
> That means that we'll need also to send something in the "check" upcall
> that indicates the client's minorversion. The good news is that we
> won't need a new column in the DB since the only timestamp that matters
> for v4.1+ clients is the "create" time. We can just avoid setting the
> time field for v4.1+ clients on the "check" upcall.
> 
> Now that we need to send info about the minorversion in a "check", I
> may go back to sending an actual minorversion in the upcall's
> environment vars. It doesn't make sense to me to send a boolean about
> RECLAIM_COMPLETE when the client hasn't actually sent one.
> 
> I'll get started on reworking this but I have no idea on an ETA just
> yet. Hopefully I can have something that works by next week sometime.
>  

This is actually a much larger can of worms than it originally looks.
Consider this:

Server reboots and v4.1+ client reclaims a few records but never sends
a RECLAIM_COMPLETE (client bug or maybe some bad timing?). Grace period
eventually ends, and its record is purged from the DB.

Now we have a client that has reclaimed some files but that has no
record on stable storage.

One possibility is to prematurely expire v4.1+ clients that have not
sent a RECLAIM_COMPLETE when the grace period ends.

That seems problematic though -- what about clients that just happen to
do an EXCHANGE_ID just before the grace period is going to end, and
that get expired before they can issue their RECLAIM_COMPLETE. Will
that be a problem for them?

Thoughts?
-- 
Jeff Layton <jlayton@primarydata.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 14:21         ` Jeff Layton
@ 2014-09-12 14:36           ` J. Bruce Fields
  2014-09-12 15:21             ` J. Bruce Fields
  2014-09-12 15:42           ` Trond Myklebust
  1 sibling, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2014-09-12 14:36 UTC (permalink / raw)
  To: Jeff Layton; +Cc: steved, linux-nfs

On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
> On Fri, 12 Sep 2014 09:36:00 -0400
> Jeff Layton <jlayton@primarydata.com> wrote:
> 
> > On Thu, 11 Sep 2014 16:28:36 -0400
> > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > 
> > > On Thu, 11 Sep 2014 15:55:47 -0400
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > 
> > > > On Mon, Sep 08, 2014 at 12:30:19PM -0400, Jeff Layton wrote:
> > > > > From: Jeff Layton <jlayton@poochiereds.net>
> > > > > 
> > > > > In order to allow knfsd's lock manager to lift its grace period early,
> > > > > we need to figure out whether all clients have finished reclaiming
> > > > > their state not. Unfortunately, the current code doesn't allow us to
> > > > > ascertain this. All we track for each client is a timestamp that tells
> > > > > us when the last "check" or "create" operation came in.
> > > > > 
> > > > > We need to track the two timestamps separately. Add a new
> > > > > "reclaim_complete" column to the database that tells us when the last
> > > > > "create" operation came in. For now, we just insert "0" in that column
> > > > > but a later patch will make it so that we insert a real timestamp for
> > > > > v4.1+ client records.
> > > > 
> > > > If I understand correctly, then nfsdcltrack has a bug here: we shouldn't
> > > > be counting a 4.1 client as allowed to reclaim on the next boot until we
> > > > get the RECLAIM_COMPLETE, but nfsdcltrack is allowing a 4.1 client to
> > > > reclaim if all we got the previous boot was a reclaim open (a "check"
> > > > operation).
> > > > 
> > > > --b.
> > > > 
> > > 
> > > Yeah, I guess so, with a bit of a clarification I think...
> > > 
> > > We don't want to allow a v4.1 client to reclaim if it didn't send a
> > > RECLAIM_COMPLETE prior to the last reboot *and* the grace period ended
> > > prior to the last reboot.
> > > 
> > > IOW, in the case where the reboot occurs before the grace period ends,
> > > we don't want to clean out the and deny reclaims. FWIW, the legacy
> > > client tracker got this very wrong -- if you did a couple of rapid
> > > reboots in succession you couldn't reclaim once everything was back up.
> > > 
> > > I'll have to ponder how best to fix that. Given that the logic required
> > > is quite different between v4.0 and v4.1 clients, we may have to add yet
> > > another column to the DB to track what sort of client this is.
> > > 
> > 
> > This new requirement complicates things quite a bit. I'll have to
> > respin both patchsets.
> > 
> > I think we can fix this by ensuring that we clean out any v4.1+ clients
> > that have not done a "create" since the start of the grace period
> > during a "grace_done" upcall. For v4.0 clients, we can't do that of
> > course since a v4.0 client may reclaim opens but never do a new one
> > (and so may never send a "create" at all).
> > 
> > That means that we'll need also to send something in the "check" upcall
> > that indicates the client's minorversion. The good news is that we
> > won't need a new column in the DB since the only timestamp that matters
> > for v4.1+ clients is the "create" time. We can just avoid setting the
> > time field for v4.1+ clients on the "check" upcall.
> > 
> > Now that we need to send info about the minorversion in a "check", I
> > may go back to sending an actual minorversion in the upcall's
> > environment vars. It doesn't make sense to me to send a boolean about
> > RECLAIM_COMPLETE when the client hasn't actually sent one.
> > 
> > I'll get started on reworking this but I have no idea on an ETA just
> > yet. Hopefully I can have something that works by next week sometime.
> >  
> 
> This is actually a much larger can of worms than it originally looks.
> Consider this:
> 
> Server reboots and v4.1+ client reclaims a few records but never sends
> a RECLAIM_COMPLETE (client bug or maybe some bad timing?).

The client must send a RECLAIM_COMPLETE.  It's not permitted to do any
regular opens, for example, till it does.

So either the client is buggy (too bad), or it's lost touch with the
server (in which case it will eventually expire normally).

I don't see any work to do here.

> Grace period
> eventually ends, and its record is purged from the DB.
> 
> Now we have a client that has reclaimed some files but that has no
> record on stable storage.
> 
> One possibility is to prematurely expire v4.1+ clients that have not
> sent a RECLAIM_COMPLETE when the grace period ends.
> 
> That seems problematic though -- what about clients that just happen to
> do an EXCHANGE_ID just before the grace period is going to end, and
> that get expired before they can issue their RECLAIM_COMPLETE. Will
> that be a problem for them?

In that case a client will send a reclaim, get back a NO_GRACE error,
mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
and continue normally.  (To the extent it can--signalling affected
processes or EIOing further attempts to use the unreclaimed state, or
whatever.)

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 14:36           ` J. Bruce Fields
@ 2014-09-12 15:21             ` J. Bruce Fields
  2014-09-12 15:54               ` Trond Myklebust
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2014-09-12 15:21 UTC (permalink / raw)
  To: Jeff Layton; +Cc: steved, linux-nfs

On Fri, Sep 12, 2014 at 10:36:21AM -0400, J. Bruce Fields wrote:
> On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
> > Grace period
> > eventually ends, and its record is purged from the DB.
> > 
> > Now we have a client that has reclaimed some files but that has no
> > record on stable storage.
> > 
> > One possibility is to prematurely expire v4.1+ clients that have not
> > sent a RECLAIM_COMPLETE when the grace period ends.
> > 
> > That seems problematic though -- what about clients that just happen to
> > do an EXCHANGE_ID just before the grace period is going to end, and
> > that get expired before they can issue their RECLAIM_COMPLETE. Will
> > that be a problem for them?
> 
> In that case a client will send a reclaim, get back a NO_GRACE error,
> mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
> and continue normally.  (To the extent it can--signalling affected
> processes or EIOing further attempts to use the unreclaimed state, or
> whatever.)

The one thing the server *could* do in this sort of case is extend the
grace period by a little--I seem to recall the spec giving some leeway
for this kind of thing.

So for example the server could have a heuristics like: extend the grace
period by another second each time we notice there's been an EXCHANGE_ID
or reclaim in the previous second, up to some maximum.  And I suppose it
could also delay the grace period until someone actually attempts a
non-reclaim open.

In isolation a single client slipping in the end like that sounds like a
freak event, but if there's a ton of state to reclaim perhaps it could
become more likely.

I don't think that's a priority, we might just want to make sure we know
how to do that in the future.

But now that I think about it I don't see the existing or proposed
nfsdcltrack stuff tying our hands in any way here.  It just gives the
kernel some extra information, and the kernel still has discretion about
when exactly it wants to end the grace period.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 14:21         ` Jeff Layton
  2014-09-12 14:36           ` J. Bruce Fields
@ 2014-09-12 15:42           ` Trond Myklebust
  1 sibling, 0 replies; 21+ messages in thread
From: Trond Myklebust @ 2014-09-12 15:42 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J. Bruce Fields, Steve Dickson, linux-nfs

On Fri, Sep 12, 2014 at 10:21 AM, Jeff Layton
<jeff.layton@primarydata.com> wrote:
> On Fri, 12 Sep 2014 09:36:00 -0400
> Jeff Layton <jlayton@primarydata.com> wrote:
>
>> On Thu, 11 Sep 2014 16:28:36 -0400
>> Jeff Layton <jeff.layton@primarydata.com> wrote:
>>
>> > On Thu, 11 Sep 2014 15:55:47 -0400
>> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
>> >
>> > > On Mon, Sep 08, 2014 at 12:30:19PM -0400, Jeff Layton wrote:
>> > > > From: Jeff Layton <jlayton@poochiereds.net>
>> > > >
>> > > > In order to allow knfsd's lock manager to lift its grace period early,
>> > > > we need to figure out whether all clients have finished reclaiming
>> > > > their state not. Unfortunately, the current code doesn't allow us to
>> > > > ascertain this. All we track for each client is a timestamp that tells
>> > > > us when the last "check" or "create" operation came in.
>> > > >
>> > > > We need to track the two timestamps separately. Add a new
>> > > > "reclaim_complete" column to the database that tells us when the last
>> > > > "create" operation came in. For now, we just insert "0" in that column
>> > > > but a later patch will make it so that we insert a real timestamp for
>> > > > v4.1+ client records.
>> > >
>> > > If I understand correctly, then nfsdcltrack has a bug here: we shouldn't
>> > > be counting a 4.1 client as allowed to reclaim on the next boot until we
>> > > get the RECLAIM_COMPLETE, but nfsdcltrack is allowing a 4.1 client to
>> > > reclaim if all we got the previous boot was a reclaim open (a "check"
>> > > operation).
>> > >
>> > > --b.
>> > >
>> >
>> > Yeah, I guess so, with a bit of a clarification I think...
>> >
>> > We don't want to allow a v4.1 client to reclaim if it didn't send a
>> > RECLAIM_COMPLETE prior to the last reboot *and* the grace period ended
>> > prior to the last reboot.
>> >
>> > IOW, in the case where the reboot occurs before the grace period ends,
>> > we don't want to clean out the and deny reclaims. FWIW, the legacy
>> > client tracker got this very wrong -- if you did a couple of rapid
>> > reboots in succession you couldn't reclaim once everything was back up.
>> >
>> > I'll have to ponder how best to fix that. Given that the logic required
>> > is quite different between v4.0 and v4.1 clients, we may have to add yet
>> > another column to the DB to track what sort of client this is.
>> >
>>
>> This new requirement complicates things quite a bit. I'll have to
>> respin both patchsets.
>>
>> I think we can fix this by ensuring that we clean out any v4.1+ clients
>> that have not done a "create" since the start of the grace period
>> during a "grace_done" upcall. For v4.0 clients, we can't do that of
>> course since a v4.0 client may reclaim opens but never do a new one
>> (and so may never send a "create" at all).
>>
>> That means that we'll need also to send something in the "check" upcall
>> that indicates the client's minorversion. The good news is that we
>> won't need a new column in the DB since the only timestamp that matters
>> for v4.1+ clients is the "create" time. We can just avoid setting the
>> time field for v4.1+ clients on the "check" upcall.
>>
>> Now that we need to send info about the minorversion in a "check", I
>> may go back to sending an actual minorversion in the upcall's
>> environment vars. It doesn't make sense to me to send a boolean about
>> RECLAIM_COMPLETE when the client hasn't actually sent one.
>>
>> I'll get started on reworking this but I have no idea on an ETA just
>> yet. Hopefully I can have something that works by next week sometime.
>>
>
> This is actually a much larger can of worms than it originally looks.
> Consider this:
>
> Server reboots and v4.1+ client reclaims a few records but never sends
> a RECLAIM_COMPLETE (client bug or maybe some bad timing?). Grace period
> eventually ends, and its record is purged from the DB.
>
> Now we have a client that has reclaimed some files but that has no
> record on stable storage.
>
> One possibility is to prematurely expire v4.1+ clients that have not
> sent a RECLAIM_COMPLETE when the grace period ends.
>
> That seems problematic though -- what about clients that just happen to
> do an EXCHANGE_ID just before the grace period is going to end, and
> that get expired before they can issue their RECLAIM_COMPLETE. Will
> that be a problem for them?
>
> Thoughts?

See RFC5661 section 8.4.3, which describes those edge conditions, and
how to deal with them.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 15:21             ` J. Bruce Fields
@ 2014-09-12 15:54               ` Trond Myklebust
  2014-09-12 16:05                 ` J. Bruce Fields
  2014-09-12 16:07                 ` Jeff Layton
  0 siblings, 2 replies; 21+ messages in thread
From: Trond Myklebust @ 2014-09-12 15:54 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, Steve Dickson, linux-nfs

On Fri, Sep 12, 2014 at 11:21 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Fri, Sep 12, 2014 at 10:36:21AM -0400, J. Bruce Fields wrote:
>> On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
>> > Grace period
>> > eventually ends, and its record is purged from the DB.
>> >
>> > Now we have a client that has reclaimed some files but that has no
>> > record on stable storage.
>> >
>> > One possibility is to prematurely expire v4.1+ clients that have not
>> > sent a RECLAIM_COMPLETE when the grace period ends.
>> >
>> > That seems problematic though -- what about clients that just happen to
>> > do an EXCHANGE_ID just before the grace period is going to end, and
>> > that get expired before they can issue their RECLAIM_COMPLETE. Will
>> > that be a problem for them?
>>
>> In that case a client will send a reclaim, get back a NO_GRACE error,
>> mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
>> and continue normally.  (To the extent it can--signalling affected
>> processes or EIOing further attempts to use the unreclaimed state, or
>> whatever.)
>
> The one thing the server *could* do in this sort of case is extend the
> grace period by a little--I seem to recall the spec giving some leeway
> for this kind of thing.


Section 8.4.2.1.

> So for example the server could have a heuristics like: extend the grace
> period by another second each time we notice there's been an EXCHANGE_ID
> or reclaim in the previous second, up to some maximum.  And I suppose it
> could also delay the grace period until someone actually attempts a
> non-reclaim open.
>
> In isolation a single client slipping in the end like that sounds like a
> freak event, but if there's a ton of state to reclaim perhaps it could
> become more likely.
>
> I don't think that's a priority, we might just want to make sure we know
> how to do that in the future.
>
> But now that I think about it I don't see the existing or proposed
> nfsdcltrack stuff tying our hands in any way here.  It just gives the
> kernel some extra information, and the kernel still has discretion about
> when exactly it wants to end the grace period.
>

It is even allowed to grant reclaim lock attempts after the grace
period has ended _if_ and only if it can guarantee that no conflicting
locks were issued.

However note that the NFSv4.1 client is not actually allowed to issue
non-reclaim lock requests before it has issued a RECLAIM_COMPLETE. I
dunno how religiously we stick to that in Linux (I think we do), but
the point is that the server can and should rely on the client
_always_ sending a RECLAIM_COMPLETE if it is going to establish new
locks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 15:54               ` Trond Myklebust
@ 2014-09-12 16:05                 ` J. Bruce Fields
  2014-09-12 16:07                 ` Jeff Layton
  1 sibling, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2014-09-12 16:05 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jeff Layton, Steve Dickson, linux-nfs

On Fri, Sep 12, 2014 at 11:54:17AM -0400, Trond Myklebust wrote:
> On Fri, Sep 12, 2014 at 11:21 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Fri, Sep 12, 2014 at 10:36:21AM -0400, J. Bruce Fields wrote:
> >> On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
> >> > Grace period
> >> > eventually ends, and its record is purged from the DB.
> >> >
> >> > Now we have a client that has reclaimed some files but that has no
> >> > record on stable storage.
> >> >
> >> > One possibility is to prematurely expire v4.1+ clients that have not
> >> > sent a RECLAIM_COMPLETE when the grace period ends.
> >> >
> >> > That seems problematic though -- what about clients that just happen to
> >> > do an EXCHANGE_ID just before the grace period is going to end, and
> >> > that get expired before they can issue their RECLAIM_COMPLETE. Will
> >> > that be a problem for them?
> >>
> >> In that case a client will send a reclaim, get back a NO_GRACE error,
> >> mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
> >> and continue normally.  (To the extent it can--signalling affected
> >> processes or EIOing further attempts to use the unreclaimed state, or
> >> whatever.)
> >
> > The one thing the server *could* do in this sort of case is extend the
> > grace period by a little--I seem to recall the spec giving some leeway
> > for this kind of thing.
> 
> 
> Section 8.4.2.1.

Thanks!

	http://tools.ietf.org/html/rfc5661#section-8.4.2.1

	"Some additional time in order to allow a client to establish a
	new client ID and session and to effect lock reclaims may be
	added to the lease time."

I thought there was something else but I actually must have been
remembering the "diligently flushing" language describing delegation
recalls.  Anyway.

> 
> > So for example the server could have a heuristics like: extend the grace
> > period by another second each time we notice there's been an EXCHANGE_ID
> > or reclaim in the previous second, up to some maximum.  And I suppose it
> > could also delay the grace period until someone actually attempts a
> > non-reclaim open.
> >
> > In isolation a single client slipping in the end like that sounds like a
> > freak event, but if there's a ton of state to reclaim perhaps it could
> > become more likely.
> >
> > I don't think that's a priority, we might just want to make sure we know
> > how to do that in the future.
> >
> > But now that I think about it I don't see the existing or proposed
> > nfsdcltrack stuff tying our hands in any way here.  It just gives the
> > kernel some extra information, and the kernel still has discretion about
> > when exactly it wants to end the grace period.
> >
> 
> It is even allowed to grant reclaim lock attempts after the grace
> period has ended _if_ and only if it can guarantee that no conflicting
> locks were issued.
> 
> However note that the NFSv4.1 client is not actually allowed to issue
> non-reclaim lock requests before it has issued a RECLAIM_COMPLETE. I
> dunno how religiously we stick to that in Linux (I think we do),

Yes, the server strictly enforces this so we'd know if the client
skipped the RECLAIM_COMPLETE.  (Any open would fail over 4.1.)

> but the point is that the server can and should rely on the client
> _always_ sending a RECLAIM_COMPLETE if it is going to establish new
> locks.

Yes.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 15:54               ` Trond Myklebust
  2014-09-12 16:05                 ` J. Bruce Fields
@ 2014-09-12 16:07                 ` Jeff Layton
  2014-09-12 16:29                   ` Trond Myklebust
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2014-09-12 16:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, Jeff Layton, Steve Dickson, linux-nfs

On Fri, 12 Sep 2014 11:54:17 -0400
Trond Myklebust <trondmy@gmail.com> wrote:

> On Fri, Sep 12, 2014 at 11:21 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Fri, Sep 12, 2014 at 10:36:21AM -0400, J. Bruce Fields wrote:
> >> On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
> >> > Grace period
> >> > eventually ends, and its record is purged from the DB.
> >> >
> >> > Now we have a client that has reclaimed some files but that has no
> >> > record on stable storage.
> >> >
> >> > One possibility is to prematurely expire v4.1+ clients that have not
> >> > sent a RECLAIM_COMPLETE when the grace period ends.
> >> >
> >> > That seems problematic though -- what about clients that just happen to
> >> > do an EXCHANGE_ID just before the grace period is going to end, and
> >> > that get expired before they can issue their RECLAIM_COMPLETE. Will
> >> > that be a problem for them?
> >>
> >> In that case a client will send a reclaim, get back a NO_GRACE error,
> >> mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
> >> and continue normally.  (To the extent it can--signalling affected
> >> processes or EIOing further attempts to use the unreclaimed state, or
> >> whatever.)
> >
> > The one thing the server *could* do in this sort of case is extend the
> > grace period by a little--I seem to recall the spec giving some leeway
> > for this kind of thing.
> 
> 
> Section 8.4.2.1.
> 
> > So for example the server could have a heuristics like: extend the grace
> > period by another second each time we notice there's been an EXCHANGE_ID
> > or reclaim in the previous second, up to some maximum.  And I suppose it
> > could also delay the grace period until someone actually attempts a
> > non-reclaim open.
> >
> > In isolation a single client slipping in the end like that sounds like a
> > freak event, but if there's a ton of state to reclaim perhaps it could
> > become more likely.
> >
> > I don't think that's a priority, we might just want to make sure we know
> > how to do that in the future.
> >
> > But now that I think about it I don't see the existing or proposed
> > nfsdcltrack stuff tying our hands in any way here.  It just gives the
> > kernel some extra information, and the kernel still has discretion about
> > when exactly it wants to end the grace period.
> >
> 
> It is even allowed to grant reclaim lock attempts after the grace
> period has ended _if_ and only if it can guarantee that no conflicting
> locks were issued.
> 
> However note that the NFSv4.1 client is not actually allowed to issue
> non-reclaim lock requests before it has issued a RECLAIM_COMPLETE. I
> dunno how religiously we stick to that in Linux (I think we do), but
> the point is that the server can and should rely on the client
> _always_ sending a RECLAIM_COMPLETE if it is going to establish new
> locks.

Yeah, I'm pretty sure that bit is enforced. The problem situation that
I think Bruce was referring to is this:

Server reboots. Client1 reclaims some of its locks (but not all) and
never sends a RECLAIM_COMPLETE. Grace period ends and then server
hands out a lock to client2 that was previously held by client1 but
that didn't get reclaimed.

Server reboots again, prior to the client1 expiring (so its record is
still in the DB). Now client1 comes back and starts reclaiming again.
This time it reclaims all of its locks and we have a conflict between
it and client2.

It's a solvable problem, but I'll need to work through how best to do
so.

-- 
Jeff Layton <jlayton@primarydata.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 16:07                 ` Jeff Layton
@ 2014-09-12 16:29                   ` Trond Myklebust
  2014-09-12 16:33                     ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Trond Myklebust @ 2014-09-12 16:29 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, J. Bruce Fields, Steve Dickson, Linux NFS Mailing List

On Fri, Sep 12, 2014 at 12:07 PM, Jeff Layton
<jeff.layton@primarydata.com> wrote:
> On Fri, 12 Sep 2014 11:54:17 -0400
> Trond Myklebust <trondmy@gmail.com> wrote:
>
>> On Fri, Sep 12, 2014 at 11:21 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> > On Fri, Sep 12, 2014 at 10:36:21AM -0400, J. Bruce Fields wrote:
>> >> On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
>> >> > Grace period
>> >> > eventually ends, and its record is purged from the DB.
>> >> >
>> >> > Now we have a client that has reclaimed some files but that has no
>> >> > record on stable storage.
>> >> >
>> >> > One possibility is to prematurely expire v4.1+ clients that have not
>> >> > sent a RECLAIM_COMPLETE when the grace period ends.
>> >> >
>> >> > That seems problematic though -- what about clients that just happen to
>> >> > do an EXCHANGE_ID just before the grace period is going to end, and
>> >> > that get expired before they can issue their RECLAIM_COMPLETE. Will
>> >> > that be a problem for them?
>> >>
>> >> In that case a client will send a reclaim, get back a NO_GRACE error,
>> >> mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
>> >> and continue normally.  (To the extent it can--signalling affected
>> >> processes or EIOing further attempts to use the unreclaimed state, or
>> >> whatever.)
>> >
>> > The one thing the server *could* do in this sort of case is extend the
>> > grace period by a little--I seem to recall the spec giving some leeway
>> > for this kind of thing.
>>
>>
>> Section 8.4.2.1.
>>
>> > So for example the server could have a heuristics like: extend the grace
>> > period by another second each time we notice there's been an EXCHANGE_ID
>> > or reclaim in the previous second, up to some maximum.  And I suppose it
>> > could also delay the grace period until someone actually attempts a
>> > non-reclaim open.
>> >
>> > In isolation a single client slipping in the end like that sounds like a
>> > freak event, but if there's a ton of state to reclaim perhaps it could
>> > become more likely.
>> >
>> > I don't think that's a priority, we might just want to make sure we know
>> > how to do that in the future.
>> >
>> > But now that I think about it I don't see the existing or proposed
>> > nfsdcltrack stuff tying our hands in any way here.  It just gives the
>> > kernel some extra information, and the kernel still has discretion about
>> > when exactly it wants to end the grace period.
>> >
>>
>> It is even allowed to grant reclaim lock attempts after the grace
>> period has ended _if_ and only if it can guarantee that no conflicting
>> locks were issued.
>>
>> However note that the NFSv4.1 client is not actually allowed to issue
>> non-reclaim lock requests before it has issued a RECLAIM_COMPLETE. I
>> dunno how religiously we stick to that in Linux (I think we do), but
>> the point is that the server can and should rely on the client
>> _always_ sending a RECLAIM_COMPLETE if it is going to establish new
>> locks.
>
> Yeah, I'm pretty sure that bit is enforced. The problem situation that
> I think Bruce was referring to is this:
>
> Server reboots. Client1 reclaims some of its locks (but not all) and
> never sends a RECLAIM_COMPLETE. Grace period ends and then server
> hands out a lock to client2 that was previously held by client1 but
> that didn't get reclaimed.
>
> Server reboots again, prior to the client1 expiring (so its record is
> still in the DB). Now client1 comes back and starts reclaiming again.
> This time it reclaims all of its locks and we have a conflict between
> it and client2.
>
> It's a solvable problem, but I'll need to work through how best to do
> so.
>
> --

That's the first edge condition described in section 8.4.3.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 5/7] nfsdcltrack: update schema to v2
  2014-09-12 16:29                   ` Trond Myklebust
@ 2014-09-12 16:33                     ` Jeff Layton
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2014-09-12 16:33 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Layton, Trond Myklebust, J. Bruce Fields, Steve Dickson,
	Linux NFS Mailing List

On Fri, 12 Sep 2014 12:29:01 -0400
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Fri, Sep 12, 2014 at 12:07 PM, Jeff Layton
> <jeff.layton@primarydata.com> wrote:
> > On Fri, 12 Sep 2014 11:54:17 -0400
> > Trond Myklebust <trondmy@gmail.com> wrote:
> >
> >> On Fri, Sep 12, 2014 at 11:21 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >> > On Fri, Sep 12, 2014 at 10:36:21AM -0400, J. Bruce Fields wrote:
> >> >> On Fri, Sep 12, 2014 at 10:21:53AM -0400, Jeff Layton wrote:
> >> >> > Grace period
> >> >> > eventually ends, and its record is purged from the DB.
> >> >> >
> >> >> > Now we have a client that has reclaimed some files but that has no
> >> >> > record on stable storage.
> >> >> >
> >> >> > One possibility is to prematurely expire v4.1+ clients that have not
> >> >> > sent a RECLAIM_COMPLETE when the grace period ends.
> >> >> >
> >> >> > That seems problematic though -- what about clients that just happen to
> >> >> > do an EXCHANGE_ID just before the grace period is going to end, and
> >> >> > that get expired before they can issue their RECLAIM_COMPLETE. Will
> >> >> > that be a problem for them?
> >> >>
> >> >> In that case a client will send a reclaim, get back a NO_GRACE error,
> >> >> mark the rest of its state as unrecoverable, send the RECLAIM_COMPLETE,
> >> >> and continue normally.  (To the extent it can--signalling affected
> >> >> processes or EIOing further attempts to use the unreclaimed state, or
> >> >> whatever.)
> >> >
> >> > The one thing the server *could* do in this sort of case is extend the
> >> > grace period by a little--I seem to recall the spec giving some leeway
> >> > for this kind of thing.
> >>
> >>
> >> Section 8.4.2.1.
> >>
> >> > So for example the server could have a heuristics like: extend the grace
> >> > period by another second each time we notice there's been an EXCHANGE_ID
> >> > or reclaim in the previous second, up to some maximum.  And I suppose it
> >> > could also delay the grace period until someone actually attempts a
> >> > non-reclaim open.
> >> >
> >> > In isolation a single client slipping in the end like that sounds like a
> >> > freak event, but if there's a ton of state to reclaim perhaps it could
> >> > become more likely.
> >> >
> >> > I don't think that's a priority, we might just want to make sure we know
> >> > how to do that in the future.
> >> >
> >> > But now that I think about it I don't see the existing or proposed
> >> > nfsdcltrack stuff tying our hands in any way here.  It just gives the
> >> > kernel some extra information, and the kernel still has discretion about
> >> > when exactly it wants to end the grace period.
> >> >
> >>
> >> It is even allowed to grant reclaim lock attempts after the grace
> >> period has ended _if_ and only if it can guarantee that no conflicting
> >> locks were issued.
> >>
> >> However note that the NFSv4.1 client is not actually allowed to issue
> >> non-reclaim lock requests before it has issued a RECLAIM_COMPLETE. I
> >> dunno how religiously we stick to that in Linux (I think we do), but
> >> the point is that the server can and should rely on the client
> >> _always_ sending a RECLAIM_COMPLETE if it is going to establish new
> >> locks.
> >
> > Yeah, I'm pretty sure that bit is enforced. The problem situation that
> > I think Bruce was referring to is this:
> >
> > Server reboots. Client1 reclaims some of its locks (but not all) and
> > never sends a RECLAIM_COMPLETE. Grace period ends and then server
> > hands out a lock to client2 that was previously held by client1 but
> > that didn't get reclaimed.
> >
> > Server reboots again, prior to the client1 expiring (so its record is
> > still in the DB). Now client1 comes back and starts reclaiming again.
> > This time it reclaims all of its locks and we have a conflict between
> > it and client2.
> >
> > It's a solvable problem, but I'll need to work through how best to do
> > so.
> >
> > --
> 
> That's the first edge condition described in section 8.4.3.
> 

Actually, it's case #2 I think...

-- 
Jeff Layton <jlayton@primarydata.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-09-12 16:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-08 16:30 [PATCH v3 0/7] nfs-utils: support for lifting grace period early Jeff Layton
2014-09-08 16:30 ` [PATCH v3 1/7] sm-notify: inform the kernel if there were no hosts to notify Jeff Layton
2014-09-08 16:30 ` [PATCH v3 2/7] nfsdcltrack: update comments in sqlite.c Jeff Layton
2014-09-11 15:53   ` J. Bruce Fields
2014-09-08 16:30 ` [PATCH v3 3/7] nfsdcltrack: rename CLD_* constants with CLTRACK_* prefixes Jeff Layton
2014-09-08 16:30 ` [PATCH v3 4/7] nfsdcltrack: overhaul database initializtion Jeff Layton
2014-09-08 16:30 ` [PATCH v3 5/7] nfsdcltrack: update schema to v2 Jeff Layton
2014-09-11 19:55   ` J. Bruce Fields
2014-09-11 20:28     ` Jeff Layton
2014-09-12 13:36       ` Jeff Layton
2014-09-12 14:21         ` Jeff Layton
2014-09-12 14:36           ` J. Bruce Fields
2014-09-12 15:21             ` J. Bruce Fields
2014-09-12 15:54               ` Trond Myklebust
2014-09-12 16:05                 ` J. Bruce Fields
2014-09-12 16:07                 ` Jeff Layton
2014-09-12 16:29                   ` Trond Myklebust
2014-09-12 16:33                     ` Jeff Layton
2014-09-12 15:42           ` Trond Myklebust
2014-09-08 16:30 ` [PATCH v3 6/7] nfsdcltrack: grab the NFSDCLTRACK_RECLAIM_COMPLETE env var if it's present Jeff Layton
2014-09-08 16:30 ` [PATCH v3 7/7] nfsdcltrack: fetch NFSDCLTRACK_GRACE_START out of environment Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.