linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [nfs-utils PATCH RFC 0/7] restore nfsdcld
@ 2018-11-06 18:36 Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 1/7] Revert "nfsdcltrack: remove the nfsdcld daemon" Scott Mayhew
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

When nfsdcld was released, it was quickly deprecated in favor of the
nfsdcltrack usermodehelper, so as to not require another running daemon.
The nfsdcld code was removed from nfs-utils in 2012.  That prevents
NFSv4 clients from reclaiming locks from nfsd's running in containers,
since neither nfsdcltrack nor the legacy client tracking code work in
containers.  These patches restore the nfsdcld code.

These patches are intended to go alongside some kernel patches that
introduce an enhancement that allows nfsd to "slurp" up the client
records during client tracking initialization and store them internally
in hash table.  This enables nfsd to check whether an NFSv4 client is
allowed to reclaim without having to do an upcall to nfsdcld.  It also
allows nfsd to decide to end the v4 grace period early if the number of
RECLAIM_COMPLETE operations it has received from "known" clients is
equal to the number of entries in the hash table.  It also allows nfsd
to skip the v4 grace period altogether if it knows there are no clients
allowed to reclaim.

The new nfsdcld code will work with older kernels, however in that case
there is no ability for nfsd to exit the grace period early or skip the
grace period altogether.

Scott Mayhew (7):
  Revert "nfsdcltrack: remove the nfsdcld daemon"
  nfsdcld: move nfsdcld to its own directory
  nfsdcld: a few enhancements
  nfsdcld: remove some unused functions
  nfsdcld: the -p option should specify the rpc_pipefs mountpoint
  nfsdcld: add /etc/nfs.conf support
  systemd: add a unit file for nfsdcld

 .gitignore                   |   1 +
 configure.ac                 |  23 +
 nfs.conf                     |   4 +
 support/include/cld.h        |   1 +
 systemd/nfs-server.service   |   2 +
 systemd/nfsdcld.service      |  11 +
 utils/Makefile.am            |   4 +
 utils/nfsdcld/Makefile.am    |  19 +
 utils/nfsdcld/cld-internal.h |  30 ++
 utils/nfsdcld/nfsdcld.c      | 761 +++++++++++++++++++++++++++++++
 utils/nfsdcld/nfsdcld.man    | 199 +++++++++
 utils/nfsdcld/sqlite.c       | 837 +++++++++++++++++++++++++++++++++++
 utils/nfsdcld/sqlite.h       |  33 ++
 13 files changed, 1925 insertions(+)
 create mode 100644 systemd/nfsdcld.service
 create mode 100644 utils/nfsdcld/Makefile.am
 create mode 100644 utils/nfsdcld/cld-internal.h
 create mode 100644 utils/nfsdcld/nfsdcld.c
 create mode 100644 utils/nfsdcld/nfsdcld.man
 create mode 100644 utils/nfsdcld/sqlite.c
 create mode 100644 utils/nfsdcld/sqlite.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 1/7] Revert "nfsdcltrack: remove the nfsdcld daemon"
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 2/7] nfsdcld: move nfsdcld to its own directory Scott Mayhew
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

This reverts commit 2cf11ec6ed261ef56bbd0d73ff404fe69f1fefb0.

nfsdcld was originally deprecated in favor of a usermodehelper, so as to
not require another running daemon.  But that turned out to make
namespaced nfsd's too difficult, so this commit brings back nfsdcld.

A few things have been changed since nfsdcld was removed in 2012:

- libnfs.a is now libnfs.la
- sqlite_insert_client and sqlite_check_client both take a bool
  has_session which is being hard-coded to false for now.
- sqlite_insert_client also takes a bool zerotime which is being
  hard-coded to false.
- an xlog() call whose arguments did not match the format string
  (and could crash nfsdcld) has been fixed
- cld_pipe_open() now calls event_initialized() rather than checking for
  EVLIST_INIT directly (which does not compile with newer versions
  of libevent).

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 .gitignore                    |   1 +
 utils/nfsdcltrack/Makefile.am |   7 +-
 utils/nfsdcltrack/nfsdcld.c   | 611 ++++++++++++++++++++++++++++++++++
 utils/nfsdcltrack/nfsdcld.man | 185 ++++++++++
 4 files changed, 802 insertions(+), 2 deletions(-)
 create mode 100644 utils/nfsdcltrack/nfsdcld.c
 create mode 100644 utils/nfsdcltrack/nfsdcld.man

diff --git a/.gitignore b/.gitignore
index e91e7a2..d4f4f34 100644
--- a/.gitignore
+++ b/.gitignore
@@ -54,6 +54,7 @@ utils/rquotad/rquotad
 utils/rquotad/rquota.h
 utils/rquotad/rquota_xdr.c
 utils/showmount/showmount
+utils/nfsdcltrack/nfsdcld
 utils/nfsdcltrack/nfsdcltrack
 utils/statd/statd
 tools/locktest/testlk
diff --git a/utils/nfsdcltrack/Makefile.am b/utils/nfsdcltrack/Makefile.am
index 2f7fe3d..0f599c0 100644
--- a/utils/nfsdcltrack/Makefile.am
+++ b/utils/nfsdcltrack/Makefile.am
@@ -4,11 +4,14 @@
 # overridden at config time. The kernel "knows" the /sbin name.
 sbindir = /sbin
 
-man8_MANS	= nfsdcltrack.man
+man8_MANS	= nfsdcld.man nfsdcltrack.man
 EXTRA_DIST	= $(man8_MANS)
 
 AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
-sbin_PROGRAMS	= nfsdcltrack
+sbin_PROGRAMS	= nfsdcld nfsdcltrack
+
+nfsdcld_SOURCES = nfsdcld.c sqlite.c
+nfsdcld_LDADD = ../../support/nfs/libnfs.la $(LIBEVENT) $(LIBSQLITE) $(LIBCAP)
 
 noinst_HEADERS	= sqlite.h
 
diff --git a/utils/nfsdcltrack/nfsdcld.c b/utils/nfsdcltrack/nfsdcld.c
new file mode 100644
index 0000000..082f3ab
--- /dev/null
+++ b/utils/nfsdcltrack/nfsdcld.c
@@ -0,0 +1,611 @@
+/*
+ * nfsdcld.c -- NFSv4 client name tracking daemon
+ *
+ * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif /* HAVE_CONFIG_H */
+
+#include <errno.h>
+#include <event.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <sys/inotify.h>
+#ifdef HAVE_SYS_CAPABILITY_H
+#include <sys/prctl.h>
+#include <sys/capability.h>
+#endif
+
+#include "xlog.h"
+#include "nfslib.h"
+#include "cld.h"
+#include "sqlite.h"
+
+#ifndef PIPEFS_DIR
+#define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
+#endif
+
+#define DEFAULT_CLD_PATH	PIPEFS_DIR "/nfsd/cld"
+
+#ifndef CLD_DEFAULT_STORAGEDIR
+#define CLD_DEFAULT_STORAGEDIR NFS_STATEDIR "/nfsdcld"
+#endif
+
+#define UPCALL_VERSION		1
+
+/* private data structures */
+struct cld_client {
+	int			cl_fd;
+	struct event		cl_event;
+	struct cld_msg	cl_msg;
+};
+
+/* global variables */
+static char *pipepath = DEFAULT_CLD_PATH;
+static int 		inotify_fd = -1;
+static struct event	pipedir_event;
+
+static struct option longopts[] =
+{
+	{ "help", 0, NULL, 'h' },
+	{ "foreground", 0, NULL, 'F' },
+	{ "debug", 0, NULL, 'd' },
+	{ "pipe", 1, NULL, 'p' },
+	{ "storagedir", 1, NULL, 's' },
+	{ NULL, 0, 0, 0 },
+};
+
+/* forward declarations */
+static void cldcb(int UNUSED(fd), short which, void *data);
+
+static void
+usage(char *progname)
+{
+	printf("%s [ -hFd ] [ -p pipe ] [ -s dir ]\n", progname);
+}
+
+static int
+cld_set_caps(void)
+{
+	int ret = 0;
+#ifdef HAVE_SYS_CAPABILITY_H
+	unsigned long i;
+	cap_t caps;
+
+	if (getuid() != 0) {
+		xlog(L_ERROR, "Not running as root. Daemon won't be able to "
+			      "open the pipe after dropping capabilities!");
+		return -EINVAL;
+	}
+
+	/* prune the bounding set to nothing */
+	for (i = 0; prctl(PR_CAPBSET_READ, i, 0, 0, 0) >= 0 ; ++i) {
+		ret = prctl(PR_CAPBSET_DROP, i, 0, 0, 0);
+		if (ret) {
+			xlog(L_ERROR, "Unable to prune capability %lu from "
+				      "bounding set: %m", i);
+			return -errno;
+		}
+	}
+
+	/* get a blank capset */
+	caps = cap_init();
+	if (caps == NULL) {
+		xlog(L_ERROR, "Unable to get blank capability set: %m");
+		return -errno;
+	}
+
+	/* reset the process capabilities */
+	if (cap_set_proc(caps) != 0) {
+		xlog(L_ERROR, "Unable to set process capabilities: %m");
+		ret = -errno;
+	}
+	cap_free(caps);
+#endif
+	return ret;
+}
+
+#define INOTIFY_EVENT_MAX (sizeof(struct inotify_event) + NAME_MAX)
+
+static int
+cld_pipe_open(struct cld_client *clnt)
+{
+	int fd;
+
+	xlog(D_GENERAL, "%s: opening upcall pipe %s", __func__, pipepath);
+	fd = open(pipepath, O_RDWR, 0);
+	if (fd < 0) {
+		xlog(D_GENERAL, "%s: open of %s failed: %m", __func__, pipepath);
+		return -errno;
+	}
+
+	if (event_initialized(&clnt->cl_event))
+		event_del(&clnt->cl_event);
+	if (clnt->cl_fd >= 0)
+		close(clnt->cl_fd);
+
+	clnt->cl_fd = fd;
+	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, cldcb, clnt);
+	/* event_add is done by the caller */
+	return 0;
+}
+
+static void
+cld_inotify_cb(int UNUSED(fd), short which, void *data)
+{
+	int ret;
+	size_t elen;
+	ssize_t rret;
+	char evbuf[INOTIFY_EVENT_MAX];
+	char *dirc = NULL, *pname;
+	struct inotify_event *event = (struct inotify_event *)evbuf;
+	struct cld_client *clnt = data;
+
+	if (which != EV_READ)
+		return;
+
+	xlog(D_GENERAL, "%s: called for EV_READ", __func__);
+
+	dirc = strndup(pipepath, PATH_MAX);
+	if (!dirc) {
+		xlog(L_ERROR, "%s: unable to allocate memory", __func__);
+		goto out;
+	}
+
+	rret = read(inotify_fd, evbuf, INOTIFY_EVENT_MAX);
+	if (rret < 0) {
+		xlog(L_ERROR, "%s: read from inotify fd failed: %m", __func__);
+		goto out;
+	}
+
+	/* check to see if we have a filename in the evbuf */
+	if (!event->len) {
+		xlog(D_GENERAL, "%s: no filename in inotify event", __func__);
+		goto out;
+	}
+
+	pname = basename(dirc);
+	elen = strnlen(event->name, event->len);
+
+	/* does the filename match our pipe? */
+	if (strlen(pname) != elen || memcmp(pname, event->name, elen)) {
+		xlog(D_GENERAL, "%s: wrong filename (%s)", __func__,
+				event->name);
+		goto out;
+	}
+
+	ret = cld_pipe_open(clnt);
+	switch (ret) {
+	case 0:
+		/* readd the event for the cl_event pipe */
+		event_add(&clnt->cl_event, NULL);
+		break;
+	case -ENOENT:
+		/* pipe must have disappeared, wait for it to come back */
+		goto out;
+	default:
+		/* anything else is fatal */
+		xlog(L_FATAL, "%s: unable to open new pipe (%d). Aborting.",
+			ret, __func__);
+		exit(ret);
+	}
+
+out:
+	event_add(&pipedir_event, NULL);
+	free(dirc);
+}
+
+static int
+cld_inotify_setup(void)
+{
+	int ret;
+	char *dirc, *dname;
+
+	dirc = strndup(pipepath, PATH_MAX);
+	if (!dirc) {
+		xlog_err("%s: unable to allocate memory", __func__);
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	dname = dirname(dirc);
+
+	inotify_fd = inotify_init();
+	if (inotify_fd < 0) {
+		xlog_err("%s: inotify_init failed: %m", __func__);
+		ret = -errno;
+		goto out_free;
+	}
+
+	ret = inotify_add_watch(inotify_fd, dname, IN_CREATE);
+	if (ret < 0) {
+		xlog_err("%s: inotify_add_watch failed: %m", __func__);
+		ret = -errno;
+		goto out_err;
+	}
+
+out_free:
+	free(dirc);
+	return 0;
+out_err:
+	close(inotify_fd);
+	goto out_free;
+}
+
+/*
+ * Set an inotify watch on the directory that should contain the pipe, and then
+ * try to open it. If it fails with anything but -ENOENT, return the error
+ * immediately.
+ *
+ * If it succeeds, then set up the pipe event handler. At that point, set up
+ * the inotify event handler and go ahead and return success.
+ */
+static int
+cld_pipe_init(struct cld_client *clnt)
+{
+	int ret;
+
+	xlog(D_GENERAL, "%s: init pipe handlers", __func__);
+
+	ret = cld_inotify_setup();
+	if (ret != 0)
+		goto out;
+
+	clnt->cl_fd = -1;
+	ret = cld_pipe_open(clnt);
+	switch (ret) {
+	case 0:
+		/* add the event and we're good to go */
+		event_add(&clnt->cl_event, NULL);
+		break;
+	case -ENOENT:
+		/* ignore this error -- cld_inotify_cb will handle it */
+		ret = 0;
+		break;
+	default:
+		/* anything else is fatal */
+		close(inotify_fd);
+		goto out;
+	}
+
+	/* set event for inotify read */
+	event_set(&pipedir_event, inotify_fd, EV_READ, cld_inotify_cb, clnt);
+	event_add(&pipedir_event, NULL);
+out:
+	return ret;
+}
+
+static void
+cld_not_implemented(struct cld_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: downcalling with not implemented error", __func__);
+
+	/* set up reply */
+	cmsg->cm_status = -EOPNOTSUPP;
+
+	bsize = sizeof(*cmsg);
+
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize)
+		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
+			 __func__, wsize);
+
+	/* reopen pipe, just to be sure */
+	ret = cld_pipe_open(clnt);
+	if (ret) {
+		xlog(L_FATAL, "%s: unable to reopen pipe: %d", __func__, ret);
+		exit(ret);
+	}
+}
+
+static void
+cld_create(struct cld_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: create client record.", __func__);
+
+
+	ret = sqlite_insert_client(cmsg->cm_u.cm_name.cn_id,
+				   cmsg->cm_u.cm_name.cn_len,
+				   false,
+				   false);
+
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
+			 __func__, wsize);
+		ret = cld_pipe_open(clnt);
+		if (ret) {
+			xlog(L_FATAL, "%s: unable to reopen pipe: %d",
+					__func__, ret);
+			exit(ret);
+		}
+	}
+}
+
+static void
+cld_remove(struct cld_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: remove client record.", __func__);
+
+	ret = sqlite_remove_client(cmsg->cm_u.cm_name.cn_id,
+				   cmsg->cm_u.cm_name.cn_len);
+
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "%s: downcall with status %d", __func__,
+			cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
+			 __func__, wsize);
+		ret = cld_pipe_open(clnt);
+		if (ret) {
+			xlog(L_FATAL, "%s: unable to reopen pipe: %d",
+					__func__, ret);
+			exit(ret);
+		}
+	}
+}
+
+static void
+cld_check(struct cld_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: check client record", __func__);
+
+	ret = sqlite_check_client(cmsg->cm_u.cm_name.cn_id,
+				  cmsg->cm_u.cm_name.cn_len,
+				  false);
+
+	/* set up reply */
+	cmsg->cm_status = ret ? -EACCES : ret;
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "%s: downcall with status %d", __func__,
+			cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
+			 __func__, wsize);
+		ret = cld_pipe_open(clnt);
+		if (ret) {
+			xlog(L_FATAL, "%s: unable to reopen pipe: %d",
+					__func__, ret);
+			exit(ret);
+		}
+	}
+}
+
+static void
+cld_gracedone(struct cld_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: grace done. cm_gracetime=%ld", __func__,
+			cmsg->cm_u.cm_gracetime);
+
+	ret = sqlite_remove_unreclaimed(cmsg->cm_u.cm_gracetime);
+
+	/* set up reply: downcall with 0 status */
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
+			 __func__, wsize);
+		ret = cld_pipe_open(clnt);
+		if (ret) {
+			xlog(L_FATAL, "%s: unable to reopen pipe: %d",
+					__func__, ret);
+			exit(ret);
+		}
+	}
+}
+
+static void
+cldcb(int UNUSED(fd), short which, void *data)
+{
+	ssize_t len;
+	struct cld_client *clnt = data;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	if (which != EV_READ)
+		goto out;
+
+	len = atomicio(read, clnt->cl_fd, cmsg, sizeof(*cmsg));
+	if (len <= 0) {
+		xlog(L_ERROR, "%s: pipe read failed: %m", __func__);
+		cld_pipe_open(clnt);
+		goto out;
+	}
+
+	if (cmsg->cm_vers > UPCALL_VERSION) {
+		xlog(L_ERROR, "%s: unsupported upcall version: %hu",
+				__func__, cmsg->cm_vers);
+		cld_pipe_open(clnt);
+		goto out;
+	}
+
+	switch(cmsg->cm_cmd) {
+	case Cld_Create:
+		cld_create(clnt);
+		break;
+	case Cld_Remove:
+		cld_remove(clnt);
+		break;
+	case Cld_Check:
+		cld_check(clnt);
+		break;
+	case Cld_GraceDone:
+		cld_gracedone(clnt);
+		break;
+	default:
+		xlog(L_WARNING, "%s: command %u is not yet implemented",
+				__func__, cmsg->cm_cmd);
+		cld_not_implemented(clnt);
+	}
+out:
+	event_add(&clnt->cl_event, NULL);
+}
+
+int
+main(int argc, char **argv)
+{
+	char arg;
+	int rc = 0;
+	bool foreground = false;
+	char *progname;
+	char *storagedir = CLD_DEFAULT_STORAGEDIR;
+	struct cld_client clnt;
+
+	memset(&clnt, 0, sizeof(clnt));
+
+	progname = strdup(basename(argv[0]));
+	if (!progname) {
+		fprintf(stderr, "%s: unable to allocate memory.\n", argv[0]);
+		return 1;
+	}
+
+	event_init();
+	xlog_syslog(0);
+	xlog_stderr(1);
+
+	/* process command-line options */
+	while ((arg = getopt_long(argc, argv, "hdFp:s:", longopts,
+				  NULL)) != EOF) {
+		switch (arg) {
+		case 'd':
+			xlog_config(D_ALL, 1);
+			break;
+		case 'F':
+			foreground = true;
+			break;
+		case 'p':
+			pipepath = optarg;
+			break;
+		case 's':
+			storagedir = optarg;
+			break;
+		default:
+			usage(progname);
+			return 0;
+		}
+	}
+
+
+	xlog_open(progname);
+	if (!foreground) {
+		xlog_syslog(1);
+		xlog_stderr(0);
+		rc = daemon(0, 0);
+		if (rc) {
+			xlog(L_ERROR, "Unable to daemonize: %m");
+			goto out;
+		}
+	}
+
+	/* drop all capabilities */
+	rc = cld_set_caps();
+	if (rc)
+		goto out;
+
+	/*
+	 * now see if the storagedir is writable by root w/o CAP_DAC_OVERRIDE.
+	 * If it isn't then give the user a warning but proceed as if
+	 * everything is OK. If the DB has already been created, then
+	 * everything might still work. If it doesn't exist at all, then
+	 * assume that the maindb init will be able to create it. Fail on
+	 * anything else.
+	 */
+	if (access(storagedir, W_OK) == -1) {
+		switch (errno) {
+		case EACCES:
+			xlog(L_WARNING, "Storage directory %s is not writable. "
+					"Should be owned by root and writable "
+					"by owner!", storagedir);
+			break;
+		case ENOENT:
+			/* ignore and assume that we can create dir as root */
+			break;
+		default:
+			xlog(L_ERROR, "Unexpected error when checking access "
+				      "on %s: %m", storagedir);
+			rc = -errno;
+			goto out;
+		}
+	}
+
+	/* set up storage db */
+	rc = sqlite_prepare_dbh(storagedir);
+	if (rc) {
+		xlog(L_ERROR, "Failed to open main database: %d", rc);
+		goto out;
+	}
+
+	/* set up event handler */
+	rc = cld_pipe_init(&clnt);
+	if (rc)
+		goto out;
+
+	xlog(D_GENERAL, "%s: Starting event dispatch handler.", __func__);
+	rc = event_dispatch();
+	if (rc < 0)
+		xlog(L_ERROR, "%s: event_dispatch failed: %m", __func__);
+
+	close(clnt.cl_fd);
+	close(inotify_fd);
+out:
+	free(progname);
+	return rc;
+}
diff --git a/utils/nfsdcltrack/nfsdcld.man b/utils/nfsdcltrack/nfsdcld.man
new file mode 100644
index 0000000..9ddaf64
--- /dev/null
+++ b/utils/nfsdcltrack/nfsdcld.man
@@ -0,0 +1,185 @@
+.\" Automatically generated by Pod::Man 2.22 (Pod::Simple 3.13)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" Set up some character translations and predefined strings.  \*(-- will
+.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
+.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
+.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
+.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
+.\" nothing in troff, for use with C<>.
+.tr \(*W-
+.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
+.ie n \{\
+.    ds -- \(*W-
+.    ds PI pi
+.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
+.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
+.    ds L" ""
+.    ds R" ""
+.    ds C` ""
+.    ds C' ""
+'br\}
+.el\{\
+.    ds -- \|\(em\|
+.    ds PI \(*p
+.    ds L" ``
+.    ds R" ''
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el       .ds Aq '
+.\"
+.\" If the F register is turned on, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD.  Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.ie \nF \{\
+.    de IX
+.    tm Index:\\$1\t\\n%\t"\\$2"
+..
+.    nr % 0
+.    rr F
+.\}
+.el \{\
+.    de IX
+..
+.\}
+.\"
+.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
+.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
+.    \" fudge factors for nroff and troff
+.if n \{\
+.    ds #H 0
+.    ds #V .8m
+.    ds #F .3m
+.    ds #[ \f1
+.    ds #] \fP
+.\}
+.if t \{\
+.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
+.    ds #V .6m
+.    ds #F 0
+.    ds #[ \&
+.    ds #] \&
+.\}
+.    \" simple accents for nroff and troff
+.if n \{\
+.    ds ' \&
+.    ds ` \&
+.    ds ^ \&
+.    ds , \&
+.    ds ~ ~
+.    ds /
+.\}
+.if t \{\
+.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
+.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
+.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
+.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
+.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
+.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
+.\}
+.    \" troff and (daisy-wheel) nroff accents
+.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
+.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
+.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
+.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
+.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
+.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
+.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
+.ds ae a\h'-(\w'a'u*4/10)'e
+.ds Ae A\h'-(\w'A'u*4/10)'E
+.    \" corrections for vroff
+.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
+.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
+.    \" for low resolution devices (crt and lpr)
+.if \n(.H>23 .if \n(.V>19 \
+\{\
+.    ds : e
+.    ds 8 ss
+.    ds o a
+.    ds d- d\h'-1'\(ga
+.    ds D- D\h'-1'\(hy
+.    ds th \o'bp'
+.    ds Th \o'LP'
+.    ds ae ae
+.    ds Ae AE
+.\}
+.rm #[ #] #H #V #F C
+.\" ========================================================================
+.\"
+.IX Title "NFSDCLD 8"
+.TH NFSDCLD 8 "2011-12-21" "" ""
+.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH "NAME"
+nfsdcld \- NFSv4 Client Tracking Daemon
+.SH "SYNOPSIS"
+.IX Header "SYNOPSIS"
+nfsdcld [\-d] [\-F] [\-p path] [\-s stable storage dir]
+.SH "DESCRIPTION"
+.IX Header "DESCRIPTION"
+nfsdcld is the NFSv4 client tracking daemon. It is not necessary to run
+this daemon on machines that are not acting as NFSv4 servers.
+.PP
+When a network partition is combined with a server reboot, there are
+edge conditions that can cause the server to grant lock reclaims when
+other clients have taken conflicting locks in the interim. A more detailed
+explanation of this issue is described in \s-1RFC\s0 3530, section 8.6.3.
+.PP
+In order to prevent these problems, the server must track a small amount
+of per-client information on stable storage. This daemon provides the
+userspace piece of that functionality.
+.SH "OPTIONS"
+.IX Header "OPTIONS"
+.IP "\fB\-d\fR, \fB\-\-debug\fR" 4
+.IX Item "-d, --debug"
+Enable debug level logging.
+.IP "\fB\-F\fR, \fB\-\-foreground\fR" 4
+.IX Item "-F, --foreground"
+Runs the daemon in the foreground and prints all output to stderr
+.IP "\fB\-p\fR \fIpipe\fR, \fB\-\-pipe\fR=\fIpipe\fR" 4
+.IX Item "-p pipe, --pipe=pipe"
+Location of the \*(L"cld\*(R" upcall pipe. The default value is
+\&\fI/var/lib/nfs/rpc_pipefs/nfsd/cld\fR. If the pipe does not exist when the
+daemon starts then it will wait for it to be created.
+.IP "\fB\-s\fR \fIstorage_dir\fR, \fB\-\-storagedir\fR=\fIstorage_dir\fR" 4
+.IX Item "-s storagedir, --storagedir=storage_dir"
+Directory where stable storage information should be kept. The default
+value is \fI/var/lib/nfs/nfsdcld\fR.
+.SH "NOTES"
+.IX Header "NOTES"
+The Linux kernel NFSv4 server has historically tracked this information
+on stable storage by manipulating information on the filesystem
+directly, in the directory to which \fI/proc/fs/nfsd/nfsv4recoverydir\fR
+points.
+.PP
+This daemon requires a kernel that supports the nfsdcld upcall. If the
+kernel does not support the new upcall, or is using the legacy client
+name tracking code then it will not create the pipe that nfsdcld uses to
+talk to the kernel.
+.PP
+This daemon should be run as root, as the pipe that it uses to communicate
+with the kernel is only accessable by root. The daemon however does drop all
+superuser capabilities after starting. Because of this, the \fIstoragedir\fR
+should be owned by root, and be readable and writable by owner.
+.SH "AUTHORS"
+.IX Header "AUTHORS"
+The nfsdcld daemon was developed by Jeff Layton <jlayton@redhat.com>.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 2/7] nfsdcld: move nfsdcld to its own directory
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 1/7] Revert "nfsdcltrack: remove the nfsdcld daemon" Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 3/7] nfsdcld: a few enhancements Scott Mayhew
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

The schema for the db used by nfsdcld is going to diverge quite a bit
from that used by nfsdcltrack.  It will be easier if the programs are
kept in separate directories.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 .gitignore                                 |   2 +-
 configure.ac                               |  23 +
 utils/Makefile.am                          |   4 +
 utils/nfsdcld/Makefile.am                  |  19 +
 utils/{nfsdcltrack => nfsdcld}/nfsdcld.c   |   0
 utils/{nfsdcltrack => nfsdcld}/nfsdcld.man |   0
 utils/nfsdcld/sqlite.c                     | 601 +++++++++++++++++++++
 utils/nfsdcld/sqlite.h                     |  32 ++
 utils/nfsdcltrack/Makefile.am              |   7 +-
 9 files changed, 682 insertions(+), 6 deletions(-)
 create mode 100644 utils/nfsdcld/Makefile.am
 rename utils/{nfsdcltrack => nfsdcld}/nfsdcld.c (100%)
 rename utils/{nfsdcltrack => nfsdcld}/nfsdcld.man (100%)
 create mode 100644 utils/nfsdcld/sqlite.c
 create mode 100644 utils/nfsdcld/sqlite.h

diff --git a/.gitignore b/.gitignore
index d4f4f34..e97b31f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -54,7 +54,7 @@ utils/rquotad/rquotad
 utils/rquotad/rquota.h
 utils/rquotad/rquota_xdr.c
 utils/showmount/showmount
-utils/nfsdcltrack/nfsdcld
+utils/nfsdcld/nfsdcld
 utils/nfsdcltrack/nfsdcltrack
 utils/statd/statd
 tools/locktest/testlk
diff --git a/configure.ac b/configure.ac
index 4b698dd..d576caf 100644
--- a/configure.ac
+++ b/configure.ac
@@ -232,6 +232,12 @@ else
 	AM_CONDITIONAL(MOUNT_CONFIG, [test "$enable_mount" = "yes"])
 fi
 
+AC_ARG_ENABLE(nfsdcld,
+	[AC_HELP_STRING([--disable-nfsdcld],
+			[disable NFSv4 clientid tracking daemon @<:@default=no@:>@])],
+	enable_nfsdcld=$enableval,
+	enable_nfsdcld="yes")
+
 AC_ARG_ENABLE(nfsdcltrack,
 	[AC_HELP_STRING([--disable-nfsdcltrack],
 			[disable NFSv4 clientid tracking programs @<:@default=no@:>@])],
@@ -318,6 +324,20 @@ if test "$enable_nfsv4" = yes; then
   dnl Check for sqlite3
   AC_SQLITE3_VERS
 
+  if test "$enable_nfsdcld" = "yes"; then
+	AC_CHECK_HEADERS([libgen.h sys/inotify.h], ,
+		AC_MSG_ERROR([Cannot find header needed for nfsdcld]))
+
+    case $libsqlite3_cv_is_recent in
+    yes) ;;
+    unknown)
+      dnl do not fail when cross-compiling
+      AC_MSG_WARN([assuming sqlite is at least v3.3]) ;;
+    *)
+      AC_MSG_ERROR([nfsdcld requires sqlite-devel]) ;;
+    esac
+  fi
+
   if test "$enable_nfsdcltrack" = "yes"; then
 	AC_CHECK_HEADERS([libgen.h sys/inotify.h], ,
 		AC_MSG_ERROR([Cannot find header needed for nfsdcltrack]))
@@ -333,6 +353,7 @@ if test "$enable_nfsv4" = yes; then
   fi
 
 else
+  enable_nfsdcld="no"
   enable_nfsdcltrack="no"
 fi
 
@@ -343,6 +364,7 @@ if test "$enable_nfsv41" = yes; then
 fi
 
 dnl enable nfsidmap when its support by libnfsidmap
+AM_CONDITIONAL(CONFIG_NFSDCLD, [test "$enable_nfsdcld" = "yes" ])
 AM_CONDITIONAL(CONFIG_NFSDCLTRACK, [test "$enable_nfsdcltrack" = "yes" ])
 
 
@@ -619,6 +641,7 @@ AC_CONFIG_FILES([
 	tools/nfsconf/Makefile
 	utils/Makefile
 	utils/blkmapd/Makefile
+	utils/nfsdcld/Makefile
 	utils/nfsdcltrack/Makefile
 	utils/exportfs/Makefile
 	utils/gssd/Makefile
diff --git a/utils/Makefile.am b/utils/Makefile.am
index d361aea..674c9ad 100644
--- a/utils/Makefile.am
+++ b/utils/Makefile.am
@@ -19,6 +19,10 @@ if CONFIG_MOUNT
 OPTDIRS += mount
 endif
 
+if CONFIG_NFSDCLD
+OPTDIRS += nfsdcld
+endif
+
 if CONFIG_NFSDCLTRACK
 OPTDIRS += nfsdcltrack
 endif
diff --git a/utils/nfsdcld/Makefile.am b/utils/nfsdcld/Makefile.am
new file mode 100644
index 0000000..8239be8
--- /dev/null
+++ b/utils/nfsdcld/Makefile.am
@@ -0,0 +1,19 @@
+## Process this file with automake to produce Makefile.in
+
+# These binaries go in /sbin (not /usr/sbin), and that cannot be
+# overridden at config time. The kernel "knows" the /sbin name.
+sbindir = /sbin
+
+man8_MANS	= nfsdcld.man
+EXTRA_DIST	= $(man8_MANS)
+
+AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
+sbin_PROGRAMS	= nfsdcld
+
+nfsdcld_SOURCES = nfsdcld.c sqlite.c
+nfsdcld_LDADD = ../../support/nfs/libnfs.la $(LIBEVENT) $(LIBSQLITE) $(LIBCAP)
+
+noinst_HEADERS	= sqlite.h
+
+MAINTAINERCLEANFILES = Makefile.in
+
diff --git a/utils/nfsdcltrack/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
similarity index 100%
rename from utils/nfsdcltrack/nfsdcld.c
rename to utils/nfsdcld/nfsdcld.c
diff --git a/utils/nfsdcltrack/nfsdcld.man b/utils/nfsdcld/nfsdcld.man
similarity index 100%
rename from utils/nfsdcltrack/nfsdcld.man
rename to utils/nfsdcld/nfsdcld.man
diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
new file mode 100644
index 0000000..c59f777
--- /dev/null
+++ b/utils/nfsdcld/sqlite.c
@@ -0,0 +1,601 @@
+/*
+ * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+/*
+ * Explanation:
+ *
+ * This file contains the code to manage the sqlite backend database for the
+ * nfsdcltrack usermodehelper upcall program.
+ *
+ * The main database is called main.sqlite and contains the following tables:
+ *
+ * parameters: simple key/value pairs for storing database info
+ *
+ * clients: an "id" column containing a BLOB with the long-form clientid as
+ * 	    sent by the client, a "time" column containing a timestamp (in
+ * 	    epoch seconds) of when the record was last updated, and a
+ * 	    "has_session" column containing a boolean value indicating
+ * 	    whether the client has sessions (v4.1+) or not (v4.0).
+ */
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif /* HAVE_CONFIG_H */
+
+#include <dirent.h>
+#include <errno.h>
+#include <event.h>
+#include <stdbool.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sqlite3.h>
+#include <linux/limits.h>
+
+#include "xlog.h"
+#include "sqlite.h"
+
+#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 2
+
+/* in milliseconds */
+#define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
+
+/* private data structures */
+
+/* global variables */
+
+/* reusable pathname and sql command buffer */
+static char buf[PATH_MAX];
+
+/* global database handle */
+static sqlite3 *dbh;
+
+/* forward declarations */
+
+/* make a directory, ignoring EEXIST errors unless it's not a directory */
+static int
+mkdir_if_not_exist(const char *dirname)
+{
+	int ret;
+	struct stat statbuf;
+
+	ret = mkdir(dirname, S_IRWXU);
+	if (ret && errno != EEXIST)
+		return -errno;
+
+	ret = stat(dirname, &statbuf);
+	if (ret)
+		return -errno;
+
+	if (!S_ISDIR(statbuf.st_mode))
+		ret = -ENOTDIR;
+
+	return ret;
+}
+
+static int
+sqlite_query_schema_version(void)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+
+	/* prepare select query */
+	ret = sqlite3_prepare_v2(dbh,
+		"SELECT value FROM parameters WHERE key == \"version\";",
+		 -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to prepare select statement: %s",
+			sqlite3_errmsg(dbh));
+		ret = 0;
+		goto out;
+	}
+
+	/* query schema version */
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(D_GENERAL, "Select statement execution failed: %s",
+				sqlite3_errmsg(dbh));
+		ret = 0;
+		goto out;
+	}
+
+	ret = sqlite3_column_int(stmt, 0);
+out:
+	sqlite3_finalize(stmt);
+	return ret;
+}
+
+static int
+sqlite_maindb_update_v1_to_v2(void)
+{
+	int ret, ret2;
+	char *err;
+
+	/* begin transaction */
+	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
+				&err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to begin transaction: %s", err);
+		goto rollback;
+	}
+
+	/*
+	 * Check schema version again. This time, under an exclusive
+	 * transaction to guard against racing DB setup attempts
+	 */
+	ret = sqlite_query_schema_version();
+	switch (ret) {
+	case 1:
+		/* Still at v1 -- do conversion */
+		break;
+	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+		/* Someone else raced in and set it up */
+		ret = 0;
+		goto rollback;
+	default:
+		/* Something went wrong -- fail! */
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	/* create v2 clients table */
+	ret = sqlite3_exec(dbh, "ALTER TABLE clients ADD COLUMN "
+				"has_session INTEGER;",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to update clients table: %s", err);
+		goto rollback;
+	}
+
+	ret = snprintf(buf, sizeof(buf), "UPDATE parameters SET value = %d "
+			"WHERE key = \"version\";",
+			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		goto rollback;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to update schema version: %s", err);
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to commit transaction: %s", err);
+		goto rollback;
+	}
+out:
+	sqlite3_free(err);
+	return ret;
+rollback:
+	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	if (ret2 != SQLITE_OK)
+		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
+	goto out;
+}
+
+/*
+ * Start an exclusive transaction and recheck the DB schema version. If it's
+ * still zero (indicating a new database) then set it up. If that all works,
+ * then insert schema version into the parameters table and commit the
+ * transaction. On any error, rollback the transaction.
+ */
+static int
+sqlite_maindb_init_v2(void)
+{
+	int ret, ret2;
+	char *err = NULL;
+
+	/* Start a transaction */
+	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
+				&err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to begin transaction: %s", err);
+		return ret;
+	}
+
+	/*
+	 * Check schema version again. This time, under an exclusive
+	 * transaction to guard against racing DB setup attempts
+	 */
+	ret = sqlite_query_schema_version();
+	switch (ret) {
+	case 0:
+		/* Query failed again -- set up DB */
+		break;
+	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+		/* Someone else raced in and set it up */
+		ret = 0;
+		goto rollback;
+	default:
+		/* Something went wrong -- fail! */
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, "CREATE TABLE parameters "
+				"(key TEXT PRIMARY KEY, value TEXT);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create parameter table: %s", err);
+		goto rollback;
+	}
+
+	/* create the "clients" table */
+	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB PRIMARY KEY, "
+				"time INTEGER, has_session INTEGER);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create clients table: %s", err);
+		goto rollback;
+	}
+
+
+	/* insert version into parameters table */
+	ret = snprintf(buf, sizeof(buf), "INSERT OR FAIL INTO parameters "
+			"values (\"version\", \"%d\");",
+			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		goto rollback;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to insert into parameter table: %s", err);
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to commit transaction: %s", err);
+		goto rollback;
+	}
+out:
+	sqlite3_free(err);
+	return ret;
+
+rollback:
+	/* Attempt to rollback the transaction */
+	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	if (ret2 != SQLITE_OK)
+		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
+	goto out;
+}
+
+/* Open the database and set up the database handle for it */
+int
+sqlite_prepare_dbh(const char *topdir)
+{
+	int ret;
+
+	/* Do nothing if the database handle is already set up */
+	if (dbh)
+		return 0;
+
+	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", topdir);
+	if (ret < 0)
+		return ret;
+
+	buf[PATH_MAX - 1] = '\0';
+
+	/* open a new DB handle */
+	ret = sqlite3_open(buf, &dbh);
+	if (ret != SQLITE_OK) {
+		/* try to create the dir */
+		ret = mkdir_if_not_exist(topdir);
+		if (ret)
+			goto out_close;
+
+		/* retry open */
+		ret = sqlite3_open(buf, &dbh);
+		if (ret != SQLITE_OK)
+			goto out_close;
+	}
+
+	/* set busy timeout */
+	ret = sqlite3_busy_timeout(dbh, CLTRACK_SQLITE_BUSY_TIMEOUT);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to set sqlite busy timeout: %s",
+				sqlite3_errmsg(dbh));
+		goto out_close;
+	}
+
+	ret = sqlite_query_schema_version();
+	switch (ret) {
+	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+		/* DB is already set up. Do nothing */
+		ret = 0;
+		break;
+	case 1:
+		/* Old DB -- update to new schema */
+		ret = sqlite_maindb_update_v1_to_v2();
+		if (ret)
+			goto out_close;
+		break;
+	case 0:
+		/* Query failed -- try to set up new DB */
+		ret = sqlite_maindb_init_v2();
+		if (ret)
+			goto out_close;
+		break;
+	default:
+		/* Unknown DB version -- downgrade? Fail */
+		xlog(L_ERROR, "Unsupported database schema version! "
+			"Expected %d, got %d.",
+			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION, ret);
+		ret = -EINVAL;
+		goto out_close;
+	}
+
+	return ret;
+out_close:
+	sqlite3_close(dbh);
+	dbh = NULL;
+	return ret;
+}
+
+/*
+ * Create a client record
+ *
+ * Returns a non-zero sqlite error code, or SQLITE_OK (aka 0)
+ */
+int
+sqlite_insert_client(const unsigned char *clname, const size_t namelen,
+			const bool has_session, const bool zerotime)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+
+	if (zerotime)
+		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
+				"VALUES (?, 0, ?);", -1, &stmt, NULL);
+	else
+		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
+				"VALUES (?, strftime('%s', 'now'), ?);", -1,
+				&stmt, NULL);
+
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: insert statement prepare failed: %s",
+			__func__, sqlite3_errmsg(dbh));
+		return ret;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind blob failed: %s", __func__,
+				sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_int(stmt, 2, (int)has_session);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind int failed: %s", __func__,
+				sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret == SQLITE_DONE)
+		ret = SQLITE_OK;
+	else
+		xlog(L_ERROR, "%s: unexpected return code from insert: %s",
+				__func__, sqlite3_errmsg(dbh));
+
+out_err:
+	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
+	sqlite3_finalize(stmt);
+	return ret;
+}
+
+/* Remove a client record */
+int
+sqlite_remove_client(const unsigned char *clname, const size_t namelen)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+
+	ret = sqlite3_prepare_v2(dbh, "DELETE FROM clients WHERE id==?", -1,
+				 &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: statement prepare failed: %s",
+				__func__, sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind blob failed: %s", __func__,
+				sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret == SQLITE_DONE)
+		ret = SQLITE_OK;
+	else
+		xlog(L_ERROR, "%s: unexpected return code from delete: %d",
+				__func__, ret);
+
+out_err:
+	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
+	sqlite3_finalize(stmt);
+	return ret;
+}
+
+/*
+ * Is the given clname in the clients table? If so, then update its timestamp
+ * and return success. If the record isn't present, or the update fails, then
+ * return an error.
+ */
+int
+sqlite_check_client(const unsigned char *clname, const size_t namelen,
+			const bool has_session)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+
+	ret = sqlite3_prepare_v2(dbh, "SELECT count(*) FROM clients WHERE "
+				      "id==?", -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: unable to prepare update statement: %s",
+				__func__, sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind blob failed: %s",
+				__func__, sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(L_ERROR, "%s: unexpected return code from select: %d",
+				__func__, ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_column_int(stmt, 0);
+	xlog(D_GENERAL, "%s: select returned %d rows", __func__, ret);
+	if (ret != 1) {
+		ret = -EACCES;
+		goto out_err;
+	}
+
+	/* Only update timestamp for v4.0 clients */
+	if (has_session) {
+		ret = SQLITE_OK;
+		goto out_err;
+	}
+
+	sqlite3_finalize(stmt);
+	stmt = NULL;
+	ret = sqlite3_prepare_v2(dbh, "UPDATE OR FAIL clients SET "
+				      "time=strftime('%s', 'now') WHERE id==?",
+				 -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: unable to prepare update statement: %s",
+				__func__, sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind blob failed: %s",
+				__func__, sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret == SQLITE_DONE)
+		ret = SQLITE_OK;
+	else
+		xlog(L_ERROR, "%s: unexpected return code from update: %s",
+				__func__, sqlite3_errmsg(dbh));
+
+out_err:
+	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
+	sqlite3_finalize(stmt);
+	return ret;
+}
+
+/*
+ * remove any client records that were not reclaimed since grace_start.
+ */
+int
+sqlite_remove_unreclaimed(time_t grace_start)
+{
+	int ret;
+	char *err = NULL;
+
+	ret = snprintf(buf, sizeof(buf), "DELETE FROM clients WHERE time < %ld",
+			grace_start);
+	if (ret < 0) {
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		return ret;
+	}
+
+	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK)
+		xlog(L_ERROR, "%s: delete failed: %s", __func__, err);
+
+	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
+	sqlite3_free(err);
+	return ret;
+}
+
+/*
+ * Are there any clients that are possibly still reclaiming? Return a positive
+ * integer (usually number of clients) if so. If not, then return 0. On any
+ * error, return non-zero.
+ */
+int
+sqlite_query_reclaiming(const time_t grace_start)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+
+	ret = sqlite3_prepare_v2(dbh, "SELECT count(*) FROM clients WHERE "
+				      "time < ? OR has_session != 1", -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: unable to prepare select statement: %s",
+				__func__, sqlite3_errmsg(dbh));
+		return ret;
+	}
+
+	ret = sqlite3_bind_int64(stmt, 1, (sqlite3_int64)grace_start);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: bind int64 failed: %s",
+				__func__, sqlite3_errmsg(dbh));
+		return ret;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(L_ERROR, "%s: unexpected return code from select: %s",
+				__func__, sqlite3_errmsg(dbh));
+		return ret;
+	}
+
+	ret = sqlite3_column_int(stmt, 0);
+	sqlite3_finalize(stmt);
+	xlog(D_GENERAL, "%s: there are %d clients that have not completed "
+			"reclaim", __func__, ret);
+	return ret;
+}
diff --git a/utils/nfsdcld/sqlite.h b/utils/nfsdcld/sqlite.h
new file mode 100644
index 0000000..06e7c04
--- /dev/null
+++ b/utils/nfsdcld/sqlite.h
@@ -0,0 +1,32 @@
+/*
+ * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _SQLITE_H_
+#define _SQLITE_H_
+
+int sqlite_prepare_dbh(const char *topdir);
+int sqlite_insert_client(const unsigned char *clname, const size_t namelen,
+				const bool has_session, const bool zerotime);
+int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
+int sqlite_check_client(const unsigned char *clname, const size_t namelen,
+				const bool has_session);
+int sqlite_remove_unreclaimed(const time_t grace_start);
+int sqlite_query_reclaiming(const time_t grace_start);
+
+#endif /* _SQLITE_H */
diff --git a/utils/nfsdcltrack/Makefile.am b/utils/nfsdcltrack/Makefile.am
index 0f599c0..2f7fe3d 100644
--- a/utils/nfsdcltrack/Makefile.am
+++ b/utils/nfsdcltrack/Makefile.am
@@ -4,14 +4,11 @@
 # overridden at config time. The kernel "knows" the /sbin name.
 sbindir = /sbin
 
-man8_MANS	= nfsdcld.man nfsdcltrack.man
+man8_MANS	= nfsdcltrack.man
 EXTRA_DIST	= $(man8_MANS)
 
 AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
-sbin_PROGRAMS	= nfsdcld nfsdcltrack
-
-nfsdcld_SOURCES = nfsdcld.c sqlite.c
-nfsdcld_LDADD = ../../support/nfs/libnfs.la $(LIBEVENT) $(LIBSQLITE) $(LIBCAP)
+sbin_PROGRAMS	= nfsdcltrack
 
 noinst_HEADERS	= sqlite.h
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 3/7] nfsdcld: a few enhancements
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 1/7] Revert "nfsdcltrack: remove the nfsdcld daemon" Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 2/7] nfsdcld: move nfsdcld to its own directory Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  2018-11-10 14:16   ` Jeff Layton
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 4/7] nfsdcld: remove some unused functions Scott Mayhew
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

1) Adopt the concept of "reboot epochs" (but not coordinated grace
periods via the "need" and "enforcing" flags) from Jeff Layton's
"Active/Active NFS Server Recovery" presentation from the Fall 2018 NFS
Bakeathon.  See
http://nfsv4bat.org/Documents/BakeAThon/2018/Active_Active%20NFS%20Server%20Recovery.pdf

- add a new table "grace" which contains two integer columns
  representing the "current" epoch (where new client records are stored)
  and the "recovery" epoch (which has the records for clients that are
  allowed to recover)
- replace the "clients" table with table(s) named "rec-CCCCCCCCCCCCCCCC"
  (where C is the hex value of the epoch), containing a single column
  "id" which stores the client id string
- when going from normal operation into grace, the current epoch becomes
  the recovery epoch, the current epoch is incremented, and a new table
  is created for the current epoch.  Clients are allowed to reclaim if
  they have a record in the table corresponding to the recovery epoch
  and new records are added to the table corresponding to the current
  epoch.
- when moving from grace back to normal operation, the table associated
  with the recovery epoch is deleted and the recovery epoch becomes
  zero.
- if the server restarts before exiting the previous grace period, then
  the epochs are not changed, and all records in the table associated
  with the "current" epoch are cleared out.

2) Allow knfsd to "slurp" the client records during startup.

During client tracking initialization, knfsd will do an upcall to get a
list of clients from the database.  nfsdcld will do one downcall with a
status of -EINPROGRESS for each client record in the database, followed
by a final downcall with a status of 0.  This will allow 2 things

- knfsd can check whether a client is allowed to reclaim without
  performing an upcall to nfsdcld
- knfsd can decide to end the grace period early by tracking the number
  of RECLAIM_COMPLETE operations it receives from "known" clients, or
  it can skip the grace period altogether if no clients are allowed
  to reclaim.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 support/include/cld.h        |   1 +
 utils/nfsdcld/Makefile.am    |   2 +-
 utils/nfsdcld/cld-internal.h |  30 +++
 utils/nfsdcld/nfsdcld.c      | 160 +++++++++++-
 utils/nfsdcld/sqlite.c       | 483 ++++++++++++++++++++++++++++-------
 utils/nfsdcld/sqlite.h       |  11 +-
 6 files changed, 579 insertions(+), 108 deletions(-)
 create mode 100644 utils/nfsdcld/cld-internal.h

diff --git a/support/include/cld.h b/support/include/cld.h
index f14a9ab..c1f5b70 100644
--- a/support/include/cld.h
+++ b/support/include/cld.h
@@ -33,6 +33,7 @@ enum cld_command {
 	Cld_Remove,		/* remove record of this cm_id */
 	Cld_Check,		/* is this cm_id allowed? */
 	Cld_GraceDone,		/* grace period is complete */
+	Cld_GraceStart,
 };
 
 /* representation of long-form NFSv4 client ID */
diff --git a/utils/nfsdcld/Makefile.am b/utils/nfsdcld/Makefile.am
index 8239be8..d1da749 100644
--- a/utils/nfsdcld/Makefile.am
+++ b/utils/nfsdcld/Makefile.am
@@ -13,7 +13,7 @@ sbin_PROGRAMS	= nfsdcld
 nfsdcld_SOURCES = nfsdcld.c sqlite.c
 nfsdcld_LDADD = ../../support/nfs/libnfs.la $(LIBEVENT) $(LIBSQLITE) $(LIBCAP)
 
-noinst_HEADERS	= sqlite.h
+noinst_HEADERS	= sqlite.h cld-internal.h
 
 MAINTAINERCLEANFILES = Makefile.in
 
diff --git a/utils/nfsdcld/cld-internal.h b/utils/nfsdcld/cld-internal.h
new file mode 100644
index 0000000..a90cced
--- /dev/null
+++ b/utils/nfsdcld/cld-internal.h
@@ -0,0 +1,30 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _CLD_INTERNAL_H_
+#define _CLD_INTERNAL_H_
+
+struct cld_client {
+	int			cl_fd;
+	struct event		cl_event;
+	struct cld_msg	cl_msg;
+};
+
+uint64_t current_epoch;
+uint64_t recovery_epoch;
+
+#endif /* _CLD_INTERNAL_H_ */
diff --git a/utils/nfsdcld/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
index 082f3ab..9b1ad98 100644
--- a/utils/nfsdcld/nfsdcld.c
+++ b/utils/nfsdcld/nfsdcld.c
@@ -42,7 +42,9 @@
 #include "xlog.h"
 #include "nfslib.h"
 #include "cld.h"
+#include "cld-internal.h"
 #include "sqlite.h"
+#include "../mount/version.h"
 
 #ifndef PIPEFS_DIR
 #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
@@ -54,19 +56,17 @@
 #define CLD_DEFAULT_STORAGEDIR NFS_STATEDIR "/nfsdcld"
 #endif
 
+#define NFSD_END_GRACE_FILE "/proc/fs/nfsd/v4_end_grace"
+
 #define UPCALL_VERSION		1
 
 /* private data structures */
-struct cld_client {
-	int			cl_fd;
-	struct event		cl_event;
-	struct cld_msg	cl_msg;
-};
 
 /* global variables */
 static char *pipepath = DEFAULT_CLD_PATH;
 static int 		inotify_fd = -1;
 static struct event	pipedir_event;
+static bool old_kernel = false;
 
 static struct option longopts[] =
 {
@@ -298,6 +298,43 @@ out:
 	return ret;
 }
 
+/*
+ * Older kernels will not tell nfsdcld when a grace period has started.
+ * Therefore we have to peek at the /proc/fs/nfsd/v4_end_grace file to
+ * see if nfsd is in grace.  We have to do this for create and remove
+ * upcalls to ensure that the correct table is being updated - otherwise
+ * we could lose client records when the grace period is lifted.
+ */
+static int
+cld_check_grace_period(void)
+{
+	int fd, ret = 0;
+	char c;
+
+	if (!old_kernel)
+		return 0;
+	if (recovery_epoch != 0)
+		return 0;
+	fd = open(NFSD_END_GRACE_FILE, O_RDONLY);
+	if (fd < 0) {
+		xlog(L_WARNING, "Unable to open %s: %m",
+			NFSD_END_GRACE_FILE);
+		return 1;
+	}
+	if (read(fd, &c, 1) < 0) {
+		xlog(L_WARNING, "Unable to read from %s: %m",
+			NFSD_END_GRACE_FILE);
+		return 1;
+	}
+	close(fd);
+	if (c == 'N') {
+		xlog(L_WARNING, "nfsd is in grace but didn't send a gracestart upcall, "
+			"please update the kernel");
+		ret = sqlite_grace_start();
+	}
+	return ret;
+}
+
 static void
 cld_not_implemented(struct cld_client *clnt)
 {
@@ -332,14 +369,17 @@ cld_create(struct cld_client *clnt)
 	ssize_t bsize, wsize;
 	struct cld_msg *cmsg = &clnt->cl_msg;
 
+	ret = cld_check_grace_period();
+	if (ret)
+		goto reply;
+
 	xlog(D_GENERAL, "%s: create client record.", __func__);
 
 
 	ret = sqlite_insert_client(cmsg->cm_u.cm_name.cn_id,
-				   cmsg->cm_u.cm_name.cn_len,
-				   false,
-				   false);
+				   cmsg->cm_u.cm_name.cn_len);
 
+reply:
 	cmsg->cm_status = ret ? -EREMOTEIO : ret;
 
 	bsize = sizeof(*cmsg);
@@ -365,11 +405,16 @@ cld_remove(struct cld_client *clnt)
 	ssize_t bsize, wsize;
 	struct cld_msg *cmsg = &clnt->cl_msg;
 
+	ret = cld_check_grace_period();
+	if (ret)
+		goto reply;
+
 	xlog(D_GENERAL, "%s: remove client record.", __func__);
 
 	ret = sqlite_remove_client(cmsg->cm_u.cm_name.cn_id,
 				   cmsg->cm_u.cm_name.cn_len);
 
+reply:
 	cmsg->cm_status = ret ? -EREMOTEIO : ret;
 
 	bsize = sizeof(*cmsg);
@@ -396,12 +441,26 @@ cld_check(struct cld_client *clnt)
 	ssize_t bsize, wsize;
 	struct cld_msg *cmsg = &clnt->cl_msg;
 
+	/*
+	 * If we get a check upcall at all, it means we're talking to an old
+	 * kernel.  Furthermore, if we're not in grace it means this is the
+	 * first client to do a reclaim.  Log a message and use
+	 * sqlite_grace_start() to advance the epoch numbers.
+	 */
+	if (recovery_epoch == 0) {
+		xlog(D_GENERAL, "%s: received a check upcall, please update the kernel",
+			__func__);
+		ret = sqlite_grace_start();
+		if (ret)
+			goto reply;
+	}
+
 	xlog(D_GENERAL, "%s: check client record", __func__);
 
 	ret = sqlite_check_client(cmsg->cm_u.cm_name.cn_id,
-				  cmsg->cm_u.cm_name.cn_len,
-				  false);
+				  cmsg->cm_u.cm_name.cn_len);
 
+reply:
 	/* set up reply */
 	cmsg->cm_status = ret ? -EACCES : ret;
 
@@ -429,11 +488,27 @@ cld_gracedone(struct cld_client *clnt)
 	ssize_t bsize, wsize;
 	struct cld_msg *cmsg = &clnt->cl_msg;
 
-	xlog(D_GENERAL, "%s: grace done. cm_gracetime=%ld", __func__,
-			cmsg->cm_u.cm_gracetime);
+	/*
+	 * If we got a "gracedone" upcall while we're not in grace, then
+	 * 1) we must be talking to an old kernel
+	 * 2) no clients attempted to reclaim
+	 * In that case, log a message and use sqlite_grace_start() to
+	 * advance the epoch numbers, and then proceed as normal.
+	 */
+	if (recovery_epoch == 0) {
+		xlog(D_GENERAL, "%s: received gracedone upcall "
+			"while not in grace, please update the kernel",
+			__func__);
+		ret = sqlite_grace_start();
+		if (ret)
+			goto reply;
+	}
+
+	xlog(D_GENERAL, "%s: grace done.", __func__);
 
-	ret = sqlite_remove_unreclaimed(cmsg->cm_u.cm_gracetime);
+	ret = sqlite_grace_done();
 
+reply:
 	/* set up reply: downcall with 0 status */
 	cmsg->cm_status = ret ? -EREMOTEIO : ret;
 
@@ -453,6 +528,59 @@ cld_gracedone(struct cld_client *clnt)
 	}
 }
 
+static int
+gracestart_callback(struct cld_client *clnt) {
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	cmsg->cm_status = -EINPROGRESS;
+
+	bsize = sizeof(struct cld_msg);
+
+	xlog(D_GENERAL, "Sending client %.*s",
+			cmsg->cm_u.cm_name.cn_len, cmsg->cm_u.cm_name.cn_id);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize)
+		return -EIO;
+	return 0;
+}
+
+static void
+cld_gracestart(struct cld_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: updating grace epochs", __func__);
+
+	ret = sqlite_grace_start();
+	if (ret)
+		goto reply;
+
+	xlog(D_GENERAL, "%s: sending client records to the kernel", __func__);
+
+	ret = sqlite_iterate_recovery(&gracestart_callback, clnt);
+
+reply:
+	/* set up reply: downcall with 0 status */
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
+
+	bsize = sizeof(struct cld_msg);
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
+			 __func__, wsize);
+		ret = cld_pipe_open(clnt);
+		if (ret) {
+			xlog(L_FATAL, "%s: unable to reopen pipe: %d",
+					__func__, ret);
+			exit(ret);
+		}
+	}
+}
+
 static void
 cldcb(int UNUSED(fd), short which, void *data)
 {
@@ -490,6 +618,9 @@ cldcb(int UNUSED(fd), short which, void *data)
 	case Cld_GraceDone:
 		cld_gracedone(clnt);
 		break;
+	case Cld_GraceStart:
+		cld_gracestart(clnt);
+		break;
 	default:
 		xlog(L_WARNING, "%s: command %u is not yet implemented",
 				__func__, cmsg->cm_cmd);
@@ -586,6 +717,9 @@ main(int argc, char **argv)
 		}
 	}
 
+	if (linux_version_code() < MAKE_VERSION(4, 20, 0))
+		old_kernel = true;
+
 	/* set up storage db */
 	rc = sqlite_prepare_dbh(storagedir);
 	if (rc) {
diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index c59f777..67549c9 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -21,17 +21,24 @@
  * Explanation:
  *
  * This file contains the code to manage the sqlite backend database for the
- * nfsdcltrack usermodehelper upcall program.
+ * nfsdcld client tracking daemon.
  *
  * The main database is called main.sqlite and contains the following tables:
  *
  * parameters: simple key/value pairs for storing database info
  *
- * clients: an "id" column containing a BLOB with the long-form clientid as
- * 	    sent by the client, a "time" column containing a timestamp (in
- * 	    epoch seconds) of when the record was last updated, and a
- * 	    "has_session" column containing a boolean value indicating
- * 	    whether the client has sessions (v4.1+) or not (v4.0).
+ * grace: a "current" column containing an INTEGER representing the current
+ *        epoch (where should new values be stored) and a "recovery" column
+ *        containing an INTEGER representing the recovery epoch (from what
+ *        epoch are we allowed to recover).  A recovery epoch of 0 means
+ *        normal operation (grace period not in force).  Note: sqlite stores
+ *        integers as signed values, so these must be cast to a uint64_t when
+ *        retrieving them from the database and back to an int64_t when storing
+ *        them in the database.
+ *
+ * rec-CCCCCCCCCCCCCCCC (where C is the hex representation of the epoch value):
+ *        a single "id" column containing a BLOB with the long-form clientid
+ *        as sent by the client.
  */
 
 #ifdef HAVE_CONFIG_H
@@ -47,16 +54,21 @@
 #include <sys/types.h>
 #include <fcntl.h>
 #include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <limits.h>
 #include <sqlite3.h>
 #include <linux/limits.h>
 
 #include "xlog.h"
 #include "sqlite.h"
+#include "cld.h"
+#include "cld-internal.h"
 
-#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 2
+#define CLD_SQLITE_LATEST_SCHEMA_VERSION 3
 
 /* in milliseconds */
-#define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
+#define CLD_SQLITE_BUSY_TIMEOUT 10000
 
 /* private data structures */
 
@@ -124,7 +136,7 @@ out:
 }
 
 static int
-sqlite_maindb_update_v1_to_v2(void)
+sqlite_maindb_update_schema(int oldversion)
 {
 	int ret, ret2;
 	char *err;
@@ -142,32 +154,66 @@ sqlite_maindb_update_v1_to_v2(void)
 	 * transaction to guard against racing DB setup attempts
 	 */
 	ret = sqlite_query_schema_version();
-	switch (ret) {
-	case 1:
-		/* Still at v1 -- do conversion */
-		break;
-	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
-		/* Someone else raced in and set it up */
-		ret = 0;
+	if (ret != oldversion) {
+		if (ret == CLD_SQLITE_LATEST_SCHEMA_VERSION)
+			/* Someone else raced in and set it up */
+			ret = 0;
+		else
+			/* Something went wrong -- fail! */
+			ret = -EINVAL;
 		goto rollback;
-	default:
-		/* Something went wrong -- fail! */
-		ret = -EINVAL;
+	}
+
+	/* Still at old version -- do conversion */
+
+	/* create grace table */
+	ret = sqlite3_exec(dbh, "CREATE TABLE grace "
+				"(current INTEGER , recovery INTEGER);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create grace table: %s", err);
+		goto rollback;
+	}
+
+	/* insert initial epochs into grace table */
+	ret = sqlite3_exec(dbh, "INSERT OR FAIL INTO grace "
+				"values (1, 0);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to set initial epochs: %s", err);
+		goto rollback;
+	}
+
+	/* create recovery table for current epoch */
+	ret = sqlite3_exec(dbh, "CREATE TABLE \"rec-0000000000000001\" "
+				"(id BLOB PRIMARY KEY);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create recovery table "
+				"for current epoch: %s", err);
+		goto rollback;
+	}
+
+	/* copy records from old clients table */
+	ret = sqlite3_exec(dbh, "INSERT INTO \"rec-0000000000000001\" "
+				"SELECT id FROM clients;",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to copy client records: %s", err);
 		goto rollback;
 	}
 
-	/* create v2 clients table */
-	ret = sqlite3_exec(dbh, "ALTER TABLE clients ADD COLUMN "
-				"has_session INTEGER;",
+	/* drop the old clients table */
+	ret = sqlite3_exec(dbh, "DROP TABLE clients;",
 				NULL, NULL, &err);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to update clients table: %s", err);
+		xlog(L_ERROR, "Unable to drop old clients table: %s", err);
 		goto rollback;
 	}
 
 	ret = snprintf(buf, sizeof(buf), "UPDATE parameters SET value = %d "
 			"WHERE key = \"version\";",
-			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
+			CLD_SQLITE_LATEST_SCHEMA_VERSION);
 	if (ret < 0) {
 		xlog(L_ERROR, "sprintf failed!");
 		goto rollback;
@@ -205,7 +251,7 @@ rollback:
  * transaction. On any error, rollback the transaction.
  */
 static int
-sqlite_maindb_init_v2(void)
+sqlite_maindb_init_v3(void)
 {
 	int ret, ret2;
 	char *err = NULL;
@@ -227,7 +273,7 @@ sqlite_maindb_init_v2(void)
 	case 0:
 		/* Query failed again -- set up DB */
 		break;
-	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+	case CLD_SQLITE_LATEST_SCHEMA_VERSION:
 		/* Someone else raced in and set it up */
 		ret = 0;
 		goto rollback;
@@ -245,20 +291,38 @@ sqlite_maindb_init_v2(void)
 		goto rollback;
 	}
 
-	/* create the "clients" table */
-	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB PRIMARY KEY, "
-				"time INTEGER, has_session INTEGER);",
+	/* create grace table */
+	ret = sqlite3_exec(dbh, "CREATE TABLE grace "
+				"(current INTEGER , recovery INTEGER);",
 				NULL, NULL, &err);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "Unable to create clients table: %s", err);
+		xlog(L_ERROR, "Unable to create grace table: %s", err);
 		goto rollback;
 	}
 
+	/* insert initial epochs into grace table */
+	ret = sqlite3_exec(dbh, "INSERT OR FAIL INTO grace "
+				"values (1, 0);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to set initial epochs: %s", err);
+		goto rollback;
+	}
+
+	/* create recovery table for current epoch */
+	ret = sqlite3_exec(dbh, "CREATE TABLE \"rec-0000000000000001\" "
+				"(id BLOB PRIMARY KEY);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create recovery table "
+				"for current epoch: %s", err);
+		goto rollback;
+	}
 
 	/* insert version into parameters table */
 	ret = snprintf(buf, sizeof(buf), "INSERT OR FAIL INTO parameters "
 			"values (\"version\", \"%d\");",
-			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
+			CLD_SQLITE_LATEST_SCHEMA_VERSION);
 	if (ret < 0) {
 		xlog(L_ERROR, "sprintf failed!");
 		goto rollback;
@@ -291,6 +355,42 @@ rollback:
 	goto out;
 }
 
+static int
+sqlite_startup_query_grace(void)
+{
+	int ret;
+	uint64_t tcur;
+	uint64_t trec;
+	sqlite3_stmt *stmt = NULL;
+
+	/* prepare select query */
+	ret = sqlite3_prepare_v2(dbh, "SELECT * FROM grace;", -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to prepare select statement: %s",
+			sqlite3_errmsg(dbh));
+		goto out;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(D_GENERAL, "Select statement execution failed: %s",
+				sqlite3_errmsg(dbh));
+		goto out;
+	}
+
+	tcur = (uint64_t)sqlite3_column_int(stmt, 0);
+	trec = (uint64_t)sqlite3_column_int(stmt, 1);
+
+	current_epoch = tcur;
+	recovery_epoch = trec;
+	ret = 0;
+	xlog(D_GENERAL, "%s: current_epoch=%lu recovery_epoch=%lu",
+		__func__, current_epoch, recovery_epoch);
+out:
+	sqlite3_finalize(stmt);
+	return ret;
+}
+
 /* Open the database and set up the database handle for it */
 int
 sqlite_prepare_dbh(const char *topdir)
@@ -322,7 +422,7 @@ sqlite_prepare_dbh(const char *topdir)
 	}
 
 	/* set busy timeout */
-	ret = sqlite3_busy_timeout(dbh, CLTRACK_SQLITE_BUSY_TIMEOUT);
+	ret = sqlite3_busy_timeout(dbh, CLD_SQLITE_BUSY_TIMEOUT);
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "Unable to set sqlite busy timeout: %s",
 				sqlite3_errmsg(dbh));
@@ -331,19 +431,26 @@ sqlite_prepare_dbh(const char *topdir)
 
 	ret = sqlite_query_schema_version();
 	switch (ret) {
-	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
+	case CLD_SQLITE_LATEST_SCHEMA_VERSION:
 		/* DB is already set up. Do nothing */
 		ret = 0;
 		break;
+	case 2:
+		/* Old DB -- update to new schema */
+		ret = sqlite_maindb_update_schema(2);
+		if (ret)
+			goto out_close;
+		break;
+
 	case 1:
 		/* Old DB -- update to new schema */
-		ret = sqlite_maindb_update_v1_to_v2();
+		ret = sqlite_maindb_update_schema(1);
 		if (ret)
 			goto out_close;
 		break;
 	case 0:
 		/* Query failed -- try to set up new DB */
-		ret = sqlite_maindb_init_v2();
+		ret = sqlite_maindb_init_v3();
 		if (ret)
 			goto out_close;
 		break;
@@ -351,11 +458,13 @@ sqlite_prepare_dbh(const char *topdir)
 		/* Unknown DB version -- downgrade? Fail */
 		xlog(L_ERROR, "Unsupported database schema version! "
 			"Expected %d, got %d.",
-			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION, ret);
+			CLD_SQLITE_LATEST_SCHEMA_VERSION, ret);
 		ret = -EINVAL;
 		goto out_close;
 	}
 
+	ret = sqlite_startup_query_grace();
+
 	return ret;
 out_close:
 	sqlite3_close(dbh);
@@ -369,20 +478,22 @@ out_close:
  * Returns a non-zero sqlite error code, or SQLITE_OK (aka 0)
  */
 int
-sqlite_insert_client(const unsigned char *clname, const size_t namelen,
-			const bool has_session, const bool zerotime)
+sqlite_insert_client(const unsigned char *clname, const size_t namelen)
 {
 	int ret;
 	sqlite3_stmt *stmt = NULL;
 
-	if (zerotime)
-		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
-				"VALUES (?, 0, ?);", -1, &stmt, NULL);
-	else
-		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
-				"VALUES (?, strftime('%s', 'now'), ?);", -1,
-				&stmt, NULL);
+	ret = snprintf(buf, sizeof(buf), "INSERT OR REPLACE INTO \"rec-%016lx\" "
+				"VALUES (?);", current_epoch);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		return -EINVAL;
+	}
 
+	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "%s: insert statement prepare failed: %s",
 			__func__, sqlite3_errmsg(dbh));
@@ -397,13 +508,6 @@ sqlite_insert_client(const unsigned char *clname, const size_t namelen,
 		goto out_err;
 	}
 
-	ret = sqlite3_bind_int(stmt, 2, (int)has_session);
-	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "%s: bind int failed: %s", __func__,
-				sqlite3_errmsg(dbh));
-		goto out_err;
-	}
-
 	ret = sqlite3_step(stmt);
 	if (ret == SQLITE_DONE)
 		ret = SQLITE_OK;
@@ -424,8 +528,18 @@ sqlite_remove_client(const unsigned char *clname, const size_t namelen)
 	int ret;
 	sqlite3_stmt *stmt = NULL;
 
-	ret = sqlite3_prepare_v2(dbh, "DELETE FROM clients WHERE id==?", -1,
-				 &stmt, NULL);
+	ret = snprintf(buf, sizeof(buf), "DELETE FROM \"rec-%016lx\" "
+				"WHERE id==?;", current_epoch);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		return -EINVAL;
+	}
+
+	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
+
 	if (ret != SQLITE_OK) {
 		xlog(L_ERROR, "%s: statement prepare failed: %s",
 				__func__, sqlite3_errmsg(dbh));
@@ -459,18 +573,26 @@ out_err:
  * return an error.
  */
 int
-sqlite_check_client(const unsigned char *clname, const size_t namelen,
-			const bool has_session)
+sqlite_check_client(const unsigned char *clname, const size_t namelen)
 {
 	int ret;
 	sqlite3_stmt *stmt = NULL;
 
-	ret = sqlite3_prepare_v2(dbh, "SELECT count(*) FROM clients WHERE "
-				      "id==?", -1, &stmt, NULL);
+	ret = snprintf(buf, sizeof(buf), "SELECT count(*) FROM  \"rec-%016lx\" "
+				"WHERE id==?;", recovery_epoch);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		return -EINVAL;
+	}
+
+	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
 	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "%s: unable to prepare update statement: %s",
-				__func__, sqlite3_errmsg(dbh));
-		goto out_err;
+		xlog(L_ERROR, "%s: select statement prepare failed: %s",
+			__func__, sqlite3_errmsg(dbh));
+		return ret;
 	}
 
 	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
@@ -495,37 +617,10 @@ sqlite_check_client(const unsigned char *clname, const size_t namelen,
 		goto out_err;
 	}
 
-	/* Only update timestamp for v4.0 clients */
-	if (has_session) {
-		ret = SQLITE_OK;
-		goto out_err;
-	}
-
 	sqlite3_finalize(stmt);
-	stmt = NULL;
-	ret = sqlite3_prepare_v2(dbh, "UPDATE OR FAIL clients SET "
-				      "time=strftime('%s', 'now') WHERE id==?",
-				 -1, &stmt, NULL);
-	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "%s: unable to prepare update statement: %s",
-				__func__, sqlite3_errmsg(dbh));
-		goto out_err;
-	}
 
-	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
-				SQLITE_STATIC);
-	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "%s: bind blob failed: %s",
-				__func__, sqlite3_errmsg(dbh));
-		goto out_err;
-	}
-
-	ret = sqlite3_step(stmt);
-	if (ret == SQLITE_DONE)
-		ret = SQLITE_OK;
-	else
-		xlog(L_ERROR, "%s: unexpected return code from update: %s",
-				__func__, sqlite3_errmsg(dbh));
+	/* Now insert the client into the table for the current epoch */
+	return sqlite_insert_client(clname, namelen);
 
 out_err:
 	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
@@ -599,3 +694,211 @@ sqlite_query_reclaiming(const time_t grace_start)
 			"reclaim", __func__, ret);
 	return ret;
 }
+
+int
+sqlite_grace_start(void)
+{
+	int ret, ret2;
+	char *err;
+	uint64_t tcur = current_epoch;
+	uint64_t trec = recovery_epoch;
+
+	/* begin transaction */
+	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
+				&err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to begin transaction: %s", err);
+		goto rollback;
+	}
+
+	if (trec == 0) {
+		/*
+		 * A normal grace start - update the epoch values in the grace
+		 * table and create a new table for the current reboot epoch.
+		 */
+		trec = tcur;
+		tcur++;
+
+		ret = snprintf(buf, sizeof(buf), "UPDATE grace "
+				"SET current = %ld, recovery = %ld;",
+				(int64_t)tcur, (int64_t)trec);
+		if (ret < 0) {
+			xlog(L_ERROR, "sprintf failed!");
+			goto rollback;
+		} else if ((size_t)ret >= sizeof(buf)) {
+			xlog(L_ERROR, "sprintf output too long! (%d chars)",
+				ret);
+			ret = -EINVAL;
+			goto rollback;
+		}
+
+		ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+		if (ret != SQLITE_OK) {
+			xlog(L_ERROR, "Unable to update epochs: %s", err);
+			goto rollback;
+		}
+
+		ret = snprintf(buf, sizeof(buf), "CREATE TABLE \"rec-%016lx\" "
+				"(id BLOB PRIMARY KEY);",
+				tcur);
+		if (ret < 0) {
+			xlog(L_ERROR, "sprintf failed!");
+			goto rollback;
+		} else if ((size_t)ret >= sizeof(buf)) {
+			xlog(L_ERROR, "sprintf output too long! (%d chars)",
+				ret);
+			ret = -EINVAL;
+			goto rollback;
+		}
+
+		ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+		if (ret != SQLITE_OK) {
+			xlog(L_ERROR, "Unable to create table for current epoch: %s",
+				err);
+			goto rollback;
+		}
+	} else {
+		/* Server restarted while in grace - don't update the epoch
+		 * values in the grace table, just clear out the records for
+		 * the current reboot epoch.
+		 */
+		ret = snprintf(buf, sizeof(buf), "DELETE FROM \"rec-%016lx\";",
+				tcur);
+		if (ret < 0) {
+			xlog(L_ERROR, "sprintf failed!");
+			goto rollback;
+		} else if ((size_t)ret >= sizeof(buf)) {
+			xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+			ret = -EINVAL;
+			goto rollback;
+		}
+
+		ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+		if (ret != SQLITE_OK) {
+			xlog(L_ERROR, "Unable to clear table for current epoch: %s",
+				err);
+			goto rollback;
+		}
+	}
+
+	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to commit transaction: %s", err);
+		goto rollback;
+	}
+
+	current_epoch = tcur;
+	recovery_epoch = trec;
+	xlog(D_GENERAL, "%s: current_epoch=%lu recovery_epoch=%lu",
+		__func__, current_epoch, recovery_epoch);
+
+out:
+	sqlite3_free(err);
+	return ret;
+rollback:
+	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	if (ret2 != SQLITE_OK)
+		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
+	goto out;
+}
+
+int
+sqlite_grace_done(void)
+{
+	int ret, ret2;
+	char *err;
+
+	/* begin transaction */
+	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
+				&err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to begin transaction: %s", err);
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, "UPDATE grace SET recovery = \"0\";",
+			NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to clear recovery epoch: %s", err);
+		goto rollback;
+	}
+
+	ret = snprintf(buf, sizeof(buf), "DROP TABLE \"rec-%016lx\";",
+		recovery_epoch);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		goto rollback;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		ret = -EINVAL;
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to drop table for recovery epoch: %s",
+			err);
+		goto rollback;
+	}
+
+	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to commit transaction: %s", err);
+		goto rollback;
+	}
+
+	recovery_epoch = 0;
+	xlog(D_GENERAL, "%s: current_epoch=%lu recovery_epoch=%lu",
+		__func__, current_epoch, recovery_epoch);
+
+out:
+	sqlite3_free(err);
+	return ret;
+rollback:
+	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
+	if (ret2 != SQLITE_OK)
+		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
+	goto out;
+}
+
+
+int
+sqlite_iterate_recovery(int (*cb)(struct cld_client *clnt), struct cld_client *clnt)
+{
+	int ret;
+	sqlite3_stmt *stmt = NULL;
+	struct cld_msg *cmsg = &clnt->cl_msg;
+
+	if (recovery_epoch == 0) {
+		xlog(D_GENERAL, "%s: not in grace!", __func__);
+		return -EINVAL;
+	}
+
+	ret = snprintf(buf, sizeof(buf), "SELECT * FROM \"rec-%016lx\";",
+		recovery_epoch);
+	if (ret < 0) {
+		xlog(L_ERROR, "sprintf failed!");
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
+		return -EINVAL;
+	}
+
+	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "%s: select statement prepare failed: %s",
+			__func__, sqlite3_errmsg(dbh));
+		return ret;
+	}
+
+	while ((ret = sqlite3_step(stmt)) == SQLITE_ROW) {
+		memcpy(&cmsg->cm_u.cm_name.cn_id, sqlite3_column_blob(stmt, 0),
+			NFS4_OPAQUE_LIMIT);
+		cmsg->cm_u.cm_name.cn_len = sqlite3_column_bytes(stmt, 0);
+		cb(clnt);
+	}
+	if (ret == SQLITE_DONE)
+		ret = 0;
+	sqlite3_finalize(stmt);
+	return ret;
+}
diff --git a/utils/nfsdcld/sqlite.h b/utils/nfsdcld/sqlite.h
index 06e7c04..5c56f75 100644
--- a/utils/nfsdcld/sqlite.h
+++ b/utils/nfsdcld/sqlite.h
@@ -20,13 +20,16 @@
 #ifndef _SQLITE_H_
 #define _SQLITE_H_
 
+struct cld_client;
+
 int sqlite_prepare_dbh(const char *topdir);
-int sqlite_insert_client(const unsigned char *clname, const size_t namelen,
-				const bool has_session, const bool zerotime);
+int sqlite_insert_client(const unsigned char *clname, const size_t namelen);
 int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
-int sqlite_check_client(const unsigned char *clname, const size_t namelen,
-				const bool has_session);
+int sqlite_check_client(const unsigned char *clname, const size_t namelen);
 int sqlite_remove_unreclaimed(const time_t grace_start);
 int sqlite_query_reclaiming(const time_t grace_start);
+int sqlite_grace_start(void);
+int sqlite_grace_done(void);
+int sqlite_iterate_recovery(int (*cb)(struct cld_client *clnt), struct cld_client *clnt);
 
 #endif /* _SQLITE_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 4/7] nfsdcld: remove some unused functions
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
                   ` (2 preceding siblings ...)
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 3/7] nfsdcld: a few enhancements Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 5/7] nfsdcld: the -p option should specify the rpc_pipefs mountpoint Scott Mayhew
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

Get rid of sqlite_query_reclaiming() and sqlite_remove_unreclaimed(),
which are not used.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 utils/nfsdcld/sqlite.c | 67 ------------------------------------------
 utils/nfsdcld/sqlite.h |  2 --
 2 files changed, 69 deletions(-)

diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index 67549c9..71187f7 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -628,73 +628,6 @@ out_err:
 	return ret;
 }
 
-/*
- * remove any client records that were not reclaimed since grace_start.
- */
-int
-sqlite_remove_unreclaimed(time_t grace_start)
-{
-	int ret;
-	char *err = NULL;
-
-	ret = snprintf(buf, sizeof(buf), "DELETE FROM clients WHERE time < %ld",
-			grace_start);
-	if (ret < 0) {
-		return ret;
-	} else if ((size_t)ret >= sizeof(buf)) {
-		ret = -EINVAL;
-		return ret;
-	}
-
-	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
-	if (ret != SQLITE_OK)
-		xlog(L_ERROR, "%s: delete failed: %s", __func__, err);
-
-	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
-	sqlite3_free(err);
-	return ret;
-}
-
-/*
- * Are there any clients that are possibly still reclaiming? Return a positive
- * integer (usually number of clients) if so. If not, then return 0. On any
- * error, return non-zero.
- */
-int
-sqlite_query_reclaiming(const time_t grace_start)
-{
-	int ret;
-	sqlite3_stmt *stmt = NULL;
-
-	ret = sqlite3_prepare_v2(dbh, "SELECT count(*) FROM clients WHERE "
-				      "time < ? OR has_session != 1", -1, &stmt, NULL);
-	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "%s: unable to prepare select statement: %s",
-				__func__, sqlite3_errmsg(dbh));
-		return ret;
-	}
-
-	ret = sqlite3_bind_int64(stmt, 1, (sqlite3_int64)grace_start);
-	if (ret != SQLITE_OK) {
-		xlog(L_ERROR, "%s: bind int64 failed: %s",
-				__func__, sqlite3_errmsg(dbh));
-		return ret;
-	}
-
-	ret = sqlite3_step(stmt);
-	if (ret != SQLITE_ROW) {
-		xlog(L_ERROR, "%s: unexpected return code from select: %s",
-				__func__, sqlite3_errmsg(dbh));
-		return ret;
-	}
-
-	ret = sqlite3_column_int(stmt, 0);
-	sqlite3_finalize(stmt);
-	xlog(D_GENERAL, "%s: there are %d clients that have not completed "
-			"reclaim", __func__, ret);
-	return ret;
-}
-
 int
 sqlite_grace_start(void)
 {
diff --git a/utils/nfsdcld/sqlite.h b/utils/nfsdcld/sqlite.h
index 5c56f75..757e5cc 100644
--- a/utils/nfsdcld/sqlite.h
+++ b/utils/nfsdcld/sqlite.h
@@ -26,8 +26,6 @@ int sqlite_prepare_dbh(const char *topdir);
 int sqlite_insert_client(const unsigned char *clname, const size_t namelen);
 int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
 int sqlite_check_client(const unsigned char *clname, const size_t namelen);
-int sqlite_remove_unreclaimed(const time_t grace_start);
-int sqlite_query_reclaiming(const time_t grace_start);
 int sqlite_grace_start(void);
 int sqlite_grace_done(void);
 int sqlite_iterate_recovery(int (*cb)(struct cld_client *clnt), struct cld_client *clnt);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 5/7] nfsdcld: the -p option should specify the rpc_pipefs mountpoint
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
                   ` (3 preceding siblings ...)
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 4/7] nfsdcld: remove some unused functions Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support Scott Mayhew
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 7/7] systemd: add a unit file for nfsdcld Scott Mayhew
  6 siblings, 0 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

Change the -p option to specify the rpc_pipefs mountpoint rather than
the full path to the cld pipe file.  This is consistent with other
daemons that use the rpc_pipefs filesystem.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 utils/nfsdcld/nfsdcld.c   | 17 ++++++++++-------
 utils/nfsdcld/nfsdcld.man |  9 ++++-----
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/utils/nfsdcld/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
index 9b1ad98..272c7c5 100644
--- a/utils/nfsdcld/nfsdcld.c
+++ b/utils/nfsdcld/nfsdcld.c
@@ -46,11 +46,11 @@
 #include "sqlite.h"
 #include "../mount/version.h"
 
-#ifndef PIPEFS_DIR
-#define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
+#ifndef DEFAULT_PIPEFS_DIR
+#define DEFAULT_PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
 #endif
 
-#define DEFAULT_CLD_PATH	PIPEFS_DIR "/nfsd/cld"
+#define DEFAULT_CLD_PATH	"/nfsd/cld"
 
 #ifndef CLD_DEFAULT_STORAGEDIR
 #define CLD_DEFAULT_STORAGEDIR NFS_STATEDIR "/nfsdcld"
@@ -63,7 +63,8 @@
 /* private data structures */
 
 /* global variables */
-static char *pipepath = DEFAULT_CLD_PATH;
+static char pipefs_dir[PATH_MAX] = DEFAULT_PIPEFS_DIR;
+static char pipepath[PATH_MAX];
 static int 		inotify_fd = -1;
 static struct event	pipedir_event;
 static bool old_kernel = false;
@@ -73,7 +74,7 @@ static struct option longopts[] =
 	{ "help", 0, NULL, 'h' },
 	{ "foreground", 0, NULL, 'F' },
 	{ "debug", 0, NULL, 'd' },
-	{ "pipe", 1, NULL, 'p' },
+	{ "pipefsdir", 1, NULL, 'p' },
 	{ "storagedir", 1, NULL, 's' },
 	{ NULL, 0, 0, 0 },
 };
@@ -84,7 +85,7 @@ static void cldcb(int UNUSED(fd), short which, void *data);
 static void
 usage(char *progname)
 {
-	printf("%s [ -hFd ] [ -p pipe ] [ -s dir ]\n", progname);
+	printf("%s [ -hFd ] [ -p pipefsdir ] [ -s storagedir ]\n", progname);
 }
 
 static int
@@ -663,7 +664,7 @@ main(int argc, char **argv)
 			foreground = true;
 			break;
 		case 'p':
-			pipepath = optarg;
+			strlcpy(pipefs_dir, optarg, sizeof(pipefs_dir));
 			break;
 		case 's':
 			storagedir = optarg;
@@ -674,6 +675,8 @@ main(int argc, char **argv)
 		}
 	}
 
+	strlcpy(pipepath, pipefs_dir, sizeof(pipepath));
+	strlcat(pipepath, DEFAULT_CLD_PATH, sizeof(pipepath));
 
 	xlog_open(progname);
 	if (!foreground) {
diff --git a/utils/nfsdcld/nfsdcld.man b/utils/nfsdcld/nfsdcld.man
index 9ddaf64..b607ba6 100644
--- a/utils/nfsdcld/nfsdcld.man
+++ b/utils/nfsdcld/nfsdcld.man
@@ -155,11 +155,10 @@ Enable debug level logging.
 .IP "\fB\-F\fR, \fB\-\-foreground\fR" 4
 .IX Item "-F, --foreground"
 Runs the daemon in the foreground and prints all output to stderr
-.IP "\fB\-p\fR \fIpipe\fR, \fB\-\-pipe\fR=\fIpipe\fR" 4
-.IX Item "-p pipe, --pipe=pipe"
-Location of the \*(L"cld\*(R" upcall pipe. The default value is
-\&\fI/var/lib/nfs/rpc_pipefs/nfsd/cld\fR. If the pipe does not exist when the
-daemon starts then it will wait for it to be created.
+.IP "\fB\-p\fR \fIpath\fR, \fB\-\-pipefsdir\fR=\fIpath\fR" 4
+.IX Item "-p path, --pipefsdir=path"
+Location of the rpc_pipefs filesystem. The default value is
+\&\fI/var/lib/nfs/rpc_pipefs\fR.
 .IP "\fB\-s\fR \fIstorage_dir\fR, \fB\-\-storagedir\fR=\fIstorage_dir\fR" 4
 .IX Item "-s storagedir, --storagedir=storage_dir"
 Directory where stable storage information should be kept. The default
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
                   ` (4 preceding siblings ...)
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 5/7] nfsdcld: the -p option should specify the rpc_pipefs mountpoint Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  2018-11-08 12:00   ` Steve Dickson
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 7/7] systemd: add a unit file for nfsdcld Scott Mayhew
  6 siblings, 1 reply; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 nfs.conf                  |  4 ++++
 utils/nfsdcld/nfsdcld.c   | 13 +++++++++++++
 utils/nfsdcld/nfsdcld.man | 15 +++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/nfs.conf b/nfs.conf
index 0d0ec9b..2157b9c 100644
--- a/nfs.conf
+++ b/nfs.conf
@@ -33,6 +33,10 @@
 # state-directory-path=/var/lib/nfs
 # ha-callout=
 #
+#[nfsdcld]
+# debug=0
+# storagedir=/var/lib/nfs/nfsdcld
+#
 #[nfsdcltrack]
 # debug=0
 # storagedir=/var/lib/nfs/nfsdcltrack
diff --git a/utils/nfsdcld/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
index 272c7c5..313c68f 100644
--- a/utils/nfsdcld/nfsdcld.c
+++ b/utils/nfsdcld/nfsdcld.c
@@ -45,6 +45,7 @@
 #include "cld-internal.h"
 #include "sqlite.h"
 #include "../mount/version.h"
+#include "conffile.h"
 
 #ifndef DEFAULT_PIPEFS_DIR
 #define DEFAULT_PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
@@ -640,6 +641,7 @@ main(int argc, char **argv)
 	char *progname;
 	char *storagedir = CLD_DEFAULT_STORAGEDIR;
 	struct cld_client clnt;
+	char *s;
 
 	memset(&clnt, 0, sizeof(clnt));
 
@@ -653,6 +655,17 @@ main(int argc, char **argv)
 	xlog_syslog(0);
 	xlog_stderr(1);
 
+	conf_init_file(NFS_CONFFILE);
+	s = conf_get_str("general", "pipefs-directory");
+	if (s)
+		strlcpy(pipefs_dir, s, sizeof(pipefs_dir));
+	s = conf_get_str("nfsdcld", "storagedir");
+	if (s)
+		storagedir = s;
+	rc = conf_get_num("nfsdcld", "debug", 0);
+	if (rc > 0)
+		xlog_config(D_ALL, 1);
+
 	/* process command-line options */
 	while ((arg = getopt_long(argc, argv, "hdFp:s:", longopts,
 				  NULL)) != EOF) {
diff --git a/utils/nfsdcld/nfsdcld.man b/utils/nfsdcld/nfsdcld.man
index b607ba6..c271d14 100644
--- a/utils/nfsdcld/nfsdcld.man
+++ b/utils/nfsdcld/nfsdcld.man
@@ -163,6 +163,21 @@ Location of the rpc_pipefs filesystem. The default value is
 .IX Item "-s storagedir, --storagedir=storage_dir"
 Directory where stable storage information should be kept. The default
 value is \fI/var/lib/nfs/nfsdcld\fR.
+.SH "CONFIGURATION FILE"
+.IX Header "CONFIGURATION FILE"
+The following values are recognized in the \fB[nfsdcld]\fR section
+of the \fI/etc/nfs.conf\fR configuration file:
+.IP "\fBstoragedir\fR" 4
+.IX Item "storagedir"
+Equivalent to \fB\-s\fR/\fB\-\-storagedir\fR.
+.IP "\fBdebug\fR" 4
+.IX Item "debug"
+Setting "debug = 1" is equivalent to \fB\-d\fR/\fB\-\-debug\fR.
+.LP
+In addition, the following value is recognized from the \fB[general]\fR section:
+.IP "\fBpipefs\-directory\fR" 4
+.IX Item "pipefs-directory"
+Equivalent to \fB\-p\fR/\fB\-\-pipefsdir\fR.
 .SH "NOTES"
 .IX Header "NOTES"
 The Linux kernel NFSv4 server has historically tracked this information
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [nfs-utils PATCH RFC 7/7] systemd: add a unit file for nfsdcld
  2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
                   ` (5 preceding siblings ...)
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support Scott Mayhew
@ 2018-11-06 18:36 ` Scott Mayhew
  6 siblings, 0 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-06 18:36 UTC (permalink / raw)
  To: steved; +Cc: jlayton, linux-nfs

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 systemd/nfs-server.service |  2 ++
 systemd/nfsdcld.service    | 11 +++++++++++
 2 files changed, 13 insertions(+)
 create mode 100644 systemd/nfsdcld.service

diff --git a/systemd/nfs-server.service b/systemd/nfs-server.service
index 136552b..24118d6 100644
--- a/systemd/nfs-server.service
+++ b/systemd/nfs-server.service
@@ -6,10 +6,12 @@ Requires= nfs-mountd.service
 Wants=rpcbind.socket network-online.target
 Wants=rpc-statd.service nfs-idmapd.service
 Wants=rpc-statd-notify.service
+Wants=nfsdcld.service
 
 After= network-online.target local-fs.target
 After= proc-fs-nfsd.mount rpcbind.socket nfs-mountd.service
 After= nfs-idmapd.service rpc-statd.service
+After= nfsdcld.service
 Before= rpc-statd-notify.service
 
 # GSS services dependencies and ordering
diff --git a/systemd/nfsdcld.service b/systemd/nfsdcld.service
new file mode 100644
index 0000000..0ed6921
--- /dev/null
+++ b/systemd/nfsdcld.service
@@ -0,0 +1,11 @@
+[Unit]
+Description=NFSv4 Client Tracking Daemon
+DefaultDependencies=no
+Conflicts=umount.target
+Requires=rpc_pipefs.target proc-fs-nfsd.mount
+After=rpc_pipefs.target proc-fs-nfsd.mount
+ConditionPathExists=|!/var/lib/nfs/v4recovery
+
+[Service]
+Type=forking
+ExecStart=/usr/sbin/nfsdcld
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support Scott Mayhew
@ 2018-11-08 12:00   ` Steve Dickson
  2018-11-08 13:09     ` Scott Mayhew
  0 siblings, 1 reply; 11+ messages in thread
From: Steve Dickson @ 2018-11-08 12:00 UTC (permalink / raw)
  To: Scott Mayhew; +Cc: jlayton, linux-nfs



On 11/6/18 1:36 PM, Scott Mayhew wrote:
> Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> ---
>  nfs.conf                  |  4 ++++
>  utils/nfsdcld/nfsdcld.c   | 13 +++++++++++++
>  utils/nfsdcld/nfsdcld.man | 15 +++++++++++++++
>  3 files changed, 32 insertions(+)
> 
> diff --git a/nfs.conf b/nfs.conf
> index 0d0ec9b..2157b9c 100644
> --- a/nfs.conf
> +++ b/nfs.conf
> @@ -33,6 +33,10 @@
>  # state-directory-path=/var/lib/nfs
>  # ha-callout=
>  #
> +#[nfsdcld]
Staring very recently, all sections are now un-commented
 
> +# debug=0
> +# storagedir=/var/lib/nfs/nfsdcld
Does this also need a  
   # pipefsdir=/var/lib/nfs/rpc_pipefs

Or are you grabbing that from the [general] section?

steved.
> +#
>  #[nfsdcltrack]
>  # debug=0
>  # storagedir=/var/lib/nfs/nfsdcltrack
> diff --git a/utils/nfsdcld/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
> index 272c7c5..313c68f 100644
> --- a/utils/nfsdcld/nfsdcld.c
> +++ b/utils/nfsdcld/nfsdcld.c
> @@ -45,6 +45,7 @@
>  #include "cld-internal.h"
>  #include "sqlite.h"
>  #include "../mount/version.h"
> +#include "conffile.h"
>  
>  #ifndef DEFAULT_PIPEFS_DIR
>  #define DEFAULT_PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> @@ -640,6 +641,7 @@ main(int argc, char **argv)
>  	char *progname;
>  	char *storagedir = CLD_DEFAULT_STORAGEDIR;
>  	struct cld_client clnt;
> +	char *s;
>  
>  	memset(&clnt, 0, sizeof(clnt));
>  
> @@ -653,6 +655,17 @@ main(int argc, char **argv)
>  	xlog_syslog(0);
>  	xlog_stderr(1);
>  
> +	conf_init_file(NFS_CONFFILE);
> +	s = conf_get_str("general", "pipefs-directory");
> +	if (s)
> +		strlcpy(pipefs_dir, s, sizeof(pipefs_dir));
> +	s = conf_get_str("nfsdcld", "storagedir");
> +	if (s)
> +		storagedir = s;
> +	rc = conf_get_num("nfsdcld", "debug", 0);
> +	if (rc > 0)
> +		xlog_config(D_ALL, 1);
> +
>  	/* process command-line options */
>  	while ((arg = getopt_long(argc, argv, "hdFp:s:", longopts,
>  				  NULL)) != EOF) {
> diff --git a/utils/nfsdcld/nfsdcld.man b/utils/nfsdcld/nfsdcld.man
> index b607ba6..c271d14 100644
> --- a/utils/nfsdcld/nfsdcld.man
> +++ b/utils/nfsdcld/nfsdcld.man
> @@ -163,6 +163,21 @@ Location of the rpc_pipefs filesystem. The default value is
>  .IX Item "-s storagedir, --storagedir=storage_dir"
>  Directory where stable storage information should be kept. The default
>  value is \fI/var/lib/nfs/nfsdcld\fR.
> +.SH "CONFIGURATION FILE"
> +.IX Header "CONFIGURATION FILE"
> +The following values are recognized in the \fB[nfsdcld]\fR section
> +of the \fI/etc/nfs.conf\fR configuration file:
> +.IP "\fBstoragedir\fR" 4
> +.IX Item "storagedir"
> +Equivalent to \fB\-s\fR/\fB\-\-storagedir\fR.
> +.IP "\fBdebug\fR" 4
> +.IX Item "debug"
> +Setting "debug = 1" is equivalent to \fB\-d\fR/\fB\-\-debug\fR.
> +.LP
> +In addition, the following value is recognized from the \fB[general]\fR section:
> +.IP "\fBpipefs\-directory\fR" 4
> +.IX Item "pipefs-directory"
> +Equivalent to \fB\-p\fR/\fB\-\-pipefsdir\fR.
>  .SH "NOTES"
>  .IX Header "NOTES"
>  The Linux kernel NFSv4 server has historically tracked this information
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support
  2018-11-08 12:00   ` Steve Dickson
@ 2018-11-08 13:09     ` Scott Mayhew
  0 siblings, 0 replies; 11+ messages in thread
From: Scott Mayhew @ 2018-11-08 13:09 UTC (permalink / raw)
  To: Steve Dickson; +Cc: jlayton, linux-nfs

On Thu, 08 Nov 2018, Steve Dickson wrote:

> 
> 
> On 11/6/18 1:36 PM, Scott Mayhew wrote:
> > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > ---
> >  nfs.conf                  |  4 ++++
> >  utils/nfsdcld/nfsdcld.c   | 13 +++++++++++++
> >  utils/nfsdcld/nfsdcld.man | 15 +++++++++++++++
> >  3 files changed, 32 insertions(+)
> > 
> > diff --git a/nfs.conf b/nfs.conf
> > index 0d0ec9b..2157b9c 100644
> > --- a/nfs.conf
> > +++ b/nfs.conf
> > @@ -33,6 +33,10 @@
> >  # state-directory-path=/var/lib/nfs
> >  # ha-callout=
> >  #
> > +#[nfsdcld]
> Staring very recently, all sections are now un-commented

Okay, I see that now.

>  
> > +# debug=0
> > +# storagedir=/var/lib/nfs/nfsdcld
> Does this also need a  
>    # pipefsdir=/var/lib/nfs/rpc_pipefs
> 
> Or are you grabbing that from the [general] section?

Yep, that comes from the general section, just like the other pipefs
users.

-Scott

> 
> steved.
> > +#
> >  #[nfsdcltrack]
> >  # debug=0
> >  # storagedir=/var/lib/nfs/nfsdcltrack
> > diff --git a/utils/nfsdcld/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
> > index 272c7c5..313c68f 100644
> > --- a/utils/nfsdcld/nfsdcld.c
> > +++ b/utils/nfsdcld/nfsdcld.c
> > @@ -45,6 +45,7 @@
> >  #include "cld-internal.h"
> >  #include "sqlite.h"
> >  #include "../mount/version.h"
> > +#include "conffile.h"
> >  
> >  #ifndef DEFAULT_PIPEFS_DIR
> >  #define DEFAULT_PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> > @@ -640,6 +641,7 @@ main(int argc, char **argv)
> >  	char *progname;
> >  	char *storagedir = CLD_DEFAULT_STORAGEDIR;
> >  	struct cld_client clnt;
> > +	char *s;
> >  
> >  	memset(&clnt, 0, sizeof(clnt));
> >  
> > @@ -653,6 +655,17 @@ main(int argc, char **argv)
> >  	xlog_syslog(0);
> >  	xlog_stderr(1);
> >  
> > +	conf_init_file(NFS_CONFFILE);
> > +	s = conf_get_str("general", "pipefs-directory");
> > +	if (s)
> > +		strlcpy(pipefs_dir, s, sizeof(pipefs_dir));
> > +	s = conf_get_str("nfsdcld", "storagedir");
> > +	if (s)
> > +		storagedir = s;
> > +	rc = conf_get_num("nfsdcld", "debug", 0);
> > +	if (rc > 0)
> > +		xlog_config(D_ALL, 1);
> > +
> >  	/* process command-line options */
> >  	while ((arg = getopt_long(argc, argv, "hdFp:s:", longopts,
> >  				  NULL)) != EOF) {
> > diff --git a/utils/nfsdcld/nfsdcld.man b/utils/nfsdcld/nfsdcld.man
> > index b607ba6..c271d14 100644
> > --- a/utils/nfsdcld/nfsdcld.man
> > +++ b/utils/nfsdcld/nfsdcld.man
> > @@ -163,6 +163,21 @@ Location of the rpc_pipefs filesystem. The default value is
> >  .IX Item "-s storagedir, --storagedir=storage_dir"
> >  Directory where stable storage information should be kept. The default
> >  value is \fI/var/lib/nfs/nfsdcld\fR.
> > +.SH "CONFIGURATION FILE"
> > +.IX Header "CONFIGURATION FILE"
> > +The following values are recognized in the \fB[nfsdcld]\fR section
> > +of the \fI/etc/nfs.conf\fR configuration file:
> > +.IP "\fBstoragedir\fR" 4
> > +.IX Item "storagedir"
> > +Equivalent to \fB\-s\fR/\fB\-\-storagedir\fR.
> > +.IP "\fBdebug\fR" 4
> > +.IX Item "debug"
> > +Setting "debug = 1" is equivalent to \fB\-d\fR/\fB\-\-debug\fR.
> > +.LP
> > +In addition, the following value is recognized from the \fB[general]\fR section:
> > +.IP "\fBpipefs\-directory\fR" 4
> > +.IX Item "pipefs-directory"
> > +Equivalent to \fB\-p\fR/\fB\-\-pipefsdir\fR.
> >  .SH "NOTES"
> >  .IX Header "NOTES"
> >  The Linux kernel NFSv4 server has historically tracked this information
> > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [nfs-utils PATCH RFC 3/7] nfsdcld: a few enhancements
  2018-11-06 18:36 ` [nfs-utils PATCH RFC 3/7] nfsdcld: a few enhancements Scott Mayhew
@ 2018-11-10 14:16   ` Jeff Layton
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff Layton @ 2018-11-10 14:16 UTC (permalink / raw)
  To: Scott Mayhew, steved; +Cc: linux-nfs

On Tue, 2018-11-06 at 13:36 -0500, Scott Mayhew wrote:
> 1) Adopt the concept of "reboot epochs" (but not coordinated grace
> periods via the "need" and "enforcing" flags) from Jeff Layton's
> "Active/Active NFS Server Recovery" presentation from the Fall 2018 NFS
> Bakeathon.  See
> http://nfsv4bat.org/Documents/BakeAThon/2018/Active_Active%20NFS%20Server%20Recovery.pdf
> 
> - add a new table "grace" which contains two integer columns
>   representing the "current" epoch (where new client records are stored)
>   and the "recovery" epoch (which has the records for clients that are
>   allowed to recover)
> - replace the "clients" table with table(s) named "rec-CCCCCCCCCCCCCCCC"
>   (where C is the hex value of the epoch), containing a single column
>   "id" which stores the client id string
> - when going from normal operation into grace, the current epoch becomes
>   the recovery epoch, the current epoch is incremented, and a new table
>   is created for the current epoch.  Clients are allowed to reclaim if
>   they have a record in the table corresponding to the recovery epoch
>   and new records are added to the table corresponding to the current
>   epoch.
> - when moving from grace back to normal operation, the table associated
>   with the recovery epoch is deleted and the recovery epoch becomes
>   zero.
> - if the server restarts before exiting the previous grace period, then
>   the epochs are not changed, and all records in the table associated
>   with the "current" epoch are cleared out.
> 
> 2) Allow knfsd to "slurp" the client records during startup.
> 
> During client tracking initialization, knfsd will do an upcall to get a
> list of clients from the database.  nfsdcld will do one downcall with a
> status of -EINPROGRESS for each client record in the database, followed
> by a final downcall with a status of 0.  This will allow 2 things
> 
> - knfsd can check whether a client is allowed to reclaim without
>   performing an upcall to nfsdcld
> - knfsd can decide to end the grace period early by tracking the number
>   of RECLAIM_COMPLETE operations it receives from "known" clients, or
>   it can skip the grace period altogether if no clients are allowed
>   to reclaim.
> 


Thanks for doing this work, Scott. This should give us an even more
robust recovery backend that is suitable for containerization, and
possibly something we could extend to do active/active clustered NFS
properly with knfsd.

The changes look great overall -- one minor thing inline below:

> Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> ---
>  support/include/cld.h        |   1 +
>  utils/nfsdcld/Makefile.am    |   2 +-
>  utils/nfsdcld/cld-internal.h |  30 +++
>  utils/nfsdcld/nfsdcld.c      | 160 +++++++++++-
>  utils/nfsdcld/sqlite.c       | 483 ++++++++++++++++++++++++++++-------
>  utils/nfsdcld/sqlite.h       |  11 +-
>  6 files changed, 579 insertions(+), 108 deletions(-)
>  create mode 100644 utils/nfsdcld/cld-internal.h
> 
> diff --git a/support/include/cld.h b/support/include/cld.h
> index f14a9ab..c1f5b70 100644
> --- a/support/include/cld.h
> +++ b/support/include/cld.h
> @@ -33,6 +33,7 @@ enum cld_command {
>  	Cld_Remove,		/* remove record of this cm_id */
>  	Cld_Check,		/* is this cm_id allowed? */
>  	Cld_GraceDone,		/* grace period is complete */
> +	Cld_GraceStart,
>  };
>  
>  /* representation of long-form NFSv4 client ID */
> diff --git a/utils/nfsdcld/Makefile.am b/utils/nfsdcld/Makefile.am
> index 8239be8..d1da749 100644
> --- a/utils/nfsdcld/Makefile.am
> +++ b/utils/nfsdcld/Makefile.am
> @@ -13,7 +13,7 @@ sbin_PROGRAMS	= nfsdcld
>  nfsdcld_SOURCES = nfsdcld.c sqlite.c
>  nfsdcld_LDADD = ../../support/nfs/libnfs.la $(LIBEVENT) $(LIBSQLITE) $(LIBCAP)
>  
> -noinst_HEADERS	= sqlite.h
> +noinst_HEADERS	= sqlite.h cld-internal.h
>  
>  MAINTAINERCLEANFILES = Makefile.in
>  
> diff --git a/utils/nfsdcld/cld-internal.h b/utils/nfsdcld/cld-internal.h
> new file mode 100644
> index 0000000..a90cced
> --- /dev/null
> +++ b/utils/nfsdcld/cld-internal.h
> @@ -0,0 +1,30 @@
> +/*
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> + * Boston, MA 02110-1301, USA.
> + */
> +
> +#ifndef _CLD_INTERNAL_H_
> +#define _CLD_INTERNAL_H_
> +
> +struct cld_client {
> +	int			cl_fd;
> +	struct event		cl_event;
> +	struct cld_msg	cl_msg;
> +};
> +
> +uint64_t current_epoch;
> +uint64_t recovery_epoch;
> +
> +#endif /* _CLD_INTERNAL_H_ */
> diff --git a/utils/nfsdcld/nfsdcld.c b/utils/nfsdcld/nfsdcld.c
> index 082f3ab..9b1ad98 100644
> --- a/utils/nfsdcld/nfsdcld.c
> +++ b/utils/nfsdcld/nfsdcld.c
> @@ -42,7 +42,9 @@
>  #include "xlog.h"
>  #include "nfslib.h"
>  #include "cld.h"
> +#include "cld-internal.h"
>  #include "sqlite.h"
> +#include "../mount/version.h"
>  
>  #ifndef PIPEFS_DIR
>  #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> @@ -54,19 +56,17 @@
>  #define CLD_DEFAULT_STORAGEDIR NFS_STATEDIR "/nfsdcld"
>  #endif
>  
> +#define NFSD_END_GRACE_FILE "/proc/fs/nfsd/v4_end_grace"
> +
>  #define UPCALL_VERSION		1
>  
>  /* private data structures */
> -struct cld_client {
> -	int			cl_fd;
> -	struct event		cl_event;
> -	struct cld_msg	cl_msg;
> -};
>  
>  /* global variables */
>  static char *pipepath = DEFAULT_CLD_PATH;
>  static int 		inotify_fd = -1;
>  static struct event	pipedir_event;
> +static bool old_kernel = false;
>  
>  static struct option longopts[] =
>  {
> @@ -298,6 +298,43 @@ out:
>  	return ret;
>  }
>  
> +/*
> + * Older kernels will not tell nfsdcld when a grace period has started.
> + * Therefore we have to peek at the /proc/fs/nfsd/v4_end_grace file to
> + * see if nfsd is in grace.  We have to do this for create and remove
> + * upcalls to ensure that the correct table is being updated - otherwise
> + * we could lose client records when the grace period is lifted.
> + */
> +static int
> +cld_check_grace_period(void)
> +{
> +	int fd, ret = 0;
> +	char c;
> +
> +	if (!old_kernel)
> +		return 0;
> +	if (recovery_epoch != 0)
> +		return 0;
> +	fd = open(NFSD_END_GRACE_FILE, O_RDONLY);
> +	if (fd < 0) {
> +		xlog(L_WARNING, "Unable to open %s: %m",
> +			NFSD_END_GRACE_FILE);
> +		return 1;
> +	}
> +	if (read(fd, &c, 1) < 0) {
> +		xlog(L_WARNING, "Unable to read from %s: %m",
> +			NFSD_END_GRACE_FILE);
> +		return 1;
> +	}
> +	close(fd);
> +	if (c == 'N') {
> +		xlog(L_WARNING, "nfsd is in grace but didn't send a gracestart upcall, "
> +			"please update the kernel");
> +		ret = sqlite_grace_start();
> +	}
> +	return ret;
> +}
> +
>  static void
>  cld_not_implemented(struct cld_client *clnt)
>  {
> @@ -332,14 +369,17 @@ cld_create(struct cld_client *clnt)
>  	ssize_t bsize, wsize;
>  	struct cld_msg *cmsg = &clnt->cl_msg;
>  
> +	ret = cld_check_grace_period();
> +	if (ret)
> +		goto reply;
> +
>  	xlog(D_GENERAL, "%s: create client record.", __func__);
>  
>  
>  	ret = sqlite_insert_client(cmsg->cm_u.cm_name.cn_id,
> -				   cmsg->cm_u.cm_name.cn_len,
> -				   false,
> -				   false);
> +				   cmsg->cm_u.cm_name.cn_len);
>  
> +reply:
>  	cmsg->cm_status = ret ? -EREMOTEIO : ret;
>  
>  	bsize = sizeof(*cmsg);
> @@ -365,11 +405,16 @@ cld_remove(struct cld_client *clnt)
>  	ssize_t bsize, wsize;
>  	struct cld_msg *cmsg = &clnt->cl_msg;
>  
> +	ret = cld_check_grace_period();
> +	if (ret)
> +		goto reply;
> +
>  	xlog(D_GENERAL, "%s: remove client record.", __func__);
>  
>  	ret = sqlite_remove_client(cmsg->cm_u.cm_name.cn_id,
>  				   cmsg->cm_u.cm_name.cn_len);
>  
> +reply:
>  	cmsg->cm_status = ret ? -EREMOTEIO : ret;
>  
>  	bsize = sizeof(*cmsg);
> @@ -396,12 +441,26 @@ cld_check(struct cld_client *clnt)
>  	ssize_t bsize, wsize;
>  	struct cld_msg *cmsg = &clnt->cl_msg;
>  
> +	/*
> +	 * If we get a check upcall at all, it means we're talking to an old
> +	 * kernel.  Furthermore, if we're not in grace it means this is the
> +	 * first client to do a reclaim.  Log a message and use
> +	 * sqlite_grace_start() to advance the epoch numbers.
> +	 */
> +	if (recovery_epoch == 0) {
> +		xlog(D_GENERAL, "%s: received a check upcall, please update the kernel",
> +			__func__);
> +		ret = sqlite_grace_start();
> +		if (ret)
> +			goto reply;
> +	}
> +
>  	xlog(D_GENERAL, "%s: check client record", __func__);
>  
>  	ret = sqlite_check_client(cmsg->cm_u.cm_name.cn_id,
> -				  cmsg->cm_u.cm_name.cn_len,
> -				  false);
> +				  cmsg->cm_u.cm_name.cn_len);
>  
> +reply:
>  	/* set up reply */
>  	cmsg->cm_status = ret ? -EACCES : ret;
>  
> @@ -429,11 +488,27 @@ cld_gracedone(struct cld_client *clnt)
>  	ssize_t bsize, wsize;
>  	struct cld_msg *cmsg = &clnt->cl_msg;
>  
> -	xlog(D_GENERAL, "%s: grace done. cm_gracetime=%ld", __func__,
> -			cmsg->cm_u.cm_gracetime);
> +	/*
> +	 * If we got a "gracedone" upcall while we're not in grace, then
> +	 * 1) we must be talking to an old kernel
> +	 * 2) no clients attempted to reclaim
> +	 * In that case, log a message and use sqlite_grace_start() to
> +	 * advance the epoch numbers, and then proceed as normal.
> +	 */
> +	if (recovery_epoch == 0) {
> +		xlog(D_GENERAL, "%s: received gracedone upcall "
> +			"while not in grace, please update the kernel",
> +			__func__);
> +		ret = sqlite_grace_start();
> +		if (ret)
> +			goto reply;
> +	}
> +
> +	xlog(D_GENERAL, "%s: grace done.", __func__);
>  
> -	ret = sqlite_remove_unreclaimed(cmsg->cm_u.cm_gracetime);
> +	ret = sqlite_grace_done();
>  
> +reply:
>  	/* set up reply: downcall with 0 status */
>  	cmsg->cm_status = ret ? -EREMOTEIO : ret;
>  
> @@ -453,6 +528,59 @@ cld_gracedone(struct cld_client *clnt)
>  	}
>  }
>  
> +static int
> +gracestart_callback(struct cld_client *clnt) {
> +	ssize_t bsize, wsize;
> +	struct cld_msg *cmsg = &clnt->cl_msg;
> +
> +	cmsg->cm_status = -EINPROGRESS;
> +
> +	bsize = sizeof(struct cld_msg);
> +
> +	xlog(D_GENERAL, "Sending client %.*s",
> +			cmsg->cm_u.cm_name.cn_len, cmsg->cm_u.cm_name.cn_id);
> +	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> +	if (wsize != bsize)
> +		return -EIO;
> +	return 0;
> +}
> +
> +static void
> +cld_gracestart(struct cld_client *clnt)
> +{
> +	int ret;
> +	ssize_t bsize, wsize;
> +	struct cld_msg *cmsg = &clnt->cl_msg;
> +
> +	xlog(D_GENERAL, "%s: updating grace epochs", __func__);
> +
> +	ret = sqlite_grace_start();
> +	if (ret)
> +		goto reply;
> +
> +	xlog(D_GENERAL, "%s: sending client records to the kernel", __func__);
> +
> +	ret = sqlite_iterate_recovery(&gracestart_callback, clnt);
> +
> +reply:
> +	/* set up reply: downcall with 0 status */
> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
> +
> +	bsize = sizeof(struct cld_msg);
> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
> +	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> +	if (wsize != bsize) {
> +		xlog(L_ERROR, "%s: problem writing to cld pipe (%ld): %m",
> +			 __func__, wsize);
> +		ret = cld_pipe_open(clnt);
> +		if (ret) {
> +			xlog(L_FATAL, "%s: unable to reopen pipe: %d",
> +					__func__, ret);
> +			exit(ret);
> +		}
> +	}
> +}
> +
>  static void
>  cldcb(int UNUSED(fd), short which, void *data)
>  {
> @@ -490,6 +618,9 @@ cldcb(int UNUSED(fd), short which, void *data)
>  	case Cld_GraceDone:
>  		cld_gracedone(clnt);
>  		break;
> +	case Cld_GraceStart:
> +		cld_gracestart(clnt);
> +		break;
>  	default:
>  		xlog(L_WARNING, "%s: command %u is not yet implemented",
>  				__func__, cmsg->cm_cmd);
> @@ -586,6 +717,9 @@ main(int argc, char **argv)
>  		}
>  	}
>  
> +	if (linux_version_code() < MAKE_VERSION(4, 20, 0))
> +		old_kernel = true;
> +
>  	/* set up storage db */
>  	rc = sqlite_prepare_dbh(storagedir);
>  	if (rc) {
> diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
> index c59f777..67549c9 100644
> --- a/utils/nfsdcld/sqlite.c
> +++ b/utils/nfsdcld/sqlite.c
> @@ -21,17 +21,24 @@
>   * Explanation:
>   *
>   * This file contains the code to manage the sqlite backend database for the
> - * nfsdcltrack usermodehelper upcall program.
> + * nfsdcld client tracking daemon.
>   *
>   * The main database is called main.sqlite and contains the following tables:
>   *
>   * parameters: simple key/value pairs for storing database info
>   *
> - * clients: an "id" column containing a BLOB with the long-form clientid as
> - * 	    sent by the client, a "time" column containing a timestamp (in
> - * 	    epoch seconds) of when the record was last updated, and a
> - * 	    "has_session" column containing a boolean value indicating
> - * 	    whether the client has sessions (v4.1+) or not (v4.0).
> + * grace: a "current" column containing an INTEGER representing the current
> + *        epoch (where should new values be stored) and a "recovery" column
> + *        containing an INTEGER representing the recovery epoch (from what
> + *        epoch are we allowed to recover).  A recovery epoch of 0 means
> + *        normal operation (grace period not in force).  Note: sqlite stores
> + *        integers as signed values, so these must be cast to a uint64_t when
> + *        retrieving them from the database and back to an int64_t when storing
> + *        them in the database.
> + *
> + * rec-CCCCCCCCCCCCCCCC (where C is the hex representation of the epoch value):
> + *        a single "id" column containing a BLOB with the long-form clientid
> + *        as sent by the client.
>   */
>  
>  #ifdef HAVE_CONFIG_H
> @@ -47,16 +54,21 @@
>  #include <sys/types.h>
>  #include <fcntl.h>
>  #include <unistd.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <limits.h>
>  #include <sqlite3.h>
>  #include <linux/limits.h>
>  
>  #include "xlog.h"
>  #include "sqlite.h"
> +#include "cld.h"
> +#include "cld-internal.h"
>  
> -#define CLTRACK_SQLITE_LATEST_SCHEMA_VERSION 2
> +#define CLD_SQLITE_LATEST_SCHEMA_VERSION 3
>  
>  /* in milliseconds */
> -#define CLTRACK_SQLITE_BUSY_TIMEOUT 10000
> +#define CLD_SQLITE_BUSY_TIMEOUT 10000
>  
>  /* private data structures */
>  
> @@ -124,7 +136,7 @@ out:
>  }
>  
>  static int
> -sqlite_maindb_update_v1_to_v2(void)
> +sqlite_maindb_update_schema(int oldversion)
>  {
>  	int ret, ret2;
>  	char *err;
> @@ -142,32 +154,66 @@ sqlite_maindb_update_v1_to_v2(void)
>  	 * transaction to guard against racing DB setup attempts
>  	 */
>  	ret = sqlite_query_schema_version();
> -	switch (ret) {
> -	case 1:
> -		/* Still at v1 -- do conversion */
> -		break;
> -	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
> -		/* Someone else raced in and set it up */
> -		ret = 0;
> +	if (ret != oldversion) {
> +		if (ret == CLD_SQLITE_LATEST_SCHEMA_VERSION)
> +			/* Someone else raced in and set it up */
> +			ret = 0;
> +		else
> +			/* Something went wrong -- fail! */
> +			ret = -EINVAL;
>  		goto rollback;
> -	default:
> -		/* Something went wrong -- fail! */
> -		ret = -EINVAL;
> +	}
> +
> +	/* Still at old version -- do conversion */
> +
> +	/* create grace table */
> +	ret = sqlite3_exec(dbh, "CREATE TABLE grace "
> +				"(current INTEGER , recovery INTEGER);",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to create grace table: %s", err);
> +		goto rollback;
> +	}
> +
> +	/* insert initial epochs into grace table */
> +	ret = sqlite3_exec(dbh, "INSERT OR FAIL INTO grace "
> +				"values (1, 0);",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to set initial epochs: %s", err);
> +		goto rollback;
> +	}
> +
> +	/* create recovery table for current epoch */
> +	ret = sqlite3_exec(dbh, "CREATE TABLE \"rec-0000000000000001\" "
> +				"(id BLOB PRIMARY KEY);",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to create recovery table "
> +				"for current epoch: %s", err);
> +		goto rollback;
> +	}
> +
> +	/* copy records from old clients table */
> +	ret = sqlite3_exec(dbh, "INSERT INTO \"rec-0000000000000001\" "
> +				"SELECT id FROM clients;",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to copy client records: %s", err);
>  		goto rollback;
>  	}
>  
> -	/* create v2 clients table */
> -	ret = sqlite3_exec(dbh, "ALTER TABLE clients ADD COLUMN "
> -				"has_session INTEGER;",
> +	/* drop the old clients table */
> +	ret = sqlite3_exec(dbh, "DROP TABLE clients;",
>  				NULL, NULL, &err);
>  	if (ret != SQLITE_OK) {
> -		xlog(L_ERROR, "Unable to update clients table: %s", err);
> +		xlog(L_ERROR, "Unable to drop old clients table: %s", err);
>  		goto rollback;
>  	}
>  
>  	ret = snprintf(buf, sizeof(buf), "UPDATE parameters SET value = %d "
>  			"WHERE key = \"version\";",
> -			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
> +			CLD_SQLITE_LATEST_SCHEMA_VERSION);
>  	if (ret < 0) {
>  		xlog(L_ERROR, "sprintf failed!");
>  		goto rollback;
> @@ -205,7 +251,7 @@ rollback:
>   * transaction. On any error, rollback the transaction.
>   */
>  static int
> -sqlite_maindb_init_v2(void)
> +sqlite_maindb_init_v3(void)
>  {
>  	int ret, ret2;
>  	char *err = NULL;
> @@ -227,7 +273,7 @@ sqlite_maindb_init_v2(void)
>  	case 0:
>  		/* Query failed again -- set up DB */
>  		break;
> -	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
> +	case CLD_SQLITE_LATEST_SCHEMA_VERSION:
>  		/* Someone else raced in and set it up */
>  		ret = 0;
>  		goto rollback;
> @@ -245,20 +291,38 @@ sqlite_maindb_init_v2(void)
>  		goto rollback;
>  	}
>  
> -	/* create the "clients" table */
> -	ret = sqlite3_exec(dbh, "CREATE TABLE clients (id BLOB PRIMARY KEY, "
> -				"time INTEGER, has_session INTEGER);",
> +	/* create grace table */
> +	ret = sqlite3_exec(dbh, "CREATE TABLE grace "
> +				"(current INTEGER , recovery INTEGER);",
> 				NULL, NULL, &err);
>  	if (ret != SQLITE_OK) {
> -		xlog(L_ERROR, "Unable to create clients table: %s", err);
> +		xlog(L_ERROR, "Unable to create grace table: %s", err);
>  		goto rollback;
>  	}
>  
> +	/* insert initial epochs into grace table */
> +	ret = sqlite3_exec(dbh, "INSERT OR FAIL INTO grace "
> +				"values (1, 0);",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to set initial epochs: %s", err);
> +		goto rollback;
> +	}
> +
> +	/* create recovery table for current epoch */
> +	ret = sqlite3_exec(dbh, "CREATE TABLE \"rec-0000000000000001\" "
> +				"(id BLOB PRIMARY KEY);",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to create recovery table "
> +				"for current epoch: %s", err);
> +		goto rollback;
> +	}
>  
>  	/* insert version into parameters table */
>  	ret = snprintf(buf, sizeof(buf), "INSERT OR FAIL INTO parameters "
>  			"values (\"version\", \"%d\");",
> -			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION);
> +			CLD_SQLITE_LATEST_SCHEMA_VERSION);
>  	if (ret < 0) {
>  		xlog(L_ERROR, "sprintf failed!");
>  		goto rollback;
> @@ -291,6 +355,42 @@ rollback:
>  	goto out;
>  }
>  
> +static int
> +sqlite_startup_query_grace(void)
> +{
> +	int ret;
> +	uint64_t tcur;
> +	uint64_t trec;
> +	sqlite3_stmt *stmt = NULL;
> +
> +	/* prepare select query */
> +	ret = sqlite3_prepare_v2(dbh, "SELECT * FROM grace;", -1, &stmt, NULL);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Unable to prepare select statement: %s",
> +			sqlite3_errmsg(dbh));
> +		goto out;
> +	}
> +
> +	ret = sqlite3_step(stmt);
> +	if (ret != SQLITE_ROW) {
> +		xlog(D_GENERAL, "Select statement execution failed: %s",
> +				sqlite3_errmsg(dbh));
> +		goto out;
> +	}
> +
> +	tcur = (uint64_t)sqlite3_column_int(stmt, 0);
> +	trec = (uint64_t)sqlite3_column_int(stmt, 1);

I think you want to use sqlite3_column_int64 here:

https://www.sqlite.org/c3ref/column_blob.html

> +
> +	current_epoch = tcur;
> +	recovery_epoch = trec;
> +	ret = 0;
> +	xlog(D_GENERAL, "%s: current_epoch=%lu recovery_epoch=%lu",
> +		__func__, current_epoch, recovery_epoch);
> +out:
> +	sqlite3_finalize(stmt);
> +	return ret;
> +}
> +
>  /* Open the database and set up the database handle for it */
>  int
>  sqlite_prepare_dbh(const char *topdir)
> @@ -322,7 +422,7 @@ sqlite_prepare_dbh(const char *topdir)
>  	}
>  
>  	/* set busy timeout */
> -	ret = sqlite3_busy_timeout(dbh, CLTRACK_SQLITE_BUSY_TIMEOUT);
> +	ret = sqlite3_busy_timeout(dbh, CLD_SQLITE_BUSY_TIMEOUT);
>  	if (ret != SQLITE_OK) {
>  		xlog(L_ERROR, "Unable to set sqlite busy timeout: %s",
>  				sqlite3_errmsg(dbh));
> @@ -331,19 +431,26 @@ sqlite_prepare_dbh(const char *topdir)
>  
>  	ret = sqlite_query_schema_version();
>  	switch (ret) {
> -	case CLTRACK_SQLITE_LATEST_SCHEMA_VERSION:
> +	case CLD_SQLITE_LATEST_SCHEMA_VERSION:
>  		/* DB is already set up. Do nothing */
>  		ret = 0;
>  		break;
> +	case 2:
> +		/* Old DB -- update to new schema */
> +		ret = sqlite_maindb_update_schema(2);
> +		if (ret)
> +			goto out_close;
> +		break;
> +
>  	case 1:
>  		/* Old DB -- update to new schema */
> -		ret = sqlite_maindb_update_v1_to_v2();
> +		ret = sqlite_maindb_update_schema(1);
>  		if (ret)
>  			goto out_close;
>  		break;
>  	case 0:
>  		/* Query failed -- try to set up new DB */
> -		ret = sqlite_maindb_init_v2();
> +		ret = sqlite_maindb_init_v3();
>  		if (ret)
>  			goto out_close;
>  		break;
> @@ -351,11 +458,13 @@ sqlite_prepare_dbh(const char *topdir)
>  		/* Unknown DB version -- downgrade? Fail */
>  		xlog(L_ERROR, "Unsupported database schema version! "
>  			"Expected %d, got %d.",
> -			CLTRACK_SQLITE_LATEST_SCHEMA_VERSION, ret);
> +			CLD_SQLITE_LATEST_SCHEMA_VERSION, ret);
>  		ret = -EINVAL;
>  		goto out_close;
>  	}
>  
> +	ret = sqlite_startup_query_grace();
> +
>  	return ret;
>  out_close:
>  	sqlite3_close(dbh);
> @@ -369,20 +478,22 @@ out_close:
>   * Returns a non-zero sqlite error code, or SQLITE_OK (aka 0)
>   */
>  int
> -sqlite_insert_client(const unsigned char *clname, const size_t namelen,
> -			const bool has_session, const bool zerotime)
> +sqlite_insert_client(const unsigned char *clname, const size_t namelen)
>  {
>  	int ret;
>  	sqlite3_stmt *stmt = NULL;
>  
> -	if (zerotime)
> -		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
> -				"VALUES (?, 0, ?);", -1, &stmt, NULL);
> -	else
> -		ret = sqlite3_prepare_v2(dbh, "INSERT OR REPLACE INTO clients "
> -				"VALUES (?, strftime('%s', 'now'), ?);", -1,
> -				&stmt, NULL);
> +	ret = snprintf(buf, sizeof(buf), "INSERT OR REPLACE INTO \"rec-%016lx\" "
> +				"VALUES (?);", current_epoch);
> +	if (ret < 0) {
> +		xlog(L_ERROR, "sprintf failed!");
> +		return ret;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +		return -EINVAL;
> +	}
>  
> +	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
>  	if (ret != SQLITE_OK) {
>  		xlog(L_ERROR, "%s: insert statement prepare failed: %s",
>  			__func__, sqlite3_errmsg(dbh));
> @@ -397,13 +508,6 @@ sqlite_insert_client(const unsigned char *clname, const size_t namelen,
>  		goto out_err;
>  	}
>  
> -	ret = sqlite3_bind_int(stmt, 2, (int)has_session);
> -	if (ret != SQLITE_OK) {
> -		xlog(L_ERROR, "%s: bind int failed: %s", __func__,
> -				sqlite3_errmsg(dbh));
> -		goto out_err;
> -	}
> -
>  	ret = sqlite3_step(stmt);
>  	if (ret == SQLITE_DONE)
>  		ret = SQLITE_OK;
> @@ -424,8 +528,18 @@ sqlite_remove_client(const unsigned char *clname, const size_t namelen)
>  	int ret;
>  	sqlite3_stmt *stmt = NULL;
>  
> -	ret = sqlite3_prepare_v2(dbh, "DELETE FROM clients WHERE id==?", -1,
> -				 &stmt, NULL);
> +	ret = snprintf(buf, sizeof(buf), "DELETE FROM \"rec-%016lx\" "
> +				"WHERE id==?;", current_epoch);
> +	if (ret < 0) {
> +		xlog(L_ERROR, "sprintf failed!");
> +		return ret;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +		return -EINVAL;
> +	}
> +
> +	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
> +
>  	if (ret != SQLITE_OK) {
>  		xlog(L_ERROR, "%s: statement prepare failed: %s",
>  				__func__, sqlite3_errmsg(dbh));
> @@ -459,18 +573,26 @@ out_err:
>   * return an error.
>   */
>  int
> -sqlite_check_client(const unsigned char *clname, const size_t namelen,
> -			const bool has_session)
> +sqlite_check_client(const unsigned char *clname, const size_t namelen)
>  {
>  	int ret;
>  	sqlite3_stmt *stmt = NULL;
>  
> -	ret = sqlite3_prepare_v2(dbh, "SELECT count(*) FROM clients WHERE "
> -				      "id==?", -1, &stmt, NULL);
> +	ret = snprintf(buf, sizeof(buf), "SELECT count(*) FROM  \"rec-%016lx\" "
> +				"WHERE id==?;", recovery_epoch);
> +	if (ret < 0) {
> +		xlog(L_ERROR, "sprintf failed!");
> +		return ret;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +		return -EINVAL;
> +	}
> +
> +	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
>  	if (ret != SQLITE_OK) {
> -		xlog(L_ERROR, "%s: unable to prepare update statement: %s",
> -				__func__, sqlite3_errmsg(dbh));
> -		goto out_err;
> +		xlog(L_ERROR, "%s: select statement prepare failed: %s",
> +			__func__, sqlite3_errmsg(dbh));
> +		return ret;
>  	}
>  
>  	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
> @@ -495,37 +617,10 @@ sqlite_check_client(const unsigned char *clname, const size_t namelen,
>  		goto out_err;
>  	}
>  
> -	/* Only update timestamp for v4.0 clients */
> -	if (has_session) {
> -		ret = SQLITE_OK;
> -		goto out_err;
> -	}
> -
>  	sqlite3_finalize(stmt);
> -	stmt = NULL;
> -	ret = sqlite3_prepare_v2(dbh, "UPDATE OR FAIL clients SET "
> -				      "time=strftime('%s', 'now') WHERE id==?",
> -				 -1, &stmt, NULL);
> -	if (ret != SQLITE_OK) {
> -		xlog(L_ERROR, "%s: unable to prepare update statement: %s",
> -				__func__, sqlite3_errmsg(dbh));
> -		goto out_err;
> -	}
>  
> -	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
> -				SQLITE_STATIC);
> -	if (ret != SQLITE_OK) {
> -		xlog(L_ERROR, "%s: bind blob failed: %s",
> -				__func__, sqlite3_errmsg(dbh));
> -		goto out_err;
> -	}
> -
> -	ret = sqlite3_step(stmt);
> -	if (ret == SQLITE_DONE)
> -		ret = SQLITE_OK;
> -	else
> -		xlog(L_ERROR, "%s: unexpected return code from update: %s",
> -				__func__, sqlite3_errmsg(dbh));
> +	/* Now insert the client into the table for the current epoch */
> +	return sqlite_insert_client(clname, namelen);
>  
>  out_err:
>  	xlog(D_GENERAL, "%s: returning %d", __func__, ret);
> @@ -599,3 +694,211 @@ sqlite_query_reclaiming(const time_t grace_start)
>  			"reclaim", __func__, ret);
>  	return ret;
>  }
> +
> +int
> +sqlite_grace_start(void)
> +{
> +	int ret, ret2;
> +	char *err;
> +	uint64_t tcur = current_epoch;
> +	uint64_t trec = recovery_epoch;
> +
> +	/* begin transaction */
> +	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
> +				&err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to begin transaction: %s", err);
> +		goto rollback;
> +	}
> +
> +	if (trec == 0) {
> +		/*
> +		 * A normal grace start - update the epoch values in the grace
> +		 * table and create a new table for the current reboot epoch.
> +		 */
> +		trec = tcur;
> +		tcur++;
> +
> +		ret = snprintf(buf, sizeof(buf), "UPDATE grace "
> +				"SET current = %ld, recovery = %ld;",
> +				(int64_t)tcur, (int64_t)trec);
> +		if (ret < 0) {
> +			xlog(L_ERROR, "sprintf failed!");
> +			goto rollback;
> +		} else if ((size_t)ret >= sizeof(buf)) {
> +			xlog(L_ERROR, "sprintf output too long! (%d chars)",
> +				ret);
> +			ret = -EINVAL;
> +			goto rollback;
> +		}
> +
> +		ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> +		if (ret != SQLITE_OK) {
> +			xlog(L_ERROR, "Unable to update epochs: %s", err);
> +			goto rollback;
> +		}
> +
> +		ret = snprintf(buf, sizeof(buf), "CREATE TABLE \"rec-%016lx\" "
> +				"(id BLOB PRIMARY KEY);",
> +				tcur);
> +		if (ret < 0) {
> +			xlog(L_ERROR, "sprintf failed!");
> +			goto rollback;
> +		} else if ((size_t)ret >= sizeof(buf)) {
> +			xlog(L_ERROR, "sprintf output too long! (%d chars)",
> +				ret);
> +			ret = -EINVAL;
> +			goto rollback;
> +		}
> +
> +		ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> +		if (ret != SQLITE_OK) {
> +			xlog(L_ERROR, "Unable to create table for current epoch: %s",
> +				err);
> +			goto rollback;
> +		}
> +	} else {
> +		/* Server restarted while in grace - don't update the epoch
> +		 * values in the grace table, just clear out the records for
> +		 * the current reboot epoch.
> +		 */
> +		ret = snprintf(buf, sizeof(buf), "DELETE FROM \"rec-%016lx\";",
> +				tcur);
> +		if (ret < 0) {
> +			xlog(L_ERROR, "sprintf failed!");
> +			goto rollback;
> +		} else if ((size_t)ret >= sizeof(buf)) {
> +			xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +			ret = -EINVAL;
> +			goto rollback;
> +		}
> +
> +		ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> +		if (ret != SQLITE_OK) {
> +			xlog(L_ERROR, "Unable to clear table for current epoch: %s",
> +				err);
> +			goto rollback;
> +		}
> +	}
> +
> +	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to commit transaction: %s", err);
> +		goto rollback;
> +	}
> +
> +	current_epoch = tcur;
> +	recovery_epoch = trec;
> +	xlog(D_GENERAL, "%s: current_epoch=%lu recovery_epoch=%lu",
> +		__func__, current_epoch, recovery_epoch);
> +
> +out:
> +	sqlite3_free(err);
> +	return ret;
> +rollback:
> +	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
> +	if (ret2 != SQLITE_OK)
> +		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
> +	goto out;
> +}
> +
> +int
> +sqlite_grace_done(void)
> +{
> +	int ret, ret2;
> +	char *err;
> +
> +	/* begin transaction */
> +	ret = sqlite3_exec(dbh, "BEGIN EXCLUSIVE TRANSACTION;", NULL, NULL,
> +				&err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to begin transaction: %s", err);
> +		goto rollback;
> +	}
> +
> +	ret = sqlite3_exec(dbh, "UPDATE grace SET recovery = \"0\";",
> +			NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to clear recovery epoch: %s", err);
> +		goto rollback;
> +	}
> +
> +	ret = snprintf(buf, sizeof(buf), "DROP TABLE \"rec-%016lx\";",
> +		recovery_epoch);
> +	if (ret < 0) {
> +		xlog(L_ERROR, "sprintf failed!");
> +		goto rollback;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +		ret = -EINVAL;
> +		goto rollback;
> +	}
> +
> +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to drop table for recovery epoch: %s",
> +			err);
> +		goto rollback;
> +	}
> +
> +	ret = sqlite3_exec(dbh, "COMMIT TRANSACTION;", NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to commit transaction: %s", err);
> +		goto rollback;
> +	}
> +
> +	recovery_epoch = 0;
> +	xlog(D_GENERAL, "%s: current_epoch=%lu recovery_epoch=%lu",
> +		__func__, current_epoch, recovery_epoch);
> +
> +out:
> +	sqlite3_free(err);
> +	return ret;
> +rollback:
> +	ret2 = sqlite3_exec(dbh, "ROLLBACK TRANSACTION;", NULL, NULL, &err);
> +	if (ret2 != SQLITE_OK)
> +		xlog(L_ERROR, "Unable to rollback transaction: %s", err);
> +	goto out;
> +}
> +
> +
> +int
> +sqlite_iterate_recovery(int (*cb)(struct cld_client *clnt), struct cld_client *clnt)
> +{
> +	int ret;
> +	sqlite3_stmt *stmt = NULL;
> +	struct cld_msg *cmsg = &clnt->cl_msg;
> +
> +	if (recovery_epoch == 0) {
> +		xlog(D_GENERAL, "%s: not in grace!", __func__);
> +		return -EINVAL;
> +	}
> +
> +	ret = snprintf(buf, sizeof(buf), "SELECT * FROM \"rec-%016lx\";",
> +		recovery_epoch);
> +	if (ret < 0) {
> +		xlog(L_ERROR, "sprintf failed!");
> +		return ret;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		xlog(L_ERROR, "sprintf output too long! (%d chars)", ret);
> +		return -EINVAL;
> +	}
> +
> +	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "%s: select statement prepare failed: %s",
> +			__func__, sqlite3_errmsg(dbh));
> +		return ret;
> +	}
> +
> +	while ((ret = sqlite3_step(stmt)) == SQLITE_ROW) {
> +		memcpy(&cmsg->cm_u.cm_name.cn_id, sqlite3_column_blob(stmt, 0),
> +			NFS4_OPAQUE_LIMIT);
> +		cmsg->cm_u.cm_name.cn_len = sqlite3_column_bytes(stmt, 0);
> +		cb(clnt);
> +	}
> +	if (ret == SQLITE_DONE)
> +		ret = 0;
> +	sqlite3_finalize(stmt);
> +	return ret;
> +}
> diff --git a/utils/nfsdcld/sqlite.h b/utils/nfsdcld/sqlite.h
> index 06e7c04..5c56f75 100644
> --- a/utils/nfsdcld/sqlite.h
> +++ b/utils/nfsdcld/sqlite.h
> @@ -20,13 +20,16 @@
>  #ifndef _SQLITE_H_
>  #define _SQLITE_H_
>  
> +struct cld_client;
> +
>  int sqlite_prepare_dbh(const char *topdir);
> -int sqlite_insert_client(const unsigned char *clname, const size_t namelen,
> -				const bool has_session, const bool zerotime);
> +int sqlite_insert_client(const unsigned char *clname, const size_t namelen);
>  int sqlite_remove_client(const unsigned char *clname, const size_t namelen);
> -int sqlite_check_client(const unsigned char *clname, const size_t namelen,
> -				const bool has_session);
> +int sqlite_check_client(const unsigned char *clname, const size_t namelen);
>  int sqlite_remove_unreclaimed(const time_t grace_start);
>  int sqlite_query_reclaiming(const time_t grace_start);
> +int sqlite_grace_start(void);
> +int sqlite_grace_done(void);
> +int sqlite_iterate_recovery(int (*cb)(struct cld_client *clnt), struct cld_client *clnt);
>  
>  #endif /* _SQLITE_H */

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-11-10 14:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-06 18:36 [nfs-utils PATCH RFC 0/7] restore nfsdcld Scott Mayhew
2018-11-06 18:36 ` [nfs-utils PATCH RFC 1/7] Revert "nfsdcltrack: remove the nfsdcld daemon" Scott Mayhew
2018-11-06 18:36 ` [nfs-utils PATCH RFC 2/7] nfsdcld: move nfsdcld to its own directory Scott Mayhew
2018-11-06 18:36 ` [nfs-utils PATCH RFC 3/7] nfsdcld: a few enhancements Scott Mayhew
2018-11-10 14:16   ` Jeff Layton
2018-11-06 18:36 ` [nfs-utils PATCH RFC 4/7] nfsdcld: remove some unused functions Scott Mayhew
2018-11-06 18:36 ` [nfs-utils PATCH RFC 5/7] nfsdcld: the -p option should specify the rpc_pipefs mountpoint Scott Mayhew
2018-11-06 18:36 ` [nfs-utils PATCH RFC 6/7] nfsdcld: add /etc/nfs.conf support Scott Mayhew
2018-11-08 12:00   ` Steve Dickson
2018-11-08 13:09     ` Scott Mayhew
2018-11-06 18:36 ` [nfs-utils PATCH RFC 7/7] systemd: add a unit file for nfsdcld Scott Mayhew

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).