[PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
@ 2011-12-14 13:57 Jeff Layton
  2011-12-14 13:57 ` [PATCH 1/7] clstated: add clname tracking daemon stub Jeff Layton
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

This patchset is the userspace portion of the knfsd client name tracking
overhaul. See this patch series for an explanation:

    nfsd: overhaul the client name tracking code (RFC)

The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
and handles the requests.

The data is stored using a sqlite database. The main reason for this is
that it takes care of most of the fussy details and atomicity concerns
of tracking the information on stable storage.

For now, the daemon is only suitable for single-host configurations.
The plan is to later extend this to be suitable for clustered
configurations as well.

The code is still a little rough, so be gentle. It also lacks things
like a manpage. I plan to add all that before doing a "formal" patch
submission, but I wanted to get some early review of the overall design
before to spend a lot of time knocking off the rough edges.

Jeff Layton (7):
  clstated: add clname tracking daemon stub
  clstated: reattempt the pipe open if it fails on ENOENT
  clstated: add autoconf goop for sqlite
  clstated: add routines for a sqlite backend database
  clstated: add remove functionality
  clstated: add check/update functionality
  clstated: add function to remove unreclaimed client records

 aclocal/libsqlite3.m4      |   33 +++
 configure.ac               |   23 ++
 utils/Makefile.am          |    4 +
 utils/clstated/Makefile.am |   14 +
 utils/clstated/clstated.c  |  355 +++++++++++++++++++++++++++
 utils/clstated/sqlite.c    |  572 ++++++++++++++++++++++++++++++++++++++++++++
 utils/clstated/sqlite.h    |   30 +++
 7 files changed, 1031 insertions(+), 0 deletions(-)
 create mode 100644 aclocal/libsqlite3.m4
 create mode 100644 utils/clstated/Makefile.am
 create mode 100644 utils/clstated/clstated.c
 create mode 100644 utils/clstated/sqlite.c
 create mode 100644 utils/clstated/sqlite.h

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/7] clstated: add clname tracking daemon stub
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 13:57 ` [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT Jeff Layton
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

This program opens and "listens" on the new nfsd/clname rpc_pipefs pipe.
The code here doesn't actually do anything on stable storage yet. That
will be added in a later patch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 configure.ac               |    1 +
 utils/Makefile.am          |    1 +
 utils/clstated/Makefile.am |   14 +++
 utils/clstated/clstated.c  |  243 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 259 insertions(+), 0 deletions(-)
 create mode 100644 utils/clstated/Makefile.am
 create mode 100644 utils/clstated/clstated.c

diff --git a/configure.ac b/configure.ac
index f101b86..822cf17 100644
--- a/configure.ac
+++ b/configure.ac
@@ -458,6 +458,7 @@ AC_CONFIG_FILES([
 	tools/nfs-iostat/Makefile
 	utils/Makefile
 	utils/blkmapd/Makefile
+	utils/clstated/Makefile
 	utils/exportfs/Makefile
 	utils/gssd/Makefile
 	utils/idmapd/Makefile
diff --git a/utils/Makefile.am b/utils/Makefile.am
index d074b85..beaf4e9 100644
--- a/utils/Makefile.am
+++ b/utils/Makefile.am
@@ -22,6 +22,7 @@ OPTDIRS += mount
 endif
 
 SUBDIRS = \
+	clstated \
 	exportfs \
 	mountd \
 	nfsd \
diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
new file mode 100644
index 0000000..9937353
--- /dev/null
+++ b/utils/clstated/Makefile.am
@@ -0,0 +1,14 @@
+## Process this file with automake to produce Makefile.in
+
+#man8_MANS	= clstated.man
+#EXTRA_DIST = $(man8_MANS)
+
+AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
+sbin_PROGRAMS	= clstated
+
+clstated_SOURCES = clstated.c
+
+clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
+
+MAINTAINERCLEANFILES = Makefile.in
+
diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
new file mode 100644
index 0000000..93a2710
--- /dev/null
+++ b/utils/clstated/clstated.c
@@ -0,0 +1,243 @@
+/*
+ * clstated.c -- NFSv4 client state tracking daemon
+ *
+ * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif /* HAVE_CONFIG_H */
+
+#include <errno.h>
+#include <event.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <linux/nfsd/clstate.h>
+
+#include "xlog.h"
+#include "nfslib.h"
+
+#ifndef PIPEFS_DIR
+#define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
+#endif
+
+#define DEFAULT_CLNAMED_PATH	PIPEFS_DIR "/nfsd/clstate"
+
+/* private data structures */
+struct clstate_client {
+	int			cl_fd;
+	struct event		cl_event;
+	struct clstate_msg	cl_msg;
+};
+
+/* global variables */
+static char *pipepath = DEFAULT_CLNAMED_PATH;
+
+static struct option longopts[] =
+{
+	{ "help", 0, NULL, 'h' },
+	{ "foreground", 0, NULL, 'F' },
+	{ "debug", 0, NULL, 'd' },
+	{ "path", 1, NULL, 'p' },
+	{ NULL, 0, 0, 0 },
+};
+
+/* forward declarations */
+static void clstatecb(int UNUSED(fd), short which, void *data);
+
+static void
+usage(char *progname)
+{
+	printf("Usage:\n");
+	printf("%s [ -hFdp ]\n", progname);
+}
+
+static int
+clstate_pipe_open(struct clstate_client *clnt)
+{
+	clnt->cl_fd = open(pipepath, O_RDWR, 0);
+	if (clnt->cl_fd < 0) {
+		xlog(L_ERROR, "%s: unable to open %s: %m", __func__, pipepath);
+		return clnt->cl_fd;
+	}
+
+	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, clstatecb, clnt);
+	event_add(&clnt->cl_event, NULL);
+
+	return clnt->cl_fd;
+}
+
+static void
+clstate_pipe_reopen(struct clstate_client *clnt)
+{
+	int fd;
+
+	fd = open(pipepath, O_RDWR, 0);
+	if (fd < 0) {
+		xlog_warn("%s: Re-opening of %s failed: %m");
+		return;
+	}
+
+	if ((clnt->cl_event.ev_flags & EVLIST_INIT))
+		event_del(&clnt->cl_event);
+	if (clnt->cl_fd > 0)
+		close(clnt->cl_fd);
+
+	clnt->cl_fd = fd;
+	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, clstatecb, clnt);
+	/* event_add is done by the caller */
+}
+
+static void
+clstate_create(struct clstate_client *clnt)
+{
+	ssize_t bsize, wsize;
+	struct clstate_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: create client record", __func__);
+
+	/* FIXME: create client record on storage here */
+
+	/* set up reply */
+	cmsg->cm_status = 0;
+	cmsg->cm_len = 0;
+	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
+		sizeof(cmsg->cm_u.cm_id));
+
+	bsize = sizeof(*cmsg);
+
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
+			 __func__, wsize);
+		clstate_pipe_reopen(clnt);
+	}
+}
+
+static void
+clstatecb(int UNUSED(fd), short which, void *data)
+{
+	ssize_t len;
+	struct clstate_client *clnt = data;
+	struct clstate_msg *cmsg = &clnt->cl_msg;
+
+	if (which != EV_READ)
+		goto out;
+
+	len = atomicio(read, clnt->cl_fd, cmsg, sizeof(*cmsg));
+	if (len <= 0) {
+		xlog_warn("%s: pipe read failed: %m", __func__);
+		clstate_pipe_reopen(clnt);
+		goto out;
+	}
+
+	switch(cmsg->cm_cmd) {
+	case Cl_Create:
+		clstate_create(clnt);
+		break;
+	case Cl_Expire:
+	case Cl_NrToReclaim:
+	case Cl_Allow:
+	case Cl_GraceDone:
+		xlog_warn("%s: command %u is not yet implemented", __func__,
+			  cmsg->cm_cmd);
+		break;
+	default:
+		xlog_warn("%s: unrecognized command %u", __func__,
+			  cmsg->cm_cmd);
+		clstate_pipe_reopen(clnt);
+	}
+out:
+	event_add(&clnt->cl_event, NULL);
+}
+
+int
+main(int argc, char **argv)
+{
+	char arg;
+	int rc = 0, fd;
+	bool foreground = false;
+	char *progname;
+	struct clstate_client clnt;
+
+	memset(&clnt, 0, sizeof(clnt));
+
+	progname = strdup(basename(argv[0]));
+	if (!progname) {
+		fprintf(stderr, "%s: unable to allocate memory.\n", argv[0]);
+		return 1;
+	}
+
+	event_init();
+	xlog_syslog(0);
+	xlog_stderr(1);
+
+	/* process command-line options */
+	while ((arg = getopt_long(argc, argv, "hdFp:", longopts,
+				  NULL)) != EOF) {
+		switch (arg) {
+		case 'd':
+			xlog_config(D_ALL, 1);
+			break;
+		case 'F':
+			foreground = true;
+			break;
+		case 'p':
+			pipepath = optarg;
+			break;
+		default:
+			usage(progname);
+			return 0;
+		}
+	}
+
+
+	xlog_open(progname);
+	if (!foreground) {
+		xlog_syslog(1);
+		xlog_stderr(0);
+		rc = daemon(0, 0);
+		if (rc) {
+			xlog(L_ERROR, "Unable to daemonize: %m");
+			goto out;
+		}
+	}
+
+	/* set up storage db */
+
+	/* set up event handler */
+	fd = clstate_pipe_open(&clnt);
+	if (fd < 0) {
+		rc = fd;
+		goto out;
+	}
+
+	rc = event_dispatch();
+	if (rc < 0)
+		xlog(L_ERROR, "%s: event_dispatch failed: %m", __func__);
+
+	close(clnt.cl_fd);
+out:
+	free(progname);
+	return rc;
+}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
  2011-12-14 13:57 ` [PATCH 1/7] clstated: add clname tracking daemon stub Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 15:09   ` Steve Dickson
  2011-12-14 13:57 ` [PATCH 3/7] clstated: add autoconf goop for sqlite Jeff Layton
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

Generally, we want this daemon started before nfsd starts. Deal with the
situation where the pipe hasn't shown up yet.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 utils/clstated/clstated.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
index 93a2710..aab09a7 100644
--- a/utils/clstated/clstated.c
+++ b/utils/clstated/clstated.c
@@ -75,10 +75,17 @@ usage(char *progname)
 static int
 clstate_pipe_open(struct clstate_client *clnt)
 {
-	clnt->cl_fd = open(pipepath, O_RDWR, 0);
-	if (clnt->cl_fd < 0) {
-		xlog(L_ERROR, "%s: unable to open %s: %m", __func__, pipepath);
-		return clnt->cl_fd;
+	/* loop until the pipe is present or open fails for another reason */
+	for (;;) {
+		clnt->cl_fd = open(pipepath, O_RDWR, 0);
+		if (clnt->cl_fd >= 0) {
+			break;
+		} else if (errno != ENOENT) {
+			xlog(D_GENERAL, "%s: unable to open %s: %m", __func__,
+					pipepath);
+			return clnt->cl_fd;
+		}
+		sleep(1);
 	}
 
 	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, clstatecb, clnt);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/7] clstated: add autoconf goop for sqlite
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
  2011-12-14 13:57 ` [PATCH 1/7] clstated: add clname tracking daemon stub Jeff Layton
  2011-12-14 13:57 ` [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 13:57 ` [PATCH 4/7] clstated: add routines for a sqlite backend database Jeff Layton
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

Mostly cribbed from Chuck Lever's new-statd rewrite a few years ago...

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 aclocal/libsqlite3.m4 |   33 +++++++++++++++++++++++++++++++++
 configure.ac          |   22 ++++++++++++++++++++++
 utils/Makefile.am     |    5 ++++-
 3 files changed, 59 insertions(+), 1 deletions(-)
 create mode 100644 aclocal/libsqlite3.m4

diff --git a/aclocal/libsqlite3.m4 b/aclocal/libsqlite3.m4
new file mode 100644
index 0000000..73d1e46
--- /dev/null
+++ b/aclocal/libsqlite3.m4
@@ -0,0 +1,33 @@
+dnl Checks for matching sqlite3 header and library, and
+dnl sufficient sqlite3 version.
+dnl
+AC_DEFUN([AC_SQLITE3_VERS], [
+  AC_CHECK_HEADERS([sqlite3.h], ,)
+
+  dnl look for the library; do not add to LIBS if found
+  AC_CHECK_LIB([sqlite3], [sqlite3_libversion_number], [LIBSQLITE=-lsqlite3], ,)
+  AC_SUBST(LIBSQLITE)
+
+  AC_MSG_CHECKING(for suitable sqlite3 version)
+
+  AC_CACHE_VAL([libsqlite3_cv_is_recent],
+   [
+    saved_LIBS="$LIBS"
+    LIBS=-lsqlite3
+    AC_TRY_RUN([
+	#include <stdio.h>
+	#include <sqlite3.h>
+	int main()
+	{
+		int vers = sqlite3_libversion_number();
+
+		return vers != SQLITE_VERSION_NUMBER ||
+			vers < 3003000;
+	}
+       ], [libsqlite3_cv_is_recent=yes], [libsqlite3_cv_is_recent=no],
+       [libsqlite3_cv_is_recent=unknown])
+    LIBS="$saved_LIBS"])
+
+  AC_MSG_RESULT($libsqlite3_cv_is_recent)
+  AM_CONDITIONAL(CONFIG_SQLITE3, [test "$libsqlite3_cv_is_recent" = "yes"])
+])dnl
diff --git a/configure.ac b/configure.ac
index 822cf17..bc7f6db 100644
--- a/configure.ac
+++ b/configure.ac
@@ -186,6 +186,12 @@ else
 	AM_CONDITIONAL(MOUNT_CONFIG, [test "$enable_mount" = "yes"])
 fi
 
+AC_ARG_ENABLE(clstated,
+	[AC_HELP_STRING([--enable-clstated],
+			[Create clstated NFSv4 clientid tracking daemon. <:@default=yes@:>@])],
+	enable_clstated=$enableval,
+	enable_clstated="maybe")
+
 dnl Check for TI-RPC library and headers
 AC_LIBTIRPC
 
@@ -259,6 +265,9 @@ if test "$enable_nfsv4" = yes; then
   dnl check for the keyutils libraries and headers
   AC_KEYUTILS
 
+  dnl Check for sqlite3
+  AC_SQLITE3_VERS
+
   dnl librpcsecgss already has a dependency on libgssapi,
   dnl but we need to make sure we get the right version
   if test "$enable_gss" = yes; then
@@ -328,6 +337,19 @@ fi
 dnl Check for IPv6 support
 AC_IPV6
 
+dnl Decide what to do with clstated based on sqlite3 test result
+if test "$enable_nfsv4" = "yes" -a "$libsqlite3_cv_is_recent" != "yes" ; then
+	if test "$enable_clstated" = "yes"; then
+		AC_MSG_ERROR([clstated requires sqlite3])
+	elif test "$enable_clstated" != "no"; then
+		AC_MSG_WARN([clstated requires sqlite3, autodisabling it])
+		enable_clstated="no"
+	fi
+fi
+
+dnl Now set CONFIG_CLSTATED properly
+AM_CONDITIONAL(CONFIG_CLSTATED, [test "$enable_clstated" != "no" ])
+
 dnl *************************************************************
 dnl Check for headers
 dnl *************************************************************
diff --git a/utils/Makefile.am b/utils/Makefile.am
index beaf4e9..31562a3 100644
--- a/utils/Makefile.am
+++ b/utils/Makefile.am
@@ -21,8 +21,11 @@ if CONFIG_MOUNT
 OPTDIRS += mount
 endif
 
+if CONFIG_CLSTATED
+OPTDIRS += clstated
+endif
+
 SUBDIRS = \
-	clstated \
 	exportfs \
 	mountd \
 	nfsd \
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
                   ` (2 preceding siblings ...)
  2011-12-14 13:57 ` [PATCH 3/7] clstated: add autoconf goop for sqlite Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 14:56   ` Chuck Lever
  2011-12-14 13:57 ` [PATCH 5/7] clstated: add remove functionality Jeff Layton
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

Rather than roll our own "storage engine", use sqlite instead. It fits
the bill nicely as it does:

- durable on-disk storage
- the ability to constrain record uniqueness
- a facility for collating and searching the host records

...it does add a build dependency to nfs-utils, but almost all modern
distros provide those packages.

The current incarnation of this code dynamically links against a
provided sqlite library, but we could also consider including their
single-file "amalgamation" to reduce dependencies (though with all
the caveats that that entails).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 utils/clstated/Makefile.am |    4 +-
 utils/clstated/clstated.c  |   16 ++-
 utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
 utils/clstated/sqlite.h    |   27 ++++
 4 files changed, 393 insertions(+), 5 deletions(-)
 create mode 100644 utils/clstated/sqlite.c
 create mode 100644 utils/clstated/sqlite.h

diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
index 9937353..d1cef3c 100644
--- a/utils/clstated/Makefile.am
+++ b/utils/clstated/Makefile.am
@@ -6,9 +6,9 @@
 AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
 sbin_PROGRAMS	= clstated
 
-clstated_SOURCES = clstated.c
+clstated_SOURCES = clstated.c sqlite.c
 
-clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
+clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
 
 MAINTAINERCLEANFILES = Makefile.in
 
diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
index aab09a7..95ac696 100644
--- a/utils/clstated/clstated.c
+++ b/utils/clstated/clstated.c
@@ -36,6 +36,7 @@
 
 #include "xlog.h"
 #include "nfslib.h"
+#include "sqlite.h"
 
 #ifndef PIPEFS_DIR
 #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
@@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
 static void
 clstate_create(struct clstate_client *clnt)
 {
+	int ret;
 	ssize_t bsize, wsize;
 	struct clstate_msg *cmsg = &clnt->cl_msg;
 
-	xlog(D_GENERAL, "%s: create client record", __func__);
+	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
+			cmsg->cm_addr);
 
-	/* FIXME: create client record on storage here */
+	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
+					cmsg->cm_len);
 
 	/* set up reply */
-	cmsg->cm_status = 0;
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
 	cmsg->cm_len = 0;
 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
 		sizeof(cmsg->cm_u.cm_id));
 
 	bsize = sizeof(*cmsg);
 
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
 	if (wsize != bsize) {
 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
@@ -231,6 +236,11 @@ main(int argc, char **argv)
 	}
 
 	/* set up storage db */
+	rc = clstate_maindb_init();
+	if (rc) {
+		xlog(L_ERROR, "Failed to open main database: %d", rc);
+		goto out;
+	}
 
 	/* set up event handler */
 	fd = clstate_pipe_open(&clnt);
diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
new file mode 100644
index 0000000..ae83634
--- /dev/null
+++ b/utils/clstated/sqlite.c
@@ -0,0 +1,351 @@
+/*
+ * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+/*
+ * Explanation:
+ *
+ * This file contains the code to manage the sqlite backend database for the
+ * clstated upcall daemon.
+ *
+ * The main database is called main.sqlite and contains the following tables:
+ *
+ * parameters: simple key/value pairs for storing database info
+ *
+ * addresses: list of server-side addresses recorded in the db, along with
+ * 	      the filenames of their respective db files.
+ *
+ * The daemon attaches to each server-address database as needed. Each
+ * server-address database has the following tables:
+ *
+ * clients: one column containing a BLOB with the clstate as sent by the client
+ * 	    and a timestamp (in epoch seconds) of when the record was
+ * 	    established
+ *
+ * FIXME: should we also record the fsid being accessed?
+ */
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif /* HAVE_CONFIG_H */
+
+#include <errno.h>
+#include <event.h>
+#include <stdbool.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sqlite3.h>
+#include <linux/limits.h>
+#include <linux/nfsd/clstate.h>
+
+#include "xlog.h"
+
+#define CLSTATE_SQLITE_SCHEMA_VERSION 1
+
+#ifndef CLSTATE_DIR
+#define CLSTATE_DIR NFS_STATEDIR "/clstate"
+#endif
+
+/* in milliseconds */
+#define CLSTATE_SQLITE_BUSY_TIMEOUT 10000
+
+/* private data structures */
+
+/* global variables */
+
+/* top level DB directory */
+static char *clstate_topdir = CLSTATE_DIR;
+
+/* reusable pathname and sql command buffer */
+static char buf[PATH_MAX];
+
+/* global database handle */
+static sqlite3 *dbh;
+
+/* forward declarations */
+
+/* '.' is not allowed in dbnames -- convert them to '-' */
+static void
+addr_to_dbname(const char *src, char *dst)
+{
+	while(*src) {
+		if (*src == '.')
+			*dst = '-';
+		else
+			*dst = *src;
+		++src;
+		++dst;
+	}
+	*dst = '\0';
+}
+
+void
+clstate_set_topdir(char *topdir)
+{
+	clstate_topdir = topdir;
+}
+
+/* make a directory, ignoring EEXIST errors unless it's not a directory */
+static int
+mkdir_if_not_exist(char *dirname)
+{
+	int ret;
+	struct stat statbuf;
+
+	ret = mkdir(dirname, S_IRWXU);
+	if (ret && errno != EEXIST)
+		return -errno;
+
+	ret = stat(dirname, &statbuf);
+	if (ret)
+		return -errno;
+
+	if (!S_ISDIR(statbuf.st_mode))
+		ret = -ENOTDIR;
+
+	return ret;
+}
+
+/*
+ * Open the "main" database, and attempt to initialize it by creating the
+ * parameters table and inserting the schema version into it. Ignore any errors
+ * from that, and then attempt to select the version out of it again. If the
+ * version appears wrong, then assume that the DB is corrupt or has been
+ * upgraded, and return an error.
+ */
+int
+clstate_maindb_init(void)
+{
+	int ret;
+	char *err = NULL;
+	sqlite3_stmt *stmt = NULL;
+
+	ret = mkdir_if_not_exist(clstate_topdir);
+	if (ret)
+		return ret;
+
+	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", clstate_topdir);
+	if (ret < 0)
+		return ret;
+
+	buf[PATH_MAX - 1] = '\0';
+
+	ret = sqlite3_open(buf, &dbh);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to open main database: %d", ret);
+		return ret;
+	}
+
+	ret = sqlite3_busy_timeout(dbh, CLSTATE_SQLITE_BUSY_TIMEOUT);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to set sqlite busy timeout: %d", ret);
+		goto out_err;
+	}
+
+	/* Try to create table */
+	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
+				"(key TEXT PRIMARY KEY, value TEXT);",
+				NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to create parameter table: %d", ret);
+		goto out_err;
+	}
+
+	/* insert version into table -- ignore error if it fails */
+	ret = snprintf(buf, sizeof(buf),
+		       "INSERT OR IGNORE INTO parameters values (\"version\", "
+		       "\"%d\");", CLSTATE_SQLITE_SCHEMA_VERSION);
+	if (ret < 0) {
+		goto out_err;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		goto out_err;
+	}
+
+	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to insert into parameter table: %d",
+				ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_prepare_v2(dbh,
+		"SELECT value FROM parameters WHERE key == \"version\";",
+		 -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to prepare select statement: %d", ret);
+		goto out_err;
+	}
+
+	/* check schema version */
+	ret = sqlite3_step(stmt);
+	if (ret != SQLITE_ROW) {
+		xlog(D_GENERAL, "Select statement execution failed: %s",
+				sqlite3_errmsg(dbh));
+		goto out_err;
+	}
+
+	/* process SELECT result */
+	ret = sqlite3_column_int(stmt, 0);
+	if (ret != CLSTATE_SQLITE_SCHEMA_VERSION) {
+		xlog(L_ERROR, "Unsupported database schema version! "
+			"Expected %d, got %d.",
+			CLSTATE_SQLITE_SCHEMA_VERSION, ret);
+		ret = -EINVAL;
+		goto out_err;
+	}
+
+	sqlite3_free(err);
+	sqlite3_finalize(stmt);
+	return 0;
+
+out_err:
+	if (err) {
+		xlog(L_ERROR, "sqlite error: %s", err);
+		sqlite3_free(err);
+	}
+	sqlite3_finalize(stmt);
+	sqlite3_close(dbh);
+	return ret;
+}
+
+static int
+clstate_db_attach(char *dbname)
+{
+	int ret;
+	char *err;
+
+	/* convert address string into valid filename */
+	ret = snprintf(buf, sizeof(buf),
+			"ATTACH DATABASE \"%s/%s.sqlite\" as \"%s\";",
+			clstate_topdir, dbname, dbname);
+	if (ret < 0) {
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		return ret;
+	}
+
+	/* attach to new DB in filename */
+	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK)
+		xlog(L_ERROR, "Unable to attach database %s: %s", dbname, err);
+
+	xlog(D_GENERAL, "Attached database %s", dbname);
+	return ret;
+}
+
+static int
+clstate_db_detach(char *dbname)
+{
+	int ret;
+	char *err;
+
+	/* convert address string into valid filename */
+	ret = snprintf(buf, PATH_MAX - 1, "DETACH DATABASE \"%s\";", dbname);
+	if (ret < 0) {
+		return ret;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		return ret;
+	}
+
+	/* attach to new DB in filename */
+	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK)
+		xlog(L_ERROR, "Unable to detach database %s: %s", dbname, err);
+
+	xlog(D_GENERAL, "Detached database %s", dbname);
+	return ret;
+}
+
+/*
+ * Create a client record
+ */
+int
+clstate_insert_client(const unsigned char *addr, const unsigned char *clname,
+			const size_t namelen)
+{
+	int ret;
+	char *err = NULL;
+	char dbname[CLSTATE_MAX_ADDRESS_LEN];
+	sqlite3_stmt *stmt = NULL;
+
+	addr_to_dbname((const char *)addr, dbname);
+
+	ret = clstate_db_attach(dbname);
+	if (ret != SQLITE_OK)
+		return ret;
+
+	snprintf(buf, sizeof(buf), "CREATE TABLE IF NOT EXISTS '%s'.clients "
+				   "(id BLOB PRIMARY KEY, time INTEGER);",
+				   dbname);
+	if (ret < 0) {
+		goto out_detach;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		goto out_detach;
+	}
+
+	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK) {
+		xlog(L_ERROR, "Unable to create table: %s", err);
+		goto out_detach;
+	}
+
+	ret = snprintf(buf, sizeof(buf),
+			"INSERT OR REPLACE INTO '%s'.clients VALUES (?, "
+			"strftime('%%s', 'now'));", dbname);
+	if (ret < 0) {
+		goto out_detach;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		goto out_detach;
+	}
+
+	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Bind blob failed: %d", ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret == SQLITE_DONE)
+		ret = SQLITE_OK;
+	else
+		xlog(D_GENERAL, "Unexpected return code from insert: %s",
+				sqlite3_errmsg(dbh));
+
+out_err:
+	xlog(D_GENERAL, "%s returning %d", __func__, ret);
+	sqlite3_finalize(stmt);
+	sqlite3_free(err);
+out_detach:
+	clstate_db_detach(dbname);
+	return ret;
+}
diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
new file mode 100644
index 0000000..4f81745
--- /dev/null
+++ b/utils/clstated/sqlite.h
@@ -0,0 +1,27 @@
+/*
+ * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _SQLITE_H_
+#define _SQLITE_H_
+
+void clstate_set_topdir(char *topdir);
+int clstate_maindb_init(void);
+int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
+
+#endif /* _SQLITE_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/7] clstated: add remove functionality
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
                   ` (3 preceding siblings ...)
  2011-12-14 13:57 ` [PATCH 4/7] clstated: add routines for a sqlite backend database Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 13:57 ` [PATCH 6/7] clstated: add check/update functionality Jeff Layton
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

Allow the kernel to ask for removal of a client record.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 utils/clstated/clstated.c |   34 +++++++++++++++++++++++++++-
 utils/clstated/sqlite.c   |   54 +++++++++++++++++++++++++++++++++++++++++++++
 utils/clstated/sqlite.h   |    1 +
 3 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
index 95ac696..54214cb 100644
--- a/utils/clstated/clstated.c
+++ b/utils/clstated/clstated.c
@@ -147,6 +147,36 @@ clstate_create(struct clstate_client *clnt)
 }
 
 static void
+clstate_remove(struct clstate_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct clstate_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: expire client record. cm_addr=%s", __func__,
+			cmsg->cm_addr);
+
+	ret = clstate_remove_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
+					cmsg->cm_len);
+
+	/* set up reply */
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
+	cmsg->cm_len = 0;
+	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
+			sizeof(cmsg->cm_u.cm_id));
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
+			 __func__, wsize);
+		clstate_pipe_reopen(clnt);
+	}
+}
+
+static void
 clstatecb(int UNUSED(fd), short which, void *data)
 {
 	ssize_t len;
@@ -168,8 +198,10 @@ clstatecb(int UNUSED(fd), short which, void *data)
 		clstate_create(clnt);
 		break;
 	case Cl_Expire:
-	case Cl_NrToReclaim:
+		clstate_remove(clnt);
+		break;
 	case Cl_Allow:
+	case Cl_NrToReclaim:
 	case Cl_GraceDone:
 		xlog_warn("%s: command %u is not yet implemented", __func__,
 			  cmsg->cm_cmd);
diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
index ae83634..78ce6b4 100644
--- a/utils/clstated/sqlite.c
+++ b/utils/clstated/sqlite.c
@@ -349,3 +349,57 @@ out_detach:
 	clstate_db_detach(dbname);
 	return ret;
 }
+
+/* Remove a client record */
+int
+clstate_remove_client(const unsigned char *addr, const unsigned char *clname,
+			const size_t namelen)
+{
+	int ret;
+	char *err = NULL;
+	char dbname[CLSTATE_MAX_ADDRESS_LEN];
+	sqlite3_stmt *stmt = NULL;
+
+	addr_to_dbname((const char *)addr, dbname);
+
+	ret = clstate_db_attach(dbname);
+	if (ret != SQLITE_OK)
+		return ret;
+
+	ret = snprintf(buf, sizeof(buf),
+			"DELETE FROM '%s'.clients WHERE id==?", dbname);
+	if (ret < 0) {
+		goto out_detach;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		goto out_detach;
+	}
+
+	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Bind blob failed: %d", ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret == SQLITE_DONE)
+		ret = SQLITE_OK;
+	else
+		xlog(D_GENERAL, "Unexpected return code from insert: %s",
+				sqlite3_errmsg(dbh));
+
+out_err:
+	xlog(D_GENERAL, "%s returning %d", __func__, ret);
+	sqlite3_finalize(stmt);
+	sqlite3_free(err);
+out_detach:
+	clstate_db_detach(dbname);
+	return ret;
+}
diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
index 4f81745..6f5637f 100644
--- a/utils/clstated/sqlite.h
+++ b/utils/clstated/sqlite.h
@@ -23,5 +23,6 @@
 void clstate_set_topdir(char *topdir);
 int clstate_maindb_init(void);
 int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
+int clstate_remove_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
 
 #endif /* _SQLITE_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/7] clstated: add check/update functionality
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
                   ` (4 preceding siblings ...)
  2011-12-14 13:57 ` [PATCH 5/7] clstated: add remove functionality Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 13:57 ` [PATCH 7/7] clstated: add function to remove unreclaimed client records Jeff Layton
  2011-12-14 15:23 ` [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Steve Dickson
  7 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

Add functions to check whether a client is allowed to reclaim, and
update its timestamp in the DB if so. We do this with an UPDATE OR
FAIL SQL statement. If the update fails for any reason (like, say
the record doesn't exist), then the kernel can disallow the reclaim.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 utils/clstated/clstated.c |   32 ++++++++++++++++++++++++
 utils/clstated/sqlite.c   |   58 +++++++++++++++++++++++++++++++++++++++++++++
 utils/clstated/sqlite.h   |    1 +
 3 files changed, 91 insertions(+), 0 deletions(-)

diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
index 54214cb..885e302 100644
--- a/utils/clstated/clstated.c
+++ b/utils/clstated/clstated.c
@@ -177,6 +177,36 @@ clstate_remove(struct clstate_client *clnt)
 }
 
 static void
+clstate_check(struct clstate_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct clstate_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: check client record. cm_addr=%s", __func__,
+			cmsg->cm_addr);
+
+	ret = clstate_check_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
+					cmsg->cm_len);
+
+	/* set up reply */
+	cmsg->cm_status = ret ? -EACCES : ret;
+	cmsg->cm_len = 0;
+	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
+			sizeof(cmsg->cm_u.cm_id));
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
+			 __func__, wsize);
+		clstate_pipe_reopen(clnt);
+	}
+}
+
+static void
 clstatecb(int UNUSED(fd), short which, void *data)
 {
 	ssize_t len;
@@ -201,6 +231,8 @@ clstatecb(int UNUSED(fd), short which, void *data)
 		clstate_remove(clnt);
 		break;
 	case Cl_Allow:
+		clstate_check(clnt);
+		break;
 	case Cl_NrToReclaim:
 	case Cl_GraceDone:
 		xlog_warn("%s: command %u is not yet implemented", __func__,
diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
index 78ce6b4..d6f6a90 100644
--- a/utils/clstated/sqlite.c
+++ b/utils/clstated/sqlite.c
@@ -403,3 +403,61 @@ out_detach:
 	clstate_db_detach(dbname);
 	return ret;
 }
+
+/*
+ * Update a client's timestamp. Return an error if the update fails. On error,
+ * the kernel can infer that reclaim should be denied.
+ */
+int
+clstate_check_client(const unsigned char *addr, const unsigned char *clname,
+			const size_t namelen)
+{
+	int ret;
+	char *err = NULL;
+	char dbname[CLSTATE_MAX_ADDRESS_LEN];
+	sqlite3_stmt *stmt = NULL;
+
+	addr_to_dbname((const char *)addr, dbname);
+
+	ret = clstate_db_attach(dbname);
+	if (ret != SQLITE_OK)
+		return ret;
+
+	ret = snprintf(buf, sizeof(buf),
+			"UPDATE OR FAIL '%s'.clients SET time=strftime('%%s', "
+			"'now') WHERE id==?", dbname);
+	if (ret < 0) {
+		goto out_detach;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		goto out_detach;
+	}
+
+	ret = sqlite3_prepare_v2(dbh, buf, -1, &stmt, NULL);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Unable to prepare update statement: %d", ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
+				SQLITE_STATIC);
+	if (ret != SQLITE_OK) {
+		xlog(D_GENERAL, "Bind blob failed: %d", ret);
+		goto out_err;
+	}
+
+	ret = sqlite3_step(stmt);
+	if (ret == SQLITE_DONE)
+		ret = SQLITE_OK;
+	else
+		xlog(D_GENERAL, "Unexpected return code from update: %s",
+				sqlite3_errmsg(dbh));
+
+out_err:
+	xlog(D_GENERAL, "%s returning %d", __func__, ret);
+	sqlite3_finalize(stmt);
+	sqlite3_free(err);
+out_detach:
+	clstate_db_detach(dbname);
+	return ret;
+}
diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
index 6f5637f..46d2f24 100644
--- a/utils/clstated/sqlite.h
+++ b/utils/clstated/sqlite.h
@@ -24,5 +24,6 @@ void clstate_set_topdir(char *topdir);
 int clstate_maindb_init(void);
 int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
 int clstate_remove_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
+int clstate_check_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
 
 #endif /* _SQLITE_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/7] clstated: add function to remove unreclaimed client records
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
                   ` (5 preceding siblings ...)
  2011-12-14 13:57 ` [PATCH 6/7] clstated: add check/update functionality Jeff Layton
@ 2011-12-14 13:57 ` Jeff Layton
  2011-12-14 15:23 ` [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Steve Dickson
  7 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 13:57 UTC (permalink / raw)
  To: linux-nfs

This should remove any client record from any server address DB that
has a timestamp prior to the given time.

Eventually, this call will need to be made cluster aware when this is
run in a clustered configuration. For now, this is only suitable for
single-host configurations.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 utils/clstated/clstated.c |   33 +++++++++++++-
 utils/clstated/sqlite.c   |  109 +++++++++++++++++++++++++++++++++++++++++++++
 utils/clstated/sqlite.h   |    1 +
 3 files changed, 142 insertions(+), 1 deletions(-)

diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
index 885e302..a19d269 100644
--- a/utils/clstated/clstated.c
+++ b/utils/clstated/clstated.c
@@ -207,6 +207,35 @@ clstate_check(struct clstate_client *clnt)
 }
 
 static void
+clstate_gracedone(struct clstate_client *clnt)
+{
+	int ret;
+	ssize_t bsize, wsize;
+	struct clstate_msg *cmsg = &clnt->cl_msg;
+
+	xlog(D_GENERAL, "%s: grace done. cm_gracetime=%ld", __func__,
+			cmsg->cm_u.cm_gracetime);
+
+	ret = clstate_remove_unreclaimed(cmsg->cm_u.cm_gracetime);
+
+	/* set up reply: downcall with 0 status */
+	cmsg->cm_status = ret ? -EREMOTEIO : ret;
+	cmsg->cm_len = 0;
+	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
+			sizeof(cmsg->cm_u));
+
+	bsize = sizeof(*cmsg);
+
+	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
+	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
+	if (wsize != bsize) {
+		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
+			 __func__, wsize);
+		clstate_pipe_reopen(clnt);
+	}
+}
+
+static void
 clstatecb(int UNUSED(fd), short which, void *data)
 {
 	ssize_t len;
@@ -233,8 +262,10 @@ clstatecb(int UNUSED(fd), short which, void *data)
 	case Cl_Allow:
 		clstate_check(clnt);
 		break;
-	case Cl_NrToReclaim:
 	case Cl_GraceDone:
+		clstate_gracedone(clnt);
+		break;
+	case Cl_NrToReclaim:
 		xlog_warn("%s: command %u is not yet implemented", __func__,
 			  cmsg->cm_cmd);
 		break;
diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
index d6f6a90..c8e15e5 100644
--- a/utils/clstated/sqlite.c
+++ b/utils/clstated/sqlite.c
@@ -44,6 +44,7 @@
 #include "config.h"
 #endif /* HAVE_CONFIG_H */
 
+#include <dirent.h>
 #include <errno.h>
 #include <event.h>
 #include <stdbool.h>
@@ -97,6 +98,30 @@ addr_to_dbname(const char *src, char *dst)
 	*dst = '\0';
 }
 
+/*
+ * strip the '.sqlite' from the end of the filename. If there is no postfix,
+ * then return NULL and don't do anything.
+ */
+static char *
+filename_to_dbname(const char *src, char *dst)
+{
+	char *postfix;
+	size_t len;
+
+	/* no .sqlite prefix? Then we're not interested */
+	postfix = strstr(src, ".sqlite");
+	if (!postfix)
+		return postfix;
+
+	len = postfix - src;
+	if (len > CLSTATE_MAX_ADDRESS_LEN)
+		return NULL;
+
+	strncpy(dst, src, len);
+	dst[len] = '\0';
+	return dst;
+}
+
 void
 clstate_set_topdir(char *topdir)
 {
@@ -461,3 +486,87 @@ out_detach:
 	clstate_db_detach(dbname);
 	return ret;
 }
+
+/*
+ * Attempt to attach to a database with the given filename. If the DB appears
+ * to be usable, then remove any client records with a timestamp before
+ * grace_start.
+ */
+static int
+clstate_remove_unreclaimed_db(const char *filename, time_t grace_start)
+{
+	int ret;
+	char *err = NULL;
+	char dbname[CLSTATE_MAX_ADDRESS_LEN];
+
+	/* get the dbname for the given filename */
+	if (!filename_to_dbname(filename, dbname))
+		return 0;
+
+	/* skip main.sqlite */
+	if (!strcmp(dbname, "main"))
+		return 0;
+
+	ret = clstate_db_attach(dbname);
+	if (ret != SQLITE_OK)
+		return ret;
+
+	ret = snprintf(buf, sizeof(buf),
+			"DELETE FROM '%s'.clients WHERE time < %ld",
+			dbname, grace_start);
+	if (ret < 0) {
+		goto out_detach;
+	} else if ((size_t)ret >= sizeof(buf)) {
+		ret = -EINVAL;
+		goto out_detach;
+	}
+
+	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
+	if (ret != SQLITE_OK)
+		xlog(D_GENERAL, "delete failed: %s", err);
+
+	xlog(D_GENERAL, "%s returning %d", __func__, ret);
+	sqlite3_free(err);
+
+out_detach:
+	clstate_db_detach(dbname);
+	return ret;
+}
+
+/*
+ * remove any client records that were not reclaimed since grace_start. We
+ * do this by attempting to attach to any database in the clstatedir and
+ * removing any records that have timestamps that are too old.
+ */
+int
+clstate_remove_unreclaimed(const time_t grace_start)
+{
+	int ret;
+	DIR *dir;
+	struct dirent *de;
+
+	dir = opendir(clstate_topdir);
+	if (!dir) {
+		xlog(L_ERROR, "Unable to opendir %s: %m", clstate_topdir);
+		return errno;
+	}
+
+	for (;;) {
+		errno = 0;
+		de = readdir(dir);
+		if (!de) {
+			ret = errno;
+			if (errno != 0)
+				xlog(L_ERROR, "readdir failed: %m");
+			break;
+		}
+
+		ret = clstate_remove_unreclaimed_db(de->d_name, grace_start);
+		if (ret)
+			xlog(L_ERROR, "Unable to clear old client records from "
+				      "%s: %d", de->d_name, ret);
+	}
+
+	closedir(dir);
+	return ret;
+}
diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
index 46d2f24..3fecf96 100644
--- a/utils/clstated/sqlite.h
+++ b/utils/clstated/sqlite.h
@@ -25,5 +25,6 @@ int clstate_maindb_init(void);
 int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
 int clstate_remove_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
 int clstate_check_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
+int clstate_remove_unreclaimed(const time_t grace_start);
 
 #endif /* _SQLITE_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-14 13:57 ` [PATCH 4/7] clstated: add routines for a sqlite backend database Jeff Layton
@ 2011-12-14 14:56   ` Chuck Lever
  2011-12-14 15:14     ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2011-12-14 14:56 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs


On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:

> Rather than roll our own "storage engine", use sqlite instead. It fits
> the bill nicely as it does:
> 
> - durable on-disk storage
> - the ability to constrain record uniqueness
> - a facility for collating and searching the host records

It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.

> ...it does add a build dependency to nfs-utils, but almost all modern
> distros provide those packages.
> 
> The current incarnation of this code dynamically links against a
> provided sqlite library, but we could also consider including their
> single-file "amalgamation" to reduce dependencies (though with all
> the caveats that that entails).
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> utils/clstated/Makefile.am |    4 +-
> utils/clstated/clstated.c  |   16 ++-
> utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
> utils/clstated/sqlite.h    |   27 ++++
> 4 files changed, 393 insertions(+), 5 deletions(-)
> create mode 100644 utils/clstated/sqlite.c
> create mode 100644 utils/clstated/sqlite.h
> 
> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
> index 9937353..d1cef3c 100644
> --- a/utils/clstated/Makefile.am
> +++ b/utils/clstated/Makefile.am
> @@ -6,9 +6,9 @@
> AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
> sbin_PROGRAMS	= clstated
> 
> -clstated_SOURCES = clstated.c
> +clstated_SOURCES = clstated.c sqlite.c
> 
> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
> 
> MAINTAINERCLEANFILES = Makefile.in
> 
> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> index aab09a7..95ac696 100644
> --- a/utils/clstated/clstated.c
> +++ b/utils/clstated/clstated.c
> @@ -36,6 +36,7 @@
> 
> #include "xlog.h"
> #include "nfslib.h"
> +#include "sqlite.h"
> 
> #ifndef PIPEFS_DIR
> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
> static void
> clstate_create(struct clstate_client *clnt)
> {
> +	int ret;
> 	ssize_t bsize, wsize;
> 	struct clstate_msg *cmsg = &clnt->cl_msg;
> 
> -	xlog(D_GENERAL, "%s: create client record", __func__);
> +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
> +			cmsg->cm_addr);
> 
> -	/* FIXME: create client record on storage here */
> +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
> +					cmsg->cm_len);
> 
> 	/* set up reply */
> -	cmsg->cm_status = 0;
> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
> 	cmsg->cm_len = 0;
> 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
> 		sizeof(cmsg->cm_u.cm_id));
> 
> 	bsize = sizeof(*cmsg);
> 
> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
> 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> 	if (wsize != bsize) {
> 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
> @@ -231,6 +236,11 @@ main(int argc, char **argv)
> 	}
> 
> 	/* set up storage db */
> +	rc = clstate_maindb_init();
> +	if (rc) {
> +		xlog(L_ERROR, "Failed to open main database: %d", rc);
> +		goto out;
> +	}
> 
> 	/* set up event handler */
> 	fd = clstate_pipe_open(&clnt);
> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
> new file mode 100644
> index 0000000..ae83634
> --- /dev/null
> +++ b/utils/clstated/sqlite.c
> @@ -0,0 +1,351 @@
> +/*
> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> + * Boston, MA 02110-1301, USA.
> + */
> +
> +/*
> + * Explanation:
> + *
> + * This file contains the code to manage the sqlite backend database for the
> + * clstated upcall daemon.
> + *
> + * The main database is called main.sqlite and contains the following tables:
> + *
> + * parameters: simple key/value pairs for storing database info
> + *
> + * addresses: list of server-side addresses recorded in the db, along with
> + * 	      the filenames of their respective db files.
> + *
> + * The daemon attaches to each server-address database as needed. Each
> + * server-address database has the following tables:

Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.

> + *
> + * clients: one column containing a BLOB with the clstate as sent by the client
> + * 	    and a timestamp (in epoch seconds) of when the record was
> + * 	    established

I'm not yet clear on what's in cm_addr, but it seems to me that if you store each client's data as structured fields instead of a single BLOB, it will be much easier to use a generic tool like the sqlite3 command to debug the database.  That gives the observability and transparency of using a flat file with all the advantages of a database.

Plus, you get automatic data conversion; a one-to-one mapping between data on-disk and data in any endian-ness or word length.

> + *
> + * FIXME: should we also record the fsid being accessed?

I'm not exactly sure what additional data might be needed to support NFSv4 migration, for example.  However, using a database means it's relatively painless to add new columns in the future without having to provision them now.

> + */
> +
> +#ifdef HAVE_CONFIG_H
> +#include "config.h"
> +#endif /* HAVE_CONFIG_H */
> +
> +#include <errno.h>
> +#include <event.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include <sqlite3.h>
> +#include <linux/limits.h>
> +#include <linux/nfsd/clstate.h>
> +
> +#include "xlog.h"
> +
> +#define CLSTATE_SQLITE_SCHEMA_VERSION 1
> +
> +#ifndef CLSTATE_DIR
> +#define CLSTATE_DIR NFS_STATEDIR "/clstate"
> +#endif
> +
> +/* in milliseconds */
> +#define CLSTATE_SQLITE_BUSY_TIMEOUT 10000
> +
> +/* private data structures */
> +
> +/* global variables */
> +
> +/* top level DB directory */
> +static char *clstate_topdir = CLSTATE_DIR;
> +
> +/* reusable pathname and sql command buffer */
> +static char buf[PATH_MAX];
> +
> +/* global database handle */
> +static sqlite3 *dbh;
> +
> +/* forward declarations */
> +
> +/* '.' is not allowed in dbnames -- convert them to '-' */
> +static void
> +addr_to_dbname(const char *src, char *dst)
> +{
> +	while(*src) {
> +		if (*src == '.')
> +			*dst = '-';
> +		else
> +			*dst = *src;
> +		++src;
> +		++dst;
> +	}
> +	*dst = '\0';
> +}
> +
> +void
> +clstate_set_topdir(char *topdir)
> +{
> +	clstate_topdir = topdir;
> +}
> +
> +/* make a directory, ignoring EEXIST errors unless it's not a directory */
> +static int
> +mkdir_if_not_exist(char *dirname)
> +{
> +	int ret;
> +	struct stat statbuf;
> +
> +	ret = mkdir(dirname, S_IRWXU);
> +	if (ret && errno != EEXIST)
> +		return -errno;
> +
> +	ret = stat(dirname, &statbuf);
> +	if (ret)
> +		return -errno;
> +
> +	if (!S_ISDIR(statbuf.st_mode))
> +		ret = -ENOTDIR;
> +
> +	return ret;
> +}
> +
> +/*
> + * Open the "main" database, and attempt to initialize it by creating the
> + * parameters table and inserting the schema version into it. Ignore any errors
> + * from that, and then attempt to select the version out of it again. If the
> + * version appears wrong, then assume that the DB is corrupt or has been
> + * upgraded, and return an error.
> + */
> +int
> +clstate_maindb_init(void)
> +{
> +	int ret;
> +	char *err = NULL;
> +	sqlite3_stmt *stmt = NULL;
> +
> +	ret = mkdir_if_not_exist(clstate_topdir);
> +	if (ret)
> +		return ret;
> +
> +	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", clstate_topdir);
> +	if (ret < 0)
> +		return ret;
> +
> +	buf[PATH_MAX - 1] = '\0';
> +
> +	ret = sqlite3_open(buf, &dbh);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Unable to open main database: %d", ret);
> +		return ret;
> +	}
> +
> +	ret = sqlite3_busy_timeout(dbh, CLSTATE_SQLITE_BUSY_TIMEOUT);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Unable to set sqlite busy timeout: %d", ret);
> +		goto out_err;
> +	}
> +
> +	/* Try to create table */
> +	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
> +				"(key TEXT PRIMARY KEY, value TEXT);",
> +				NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Unable to create parameter table: %d", ret);
> +		goto out_err;
> +	}
> +
> +	/* insert version into table -- ignore error if it fails */
> +	ret = snprintf(buf, sizeof(buf),
> +		       "INSERT OR IGNORE INTO parameters values (\"version\", "
> +		       "\"%d\");", CLSTATE_SQLITE_SCHEMA_VERSION);
> +	if (ret < 0) {
> +		goto out_err;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		ret = -EINVAL;
> +		goto out_err;
> +	}
> +
> +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Unable to insert into parameter table: %d",
> +				ret);
> +		goto out_err;
> +	}
> +
> +	ret = sqlite3_prepare_v2(dbh,
> +		"SELECT value FROM parameters WHERE key == \"version\";",
> +		 -1, &stmt, NULL);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Unable to prepare select statement: %d", ret);
> +		goto out_err;
> +	}
> +
> +	/* check schema version */
> +	ret = sqlite3_step(stmt);
> +	if (ret != SQLITE_ROW) {
> +		xlog(D_GENERAL, "Select statement execution failed: %s",
> +				sqlite3_errmsg(dbh));
> +		goto out_err;
> +	}
> +
> +	/* process SELECT result */
> +	ret = sqlite3_column_int(stmt, 0);
> +	if (ret != CLSTATE_SQLITE_SCHEMA_VERSION) {
> +		xlog(L_ERROR, "Unsupported database schema version! "
> +			"Expected %d, got %d.",
> +			CLSTATE_SQLITE_SCHEMA_VERSION, ret);
> +		ret = -EINVAL;
> +		goto out_err;
> +	}
> +
> +	sqlite3_free(err);
> +	sqlite3_finalize(stmt);
> +	return 0;
> +
> +out_err:
> +	if (err) {
> +		xlog(L_ERROR, "sqlite error: %s", err);
> +		sqlite3_free(err);
> +	}
> +	sqlite3_finalize(stmt);
> +	sqlite3_close(dbh);
> +	return ret;
> +}
> +
> +static int
> +clstate_db_attach(char *dbname)
> +{
> +	int ret;
> +	char *err;
> +
> +	/* convert address string into valid filename */
> +	ret = snprintf(buf, sizeof(buf),
> +			"ATTACH DATABASE \"%s/%s.sqlite\" as \"%s\";",
> +			clstate_topdir, dbname, dbname);
> +	if (ret < 0) {
> +		return ret;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		ret = -EINVAL;
> +		return ret;
> +	}
> +
> +	/* attach to new DB in filename */
> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> +	if (ret != SQLITE_OK)
> +		xlog(L_ERROR, "Unable to attach database %s: %s", dbname, err);
> +
> +	xlog(D_GENERAL, "Attached database %s", dbname);
> +	return ret;
> +}
> +
> +static int
> +clstate_db_detach(char *dbname)
> +{
> +	int ret;
> +	char *err;
> +
> +	/* convert address string into valid filename */
> +	ret = snprintf(buf, PATH_MAX - 1, "DETACH DATABASE \"%s\";", dbname);
> +	if (ret < 0) {
> +		return ret;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		ret = -EINVAL;
> +		return ret;
> +	}
> +
> +	/* attach to new DB in filename */
> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> +	if (ret != SQLITE_OK)
> +		xlog(L_ERROR, "Unable to detach database %s: %s", dbname, err);
> +
> +	xlog(D_GENERAL, "Detached database %s", dbname);
> +	return ret;
> +}
> +
> +/*
> + * Create a client record
> + */
> +int
> +clstate_insert_client(const unsigned char *addr, const unsigned char *clname,
> +			const size_t namelen)
> +{
> +	int ret;
> +	char *err = NULL;
> +	char dbname[CLSTATE_MAX_ADDRESS_LEN];
> +	sqlite3_stmt *stmt = NULL;
> +
> +	addr_to_dbname((const char *)addr, dbname);
> +
> +	ret = clstate_db_attach(dbname);
> +	if (ret != SQLITE_OK)
> +		return ret;
> +
> +	snprintf(buf, sizeof(buf), "CREATE TABLE IF NOT EXISTS '%s'.clients "
> +				   "(id BLOB PRIMARY KEY, time INTEGER);",
> +				   dbname);
> +	if (ret < 0) {
> +		goto out_detach;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		ret = -EINVAL;
> +		goto out_detach;
> +	}
> +
> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> +	if (ret != SQLITE_OK) {
> +		xlog(L_ERROR, "Unable to create table: %s", err);
> +		goto out_detach;
> +	}
> +
> +	ret = snprintf(buf, sizeof(buf),
> +			"INSERT OR REPLACE INTO '%s'.clients VALUES (?, "
> +			"strftime('%%s', 'now'));", dbname);
> +	if (ret < 0) {
> +		goto out_detach;
> +	} else if ((size_t)ret >= sizeof(buf)) {
> +		ret = -EINVAL;
> +		goto out_detach;
> +	}
> +
> +	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
> +		goto out_err;
> +	}
> +
> +	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
> +				SQLITE_STATIC);
> +	if (ret != SQLITE_OK) {
> +		xlog(D_GENERAL, "Bind blob failed: %d", ret);
> +		goto out_err;
> +	}
> +
> +	ret = sqlite3_step(stmt);
> +	if (ret == SQLITE_DONE)
> +		ret = SQLITE_OK;
> +	else
> +		xlog(D_GENERAL, "Unexpected return code from insert: %s",
> +				sqlite3_errmsg(dbh));
> +
> +out_err:
> +	xlog(D_GENERAL, "%s returning %d", __func__, ret);
> +	sqlite3_finalize(stmt);
> +	sqlite3_free(err);
> +out_detach:
> +	clstate_db_detach(dbname);
> +	return ret;
> +}
> diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
> new file mode 100644
> index 0000000..4f81745
> --- /dev/null
> +++ b/utils/clstated/sqlite.h
> @@ -0,0 +1,27 @@
> +/*
> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> + * Boston, MA 02110-1301, USA.
> + */
> +
> +#ifndef _SQLITE_H_
> +#define _SQLITE_H_
> +
> +void clstate_set_topdir(char *topdir);
> +int clstate_maindb_init(void);
> +int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
> +
> +#endif /* _SQLITE_H */
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 13:57 ` [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT Jeff Layton
@ 2011-12-14 15:09   ` Steve Dickson
  2011-12-14 15:19     ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 15:09 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 08:57 AM, Jeff Layton wrote:
> Generally, we want this daemon started before nfsd starts. Deal with the
> situation where the pipe hasn't shown up yet.
This can be done with your systemd start up script. Plus I'm not sure its 
a good idea to steal cpu cycles waiting for an event that may never happen...

steved.

> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  utils/clstated/clstated.c |   15 +++++++++++----
>  1 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> index 93a2710..aab09a7 100644
> --- a/utils/clstated/clstated.c
> +++ b/utils/clstated/clstated.c
> @@ -75,10 +75,17 @@ usage(char *progname)
>  static int
>  clstate_pipe_open(struct clstate_client *clnt)
>  {
> -	clnt->cl_fd = open(pipepath, O_RDWR, 0);
> -	if (clnt->cl_fd < 0) {
> -		xlog(L_ERROR, "%s: unable to open %s: %m", __func__, pipepath);
> -		return clnt->cl_fd;
> +	/* loop until the pipe is present or open fails for another reason */
> +	for (;;) {
> +		clnt->cl_fd = open(pipepath, O_RDWR, 0);
> +		if (clnt->cl_fd >= 0) {
> +			break;
> +		} else if (errno != ENOENT) {
> +			xlog(D_GENERAL, "%s: unable to open %s: %m", __func__,
> +					pipepath);
> +			return clnt->cl_fd;
> +		}
> +		sleep(1);
>  	}
>  
>  	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, clstatecb, clnt);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-14 14:56   ` Chuck Lever
@ 2011-12-14 15:14     ` Jeff Layton
  2011-12-14 15:47       ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 15:14 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

On Wed, 14 Dec 2011 09:56:32 -0500
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
> 
> > Rather than roll our own "storage engine", use sqlite instead. It fits
> > the bill nicely as it does:
> > 
> > - durable on-disk storage
> > - the ability to constrain record uniqueness
> > - a facility for collating and searching the host records
> 
> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
> 

Indeed. That's one of the main reasons I chose sqlite here.

> > ...it does add a build dependency to nfs-utils, but almost all modern
> > distros provide those packages.
> > 
> > The current incarnation of this code dynamically links against a
> > provided sqlite library, but we could also consider including their
> > single-file "amalgamation" to reduce dependencies (though with all
> > the caveats that that entails).
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > utils/clstated/Makefile.am |    4 +-
> > utils/clstated/clstated.c  |   16 ++-
> > utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
> > utils/clstated/sqlite.h    |   27 ++++
> > 4 files changed, 393 insertions(+), 5 deletions(-)
> > create mode 100644 utils/clstated/sqlite.c
> > create mode 100644 utils/clstated/sqlite.h
> > 
> > diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
> > index 9937353..d1cef3c 100644
> > --- a/utils/clstated/Makefile.am
> > +++ b/utils/clstated/Makefile.am
> > @@ -6,9 +6,9 @@
> > AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
> > sbin_PROGRAMS	= clstated
> > 
> > -clstated_SOURCES = clstated.c
> > +clstated_SOURCES = clstated.c sqlite.c
> > 
> > -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
> > +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
> > 
> > MAINTAINERCLEANFILES = Makefile.in
> > 
> > diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> > index aab09a7..95ac696 100644
> > --- a/utils/clstated/clstated.c
> > +++ b/utils/clstated/clstated.c
> > @@ -36,6 +36,7 @@
> > 
> > #include "xlog.h"
> > #include "nfslib.h"
> > +#include "sqlite.h"
> > 
> > #ifndef PIPEFS_DIR
> > #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> > @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
> > static void
> > clstate_create(struct clstate_client *clnt)
> > {
> > +	int ret;
> > 	ssize_t bsize, wsize;
> > 	struct clstate_msg *cmsg = &clnt->cl_msg;
> > 
> > -	xlog(D_GENERAL, "%s: create client record", __func__);
> > +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
> > +			cmsg->cm_addr);
> > 
> > -	/* FIXME: create client record on storage here */
> > +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
> > +					cmsg->cm_len);
> > 
> > 	/* set up reply */
> > -	cmsg->cm_status = 0;
> > +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
> > 	cmsg->cm_len = 0;
> > 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
> > 		sizeof(cmsg->cm_u.cm_id));
> > 
> > 	bsize = sizeof(*cmsg);
> > 
> > +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
> > 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> > 	if (wsize != bsize) {
> > 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
> > @@ -231,6 +236,11 @@ main(int argc, char **argv)
> > 	}
> > 
> > 	/* set up storage db */
> > +	rc = clstate_maindb_init();
> > +	if (rc) {
> > +		xlog(L_ERROR, "Failed to open main database: %d", rc);
> > +		goto out;
> > +	}
> > 
> > 	/* set up event handler */
> > 	fd = clstate_pipe_open(&clnt);
> > diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
> > new file mode 100644
> > index 0000000..ae83634
> > --- /dev/null
> > +++ b/utils/clstated/sqlite.c
> > @@ -0,0 +1,351 @@
> > +/*
> > + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> > + * Boston, MA 02110-1301, USA.
> > + */
> > +
> > +/*
> > + * Explanation:
> > + *
> > + * This file contains the code to manage the sqlite backend database for the
> > + * clstated upcall daemon.
> > + *
> > + * The main database is called main.sqlite and contains the following tables:
> > + *
> > + * parameters: simple key/value pairs for storing database info
> > + *
> > + * addresses: list of server-side addresses recorded in the db, along with
> > + * 	      the filenames of their respective db files.
> > + *
> > + * The daemon attaches to each server-address database as needed. Each
> > + * server-address database has the following tables:
> 
> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
> 

As you pointed out above, sqlite allows us to share the db directory
across a cluster of nodes. Using different files here will allow us to
reduce lock contention in the cluster. If we put each server address DB
in a separate file, then only the node that currently owns that address
should need to touch it under normal circumstances.

> > + *
> > + * clients: one column containing a BLOB with the clstate as sent by the client
> > + * 	    and a timestamp (in epoch seconds) of when the record was
> > + * 	    established
> 
> I'm not yet clear on what's in cm_addr, but it seems to me that if you store each client's data as structured fields instead of a single BLOB, it will be much easier to use a generic tool like the sqlite3 command to debug the database.  That gives the observability and transparency of using a flat file with all the advantages of a database.
> 
> Plus, you get automatic data conversion; a one-to-one mapping between data on-disk and data in any endian-ness or word length.
> 

cm_addr is the *server's* address. One key point here is that (at least
in v4.0), clients do not SETCLIENTID against a server, but rather
against the server's address. If the client needs to talk to another
address it must call SETCLIENTID again. So, it's reasonable to track
the client names on a per-server address basis to reduce DB contention
in a clustered configuration.

> > + *
> > + * FIXME: should we also record the fsid being accessed?
> 
> I'm not exactly sure what additional data might be needed to support NFSv4 migration, for example.  However, using a database means it's relatively painless to add new columns in the future without having to provision them now.
> 

Right. It's just a matter of adding the correct column and somehow
populating them for existing entries. Making this extensible is a
design goal here. It's a complex enough problem, that I'm fairly
certain I'll get something wrong on the first pass...

> > + */
> > +
> > +#ifdef HAVE_CONFIG_H
> > +#include "config.h"
> > +#endif /* HAVE_CONFIG_H */
> > +
> > +#include <errno.h>
> > +#include <event.h>
> > +#include <stdbool.h>
> > +#include <string.h>
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <fcntl.h>
> > +#include <unistd.h>
> > +#include <sqlite3.h>
> > +#include <linux/limits.h>
> > +#include <linux/nfsd/clstate.h>
> > +
> > +#include "xlog.h"
> > +
> > +#define CLSTATE_SQLITE_SCHEMA_VERSION 1
> > +
> > +#ifndef CLSTATE_DIR
> > +#define CLSTATE_DIR NFS_STATEDIR "/clstate"
> > +#endif
> > +
> > +/* in milliseconds */
> > +#define CLSTATE_SQLITE_BUSY_TIMEOUT 10000
> > +
> > +/* private data structures */
> > +
> > +/* global variables */
> > +
> > +/* top level DB directory */
> > +static char *clstate_topdir = CLSTATE_DIR;
> > +
> > +/* reusable pathname and sql command buffer */
> > +static char buf[PATH_MAX];
> > +
> > +/* global database handle */
> > +static sqlite3 *dbh;
> > +
> > +/* forward declarations */
> > +
> > +/* '.' is not allowed in dbnames -- convert them to '-' */
> > +static void
> > +addr_to_dbname(const char *src, char *dst)
> > +{
> > +	while(*src) {
> > +		if (*src == '.')
> > +			*dst = '-';
> > +		else
> > +			*dst = *src;
> > +		++src;
> > +		++dst;
> > +	}
> > +	*dst = '\0';
> > +}
> > +
> > +void
> > +clstate_set_topdir(char *topdir)
> > +{
> > +	clstate_topdir = topdir;
> > +}
> > +
> > +/* make a directory, ignoring EEXIST errors unless it's not a directory */
> > +static int
> > +mkdir_if_not_exist(char *dirname)
> > +{
> > +	int ret;
> > +	struct stat statbuf;
> > +
> > +	ret = mkdir(dirname, S_IRWXU);
> > +	if (ret && errno != EEXIST)
> > +		return -errno;
> > +
> > +	ret = stat(dirname, &statbuf);
> > +	if (ret)
> > +		return -errno;
> > +
> > +	if (!S_ISDIR(statbuf.st_mode))
> > +		ret = -ENOTDIR;
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Open the "main" database, and attempt to initialize it by creating the
> > + * parameters table and inserting the schema version into it. Ignore any errors
> > + * from that, and then attempt to select the version out of it again. If the
> > + * version appears wrong, then assume that the DB is corrupt or has been
> > + * upgraded, and return an error.
> > + */
> > +int
> > +clstate_maindb_init(void)
> > +{
> > +	int ret;
> > +	char *err = NULL;
> > +	sqlite3_stmt *stmt = NULL;
> > +
> > +	ret = mkdir_if_not_exist(clstate_topdir);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", clstate_topdir);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	buf[PATH_MAX - 1] = '\0';
> > +
> > +	ret = sqlite3_open(buf, &dbh);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to open main database: %d", ret);
> > +		return ret;
> > +	}
> > +
> > +	ret = sqlite3_busy_timeout(dbh, CLSTATE_SQLITE_BUSY_TIMEOUT);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to set sqlite busy timeout: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	/* Try to create table */
> > +	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
> > +				"(key TEXT PRIMARY KEY, value TEXT);",
> > +				NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to create parameter table: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	/* insert version into table -- ignore error if it fails */
> > +	ret = snprintf(buf, sizeof(buf),
> > +		       "INSERT OR IGNORE INTO parameters values (\"version\", "
> > +		       "\"%d\");", CLSTATE_SQLITE_SCHEMA_VERSION);
> > +	if (ret < 0) {
> > +		goto out_err;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to insert into parameter table: %d",
> > +				ret);
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_prepare_v2(dbh,
> > +		"SELECT value FROM parameters WHERE key == \"version\";",
> > +		 -1, &stmt, NULL);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Unable to prepare select statement: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	/* check schema version */
> > +	ret = sqlite3_step(stmt);
> > +	if (ret != SQLITE_ROW) {
> > +		xlog(D_GENERAL, "Select statement execution failed: %s",
> > +				sqlite3_errmsg(dbh));
> > +		goto out_err;
> > +	}
> > +
> > +	/* process SELECT result */
> > +	ret = sqlite3_column_int(stmt, 0);
> > +	if (ret != CLSTATE_SQLITE_SCHEMA_VERSION) {
> > +		xlog(L_ERROR, "Unsupported database schema version! "
> > +			"Expected %d, got %d.",
> > +			CLSTATE_SQLITE_SCHEMA_VERSION, ret);
> > +		ret = -EINVAL;
> > +		goto out_err;
> > +	}
> > +
> > +	sqlite3_free(err);
> > +	sqlite3_finalize(stmt);
> > +	return 0;
> > +
> > +out_err:
> > +	if (err) {
> > +		xlog(L_ERROR, "sqlite error: %s", err);
> > +		sqlite3_free(err);
> > +	}
> > +	sqlite3_finalize(stmt);
> > +	sqlite3_close(dbh);
> > +	return ret;
> > +}
> > +
> > +static int
> > +clstate_db_attach(char *dbname)
> > +{
> > +	int ret;
> > +	char *err;
> > +
> > +	/* convert address string into valid filename */
> > +	ret = snprintf(buf, sizeof(buf),
> > +			"ATTACH DATABASE \"%s/%s.sqlite\" as \"%s\";",
> > +			clstate_topdir, dbname, dbname);
> > +	if (ret < 0) {
> > +		return ret;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		return ret;
> > +	}
> > +
> > +	/* attach to new DB in filename */
> > +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK)
> > +		xlog(L_ERROR, "Unable to attach database %s: %s", dbname, err);
> > +
> > +	xlog(D_GENERAL, "Attached database %s", dbname);
> > +	return ret;
> > +}
> > +
> > +static int
> > +clstate_db_detach(char *dbname)
> > +{
> > +	int ret;
> > +	char *err;
> > +
> > +	/* convert address string into valid filename */
> > +	ret = snprintf(buf, PATH_MAX - 1, "DETACH DATABASE \"%s\";", dbname);
> > +	if (ret < 0) {
> > +		return ret;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		return ret;
> > +	}
> > +
> > +	/* attach to new DB in filename */
> > +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK)
> > +		xlog(L_ERROR, "Unable to detach database %s: %s", dbname, err);
> > +
> > +	xlog(D_GENERAL, "Detached database %s", dbname);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Create a client record
> > + */
> > +int
> > +clstate_insert_client(const unsigned char *addr, const unsigned char *clname,
> > +			const size_t namelen)
> > +{
> > +	int ret;
> > +	char *err = NULL;
> > +	char dbname[CLSTATE_MAX_ADDRESS_LEN];
> > +	sqlite3_stmt *stmt = NULL;
> > +
> > +	addr_to_dbname((const char *)addr, dbname);
> > +
> > +	ret = clstate_db_attach(dbname);
> > +	if (ret != SQLITE_OK)
> > +		return ret;
> > +
> > +	snprintf(buf, sizeof(buf), "CREATE TABLE IF NOT EXISTS '%s'.clients "
> > +				   "(id BLOB PRIMARY KEY, time INTEGER);",
> > +				   dbname);
> > +	if (ret < 0) {
> > +		goto out_detach;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		goto out_detach;
> > +	}
> > +
> > +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(L_ERROR, "Unable to create table: %s", err);
> > +		goto out_detach;
> > +	}
> > +
> > +	ret = snprintf(buf, sizeof(buf),
> > +			"INSERT OR REPLACE INTO '%s'.clients VALUES (?, "
> > +			"strftime('%%s', 'now'));", dbname);
> > +	if (ret < 0) {
> > +		goto out_detach;
> > +	} else if ((size_t)ret >= sizeof(buf)) {
> > +		ret = -EINVAL;
> > +		goto out_detach;
> > +	}
> > +
> > +	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
> > +				SQLITE_STATIC);
> > +	if (ret != SQLITE_OK) {
> > +		xlog(D_GENERAL, "Bind blob failed: %d", ret);
> > +		goto out_err;
> > +	}
> > +
> > +	ret = sqlite3_step(stmt);
> > +	if (ret == SQLITE_DONE)
> > +		ret = SQLITE_OK;
> > +	else
> > +		xlog(D_GENERAL, "Unexpected return code from insert: %s",
> > +				sqlite3_errmsg(dbh));
> > +
> > +out_err:
> > +	xlog(D_GENERAL, "%s returning %d", __func__, ret);
> > +	sqlite3_finalize(stmt);
> > +	sqlite3_free(err);
> > +out_detach:
> > +	clstate_db_detach(dbname);
> > +	return ret;
> > +}
> > diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
> > new file mode 100644
> > index 0000000..4f81745
> > --- /dev/null
> > +++ b/utils/clstated/sqlite.h
> > @@ -0,0 +1,27 @@
> > +/*
> > + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> > + * Boston, MA 02110-1301, USA.
> > + */
> > +
> > +#ifndef _SQLITE_H_
> > +#define _SQLITE_H_
> > +
> > +void clstate_set_topdir(char *topdir);
> > +int clstate_maindb_init(void);
> > +int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
> > +
> > +#endif /* _SQLITE_H */
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 15:09   ` Steve Dickson
@ 2011-12-14 15:19     ` Jeff Layton
  2011-12-14 15:29       ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 15:19 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 10:09:04 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> > Generally, we want this daemon started before nfsd starts. Deal with the
> > situation where the pipe hasn't shown up yet.
> This can be done with your systemd start up script. Plus I'm not sure its 
> a good idea to steal cpu cycles waiting for an event that may never happen...
> 

Presumably you just wouldn't start the daemon if you have no intent to
use it.

It does sleep 1s between each check, so the time is fairly minimal,
but I'm definitely open to doing this differently. What may be
reasonable is adding code to the daemon to check and see if the
v4recoverydir is present. If it is, then just exit. Otherwise, wait for
the pipe to show up.

> 
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  utils/clstated/clstated.c |   15 +++++++++++----
> >  1 files changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> > index 93a2710..aab09a7 100644
> > --- a/utils/clstated/clstated.c
> > +++ b/utils/clstated/clstated.c
> > @@ -75,10 +75,17 @@ usage(char *progname)
> >  static int
> >  clstate_pipe_open(struct clstate_client *clnt)
> >  {
> > -	clnt->cl_fd = open(pipepath, O_RDWR, 0);
> > -	if (clnt->cl_fd < 0) {
> > -		xlog(L_ERROR, "%s: unable to open %s: %m", __func__, pipepath);
> > -		return clnt->cl_fd;
> > +	/* loop until the pipe is present or open fails for another reason */
> > +	for (;;) {
> > +		clnt->cl_fd = open(pipepath, O_RDWR, 0);
> > +		if (clnt->cl_fd >= 0) {
> > +			break;
> > +		} else if (errno != ENOENT) {
> > +			xlog(D_GENERAL, "%s: unable to open %s: %m", __func__,
> > +					pipepath);
> > +			return clnt->cl_fd;
> > +		}
> > +		sleep(1);
> >  	}
> >  
> >  	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, clstatecb, clnt);


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
                   ` (6 preceding siblings ...)
  2011-12-14 13:57 ` [PATCH 7/7] clstated: add function to remove unreclaimed client records Jeff Layton
@ 2011-12-14 15:23 ` Steve Dickson
  2011-12-14 15:32   ` Jeff Layton
  7 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 15:23 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 08:57 AM, Jeff Layton wrote:
> This patchset is the userspace portion of the knfsd client name tracking
> overhaul. See this patch series for an explanation:
> 
>     nfsd: overhaul the client name tracking code (RFC)
> 
> The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
> and handles the requests.
> 
> The data is stored using a sqlite database. The main reason for this is
> that it takes care of most of the fussy details and atomicity concerns
> of tracking the information on stable storage.
This will make nfs-utils dependent on the sqlite database... Any idea
what kinda of extra baggage this brings? 

> 
> For now, the daemon is only suitable for single-host configurations.
> The plan is to later extend this to be suitable for clustered
> configurations as well.
> 
> The code is still a little rough, so be gentle. It also lacks things
> like a manpage. I plan to add all that before doing a "formal" patch
> submission, but I wanted to get some early review of the overall design
> before to spend a lot of time knocking off the rough edges.
Finally the name of the daemon...  clstated does not make it clear that 
this is a nfs server daemon... maybe something like nfsdcld? 

steved.

> 
> Jeff Layton (7):
>   clstated: add clname tracking daemon stub
>   clstated: reattempt the pipe open if it fails on ENOENT
>   clstated: add autoconf goop for sqlite
>   clstated: add routines for a sqlite backend database
>   clstated: add remove functionality
>   clstated: add check/update functionality
>   clstated: add function to remove unreclaimed client records
> 
>  aclocal/libsqlite3.m4      |   33 +++
>  configure.ac               |   23 ++
>  utils/Makefile.am          |    4 +
>  utils/clstated/Makefile.am |   14 +
>  utils/clstated/clstated.c  |  355 +++++++++++++++++++++++++++
>  utils/clstated/sqlite.c    |  572 ++++++++++++++++++++++++++++++++++++++++++++
>  utils/clstated/sqlite.h    |   30 +++
>  7 files changed, 1031 insertions(+), 0 deletions(-)
>  create mode 100644 aclocal/libsqlite3.m4
>  create mode 100644 utils/clstated/Makefile.am
>  create mode 100644 utils/clstated/clstated.c
>  create mode 100644 utils/clstated/sqlite.c
>  create mode 100644 utils/clstated/sqlite.h
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 15:19     ` Jeff Layton
@ 2011-12-14 15:29       ` Steve Dickson
  2011-12-14 15:37         ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 15:29 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 10:19 AM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 10:09:04 -0500
> Steve Dickson <SteveD@redhat.com> wrote:
> 
>>
>>
>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
>>> Generally, we want this daemon started before nfsd starts. Deal with the
>>> situation where the pipe hasn't shown up yet.
>> This can be done with your systemd start up script. Plus I'm not sure its 
>> a good idea to steal cpu cycles waiting for an event that may never happen...
>>
> 
> Presumably you just wouldn't start the daemon if you have no intent to
> use it.
> 
> It does sleep 1s between each check, so the time is fairly minimal,
> but I'm definitely open to doing this differently. What may be
> reasonable is adding code to the daemon to check and see if the
> v4recoverydir is present. If it is, then just exit. Otherwise, wait for
> the pipe to show up.
Why just let the systemd scrips worry about the order of when to start
things up... To be honest, that is one thing systemd does do fairly well.

steved.

> 
>>
>>>
>>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>>> ---
>>>  utils/clstated/clstated.c |   15 +++++++++++----
>>>  1 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
>>> index 93a2710..aab09a7 100644
>>> --- a/utils/clstated/clstated.c
>>> +++ b/utils/clstated/clstated.c
>>> @@ -75,10 +75,17 @@ usage(char *progname)
>>>  static int
>>>  clstate_pipe_open(struct clstate_client *clnt)
>>>  {
>>> -	clnt->cl_fd = open(pipepath, O_RDWR, 0);
>>> -	if (clnt->cl_fd < 0) {
>>> -		xlog(L_ERROR, "%s: unable to open %s: %m", __func__, pipepath);
>>> -		return clnt->cl_fd;
>>> +	/* loop until the pipe is present or open fails for another reason */
>>> +	for (;;) {
>>> +		clnt->cl_fd = open(pipepath, O_RDWR, 0);
>>> +		if (clnt->cl_fd >= 0) {
>>> +			break;
>>> +		} else if (errno != ENOENT) {
>>> +			xlog(D_GENERAL, "%s: unable to open %s: %m", __func__,
>>> +					pipepath);
>>> +			return clnt->cl_fd;
>>> +		}
>>> +		sleep(1);
>>>  	}
>>>  
>>>  	event_set(&clnt->cl_event, clnt->cl_fd, EV_READ, clstatecb, clnt);
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 15:23 ` [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Steve Dickson
@ 2011-12-14 15:32   ` Jeff Layton
  2011-12-14 15:44     ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 15:32 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 10:23:15 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> > This patchset is the userspace portion of the knfsd client name tracking
> > overhaul. See this patch series for an explanation:
> > 
> >     nfsd: overhaul the client name tracking code (RFC)
> > 
> > The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
> > and handles the requests.
> > 
> > The data is stored using a sqlite database. The main reason for this is
> > that it takes care of most of the fussy details and atomicity concerns
> > of tracking the information on stable storage.
> This will make nfs-utils dependent on the sqlite database... Any idea
> what kinda of extra baggage this brings? 
> 

Depends on what you mean by "baggage". What is your concern?

For something like fedora, it'll mean a build-time dependency on
sqlite-devel, and a runtime dependency on sqlite. We could consider
embedding their single-file amalgamation code, but I'd prefer not to do
that unless we really have to.

> > 
> > For now, the daemon is only suitable for single-host configurations.
> > The plan is to later extend this to be suitable for clustered
> > configurations as well.
> > 
> > The code is still a little rough, so be gentle. It also lacks things
> > like a manpage. I plan to add all that before doing a "formal" patch
> > submission, but I wanted to get some early review of the overall design
> > before to spend a lot of time knocking off the rough edges.
> Finally the name of the daemon...  clstated does not make it clear that 
> this is a nfs server daemon... maybe something like nfsdcld? 
> 

Sure, we could change the name if that's desirable. Maybe
nfsd.clnamed ? At this point it's really just a client name tracking
daemon.

> steved.
> 
> > 
> > Jeff Layton (7):
> >   clstated: add clname tracking daemon stub
> >   clstated: reattempt the pipe open if it fails on ENOENT
> >   clstated: add autoconf goop for sqlite
> >   clstated: add routines for a sqlite backend database
> >   clstated: add remove functionality
> >   clstated: add check/update functionality
> >   clstated: add function to remove unreclaimed client records
> > 
> >  aclocal/libsqlite3.m4      |   33 +++
> >  configure.ac               |   23 ++
> >  utils/Makefile.am          |    4 +
> >  utils/clstated/Makefile.am |   14 +
> >  utils/clstated/clstated.c  |  355 +++++++++++++++++++++++++++
> >  utils/clstated/sqlite.c    |  572 ++++++++++++++++++++++++++++++++++++++++++++
> >  utils/clstated/sqlite.h    |   30 +++
> >  7 files changed, 1031 insertions(+), 0 deletions(-)
> >  create mode 100644 aclocal/libsqlite3.m4
> >  create mode 100644 utils/clstated/Makefile.am
> >  create mode 100644 utils/clstated/clstated.c
> >  create mode 100644 utils/clstated/sqlite.c
> >  create mode 100644 utils/clstated/sqlite.h
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 15:29       ` Steve Dickson
@ 2011-12-14 15:37         ` Jeff Layton
  2011-12-14 15:56           ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 15:37 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 10:29:49 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 10:19 AM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 10:09:04 -0500
> > Steve Dickson <SteveD@redhat.com> wrote:
> > 
> >>
> >>
> >> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> >>> Generally, we want this daemon started before nfsd starts. Deal with the
> >>> situation where the pipe hasn't shown up yet.
> >> This can be done with your systemd start up script. Plus I'm not sure its 
> >> a good idea to steal cpu cycles waiting for an event that may never happen...
> >>
> > 
> > Presumably you just wouldn't start the daemon if you have no intent to
> > use it.
> > 
> > It does sleep 1s between each check, so the time is fairly minimal,
> > but I'm definitely open to doing this differently. What may be
> > reasonable is adding code to the daemon to check and see if the
> > v4recoverydir is present. If it is, then just exit. Otherwise, wait for
> > the pipe to show up.
> Why just let the systemd scrips worry about the order of when to start
> things up... To be honest, that is one thing systemd does do fairly well.
> 

Because not everyone uses systemd, and we have to deal with the
"legacy" case too for the transition phase.

It's generally preferable not to start up nfsd until everything it
needs is up. If we do what you suggest, then we're basically mandating
that this daemon can't start until nfsd is up and running.

Could you give some details on how you think this ought to work?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 15:32   ` Jeff Layton
@ 2011-12-14 15:44     ` Steve Dickson
  2011-12-14 16:05       ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 15:44 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 10:32 AM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 10:23:15 -0500
> Steve Dickson <SteveD@redhat.com> wrote:
> 
>>
>>
>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
>>> This patchset is the userspace portion of the knfsd client name tracking
>>> overhaul. See this patch series for an explanation:
>>>
>>>     nfsd: overhaul the client name tracking code (RFC)
>>>
>>> The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
>>> and handles the requests.
>>>
>>> The data is stored using a sqlite database. The main reason for this is
>>> that it takes care of most of the fussy details and atomicity concerns
>>> of tracking the information on stable storage.
>> This will make nfs-utils dependent on the sqlite database... Any idea
>> what kinda of extra baggage this brings? 
>>
> 
> Depends on what you mean by "baggage". What is your concern?
In Fedora doing a 'yum install sqlite' which would require
a ton of other package needing to be install... The Required
for nfs-utils is getting pretty long at this point... 
 
> 
> For something like fedora, it'll mean a build-time dependency on
> sqlite-devel, and a runtime dependency on sqlite. We could consider
> embedding their single-file amalgamation code, but I'd prefer not to do
> that unless we really have to.
> 
>>>
>>> For now, the daemon is only suitable for single-host configurations.
>>> The plan is to later extend this to be suitable for clustered
>>> configurations as well.
>>>
>>> The code is still a little rough, so be gentle. It also lacks things
>>> like a manpage. I plan to add all that before doing a "formal" patch
>>> submission, but I wanted to get some early review of the overall design
>>> before to spend a lot of time knocking off the rough edges.
>> Finally the name of the daemon...  clstated does not make it clear that 
>> this is a nfs server daemon... maybe something like nfsdcld? 
>>
> 
> Sure, we could change the name if that's desirable. Maybe
> nfsd.clnamed ? At this point it's really just a client name tracking
> daemon.
How about nfsdcld ;-) short, to the point and easy to type... 8-) 

steved.

> 
>> steved.
>>
>>>
>>> Jeff Layton (7):
>>>   clstated: add clname tracking daemon stub
>>>   clstated: reattempt the pipe open if it fails on ENOENT
>>>   clstated: add autoconf goop for sqlite
>>>   clstated: add routines for a sqlite backend database
>>>   clstated: add remove functionality
>>>   clstated: add check/update functionality
>>>   clstated: add function to remove unreclaimed client records
>>>
>>>  aclocal/libsqlite3.m4      |   33 +++
>>>  configure.ac               |   23 ++
>>>  utils/Makefile.am          |    4 +
>>>  utils/clstated/Makefile.am |   14 +
>>>  utils/clstated/clstated.c  |  355 +++++++++++++++++++++++++++
>>>  utils/clstated/sqlite.c    |  572 ++++++++++++++++++++++++++++++++++++++++++++
>>>  utils/clstated/sqlite.h    |   30 +++
>>>  7 files changed, 1031 insertions(+), 0 deletions(-)
>>>  create mode 100644 aclocal/libsqlite3.m4
>>>  create mode 100644 utils/clstated/Makefile.am
>>>  create mode 100644 utils/clstated/clstated.c
>>>  create mode 100644 utils/clstated/sqlite.c
>>>  create mode 100644 utils/clstated/sqlite.h
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-14 15:14     ` Jeff Layton
@ 2011-12-14 15:47       ` Chuck Lever
  2011-12-14 16:15         ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2011-12-14 15:47 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs


On Dec 14, 2011, at 10:14 AM, Jeff Layton wrote:

> On Wed, 14 Dec 2011 09:56:32 -0500
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> 
>> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
>> 
>>> Rather than roll our own "storage engine", use sqlite instead. It fits
>>> the bill nicely as it does:
>>> 
>>> - durable on-disk storage
>>> - the ability to constrain record uniqueness
>>> - a facility for collating and searching the host records
>> 
>> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
>> 
> 
> Indeed. That's one of the main reasons I chose sqlite here.
> 
>>> ...it does add a build dependency to nfs-utils, but almost all modern
>>> distros provide those packages.
>>> 
>>> The current incarnation of this code dynamically links against a
>>> provided sqlite library, but we could also consider including their
>>> single-file "amalgamation" to reduce dependencies (though with all
>>> the caveats that that entails).
>>> 
>>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>>> ---
>>> utils/clstated/Makefile.am |    4 +-
>>> utils/clstated/clstated.c  |   16 ++-
>>> utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
>>> utils/clstated/sqlite.h    |   27 ++++
>>> 4 files changed, 393 insertions(+), 5 deletions(-)
>>> create mode 100644 utils/clstated/sqlite.c
>>> create mode 100644 utils/clstated/sqlite.h
>>> 
>>> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
>>> index 9937353..d1cef3c 100644
>>> --- a/utils/clstated/Makefile.am
>>> +++ b/utils/clstated/Makefile.am
>>> @@ -6,9 +6,9 @@
>>> AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
>>> sbin_PROGRAMS	= clstated
>>> 
>>> -clstated_SOURCES = clstated.c
>>> +clstated_SOURCES = clstated.c sqlite.c
>>> 
>>> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
>>> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
>>> 
>>> MAINTAINERCLEANFILES = Makefile.in
>>> 
>>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
>>> index aab09a7..95ac696 100644
>>> --- a/utils/clstated/clstated.c
>>> +++ b/utils/clstated/clstated.c
>>> @@ -36,6 +36,7 @@
>>> 
>>> #include "xlog.h"
>>> #include "nfslib.h"
>>> +#include "sqlite.h"
>>> 
>>> #ifndef PIPEFS_DIR
>>> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
>>> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
>>> static void
>>> clstate_create(struct clstate_client *clnt)
>>> {
>>> +	int ret;
>>> 	ssize_t bsize, wsize;
>>> 	struct clstate_msg *cmsg = &clnt->cl_msg;
>>> 
>>> -	xlog(D_GENERAL, "%s: create client record", __func__);
>>> +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
>>> +			cmsg->cm_addr);
>>> 
>>> -	/* FIXME: create client record on storage here */
>>> +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
>>> +					cmsg->cm_len);
>>> 
>>> 	/* set up reply */
>>> -	cmsg->cm_status = 0;
>>> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
>>> 	cmsg->cm_len = 0;
>>> 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
>>> 		sizeof(cmsg->cm_u.cm_id));
>>> 
>>> 	bsize = sizeof(*cmsg);
>>> 
>>> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
>>> 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
>>> 	if (wsize != bsize) {
>>> 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
>>> @@ -231,6 +236,11 @@ main(int argc, char **argv)
>>> 	}
>>> 
>>> 	/* set up storage db */
>>> +	rc = clstate_maindb_init();
>>> +	if (rc) {
>>> +		xlog(L_ERROR, "Failed to open main database: %d", rc);
>>> +		goto out;
>>> +	}
>>> 
>>> 	/* set up event handler */
>>> 	fd = clstate_pipe_open(&clnt);
>>> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
>>> new file mode 100644
>>> index 0000000..ae83634
>>> --- /dev/null
>>> +++ b/utils/clstated/sqlite.c
>>> @@ -0,0 +1,351 @@
>>> +/*
>>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of the GNU General Public License
>>> + * as published by the Free Software Foundation; either version 2
>>> + * of the License, or (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public License
>>> + * along with this program; if not, write to the Free Software
>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
>>> + * Boston, MA 02110-1301, USA.
>>> + */
>>> +
>>> +/*
>>> + * Explanation:
>>> + *
>>> + * This file contains the code to manage the sqlite backend database for the
>>> + * clstated upcall daemon.
>>> + *
>>> + * The main database is called main.sqlite and contains the following tables:
>>> + *
>>> + * parameters: simple key/value pairs for storing database info
>>> + *
>>> + * addresses: list of server-side addresses recorded in the db, along with
>>> + * 	      the filenames of their respective db files.
>>> + *
>>> + * The daemon attaches to each server-address database as needed. Each
>>> + * server-address database has the following tables:
>> 
>> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
>> 
> 
> As you pointed out above, sqlite allows us to share the db directory
> across a cluster of nodes. Using different files here will allow us to
> reduce lock contention in the cluster. If we put each server address DB
> in a separate file, then only the node that currently owns that address
> should need to touch it under normal circumstances.

This seems like premature optimization.  A single file is much simpler and I'm guessing would be a more standard arrangement of tables.  If cluster contention avoidance is a key design point, it's worth calling out in the documenting comment, because this multi-file arrangement is not a typical use of sqlite3.

Why do you predict that the database will be under contention?  Isn't it the case that sqlite3 can allow multiple accessors to a single database file?

> 
>>> + *
>>> + * clients: one column containing a BLOB with the clstate as sent by the client
>>> + * 	    and a timestamp (in epoch seconds) of when the record was
>>> + * 	    established
>> 
>> I'm not yet clear on what's in cm_addr, but it seems to me that if you store each client's data as structured fields instead of a single BLOB, it will be much easier to use a generic tool like the sqlite3 command to debug the database.  That gives the observability and transparency of using a flat file with all the advantages of a database.
>> 
>> Plus, you get automatic data conversion; a one-to-one mapping between data on-disk and data in any endian-ness or word length.
>> 
> 
> cm_addr is the *server's* address.

What if a server has more than one address?  For example, an IPv4 and an IPv6 address?  Does it get two separate database files?  If so, how do you ensure that a client's nfs_client_id4 is recorded in both places atomically?  I'm not sure tying the server's identity to an IP address is wise.

Even if this is not a complex data item, I think opaque BLOBs should be avoided.  Store the presentation address in a text field.  Data conversion and observability are still strong reasons to do this.

> One key point here is that (at least
> in v4.0), clients do not SETCLIENTID against a server, but rather
> against the server's address. If the client needs to talk to another
> address it must call SETCLIENTID again. So, it's reasonable to track
> the client names on a per-server address basis to reduce DB contention
> in a clustered configuration.
> 
>>> + *
>>> + * FIXME: should we also record the fsid being accessed?
>> 
>> I'm not exactly sure what additional data might be needed to support NFSv4 migration, for example.  However, using a database means it's relatively painless to add new columns in the future without having to provision them now.
>> 
> 
> Right. It's just a matter of adding the correct column and somehow
> populating them for existing entries. Making this extensible is a
> design goal here. It's a complex enough problem, that I'm fairly
> certain I'll get something wrong on the first pass...

You don't necessarily need to populate new columns for existing rows by the way.  Simply add the column to the table.  Existing rows will return a special result meaning "no value."  Your code can choose a default value in that case.

Anyway, that's a detail.

> 
>>> + */
>>> +
>>> +#ifdef HAVE_CONFIG_H
>>> +#include "config.h"
>>> +#endif /* HAVE_CONFIG_H */
>>> +
>>> +#include <errno.h>
>>> +#include <event.h>
>>> +#include <stdbool.h>
>>> +#include <string.h>
>>> +#include <sys/stat.h>
>>> +#include <sys/types.h>
>>> +#include <fcntl.h>
>>> +#include <unistd.h>
>>> +#include <sqlite3.h>
>>> +#include <linux/limits.h>
>>> +#include <linux/nfsd/clstate.h>
>>> +
>>> +#include "xlog.h"
>>> +
>>> +#define CLSTATE_SQLITE_SCHEMA_VERSION 1
>>> +
>>> +#ifndef CLSTATE_DIR
>>> +#define CLSTATE_DIR NFS_STATEDIR "/clstate"
>>> +#endif
>>> +
>>> +/* in milliseconds */
>>> +#define CLSTATE_SQLITE_BUSY_TIMEOUT 10000
>>> +
>>> +/* private data structures */
>>> +
>>> +/* global variables */
>>> +
>>> +/* top level DB directory */
>>> +static char *clstate_topdir = CLSTATE_DIR;
>>> +
>>> +/* reusable pathname and sql command buffer */
>>> +static char buf[PATH_MAX];
>>> +
>>> +/* global database handle */
>>> +static sqlite3 *dbh;
>>> +
>>> +/* forward declarations */
>>> +
>>> +/* '.' is not allowed in dbnames -- convert them to '-' */
>>> +static void
>>> +addr_to_dbname(const char *src, char *dst)
>>> +{
>>> +	while(*src) {
>>> +		if (*src == '.')
>>> +			*dst = '-';
>>> +		else
>>> +			*dst = *src;
>>> +		++src;
>>> +		++dst;
>>> +	}
>>> +	*dst = '\0';
>>> +}
>>> +
>>> +void
>>> +clstate_set_topdir(char *topdir)
>>> +{
>>> +	clstate_topdir = topdir;
>>> +}
>>> +
>>> +/* make a directory, ignoring EEXIST errors unless it's not a directory */
>>> +static int
>>> +mkdir_if_not_exist(char *dirname)
>>> +{
>>> +	int ret;
>>> +	struct stat statbuf;
>>> +
>>> +	ret = mkdir(dirname, S_IRWXU);
>>> +	if (ret && errno != EEXIST)
>>> +		return -errno;
>>> +
>>> +	ret = stat(dirname, &statbuf);
>>> +	if (ret)
>>> +		return -errno;
>>> +
>>> +	if (!S_ISDIR(statbuf.st_mode))
>>> +		ret = -ENOTDIR;
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +/*
>>> + * Open the "main" database, and attempt to initialize it by creating the
>>> + * parameters table and inserting the schema version into it. Ignore any errors
>>> + * from that, and then attempt to select the version out of it again. If the
>>> + * version appears wrong, then assume that the DB is corrupt or has been
>>> + * upgraded, and return an error.
>>> + */
>>> +int
>>> +clstate_maindb_init(void)
>>> +{
>>> +	int ret;
>>> +	char *err = NULL;
>>> +	sqlite3_stmt *stmt = NULL;
>>> +
>>> +	ret = mkdir_if_not_exist(clstate_topdir);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", clstate_topdir);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	buf[PATH_MAX - 1] = '\0';
>>> +
>>> +	ret = sqlite3_open(buf, &dbh);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Unable to open main database: %d", ret);
>>> +		return ret;
>>> +	}
>>> +
>>> +	ret = sqlite3_busy_timeout(dbh, CLSTATE_SQLITE_BUSY_TIMEOUT);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Unable to set sqlite busy timeout: %d", ret);
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	/* Try to create table */
>>> +	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
>>> +				"(key TEXT PRIMARY KEY, value TEXT);",
>>> +				NULL, NULL, &err);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Unable to create parameter table: %d", ret);
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	/* insert version into table -- ignore error if it fails */
>>> +	ret = snprintf(buf, sizeof(buf),
>>> +		       "INSERT OR IGNORE INTO parameters values (\"version\", "
>>> +		       "\"%d\");", CLSTATE_SQLITE_SCHEMA_VERSION);
>>> +	if (ret < 0) {
>>> +		goto out_err;
>>> +	} else if ((size_t)ret >= sizeof(buf)) {
>>> +		ret = -EINVAL;
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Unable to insert into parameter table: %d",
>>> +				ret);
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	ret = sqlite3_prepare_v2(dbh,
>>> +		"SELECT value FROM parameters WHERE key == \"version\";",
>>> +		 -1, &stmt, NULL);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Unable to prepare select statement: %d", ret);
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	/* check schema version */
>>> +	ret = sqlite3_step(stmt);
>>> +	if (ret != SQLITE_ROW) {
>>> +		xlog(D_GENERAL, "Select statement execution failed: %s",
>>> +				sqlite3_errmsg(dbh));
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	/* process SELECT result */
>>> +	ret = sqlite3_column_int(stmt, 0);
>>> +	if (ret != CLSTATE_SQLITE_SCHEMA_VERSION) {
>>> +		xlog(L_ERROR, "Unsupported database schema version! "
>>> +			"Expected %d, got %d.",
>>> +			CLSTATE_SQLITE_SCHEMA_VERSION, ret);
>>> +		ret = -EINVAL;
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	sqlite3_free(err);
>>> +	sqlite3_finalize(stmt);
>>> +	return 0;
>>> +
>>> +out_err:
>>> +	if (err) {
>>> +		xlog(L_ERROR, "sqlite error: %s", err);
>>> +		sqlite3_free(err);
>>> +	}
>>> +	sqlite3_finalize(stmt);
>>> +	sqlite3_close(dbh);
>>> +	return ret;
>>> +}
>>> +
>>> +static int
>>> +clstate_db_attach(char *dbname)
>>> +{
>>> +	int ret;
>>> +	char *err;
>>> +
>>> +	/* convert address string into valid filename */
>>> +	ret = snprintf(buf, sizeof(buf),
>>> +			"ATTACH DATABASE \"%s/%s.sqlite\" as \"%s\";",
>>> +			clstate_topdir, dbname, dbname);
>>> +	if (ret < 0) {
>>> +		return ret;
>>> +	} else if ((size_t)ret >= sizeof(buf)) {
>>> +		ret = -EINVAL;
>>> +		return ret;
>>> +	}
>>> +
>>> +	/* attach to new DB in filename */
>>> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
>>> +	if (ret != SQLITE_OK)
>>> +		xlog(L_ERROR, "Unable to attach database %s: %s", dbname, err);
>>> +
>>> +	xlog(D_GENERAL, "Attached database %s", dbname);
>>> +	return ret;
>>> +}
>>> +
>>> +static int
>>> +clstate_db_detach(char *dbname)
>>> +{
>>> +	int ret;
>>> +	char *err;
>>> +
>>> +	/* convert address string into valid filename */
>>> +	ret = snprintf(buf, PATH_MAX - 1, "DETACH DATABASE \"%s\";", dbname);
>>> +	if (ret < 0) {
>>> +		return ret;
>>> +	} else if ((size_t)ret >= sizeof(buf)) {
>>> +		ret = -EINVAL;
>>> +		return ret;
>>> +	}
>>> +
>>> +	/* attach to new DB in filename */
>>> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
>>> +	if (ret != SQLITE_OK)
>>> +		xlog(L_ERROR, "Unable to detach database %s: %s", dbname, err);
>>> +
>>> +	xlog(D_GENERAL, "Detached database %s", dbname);
>>> +	return ret;
>>> +}
>>> +
>>> +/*
>>> + * Create a client record
>>> + */
>>> +int
>>> +clstate_insert_client(const unsigned char *addr, const unsigned char *clname,
>>> +			const size_t namelen)
>>> +{
>>> +	int ret;
>>> +	char *err = NULL;
>>> +	char dbname[CLSTATE_MAX_ADDRESS_LEN];
>>> +	sqlite3_stmt *stmt = NULL;
>>> +
>>> +	addr_to_dbname((const char *)addr, dbname);
>>> +
>>> +	ret = clstate_db_attach(dbname);
>>> +	if (ret != SQLITE_OK)
>>> +		return ret;
>>> +
>>> +	snprintf(buf, sizeof(buf), "CREATE TABLE IF NOT EXISTS '%s'.clients "
>>> +				   "(id BLOB PRIMARY KEY, time INTEGER);",
>>> +				   dbname);
>>> +	if (ret < 0) {
>>> +		goto out_detach;
>>> +	} else if ((size_t)ret >= sizeof(buf)) {
>>> +		ret = -EINVAL;
>>> +		goto out_detach;
>>> +	}
>>> +
>>> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(L_ERROR, "Unable to create table: %s", err);
>>> +		goto out_detach;
>>> +	}
>>> +
>>> +	ret = snprintf(buf, sizeof(buf),
>>> +			"INSERT OR REPLACE INTO '%s'.clients VALUES (?, "
>>> +			"strftime('%%s', 'now'));", dbname);
>>> +	if (ret < 0) {
>>> +		goto out_detach;
>>> +	} else if ((size_t)ret >= sizeof(buf)) {
>>> +		ret = -EINVAL;
>>> +		goto out_detach;
>>> +	}
>>> +
>>> +	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
>>> +				SQLITE_STATIC);
>>> +	if (ret != SQLITE_OK) {
>>> +		xlog(D_GENERAL, "Bind blob failed: %d", ret);
>>> +		goto out_err;
>>> +	}
>>> +
>>> +	ret = sqlite3_step(stmt);
>>> +	if (ret == SQLITE_DONE)
>>> +		ret = SQLITE_OK;
>>> +	else
>>> +		xlog(D_GENERAL, "Unexpected return code from insert: %s",
>>> +				sqlite3_errmsg(dbh));
>>> +
>>> +out_err:
>>> +	xlog(D_GENERAL, "%s returning %d", __func__, ret);
>>> +	sqlite3_finalize(stmt);
>>> +	sqlite3_free(err);
>>> +out_detach:
>>> +	clstate_db_detach(dbname);
>>> +	return ret;
>>> +}
>>> diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
>>> new file mode 100644
>>> index 0000000..4f81745
>>> --- /dev/null
>>> +++ b/utils/clstated/sqlite.h
>>> @@ -0,0 +1,27 @@
>>> +/*
>>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of the GNU General Public License
>>> + * as published by the Free Software Foundation; either version 2
>>> + * of the License, or (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public License
>>> + * along with this program; if not, write to the Free Software
>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
>>> + * Boston, MA 02110-1301, USA.
>>> + */
>>> +
>>> +#ifndef _SQLITE_H_
>>> +#define _SQLITE_H_
>>> +
>>> +void clstate_set_topdir(char *topdir);
>>> +int clstate_maindb_init(void);
>>> +int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
>>> +
>>> +#endif /* _SQLITE_H */
>>> -- 
>>> 1.7.1
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 15:37         ` Jeff Layton
@ 2011-12-14 15:56           ` Steve Dickson
  2011-12-14 16:00             ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 15:56 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 10:37 AM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 10:29:49 -0500
> Steve Dickson <SteveD@redhat.com> wrote:
> 
>>
>>
>> On 12/14/2011 10:19 AM, Jeff Layton wrote:
>>> On Wed, 14 Dec 2011 10:09:04 -0500
>>> Steve Dickson <SteveD@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
>>>>> Generally, we want this daemon started before nfsd starts. Deal with the
>>>>> situation where the pipe hasn't shown up yet.
>>>> This can be done with your systemd start up script. Plus I'm not sure its 
>>>> a good idea to steal cpu cycles waiting for an event that may never happen...
>>>>
>>>
>>> Presumably you just wouldn't start the daemon if you have no intent to
>>> use it.
>>>
>>> It does sleep 1s between each check, so the time is fairly minimal,
>>> but I'm definitely open to doing this differently. What may be
>>> reasonable is adding code to the daemon to check and see if the
>>> v4recoverydir is present. If it is, then just exit. Otherwise, wait for
>>> the pipe to show up.
>> Why just let the systemd scrips worry about the order of when to start
>> things up... To be honest, that is one thing systemd does do fairly well.
>>
> 
> Because not everyone uses systemd, and we have to deal with the
> "legacy" case too for the transition phase.
> 
> It's generally preferable not to start up nfsd until everything it
> needs is up. If we do what you suggest, then we're basically mandating
> that this daemon can't start until nfsd is up and running.
Order has ways been a part of how and when things are started which
have always been handled by initscripts. That's their job, to start
things in the correct order. 

I understand you want to make the daemon bullet proof... but starting
things up in the wrong order is an error... IMHO... 

> 
> Could you give some details on how you think this ought to work?
> 
I would think a error message stating unable to open whatever and then 
say something like please make sure the nfs server is up and running,
would work... It seems to me this is a pretty common way of handling 
this type of situation.... although I can not come up with a 
explicit example, atm. 

steved.
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 15:56           ` Steve Dickson
@ 2011-12-14 16:00             ` Jeff Layton
  2011-12-14 16:28               ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 16:00 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 10:56:46 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 10:37 AM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 10:29:49 -0500
> > Steve Dickson <SteveD@redhat.com> wrote:
> > 
> >>
> >>
> >> On 12/14/2011 10:19 AM, Jeff Layton wrote:
> >>> On Wed, 14 Dec 2011 10:09:04 -0500
> >>> Steve Dickson <SteveD@redhat.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> >>>>> Generally, we want this daemon started before nfsd starts. Deal with the
> >>>>> situation where the pipe hasn't shown up yet.
> >>>> This can be done with your systemd start up script. Plus I'm not sure its 
> >>>> a good idea to steal cpu cycles waiting for an event that may never happen...
> >>>>
> >>>
> >>> Presumably you just wouldn't start the daemon if you have no intent to
> >>> use it.
> >>>
> >>> It does sleep 1s between each check, so the time is fairly minimal,
> >>> but I'm definitely open to doing this differently. What may be
> >>> reasonable is adding code to the daemon to check and see if the
> >>> v4recoverydir is present. If it is, then just exit. Otherwise, wait for
> >>> the pipe to show up.
> >> Why just let the systemd scrips worry about the order of when to start
> >> things up... To be honest, that is one thing systemd does do fairly well.
> >>
> > 
> > Because not everyone uses systemd, and we have to deal with the
> > "legacy" case too for the transition phase.
> > 
> > It's generally preferable not to start up nfsd until everything it
> > needs is up. If we do what you suggest, then we're basically mandating
> > that this daemon can't start until nfsd is up and running.
> Order has ways been a part of how and when things are started which
> have always been handled by initscripts. That's their job, to start
> things in the correct order. 
> 
> I understand you want to make the daemon bullet proof... but starting
> things up in the wrong order is an error... IMHO... 
> 
> > 
> > Could you give some details on how you think this ought to work?
> > 
> I would think a error message stating unable to open whatever and then 
> say something like please make sure the nfs server is up and running,
> would work... It seems to me this is a pretty common way of handling 
> this type of situation.... although I can not come up with a 
> explicit example, atm. 
> 

That's doable simply by dropping this patch. I think it'll make this
more fragile, but if that's the consensus, I'll go along with it.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 15:44     ` Steve Dickson
@ 2011-12-14 16:05       ` Jeff Layton
  2011-12-14 16:26         ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 16:05 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 10:44:27 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 10:32 AM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 10:23:15 -0500
> > Steve Dickson <SteveD@redhat.com> wrote:
> > 
> >>
> >>
> >> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> >>> This patchset is the userspace portion of the knfsd client name tracking
> >>> overhaul. See this patch series for an explanation:
> >>>
> >>>     nfsd: overhaul the client name tracking code (RFC)
> >>>
> >>> The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
> >>> and handles the requests.
> >>>
> >>> The data is stored using a sqlite database. The main reason for this is
> >>> that it takes care of most of the fussy details and atomicity concerns
> >>> of tracking the information on stable storage.
> >> This will make nfs-utils dependent on the sqlite database... Any idea
> >> what kinda of extra baggage this brings? 
> >>
> > 
> > Depends on what you mean by "baggage". What is your concern?
> In Fedora doing a 'yum install sqlite' which would require
> a ton of other package needing to be install... The Required
> for nfs-utils is getting pretty long at this point... 
>  

I don't think it pulls in that much. This is pretty minimal for dependencies:

$ ldd /usr/lib/libsqlite3.so.0.8.6 
	linux-gate.so.1 =>  (0x00e05000)
	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
	libc.so.6 => /lib/libc.so.6 (0x00e27000)
	/lib/ld-linux.so.2 (0x4e572000)

I think the command-line tools have some dependency on ncurses and
readline and such, but I don't think it's that much really...

In any case, perhaps it's time to split the nfs-utils packaging for
fedora into client and server pieces? Clients really only need
a few of the tools, but they can't install that on fedora without
getting all of the server pieces too.

That's a separate discussion though...

> > 
> > For something like fedora, it'll mean a build-time dependency on
> > sqlite-devel, and a runtime dependency on sqlite. We could consider
> > embedding their single-file amalgamation code, but I'd prefer not to do
> > that unless we really have to.
> > 
> >>>
> >>> For now, the daemon is only suitable for single-host configurations.
> >>> The plan is to later extend this to be suitable for clustered
> >>> configurations as well.
> >>>
> >>> The code is still a little rough, so be gentle. It also lacks things
> >>> like a manpage. I plan to add all that before doing a "formal" patch
> >>> submission, but I wanted to get some early review of the overall design
> >>> before to spend a lot of time knocking off the rough edges.
> >> Finally the name of the daemon...  clstated does not make it clear that 
> >> this is a nfs server daemon... maybe something like nfsdcld? 
> >>
> > 
> > Sure, we could change the name if that's desirable. Maybe
> > nfsd.clnamed ? At this point it's really just a client name tracking
> > daemon.
> How about nfsdcld ;-) short, to the point and easy to type... 8-) 
> 

Ok, I'm not religious on this issue. I'll plan to rename it to that.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-14 15:47       ` Chuck Lever
@ 2011-12-14 16:15         ` Jeff Layton
  2011-12-15 14:55           ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 16:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

On Wed, 14 Dec 2011 10:47:56 -0500
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Dec 14, 2011, at 10:14 AM, Jeff Layton wrote:
> 
> > On Wed, 14 Dec 2011 09:56:32 -0500
> > Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> >> 
> >> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
> >> 
> >>> Rather than roll our own "storage engine", use sqlite instead. It fits
> >>> the bill nicely as it does:
> >>> 
> >>> - durable on-disk storage
> >>> - the ability to constrain record uniqueness
> >>> - a facility for collating and searching the host records
> >> 
> >> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
> >> 
> > 
> > Indeed. That's one of the main reasons I chose sqlite here.
> > 
> >>> ...it does add a build dependency to nfs-utils, but almost all modern
> >>> distros provide those packages.
> >>> 
> >>> The current incarnation of this code dynamically links against a
> >>> provided sqlite library, but we could also consider including their
> >>> single-file "amalgamation" to reduce dependencies (though with all
> >>> the caveats that that entails).
> >>> 
> >>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> >>> ---
> >>> utils/clstated/Makefile.am |    4 +-
> >>> utils/clstated/clstated.c  |   16 ++-
> >>> utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
> >>> utils/clstated/sqlite.h    |   27 ++++
> >>> 4 files changed, 393 insertions(+), 5 deletions(-)
> >>> create mode 100644 utils/clstated/sqlite.c
> >>> create mode 100644 utils/clstated/sqlite.h
> >>> 
> >>> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
> >>> index 9937353..d1cef3c 100644
> >>> --- a/utils/clstated/Makefile.am
> >>> +++ b/utils/clstated/Makefile.am
> >>> @@ -6,9 +6,9 @@
> >>> AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
> >>> sbin_PROGRAMS	= clstated
> >>> 
> >>> -clstated_SOURCES = clstated.c
> >>> +clstated_SOURCES = clstated.c sqlite.c
> >>> 
> >>> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
> >>> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
> >>> 
> >>> MAINTAINERCLEANFILES = Makefile.in
> >>> 
> >>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> >>> index aab09a7..95ac696 100644
> >>> --- a/utils/clstated/clstated.c
> >>> +++ b/utils/clstated/clstated.c
> >>> @@ -36,6 +36,7 @@
> >>> 
> >>> #include "xlog.h"
> >>> #include "nfslib.h"
> >>> +#include "sqlite.h"
> >>> 
> >>> #ifndef PIPEFS_DIR
> >>> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> >>> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
> >>> static void
> >>> clstate_create(struct clstate_client *clnt)
> >>> {
> >>> +	int ret;
> >>> 	ssize_t bsize, wsize;
> >>> 	struct clstate_msg *cmsg = &clnt->cl_msg;
> >>> 
> >>> -	xlog(D_GENERAL, "%s: create client record", __func__);
> >>> +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
> >>> +			cmsg->cm_addr);
> >>> 
> >>> -	/* FIXME: create client record on storage here */
> >>> +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
> >>> +					cmsg->cm_len);
> >>> 
> >>> 	/* set up reply */
> >>> -	cmsg->cm_status = 0;
> >>> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
> >>> 	cmsg->cm_len = 0;
> >>> 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
> >>> 		sizeof(cmsg->cm_u.cm_id));
> >>> 
> >>> 	bsize = sizeof(*cmsg);
> >>> 
> >>> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
> >>> 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> >>> 	if (wsize != bsize) {
> >>> 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
> >>> @@ -231,6 +236,11 @@ main(int argc, char **argv)
> >>> 	}
> >>> 
> >>> 	/* set up storage db */
> >>> +	rc = clstate_maindb_init();
> >>> +	if (rc) {
> >>> +		xlog(L_ERROR, "Failed to open main database: %d", rc);
> >>> +		goto out;
> >>> +	}
> >>> 
> >>> 	/* set up event handler */
> >>> 	fd = clstate_pipe_open(&clnt);
> >>> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
> >>> new file mode 100644
> >>> index 0000000..ae83634
> >>> --- /dev/null
> >>> +++ b/utils/clstated/sqlite.c
> >>> @@ -0,0 +1,351 @@
> >>> +/*
> >>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or
> >>> + * modify it under the terms of the GNU General Public License
> >>> + * as published by the Free Software Foundation; either version 2
> >>> + * of the License, or (at your option) any later version.
> >>> + *
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> + *
> >>> + * You should have received a copy of the GNU General Public License
> >>> + * along with this program; if not, write to the Free Software
> >>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> >>> + * Boston, MA 02110-1301, USA.
> >>> + */
> >>> +
> >>> +/*
> >>> + * Explanation:
> >>> + *
> >>> + * This file contains the code to manage the sqlite backend database for the
> >>> + * clstated upcall daemon.
> >>> + *
> >>> + * The main database is called main.sqlite and contains the following tables:
> >>> + *
> >>> + * parameters: simple key/value pairs for storing database info
> >>> + *
> >>> + * addresses: list of server-side addresses recorded in the db, along with
> >>> + * 	      the filenames of their respective db files.
> >>> + *
> >>> + * The daemon attaches to each server-address database as needed. Each
> >>> + * server-address database has the following tables:
> >> 
> >> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
> >> 
> > 
> > As you pointed out above, sqlite allows us to share the db directory
> > across a cluster of nodes. Using different files here will allow us to
> > reduce lock contention in the cluster. If we put each server address DB
> > in a separate file, then only the node that currently owns that address
> > should need to touch it under normal circumstances.
> 
> This seems like premature optimization.  A single file is much simpler and I'm guessing would be a more standard arrangement of tables.  If cluster contention avoidance is a key design point, it's worth calling out in the documenting comment, because this multi-file arrangement is not a typical use of sqlite3.
> 
> Why do you predict that the database will be under contention?  Isn't it the case that sqlite3 can allow multiple accessors to a single database file?
> 

Since the grace periods will need to be coordinated, then it's likely
on a server reboot that clients will be reestablishing state with
multiple nodes in the cluster at a time. That means at least some
contention for the DB. Whether it's problematic or not is an open
question at this point, but why not avoid that if we can?

> > 
> >>> + *
> >>> + * clients: one column containing a BLOB with the clstate as sent by the client
> >>> + * 	    and a timestamp (in epoch seconds) of when the record was
> >>> + * 	    established
> >> 
> >> I'm not yet clear on what's in cm_addr, but it seems to me that if you store each client's data as structured fields instead of a single BLOB, it will be much easier to use a generic tool like the sqlite3 command to debug the database.  That gives the observability and transparency of using a flat file with all the advantages of a database.
> >> 
> >> Plus, you get automatic data conversion; a one-to-one mapping between data on-disk and data in any endian-ness or word length.
> >> 
> > 
> > cm_addr is the *server's* address.
> 
> What if a server has more than one address?  For example, an IPv4 and an IPv6 address?  Does it get two separate database files? 

Yes.

> If so, how do you ensure that a client's nfs_client_id4 is recorded in both places atomically? I'm not sure tying the server's identity to an IP address is wise.

It doesn't get recorded in both places atomically. A SETCLIENTID is
done against an address, not against the whole server. If the server
has two different addresses and the client wants to talk to both, it
needs to do a SETCLIENTID against both addresses. The server should
treat those individually.

Granted, the existing code in the kernel doesn't make that distinction
but eventually it'll need to do so.
 
> 
> Even if this is not a complex data item, I think opaque BLOBs should be avoided.  Store the presentation address in a text field.  Data conversion and observability are still strong reasons to do this.

I'm not storing the address in the db. The address just determines the
name of the DB file in the current design. The client name string
however is stored as an opaque blob, since that's what it is. The Linux
client sends an ASCII string here, but we can't rely on all clients to
do so.

> 
> > One key point here is that (at least
> > in v4.0), clients do not SETCLIENTID against a server, but rather
> > against the server's address. If the client needs to talk to another
> > address it must call SETCLIENTID again. So, it's reasonable to track
> > the client names on a per-server address basis to reduce DB contention
> > in a clustered configuration.
> > 
> >>> + *
> >>> + * FIXME: should we also record the fsid being accessed?
> >> 
> >> I'm not exactly sure what additional data might be needed to support NFSv4 migration, for example.  However, using a database means it's relatively painless to add new columns in the future without having to provision them now.
> >> 
> > 
> > Right. It's just a matter of adding the correct column and somehow
> > populating them for existing entries. Making this extensible is a
> > design goal here. It's a complex enough problem, that I'm fairly
> > certain I'll get something wrong on the first pass...
> 
> You don't necessarily need to populate new columns for existing rows by the way.  Simply add the column to the table.  Existing rows will return a special result meaning "no value."  Your code can choose a default value in that case.
> 
> Anyway, that's a detail.
> 

Ok, good to know...

> > 
> >>> + */
> >>> +
> >>> +#ifdef HAVE_CONFIG_H
> >>> +#include "config.h"
> >>> +#endif /* HAVE_CONFIG_H */
> >>> +
> >>> +#include <errno.h>
> >>> +#include <event.h>
> >>> +#include <stdbool.h>
> >>> +#include <string.h>
> >>> +#include <sys/stat.h>
> >>> +#include <sys/types.h>
> >>> +#include <fcntl.h>
> >>> +#include <unistd.h>
> >>> +#include <sqlite3.h>
> >>> +#include <linux/limits.h>
> >>> +#include <linux/nfsd/clstate.h>
> >>> +
> >>> +#include "xlog.h"
> >>> +
> >>> +#define CLSTATE_SQLITE_SCHEMA_VERSION 1
> >>> +
> >>> +#ifndef CLSTATE_DIR
> >>> +#define CLSTATE_DIR NFS_STATEDIR "/clstate"
> >>> +#endif
> >>> +
> >>> +/* in milliseconds */
> >>> +#define CLSTATE_SQLITE_BUSY_TIMEOUT 10000
> >>> +
> >>> +/* private data structures */
> >>> +
> >>> +/* global variables */
> >>> +
> >>> +/* top level DB directory */
> >>> +static char *clstate_topdir = CLSTATE_DIR;
> >>> +
> >>> +/* reusable pathname and sql command buffer */
> >>> +static char buf[PATH_MAX];
> >>> +
> >>> +/* global database handle */
> >>> +static sqlite3 *dbh;
> >>> +
> >>> +/* forward declarations */
> >>> +
> >>> +/* '.' is not allowed in dbnames -- convert them to '-' */
> >>> +static void
> >>> +addr_to_dbname(const char *src, char *dst)
> >>> +{
> >>> +	while(*src) {
> >>> +		if (*src == '.')
> >>> +			*dst = '-';
> >>> +		else
> >>> +			*dst = *src;
> >>> +		++src;
> >>> +		++dst;
> >>> +	}
> >>> +	*dst = '\0';
> >>> +}
> >>> +
> >>> +void
> >>> +clstate_set_topdir(char *topdir)
> >>> +{
> >>> +	clstate_topdir = topdir;
> >>> +}
> >>> +
> >>> +/* make a directory, ignoring EEXIST errors unless it's not a directory */
> >>> +static int
> >>> +mkdir_if_not_exist(char *dirname)
> >>> +{
> >>> +	int ret;
> >>> +	struct stat statbuf;
> >>> +
> >>> +	ret = mkdir(dirname, S_IRWXU);
> >>> +	if (ret && errno != EEXIST)
> >>> +		return -errno;
> >>> +
> >>> +	ret = stat(dirname, &statbuf);
> >>> +	if (ret)
> >>> +		return -errno;
> >>> +
> >>> +	if (!S_ISDIR(statbuf.st_mode))
> >>> +		ret = -ENOTDIR;
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Open the "main" database, and attempt to initialize it by creating the
> >>> + * parameters table and inserting the schema version into it. Ignore any errors
> >>> + * from that, and then attempt to select the version out of it again. If the
> >>> + * version appears wrong, then assume that the DB is corrupt or has been
> >>> + * upgraded, and return an error.
> >>> + */
> >>> +int
> >>> +clstate_maindb_init(void)
> >>> +{
> >>> +	int ret;
> >>> +	char *err = NULL;
> >>> +	sqlite3_stmt *stmt = NULL;
> >>> +
> >>> +	ret = mkdir_if_not_exist(clstate_topdir);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	ret = snprintf(buf, PATH_MAX - 1, "%s/main.sqlite", clstate_topdir);
> >>> +	if (ret < 0)
> >>> +		return ret;
> >>> +
> >>> +	buf[PATH_MAX - 1] = '\0';
> >>> +
> >>> +	ret = sqlite3_open(buf, &dbh);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Unable to open main database: %d", ret);
> >>> +		return ret;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_busy_timeout(dbh, CLSTATE_SQLITE_BUSY_TIMEOUT);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Unable to set sqlite busy timeout: %d", ret);
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	/* Try to create table */
> >>> +	ret = sqlite3_exec(dbh, "CREATE TABLE IF NOT EXISTS parameters "
> >>> +				"(key TEXT PRIMARY KEY, value TEXT);",
> >>> +				NULL, NULL, &err);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Unable to create parameter table: %d", ret);
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	/* insert version into table -- ignore error if it fails */
> >>> +	ret = snprintf(buf, sizeof(buf),
> >>> +		       "INSERT OR IGNORE INTO parameters values (\"version\", "
> >>> +		       "\"%d\");", CLSTATE_SQLITE_SCHEMA_VERSION);
> >>> +	if (ret < 0) {
> >>> +		goto out_err;
> >>> +	} else if ((size_t)ret >= sizeof(buf)) {
> >>> +		ret = -EINVAL;
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_exec(dbh, (const char *)buf, NULL, NULL, &err);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Unable to insert into parameter table: %d",
> >>> +				ret);
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_prepare_v2(dbh,
> >>> +		"SELECT value FROM parameters WHERE key == \"version\";",
> >>> +		 -1, &stmt, NULL);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Unable to prepare select statement: %d", ret);
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	/* check schema version */
> >>> +	ret = sqlite3_step(stmt);
> >>> +	if (ret != SQLITE_ROW) {
> >>> +		xlog(D_GENERAL, "Select statement execution failed: %s",
> >>> +				sqlite3_errmsg(dbh));
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	/* process SELECT result */
> >>> +	ret = sqlite3_column_int(stmt, 0);
> >>> +	if (ret != CLSTATE_SQLITE_SCHEMA_VERSION) {
> >>> +		xlog(L_ERROR, "Unsupported database schema version! "
> >>> +			"Expected %d, got %d.",
> >>> +			CLSTATE_SQLITE_SCHEMA_VERSION, ret);
> >>> +		ret = -EINVAL;
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	sqlite3_free(err);
> >>> +	sqlite3_finalize(stmt);
> >>> +	return 0;
> >>> +
> >>> +out_err:
> >>> +	if (err) {
> >>> +		xlog(L_ERROR, "sqlite error: %s", err);
> >>> +		sqlite3_free(err);
> >>> +	}
> >>> +	sqlite3_finalize(stmt);
> >>> +	sqlite3_close(dbh);
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int
> >>> +clstate_db_attach(char *dbname)
> >>> +{
> >>> +	int ret;
> >>> +	char *err;
> >>> +
> >>> +	/* convert address string into valid filename */
> >>> +	ret = snprintf(buf, sizeof(buf),
> >>> +			"ATTACH DATABASE \"%s/%s.sqlite\" as \"%s\";",
> >>> +			clstate_topdir, dbname, dbname);
> >>> +	if (ret < 0) {
> >>> +		return ret;
> >>> +	} else if ((size_t)ret >= sizeof(buf)) {
> >>> +		ret = -EINVAL;
> >>> +		return ret;
> >>> +	}
> >>> +
> >>> +	/* attach to new DB in filename */
> >>> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> >>> +	if (ret != SQLITE_OK)
> >>> +		xlog(L_ERROR, "Unable to attach database %s: %s", dbname, err);
> >>> +
> >>> +	xlog(D_GENERAL, "Attached database %s", dbname);
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int
> >>> +clstate_db_detach(char *dbname)
> >>> +{
> >>> +	int ret;
> >>> +	char *err;
> >>> +
> >>> +	/* convert address string into valid filename */
> >>> +	ret = snprintf(buf, PATH_MAX - 1, "DETACH DATABASE \"%s\";", dbname);
> >>> +	if (ret < 0) {
> >>> +		return ret;
> >>> +	} else if ((size_t)ret >= sizeof(buf)) {
> >>> +		ret = -EINVAL;
> >>> +		return ret;
> >>> +	}
> >>> +
> >>> +	/* attach to new DB in filename */
> >>> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> >>> +	if (ret != SQLITE_OK)
> >>> +		xlog(L_ERROR, "Unable to detach database %s: %s", dbname, err);
> >>> +
> >>> +	xlog(D_GENERAL, "Detached database %s", dbname);
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Create a client record
> >>> + */
> >>> +int
> >>> +clstate_insert_client(const unsigned char *addr, const unsigned char *clname,
> >>> +			const size_t namelen)
> >>> +{
> >>> +	int ret;
> >>> +	char *err = NULL;
> >>> +	char dbname[CLSTATE_MAX_ADDRESS_LEN];
> >>> +	sqlite3_stmt *stmt = NULL;
> >>> +
> >>> +	addr_to_dbname((const char *)addr, dbname);
> >>> +
> >>> +	ret = clstate_db_attach(dbname);
> >>> +	if (ret != SQLITE_OK)
> >>> +		return ret;
> >>> +
> >>> +	snprintf(buf, sizeof(buf), "CREATE TABLE IF NOT EXISTS '%s'.clients "
> >>> +				   "(id BLOB PRIMARY KEY, time INTEGER);",
> >>> +				   dbname);
> >>> +	if (ret < 0) {
> >>> +		goto out_detach;
> >>> +	} else if ((size_t)ret >= sizeof(buf)) {
> >>> +		ret = -EINVAL;
> >>> +		goto out_detach;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_exec(dbh, buf, NULL, NULL, &err);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(L_ERROR, "Unable to create table: %s", err);
> >>> +		goto out_detach;
> >>> +	}
> >>> +
> >>> +	ret = snprintf(buf, sizeof(buf),
> >>> +			"INSERT OR REPLACE INTO '%s'.clients VALUES (?, "
> >>> +			"strftime('%%s', 'now'));", dbname);
> >>> +	if (ret < 0) {
> >>> +		goto out_detach;
> >>> +	} else if ((size_t)ret >= sizeof(buf)) {
> >>> +		ret = -EINVAL;
> >>> +		goto out_detach;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_prepare_v2(dbh, buf, ret + 1, &stmt, NULL);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Statement prepare failed: %d", ret);
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_bind_blob(stmt, 1, (const void *)clname, namelen,
> >>> +				SQLITE_STATIC);
> >>> +	if (ret != SQLITE_OK) {
> >>> +		xlog(D_GENERAL, "Bind blob failed: %d", ret);
> >>> +		goto out_err;
> >>> +	}
> >>> +
> >>> +	ret = sqlite3_step(stmt);
> >>> +	if (ret == SQLITE_DONE)
> >>> +		ret = SQLITE_OK;
> >>> +	else
> >>> +		xlog(D_GENERAL, "Unexpected return code from insert: %s",
> >>> +				sqlite3_errmsg(dbh));
> >>> +
> >>> +out_err:
> >>> +	xlog(D_GENERAL, "%s returning %d", __func__, ret);
> >>> +	sqlite3_finalize(stmt);
> >>> +	sqlite3_free(err);
> >>> +out_detach:
> >>> +	clstate_db_detach(dbname);
> >>> +	return ret;
> >>> +}
> >>> diff --git a/utils/clstated/sqlite.h b/utils/clstated/sqlite.h
> >>> new file mode 100644
> >>> index 0000000..4f81745
> >>> --- /dev/null
> >>> +++ b/utils/clstated/sqlite.h
> >>> @@ -0,0 +1,27 @@
> >>> +/*
> >>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or
> >>> + * modify it under the terms of the GNU General Public License
> >>> + * as published by the Free Software Foundation; either version 2
> >>> + * of the License, or (at your option) any later version.
> >>> + *
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> + *
> >>> + * You should have received a copy of the GNU General Public License
> >>> + * along with this program; if not, write to the Free Software
> >>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> >>> + * Boston, MA 02110-1301, USA.
> >>> + */
> >>> +
> >>> +#ifndef _SQLITE_H_
> >>> +#define _SQLITE_H_
> >>> +
> >>> +void clstate_set_topdir(char *topdir);
> >>> +int clstate_maindb_init(void);
> >>> +int clstate_insert_client(const unsigned char *addr, const unsigned char *clname, const size_t namelen);
> >>> +
> >>> +#endif /* _SQLITE_H */
> >>> -- 
> >>> 1.7.1
> >>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> > 
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 16:05       ` Jeff Layton
@ 2011-12-14 16:26         ` Steve Dickson
  2011-12-14 16:34           ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 16:26 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 11:05 AM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 10:44:27 -0500
> Steve Dickson <SteveD@redhat.com> wrote:
> 
>>
>>
>> On 12/14/2011 10:32 AM, Jeff Layton wrote:
>>> On Wed, 14 Dec 2011 10:23:15 -0500
>>> Steve Dickson <SteveD@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
>>>>> This patchset is the userspace portion of the knfsd client name tracking
>>>>> overhaul. See this patch series for an explanation:
>>>>>
>>>>>     nfsd: overhaul the client name tracking code (RFC)
>>>>>
>>>>> The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
>>>>> and handles the requests.
>>>>>
>>>>> The data is stored using a sqlite database. The main reason for this is
>>>>> that it takes care of most of the fussy details and atomicity concerns
>>>>> of tracking the information on stable storage.
>>>> This will make nfs-utils dependent on the sqlite database... Any idea
>>>> what kinda of extra baggage this brings? 
>>>>
>>>
>>> Depends on what you mean by "baggage". What is your concern?
>> In Fedora doing a 'yum install sqlite' which would require
>> a ton of other package needing to be install... The Required
>> for nfs-utils is getting pretty long at this point... 
>>  
> 
> I don't think it pulls in that much. This is pretty minimal for dependencies:
> 
> $ ldd /usr/lib/libsqlite3.so.0.8.6 
> 	linux-gate.so.1 =>  (0x00e05000)
> 	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
> 	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
> 	libc.so.6 => /lib/libc.so.6 (0x00e27000)
> 	/lib/ld-linux.so.2 (0x4e572000)
> 
> I think the command-line tools have some dependency on ncurses and
> readline and such, but I don't think it's that much really...
> 
> In any case, perhaps it's time to split the nfs-utils packaging for
> fedora into client and server pieces? Clients really only need
> a few of the tools, but they can't install that on fedora without
> getting all of the server pieces too.
Yeah I thinking this is quickly becoming slippery slope... In the end I'm
all for looking into using dbs instead of flat files but I'm just
worrying about the size of nfs-utils footprint... Maybe unnecessarily 

Question, how would admin look at the list of client? Will that
even be needed? 

>>>>> For now, the daemon is only suitable for single-host configurations.
>>>>> The plan is to later extend this to be suitable for clustered
>>>>> configurations as well.
>>>>>
>>>>> The code is still a little rough, so be gentle. It also lacks things
>>>>> like a manpage. I plan to add all that before doing a "formal" patch
>>>>> submission, but I wanted to get some early review of the overall design
>>>>> before to spend a lot of time knocking off the rough edges.
>>>> Finally the name of the daemon...  clstated does not make it clear that 
>>>> this is a nfs server daemon... maybe something like nfsdcld? 
>>>>
>>>
>>> Sure, we could change the name if that's desirable. Maybe
>>> nfsd.clnamed ? At this point it's really just a client name tracking
>>> daemon.
>> How about nfsdcld ;-) short, to the point and easy to type... 8-) 
>>
> 
> Ok, I'm not religious on this issue. I'll plan to rename it to that.
> 
Thanks!

steved.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 16:00             ` Jeff Layton
@ 2011-12-14 16:28               ` Steve Dickson
  2011-12-14 21:10                 ` J. Bruce Fields
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 16:28 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 11:00 AM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 10:56:46 -0500
> Steve Dickson <SteveD@redhat.com> wrote:
> 
>>
>>
>> On 12/14/2011 10:37 AM, Jeff Layton wrote:
>>> On Wed, 14 Dec 2011 10:29:49 -0500
>>> Steve Dickson <SteveD@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On 12/14/2011 10:19 AM, Jeff Layton wrote:
>>>>> On Wed, 14 Dec 2011 10:09:04 -0500
>>>>> Steve Dickson <SteveD@redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
>>>>>>> Generally, we want this daemon started before nfsd starts. Deal with the
>>>>>>> situation where the pipe hasn't shown up yet.
>>>>>> This can be done with your systemd start up script. Plus I'm not sure its 
>>>>>> a good idea to steal cpu cycles waiting for an event that may never happen...
>>>>>>
>>>>>
>>>>> Presumably you just wouldn't start the daemon if you have no intent to
>>>>> use it.
>>>>>
>>>>> It does sleep 1s between each check, so the time is fairly minimal,
>>>>> but I'm definitely open to doing this differently. What may be
>>>>> reasonable is adding code to the daemon to check and see if the
>>>>> v4recoverydir is present. If it is, then just exit. Otherwise, wait for
>>>>> the pipe to show up.
>>>> Why just let the systemd scrips worry about the order of when to start
>>>> things up... To be honest, that is one thing systemd does do fairly well.
>>>>
>>>
>>> Because not everyone uses systemd, and we have to deal with the
>>> "legacy" case too for the transition phase.
>>>
>>> It's generally preferable not to start up nfsd until everything it
>>> needs is up. If we do what you suggest, then we're basically mandating
>>> that this daemon can't start until nfsd is up and running.
>> Order has ways been a part of how and when things are started which
>> have always been handled by initscripts. That's their job, to start
>> things in the correct order. 
>>
>> I understand you want to make the daemon bullet proof... but starting
>> things up in the wrong order is an error... IMHO... 
>>
>>>
>>> Could you give some details on how you think this ought to work?
>>>
>> I would think a error message stating unable to open whatever and then 
>> say something like please make sure the nfs server is up and running,
>> would work... It seems to me this is a pretty common way of handling 
>> this type of situation.... although I can not come up with a 
>> explicit example, atm. 
>>
> 
> That's doable simply by dropping this patch. I think it'll make this
> more fragile, but if that's the consensus, I'll go along with it.
> 
If that is the only fragile part then I you are in very good shape! ;-) 

Thanks again! 

steved.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 16:26         ` Steve Dickson
@ 2011-12-14 16:34           ` Jeff Layton
  2011-12-14 20:31             ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 16:34 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 11:26:00 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 11:05 AM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 10:44:27 -0500
> > Steve Dickson <SteveD@redhat.com> wrote:
> > 
> >>
> >>
> >> On 12/14/2011 10:32 AM, Jeff Layton wrote:
> >>> On Wed, 14 Dec 2011 10:23:15 -0500
> >>> Steve Dickson <SteveD@redhat.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> >>>>> This patchset is the userspace portion of the knfsd client name tracking
> >>>>> overhaul. See this patch series for an explanation:
> >>>>>
> >>>>>     nfsd: overhaul the client name tracking code (RFC)
> >>>>>
> >>>>> The daemon listens for upcalls on the rpc_pipefs pipe using libevent,
> >>>>> and handles the requests.
> >>>>>
> >>>>> The data is stored using a sqlite database. The main reason for this is
> >>>>> that it takes care of most of the fussy details and atomicity concerns
> >>>>> of tracking the information on stable storage.
> >>>> This will make nfs-utils dependent on the sqlite database... Any idea
> >>>> what kinda of extra baggage this brings? 
> >>>>
> >>>
> >>> Depends on what you mean by "baggage". What is your concern?
> >> In Fedora doing a 'yum install sqlite' which would require
> >> a ton of other package needing to be install... The Required
> >> for nfs-utils is getting pretty long at this point... 
> >>  
> > 
> > I don't think it pulls in that much. This is pretty minimal for dependencies:
> > 
> > $ ldd /usr/lib/libsqlite3.so.0.8.6 
> > 	linux-gate.so.1 =>  (0x00e05000)
> > 	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
> > 	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
> > 	libc.so.6 => /lib/libc.so.6 (0x00e27000)
> > 	/lib/ld-linux.so.2 (0x4e572000)
> > 
> > I think the command-line tools have some dependency on ncurses and
> > readline and such, but I don't think it's that much really...
> > 
> > In any case, perhaps it's time to split the nfs-utils packaging for
> > fedora into client and server pieces? Clients really only need
> > a few of the tools, but they can't install that on fedora without
> > getting all of the server pieces too.
> Yeah I thinking this is quickly becoming slippery slope... In the end I'm
> all for looking into using dbs instead of flat files but I'm just
> worrying about the size of nfs-utils footprint... Maybe unnecessarily 
> 
> Question, how would admin look at the list of client? Will that
> even be needed? 
> 

Generally shouldn't be needed unless you're debugging. Today, we just
have a bunch of directories in the v4recoverydir with md5 hash names,
so I think moving to a DB is a step forward in this sense.

There's a sqlite3 tool that you can use to attach to the DB file(s).
Then you can run sql commands against the tables in it. The DB layout
at this point is very simple, so this shouldn't be too hard.

Note that I'm not dead-set on using sqlite for this if there's
something more suitable. What I really don't want to do though is to
reinvent the wheel. Storing info safely on stable storage is a solved
problem, IMO...

> >>>>> For now, the daemon is only suitable for single-host configurations.
> >>>>> The plan is to later extend this to be suitable for clustered
> >>>>> configurations as well.
> >>>>>
> >>>>> The code is still a little rough, so be gentle. It also lacks things
> >>>>> like a manpage. I plan to add all that before doing a "formal" patch
> >>>>> submission, but I wanted to get some early review of the overall design
> >>>>> before to spend a lot of time knocking off the rough edges.
> >>>> Finally the name of the daemon...  clstated does not make it clear that 
> >>>> this is a nfs server daemon... maybe something like nfsdcld? 
> >>>>
> >>>
> >>> Sure, we could change the name if that's desirable. Maybe
> >>> nfsd.clnamed ? At this point it's really just a client name tracking
> >>> daemon.
> >> How about nfsdcld ;-) short, to the point and easy to type... 8-) 
> >>
> > 
> > Ok, I'm not religious on this issue. I'll plan to rename it to that.
> > 
> Thanks!
> 
> steved.


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 16:34           ` Jeff Layton
@ 2011-12-14 20:31             ` Steve Dickson
  2011-12-14 21:06               ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 20:31 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 11:34 AM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 11:26:00 -0500
>>>>>>> The data is stored using a sqlite database. The main reason for this is
>>>>>>> that it takes care of most of the fussy details and atomicity concerns
>>>>>>> of tracking the information on stable storage.
>>>>>> This will make nfs-utils dependent on the sqlite database... Any idea
>>>>>> what kinda of extra baggage this brings? 
>>>>>>
>>>>>
>>>>> Depends on what you mean by "baggage". What is your concern?
>>>> In Fedora doing a 'yum install sqlite' which would require
>>>> a ton of other package needing to be install... The Required
>>>> for nfs-utils is getting pretty long at this point... 
>>>>  
>>>
>>> I don't think it pulls in that much. This is pretty minimal for dependencies:
>>>
>>> $ ldd /usr/lib/libsqlite3.so.0.8.6 
>>> 	linux-gate.so.1 =>  (0x00e05000)
>>> 	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
>>> 	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
>>> 	libc.so.6 => /lib/libc.so.6 (0x00e27000)
>>> 	/lib/ld-linux.so.2 (0x4e572000)
>>>
>>> I think the command-line tools have some dependency on ncurses and
>>> readline and such, but I don't think it's that much really...
>>>
>>> In any case, perhaps it's time to split the nfs-utils packaging for
>>> fedora into client and server pieces? Clients really only need
>>> a few of the tools, but they can't install that on fedora without
>>> getting all of the server pieces too.
>> Yeah I thinking this is quickly becoming slippery slope... In the end I'm
>> all for looking into using dbs instead of flat files but I'm just
>> worrying about the size of nfs-utils footprint... Maybe unnecessarily 
>>
>> Question, how would admin look at the list of client? Will that
>> even be needed? 
>>
> 
> Generally shouldn't be needed unless you're debugging. Today, we just
> have a bunch of directories in the v4recoverydir with md5 hash names,
> so I think moving to a DB is a step forward in this sense.
Or possible an overkill?? How large do you expect these lists
to grow? 

Also, from what the Nfsd4_server_recovery wiki says all that is
really need is "stable storage". What makes sqlite more stable
than a simple write() followed by an fsync()?
> 
> There's a sqlite3 tool that you can use to attach to the DB file(s).
> Then you can run sql commands against the tables in it. The DB layout
> at this point is very simple, so this shouldn't be too hard.
> 
> Note that I'm not dead-set on using sqlite for this if there's
> something more suitable. What I really don't want to do though is to
> reinvent the wheel. Storing info safely on stable storage is a solved
> problem, IMO...
> 
I took a look there does not appear to be any dependencies at all...
So that worry was unfounded... Whats another dependency?? ;-) Lets just
make sure using db is not overkill... then I'm good to go...

steved.

steved.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 20:31             ` Steve Dickson
@ 2011-12-14 21:06               ` Jeff Layton
  2011-12-14 22:27                 ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 21:06 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 15:31:25 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 11:34 AM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 11:26:00 -0500
> >>>>>>> The data is stored using a sqlite database. The main reason for this is
> >>>>>>> that it takes care of most of the fussy details and atomicity concerns
> >>>>>>> of tracking the information on stable storage.
> >>>>>> This will make nfs-utils dependent on the sqlite database... Any idea
> >>>>>> what kinda of extra baggage this brings? 
> >>>>>>
> >>>>>
> >>>>> Depends on what you mean by "baggage". What is your concern?
> >>>> In Fedora doing a 'yum install sqlite' which would require
> >>>> a ton of other package needing to be install... The Required
> >>>> for nfs-utils is getting pretty long at this point... 
> >>>>  
> >>>
> >>> I don't think it pulls in that much. This is pretty minimal for dependencies:
> >>>
> >>> $ ldd /usr/lib/libsqlite3.so.0.8.6 
> >>> 	linux-gate.so.1 =>  (0x00e05000)
> >>> 	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
> >>> 	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
> >>> 	libc.so.6 => /lib/libc.so.6 (0x00e27000)
> >>> 	/lib/ld-linux.so.2 (0x4e572000)
> >>>
> >>> I think the command-line tools have some dependency on ncurses and
> >>> readline and such, but I don't think it's that much really...
> >>>
> >>> In any case, perhaps it's time to split the nfs-utils packaging for
> >>> fedora into client and server pieces? Clients really only need
> >>> a few of the tools, but they can't install that on fedora without
> >>> getting all of the server pieces too.
> >> Yeah I thinking this is quickly becoming slippery slope... In the end I'm
> >> all for looking into using dbs instead of flat files but I'm just
> >> worrying about the size of nfs-utils footprint... Maybe unnecessarily 
> >>
> >> Question, how would admin look at the list of client? Will that
> >> even be needed? 
> >>
> > 
> > Generally shouldn't be needed unless you're debugging. Today, we just
> > have a bunch of directories in the v4recoverydir with md5 hash names,
> > so I think moving to a DB is a step forward in this sense.
> Or possible an overkill?? How large do you expect these lists
> to grow? 
> 

At least one record per client. So probably on the order of a few
thousand per cluster node. Multiply that times the number of cluster
nodes.

> Also, from what the Nfsd4_server_recovery wiki says all that is
> really need is "stable storage". What makes sqlite more stable
> than a simple write() followed by an fsync()?

Eventually we need to share this DB in a cluster too. With that, we'll
have to coordinate these writes and fsyncs across cluster nodes to
ensure that they don't clobber each other. Now we have a simple
write/fsync plus fcntl locking. Things get less simple...

There are also atomicity concerns too. What happens if we try to write
out a record and only part of it gets written out before the node dies?

sqlite3 databases are journalled so that's not a problem there. The
transaction just didn't happen.

With flat files, you have to be able to deal with those cases on your
own. Witness the growth in code that we had to maintain when Chuck
moved from sqlite for new-statd to using the legacy statd flat file
format.

> > 
> > There's a sqlite3 tool that you can use to attach to the DB file(s).
> > Then you can run sql commands against the tables in it. The DB layout
> > at this point is very simple, so this shouldn't be too hard.
> > 
> > Note that I'm not dead-set on using sqlite for this if there's
> > something more suitable. What I really don't want to do though is to
> > reinvent the wheel. Storing info safely on stable storage is a solved
> > problem, IMO...
> > 
> I took a look there does not appear to be any dependencies at all...
> So that worry was unfounded... Whats another dependency?? ;-) Lets just
> make sure using db is not overkill... then I'm good to go...
> 

Again, I'm not set on sqlite. Maybe something else would be better?
There's bdb for instance, and samba has the tdb stuff. I haven't
looked at them in great detail but they might also work here too.

I really don't have any desire to write yet another flat file storage
engine for this however.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 16:28               ` Steve Dickson
@ 2011-12-14 21:10                 ` J. Bruce Fields
  2011-12-14 21:20                   ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: J. Bruce Fields @ 2011-12-14 21:10 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Jeff Layton, linux-nfs

On Wed, Dec 14, 2011 at 11:28:22AM -0500, Steve Dickson wrote:
> 
> 
> On 12/14/2011 11:00 AM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 10:56:46 -0500
> > Steve Dickson <SteveD@redhat.com> wrote:
> > 
> >>
> >>
> >> On 12/14/2011 10:37 AM, Jeff Layton wrote:
> >>> On Wed, 14 Dec 2011 10:29:49 -0500
> >>> Steve Dickson <SteveD@redhat.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 12/14/2011 10:19 AM, Jeff Layton wrote:
> >>>>> On Wed, 14 Dec 2011 10:09:04 -0500
> >>>>> Steve Dickson <SteveD@redhat.com> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> >>>>>>> Generally, we want this daemon started before nfsd starts. Deal with the
> >>>>>>> situation where the pipe hasn't shown up yet.
> >>>>>> This can be done with your systemd start up script. Plus I'm not sure its 
> >>>>>> a good idea to steal cpu cycles waiting for an event that may never happen...
> >>>>>>
> >>>>>
> >>>>> Presumably you just wouldn't start the daemon if you have no intent to
> >>>>> use it.
> >>>>>
> >>>>> It does sleep 1s between each check, so the time is fairly minimal,
> >>>>> but I'm definitely open to doing this differently. What may be
> >>>>> reasonable is adding code to the daemon to check and see if the
> >>>>> v4recoverydir is present. If it is, then just exit. Otherwise, wait for
> >>>>> the pipe to show up.
> >>>> Why just let the systemd scrips worry about the order of when to start
> >>>> things up... To be honest, that is one thing systemd does do fairly well.
> >>>>
> >>>
> >>> Because not everyone uses systemd, and we have to deal with the
> >>> "legacy" case too for the transition phase.
> >>>
> >>> It's generally preferable not to start up nfsd until everything it
> >>> needs is up. If we do what you suggest, then we're basically mandating
> >>> that this daemon can't start until nfsd is up and running.
> >> Order has ways been a part of how and when things are started which
> >> have always been handled by initscripts. That's their job, to start
> >> things in the correct order. 
> >>
> >> I understand you want to make the daemon bullet proof... but starting
> >> things up in the wrong order is an error... IMHO... 
> >>
> >>>
> >>> Could you give some details on how you think this ought to work?
> >>>
> >> I would think a error message stating unable to open whatever and then 
> >> say something like please make sure the nfs server is up and running,
> >> would work... It seems to me this is a pretty common way of handling 
> >> this type of situation.... although I can not come up with a 
> >> explicit example, atm. 
> >>
> > 
> > That's doable simply by dropping this patch. I think it'll make this
> > more fragile, but if that's the consensus, I'll go along with it.
> > 
> If that is the only fragile part then I you are in very good shape! ;-) 
> 
> Thanks again! 

Please patch section 3.1 of nfs-utils/README while you're at it.

The start-up order *is* rather complicated, though, and easy to get
wrong.  And we'd rather not make nfsd wait unnecessarily for this dameon
to start up....

Why not use directory notifications like idmapd and gssd always have?

--b.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT
  2011-12-14 21:10                 ` J. Bruce Fields
@ 2011-12-14 21:20                   ` Jeff Layton
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-14 21:20 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Steve Dickson, linux-nfs

On Wed, 14 Dec 2011 16:10:39 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Wed, Dec 14, 2011 at 11:28:22AM -0500, Steve Dickson wrote:
> > 
> > 
> > On 12/14/2011 11:00 AM, Jeff Layton wrote:
> > > On Wed, 14 Dec 2011 10:56:46 -0500
> > > Steve Dickson <SteveD@redhat.com> wrote:
> > > 
> > >>
> > >>
> > >> On 12/14/2011 10:37 AM, Jeff Layton wrote:
> > >>> On Wed, 14 Dec 2011 10:29:49 -0500
> > >>> Steve Dickson <SteveD@redhat.com> wrote:
> > >>>
> > >>>>
> > >>>>
> > >>>> On 12/14/2011 10:19 AM, Jeff Layton wrote:
> > >>>>> On Wed, 14 Dec 2011 10:09:04 -0500
> > >>>>> Steve Dickson <SteveD@redhat.com> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 12/14/2011 08:57 AM, Jeff Layton wrote:
> > >>>>>>> Generally, we want this daemon started before nfsd starts. Deal with the
> > >>>>>>> situation where the pipe hasn't shown up yet.
> > >>>>>> This can be done with your systemd start up script. Plus I'm not sure its 
> > >>>>>> a good idea to steal cpu cycles waiting for an event that may never happen...
> > >>>>>>
> > >>>>>
> > >>>>> Presumably you just wouldn't start the daemon if you have no intent to
> > >>>>> use it.
> > >>>>>
> > >>>>> It does sleep 1s between each check, so the time is fairly minimal,
> > >>>>> but I'm definitely open to doing this differently. What may be
> > >>>>> reasonable is adding code to the daemon to check and see if the
> > >>>>> v4recoverydir is present. If it is, then just exit. Otherwise, wait for
> > >>>>> the pipe to show up.
> > >>>> Why just let the systemd scrips worry about the order of when to start
> > >>>> things up... To be honest, that is one thing systemd does do fairly well.
> > >>>>
> > >>>
> > >>> Because not everyone uses systemd, and we have to deal with the
> > >>> "legacy" case too for the transition phase.
> > >>>
> > >>> It's generally preferable not to start up nfsd until everything it
> > >>> needs is up. If we do what you suggest, then we're basically mandating
> > >>> that this daemon can't start until nfsd is up and running.
> > >> Order has ways been a part of how and when things are started which
> > >> have always been handled by initscripts. That's their job, to start
> > >> things in the correct order. 
> > >>
> > >> I understand you want to make the daemon bullet proof... but starting
> > >> things up in the wrong order is an error... IMHO... 
> > >>
> > >>>
> > >>> Could you give some details on how you think this ought to work?
> > >>>
> > >> I would think a error message stating unable to open whatever and then 
> > >> say something like please make sure the nfs server is up and running,
> > >> would work... It seems to me this is a pretty common way of handling 
> > >> this type of situation.... although I can not come up with a 
> > >> explicit example, atm. 
> > >>
> > > 
> > > That's doable simply by dropping this patch. I think it'll make this
> > > more fragile, but if that's the consensus, I'll go along with it.
> > > 
> > If that is the only fragile part then I you are in very good shape! ;-) 
> > 
> > Thanks again! 
> 
> Please patch section 3.1 of nfs-utils/README while you're at it.
> 
> The start-up order *is* rather complicated, though, and easy to get
> wrong.  And we'd rather not make nfsd wait unnecessarily for this dameon
> to start up....
> 
> Why not use directory notifications like idmapd and gssd always have?
> 

I suppose I could do that. I'll take a look at what they do.

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 21:06               ` Jeff Layton
@ 2011-12-14 22:27                 ` Steve Dickson
  2011-12-15  1:46                   ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Dickson @ 2011-12-14 22:27 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs



On 12/14/2011 04:06 PM, Jeff Layton wrote:
> On Wed, 14 Dec 2011 15:31:25 -0500
> Steve Dickson <SteveD@redhat.com> wrote:
> 
>>
>>
>> On 12/14/2011 11:34 AM, Jeff Layton wrote:
>>> On Wed, 14 Dec 2011 11:26:00 -0500
>>>>>>>>> The data is stored using a sqlite database. The main reason for this is
>>>>>>>>> that it takes care of most of the fussy details and atomicity concerns
>>>>>>>>> of tracking the information on stable storage.
>>>>>>>> This will make nfs-utils dependent on the sqlite database... Any idea
>>>>>>>> what kinda of extra baggage this brings? 
>>>>>>>>
>>>>>>>
>>>>>>> Depends on what you mean by "baggage". What is your concern?
>>>>>> In Fedora doing a 'yum install sqlite' which would require
>>>>>> a ton of other package needing to be install... The Required
>>>>>> for nfs-utils is getting pretty long at this point... 
>>>>>>  
>>>>>
>>>>> I don't think it pulls in that much. This is pretty minimal for dependencies:
>>>>>
>>>>> $ ldd /usr/lib/libsqlite3.so.0.8.6 
>>>>> 	linux-gate.so.1 =>  (0x00e05000)
>>>>> 	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
>>>>> 	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
>>>>> 	libc.so.6 => /lib/libc.so.6 (0x00e27000)
>>>>> 	/lib/ld-linux.so.2 (0x4e572000)
>>>>>
>>>>> I think the command-line tools have some dependency on ncurses and
>>>>> readline and such, but I don't think it's that much really...
>>>>>
>>>>> In any case, perhaps it's time to split the nfs-utils packaging for
>>>>> fedora into client and server pieces? Clients really only need
>>>>> a few of the tools, but they can't install that on fedora without
>>>>> getting all of the server pieces too.
>>>> Yeah I thinking this is quickly becoming slippery slope... In the end I'm
>>>> all for looking into using dbs instead of flat files but I'm just
>>>> worrying about the size of nfs-utils footprint... Maybe unnecessarily 
>>>>
>>>> Question, how would admin look at the list of client? Will that
>>>> even be needed? 
>>>>
>>>
>>> Generally shouldn't be needed unless you're debugging. Today, we just
>>> have a bunch of directories in the v4recoverydir with md5 hash names,
>>> so I think moving to a DB is a step forward in this sense.
>> Or possible an overkill?? How large do you expect these lists
>> to grow? 
>>
> 
> At least one record per client. So probably on the order of a few
> thousand per cluster node. Multiply that times the number of cluster
> nodes.
> 
>> Also, from what the Nfsd4_server_recovery wiki says all that is
>> really need is "stable storage". What makes sqlite more stable
>> than a simple write() followed by an fsync()?
> 
> Eventually we need to share this DB in a cluster too. With that, we'll
> have to coordinate these writes and fsyncs across cluster nodes to
> ensure that they don't clobber each other. Now we have a simple
> write/fsync plus fcntl locking. Things get less simple...
True... but file locking is a well established tried and true way of 
keeping files in sync... Do we have that type of confidence with 
the file locking in sqlite? 
 
> 
> There are also atomicity concerns too. What happens if we try to write
> out a record and only part of it gets written out before the node dies?
> 
> sqlite3 databases are journalled so that's not a problem there. The
> transaction just didn't happen.
hmm... this seems like a lot of overhead... but interesting.... 
Those journaled writes... any idea what type of latency they introduces?
Probably not... It just seems there is a lot of technology there just to
keep list of clients... 
> 
> With flat files, you have to be able to deal with those cases on your
> own. Witness the growth in code that we had to maintain when Chuck
> moved from sqlite for new-statd to using the legacy statd flat file
> format.
But you dance with the one you brung! ;-) Meaning we know those cases,
sqlite is a black box.. 

> 
>>>
>>> There's a sqlite3 tool that you can use to attach to the DB file(s).
>>> Then you can run sql commands against the tables in it. The DB layout
>>> at this point is very simple, so this shouldn't be too hard.
>>>
>>> Note that I'm not dead-set on using sqlite for this if there's
>>> something more suitable. What I really don't want to do though is to
>>> reinvent the wheel. Storing info safely on stable storage is a solved
>>> problem, IMO...
>>>
>> I took a look there does not appear to be any dependencies at all...
>> So that worry was unfounded... Whats another dependency?? ;-) Lets just
>> make sure using db is not overkill... then I'm good to go...
>>
> 
> Again, I'm not set on sqlite. Maybe something else would be better?
> There's bdb for instance, and samba has the tdb stuff. I haven't
> looked at them in great detail but they might also work here too.
> 
> I really don't have any desire to write yet another flat file storage
> engine for this however.
> 
I agree... lets have this conversation and see where it leads... 

Just thinking out loud... Would it make sense to build an plug
and play API where it would not matter what the back-end db is
that every daemon in nfs-utils could migrate to?

steved.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-14 22:27                 ` Steve Dickson
@ 2011-12-15  1:46                   ` Jeff Layton
  2011-12-15 23:34                     ` Steve Dickson
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Layton @ 2011-12-15  1:46 UTC (permalink / raw)
  To: Steve Dickson; +Cc: linux-nfs

On Wed, 14 Dec 2011 17:27:54 -0500
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> On 12/14/2011 04:06 PM, Jeff Layton wrote:
> > On Wed, 14 Dec 2011 15:31:25 -0500
> > Steve Dickson <SteveD@redhat.com> wrote:
> > 
> >>
> >>
> >> On 12/14/2011 11:34 AM, Jeff Layton wrote:
> >>> On Wed, 14 Dec 2011 11:26:00 -0500
> >>>>>>>>> The data is stored using a sqlite database. The main reason for this is
> >>>>>>>>> that it takes care of most of the fussy details and atomicity concerns
> >>>>>>>>> of tracking the information on stable storage.
> >>>>>>>> This will make nfs-utils dependent on the sqlite database... Any idea
> >>>>>>>> what kinda of extra baggage this brings? 
> >>>>>>>>
> >>>>>>>
> >>>>>>> Depends on what you mean by "baggage". What is your concern?
> >>>>>> In Fedora doing a 'yum install sqlite' which would require
> >>>>>> a ton of other package needing to be install... The Required
> >>>>>> for nfs-utils is getting pretty long at this point... 
> >>>>>>  
> >>>>>
> >>>>> I don't think it pulls in that much. This is pretty minimal for dependencies:
> >>>>>
> >>>>> $ ldd /usr/lib/libsqlite3.so.0.8.6 
> >>>>> 	linux-gate.so.1 =>  (0x00e05000)
> >>>>> 	libdl.so.2 => /lib/libdl.so.2 (0x001bd000)
> >>>>> 	libpthread.so.0 => /lib/libpthread.so.0 (0x00769000)
> >>>>> 	libc.so.6 => /lib/libc.so.6 (0x00e27000)
> >>>>> 	/lib/ld-linux.so.2 (0x4e572000)
> >>>>>
> >>>>> I think the command-line tools have some dependency on ncurses and
> >>>>> readline and such, but I don't think it's that much really...
> >>>>>
> >>>>> In any case, perhaps it's time to split the nfs-utils packaging for
> >>>>> fedora into client and server pieces? Clients really only need
> >>>>> a few of the tools, but they can't install that on fedora without
> >>>>> getting all of the server pieces too.
> >>>> Yeah I thinking this is quickly becoming slippery slope... In the end I'm
> >>>> all for looking into using dbs instead of flat files but I'm just
> >>>> worrying about the size of nfs-utils footprint... Maybe unnecessarily 
> >>>>
> >>>> Question, how would admin look at the list of client? Will that
> >>>> even be needed? 
> >>>>
> >>>
> >>> Generally shouldn't be needed unless you're debugging. Today, we just
> >>> have a bunch of directories in the v4recoverydir with md5 hash names,
> >>> so I think moving to a DB is a step forward in this sense.
> >> Or possible an overkill?? How large do you expect these lists
> >> to grow? 
> >>
> > 
> > At least one record per client. So probably on the order of a few
> > thousand per cluster node. Multiply that times the number of cluster
> > nodes.
> > 
> >> Also, from what the Nfsd4_server_recovery wiki says all that is
> >> really need is "stable storage". What makes sqlite more stable
> >> than a simple write() followed by an fsync()?
> > 
> > Eventually we need to share this DB in a cluster too. With that, we'll
> > have to coordinate these writes and fsyncs across cluster nodes to
> > ensure that they don't clobber each other. Now we have a simple
> > write/fsync plus fcntl locking. Things get less simple...
> True... but file locking is a well established tried and true way of 
> keeping files in sync... Do we have that type of confidence with 
> the file locking in sqlite? 
>  

Yes, it uses the same fcntl locking under the hood (whole files only).

> > 
> > There are also atomicity concerns too. What happens if we try to write
> > out a record and only part of it gets written out before the node dies?
> > 
> > sqlite3 databases are journalled so that's not a problem there. The
> > transaction just didn't happen.
> hmm... this seems like a lot of overhead... but interesting.... 
> Those journaled writes... any idea what type of latency they introduces?
> Probably not... It just seems there is a lot of technology there just to
> keep list of clients... 
> > 
> > With flat files, you have to be able to deal with those cases on your
> > own. Witness the growth in code that we had to maintain when Chuck
> > moved from sqlite for new-statd to using the legacy statd flat file
> > format.
> But you dance with the one you brung! ;-) Meaning we know those cases,
> sqlite is a black box.. 
> 

I'm not sure how to respond to this. The use of sqlite here is very
simple. I really don't anticipate any problems with what I'm doing with
it. It's used every day for far heavier fashions.

I'm fairly sure that if we find bugs in sqlite, then the maintainers
will be happy to fix them.

> > 
> >>>
> >>> There's a sqlite3 tool that you can use to attach to the DB file(s).
> >>> Then you can run sql commands against the tables in it. The DB layout
> >>> at this point is very simple, so this shouldn't be too hard.
> >>>
> >>> Note that I'm not dead-set on using sqlite for this if there's
> >>> something more suitable. What I really don't want to do though is to
> >>> reinvent the wheel. Storing info safely on stable storage is a solved
> >>> problem, IMO...
> >>>
> >> I took a look there does not appear to be any dependencies at all...
> >> So that worry was unfounded... Whats another dependency?? ;-) Lets just
> >> make sure using db is not overkill... then I'm good to go...
> >>
> > 
> > Again, I'm not set on sqlite. Maybe something else would be better?
> > There's bdb for instance, and samba has the tdb stuff. I haven't
> > looked at them in great detail but they might also work here too.
> > 
> > I really don't have any desire to write yet another flat file storage
> > engine for this however.
> > 
> I agree... lets have this conversation and see where it leads... 
> 
> Just thinking out loud... Would it make sense to build an plug
> and play API where it would not matter what the back-end db is
> that every daemon in nfs-utils could migrate to?
> 

That's possible. I wrote the daemon such that it wouldn't be too hard
to do that, at least at build time. The question is: who thinks its
worth the trouble to write other storage backends? If we write this
well enough to begin with, there's really no need for another one.

Not to put too fine a point on it, but so far most of what I'm hearing
is fear, uncertainty and doubt about using sqlite for this. Does anyone
have a real technical reason as to why it's not suitable for this
purpose?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-14 16:15         ` Jeff Layton
@ 2011-12-15 14:55           ` Chuck Lever
  2011-12-15 15:04             ` Jeff Layton
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2011-12-15 14:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs


On Dec 14, 2011, at 11:15 AM, Jeff Layton wrote:

> On Wed, 14 Dec 2011 10:47:56 -0500
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> 
>> On Dec 14, 2011, at 10:14 AM, Jeff Layton wrote:
>> 
>>> On Wed, 14 Dec 2011 09:56:32 -0500
>>> Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>>> 
>>>> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
>>>> 
>>>>> Rather than roll our own "storage engine", use sqlite instead. It fits
>>>>> the bill nicely as it does:
>>>>> 
>>>>> - durable on-disk storage
>>>>> - the ability to constrain record uniqueness
>>>>> - a facility for collating and searching the host records
>>>> 
>>>> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
>>>> 
>>> 
>>> Indeed. That's one of the main reasons I chose sqlite here.
>>> 
>>>>> ...it does add a build dependency to nfs-utils, but almost all modern
>>>>> distros provide those packages.
>>>>> 
>>>>> The current incarnation of this code dynamically links against a
>>>>> provided sqlite library, but we could also consider including their
>>>>> single-file "amalgamation" to reduce dependencies (though with all
>>>>> the caveats that that entails).
>>>>> 
>>>>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>>>>> ---
>>>>> utils/clstated/Makefile.am |    4 +-
>>>>> utils/clstated/clstated.c  |   16 ++-
>>>>> utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
>>>>> utils/clstated/sqlite.h    |   27 ++++
>>>>> 4 files changed, 393 insertions(+), 5 deletions(-)
>>>>> create mode 100644 utils/clstated/sqlite.c
>>>>> create mode 100644 utils/clstated/sqlite.h
>>>>> 
>>>>> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
>>>>> index 9937353..d1cef3c 100644
>>>>> --- a/utils/clstated/Makefile.am
>>>>> +++ b/utils/clstated/Makefile.am
>>>>> @@ -6,9 +6,9 @@
>>>>> AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
>>>>> sbin_PROGRAMS	= clstated
>>>>> 
>>>>> -clstated_SOURCES = clstated.c
>>>>> +clstated_SOURCES = clstated.c sqlite.c
>>>>> 
>>>>> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
>>>>> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
>>>>> 
>>>>> MAINTAINERCLEANFILES = Makefile.in
>>>>> 
>>>>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
>>>>> index aab09a7..95ac696 100644
>>>>> --- a/utils/clstated/clstated.c
>>>>> +++ b/utils/clstated/clstated.c
>>>>> @@ -36,6 +36,7 @@
>>>>> 
>>>>> #include "xlog.h"
>>>>> #include "nfslib.h"
>>>>> +#include "sqlite.h"
>>>>> 
>>>>> #ifndef PIPEFS_DIR
>>>>> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
>>>>> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
>>>>> static void
>>>>> clstate_create(struct clstate_client *clnt)
>>>>> {
>>>>> +	int ret;
>>>>> 	ssize_t bsize, wsize;
>>>>> 	struct clstate_msg *cmsg = &clnt->cl_msg;
>>>>> 
>>>>> -	xlog(D_GENERAL, "%s: create client record", __func__);
>>>>> +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
>>>>> +			cmsg->cm_addr);
>>>>> 
>>>>> -	/* FIXME: create client record on storage here */
>>>>> +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
>>>>> +					cmsg->cm_len);
>>>>> 
>>>>> 	/* set up reply */
>>>>> -	cmsg->cm_status = 0;
>>>>> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
>>>>> 	cmsg->cm_len = 0;
>>>>> 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
>>>>> 		sizeof(cmsg->cm_u.cm_id));
>>>>> 
>>>>> 	bsize = sizeof(*cmsg);
>>>>> 
>>>>> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
>>>>> 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
>>>>> 	if (wsize != bsize) {
>>>>> 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
>>>>> @@ -231,6 +236,11 @@ main(int argc, char **argv)
>>>>> 	}
>>>>> 
>>>>> 	/* set up storage db */
>>>>> +	rc = clstate_maindb_init();
>>>>> +	if (rc) {
>>>>> +		xlog(L_ERROR, "Failed to open main database: %d", rc);
>>>>> +		goto out;
>>>>> +	}
>>>>> 
>>>>> 	/* set up event handler */
>>>>> 	fd = clstate_pipe_open(&clnt);
>>>>> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
>>>>> new file mode 100644
>>>>> index 0000000..ae83634
>>>>> --- /dev/null
>>>>> +++ b/utils/clstated/sqlite.c
>>>>> @@ -0,0 +1,351 @@
>>>>> +/*
>>>>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
>>>>> + *
>>>>> + * This program is free software; you can redistribute it and/or
>>>>> + * modify it under the terms of the GNU General Public License
>>>>> + * as published by the Free Software Foundation; either version 2
>>>>> + * of the License, or (at your option) any later version.
>>>>> + *
>>>>> + * This program is distributed in the hope that it will be useful,
>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>> + * GNU General Public License for more details.
>>>>> + *
>>>>> + * You should have received a copy of the GNU General Public License
>>>>> + * along with this program; if not, write to the Free Software
>>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
>>>>> + * Boston, MA 02110-1301, USA.
>>>>> + */
>>>>> +
>>>>> +/*
>>>>> + * Explanation:
>>>>> + *
>>>>> + * This file contains the code to manage the sqlite backend database for the
>>>>> + * clstated upcall daemon.
>>>>> + *
>>>>> + * The main database is called main.sqlite and contains the following tables:
>>>>> + *
>>>>> + * parameters: simple key/value pairs for storing database info
>>>>> + *
>>>>> + * addresses: list of server-side addresses recorded in the db, along with
>>>>> + * 	      the filenames of their respective db files.
>>>>> + *
>>>>> + * The daemon attaches to each server-address database as needed. Each
>>>>> + * server-address database has the following tables:
>>>> 
>>>> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
>>>> 
>>> 
>>> As you pointed out above, sqlite allows us to share the db directory
>>> across a cluster of nodes. Using different files here will allow us to
>>> reduce lock contention in the cluster. If we put each server address DB
>>> in a separate file, then only the node that currently owns that address
>>> should need to touch it under normal circumstances.
>> 
>> This seems like premature optimization.  A single file is much simpler and I'm guessing would be a more standard arrangement of tables.  If cluster contention avoidance is a key design point, it's worth calling out in the documenting comment, because this multi-file arrangement is not a typical use of sqlite3.
>> 
>> Why do you predict that the database will be under contention?  Isn't it the case that sqlite3 can allow multiple accessors to a single database file?
>> 
> 
> Since the grace periods will need to be coordinated, then it's likely
> on a server reboot that clients will be reestablishing state with
> multiple nodes in the cluster at a time. That means at least some
> contention for the DB. Whether it's problematic or not is an open
> question at this point, but why not avoid that if we can?

Yes, there may be contention during reclaim.  However, I submit that we don't understand the details of that well enough to solidify a design for handling that workload scalably at this point.  I recommend some simple prototyping and careful modeling to understand exactly what's going to happen in such a scenario.  Then we can approach a client ID database design with good information about what needs to scale and what can be designed for convenience.  :-)

Just a thought.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] clstated: add routines for a sqlite backend database
  2011-12-15 14:55           ` Chuck Lever
@ 2011-12-15 15:04             ` Jeff Layton
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff Layton @ 2011-12-15 15:04 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

On Thu, 15 Dec 2011 09:55:59 -0500
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Dec 14, 2011, at 11:15 AM, Jeff Layton wrote:
> 
> > On Wed, 14 Dec 2011 10:47:56 -0500
> > Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> >> 
> >> On Dec 14, 2011, at 10:14 AM, Jeff Layton wrote:
> >> 
> >>> On Wed, 14 Dec 2011 09:56:32 -0500
> >>> Chuck Lever <chuck.lever@oracle.com> wrote:
> >>> 
> >>>> 
> >>>> On Dec 14, 2011, at 8:57 AM, Jeff Layton wrote:
> >>>> 
> >>>>> Rather than roll our own "storage engine", use sqlite instead. It fits
> >>>>> the bill nicely as it does:
> >>>>> 
> >>>>> - durable on-disk storage
> >>>>> - the ability to constrain record uniqueness
> >>>>> - a facility for collating and searching the host records
> >>>> 
> >>>> It should also allow multiple processes (possibly on multiple cluster nodes) to access the on-disk data safely and atomically.  No need to invent a file locking scheme from scratch.
> >>>> 
> >>> 
> >>> Indeed. That's one of the main reasons I chose sqlite here.
> >>> 
> >>>>> ...it does add a build dependency to nfs-utils, but almost all modern
> >>>>> distros provide those packages.
> >>>>> 
> >>>>> The current incarnation of this code dynamically links against a
> >>>>> provided sqlite library, but we could also consider including their
> >>>>> single-file "amalgamation" to reduce dependencies (though with all
> >>>>> the caveats that that entails).
> >>>>> 
> >>>>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> >>>>> ---
> >>>>> utils/clstated/Makefile.am |    4 +-
> >>>>> utils/clstated/clstated.c  |   16 ++-
> >>>>> utils/clstated/sqlite.c    |  351 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>> utils/clstated/sqlite.h    |   27 ++++
> >>>>> 4 files changed, 393 insertions(+), 5 deletions(-)
> >>>>> create mode 100644 utils/clstated/sqlite.c
> >>>>> create mode 100644 utils/clstated/sqlite.h
> >>>>> 
> >>>>> diff --git a/utils/clstated/Makefile.am b/utils/clstated/Makefile.am
> >>>>> index 9937353..d1cef3c 100644
> >>>>> --- a/utils/clstated/Makefile.am
> >>>>> +++ b/utils/clstated/Makefile.am
> >>>>> @@ -6,9 +6,9 @@
> >>>>> AM_CFLAGS	+= -D_LARGEFILE64_SOURCE
> >>>>> sbin_PROGRAMS	= clstated
> >>>>> 
> >>>>> -clstated_SOURCES = clstated.c
> >>>>> +clstated_SOURCES = clstated.c sqlite.c
> >>>>> 
> >>>>> -clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT)
> >>>>> +clstated_LDADD = ../../support/nfs/libnfs.a $(LIBEVENT) $(LIBSQLITE)
> >>>>> 
> >>>>> MAINTAINERCLEANFILES = Makefile.in
> >>>>> 
> >>>>> diff --git a/utils/clstated/clstated.c b/utils/clstated/clstated.c
> >>>>> index aab09a7..95ac696 100644
> >>>>> --- a/utils/clstated/clstated.c
> >>>>> +++ b/utils/clstated/clstated.c
> >>>>> @@ -36,6 +36,7 @@
> >>>>> 
> >>>>> #include "xlog.h"
> >>>>> #include "nfslib.h"
> >>>>> +#include "sqlite.h"
> >>>>> 
> >>>>> #ifndef PIPEFS_DIR
> >>>>> #define PIPEFS_DIR NFS_STATEDIR "/rpc_pipefs"
> >>>>> @@ -118,21 +119,25 @@ clstate_pipe_reopen(struct clstate_client *clnt)
> >>>>> static void
> >>>>> clstate_create(struct clstate_client *clnt)
> >>>>> {
> >>>>> +	int ret;
> >>>>> 	ssize_t bsize, wsize;
> >>>>> 	struct clstate_msg *cmsg = &clnt->cl_msg;
> >>>>> 
> >>>>> -	xlog(D_GENERAL, "%s: create client record", __func__);
> >>>>> +	xlog(D_GENERAL, "%s: create client record. cm_addr=%s", __func__,
> >>>>> +			cmsg->cm_addr);
> >>>>> 
> >>>>> -	/* FIXME: create client record on storage here */
> >>>>> +	ret = clstate_insert_client(cmsg->cm_addr, cmsg->cm_u.cm_id,
> >>>>> +					cmsg->cm_len);
> >>>>> 
> >>>>> 	/* set up reply */
> >>>>> -	cmsg->cm_status = 0;
> >>>>> +	cmsg->cm_status = ret ? -EREMOTEIO : ret;
> >>>>> 	cmsg->cm_len = 0;
> >>>>> 	memset(cmsg->cm_addr, 0, sizeof(cmsg->cm_addr) +
> >>>>> 		sizeof(cmsg->cm_u.cm_id));
> >>>>> 
> >>>>> 	bsize = sizeof(*cmsg);
> >>>>> 
> >>>>> +	xlog(D_GENERAL, "Doing downcall with status %d", cmsg->cm_status);
> >>>>> 	wsize = atomicio((void *)write, clnt->cl_fd, cmsg, bsize);
> >>>>> 	if (wsize != bsize) {
> >>>>> 		xlog(L_ERROR, "%s: problem writing to clstate pipe (%ld): %m",
> >>>>> @@ -231,6 +236,11 @@ main(int argc, char **argv)
> >>>>> 	}
> >>>>> 
> >>>>> 	/* set up storage db */
> >>>>> +	rc = clstate_maindb_init();
> >>>>> +	if (rc) {
> >>>>> +		xlog(L_ERROR, "Failed to open main database: %d", rc);
> >>>>> +		goto out;
> >>>>> +	}
> >>>>> 
> >>>>> 	/* set up event handler */
> >>>>> 	fd = clstate_pipe_open(&clnt);
> >>>>> diff --git a/utils/clstated/sqlite.c b/utils/clstated/sqlite.c
> >>>>> new file mode 100644
> >>>>> index 0000000..ae83634
> >>>>> --- /dev/null
> >>>>> +++ b/utils/clstated/sqlite.c
> >>>>> @@ -0,0 +1,351 @@
> >>>>> +/*
> >>>>> + * Copyright (C) 2011  Red Hat, Jeff Layton <jlayton@redhat.com>
> >>>>> + *
> >>>>> + * This program is free software; you can redistribute it and/or
> >>>>> + * modify it under the terms of the GNU General Public License
> >>>>> + * as published by the Free Software Foundation; either version 2
> >>>>> + * of the License, or (at your option) any later version.
> >>>>> + *
> >>>>> + * This program is distributed in the hope that it will be useful,
> >>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>>>> + * GNU General Public License for more details.
> >>>>> + *
> >>>>> + * You should have received a copy of the GNU General Public License
> >>>>> + * along with this program; if not, write to the Free Software
> >>>>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor,
> >>>>> + * Boston, MA 02110-1301, USA.
> >>>>> + */
> >>>>> +
> >>>>> +/*
> >>>>> + * Explanation:
> >>>>> + *
> >>>>> + * This file contains the code to manage the sqlite backend database for the
> >>>>> + * clstated upcall daemon.
> >>>>> + *
> >>>>> + * The main database is called main.sqlite and contains the following tables:
> >>>>> + *
> >>>>> + * parameters: simple key/value pairs for storing database info
> >>>>> + *
> >>>>> + * addresses: list of server-side addresses recorded in the db, along with
> >>>>> + * 	      the filenames of their respective db files.
> >>>>> + *
> >>>>> + * The daemon attaches to each server-address database as needed. Each
> >>>>> + * server-address database has the following tables:
> >>>> 
> >>>> Why do you keep separate database files?  There are ways to store all of this data in a single database using multiple tables.  I'm happy to help you with table design if you like.
> >>>> 
> >>> 
> >>> As you pointed out above, sqlite allows us to share the db directory
> >>> across a cluster of nodes. Using different files here will allow us to
> >>> reduce lock contention in the cluster. If we put each server address DB
> >>> in a separate file, then only the node that currently owns that address
> >>> should need to touch it under normal circumstances.
> >> 
> >> This seems like premature optimization.  A single file is much simpler and I'm guessing would be a more standard arrangement of tables.  If cluster contention avoidance is a key design point, it's worth calling out in the documenting comment, because this multi-file arrangement is not a typical use of sqlite3.
> >> 
> >> Why do you predict that the database will be under contention?  Isn't it the case that sqlite3 can allow multiple accessors to a single database file?
> >> 
> > 
> > Since the grace periods will need to be coordinated, then it's likely
> > on a server reboot that clients will be reestablishing state with
> > multiple nodes in the cluster at a time. That means at least some
> > contention for the DB. Whether it's problematic or not is an open
> > question at this point, but why not avoid that if we can?
> 
> Yes, there may be contention during reclaim.  However, I submit that we don't understand the details of that well enough to solidify a design for handling that workload scalably at this point.  I recommend some simple prototyping and careful modeling to understand exactly what's going to happen in such a scenario.  Then we can approach a client ID database design with good information about what needs to scale and what can be designed for convenience.  :-)
> 
> Just a thought.
> 

Fair enough. That's probably reasonable and not using separate DB's
simplifies the code considerably. You also mentioned on IRC yesterday,
that segregating the records by server address would break the UCS
migration model, so I'm planning to abandon that model anyway.

I have a revised set of patches that I'm testing now that should
address these concerns. It also renames the daemon and kernel
structures to adhere more closely to what Steve would prefer. In the
absence of a NAK for solid technical reasons, I'm sticking with sqlite
for now.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC)
  2011-12-15  1:46                   ` Jeff Layton
@ 2011-12-15 23:34                     ` Steve Dickson
  0 siblings, 0 replies; 34+ messages in thread
From: Steve Dickson @ 2011-12-15 23:34 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

On 12/14/2011 08:46 PM, Jeff Layton wrote:
> Not to put too fine a point on it, but so far most of what I'm hearing
> is fear, uncertainty and doubt about using sqlite for this. Does anyone
> have a real technical reason as to why it's not suitable for this
> purpose?
> 
Nope... Go for it!

steved.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-12-15 23:34 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-14 13:57 [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Jeff Layton
2011-12-14 13:57 ` [PATCH 1/7] clstated: add clname tracking daemon stub Jeff Layton
2011-12-14 13:57 ` [PATCH 2/7] clstated: reattempt the pipe open if it fails on ENOENT Jeff Layton
2011-12-14 15:09   ` Steve Dickson
2011-12-14 15:19     ` Jeff Layton
2011-12-14 15:29       ` Steve Dickson
2011-12-14 15:37         ` Jeff Layton
2011-12-14 15:56           ` Steve Dickson
2011-12-14 16:00             ` Jeff Layton
2011-12-14 16:28               ` Steve Dickson
2011-12-14 21:10                 ` J. Bruce Fields
2011-12-14 21:20                   ` Jeff Layton
2011-12-14 13:57 ` [PATCH 3/7] clstated: add autoconf goop for sqlite Jeff Layton
2011-12-14 13:57 ` [PATCH 4/7] clstated: add routines for a sqlite backend database Jeff Layton
2011-12-14 14:56   ` Chuck Lever
2011-12-14 15:14     ` Jeff Layton
2011-12-14 15:47       ` Chuck Lever
2011-12-14 16:15         ` Jeff Layton
2011-12-15 14:55           ` Chuck Lever
2011-12-15 15:04             ` Jeff Layton
2011-12-14 13:57 ` [PATCH 5/7] clstated: add remove functionality Jeff Layton
2011-12-14 13:57 ` [PATCH 6/7] clstated: add check/update functionality Jeff Layton
2011-12-14 13:57 ` [PATCH 7/7] clstated: add function to remove unreclaimed client records Jeff Layton
2011-12-14 15:23 ` [PATCH 0/7] clstated: add a daemon to track NFSv4 client names on stable storage (RFC) Steve Dickson
2011-12-14 15:32   ` Jeff Layton
2011-12-14 15:44     ` Steve Dickson
2011-12-14 16:05       ` Jeff Layton
2011-12-14 16:26         ` Steve Dickson
2011-12-14 16:34           ` Jeff Layton
2011-12-14 20:31             ` Steve Dickson
2011-12-14 21:06               ` Jeff Layton
2011-12-14 22:27                 ` Steve Dickson
2011-12-15  1:46                   ` Jeff Layton
2011-12-15 23:34                     ` Steve Dickson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.