All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] new statd [take 2]
@ 2009-08-05 14:45 Chuck Lever
       [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 14:45 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs

Apparently there is a 100KB per-email size limit imposed by
vger.kernel.org.  I've broken up the large statd patch, and
postponed the nfs(8) man page patches to another day.

Yesterday it was decided that:

  [PATCH 3/5] statd: change default state dir to /var/lib/nfs/statd

was overkill. That one has been dropped from this series.

The first two of the present patches introduce the .c files under
utils/new-statd/ .  The next patch adds actools infrastructure for
building the new statd and sm-notify executables.  The last patch
adds two new man pages for statd and sm-notify.

The old statd remains in utils/statd.  If sqlite3 and TI-RPC is
present on the build system, and TI-RPC is enabled during
configuration, the new statd is built and installed instead of the
old one.

See the description in the first patch for more details on the
features of the new implementation.

---

Chuck Lever (4):
      nfs-utils: introduce new statd implementation (4th part)
      nfs-utils: introduce new statd implementation (3rd part)
      nfs-utils: introduce new statd implementation (2nd part)
      nfs-utils: introduce new statd implementation (1st part)


 .gitignore                    |    4 
 Makefile.am                   |    4 
 aclocal/libcap.m4             |   15 +
 aclocal/libsqlite3.m4         |   33 ++
 aclocal/libtirpc.m4           |    2 
 configure.ac                  |   12 +
 utils/Makefile.am             |    7 
 utils/new-statd/Makefile.am   |  102 +++++
 utils/new-statd/file.c        |  779 ++++++++++++++++++++++++++++++++++++++
 utils/new-statd/hostname.c    |  520 +++++++++++++++++++++++++
 utils/new-statd/main.c        |  319 ++++++++++++++++
 utils/new-statd/nlmcall.c     |  525 ++++++++++++++++++++++++++
 utils/new-statd/sm-notify.c   |  155 ++++++++
 utils/new-statd/sm-notify.man |  286 ++++++++++++++
 utils/new-statd/sm_inter.x    |  107 +++++
 utils/new-statd/smncall.c     |  774 ++++++++++++++++++++++++++++++++++++++
 utils/new-statd/start-statd   |   12 +
 utils/new-statd/statd.h       |  135 +++++++
 utils/new-statd/statd.man     |  388 +++++++++++++++++++
 utils/new-statd/svc.c         |  841 +++++++++++++++++++++++++++++++++++++++++
 utils/statd/Makefile.am       |    9 
 21 files changed, 5024 insertions(+), 5 deletions(-)
 create mode 100644 aclocal/libcap.m4
 create mode 100644 aclocal/libsqlite3.m4
 create mode 100644 utils/new-statd/Makefile.am
 create mode 100644 utils/new-statd/file.c
 create mode 100644 utils/new-statd/hostname.c
 create mode 100644 utils/new-statd/main.c
 create mode 100644 utils/new-statd/nlmcall.c
 create mode 100644 utils/new-statd/sm-notify.c
 create mode 100644 utils/new-statd/sm-notify.man
 create mode 100644 utils/new-statd/sm_inter.x
 create mode 100644 utils/new-statd/smncall.c
 create mode 100644 utils/new-statd/start-statd
 create mode 100644 utils/new-statd/statd.h
 create mode 100644 utils/new-statd/statd.man
 create mode 100644 utils/new-statd/svc.c

-- 
Signature

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
@ 2009-08-05 14:45   ` Chuck Lever
       [not found]     ` <20090805144540.12866.22084.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
  2009-08-05 14:45   ` [PATCH 2/4] nfs-utils: introduce new statd implementation (2nd part) Chuck Lever
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 14:45 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs

Provide a new implementation of statd that supports IPv6.  The new
statd implementation resides under

  utils/new-statd/

The contents of this directory are built if --enable-tirpc is set
on the ./configure command line, and sqlite3 is available on the
build system.  Otherwise, the legacy version of statd, which still
resides under utils/statd/, is built.

The goals of this re-write are:

 o Support IPv6 networking

   Support interoperation with TI-RPC-based NSM implementations.
   Transport Independent RPC, or TI-RPC, provides IPv6 network support
   for Linux's NSM implementation.

   To support TI-RPC, open code to construct RPC requests in socket
   buffers and then schedule them has been replaced with standard
   library calls.

 o Support notification via TCP

   As a secondary benefit of using TI-RPC library calls, reboot
   notifications and NLM callbacks can now be sent via connection-
   oriented transport protocols.

   Note that lockd does not (yet) tell statd what transport protocol
   to use when sending reboot notifications.  statd/sm-notify will
   continue to use UDP for the time being.

 o Use an embedded database for storing on-disk callback data

   This whole exercise is for the purpose of crash robustness.  There
   are well-known deficiencies with simple create/rename/unlink
   disk storage schemes during system crashes.  Replace the current
   flat-file monitor list mechanism which uses sync(2) with sqlite3,
   which uses fsync(3).

 o Share code between sm-notify and statd

   Statd and sm-notify access the same set of on-disk data.  These
   separate programs now share the same code and implementation, with
   access to on-disk data serialized by sqlite3.  The two remain
   separate executables to allow other system facilities to send
   reboot notifications without poking statd.

 o Reduce impact of DNS outages

   The heuristics used by SM_NOTIFY to figure out which remote peer
   has rebooted are heavily dependent on DNS.  If the DNS service is
   slow or hangs, that will make the NSM listener unresponsive.
   Incoming SM_NOTIFY requests are now handled in a sidecar process
   to reduce the impact of DNS outages on the NSM service listener.

 o Proper my_name support

   The current version of statd uses gethostname(3) to generate the
   mon_name argument of SM_NOTIFY.  This value can change across a
   reboot.  The new version of statd records lockd's my_name, passed
   by every SM_MON request, and uses that when sending SM_NOTIFY.

   This can be useful for multi-homed and DHCP configured hosts.

 o Send SM_NOTIFY more aggressively

   It has been recommended that statd/sm-notify send SM_NOTIFY
   more aggressively (for example, to the entire list returned by
   getaddrinfo(3)).  Since SM_NOTIFY's reply is NULL, there's no
   way to tell whether the remote peer recognized the mon_name we
   sent.  More study is required, but this implementation attempts
   to send an SM_NOTIFY request to each address returned by
   getaddrinfo(3).

This re-implementation paves the way for a number of future
improvements.  However, it does not immediately address:

 o lockd/statd start-up serialization issues

   Sending reboot notifications, starting statd and lockd, and opening
   the lockd grace period are still determined independently in user
   space and the kernel.

 o Binding mon_names to caller IP addresses

   By default, lockd continues to send IP addresses as the mon_name
   argument of the SM_MON procedure.  This provides a better guarantee
   of being able to contact remote peers during a reboot, but means
   statd must continue to use heuristics to match incoming SM_NOTIFY
   requests with peers on the monitor list.

 o Distinct logic for NFS client- and server-side

   Client-side and server-side monitoring requirements are different.
   Statd continues to use the same logic for both NFS client and
   server, as the NSMv1 protocol does not provide any indication
   that a mon_name is for a client or server peer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 utils/new-statd/file.c     |  779 ++++++++++++++++++++++++++++++++++++++++++++
 utils/new-statd/hostname.c |  520 +++++++++++++++++++++++++++++
 utils/new-statd/nlmcall.c  |  525 ++++++++++++++++++++++++++++++
 3 files changed, 1824 insertions(+), 0 deletions(-)
 create mode 100644 utils/new-statd/file.c
 create mode 100644 utils/new-statd/hostname.c
 create mode 100644 utils/new-statd/nlmcall.c

diff --git a/utils/new-statd/file.c b/utils/new-statd/file.c
new file mode 100644
index 0000000..db54bf7
--- /dev/null
+++ b/utils/new-statd/file.c
@@ -0,0 +1,779 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * Callback information is stored in an sqlite database that provides
+ * transactional semantics, to preserve state in the event of a system
+ * crash, and can easily be updated to store additional data elements
+ * without need for versioning
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <sys/types.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/ipc.h>
+#include <sys/sem.h>
+
+#include <stdint.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <grp.h>
+#include <arpa/inet.h>
+
+#include "statd.h"
+
+#ifdef NFS_STATEDIR
+#define STATD_DEFAULT_STATEDIR		NFS_STATEDIR "/statd"
+#else	/* !defined(NFS_STATEDIR) */
+#define STATD_DEFAULT_STATEDIR		"/var/lib/nfs/statd"
+#endif	/* !defined(NFS_STATEDIR) */
+
+#define STATD_DATABASE_FILE		"statdb"
+
+static char statd_db_dirname[PATH_MAX];
+static char statd_db_filename[PATH_MAX];
+
+/**
+ * statd_open_db - open a database connection to statd's sqlite3 database
+ * @flags: flags passed to sqlite3_open_v2
+ *
+ * Returns an open database connection; otherwise NULL is returned if
+ * an error occurred.
+ */
+sqlite3 *
+statd_open_db(int flags)
+{
+	sqlite3 *db;
+	int rc;
+
+	rc = sqlite3_open_v2(statd_db_filename, &db, flags, NULL);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to open database in %s: %s",
+				statd_db_filename, sqlite3_errmsg(db));
+		xlog(L_ERROR, "Check that the state directory pathname "
+				"is correct and has proper permissions");
+		(void)sqlite3_close(db);
+		return NULL;
+	}
+
+	/*
+	 * Retry SQLITE_BUSY for 100 msec before returning an error.
+	 */
+	(void)sqlite3_busy_timeout(db, 100);
+
+	return db;
+}
+
+/**
+ * statd_close_db - close an open sqlite3 database connection
+ * @db: open sqlite3 database connection descriptor
+ *
+ */
+void
+statd_close_db(sqlite3 *db)
+{
+	if (sqlite3_close(db) != SQLITE_OK)
+		xlog(L_ERROR, "Failed to close database: %s",
+				sqlite3_errmsg(db));
+}
+
+/**
+ * statd_prepare_stmt - prepare a C string SQL statement
+ * @db: open database connection
+ * @stmt: OUT: pointer to prepared SQL statement object
+ * @sql: C string containing SQL statement to prepare
+ *
+ * Returns TRUE if the statement was prepared successfully; otherwise
+ * FALSE is returned and an error is logged.
+ */
+bool_t
+statd_prepare_stmt(sqlite3 *db, sqlite3_stmt **stmt, const char *sql)
+{
+	int rc;
+
+	rc = sqlite3_prepare_v2(db, sql, -1, stmt, NULL);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to compile SQL: %s",
+				sqlite3_errmsg(db));
+		xlog(L_ERROR, "SQL: %s", sql);
+		return FALSE;
+	}
+	return TRUE;
+}
+
+/**
+ * statd_finalize_stmt - release a prepared SQL statement
+ * @stmt: sqlite3_stmt to release
+ *
+ */
+void
+statd_finalize_stmt(sqlite3_stmt *stmt)
+{
+	sqlite3 *db = sqlite3_db_handle(stmt);
+	int rc;
+
+	rc = sqlite3_finalize(stmt);
+	switch(rc) {
+	case SQLITE_OK:
+	case SQLITE_ABORT:
+	case SQLITE_CONSTRAINT:
+		break;
+	default:
+		xlog(L_ERROR, "Failed to finalize SQL statement: %s",
+				sqlite3_errmsg(db));
+	}
+}
+
+/**
+ * statd_begin_transaction - start a sqlite3 transaction
+ * @db: open database connection
+ *
+ * Returns TRUE if the transaction was started; otherwise FALSE is
+ * returned if sqlite3 reported an error.
+ */
+bool_t
+statd_begin_transaction(sqlite3 *db)
+{
+	char *err_msg;
+	int rc;
+
+	err_msg = NULL;
+	rc = sqlite3_exec(db, "BEGIN IMMEDIATE TRANSACTION;", NULL, 0, &err_msg);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to start transaction: %s", err_msg);
+		sqlite3_free(err_msg);
+		return FALSE;
+	}
+
+	xlog(D_CALL, "Transaction started");
+	return TRUE;
+}
+
+/**
+ * statd_end_transaction - commit a sqlite3 transaction
+ * @db: open database connection
+ *
+ */
+void
+statd_end_transaction(sqlite3 *db)
+{
+	char *err_msg;
+	int rc;
+
+	err_msg = NULL;
+	rc = sqlite3_exec(db, "COMMIT TRANSACTION;", NULL, 0, &err_msg);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to commit transaction: %s", err_msg);
+		sqlite3_free(err_msg);
+		return;
+	}
+
+	xlog(D_CALL, "Transaction committed");
+}
+
+/**
+ * statd_rollback_transaction - rollback an active sqlite3 transaction
+ * @db: open database connection
+ *
+ */
+void
+statd_rollback_transaction(sqlite3 *db)
+{
+	char *err_msg;
+	int rc;
+
+	err_msg = NULL;
+	rc = sqlite3_exec(db, "ROLLBACK TRANSACTION;", NULL, 0, &err_msg);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to roll back transaction: %s", err_msg);
+		sqlite3_free(err_msg);
+		return;
+	}
+
+	xlog(D_CALL, "Transaction rolled back");
+}
+
+/**
+ * statd_check_pathname - set up pathname
+ * @parentdir: C string containing pathname to on-disk state
+ *
+ * Returns TRUE if pathname was valid; otherwise FALSE is returned.
+ */
+bool_t
+statd_check_pathname(const char *parentdir)
+{
+	struct stat st;
+	int len;
+
+	if (parentdir != NULL) {
+		if (strlen(parentdir) >= PATH_MAX) {
+			xlog(L_ERROR, "Invalid directory name");
+			return FALSE;
+		}
+	} else
+		parentdir = STATD_DEFAULT_STATEDIR;
+
+	if (lstat(parentdir, &st) == -1) {
+		xlog(L_ERROR, "Failed to stat %s: %m", parentdir);
+		return FALSE;
+	}
+
+	if (chdir(parentdir) == -1) {
+		xlog(L_ERROR, "Failed to change working directory to %s: %m",
+				parentdir);
+		return FALSE;
+	}
+
+	strncpy(statd_db_dirname, parentdir, sizeof(statd_db_dirname));
+
+	len = snprintf(statd_db_filename, sizeof(statd_db_filename),
+			"%s/%s", parentdir, STATD_DATABASE_FILE);
+	if (__error_check(len, sizeof(statd_db_filename))) {
+		xlog(L_ERROR, "Illegal directory name");
+		return FALSE;
+	}
+
+	return TRUE;
+}
+
+/*
+ * Returns TRUE if info table creation was successful, or if
+ * info table already exists; otherwise FALSE is returned if
+ * any error occurs.
+ */
+static bool_t
+statd_create_info_table(sqlite3 *db)
+{
+	sqlite3_stmt *stmt;
+	bool_t result;
+	int rc;
+
+	result = FALSE;
+
+	rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_INFO_TABLENAME
+					" (name TEXT PRIMARY KEY,"
+					" value INTEGER);",
+						-1, &stmt, NULL);
+	switch (rc) {
+	case SQLITE_OK:
+		rc = sqlite3_step(stmt);
+		if (rc != SQLITE_DONE) {
+			xlog(L_ERROR, "Failed to create table '"
+					STATD_INFO_TABLENAME "': %s",
+					sqlite3_errmsg(db));
+			break;
+		}
+
+		result = TRUE;
+		xlog(D_GENERAL, "Table '" STATD_INFO_TABLENAME
+				"' created successfully");
+		break;
+	case SQLITE_ERROR:
+		result = TRUE;
+		xlog(D_GENERAL, "Table '" STATD_INFO_TABLENAME "' already exists");
+		goto out;
+	default:
+		xlog(L_ERROR, "Failed to compile SQL: %s",
+				sqlite3_errmsg(db));
+		xlog(L_ERROR, "SQL: %s");
+		goto out;
+	}
+
+	statd_finalize_stmt(stmt);
+
+out:
+	return result;
+}
+
+/*
+ * Returns TRUE if the monitor table exists, or was created; otherwise
+ * FALSE is returned if an error occurs.
+ *
+ * This could be done in svc.c.  Since sm-notify links file.o, that
+ * would mean it would have to link svc.c as well.
+ */
+static bool_t
+statd_create_monitor_table(sqlite3 *db)
+{
+	sqlite3_stmt *stmt;
+	bool_t result;
+	int rc;
+
+	result = FALSE;
+
+	rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_MONITOR_TABLENAME
+					" (priv BLOB,"
+					" mon_name TEXT NOT NULL,"
+					" my_name TEXT NOT NULL,"
+					" program INTEGER,"
+					" version INTEGER,"
+					" procedure INTEGER,"
+					" protocol TEXT NOT NULL,"
+					" state INTEGER,"
+					" UNIQUE(mon_name,my_name));",
+						-1, &stmt, NULL);
+	switch (rc) {
+	case SQLITE_OK:
+		rc = sqlite3_step(stmt);
+		if (rc != SQLITE_DONE) {
+			xlog(L_ERROR, "Failed to create table '"
+					STATD_MONITOR_TABLENAME "': %s",
+					sqlite3_errmsg(db));
+			break;
+		}
+
+		result = 1;
+		xlog(D_GENERAL, "Table '" STATD_MONITOR_TABLENAME
+				"' created successfully");
+		break;
+	case SQLITE_ERROR:
+		result = TRUE;
+		xlog(D_GENERAL, "Table '" STATD_MONITOR_TABLENAME
+				"' already exists");
+		goto out;
+	default:
+		xlog(L_ERROR, "Failed to compile SQL: %s",
+				sqlite3_errmsg(db));
+		xlog(L_ERROR, "SQL: %s");
+		goto out;
+	}
+
+	statd_finalize_stmt(stmt);
+
+out:
+	return result;
+}
+
+/*
+ * Ensure database file exists.  This should create the file
+ * with the proper ownership.  Also ensure tables are created.
+ *
+ * XXX: Is it enough to set journal_mode and synchronous just once?
+ */
+static bool_t
+statd_init_database(void)
+{
+	bool_t result;
+	char *err_msg;
+	sqlite3 *db;
+	int rc;
+
+	xlog(D_CALL, "Initializing database");
+
+	result = FALSE;
+	db = statd_open_db(SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
+	if (db == NULL)
+		goto out;
+
+	/*
+	 * Don't delete the journal file after each transaction.
+	 * This provides better performance and crash robustness.
+	 */
+	err_msg = NULL;
+	rc = sqlite3_exec(db, "PRAGMA journal_mode=TRUNCATE;", NULL, 0, &err_msg);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to enable persistent journaling: %s",
+				err_msg);
+		sqlite3_free(err_msg);
+		goto out_close;
+	}
+
+	/*
+	 * Protect against db corruption during crashes.
+	 */
+	err_msg = NULL;
+	rc = sqlite3_exec(db, "PRAGMA synchronous=FULL;", NULL, 0, &err_msg);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to enable full synchronous mode: %s",
+				err_msg);
+		sqlite3_free(err_msg);
+		goto out_close;
+	}
+
+	if (!statd_create_info_table(db))
+		goto out_close;
+
+	if (!statd_create_monitor_table(db))
+		goto out_close;
+
+	result = TRUE;
+
+out_close:
+	statd_close_db(db);
+
+out:
+	return result;
+}
+
+/*
+ * Clear all capabilities but CAP_NET_BIND_SERVICE.  This permits
+ * statd to acquire privileged source ports, but all other
+ * capabilities are disallowed.
+ */
+static bool_t
+statd_clear_capabilities(void)
+{
+	bool_t result;
+	cap_t caps;
+
+	result = FALSE;
+
+	caps = cap_from_text("cap_net_bind_service=ep");
+	if (caps == NULL) {
+		xlog(L_ERROR, "Failed to allocate working storage: %m");
+		return result;
+	}
+
+	if (cap_set_proc(caps) == -1) {
+		xlog(L_ERROR, "Failed to set capability flags: %m");
+		goto out_free;
+	}
+
+	result = TRUE;
+
+out_free:
+	(void)cap_free(caps);
+	return result;
+}
+
+/**
+ * statd_drop_privileges - drop root privileges
+ * @keep_bind: caller sets this to keep CAP_NET_BIND_SERVICE
+ *
+ * Returns TRUE if successful, or FALSE if some error occurred.
+ *
+ * Set our effective UID and GID to that of our on-disk database.
+ */
+bool_t
+statd_drop_privileges(const int keep_bind)
+{
+	struct stat st;
+
+	/*
+	 * XXX: If we can't stat dirname, or if dirname is owned by
+	 *      root, we should use "statduser" instead, which is set up
+	 *      by configure.ac
+	 *
+	 *      Nothing in nfs-utils seems to use statduser, though.
+	 */
+	if (lstat(statd_db_dirname, &st) == -1) {
+		xlog(L_ERROR, "Failed to stat %s: %m", statd_db_dirname);
+		return FALSE;
+	}
+
+	if (st.st_uid == 0) {
+		xlog(L_WARNING,
+			"Running as root. chown %s to choose different user",
+				statd_db_dirname);
+		return TRUE;
+	}
+
+        /*
+	 * Don't clear capabilities when dropping root.
+	 */
+        if (keep_bind && prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) == -1) {
+                xlog(L_ERROR, "prctl(PR_SET_KEEPCAPS) failed: %m");
+		return FALSE;
+	}
+
+	if (setgroups(0, NULL) == -1) {
+		xlog(L_ERROR, "Failed to drop supplementary groups: %m");
+		return FALSE;
+	}
+
+	/*
+	 * ORDER
+	 *
+	 * setgid(2) first, as setuid(2) may remove privileges needed
+	 * to set the group id.
+	 */
+	if (setgid(st.st_gid) == -1 || setuid(st.st_uid) == -1) {
+		xlog(L_ERROR, "Failed to drop privileges: %m");
+		return FALSE;
+	}
+	xlog(D_GENERAL, "Effective UID, GID: %u, %u", st.st_uid, st.st_gid);
+
+	if (keep_bind && !statd_clear_capabilities())
+		return FALSE;
+
+	/*
+	 * ORDER
+	 *
+	 * If the on-disk database doesn't exist yet, it is created
+	 * here, after lowering our privileges.  This ensures that
+	 * statd can continue to access the database after privileges
+	 * are dropped.
+	 */
+	return statd_init_database();
+}
+
+/**
+ * statd_system_rebooted - check if we've run since the last reboot
+ *
+ * Returns TRUE if the system has rebooted since we last ran, or if
+ * we can't tell if we rebooted.  Otherwise FALSE is returned.
+ */
+bool_t
+statd_system_rebooted(void)
+{
+	key_t key;
+
+	key = ftok(statd_db_filename, 'r');	/* 'r' as in aRbitrary */
+
+	if (semget(key, 1, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR) == -1) {
+		if (errno == EEXIST) {
+			xlog(D_GENERAL, "Already ran since last reboot");
+			return FALSE;
+		} else
+		xlog(D_GENERAL, "Failed to find shared semaphor: %m");
+	}
+
+	xlog(D_GENERAL, "First run since last reboot");
+	return TRUE;
+}
+
+/*
+ * Returns TRUE if successful, or FALSE if an error occurred.
+ */
+static bool_t
+statd_insert_info(sqlite3 *db, const char *name, const int64_t value)
+{
+	sqlite3_stmt *stmt;
+	bool_t result;
+	int rc;
+
+	result = FALSE;
+
+	if (!statd_prepare_stmt(db, &stmt, "INSERT OR REPLACE INTO "
+			STATD_INFO_TABLENAME " (name,value) VALUES(?,?);"))
+		goto out;
+
+	rc = sqlite3_bind_text(stmt, 1, name, -1, SQLITE_STATIC);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to bind name %s: %s",
+				name, sqlite3_errmsg(db));
+		goto out_finalize;
+	}
+
+	rc = sqlite3_bind_int64(stmt, 2, (sqlite3_int64)value);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to bind value: %s",
+				sqlite3_errmsg(db));
+		goto out_finalize;
+	}
+
+	rc = sqlite3_step(stmt);
+	if (rc != SQLITE_DONE) {
+		xlog(L_ERROR, "Failed to write NSM state number: %s",
+				sqlite3_errmsg(db));
+		goto out_finalize;
+	}
+
+	result = TRUE;
+
+out_finalize:
+	statd_finalize_stmt(stmt);
+
+out:
+	return result;
+}
+
+/*
+ * Returns TRUE if successful, or FALSE if an error occurred.
+ */
+static bool_t
+statd_get_info(sqlite3 *db, const char *name, int64_t *value)
+{
+	sqlite3_stmt *stmt;
+	bool_t result;
+	int rc;
+
+	result = FALSE;
+
+	if (!statd_prepare_stmt(db, &stmt,
+			"SELECT * FROM " STATD_INFO_TABLENAME " WHERE name=?;"))
+		goto out;
+
+	rc = sqlite3_bind_text(stmt, 1, name, -1, SQLITE_STATIC);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to bind name %s: %s",
+				name, sqlite3_errmsg(db));
+		goto out_finalize;
+	}
+
+	rc = sqlite3_step(stmt);
+	switch (rc) {
+	case SQLITE_ROW:
+		*value = (int64_t)sqlite3_column_int64(stmt, 1);
+		result = TRUE;
+		break;
+	case SQLITE_DONE:
+		xlog(D_GENERAL, "Table '" STATD_INFO_TABLENAME
+				"' did not contain a row for '%s'", name);
+		break;
+	default:
+		xlog(L_ERROR, "Failed to read value of '%s': %s",
+				name, sqlite3_errmsg(db));
+	}
+
+out_finalize:
+	statd_finalize_stmt(stmt);
+
+out:
+	return result;
+}
+
+/**
+ * statd_get_nsm_state - retrieve on-disk NSM state number
+ *
+ * Returns an odd NSM state number, or a starting NSM state number
+ * of 1 if some error occurred.
+ */
+int
+statd_get_nsm_state(void)
+{
+	int64_t state;
+	sqlite3 *db;
+
+	state = 1;
+	db = statd_open_db(SQLITE_OPEN_READONLY);
+	if (db == NULL)
+		goto out;
+
+	if (!statd_get_info(db, "state", &state))
+		goto out_close;
+
+	if (!(state & 1)) {
+		xlog(L_WARNING, "Database contained an even state number");
+		state++;
+	}
+
+out_close:
+	statd_close_db(db);
+
+out:
+	return (int)state;
+}
+
+/**
+ * statd_update_nsm_state - unconditionally update NSM state number
+ * @db: open database connection
+ *
+ * Returns TRUE if the on-disk NSM state number was updated successfully;
+ * otherwise FALSE.
+ */
+bool_t
+statd_update_nsm_state(sqlite3 *db)
+{
+	int64_t state;
+
+	if (!statd_get_info(db, "state", &state))
+		state = 1;
+	else
+		state += 2;
+
+	if (!statd_insert_info(db, "state", state))
+		return FALSE;
+
+	xlog(D_GENERAL, "Bumped NSM state, new value: %d", state);
+	return TRUE;
+}
+
+/**
+ * statd_dump_monitor_list - Display the monitor list
+ *
+ * For debugging.
+ */
+void
+statd_dump_monitor_list(void)
+{
+	int result, err, i, nrow, ncolumn;
+	char **resultp;
+	char *err_msg;
+	sqlite3 *db;
+
+	result = 0;
+
+	db = statd_open_db(SQLITE_OPEN_READONLY);
+	if (db == NULL)
+		return;
+
+	err_msg = NULL;
+	err = sqlite3_get_table(db, "SELECT"
+		" mon_name,my_name,state  FROM " STATD_MONITOR_TABLENAME ";",
+			&resultp, &nrow, &ncolumn, &err_msg);
+	if (err != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to get table '"
+				STATD_MONITOR_TABLENAME "': %s", err_msg);
+		sqlite3_free(err_msg);
+		goto out_close;
+	}
+	if (ncolumn == 0) {
+		xlog(D_GENERAL, "The monitor list currently is empty");
+		goto out_free_table;
+	}
+	if (ncolumn != 3) {
+		xlog(L_ERROR, "Incorrect column count %d in SELECT result",
+				ncolumn);
+		goto out_free_table;
+	}
+
+	xlog(D_GENERAL, "Monitor list (mon_name, my_name, state)");
+	for (i = ncolumn; i < (nrow + 1) * ncolumn ; i += ncolumn)
+		xlog(D_GENERAL, "%s, %s, %d",
+			resultp[i], resultp[i + 1], resultp[i + 2]);
+
+out_free_table:
+	sqlite3_free_table(resultp);
+
+out_close:
+	statd_close_db(db);
+}
+
+/**
+ * statd_print_capabilities - dump process capability set to log
+ *
+ * For debugging.
+ */
+void
+statd_print_capabilities(void)
+{
+	cap_t caps;
+
+	caps = cap_get_proc();
+	if (caps == NULL) {
+		xlog(L_ERROR, "Failed to get process capaabilities: %m");
+		return;
+	}
+
+	xlog(L_NOTICE, "Capabilities: %s", cap_to_text(caps, NULL));
+
+	(void)cap_free(caps);
+}
diff --git a/utils/new-statd/hostname.c b/utils/new-statd/hostname.c
new file mode 100644
index 0000000..a9f1557
--- /dev/null
+++ b/utils/new-statd/hostname.c
@@ -0,0 +1,520 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * mon_name matching and verification functions.
+ *
+ * Multi-homed remotes usually send only a single SM_NOTIFY when they
+ * reboot, even though the local system may be accessing the remote
+ * by different names.
+ *
+ * statd maps such remotes to the same callback database entry by
+ * performing a series of steps to generate a canonical hostname
+ * from the mon_name passed to it by lockd, which it then uses as
+ * a key to index its database of NLM callback information.  The
+ * main assumption is that the names of the remotes are not volatile
+ * across local host reboots.
+ *
+ * There are some risks to this strategy:
+ *
+ *   1.  The external DNS database can change over time so that
+ *       this canonicalization process generates a different string,
+ *       preventing a clean match
+ *
+ *   2.  Local DNS resolution can produce different results than the
+ *       hostname the remote used to contact us, preventing a clean
+ *       match.
+ *
+ *   3.  Inconsistent forward and reverse DNS maps may prevent a clean
+ *       match.
+ *
+ *   4.  When DNS is slow or unavailable, statd, and thus mounting
+ *       and reboot recovery, hangs.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <ctype.h>
+#include <string.h>
+#include <stdlib.h>
+#include <netdb.h>
+#include <errno.h>
+
+#include <netinet/in.h>
+#include <arpa/inet.h>
+
+#include <rpc/rpc.h>
+#include <rpc/svc.h>
+
+#include "statd.h"
+
+#define IN_LOOPBACK(a)	((((long int)(a)) & 0xff000000) == 0x7f000000)
+
+static bool_t
+statd_report_nonlocal(const struct sockaddr *sap)
+{
+	const struct sockaddr_in *sin = (const struct sockaddr_in *)sap;
+	const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sap;
+	char buf[INET6_ADDRSTRLEN];
+
+	switch (sap->sa_family) {
+	case AF_INET:
+		if (inet_ntop(AF_INET, &sin->sin_addr.s_addr,
+						buf, sizeof(buf)) == NULL)
+			break;
+		xlog(L_WARNING, "Dropping non-local call from %s:%u",
+				buf, ntohs(sin->sin_port));
+		goto out;
+	case AF_INET6:
+		if (inet_ntop(AF_INET6, &sin6->sin6_addr,
+						buf, sizeof(buf)) == NULL)
+			break;
+		xlog(L_WARNING, "Dropping non-local call from %s:%u",
+				buf, ntohs(sin6->sin6_port));
+		goto out;
+	}
+
+	xlog(L_WARNING, "Received call from unrecognized address");
+
+out:
+	return FALSE;
+}
+
+static bool_t
+statd_is_loopback4(const struct sockaddr_in *sin)
+{
+	return IN_LOOPBACK(ntohl(sin->sin_addr.s_addr)) ? TRUE : FALSE;
+}
+
+static bool_t
+statd_is_loopback6(const struct sockaddr_in6 *sin6)
+{
+	return IN6_IS_ADDR_LOOPBACK(&sin6->sin6_addr) ? TRUE : FALSE;
+}
+
+static bool_t
+statd_is_loopback46(const struct sockaddr_in6 *sin6)
+{
+	if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr))
+		return IN_LOOPBACK(ntohl(sin6->sin6_addr.s6_addr32[3])) ?
+				TRUE : FALSE;
+	return statd_is_loopback6(sin6);
+}
+
+static bool_t
+statd_localhost4_caller_check(const struct sockaddr *sap)
+{
+	const struct sockaddr_in *sin = (struct sockaddr_in *)sap;
+
+	if (statd_is_loopback4(sin) && ntohs(sin->sin_port) < IPPORT_RESERVED)
+		return TRUE;
+	return statd_report_nonlocal(sap);
+}
+
+static bool_t
+statd_localhost6_caller_check(const struct sockaddr *sap)
+{
+	const struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
+
+	if (statd_is_loopback6(sin6) && ntohs(sin6->sin6_port) < IPPORT_RESERVED)
+		return TRUE;
+	return statd_report_nonlocal(sap);
+}
+
+/**
+ * statd_localhost_caller - ensure RPC caller is from loopback
+ * @rqstp: RPC request metadata
+ *
+ * Return TRUE if caller is localhost; otherwise return FALSE.
+ *
+ * Reject requests from non-loopback addresses in order
+ * to prevent attack described in CERT CA-99.05.  Also ensure
+ * that these calls are from a privileged port.
+ */
+bool_t
+statd_localhost_caller(const struct svc_req *rqstp)
+{
+	const struct netbuf *nbuf = svc_getrpccaller(rqstp->rq_xprt);
+	const struct sockaddr *sap = (struct sockaddr *)nbuf->buf;
+
+	switch (sap->sa_family) {
+	case AF_INET:
+		return statd_localhost4_caller_check(sap);
+	case AF_INET6:
+		return statd_localhost6_caller_check(sap);
+	}
+	return statd_report_nonlocal(sap);
+}
+
+/*
+ * Look up the hostname; report exceptional errors.  Caller must
+ * call freeaddrinfo(3) if a valid addrinfo is returned.
+ */
+static struct addrinfo *
+statd_get_addrinfo(const char *hostname, const struct addrinfo* gai_hint)
+{
+	struct addrinfo *gai_results;
+	int error;
+
+	error = getaddrinfo(hostname, NULL, gai_hint, &gai_results);
+	switch (error) {
+	case 0:
+		xlog(D_CALL, "%s: getaddrinfo('%s') successful",
+				__func__, hostname);
+		return gai_results;
+	case EAI_NONAME:
+		xlog(D_CALL, "%s: No hostname '%s' was found",
+				__func__, hostname);
+		break;
+	case EAI_SYSTEM:
+		xlog(L_ERROR, "Unexpected system error %m while "
+			"resolving host %s", hostname);
+		break;
+	default:
+		xlog(L_ERROR, "Unexpected getaddrinfo error %s while "
+			"resolving host %s", gai_strerror(error), hostname);
+	}
+
+	return NULL;
+}
+
+/**
+ * statd_forward_lookup - hostname or presentation address to sockaddr list
+ * @hostname: C string containing hostname or presentation address
+ * @protocol: IPPROTO_ number hint
+ *
+ * Returns addrinfo list, or NULL if an error occurs.  Caller must free
+ * addrinfo list via freeaddrinfo(3).
+ *
+ * We must be able to send an SM_NOTIFY later, so ensure that the
+ * provided mon_name either is an IP address, or can be resolved by DNS.
+ * Verify that the name does not contain illegal characters.  See CERT
+ * CA-96.09 for details.
+ *
+ * AI_ADDRCONFIG should prevent us from monitoring a host that we can't
+ * reach.  If IPv6 is not enabled on this system, then we don't want to
+ * monitor remote hosts that have only IPv6 addresses.
+ */
+struct addrinfo *
+statd_forward_lookup(const char *hostname, const int protocol)
+{
+	struct addrinfo gai_hint = {
+#ifdef IPV6_SUPPORTED
+		.ai_family	= AF_UNSPEC,
+		.ai_flags	= AI_CANONNAME | AI_ADDRCONFIG,
+#else	/* !IPV6_SUPPORTED */
+		.ai_family	= AF_INET,
+		.ai_flags	= AI_CANONNAME,
+#endif	/* !IPV6_SUPPORTED */
+	};
+
+	gai_hint.ai_protocol = protocol;
+	return statd_get_addrinfo(hostname, &gai_hint);
+}
+
+/*
+ * This module is built only if the local system has TI-RPC
+ * and sqlite3, so we are fairly sure it has a recent enough
+ * glibc that getnameinfo(3) is available.
+ *
+ * Returns a pointer to a C string containing a hostname (never
+ * a presentation address), or NULL if the resolution failed.
+ * Caller must free the string.
+ */
+static char *
+statd_reverse_lookup(const struct sockaddr *sap, const socklen_t salen,
+			const char *hostname)
+{
+	char buf[NI_MAXHOST];
+	int error;
+
+	error = getnameinfo(sap, salen, buf, sizeof(buf), NULL, 0, NI_NAMEREQD);
+	switch (error) {
+	case 0:
+		xlog(D_CALL, "Reverse lookup found hostname '%s'", buf);
+		return strdup(buf);
+	case EAI_NONAME:
+		xlog(D_CALL, "Reverse lookup found no name");
+		break;
+	case EAI_SYSTEM:
+		xlog(L_ERROR, "Unexpected system error %m while "
+			"resolving host %s", hostname);
+		break;
+	default:
+		xlog(L_ERROR, "Unexpected getnameinfo(3) error %s while "
+			"resolving host %s", gai_strerror(error), hostname);
+		break;
+	}
+
+	return NULL;
+}
+
+/**
+ * statd_get_address_list - Generate list of addresses for @hostname
+ * @hostname: C string containing hostname or presentation address
+ * @gai_hint: hint passed to getaddrinfo(3)
+ *
+ * If @hostname is a presentation address, first try to reverse map it
+ * to a canonical hostname.  This allows us to generate a list of
+ * equivalent addresses to try.  If the address in @hostname doesn't
+ * have a reverse map, just convert the address to an addrinfo.
+ *
+ * Returns addrinfo list, or NULL if an error occurs.  Caller must free
+ * addrinfo list via freeaddrinfo(3).
+ */
+struct addrinfo *
+statd_get_address_list(const char *hostname, const struct addrinfo *gai_hint)
+{
+	struct addrinfo reverse_hint, *gai_results;
+	char *name;
+	int error;
+
+	/* AI_NUMERICHOST means this one doesn't go on the wire */
+	reverse_hint = *gai_hint;
+	reverse_hint.ai_flags |= AI_NUMERICHOST;
+
+	error = getaddrinfo(hostname, NULL, &reverse_hint, &gai_results);
+	switch (error) {
+	case 0:
+		xlog(D_CALL, "Hostname '%s' is a presentation format address",
+				hostname);
+		name = statd_reverse_lookup(gai_results->ai_addr,
+						gai_results->ai_addrlen,
+						hostname);
+		if (name == NULL)
+			return gai_results;
+
+		freeaddrinfo(gai_results);
+		gai_results = statd_get_addrinfo(name, gai_hint);
+		free(name);
+		return gai_results;
+	case EAI_NONAME:
+		xlog(D_CALL, "Hostname '%s' is not a presentation format address",
+				hostname);
+		return statd_get_addrinfo(hostname, gai_hint);
+	case EAI_SYSTEM:
+		xlog(L_ERROR, "unexpected system error %m while "
+			"resolving hostname '%s'", hostname);
+		break;
+	default:
+		xlog(L_ERROR, "unexpected getaddrinfo(3) error %s while "
+			"resolving hostname '%s'", gai_strerror(error), hostname);
+	}
+
+	return NULL;
+}
+
+#ifdef IPV6_SUPPORTED
+static bool_t
+statd_compare_sockaddrs(const struct sockaddr *sa1,
+				const struct sockaddr *sa2)
+{
+	const struct sockaddr_in6 *sin1 = (const struct sockaddr_in6 *)sa1;
+	const struct sockaddr_in6 *sin2 = (const struct sockaddr_in6 *)sa2;
+
+	if (statd_is_loopback46(sin1) && statd_is_loopback46(sin2))
+		return TRUE;
+
+	if ((IN6_IS_ADDR_LINKLOCAL(&sin1->sin6_addr) &&
+	     IN6_IS_ADDR_LINKLOCAL(&sin2->sin6_addr)) ||
+	    (IN6_IS_ADDR_SITELOCAL(&sin1->sin6_addr) &&
+	     IN6_IS_ADDR_SITELOCAL(&sin2->sin6_addr)))
+		if (sin1->sin6_scope_id != sin2->sin6_scope_id)
+			return FALSE;
+
+	return IN6_ARE_ADDR_EQUAL(&sin1->sin6_addr, &sin2->sin6_addr) ?
+			TRUE : FALSE;
+}
+
+#else	/* !IPV6_SUPPORTED */
+static bool_t
+statd_compare_sockaddrs(const struct sockaddr *sa1,
+				const struct sockaddr *sa2)
+{
+	const struct sockaddr_in *sin1 = (const struct sockaddr_in *)sa1;
+	const struct sockaddr_in *sin2 = (const struct sockaddr_in *)sa2;
+
+	if (statd_is_loopback4(sin1) && statd_is_loopback4(sin2))
+		return TRUE;
+	return sin1->sin_addr.s_addr == sin2->sin_addr.s_addr ?
+			TRUE : FALSE;
+}
+#endif	/* !IPV6_SUPPORTED */
+
+static bool_t
+statd_compare_addrlists(struct addrinfo *ai1, struct addrinfo *ai2)
+{
+	xlog(D_CALL, "%s", __func__);
+
+	for (; ai1 != NULL; ai1 = ai1->ai_next) {
+		struct addrinfo *ai;
+		for (ai = ai2; ai != NULL; ai = ai->ai_next)
+			if (statd_compare_sockaddrs(ai1->ai_addr, ai->ai_addr))
+				return TRUE;
+	}
+	return FALSE;
+}
+
+/*
+ * If IPV6 is enabled, we always get back a list of only AF_INET6
+ * addresses (including mapped v4 addresses) from getaddrinfo(3).
+ * Thus we can always use IP6_ARE_ADDR_EQUAL to compare these addresses.
+ */
+static bool_t
+statd_matchhostname_dns(const char *hostname1, const char *hostname2)
+{
+	static const struct addrinfo gai_hint = {
+#ifdef IPV6_SUPPORTED
+		.ai_family	= AF_INET6,
+		.ai_flags	= AI_CANONNAME | AI_V4MAPPED,
+#else	/* !IPV6_SUPPORTED */
+		.ai_family	= AF_INET,
+		.ai_flags	= AI_CANONNAME,
+#endif	/* !IPV6_SUPPORTED */
+		.ai_protocol	= IPPROTO_UDP,
+	};
+	struct addrinfo *gai_results1, *gai_results2;
+	bool_t result;
+
+	result = FALSE;
+
+	/*
+	 * Acquire a list of associated IP addresses for each hostname.
+	 */
+	gai_results1 = statd_get_address_list(hostname1, &gai_hint);
+	if (gai_results1 == NULL)
+		return result;
+	gai_results2 = statd_get_address_list(hostname2, &gai_hint);
+	if (gai_results2 == NULL) {
+		freeaddrinfo(gai_results1);
+		return result;
+	}
+
+	/*
+	 * If the canonical hostnames match, or one address in each
+	 * list matches, then we have a positive result.
+	 */
+	if ((strcasecmp(gai_results1->ai_canonname,
+				gai_results2->ai_canonname) != 0) ||
+	    !statd_compare_addrlists(gai_results1, gai_results2))
+		goto out;
+
+	result = TRUE;
+
+out:
+	freeaddrinfo(gai_results2);
+	freeaddrinfo(gai_results1);
+	return result;
+}
+
+/**
+ * statd_match_hostname - check if two hostnames are equivalent
+ * @hostname1: C string containing hostname
+ * @hostname2: C string containing hostname
+ *
+ * Returns TRUE if the hostnames are the same, the hostnames resolve
+ * to the same canonical name, or the hostnames resolve to at least
+ * one address that is the same.  FALSE is returned if the hostnames
+ * do not match in any of these ways, if either hostname contains
+ * illegal characters, or if an error occurs.
+ */
+bool_t
+statd_match_hostname(const char *hostname1, const char *hostname2)
+{
+	bool_t result;
+
+	xlog(D_CALL, "Comparing '%s' and '%s'", hostname1, hostname2);
+
+	if (strcasecmp(hostname1, hostname2) == 0) {
+		xlog(D_GENERAL, "Exact match between '%s' and '%s'",
+				hostname1, hostname2);
+		return TRUE;
+	}
+
+	result = statd_matchhostname_dns(hostname1, hostname2);
+	if (result == TRUE)
+		xlog(D_GENERAL, "DNS match between '%s' and '%s'",
+				hostname1, hostname2);
+	else
+		xlog(D_GENERAL, "Failed to match '%s' and '%s':",
+				hostname1, hostname2);
+	return result;
+}
+
+/**
+ * statd_match_address - check if hostname and IP address are equivalent
+ * @hostname: C string containing hostname
+ * @sap: pointer to socket address
+ *
+ * Returns TRUE if the hostname resolves to @sap.  FALSE is returned if
+ * the hostname does not resolve to address, if @hostname contains
+ * illegal characters, or if an error occurs.
+ */
+bool_t
+statd_match_address(const char *hostname, const struct sockaddr *sap)
+{
+	struct addrinfo gai_hint = {
+		.ai_protocol	= IPPROTO_UDP,
+	};
+	struct addrinfo *gai_results, *ai;
+	char buf[INET6_ADDRSTRLEN];
+
+	if (opt_debug) {
+		const struct sockaddr_in *sin = (const struct sockaddr_in *)sap;
+		const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sap;
+
+		switch (sap->sa_family) {
+		case AF_INET:
+			(void)inet_ntop(AF_INET, &sin->sin_addr.s_addr,
+						buf, sizeof(buf));
+			break;
+		case AF_INET6:
+			(void)inet_ntop(AF_INET6, &sin6->sin6_addr,
+						buf, sizeof(buf));
+			break;
+		default:
+			xlog(L_NOTICE, "%s: Invalid address family", __func__);
+			return FALSE;
+		}
+
+		xlog(D_CALL, "Comparing '%s' and %s", hostname, buf);
+	}
+
+	gai_hint.ai_family = sap->sa_family;
+	gai_results = statd_get_address_list(hostname, &gai_hint);
+	if (gai_results == NULL)
+		goto out_fail;
+
+	for (ai = gai_results; ai != NULL; ai = ai->ai_next)
+		if (statd_compare_sockaddrs(ai->ai_addr, sap)) {
+			xlog(D_GENERAL, "DNS match between '%s' and %s",
+					hostname, buf);
+			return TRUE;
+		}
+
+	freeaddrinfo(gai_results);
+
+out_fail:
+	xlog(D_GENERAL, "Failed to match '%s' and %s", hostname, buf);
+	return FALSE;
+}
diff --git a/utils/new-statd/nlmcall.c b/utils/new-statd/nlmcall.c
new file mode 100644
index 0000000..3a7b3b0
--- /dev/null
+++ b/utils/new-statd/nlmcall.c
@@ -0,0 +1,525 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * Manage NLM callbacks.
+ *
+ * There are some complications.
+ *
+ * 1. Standard RPC library client calls that expect a reply are always
+ *    synchronous with the remote server, and use system calls that
+ *    cause the RPC caller process to block.
+ *
+ *    However, statd cannot block, because its NSM service is
+ *    implemented as a single thread.  Blocking would cause statd to
+ *    become unresponsive.
+ *
+ *    To allow the use of standard RPC client library calls to
+ *    implement our NLM callback, callbacks are done in a separate
+ *    process.  The statd service process hands NLM callback requests
+ *    to this process via a packet socket.
+ *
+ * 2. Statd drops its root privileges before creating listener
+ *    sockets to minimize exposure to network attacks.  Linux lockd
+ *    requires, however, that NLM callbacks are sent from a
+ *    privileged port, in order to prevent users from sending local
+ *    reboot notification spoofs.
+ *
+ *    After starting up, then, the statd process forks the NLM
+ *    callback child process before dropping its root privileges.  The
+ *    child NLM callback process retains its root privileges so it can
+ *    create sockets with privileged source ports as needed.
+ *
+ *    Each transport socket used for NLM callbacks is destroyed after
+ *    the callback finishes to minimize exposure to RPC reply spoofing.
+ *
+ * 3. A UDP socket is used to send an NLM callback.  This prevents
+ *    the callback process from leaving privileged ports in TIME_WAIT
+ *    after each callback request is finished.
+ *
+ * 4. The callback mechanism must tolerate the temporary absence of
+ *    lockd and/or rpcbind.  The NLM callback implementation uses
+ *    a long timeout and a large number of retries to improve the
+ *    chances that the local lockd will see the request.
+ *
+ */
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <sys/resource.h>
+#include <sys/time.h>
+
+#include <time.h>
+#include <string.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <netdb.h>
+
+#include "statd.h"
+#include "nfsrpc.h"
+
+#if SIZEOF_SOCKLEN_T - 0 == 0
+#define socklen_t int
+#endif
+
+struct statd_callback {
+	int			state;
+	struct sockaddr_storage	address;
+	char			mon_name[SM_MAXSTRLEN];
+};
+
+static int statd_sv[2];
+
+/*
+ * Show the priv in hexadecimal.  For debugging only.
+ */
+#define STATD_SIZEOF_HEXBYTE	sizeof("ff")
+#define STATD_PRIVBUF_LEN	(SM_PRIV_SIZE * STATD_SIZEOF_HEXBYTE)
+static const char *
+statd_show_priv(const char *priv)
+{
+	static char buf[STATD_PRIVBUF_LEN];
+	unsigned int i;
+	char *p;
+
+	memset(buf, 0, sizeof(buf));
+	p = &buf[0];
+	for (i = 0; i < SM_PRIV_SIZE; i++)
+		p += sprintf(p, "%02x", 0xff & priv[i]);
+
+	return (const char *)buf;
+}
+
+/*
+ * Synchronously create a socket and RPC client, perform an rpcbind,
+ * send the callback, then shutdown the client and destroy the socket.
+ *
+ * Lockd requires a privileged port for these calls, as a simple form
+ * of authentication.  A UDP transport is used to prevent tying up a
+ * pair of privileged ports for 120 seconds in TIME_WAIT (one for the
+ * rpcbind and one for the call itself).
+ *
+ * Note that some recent kernels do not create a UDP listener for
+ * lockd.  In that case we try the client creation with TCP.
+ *
+ * Even though the reply is void, we leave our RPC client available
+ * to receive it.  The reply tells us the kernel actually saw and
+ * executed the request, thus we don't need to retransmit it.
+ *
+ * Returns TRUE if the call succeeded, or FALSE if caller should try again.
+ */
+static bool_t
+statd_send_one_nlm_callback(const struct mon *monp, const int state)
+{
+	const struct my_id *my_idp = &monp->mon_id.my_id;
+	const rpcprog_t program = (rpcprog_t)my_idp->my_prog;
+	const rpcvers_t version = (rpcvers_t)my_idp->my_vers;
+	const rpcproc_t procedure = (rpcproc_t)my_idp->my_proc;
+	struct timeval timeout = { STATD_NLM_CALLBACK_TIMEOUT, 0 };
+	struct nlm_reboot new_status = {
+		.mon_name	= monp->mon_id.mon_name,
+		.state		= (int)state,
+	};
+	enum clnt_stat status;
+	CLIENT *clnt;
+
+	rpc_createerr.cf_stat = 0;
+	clnt = clnt_create_timed("localhost", program, version,
+						"udp", &timeout);
+	/* Sometimes lockd starts only a TCP listener */
+	if (clnt == NULL && rpc_createerr.cf_stat == RPC_PROGNOTREGISTERED) {
+		rpc_createerr.cf_stat = 0;
+		clnt = clnt_create_timed("localhost", program,
+					version, "tcp", &timeout);
+	}
+	if (clnt == NULL) {
+		xlog(L_ERROR, "Failed to create RPC client for NLM callback: %s",
+				clnt_sperrno(rpc_createerr.cf_stat));
+		return FALSE;
+	}
+
+	memcpy(new_status.priv, monp->priv, sizeof(new_status.priv));
+	xlog(D_GENERAL, "Arguments for callback: %s, %u",
+			new_status.mon_name, new_status.state);
+
+	status = clnt_call(clnt, procedure,
+				(xdrproc_t)xdr_nlm_reboot, (caddr_t)&new_status,
+				(xdrproc_t)xdr_void, NULL, timeout);
+	switch (status) {
+	case RPC_SUCCESS:
+		xlog(D_GENERAL, "NLM callback succeeded");
+		break;
+	default:
+		xlog(D_GENERAL, "NLM callback failed: %s", clnt_sperrno(status));
+		break;
+	}
+
+	clnt_destroy(clnt);
+	return status == RPC_SUCCESS ? TRUE : FALSE;
+}
+
+static bool_t
+statd_send_nlm_callback(const struct mon *monp, const int state)
+{
+	unsigned int retries;
+
+	xlog(D_CALL, "Sending callback for %s/%s (%u, %u, %u) priv: %s",
+			monp->mon_id.mon_name, monp->mon_id.my_id.my_name,
+			monp->mon_id.my_id.my_prog,
+			monp->mon_id.my_id.my_vers,
+			monp->mon_id.my_id.my_proc,
+			statd_show_priv(monp->priv));
+
+	retries = STATD_NLM_CALLBACK_RETRIES;
+	while (retries--) {
+		if (statd_send_one_nlm_callback(monp, state))
+			return TRUE;
+		sleep(2);
+	}
+
+	return FALSE;
+}
+
+/*
+ * Extract the NLM callback arguments from our database,
+ * and send the the RPC.
+ *
+ * Returns TRUE if all went to plan, or FALSE if some error
+ * occurred.
+ */
+static bool_t
+statd_post_nlm_callback(sqlite3 *db, char *mon_name,
+		char *my_name, const int state)
+{
+	sqlite3_stmt *stmt;
+	struct mon mon;
+	bool_t result;
+	int rc, tmp;
+	char *sql;
+
+	memset(&mon, 0, sizeof(mon));
+	result = FALSE;
+
+	sql = sqlite3_mprintf("SELECT priv,program,version,procedure FROM "
+				STATD_MONITOR_TABLENAME
+				" WHERE mon_name='%q' and my_name='%q';",
+					mon_name, my_name);
+	if (sql == NULL) {
+		xlog(L_ERROR, "Failed to generate SQL command in %s", __func__);
+		return result;
+	}
+
+	rc = statd_prepare_stmt(db, &stmt, sql);
+	sqlite3_free(sql);
+	if (!rc)
+		goto out;
+
+	rc = sqlite3_step(stmt);
+	switch (rc) {
+	case SQLITE_ROW:
+		tmp = sqlite3_column_count(stmt);
+		if (tmp != 4) {
+			xlog(L_ERROR, "Incorrect column count %d "
+					"in SELECT result", tmp);
+			break;
+		}
+
+		tmp = sqlite3_column_bytes(stmt, 0);
+		if (tmp != SM_PRIV_SIZE) {
+			xlog(L_ERROR, "Incorrect priv cookie size %d "
+					"in SELECT result", tmp);
+			//break;
+		}
+		memcpy(mon.priv, sqlite3_column_blob(stmt, 0), sizeof(mon.priv));
+
+		mon.mon_id.mon_name = mon_name;
+		mon.mon_id.my_id.my_name = my_name;
+		mon.mon_id.my_id.my_prog = sqlite3_column_int(stmt, 1);
+		mon.mon_id.my_id.my_vers = sqlite3_column_int(stmt, 2);
+		mon.mon_id.my_id.my_proc = sqlite3_column_int(stmt, 3);
+
+		result = statd_send_nlm_callback(&mon, state);
+		break;
+	case SQLITE_DONE:
+		xlog(L_ERROR, "The monitor list does not contain an entry "
+				"for '%s'", mon_name);
+		break;
+	default:
+		xlog(L_ERROR, "Failed to find row for '%s' in table '"
+				STATD_MONITOR_TABLENAME "': %s",
+				mon_name, sqlite3_errmsg(db));
+	}
+
+	statd_finalize_stmt(stmt);
+
+out:
+	return result;
+}
+
+/*
+ * Remember the new NSM state for this mon_name/my_name.
+ */
+static bool_t
+statd_set_mon_state(sqlite3 *db, const char *mon_name,
+		const char *my_name, const int state)
+{
+	char *sql, *err_msg;
+	bool_t result;
+
+	result = FALSE;
+
+	sql = sqlite3_mprintf("UPDATE " STATD_MONITOR_TABLENAME " SET state=%d"
+				" WHERE mon_name='%q' and my_name='%q';",
+					state, mon_name, my_name);
+	if (sql == NULL) {
+		xlog(L_ERROR, "Failed to generate SQL command in %s", __func__);
+		return result;
+	}
+
+	err_msg = NULL;
+	if (sqlite3_exec(db, sql, NULL, 0, &err_msg) != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to update state for mon_name '%s': %s",
+				mon_name, err_msg);
+		xlog(L_ERROR, "SQL: %s", sql);
+		sqlite3_free(err_msg);
+		goto out_free;
+	}
+
+	result = TRUE;
+	xlog(D_CALL, "Set state to %d for mon_name '%s'", state, mon_name);
+
+out_free:
+	sqlite3_free(sql);
+	return result;
+}
+
+/*
+ * Since it's not likely we will get an exact match on the mon_name
+ * (if the kernel sends us IP addresses to monitor instead of real
+ * hostnames, or the NFS server's actual hostname does not match
+ * the name sent to us by the local NFS client), we must use a
+ * heuristic to match incoming SM_NOTIFY arguments to rows in our
+ * monitor list.
+ */
+static void
+statd_match_monitored_host(const char *mon_name, const int state,
+		const struct sockaddr *sap)
+{
+	int err, i, nrow, ncolumn;
+	char **resultp;
+	char *err_msg;
+	bool_t result;
+	sqlite3 *db;
+
+	result = FALSE;
+
+	db = statd_open_db(SQLITE_OPEN_READWRITE);
+	if (db == NULL)
+		goto out;
+
+	err_msg = NULL;
+	err = sqlite3_get_table(db, "SELECT"
+		" mon_name,my_name,state"
+		" FROM " STATD_MONITOR_TABLENAME ";",
+			&resultp, &nrow, &ncolumn, &err_msg);
+	if (err != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to get table '"
+				STATD_MONITOR_TABLENAME "': %s", err_msg);
+		sqlite3_free(err_msg);
+		goto out_close;
+	}
+	if (ncolumn == 0) {
+		xlog(D_GENERAL, "The monitor list is empty");
+		goto out_free_table;
+	}
+	if (ncolumn != 3) {
+		xlog(L_ERROR, "Incorrect column count %d in SELECT result",
+				ncolumn);
+		goto out_free_table;
+	}
+
+	for (i = ncolumn; i < (nrow + 1) * ncolumn ; i += ncolumn) {
+		/*
+		 * Does the row's mon_name or hostname "match" the incoming
+		 * mon_name?  Does the row's hostname "match" the caller's
+		 * IP address?
+		 */
+		if (!statd_match_hostname(resultp[i], mon_name) &&
+		    !statd_match_address(resultp[i], sap))
+			continue;
+
+		/*
+		 * If we've been sent this state number update
+		 * already, don't send callback.
+		 */
+		if (state == atoi(resultp[i + 2])) {
+			xlog(D_GENERAL, "Already seen state %d from %s",
+					state, mon_name);
+			continue;
+		}
+
+		if (statd_post_nlm_callback(db, resultp[i], resultp[i + 1], state)) {
+			statd_set_mon_state(db, resultp[i], resultp[i + 1],
+						state);
+			result = TRUE;
+		}
+	}
+
+out_free_table:
+	sqlite3_free_table(resultp);
+
+out_close:
+	statd_close_db(db);
+
+out:
+	if (result == FALSE)
+		xlog(D_GENERAL, "SM_NOTIFY for %s call ignored", mon_name);
+}
+
+/*
+ * The statd child process that manages NLM callbacks.
+ *
+ * Read NLM callback requests from our socket, match them against
+ * our monitor list, and execute them.
+ */
+static void
+statd_nlm_callback_process(void)
+{
+	(void)setpriority(PRIO_PROCESS, 0, 19);
+
+	xlog(D_CALL, "NLM callback process %u running", getpid());
+
+	if (!statd_drop_privileges(1))
+		return;
+
+	for (;;) {
+		static struct statd_callback buf;
+
+		if (read(statd_sv[1], &buf, sizeof(buf)) < 1)
+			break;
+
+		xlog(D_CALL, "Received queued NLM callback request");
+
+		statd_match_monitored_host(buf.mon_name, buf.state,
+					(struct sockaddr *)&buf.address);
+	}
+
+	xlog(D_CALL, "Shutting down NLM callback process");
+}
+
+/**
+ * statd_queue_nlm_callback - queue an NLM callback
+ * @mon_name: C string containing mon_name argument of SM_NOTIFY request
+ * @state: NSM state number argument of SM_NOTIFY request
+ * @sap: pointer to IP address of caller
+ *
+ * Called by the statd RPC listener process when a remote has sent
+ * us an SM_NOTIFY for a host we're monitoring.
+ *
+ * Tests show a packet socket can queue over 3000 packets, independent
+ * of packet size.
+ */
+void
+statd_queue_nlm_callback(const char *mon_name, const int state,
+		const struct sockaddr *sap)
+{
+	static struct statd_callback args;
+	ssize_t len;
+
+	switch (sap->sa_family) {
+	case AF_INET:
+		memcpy(&args.address, sap, sizeof(struct sockaddr_in));
+		break;
+	case AF_INET6:
+		memcpy(&args.address, sap, sizeof(struct sockaddr_in6));
+		break;
+	default:
+		xlog(L_ERROR, "Unrecognized SM_NOTIFY caller address");
+		return;
+	}
+	args.state = state;
+	strncpy(args.mon_name, mon_name, sizeof(args.mon_name));
+
+	len = write(statd_sv[0], &args, sizeof(args));
+	if (__exact_error_check(len, sizeof(args))) {
+		if (errno == EAGAIN)
+			xlog(L_ERROR, "NLM callback process not responding");
+		else
+			xlog(L_ERROR, "NLM callback socket write failed: %m");
+		return;
+	}
+
+	xlog(D_CALL, "Queued callback request for %s", mon_name);
+}
+
+/**
+ * statd_init_nlm_callback - set up callback process
+ *
+ * Returns TRUE if all went to plan; otherwise returns FALSE.
+ */
+bool_t
+statd_init_nlm_callback(void)
+{
+	const struct sigaction statd_sigchld_action = {
+		.sa_handler	= SIG_DFL,
+		.sa_flags	= SA_NOCLDSTOP | SA_NOCLDWAIT,
+	};
+	bool_t result;
+
+	result = FALSE;
+
+	if (socketpair(AF_UNIX, SOCK_SEQPACKET, 0, statd_sv) < 0) {
+		xlog(L_ERROR, "Failed to create NLM callback sockets: %m");
+		return result;
+	}
+
+	/*
+	 * Make the write end non-blocking so the NSM service doesn't
+	 * stall if the child end stops responding.
+	 */
+	if (fcntl(statd_sv[0], F_SETFL, O_NONBLOCK) == -1)
+		xlog(L_ERROR, "Failed to make NLM callback socket nonblocking: %m");
+	else {
+		switch (fork()) {
+		case -1:
+			xlog(L_ERROR, "Failed to fork NLM callback process: %m");
+			close(statd_sv[0]);
+			break;
+		case 0:
+			/* Child. */
+			close(statd_sv[0]);
+			statd_nlm_callback_process();
+			break;
+		default:
+			/* Parent. */
+			(void)sigaction(SIGCHLD, &statd_sigchld_action, NULL);
+			usleep(10000);	/* Let child run */
+			result = TRUE;
+		}
+	}
+
+	close(statd_sv[1]);
+	return result;
+}


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/4] nfs-utils: introduce new statd implementation (2nd part)
       [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
  2009-08-05 14:45   ` [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part) Chuck Lever
@ 2009-08-05 14:45   ` Chuck Lever
  2009-08-05 14:46   ` [PATCH 3/4] nfs-utils: introduce new statd implementation (3rd part) Chuck Lever
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 14:45 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs

Finish introducing source files for new statd.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 utils/new-statd/smncall.c |  774 +++++++++++++++++++++++++++++++++++++++++
 utils/new-statd/svc.c     |  841 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1615 insertions(+), 0 deletions(-)
 create mode 100644 utils/new-statd/smncall.c
 create mode 100644 utils/new-statd/svc.c

diff --git a/utils/new-statd/smncall.c b/utils/new-statd/smncall.c
new file mode 100644
index 0000000..e6aba47
--- /dev/null
+++ b/utils/new-statd/smncall.c
@@ -0,0 +1,774 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ */
+
+/*
+ * Send reboot notifications to remote peers on our notify list.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <unistd.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <errno.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+
+#include <netinet/in.h>
+#include <arpa/inet.h>
+
+#include "statd.h"
+#include "nfsrpc.h"
+
+static const char *statd_pgmtbl[] = {
+       "status",
+       NULL,
+};
+
+static void
+statd_delete_notified_host(const char *mon_name, const char *my_name)
+{
+	sqlite3_stmt *stmt;
+	sqlite3 *db;
+	char *sql;
+	bool_t rc;
+
+	db = statd_open_db(SQLITE_OPEN_READWRITE);
+	if (db == NULL)
+		return;
+
+	sql = sqlite3_mprintf("DELETE FROM " STATD_NOTIFY_TABLENAME
+				" WHERE mon_name='%q' and my_name='%q'",
+					mon_name, my_name);
+	if (sql == NULL) {
+		xlog(L_ERROR, "Failed to generate DELETE FROM statement", __func__);
+		goto out_close;
+	}
+
+	rc = statd_prepare_stmt(db, &stmt, sql);
+	sqlite3_free(sql);
+	if (!rc)
+		goto out_close;
+
+	switch (sqlite3_step(stmt)) {
+	case SQLITE_DONE:
+		if (sqlite3_changes(db) == 0)
+			xlog(L_ERROR, "Mon_name '%s' not found "
+					"in the notify list", mon_name);
+		else
+			xlog(D_GENERAL, "Mon_name '%s' removed from notify list",
+					mon_name);
+		break;
+	default:
+		xlog(L_ERROR, "Failed to delete row for mon_name '%s': %s",
+				mon_name, sqlite3_errmsg(db));
+	}
+
+	statd_finalize_stmt(stmt);
+
+out_close:
+	statd_close_db(db);
+}
+
+/*
+ * Returns 1 if the table already exists, or was created; otherwise
+ * zero is returned if an error occurred.
+ */
+static bool_t
+statd_create_notify_table(sqlite3 *db)
+{
+	sqlite3_stmt *stmt;
+	bool_t result;
+	int rc;
+
+	result = FALSE;
+
+	rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_NOTIFY_TABLENAME
+					" (mon_name TEXT NOT NULL,"
+					" my_name TEXT NOT NULL,"
+					" protocol TEXT NOT NULL,"
+					" UNIQUE(mon_name, my_name));",
+						-1, &stmt, NULL);
+	switch (rc) {
+	case SQLITE_OK:
+		rc = sqlite3_step(stmt);
+		switch (rc) {
+		case SQLITE_DONE:
+			result = TRUE;
+			xlog(D_GENERAL, "Table '" STATD_NOTIFY_TABLENAME
+					"' created successfully");
+			break;
+		default:
+			xlog(L_ERROR, "Failed to create table '"
+					STATD_NOTIFY_TABLENAME "': %s",
+					sqlite3_errmsg(db));
+		}
+
+		statd_finalize_stmt(stmt);
+		break;
+	case SQLITE_ERROR:
+		result = TRUE;
+		xlog(D_GENERAL, "Table '" STATD_NOTIFY_TABLENAME
+				"' already exists");
+		break;
+	default:
+		xlog(L_ERROR, "Failed to compile SQL: %s",
+				sqlite3_errmsg(db));
+		xlog(L_ERROR, "SQL: %s");
+		break;
+	}
+
+	return result;
+}
+
+static bool_t
+statd_merge_notify_list(sqlite3 *db)
+{
+	char *err_msg;
+
+	if (sqlite3_exec(db,
+		"INSERT OR REPLACE INTO " STATD_NOTIFY_TABLENAME
+		" SELECT mon_name,my_name,protocol FROM " STATD_MONITOR_TABLENAME ";",
+					NULL, 0, &err_msg) != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to merge notify list: %s", err_msg);
+		sqlite3_free(err_msg);
+		return FALSE;
+	}
+	return TRUE;
+}
+
+static bool_t
+statd_empty_monitor_list(sqlite3 *db)
+{
+	char *err_msg;
+
+	err_msg = NULL;
+	if (sqlite3_exec(db, "DELETE FROM " STATD_MONITOR_TABLENAME ,
+					NULL, 0, &err_msg) != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to empty the monitor list: %s",
+					sqlite3_errmsg(db));
+		sqlite3_free(err_msg);
+		return FALSE;
+	}
+
+	xlog(D_GENERAL, "All monitor list entries were deleted");
+	return TRUE;
+}
+
+/*
+ * Returns 1 if monitor list was retired.  Returns zero if there
+ * were no hosts to notify.  Otherwise -1 is returned if some
+ * error occurred.
+ */
+static int
+statd_retire_monitor_list(void)
+{
+	sqlite3 *db;
+	int result;
+
+	result = 0;
+
+	db = statd_open_db(SQLITE_OPEN_READWRITE);
+	if (db == NULL)
+		return result;
+
+	if (!statd_begin_transaction(db))
+		goto out_close;
+
+	if (!statd_update_nsm_state(db))
+		goto out_rollback;
+
+	if (!statd_create_notify_table(db))
+		goto out_rollback;
+
+	if (!statd_merge_notify_list(db))
+		goto out_rollback;
+
+	if (!statd_empty_monitor_list(db))
+		goto out_rollback;
+
+	statd_end_transaction(db);
+	result = 1;
+
+	xlog(D_CALL, "Retired monitor list");
+
+out_close:
+	statd_close_db(db);
+	return result;
+
+out_rollback:
+	statd_rollback_transaction(db);
+	goto out_close;
+}
+
+/*
+ * Ensure IP addresses and hostnames are mapped to a working address.
+ * If @hostname is NULL, generate an appropriate ANYADDR address instead.
+ *
+ * You might think the bind address is loop invariant (ie the same for
+ * all notified peers).  Unfortunately the DNS lookup here will fail if
+ * sm-notify is started before networking is up on the local host.  So
+ * we have to make statd_verify_bindaddr() part of the retry loop.
+ */
+static struct addrinfo *
+statd_verify_bindaddr(const char *hostname, const uint16_t port,
+		const int family, const int transport)
+{
+	struct addrinfo gai_hint = {
+		.ai_flags	= AI_NUMERICSERV,
+		.ai_family	= family,
+		.ai_protocol	= transport,
+	};
+	struct addrinfo	*gai_results;
+	char buf[8];
+	int error;
+
+	buf[0] = '\0';
+	if (port != 0)
+		(void)snprintf(buf, sizeof(buf), "%u", port);
+	if (hostname == NULL)
+		gai_hint.ai_flags = AI_PASSIVE;
+	error = getaddrinfo(hostname, buf, &gai_hint, &gai_results);
+	switch (error) {
+	case 0:
+		return gai_results;
+	case EAI_NONAME:
+		if (hostname != NULL)
+			xlog(L_ERROR, "The supplied source address \"%s\" "
+				"is invalid", hostname);
+		else
+			/* Shouldn't ever happen, but just in case... */
+			xlog(L_ERROR, "EAI_NONAME for a NULL hostname?");
+		break;
+	case EAI_SYSTEM:
+		xlog(L_ERROR, "While determining bind address: %m");
+		break;
+	default:
+		xlog(L_ERROR, "While determining bind address: %s",
+			gai_strerror(error));
+	}
+
+	return NULL;
+}
+
+static int
+statd_socket(const struct addrinfo *gai_results)
+{
+	int sock;
+
+	sock = socket(gai_results->ai_family, gai_results->ai_socktype, 0);
+	if (sock < 0) {
+		xlog(L_ERROR, "Failed to create RPC socket: %m");
+		return -1;
+	}
+
+	fcntl(sock, F_SETFL, O_NONBLOCK);
+	if (gai_results->ai_family == AF_INET6) {
+		const int zero = 0;
+		/* We want replies from both IPv4 and IPv6 remotes, please */
+		setsockopt(sock, SOL_IPV6, IPV6_V6ONLY,
+				(char *)&zero, sizeof(zero));
+	}
+
+	return sock;
+}
+
+static int
+statd_bind(int sock, struct addrinfo *gai_results)
+{
+	struct sockaddr *sap = gai_results->ai_addr;
+	socklen_t salen = gai_results->ai_addrlen;
+
+	/* port was set by getaddrinfo(3) */
+	if (bind(sock, sap, salen) == -1) {
+		xlog(L_ERROR, "Failed to bind RPC socket: %m");
+		close(sock);
+		sock = -1;
+	}
+
+	return sock;
+}
+
+/*
+ * @bindaddr can be NULL; in this case, an appropriate ANYADDR is used.
+ * @srcport can be zero; in this case, an appropriate reserved port
+ * is used.
+ *
+ * Returns a bound socket file descriptor, or -1 if an error occurs.
+ */
+static int
+statd_create_socket(const char *bindaddr, const uint16_t srcport,
+			const int family, const int protocol)
+{
+	struct addrinfo *gai_results;
+	char *transport;
+	int sock;
+
+	switch (protocol) {
+	case IPPROTO_UDP:
+		transport = "datagram";
+		break;
+	case IPPROTO_TCP:
+		transport = "stream";
+		break;
+	default:
+		transport = "unknown";
+	}
+	xlog(D_CALL, "Creating %s socket with bindaddr '%s' and port %u",
+			transport, bindaddr, srcport);
+
+	gai_results = statd_verify_bindaddr(bindaddr, srcport, family, protocol);
+	if (gai_results == NULL)
+		return -1;
+
+	sock = statd_socket(gai_results);
+	if (sock < 0)
+		return -1;
+
+	sock = statd_bind(sock, gai_results);
+
+	freeaddrinfo(gai_results);
+	return sock;
+}
+
+/*
+ * @timeout is modified to contain the time remaining (i.e. time
+ * provided minus time elasped).
+ *
+ * Returns zero on success, or returns -1 on error.  errno is
+ * set to reflect the nature of the error.
+ */
+static int
+statd_connect_socket(const int sock, const struct sockaddr *sap,
+			const socklen_t salen, struct timeval *timeout)
+{
+	int save, result;
+	fd_set rset;
+
+	result = -1;
+
+	save = fcntl(sock, F_GETFL, 0);
+	if (save == -1) {
+		xlog(L_ERROR, "Failed to get socket flags: %m");
+		goto out;
+	}
+
+	if (fcntl(sock, F_SETFL, save | O_NONBLOCK) == -1) {
+		xlog(L_ERROR, "Failed to set socket flags: %m");
+		goto out;
+	}
+
+	if (connect(sock, sap, salen) == -1) {
+		switch (errno) {
+		case EINPROGRESS:
+			break;
+		case EINVAL:
+			xlog(L_ERROR, "Invalid bind address");
+			goto out_restore;
+		default:
+			xlog(L_ERROR, "Failed to connect socket: %m");
+			goto out_restore;
+		}
+	} else {
+		result = 0;
+		goto out_restore;
+	}
+
+	FD_ZERO(&rset);
+	FD_SET(sock, &rset);
+
+	result = select(sock + 1, NULL, &rset, NULL, timeout);
+	if (result <= 0) {
+		if (result == 0)
+			errno = ETIMEDOUT;
+		xlog(L_ERROR, "Failed to select: %m");
+		result = -1;
+		goto out_restore;
+	}
+
+	if (FD_ISSET(sock, &rset)) {
+		socklen_t len = (socklen_t)sizeof(result);
+
+		if (getsockopt(sock, SOL_SOCKET, SO_ERROR, &result, &len) == -1) {
+			xlog(L_ERROR, "Failed to get socket options: %m");
+			result = -1;
+			goto out_restore;
+		}
+
+		if (result != 0) {
+			xlog(L_ERROR, "Invalid socket options: %s",
+					strerror(result));
+			errno = result;
+			result = -1;
+		}
+	}
+
+out_restore:
+	(void)fcntl(sock, F_SETFL, save);
+
+out:
+	return result;
+}
+
+/*
+ * clnt_create_timed(3) could do all this for us, but we have to
+ * allow admins to set the socket's bind address and source port.
+ *
+ * @bindaddr is supposedly necessary because Linux's statd can
+ * scrape the source address of incoming SM_NOTIFY requests to
+ * figure out if it knows about this host.
+ */
+static CLIENT *
+statd_create_clnt(struct sockaddr *sap, const socklen_t salen,
+		const char *bindaddr, const unsigned short srcport,
+		const unsigned short protocol, struct timeval *timeout)
+{
+	CLIENT *clnt = NULL;
+	struct netbuf nbuf = {
+		.maxlen	= salen,
+		.len	= salen,
+		.buf	= sap,
+	};
+	rpcprot_t program;
+	uint16_t port;
+	int sock;
+
+	/*
+	 * Chicken and egg... rpcb_getaddr(3t) could give us the
+	 * address and port of the remote, but we would need to pass
+	 * it an nconf.  To get an nconf, we need the transport
+	 * protocol and the address family.  But we don't know until
+	 * we have the remote's address what address family to use.
+	 *
+	 * Lame.
+	 */
+	program = nfs_getrpcbyname(SM_PROG, statd_pgmtbl);
+	port = nfs_getport(nbuf.buf, nbuf.len, program, SM_VERS, protocol);
+	if (port == 0) {
+		xlog(L_WARNING, "Failed to get server's port: %s",
+				clnt_sperrno(rpc_createerr.cf_stat));
+		return NULL;
+	}
+	nfs_set_port(sap, port);
+
+	/*
+	 * ...and we can't create our transport socket until we know
+	 * which address family we need to use to contact the remote.
+	 */
+	sock = statd_create_socket(bindaddr, srcport,
+				sap->sa_family, protocol);
+	if (sock == -1)
+		return NULL;
+
+	if (protocol == IPPROTO_TCP)
+		if (statd_connect_socket(sock, nbuf.buf, nbuf.len,
+							timeout) == -1) {
+			close(sock);
+			return NULL;
+		}
+
+	clnt = clnt_tli_create(sock, NULL, &nbuf, program, SM_VERS,
+					RPCSMALLMSGSIZE, RPCSMALLMSGSIZE);
+	if (clnt == NULL) {
+		xlog(L_WARNING, "Failed to create RPC client: %s",
+				clnt_sperrno(rpc_createerr.cf_stat));
+		return NULL;
+	}
+
+	CLNT_CONTROL(clnt, CLSET_FD_CLOSE, NULL);
+	return clnt;
+}
+
+static unsigned int
+statd_sm_notify_call(struct sockaddr *sap, const socklen_t salen,
+		const char *bindaddr, const unsigned short srcport,
+		const unsigned short protocol,
+		const char *my_name, const int my_state)
+{
+	struct timeval timeout = { SMN_TIMEOUT, 0 };
+	static char hostbuf[NI_MAXHOST];
+	struct stat_chge new_status;
+	enum clnt_stat status;
+	CLIENT *clnt;
+
+	getnameinfo(sap, salen, hostbuf, sizeof(hostbuf), NULL, 0, NI_NUMERICHOST);
+	xlog(D_CALL, "Attempting to send notification to %s", hostbuf);
+
+	nfs_clear_rpc_createerr();
+	clnt = statd_create_clnt(sap, salen, bindaddr, srcport,
+						protocol, &timeout);
+	if (clnt == NULL)
+		return 0;
+
+	new_status.state = my_state;
+	new_status.mon_name = strdup(my_name);
+	status = clnt_call(clnt, SM_NOTIFY,
+				(xdrproc_t)xdr_stat_chge, (caddr_t)&new_status,
+				(xdrproc_t)xdr_void, NULL, timeout);
+	free(new_status.mon_name);
+	clnt_destroy(clnt);
+
+	if (status != RPC_SUCCESS) {
+		xlog(D_CALL, "Notification RPC failed: %s", clnt_sperrno(status));
+		return 0;
+	}
+
+	xlog(D_CALL, "Notification RPC succeeded");
+	return 1;
+}
+
+/*
+ * Attempt to send to each of this host's known IP addresses.
+ * This is a weak attempt to send from each interface and/or
+ * protocol family on the local host.
+ *
+ * We might want to beef this up by looking at what network
+ * interfaces exist on the system and trying to send from
+ * each of them.  Unfortunately in the world of DHCP and
+ * NetworkManager, we can't count on the local network
+ * configuration to be stable, especially at boot time when
+ * this function is typically invoked..
+ *
+ * If we get at least one through, consider it a success and
+ * return TRUE.
+ */
+static bool_t
+statd_notify_host(const char *hostname, const char *bindaddr,
+		const unsigned short srcport, const unsigned short protocol,
+		const char *my_name, const int my_state)
+{
+	struct addrinfo gai_hint = {
+#ifdef IPV6_SUPPORTED
+		.ai_family	= AF_UNSPEC,
+		.ai_flags	= AI_ADDRCONFIG,
+#else	/* !IPV6_SUPPORTED */
+		.ai_family	= AF_INET,
+#endif	/* !IPV6_SUPPORTED */
+	};
+	struct addrinfo	*gai_results, *ai;
+	char canon_name[NI_MAXHOST];
+	unsigned int count;
+
+	gai_hint.ai_protocol = protocol;
+	gai_results = statd_get_address_list(canon_name, &gai_hint);
+	if (gai_results == NULL)
+		return FALSE;
+
+	count = 0;
+	ai = gai_results;
+	while (ai != NULL) {
+		count += statd_sm_notify_call(gai_results->ai_addr,
+				gai_results->ai_addrlen, bindaddr, srcport,
+				protocol, my_name, my_state);
+		ai = ai->ai_next;
+	}
+	freeaddrinfo(gai_results);
+
+	if (count == 0) {
+		xlog(D_CALL, "Failed to notify Host %s", hostname);
+		return FALSE;
+	}
+
+	xlog(D_CALL, "Host %s notified successfully", hostname);
+	return TRUE;
+}
+
+/*
+ * Keep trying until we succeed or our timer expires
+ */
+static void
+statd_notification_child(const char *mon_name, const char *bindaddr,
+		const unsigned short srcport, const char *transport,
+		const char *my_name, const int my_state,
+		const time_t timeout)
+{
+	const struct sigaction statd_create_sigaction = {
+		.sa_handler	= SIG_IGN,
+	};
+	unsigned int sleep_time = SMN_TIMEOUT;
+	unsigned short protocol;
+
+	(void)setpriority(PRIO_PROCESS, 0, 19);
+
+	if (strcmp(transport, "tcp") == 0) {
+		/* Don't bother our parent if the other end
+		 * closes our TCP connection. */
+		(void)sigaction(SIGPIPE, &statd_create_sigaction, NULL);
+		protocol = IPPROTO_TCP;
+	} else if (strcmp(transport, "udp") == 0) {
+		protocol = IPPROTO_UDP;
+	} else {
+		xlog(L_ERROR, "Unknown transport protocol for notification");
+		return;
+	}
+
+	for (;;) {
+		if (statd_notify_host(mon_name, bindaddr, srcport,
+						protocol, my_name, my_state)) {
+			statd_delete_notified_host(mon_name, my_name);
+			break;
+		}
+
+		if (timeout != 0 && timeout < time(NULL)) {
+			xlog(L_WARNING, "Could not notify %s; giving up",
+				mon_name);
+			break;
+		}
+
+		(void)sleep(sleep_time);
+
+		sleep_time <<= 2;
+		if (sleep_time > 60)
+			sleep_time = 60;
+	}
+}
+
+/*
+ * Hosts that can't be contacted now shouldn't block notification
+ * of active hosts that may be farther down the notify list.  So
+ * fork off all notifications.
+ */
+static void
+statd_fork_and_notify(const char *mon_name, const char *bindaddr,
+		const unsigned short srcport, const char *transport,
+		const char *my_name, const int my_state,
+		const unsigned long max_retry_minutes)
+{
+	time_t timeout = 0;
+
+	if (max_retry_minutes != 0)
+		timeout = time(NULL) + max_retry_minutes * 60;
+
+	switch (fork()) {
+	case -1:
+		xlog(L_ERROR, "Failed to fork notification process: %m");
+		break;
+	case 0:
+		/* Child. */
+		statd_notification_child(mon_name, bindaddr, srcport,
+					transport, my_name, my_state, timeout);
+		_exit(EXIT_SUCCESS);
+	default:
+		/* Parent. */
+		usleep(100000);
+		break;
+	}
+}
+
+/*
+ * Returns TRUE if all listed notifications have been scheduled;
+ * otherwise FALSE is returned.
+ */
+static bool_t
+statd_iterate_notify_list(const char *bindaddr, const unsigned short srcport,
+		const unsigned long max_retry_minutes)
+{
+	int i, nrow, ncolumn, my_state;
+	char **resultp;
+	char *err_msg;
+	bool_t result;
+	sqlite3 *db;
+
+	result = FALSE;
+	my_state = statd_get_nsm_state();
+
+	db = statd_open_db(SQLITE_OPEN_READONLY);
+	if (db == NULL)
+		return result;
+
+	err_msg = NULL;
+	if (sqlite3_get_table(db, "SELECT mon_name,my_name,protocol FROM "
+			STATD_NOTIFY_TABLENAME ";",
+			&resultp, &nrow, &ncolumn, &err_msg) != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to get '" STATD_NOTIFY_TABLENAME
+				"' table: %s", err_msg);
+		sqlite3_free(err_msg);
+		goto out_close;
+	}
+
+	if (ncolumn == 0) {
+		xlog(D_GENERAL, "No hosts to notify");
+		result = TRUE;
+		goto out_free_table;
+	}
+	if (ncolumn != 3) {
+		xlog(L_ERROR, "Incorrect column count (%d) in SELECT result",
+				ncolumn);
+		goto out_free_table;
+	}
+
+	xlog(D_CALL, "Notification NSM state: %d", my_state);
+
+	for (i = ncolumn; i < (nrow + 1) * ncolumn ; i += ncolumn)
+		statd_fork_and_notify(resultp[i],
+				bindaddr, srcport, resultp[i + 2],
+				resultp[i + 1], my_state, max_retry_minutes);
+	result = TRUE;
+
+out_free_table:
+	sqlite3_free_table(resultp);
+
+out_close:
+	statd_close_db(db);
+	return result;
+}
+
+/**
+ * statd_notify - send SM_NOTIFY requests
+ * @bindaddr: C string containing DNS name or presentation address
+ *	of socket's source address
+ * @srcport: source port to use when sending reboot notifications
+ * @max_retry_minutes: minutes after which to stop attempting to contact peer
+ *
+ * Copy the on-disk monitored host list to a backup directory,
+ * update the local NSM state number, and fork off notification
+ * processes.
+ *
+ * Returns TRUE if notificaitions were started successfully,
+ * or FALSE if an error occurred.
+ */
+bool_t
+statd_notify(const char *bindaddr, const unsigned short srcport,
+		const unsigned long max_retry_minutes)
+{
+	const struct sigaction statd_sigchld_action = {
+		.sa_handler	= SIG_DFL,
+		.sa_flags	= SA_NOCLDSTOP | SA_NOCLDWAIT,
+	};
+
+	(void)sigaction(SIGCHLD, &statd_sigchld_action, NULL);
+
+	switch (statd_retire_monitor_list()) {
+	case 0:
+		return TRUE;
+	case 1:
+		return statd_iterate_notify_list(bindaddr, srcport,
+							max_retry_minutes);
+	default:
+		return FALSE;
+	}
+}
diff --git a/utils/new-statd/svc.c b/utils/new-statd/svc.c
new file mode 100644
index 0000000..fa9a517
--- /dev/null
+++ b/utils/new-statd/svc.c
@@ -0,0 +1,841 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ *
+ * Convert incoming NSM RPC requests into local function calls.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <memory.h>
+#include <signal.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/in.h>
+
+#include <sys/socket.h>
+#include <sys/resource.h>
+
+#include <rpc/rpc.h>
+#include <rpc/svc.h>
+
+#ifdef HAVE_TCP_WRAPPER
+#include "tcpwrapper.h"
+#endif
+
+#include "statd.h"
+#include "ha-callout.h"
+#include "rpcmisc.h"
+
+/*
+ * Show the priv in hexadecimal.  For debugging only.
+ */
+#define STATD_SIZEOF_HEXBYTE	sizeof("ff")
+#define STATD_PRIVBUF_LEN	(SM_PRIV_SIZE * STATD_SIZEOF_HEXBYTE)
+static const char *
+statd_show_priv(const char *priv)
+{
+	static char buf[STATD_PRIVBUF_LEN];
+	unsigned int i;
+	char *p;
+
+	memset(buf, 0, sizeof(buf));
+	p = &buf[0];
+	for (i = 0; i < SM_PRIV_SIZE; i++)
+		p += sprintf(p, "%02x", 0xff & priv[i]);
+
+	return (const char *)buf;
+}
+
+/*
+ * Simplistic method to generate a presentation format version
+ * of the RPC caller's address.  Used only for debugging messages.
+ */
+static void
+statd_caller(struct svc_req *rqstp, char *buf, const size_t buflen)
+{
+	const struct sockaddr *sap = (struct sockaddr *)
+					svc_getcaller(rqstp->rq_xprt);
+	const struct sockaddr_in *sin = (const struct sockaddr_in *)sap;
+	const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sap;
+
+	switch (sap->sa_family) {
+	case AF_INET:
+		if (inet_ntop(AF_INET, &sin->sin_addr.s_addr, buf, buflen) == NULL)
+			goto out_unknown;
+		break;
+	case AF_INET6:
+		if (inet_ntop(AF_INET6, &sin6->sin6_addr, buf, buflen) == NULL)
+			goto out_unknown;
+		break;
+	}
+
+	return;
+
+out_unknown:
+	(void)snprintf(buf, buflen, "unknown address");
+}
+
+static void
+statd_send_null_reply(SVCXPRT *xprt, const char *procname)
+{
+	if (!svc_sendreply(xprt, (xdrproc_t)xdr_void, NULL)) {
+		xlog(L_WARNING, "Failed to send %s reply", procname);
+		svcerr_systemerr(xprt);
+	}
+}
+
+/**
+ * statd_svc_null_1 - service a NULL request
+ *
+ * Currently the kernel NLM implementation does not send us
+ * this request, so this procedure is not implemented.
+ */
+static void
+statd_svc_null_1(struct svc_req *rqstp __attribute((unused)), SVCXPRT *xprt)
+{
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received NULLPROC request from %s", buf);
+	}
+
+	statd_send_null_reply(xprt, "NULLPROC");
+}
+
+/**
+ * statd_svc_stat_1 - service an SM_STAT request
+ *
+ * Returns stat_succ if the host can be monitored, otherwise stat_fail
+ * is returned.  Our current local NSM state is always returned.
+ */
+static void
+statd_svc_stat_1(struct svc_req *rqstp __attribute((unused)), SVCXPRT *xprt)
+{
+	struct addrinfo *gai_results;
+	struct sm_stat_res result;
+	struct sm_name arg;
+
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received SM_STAT request from %s for %s",
+				buf, arg.mon_name);
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	if (!svc_getargs(xprt, (xdrproc_t)xdr_sm_name, (caddr_t)&arg)) {
+		xlog(L_WARNING, "Failed to decode SM_STAT arguments");
+		svcerr_decode(xprt);
+		return;
+	}
+
+	result.state = statd_get_nsm_state();
+	result.res_stat = stat_succ;
+	gai_results = statd_forward_lookup(arg.mon_name, IPPROTO_UDP);
+	if (gai_results == NULL) {
+		result.res_stat = stat_fail;
+		xlog(D_CALL, "SM_STAT fail for '%s'", arg.mon_name);
+	} else {
+		freeaddrinfo(gai_results);
+		xlog(D_CALL, "SM_STAT success for '%s', state=%d",
+				arg.mon_name, result.state);
+	}
+
+	if (!svc_sendreply(xprt, (xdrproc_t)xdr_sm_stat_res, &result)) {
+		xlog(L_WARNING, "Failed to send SM_STAT reply");
+		svcerr_systemerr(xprt);
+	}
+
+	if (!svc_freeargs(xprt, (xdrproc_t)xdr_mon, (caddr_t)&arg))
+		xlog(L_WARNING, "Failed to free SM_STAT arguments");
+}
+
+/*
+ * Insert a row for the corresponding host.  Returns 1 if the
+ * row was inserted, -1 if a row for this host already exists,
+ * or zero if an error occurs.
+ */
+static int
+statd_insert_row(sqlite3 *db, const struct mon *monp,
+		const char *protocol)
+{
+	sqlite3_stmt *stmt;
+	int result;
+	bool_t rc;
+	char *sql;
+
+	result = 0;
+
+	sql = sqlite3_mprintf("INSERT INTO " STATD_MONITOR_TABLENAME
+				" (priv,mon_name,my_name,"
+					"program,version,procedure,"
+					"protocol,state)"
+				" VALUES(?,'%q','%q',%d,%d,%d,'%q',0);",
+				monp->mon_id.mon_name,
+				monp->mon_id.my_id.my_name,
+				monp->mon_id.my_id.my_prog,
+				monp->mon_id.my_id.my_vers,
+				monp->mon_id.my_id.my_proc,
+				protocol);
+	if (sql == NULL) {
+		xlog(L_ERROR, "Failed to generate INSERT INTO statement");
+		return result;
+	}
+
+	rc = statd_prepare_stmt(db, &stmt, sql);
+	sqlite3_free(sql);
+	if (!rc)
+		return result;
+
+	if (sqlite3_bind_blob(stmt, 1, monp->priv, sizeof(monp->priv),
+					SQLITE_STATIC) != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to bind priv: %s",
+				sqlite3_errmsg(db));
+		goto out_finalize;
+	}
+
+	switch (sqlite3_step(stmt)) {
+	case SQLITE_DONE:
+		result = 1;
+		xlog(D_GENERAL, "Inserted callback record for %s/%s "
+				"(%u, %u, %u) priv: %s",
+				monp->mon_id.mon_name,
+				monp->mon_id.my_id.my_name,
+				monp->mon_id.my_id.my_prog,
+				monp->mon_id.my_id.my_vers,
+				monp->mon_id.my_id.my_proc,
+				statd_show_priv(monp->priv));
+		break;
+	case SQLITE_CONSTRAINT:
+		xlog(D_GENERAL, "Duplicate record for mon_name '%s' "
+				"found in the monitor list",
+					monp->mon_id.mon_name);
+		result = -1;
+		break;
+	default:
+		xlog(L_ERROR, "Failed to insert record for mon_name '%s' "
+				"into the monitor list",
+					monp->mon_id.mon_name);
+		break;
+	}
+
+out_finalize:
+	statd_finalize_stmt(stmt);
+	return result;
+}
+
+static int
+statd_insert_monitored_host(const struct mon *monp, const char *protocol)
+{
+	sqlite3 *db;
+	int result;
+
+	result = 0;
+
+	db = statd_open_db(SQLITE_OPEN_READWRITE);
+	if (db == NULL)
+		return result;
+
+	result = statd_insert_row(db, monp, protocol);
+	if (result)
+		ha_callout("add-client", monp->mon_id.mon_name,
+				monp->mon_id.my_id.my_name, -1);
+
+	statd_close_db(db);
+	return result;
+}
+
+/**
+ * statd_svc_mon_1 - service an SM_MON request
+ *
+ * Returns stat_succ if the host is monitored, otherwise stat_fail
+ * is returned.  Our current local NSM state is always returned.
+ */
+static void
+statd_svc_mon_1(struct svc_req *rqstp, SVCXPRT *xprt)
+{
+	struct addrinfo *gai_results;
+	struct sm_stat_res result;
+	struct mon arg;
+
+	memset(&arg, 0, sizeof(arg));
+	if (!svc_getargs(xprt, (xdrproc_t)xdr_mon, (caddr_t)&arg)) {
+		xlog(L_WARNING, "Failed to decode SM_MON arguments");
+		svcerr_decode(xprt);
+		return;
+	}
+
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received SM_MON request for %s from %s",
+				arg.mon_id.mon_name, buf);
+	}
+
+	/*
+	 * Reject requests from non-loopback addresses in order
+	 * to prevent attack described in CERT CA-99.05.
+	 */
+	if (!statd_localhost_caller(rqstp)) {
+		svcerr_auth(xprt, AUTH_FAILED);
+		return;
+	}
+
+	result.state = statd_get_nsm_state();
+	result.res_stat = stat_fail;
+
+	/*
+	 * Check that we can send the SM_NOTIFY later: ensure
+	 * that the mon_name can be resolved.
+	 *
+	 * XXX: Eventually we want to determine the transport
+	 *      protocol of the remote peer and use that instead.
+	 */
+	gai_results = statd_forward_lookup(arg.mon_id.mon_name, IPPROTO_UDP);
+	if (gai_results != NULL) {
+		if (statd_insert_monitored_host(&arg, "udp"))
+			result.res_stat = stat_succ;
+		freeaddrinfo(gai_results);
+	}
+
+	if (result.res_stat == stat_fail)
+		xlog(D_CALL, "Failed to monitor %s on behalf of %s",
+				arg.mon_id.mon_name, arg.mon_id.my_id.my_name);
+	else
+		xlog(D_CALL, "SM_MON success, state=%d", result.state);
+
+	if (!svc_sendreply(xprt, (xdrproc_t)xdr_sm_stat_res, &result)) {
+		xlog(L_WARNING, "Failed to send SM_MON reply");
+		svcerr_systemerr(xprt);
+	}
+
+	if (!svc_freeargs(xprt, (xdrproc_t)xdr_mon, (caddr_t)&arg))
+		xlog(L_WARNING, "Failed to free SM_MON arguments");
+}
+
+/*
+ * Delete the row for the corresponding host.  Returns TRUE if a row
+ * was deleted, otherwise FALSE if an error occurs or no row was found.
+ */
+static bool_t
+statd_delete_row(sqlite3 *db, const struct mon_id *mon_idp)
+{
+	char *err_msg;
+	bool_t result;
+	char *sql;
+
+	result = FALSE;
+
+	sql = sqlite3_mprintf("DELETE FROM " STATD_MONITOR_TABLENAME
+				" WHERE mon_name='%q' and my_name='%q' and"
+				" program=%d and version=%d and procedure=%d;",
+				mon_idp->mon_name, mon_idp->my_id.my_name,
+				mon_idp->my_id.my_prog,
+				mon_idp->my_id.my_vers,
+				mon_idp->my_id.my_proc);
+	if (sql == NULL) {
+		xlog(L_ERROR, "Failed to generate DELETE FROM statement");
+		return result;
+	}
+
+	err_msg = NULL;
+	if (sqlite3_exec(db, sql, NULL, 0, &err_msg) != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to delete mon_name '%s': %s",
+				mon_idp->mon_name, err_msg);
+		xlog(L_ERROR, "SQL: %s", sql);
+		sqlite3_free(err_msg);
+		goto out_free;
+	}
+
+	if (sqlite3_changes(db) == 0) {
+		xlog(D_GENERAL, "No record for mon_name '%s' was found",
+				mon_idp->mon_name);
+		goto out_free;
+	}
+
+	result = TRUE;
+	xlog(D_GENERAL, "Record for mon_name '%s' deleted",
+			mon_idp->mon_name);
+
+out_free:
+	sqlite3_free(sql);
+	return result;
+}
+
+static void
+statd_delete_monitored_host(const struct mon_id *mon_idp)
+{
+	sqlite3 *db;
+
+	db = statd_open_db(SQLITE_OPEN_READWRITE);
+	if (db == NULL)
+		return;
+
+	if (statd_delete_row(db, mon_idp))
+		ha_callout("del-client", mon_idp->mon_name,
+				mon_idp->my_id.my_name, -1);
+
+	statd_close_db(db);
+}
+
+/**
+ * statd_svc_unmon_1 - service an SM_UNMON request
+ *
+ * Our current local NSM state is always returned.
+ */
+static void
+statd_svc_unmon_1(struct svc_req *rqstp, SVCXPRT *xprt)
+{
+	struct mon_id arg;
+	struct sm_stat result;
+
+	memset(&arg, 0, sizeof(arg));
+	if (!svc_getargs(xprt, (xdrproc_t)xdr_mon_id, (caddr_t)&arg)) {
+		xlog(L_WARNING, "Failed to decode SM_UNMON arguments");
+		svcerr_decode(xprt);
+		return;
+	}
+
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received SM_UNMON request for %s from %s",
+				arg.mon_name, buf);
+	}
+
+	/*
+	 * Reject requests from non-loopback addresses in order
+	 * to prevent attack described in CERT CA-99.05.
+	 */
+	if (!statd_localhost_caller(rqstp)) {
+		svcerr_auth(xprt, AUTH_FAILED);
+		return;
+	}
+
+	result.state = statd_get_nsm_state();
+	xlog(D_CALL, "SM_UNMON reply, state=%d", result.state);
+
+	if (!svc_sendreply(xprt, (xdrproc_t)xdr_sm_stat, &result)) {
+		xlog(L_WARNING, "Failed to send SM_UNMON reply");
+		svcerr_systemerr(xprt);
+	}
+
+	statd_delete_monitored_host(&arg);
+
+	if (!svc_freeargs(xprt, (xdrproc_t)xdr_mon_id, (caddr_t)&arg))
+		xlog(L_WARNING, "Failed to free SM_UNMON arguments");
+}
+
+static void
+statd_delete_monitored_hosts(const struct my_id *my_idp)
+{
+	int i, rc, nrow, ncolumn;
+	char **resultp;
+	char *err_msg;
+	sqlite3 *db;
+	char *sql;
+
+	db = statd_open_db(SQLITE_OPEN_READWRITE);
+	if (db == NULL)
+		return;
+
+	sql = sqlite3_mprintf("SELECT mon_name,my_name,program,version,procedure"
+				" FROM " STATD_MONITOR_TABLENAME
+				" WHERE my_name='%q' and"
+				" program=%d and version=%d and procedure=%d;",
+				my_idp->my_name, my_idp->my_prog,
+				my_idp->my_vers, my_idp->my_proc);
+	if (sql == NULL) {
+		xlog(L_ERROR, "Failed to generate SQL command in %s", __func__);
+		goto out_close;
+	}
+
+	err_msg = NULL;
+	rc = sqlite3_get_table(db, sql, &resultp, &nrow, &ncolumn, &err_msg);
+	sqlite3_free(sql);
+	if (rc != SQLITE_OK) {
+		xlog(L_ERROR, "Failed to get " STATD_MONITOR_TABLENAME
+				" table: %s", err_msg);
+		sqlite3_free(err_msg);
+		goto out_close;
+	}
+
+	if (ncolumn == 0) {
+		xlog(D_GENERAL, "No host records found");
+		goto out_free_table;
+	}
+	if (ncolumn != 5) {
+		xlog(L_ERROR, "Incorrect column count %d in SELECT result",
+				ncolumn);
+		goto out_free_table;
+	}
+
+	for (i = ncolumn; i < (nrow + 1) * ncolumn ; i += ncolumn) {
+		struct mon_id mon_id;
+
+		mon_id.mon_name = resultp[i];
+		mon_id.my_id.my_name = resultp[i + 1];
+		mon_id.my_id.my_prog = atoi(resultp[i + 2]);
+		mon_id.my_id.my_vers = atoi(resultp[i + 3]);
+		mon_id.my_id.my_proc = atoi(resultp[i + 4]);
+
+		if (statd_delete_row(db, &mon_id))
+			ha_callout("del-client", mon_id.mon_name,
+					mon_id.my_id.my_name, -1);
+	}
+
+out_free_table:
+	sqlite3_free_table(resultp);
+
+out_close:
+	statd_close_db(db);
+}
+
+/**
+ * statd_svc_unmon_all_1 - service an SM_UNMON_ALL request
+ *
+ * Our current local NSM state is always returned.
+ */
+static void
+statd_svc_unmon_all_1(struct svc_req *rqstp, SVCXPRT *xprt)
+{
+	struct my_id arg;
+	struct sm_stat result;
+
+	/*
+	 * Reject requests from non-loopback addresses in order
+	 * to prevent attack described in CERT CA-99.05.
+	 */
+	if (!statd_localhost_caller(rqstp)) {
+		svcerr_auth(xprt, AUTH_FAILED);
+		return;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	if (!svc_getargs(xprt, (xdrproc_t)xdr_my_id, (caddr_t)&arg)) {
+		xlog(L_WARNING, "Failed to decode SM_UNMON_ALL arguments");
+		svcerr_decode(xprt);
+		return;
+	}
+
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received SM_UNMON_ALL request for %s from %s",
+				arg.my_name, buf);
+	}
+
+	result.state = statd_get_nsm_state();
+	xlog(D_CALL, "SM_UNMON_ALL reply, state=%d", result.state);
+
+	if (!svc_sendreply(xprt, (xdrproc_t)xdr_sm_stat, &result)) {
+		xlog(L_WARNING, "Failed to send SM_UNMON_ALL reply");
+		svcerr_systemerr(xprt);
+	}
+
+	statd_delete_monitored_hosts(&arg);
+
+	if (!svc_freeargs(xprt, (xdrproc_t)xdr_my_id, (caddr_t)&arg))
+		xlog(L_WARNING, "Failed to free SM_UNMON_ALL arguments");
+}
+
+/**
+ * statd_svc_simu_crash_1 - service an SM_SIMU_CRASH request
+ *
+ */
+static void
+statd_svc_simu_crash_1(struct svc_req *rqstp, SVCXPRT *xprt)
+{
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received SM_SIMU_CRASH request from %s", buf);
+	}
+
+	/*
+	 * Reject requests from non-loopback addresses in order
+	 * to prevent attack described in CERT CA-99.05.
+	 */
+	if (!statd_localhost_caller(rqstp)) {
+		svcerr_auth(xprt, AUTH_FAILED);
+		return;
+	}
+
+	statd_send_null_reply(xprt, "SM_SIMU_CRASH");
+
+	/*
+	 * This call may block temporarily while the monitor list
+	 * is retired.  Notifications will then proceed in a
+	 * separate process.
+	 */
+	(void)statd_notify(opt_name, opt_notify_port, SMN_MAX_RETRY);
+}
+
+/**
+ * statd_svc_notify_1 - service an SM_NOTIFY request
+ *
+ */
+static void
+statd_svc_notify_1(struct svc_req *rqstp, SVCXPRT *xprt)
+{
+	const struct sockaddr *sap = (struct sockaddr *)
+					svc_getcaller(rqstp->rq_xprt);
+	struct stat_chge arg;
+
+	memset(&arg, 0, sizeof(arg));
+	if (!svc_getargs(xprt, (xdrproc_t)xdr_stat_chge, (caddr_t)&arg)) {
+		xlog(L_WARNING, "Failed to decode SM_NOTIFY arguments");
+		svcerr_decode(xprt);
+		return;
+	}
+
+	if (opt_debug) {
+		char buf[INET6_ADDRSTRLEN];
+
+		statd_caller(rqstp, buf, sizeof(buf));
+		xlog(D_CALL, "Received SM_NOTIFY request for %s "
+						"(new state %d) from %s",
+				arg.mon_name, arg.state, buf);
+	}
+
+	statd_send_null_reply(xprt, "SM_NOTIFY");
+
+	statd_queue_nlm_callback(arg.mon_name, arg.state, sap);
+
+	if (!svc_freeargs(xprt, (xdrproc_t)xdr_stat_chge, (caddr_t)&arg))
+		xlog(L_WARNING, "Failed to free SM_NOTIFY arguments");
+}
+
+/**
+ * statd_dispatch_1 - server-side entry point for RPC procedure calls
+ * @rqstp: information about the incoming request
+ * @xprt: the transport used to receive this request
+ *
+ * The NSM call is executed if allowed by TCP wrappers.
+ */
+static void
+statd_dispatch_1(struct svc_req *rqstp, SVCXPRT *xprt)
+{
+#ifdef HAVE_TCP_WRAPPER
+	struct sockaddr_in *sin = nfs_getrpccaller_in(xprt);
+
+	/*
+	 * XXX: Update this when tcp_wrappers supports AF_INET6
+	 */
+	if (sin->sin_family == AF_INET &&
+	    !check_default("statd", sin, rqstp->rq_proc, statd_nsm_program)) {
+		svcerr_auth(xprt, AUTH_FAILED);
+		return;
+	}
+#endif	/* HAVE_TCP_WRAPPER */
+
+	switch (rqstp->rq_proc) {
+	case NULLPROC:
+		statd_svc_null_1(rqstp, xprt);
+		break;
+	case SM_STAT:
+		statd_svc_stat_1(rqstp, xprt);
+		break;
+	case SM_MON:
+		statd_svc_mon_1(rqstp, xprt);
+		break;
+	case SM_UNMON:
+		statd_svc_unmon_1(rqstp, xprt);
+		break;
+	case SM_UNMON_ALL:
+		statd_svc_unmon_all_1(rqstp, xprt);
+		break;
+	case SM_SIMU_CRASH:
+		statd_svc_simu_crash_1(rqstp, xprt);
+		break;
+	case SM_NOTIFY:
+		statd_svc_notify_1(rqstp, xprt);
+		break;
+	default:
+		xlog(L_WARNING, "Unrecognized RPC procedure number %d",
+				rqstp->rq_proc);
+		svcerr_noproc(xprt);
+	}
+}
+
+void
+statd_unregister(void)
+{
+	svc_unreg(statd_nsm_program, SM_VERS);
+}
+
+/*
+ * Set up an appropriate bind address, given @port and @nconf.
+ *
+ * Returns getaddrinfo(3) results if successful.  Caller must
+ * invoke freeaddrinfo(3) on these results.
+ *
+ * Otherwise NULL is returned if an error occurs.
+ */
+static struct addrinfo *
+statd_svc_create_bindaddr(struct netconfig *nconf, const uint16_t port)
+{
+	struct addrinfo gai_hint = {
+		.ai_flags	= AI_PASSIVE | AI_NUMERICSERV,
+	};
+	struct addrinfo *gai_results;
+	char buf[8];
+	int error;
+
+	if (strcmp(nconf->nc_protofmly, NC_INET) == 0)
+		gai_hint.ai_family = AF_INET;
+#ifdef IPV6_SUPPORTED
+	else if (strcmp(nconf->nc_protofmly, NC_INET6) == 0)
+		gai_hint.ai_family = AF_INET6;
+#endif	/* IPV6_SUPPORTED */
+	else {
+		xlog(L_ERROR, "Unrecognized bind address family: %s",
+			nconf->nc_protofmly);
+		return NULL;
+	}
+
+	if (strcmp(nconf->nc_proto, NC_UDP) == 0)
+		gai_hint.ai_protocol = IPPROTO_UDP;
+	else if (strcmp(nconf->nc_proto, NC_TCP) == 0)
+		gai_hint.ai_protocol = IPPROTO_TCP;
+	else {
+		xlog(L_ERROR, "Unrecognized bind address protocol: %s",
+			nconf->nc_proto);
+		return NULL;
+	}
+
+	(void)snprintf(buf, sizeof(buf), "%u", port);
+	error = getaddrinfo(NULL, buf, &gai_hint, &gai_results);
+	switch (error) {
+	case 0:
+		return gai_results;
+	case EAI_SYSTEM:
+		xlog(L_ERROR, "Failed to construct bind address: %m");
+		break;
+	default:
+		xlog(L_ERROR, "Failed to construct bind address: %s",
+			gai_strerror(error));
+		break;
+	}
+
+	return NULL;
+}
+
+static int
+statd_svc_create_nconf(const rpcprog_t program, const rpcvers_t version,
+		const uint16_t port, struct netconfig *nconf)
+{
+	struct addrinfo *gai_results;
+	struct t_bind bindaddr;
+	SVCXPRT	*xprt;
+
+	gai_results = statd_svc_create_bindaddr(nconf, port);
+	if (gai_results == NULL)
+		return 0;
+
+	bindaddr.addr.buf = gai_results->ai_addr;
+	bindaddr.qlen = SOMAXCONN;
+
+	xprt = svc_tli_create(RPC_ANYFD, nconf, &bindaddr, 0, 0);
+	freeaddrinfo(gai_results);
+	if (xprt == NULL) {
+		xlog(L_ERROR, "Failed to create NSM xprt");
+		return 0;
+	}
+
+	if (!svc_reg(xprt, program, version, statd_dispatch_1, nconf)) {
+		/* svc_reg(3) destroys @xprt in this case */
+		xlog(L_ERROR, "Failed to register (statd, %u, %s).",
+				version, nconf->nc_proto);
+		return 0;
+	}
+
+	return 1;
+}
+
+/**
+ * statd_svc_create - start up NSM listeners
+ * @program: RPC program number to register
+ * @version: RPC version number to register
+ * @port: if not zero, transport listens on this port
+ *
+ * Sets up network transports for receiving NSM RPC requests, and starts
+ * the RPC dispatcher.  Normally does not return.
+ */
+void
+statd_svc_create(const rpcprog_t program, const rpcvers_t version,
+		const uint16_t port)
+{
+	const struct sigaction statd_create_sigaction = {
+		.sa_handler	= SIG_IGN,
+	};
+	unsigned int visible, up;
+	struct netconfig *nconf;
+	void *handlep;
+
+	/*
+	 * Ignore SIGPIPE to avoid exiting sideways when peers
+	 * close their TCP connection while we're trying to reply
+	 * to them.
+	 */
+	(void)sigaction(SIGPIPE, &statd_create_sigaction, NULL);
+
+	handlep = setnetconfig();
+	if (handlep == NULL) {
+		xlog(L_ERROR, "Failed to access local netconfig database: %s",
+			nc_sperror());
+		return;
+	}
+
+	visible = 0;
+	up = 0;
+	while ((nconf = getnetconfig(handlep)) != NULL) {
+		if (!(nconf->nc_flag & NC_VISIBLE))
+			continue;
+		visible++;
+		up += statd_svc_create_nconf(program, version, port, nconf);
+	}
+
+	if (visible == 0)
+		xlog(L_ERROR, "Failed to find any visible netconfig entries");
+
+	if (endnetconfig(handlep) == -1)
+		xlog(L_ERROR, "Failed to close local netconfig database: %s",
+			nc_sperror());
+
+	if (up > 0) {
+		atexit(statd_unregister);
+		svc_run();
+	}
+}


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/4] nfs-utils: introduce new statd implementation (3rd part)
       [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
  2009-08-05 14:45   ` [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part) Chuck Lever
  2009-08-05 14:45   ` [PATCH 2/4] nfs-utils: introduce new statd implementation (2nd part) Chuck Lever
@ 2009-08-05 14:46   ` Chuck Lever
  2009-08-05 14:46   ` [PATCH 4/4] nfs-utils: introduce new statd implementation (4th part) Chuck Lever
  2009-08-05 17:15   ` [PATCH 0/4] new statd [take 2] Jeff Layton
  4 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 14:46 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs

Introduce auxiliary files (Makefiles, etc.) and update actools to support
utils/new-statd.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 .gitignore                  |    4 +
 Makefile.am                 |    4 -
 aclocal/libcap.m4           |   15 ++
 aclocal/libsqlite3.m4       |   33 ++++
 aclocal/libtirpc.m4         |    2 
 configure.ac                |   12 ++
 utils/Makefile.am           |    7 +
 utils/new-statd/Makefile.am |  102 ++++++++++++++
 utils/new-statd/main.c      |  319 +++++++++++++++++++++++++++++++++++++++++++
 utils/new-statd/sm-notify.c |  155 +++++++++++++++++++++
 utils/new-statd/sm_inter.x  |  107 ++++++++++++++
 utils/new-statd/start-statd |   12 ++
 utils/new-statd/statd.h     |  135 ++++++++++++++++++
 utils/statd/Makefile.am     |    9 +
 14 files changed, 911 insertions(+), 5 deletions(-)
 create mode 100644 aclocal/libcap.m4
 create mode 100644 aclocal/libsqlite3.m4
 create mode 100644 utils/new-statd/Makefile.am
 create mode 100644 utils/new-statd/main.c
 create mode 100644 utils/new-statd/sm-notify.c
 create mode 100644 utils/new-statd/sm_inter.x
 create mode 100644 utils/new-statd/start-statd
 create mode 100644 utils/new-statd/statd.h

diff --git a/.gitignore b/.gitignore
index 632609e..30e9408 100644
--- a/.gitignore
+++ b/.gitignore
@@ -49,6 +49,8 @@ utils/rquotad/rquota.h
 utils/rquotad/rquota_xdr.c
 utils/showmount/showmount
 utils/statd/statd
+utils/new-statd/statd
+utils/new-statd/sm-notify
 tools/locktest/testlk
 tools/getiversion/getiversion
 support/export/mount.h
@@ -59,6 +61,8 @@ utils/statd/sm_inter.h
 utils/statd/sm_inter_clnt.c
 utils/statd/sm_inter_svc.c
 utils/statd/sm_inter_xdr.c
+utils/new-statd/sm_inter.h
+utils/new-statd/sm_inter_xdr.c
 # cscope database files
 cscope.*
 # generic editor backup et al
diff --git a/Makefile.am b/Makefile.am
index b3a6e91..d45dbde 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -54,10 +54,6 @@ install-data-hook:
 	touch $(DESTDIR)$(statedir)/xtab; chmod 644 $(DESTDIR)$(statedir)/xtab
 	touch $(DESTDIR)$(statedir)/etab; chmod 644 $(DESTDIR)$(statedir)/etab
 	touch $(DESTDIR)$(statedir)/rmtab; chmod 644 $(DESTDIR)$(statedir)/rmtab
-	mkdir -p $(DESTDIR)$(statedir)/sm $(DESTDIR)$(statedir)/sm.bak
-	touch $(DESTDIR)$(statedir)/state
-	chmod go-rwx $(DESTDIR)$(statedir)/sm $(DESTDIR)$(statedir)/sm.bak $(DESTDIR)$(statedir)/state
-	-chown $(statduser) $(DESTDIR)$(statedir)/sm $(DESTDIR)$(statedir)/sm.bak $(DESTDIR)$(statedir)/state
 
 uninstall-hook:
 	rm $(DESTDIR)$(statedir)/xtab
diff --git a/aclocal/libcap.m4 b/aclocal/libcap.m4
new file mode 100644
index 0000000..eabe507
--- /dev/null
+++ b/aclocal/libcap.m4
@@ -0,0 +1,15 @@
+dnl Checks for libcap.so
+dnl
+AC_DEFUN([AC_LIBCAP], [
+
+  dnl look for prctl
+  AC_CHECK_FUNC([prctl], , )
+
+  dnl look for the library; do not add to LIBS if found
+  AC_CHECK_LIB([cap], [cap_get_proc], [LIBCAP=-lcap], ,)
+  AC_SUBST(LIBCAP)
+
+  AC_CHECK_HEADERS([sys/capability.h], ,
+                   [AC_MSG_ERROR([libcap headers not found.])])
+
+])dnl
diff --git a/aclocal/libsqlite3.m4 b/aclocal/libsqlite3.m4
new file mode 100644
index 0000000..73d1e46
--- /dev/null
+++ b/aclocal/libsqlite3.m4
@@ -0,0 +1,33 @@
+dnl Checks for matching sqlite3 header and library, and
+dnl sufficient sqlite3 version.
+dnl
+AC_DEFUN([AC_SQLITE3_VERS], [
+  AC_CHECK_HEADERS([sqlite3.h], ,)
+
+  dnl look for the library; do not add to LIBS if found
+  AC_CHECK_LIB([sqlite3], [sqlite3_libversion_number], [LIBSQLITE=-lsqlite3], ,)
+  AC_SUBST(LIBSQLITE)
+
+  AC_MSG_CHECKING(for suitable sqlite3 version)
+
+  AC_CACHE_VAL([libsqlite3_cv_is_recent],
+   [
+    saved_LIBS="$LIBS"
+    LIBS=-lsqlite3
+    AC_TRY_RUN([
+	#include <stdio.h>
+	#include <sqlite3.h>
+	int main()
+	{
+		int vers = sqlite3_libversion_number();
+
+		return vers != SQLITE_VERSION_NUMBER ||
+			vers < 3003000;
+	}
+       ], [libsqlite3_cv_is_recent=yes], [libsqlite3_cv_is_recent=no],
+       [libsqlite3_cv_is_recent=unknown])
+    LIBS="$saved_LIBS"])
+
+  AC_MSG_RESULT($libsqlite3_cv_is_recent)
+  AM_CONDITIONAL(CONFIG_SQLITE3, [test "$libsqlite3_cv_is_recent" = "yes"])
+])dnl
diff --git a/aclocal/libtirpc.m4 b/aclocal/libtirpc.m4
index 9f0fde0..a20894c 100644
--- a/aclocal/libtirpc.m4
+++ b/aclocal/libtirpc.m4
@@ -37,4 +37,6 @@ AC_DEFUN([AC_LIBTIRPC], [
 
   fi
 
+  AM_CONDITIONAL([CONFIG_TIRPC], [test "$enable_tirpc" != "no"])
+
 ])dnl
diff --git a/configure.ac b/configure.ac
index e0ca70e..c9ef819 100644
--- a/configure.ac
+++ b/configure.ac
@@ -139,6 +139,9 @@ AC_ARG_ENABLE(ipv6,
 dnl Check for TI-RPC library and headers
 AC_LIBTIRPC
 
+dnl Check for -lcap
+AC_LIBCAP
+
 # Check whether user wants TCP wrappers support
 AC_TCP_WRAPPERS
 
@@ -256,6 +259,14 @@ fi
 dnl Check for IPv6 support
 AC_IPV6
 
+dnl Check for sqlite3
+AC_SQLITE3_VERS
+
+dnl Build new-statd if TI-RPC and a recent libsqlite3 is present
+if test "$enable_tirpc" = yes; then
+  AM_CONDITIONAL(CONFIG_NEWSTATD, [test "$libsqlite3_cv_is_recent" = "yes"])
+fi
+
 dnl *************************************************************
 dnl Check for headers
 dnl *************************************************************
@@ -389,6 +400,7 @@ AC_CONFIG_FILES([
 	utils/nfsd/Makefile
 	utils/nfsstat/Makefile
 	utils/showmount/Makefile
+	utils/new-statd/Makefile
 	utils/statd/Makefile])
 AC_OUTPUT
 
diff --git a/utils/Makefile.am b/utils/Makefile.am
index 8665183..3acae4c 100644
--- a/utils/Makefile.am
+++ b/utils/Makefile.am
@@ -2,6 +2,12 @@
 
 OPTDIRS =
 
+if CONFIG_NEWSTATD
+OPTDIRS += new-statd
+else
+OPTDIRS += statd
+endif
+
 if CONFIG_NFSV4
 OPTDIRS += idmapd
 endif
@@ -20,7 +26,6 @@ SUBDIRS = \
 	nfsd \
 	nfsstat \
 	showmount \
-	statd \
 	$(OPTDIRS)
 
 MAINTAINERCLEANFILES = Makefile.in
diff --git a/utils/new-statd/Makefile.am b/utils/new-statd/Makefile.am
new file mode 100644
index 0000000..281d54e
--- /dev/null
+++ b/utils/new-statd/Makefile.am
@@ -0,0 +1,102 @@
+## Process this file with automake to produce Makefile.in
+
+man8_MANS = statd.man sm-notify.man
+
+GENFILES_XDR	= sm_inter_xdr.c
+GENFILES_H	= sm_inter.h
+GENFILES	= $(GENFILES_XDR) $(GENFILES_H)
+
+RPCPREFIX	= rpc.
+KPREFIX		= @kprefix@
+sbin_PROGRAMS	= statd sm-notify
+dist_sbin_SCRIPTS	= start-statd
+
+statd_SOURCES = file.c hostname.c main.c nlmcall.c \
+		smncall.c svc.c statd.h file.h $(GENFILES)
+statd_CPPFLAGS = $(AM_CPPFLAGS) $(CPPFLAGS)
+statd_LDADD = $(top_builddir)/support/misc/libmisc.a \
+	      $(top_builddir)/support/nfs/libnfs.a \
+	      $(LIBWRAP) $(LIBSQLITE) $(LIBCAP)
+
+sm_notify_SOURCES = file.c hostname.c sm-notify.c smncall.c \
+		    statd.h file.h $(GENFILES)
+sm_notify_CPPFLAGS = $(AM_CPPFLAGS) $(CPPFLAGS)
+sm_notify_LDADD = $(top_builddir)/support/misc/libmisc.a \
+		  $(top_builddir)/support/nfs/libnfs.a \
+		  $(LIBSQLITE) $(LIBCAP)
+
+BUILT_SOURCES = $(GENFILES)
+
+EXTRA_DIST = $(man8_MANS) sm_inter.x
+
+dist-hook:
+	for f in $(GENFILES); do \
+	  rm ${distdir}/$$f; \
+	done
+
+if CONFIG_RPCGEN
+RPCGEN	= $(top_builddir)/tools/rpcgen/rpcgen
+$(RPCGEN):
+	make -C ../../tools/rpcgen all
+else
+RPCGEN = @RPCGEN_PATH@
+endif
+
+$(GENFILES_XDR): %_xdr.c: %.x $(RPCGEN)
+	test -f $@ && rm -rf $@ || true
+	$(RPCGEN) -c -o $@ $<
+
+$(GENFILES_H): %.h: %.x $(RPCGEN)
+	test -f $@ && rm -rf $@ || true
+	$(RPCGEN) -h -o $@ $<
+
+MAINTAINERCLEANFILES = Makefile.in
+
+CLEANFILES = $(GENFILES)
+
+#######################################################################
+# The following allows the current practice of having
+# daemons renamed during the install to include RPCPREFIX
+# and the KPREFIX
+# This could all be done much easier with program_transform_name
+# ( program_transform_name = s/^/$(RPCPREFIX)$(KPREFIX)/ )
+# but that also renames the man pages, which the current
+# practice does not do.
+#
+install-exec-hook:
+	(cd $(DESTDIR)$(sbindir) && \
+	  for p in $(sbin_PROGRAMS); do \
+	    [ $$p = sm-notify ] || mv -f $$p$(EXEEXT) $(RPCPREFIX)$(KPREFIX)$$p$(EXEEXT) ;\
+	  done)
+uninstall-hook:
+	(cd $(DESTDIR)$(sbindir) && \
+	  for p in $(sbin_PROGRAMS); do \
+	    [ $$p = sm-notify ] || rm -f $(RPCPREFIX)$(KPREFIX)$$p$(EXEEXT) ;\
+	  done)
+
+install-data-hook:
+	if [ ! -d $(DESTDIR)$(statedir) ]; then mkdir -p $(DESTDIR)$(statedir); chown $(statduser) $(DESTDIR)$(statedir); fi
+
+uninstall-hook:
+	rm $(DESTDIR)$(statedir)/statdb
+
+# XXX This makes some assumptions about what automake does.
+# XXX But there is no install-man-hook or install-man-local.
+#
+install-man: install-man8 install-man-links
+uninstall-man: uninstall-man8 uninstall-man-links
+
+install-man-links:
+	(cd $(DESTDIR)$(man8dir) && \
+	  for m in $(man8_MANS) $(dist_man8_MANS) $(nodist_man8_MANS); do \
+	    inst=`echo $$m | sed -e 's/man$$/8/'`; \
+	    rm -f $(RPCPREFIX)$$inst ; \
+	    $(LN_S) $$inst $(RPCPREFIX)$$inst ; \
+	  done)
+
+uninstall-man-links:
+	(cd $(DESTDIR)$(man8dir) && \
+	  for m in $(man8_MANS) $(dist_man8_MANS) $(nodist_man8_MANS); do \
+	    inst=`echo $$m | sed -e 's/man$$/8/'`; \
+	    rm -f $(RPCPREFIX)$$inst ; \
+	  done)
diff --git a/utils/new-statd/main.c b/utils/new-statd/main.c
new file mode 100644
index 0000000..f7b3cfa
--- /dev/null
+++ b/utils/new-statd/main.c
@@ -0,0 +1,319 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <sys/wait.h>
+
+#include <unistd.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <errno.h>
+#include <locale.h>
+
+#include "statd.h"
+#include "nfsrpc.h"
+
+static char statd_my_name[SM_MAXSTRLEN];
+static char statd_progname[PATH_MAX];
+static char statd_callout_prog[PATH_MAX];
+char *ha_callout_prog = NULL;
+
+int opt_debug, opt_foreground, opt_notify_port;
+char *opt_name = NULL;
+
+static const char *statd_nsm_pgmtbl[] = {
+	"status",
+	NULL,
+};
+rpcprog_t statd_nsm_program;
+
+static char statd_opts[] = "dFwh?H:n:p:o:P:";
+static struct option statd_longopts[] =
+{
+	{ "debug", 0, 0, 'd', },
+	{ "foreground", 0, 0, 'F', },
+	{ "ha-callout", 1, 0, 'H', },
+	{ "help", 0, 0, 'h', },
+	{ "name", 1, 0, 'n', },
+	{ "notify-port", 1, 0, 'o', },
+	{ "port", 1, 0, 'p', },
+	{ "state-directory-path", 1, 0, 'P', },
+	{ "warm-start", 0, 0, 'w', },
+	{ NULL, 0, 0, '\0', }
+};
+
+static void
+statd_usage(const char *progname)
+{
+	fprintf(stderr, "%s version " VERSION "\n", progname);
+	fprintf(stderr, "usage: %s [options]\n", progname);
+
+	fprintf(stderr, "\t-d, --debug          Enable verbose debug logging\n");
+	fprintf(stderr, "\t-F, --foreground     Foreground mode\n");
+	fprintf(stderr, "\t-H, --ha-callout     Specify a high-availability callout program\n");
+	fprintf(stderr, "\t-h, -?, --help       Print this help\n");
+	fprintf(stderr, "\t-n, --name           Mon_name for SM_NOTIFY requests\n");
+	fprintf(stderr, "\t-o, --notify-port    Source port for notifications\n");
+	fprintf(stderr, "\t-p, --port           Listener port for NSM protocol\n");
+	fprintf(stderr, "\t-P, --state-directory-path  State directory path\n");
+	fprintf(stderr, "\t-w, --warm-start     Do not notify\n");
+}
+
+static void
+statd_signal(int sig)
+{
+	xlog(L_WARNING, "Caught signal %d, unregistering and exiting", sig);
+
+	/*
+	 * POSIX says atexit function isn't called on signal.
+	 */
+	statd_unregister();
+
+	exit(EXIT_SUCCESS);
+}
+
+/*
+ * Prevent multiple copies of statd from running concurrently.
+ * The pid file is automatically unlocked when statd exits.
+ */
+static bool_t
+statd_lock_pidfile(void)
+{
+	const char *filename = STATD_PIDFILE;
+	struct flock pidlock = {
+		.l_type		= F_WRLCK,
+		.l_whence	= SEEK_SET,
+	};
+	char pidbuf[16];
+	ssize_t count;
+	bool_t result;
+	int len, fd;
+
+	result = FALSE;
+
+	len = snprintf(pidbuf, sizeof(pidbuf), "%u\n", getpid());
+	if (__error_check(len, sizeof(pidbuf))) {
+		xlog(L_ERROR, "Pid overflows write buffer");
+		return result;
+	}
+
+	fd = open(filename, O_CREAT|O_WRONLY|O_SYNC|O_NOFOLLOW,
+							S_IWUSR|S_IRUSR);
+	if (fd == -1) {
+		xlog(L_ERROR, "Failed to open %s: %m", filename);
+		return result;
+	}
+
+	if (fcntl(fd, F_SETLK, &pidlock) == -1) {
+		switch (errno) {
+		case EACCES:
+		case EAGAIN:
+			(void)fcntl(fd, F_GETLK, &pidlock);
+			xlog(L_ERROR, "Another statd is running, pid: %u",
+					pidlock.l_pid);
+			break;
+		default:
+			xlog(L_ERROR, "Failed to lock %s: %m", filename);
+		}
+		goto out;
+	}
+
+	if (ftruncate(fd, 0) == -1) {
+		xlog(L_ERROR, "Failed to truncate %s: %m", filename);
+		goto out;
+	}
+
+	count = write(fd, pidbuf, len);
+	if (__exact_error_check(count, len)) {
+		xlog(L_ERROR, "Failed to write %s: %m", filename);
+		goto out;
+	}
+
+	result = TRUE;
+	xlog(D_GENERAL, "Lock acquired on pidfile. Pid: %u", getpid());
+
+out:
+	return result;
+}
+
+int
+main(int argc, char **argv)
+{
+	const struct sigaction statd_sigaction = {
+		.sa_handler	= statd_signal,
+	};
+	char *opt_state_directory = NULL;
+	int opt_listen_port = 0;
+	int opt_warmstart = 0;
+	int arg;
+
+	(void)setlocale(LC_ALL, "");
+	(void)umask(S_IRWXO);
+	xlog_stderr(0);
+	xlog_syslog(1);
+
+	strncpy(statd_progname, basename(argv[0]), sizeof(statd_progname));
+
+	while ((arg = getopt_long(argc, argv,
+					statd_opts, statd_longopts, NULL)) != EOF) {
+		switch (arg) {
+		case 'd':
+			xlog_config(D_ALL, 1);
+			opt_debug++;
+			break;
+		case 'F':
+			xlog_stderr(1);
+			opt_foreground++;
+			break;
+		case 'H':
+			if (strlen(optarg) > sizeof(statd_callout_prog)) {
+				fprintf(stderr, "%s: Invalid HA callout program\n",
+					statd_progname);
+				exit(EXIT_FAILURE);
+			}
+			strncpy(statd_callout_prog, optarg,
+						sizeof(statd_callout_prog));
+			ha_callout_prog = statd_callout_prog;
+			break;
+		case '?':
+		case 'h':
+			statd_usage(statd_progname);
+			exit(EXIT_SUCCESS);
+		case 'n':
+			if (strlen(optarg) > sizeof(statd_my_name)) {
+				fprintf(stderr, "%s: mon_name too long\n",
+					statd_progname);
+				exit(EXIT_FAILURE);
+			}
+			strncpy(statd_my_name, optarg,
+						sizeof(statd_my_name));
+			opt_name = statd_my_name;
+			break;
+		case 'o':
+			opt_notify_port = atoi(optarg);
+			if (opt_notify_port < 1 || opt_notify_port > 65535) {
+				fprintf(stderr,
+					"%s: Invalid notify port number: %s\n",
+					statd_progname, optarg);
+				statd_usage(statd_progname);
+				exit(EXIT_FAILURE);
+			}
+			break;
+		case 'p':
+			opt_listen_port = atoi(optarg);
+			if (opt_listen_port < 1 || opt_listen_port > 65535) {
+				fprintf(stderr,
+					"%s: Invalid listener port number: %s\n",
+					statd_progname, optarg);
+				statd_usage(statd_progname);
+				exit(EXIT_FAILURE);
+			}
+			break;
+		case 'P':
+			opt_state_directory = optarg;
+			break;
+		case 'w':
+			opt_warmstart++;
+			break;
+		default:
+			statd_usage(statd_progname);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	if (opt_listen_port != 0 && opt_notify_port != 0 &&
+	    opt_listen_port == opt_notify_port) {
+		fprintf(stderr, "Listening and notification ports "
+				"cannot be the same\n");
+		exit(EXIT_FAILURE);
+	}
+
+	xlog_open(statd_progname);
+
+	if (!opt_foreground) {
+		if (daemon(1, 0) < 0) {
+			xlog(L_ERROR, "Failed to background: %m");
+			exit(EXIT_FAILURE);
+		}
+	}
+	xlog(L_NOTICE, "Version " VERSION " starting");
+
+	/*
+	 * ORDER
+	 *
+	 * Make sure the database pathname is validated and
+	 * set up before forking the NLM callback process, to
+	 * ensure it can access the database.
+	 */
+	if (!statd_check_pathname(opt_state_directory))
+		exit(EXIT_FAILURE);
+
+	if (!statd_lock_pidfile())
+		exit(EXIT_FAILURE);
+
+	(void)sigaction(SIGHUP, &statd_sigaction, NULL);
+	(void)sigaction(SIGINT, &statd_sigaction, NULL);
+	(void)sigaction(SIGTERM, &statd_sigaction, NULL);
+
+	if (!statd_init_nlm_callback())
+		exit(EXIT_FAILURE);
+
+	/*
+	 * ORDER
+	 *
+	 * Clear previous NSM listener registrations while we're still
+	 * root.  This should guarantee that we can rip out any previous
+	 * registrations, no matter who made them.
+	 */
+	statd_nsm_program = nfs_getrpcbyname(SM_PROG, statd_nsm_pgmtbl);
+	if (!rpcb_unset(statd_nsm_program, SM_VERS, NULL)) {
+		xlog(L_ERROR, "Failed to remove old NSM registrations");
+		exit(EXIT_FAILURE);
+	}
+
+	if (!statd_drop_privileges(0))
+		exit(EXIT_FAILURE);
+
+	/*
+	 * Note: only the client side needs to send notifications
+	 *       automatically during a system boot.  Server side
+	 *       should invoke statd with opt_warmstart.  More
+	 *       sophisticated logic is forthcoming.
+	 */
+	if (!opt_warmstart && statd_system_rebooted())
+		(void)statd_notify(opt_name, opt_notify_port, SMN_MAX_RETRY);
+	else {
+		xlog(L_NOTICE, "Warm start: preserving monitor list and NSM state");
+		if (opt_debug)
+			statd_dump_monitor_list();
+	}
+
+	statd_svc_create(statd_nsm_program, SM_VERS, opt_listen_port);
+
+	xlog(L_ERROR, "RPC service exited");
+	exit(EXIT_FAILURE);
+}
diff --git a/utils/new-statd/sm-notify.c b/utils/new-statd/sm-notify.c
new file mode 100644
index 0000000..0d3f4c3
--- /dev/null
+++ b/utils/new-statd/sm-notify.c
@@ -0,0 +1,155 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ */
+
+#ifdef HAVE_CONFIG_H
+#include <config.h>
+#endif
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <errno.h>
+#include <getopt.h>
+#include <locale.h>
+
+#include "statd.h"
+
+int opt_debug;
+
+static char smn_progname[PATH_MAX];
+
+static char smn_opts[] = "dfh?m:n:o:P:";
+static struct option smn_longopts[] = {
+	{ "debug", 0, NULL, 'd', },
+	{ "force", 0, NULL, 'f', },
+	{ "help", 0, NULL, 'h', },
+	{ "max-retry", 1, NULL, 'm', },
+	{ "name", 1, NULL, 'n', },
+	{ "notify-port", 1, NULL, 'o', },
+	{ "state-directory-path", 1, NULL, 'P', },
+	{ NULL, 0, NULL, 0, },
+};
+
+static void
+smn_usage(const char *progname)
+{
+	fprintf(stderr, "%s version " VERSION "\n", progname);
+	fprintf(stderr, "usage: %s [options]\n", progname);
+
+	fprintf(stderr, "\t-d, --debug          Enable verbose debug logging\n");
+	fprintf(stderr, "\t-f, --force          Force notification\n");
+	fprintf(stderr, "\t-h, -?, --help       Print this help\n");
+	fprintf(stderr, "\t-m, --max-retry      Minutes to run\n");
+	fprintf(stderr, "\t-n, --name           Specify bind address/mon_name\n");
+	fprintf(stderr, "\t-o, --notify-port    Source port for notifications\n");
+	fprintf(stderr, "\t-P, --state-directory-path  State directory path\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	unsigned long opt_max_retry = SMN_MAX_RETRY;;
+	char *opt_state_directory = NULL;
+	static char mon_name[NI_MAXHOST];
+	char *opt_mon_name = NULL;
+	int opt_notify_port = 0;
+	int opt_force = 0;
+	int arg, cap_net_bind;
+
+	(void)setlocale(LC_ALL, "");
+	(void)umask(S_IRWXO);
+	xlog_stderr(0);
+	xlog_syslog(1);
+
+	strncpy(smn_progname, basename(argv[0]), sizeof(smn_progname));
+
+	while ((arg = getopt_long(argc, argv,
+					smn_opts, smn_longopts, NULL)) != -1) {
+		switch (arg) {
+		case 'd':
+			xlog_config(D_ALL, 1);
+			xlog_stderr(1);
+			opt_debug++;
+			break;
+		case 'f':
+			opt_force++;
+			break;
+		case '?':
+		case 'h':
+			smn_usage(smn_progname);
+			exit(EXIT_SUCCESS);
+		case 'm':
+			opt_max_retry = atoi(optarg) * 60;
+			break;
+		case 'n':
+			if (strlen(optarg) > sizeof(mon_name)) {
+				fprintf(stderr, "%s: Invalid bind address\n",
+					smn_progname);
+				exit(EXIT_FAILURE);
+			}
+			strncpy(mon_name, optarg, sizeof(mon_name));
+			opt_mon_name = mon_name;
+			break;
+		case 'o':
+			opt_notify_port = atoi(optarg);
+			if (opt_notify_port < 1 || opt_notify_port > 65535) {
+				fprintf(stderr, "%s: Invalid port number: %s\n",
+					smn_progname, optarg);
+				smn_usage(smn_progname);
+				exit(EXIT_FAILURE);
+			}
+			break;
+		case 'P':
+			opt_state_directory = optarg;
+			break;
+		default:
+			smn_usage(smn_progname);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	xlog_open(smn_progname);
+
+	if (!opt_debug) {
+		if (daemon(1, 0) < 0) {
+			xlog(L_ERROR, "Failed to background: %m");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	if (!statd_check_pathname(opt_state_directory))
+		exit(EXIT_FAILURE);
+
+	cap_net_bind = (opt_notify_port > 0 && opt_notify_port < 1024);
+	if (!statd_drop_privileges(cap_net_bind))
+		exit(EXIT_FAILURE);
+
+	xlog(L_NOTICE, "Version " VERSION " starting");
+
+	if (opt_force || statd_system_rebooted()) {
+		if (!statd_notify(opt_mon_name, opt_notify_port, opt_max_retry))
+			exit(EXIT_FAILURE);
+	} else
+		xlog(L_NOTICE, "No hosts to notify");
+
+	exit(EXIT_SUCCESS);
+}
diff --git a/utils/new-statd/sm_inter.x b/utils/new-statd/sm_inter.x
new file mode 100644
index 0000000..413d3e5
--- /dev/null
+++ b/utils/new-statd/sm_inter.x
@@ -0,0 +1,107 @@
+%/*
+% * Copyright 2009 Oracle.  All rights reserved.
+% *
+% * This file is part of nfs-utils.
+% *
+% * nfs-utils is free software; you can redistribute it and/or modify
+% * it under the terms of the GNU General Public License as published by
+% * the Free Software Foundation; either version 2 of the License, or
+% * (at your option) any later version.
+% *
+% * nfs-utils is distributed in the hope that it will be useful,
+% * but WITHOUT ANY WARRANTY; without even the implied warranty of
+% * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+% * GNU General Public License for more details.
+% *
+% * You should have received a copy of the GNU General Public License
+% * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+% */
+
+%/*
+% * NSM for Linux.
+% */
+
+/*
+ * RPCL definition of Network Status Monitor, version 1, based on
+ * The Open Group's "Protocols for Interworking: XNFS, Version 3W"
+ */
+
+const SM_MAXSTRLEN = 1024;
+const SM_PRIV_SIZE = 16;
+
+enum res {
+	stat_succ = 0,
+	stat_fail = 1
+};
+
+struct sm_name {
+	string		mon_name<SM_MAXSTRLEN>;
+};
+
+struct sm_stat_res {
+	res		res_stat;
+	int		state;
+};
+
+struct sm_stat {
+	int		state;
+};
+
+struct my_id {
+	string		my_name<SM_MAXSTRLEN>;
+	int		my_prog;
+	int		my_vers;
+	int		my_proc;
+};
+
+struct mon_id {
+	string		mon_name<SM_MAXSTRLEN>;
+	struct		my_id my_id;
+};
+
+struct mon {
+	struct mon_id	mon_id;
+	opaque		priv[SM_PRIV_SIZE];
+};
+
+struct stat_chge {
+	string		mon_name<SM_MAXSTRLEN>;
+	int		state;
+};
+
+program SM_PROG {
+	version SM_VERS {
+		void
+		SM_NULL(void) = 0;
+
+		struct sm_stat_res
+		SM_STAT(struct sm_name) = 1;
+
+		struct sm_stat_res
+		SM_MON(struct mon) = 2;
+
+		struct sm_stat
+		SM_UNMON(struct mon_id) = 3;
+
+		struct sm_stat
+		SM_UNMON_ALL(struct my_id) = 4;
+
+		void
+		SM_SIMU_CRASH(void) = 5;
+
+		void
+		SM_NOTIFY(struct stat_chge) = 6;
+	} = 1;
+} = 100024;
+
+/*
+ * This data type is used for the argument of an unnamed dynamic
+ * NLM RPC procedure call (meaning lockd tells statd what procedure
+ * call number to use) that is private to the Linux lockd and
+ * statd implementation.
+ */
+struct nlm_reboot {
+	string		mon_name<SM_MAXSTRLEN>;
+	int		state;
+	opaque		priv[SM_PRIV_SIZE];
+};
diff --git a/utils/new-statd/start-statd b/utils/new-statd/start-statd
new file mode 100644
index 0000000..da0f65f
--- /dev/null
+++ b/utils/new-statd/start-statd
@@ -0,0 +1,12 @@
+#!/bin/sh -p
+#
+# Sample start-statd script
+#
+# mount.nfs execs this script when mounting a filesystem with locking
+# enabled, but when statd does not seem to be running
+#
+# It should run run statd with whatever flags are appropriate for
+# this site.
+#
+PATH=/sbin:/usr/sbin
+exec rpc.statd
diff --git a/utils/new-statd/statd.h b/utils/new-statd/statd.h
new file mode 100644
index 0000000..b762b03
--- /dev/null
+++ b/utils/new-statd/statd.h
@@ -0,0 +1,135 @@
+/*
+ * Copyright 2009 Oracle.  All rights reserved.
+ *
+ * This file is part of nfs-utils.
+ *
+ * nfs-utils is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * nfs-utils is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with nfs-utils.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * NSM for Linux.
+ */
+
+#ifndef _NFS_UTILS_STATD_STATD_H
+#define _NFS_UTILS_STATD_STATD_H
+
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include <netdb.h>
+#include <time.h>
+#include <sqlite3.h>
+
+#include <rpc/types.h>
+#include <rpc/rpc.h>
+#include <rpc/svc.h>
+
+#include "sm_inter.h"
+#include "xlog.h"
+
+#define STATD_MONITOR_TABLENAME		"monitor"
+#define STATD_NOTIFY_TABLENAME		"notify"
+#define STATD_INFO_TABLENAME		"info"
+
+#define STATD_KERNEL_NSM_STATE		"/proc/sys/fs/nfs/nsm_local_state"
+#define STATD_PIDFILE			"/var/run/rpc.statd.pid"
+
+/*
+ * Retry timeout for individual NLM callbacks
+ */
+#define STATD_NLM_CALLBACK_TIMEOUT	(30)
+
+/*
+ * Number of times to retry an NLM callback before giving up
+ */
+#define STATD_NLM_CALLBACK_RETRIES	(10)
+
+/*
+ * Retry timeout for individual notification requests
+ */
+#define SMN_TIMEOUT		(5)	/* in seconds */
+
+/*
+ * Timeout for notifying all hosts
+ */
+#define SMN_MAX_RETRY		(15)	/* in minutes */
+
+extern rpcprog_t	statd_nsm_program;
+extern int		statd_my_state;
+
+extern int		opt_debug, opt_foreground, opt_notify_port;
+extern char		*opt_name;
+
+/* file.c */
+extern bool_t	statd_check_pathname(const char *parentdir);
+extern bool_t	statd_drop_privileges(const int keep_bind);
+
+extern int	statd_get_nsm_state(void);
+extern bool_t	statd_update_nsm_state(sqlite3 *db);
+extern bool_t	statd_system_rebooted(void);
+
+extern sqlite3 *statd_open_db(int flags);
+extern void	statd_close_db(sqlite3 *db);
+extern bool_t	statd_prepare_stmt(sqlite3 *db, sqlite3_stmt **stmt,
+				const char *sql);
+extern void	statd_finalize_stmt(sqlite3_stmt *stmt);
+extern bool_t	statd_begin_transaction(sqlite3 *db);
+extern void	statd_end_transaction(sqlite3 *db);
+extern void	statd_rollback_transaction(sqlite3 *db);
+
+extern void	statd_dump_monitor_list(void);
+extern void	statd_print_capabilities(void);
+
+/* hostname.c */
+extern bool_t	statd_localhost_caller(const struct svc_req *rqstp);
+extern struct addrinfo *
+		statd_forward_lookup(const char *hostname, int protocol);
+extern struct addrinfo *
+		statd_get_address_list(const char *hostname,
+				const struct addrinfo *gai_hint);
+extern bool_t	statd_match_hostname(const char *hostname1,
+				const char *hostname2);
+extern bool_t	statd_match_address(const char *hostname,
+				const struct sockaddr *sap);
+
+/* nlmcall.c */
+extern void	statd_queue_nlm_callback(const char *mon_name,
+				const int state, const struct sockaddr *sap);
+extern bool_t	statd_init_nlm_callback(void);
+
+/*
+ * smncall.c
+ */
+extern bool_t	statd_notify(const char *bindaddr, const unsigned short srcport,
+				const unsigned long max_retry_minutes);
+
+/* svc.c */
+extern void	statd_unregister(void);
+extern void	statd_svc_create(const rpcprog_t program,
+				const rpcvers_t version,
+				const uint16_t port);
+
+static inline int
+__error_check(const int len, const size_t buflen)
+{
+	return (len < 0) || ((size_t)len >= buflen);
+}
+
+static inline int
+__exact_error_check(const ssize_t len, const size_t buflen)
+{
+	return (len < 0) || ((size_t)len != buflen);
+}
+
+#endif	/* !_NFS_UTILS_STATD_STATD_H */
diff --git a/utils/statd/Makefile.am b/utils/statd/Makefile.am
index 8a3ba4e..080bce1 100644
--- a/utils/statd/Makefile.am
+++ b/utils/statd/Makefile.am
@@ -75,6 +75,15 @@ uninstall-hook:
 	    [ $$p = sm-notify ] || rm -f $(RPCPREFIX)$(KPREFIX)$$p$(EXEEXT) ;\
 	  done)
 
+install-data-hook:
+	if [ ! -d $(DESTDIR)$(statedir) ]; then mkdir -p $(DESTDIR)$(statedir); fi
+	mkdir -p $(DESTDIR)$(statedir)/sm $(DESTDIR)$(statedir)/sm.bak
+	touch $(DESTDIR)$(statedir)/state
+	chmod go-rwx $(DESTDIR)$(statedir)/sm $(DESTDIR)$(statedir)/sm.bak $(DESTDIR)$(statedir)/state
+	-chown $(statduser) $(DESTDIR)$(statedir)/sm $(DESTDIR)$(statedir)/sm.bak $(DESTDIR)$(statedir)/state
+
+uninstall-hook:
+	rm $(DESTDIR)$(statedir)/state
 
 # XXX This makes some assumptions about what automake does.
 # XXX But there is no install-man-hook or install-man-local.


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/4] nfs-utils: introduce new statd implementation (4th part)
       [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-08-05 14:46   ` [PATCH 3/4] nfs-utils: introduce new statd implementation (3rd part) Chuck Lever
@ 2009-08-05 14:46   ` Chuck Lever
  2009-08-05 17:15   ` [PATCH 0/4] new statd [take 2] Jeff Layton
  4 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 14:46 UTC (permalink / raw)
  To: steved; +Cc: linux-nfs

Introduce new man pages for new statd and sm-notify.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 utils/new-statd/sm-notify.man |  286 ++++++++++++++++++++++++++++++
 utils/new-statd/statd.man     |  388 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 674 insertions(+), 0 deletions(-)
 create mode 100644 utils/new-statd/sm-notify.man
 create mode 100644 utils/new-statd/statd.man

diff --git a/utils/new-statd/sm-notify.man b/utils/new-statd/sm-notify.man
new file mode 100644
index 0000000..b38acf8
--- /dev/null
+++ b/utils/new-statd/sm-notify.man
@@ -0,0 +1,286 @@
+.\"@(#)sm-notify.8"
+.\"
+.\" Copyright 2009 Oracle.  All rights reserved.
+.\"
+.TH SM-NOTIFY 8 "1 July 2009
+.SH NAME
+sm-notify \- send reboot notification to NFS peers
+.SH SYNOPSIS
+.BI "/usr/sbin/sm-notify [-dfh] [-m " minutes "] [-n " name "] [-o " port "] [-P " path "]
+.SH DESCRIPTION
+Locks held on files are not part of persistent file system state.
+Lock state is thus lost when a host reboots.
+.PP
+NFS file locking must also detect when lock state is lost because a remote
+host has rebooted.
+After an NFS client reboots, an NFS server must release all locks held by
+applications that were running on that client.
+After a server reboots, a client must remind the
+server of locks held by applications running on that client.
+.PP
+For NFS version 2 and version 3, a network protocol known as the
+.I Network Status Monitor
+protocol (or NSM for short)
+is used to notify NFS peers of reboots.
+On Linux, two separate user-space components constitute the NSM service:
+.TP
+.B sm-notify
+A helper program that notifies NFS peers after the local system reboots
+.TP
+.B rpc.statd
+A daemon that listens for reboot notifications from other hosts, and
+manages the list of hosts to be notified when the local system reboots
+.PP
+When a system reboots, system initialization starts
+.BR rpc.statd ,
+which then typically invokes
+.B sm-notify
+automatically.
+The local NFS lock manager indicates to
+.B rpc.statd
+which remote hosts to monitor.
+.SS NSM operation
+The first file locking interaction between an NFS client and server causes
+the NFS lock managers on both peers to contact their local NSM service to
+store information about their respective remote peers.
+.PP
+On Linux,
+the lock manager contacts
+.BR rpc.statd ,
+which records on persistent storage information about each monitored NFS peer.
+This information describes how to contact a remote peer
+in case the local system reboots, and how to notify the local
+lockd when a monitored peer indicates to
+.B rpc.statd
+it has rebooted.
+.PP
+An NFS client sends a hostname in each file lock request to NFS servers
+that can be used to call the client back.
+This hostname is known as the client's
+.IR caller_name .
+An NFS server can use this hostname to send asynchronous GRANT
+calls to a client, and
+it can also use this hostname when notifying clients of reboots.
+An NFS server may use this hostname,
+or it may simply scrape the client's network address
+from the underlying transport and use that instead,
+when calling a client back.
+.PP
+The hostname or network address of the client is handed to
+.B rpc.statd
+during the first locking interaction between an NFS client and server.
+This is known as the monitored peer's
+.IR mon_name .
+In addition, the local lockd tells
+.B rpc.statd
+what it thinks its own hostname is.
+This hostname is known as
+.IR my_name .
+.PP
+NFS clients do not actually know what an NFS server's
+.I mon_name
+might be, so they usually record the hostname and/or
+network address used to mount the server.
+.SS Reboot notification
+After a reboot,
+.B sm-notify
+reads the list of monitored peers from persistent storage and
+sends an SM_NOTIFY request
+to the NSM service on each listed remote peer,
+using each
+.I mon_name
+as the destination, and the
+.I my_name
+string to identify which host has rebooted.
+.PP
+After receiving an SM_NOTIFY request from a remote peer,
+.B rpc.statd
+contacts its local lock manager,
+which initiates appropriate lock recovery.
+.B rpc.statd
+does not forward all reboot notifications it receives to the local lockd,
+however.
+It matches incoming SM_NOTIFY requests by their
+.I mon_name
+or network address to one or more peers on its own monitor list.
+If
+.B rpc.statd
+does not find a peer on its monitor list that matches
+an incoming SM_NOTIFY request, the request is ignored.
+.PP
+In addition, each peer has its own
+.IR "state number" ,
+which is a 32-bit integer that is bumped after each reboot.
+This state number is communicated between two peers so that they
+can distinguish between actual reboots and replayed notifications.
+.PP
+The value of state number is managed on Linux by
+.BR rpc.statd " and " sm-notify .
+The NSM state number is sent by NFS clients in each NLM LOCK request.
+The client does not discover the server's state number until the
+server sends an SM_NOTIFY request.
+.PP
+After reboot notifications have been
+sent to each peer on the list of monitored hosts,
+.B rpc.statd
+clears the monitor list, since, as noted above,
+lock state does not persist across reboots.
+Part of NFS lock recovery is rediscovering
+which peers need to be monitored again.
+.SH OPTIONS
+.TP
+.BR -d , " --debug
+Keeps
+.B sm-notify
+in the foreground while sending reboot notifications,
+and causes it to generate extra messages
+so that notification progress may be observed directly.
+.TP
+.BR -h , " --help
+Displays version and usage information on
+.IR stderr .
+.TP
+.BR -f , " --force
+Disables detection of system reboots, forcing notification to occur.
+If this option is not specified,
+.B sm-notify
+bumps the NSM state number, clears
+.BR rpc.statd 's
+monitor list, and
+sends reboot notifications only once after each system reboot;
+subsequent invocations of
+.B sm-notify
+become no-ops until the system is rebooted.
+.TP
+.BI "\-m, " "" "\-\-max-retry " retry-time
+Specifies the length of time, in minutes, to continue retrying
+notifications to unresponsive hosts.
+If this option is not specified,
+.B sm-notify
+tries to send notifications for 15 minutes.
+.IP
+Specifying a value of 0 causes
+.B sm-notify
+to continue sending notifications to unresponsive peers
+until it is manually killed.
+.TP
+.BI "\-n, " "" "\-\-name " ipaddr " | " hostname
+Specifies the source network address or hostname from which
+to send reboot notification requests.
+If this option is not specified,
+.B sm-notify
+uses an appropriate ANYADDR as the transport bind address,
+and sends the local lockd's
+.I caller_name
+as the
+.I mon_name
+argument to SM_NOTIFY.
+.IP
+This option can be useful in multi-homed configurations where
+the remote requires notification from a specific network address.
+.TP
+.BI "\-o, " "" "\-\-notify\-port " port
+Specifies the source port number for the network socket used to
+send reboot notifications.
+If this option is not specified, a randomly chosen ephemeral port is used.
+.IP
+This option can be used to traverse a firewall between client and server.
+.TP
+.BI "\-P, " "" \-\-state\-directory\-path " pathname
+Specifies a pathname to the parent directory
+on the local system where NSM state information is kept.
+If this option is not specified,
+.B sm-notify
+uses
+.I /var/lib/nfs/statd
+by default.
+.IP
+After starting,
+.B sm-notify
+attempts to set its effective UID and GID to the owner
+and group of this directory.
+.SH SECURITY
+The
+.B sm-notify
+program must be started as root to acquire privileges needed
+to access the state information database.
+It drops root privileges
+as soon as it starts up to reduce the risk of a privilege escalation attack.
+During normal operation,
+the effective user ID it chooses is the owner of the state directory.
+This allows it to continue to access files in that directory after it
+has dropped its root privileges.
+To control which user ID
+.B rpc.statd
+chooses, simply use
+.BR chown (1)
+to set the owner of
+the state directory.
+.SH NOTES
+Lock recovery after a reboot is critical to maintaining data integrity
+and preventing unnecessary application hangs.
+.PP
+To help
+.B rpc.statd
+match SM_NOTIFY requests to NLM requests, a number of best practices
+should be observed, including:
+.IP
+The UTS nodename of your systems should match the DNS names that NFS
+peers use to contact them
+.IP
+The nodenames should all be fully qualified domain names
+.IP
+The forward and reverse DNS mapping of the nodenames should be
+consistent
+.IP
+The hostname the client uses to contact the server should match the
+server's mon_name in SM_NOTIFY requests it sends
+.PP
+The use of network addresses for either argument should be avoided when
+interoperating with non-Linux NFS implementations.
+.PP
+Unmounting an NFS file system does not necessarily stop
+either the NFS client or server from monitoring each other.
+Both may continue monitoring each other for a time in case subsequent
+NFS traffic between the two results in fresh mounts and additional
+file locking.
+.PP
+On Linux, if the
+.B lockd
+kernel module is unloaded during normal operation,
+all remote NFS peers are unmonitored.
+This can happen on an NFS client, for example,
+if an automounter removes all NFS mount
+points due to inactivity.
+.SS IPv6 and TI-RPC support
+This version of
+.B sm-notify
+supports IPv6 networking.
+RPC over IPv6 requires TI-RPC, implemented on Linux via
+.BR libtirpc .
+.B libtirpc
+is required to be available on the local system for this version of
+.BR sm-notify .
+.PP
+.B sm-notify
+will choose an appropriate IPv4 or IPv6 transport
+based on the network address returned by DNS for each remote peer.
+It should be fully compatible with remote systems that do not support TI-RPC
+or IPv6.
+.SH FILES
+.TP 2.5i
+.I /var/lib/nfs/statd/statdb
+default NSM state information database
+.SH SEE ALSO
+.BR rpc.statd (8),
+.BR nfs (5),
+.BR uname (2),
+.BR hostname (7)
+.PP
+RFC 1094 - "NFS: Network File System Protocol Specification"
+.br
+RFC 1813 - "NFS Version 3 Protocol Specification"
+.br
+OpenGroup Protocols for Interworking: XNFS, Version 3W - Chapter 11
+.SH AUTHORS
+Chuck Lever <chuck.lever@oracle.com>
diff --git a/utils/new-statd/statd.man b/utils/new-statd/statd.man
new file mode 100644
index 0000000..66508db
--- /dev/null
+++ b/utils/new-statd/statd.man
@@ -0,0 +1,388 @@
+.\"@(#)rpc.statd.8"
+.\"
+.\" Copyright 2009 Oracle.  All rights reserved.
+.\"
+.TH RPC.STATD 8 "1 July 2009
+.SH NAME
+rpc.statd \- NSM service daemon
+.SH SYNOPSIS
+.BI "rpc.statd [-dhfw] [-H " prog "] [-n " my-name "] [-o " notify-port "] [-p " port "] [-P " path " ]
+.SH DESCRIPTION
+Locks held on files are not part of persistent file system state.
+Lock state is thus lost when a host reboots.
+.PP
+NFS file locking must also detect when lock state is lost because a remote
+host has rebooted.
+After an NFS client reboots, an NFS server must release all locks held by
+applications that were running on that client.
+After a server reboots, a client must remind the
+server of locks held by applications running on that client.
+.PP
+For NFS version 2 and version 3, a network protocol known as the
+.I Network Status Monitor
+protocol (or NSM for short)
+is used to notify NFS peers of reboots.
+On Linux, two separate user-space components constitute the NSM service:
+.TP
+.B rpc.statd
+A daemon that listens for reboot notifications from other hosts, and
+manages the list of hosts to be notified when the local system reboots
+.TP
+.B sm-notify
+A helper program that notifies NFS peers after the local system reboots
+.PP
+When a system reboots, system initialization starts
+.BR rpc.statd ,
+which then typically invokes
+.B sm-notify
+automatically.
+The local NFS lock manager indicates to
+.B rpc.statd
+which remote hosts to monitor.
+.SS NSM operation
+The first file locking interaction between an NFS client and server causes
+the NFS lock managers on both peers to contact their local NSM service to
+store information about their respective remote peers.
+.PP
+On Linux,
+the lock manager contacts
+.BR rpc.statd ,
+which records on persistent storage information about each monitored NFS peer.
+This information describes how to contact a remote peer
+in case the local system reboots, and how to notify the local
+lockd when a monitored peer indicates to
+.B rpc.statd
+it has rebooted.
+.PP
+An NFS client sends a hostname in each file lock request to NFS servers
+that can be used to call the client back.
+This hostname is known as the client's
+.IR caller_name .
+An NFS server can use this hostname to send asynchronous GRANT
+calls to a client, and
+it can also use this hostname when notifying clients of reboots.
+An NFS server may use this hostname,
+or it may simply scrape the client's network address
+from the underlying transport and use that instead,
+when calling a client back.
+.PP
+The hostname or network address of the client is handed to
+.B rpc.statd
+during the first locking interaction between an NFS client and server.
+This is known as the monitored peer's
+.IR mon_name .
+In addition, the local lockd tells
+.B rpc.statd
+what it thinks its own hostname is.
+This hostname is known as
+.IR my_name .
+.PP
+NFS clients do not actually know what an NFS server's
+.I mon_name
+might be, so they usually record the hostname and/or
+network address used to mount the server.
+.SS Reboot notification
+After a reboot,
+.B sm-notify
+reads the list of monitored peers from persistent storage and
+sends an SM_NOTIFY request
+to the NSM service on each listed remote peer,
+using each
+.I mon_name
+as the destination, and the
+.I my_name
+string to identify which host has rebooted.
+.PP
+After receiving an SM_NOTIFY request from a remote peer,
+.B rpc.statd
+contacts its local lock manager,
+which initiates appropriate lock recovery.
+.B rpc.statd
+does not forward all reboot notifications it receives to the local lockd,
+however.
+It matches incoming SM_NOTIFY requests by their
+.I mon_name
+or network address to one or more peers on its own monitor list.
+If
+.B rpc.statd
+does not find a peer on its monitor list that matches
+an incoming SM_NOTIFY request, the request is ignored.
+.PP
+In addition, each peer has its own
+.IR "state number" ,
+which is a 32-bit integer that is bumped after each reboot.
+This state number is communicated between two peers so that they
+can distinguish between actual reboots and replayed notifications.
+.PP
+The value of state number is managed on Linux by
+.BR rpc.statd " and " sm-notify .
+The NSM state number is sent by NFS clients in each NLM LOCK request.
+The client does not discover the server's state number until the
+server sends an SM_NOTIFY request.
+.PP
+After reboot notifications have been
+sent to each peer on the list of monitored hosts,
+.B rpc.statd
+clears the monitor list, since, as noted above,
+lock state does not persist across reboots.
+Part of NFS lock recovery is rediscovering
+which peers need to be monitored again.
+.SH OPTIONS
+.TP
+.BR -d , " --debug
+Causes
+.B rpc.statd
+to generate extra log messages so that NSM operation can be monitored directly.
+If this option is not specified,
+.B rpc.statd
+reports only significant errors.
+.TP
+.BR -h , " --help
+Displays version and usage information on
+.IR stderr .
+.TP
+.BR -F , " --foreground
+Keeps
+.B rpc.statd
+in attached to a controlling terminal so that NSM
+operation can be monitored directly or run under a debugger.
+If this option is not specified,
+.B rpc.statd
+backgrounds itself as soon as it starts.
+.TP
+.BR -w , " --warm-start
+Disables detection of system reboots.
+If this option is not specified,
+.B rpc.statd
+bumps the NSM state number and
+sends reboot notifications automatically once after each system reboot.
+.IP
+Specifying this option prevents
+.B rpc.statd
+from sending notifications automatically after a reboot,
+preserving the existing NSM state number and monitor list.
+This option can be used, for example, in configurations where
+some other program, such as
+.BR sm-notify ,
+is always used to send reboot notifications.
+.IP
+Note that
+.B --warm-start
+does not prevent
+.B rpc.statd
+from sending notifications and bumping the NSM state number
+after receiving an SM_SIMU_CRASH request
+from a privileged port on the local host.
+.TP
+.BI "\-H," "" " \-\-ha-callout " prog
+Specifies a high availability callout program.
+If this option is not specified, all callouts are skipped.
+.IP
+.B rpc.statd
+execs a callout program during processing of
+successful SM_MON, SM_UNMON, and SM_UNMON_ALL requests.
+Such a program may be used in a High Availability NFS (HA-NFS)
+environment to track lock state that may need to be migrated after
+a system restart.
+.IP
+The program is run with 3 arguments:
+The first is either
+.B add-client
+or
+.B del-client
+depending on the reason for the callout.
+The second is the
+.I mon_name
+of the monitored peer.
+The third is the
+.I caller_name
+of the requesting lock manager.
+.TP
+.BI "\-n, " "" "\-\-name " ipaddr " | " hostname
+Specifies the source network address or hostname from which
+to send reboot notification requests.
+If this option is not specified,
+.B rpc.statd
+uses an appropriate ANYADDR as the transport bind address,
+and sends the local lockd's
+.I caller_name
+as the
+.I mon_name
+argument to SM_NOTIFY.
+.IP
+This option can be useful in multi-homed configurations where
+the remote requires notification from a specific network address.
+.TP
+.BI "\-o," "" " \-\-notify\-port "  port
+Specifies a local port number that
+.B rpc.statd
+passes to the
+.B sm-notify
+program when sending reboot notifications.
+See
+.BR sm-notify (8)
+for details.
+.TP
+.BI "\-p," "" " \-\-port " port
+Specifies a port that
+.B rpc.statd
+should use when listening for incoming NSM requests.
+If this option is not specified,
+.B rpc.statd
+choses random ephemeral ports.
+.IP
+This option can be used to fix the port value of its listeners when
+SM_NOTIFY requests must traverse a firewall between clients and servers.
+.TP
+.BI "\-P, " "" \-\-state\-directory\-path " pathname
+Specifies a pathname to the parent directory
+on the local system where NSM state information is kept.
+If this option is not specified,
+.B rpc.statd
+uses
+.I /var/lib/nfs/statd
+by default.
+.IP
+After starting,
+.B rpc.statd
+attempts to set its effective UID and GID to the owner
+and group of this directory.
+.SH SECURITY
+The
+.B rpc.statd
+program must be started as root to acquire privileges needed
+to create sockets with privileged source ports, and to access the
+state information database.
+Because
+.B rpc.statd
+maintains a long-running network service, however, it drops root privileges
+as soon as it starts up to reduce the risk of a privilege escalation attack.
+During normal operation,
+the effective user ID it chooses is the owner of the state directory.
+This allows it to continue to access files in that directory after it
+has dropped its root privileges.
+To control which user ID
+.B rpc.statd
+chooses, simply use
+.BR chown (1)
+to set the owner of
+the state directory.
+.PP
+You can also protect your
+.B rpc.statd
+listeners using the
+.B tcp_wrapper
+library or
+.BR ip_tables .
+Note that the
+.B tcp_wrapper
+library supports only IPv4 networking.
+To use the
+.B tcp_wrapper
+library, list systems that should be allowed access in
+.IR /etc/hosts.allow .
+Use the daemon name
+.B statd
+even if the
+.B rpc.statd
+binary has a different name.
+For further information see the
+.BR tcpd (8)
+and
+.BR hosts_access (5)
+manual pages.
+.SH NOTES
+Lock recovery after a reboot is critical to maintaining data integrity
+and preventing unnecessary application hangs.
+.PP
+To help
+.B rpc.statd
+match SM_NOTIFY requests to NLM requests, a number of best practices
+should be observed, including:
+.IP
+The UTS nodename of your systems should match the DNS names that NFS
+peers use to contact them
+.IP
+The nodenames should all be fully qualified domain names
+.IP
+The forward and reverse DNS mapping of the nodenames should be
+consistent
+.IP
+The hostname the client uses to contact the server should match the
+server's mon_name in SM_NOTIFY requests it sends
+.PP
+The use of network addresses for either argument should be avoided when
+interoperating with non-Linux NFS implementations.
+.PP
+Unmounting an NFS file system does not necessarily stop
+either the NFS client or server from monitoring each other.
+Both may continue monitoring each other for a time in case subsequent
+NFS traffic between the two results in fresh mounts and additional
+file locking.
+.PP
+On Linux, if the
+.B lockd
+kernel module is unloaded during normal operation,
+all remote NFS peers are unmonitored.
+This can happen on an NFS client, for example,
+if an automounter removes all NFS mount
+points due to inactivity.
+.SS IPv6 and TI-RPC support
+This version of
+.B rpc.statd
+supports IPv6 networking.
+RPC over IPv6 requires TI-RPC, implemented on Linux via
+.BR libtirpc .
+.B libtirpc
+is required to be available on the local system for this version of
+.BR rpc.statd .
+.PP
+.B rpc.statd
+consults the local
+.I /etc/netconfig
+database to determine what protocol families and transport protocols
+to use for its listeners.
+It attempts to start listeners on network transports marked
+'visible' in
+.IR /etc/netconfig .
+As long as at least one listener can be started,
+.B rpc.statd
+will operate.
+Marking the "udp6" and "tcp6" netids non-visible in
+.IR /etc/netconfig ,
+for example, will prevent
+.B rpc.statd
+from listening on IPv6 transports.
+.PP
+.B rpc.statd
+resolves the hostname
+.I localhost
+to determine what network address to use when performing NLM callbacks.
+.I localhost
+must resolve to a valid loopback address.
+.SH FILES
+.TP 2.5i
+.I /var/lib/nfs/statd/statdb
+default NSM state information database
+.TP 2.5i
+.I /var/run/run.statd.pid
+pid file
+.TP 2.5i
+.I /etc/netconfig
+network transport capability database
+.SH SEE ALSO
+.BR sm-notify (8),
+.BR nfs (5),
+.BR rpc.nfsd (8),
+.BR rpcbind (8),
+.BR tcpd (8),
+.BR hosts_access (5)
+.BR netconfig (5)
+.sp
+RFC 1094 - "NFS: Network File System Protocol Specification"
+.br
+RFC 1813 - "NFS Version 3 Protocol Specification"
+.br
+OpenGroup Protocols for Interworking: XNFS, Version 3W - Chapter 11
+.SH AUTHORS
+Chuck Lever <chuck.lever@oracle.com>


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4] new statd [take 2]
       [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-08-05 14:46   ` [PATCH 4/4] nfs-utils: introduce new statd implementation (4th part) Chuck Lever
@ 2009-08-05 17:15   ` Jeff Layton
       [not found]     ` <20090805131554.67eb15bc-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  4 siblings, 1 reply; 39+ messages in thread
From: Jeff Layton @ 2009-08-05 17:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: steved, linux-nfs

On Wed, 05 Aug 2009 10:45:30 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Apparently there is a 100KB per-email size limit imposed by
> vger.kernel.org.  I've broken up the large statd patch, and
> postponed the nfs(8) man page patches to another day.
> 
> Yesterday it was decided that:
> 
>   [PATCH 3/5] statd: change default state dir to /var/lib/nfs/statd
> 
> was overkill. That one has been dropped from this series.
> 
> The first two of the present patches introduce the .c files under
> utils/new-statd/ .  The next patch adds actools infrastructure for
> building the new statd and sm-notify executables.  The last patch
> adds two new man pages for statd and sm-notify.
> 
> The old statd remains in utils/statd.  If sqlite3 and TI-RPC is
> present on the build system, and TI-RPC is enabled during
> configuration, the new statd is built and installed instead of the
> old one.
> 
> See the description in the first patch for more details on the
> features of the new implementation.
> 

Nice work, Chuck.

Rewriting statd from scratch is a pretty bold step, but I think it's
warranted.

I've been testing this new implementation over several iterations and
the recent ones have worked well for me so far. I've also walked
through the code and it all seems to make sense.

Acked-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]     ` <20090805144540.12866.22084.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
@ 2009-08-05 17:48       ` J. Bruce Fields
  2009-08-05 18:05         ` Chuck Lever
  0 siblings, 1 reply; 39+ messages in thread
From: J. Bruce Fields @ 2009-08-05 17:48 UTC (permalink / raw)
  To: Chuck Lever; +Cc: steved, linux-nfs

On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
> Provide a new implementation of statd that supports IPv6.  The new
> statd implementation resides under
> 
>   utils/new-statd/
> 
> The contents of this directory are built if --enable-tirpc is set
> on the ./configure command line, and sqlite3 is available on the
> build system.  Otherwise, the legacy version of statd, which still
> resides under utils/statd/, is built.
> 
> The goals of this re-write are:
> 
>  o Support IPv6 networking
> 
>    Support interoperation with TI-RPC-based NSM implementations.
>    Transport Independent RPC, or TI-RPC, provides IPv6 network support
>    for Linux's NSM implementation.
> 
>    To support TI-RPC, open code to construct RPC requests in socket
>    buffers and then schedule them has been replaced with standard
>    library calls.
> 
>  o Support notification via TCP
> 
>    As a secondary benefit of using TI-RPC library calls, reboot
>    notifications and NLM callbacks can now be sent via connection-
>    oriented transport protocols.
> 
>    Note that lockd does not (yet) tell statd what transport protocol
>    to use when sending reboot notifications.  statd/sm-notify will
>    continue to use UDP for the time being.
> 
>  o Use an embedded database for storing on-disk callback data
> 
>    This whole exercise is for the purpose of crash robustness.  There
>    are well-known deficiencies with simple create/rename/unlink
>    disk storage schemes during system crashes.  Replace the current
>    flat-file monitor list mechanism which uses sync(2) with sqlite3,
>    which uses fsync(3).

If someone wants to move around that data, is it still simple to do
that?  (Where is it kept on the filesystem?)

(I'm thinking of someone that shares it for high-availabity, as in:

	http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3

Or maybe somebody that just needs to move their /var partition to a
different disk one day.)

>  o Share code between sm-notify and statd
> 
>    Statd and sm-notify access the same set of on-disk data.  These
>    separate programs now share the same code and implementation, with
>    access to on-disk data serialized by sqlite3.  The two remain
>    separate executables to allow other system facilities to send
>    reboot notifications without poking statd.
> 
>  o Reduce impact of DNS outages
> 
>    The heuristics used by SM_NOTIFY to figure out which remote peer
>    has rebooted are heavily dependent on DNS.  If the DNS service is
>    slow or hangs, that will make the NSM listener unresponsive.
>    Incoming SM_NOTIFY requests are now handled in a sidecar process
>    to reduce the impact of DNS outages on the NSM service listener.
> 
>  o Proper my_name support
> 
>    The current version of statd uses gethostname(3) to generate the
>    mon_name argument of SM_NOTIFY.  This value can change across a
>    reboot.  The new version of statd records lockd's my_name, passed
>    by every SM_MON request, and uses that when sending SM_NOTIFY.
> 
>    This can be useful for multi-homed and DHCP configured hosts.
> 
>  o Send SM_NOTIFY more aggressively
> 
>    It has been recommended that statd/sm-notify send SM_NOTIFY
>    more aggressively (for example, to the entire list returned by
>    getaddrinfo(3)).  Since SM_NOTIFY's reply is NULL, there's no
>    way to tell whether the remote peer recognized the mon_name we
>    sent.  More study is required, but this implementation attempts
>    to send an SM_NOTIFY request to each address returned by
>    getaddrinfo(3).
> 
> This re-implementation paves the way for a number of future
> improvements.  However, it does not immediately address:
> 
>  o lockd/statd start-up serialization issues
> 
>    Sending reboot notifications, starting statd and lockd, and opening
>    the lockd grace period are still determined independently in user
>    space and the kernel.
> 
>  o Binding mon_names to caller IP addresses
> 
>    By default, lockd continues to send IP addresses as the mon_name
>    argument of the SM_MON procedure.  This provides a better guarantee
>    of being able to contact remote peers during a reboot, but means
>    statd must continue to use heuristics to match incoming SM_NOTIFY
>    requests with peers on the monitor list.
> 
>  o Distinct logic for NFS client- and server-side
> 
>    Client-side and server-side monitoring requirements are different.
>    Statd continues to use the same logic for both NFS client and
>    server, as the NSMv1 protocol does not provide any indication
>    that a mon_name is for a client or server peer.

Note we probably don't need to be limited by the protocol here, only by
kernel backwards-compatibility requirements, as long as this is just
kernel<->statd communication and not something that goes across the wire
to other statd implementations.

--b.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-08-05 17:48       ` J. Bruce Fields
@ 2009-08-05 18:05         ` Chuck Lever
  2009-08-05 18:15           ` J. Bruce Fields
  0 siblings, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 18:05 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: steved, linux-nfs

On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>> Provide a new implementation of statd that supports IPv6.  The new
>> statd implementation resides under
>>
>>  utils/new-statd/
>>
>> The contents of this directory are built if --enable-tirpc is set
>> on the ./configure command line, and sqlite3 is available on the
>> build system.  Otherwise, the legacy version of statd, which still
>> resides under utils/statd/, is built.
>>
>> The goals of this re-write are:
>>
>> o Support IPv6 networking
>>
>>   Support interoperation with TI-RPC-based NSM implementations.
>>   Transport Independent RPC, or TI-RPC, provides IPv6 network support
>>   for Linux's NSM implementation.
>>
>>   To support TI-RPC, open code to construct RPC requests in socket
>>   buffers and then schedule them has been replaced with standard
>>   library calls.
>>
>> o Support notification via TCP
>>
>>   As a secondary benefit of using TI-RPC library calls, reboot
>>   notifications and NLM callbacks can now be sent via connection-
>>   oriented transport protocols.
>>
>>   Note that lockd does not (yet) tell statd what transport protocol
>>   to use when sending reboot notifications.  statd/sm-notify will
>>   continue to use UDP for the time being.
>>
>> o Use an embedded database for storing on-disk callback data
>>
>>   This whole exercise is for the purpose of crash robustness.  There
>>   are well-known deficiencies with simple create/rename/unlink
>>   disk storage schemes during system crashes.  Replace the current
>>   flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>   which uses fsync(3).
>
> If someone wants to move around that data, is it still simple to do
> that?  (Where is it kept on the filesystem?)
>
> (I'm thinking of someone that shares it for high-availabity, as in:
>
> 	http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>
> Or maybe somebody that just needs to move their /var partition to a
> different disk one day.)

Statd's monitor lists and state number are stored in a single regular  
file, /var/lib/nfs/statd/statdb by default.  This file can be easily  
backed up, or used on other systems, if desired.  I would recommend  
ensuring the NSM state number is reset in the latter case, which can  
be done with the sqlite3 command.

I've had some dialog with Lon Hohberger about clustering  
requirements.  I think we are looking at crafting a separate utility  
that uses sqlite3 C function calls to extract data that's interesting  
to the clustering implementation.  Again, this could even be scripted  
with bash and the sqlite3 command, but perhaps a C program is more  
maintainable.

>> o Share code between sm-notify and statd
>>
>>   Statd and sm-notify access the same set of on-disk data.  These
>>   separate programs now share the same code and implementation, with
>>   access to on-disk data serialized by sqlite3.  The two remain
>>   separate executables to allow other system facilities to send
>>   reboot notifications without poking statd.
>>
>> o Reduce impact of DNS outages
>>
>>   The heuristics used by SM_NOTIFY to figure out which remote peer
>>   has rebooted are heavily dependent on DNS.  If the DNS service is
>>   slow or hangs, that will make the NSM listener unresponsive.
>>   Incoming SM_NOTIFY requests are now handled in a sidecar process
>>   to reduce the impact of DNS outages on the NSM service listener.
>>
>> o Proper my_name support
>>
>>   The current version of statd uses gethostname(3) to generate the
>>   mon_name argument of SM_NOTIFY.  This value can change across a
>>   reboot.  The new version of statd records lockd's my_name, passed
>>   by every SM_MON request, and uses that when sending SM_NOTIFY.
>>
>>   This can be useful for multi-homed and DHCP configured hosts.
>>
>> o Send SM_NOTIFY more aggressively
>>
>>   It has been recommended that statd/sm-notify send SM_NOTIFY
>>   more aggressively (for example, to the entire list returned by
>>   getaddrinfo(3)).  Since SM_NOTIFY's reply is NULL, there's no
>>   way to tell whether the remote peer recognized the mon_name we
>>   sent.  More study is required, but this implementation attempts
>>   to send an SM_NOTIFY request to each address returned by
>>   getaddrinfo(3).
>>
>> This re-implementation paves the way for a number of future
>> improvements.  However, it does not immediately address:
>>
>> o lockd/statd start-up serialization issues
>>
>>   Sending reboot notifications, starting statd and lockd, and opening
>>   the lockd grace period are still determined independently in user
>>   space and the kernel.
>>
>> o Binding mon_names to caller IP addresses
>>
>>   By default, lockd continues to send IP addresses as the mon_name
>>   argument of the SM_MON procedure.  This provides a better guarantee
>>   of being able to contact remote peers during a reboot, but means
>>   statd must continue to use heuristics to match incoming SM_NOTIFY
>>   requests with peers on the monitor list.
>>
>> o Distinct logic for NFS client- and server-side
>>
>>   Client-side and server-side monitoring requirements are different.
>>   Statd continues to use the same logic for both NFS client and
>>   server, as the NSMv1 protocol does not provide any indication
>>   that a mon_name is for a client or server peer.
>
> Note we probably don't need to be limited by the protocol here, only  
> by
> kernel backwards-compatibility requirements, as long as this is just
> kernel<->statd communication and not something that goes across the  
> wire
> to other statd implementations.

Agreed.

It would be possible to export the kernel's NSM host cache via sysfs,  
for instance.  An SM_MON upcall could cause statd to look in /sys for  
more information like whether the remote peer is a client or server,  
and what transport protocol and what network address the caller used  
to contact the local host.  This kind of scheme would work well for  
both old kernels running the new statd, and new kernels running the  
old statd.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4] new statd [take 2]
       [not found]     ` <20090805131554.67eb15bc-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2009-08-05 18:06       ` Chuck Lever
  0 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 18:06 UTC (permalink / raw)
  To: Jeff Layton; +Cc: steved, linux-nfs

On Aug 5, 2009, at 1:15 PM, Jeff Layton wrote:
> On Wed, 05 Aug 2009 10:45:30 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
>
>> Apparently there is a 100KB per-email size limit imposed by
>> vger.kernel.org.  I've broken up the large statd patch, and
>> postponed the nfs(8) man page patches to another day.
>>
>> Yesterday it was decided that:
>>
>>  [PATCH 3/5] statd: change default state dir to /var/lib/nfs/statd
>>
>> was overkill. That one has been dropped from this series.
>>
>> The first two of the present patches introduce the .c files under
>> utils/new-statd/ .  The next patch adds actools infrastructure for
>> building the new statd and sm-notify executables.  The last patch
>> adds two new man pages for statd and sm-notify.
>>
>> The old statd remains in utils/statd.  If sqlite3 and TI-RPC is
>> present on the build system, and TI-RPC is enabled during
>> configuration, the new statd is built and installed instead of the
>> old one.
>>
>> See the description in the first patch for more details on the
>> features of the new implementation.
>>
>
> Nice work, Chuck.
>
> Rewriting statd from scratch is a pretty bold step, but I think it's
> warranted.
>
> I've been testing this new implementation over several iterations and
> the recent ones have worked well for me so far. I've also walked
> through the code and it all seems to make sense.
>
> Acked-by: Jeff Layton <jlayton@redhat.com>

Independent testing was an important piece of this work.  Thanks to  
Jeff and Sankar Bose at NetApp for their assistance so far.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-08-05 18:05         ` Chuck Lever
@ 2009-08-05 18:15           ` J. Bruce Fields
  2009-08-05 18:26             ` Chuck Lever
  0 siblings, 1 reply; 39+ messages in thread
From: J. Bruce Fields @ 2009-08-05 18:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: steved, linux-nfs

On Wed, Aug 05, 2009 at 02:05:44PM -0400, Chuck Lever wrote:
> On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
>> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>>> Provide a new implementation of statd that supports IPv6.  The new
>>> statd implementation resides under
>>>
>>>  utils/new-statd/
>>>
>>> The contents of this directory are built if --enable-tirpc is set
>>> on the ./configure command line, and sqlite3 is available on the
>>> build system.  Otherwise, the legacy version of statd, which still
>>> resides under utils/statd/, is built.
>>>
>>> The goals of this re-write are:
>>>
>>> o Support IPv6 networking
>>>
>>>   Support interoperation with TI-RPC-based NSM implementations.
>>>   Transport Independent RPC, or TI-RPC, provides IPv6 network support
>>>   for Linux's NSM implementation.
>>>
>>>   To support TI-RPC, open code to construct RPC requests in socket
>>>   buffers and then schedule them has been replaced with standard
>>>   library calls.
>>>
>>> o Support notification via TCP
>>>
>>>   As a secondary benefit of using TI-RPC library calls, reboot
>>>   notifications and NLM callbacks can now be sent via connection-
>>>   oriented transport protocols.
>>>
>>>   Note that lockd does not (yet) tell statd what transport protocol
>>>   to use when sending reboot notifications.  statd/sm-notify will
>>>   continue to use UDP for the time being.
>>>
>>> o Use an embedded database for storing on-disk callback data
>>>
>>>   This whole exercise is for the purpose of crash robustness.  There
>>>   are well-known deficiencies with simple create/rename/unlink
>>>   disk storage schemes during system crashes.  Replace the current
>>>   flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>>   which uses fsync(3).
>>
>> If someone wants to move around that data, is it still simple to do
>> that?  (Where is it kept on the filesystem?)
>>
>> (I'm thinking of someone that shares it for high-availabity, as in:
>>
>> 	http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>>
>> Or maybe somebody that just needs to move their /var partition to a
>> different disk one day.)
>
> Statd's monitor lists and state number are stored in a single regular  
> file, /var/lib/nfs/statd/statdb by default.  This file can be easily  
> backed up, or used on other systems, if desired.  I would recommend  
> ensuring the NSM state number is reset in the latter case, which can be 
> done with the sqlite3 command.
>
> I've had some dialog with Lon Hohberger about clustering requirements.  I 
> think we are looking at crafting a separate utility that uses sqlite3 C 
> function calls to extract data that's interesting to the clustering 
> implementation.  Again, this could even be scripted with bash and the 
> sqlite3 command, but perhaps a C program is more maintainable.

OK, good.

And for the simplest cases, it should still be enough to just copy
/var/lib/nfs/, right?

--b.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-08-05 18:15           ` J. Bruce Fields
@ 2009-08-05 18:26             ` Chuck Lever
  2009-08-05 21:22               ` Trond Myklebust
  0 siblings, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 18:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: steved, linux-nfs


On Aug 5, 2009, at 2:15 PM, J. Bruce Fields wrote:

> On Wed, Aug 05, 2009 at 02:05:44PM -0400, Chuck Lever wrote:
>> On Aug 5, 2009, at 1:48 PM, J. Bruce Fields wrote:
>>> On Wed, Aug 05, 2009 at 10:45:40AM -0400, Chuck Lever wrote:
>>>> Provide a new implementation of statd that supports IPv6.  The new
>>>> statd implementation resides under
>>>>
>>>> utils/new-statd/
>>>>
>>>> The contents of this directory are built if --enable-tirpc is set
>>>> on the ./configure command line, and sqlite3 is available on the
>>>> build system.  Otherwise, the legacy version of statd, which still
>>>> resides under utils/statd/, is built.
>>>>
>>>> The goals of this re-write are:
>>>>
>>>> o Support IPv6 networking
>>>>
>>>>  Support interoperation with TI-RPC-based NSM implementations.
>>>>  Transport Independent RPC, or TI-RPC, provides IPv6 network  
>>>> support
>>>>  for Linux's NSM implementation.
>>>>
>>>>  To support TI-RPC, open code to construct RPC requests in socket
>>>>  buffers and then schedule them has been replaced with standard
>>>>  library calls.
>>>>
>>>> o Support notification via TCP
>>>>
>>>>  As a secondary benefit of using TI-RPC library calls, reboot
>>>>  notifications and NLM callbacks can now be sent via connection-
>>>>  oriented transport protocols.
>>>>
>>>>  Note that lockd does not (yet) tell statd what transport protocol
>>>>  to use when sending reboot notifications.  statd/sm-notify will
>>>>  continue to use UDP for the time being.
>>>>
>>>> o Use an embedded database for storing on-disk callback data
>>>>
>>>>  This whole exercise is for the purpose of crash robustness.  There
>>>>  are well-known deficiencies with simple create/rename/unlink
>>>>  disk storage schemes during system crashes.  Replace the current
>>>>  flat-file monitor list mechanism which uses sync(2) with sqlite3,
>>>>  which uses fsync(3).
>>>
>>> If someone wants to move around that data, is it still simple to do
>>> that?  (Where is it kept on the filesystem?)
>>>
>>> (I'm thinking of someone that shares it for high-availabity, as in:
>>>
>>> 	http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat_p3
>>>
>>> Or maybe somebody that just needs to move their /var partition to a
>>> different disk one day.)
>>
>> Statd's monitor lists and state number are stored in a single regular
>> file, /var/lib/nfs/statd/statdb by default.  This file can be easily
>> backed up, or used on other systems, if desired.  I would recommend
>> ensuring the NSM state number is reset in the latter case, which  
>> can be
>> done with the sqlite3 command.
>>
>> I've had some dialog with Lon Hohberger about clustering  
>> requirements.  I
>> think we are looking at crafting a separate utility that uses  
>> sqlite3 C
>> function calls to extract data that's interesting to the clustering
>> implementation.  Again, this could even be scripted with bash and the
>> sqlite3 command, but perhaps a C program is more maintainable.
>
> OK, good.
>
> And for the simplest cases, it should still be enough to just copy
> /var/lib/nfs/, right?

I don't see why that wouldn't work, as long statd/sm-notify aren't  
updating the database at that moment.  For safety I think there is an  
sqlite3 backup mechanism for database files that respects the  
library's locking semantics.

sqlite3 doesn't do anything special under the covers.  It uses only  
POSIX file access and locking calls, as far as I know.  So I think  
hosting /var on most well-behaved clustering file systems won't have  
any problem with this arrangement.

One (admittedly minor) reason I did this is so we have some sample  
code to try for other NFS-related daemons that need to store  
information in /var robustly, potentially in clustered environments.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-08-05 18:26             ` Chuck Lever
@ 2009-08-05 21:22               ` Trond Myklebust
       [not found]                 ` <1249507356.5428.11.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Trond Myklebust @ 2009-08-05 21:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, steved, linux-nfs

On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:

> sqlite3 doesn't do anything special under the covers.  It uses only  
> POSIX file access and locking calls, as far as I know.  So I think  
> hosting /var on most well-behaved clustering file systems won't have  
> any problem with this arrangement.

So we're basically introducing a dependency on a completely new library
that will have to be added to boot partitions/nfsroot/etc, and we have
no real reason for doing it other than because we want to move from
using sync() to fsync()?

Sounds like a NACK to me...

Trond


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]                 ` <1249507356.5428.11.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-08-05 22:24                   ` Chuck Lever
  2009-08-05 23:30                     ` Trond Myklebust
  0 siblings, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-08-05 22:24 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, steved, linux-nfs

On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>> sqlite3 doesn't do anything special under the covers.  It uses only
>> POSIX file access and locking calls, as far as I know.  So I think
>> hosting /var on most well-behaved clustering file systems won't have
>> any problem with this arrangement.
>
> So we're basically introducing a dependency on a completely new  
> library
> that will have to be added to boot partitions/nfsroot/etc, and we have
> no real reason for doing it other than because we want to move from
> using sync() to fsync()?
>
> Sounds like a NACK to me...

Which library are you talking about, libsqlite3 or libtirpc?  Because  
NEITHER of those is in /lib.

In any event, it's not just sync(2) that is a problem.  sync(2) by  
itself is a boot performance problem, but it's the combination of  
rename and sync that is known to be especially unreliable during  
system crashes.  Statd, being a crash monitor, shouldn't depend on  
rename/sync to maintain persistent data in the face of system  
instability.  I'd call that a real reason to use something more robust.

Can we try to be a little more constructive, please?  Asking the list  
(which includes distributors, who actually have to worry about such  
things) whether this would be a problem is significantly less abrasive  
then just saying "NACK" outright.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-08-05 22:24                   ` Chuck Lever
@ 2009-08-05 23:30                     ` Trond Myklebust
  2009-09-09 18:29                       ` Jeff Layton
  0 siblings, 1 reply; 39+ messages in thread
From: Trond Myklebust @ 2009-08-05 23:30 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, steved, linux-nfs

On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> >> sqlite3 doesn't do anything special under the covers.  It uses only
> >> POSIX file access and locking calls, as far as I know.  So I think
> >> hosting /var on most well-behaved clustering file systems won't have
> >> any problem with this arrangement.
> >
> > So we're basically introducing a dependency on a completely new  
> > library
> > that will have to be added to boot partitions/nfsroot/etc, and we have
> > no real reason for doing it other than because we want to move from
> > using sync() to fsync()?
> >
> > Sounds like a NACK to me...
> 
> Which library are you talking about, libsqlite3 or libtirpc?  Because  
> NEITHER of those is in /lib.

libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
established.

> In any event, it's not just sync(2) that is a problem.  sync(2) by  
> itself is a boot performance problem, but it's the combination of  
> rename and sync that is known to be especially unreliable during  
> system crashes.  Statd, being a crash monitor, shouldn't depend on  
> rename/sync to maintain persistent data in the face of system  
> instability.  I'd call that a real reason to use something more robust.

What are you talking about? Is this about the truncate + rename issue
leaving empty files upon a crash?
That issue is solved trivially by doing an fsync() before you rename the
file. That entire discussion was about whether or not existing
applications should be _required_ to do this kind of POSIX pedantry,
when previously they could get away without it.

IOW: that issue alone does not justify replacing the current simple file
based scheme.

> Can we try to be a little more constructive, please?  Asking the list  
> (which includes distributors, who actually have to worry about such  
> things) whether this would be a problem is significantly less abrasive  
> then just saying "NACK" outright.

It would be constructive if you could actually _justify_ these
backward-incompatible changes instead of hand waving, and accusing
others of being obstructionist.

     Trond


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-08-05 23:30                     ` Trond Myklebust
@ 2009-09-09 18:29                       ` Jeff Layton
       [not found]                         ` <20090909142945.755da393-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Jeff Layton @ 2009-09-09 18:29 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, J. Bruce Fields, steved, linux-nfs

On Wed, 05 Aug 2009 19:30:04 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > >> sqlite3 doesn't do anything special under the covers.  It uses only
> > >> POSIX file access and locking calls, as far as I know.  So I think
> > >> hosting /var on most well-behaved clustering file systems won't have
> > >> any problem with this arrangement.
> > >
> > > So we're basically introducing a dependency on a completely new  
> > > library
> > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > no real reason for doing it other than because we want to move from
> > > using sync() to fsync()?
> > >
> > > Sounds like a NACK to me...
> > 
> > Which library are you talking about, libsqlite3 or libtirpc?  Because  
> > NEITHER of those is in /lib.
> 
> libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> established.
>

Sorry to revive this so late, but I think we need to come to some
sort of resolution here. The only missing piece for client side IPv6
support is statd...

I'm not sure I understand the objection to using libsqlite3 here. We
certainly could roll our own routines to handle data storage, but why
would we want to do so? sqlite3 is quite good at what it does. Why
wouldn't we want to use it?

> > In any event, it's not just sync(2) that is a problem.  sync(2) by  
> > itself is a boot performance problem, but it's the combination of  
> > rename and sync that is known to be especially unreliable during  
> > system crashes.  Statd, being a crash monitor, shouldn't depend on  
> > rename/sync to maintain persistent data in the face of system  
> > instability.  I'd call that a real reason to use something more robust.
> 
> What are you talking about? Is this about the truncate + rename issue
> leaving empty files upon a crash?
> That issue is solved trivially by doing an fsync() before you rename the
> file. That entire discussion was about whether or not existing
> applications should be _required_ to do this kind of POSIX pedantry,
> when previously they could get away without it.
> 
> IOW: that issue alone does not justify replacing the current simple file
> based scheme.
>

There are other reasons, not to use the simple file-based scheme too...

Internationalized domain names will be easier to deal with via sqlite3,
for instance.

Certainly we could code this up ourselves, but what's the benefit to
doing that when we have a perfectly good data storage engine available?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]                         ` <20090909142945.755da393-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2009-09-09 18:39                           ` Trond Myklebust
  2009-09-09 19:13                             ` Jeff Layton
  2009-09-09 19:17                             ` Chuck Lever
  0 siblings, 2 replies; 39+ messages in thread
From: Trond Myklebust @ 2009-09-09 18:39 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Chuck Lever, J. Bruce Fields, steved, linux-nfs

On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
> On Wed, 05 Aug 2009 19:30:04 -0400
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > > >> sqlite3 doesn't do anything special under the covers.  It uses only
> > > >> POSIX file access and locking calls, as far as I know.  So I think
> > > >> hosting /var on most well-behaved clustering file systems won't have
> > > >> any problem with this arrangement.
> > > >
> > > > So we're basically introducing a dependency on a completely new  
> > > > library
> > > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > > no real reason for doing it other than because we want to move from
> > > > using sync() to fsync()?
> > > >
> > > > Sounds like a NACK to me...
> > > 
> > > Which library are you talking about, libsqlite3 or libtirpc?  Because  
> > > NEITHER of those is in /lib.
> > 
> > libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> > established.
> >
> 
> Sorry to revive this so late, but I think we need to come to some
> sort of resolution here. The only missing piece for client side IPv6
> support is statd...
> 
> I'm not sure I understand the objection to using libsqlite3 here. We
> certainly could roll our own routines to handle data storage, but why
> would we want to do so? sqlite3 is quite good at what it does. Why
> wouldn't we want to use it?

Backwards compatibility is one major reason. statd already exists, and
is in use out there. I shouldn't be forced to reboot all my clients when
I upgrade the nfs-utils package on my server.

Simplicity is another reason. WTF do we need a full SQL database, when
all we want to do is store 2 pieces of data (a hostname and a cookie)?
It isn't as if this has been a major problem for us previously.

> > > In any event, it's not just sync(2) that is a problem.  sync(2) by  
> > > itself is a boot performance problem, but it's the combination of  
> > > rename and sync that is known to be especially unreliable during  
> > > system crashes.  Statd, being a crash monitor, shouldn't depend on  
> > > rename/sync to maintain persistent data in the face of system  
> > > instability.  I'd call that a real reason to use something more robust.
> > 
> > What are you talking about? Is this about the truncate + rename issue
> > leaving empty files upon a crash?
> > That issue is solved trivially by doing an fsync() before you rename the
> > file. That entire discussion was about whether or not existing
> > applications should be _required_ to do this kind of POSIX pedantry,
> > when previously they could get away without it.
> > 
> > IOW: that issue alone does not justify replacing the current simple file
> > based scheme.
> >
> 
> There are other reasons, not to use the simple file-based scheme too...
> 
> Internationalized domain names will be easier to deal with via sqlite3,
> for instance.

Please explain...

> Certainly we could code this up ourselves, but what's the benefit to
> doing that when we have a perfectly good data storage engine available?

Why change something that works???? Rewriting from scratch is _NOT_ the
Linux way, and has usually bitten us hard when we've done it.

The 2.6.19 rewrite of the kernel mount code springs to mind...

Trond


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 18:39                           ` Trond Myklebust
@ 2009-09-09 19:13                             ` Jeff Layton
       [not found]                               ` <20090909151354.504ccdf6-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2009-09-09 19:17                             ` Chuck Lever
  1 sibling, 1 reply; 39+ messages in thread
From: Jeff Layton @ 2009-09-09 19:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, J. Bruce Fields, steved, linux-nfs

On Wed, 09 Sep 2009 14:39:59 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
> > On Wed, 05 Aug 2009 19:30:04 -0400
> > Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> > 
> > > On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
> > > > On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
> > > > > On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
> > > > >> sqlite3 doesn't do anything special under the covers.  It uses only
> > > > >> POSIX file access and locking calls, as far as I know.  So I think
> > > > >> hosting /var on most well-behaved clustering file systems won't have
> > > > >> any problem with this arrangement.
> > > > >
> > > > > So we're basically introducing a dependency on a completely new  
> > > > > library
> > > > > that will have to be added to boot partitions/nfsroot/etc, and we have
> > > > > no real reason for doing it other than because we want to move from
> > > > > using sync() to fsync()?
> > > > >
> > > > > Sounds like a NACK to me...
> > > > 
> > > > Which library are you talking about, libsqlite3 or libtirpc?  Because  
> > > > NEITHER of those is in /lib.
> > > 
> > > libsqlite is the problem. Unlike libtirpc, it's utility has yet to be
> > > established.
> > >
> > 
> > Sorry to revive this so late, but I think we need to come to some
> > sort of resolution here. The only missing piece for client side IPv6
> > support is statd...
> > 
> > I'm not sure I understand the objection to using libsqlite3 here. We
> > certainly could roll our own routines to handle data storage, but why
> > would we want to do so? sqlite3 is quite good at what it does. Why
> > wouldn't we want to use it?
> 
> Backwards compatibility is one major reason. statd already exists, and
> is in use out there. I shouldn't be forced to reboot all my clients when
> I upgrade the nfs-utils package on my server.
> 

We could roll a conversion utility for this if it would help. nfs-utils
upgrades usually mean restarting statd anyway:

shut down old-statd
convert flat file db to sqlite
start up new-statd

...so I don't think we necessarily need to reboot the clients for this.
It should (in theory) be possible to do the reverse even.

> Simplicity is another reason. WTF do we need a full SQL database, when
> all we want to do is store 2 pieces of data (a hostname and a cookie)?
> It isn't as if this has been a major problem for us previously.
> 

Been a little while since I took my initial look at new-statd, but I
see that Chuck is has this (just a for-instance):

        rc = sqlite3_prepare_v2(db, "CREATE TABLE " STATD_MONITOR_TABLENAME
                                        " (priv BLOB,"
                                        " mon_name TEXT NOT NULL,"
                                        " my_name TEXT NOT NULL,"
                                        " program INTEGER,"
                                        " version INTEGER,"
                                        " procedure INTEGER,"
                                        " protocol TEXT NOT NULL,"
                                        " state INTEGER,"
                                        " UNIQUE(mon_name,my_name));",
                                                -1, &stmt, NULL);


He's tracking some other info there too. Is this all necessary? Maybe
not now, but having a storage engine that can cope with tracking extra
info will make it easier to handle things like multihomed clients and
servers correctly (something that the existing statd is not very good
at at the moment).

> > > > In any event, it's not just sync(2) that is a problem.  sync(2) by  
> > > > itself is a boot performance problem, but it's the combination of  
> > > > rename and sync that is known to be especially unreliable during  
> > > > system crashes.  Statd, being a crash monitor, shouldn't depend on  
> > > > rename/sync to maintain persistent data in the face of system  
> > > > instability.  I'd call that a real reason to use something more robust.
> > > 
> > > What are you talking about? Is this about the truncate + rename issue
> > > leaving empty files upon a crash?
> > > That issue is solved trivially by doing an fsync() before you rename the
> > > file. That entire discussion was about whether or not existing
> > > applications should be _required_ to do this kind of POSIX pedantry,
> > > when previously they could get away without it.
> > > 
> > > IOW: that issue alone does not justify replacing the current simple file
> > > based scheme.
> > >
> > 
> > There are other reasons, not to use the simple file-based scheme too...
> > 
> > Internationalized domain names will be easier to deal with via sqlite3,
> > for instance.
> 
> Please explain...
> 

Well, we currently store statd info in flat files named with the
hostname. With an internationalized domain name, we may have a
multibyte character in that name. We could try to store that as an
ASCII or UTF8 name, but we'd have to roll conversion routines for it.
Why bother when we have a storage engine that does that work for us?

> > Certainly we could code this up ourselves, but what's the benefit to
> > doing that when we have a perfectly good data storage engine available?
> 
> Why change something that works???? Rewriting from scratch is _NOT_ the
> Linux way, and has usually bitten us hard when we've done it.
> 
> The 2.6.19 rewrite of the kernel mount code springs to mind...
> 

A good point. Given who I work for, I *really* hate regressions since
they tend to mean a lot of my time tends to get eaten up.

At some point however it becomes very difficult to patch up old code.
old-statd in particular hasn't seen the attention that it probably
should have.

I trust that Chuck will be willing to fix problems that come along. The
new code seems to be on par with the old in complexity so I don't think
we'd be taking on a great maintenance burden with it even if Chuck
isn't available.

Could we add IPv6 support as a patchset to the existing statd instead?
Sure. That patch would be smaller than Chuck's rewrite, but I still
think there are advantages to considering a major overhaul here.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 18:39                           ` Trond Myklebust
  2009-09-09 19:13                             ` Jeff Layton
@ 2009-09-09 19:17                             ` Chuck Lever
  2009-09-09 19:42                               ` Trond Myklebust
  1 sibling, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-09-09 19:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
>> On Wed, 05 Aug 2009 19:30:04 -0400
>> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>>
>>> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
>>>> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
>>>>> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>>>>>> sqlite3 doesn't do anything special under the covers.  It uses  
>>>>>> only
>>>>>> POSIX file access and locking calls, as far as I know.  So I  
>>>>>> think
>>>>>> hosting /var on most well-behaved clustering file systems won't  
>>>>>> have
>>>>>> any problem with this arrangement.
>>>>>
>>>>> So we're basically introducing a dependency on a completely new
>>>>> library
>>>>> that will have to be added to boot partitions/nfsroot/etc, and  
>>>>> we have
>>>>> no real reason for doing it other than because we want to move  
>>>>> from
>>>>> using sync() to fsync()?
>>>>>
>>>>> Sounds like a NACK to me...
>>>>
>>>> Which library are you talking about, libsqlite3 or libtirpc?   
>>>> Because
>>>> NEITHER of those is in /lib.
>>>
>>> libsqlite is the problem. Unlike libtirpc, it's utility has yet to  
>>> be
>>> established.
>>>
>>
>> Sorry to revive this so late, but I think we need to come to some
>> sort of resolution here. The only missing piece for client side IPv6
>> support is statd...
>>
>> I'm not sure I understand the objection to using libsqlite3 here. We
>> certainly could roll our own routines to handle data storage, but why
>> would we want to do so? sqlite3 is quite good at what it does. Why
>> wouldn't we want to use it?
>
> Backwards compatibility is one major reason. statd already exists, and
> is in use out there. I shouldn't be forced to reboot all my clients  
> when
> I upgrade the nfs-utils package on my server.

The old statd still exists in nfs-utils.  The new statd is an entirely  
separate component.  Distributions can continue to use the old statd  
as long as they want.  This is a red herring.

> Simplicity is another reason. WTF do we need a full SQL database, when
> all we want to do is store 2 pieces of data (a hostname and a cookie)?
> It isn't as if this has been a major problem for us previously.

Because we are not storing just a hostname and a cookie.  We are  
storing several different data items for each host, and we need to  
search over the records, and provide uniqueness constraints, and  
handle data conversion (for binary data like the cookie, for string  
data like the hostname, and for integers, like the prog/vers/proc  
tuple).  We need to store them durably on persistent storage to have  
some protection against crashes.  These are all things that an  
embedded database can do well, and that we therefore don't have to  
code ourselves.

Simplicity is in the eye of the beholder.  I can easily counterargue  
that it's simpler to rely on a library than to duplicate the  
functionality in statd.  If we think it is appropriate to use glibc  
for managing a memory heap rather than open coding it, and libtirpc  
for handling RPC calls rather than open coding it, then why shouldn't  
we use another library for managing our host records on disk, rather  
than open coding our data conversion and record searching logic?

Again, this is a red herring.

>>>> In any event, it's not just sync(2) that is a problem.  sync(2) by
>>>> itself is a boot performance problem, but it's the combination of
>>>> rename and sync that is known to be especially unreliable during
>>>> system crashes.  Statd, being a crash monitor, shouldn't depend on
>>>> rename/sync to maintain persistent data in the face of system
>>>> instability.  I'd call that a real reason to use something more  
>>>> robust.
>>>
>>> What are you talking about? Is this about the truncate + rename  
>>> issue
>>> leaving empty files upon a crash?
>>> That issue is solved trivially by doing an fsync() before you  
>>> rename the
>>> file. That entire discussion was about whether or not existing
>>> applications should be _required_ to do this kind of POSIX pedantry,
>>> when previously they could get away without it.
>>>
>>> IOW: that issue alone does not justify replacing the current  
>>> simple file
>>> based scheme.
>>>
>>
>> There are other reasons, not to use the simple file-based scheme  
>> too...
>>
>> Internationalized domain names will be easier to deal with via  
>> sqlite3,
>> for instance.
>
> Please explain...

IPv6 is used in Asia, where they almost certainly need to use non- 
ASCII characters in their hostnames.  Internationalized domain names  
are stored in double-wide character sets.  To provide reliable support  
for IDNs in statd, we will have to guarantee somehow that we can store  
an IDN as a file name (if we want to stay with the current scheme), no  
matter what file system is used for /var.

What's more, multi-homed host support will need to store multiple  
records for the same hostname.  The mon_name is the same, but my_name  
is different, for each of these records.  So we could do that by  
adding more than one line in each hostname file, but it's also a  
simple matter to set this up in SQL.

When we want to have statd remember things like multiple addresses for  
the same hostname, or whether the remote is a client or server, we  
will need to make more adjustments to the files.

As we get more and more new requirements, why lock ourselves into the  
current on-disk format?  Using statd means we can store new fields and  
new records without any backwards-compatibility issues.  It's all  
handled by the database code.  So, we can think about the high level  
problem of getting statd to behave correctly rather than worry about  
the details of exactly how we are going to get the next data item  
stored in our current files in a backward compatible way.

>> Certainly we could code this up ourselves, but what's the benefit to
>> doing that when we have a perfectly good data storage engine  
>> available?
>
> Why change something that works???? Rewriting from scratch is _NOT_  
> the
> Linux way, and has usually bitten us hard when we've done it.

Because we are adding a bunch of new feature requirements.   
Internationalized domain names, multi-homed host support, IPv6 and TI- 
RPC, fast boot times, keeping better track of remote host addresses,  
keeping track of which remotes are clients and which are servers, and  
support for sending notifications via TCP all require significant  
modifications to this code base.

At some point you have to look at the code you have, and decide it's  
simply not going to be adequate, going forward.

> The 2.6.19 rewrite of the kernel mount code springs to mind...

One can just as easily argue that we've been bitten hard precisely  
because we've let things rot, or because we have inadequate testing  
for these components.

Another red herring, and especially annoying because you've known I  
was rewriting statd for months.  Only now, when I'm done, do you say  
"rewriting is not the Linux way."

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]                               ` <20090909151354.504ccdf6-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2009-09-09 19:19                                 ` Chuck Lever
  0 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-09 19:19 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, J. Bruce Fields, steved, linux-nfs


On Sep 9, 2009, at 3:13 PM, Jeff Layton wrote:

> On Wed, 09 Sep 2009 14:39:59 -0400
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
>> On Wed, 2009-09-09 at 14:29 -0400, Jeff Layton wrote:
>>> On Wed, 05 Aug 2009 19:30:04 -0400
>>> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>>>
>>>> On Wed, 2009-08-05 at 18:24 -0400, Chuck Lever wrote:
>>>>> On Aug 5, 2009, at 5:22 PM, Trond Myklebust wrote:
>>>>>> On Wed, 2009-08-05 at 14:26 -0400, Chuck Lever wrote:
>>>>>>> sqlite3 doesn't do anything special under the covers.  It uses  
>>>>>>> only
>>>>>>> POSIX file access and locking calls, as far as I know.  So I  
>>>>>>> think
>>>>>>> hosting /var on most well-behaved clustering file systems  
>>>>>>> won't have
>>>>>>> any problem with this arrangement.
>>>>>>
>>>>>> So we're basically introducing a dependency on a completely new
>>>>>> library
>>>>>> that will have to be added to boot partitions/nfsroot/etc, and  
>>>>>> we have
>>>>>> no real reason for doing it other than because we want to move  
>>>>>> from
>>>>>> using sync() to fsync()?
>>>>>>
>>>>>> Sounds like a NACK to me...
>>>>>
>>>>> Which library are you talking about, libsqlite3 or libtirpc?   
>>>>> Because
>>>>> NEITHER of those is in /lib.
>>>>
>>>> libsqlite is the problem. Unlike libtirpc, it's utility has yet  
>>>> to be
>>>> established.
>>>>
>>>
>>> Sorry to revive this so late, but I think we need to come to some
>>> sort of resolution here. The only missing piece for client side IPv6
>>> support is statd...
>>>
>>> I'm not sure I understand the objection to using libsqlite3 here. We
>>> certainly could roll our own routines to handle data storage, but  
>>> why
>>> would we want to do so? sqlite3 is quite good at what it does. Why
>>> wouldn't we want to use it?
>>
>> Backwards compatibility is one major reason. statd already exists,  
>> and
>> is in use out there. I shouldn't be forced to reboot all my clients  
>> when
>> I upgrade the nfs-utils package on my server.
>>
>
> We could roll a conversion utility for this if it would help. nfs- 
> utils
> upgrades usually mean restarting statd anyway:
>
> shut down old-statd
> convert flat file db to sqlite
> start up new-statd
>
> ...so I don't think we necessarily need to reboot the clients for  
> this.
> It should (in theory) be possible to do the reverse even.
>
>> Simplicity is another reason. WTF do we need a full SQL database,  
>> when
>> all we want to do is store 2 pieces of data (a hostname and a  
>> cookie)?
>> It isn't as if this has been a major problem for us previously.
>>
>
> Been a little while since I took my initial look at new-statd, but I
> see that Chuck is has this (just a for-instance):
>
>        rc = sqlite3_prepare_v2(db, "CREATE TABLE "  
> STATD_MONITOR_TABLENAME
>                                        " (priv BLOB,"
>                                        " mon_name TEXT NOT NULL,"
>                                        " my_name TEXT NOT NULL,"
>                                        " program INTEGER,"
>                                        " version INTEGER,"
>                                        " procedure INTEGER,"
>                                        " protocol TEXT NOT NULL,"
>                                        " state INTEGER,"
>                                        " UNIQUE(mon_name,my_name));",
>                                                -1, &stmt, NULL);
>
>
> He's tracking some other info there too. Is this all necessary? Maybe
> not now, but having a storage engine that can cope with tracking extra
> info will make it easier to handle things like multihomed clients and
> servers correctly (something that the existing statd is not very good
> at at the moment).
>
>>>>> In any event, it's not just sync(2) that is a problem.  sync(2) by
>>>>> itself is a boot performance problem, but it's the combination of
>>>>> rename and sync that is known to be especially unreliable during
>>>>> system crashes.  Statd, being a crash monitor, shouldn't depend on
>>>>> rename/sync to maintain persistent data in the face of system
>>>>> instability.  I'd call that a real reason to use something more  
>>>>> robust.
>>>>
>>>> What are you talking about? Is this about the truncate + rename  
>>>> issue
>>>> leaving empty files upon a crash?
>>>> That issue is solved trivially by doing an fsync() before you  
>>>> rename the
>>>> file. That entire discussion was about whether or not existing
>>>> applications should be _required_ to do this kind of POSIX  
>>>> pedantry,
>>>> when previously they could get away without it.
>>>>
>>>> IOW: that issue alone does not justify replacing the current  
>>>> simple file
>>>> based scheme.
>>>>
>>>
>>> There are other reasons, not to use the simple file-based scheme  
>>> too...
>>>
>>> Internationalized domain names will be easier to deal with via  
>>> sqlite3,
>>> for instance.
>>
>> Please explain...
>>
>
> Well, we currently store statd info in flat files named with the
> hostname. With an internationalized domain name, we may have a
> multibyte character in that name. We could try to store that as an
> ASCII or UTF8 name, but we'd have to roll conversion routines for it.
> Why bother when we have a storage engine that does that work for us?
>
>>> Certainly we could code this up ourselves, but what's the benefit to
>>> doing that when we have a perfectly good data storage engine  
>>> available?
>>
>> Why change something that works???? Rewriting from scratch is _NOT_  
>> the
>> Linux way, and has usually bitten us hard when we've done it.
>>
>> The 2.6.19 rewrite of the kernel mount code springs to mind...
>>
>
> A good point. Given who I work for, I *really* hate regressions since
> they tend to mean a lot of my time tends to get eaten up.
>
> At some point however it becomes very difficult to patch up old code.
> old-statd in particular hasn't seen the attention that it probably
> should have.
>
> I trust that Chuck will be willing to fix problems that come along.  
> The
> new code seems to be on par with the old in complexity so I don't  
> think
> we'd be taking on a great maintenance burden with it even if Chuck
> isn't available.
>
> Could we add IPv6 support as a patchset to the existing statd instead?
> Sure.

Not as easy as you might think.

The old statd open codes a bunch of the RPC client side, which means  
all of that would have to be rewritten anyway.

> That patch would be smaller than Chuck's rewrite, but I still
> think there are advantages to considering a major overhaul here.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 19:17                             ` Chuck Lever
@ 2009-09-09 19:42                               ` Trond Myklebust
  2009-09-09 22:18                                 ` Chuck Lever
  0 siblings, 1 reply; 39+ messages in thread
From: Trond Myklebust @ 2009-09-09 19:42 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
> The old statd still exists in nfs-utils.  The new statd is an entirely  
> separate component.  Distributions can continue to use the old statd  
> as long as they want.  This is a red herring.

Bullshit. If they are adding IPv6 support, then they will have to
upgrade at some point.

> > Simplicity is another reason. WTF do we need a full SQL database, when
> > all we want to do is store 2 pieces of data (a hostname and a cookie)?
> > It isn't as if this has been a major problem for us previously.
> 
> Because we are not storing just a hostname and a cookie.  We are  
> storing several different data items for each host, and we need to  
> search over the records, and provide uniqueness constraints, and  
> handle data conversion (for binary data like the cookie, for string  
> data like the hostname, and for integers, like the prog/vers/proc  
> tuple).  We need to store them durably on persistent storage to have  
> some protection against crashes.  These are all things that an  
> embedded database can do well, and that we therefore don't have to  
> code ourselves.

Speaking of red herrings. Why are we adding all this crap?

This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM from
scratch, just add minimal support for IPv6.

> >>>> In any event, it's not just sync(2) that is a problem.  sync(2) by
> >>>> itself is a boot performance problem, but it's the combination of
> >>>> rename and sync that is known to be especially unreliable during
> >>>> system crashes.  Statd, being a crash monitor, shouldn't depend on
> >>>> rename/sync to maintain persistent data in the face of system
> >>>> instability.  I'd call that a real reason to use something more  
> >>>> robust.
> >>>
> >>> What are you talking about? Is this about the truncate + rename  
> >>> issue
> >>> leaving empty files upon a crash?
> >>> That issue is solved trivially by doing an fsync() before you  
> >>> rename the
> >>> file. That entire discussion was about whether or not existing
> >>> applications should be _required_ to do this kind of POSIX pedantry,
> >>> when previously they could get away without it.
> >>>
> >>> IOW: that issue alone does not justify replacing the current  
> >>> simple file
> >>> based scheme.
> >>>
> >>
> >> There are other reasons, not to use the simple file-based scheme  
> >> too...
> >>
> >> Internationalized domain names will be easier to deal with via  
> >> sqlite3,
> >> for instance.
> >
> > Please explain...
> 
> IPv6 is used in Asia, where they almost certainly need to use non- 
> ASCII characters in their hostnames.  Internationalized domain names  
> are stored in double-wide character sets.  To provide reliable support  
> for IDNs in statd, we will have to guarantee somehow that we can store  
> an IDN as a file name (if we want to stay with the current scheme), no  
> matter what file system is used for /var.

So, what's stopping us? These are POSIX filesystems. They can store any
filename as long as it doesn't contain '/' or '\0'.

> What's more, multi-homed host support will need to store multiple  
> records for the same hostname.  The mon_name is the same, but my_name  
> is different, for each of these records.  So we could do that by  
> adding more than one line in each hostname file, but it's also a  
> simple matter to set this up in SQL.
> 
> When we want to have statd remember things like multiple addresses for  
> the same hostname, or whether the remote is a client or server, we  
> will need to make more adjustments to the files.
> 
> As we get more and more new requirements, why lock ourselves into the  
> current on-disk format?  Using statd means we can store new fields and  
> new records without any backwards-compatibility issues.  It's all  
> handled by the database code.  So, we can think about the high level  
> problem of getting statd to behave correctly rather than worry about  
> the details of exactly how we are going to get the next data item  
> stored in our current files in a backward compatible way.

Again. This is a legacy filesystem. Why are we adding requirements?

> >> Certainly we could code this up ourselves, but what's the benefit to
> >> doing that when we have a perfectly good data storage engine  
> >> available?
> >
> > Why change something that works???? Rewriting from scratch is _NOT_  
> > the
> > Linux way, and has usually bitten us hard when we've done it.
> 
> Because we are adding a bunch of new feature requirements.   
> Internationalized domain names, multi-homed host support, IPv6 and TI- 
> RPC, fast boot times, keeping better track of remote host addresses,  
> keeping track of which remotes are clients and which are servers, and  
> support for sending notifications via TCP all require significant  
> modifications to this code base.
> 
> At some point you have to look at the code you have, and decide it's  
> simply not going to be adequate, going forward.
> 
> > The 2.6.19 rewrite of the kernel mount code springs to mind...
> 
> One can just as easily argue that we've been bitten hard precisely  
> because we've let things rot, or because we have inadequate testing  
> for these components.
> 
> Another red herring, and especially annoying because you've known I  
> was rewriting statd for months.  Only now, when I'm done, do you say  
> "rewriting is not the Linux way."

I have _NEVER_ agreed to a rewrite of the storage formats. You sprang
this crap on me a month ago, and I made my feelings quite clear then.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 19:42                               ` Trond Myklebust
@ 2009-09-09 22:18                                 ` Chuck Lever
  2009-09-09 23:15                                   ` Steve Dickson
                                                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-09 22:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>> The old statd still exists in nfs-utils.  The new statd is an  
>> entirely
>> separate component.  Distributions can continue to use the old statd
>> as long as they want.  This is a red herring.
>
> Bullshit. If they are adding IPv6 support, then they will have to
> upgrade at some point.

I don't see a problem with a distribution upgrade using old statd and  
a fresh install using new statd.  You have to install a lot of new  
components to get NFS/IPv6 support.  It's not like the only thing that  
needs to change is statd.  People will install a new distribution to  
get IPv6 support.  With so many simple ways to install from scratch,  
the days of someone upgrading just a few pieces of an old system to  
get a new feature, especially one as extensive as NFS/IPv6, are long  
gone.

I don't hear a lot of distributors objecting to this idea.

And you have never clearly answered why it wouldn't be enough to add a  
little code to convert the current on-disk format to sqlite3 when  
upgrading to the new statd, if upgradability is truly an important  
requirement.  Possibly this is because it eliminates the only real  
technical objection you have to using sqlite3 here.

>>> Simplicity is another reason. WTF do we need a full SQL database,  
>>> when
>>> all we want to do is store 2 pieces of data (a hostname and a  
>>> cookie)?
>>> It isn't as if this has been a major problem for us previously.
>>
>> Because we are not storing just a hostname and a cookie.  We are
>> storing several different data items for each host, and we need to
>> search over the records, and provide uniqueness constraints, and
>> handle data conversion (for binary data like the cookie, for string
>> data like the hostname, and for integers, like the prog/vers/proc
>> tuple).  We need to store them durably on persistent storage to have
>> some protection against crashes.  These are all things that an
>> embedded database can do well, and that we therefore don't have to
>> code ourselves.
>
> Speaking of red herrings. Why are we adding all this crap?
>
> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM  
> from
> scratch, just add minimal support for IPv6.

You and Bruce brought up a number of work items related to statd,  
including having distinct statd behavior for remotes who are clients  
and remotes who are servers.  Tom Talpey suggested we needed to send  
multiple SM_NOTIFY requests to each host, and use TCP to do it when  
possible, and you even specifically encouraged me to read his  
connectathon presentation on this.  If Asian countries are driving the  
IPv6 requirement, why wouldn't they want IDN support as well?   
Interoperable NFS/IPv6 support requires TI-RPC.  Plus, NFS/IPv6  
practically requires multi-homed NLM/NSM support -- see Alex's RFC  
draft for details on that.

Which would you like me to drop?

Let me also point out that old statd is already broken in a number of  
ways, and I certainly haven't heard a lot of complaints about it.  Our  
client NLM has sent "0" as our NSM state number for years, for  
example.  Thus I hardly think there is a lot of risk in making changes  
here.  It can only get better.

>>>>>> In any event, it's not just sync(2) that is a problem.  sync(2)  
>>>>>> by
>>>>>> itself is a boot performance problem, but it's the combination of
>>>>>> rename and sync that is known to be especially unreliable during
>>>>>> system crashes.  Statd, being a crash monitor, shouldn't depend  
>>>>>> on
>>>>>> rename/sync to maintain persistent data in the face of system
>>>>>> instability.  I'd call that a real reason to use something more
>>>>>> robust.
>>>>>
>>>>> What are you talking about? Is this about the truncate + rename
>>>>> issue
>>>>> leaving empty files upon a crash?
>>>>> That issue is solved trivially by doing an fsync() before you
>>>>> rename the
>>>>> file. That entire discussion was about whether or not existing
>>>>> applications should be _required_ to do this kind of POSIX  
>>>>> pedantry,
>>>>> when previously they could get away without it.
>>>>>
>>>>> IOW: that issue alone does not justify replacing the current
>>>>> simple file
>>>>> based scheme.
>>>>>
>>>>
>>>> There are other reasons, not to use the simple file-based scheme
>>>> too...
>>>>
>>>> Internationalized domain names will be easier to deal with via
>>>> sqlite3,
>>>> for instance.
>>>
>>> Please explain...
>>
>> IPv6 is used in Asia, where they almost certainly need to use non-
>> ASCII characters in their hostnames.  Internationalized domain names
>> are stored in double-wide character sets.  To provide reliable  
>> support
>> for IDNs in statd, we will have to guarantee somehow that we can  
>> store
>> an IDN as a file name (if we want to stay with the current scheme),  
>> no
>> matter what file system is used for /var.
>
> So, what's stopping us? These are POSIX filesystems. They can store  
> any
> filename as long as it doesn't contain '/' or '\0'.

IDNs are UTF16.  /var therefore has to support UTF16 filenames; either  
byte in a double-byte character can be '/' or '\0'.  That means the  
underlying fs implementation has to support UTF16 (FAT32 anyone?), and  
the system's locale has to be configured correctly.  If we decide not  
to depend on the file system to support UTF16 filenames, then statd  
has to be intelligent enough to figure out how to deal with converting  
UTF16 hostnames before storing them as filenames.  Then, we have to  
teach matchhostname() and friends how to deal with double-byte  
character strings...

Or we just tell sqlite3 that this is a double-byte character string,  
and let it handle the collation and on-disk storage details for us.

The point is, this is yet another detail we have to either worry about  
and open code in statd, or we can simply rely on what's already  
provided in sqlite3.  No one, repeat NO ONE, is arguing that you can't  
implement these features without sqlite3.  My argument is that we  
quickly bury a whole bunch of details if we use sqlite3, and can then  
focus on larger issues.  That's the prime goal of software layering  
with libraries.

We can open code any or all of statd.  In fact the current statd open  
codes RPC request creation in socket buffers rather than using glibc's  
RPC API, and I think we agree that is not an optimal solution.  The  
question is: should we duplicate code and bugs by open coding statd's  
RPC and data storage?  Or should we pretend to be modern software  
engineers, and use widely used and known good code that other people  
have written already to handle these details?

>> What's more, multi-homed host support will need to store multiple
>> records for the same hostname.  The mon_name is the same, but my_name
>> is different, for each of these records.  So we could do that by
>> adding more than one line in each hostname file, but it's also a
>> simple matter to set this up in SQL.
>>
>> When we want to have statd remember things like multiple addresses  
>> for
>> the same hostname, or whether the remote is a client or server, we
>> will need to make more adjustments to the files.
>>
>> As we get more and more new requirements, why lock ourselves into the
>> current on-disk format?  Using statd means we can store new fields  
>> and
>> new records without any backwards-compatibility issues.  It's all
>> handled by the database code.  So, we can think about the high level
>> problem of getting statd to behave correctly rather than worry about
>> the details of exactly how we are going to get the next data item
>> stored in our current files in a backward compatible way.
>
> Again. This is a legacy filesystem. Why are we adding requirements?

Maybe you should ask the people who are requesting NFS/IPv6 from Red  
Hat and other distributors.  Or ask yourself why we would add an  
engine to allow NFSv2 to shift authentication flavors without  
remounting, since NFSv2 is a legacy filesystem.  Or ask why you think  
we should add support to statd to recognize the difference between  
remote clients and servers, if this is a legacy filesystem.

I don't object to any of those work items, but I do have trouble with  
you dropping in and saying "why diddle a legacy file system" when  
clearly that is not a show stopper in these other cases.

>>>> Certainly we could code this up ourselves, but what's the benefit  
>>>> to
>>>> doing that when we have a perfectly good data storage engine
>>>> available?
>>>
>>> Why change something that works???? Rewriting from scratch is _NOT_
>>> the
>>> Linux way, and has usually bitten us hard when we've done it.
>>
>> Because we are adding a bunch of new feature requirements.
>> Internationalized domain names, multi-homed host support, IPv6 and  
>> TI-
>> RPC, fast boot times, keeping better track of remote host addresses,
>> keeping track of which remotes are clients and which are servers, and
>> support for sending notifications via TCP all require significant
>> modifications to this code base.
>>
>> At some point you have to look at the code you have, and decide it's
>> simply not going to be adequate, going forward.
>>
>>> The 2.6.19 rewrite of the kernel mount code springs to mind...
>>
>> One can just as easily argue that we've been bitten hard precisely
>> because we've let things rot, or because we have inadequate testing
>> for these components.
>>
>> Another red herring, and especially annoying because you've known I
>> was rewriting statd for months.  Only now, when I'm done, do you say
>> "rewriting is not the Linux way."
>
> I have _NEVER_ agreed to a rewrite of the storage formats. You sprang
> this crap on me a month ago, and I made my feelings quite clear then.

"Rewriting is not the Linux way" is not the same as saying you don't  
want to change storage formats.  Don't change the subject.

The idea that "the Linux way" is the best and only way is ridiculous  
on its face, anyway.  I mean, what do you expect when we have no  
requirements and specification process, no formal testing, C coding  
style conventions based on 20-year old coding practices, a hit-or-miss  
review process that relies more on reviewers' personal preferences  
than any kind of standards, no static code analysis tools, no defect  
metrics or bug meta-analysis tools, kernel debuggers are verboten, a  
combative mailing list environment, and parts of our knowledge base  
and team history are lost every time a developer leaves (in this case,  
Olaf and Neil)?  It's no wonder we never change anything unless  
absolutely necessary!

You told me to implement IPv6 support in statd.  Now you are spitting  
on what I worked out without any guidance from you because you're too  
busy working on IETF standards and NFSv4.1 to bother discussing  
"legacy" code, other than to say "ewe!".  Just how should I react to  
that, pray tell?

Clearly you do not want to admit that even "minimal" IPv6 support is a  
significant effort, especially given how far behind the Linux RPC and  
NLM/NSM implementations are.  Yelling at me, throwing a bunch of  
generic objections up, and calling my code "crap" is not going to make  
the problem any simpler.

Cooperating with me to get your way and giving specific and  
constructive criticism will go a lot farther than your aggressive and  
disrespectful attitude and obnoxious tone.  I will happily sit down  
and discuss this with you in a rational tone, but I will no longer  
tolerate your unwarranted harassment on a public mailing list.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 22:18                                 ` Chuck Lever
@ 2009-09-09 23:15                                   ` Steve Dickson
  2009-09-10 15:01                                     ` Chuck Lever
  2009-09-10  8:44                                   ` NeilBrown
  2009-09-10 15:03                                   ` J. Bruce Fields
  2 siblings, 1 reply; 39+ messages in thread
From: Steve Dickson @ 2009-09-09 23:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, Jeff Layton, J. Bruce Fields, linux-nfs



On 09/09/2009 06:18 PM, Chuck Lever wrote:
> On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
>> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>>> The old statd still exists in nfs-utils.  The new statd is an entirely
>>> separate component.  Distributions can continue to use the old statd
>>> as long as they want.  This is a red herring.
>>
>> Bullshit. If they are adding IPv6 support, then they will have to
>> upgrade at some point.
> 
> I don't see a problem with a distribution upgrade using old statd and a
> fresh install using new statd.  You have to install a lot of new
> components to get NFS/IPv6 support.  
What new components that are not already being installed?? 

> It's not like the only thing that needs to change is statd.  
> People will install a new distribution to get IPv6 support. 
> With so many simple ways to install from scratch, the days of someone 
> upgrading just a few pieces of an old system to get a new feature, 
> especially one as extensive as NFS/IPv6, are long gone.
I'm not sure how people could only install bits and pieces of 
nfs-utils... Even 'make install' in the git tree installs everything...

> 
> And you have never clearly answered why it wouldn't be enough to add a
> little code to convert the current on-disk format to sqlite3 when
> upgrading to the new statd, if upgradability is truly an important
> requirement.  Possibly this is because it eliminates the only real
> technical objection you have to using sqlite3 here.
The issue I would have with using sqlite3 is it would add yet another
requirement on nfs-utils... I really don't know how big sqlite3 and/or
sqlite3-devel (possibly needed for builds) packages are but it just
one more thing will be need for nfs-utils to function... 
 
> 
>>>> Simplicity is another reason. WTF do we need a full SQL database, when
>>>> all we want to do is store 2 pieces of data (a hostname and a cookie)?
>>>> It isn't as if this has been a major problem for us previously.
>>>
>>> Because we are not storing just a hostname and a cookie.  We are
>>> storing several different data items for each host, and we need to
>>> search over the records, and provide uniqueness constraints, and
>>> handle data conversion (for binary data like the cookie, for string
>>> data like the hostname, and for integers, like the prog/vers/proc
>>> tuple).  We need to store them durably on persistent storage to have
>>> some protection against crashes.  These are all things that an
>>> embedded database can do well, and that we therefore don't have to
>>> code ourselves.
>>
>> Speaking of red herrings. Why are we adding all this crap?
>>
>> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM from
>> scratch, just add minimal support for IPv6.
> 
> You and Bruce brought up a number of work items related to statd,
> including having distinct statd behavior for remotes who are clients and
> remotes who are servers.  Tom Talpey suggested we needed to send
> multiple SM_NOTIFY requests to each host, and use TCP to do it when
> possible, and you even specifically encouraged me to read his
> connectathon presentation on this.  If Asian countries are driving the
> IPv6 requirement, why wouldn't they want IDN support as well? 
> Interoperable NFS/IPv6 support requires TI-RPC.  Plus, NFS/IPv6
> practically requires multi-homed NLM/NSM support -- see Alex's RFC draft
> for details on that.
So a database is needed to accomplish all this? 

> 
> Let me also point out that old statd is already broken in a number of
> ways, and I certainly haven't heard a lot of complaints about it.  Our
> client NLM has sent "0" as our NSM state number for years, for example. 
> Thus I hardly think there is a lot of risk in making changes here.  It
> can only get better.
> 
I can agree with you here... 

>>> IPv6 is used in Asia, where they almost certainly need to use non-
>>> ASCII characters in their hostnames.  Internationalized domain names
>>> are stored in double-wide character sets.  To provide reliable support
>>> for IDNs in statd, we will have to guarantee somehow that we can store
>>> an IDN as a file name (if we want to stay with the current scheme), no
>>> matter what file system is used for /var.
>>
>> So, what's stopping us? These are POSIX filesystems. They can store any
>> filename as long as it doesn't contain '/' or '\0'.
> 
> IDNs are UTF16.  /var therefore has to support UTF16 filenames; either
> byte in a double-byte character can be '/' or '\0'.  That means the
> underlying fs implementation has to support UTF16 (FAT32 anyone?), and
> the system's locale has to be configured correctly.  If we decide not to
> depend on the file system to support UTF16 filenames, then statd has to
> be intelligent enough to figure out how to deal with converting UTF16
> hostnames before storing them as filenames.  Then, we have to teach
> matchhostname() and friends how to deal with double-byte character
> strings...
Has this been a problem in the past? How are other implementations 
dealing with this? Have they gone to use a db as well?
 
> 
> Or we just tell sqlite3 that this is a double-byte character string, and
> let it handle the collation and on-disk storage details for us.
> 
> The point is, this is yet another detail we have to either worry about
> and open code in statd, or we can simply rely on what's already provided
> in sqlite3.  No one, repeat NO ONE, is arguing that you can't implement
> these features without sqlite3.  My argument is that we quickly bury a
> whole bunch of details if we use sqlite3, and can then focus on larger
> issues.  That's the prime goal of software layering with libraries.
What kind of performance hit will there be (if any)? The nice thing
about a file is you only have to read it once in to a cache verses
doing a number of queries... or can one also cache queries?

> 
> We can open code any or all of statd.  In fact the current statd open
> codes RPC request creation in socket buffers rather than using glibc's
> RPC API, and I think we agree that is not an optimal solution.  The
> question is: should we duplicate code and bugs by open coding statd's
> RPC and data storage?  Or should we pretend to be modern software
> engineers, and use widely used and known good code that other people
> have written already to handle these details?
I'm all for using moving forward with "modern software" but, as 
a common theme with me, I'm always worried about becoming 
needlessly complicated or over engineering... which might be
the case with having statd use a db... 

steved.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 22:18                                 ` Chuck Lever
  2009-09-09 23:15                                   ` Steve Dickson
@ 2009-09-10  8:44                                   ` NeilBrown
  2009-09-10 14:09                                     ` Chuck Lever
       [not found]                                     ` <9eae93545189a6be6eebe0460b860fc7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  2009-09-10 15:03                                   ` J. Bruce Fields
  2 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2009-09-10  8:44 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Trond Myklebust, Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:

> The idea that "the Linux way" is the best and only way is ridiculous
> on its face, anyway.  I mean, what do you expect when we have no
> requirements and specification process, no formal testing, C coding
> style conventions based on 20-year old coding practices, a hit-or-miss
> review process that relies more on reviewers' personal preferences
> than any kind of standards, no static code analysis tools, no defect
> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
> combative mailing list environment, and parts of our knowledge base
> and team history are lost every time a developer leaves (in this case,
> Olaf and Neil)?  It's no wonder we never change anything unless
> absolutely necessary!

And yet is largely works!
I could summarise a lot of your points by observing that the community
values people over process.  I really think that is the right place to
put value, because people are richer and more flexible than process.

I agree that combative mailing lists are a problem, but even there, I
believe most of the aggression is more perceived than real, and that
a graceful, humble, polite attitude can have a positive-feedback effect
too.

Yes, there are lots of practices that might improve things that we don't
have standardised.  But one practice we do have that has proven very
effective is incremental refinement.  It can be hard to understand what
order to make changes until after you have made them, but once you
understand what you want to do, going back and doing it in logical
order really is very effective.  It makes it easier for others to
review, it makes it easy for you to review yourself.  It means
less controversial bits can be included quickly leaving room for the
more controversial bits to be discussed in isolation.

I think that the switch from portmap to rpcbind was a bad idea,
and I think that a wholesale replacement of statd is probably a
bad idea too.  It might seem like the easiest way to get something
useful working, but you'll probably be paying the price for years as
little regression turn up because you didn't completely understand
the original statd (and face it, who does?)

As for the use of sql-lite ... I must admit that I wouldn't choose
it.  Maybe it is a good idea.  If it is, you probably need to merge
that change early with a clear argument and tools to make it manageable
(e.g. a developer will want to tool to be able to look inside the
database easily and make changes, without having to know sql).
It is much easier to discuss one thing at a time on these
combative mailing lists ;-)

NeilBrown

P.S.
And we do have static code analysis tools.  Both 'gcc' and 'sparse'
fit that description.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10  8:44                                   ` NeilBrown
@ 2009-09-10 14:09                                     ` Chuck Lever
  2009-09-14  7:08                                       ` Neil Brown
  2009-09-14 13:54                                       ` Trond Myklebust
       [not found]                                     ` <9eae93545189a6be6eebe0460b860fc7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-10 14:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:
>> The idea that "the Linux way" is the best and only way is ridiculous
>> on its face, anyway.  I mean, what do you expect when we have no
>> requirements and specification process, no formal testing, C coding
>> style conventions based on 20-year old coding practices, a hit-or- 
>> miss
>> review process that relies more on reviewers' personal preferences
>> than any kind of standards, no static code analysis tools, no defect
>> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
>> combative mailing list environment, and parts of our knowledge base
>> and team history are lost every time a developer leaves (in this  
>> case,
>> Olaf and Neil)?  It's no wonder we never change anything unless
>> absolutely necessary!
>
> And yet is largely works!
> I could summarise a lot of your points by observing that the community
> values people over process.  I really think that is the right place to
> put value, because people are richer and more flexible than process.

Agreed, but there are risks to that approach as well, which are  
largely ignored by the Linux community.  The last point in my list is  
probably the biggest risk: when people leave, we are stuck with  
decades-old code that no-one understands.  Cf: statd.

My point is that many of the items I mentioned above are expressly  
designed to allow quicker, less risky change, precisely to decrease  
the amount of time and effort to get new features into our code.  Yet  
we turn our back on all of them in favor of an antique "don't touch  
that!" policy.  "Don't touch that!" is not a reasonable argument  
against replacing components that need to be replaced.

> I agree that combative mailing lists are a problem, but even there, I
> believe most of the aggression is more perceived than real, and that
> a graceful, humble, polite attitude can have a positive-feedback  
> effect
> too.

Years ago I believed that, but I have seen much evidence to the  
contrary in this community.  More often such an attitude is entirely  
ignored, or treated as an invitation for abuse, especially by people  
who have no interest in politeness.  This kind of approach has no  
effect on the leaders in the Linux community, who set an example of  
extreme rudeness and belligerence.

I've made an effort to stop arguing small points, and to make  
observations and not argue.  I still get e-mail full of "crap" this  
and "bullshit" that and "NACK!" with little explanation.

> Yes, there are lots of practices that might improve things that we  
> don't
> have standardised.  But one practice we do have that has proven very
> effective is incremental refinement.  It can be hard to understand  
> what
> order to make changes until after you have made them, but once you
> understand what you want to do, going back and doing it in logical
> order really is very effective.  It makes it easier for others to
> review, it makes it easy for you to review yourself.  It means
> less controversial bits can be included quickly leaving room for the
> more controversial bits to be discussed in isolation.

I am a fan of incremental refinement, and I use that approach as often  
as I can.  There are some things that incremental refinement cannot  
do, however.

> I think that the switch from portmap to rpcbind was a bad idea,
> and I think that a wholesale replacement of statd is probably a
> bad idea too.  It might seem like the easiest way to get something
> useful working, but you'll probably be paying the price for years as
> little regression turn up because you didn't completely understand
> the original statd (and face it, who does?)

Yes, but _why_ is it a bad idea?  All I hear is "this is a bad idea"  
and "you could do it some other way" but these are qualitative, not  
quantitative arguments.  They are religious statements, not specific  
technical criticisms.

This leaves me with the impression that folks are responding out of a  
fear of the unknown, and not out of a considered technical opinion.   
If sqlite3 is outside of people's comfort zone, that's OK.  Please  
let's be honest about it instead of slinging mud and throwing up a  
bunch of generic arguments that no one can rebut.

> As for the use of sql-lite ... I must admit that I wouldn't choose
> it.  Maybe it is a good idea.  If it is, you probably need to merge
> that change early with a clear argument and tools to make it  
> manageable
> (e.g. a developer will want to tool to be able to look inside the
> database easily and make changes, without having to know sql).
> It is much easier to discuss one thing at a time on these
> combative mailing lists ;-)

That's a fine and constructive comment, thanks.

There is already a tool for managing the data in the database:  
'sqlite3' the executable, which can be used in shell scripts.  There  
are also sqlite3 libraries for Python and Perl and C.  This is really  
not very different from POSIX file system calls and using 'cat'.  SQL  
is not difficult to learn, and a one or two page recipe document is  
easy to provide.  I certainly have not used any advanced features of  
SQL to implement new statd.

Given the complexity of the change, it makes it much easier to argue  
against sqlite3, however, if it is separated from the set of changes  
that motivate its use.  Bruce, for example, has stated to me  
specifically that he prefers having such changes and their  
motivational requirements included in the same patch.  I regularly  
code for several different maintainers, and each one has his own  
preferences, often contradicting other maintainers.  I don't think  
regular maintainers have any idea how confusing and challenging this is.

Note, however, I am not married to the specifics of sqlite3.  What I  
am attached to is the ideas that the current system is inadequate for  
the kinds of features we want to add to statd, and that statd should  
not worry about the details of data storage, or search and management  
of the host records, because we have other tools that are better at  
this.  If there is another solution that provides a durable and  
flexible way to store and search host records and offload the details  
to pre-existing code, I'm open.  However, sqlite3 is the most widely  
used embedded database on the planet, and is eminently suitable for  
this task.

> And we do have static code analysis tools.  Both 'gcc' and 'sparse'
> fit that description.

Yes, gcc can provide some static analysis, if the correct options are  
specified, and care is taken to eliminate the noise of false  
positives.  There is a prevailing attitude, however, that this is a  
worthless endeavor.  Witness the amount of noise that comes out when  
you build a Red Hat kernel or the nfs-utils package.  Witness also the  
sarcasm of Linus who repeatedly chides folks for not running sparse  
regularly.

Additionally, gcc is not the best tool for this job, given the often  
oblique way it calls out errors and warnings.  There are purify,  
fortify, and splint, just to name three, that are standard analysis  
tools we don't even consider.

My comment goes more to the point that static analysis is not  
considered of any value.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 23:15                                   ` Steve Dickson
@ 2009-09-10 15:01                                     ` Chuck Lever
  0 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-10 15:01 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Trond Myklebust, Jeff Layton, J. Bruce Fields, linux-nfs

On Sep 9, 2009, at 7:15 PM, Steve Dickson wrote:
> On 09/09/2009 06:18 PM, Chuck Lever wrote:
>> On Sep 9, 2009, at 3:42 PM, Trond Myklebust wrote:
>>> On Wed, 2009-09-09 at 15:17 -0400, Chuck Lever wrote:
>>>> On Sep 9, 2009, at 2:39 PM, Trond Myklebust wrote:
>>>> The old statd still exists in nfs-utils.  The new statd is an  
>>>> entirely
>>>> separate component.  Distributions can continue to use the old  
>>>> statd
>>>> as long as they want.  This is a red herring.
>>>
>>> Bullshit. If they are adding IPv6 support, then they will have to
>>> upgrade at some point.
>>
>> I don't see a problem with a distribution upgrade using old statd  
>> and a
>> fresh install using new statd.  You have to install a lot of new
>> components to get NFS/IPv6 support.
> What new components that are not already being installed??

You need a kernel that can do NFS/IPv6, you need to install rpcbind  
and libtirpc, you need the new mount command, you need all the user  
space network pieces to manage IPv6, you need to consider firewall and  
address distribution on your local network, and you need statd and  
mountd/exportfs to get NFS/IPv6 support.

Configuring a system for IPv6 support can also be nontrivial, and not  
something people will do on a whim.

I didn't mean to imply that some of these components are not already  
installed.  My point is that the required changes for NFS/IPv6 are  
wide spread, and that most people would opt for installing a new OS on  
their systems to get these features, rather than upgrade all of these  
items piecemeal.

>> And you have never clearly answered why it wouldn't be enough to  
>> add a
>> little code to convert the current on-disk format to sqlite3 when
>> upgrading to the new statd, if upgradability is truly an important
>> requirement.  Possibly this is because it eliminates the only real
>> technical objection you have to using sqlite3 here.
> The issue I would have with using sqlite3 is it would add yet another
> requirement on nfs-utils... I really don't know how big sqlite3 and/or
> sqlite3-devel (possibly needed for builds) packages are but it just
> one more thing will be need for nfs-utils to function...

sqlite3.org provides a single source file version of sqlite3 that is  
licensed and designed explicitly for folks to include in their own  
code, without the need for linking a library.  You can even disable a  
number of build time options to reduce object size.

This means that the libsqlite3 and libsqlite3-devel packages would not  
be required on either the build system or the end system, and it  
eliminates the issue of whether libsqlite3.so can be moved to /lib.

>>>>> Simplicity is another reason. WTF do we need a full SQL  
>>>>> database, when
>>>>> all we want to do is store 2 pieces of data (a hostname and a  
>>>>> cookie)?
>>>>> It isn't as if this has been a major problem for us previously.
>>>>
>>>> Because we are not storing just a hostname and a cookie.  We are
>>>> storing several different data items for each host, and we need to
>>>> search over the records, and provide uniqueness constraints, and
>>>> handle data conversion (for binary data like the cookie, for string
>>>> data like the hostname, and for integers, like the prog/vers/proc
>>>> tuple).  We need to store them durably on persistent storage to  
>>>> have
>>>> some protection against crashes.  These are all things that an
>>>> embedded database can do well, and that we therefore don't have to
>>>> code ourselves.
>>>
>>> Speaking of red herrings. Why are we adding all this crap?
>>>
>>> This is a legacy filesystem! We shouldn't not be rewriting NLM/NSM  
>>> from
>>> scratch, just add minimal support for IPv6.
>>
>> You and Bruce brought up a number of work items related to statd,
>> including having distinct statd behavior for remotes who are  
>> clients and
>> remotes who are servers.  Tom Talpey suggested we needed to send
>> multiple SM_NOTIFY requests to each host, and use TCP to do it when
>> possible, and you even specifically encouraged me to read his
>> connectathon presentation on this.  If Asian countries are driving  
>> the
>> IPv6 requirement, why wouldn't they want IDN support as well?
>> Interoperable NFS/IPv6 support requires TI-RPC.  Plus, NFS/IPv6
>> practically requires multi-homed NLM/NSM support -- see Alex's RFC  
>> draft
>> for details on that.
> So a database is needed to accomplish all this?

No, a database is not specifically required.

However, libsqlite3 is a library that contains all of the elements --  
durable on-disk storage, proper data conversion for binary blobs,  
single- and double-width character strings, integers, the ability to  
constrain record uniqueness, the ability to add new data items easily  
to each record, and a facility for collating and searching the host  
records.

sqlite3 is an embedded database, meaning the implementation is  
purposely smaller than a full SQL database, and is designed explicitly  
to have zero database administration requirements.  sqlite3 is  
designed for managing data for long-running network daemons, and it is  
widely used for that purpose.

If there is some other pre-existing code that can do this, I'm open to  
considering it.

>> Let me also point out that old statd is already broken in a number of
>> ways, and I certainly haven't heard a lot of complaints about it.   
>> Our
>> client NLM has sent "0" as our NSM state number for years, for  
>> example.
>> Thus I hardly think there is a lot of risk in making changes here.   
>> It
>> can only get better.
>>
> I can agree with you here...
>
>>>> IPv6 is used in Asia, where they almost certainly need to use non-
>>>> ASCII characters in their hostnames.  Internationalized domain  
>>>> names
>>>> are stored in double-wide character sets.  To provide reliable  
>>>> support
>>>> for IDNs in statd, we will have to guarantee somehow that we can  
>>>> store
>>>> an IDN as a file name (if we want to stay with the current  
>>>> scheme), no
>>>> matter what file system is used for /var.
>>>
>>> So, what's stopping us? These are POSIX filesystems. They can  
>>> store any
>>> filename as long as it doesn't contain '/' or '\0'.
>>
>> IDNs are UTF16.  /var therefore has to support UTF16 filenames;  
>> either
>> byte in a double-byte character can be '/' or '\0'.  That means the
>> underlying fs implementation has to support UTF16 (FAT32 anyone?),  
>> and
>> the system's locale has to be configured correctly.  If we decide  
>> not to
>> depend on the file system to support UTF16 filenames, then statd  
>> has to
>> be intelligent enough to figure out how to deal with converting UTF16
>> hostnames before storing them as filenames.  Then, we have to teach
>> matchhostname() and friends how to deal with double-byte character
>> strings...
> Has this been a problem in the past? How are other implementations
> dealing with this? Have they gone to use a db as well?

No, IDNs are recent, but it is reasonable to think that  
internationalized domain names is a feature that would appeal to the  
same folks who are driving the IPv6 requirement.  This is not a hard  
requirement, but it is one reason why statd's current on-disk format  
is not adequate.

Yes, I understand that there are some statd implementations that use a  
database rather than flat files.  statd is nothing if not exactly a  
mechanism for storing structured data across system crashes.  That's  
exactly what databases are for.

>> Or we just tell sqlite3 that this is a double-byte character  
>> string, and
>> let it handle the collation and on-disk storage details for us.
>>
>> The point is, this is yet another detail we have to either worry  
>> about
>> and open code in statd, or we can simply rely on what's already  
>> provided
>> in sqlite3.  No one, repeat NO ONE, is arguing that you can't  
>> implement
>> these features without sqlite3.  My argument is that we quickly  
>> bury a
>> whole bunch of details if we use sqlite3, and can then focus on  
>> larger
>> issues.  That's the prime goal of software layering with libraries.
> What kind of performance hit will there be (if any)? The nice thing
> about a file is you only have to read it once in to a cache verses
> doing a number of queries... or can one also cache queries?

sqlite3's performance for the statd application would actually be  
better than what we have today.

Naturally the database is cached in memory, making queries as fast as  
memory reads.  The better performance comes with record insertion and  
deletion.  Today statd does a file create and then an O_SYNC write to  
that file.  This requires synchronous metadata updates to the file  
system to create the new file and create a new directory entry for  
it.  If the directory becomes large, creating a new directory entry  
becomes even slower.  Likewise for record deletion, multiple  
synchronous metadata updates are required to remove the directory  
entry and the file containing the host record.

With sqlite3 (or any database style solution) record insertion and  
deletion can usually be handled with a single O_SYNC write to the  
database file.

You could argue that using sqlite3 means more CPU and memory  
consumption.  Perhaps, but that's a less onerous resource requirement  
than synchronous disk activity, in my view.

>> We can open code any or all of statd.  In fact the current statd open
>> codes RPC request creation in socket buffers rather than using  
>> glibc's
>> RPC API, and I think we agree that is not an optimal solution.  The
>> question is: should we duplicate code and bugs by open coding statd's
>> RPC and data storage?  Or should we pretend to be modern software
>> engineers, and use widely used and known good code that other people
>> have written already to handle these details?
> I'm all for using moving forward with "modern software" but, as
> a common theme with me, I'm always worried about becoming
> needlessly complicated or over engineering... which might be
> the case with having statd use a db...

Consider what would happen if we open coded all of the details of on- 
disk storage and record searching into statd itself.  I think  
something like sqlite3 is a better and less complex solution than open  
coding because all these details are moved out of statd into a pre- 
existing library, thus making statd itself architecturally simpler,  
and therefore easier to understand and maintain.

The one weakness here is the dependence on SQL.  That makes the statd  
code uglier and more complex than I would like, and is something I  
want to address.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-09 22:18                                 ` Chuck Lever
  2009-09-09 23:15                                   ` Steve Dickson
  2009-09-10  8:44                                   ` NeilBrown
@ 2009-09-10 15:03                                   ` J. Bruce Fields
  2009-09-10 16:14                                     ` Chuck Lever
  2 siblings, 1 reply; 39+ messages in thread
From: J. Bruce Fields @ 2009-09-10 15:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, Jeff Layton, steved, linux-nfs

On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
> IDNs are UTF16.  /var therefore has to support UTF16 filenames; either  
> byte in a double-byte character can be '/' or '\0'.  That means the  
> underlying fs implementation has to support UTF16 (FAT32 anyone?), and  
> the system's locale has to be configured correctly.  If we decide not to 
> depend on the file system to support UTF16 filenames, then statd has to 
> be intelligent enough to figure out how to deal with converting UTF16 
> hostnames before storing them as filenames.  Then, we have to teach 
> matchhostname() and friends how to deal with double-byte character 
> strings...

Googling around.... Is this accurate?:

	http://en.wikipedia.org/wiki/Internationalized_domain_name

That makes it sound like domain names are staying ascii, and they're
just adding something on top to allow encoding unicode using ascii,
which may optionally be used by applications.

--b.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]                                     ` <9eae93545189a6be6eebe0460b860fc7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
@ 2009-09-10 15:05                                       ` bpm
  0 siblings, 0 replies; 39+ messages in thread
From: bpm @ 2009-09-10 15:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chuck Lever, Trond Myklebust, Jeff Layton, J. Bruce Fields,
	steved, linux-nfs

On Thu, Sep 10, 2009 at 06:44:00PM +1000, NeilBrown wrote:
> I could summarise a lot of your points by observing that the community
> values people over process.  I really think that is the right place to
> put value, because people are richer and more flexible than process.

What a refreshing point of view.  I hope it catches on.

> As for the use of sql-lite ... I must admit that I wouldn't choose
> it. 

A number of other projects support multiple backends.  Maybe that makes
sense here.

-Ben

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 15:03                                   ` J. Bruce Fields
@ 2009-09-10 16:14                                     ` Chuck Lever
  2009-09-10 16:23                                       ` J. Bruce Fields
  0 siblings, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-09-10 16:14 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Trond Myklebust, Jeff Layton, steved, linux-nfs

On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>> IDNs are UTF16.  /var therefore has to support UTF16 filenames;  
>> either
>> byte in a double-byte character can be '/' or '\0'.  That means the
>> underlying fs implementation has to support UTF16 (FAT32 anyone?),  
>> and
>> the system's locale has to be configured correctly.  If we decide  
>> not to
>> depend on the file system to support UTF16 filenames, then statd  
>> has to
>> be intelligent enough to figure out how to deal with converting UTF16
>> hostnames before storing them as filenames.  Then, we have to teach
>> matchhostname() and friends how to deal with double-byte character
>> strings...
>
> Googling around.... Is this accurate?:
>
> 	http://en.wikipedia.org/wiki/Internationalized_domain_name
>
> That makes it sound like domain names are staying ascii, and they're
> just adding something on top to allow encoding unicode using ascii,
> which may optionally be used by applications.

There is a mechanism that provides an ASCII-ized version of domain  
names that may contain non-ASCII characters, expressly for  
applications that need to perform DNS queries but can't be easily  
converted to handle double-byte character strings.  This can be  
adapted for statd, though I'm not sure if the converted ASCII version  
of such names specifically exclude '/'.

Internationalized domain names themselves are still expressed in  
UTF16, as far as I understand it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 16:14                                     ` Chuck Lever
@ 2009-09-10 16:23                                       ` J. Bruce Fields
  2009-09-10 16:43                                         ` Trond Myklebust
  2009-09-10 20:39                                         ` Chuck Lever
  0 siblings, 2 replies; 39+ messages in thread
From: J. Bruce Fields @ 2009-09-10 16:23 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, Jeff Layton, steved, linux-nfs

On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>> IDNs are UTF16.  /var therefore has to support UTF16 filenames;  
>>> either
>>> byte in a double-byte character can be '/' or '\0'.  That means the
>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),  
>>> and
>>> the system's locale has to be configured correctly.  If we decide  
>>> not to
>>> depend on the file system to support UTF16 filenames, then statd has 
>>> to
>>> be intelligent enough to figure out how to deal with converting UTF16
>>> hostnames before storing them as filenames.  Then, we have to teach
>>> matchhostname() and friends how to deal with double-byte character
>>> strings...
>>
>> Googling around.... Is this accurate?:
>>
>> 	http://en.wikipedia.org/wiki/Internationalized_domain_name
>>
>> That makes it sound like domain names are staying ascii, and they're
>> just adding something on top to allow encoding unicode using ascii,
>> which may optionally be used by applications.
>
> There is a mechanism that provides an ASCII-ized version of domain names 
> that may contain non-ASCII characters, expressly for applications that 
> need to perform DNS queries but can't be easily converted to handle 
> double-byte character strings.  This can be adapted for statd, though I'm 
> not sure if the converted ASCII version of such names specifically 
> exclude '/'.
>
> Internationalized domain names themselves are still expressed in UTF16, 
> as far as I understand it.

>From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears to
me that protocols (at the very least, any preexisting protocols) are all
expected to use the ascii representation on the wire, and that the
translation to unicode is meant by use for applications.

So in our case we'd continue to expect ascii domain names on the wire,
and I believe that's also what we should store in any database.  But if
someone were to write a gui administrative interface to that data, for
example, they might choose to use idna for display.

--b.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 16:23                                       ` J. Bruce Fields
@ 2009-09-10 16:43                                         ` Trond Myklebust
  2009-09-10 20:39                                         ` Chuck Lever
  1 sibling, 0 replies; 39+ messages in thread
From: Trond Myklebust @ 2009-09-10 16:43 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Chuck Lever, Jeff Layton, steved, linux-nfs

On Thu, 2009-09-10 at 12:23 -0400, J. Bruce Fields wrote:
> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears to
> me that protocols (at the very least, any preexisting protocols) are all
> expected to use the ascii representation on the wire, and that the
> translation to unicode is meant by use for applications.
> 
> So in our case we'd continue to expect ascii domain names on the wire,
> and I believe that's also what we should store in any database.  But if
> someone were to write a gui administrative interface to that data, for
> example, they might choose to use idna for display.

Right. There should be no need to translate anything. Just store as is
and use as is...

Trond


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 16:23                                       ` J. Bruce Fields
  2009-09-10 16:43                                         ` Trond Myklebust
@ 2009-09-10 20:39                                         ` Chuck Lever
  2009-09-10 20:49                                           ` J. Bruce Fields
  1 sibling, 1 reply; 39+ messages in thread
From: Chuck Lever @ 2009-09-10 20:39 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Trond Myklebust, Jeff Layton, steved, linux-nfs

On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>> IDNs are UTF16.  /var therefore has to support UTF16 filenames;
>>>> either
>>>> byte in a double-byte character can be '/' or '\0'.  That means the
>>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>>> and
>>>> the system's locale has to be configured correctly.  If we decide
>>>> not to
>>>> depend on the file system to support UTF16 filenames, then statd  
>>>> has
>>>> to
>>>> be intelligent enough to figure out how to deal with converting  
>>>> UTF16
>>>> hostnames before storing them as filenames.  Then, we have to teach
>>>> matchhostname() and friends how to deal with double-byte character
>>>> strings...
>>>
>>> Googling around.... Is this accurate?:
>>>
>>> 	http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>
>>> That makes it sound like domain names are staying ascii, and they're
>>> just adding something on top to allow encoding unicode using ascii,
>>> which may optionally be used by applications.
>>
>> There is a mechanism that provides an ASCII-ized version of domain  
>> names
>> that may contain non-ASCII characters, expressly for applications  
>> that
>> need to perform DNS queries but can't be easily converted to handle
>> double-byte character strings.  This can be adapted for statd,  
>> though I'm
>> not sure if the converted ASCII version of such names specifically
>> exclude '/'.
>>
>> Internationalized domain names themselves are still expressed in  
>> UTF16,
>> as far as I understand it.
>
> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears  
> to
> me that protocols (at the very least, any preexisting protocols) are  
> all
> expected to use the ascii representation on the wire, and that the
> translation to unicode is meant by use for applications.
>
> So in our case we'd continue to expect ascii domain names on the wire,
> and I believe that's also what we should store in any database.  But  
> if
> someone were to write a gui administrative interface to that data, for
> example, they might choose to use idna for display.

That's a reasonable and specific objection to my claim that our  
current host record storage format is inadequate to support IDNA.   
I've also confirmed that ToAscii with the UseSTD3ASCIIRules flag set  
is not supposed to generate a domain label string with a '/' in it.   
My remaining concern here is that we could possibly see hostnames that  
are too long to be stored in directory entries of some file systems,  
especially considering that the ASCII-fied Unicode names will be  
longer than typical ASCII names we normally encounter today.

What about multi-homed host support?  The same mon_name can be used  
with more than one my_name, for multi-homed hosts.  Using the current  
on-disk scheme, statd turns that SM_MON request into a no-op.  So  
additional records for the same hostname can't be stored, or we have  
to resort to adding multiple lines in the same file.  This is possible  
to do with just POSIX file system calls, but it does add complexity to  
manage several lines in each hostname file without increasing the risk  
of corruption if a file update (especially the deletion of one record  
in the middle) is interrupted.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 20:39                                         ` Chuck Lever
@ 2009-09-10 20:49                                           ` J. Bruce Fields
  2009-09-10 21:26                                             ` Chuck Lever
  0 siblings, 1 reply; 39+ messages in thread
From: J. Bruce Fields @ 2009-09-10 20:49 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, Jeff Layton, steved, linux-nfs

On Thu, Sep 10, 2009 at 04:39:51PM -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
>> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>>> IDNs are UTF16.  /var therefore has to support UTF16 filenames;
>>>>> either
>>>>> byte in a double-byte character can be '/' or '\0'.  That means the
>>>>> underlying fs implementation has to support UTF16 (FAT32 anyone?),
>>>>> and
>>>>> the system's locale has to be configured correctly.  If we decide
>>>>> not to
>>>>> depend on the file system to support UTF16 filenames, then statd  
>>>>> has
>>>>> to
>>>>> be intelligent enough to figure out how to deal with converting  
>>>>> UTF16
>>>>> hostnames before storing them as filenames.  Then, we have to teach
>>>>> matchhostname() and friends how to deal with double-byte character
>>>>> strings...
>>>>
>>>> Googling around.... Is this accurate?:
>>>>
>>>> 	http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>>
>>>> That makes it sound like domain names are staying ascii, and they're
>>>> just adding something on top to allow encoding unicode using ascii,
>>>> which may optionally be used by applications.
>>>
>>> There is a mechanism that provides an ASCII-ized version of domain  
>>> names
>>> that may contain non-ASCII characters, expressly for applications  
>>> that
>>> need to perform DNS queries but can't be easily converted to handle
>>> double-byte character strings.  This can be adapted for statd,  
>>> though I'm
>>> not sure if the converted ASCII version of such names specifically
>>> exclude '/'.
>>>
>>> Internationalized domain names themselves are still expressed in  
>>> UTF16,
>>> as far as I understand it.
>>
>> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears  
>> to
>> me that protocols (at the very least, any preexisting protocols) are  
>> all
>> expected to use the ascii representation on the wire, and that the
>> translation to unicode is meant by use for applications.
>>
>> So in our case we'd continue to expect ascii domain names on the wire,
>> and I believe that's also what we should store in any database.  But  
>> if
>> someone were to write a gui administrative interface to that data, for
>> example, they might choose to use idna for display.
>
> That's a reasonable and specific objection to my claim that our current 
> host record storage format is inadequate to support IDNA.  I've also 
> confirmed that ToAscii with the UseSTD3ASCIIRules flag set is not 
> supposed to generate a domain label string with a '/' in it.  My 
> remaining concern here is that we could possibly see hostnames that are 
> too long to be stored in directory entries of some file systems,  
> especially considering that the ASCII-fied Unicode names will be longer 
> than typical ASCII names we normally encounter today.

Googling around some more.... the normal limits for dns appear to be 63
bytes per component, and 255 for the whole string, and those limits are
still in force on the output of that mapping.  I suspect this isn't a
huge deal.

> What about multi-homed host support?  The same mon_name can be used with 
> more than one my_name, for multi-homed hosts.  Using the current on-disk 
> scheme, statd turns that SM_MON request into a no-op.

I don't know what we're supposed to do in that case.  You want to store
them all so you can send notifies to them all on reboot?

--b.

> So additional 
> records for the same hostname can't be stored, or we have to resort to 
> adding multiple lines in the same file.  This is possible to do with just 
> POSIX file system calls, but it does add complexity to manage several 
> lines in each hostname file without increasing the risk of corruption if 
> a file update (especially the deletion of one record in the middle) is 
> interrupted.
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 20:49                                           ` J. Bruce Fields
@ 2009-09-10 21:26                                             ` Chuck Lever
  0 siblings, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-10 21:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Trond Myklebust, Jeff Layton, steved, linux-nfs

On Sep 10, 2009, at 4:49 PM, J. Bruce Fields wrote:
> On Thu, Sep 10, 2009 at 04:39:51PM -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 12:23 PM, J. Bruce Fields wrote:
>>> On Thu, Sep 10, 2009 at 12:14:27PM -0400, Chuck Lever wrote:
>>>> On Sep 10, 2009, at 11:03 AM, J. Bruce Fields wrote:
>>>>> On Wed, Sep 09, 2009 at 06:18:11PM -0400, Chuck Lever wrote:
>>>>>> IDNs are UTF16.  /var therefore has to support UTF16 filenames;
>>>>>> either
>>>>>> byte in a double-byte character can be '/' or '\0'.  That means  
>>>>>> the
>>>>>> underlying fs implementation has to support UTF16 (FAT32  
>>>>>> anyone?),
>>>>>> and
>>>>>> the system's locale has to be configured correctly.  If we decide
>>>>>> not to
>>>>>> depend on the file system to support UTF16 filenames, then statd
>>>>>> has
>>>>>> to
>>>>>> be intelligent enough to figure out how to deal with converting
>>>>>> UTF16
>>>>>> hostnames before storing them as filenames.  Then, we have to  
>>>>>> teach
>>>>>> matchhostname() and friends how to deal with double-byte  
>>>>>> character
>>>>>> strings...
>>>>>
>>>>> Googling around.... Is this accurate?:
>>>>>
>>>>> 	http://en.wikipedia.org/wiki/Internationalized_domain_name
>>>>>
>>>>> That makes it sound like domain names are staying ascii, and  
>>>>> they're
>>>>> just adding something on top to allow encoding unicode using  
>>>>> ascii,
>>>>> which may optionally be used by applications.
>>>>
>>>> There is a mechanism that provides an ASCII-ized version of domain
>>>> names
>>>> that may contain non-ASCII characters, expressly for applications
>>>> that
>>>> need to perform DNS queries but can't be easily converted to handle
>>>> double-byte character strings.  This can be adapted for statd,
>>>> though I'm
>>>> not sure if the converted ASCII version of such names specifically
>>>> exclude '/'.
>>>>
>>>> Internationalized domain names themselves are still expressed in
>>>> UTF16,
>>>> as far as I understand it.
>>>
>>> From a quick skim of http://www.ietf.org/rfc/rfc3490.txt, it appears
>>> to
>>> me that protocols (at the very least, any preexisting protocols) are
>>> all
>>> expected to use the ascii representation on the wire, and that the
>>> translation to unicode is meant by use for applications.
>>>
>>> So in our case we'd continue to expect ascii domain names on the  
>>> wire,
>>> and I believe that's also what we should store in any database.  But
>>> if
>>> someone were to write a gui administrative interface to that data,  
>>> for
>>> example, they might choose to use idna for display.
>>
>> That's a reasonable and specific objection to my claim that our  
>> current
>> host record storage format is inadequate to support IDNA.  I've also
>> confirmed that ToAscii with the UseSTD3ASCIIRules flag set is not
>> supposed to generate a domain label string with a '/' in it.  My
>> remaining concern here is that we could possibly see hostnames that  
>> are
>> too long to be stored in directory entries of some file systems,
>> especially considering that the ASCII-fied Unicode names will be  
>> longer
>> than typical ASCII names we normally encounter today.
>
> Googling around some more.... the normal limits for dns appear to be  
> 63
> bytes per component, and 255 for the whole string, and those limits  
> are
> still in force on the output of that mapping.  I suspect this isn't a
> huge deal.

I bring this up because NI_MAXHOST, declared in /usr/include/netdb.h,  
is 1025, not 255.  You are probably correct, practically speaking.

>> What about multi-homed host support?  The same mon_name can be used  
>> with
>> more than one my_name, for multi-homed hosts.  Using the current on- 
>> disk
>> scheme, statd turns that SM_MON request into a no-op.
>
> I don't know what we're supposed to do in that case.  You want to  
> store
> them all so you can send notifies to them all on reboot?

Something like that.  Basically I think we want to send SM_NOTIFY to  
the monitored host _from_ every registered my_name we have for that  
mon_name.  The kernel will probably use a separate nsm_host for each  
one of these, so statd should probably keep track of each of the  
cookies as well.

Remembering which my_names have been used is important because the  
remote often uses the sender's name or IP address to identify which  
monitored host has rebooted.

We could throw up our hands and just keep track of all the my_names  
that were used during the last reboot, and notify each mon_name from  
all of those.  That doesn't help with remembering the cookies, though,  
and makes sending all notifications at reboot take longer.

>> So additional
>> records for the same hostname can't be stored, or we have to resort  
>> to
>> adding multiple lines in the same file.  This is possible to do  
>> with just
>> POSIX file system calls, but it does add complexity to manage several
>> lines in each hostname file without increasing the risk of  
>> corruption if
>> a file update (especially the deletion of one record in the middle)  
>> is
>> interrupted.
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 14:09                                     ` Chuck Lever
@ 2009-09-14  7:08                                       ` Neil Brown
       [not found]                                         ` <19117.60405.793389.323010-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2009-09-15  2:45                                         ` Chuck Lever
  2009-09-14 13:54                                       ` Trond Myklebust
  1 sibling, 2 replies; 39+ messages in thread
From: Neil Brown @ 2009-09-14  7:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Trond Myklebust, Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Thursday September 10, chuck.lever@oracle.com wrote:
> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> > On Thu, September 10, 2009 8:18 am, Chuck Lever wrote:
> >> The idea that "the Linux way" is the best and only way is ridiculous
> >> on its face, anyway.  I mean, what do you expect when we have no
> >> requirements and specification process, no formal testing, C coding
> >> style conventions based on 20-year old coding practices, a hit-or- 
> >> miss
> >> review process that relies more on reviewers' personal preferences
> >> than any kind of standards, no static code analysis tools, no defect
> >> metrics or bug meta-analysis tools, kernel debuggers are verboten, a
> >> combative mailing list environment, and parts of our knowledge base
> >> and team history are lost every time a developer leaves (in this  
> >> case,
> >> Olaf and Neil)?  It's no wonder we never change anything unless
> >> absolutely necessary!
> >
> > And yet is largely works!
> > I could summarise a lot of your points by observing that the community
> > values people over process.  I really think that is the right place to
> > put value, because people are richer and more flexible than process.
> 
> Agreed, but there are risks to that approach as well, which are  
> largely ignored by the Linux community.  The last point in my list is  
> probably the biggest risk: when people leave, we are stuck with  
> decades-old code that no-one understands.  Cf: statd.
> 

Much of my understanding of statd is embedded in the code in
nfs-utils, and in the change log (which is much better since we
changed to git).
By discarding all that and creating a new project you risk losing that
legacy.

There can certainly be a case of discarding and starting again.  I did
that with the support tools for raid (raidtools is no-more, mdadm
reigns :-).  But this is not a decision to be taken lightly and is not
without it's costs.  The fact that raidtools was by that time
essentially an unmaintained orphan made some of those costs
unavoidable in my case.

You obviously could do the same thing - you don't need anyone's
permission: create a new project called "statd" and make it whatever
you want and hope that distributors will pick it up.  If your statd
supports IPv6, and the nfs-utils one does not, then there is a good
chance that distros will pick it up as people like to see that big
tick next to "IPv6 support".

Maintaining and developing such a thing long enough to establish a
self-sustaining community would be a big effort.  You would need to
compare that with the effort of taking the incremental approach and
revising the code in nfs-utils and getting those revisions accepted.
That would probably have a higher development cost, but a lower
maintenance cost.  It is very hard to know up front which cost is
lower.

But you will leave one day.  How can you best make sure that you leave
something that others can maintain????


> My point is that many of the items I mentioned above are expressly  
> designed to allow quicker, less risky change, precisely to decrease  
> the amount of time and effort to get new features into our code.  Yet  
> we turn our back on all of them in favor of an antique "don't touch  
> that!" policy.  "Don't touch that!" is not a reasonable argument  
> against replacing components that need to be replaced.

The only "Don't touch that" which I am aware of relates to interfaces,
particularly with established code.
In the case of statd, the files in sm/ and sm.bak/ are a well
established interface.  Exactly how much is dependant on it is hard to
say.  Not much formal code I expect but maybe some obscure scripts and
lots of sysadmin knowledge.

Can you run them both in parallel??  i.e. have a database with all the
data, but also store it in the files (if the hostname can be
represented in ASCII)... It is hard to guess how easy that would be
and how worthwhile it would be.  And it doesn't answer the question of
whether sqlite is stable enough.

> 
> > I agree that combative mailing lists are a problem, but even there, I
> > believe most of the aggression is more perceived than real, and that
> > a graceful, humble, polite attitude can have a positive-feedback  
> > effect
> > too.
> 
> Years ago I believed that, but I have seen much evidence to the  
> contrary in this community.  More often such an attitude is entirely  
> ignored, or treated as an invitation for abuse, especially by people  
> who have no interest in politeness.  This kind of approach has no  
> effect on the leaders in the Linux community, who set an example of  
> extreme rudeness and belligerence.

That last comment is interesting.  At the 2007 kernel summit (the last
one I was at) the topic of mailing list etiquette was discussed and
there seemed to be agreement that we, the leaders (I guess the kernel
summit attendees are the closest we have to group leadership) have a
role in damping down the fire, not building it up.  My feeling is that
most high profile people do quite well, but maybe I am too forgiving.

> 
> I've made an effort to stop arguing small points, and to make  
> observations and not argue.  I still get e-mail full of "crap" this  
> and "bullshit" that and "NACK!" with little explanation.

I certainly agree that sort of response is best not sent.  And I must
confess to a recent experience (on a different list) where I gave up
due to similar behaviour.   That is partly why I decided to join in
this discussion (though I'm not sure if I'm being helpful yet).

> 
> > Yes, there are lots of practices that might improve things that we  
> > don't
> > have standardised.  But one practice we do have that has proven very
> > effective is incremental refinement.  It can be hard to understand  
> > what
> > order to make changes until after you have made them, but once you
> > understand what you want to do, going back and doing it in logical
> > order really is very effective.  It makes it easier for others to
> > review, it makes it easy for you to review yourself.  It means
> > less controversial bits can be included quickly leaving room for the
> > more controversial bits to be discussed in isolation.
> 
> I am a fan of incremental refinement, and I use that approach as often  
> as I can.  There are some things that incremental refinement cannot  
> do, however.
> 
> > I think that the switch from portmap to rpcbind was a bad idea,
> > and I think that a wholesale replacement of statd is probably a
> > bad idea too.  It might seem like the easiest way to get something
> > useful working, but you'll probably be paying the price for years as
> > little regression turn up because you didn't completely understand
> > the original statd (and face it, who does?)
> 
> Yes, but _why_ is it a bad idea?  All I hear is "this is a bad idea"  
> and "you could do it some other way" but these are qualitative, not  
> quantitative arguments.  They are religious statements, not specific  
> technical criticisms.

It is a bad idea because it doesn't have the legacy of testing and
refinement.  Almost as soon as we started using it bugs were found -
or at least differences in behaviour to portmap (something about the
privilege level required to register a binding I think).

Now I admit that no one put their hand up to add IPv6 support to
portmap, arguably it could have been a worse idea to stay with portmap
as it meant no IPv6.  But changing was still a bad idea.

Had we (had the man power to) incrementally enhance portmap we would
have had a much more reviewable process, and a bisectable result which
would allow regression to be isolated more directly.

> 
> This leaves me with the impression that folks are responding out of a  
> fear of the unknown, and not out of a considered technical opinion.   
> If sqlite3 is outside of people's comfort zone, that's OK.  Please  
> let's be honest about it instead of slinging mud and throwing up a  
> bunch of generic arguments that no one can rebut.
> 
> > As for the use of sql-lite ... I must admit that I wouldn't choose
> > it.  Maybe it is a good idea.  If it is, you probably need to merge
> > that change early with a clear argument and tools to make it  
> > manageable
> > (e.g. a developer will want to tool to be able to look inside the
> > database easily and make changes, without having to know sql).
> > It is much easier to discuss one thing at a time on these
> > combative mailing lists ;-)
> 
> That's a fine and constructive comment, thanks.
> 
> There is already a tool for managing the data in the database:  
> 'sqlite3' the executable, which can be used in shell scripts.  There  
> are also sqlite3 libraries for Python and Perl and C.  This is really  
> not very different from POSIX file system calls and using 'cat'.  SQL  
> is not difficult to learn, and a one or two page recipe document is  
> easy to provide.  I certainly have not used any advanced features of  
> SQL to implement new statd.
> 
> Given the complexity of the change, it makes it much easier to argue  
> against sqlite3, however, if it is separated from the set of changes  
> that motivate its use.  Bruce, for example, has stated to me  
> specifically that he prefers having such changes and their  
> motivational requirements included in the same patch.  I regularly  
> code for several different maintainers, and each one has his own  
> preferences, often contradicting other maintainers.  I don't think  
> regular maintainers have any idea how confusing and challenging this is.
> 

:-)

I think there has to be room for balance here.  Certainly it is best
to use new functionality as soon as it is added so the use case is
clear.  But I can imagine that converting from a files database to a
sqlite database would be several patches in itself.  And then if you
want to start adding 'search' or 'utf-16' functionality, that would be
more patches still.  Combining all that in to one big patch just to
keep the change with the motivation seems unlikely to be a win from
anyone's perspective.
Certainly having a real big changelog comment early on that
explains the value of sqlite would be essential, and that patch would
not be merged until the subsequent ones were reviewed.


> Note, however, I am not married to the specifics of sqlite3.  What I  
> am attached to is the ideas that the current system is inadequate for  
> the kinds of features we want to add to statd, and that statd should  
> not worry about the details of data storage, or search and management  
> of the host records, because we have other tools that are better at  
> this.  If there is another solution that provides a durable and  
> flexible way to store and search host records and offload the details  
> to pre-existing code, I'm open.  However, sqlite3 is the most widely  
> used embedded database on the planet, and is eminently suitable for  
> this task.
> 
> > And we do have static code analysis tools.  Both 'gcc' and 'sparse'
> > fit that description.
> 
> Yes, gcc can provide some static analysis, if the correct options are  
> specified, and care is taken to eliminate the noise of false  
> positives.  There is a prevailing attitude, however, that this is a  
> worthless endeavor.  Witness the amount of noise that comes out when  
> you build a Red Hat kernel or the nfs-utils package.  Witness also the  
> sarcasm of Linus who repeatedly chides folks for not running sparse  
> regularly.
> 
> Additionally, gcc is not the best tool for this job, given the often  
> oblique way it calls out errors and warnings.  There are purify,  
> fortify, and splint, just to name three, that are standard analysis  
> tools we don't even consider.
> 
> My comment goes more to the point that static analysis is not  
> considered of any value.

You might be right....

I see there be a "law of diminishing returns" here.
sparse (aka "make C=1") has almost never shown me anything
interesting.  So while there is a (small) cost in running it, there is
almost no perceived value.

For my own C projects I always compile with "-Wall -Werror" to
remove the cost of running it and to artificially increase the value.

I just tried "make C=1" on a couple of bits of kernel code and it did
provide some vaguely interesting things that I'll probably fix.
But there was a lot of noise like:
  warning: potentially expensive pointer subtraction
which I don't think I want to fix, but don't know how to silence the
warning for just that case.  It would be nice of "C=1" was a default
and either the warning, or sparse, were fixed.   That might increase
the perception that this sort of thing was of value.

Some time ago Greg Banks went on a pursuit of warnings in nfs-utils
and got rid of all of them -- except those generated by rpcgen. 
Have more been introduced?

NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-10 14:09                                     ` Chuck Lever
  2009-09-14  7:08                                       ` Neil Brown
@ 2009-09-14 13:54                                       ` Trond Myklebust
       [not found]                                         ` <1252936467.6866.54.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2009-09-15  1:29                                         ` Chuck Lever
  1 sibling, 2 replies; 39+ messages in thread
From: Trond Myklebust @ 2009-09-14 13:54 UTC (permalink / raw)
  To: Chuck Lever; +Cc: NeilBrown, Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Thu, 2009-09-10 at 10:09 -0400, Chuck Lever wrote:
> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
> > I agree that combative mailing lists are a problem, but even there, I
> > believe most of the aggression is more perceived than real, and that
> > a graceful, humble, polite attitude can have a positive-feedback  
> > effect
> > too.
> 
> Years ago I believed that, but I have seen much evidence to the  
> contrary in this community.  More often such an attitude is entirely  
> ignored, or treated as an invitation for abuse, especially by people  
> who have no interest in politeness.  This kind of approach has no  
> effect on the leaders in the Linux community, who set an example of  
> extreme rudeness and belligerence.
> 
> I've made an effort to stop arguing small points, and to make  
> observations and not argue.  I still get e-mail full of "crap" this  
> and "bullshit" that and "NACK!" with little explanation.

As you said above, you've been part of the community for years. It is
not as if you haven't learned by now that a review might turn up issues
that may give rise to a NACK, and that you need to be open to changing
your code should this happen.
If you need more information about what needs to be changed, then you
know to ask.

That hasn't been your approach in this case, though, and the responses
you got were a direct consequence of that approach.
You tried reversing the burden of proof as to why we should change an
established interface instead of supplying adequate evidence justifying
that change.
When problems were pointed out to you (e.g. backward compatibility) your
response was to deny they existed instead of proposing a change to your
code.
Finally, you tried changing the thread into a discussion about mean rude
people obstructing you and failing to give you adequate guidance.



The problems with the code remain, and you will need to change it in
order to make it acceptable. The question I haven't seen you asking, and
that you should have be asking from the very start is "what would be the
minimal set of changes?".

I'm quite willing to discuss that with you.

Trond


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]                                         ` <19117.60405.793389.323010-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-09-14 15:48                                           ` Steve Dickson
  0 siblings, 0 replies; 39+ messages in thread
From: Steve Dickson @ 2009-09-14 15:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: Chuck Lever, Trond Myklebust, Jeff Layton, J. Bruce Fields, linux-nfs



On 09/14/2009 03:08 AM, Neil Brown wrote:
> 
> Some time ago Greg Banks went on a pursuit of warnings in nfs-utils
> and got rid of all of them -- except those generated by rpcgen. 
> Have more been introduced?
The only other warnings I'm struggling with are the 
     warning: dereferencing pointer 'local_addr' does break strict-aliasing rules

in sm-notify.c, which seem to be a bunch of noise... But I will like 
to get ride of them... Other than that, there are no warnings that 
I know of... 

steved.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
       [not found]                                         ` <1252936467.6866.54.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-09-14 15:55                                           ` Steve Dickson
  0 siblings, 0 replies; 39+ messages in thread
From: Steve Dickson @ 2009-09-14 15:55 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Chuck Lever, NeilBrown, Jeff Layton, J. Bruce Fields, linux-nfs

> The problems with the code remain, and you will need to change it in
> order to make it acceptable. The question I haven't seen you asking, and
> that you should have be asking from the very start is "what would be the
> minimal set of changes?".

I too would like to see what are minimal change set would be... Just
to see how much a db is really needed... If at all.. 

Now with that said... I've taken on the task of how to deal with
pNFS exports and it might make sense that a db is need to deal with
that complexity...  So I'm not totally against the idea of introducing 
a db to nfs-utils I just think we need to do it for the right reason....

steved.
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-14 13:54                                       ` Trond Myklebust
       [not found]                                         ` <1252936467.6866.54.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-09-15  1:29                                         ` Chuck Lever
  1 sibling, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-15  1:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: NeilBrown, Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Sep 14, 2009, at 9:54 AM, Trond Myklebust wrote:
> On Thu, 2009-09-10 at 10:09 -0400, Chuck Lever wrote:
>> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:
>>> I agree that combative mailing lists are a problem, but even  
>>> there, I
>>> believe most of the aggression is more perceived than real, and that
>>> a graceful, humble, polite attitude can have a positive-feedback
>>> effect
>>> too.
>>
>> Years ago I believed that, but I have seen much evidence to the
>> contrary in this community.  More often such an attitude is entirely
>> ignored, or treated as an invitation for abuse, especially by people
>> who have no interest in politeness.  This kind of approach has no
>> effect on the leaders in the Linux community, who set an example of
>> extreme rudeness and belligerence.
>>
>> I've made an effort to stop arguing small points, and to make
>> observations and not argue.  I still get e-mail full of "crap" this
>> and "bullshit" that and "NACK!" with little explanation.
>
> As you said above, you've been part of the community for years. It is
> not as if you haven't learned by now that a review might turn up  
> issues
> that may give rise to a NACK, and that you need to be open to changing
> your code should this happen.

The tone of your vetoes is unnecessarily aggressive, and often the  
comments are entirely negative.  A mention of the pieces of new statd  
that you liked, for example, would have been welcome, and even useful  
for this conversation.

Naturally, you are free to disagree with me as much as you like.  But  
I wish you could be more constructive about it.  This is not a  
contest... we're supposed to be working together to improve the code  
base.

> If you need more information about what needs to be changed, then you
> know to ask.
>
> That hasn't been your approach in this case, though, and the responses
> you got were a direct consequence of that approach.
> You tried reversing the burden of proof as to why we should change an
> established interface instead of supplying adequate evidence  
> justifying
> that change.

I think it's reasonable to ask for evidence on both sides.

> When problems were pointed out to you (e.g. backward compatibility)  
> your
> response was to deny they existed instead of proposing a change to  
> your
> code.

The problem you described seemed to stem from a basic disagreement  
about how statd with IPv6 would be deployed -- an issue that neither  
of us (being upstream developers and not distributors) can resolve.   
Your objection still doesn't make sense to me.  But see below.

> Finally, you tried changing the thread into a discussion about mean  
> rude
> people obstructing you and failing to give you adequate guidance.

I was responding to Neil's comments, not changing the subject.

In any event, since all of the NFS maintainers have now passed their  
judgement, it's clear that I will have to withdraw new statd, and  
proceed with a re-write that uses the existing on-disk format.  I've  
never claimed that sqlite3 is a _requirement_ to solve these issues,  
but only that some changes would be necessary to the on-disk format,  
and that a database seemed an appropriate improvement.

No maintainers agree with that, so I will rework it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part)
  2009-09-14  7:08                                       ` Neil Brown
       [not found]                                         ` <19117.60405.793389.323010-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-09-15  2:45                                         ` Chuck Lever
  1 sibling, 0 replies; 39+ messages in thread
From: Chuck Lever @ 2009-09-15  2:45 UTC (permalink / raw)
  To: Neil Brown
  Cc: Trond Myklebust, Jeff Layton, J. Bruce Fields, steved, linux-nfs

On Sep 14, 2009, at 3:08 AM, Neil Brown wrote:
> On Thursday September 10, chuck.lever@oracle.com wrote:
>> On Sep 10, 2009, at 4:44 AM, NeilBrown wrote:

> But you will leave one day.  How can you best make sure that you leave
> something that others can maintain????

By writing code that is self-explanatory, providing lots of comments  
in the code, adding to the git log (as you suggested) and writing  
expansive man pages that describe the interfaces in as clear a manner  
as possible.  The review process is also part of that effort.

There is also the possibility of mentoring others, as FreeBSD does,  
and providing extensive written documentation and specifications in  
wikis.  Agile methodologies suggest that rewriting as a regular  
practice is a good way for a team to retain familiarity with a code  
base.  Having a full test suite that can be used to verify the  
behavior of new or existing code is also a way to codify requirements  
and create an institutional memory of regressions, as well as to  
insulate users from regressions in new code.

>> My point is that many of the items I mentioned above are expressly
>> designed to allow quicker, less risky change, precisely to decrease
>> the amount of time and effort to get new features into our code.  Yet
>> we turn our back on all of them in favor of an antique "don't touch
>> that!" policy.  "Don't touch that!" is not a reasonable argument
>> against replacing components that need to be replaced.
>
> The only "Don't touch that" which I am aware of relates to interfaces,
> particularly with established code.
> In the case of statd, the files in sm/ and sm.bak/ are a well
> established interface.  Exactly how much is dependant on it is hard to
> say.  Not much formal code I expect but maybe some obscure scripts and
> lots of sysadmin knowledge.

There is no documentation I'm aware of of statd's on-disk format as a  
formal interface.  I have had some recent conversations with Lon about  
this, to handle any dependencies his clustering scripts may have, and  
he didn't throw up any flags.  He told me that all we needed was to  
provide a mechanism to access this data from a shell script, which we  
would have in 'sqlite3' the executable.

So this is a new requirement (to me, anyway).  If these files  
constitute a formal interface, how can statd be modified to store  
additional data or new data types in these files?  Am I allowed to put  
IPv6 presentation addresses in these files in place of IPv4  
addresses?  Am I allowed to add new fields?  Not rhetorical  
questions... really... how should I go about doing this and testing  
the result?  You seem to be suggesting that the sm/* files can't be  
used for the kind of features we want to add.

> Can you run them both in parallel??  i.e. have a database with all the
> data, but also store it in the files (if the hostname can be
> represented in ASCII)... It is hard to guess how easy that would be
> and how worthwhile it would be.  And it doesn't answer the question of
> whether sqlite is stable enough.

Is it even a good thing to freeze the sm/* files as a formal  
interface, or should we go about providing a real documented  
programming interface for this, and migrate to it?  There is a real  
risk to maintaining undocumented interfaces like this, and that is  
that we can't make any change to this code without a significant  
possibility of breaking something.

>>> I think that the switch from portmap to rpcbind was a bad idea,
>>> and I think that a wholesale replacement of statd is probably a
>>> bad idea too.  It might seem like the easiest way to get something
>>> useful working, but you'll probably be paying the price for years as
>>> little regression turn up because you didn't completely understand
>>> the original statd (and face it, who does?)
>>
>> Yes, but _why_ is it a bad idea?  All I hear is "this is a bad idea"
>> and "you could do it some other way" but these are qualitative, not
>> quantitative arguments.  They are religious statements, not specific
>> technical criticisms.
>
> It is a bad idea because it doesn't have the legacy of testing and
> refinement.  Almost as soon as we started using it bugs were found -
> or at least differences in behaviour to portmap (something about the
> privilege level required to register a binding I think).
>
> Now I admit that no one put their hand up to add IPv6 support to
> portmap, arguably it could have been a worse idea to stay with portmap
> as it meant no IPv6.  But changing was still a bad idea.
>
> Had we (had the man power to) incrementally enhance portmap we would
> have had a much more reviewable process, and a bisectable result which
> would allow regression to be isolated more directly.

What we have now is an inherited body of code (with its own history of  
incremental improvement) that is shared with many other operating  
systems, which improves our ability to interoperate with them, and  
includes bug fixes that have been made to it over the years.

I think we would have had some bugs and regressions pursuing either  
path.  There are well-understood ways to manage these risks, either way.

But this is a sidebar.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-09-15  2:45 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-05 14:45 [PATCH 0/4] new statd [take 2] Chuck Lever
     [not found] ` <20090805143550.12866.8377.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
2009-08-05 14:45   ` [PATCH 1/4] nfs-utils: introduce new statd implementation (1st part) Chuck Lever
     [not found]     ` <20090805144540.12866.22084.stgit-RytpoXr2tKZ9HhUboXbp9zCvJB+x5qRC@public.gmane.org>
2009-08-05 17:48       ` J. Bruce Fields
2009-08-05 18:05         ` Chuck Lever
2009-08-05 18:15           ` J. Bruce Fields
2009-08-05 18:26             ` Chuck Lever
2009-08-05 21:22               ` Trond Myklebust
     [not found]                 ` <1249507356.5428.11.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-08-05 22:24                   ` Chuck Lever
2009-08-05 23:30                     ` Trond Myklebust
2009-09-09 18:29                       ` Jeff Layton
     [not found]                         ` <20090909142945.755da393-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2009-09-09 18:39                           ` Trond Myklebust
2009-09-09 19:13                             ` Jeff Layton
     [not found]                               ` <20090909151354.504ccdf6-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2009-09-09 19:19                                 ` Chuck Lever
2009-09-09 19:17                             ` Chuck Lever
2009-09-09 19:42                               ` Trond Myklebust
2009-09-09 22:18                                 ` Chuck Lever
2009-09-09 23:15                                   ` Steve Dickson
2009-09-10 15:01                                     ` Chuck Lever
2009-09-10  8:44                                   ` NeilBrown
2009-09-10 14:09                                     ` Chuck Lever
2009-09-14  7:08                                       ` Neil Brown
     [not found]                                         ` <19117.60405.793389.323010-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-09-14 15:48                                           ` Steve Dickson
2009-09-15  2:45                                         ` Chuck Lever
2009-09-14 13:54                                       ` Trond Myklebust
     [not found]                                         ` <1252936467.6866.54.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-09-14 15:55                                           ` Steve Dickson
2009-09-15  1:29                                         ` Chuck Lever
     [not found]                                     ` <9eae93545189a6be6eebe0460b860fc7.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-09-10 15:05                                       ` bpm
2009-09-10 15:03                                   ` J. Bruce Fields
2009-09-10 16:14                                     ` Chuck Lever
2009-09-10 16:23                                       ` J. Bruce Fields
2009-09-10 16:43                                         ` Trond Myklebust
2009-09-10 20:39                                         ` Chuck Lever
2009-09-10 20:49                                           ` J. Bruce Fields
2009-09-10 21:26                                             ` Chuck Lever
2009-08-05 14:45   ` [PATCH 2/4] nfs-utils: introduce new statd implementation (2nd part) Chuck Lever
2009-08-05 14:46   ` [PATCH 3/4] nfs-utils: introduce new statd implementation (3rd part) Chuck Lever
2009-08-05 14:46   ` [PATCH 4/4] nfs-utils: introduce new statd implementation (4th part) Chuck Lever
2009-08-05 17:15   ` [PATCH 0/4] new statd [take 2] Jeff Layton
     [not found]     ` <20090805131554.67eb15bc-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2009-08-05 18:06       ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.