linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/13] NFSACL protocol extension for NFSv3
@ 2005-01-22 20:34 Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
                   ` (12 more replies)
  0 siblings, 13 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

This patchset implements the NFSACL protocol extension, which consists
of the GETACL and SETACL RPCs. I would appreciate to have these patches
in -mm to give them more exposure. (This has nothing to do with NFSv4
acls, by the way.)

The actual access decisions are performed using the ACCESS RPC which is
part of NFSv3 proper, and is independent of acls. The GETACL and SETACL
RPCs are mainly used by tools like getfacl and setfacl, and ls (which
merely displays whether or not a file's permissions go beyond the file
mode permission bits). In addition, for files created inside directories
that have a default acl, SETACL is used at file create time to implement
the POSIX ACL file create semantics (see the comment in
nfsacl-umask.diff for a detailed explanation).

We have been shipping a slightly older version of this patch in SuSE
Linux 8.2, 9.0, 9.1, and SLES9. Until recently acls were not cached on
the client side; we didn't see this as a huge performance issue.
Nevertheless, this version now also caches acls on the client.

The protocol is compatible with the Solaris version. Solaris has
slightly different acl semantics; they are based on an earlier
POSIX acl draft than the Linux version. Other than Linux, Solaris does
not allow three-entry acls (user::, group::, other::). Where Linux has
three-entry acls, Solaris makes up a fourth mask:: entry with the same
permissions as the group:: entry. The NFSACL protocol follows the
Solaris semantics, so we also fake a fourth entry for three-entry
acls, and give it the group:: entry permissions. When receiving a
four-entry acl, we cannot tell three-entry from four-entry acls if the
group:: entry permissions equal the mask:: entry permissions. (If they
differ, we know we have a "real" four-entry acl).  This incompatibility
causes mask entries to be lost in very rare cases.

Four-entry acls are extremely rare and not very useful; I never stumbled
upon them in real life. Judging from that and from the experience of
shipping with this incompatibility for almost two years now, I guess we
can safely continue to ignore this issue. It's not fixable within NFSACL
without breaking Solaris, anyway.

Regards,
--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 1/13] Qsort
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-22 21:00   ` vlobanov
                     ` (4 more replies)
  2005-01-22 20:34 ` [patch 2/13] Return -ENOSYS for RPC programs that are unavailable Andreas Gruenbacher
                   ` (11 subsequent siblings)
  12 siblings, 5 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/qsort --]
[-- Type: text/plain, Size: 18585 bytes --]

Add a quicksort from glibc as a kernel library function, and switch
xfs over to using it. The implementations are equivalent. The nfsacl
protocol also requires a sort function, so it makes more sense in
the common code.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/include/linux/kernel.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/kernel.h
+++ linux-2.6.11-rc2/include/linux/kernel.h
@@ -93,6 +93,8 @@ extern int sscanf(const char *, const ch
 	__attribute__ ((format (scanf,2,3)));
 extern int vsscanf(const char *, const char *, va_list);
 
+extern void qsort(void *, size_t, size_t, int (*)(const void *,const void *));
+
 extern int get_option(char **str, int *pint);
 extern char *get_options(const char *str, int nints, int *ints);
 extern unsigned long long memparse(char *ptr, char **retptr);
Index: linux-2.6.11-rc2/lib/Kconfig
===================================================================
--- linux-2.6.11-rc2.orig/lib/Kconfig
+++ linux-2.6.11-rc2/lib/Kconfig
@@ -30,6 +30,9 @@ config LIBCRC32C
 	  require M here.  See Castagnoli93.
 	  Module will be libcrc32c.
 
+config QSORT
+	bool "Quick Sort"
+
 #
 # compression support is select'ed if needed
 #
Index: linux-2.6.11-rc2/lib/Makefile
===================================================================
--- linux-2.6.11-rc2.orig/lib/Makefile
+++ linux-2.6.11-rc2/lib/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_CRC_CCITT)	+= crc-ccitt.o
 obj-$(CONFIG_CRC32)	+= crc32.o
 obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
 obj-$(CONFIG_GENERIC_IOMAP) += iomap.o
+obj-$(CONFIG_QSORT)	+= qsort.o
 
 obj-$(CONFIG_ZLIB_INFLATE) += zlib_inflate/
 obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate/
Index: linux-2.6.11-rc2/lib/qsort.c
===================================================================
--- /dev/null
+++ linux-2.6.11-rc2/lib/qsort.c
@@ -0,0 +1,249 @@
+/* Copyright (C) 1991, 1992, 1996, 1997, 1999 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Written by Douglas C. Schmidt (schmidt@ics.uci.edu).
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+/* If you consider tuning this algorithm, you should consult first:
+   Engineering a sort function; Jon Bentley and M. Douglas McIlroy;
+   Software - Practice and Experience; Vol. 23 (11), 1249-1265, 1993.  */
+
+# include <linux/module.h>
+# include <linux/slab.h>
+# include <linux/string.h>
+
+MODULE_LICENSE("GPL");
+
+/* Byte-wise swap two items of size SIZE. */
+#define SWAP(a, b, size)						      \
+  do									      \
+    {									      \
+      register size_t __size = (size);					      \
+      register char *__a = (a), *__b = (b);				      \
+      do								      \
+	{								      \
+	  char __tmp = *__a;						      \
+	  *__a++ = *__b;						      \
+	  *__b++ = __tmp;						      \
+	} while (--__size > 0);						      \
+    } while (0)
+
+/* Discontinue quicksort algorithm when partition gets below this size.
+   This particular magic number was chosen to work best on a Sun 4/260. */
+#define MAX_THRESH 4
+
+/* Stack node declarations used to store unfulfilled partition obligations. */
+typedef struct
+  {
+    char *lo;
+    char *hi;
+  } stack_node;
+
+/* The next 5 #defines implement a very fast in-line stack abstraction. */
+/* The stack needs log (total_elements) entries (we could even subtract
+   log(MAX_THRESH)).  Since total_elements has type size_t, we get as
+   upper bound for log (total_elements):
+   bits per byte (CHAR_BIT) * sizeof(size_t).  */
+#define CHAR_BIT 8
+#define STACK_SIZE	(CHAR_BIT * sizeof(size_t))
+#define PUSH(low, high)	((void) ((top->lo = (low)), (top->hi = (high)), ++top))
+#define	POP(low, high)	((void) (--top, (low = top->lo), (high = top->hi)))
+#define	STACK_NOT_EMPTY	(stack < top)
+
+
+/* Order size using quicksort.  This implementation incorporates
+   four optimizations discussed in Sedgewick:
+
+   1. Non-recursive, using an explicit stack of pointer that store the
+      next array partition to sort.  To save time, this maximum amount
+      of space required to store an array of SIZE_MAX is allocated on the
+      stack.  Assuming a 32-bit (64 bit) integer for size_t, this needs
+      only 32 * sizeof(stack_node) == 256 bytes (for 64 bit: 1024 bytes).
+      Pretty cheap, actually.
+
+   2. Chose the pivot element using a median-of-three decision tree.
+      This reduces the probability of selecting a bad pivot value and
+      eliminates certain extraneous comparisons.
+
+   3. Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions, leaving
+      insertion sort to order the MAX_THRESH items within each partition.
+      This is a big win, since insertion sort is faster for small, mostly
+      sorted array segments.
+
+   4. The larger of the two sub-partitions is always pushed onto the
+      stack first, with the algorithm then concentrating on the
+      smaller partition.  This *guarantees* no more than log (total_elems)
+      stack size is needed (actually O(1) in this case)!  */
+
+void
+qsort(void *const pbase, size_t total_elems, size_t size,
+      int(*cmp)(const void *,const void *))
+{
+  register char *base_ptr = (char *) pbase;
+
+  const size_t max_thresh = MAX_THRESH * size;
+
+  if (total_elems == 0)
+    /* Avoid lossage with unsigned arithmetic below.  */
+    return;
+
+  if (total_elems > MAX_THRESH)
+    {
+      char *lo = base_ptr;
+      char *hi = &lo[size * (total_elems - 1)];
+      stack_node stack[STACK_SIZE];
+      stack_node *top = stack + 1;
+
+      while (STACK_NOT_EMPTY)
+        {
+          char *left_ptr;
+          char *right_ptr;
+
+	  /* Select median value from among LO, MID, and HI. Rearrange
+	     LO and HI so the three values are sorted. This lowers the
+	     probability of picking a pathological pivot value and
+	     skips a comparison for both the LEFT_PTR and RIGHT_PTR in
+	     the while loops. */
+
+	  char *mid = lo + size * ((hi - lo) / size >> 1);
+
+	  if ((*cmp) ((void *) mid, (void *) lo) < 0)
+	    SWAP (mid, lo, size);
+	  if ((*cmp) ((void *) hi, (void *) mid) < 0)
+	    SWAP (mid, hi, size);
+	  else
+	    goto jump_over;
+	  if ((*cmp) ((void *) mid, (void *) lo) < 0)
+	    SWAP (mid, lo, size);
+	jump_over:;
+
+	  left_ptr  = lo + size;
+	  right_ptr = hi - size;
+
+	  /* Here's the famous ``collapse the walls'' section of quicksort.
+	     Gotta like those tight inner loops!  They are the main reason
+	     that this algorithm runs much faster than others. */
+	  do
+	    {
+	      while ((*cmp) ((void *) left_ptr, (void *) mid) < 0)
+		left_ptr += size;
+
+	      while ((*cmp) ((void *) mid, (void *) right_ptr) < 0)
+		right_ptr -= size;
+
+	      if (left_ptr < right_ptr)
+		{
+		  SWAP (left_ptr, right_ptr, size);
+		  if (mid == left_ptr)
+		    mid = right_ptr;
+		  else if (mid == right_ptr)
+		    mid = left_ptr;
+		  left_ptr += size;
+		  right_ptr -= size;
+		}
+	      else if (left_ptr == right_ptr)
+		{
+		  left_ptr += size;
+		  right_ptr -= size;
+		  break;
+		}
+	    }
+	  while (left_ptr <= right_ptr);
+
+          /* Set up pointers for next iteration.  First determine whether
+             left and right partitions are below the threshold size.  If so,
+             ignore one or both.  Otherwise, push the larger partition's
+             bounds on the stack and continue sorting the smaller one. */
+
+          if ((size_t) (right_ptr - lo) <= max_thresh)
+            {
+              if ((size_t) (hi - left_ptr) <= max_thresh)
+		/* Ignore both small partitions. */
+                POP (lo, hi);
+              else
+		/* Ignore small left partition. */
+                lo = left_ptr;
+            }
+          else if ((size_t) (hi - left_ptr) <= max_thresh)
+	    /* Ignore small right partition. */
+            hi = right_ptr;
+          else if ((right_ptr - lo) > (hi - left_ptr))
+            {
+	      /* Push larger left partition indices. */
+              PUSH (lo, right_ptr);
+              lo = left_ptr;
+            }
+          else
+            {
+	      /* Push larger right partition indices. */
+              PUSH (left_ptr, hi);
+              hi = right_ptr;
+            }
+        }
+    }
+
+  /* Once the BASE_PTR array is partially sorted by quicksort the rest
+     is completely sorted using insertion sort, since this is efficient
+     for partitions below MAX_THRESH size. BASE_PTR points to the beginning
+     of the array to sort, and END_PTR points at the very last element in
+     the array (*not* one beyond it!). */
+
+  {
+    char *end_ptr = &base_ptr[size * (total_elems - 1)];
+    char *tmp_ptr = base_ptr;
+    char *thresh = min(end_ptr, base_ptr + max_thresh);
+    register char *run_ptr;
+
+    /* Find smallest element in first threshold and place it at the
+       array's beginning.  This is the smallest array element,
+       and the operation speeds up insertion sort's inner loop. */
+
+    for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
+      if ((*cmp) ((void *) run_ptr, (void *) tmp_ptr) < 0)
+        tmp_ptr = run_ptr;
+
+    if (tmp_ptr != base_ptr)
+      SWAP (tmp_ptr, base_ptr, size);
+
+    /* Insertion sort, running from left-hand-side up to right-hand-side.  */
+
+    run_ptr = base_ptr + size;
+    while ((run_ptr += size) <= end_ptr)
+      {
+	tmp_ptr = run_ptr - size;
+	while ((*cmp) ((void *) run_ptr, (void *) tmp_ptr) < 0)
+	  tmp_ptr -= size;
+
+	tmp_ptr += size;
+        if (tmp_ptr != run_ptr)
+          {
+            char *trav;
+
+	    trav = run_ptr + size;
+	    while (--trav >= run_ptr)
+              {
+                char c = *trav;
+                char *hi, *lo;
+
+                for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
+                  *hi = *lo;
+                *hi = c;
+              }
+          }
+      }
+  }
+}
+EXPORT_SYMBOL(qsort);
Index: linux-2.6.11-rc2/fs/xfs/support/qsort.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/xfs/support/qsort.c
+++ /dev/null
@@ -1,155 +0,0 @@
-/*
- * Copyright (c) 1992, 1993
- *	The Regents of the University of California.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the University nor the names of its contributors
- *    may be used to endorse or promote products derived from this software
- *    without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#include <linux/kernel.h>
-#include <linux/string.h>
-
-/*
- * Qsort routine from Bentley & McIlroy's "Engineering a Sort Function".
- */
-#define swapcode(TYPE, parmi, parmj, n) { 		\
-	long i = (n) / sizeof (TYPE); 			\
-	register TYPE *pi = (TYPE *) (parmi); 		\
-	register TYPE *pj = (TYPE *) (parmj); 		\
-	do { 						\
-		register TYPE	t = *pi;		\
-		*pi++ = *pj;				\
-		*pj++ = t;				\
-        } while (--i > 0);				\
-}
-
-#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
-	es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1;
-
-static __inline void
-swapfunc(char *a, char *b, int n, int swaptype)
-{
-	if (swaptype <= 1) 
-		swapcode(long, a, b, n)
-	else
-		swapcode(char, a, b, n)
-}
-
-#define swap(a, b)					\
-	if (swaptype == 0) {				\
-		long t = *(long *)(a);			\
-		*(long *)(a) = *(long *)(b);		\
-		*(long *)(b) = t;			\
-	} else						\
-		swapfunc(a, b, es, swaptype)
-
-#define vecswap(a, b, n) 	if ((n) > 0) swapfunc(a, b, n, swaptype)
-
-static __inline char *
-med3(char *a, char *b, char *c, int (*cmp)(const void *, const void *))
-{
-	return cmp(a, b) < 0 ?
-	       (cmp(b, c) < 0 ? b : (cmp(a, c) < 0 ? c : a ))
-              :(cmp(b, c) > 0 ? b : (cmp(a, c) < 0 ? a : c ));
-}
-
-void
-qsort(void *aa, size_t n, size_t es, int (*cmp)(const void *, const void *))
-{
-	char *pa, *pb, *pc, *pd, *pl, *pm, *pn;
-	int d, r, swaptype, swap_cnt;
-	register char *a = aa;
-
-loop:	SWAPINIT(a, es);
-	swap_cnt = 0;
-	if (n < 7) {
-		for (pm = (char *)a + es; pm < (char *) a + n * es; pm += es)
-			for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0;
-			     pl -= es)
-				swap(pl, pl - es);
-		return;
-	}
-	pm = (char *)a + (n / 2) * es;
-	if (n > 7) {
-		pl = (char *)a;
-		pn = (char *)a + (n - 1) * es;
-		if (n > 40) {
-			d = (n / 8) * es;
-			pl = med3(pl, pl + d, pl + 2 * d, cmp);
-			pm = med3(pm - d, pm, pm + d, cmp);
-			pn = med3(pn - 2 * d, pn - d, pn, cmp);
-		}
-		pm = med3(pl, pm, pn, cmp);
-	}
-	swap(a, pm);
-	pa = pb = (char *)a + es;
-
-	pc = pd = (char *)a + (n - 1) * es;
-	for (;;) {
-		while (pb <= pc && (r = cmp(pb, a)) <= 0) {
-			if (r == 0) {
-				swap_cnt = 1;
-				swap(pa, pb);
-				pa += es;
-			}
-			pb += es;
-		}
-		while (pb <= pc && (r = cmp(pc, a)) >= 0) {
-			if (r == 0) {
-				swap_cnt = 1;
-				swap(pc, pd);
-				pd -= es;
-			}
-			pc -= es;
-		}
-		if (pb > pc)
-			break;
-		swap(pb, pc);
-		swap_cnt = 1;
-		pb += es;
-		pc -= es;
-	}
-	if (swap_cnt == 0) {  /* Switch to insertion sort */
-		for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
-			for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0; 
-			     pl -= es)
-				swap(pl, pl - es);
-		return;
-	}
-
-	pn = (char *)a + n * es;
-	r = min(pa - (char *)a, pb - pa);
-	vecswap(a, pb - r, r);
-	r = min((long)(pd - pc), (long)(pn - pd - es));
-	vecswap(pb, pn - r, r);
-	if ((r = pb - pa) > es)
-		qsort(a, r / es, es, cmp);
-	if ((r = pd - pc) > es) { 
-		/* Iterate rather than recurse to save stack space */
-		a = pn - r;
-		n = r / es;
-		goto loop;
-	}
-/*		qsort(pn - r, r / es, es, cmp);*/
-}
Index: linux-2.6.11-rc2/fs/xfs/Makefile
===================================================================
--- linux-2.6.11-rc2.orig/fs/xfs/Makefile
+++ linux-2.6.11-rc2/fs/xfs/Makefile
@@ -142,7 +142,6 @@ xfs-y				+= $(addprefix linux-2.6/, \
 xfs-y				+= $(addprefix support/, \
 				   debug.o \
 				   move.o \
-				   qsort.o \
 				   uuid.o)
 
 xfs-$(CONFIG_XFS_TRACE)		+= support/ktrace.o
Index: linux-2.6.11-rc2/fs/xfs/support/qsort.h
===================================================================
--- linux-2.6.11-rc2.orig/fs/xfs/support/qsort.h
+++ /dev/null
@@ -1,41 +0,0 @@
-/*
- * Copyright (c) 2000-2002 Silicon Graphics, Inc.  All Rights Reserved.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it would be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
- *
- * Further, this software is distributed without any warranty that it is
- * free of the rightful claim of any third person regarding infringement
- * or the like.  Any license provided herein, whether implied or
- * otherwise, applies only to this software file.  Patent licenses, if
- * any, provided herein do not apply to combinations of this program with
- * other software, or any other product whatsoever.
- *
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, write the Free Software Foundation, Inc., 59
- * Temple Place - Suite 330, Boston MA 02111-1307, USA.
- *
- * Contact information: Silicon Graphics, Inc., 1600 Amphitheatre Pkwy,
- * Mountain View, CA  94043, or:
- *
- * http://www.sgi.com
- *
- * For further information regarding this notice, see:
- *
- * http://oss.sgi.com/projects/GenInfo/SGIGPLNoticeExplan/
- */
-
-#ifndef QSORT_H
-#define QSORT_H
-
-extern void qsort (void *const pbase,
-		    size_t total_elems,
-		    size_t size,
-		    int (*cmp)(const void *, const void *));
-
-#endif
Index: linux-2.6.11-rc2/fs/xfs/linux-2.6/xfs_linux.h
===================================================================
--- linux-2.6.11-rc2.orig/fs/xfs/linux-2.6/xfs_linux.h
+++ linux-2.6.11-rc2/fs/xfs/linux-2.6/xfs_linux.h
@@ -64,7 +64,6 @@
 #include <sema.h>
 #include <time.h>
 
-#include <support/qsort.h>
 #include <support/ktrace.h>
 #include <support/debug.h>
 #include <support/move.h>
Index: linux-2.6.11-rc2/fs/Kconfig
===================================================================
--- linux-2.6.11-rc2.orig/fs/Kconfig
+++ linux-2.6.11-rc2/fs/Kconfig
@@ -306,6 +306,7 @@ config FS_POSIX_ACL
 
 config XFS_FS
 	tristate "XFS filesystem support"
+	select QSORT
 	help
 	  XFS is a high performance journaling filesystem which originated
 	  on the SGI IRIX platform.  It is completely multi-threaded, can

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 2/13] Return -ENOSYS for RPC programs that are unavailable
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-02-15 17:04   ` Trond Myklebust
  2005-01-22 20:34 ` [patch 3/13] Add missing -EOPNOTSUPP => NFS3ERR_NOTSUPP mapping in nfsd Andreas Gruenbacher
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/sunrpc-enosys-when-unavail --]
[-- Type: text/plain, Size: 2978 bytes --]

The issuer of an RPC call should be able to tell the difference
between an I/O error and program unavailable / program version
unavailable / procedure unavailable. Return -ENOSYS for unavailable
RPCs instead of -EIO.

Only issue a program unavailable warning for program numbers other
than the one for nfsacl: Clients with nfsacl support are quite
common already; no need to clutter the syslog.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/include/linux/nfs.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs.h
+++ linux-2.6.11-rc2/include/linux/nfs.h
@@ -11,6 +11,7 @@
 #include <linux/string.h>
 
 #define NFS_PROGRAM	100003
+#define NFSACL_PROGRAM	100227
 #define NFS_PORT	2049
 #define NFS_MAXDATA	8192
 #define NFS_MAXPATHLEN	1024
Index: linux-2.6.11-rc2/net/sunrpc/clnt.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/clnt.c
+++ linux-2.6.11-rc2/net/sunrpc/clnt.c
@@ -988,10 +988,12 @@ call_verify(struct rpc_task *task)
 				break;
 			case RPC_MISMATCH:
 				printk(KERN_WARNING "%s: RPC call version mismatch!\n", __FUNCTION__);
-				goto out_eio;
+				error = -ENOSYS;
+				goto out_err;
 			default:
 				printk(KERN_WARNING "%s: RPC call rejected, unknown error: %x\n", __FUNCTION__, n);
-				goto out_eio;
+				error = -ENOSYS;
+				goto out_err;
 		}
 		if (--len < 0)
 			goto out_overflow;
@@ -1041,23 +1043,28 @@ call_verify(struct rpc_task *task)
 	case RPC_SUCCESS:
 		return p;
 	case RPC_PROG_UNAVAIL:
-		printk(KERN_WARNING "RPC: call_verify: program %u is unsupported by server %s\n",
+		if (task->tk_client->cl_prog != NFSACL_PROGRAM) {
+			printk(KERN_WARNING "RPC: call_verify: program %u is unsupported by server %s\n",
 				(unsigned int)task->tk_client->cl_prog,
 				task->tk_client->cl_server);
-		goto out_eio;
+		}
+		error = -ENOSYS;
+		goto out_err;
 	case RPC_PROG_MISMATCH:
 		printk(KERN_WARNING "RPC: call_verify: program %u, version %u unsupported by server %s\n",
 				(unsigned int)task->tk_client->cl_prog,
 				(unsigned int)task->tk_client->cl_vers,
 				task->tk_client->cl_server);
-		goto out_eio;
+		error = -ENOSYS;
+		goto out_err;
 	case RPC_PROC_UNAVAIL:
 		printk(KERN_WARNING "RPC: call_verify: proc %p unsupported by program %u, version %u on server %s\n",
 				task->tk_msg.rpc_proc,
 				task->tk_client->cl_prog,
 				task->tk_client->cl_vers,
 				task->tk_client->cl_server);
-		goto out_eio;
+		error = -ENOSYS;
+		goto out_err;
 	case RPC_GARBAGE_ARGS:
 		dprintk("RPC: %4d %s: server saw garbage\n", task->tk_pid, __FUNCTION__);
 		break;			/* retry */
@@ -1075,7 +1082,6 @@ out_retry:
 		return NULL;
 	}
 	printk(KERN_WARNING "RPC %s: retry failed, exit EIO\n", __FUNCTION__);
-out_eio:
 	error = -EIO;
 out_err:
 	rpc_exit(task, error);

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 3/13] Add missing -EOPNOTSUPP => NFS3ERR_NOTSUPP mapping in nfsd
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 2/13] Return -ENOSYS for RPC programs that are unavailable Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 4/13] Allow multiple programs to listen on the same port Andreas Gruenbacher
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/sunrpc-enotsupp --]
[-- Type: text/plain, Size: 718 bytes --]

Add the missing NFS3ERR_NOTSUPP error code (defined in NFSv3) to the
system-to-protocol-error table in nfsd. The nfsacl extension uses
this error code.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/fs/nfsd/nfsproc.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfsd/nfsproc.c
+++ linux-2.6.11-rc2/fs/nfsd/nfsproc.c
@@ -590,6 +590,7 @@ nfserrno (int errno)
 		{ nfserr_dropit, -EAGAIN },
 		{ nfserr_dropit, -ENOMEM },
 		{ nfserr_badname, -ESRCH },
+		{ nfserr_notsupp, -EOPNOTSUPP },
 		{ -1, -EIO }
 	};
 	int	i;

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 4/13] Allow multiple programs to listen on the same port
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2005-01-22 20:34 ` [patch 3/13] Add missing -EOPNOTSUPP => NFS3ERR_NOTSUPP mapping in nfsd Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 5/13] Allow multiple programs to share the same transport Andreas Gruenbacher
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/sunrpc-multiple-programs --]
[-- Type: text/plain, Size: 3263 bytes --]

The NFS and NFSACL programs run on the same RPC transport. This patch
adds support for this by converting svc_program into a chained list of
programs (server-side).

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/include/linux/sunrpc/svc.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/sunrpc/svc.h
+++ linux-2.6.11-rc2/include/linux/sunrpc/svc.h
@@ -240,9 +240,10 @@ struct svc_deferred_req {
 };
 
 /*
- * RPC program
+ * List of RPC programs on the same transport endpoint
  */
 struct svc_program {
+	struct svc_program *	pg_next;	/* other programs (same xprt) */
 	u32			pg_prog;	/* program number */
 	unsigned int		pg_lovers;	/* lowest version */
 	unsigned int		pg_hivers;	/* lowest version */
Index: linux-2.6.11-rc2/net/sunrpc/svc.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/svc.c
+++ linux-2.6.11-rc2/net/sunrpc/svc.c
@@ -35,20 +35,24 @@ svc_create(struct svc_program *prog, uns
 	if (!(serv = (struct svc_serv *) kmalloc(sizeof(*serv), GFP_KERNEL)))
 		return NULL;
 	memset(serv, 0, sizeof(*serv));
+	serv->sv_name      = prog->pg_name;
 	serv->sv_program   = prog;
 	serv->sv_nrthreads = 1;
 	serv->sv_stats     = prog->pg_stats;
 	serv->sv_bufsz	   = bufsize? bufsize : 4096;
-	prog->pg_lovers = prog->pg_nvers-1;
 	xdrsize = 0;
-	for (vers=0; vers<prog->pg_nvers ; vers++)
-		if (prog->pg_vers[vers]) {
-			prog->pg_hivers = vers;
-			if (prog->pg_lovers > vers)
-				prog->pg_lovers = vers;
-			if (prog->pg_vers[vers]->vs_xdrsize > xdrsize)
-				xdrsize = prog->pg_vers[vers]->vs_xdrsize;
-		}
+	while (prog) {
+		prog->pg_lovers = prog->pg_nvers-1;
+		for (vers=0; vers<prog->pg_nvers ; vers++)
+			if (prog->pg_vers[vers]) {
+				prog->pg_hivers = vers;
+				if (prog->pg_lovers > vers)
+					prog->pg_lovers = vers;
+				if (prog->pg_vers[vers]->vs_xdrsize > xdrsize)
+					xdrsize = prog->pg_vers[vers]->vs_xdrsize;
+			}
+		prog = prog->pg_next;
+	}
 	serv->sv_xdrsize   = xdrsize;
 	INIT_LIST_HEAD(&serv->sv_threads);
 	INIT_LIST_HEAD(&serv->sv_sockets);
@@ -56,8 +60,6 @@ svc_create(struct svc_program *prog, uns
 	INIT_LIST_HEAD(&serv->sv_permsocks);
 	spin_lock_init(&serv->sv_lock);
 
-	serv->sv_name      = prog->pg_name;
-
 	/* Remove any stale portmap registrations */
 	svc_register(serv, 0, 0);
 
@@ -332,7 +334,10 @@ svc_process(struct svc_serv *serv, struc
 		goto sendit;
 	}
 		
-	if (prog != progp->pg_prog)
+	for (progp = serv->sv_program; progp; progp = progp->pg_next)
+		if (prog == progp->pg_prog)
+			break;
+	if (progp == NULL)
 		goto err_bad_prog;
 
 	if (vers >= progp->pg_nvers ||
@@ -445,9 +450,8 @@ err_bad_auth:
 
 err_bad_prog:
 #ifdef RPC_PARANOIA
-	if (prog != 100227 || progp->pg_prog != 100003)
-		printk("svc: unknown program %d (me %d)\n", prog, progp->pg_prog);
-	/* else it is just a Solaris client seeing if ACLs are supported */
+	if (prog != NFSACL_PROGRAM || serv->sv_program->pg_prog != NFS_PROGRAM)
+		printk("svc: unknown program %d\n", prog);
 #endif
 	serv->sv_stats->rpcbadfmt++;
 	svc_putu32(resv, rpc_prog_unavail);

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 5/13] Allow multiple programs to share the same transport
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2005-01-22 20:34 ` [patch 4/13] Allow multiple programs to listen on the same port Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 6/13] Lazy RPC receive buffer allocation Andreas Gruenbacher
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/sunrpc-change-program --]
[-- Type: text/plain, Size: 4160 bytes --]

Allow a clone of an RPC client (created with rpc_clone_client()) to
change to another program. This allows the NFS and NFSACL programs to
share the same transport.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/include/linux/sunrpc/clnt.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/sunrpc/clnt.h
+++ linux-2.6.11-rc2/include/linux/sunrpc/clnt.h
@@ -22,6 +22,7 @@
  * This defines an RPC port mapping
  */
 struct rpc_portmap {
+	struct rpc_portmap	*pm_parent;
 	__u32			pm_prog;
 	__u32			pm_vers;
 	__u32			pm_prot;
@@ -116,6 +117,8 @@ struct rpc_clnt *rpc_clone_client(struct
 int		rpc_shutdown_client(struct rpc_clnt *);
 int		rpc_destroy_client(struct rpc_clnt *);
 void		rpc_release_client(struct rpc_clnt *);
+void		rpc_change_program(struct rpc_clnt *, struct rpc_program *,
+				   int);
 void		rpc_getport(struct rpc_task *, struct rpc_clnt *);
 int		rpc_register(u32, u32, int, unsigned short, int *);
 
Index: linux-2.6.11-rc2/net/sunrpc/clnt.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/clnt.c
+++ linux-2.6.11-rc2/net/sunrpc/clnt.c
@@ -139,6 +139,7 @@ rpc_create_client(struct rpc_xprt *xprt,
 	clnt->cl_maxproc  = version->nrprocs;
 	clnt->cl_protname = program->name;
 	clnt->cl_pmap	  = &clnt->cl_pmap_default;
+	clnt->cl_pmap->pm_parent = clnt->cl_pmap;
 	clnt->cl_port     = xprt->addr.sin_port;
 	clnt->cl_prog     = program->number;
 	clnt->cl_vers     = version->number;
@@ -207,6 +208,9 @@ rpc_clone_client(struct rpc_clnt *clnt)
 	rpc_init_rtt(&new->cl_rtt_default, clnt->cl_xprt->timeout.to_initval);
 	if (new->cl_auth)
 		atomic_inc(&new->cl_auth->au_count);
+	new->cl_pmap		= &new->cl_pmap_default;
+	new->cl_pmap->pm_parent = clnt->cl_pmap->pm_parent;
+	rpc_init_wait_queue(&new->cl_pmap_default.pm_bindwait, "bindwait");
 	return new;
 out_no_clnt:
 	printk(KERN_INFO "RPC: out of memory in %s\n", __FUNCTION__);
@@ -296,6 +300,25 @@ rpc_release_client(struct rpc_clnt *clnt
 }
 
 /*
+ * Change the program of a (usually cloned) client
+ */
+void
+rpc_change_program(struct rpc_clnt *clnt, struct rpc_program *program,
+		   int vers)
+{
+	struct rpc_version *version;
+
+	BUG_ON(vers >= program->nrvers || !program->version[vers]);
+	version = program->version[vers];
+	clnt->cl_procinfo = version->procs;
+	clnt->cl_maxproc  = version->nrprocs;
+	clnt->cl_protname = program->name;
+	clnt->cl_prog     = program->number;
+	clnt->cl_vers     = version->number;
+	clnt->cl_stats    = program->stats;
+}
+
+/*
  * Default callback for async RPC calls
  */
 static void
Index: linux-2.6.11-rc2/net/sunrpc/pmap_clnt.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/pmap_clnt.c
+++ linux-2.6.11-rc2/net/sunrpc/pmap_clnt.c
@@ -41,7 +41,7 @@ static DEFINE_SPINLOCK(pmap_lock);
 void
 rpc_getport(struct rpc_task *task, struct rpc_clnt *clnt)
 {
-	struct rpc_portmap *map = clnt->cl_pmap;
+	struct rpc_portmap *map = clnt->cl_pmap->pm_parent;
 	struct sockaddr_in *sap = &clnt->cl_xprt->addr;
 	struct rpc_message msg = {
 		.rpc_proc	= &pmap_procedures[PMAP_GETPORT],
@@ -132,7 +132,7 @@ static void
 pmap_getport_done(struct rpc_task *task)
 {
 	struct rpc_clnt	*clnt = task->tk_client;
-	struct rpc_portmap *map = clnt->cl_pmap;
+	struct rpc_portmap *map = clnt->cl_pmap->pm_parent;
 
 	dprintk("RPC: %4d pmap_getport_done(status %d, port %d)\n",
 			task->tk_pid, task->tk_status, clnt->cl_port);
Index: linux-2.6.11-rc2/net/sunrpc/sunrpc_syms.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/sunrpc_syms.c
+++ linux-2.6.11-rc2/net/sunrpc/sunrpc_syms.c
@@ -42,6 +42,7 @@ EXPORT_SYMBOL(rpc_release_task);
 /* RPC client functions */
 EXPORT_SYMBOL(rpc_create_client);
 EXPORT_SYMBOL(rpc_clone_client);
+EXPORT_SYMBOL(rpc_change_program);
 EXPORT_SYMBOL(rpc_destroy_client);
 EXPORT_SYMBOL(rpc_shutdown_client);
 EXPORT_SYMBOL(rpc_release_client);

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 6/13] Lazy RPC receive buffer allocation
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2005-01-22 20:34 ` [patch 5/13] Allow multiple programs to share the same transport Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 7/13] Encode and decode arbitrary XDR arrays Andreas Gruenbacher
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfsacl-lazy-alloc --]
[-- Type: text/plain, Size: 4787 bytes --]

Allow to allocate pages in the receive buffer lazily. Used for the
GETACL RPC, which has a big maximum reply size, but a small average
reply size.

Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.11-rc2/include/linux/sunrpc/xdr.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/sunrpc/xdr.h
+++ linux-2.6.11-rc2/include/linux/sunrpc/xdr.h
@@ -160,7 +160,7 @@ typedef struct {
 
 typedef size_t (*skb_read_actor_t)(skb_reader_t *desc, void *to, size_t len);
 
-extern void xdr_partial_copy_from_skb(struct xdr_buf *, unsigned int,
+extern int xdr_partial_copy_from_skb(struct xdr_buf *, unsigned int,
 		skb_reader_t *, skb_read_actor_t);
 
 struct socket;
Index: linux-2.6.11-rc2/net/sunrpc/xdr.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/xdr.c
+++ linux-2.6.11-rc2/net/sunrpc/xdr.c
@@ -176,7 +176,7 @@ xdr_inline_pages(struct xdr_buf *xdr, un
 	xdr->buflen += len;
 }
 
-void
+int
 xdr_partial_copy_from_skb(struct xdr_buf *xdr, unsigned int base,
 			  skb_reader_t *desc,
 			  skb_read_actor_t copy_actor)
@@ -190,7 +190,7 @@ xdr_partial_copy_from_skb(struct xdr_buf
 		len -= base;
 		ret = copy_actor(desc, (char *)xdr->head[0].iov_base + base, len);
 		if (ret != len || !desc->count)
-			return;
+			return 0;
 		base = 0;
 	} else
 		base -= len;
@@ -210,6 +210,13 @@ xdr_partial_copy_from_skb(struct xdr_buf
 	do {
 		char *kaddr;
 
+		/* ACL likes to be lazy in allocating pages - ACLs
+		 * are small by default but can get huge. */
+		if (unlikely(*ppage == NULL)) {
+			if (!(*ppage = alloc_page(GFP_ATOMIC)))
+				return -ENOMEM;
+		}
+
 		len = PAGE_CACHE_SIZE;
 		kaddr = kmap_atomic(*ppage, KM_SKB_SUNRPC_DATA);
 		if (base) {
@@ -226,13 +233,15 @@ xdr_partial_copy_from_skb(struct xdr_buf
 		flush_dcache_page(*ppage);
 		kunmap_atomic(kaddr, KM_SKB_SUNRPC_DATA);
 		if (ret != len || !desc->count)
-			return;
+			return 0;
 		ppage++;
 	} while ((pglen -= len) != 0);
 copy_tail:
 	len = xdr->tail[0].iov_len;
 	if (base < len)
 		copy_actor(desc, (char *)xdr->tail[0].iov_base + base, len - base);
+
+	return 0;
 }
 
 
Index: linux-2.6.11-rc2/net/sunrpc/xprt.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/xprt.c
+++ linux-2.6.11-rc2/net/sunrpc/xprt.c
@@ -725,7 +725,8 @@ csum_partial_copy_to_xdr(struct xdr_buf 
 		goto no_checksum;
 
 	desc.csum = csum_partial(skb->data, desc.offset, skb->csum);
-	xdr_partial_copy_from_skb(xdr, 0, &desc, skb_read_and_csum_bits);
+	if (xdr_partial_copy_from_skb(xdr, 0, &desc, skb_read_and_csum_bits) < 0)
+		return -1;
 	if (desc.offset != skb->len) {
 		unsigned int csum2;
 		csum2 = skb_checksum(skb, desc.offset, skb->len - desc.offset, 0);
@@ -737,7 +738,8 @@ csum_partial_copy_to_xdr(struct xdr_buf 
 		return -1;
 	return 0;
 no_checksum:
-	xdr_partial_copy_from_skb(xdr, 0, &desc, skb_read_bits);
+	if (xdr_partial_copy_from_skb(xdr, 0, &desc, skb_read_bits) < 0)
+		return -1;
 	if (desc.count)
 		return -1;
 	return 0;
@@ -907,6 +909,7 @@ tcp_read_request(struct rpc_xprt *xprt, 
 	struct rpc_rqst *req;
 	struct xdr_buf *rcvbuf;
 	size_t len;
+	int r;
 
 	/* Find and lock the request corresponding to this xid */
 	spin_lock(&xprt->sock_lock);
@@ -927,16 +930,30 @@ tcp_read_request(struct rpc_xprt *xprt, 
 		len = xprt->tcp_reclen - xprt->tcp_offset;
 		memcpy(&my_desc, desc, sizeof(my_desc));
 		my_desc.count = len;
-		xdr_partial_copy_from_skb(rcvbuf, xprt->tcp_copied,
+		r = xdr_partial_copy_from_skb(rcvbuf, xprt->tcp_copied,
 					  &my_desc, tcp_copy_data);
 		desc->count -= len;
 		desc->offset += len;
 	} else
-		xdr_partial_copy_from_skb(rcvbuf, xprt->tcp_copied,
+		r = xdr_partial_copy_from_skb(rcvbuf, xprt->tcp_copied,
 					  desc, tcp_copy_data);
 	xprt->tcp_copied += len;
 	xprt->tcp_offset += len;
 
+	if (r < 0) {
+		/* Error when copying to the receive buffer,
+		 * usually because we weren't able to allocate
+		 * additional buffer pages. All we can do now
+		 * is turn off XPRT_COPY_DATA, so the request
+		 * will not receive any additional updates,
+		 * and time out.
+		 * Any remaining data from this record will
+		 * be discarded.
+		 */
+		xprt->tcp_flags &= ~XPRT_COPY_DATA;
+		goto out;
+	}
+
 	if (xprt->tcp_copied == req->rq_private_buf.buflen)
 		xprt->tcp_flags &= ~XPRT_COPY_DATA;
 	else if (xprt->tcp_offset == xprt->tcp_reclen) {
@@ -949,6 +966,7 @@ tcp_read_request(struct rpc_xprt *xprt, 
 				req->rq_task->tk_pid);
 		xprt_complete_rqst(xprt, req, xprt->tcp_copied);
 	}
+out:
 	spin_unlock(&xprt->sock_lock);
 	tcp_check_recm(xprt);
 }

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 7/13] Encode and decode arbitrary XDR arrays
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2005-01-22 20:34 ` [patch 6/13] Lazy RPC receive buffer allocation Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-02-15 19:17   ` Trond Myklebust
  2005-01-22 20:34 ` [patch 8/13] Add noacl nfs mount option Andreas Gruenbacher
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/sunrpc-xdr-arrays --]
[-- Type: text/plain, Size: 8972 bytes --]

Add xdr_encode_array2 and xdr_decode_array2 functions for encoding
end decoding arrays with arbitrary entries, such as acl entries. The
goal here is to do this without allocating a contiguous temporary
buffer.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/include/linux/sunrpc/xdr.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/sunrpc/xdr.h
+++ linux-2.6.11-rc2/include/linux/sunrpc/xdr.h
@@ -146,7 +146,8 @@ extern void xdr_shift_buf(struct xdr_buf
 extern void xdr_buf_from_iov(struct kvec *, struct xdr_buf *);
 extern int xdr_buf_subsegment(struct xdr_buf *, struct xdr_buf *, int, int);
 extern int xdr_buf_read_netobj(struct xdr_buf *, struct xdr_netobj *, int);
-extern int read_bytes_from_xdr_buf(struct xdr_buf *buf, int base, void *obj, int len);
+extern int read_bytes_from_xdr_buf(struct xdr_buf *, int, void *, int);
+extern int write_bytes_to_xdr_buf(struct xdr_buf *, int, void *, int);
 
 /*
  * Helper structure for copying from an sk_buff.
@@ -168,6 +169,22 @@ struct sockaddr;
 extern int xdr_sendpages(struct socket *, struct sockaddr *, int,
 		struct xdr_buf *, unsigned int, int);
 
+extern int xdr_encode_word(struct xdr_buf *, int, u32);
+extern int xdr_decode_word(struct xdr_buf *, int, u32 *);
+
+struct xdr_array2_desc;
+typedef int (*xdr_xcode_elem_t)(struct xdr_array2_desc *desc, void *elem);
+struct xdr_array2_desc {
+	unsigned int elem_size;
+	unsigned int array_len;
+	xdr_xcode_elem_t xcode;
+};
+
+extern int xdr_decode_array2(struct xdr_buf *buf, unsigned int base,
+                             struct xdr_array2_desc *desc);
+extern int xdr_encode_array2(struct xdr_buf *buf, unsigned int base,
+			     struct xdr_array2_desc *desc);
+
 /*
  * Provide some simple tools for XDR buffer overflow-checking etc.
  */
Index: linux-2.6.11-rc2/net/sunrpc/sunrpc_syms.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/sunrpc_syms.c
+++ linux-2.6.11-rc2/net/sunrpc/sunrpc_syms.c
@@ -129,6 +129,10 @@ EXPORT_SYMBOL(xdr_encode_netobj);
 EXPORT_SYMBOL(xdr_encode_pages);
 EXPORT_SYMBOL(xdr_inline_pages);
 EXPORT_SYMBOL(xdr_shift_buf);
+EXPORT_SYMBOL(xdr_encode_word);
+EXPORT_SYMBOL(xdr_decode_word);
+EXPORT_SYMBOL(xdr_encode_array2);
+EXPORT_SYMBOL(xdr_decode_array2);
 EXPORT_SYMBOL(xdr_buf_from_iov);
 EXPORT_SYMBOL(xdr_buf_subsegment);
 EXPORT_SYMBOL(xdr_buf_read_netobj);
Index: linux-2.6.11-rc2/net/sunrpc/xdr.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/xdr.c
+++ linux-2.6.11-rc2/net/sunrpc/xdr.c
@@ -868,8 +868,34 @@ out:
 	return status;
 }
 
-static int
-read_u32_from_xdr_buf(struct xdr_buf *buf, int base, u32 *obj)
+/* obj is assumed to point to allocated memory of size at least len: */
+int
+write_bytes_to_xdr_buf(struct xdr_buf *buf, int base, void *obj, int len)
+{
+	struct xdr_buf subbuf;
+	int this_len;
+	int status;
+
+	status = xdr_buf_subsegment(buf, &subbuf, base, len);
+	if (status)
+		goto out;
+	this_len = min(len, (int)subbuf.head[0].iov_len);
+	memcpy(subbuf.head[0].iov_base, obj, this_len);
+	len -= this_len;
+	obj += this_len;
+	this_len = min(len, (int)subbuf.page_len);
+	if (this_len)
+		_copy_to_pages(subbuf.pages, subbuf.page_base, obj, this_len);
+	len -= this_len;
+	obj += this_len;
+	this_len = min(len, (int)subbuf.tail[0].iov_len);
+	memcpy(subbuf.tail[0].iov_base, obj, this_len);
+out:
+	return status;
+}
+
+int
+xdr_decode_word(struct xdr_buf *buf, int base, u32 *obj)
 {
 	u32	raw;
 	int	status;
@@ -881,6 +907,14 @@ read_u32_from_xdr_buf(struct xdr_buf *bu
 	return 0;
 }
 
+int
+xdr_encode_word(struct xdr_buf *buf, int base, u32 obj)
+{
+	u32	raw = htonl(obj);
+
+	return write_bytes_to_xdr_buf(buf, base, &raw, sizeof(obj));
+}
+
 /* If the netobj starting offset bytes from the start of xdr_buf is contained
  * entirely in the head or the tail, set object to point to it; otherwise
  * try to find space for it at the end of the tail, copy it there, and
@@ -891,7 +925,7 @@ xdr_buf_read_netobj(struct xdr_buf *buf,
 	u32	tail_offset = buf->head[0].iov_len + buf->page_len;
 	u32	obj_end_offset;
 
-	if (read_u32_from_xdr_buf(buf, offset, &obj->len))
+	if (xdr_decode_word(buf, offset, &obj->len))
 		goto out;
 	obj_end_offset = offset + 4 + obj->len;
 
@@ -924,3 +958,194 @@ xdr_buf_read_netobj(struct xdr_buf *buf,
 out:
 	return -1;
 }
+
+/* Returns 0 on success, or else a negative error code. */
+static int
+xdr_xcode_array2(struct xdr_buf *buf, unsigned int base,
+		 struct xdr_array2_desc *desc, int encode)
+{
+	char elem[desc->elem_size], *c;
+	unsigned int copied = 0, todo, avail_here;
+	struct page **ppages = NULL;
+	int err = 0;
+
+	if (encode) {
+		if (xdr_encode_word(buf, base, desc->array_len) != 0)
+			return -EINVAL;
+	} else {
+		if (xdr_decode_word(buf, base, &desc->array_len) != 0 ||
+		    (unsigned long) base + 4 + desc->array_len *
+				    desc->elem_size > buf->len)
+			return -EINVAL;
+	}
+	base += 4;
+
+	if (!desc->xcode)
+		return 0;
+
+	todo = desc->array_len * desc->elem_size;
+	
+	/* process head */
+	if (todo && base < buf->head->iov_len) {
+		c = buf->head->iov_base + base;
+		avail_here = min_t(unsigned int, todo,
+				   buf->head->iov_len - base);
+		todo -= avail_here;
+
+		while (avail_here >= desc->elem_size) {
+			err = desc->xcode(desc, c);
+			if (err)
+				goto out;
+			c += desc->elem_size;
+			avail_here -= desc->elem_size;
+		}
+		if (avail_here) {
+			if (encode) {
+				err = desc->xcode(desc, elem);
+				if (err)
+					goto out;
+				memcpy(c, elem, avail_here);
+			} else
+				memcpy(elem, c, avail_here);
+			copied = avail_here;
+		}
+		base = buf->head->iov_len;  /* align to start of pages */
+	}
+
+	/* process pages array */
+	base -= buf->head->iov_len;
+	if (todo && base < buf->page_len) {
+		avail_here = min(todo, buf->page_len - base);
+		todo -= avail_here;
+
+		base += buf->page_base;
+		ppages = buf->pages + (base >> PAGE_CACHE_SHIFT);
+		base &= ~PAGE_CACHE_MASK;
+		unsigned int avail_page = min_t(unsigned int,
+			PAGE_CACHE_SIZE - base, avail_here);
+		c = kmap(*ppages) + base;
+
+		while (avail_here) {
+			avail_here -= avail_page;
+			if (copied || avail_page < desc->elem_size) {
+				unsigned int l = min(avail_page,
+					desc->elem_size - copied);
+				if (encode) {
+					if (!copied) {
+						err = desc->xcode(desc, elem);
+						if (err)
+							goto out;
+					}
+					memcpy(c, elem + copied, l);
+					copied += l;
+					if (copied == desc->elem_size)
+						copied = 0;
+				} else {
+					memcpy(elem + copied, c, l);
+					copied += l;
+					if (copied == desc->elem_size) {
+						err = desc->xcode(desc, elem);
+						if (err)
+							goto out;
+						copied = 0;
+					}
+				}
+				avail_page -= l;
+				c += l;
+			}
+			while (avail_page >= desc->elem_size) {
+				err = desc->xcode(desc, c);
+				if (err)
+					goto out;
+				c += desc->elem_size;
+				avail_page -= desc->elem_size;
+			}
+			if (avail_page) {
+				unsigned int l = min(avail_page,
+					    desc->elem_size - copied);
+				if (encode) {
+					if (!copied) {
+						err = desc->xcode(desc, elem);
+						if (err)
+							goto out;
+					}
+					memcpy(c, elem + copied, l);
+					copied += l;
+					if (copied == desc->elem_size)
+						copied = 0;
+				} else {
+					memcpy(elem + copied, c, l);
+					copied += l;
+					if (copied == desc->elem_size) {
+						err = desc->xcode(desc, elem);
+						if (err)
+							goto out;
+						copied = 0;
+					}
+				}
+			}
+			if (avail_here) {
+				kunmap(*ppages);
+				ppages++;
+				c = kmap(*ppages);
+			}
+
+			avail_page = min(avail_here,
+				 (unsigned int) PAGE_CACHE_SIZE);
+		}
+		base = buf->page_len;  /* align to start of tail */
+	}
+
+	/* process tail */
+	base -= buf->page_len;
+	if (todo) {
+		c = buf->tail->iov_base + base;
+		if (copied) {
+			unsigned int l = desc->elem_size - copied;
+
+			if (encode)
+				memcpy(c, elem + copied, l);
+			else {
+				memcpy(elem + copied, c, l);
+				err = desc->xcode(desc, elem);
+				if (err)
+					goto out;
+			}
+			todo -= l;
+			c += l;
+		}
+		while (todo) {
+			err = desc->xcode(desc, c);
+			if (err)
+				goto out;
+			c += desc->elem_size;
+			todo -= desc->elem_size;
+		}
+	}
+	
+out:
+	if (ppages)
+		kunmap(*ppages);
+	return err;
+}
+
+int
+xdr_decode_array2(struct xdr_buf *buf, unsigned int base,
+		  struct xdr_array2_desc *desc)
+{
+	if (base >= buf->len)
+		return -EINVAL;
+
+	return xdr_xcode_array2(buf, base, desc, 0);
+}
+
+int
+xdr_encode_array2(struct xdr_buf *buf, unsigned int base,
+		  struct xdr_array2_desc *desc)
+{
+	if ((unsigned long) base + 4 + desc->array_len * desc->elem_size >
+	    buf->head->iov_len + buf->page_len + buf->tail->iov_len)
+		return -EINVAL;
+
+	return xdr_xcode_array2(buf, base, desc, 1);
+}

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 8/13] Add noacl nfs mount option
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (6 preceding siblings ...)
  2005-01-22 20:34 ` [patch 7/13] Encode and decode arbitrary XDR arrays Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-02-15 17:24   ` Trond Myklebust
  2005-01-22 20:34 ` [patch 9/13] Infrastructure and server side of nfsacl Andreas Gruenbacher
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfs-access-acl --]
[-- Type: text/plain, Size: 2469 bytes --]

With the noacl mount option, nfs clients stop using the ACCESS RPC
which they usually use to get an access decision from the server.
Instead, they make the decision based on the file ownership and
file mode permission bits.

Security-wise using this option can lead to illicit read access to data
cached locally on the client if the server uses POSIX ACLs.  Local
access decisions are correct as long as the server does not support
POSIX access control lists.

This approach was discussed with Trond Myklebust <trond.myklebust@fys.uio.no>
and Olaf Kirch <okir@suse.de>. Requires a patch to mount (util-linux).

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/fs/nfs/dir.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/dir.c
+++ linux-2.6.11-rc2/fs/nfs/dir.c
@@ -1497,6 +1497,7 @@ out:
 
 int nfs_permission(struct inode *inode, int mask, struct nameidata *nd)
 {
+	struct nfs_server *server = NFS_SERVER(inode);
 	struct rpc_cred *cred;
 	int res;
 
@@ -1515,7 +1516,7 @@ int nfs_permission(struct inode *inode, 
 
 	lock_kernel();
 
-	if (!NFS_PROTO(inode)->access)
+	if ((server->flags & NFS_MOUNT_NOACL) || !NFS_PROTO(inode)->access)
 		goto out_notsup;
 
 	cred = rpcauth_lookupcred(NFS_CLIENT(inode)->cl_auth, 0);
Index: linux-2.6.11-rc2/fs/nfs/inode.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/inode.c
+++ linux-2.6.11-rc2/fs/nfs/inode.c
@@ -539,6 +539,7 @@ static int nfs_show_options(struct seq_f
 		{ NFS_MOUNT_NOAC, ",noac", "" },
 		{ NFS_MOUNT_NONLM, ",nolock", ",lock" },
 		{ NFS_MOUNT_BROKEN_SUID, ",broken_suid", "" },
+		{ NFS_MOUNT_NOACL, ",noacl", "" },
 		{ 0, NULL, NULL }
 	};
 	struct proc_nfs_info *nfs_infop;
Index: linux-2.6.11-rc2/include/linux/nfs_mount.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs_mount.h
+++ linux-2.6.11-rc2/include/linux/nfs_mount.h
@@ -58,6 +58,7 @@ struct nfs_mount_data {
 #define NFS_MOUNT_KERBEROS	0x0100	/* 3 */
 #define NFS_MOUNT_NONLM		0x0200	/* 3 */
 #define NFS_MOUNT_BROKEN_SUID	0x0400	/* 4 */
+#define NFS_MOUNT_NOACL		0x0800  /* 4 */
 #define NFS_MOUNT_STRICTLOCK	0x1000	/* reserved for NFSv4 */
 #define NFS_MOUNT_SECFLAVOUR	0x2000	/* 5 */
 #define NFS_MOUNT_FLAGMASK	0xFFFF

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 9/13] Infrastructure and server side of nfsacl
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (7 preceding siblings ...)
  2005-01-22 20:34 ` [patch 8/13] Add noacl nfs mount option Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 10/13] Solaris nfsacl workaround Andreas Gruenbacher
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfsd-acl --]
[-- Type: text/plain, Size: 28303 bytes --]

This adds functions for encoding and decoding POSIX ACLs for the
NFSACL protocol extension, and the GETACL and SETACL RPCs. The
implementation is compatible with NFSACL in Solaris.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/include/linux/sunrpc/svc.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/sunrpc/svc.h
+++ linux-2.6.11-rc2/include/linux/sunrpc/svc.h
@@ -185,6 +185,27 @@ xdr_ressize_check(struct svc_rqst *rqstp
 	return vec->iov_len <= PAGE_SIZE;
 }
 
+#if 0
+static inline struct page *
+svc_take_arg_page(struct svc_rqst *rqstp)
+{
+	if (rqstp->rq_arghi <= rqstp->rq_argused)
+		return NULL;
+	return rqstp->rq_argpages[rqstp->rq_argused++];
+}
+#endif
+
+static inline struct page *
+svc_take_res_page(struct svc_rqst *rqstp)
+{
+	if (rqstp->rq_arghi <= rqstp->rq_argused)
+		return NULL;
+	rqstp->rq_arghi--;
+	rqstp->rq_respages[rqstp->rq_resused] =
+		rqstp->rq_argpages[rqstp->rq_arghi];
+	return rqstp->rq_respages[rqstp->rq_resused++];
+}
+
 static inline int svc_take_page(struct svc_rqst *rqstp)
 {
 	if (rqstp->rq_arghi <= rqstp->rq_argused)
Index: linux-2.6.11-rc2/fs/Kconfig
===================================================================
--- linux-2.6.11-rc2.orig/fs/Kconfig
+++ linux-2.6.11-rc2/fs/Kconfig
@@ -1477,6 +1477,7 @@ config NFSD
 	depends on INET
 	select LOCKD
 	select SUNRPC
+	select NFS_ACL_SUPPORT if NFSD_ACL
 	help
 	  If you want your Linux box to act as an NFS *server*, so that other
 	  computers on your local network which support NFS can access certain
@@ -1507,6 +1508,19 @@ config NFSD_V3
 	  If you would like to include the NFSv3 server as well as the NFSv2
 	  server, say Y here.  If unsure, say Y.
 
+config NFSD_ACL
+	bool "NFS_ACL protocol extension"
+	depends on NFSD_V3
+	select QSORT
+	help
+	  Implement the NFS_ACL protocol extension for manipulating POSIX
+	  Access Control Lists on exported file systems.  The clients must
+	  also implement the NFS_ACL protocol extension; see the
+	  CONFIG_NFS_ACL option.  If unsure, say N.
+
+config NFS_ACL_SUPPORT
+	tristate
+
 config NFSD_V4
 	bool "Provide NFSv4 server support (EXPERIMENTAL)"
 	depends on NFSD_V3 && EXPERIMENTAL
Index: linux-2.6.11-rc2/fs/Makefile
===================================================================
--- linux-2.6.11-rc2.orig/fs/Makefile
+++ linux-2.6.11-rc2/fs/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_BINFMT_FLAT)	+= binfmt_flat
 
 obj-$(CONFIG_FS_MBCACHE)	+= mbcache.o
 obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.o xattr_acl.o
+obj-$(CONFIG_NFS_ACL_SUPPORT)	+= nfsacl.o
 
 obj-$(CONFIG_QUOTA)		+= dquot.o
 obj-$(CONFIG_QFMT_V1)		+= quota_v1.o
Index: linux-2.6.11-rc2/fs/nfsacl.c
===================================================================
--- /dev/null
+++ linux-2.6.11-rc2/fs/nfsacl.c
@@ -0,0 +1,254 @@
+/*
+ * fs/nfsacl.c
+ *
+ *  Copyright (C) 2002-2003 Andreas Gruenbacher <agruen@suse.de>
+ */
+
+/*
+ * The Solaris nfsacl protocol represents some ACLs slightly differently
+ * than POSIX 1003.1e draft 17 does (and we do):
+ *
+ *  - Minimal ACLs always have an ACL_MASK entry, so they have
+ *    four instead of three entries.
+ *  - The ACL_MASK entry in such minimal ACLs always has the same
+ *    permissions as the ACL_GROUP_OBJ entry. (In extended ACLs
+ *    the ACL_MASK and ACL_GROUP_OBJ entries may differ.)
+ *  - The identifier fields of the ACL_USER_OBJ and ACL_GROUP_OBJ
+ *    entries contain the identifiers of the owner and owning group.
+ *    (In POSIX ACLs we always set them to ACL_UNDEFINED_ID).
+ *  - ACL entries in the kernel are kept sorted in ascending order
+ *    of (e_tag, e_id). Solaris ACLs are unsorted.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/sunrpc/xdr.h>
+#include <linux/nfsacl.h>
+#include <linux/nfs3.h>
+
+MODULE_LICENSE("GPL");
+
+EXPORT_SYMBOL(nfsacl_encode);
+EXPORT_SYMBOL(nfsacl_decode);
+
+struct nfsacl_encode_desc {
+	struct xdr_array2_desc desc;
+	unsigned int count;
+	struct posix_acl *acl;
+	int typeflag;
+	uid_t uid;
+	gid_t gid;
+};
+
+static int
+xdr_nfsace_encode(struct xdr_array2_desc *desc, void *elem)
+{
+	struct nfsacl_encode_desc *nfsacl_desc =
+		(struct nfsacl_encode_desc *) desc;
+	u32 *p = (u32 *) elem;
+
+	if (nfsacl_desc->count < nfsacl_desc->acl->a_count) {
+		struct posix_acl_entry *entry =
+			&nfsacl_desc->acl->a_entries[nfsacl_desc->count++];
+
+		*p++ = htonl(entry->e_tag | nfsacl_desc->typeflag);
+		switch(entry->e_tag) {
+			case ACL_USER_OBJ:
+				*p++ = htonl(nfsacl_desc->uid);
+				break;
+			case ACL_GROUP_OBJ:
+				*p++ = htonl(nfsacl_desc->gid);
+				break;
+			case ACL_USER:
+			case ACL_GROUP:
+				*p++ = htonl(entry->e_id);
+				break;
+			default:  /* Solaris depends on that! */
+				*p++ = 0;
+				break;
+		}
+		*p++ = htonl(entry->e_perm & S_IRWXO);
+	} else {
+		const struct posix_acl_entry *pa, *pe;
+		int group_obj_perm = ACL_READ|ACL_WRITE|ACL_EXECUTE;
+
+		FOREACH_ACL_ENTRY(pa, nfsacl_desc->acl, pe) {
+			if (pa->e_tag == ACL_GROUP_OBJ) {
+				group_obj_perm = pa->e_perm & S_IRWXO;
+				break;
+			}
+		}
+		/* fake up ACL_MASK entry */
+		*p++ = htonl(ACL_MASK | nfsacl_desc->typeflag);
+		*p++ = htonl(0);
+		*p++ = htonl(group_obj_perm);
+	}
+
+	return 0;
+}
+
+unsigned int
+nfsacl_encode(struct xdr_buf *buf, unsigned int base, struct inode *inode,
+	      struct posix_acl *acl, int encode_entries, int typeflag)
+{
+	int entries = (acl && acl->a_count) ? max_t(int, acl->a_count, 4) : 0;
+	struct nfsacl_encode_desc nfsacl_desc = {
+		.desc = {
+			.elem_size = 12,
+			.array_len = encode_entries ? entries : 0,
+			.xcode = xdr_nfsace_encode,
+		},
+		.acl = acl,
+		.typeflag = typeflag,
+		.uid = inode->i_uid,
+		.gid = inode->i_gid,
+	};
+	int err;
+
+	if (entries > NFS3_ACL_MAX_ENTRIES ||
+	    xdr_encode_word(buf, base, entries))
+		return -EINVAL;
+	err = xdr_encode_array2(buf, base + 4, &nfsacl_desc.desc);
+	if (!err)
+		err = 8 + nfsacl_desc.desc.elem_size *
+			  nfsacl_desc.desc.array_len;
+	return err;
+}
+
+struct nfsacl_decode_desc {
+	struct xdr_array2_desc desc;
+	unsigned int count;
+	struct posix_acl *acl;
+};
+
+static int
+xdr_nfsace_decode(struct xdr_array2_desc *desc, void *elem)
+{
+	struct nfsacl_decode_desc *nfsacl_desc =
+		(struct nfsacl_decode_desc *) desc;
+	u32 *p = (u32 *) elem;
+	struct posix_acl_entry *entry;
+
+	if (!nfsacl_desc->acl) {
+		if (desc->array_len > NFS3_ACL_MAX_ENTRIES)
+			return -EINVAL;
+		nfsacl_desc->acl = posix_acl_alloc(desc->array_len, GFP_KERNEL);
+		if (!nfsacl_desc->acl)
+			return -ENOMEM;
+		nfsacl_desc->count = 0;
+	}
+
+	entry = &nfsacl_desc->acl->a_entries[nfsacl_desc->count++];
+	entry->e_tag = ntohl(*p++) & ~NFS3_ACL_DEFAULT;
+	entry->e_id = ntohl(*p++);
+	entry->e_perm = ntohl(*p++);
+
+	switch(entry->e_tag) {
+		case ACL_USER_OBJ:
+		case ACL_USER:
+		case ACL_GROUP_OBJ:
+		case ACL_GROUP:
+		case ACL_OTHER:
+			if (entry->e_perm & ~S_IRWXO)
+				return -EINVAL;
+			break;
+		case ACL_MASK:
+			/* Solaris sometimes sets additonal bits in the mask */
+			entry->e_perm &= S_IRWXO;
+			break;
+		default:
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int
+cmp_acl_entry(const struct posix_acl_entry *a, const struct posix_acl_entry *b)
+{
+	if (a->e_tag != b->e_tag)
+		return a->e_tag - b->e_tag;
+	else if (a->e_id > b->e_id)
+		return 1;
+	else if (a->e_id < b->e_id)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Convert from a Solaris ACL to a POSIX 1003.1e draft 17 ACL.
+ */
+static int
+posix_acl_from_nfsacl(struct posix_acl *acl)
+{
+	struct posix_acl_entry *pa, *pe,
+	       *group_obj = NULL, *mask = NULL;
+
+	if (!acl)
+		return 0;
+
+	qsort(acl->a_entries, acl->a_count, sizeof(struct posix_acl_entry),
+	      (int(*)(const void *,const void *))cmp_acl_entry);
+
+	/* Clear undefined identifier fields and find the ACL_GROUP_OBJ
+	   and ACL_MASK entries. */
+	FOREACH_ACL_ENTRY(pa, acl, pe) {
+		switch(pa->e_tag) {
+			case ACL_USER_OBJ:
+				pa->e_id = ACL_UNDEFINED_ID;
+				break;
+			case ACL_GROUP_OBJ:
+				pa->e_id = ACL_UNDEFINED_ID;
+				group_obj = pa;
+				break;
+			case ACL_MASK:
+				mask = pa;
+				/* fall through */
+			case ACL_OTHER:
+				pa->e_id = ACL_UNDEFINED_ID;
+				break;
+		}
+	}
+	if (acl->a_count == 4 && group_obj && mask &&
+	    mask->e_perm == group_obj->e_perm) {
+		/* remove bogus ACL_MASK entry */
+		memmove(mask, mask+1, (3 - (mask - acl->a_entries)) *
+				      sizeof(struct posix_acl_entry));
+		acl->a_count = 3;
+	}
+	return 0;
+}
+
+unsigned int
+nfsacl_decode(struct xdr_buf *buf, unsigned int base, unsigned int *aclcnt,
+	      struct posix_acl **pacl)
+{
+	struct nfsacl_decode_desc nfsacl_desc = {
+		.desc = {
+			.elem_size = 12,
+			.xcode = pacl ? xdr_nfsace_decode : NULL,
+		},
+	};
+	u32 entries;
+	int err;
+
+	if (xdr_decode_word(buf, base, &entries) ||
+	    entries > NFS3_ACL_MAX_ENTRIES)
+		return -EINVAL;
+	err = xdr_decode_array2(buf, base + 4, &nfsacl_desc.desc);
+	if (err)
+		return err;
+	if (pacl) {
+		if (entries != nfsacl_desc.desc.array_len ||
+		    posix_acl_from_nfsacl(nfsacl_desc.acl) != 0) {
+			posix_acl_release(nfsacl_desc.acl);
+			return -EINVAL;
+		}
+		*pacl = nfsacl_desc.acl;
+	}
+	if (aclcnt)
+		*aclcnt = entries;
+	return 8 + nfsacl_desc.desc.elem_size *
+		   nfsacl_desc.desc.array_len;
+}
Index: linux-2.6.11-rc2/fs/nfsd/nfs3proc.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfsd/nfs3proc.c
+++ linux-2.6.11-rc2/fs/nfsd/nfs3proc.c
@@ -24,6 +24,7 @@
 #include <linux/nfsd/cache.h>
 #include <linux/nfsd/xdr3.h>
 #include <linux/nfs3.h>
+#include <linux/nfsacl.h>
 
 #define NFSDDBG_FACILITY		NFSDDBG_PROC
 
@@ -630,6 +631,105 @@ nfsd3_proc_commit(struct svc_rqst * rqst
 	RETURN_STATUS(nfserr);
 }
 
+#ifdef CONFIG_NFSD_ACL
+/*
+ * Get the Access and/or Default ACL of a file.
+ */
+static int
+nfsd3_proc_getacl(struct svc_rqst * rqstp, struct nfsd3_getaclargs *argp,
+					   struct nfsd3_getaclres *resp)
+{
+	svc_fh *fh;
+	struct posix_acl *acl;
+	int nfserr = 0;
+
+	fh = fh_copy(&resp->fh, &argp->fh);
+	if ((nfserr = fh_verify(rqstp, &resp->fh, 0, MAY_NOP)))
+		RETURN_STATUS(nfserr_inval);
+	
+	if (argp->mask & ~(NFS3_ACL|NFS3_ACLCNT|NFS3_DFACL|NFS3_DFACLCNT))
+		RETURN_STATUS(nfserr_inval);
+	resp->mask = argp->mask;
+
+	if (resp->mask & (NFS3_ACL|NFS3_ACLCNT)) {
+		acl = nfsd_get_posix_acl(fh, ACL_TYPE_ACCESS);
+		if (IS_ERR(acl)) {
+			int err = PTR_ERR(acl);
+
+			if (err == -ENODATA || err == -EOPNOTSUPP)
+				acl = NULL;
+			else {
+				nfserr = nfserrno(err);
+				goto fail;
+			}
+		}
+		if (acl == NULL) {
+			/* Solaris returns the inode's minimum ACL. */
+
+			struct inode *inode = fh->fh_dentry->d_inode;
+			acl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL);
+		}
+		resp->acl_access = acl;
+	}
+	if (resp->mask & (NFS3_DFACL|NFS3_DFACLCNT)) {
+		/* Check how Solaris handles requests for the Default ACL
+		   of a non-directory! */
+
+		acl = nfsd_get_posix_acl(fh, ACL_TYPE_DEFAULT);
+		if (IS_ERR(acl)) {
+			int err = PTR_ERR(acl);
+
+			if (err == -ENODATA || err == -EOPNOTSUPP)
+				acl = NULL;
+			else {
+				nfserr = nfserrno(err);
+				goto fail;
+			}
+		}
+		resp->acl_default = acl;
+	}
+
+	/* resp->acl_{access,default} are released in nfs3svc_release_getacl. */
+	RETURN_STATUS(0);
+
+fail:
+	posix_acl_release(resp->acl_access);
+	posix_acl_release(resp->acl_default);
+	RETURN_STATUS(nfserr);
+}
+#endif  /* CONFIG_NFSD_ACL */
+
+#ifdef CONFIG_NFSD_ACL
+/*
+ * Set the Access and/or Default ACL of a file.
+ */
+static int
+nfsd3_proc_setacl(struct svc_rqst * rqstp, struct nfsd3_setaclargs *argp,
+					   struct nfsd3_attrstat *resp)
+{
+	svc_fh *fh;
+	int nfserr = 0;
+
+	fh = fh_copy(&resp->fh, &argp->fh);
+	nfserr = fh_verify(rqstp, &resp->fh, 0, MAY_NOP);
+	
+	if (!nfserr) {
+		nfserr = nfserrno( nfsd_set_posix_acl(
+			fh, ACL_TYPE_ACCESS, argp->acl_access) );
+	}
+	if (!nfserr) {
+		nfserr = nfserrno( nfsd_set_posix_acl(
+			fh, ACL_TYPE_DEFAULT, argp->acl_default) );
+	}
+
+	/* argp->acl_{access,default} may have been allocated in
+	   nfs3svc_decode_setaclargs. */
+	posix_acl_release(argp->acl_access);
+	posix_acl_release(argp->acl_default);
+	RETURN_STATUS(nfserr);
+}
+#endif  /* CONFIG_NFSD_ACL */
+
 
 /*
  * NFSv3 Server procedures.
@@ -647,6 +747,7 @@ nfsd3_proc_commit(struct svc_rqst * rqst
 #define nfsd3_attrstatres		nfsd3_attrstat
 #define nfsd3_wccstatres		nfsd3_attrstat
 #define nfsd3_createres			nfsd3_diropres
+#define nfsd3_setaclres			nfsd3_attrstat
 #define nfsd3_voidres			nfsd3_voidargs
 struct nfsd3_voidargs { int dummy; };
 
@@ -667,6 +768,7 @@ struct nfsd3_voidargs { int dummy; };
 #define AT 21		/* attributes */
 #define pAT (1+AT)	/* post attributes - conditional */
 #define WC (7+pAT)	/* WCC attributes */
+#define ACL (1+NFS3_ACL_MAX_ENTRIES*3)  /* Access Control List */
 
 static struct svc_procedure		nfsd_procedures3[22] = {
   PROC(null,	 void,		void,		void,	  RC_NOCACHE, ST),
@@ -700,3 +802,19 @@ struct svc_version	nfsd_version3 = {
 		.vs_dispatch	= nfsd_dispatch,
 		.vs_xdrsize	= NFS3_SVC_XDRSIZE,
 };
+
+#ifdef CONFIG_NFSD_ACL
+struct svc_procedure		nfsd_acl_procedures3[] = {
+  PROC(null,	void,		void,		void,	  RC_NOCACHE, ST),
+  PROC(getacl,	getacl,		getacl,		getacl,	  RC_NOCACHE, ST+1+2*(1+ACL)),
+  PROC(setacl,	setacl,		setacl,		fhandle,  RC_NOCACHE, ST+pAT),
+};
+
+struct svc_version	nfsd_acl_version3 = {
+		.vs_vers	= 3,
+		.vs_nproc	= 3,
+		.vs_proc	nfsd_acl_procedures3,
+		.vs_dispatch	= nfsd_dispatch,
+		.vs_xdrsize	= NFS3_SVC_XDRSIZE,
+};
+#endif  /* CONFIG_NFSD_ACL */
Index: linux-2.6.11-rc2/fs/nfsd/nfs3xdr.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfsd/nfs3xdr.c
+++ linux-2.6.11-rc2/fs/nfsd/nfs3xdr.c
@@ -21,6 +21,7 @@
 #include <linux/sunrpc/svc.h>
 #include <linux/nfsd/nfsd.h>
 #include <linux/nfsd/xdr3.h>
+#include <linux/nfsacl.h>
 
 #define NFSDDBG_FACILITY		NFSDDBG_XDR
 
@@ -583,6 +584,47 @@ nfs3svc_decode_commitargs(struct svc_rqs
 	return xdr_argsize_check(rqstp, p);
 }
 
+#ifdef CONFIG_NFSD_ACL
+int
+nfs3svc_decode_getaclargs(struct svc_rqst *rqstp, u32 *p,
+			  struct nfsd3_getaclargs *args)
+{
+	if (!(p = decode_fh(p, &args->fh)))
+		return 0;
+	args->mask = ntohl(*p); p++;
+
+	return xdr_argsize_check(rqstp, p);
+}
+#endif  /* CONFIG_NFSD_ACL */
+
+#ifdef CONFIG_NFSD_ACL
+int
+nfs3svc_decode_setaclargs(struct svc_rqst *rqstp, u32 *p,
+			  struct nfsd3_setaclargs *args)
+{
+	struct kvec *head = rqstp->rq_arg.head;
+	unsigned int base;
+	int n;
+
+	if (!(p = decode_fh(p, &args->fh)))
+		return 0;
+	args->mask = ntohl(*p++);
+	if (args->mask & ~(NFS3_ACL|NFS3_ACLCNT|NFS3_DFACL|NFS3_DFACLCNT) ||
+	    !xdr_argsize_check(rqstp, p))
+		return 0;
+
+	base = (char *)p - (char *)head->iov_base;
+	n = nfsacl_decode(&rqstp->rq_arg, base, NULL,
+			  (args->mask & NFS3_ACL) ?
+			  &args->acl_access : NULL);
+	if (n > 0)
+		n = nfsacl_decode(&rqstp->rq_arg, base + n, NULL,
+				  (args->mask & NFS3_DFACL) ?
+				  &args->acl_default : NULL);
+	return (n > 0);
+}
+#endif  /* CONFIG_NFSD_ACL */
+
 /*
  * XDR encode functions
  */
@@ -1066,6 +1108,66 @@ nfs3svc_encode_commitres(struct svc_rqst
 	return xdr_ressize_check(rqstp, p);
 }
 
+#ifdef CONFIG_NFSD_ACL
+/* GETACL */
+int
+nfs3svc_encode_getaclres(struct svc_rqst *rqstp, u32 *p,
+			 struct nfsd3_getaclres *resp)
+{
+	struct dentry *dentry = resp->fh.fh_dentry;
+
+	p = encode_post_op_attr(rqstp, p, &resp->fh);
+	if (resp->status == 0 && dentry && dentry->d_inode) {
+		struct inode *inode = dentry->d_inode;
+		int w = nfsacl_size(
+			(resp->mask & NFS3_ACL)   ? resp->acl_access  : NULL,
+			(resp->mask & NFS3_DFACL) ? resp->acl_default : NULL);
+		struct kvec *head = rqstp->rq_res.head;
+		unsigned int base;
+		int n;
+
+		*p++ = htonl(resp->mask);
+		if (!xdr_ressize_check(rqstp, p))
+			return 0;
+		base = (char *)p - (char *)head->iov_base;
+
+		rqstp->rq_res.page_len = w;
+		while (w > 0) {
+			if (!svc_take_res_page(rqstp))
+				return 0;
+			w -= PAGE_SIZE;
+		}
+
+		n = nfsacl_encode(&rqstp->rq_res, base, inode,
+				  resp->acl_access,
+				  resp->mask & NFS3_ACL, 0);
+		if (n > 0)
+			n = nfsacl_encode(&rqstp->rq_res, base + n, inode,
+					  resp->acl_default,
+					  resp->mask & NFS3_DFACL,
+					  NFS3_ACL_DEFAULT);
+		if (n <= 0)
+			return 0;
+	} else
+		if (!xdr_ressize_check(rqstp, p))
+			return 0;
+
+	return 1;
+}
+#endif  /* CONFIG_NFSD_ACL */
+
+#ifdef CONFIG_NFSD_ACL
+/* SETACL */
+int
+nfs3svc_encode_setaclres(struct svc_rqst *rqstp, u32 *p,
+			 struct nfsd3_attrstat *resp)
+{
+	p = encode_post_op_attr(rqstp, p, &resp->fh);
+
+	return xdr_ressize_check(rqstp, p);
+}
+#endif  /* CONFIG_NFSD_ACL */
+
 /*
  * XDR release functions
  */
@@ -1085,3 +1187,15 @@ nfs3svc_release_fhandle2(struct svc_rqst
 	fh_put(&resp->fh2);
 	return 1;
 }
+
+#ifdef CONFIG_NFSD_ACL
+int
+nfs3svc_release_getacl(struct svc_rqst *rqstp, u32 *p,
+		       struct nfsd3_getaclres *resp)
+{
+	fh_put(&resp->fh);
+	posix_acl_release(resp->acl_access);
+	posix_acl_release(resp->acl_default);
+	return 1;
+}
+#endif  /* CONFIG_NFSD_ACL */
Index: linux-2.6.11-rc2/fs/nfsd/nfssvc.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfsd/nfssvc.c
+++ linux-2.6.11-rc2/fs/nfsd/nfssvc.c
@@ -49,6 +49,9 @@
 #define	SIG_NOCLEAN	SIGHUP
 
 extern struct svc_program	nfsd_program;
+#ifdef CONFIG_NFSD_ACL
+extern struct svc_program	nfsd_acl_program;
+#endif
 static void			nfsd(struct svc_rqst *rqstp);
 struct timeval			nfssvc_boot;
 static struct svc_serv 		*nfsd_serv;
@@ -370,8 +373,29 @@ static struct svc_version *	nfsd_version
 #endif
 };
 
+#ifdef CONFIG_NFSD_ACL
+extern struct svc_version nfsd_acl_version3;
+
+static struct svc_version *	nfsd_acl_version[] = {
+	[3] = &nfsd_acl_version3,
+};
+
+#define NFSD_ACL_NRVERS		(sizeof(nfsd_acl_version)/sizeof(nfsd_acl_version[0]))
+struct svc_program		nfsd_acl_program = {
+	.pg_prog		= NFS3_ACL_PROGRAM,
+	.pg_nvers		= NFSD_ACL_NRVERS,
+	.pg_vers		= nfsd_acl_version,
+	.pg_name		= "nfsd",
+	.pg_stats		= &nfsd_acl_svcstats,
+};
+# define nfsd_acl_program_p &nfsd_acl_program
+#else
+# define nfsd_acl_program_p NULL
+#endif
+
 #define NFSD_NRVERS		(sizeof(nfsd_version)/sizeof(nfsd_version[0]))
 struct svc_program		nfsd_program = {
+	.pg_next		= nfsd_acl_program_p,
 	.pg_prog		= NFS_PROGRAM,		/* program number */
 	.pg_nvers		= NFSD_NRVERS,		/* nr of entries in nfsd_version */
 	.pg_vers		= nfsd_version,		/* version table */
Index: linux-2.6.11-rc2/fs/nfsd/stats.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfsd/stats.c
+++ linux-2.6.11-rc2/fs/nfsd/stats.c
@@ -40,6 +40,12 @@ struct svc_stat		nfsd_svcstats = {
 	.program	= &nfsd_program,
 };
 
+#ifdef CONFIG_NFSD_ACL
+struct svc_stat	nfsd_acl_svcstats = {
+	.program	= &nfsd_acl_program,
+};
+#endif
+
 static int nfsd_proc_show(struct seq_file *seq, void *v)
 {
 	int i;
Index: linux-2.6.11-rc2/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfsd/vfs.c
+++ linux-2.6.11-rc2/fs/nfsd/vfs.c
@@ -45,6 +45,7 @@
 #include <linux/nfsd/nfsfh.h>
 #include <linux/quotaops.h>
 #include <linux/dnotify.h>
+#include <linux/xattr_acl.h>
 #ifdef CONFIG_NFSD_V4
 #include <linux/posix_acl.h>
 #include <linux/posix_acl_xattr.h>
@@ -1814,3 +1815,109 @@ nfsd_racache_init(int cache_size)
 	nfsdstats.ra_size = cache_size;
 	return 0;
 }
+
+#ifdef CONFIG_NFSD_ACL
+struct posix_acl *
+nfsd_get_posix_acl(struct svc_fh *fhp, int type)
+{
+	struct inode *inode = fhp->fh_dentry->d_inode;
+	char *name;
+	void *value = NULL;
+	ssize_t size;
+	struct posix_acl *acl;
+
+	if (!IS_POSIXACL(inode) || !inode->i_op || !inode->i_op->getxattr)
+		return ERR_PTR(-EOPNOTSUPP);
+	switch(type) {
+		case ACL_TYPE_ACCESS:
+			name = XATTR_NAME_ACL_ACCESS;
+			break;
+		case ACL_TYPE_DEFAULT:
+			name = XATTR_NAME_ACL_DEFAULT;
+			break;
+		default:
+			return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	size = inode->i_op->getxattr(fhp->fh_dentry, name, NULL, 0);
+
+	if (size < 0) {
+		acl = ERR_PTR(size);
+		goto getout;
+	} else if (size > 0) {
+		value = kmalloc(size, GFP_KERNEL);
+		if (!value) {
+			acl = ERR_PTR(-ENOMEM);
+			goto getout;
+		}
+		size = inode->i_op->getxattr(fhp->fh_dentry, name, value, size);
+		if (size < 0) {
+			acl = ERR_PTR(size);
+			goto getout;
+		}
+	}
+	acl = posix_acl_from_xattr(value, size);
+
+getout:
+	kfree(value);
+	return acl;
+}
+#endif  /* CONFIG_NFSD_ACL */
+
+#ifdef CONFIG_NFSD_ACL
+int
+nfsd_set_posix_acl(struct svc_fh *fhp, int type, struct posix_acl *acl)
+{
+	struct inode *inode = fhp->fh_dentry->d_inode;
+	char *name;
+	void *value = NULL;
+	size_t size;
+	int error;
+
+	if (!IS_POSIXACL(inode) || !inode->i_op ||
+	    !inode->i_op->setxattr || !inode->i_op->removexattr)
+		return -EOPNOTSUPP;
+	switch(type) {
+		case ACL_TYPE_ACCESS:
+			name = XATTR_NAME_ACL_ACCESS;
+			break;
+		case ACL_TYPE_DEFAULT:
+			name = XATTR_NAME_ACL_DEFAULT;
+			break;
+		default:
+			return -EOPNOTSUPP;
+	}
+
+	if (acl && acl->a_count) {
+		size = xattr_acl_size(acl->a_count);
+		value = kmalloc(size, GFP_KERNEL);
+		if (!value)
+			return -ENOMEM;
+		size = posix_acl_to_xattr(acl, value, size);
+		if (size < 0) {
+			error = size;
+			goto getout;
+		}
+	} else
+		size = 0;
+
+	if (!fhp->fh_locked)
+		fh_lock(fhp);  /* unlocking is done automatically */
+	if (size)
+		error = inode->i_op->setxattr(fhp->fh_dentry, name,
+					      value, size, 0);
+	else {
+		if (!S_ISDIR(inode->i_mode) && type == ACL_TYPE_DEFAULT)
+			error = 0;
+		else {
+			error = inode->i_op->removexattr(fhp->fh_dentry, name);
+			if (error == -ENODATA)
+				error = 0;
+		}
+	}
+
+getout:
+	kfree(value);
+	return error;
+}
+#endif  /* CONFIG_NFSD_ACL */
Index: linux-2.6.11-rc2/include/linux/nfs3.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs3.h
+++ linux-2.6.11-rc2/include/linux/nfs3.h
@@ -37,6 +37,15 @@ enum nfs3_createmode {
 	NFS3_CREATE_EXCLUSIVE = 2
 };
 
+/* Flags for the getacl/setacl mode */
+#define NFS3_ACL		0x0001
+#define NFS3_ACLCNT		0x0002
+#define NFS3_DFACL		0x0004
+#define NFS3_DFACLCNT		0x0008
+
+/* Flag for Default ACL entries */
+#define NFS3_ACL_DEFAULT	0x1000
+
 /* NFSv3 file system properties */
 #define NFS3_FSF_LINK		0x0001
 #define NFS3_FSF_SYMLINK	0x0002
@@ -88,6 +97,10 @@ struct nfs3_fh {
 #define NFS3PROC_PATHCONF	20
 #define NFS3PROC_COMMIT		21
 
+#define NFS3_ACL_PROGRAM	100227
+#define NFS3PROC_GETACL		1
+#define NFS3PROC_SETACL		2
+
 #define NFS_MNT3_PROGRAM	100005
 #define NFS_MNT3_VERSION	3
 #define MOUNTPROC3_NULL		0
Index: linux-2.6.11-rc2/include/linux/nfsacl.h
===================================================================
--- /dev/null
+++ linux-2.6.11-rc2/include/linux/nfsacl.h
@@ -0,0 +1,37 @@
+/*
+ * File: linux/nfsacl.h
+ *
+ * (C) 2003 Andreas Gruenbacher <agruen@suse.de>
+ */
+
+
+#ifndef __LINUX_NFSACL_H
+#define __LINUX_NFSACL_H
+
+#include <linux/posix_acl.h>
+
+/* Maximum number of ACL entries over NFS */
+#define NFS3_ACL_MAX_ENTRIES	1024
+
+#define NFSACL_MAXWORDS		(2*(2+3*NFS3_ACL_MAX_ENTRIES))
+#define NFSACL_MAXPAGES		((2*(8+12*NFS3_ACL_MAX_ENTRIES) + PAGE_SIZE-1) \
+				 >> PAGE_SHIFT)
+
+static inline unsigned int
+nfsacl_size(struct posix_acl *acl_access, struct posix_acl *acl_default)
+{
+	unsigned int w = 16;
+	w += max(acl_access ? (int)acl_access->a_count : 3, 4) * 12;
+	if (acl_default)
+		w += max((int)acl_default->a_count, 4) * 12;
+	return w;
+}
+
+extern unsigned int
+nfsacl_encode(struct xdr_buf *buf, unsigned int base, struct inode *inode,
+	      struct posix_acl *acl, int encode_entries, int typeflag);
+extern unsigned int
+nfsacl_decode(struct xdr_buf *buf, unsigned int base, unsigned int *aclcnt,
+	      struct posix_acl **pacl);
+
+#endif  /* __LINUX_NFSACL_H */
Index: linux-2.6.11-rc2/include/linux/nfsd/nfsd.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfsd/nfsd.h
+++ linux-2.6.11-rc2/include/linux/nfsd/nfsd.h
@@ -15,6 +15,7 @@
 #include <linux/unistd.h>
 #include <linux/dirent.h>
 #include <linux/fs.h>
+#include <linux/posix_acl.h>
 #include <linux/mount.h>
 
 #include <linux/nfsd/debug.h>
@@ -60,6 +61,8 @@ extern struct svc_program	nfsd_program;
 extern struct svc_version	nfsd_version2, nfsd_version3,
 				nfsd_version4;
 
+extern struct svc_program	nfsd_acl_program;
+extern struct svc_version	nfsd_acl_version3;
 /*
  * Function prototypes.
  */
@@ -124,6 +127,22 @@ int		nfsd_statfs(struct svc_rqst *, stru
 int		nfsd_notify_change(struct inode *, struct iattr *);
 int		nfsd_permission(struct svc_export *, struct dentry *, int);
 
+#ifdef CONFIG_NFSD_ACL
+struct posix_acl *nfsd_get_posix_acl(struct svc_fh *, int);
+int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);
+#else
+static inline struct posix_acl *
+nfsd_get_posix_acl(struct svc_fh *fhp, int acl_type)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline int
+nfsd_set_posix_acl(struct svc_fh *fhp, int type, struct posix_acl *acl)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 
 /* 
  * NFSv4 State
Index: linux-2.6.11-rc2/include/linux/nfsd/stats.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfsd/stats.h
+++ linux-2.6.11-rc2/include/linux/nfsd/stats.h
@@ -36,6 +36,7 @@ struct nfsd_stats {
 
 extern struct nfsd_stats	nfsdstats;
 extern struct svc_stat		nfsd_svcstats;
+extern struct svc_stat		nfsd_acl_svcstats;
 
 void	nfsd_stat_init(void);
 void	nfsd_stat_shutdown(void);
Index: linux-2.6.11-rc2/include/linux/nfsd/xdr3.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfsd/xdr3.h
+++ linux-2.6.11-rc2/include/linux/nfsd/xdr3.h
@@ -10,6 +10,7 @@
 #define _LINUX_NFSD_XDR3_H
 
 #include <linux/nfsd/xdr.h>
+#include <linux/posix_acl.h>
 
 struct nfsd3_sattrargs {
 	struct svc_fh		fh;
@@ -110,6 +111,18 @@ struct nfsd3_commitargs {
 	__u32			count;
 };
 
+struct nfsd3_getaclargs {
+	struct svc_fh		fh;
+	int			mask;
+};
+
+struct nfsd3_setaclargs {
+	struct svc_fh		fh;
+	int			mask;
+	struct posix_acl	*acl_access;
+	struct posix_acl	*acl_default;
+};
+
 struct nfsd3_attrstat {
 	__u32			status;
 	struct svc_fh		fh;
@@ -209,6 +222,14 @@ struct nfsd3_commitres {
 	struct svc_fh		fh;
 };
 
+struct nfsd3_getaclres {
+	__u32			status;
+	struct svc_fh		fh;
+	int			mask;
+	struct posix_acl	*acl_access;
+	struct posix_acl	*acl_default;
+};
+
 /* dummy type for release */
 struct nfsd3_fhandle_pair {
 	__u32			dummy;
@@ -241,6 +262,7 @@ union nfsd3_xdrstore {
 	struct nfsd3_fsinfores		fsinfores;
 	struct nfsd3_pathconfres	pathconfres;
 	struct nfsd3_commitres		commitres;
+	struct nfsd3_getaclres		getaclres;
 };
 
 #define NFS3_SVC_XDRSIZE		sizeof(union nfsd3_xdrstore)
@@ -276,6 +298,10 @@ int nfs3svc_decode_readdirplusargs(struc
 				struct nfsd3_readdirargs *);
 int nfs3svc_decode_commitargs(struct svc_rqst *, u32 *,
 				struct nfsd3_commitargs *);
+int nfs3svc_decode_getaclargs(struct svc_rqst *, u32 *,
+			      struct nfsd3_getaclargs *);
+int nfs3svc_decode_setaclargs(struct svc_rqst *, u32 *,
+			      struct nfsd3_setaclargs *);
 int nfs3svc_encode_voidres(struct svc_rqst *, u32 *, void *);
 int nfs3svc_encode_attrstat(struct svc_rqst *, u32 *,
 				struct nfsd3_attrstat *);
@@ -305,11 +331,17 @@ int nfs3svc_encode_pathconfres(struct sv
 				struct nfsd3_pathconfres *);
 int nfs3svc_encode_commitres(struct svc_rqst *, u32 *,
 				struct nfsd3_commitres *);
+int nfs3svc_encode_getaclres(struct svc_rqst *, u32 *,
+			     struct nfsd3_getaclres *);
+int nfs3svc_encode_setaclres(struct svc_rqst *, u32 *,
+			     struct nfsd3_attrstat *);
 
 int nfs3svc_release_fhandle(struct svc_rqst *, u32 *,
 				struct nfsd3_attrstat *);
 int nfs3svc_release_fhandle2(struct svc_rqst *, u32 *,
 				struct nfsd3_fhandle_pair *);
+int nfs3svc_release_getacl(struct svc_rqst *rqstp, u32 *p,
+			   struct nfsd3_getaclres *resp);
 int nfs3svc_encode_entry(struct readdir_cd *, const char *name,
 				int namlen, loff_t offset, ino_t ino,
 				unsigned int);

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 10/13] Solaris nfsacl workaround
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (8 preceding siblings ...)
  2005-01-22 20:34 ` [patch 9/13] Infrastructure and server side of nfsacl Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-02-15 17:29   ` Trond Myklebust
  2005-01-22 20:34 ` [patch 11/13] Client side of nfsacl Andreas Gruenbacher
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfsd-acl-v2-solaris --]
[-- Type: text/plain, Size: 1016 bytes --]

If the nfs_acl program is available, Solaris clients expect both
version 2 and version 3 to be available; RPC_PROG_MISMATCH leads to a
mount failure. Fake RPC_PROG_UNAVAIL when asked for nfs_acl version 2.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/net/sunrpc/svc.c
===================================================================
--- linux-2.6.11-rc2.orig/net/sunrpc/svc.c
+++ linux-2.6.11-rc2/net/sunrpc/svc.c
@@ -458,6 +458,13 @@ err_bad_prog:
 	goto sendit;
 
 err_bad_vers:
+	if (prog == NFSACL_PROGRAM && vers == 2) {
+		/* If the nfs_acl program is available, Solaris clients expect
+		   both version 2 and version 3 to be available;
+		   RPC_PROG_MISMATCH leads to a mount failure. Fake
+		   RPC_PROG_UNAVAIL when asked for nfs_acl version 2. */
+		goto err_bad_prog;
+	}
 #ifdef RPC_PARANOIA
 	printk("svc: unknown version (%d)\n", vers);
 #endif

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 11/13] Client side of nfsacl
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (9 preceding siblings ...)
  2005-01-22 20:34 ` [patch 10/13] Solaris nfsacl workaround Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-02-15 17:49   ` Trond Myklebust
  2005-01-22 20:34 ` [patch 12/13] ACL umask handling workaround in nfs client Andreas Gruenbacher
  2005-01-22 20:34 ` [patch 13/13] Cache acls on the nfs client side Andreas Gruenbacher
  12 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfs-acl --]
[-- Type: text/plain, Size: 24552 bytes --]

This adds acl support fo nfs clients via the NFSACL protocol extension,
by implementing the getxattr, listxattr, setxattr, and removexattr iops
for the system.posix_acl_access and system.posix_acl_default attributes.
This patch implements a dumb version that uses no caching (and thus adds
some overhead). (Another patch in this patchset adds caching as well.)

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/fs/Kconfig
===================================================================
--- linux-2.6.11-rc2.orig/fs/Kconfig
+++ linux-2.6.11-rc2/fs/Kconfig
@@ -1393,6 +1393,7 @@ config NFS_FS
 	depends on INET
 	select LOCKD
 	select SUNRPC
+	select NFS_ACL_SUPPORT if NFS_ACL
 	help
 	  If you are connected to some other (usually local) Unix computer
 	  (using SLIP, PLIP, PPP or Ethernet) and want to mount files residing
@@ -1435,6 +1436,17 @@ config NFS_V3
 
 	  If unsure, say Y.
 
+config NFS_ACL
+	bool "NFS_ACL protocol extension"
+	depends on NFS_V3
+	select QSORT
+	help
+	  Implement the NFS_ACL protocol extension for manipulating POSIX
+	  Access Control Lists.  The server must also implement the NFS_ACL
+	  protocol extension; see the CONFIG_NFSD_ACL option.
+
+	  If unsure, say N.
+
 config NFS_V4
 	bool "Provide NFSv4 client support (EXPERIMENTAL)"
 	depends on NFS_FS && EXPERIMENTAL
Index: linux-2.6.11-rc2/fs/nfs/Makefile
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/Makefile
+++ linux-2.6.11-rc2/fs/nfs/Makefile
@@ -8,6 +8,7 @@ nfs-y 			:= dir.o file.o inode.o nfs2xdr
 			   proc.o read.o symlink.o unlink.o write.o
 nfs-$(CONFIG_ROOT_NFS)	+= nfsroot.o mount_clnt.o      
 nfs-$(CONFIG_NFS_V3)	+= nfs3proc.o nfs3xdr.o
+nfs-$(CONFIG_NFS_ACL)	+= xattr.o
 nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
 			   delegation.o idmap.o \
 			   callback.o callback_xdr.o callback_proc.o
Index: linux-2.6.11-rc2/fs/nfs/dir.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/dir.c
+++ linux-2.6.11-rc2/fs/nfs/dir.c
@@ -72,6 +72,10 @@ struct inode_operations nfs_dir_inode_op
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
+	.listxattr	= nfs_listxattr,
+	.getxattr	= nfs_getxattr,
+	.setxattr	= nfs_setxattr,
+	.removexattr	= nfs_removexattr,
 };
 
 #ifdef CONFIG_NFS_V4
Index: linux-2.6.11-rc2/fs/nfs/file.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/file.c
+++ linux-2.6.11-rc2/fs/nfs/file.c
@@ -65,6 +65,10 @@ struct inode_operations nfs_file_inode_o
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
+	.listxattr	= nfs_listxattr,
+	.getxattr	= nfs_getxattr,
+	.setxattr	= nfs_setxattr,
+	.removexattr	= nfs_removexattr,
 };
 
 /* Hack for future NFS swap support */
Index: linux-2.6.11-rc2/fs/nfs/inode.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/inode.c
+++ linux-2.6.11-rc2/fs/nfs/inode.c
@@ -104,6 +104,21 @@ struct rpc_program		nfs_program = {
 	.pipe_dir_name		= "/nfs",
 };
 
+#ifdef CONFIG_NFS_ACL
+static struct rpc_stat		nfsacl_rpcstat = { &nfsacl_program };
+static struct rpc_version *	nfsacl_version[] = {
+	[3]			= &nfsacl_version3,
+};
+
+struct rpc_program		nfsacl_program = {
+	.name =			"nfsacl",
+	.number =		NFS3_ACL_PROGRAM,
+	.nrvers =		sizeof(nfsacl_version) / sizeof(nfsacl_version[0]),
+	.version =		nfsacl_version,
+	.stats =		&nfsacl_rpcstat,
+};
+#endif  /* CONFIG_NFS_ACL */
+
 static inline unsigned long
 nfs_fattr_to_ino_t(struct nfs_fattr *fattr)
 {
@@ -165,6 +180,10 @@ nfs_umount_begin(struct super_block *sb)
 	/* -EIO all pending I/O */
 	if ((rpc = server->client) != NULL)
 		rpc_killall_tasks(rpc);
+#ifdef CONFIG_NFS_ACL
+	if ((rpc = server->client_acl) != NULL)
+		rpc_killall_tasks(rpc);
+#endif  /* CONFIG_NFS_ACL */
 }
 
 
@@ -453,7 +472,21 @@ nfs_fill_super(struct super_block *sb, s
 		atomic_inc(&server->client->cl_count);
 		server->client_sys = server->client;
 	}
+#ifdef CONFIG_NFS_ACL
+	if (server->flags & NFS_MOUNT_VER3) {
+		struct rpc_clnt *clnt = rpc_clone_client(server->client);
 
+		if (IS_ERR(clnt)) {
+			rpc_release_client(server->client_sys);
+			server->client_sys = NULL;
+			return PTR_ERR(clnt);
+		}
+		rpc_change_program(clnt, &nfsacl_program, 3);
+		server->client_acl = clnt;
+		/* Initially assume the nfsacl program is supported */
+		server->flags |= NFSACL;
+	}
+#endif
 	if (server->flags & NFS_MOUNT_VER3) {
 		if (server->namelen == 0 || server->namelen > NFS3_MAXNAMLEN)
 			server->namelen = NFS3_MAXNAMLEN;
@@ -640,6 +673,20 @@ nfs_init_locked(struct inode *inode, voi
 /* Don't use READDIRPLUS on directories that we believe are too large */
 #define NFS_LIMIT_READDIRPLUS (8*PAGE_SIZE)
 
+#ifdef CONFIG_NFS_ACL
+static struct inode_operations nfs_special_inode_operations[] = {{
+	.permission =	nfs_permission,
+	.getattr =	nfs_getattr,
+	.setattr =	nfs_setattr,
+	.listxattr =	nfs_listxattr,
+	.getxattr =	nfs_getxattr,
+	.setxattr =	nfs_setxattr,
+	.removexattr =	nfs_removexattr,
+}};
+#else
+#define nfs_special_inode_operations NULL
+#endif  /* CONFIG_NFS_ACL */
+
 /*
  * This is our front-end to iget that looks up inodes by file handle
  * instead of inode number.
@@ -693,8 +740,10 @@ nfs_fhget(struct super_block *sb, struct
 				NFS_FLAGS(inode) |= NFS_INO_ADVISE_RDPLUS;
 		} else if (S_ISLNK(inode->i_mode))
 			inode->i_op = &nfs_symlink_inode_operations;
-		else
+		else {
+			inode->i_op = nfs_special_inode_operations;
 			init_special_inode(inode, inode->i_mode, fattr->rdev);
+		}
 
 		nfsi->read_cache_jiffies = fattr->timestamp;
 		inode->i_atime = fattr->atime;
@@ -1458,6 +1507,10 @@ static void nfs_kill_super(struct super_
 		rpc_shutdown_client(server->client);
 	if (server->client_sys != NULL && !IS_ERR(server->client_sys))
 		rpc_shutdown_client(server->client_sys);
+#ifdef CONFIG_NFS_ACL
+	if (server->client_acl != NULL && !IS_ERR(server->client_acl))
+		rpc_shutdown_client(server->client_acl);
+#endif
 
 	if (!(server->flags & NFS_MOUNT_NONLM))
 		lockd_down();	/* release rpc.lockd */
Index: linux-2.6.11-rc2/fs/nfs/nfs3proc.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/nfs3proc.c
+++ linux-2.6.11-rc2/fs/nfs/nfs3proc.c
@@ -17,6 +17,7 @@
 #include <linux/nfs_page.h>
 #include <linux/lockd/bind.h>
 #include <linux/smp_lock.h>
+#include <linux/nfs_mount.h>
 
 #define NFSDBG_FACILITY		NFSDBG_PROC
 
@@ -45,7 +46,7 @@ static inline int
 nfs3_rpc_call_wrapper(struct rpc_clnt *clnt, u32 proc, void *argp, void *resp, int flags)
 {
 	struct rpc_message msg = {
-		.rpc_proc	= &nfs3_procedures[proc],
+		.rpc_proc	= &clnt->cl_procinfo[proc],
 		.rpc_argp	= argp,
 		.rpc_resp	= resp,
 	};
@@ -714,6 +715,213 @@ nfs3_proc_pathconf(struct nfs_server *se
 	return status;
 }
 
+#ifdef CONFIG_NFS_ACL
+static struct posix_acl *
+nfs3_proc_getacl(struct inode *inode, int type)
+{
+	struct nfs_server *server = NFS_SERVER(inode);
+	struct nfs_fattr fattr;
+	struct page *pages[NFSACL_MAXPAGES] = { };
+	struct nfs3_getaclargs args = {
+		/* The xdr layer may allocate pages here. */
+		.pages =	pages,
+	};
+	struct nfs3_getaclres res = {
+		.fattr =	&fattr,
+	};
+	struct posix_acl *acl = NULL;
+	int status, count;
+
+	if (!(server->flags & NFSACL) || (server->flags & NFS_MOUNT_NOACL))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	switch (type) {
+		case ACL_TYPE_ACCESS:
+			args.mask = NFS3_ACLCNT|NFS3_ACL;
+			break;
+
+		case ACL_TYPE_DEFAULT:
+			if (!S_ISDIR(inode->i_mode))
+				return NULL;
+			args.mask = NFS3_DFACLCNT|NFS3_DFACL;
+
+		default:
+			return -EINVAL;
+	}
+	args.fh = NFS_FH(inode);
+
+	dprintk("NFS call getacl\n");
+	status = rpc_call(server->client_acl, NFS3PROC_GETACL,
+			  &args, &res, 0);
+	dprintk("NFS reply getacl: %d\n", status);
+
+	/* pages may have been allocated at the xdr layer. */
+	for (count = 0; count < NFSACL_MAXPAGES && args.pages[count]; count++)
+		__free_page(args.pages[count]);
+
+	if (status) {
+		if (status == -ENOSYS) {
+			dprintk("NFS_ACL extension not supported; disabling\n");
+			server->flags &= ~NFSACL;
+			status = -EOPNOTSUPP;
+		} else if (status == -ENOTSUPP)
+			status = -EOPNOTSUPP;
+		goto getout;
+	}
+	if ((args.mask & res.mask) != args.mask) {
+		status = -EIO;
+		goto getout;
+	}
+
+	status = nfs_refresh_inode(inode, &fattr);
+	if (res.acl_access) {
+		if (posix_acl_equiv_mode(res.acl_access, NULL) == 0) {
+			posix_acl_release(res.acl_access);
+			res.acl_access = NULL;
+		}
+	}
+
+	switch(type) {
+		case ACL_TYPE_ACCESS:
+			acl = res.acl_access;
+			res.acl_access = NULL;
+			break;
+
+		case ACL_TYPE_DEFAULT:
+			acl = res.acl_default;
+			res.acl_default = NULL;
+			break;
+	}
+
+getout:
+	posix_acl_release(res.acl_access);
+	posix_acl_release(res.acl_default);
+
+	if (status) {
+		posix_acl_release(acl);
+		acl = ERR_PTR(status);
+	}
+	return acl;
+}
+#endif  /* CONFIG_NFS_ACL */
+
+#ifdef CONFIG_NFS_ACL
+static int
+nfs3_proc_setacls(struct inode *inode, struct posix_acl *acl,
+		  struct posix_acl *dfacl)
+{
+	struct nfs_server *server = NFS_SERVER(inode);
+	struct nfs_fattr fattr;
+	struct page *pages[NFSACL_MAXPAGES] = { };
+	struct nfs3_setaclargs args = {
+		.pages = pages,
+	};
+	int status, count;
+
+	if (!(server->flags & NFSACL) || (server->flags & NFS_MOUNT_NOACL))
+		return -EOPNOTSUPP;
+
+	/* We are doing this here, because XDR marshalling can only
+	   return -ENOMEM. */
+	if (acl && acl->a_count > NFS3_ACL_MAX_ENTRIES)
+		return -ENOSPC;
+	if (dfacl && dfacl->a_count > NFS3_ACL_MAX_ENTRIES)
+		return -ENOSPC;
+	args.inode = inode;
+	args.mask = NFS3_ACL;
+	args.acl_access = acl;
+	if (S_ISDIR(inode->i_mode)) {
+		args.mask |= NFS3_DFACL;
+		args.acl_default = dfacl;
+	}
+
+	dprintk("NFS call setacl\n");
+	status = rpc_call(server->client_acl, NFS3PROC_SETACL,
+			  &args, &fattr, 0);
+	dprintk("NFS reply setacl: %d\n", status);
+
+	/* pages may have been allocated at the xdr layer. */
+	for (count = 0; count < NFSACL_MAXPAGES && args.pages[count]; count++)
+		__free_page(args.pages[count]);
+
+	if (status) {
+		if (status == -ENOSYS) {
+			dprintk("NFS_ACL SETACL RPC not supported"
+				"(will not retry)\n");
+			server->flags &= ~NFSACL;
+			status = -EOPNOTSUPP;
+		} else if (status == -ENOTSUPP)
+			status = -EOPNOTSUPP;
+	} else {
+		NFS_FLAGS(inode) |= NFS_INO_INVALID_ACCESS;
+		if (acl) {
+			/*
+			 * Updating the access acl modifies the file mode
+			 * mode permission bits, so update the icache.
+			 */
+			mode_t mode = inode->i_mode;
+			int error = posix_acl_equiv_mode(acl, &mode);
+			if (error >= 0)
+				inode->i_mode = mode;
+			if (error == 0) {
+				/*
+				 * The acl is equivalent to the file mode
+				 * permission bits. No need to cache it.
+				 */
+				acl = NULL;
+			}
+		}
+		status = nfs_refresh_inode(inode, &fattr);
+	}
+
+	return status;
+}
+#endif  /* CONFIG_NFS_ACL */
+
+#ifdef CONFIG_NFS_ACL
+static int
+nfs3_proc_setacl(struct inode *inode, int type, struct posix_acl *acl)
+{
+	struct posix_acl *alloc = NULL, *dfacl = NULL;
+	int status;
+
+	if (S_ISDIR(inode->i_mode)) {
+		switch(type) {
+			case ACL_TYPE_ACCESS:
+				alloc = dfacl = NFS_PROTO(inode)->
+					getacl(inode, ACL_TYPE_DEFAULT);
+				if (IS_ERR(alloc))
+					goto fail;
+				break;
+
+			case ACL_TYPE_DEFAULT:
+				dfacl = acl;
+				alloc = acl = NFS_PROTO(inode)->
+					getacl(inode, ACL_TYPE_ACCESS);
+				if (IS_ERR(alloc))
+					goto fail;
+				break;
+
+			default:
+				return -EINVAL;
+		}
+	} else if (type != ACL_TYPE_ACCESS)
+			return -EINVAL;
+
+	if (acl == NULL) {
+		alloc = acl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL);
+		if (IS_ERR(alloc))
+			goto fail;
+	}
+	status = nfs3_proc_setacls(inode, acl, dfacl);
+	posix_acl_release(alloc);
+	return status;
+
+fail:
+	return PTR_ERR(alloc);
+}
+#endif  /* CONFIG_NFS_ACL */
+
 extern u32 *nfs3_decode_dirent(u32 *, struct nfs_entry *, int);
 
 static void
@@ -868,4 +1076,9 @@ struct nfs_rpc_ops	nfs_v3_clientops = {
 	.file_open	= nfs_open,
 	.file_release	= nfs_release,
 	.lock		= nfs3_proc_lock,
+#ifdef CONFIG_NFS_ACL
+	.getacl		= nfs3_proc_getacl,
+	.setacl		= nfs3_proc_setacl,
+	.setacls	= nfs3_proc_setacls,
+#endif  /* CONFIG_NFS_ACL */
 };
Index: linux-2.6.11-rc2/fs/nfs/nfs3xdr.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/nfs3xdr.c
+++ linux-2.6.11-rc2/fs/nfs/nfs3xdr.c
@@ -21,6 +21,7 @@
 #include <linux/nfs.h>
 #include <linux/nfs3.h>
 #include <linux/nfs_fs.h>
+#include <linux/nfsacl.h>
 
 #define NFSDBG_FACILITY		NFSDBG_XDR
 
@@ -62,6 +63,8 @@ extern int			nfs_stat_to_errno(int);
 #define NFS3_linkargs_sz		(NFS3_fh_sz+NFS3_diropargs_sz)
 #define NFS3_readdirargs_sz	(NFS3_fh_sz+2)
 #define NFS3_commitargs_sz	(NFS3_fh_sz+3)
+#define NFS3_getaclargs_sz	(NFS3_fh_sz+1)
+#define NFS3_setaclargs_sz	(NFS3_fh_sz+1+2*(2+5*3))
 
 #define NFS3_attrstat_sz	(1+NFS3_fattr_sz)
 #define NFS3_wccstat_sz		(1+NFS3_wcc_data_sz)
@@ -78,6 +81,8 @@ extern int			nfs_stat_to_errno(int);
 #define NFS3_fsinfores_sz	(1+NFS3_post_op_attr_sz+12)
 #define NFS3_pathconfres_sz	(1+NFS3_post_op_attr_sz+6)
 #define NFS3_commitres_sz	(1+NFS3_wcc_data_sz+2)
+#define NFS3_getaclres_sz	(1+NFS3_post_op_attr_sz+1+2*(2+5*3))
+#define NFS3_setaclres_sz	(1+NFS3_post_op_attr_sz)
 
 /*
  * Map file type to S_IFMT bits
@@ -627,6 +632,76 @@ nfs3_xdr_commitargs(struct rpc_rqst *req
 	return 0;
 }
 
+#ifdef CONFIG_NFS_ACL
+/*
+ * Encode GETACL arguments
+ */
+static int
+nfs3_xdr_getaclargs(struct rpc_rqst *req, u32 *p,
+		    struct nfs3_getaclargs *args)
+{
+	struct rpc_auth *auth = req->rq_task->tk_auth;
+	unsigned int replen;
+
+	p = xdr_encode_fhandle(p, args->fh);
+	*p++ = htonl(args->mask);
+	req->rq_slen = xdr_adjust_iovec(req->rq_svec, p);
+
+	if (args->mask & (NFS3_ACL | NFS3_DFACL)) {
+		/* Inline the page array */
+		replen = (RPC_REPHDRSIZE + auth->au_rslack +
+			  NFS3_getaclres_sz) << 2;
+		xdr_inline_pages(&req->rq_rcv_buf, replen, args->pages, 0,
+				 NFSACL_MAXPAGES << PAGE_SHIFT);
+	}
+	return 0;
+}
+#endif  /* CONFIG_NFS_ACL */
+
+#ifdef CONFIG_NFS_ACL
+/*
+ * Encode SETACL arguments
+ */
+static int
+nfs3_xdr_setaclargs(struct rpc_rqst *req, u32 *p,
+                   struct nfs3_setaclargs *args)
+{
+	struct xdr_buf *buf = &req->rq_snd_buf;
+	unsigned int base, len_in_head, len = nfsacl_size(
+		(args->mask & NFS3_ACL)   ? args->acl_access  : NULL,
+		(args->mask & NFS3_DFACL) ? args->acl_default : NULL);
+	int count, err;
+
+	p = xdr_encode_fhandle(p, NFS_FH(args->inode));
+	*p++ = htonl(args->mask);
+	base = (char *)p - (char *)buf->head->iov_base;
+	/* put as much of the acls into head as possible. */
+	len_in_head = min_t(unsigned int, buf->head->iov_len - base, len);
+	len -= len_in_head;
+	req->rq_slen = xdr_adjust_iovec(req->rq_svec, p + len_in_head);
+
+	for (count = 0; (count << PAGE_SHIFT) < len; count++) {
+		args->pages[count] = alloc_page(GFP_KERNEL);
+		if (!args->pages[count]) {
+			while (count)
+				__free_page(args->pages[--count]);
+			return -ENOMEM;
+		}
+	}
+	xdr_encode_pages(buf, args->pages, 0, len);
+
+	err = nfsacl_encode(buf, base, args->inode,
+			    (args->mask & NFS3_ACL) ?
+			    args->acl_access : NULL, 1, 0);
+	if (err > 0)
+		err = nfsacl_encode(buf, base + err, args->inode,
+				    (args->mask & NFS3_DFACL) ?
+				    args->acl_default : NULL, 1,
+				    NFS3_ACL_DEFAULT);
+	return (err > 0) ? 0 : err;
+}
+#endif  /* CONFIG_NFS_ACL */
+
 /*
  * NFS XDR decode functions
  */
@@ -978,6 +1053,56 @@ nfs3_xdr_commitres(struct rpc_rqst *req,
 	return 0;
 }
 
+#ifdef CONFIG_NFS_ACL
+/*
+ * Decode GETACL reply
+ */
+static int
+nfs3_xdr_getaclres(struct rpc_rqst *req, u32 *p,
+		   struct nfs3_getaclres *res)
+{
+	struct xdr_buf *buf = &req->rq_rcv_buf;
+	int status = ntohl(*p++);
+	struct posix_acl **acl;
+	unsigned int *aclcnt;
+	int err, base;
+	
+	if (status != 0)
+		return -nfs_stat_to_errno(status);
+	p = xdr_decode_post_op_attr(p, res->fattr);
+	res->mask = ntohl(*p++);
+	if (res->mask & ~(NFS3_ACL|NFS3_ACLCNT|NFS3_DFACL|NFS3_DFACLCNT))
+		return -EINVAL;
+	base = (char *)p - (char *)req->rq_rcv_buf.head->iov_base;
+	
+	acl = (res->mask & NFS3_ACL) ? &res->acl_access : NULL;
+	aclcnt = (res->mask & NFS3_ACLCNT) ? &res->acl_access_count : NULL;
+	err = nfsacl_decode(buf, base, aclcnt, acl);
+
+	acl = (res->mask & NFS3_DFACL) ? &res->acl_default : NULL;
+	aclcnt = (res->mask & NFS3_DFACLCNT) ? &res->acl_default_count : NULL;
+	if (err > 0)
+		err = nfsacl_decode(buf, base + err, aclcnt, acl);
+	return (err > 0) ? 0 : err;
+}
+#endif  /* CONFIG_NFS_ACL */
+
+#ifdef CONFIG_NFS_ACL
+/*
+ * Decode setacl reply.
+ */
+static int
+nfs3_xdr_setaclres(struct rpc_rqst *req, u32 *p, struct nfs_fattr *fattr)
+{
+	int status = ntohl(*p++);
+
+	if (status)
+		return -nfs_stat_to_errno(status);
+	xdr_decode_post_op_attr(p, fattr);
+	return 0;
+}
+#endif  /* CONFIG_NFS_ACL */
+
 #ifndef MAX
 # define MAX(a, b)	(((a) > (b))? (a) : (b))
 #endif
@@ -1021,3 +1146,16 @@ struct rpc_version		nfs_version3 = {
 	.procs			= nfs3_procedures
 };
 
+#ifdef CONFIG_NFS_ACL
+static struct rpc_procinfo	nfs3_acl_procedures[] = {
+  PROC(GETACL,		getaclargs,	getaclres, 1),
+  PROC(SETACL,		setaclargs,	setaclres, 0),
+};
+
+struct rpc_version		nfsacl_version3 = {
+	.number			= 3,
+	.nrprocs		= sizeof(nfs3_acl_procedures)/
+				  sizeof(nfs3_acl_procedures[0]),
+	.procs			= nfs3_acl_procedures,
+};
+#endif  /* CONFIG_NFS_ACL */
Index: linux-2.6.11-rc2/fs/nfs/xattr.c
===================================================================
--- /dev/null
+++ linux-2.6.11-rc2/fs/nfs/xattr.c
@@ -0,0 +1,125 @@
+#include <linux/fs.h>
+#include <linux/nfs.h>
+#include <linux/nfs3.h>
+#include <linux/nfs_fs.h>
+#include <linux/xattr_acl.h>
+
+ssize_t
+nfs_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+	struct inode *inode = dentry->d_inode;
+	struct posix_acl *acl;
+	int pos=0, len=0;
+
+	if (NFS_PROTO(inode)->version != 3 || !NFS_PROTO(inode)->getacl)
+		return -EOPNOTSUPP;
+
+#	define output(s) do {						\
+			if (pos + sizeof(s) <= size) {			\
+				memcpy(buffer + pos, s, sizeof(s));	\
+				pos += sizeof(s);			\
+			}						\
+			len += sizeof(s);				\
+		} while(0)
+
+	acl = NFS_PROTO(inode)->getacl(inode, ACL_TYPE_ACCESS);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	if (acl) {
+		output("system.posix_acl_access");
+		posix_acl_release(acl);
+	}
+
+	if (S_ISDIR(inode->i_mode)) {
+		acl = NFS_PROTO(inode)->getacl(inode, ACL_TYPE_DEFAULT);
+		if (IS_ERR(acl))
+			return PTR_ERR(acl);
+		if (acl) {
+			output("system.posix_acl_default");
+			posix_acl_release(acl);
+		}
+	}
+
+#	undef output
+
+	if (!buffer || len <= size)
+		return len;
+	return -ERANGE;
+}
+
+ssize_t
+nfs_getxattr(struct dentry *dentry, const char *name, void *buffer, size_t size)
+{
+	struct inode *inode = dentry->d_inode;
+	struct posix_acl *acl;
+	int type, error = 0;
+
+	if (strcmp(name, XATTR_NAME_ACL_ACCESS) == 0)
+		type = ACL_TYPE_ACCESS;
+	else if (strcmp(name, XATTR_NAME_ACL_DEFAULT) == 0)
+		type = ACL_TYPE_DEFAULT;
+	else
+		return -EOPNOTSUPP;
+
+	acl = ERR_PTR(-EOPNOTSUPP);
+	if (NFS_PROTO(inode)->version == 3 && NFS_PROTO(inode)->getacl)
+		acl = NFS_PROTO(inode)->getacl(inode, type);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	else if (acl) {
+		if (type == ACL_TYPE_ACCESS && acl->a_count == 0)
+			error = -ENODATA;
+		else
+			error = posix_acl_to_xattr(acl, buffer, size);
+		posix_acl_release(acl);
+	} else
+		error = -ENODATA;
+
+	return error;
+}
+
+int
+nfs_setxattr(struct dentry *dentry, const char *name,
+	     const void *value, size_t size, int flags)
+{
+	struct inode *inode = dentry->d_inode;
+	struct posix_acl *acl;
+	int type, error;
+
+	if (strcmp(name, XATTR_NAME_ACL_ACCESS) == 0)
+		type = ACL_TYPE_ACCESS;
+	else if (strcmp(name, XATTR_NAME_ACL_DEFAULT) == 0)
+		type = ACL_TYPE_DEFAULT;
+	else
+		return -EOPNOTSUPP;
+	if (NFS_PROTO(inode)->version != 3 || !NFS_PROTO(inode)->setacl)
+		return -EOPNOTSUPP;
+
+	acl = posix_acl_from_xattr(value, size);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	error = NFS_PROTO(inode)->setacl(inode, type, acl);
+	posix_acl_release(acl);
+
+	return error;
+}
+
+int
+nfs_removexattr(struct dentry *dentry, const char *name)
+{
+	struct inode *inode = dentry->d_inode;
+	int error, type;
+
+	if (strcmp(name, XATTR_NAME_ACL_ACCESS) == 0)
+		type = ACL_TYPE_ACCESS;
+	else if (strcmp(name, XATTR_NAME_ACL_DEFAULT) == 0)
+		type = ACL_TYPE_DEFAULT;
+	else
+		return -EOPNOTSUPP;
+
+	error = -EOPNOTSUPP;
+	if (NFS_PROTO(inode)->version == 3 && NFS_PROTO(inode)->setacl)
+		error = NFS_PROTO(inode)->setacl(inode, type, NULL);
+
+	return error;
+}
Index: linux-2.6.11-rc2/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs_fs.h
+++ linux-2.6.11-rc2/include/linux/nfs_fs.h
@@ -329,6 +329,22 @@ static inline struct rpc_cred *nfs_file_
 }
 
 /*
+ * linux/fs/nfs/xattr.c
+ */
+#ifdef CONFIG_NFS_ACL
+extern ssize_t nfs_listxattr(struct dentry *, char *, size_t);
+extern ssize_t nfs_getxattr(struct dentry *, const char *, void *, size_t);
+extern int nfs_setxattr(struct dentry *, const char *,
+			const void *, size_t, int);
+extern int nfs_removexattr (struct dentry *, const char *name);
+#else
+# define nfs_listxattr NULL
+# define nfs_getxattr NULL
+# define nfs_setxattr NULL
+# define nfs_removexattr NULL
+#endif
+
+/*
  * linux/fs/nfs/direct.c
  */
 extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t,
Index: linux-2.6.11-rc2/include/linux/nfs_fs_sb.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs_fs_sb.h
+++ linux-2.6.11-rc2/include/linux/nfs_fs_sb.h
@@ -10,6 +10,9 @@
 struct nfs_server {
 	struct rpc_clnt *	client;		/* RPC client handle */
 	struct rpc_clnt *	client_sys;	/* 2nd handle for FSINFO */
+#ifdef CONFIG_NFS_ACL
+	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
+#endif  /* CONFIG_NFS_ACL */
 	struct nfs_rpc_ops *	rpc_ops;	/* NFS protocol vector */
 	struct backing_dev_info	backing_dev_info;
 	int			flags;		/* various flags */
Index: linux-2.6.11-rc2/include/linux/nfs_mount.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs_mount.h
+++ linux-2.6.11-rc2/include/linux/nfs_mount.h
@@ -63,4 +63,7 @@ struct nfs_mount_data {
 #define NFS_MOUNT_SECFLAVOUR	0x2000	/* 5 */
 #define NFS_MOUNT_FLAGMASK	0xFFFF
 
+/* Feature flag for the NFS_ACL protocol extension */
+#define NFSACL			0x10000
+
 #endif
Index: linux-2.6.11-rc2/include/linux/nfs_xdr.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs_xdr.h
+++ linux-2.6.11-rc2/include/linux/nfs_xdr.h
@@ -2,6 +2,7 @@
 #define _LINUX_NFS_XDR_H
 
 #include <linux/sunrpc/xprt.h>
+#include <linux/nfsacl.h>
 
 struct nfs4_fsid {
 	__u64 major;
@@ -354,6 +355,20 @@ struct nfs_readdirargs {
 	struct page **		pages;
 };
 
+struct nfs3_getaclargs {
+	struct nfs_fh *		fh;
+	int			mask;
+	struct page **		pages;
+};
+
+struct nfs3_setaclargs {
+	struct inode *		inode;
+	int			mask;
+	struct posix_acl *	acl_access;
+	struct posix_acl *	acl_default;
+	struct page **		pages;
+};
+
 struct nfs_diropok {
 	struct nfs_fh *		fh;
 	struct nfs_fattr *	fattr;
@@ -477,6 +492,15 @@ struct nfs3_readdirres {
 	int			plus;
 };
 
+struct nfs3_getaclres {
+	struct nfs_fattr *	fattr;
+	int			mask;
+	unsigned int		acl_access_count;
+	unsigned int		acl_default_count;
+	struct posix_acl *	acl_access;
+	struct posix_acl *	acl_default;
+};
+
 #ifdef CONFIG_NFS_V4
 
 typedef u64 clientid4;
@@ -713,6 +737,11 @@ struct nfs_rpc_ops {
 	int	(*file_open)   (struct inode *, struct file *);
 	int	(*file_release) (struct inode *, struct file *);
 	int	(*lock)(struct file *, int, struct file_lock *);
+#ifdef CONFIG_NFS_ACL
+	struct posix_acl * (*getacl)(struct inode *, int);
+	int	(*setacl)(struct inode *, int, struct posix_acl *);
+	int	(*setacls)(struct inode *, struct posix_acl *, struct posix_acl *);
+#endif  /* CONFIG_NFS_ACL */
 };
 
 /*
@@ -734,4 +763,7 @@ extern struct rpc_version	nfs_version4;
 extern struct rpc_program	nfs_program;
 extern struct rpc_stat		nfs_rpcstat;
 
+extern struct rpc_version	nfsacl_version3;
+extern struct rpc_program	nfsacl_program;
+
 #endif

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 12/13] ACL umask handling workaround in nfs client
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (10 preceding siblings ...)
  2005-01-22 20:34 ` [patch 11/13] Client side of nfsacl Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  2005-01-25  1:20   ` Andreas Gruenbacher
  2005-02-15 18:04   ` Trond Myklebust
  2005-01-22 20:34 ` [patch 13/13] Cache acls on the nfs client side Andreas Gruenbacher
  12 siblings, 2 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfsacl-umask.diff --]
[-- Type: text/plain, Size: 4296 bytes --]

NFSv3 has no concept of a umask on the server side: The client applies
the umask locally, and sends the effective permissions to the server.
This behavior is wrong when files are created in a directory that has
a default ACL. In this case, the umask is supposed to be ignored, and
only the default ACL determines the file's effective permissions.

Usually its the server's task to conditionally apply the umask. But
since the server knows nothing about the umask, we have to do it on the
client side. This patch tries to fetch the parent directory's default
ACL before creating a new file, computes the appropriate create mode to
send to the server, and finally sets the new file's access and default
acl appropriately.

Many thanks to Buck Huppmann <buchk@pobox.com> for sending the initial
version of this patch, as well as for arguing why we need this change.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/fs/nfs/dir.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/dir.c
+++ linux-2.6.11-rc2/fs/nfs/dir.c
@@ -31,6 +31,7 @@
 #include <linux/pagemap.h>
 #include <linux/smp_lock.h>
 #include <linux/namei.h>
+#include <linux/posix_acl.h>
 
 #include "delegation.h"
 
@@ -976,6 +977,38 @@ out_err:
 	return error;
 }
 
+static int nfs_set_default_acl(struct inode *dir, struct inode *inode,
+			       mode_t mode)
+{
+#ifdef CONFIG_NFS_ACL
+	struct posix_acl *dfacl, *acl;
+	int error = 0;
+
+	dfacl = NFS_PROTO(dir)->getacl(dir, ACL_TYPE_DEFAULT);
+	if (IS_ERR(dfacl)) {
+		error = PTR_ERR(dfacl);
+		return (error == -EOPNOTSUPP) ? 0 : error;
+	}
+	if (!dfacl)
+		return 0;
+	acl = posix_acl_clone(dfacl, GFP_KERNEL);
+	error = -ENOMEM;
+	if (!acl)
+		goto out;
+	error = posix_acl_create_masq(acl, &mode);
+	if (error < 0)
+		goto out;
+	error = NFS_PROTO(inode)->setacls(inode, acl, S_ISDIR(inode->i_mode) ?
+						      dfacl : NULL);
+out:
+	posix_acl_release(acl);
+	posix_acl_release(dfacl);
+	return error;
+#else
+	return 0;
+#endif
+}
+
 /*
  * Following a failed create operation, we drop the dentry rather
  * than retain a negative dentry. This avoids a problem in the event
@@ -993,7 +1026,7 @@ static int nfs_create(struct inode *dir,
 	dfprintk(VFS, "NFS: create(%s/%ld, %s\n", dir->i_sb->s_id, 
 		dir->i_ino, dentry->d_name.name);
 
-	attr.ia_mode = mode;
+	attr.ia_mode = mode & ~current->fs->umask;
 	attr.ia_valid = ATTR_MODE;
 
 	if (nd && (nd->flags & LOOKUP_CREATE))
@@ -1007,7 +1040,7 @@ static int nfs_create(struct inode *dir,
 		d_instantiate(dentry, inode);
 		nfs_renew_times(dentry);
 		nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
-		error = 0;
+		error = nfs_set_default_acl(dir, inode, mode);
 	} else {
 		error = PTR_ERR(inode);
 		d_drop(dentry);
@@ -1033,7 +1066,7 @@ nfs_mknod(struct inode *dir, struct dent
 	if (!new_valid_dev(rdev))
 		return -EINVAL;
 
-	attr.ia_mode = mode;
+	attr.ia_mode = mode & ~current->fs->umask;
 	attr.ia_valid = ATTR_MODE;
 
 	lock_kernel();
@@ -1045,6 +1078,8 @@ nfs_mknod(struct inode *dir, struct dent
 		error = nfs_instantiate(dentry, &fhandle, &fattr);
 	else
 		d_drop(dentry);
+	if (!error)
+		error = nfs_set_default_acl(dir, dentry->d_inode, mode);
 	unlock_kernel();
 	return error;
 }
@@ -1063,7 +1098,7 @@ static int nfs_mkdir(struct inode *dir, 
 		dir->i_ino, dentry->d_name.name);
 
 	attr.ia_valid = ATTR_MODE;
-	attr.ia_mode = mode | S_IFDIR;
+	attr.ia_mode = (mode & ~current->fs->umask) | S_IFDIR;
 
 	lock_kernel();
 #if 0
@@ -1083,6 +1118,8 @@ static int nfs_mkdir(struct inode *dir, 
 		error = nfs_instantiate(dentry, &fhandle, &fattr);
 	else
 		d_drop(dentry);
+	if (!error)
+		error = nfs_set_default_acl(dir, dentry->d_inode, mode);
 	unlock_kernel();
 	return error;
 }
Index: linux-2.6.11-rc2/fs/nfs/inode.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/inode.c
+++ linux-2.6.11-rc2/fs/nfs/inode.c
@@ -1494,6 +1494,8 @@ static struct super_block *nfs_get_sb(st
 		return ERR_PTR(error);
 	}
 	s->s_flags |= MS_ACTIVE;
+	/* The nfs client applies the umask itself when needed. */
+	s->s_flags |= MS_POSIXACL;
 	return s;
 }
 

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch 13/13] Cache acls on the nfs client side
  2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
                   ` (11 preceding siblings ...)
  2005-01-22 20:34 ` [patch 12/13] ACL umask handling workaround in nfs client Andreas Gruenbacher
@ 2005-01-22 20:34 ` Andreas Gruenbacher
  12 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 20:34 UTC (permalink / raw)
  To: linux-kernel, Neil Brown, Trond Myklebust
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

[-- Attachment #1: patches.suse/nfsacl-client-cache.diff --]
[-- Type: text/plain, Size: 8066 bytes --]

Attach acls to inodes in the icache to avoid unnecessary GETACL RPC
round-trips. As long as the client doesn't retrieve any acls itself,
only the default acls of exiting directories and the default and access
acls of new directories will end up in the cache, which preserves some
memory compared to always caching the access and default acl of all
files.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>

Index: linux-2.6.11-rc2/fs/nfs/nfs3proc.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/nfs3proc.c
+++ linux-2.6.11-rc2/fs/nfs/nfs3proc.c
@@ -729,25 +729,29 @@ nfs3_proc_getacl(struct inode *inode, in
 	struct nfs3_getaclres res = {
 		.fattr =	&fattr,
 	};
-	struct posix_acl *acl = NULL;
+	struct posix_acl *acl;
 	int status, count;
 
 	if (!(server->flags & NFSACL) || (server->flags & NFS_MOUNT_NOACL))
 		return ERR_PTR(-EOPNOTSUPP);
 
-	switch (type) {
-		case ACL_TYPE_ACCESS:
-			args.mask = NFS3_ACLCNT|NFS3_ACL;
-			break;
-
-		case ACL_TYPE_DEFAULT:
-			if (!S_ISDIR(inode->i_mode))
-				return NULL;
-			args.mask = NFS3_DFACLCNT|NFS3_DFACL;
-
-		default:
-			return -EINVAL;
-	}
+	acl = nfs_get_cached_acl(inode, type);
+	if (acl != ERR_PTR(-EAGAIN))
+		return acl;
+	acl = NULL;
+
+	/*
+	 * Only get the access acl when explicitly requested: We don't
+	 * need it for access decisions, and only some applications use
+	 * it. Applications which request the access acl first are not
+	 * penalized from this optimization.
+	 */
+	if (type == ACL_TYPE_ACCESS)
+		args.mask |= NFS3_ACLCNT|NFS3_ACL;
+	if (S_ISDIR(inode->i_mode))
+		args.mask |= NFS3_DFACLCNT|NFS3_DFACL;
+	if (!args.mask)
+		return NULL;
 	args.fh = NFS_FH(inode);
 
 	dprintk("NFS call getacl\n");
@@ -780,6 +784,7 @@ nfs3_proc_getacl(struct inode *inode, in
 			res.acl_access = NULL;
 		}
 	}
+	nfs_cache_acls(inode, res.acl_access, res.acl_default);
 
 	switch(type) {
 		case ACL_TYPE_ACCESS:
@@ -871,6 +876,7 @@ nfs3_proc_setacls(struct inode *inode, s
 				acl = NULL;
 			}
 		}
+		nfs_cache_acls(inode, acl, dfacl);
 		status = nfs_refresh_inode(inode, &fattr);
 	}
 
Index: linux-2.6.11-rc2/fs/nfs/inode.c
===================================================================
--- linux-2.6.11-rc2.orig/fs/nfs/inode.c
+++ linux-2.6.11-rc2/fs/nfs/inode.c
@@ -64,6 +64,19 @@ static void nfs_umount_begin(struct supe
 static int  nfs_statfs(struct super_block *, struct kstatfs *);
 static int  nfs_show_options(struct seq_file *, struct vfsmount *);
 
+#ifdef CONFIG_NFS_ACL
+static void nfs_forget_cached_acls(struct inode *);
+static void __nfs_forget_cached_acls(struct nfs_inode *nfsi);
+#else
+static inline void nfs_forget_cached_acls(struct inode *inode)
+{
+}
+
+static inline void __nfs_forget_cached_acls(struct nfs_inode *nfsi)
+{
+}
+#endif
+
 static struct super_operations nfs_sops = { 
 	.alloc_inode	= nfs_alloc_inode,
 	.destroy_inode	= nfs_destroy_inode,
@@ -617,6 +630,7 @@ nfs_zap_caches(struct inode *inode)
 		nfsi->flags |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA|NFS_INO_INVALID_ACCESS;
 	else
 		nfsi->flags |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS;
+	nfs_forget_cached_acls(inode);
 }
 
 /*
@@ -1159,6 +1173,81 @@ void nfs_end_data_update_defer(struct in
 	}
 }
 
+#ifdef CONFIG_NFS_ACL
+static void __nfs_forget_cached_acls(struct nfs_inode *nfsi)
+{
+	if (nfsi->acl_access != ERR_PTR(-EAGAIN)) {
+		posix_acl_release(nfsi->acl_access);
+		nfsi->acl_access = ERR_PTR(-EAGAIN);
+	}
+	if (nfsi->acl_default != ERR_PTR(-EAGAIN)) {
+		posix_acl_release(nfsi->acl_default);
+		nfsi->acl_default = ERR_PTR(-EAGAIN);
+	}
+}
+#endif  /* CONFIG_NFS_ACL */
+
+#ifdef CONFIG_NFS_ACL
+static void nfs_forget_cached_acls(struct inode *inode)
+{
+	dprintk("NFS: nfs_forget_cached_acls(%s/%ld)\n", inode->i_sb->s_id,
+		inode->i_ino);
+	spin_lock(&inode->i_lock);
+	__nfs_forget_cached_acls(NFS_I(inode));
+	spin_unlock(&inode->i_lock);
+}
+#endif
+
+#ifdef CONFIG_NFS_ACL
+struct posix_acl *nfs_get_cached_acl(struct inode *inode, int type)
+{
+	struct nfs_inode *nfsi = NFS_I(inode);
+	struct posix_acl *acl = ERR_PTR(-EAGAIN);
+
+	spin_lock(&inode->i_lock);
+	if (time_after(jiffies, nfsi->acl_timestamp + nfsi->attrtimeo)) {
+		__nfs_forget_cached_acls(nfsi);
+		nfsi->acl_timestamp = jiffies;
+	} else switch(type) {
+		case ACL_TYPE_ACCESS:
+			acl = nfsi->acl_access;
+			break;
+
+		case ACL_TYPE_DEFAULT:
+			acl = nfsi->acl_default;
+			break;
+
+		default:
+			return ERR_PTR(-EINVAL);
+	}
+	if (acl == ERR_PTR(-EAGAIN))
+		acl = ERR_PTR(-EAGAIN);
+	else
+		acl = posix_acl_dup(acl);
+	spin_unlock(&inode->i_lock);
+	dprintk("NFS: nfs_get_cached_acl(%s/%ld, %d) = %p\n", inode->i_sb->s_id,
+		inode->i_ino, type, acl);
+	return acl;
+}
+#endif  /* CONFIG_NFS_ACL */
+
+#ifdef CONFIG_NFS_ACL
+void nfs_cache_acls(struct inode *inode, struct posix_acl *acl,
+		    struct posix_acl *dfacl)
+{
+	struct nfs_inode *nfsi = NFS_I(inode);
+
+	dprintk("nfs_cache_acls(%s/%ld, %p, %p)\n", inode->i_sb->s_id,
+		inode->i_ino, acl, dfacl);
+	spin_lock(&inode->i_lock);
+	__nfs_forget_cached_acls(NFS_I(inode));
+	nfsi->acl_access = posix_acl_dup(acl);
+	nfsi->acl_default = posix_acl_dup(dfacl);
+	nfsi->acl_timestamp = jiffies;
+	spin_unlock(&inode->i_lock);
+}
+#endif  /* CONFIG_NFS_ACL */
+
 /**
  * nfs_refresh_inode - verify consistency of the inode attribute cache
  * @inode - pointer to inode
@@ -1219,8 +1308,10 @@ int nfs_refresh_inode(struct inode *inod
 	/* Have any file permissions changed? */
 	if ((inode->i_mode & S_IALLUGO) != (fattr->mode & S_IALLUGO)
 			|| inode->i_uid != fattr->uid
-			|| inode->i_gid != fattr->gid)
+			|| inode->i_gid != fattr->gid) {
 		nfsi->flags |= NFS_INO_INVALID_ATTR | NFS_INO_INVALID_ACCESS;
+		nfs_forget_cached_acls(inode);
+	}
 
 	/* Has the link count changed? */
 	if (inode->i_nlink != fattr->nlink)
@@ -1337,8 +1428,10 @@ static int nfs_update_inode(struct inode
 
 	if ((inode->i_mode & S_IALLUGO) != (fattr->mode & S_IALLUGO) ||
 	    inode->i_uid != fattr->uid ||
-	    inode->i_gid != fattr->gid)
+	    inode->i_gid != fattr->gid) {
 		invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS;
+		nfs_forget_cached_acls(inode);
+	}
 
 	inode->i_mode = fattr->mode;
 	inode->i_nlink = fattr->nlink;
@@ -1912,6 +2005,7 @@ static struct inode *nfs_alloc_inode(str
 
 static void nfs_destroy_inode(struct inode *inode)
 {
+	__nfs_forget_cached_acls(NFS_I(inode));
 	kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
 }
 
@@ -1932,6 +2026,10 @@ static void init_once(void * foo, kmem_c
 		nfsi->ncommit = 0;
 		nfsi->npages = 0;
 		init_waitqueue_head(&nfsi->nfs_i_wait);
+#ifdef CONFIG_NFS_ACL
+		nfsi->acl_access = ERR_PTR(-EAGAIN);
+		nfsi->acl_default = ERR_PTR(-EAGAIN);
+#endif
 		nfs4_init_once(nfsi);
 	}
 }
Index: linux-2.6.11-rc2/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.11-rc2.orig/include/linux/nfs_fs.h
+++ linux-2.6.11-rc2/include/linux/nfs_fs.h
@@ -104,6 +104,8 @@ struct nfs_open_context {
  */
 struct nfs_delegation;
 
+struct posix_acl;
+
 /*
  * nfs fs inode data in memory
  */
@@ -158,6 +160,11 @@ struct nfs_inode {
 	atomic_t		data_updates;
 
 	struct nfs_access_entry	cache_access;
+#ifdef CONFIG_NFS_ACL
+	unsigned long		acl_timestamp;
+	struct posix_acl	*acl_access;
+	struct posix_acl	*acl_default;
+#endif
 
 	/*
 	 * This is the cookie verifier used for NFSv3 readdir
@@ -284,6 +291,8 @@ static inline int nfs_verify_change_attr
 extern void nfs_zap_caches(struct inode *);
 extern struct inode *nfs_fhget(struct super_block *, struct nfs_fh *,
 				struct nfs_fattr *);
+extern struct posix_acl *nfs_get_cached_acl(struct inode *, int);
+extern void nfs_cache_acls(struct inode *, struct posix_acl *, struct posix_acl *);
 extern int nfs_refresh_inode(struct inode *, struct nfs_fattr *);
 extern int nfs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
 extern int nfs_permission(struct inode *, int, struct nameidata *);

--
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
@ 2005-01-22 21:00   ` vlobanov
  2005-01-23  2:03     ` Felipe Alfaro Solana
  2005-01-23 21:24     ` Richard Henderson
       [not found]   ` <1106431568.4153.154.camel@laptopd505.fenrus.org>
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 85+ messages in thread
From: vlobanov @ 2005-01-22 21:00 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Buck Huppmann, Andrew Morton

Hi,

I was just reading over the patch, and had a quick question/comment upon
the SWAP macro defined below. I think it's possible to do a tiny bit
better (better, of course, being subjective), as follows:

#define SWAP(a, b, size)			\
    do {					\
	register size_t __size = (size);	\
	register char * __a = (a), * __b = (b);	\
	do {					\
	    *__a ^= *__b;			\
	    *__b ^= *__a;			\
	    *__a ^= *__b;			\
	    __a++;				\
	    __b++;				\
	} while ((--__size) > 0);		\
    } while (0)

What do you think? :)

-Vadim Lobanov

On Sat, 22 Jan 2005, Andreas Gruenbacher wrote:

> Add a quicksort from glibc as a kernel library function, and switch
> xfs over to using it. The implementations are equivalent. The nfsacl
> protocol also requires a sort function, so it makes more sense in
> the common code.
>
> Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
> Acked-by: Olaf Kirch <okir@suse.de>
>
> Index: linux-2.6.11-rc2/include/linux/kernel.h
> ===================================================================
> --- linux-2.6.11-rc2.orig/include/linux/kernel.h
> +++ linux-2.6.11-rc2/include/linux/kernel.h
> @@ -93,6 +93,8 @@ extern int sscanf(const char *, const ch
>  	__attribute__ ((format (scanf,2,3)));
>  extern int vsscanf(const char *, const char *, va_list);
>
> +extern void qsort(void *, size_t, size_t, int (*)(const void *,const void *));
> +
>  extern int get_option(char **str, int *pint);
>  extern char *get_options(const char *str, int nints, int *ints);
>  extern unsigned long long memparse(char *ptr, char **retptr);
> Index: linux-2.6.11-rc2/lib/Kconfig
> ===================================================================
> --- linux-2.6.11-rc2.orig/lib/Kconfig
> +++ linux-2.6.11-rc2/lib/Kconfig
> @@ -30,6 +30,9 @@ config LIBCRC32C
>  	  require M here.  See Castagnoli93.
>  	  Module will be libcrc32c.
>
> +config QSORT
> +	bool "Quick Sort"
> +
>  #
>  # compression support is select'ed if needed
>  #
> Index: linux-2.6.11-rc2/lib/Makefile
> ===================================================================
> --- linux-2.6.11-rc2.orig/lib/Makefile
> +++ linux-2.6.11-rc2/lib/Makefile
> @@ -25,6 +25,7 @@ obj-$(CONFIG_CRC_CCITT)	+= crc-ccitt.o
>  obj-$(CONFIG_CRC32)	+= crc32.o
>  obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
>  obj-$(CONFIG_GENERIC_IOMAP) += iomap.o
> +obj-$(CONFIG_QSORT)	+= qsort.o
>
>  obj-$(CONFIG_ZLIB_INFLATE) += zlib_inflate/
>  obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate/
> Index: linux-2.6.11-rc2/lib/qsort.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.11-rc2/lib/qsort.c
> @@ -0,0 +1,249 @@
> +/* Copyright (C) 1991, 1992, 1996, 1997, 1999 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +   Written by Douglas C. Schmidt (schmidt@ics.uci.edu).
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, write to the Free
> +   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
> +   02111-1307 USA.  */
> +
> +/* If you consider tuning this algorithm, you should consult first:
> +   Engineering a sort function; Jon Bentley and M. Douglas McIlroy;
> +   Software - Practice and Experience; Vol. 23 (11), 1249-1265, 1993.  */
> +
> +# include <linux/module.h>
> +# include <linux/slab.h>
> +# include <linux/string.h>
> +
> +MODULE_LICENSE("GPL");
> +
> +/* Byte-wise swap two items of size SIZE. */
> +#define SWAP(a, b, size)						      \
> +  do									      \
> +    {									      \
> +      register size_t __size = (size);					      \
> +      register char *__a = (a), *__b = (b);				      \
> +      do								      \
> +	{								      \
> +	  char __tmp = *__a;						      \
> +	  *__a++ = *__b;						      \
> +	  *__b++ = __tmp;						      \
> +	} while (--__size > 0);						      \
> +    } while (0)
> +
> +/* Discontinue quicksort algorithm when partition gets below this size.
> +   This particular magic number was chosen to work best on a Sun 4/260. */
> +#define MAX_THRESH 4
> +
> +/* Stack node declarations used to store unfulfilled partition obligations. */
> +typedef struct
> +  {
> +    char *lo;
> +    char *hi;
> +  } stack_node;
> +
> +/* The next 5 #defines implement a very fast in-line stack abstraction. */
> +/* The stack needs log (total_elements) entries (we could even subtract
> +   log(MAX_THRESH)).  Since total_elements has type size_t, we get as
> +   upper bound for log (total_elements):
> +   bits per byte (CHAR_BIT) * sizeof(size_t).  */
> +#define CHAR_BIT 8
> +#define STACK_SIZE	(CHAR_BIT * sizeof(size_t))
> +#define PUSH(low, high)	((void) ((top->lo = (low)), (top->hi = (high)), ++top))
> +#define	POP(low, high)	((void) (--top, (low = top->lo), (high = top->hi)))
> +#define	STACK_NOT_EMPTY	(stack < top)
> +
> +
> +/* Order size using quicksort.  This implementation incorporates
> +   four optimizations discussed in Sedgewick:
> +
> +   1. Non-recursive, using an explicit stack of pointer that store the
> +      next array partition to sort.  To save time, this maximum amount
> +      of space required to store an array of SIZE_MAX is allocated on the
> +      stack.  Assuming a 32-bit (64 bit) integer for size_t, this needs
> +      only 32 * sizeof(stack_node) == 256 bytes (for 64 bit: 1024 bytes).
> +      Pretty cheap, actually.
> +
> +   2. Chose the pivot element using a median-of-three decision tree.
> +      This reduces the probability of selecting a bad pivot value and
> +      eliminates certain extraneous comparisons.
> +
> +   3. Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions, leaving
> +      insertion sort to order the MAX_THRESH items within each partition.
> +      This is a big win, since insertion sort is faster for small, mostly
> +      sorted array segments.
> +
> +   4. The larger of the two sub-partitions is always pushed onto the
> +      stack first, with the algorithm then concentrating on the
> +      smaller partition.  This *guarantees* no more than log (total_elems)
> +      stack size is needed (actually O(1) in this case)!  */
> +
> +void
> +qsort(void *const pbase, size_t total_elems, size_t size,
> +      int(*cmp)(const void *,const void *))
> +{
> +  register char *base_ptr = (char *) pbase;
> +
> +  const size_t max_thresh = MAX_THRESH * size;
> +
> +  if (total_elems == 0)
> +    /* Avoid lossage with unsigned arithmetic below.  */
> +    return;
> +
> +  if (total_elems > MAX_THRESH)
> +    {
> +      char *lo = base_ptr;
> +      char *hi = &lo[size * (total_elems - 1)];
> +      stack_node stack[STACK_SIZE];
> +      stack_node *top = stack + 1;
> +
> +      while (STACK_NOT_EMPTY)
> +        {
> +          char *left_ptr;
> +          char *right_ptr;
> +
> +	  /* Select median value from among LO, MID, and HI. Rearrange
> +	     LO and HI so the three values are sorted. This lowers the
> +	     probability of picking a pathological pivot value and
> +	     skips a comparison for both the LEFT_PTR and RIGHT_PTR in
> +	     the while loops. */
> +
> +	  char *mid = lo + size * ((hi - lo) / size >> 1);
> +
> +	  if ((*cmp) ((void *) mid, (void *) lo) < 0)
> +	    SWAP (mid, lo, size);
> +	  if ((*cmp) ((void *) hi, (void *) mid) < 0)
> +	    SWAP (mid, hi, size);
> +	  else
> +	    goto jump_over;
> +	  if ((*cmp) ((void *) mid, (void *) lo) < 0)
> +	    SWAP (mid, lo, size);
> +	jump_over:;
> +
> +	  left_ptr  = lo + size;
> +	  right_ptr = hi - size;
> +
> +	  /* Here's the famous ``collapse the walls'' section of quicksort.
> +	     Gotta like those tight inner loops!  They are the main reason
> +	     that this algorithm runs much faster than others. */
> +	  do
> +	    {
> +	      while ((*cmp) ((void *) left_ptr, (void *) mid) < 0)
> +		left_ptr += size;
> +
> +	      while ((*cmp) ((void *) mid, (void *) right_ptr) < 0)
> +		right_ptr -= size;
> +
> +	      if (left_ptr < right_ptr)
> +		{
> +		  SWAP (left_ptr, right_ptr, size);
> +		  if (mid == left_ptr)
> +		    mid = right_ptr;
> +		  else if (mid == right_ptr)
> +		    mid = left_ptr;
> +		  left_ptr += size;
> +		  right_ptr -= size;
> +		}
> +	      else if (left_ptr == right_ptr)
> +		{
> +		  left_ptr += size;
> +		  right_ptr -= size;
> +		  break;
> +		}
> +	    }
> +	  while (left_ptr <= right_ptr);
> +
> +          /* Set up pointers for next iteration.  First determine whether
> +             left and right partitions are below the threshold size.  If so,
> +             ignore one or both.  Otherwise, push the larger partition's
> +             bounds on the stack and continue sorting the smaller one. */
> +
> +          if ((size_t) (right_ptr - lo) <= max_thresh)
> +            {
> +              if ((size_t) (hi - left_ptr) <= max_thresh)
> +		/* Ignore both small partitions. */
> +                POP (lo, hi);
> +              else
> +		/* Ignore small left partition. */
> +                lo = left_ptr;
> +            }
> +          else if ((size_t) (hi - left_ptr) <= max_thresh)
> +	    /* Ignore small right partition. */
> +            hi = right_ptr;
> +          else if ((right_ptr - lo) > (hi - left_ptr))
> +            {
> +	      /* Push larger left partition indices. */
> +              PUSH (lo, right_ptr);
> +              lo = left_ptr;
> +            }
> +          else
> +            {
> +	      /* Push larger right partition indices. */
> +              PUSH (left_ptr, hi);
> +              hi = right_ptr;
> +            }
> +        }
> +    }
> +
> +  /* Once the BASE_PTR array is partially sorted by quicksort the rest
> +     is completely sorted using insertion sort, since this is efficient
> +     for partitions below MAX_THRESH size. BASE_PTR points to the beginning
> +     of the array to sort, and END_PTR points at the very last element in
> +     the array (*not* one beyond it!). */
> +
> +  {
> +    char *end_ptr = &base_ptr[size * (total_elems - 1)];
> +    char *tmp_ptr = base_ptr;
> +    char *thresh = min(end_ptr, base_ptr + max_thresh);
> +    register char *run_ptr;
> +
> +    /* Find smallest element in first threshold and place it at the
> +       array's beginning.  This is the smallest array element,
> +       and the operation speeds up insertion sort's inner loop. */
> +
> +    for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
> +      if ((*cmp) ((void *) run_ptr, (void *) tmp_ptr) < 0)
> +        tmp_ptr = run_ptr;
> +
> +    if (tmp_ptr != base_ptr)
> +      SWAP (tmp_ptr, base_ptr, size);
> +
> +    /* Insertion sort, running from left-hand-side up to right-hand-side.  */
> +
> +    run_ptr = base_ptr + size;
> +    while ((run_ptr += size) <= end_ptr)
> +      {
> +	tmp_ptr = run_ptr - size;
> +	while ((*cmp) ((void *) run_ptr, (void *) tmp_ptr) < 0)
> +	  tmp_ptr -= size;
> +
> +	tmp_ptr += size;
> +        if (tmp_ptr != run_ptr)
> +          {
> +            char *trav;
> +
> +	    trav = run_ptr + size;
> +	    while (--trav >= run_ptr)
> +              {
> +                char c = *trav;
> +                char *hi, *lo;
> +
> +                for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
> +                  *hi = *lo;
> +                *hi = c;
> +              }
> +          }
> +      }
> +  }
> +}
> +EXPORT_SYMBOL(qsort);
> Index: linux-2.6.11-rc2/fs/xfs/support/qsort.c
> ===================================================================
> --- linux-2.6.11-rc2.orig/fs/xfs/support/qsort.c
> +++ /dev/null
> @@ -1,155 +0,0 @@
> -/*
> - * Copyright (c) 1992, 1993
> - *	The Regents of the University of California.  All rights reserved.
> - *
> - * Redistribution and use in source and binary forms, with or without
> - * modification, are permitted provided that the following conditions
> - * are met:
> - * 1. Redistributions of source code must retain the above copyright
> - *    notice, this list of conditions and the following disclaimer.
> - * 2. Redistributions in binary form must reproduce the above copyright
> - *    notice, this list of conditions and the following disclaimer in the
> - *    documentation and/or other materials provided with the distribution.
> - * 3. Neither the name of the University nor the names of its contributors
> - *    may be used to endorse or promote products derived from this software
> - *    without specific prior written permission.
> - *
> - * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> - * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> - * SUCH DAMAGE.
> - */
> -
> -#include <linux/kernel.h>
> -#include <linux/string.h>
> -
> -/*
> - * Qsort routine from Bentley & McIlroy's "Engineering a Sort Function".
> - */
> -#define swapcode(TYPE, parmi, parmj, n) { 		\
> -	long i = (n) / sizeof (TYPE); 			\
> -	register TYPE *pi = (TYPE *) (parmi); 		\
> -	register TYPE *pj = (TYPE *) (parmj); 		\
> -	do { 						\
> -		register TYPE	t = *pi;		\
> -		*pi++ = *pj;				\
> -		*pj++ = t;				\
> -        } while (--i > 0);				\
> -}
> -
> -#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
> -	es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1;
> -
> -static __inline void
> -swapfunc(char *a, char *b, int n, int swaptype)
> -{
> -	if (swaptype <= 1)
> -		swapcode(long, a, b, n)
> -	else
> -		swapcode(char, a, b, n)
> -}
> -
> -#define swap(a, b)					\
> -	if (swaptype == 0) {				\
> -		long t = *(long *)(a);			\
> -		*(long *)(a) = *(long *)(b);		\
> -		*(long *)(b) = t;			\
> -	} else						\
> -		swapfunc(a, b, es, swaptype)
> -
> -#define vecswap(a, b, n) 	if ((n) > 0) swapfunc(a, b, n, swaptype)
> -
> -static __inline char *
> -med3(char *a, char *b, char *c, int (*cmp)(const void *, const void *))
> -{
> -	return cmp(a, b) < 0 ?
> -	       (cmp(b, c) < 0 ? b : (cmp(a, c) < 0 ? c : a ))
> -              :(cmp(b, c) > 0 ? b : (cmp(a, c) < 0 ? a : c ));
> -}
> -
> -void
> -qsort(void *aa, size_t n, size_t es, int (*cmp)(const void *, const void *))
> -{
> -	char *pa, *pb, *pc, *pd, *pl, *pm, *pn;
> -	int d, r, swaptype, swap_cnt;
> -	register char *a = aa;
> -
> -loop:	SWAPINIT(a, es);
> -	swap_cnt = 0;
> -	if (n < 7) {
> -		for (pm = (char *)a + es; pm < (char *) a + n * es; pm += es)
> -			for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0;
> -			     pl -= es)
> -				swap(pl, pl - es);
> -		return;
> -	}
> -	pm = (char *)a + (n / 2) * es;
> -	if (n > 7) {
> -		pl = (char *)a;
> -		pn = (char *)a + (n - 1) * es;
> -		if (n > 40) {
> -			d = (n / 8) * es;
> -			pl = med3(pl, pl + d, pl + 2 * d, cmp);
> -			pm = med3(pm - d, pm, pm + d, cmp);
> -			pn = med3(pn - 2 * d, pn - d, pn, cmp);
> -		}
> -		pm = med3(pl, pm, pn, cmp);
> -	}
> -	swap(a, pm);
> -	pa = pb = (char *)a + es;
> -
> -	pc = pd = (char *)a + (n - 1) * es;
> -	for (;;) {
> -		while (pb <= pc && (r = cmp(pb, a)) <= 0) {
> -			if (r == 0) {
> -				swap_cnt = 1;
> -				swap(pa, pb);
> -				pa += es;
> -			}
> -			pb += es;
> -		}
> -		while (pb <= pc && (r = cmp(pc, a)) >= 0) {
> -			if (r == 0) {
> -				swap_cnt = 1;
> -				swap(pc, pd);
> -				pd -= es;
> -			}
> -			pc -= es;
> -		}
> -		if (pb > pc)
> -			break;
> -		swap(pb, pc);
> -		swap_cnt = 1;
> -		pb += es;
> -		pc -= es;
> -	}
> -	if (swap_cnt == 0) {  /* Switch to insertion sort */
> -		for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
> -			for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0;
> -			     pl -= es)
> -				swap(pl, pl - es);
> -		return;
> -	}
> -
> -	pn = (char *)a + n * es;
> -	r = min(pa - (char *)a, pb - pa);
> -	vecswap(a, pb - r, r);
> -	r = min((long)(pd - pc), (long)(pn - pd - es));
> -	vecswap(pb, pn - r, r);
> -	if ((r = pb - pa) > es)
> -		qsort(a, r / es, es, cmp);
> -	if ((r = pd - pc) > es) {
> -		/* Iterate rather than recurse to save stack space */
> -		a = pn - r;
> -		n = r / es;
> -		goto loop;
> -	}
> -/*		qsort(pn - r, r / es, es, cmp);*/
> -}
> Index: linux-2.6.11-rc2/fs/xfs/Makefile
> ===================================================================
> --- linux-2.6.11-rc2.orig/fs/xfs/Makefile
> +++ linux-2.6.11-rc2/fs/xfs/Makefile
> @@ -142,7 +142,6 @@ xfs-y				+= $(addprefix linux-2.6/, \
>  xfs-y				+= $(addprefix support/, \
>  				   debug.o \
>  				   move.o \
> -				   qsort.o \
>  				   uuid.o)
>
>  xfs-$(CONFIG_XFS_TRACE)		+= support/ktrace.o
> Index: linux-2.6.11-rc2/fs/xfs/support/qsort.h
> ===================================================================
> --- linux-2.6.11-rc2.orig/fs/xfs/support/qsort.h
> +++ /dev/null
> @@ -1,41 +0,0 @@
> -/*
> - * Copyright (c) 2000-2002 Silicon Graphics, Inc.  All Rights Reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of version 2 of the GNU General Public License as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it would be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> - *
> - * Further, this software is distributed without any warranty that it is
> - * free of the rightful claim of any third person regarding infringement
> - * or the like.  Any license provided herein, whether implied or
> - * otherwise, applies only to this software file.  Patent licenses, if
> - * any, provided herein do not apply to combinations of this program with
> - * other software, or any other product whatsoever.
> - *
> - * You should have received a copy of the GNU General Public License along
> - * with this program; if not, write the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston MA 02111-1307, USA.
> - *
> - * Contact information: Silicon Graphics, Inc., 1600 Amphitheatre Pkwy,
> - * Mountain View, CA  94043, or:
> - *
> - * http://www.sgi.com
> - *
> - * For further information regarding this notice, see:
> - *
> - * http://oss.sgi.com/projects/GenInfo/SGIGPLNoticeExplan/
> - */
> -
> -#ifndef QSORT_H
> -#define QSORT_H
> -
> -extern void qsort (void *const pbase,
> -		    size_t total_elems,
> -		    size_t size,
> -		    int (*cmp)(const void *, const void *));
> -
> -#endif
> Index: linux-2.6.11-rc2/fs/xfs/linux-2.6/xfs_linux.h
> ===================================================================
> --- linux-2.6.11-rc2.orig/fs/xfs/linux-2.6/xfs_linux.h
> +++ linux-2.6.11-rc2/fs/xfs/linux-2.6/xfs_linux.h
> @@ -64,7 +64,6 @@
>  #include <sema.h>
>  #include <time.h>
>
> -#include <support/qsort.h>
>  #include <support/ktrace.h>
>  #include <support/debug.h>
>  #include <support/move.h>
> Index: linux-2.6.11-rc2/fs/Kconfig
> ===================================================================
> --- linux-2.6.11-rc2.orig/fs/Kconfig
> +++ linux-2.6.11-rc2/fs/Kconfig
> @@ -306,6 +306,7 @@ config FS_POSIX_ACL
>
>  config XFS_FS
>  	tristate "XFS filesystem support"
> +	select QSORT
>  	help
>  	  XFS is a high performance journaling filesystem which originated
>  	  on the SGI IRIX platform.  It is completely multi-threaded, can
>
> --
> Andreas Gruenbacher <agruen@suse.de>
> SUSE Labs, SUSE LINUX PRODUCTS GMBH
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
       [not found]   ` <1106431568.4153.154.camel@laptopd505.fenrus.org>
@ 2005-01-22 22:10     ` Andreas Gruenbacher
  0 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-22 22:10 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

On Sat, 2005-01-22 at 23:06, Arjan van de Ven wrote:
> since you took the glibc one.. the glibc authors have repeatedly asked
> if glibc code that goes into the kernel will be export_symbol_gpl only
> due to their view of the gpl and lgpl

Sure, no big deal. We could equally well take the xfs one instead.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
  2005-01-22 21:00   ` vlobanov
       [not found]   ` <1106431568.4153.154.camel@laptopd505.fenrus.org>
@ 2005-01-22 23:28   ` Matt Mackall
  2005-01-23  0:21     ` Matt Mackall
  2005-01-23  5:08     ` Andreas Gruenbacher
  2005-01-24  3:48   ` Horst von Brand
  2005-01-24 20:15   ` [PATCH] lib/qsort Matt Mackall
  4 siblings, 2 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-22 23:28 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Buck Huppmann, Andrew Morton

On Sat, Jan 22, 2005 at 09:34:01PM +0100, Andreas Gruenbacher wrote:
> Add a quicksort from glibc as a kernel library function, and switch
> xfs over to using it. The implementations are equivalent. The nfsacl
> protocol also requires a sort function, so it makes more sense in
> the common code.

Please update this to kernel formatting standards and try to modernize
it a bit.

> +/* Byte-wise swap two items of size SIZE. */
> +#define SWAP(a, b, size)						      \
> +  do									      \
> +    {									      \
> +      register size_t __size = (size);					      \
> +      register char *__a = (a), *__b = (b);				      \
> +      do								      \
> +	{								      \
> +	  char __tmp = *__a;						      \
> +	  *__a++ = *__b;						      \
> +	  *__b++ = __tmp;						      \
> +	} while (--__size > 0);						      \
> +    } while (0)

Inline, please? Register keyword?!

> +typedef struct
> +  {
> +    char *lo;
> +    char *hi;
> +  } stack_node;

void *, please

> +
> +/* The next 5 #defines implement a very fast in-line stack abstraction. */
> +/* The stack needs log (total_elements) entries (we could even subtract
> +   log(MAX_THRESH)).  Since total_elements has type size_t, we get as
> +   upper bound for log (total_elements):
> +   bits per byte (CHAR_BIT) * sizeof(size_t).  */
> +#define CHAR_BIT 8
> +#define STACK_SIZE	(CHAR_BIT * sizeof(size_t))

So the stack is going to be either 256 or 1024 bytes. Seems like we
ought to kmalloc it.

> +#define PUSH(low, high)	((void) ((top->lo = (low)), (top->hi = (high)), ++top))
> +#define	POP(low, high)	((void) (--top, (low = top->lo), (high = top->hi)))
> +#define	STACK_NOT_EMPTY	(stack < top)

There's only one usage of POP, one of STACK_NOT_EMPTY and two of PUSH
that can trivially be made one. Please kill these macros.

> +   3. Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions, leaving
> +      insertion sort to order the MAX_THRESH items within each partition.
> +      This is a big win, since insertion sort is faster for small, mostly
> +      sorted array segments.

This observation may be dated, instruction cache issues may dominate now.

> +	  char *mid = lo + size * ((hi - lo) / size >> 1);

Get rid of all this char* stuff, please. It makes for lots of ugly and
unnecessary casting.

> +	  if ((*cmp) ((void *) mid, (void *) lo) < 0)
> +	    SWAP (mid, lo, size);

cmp(mid, lo)

> +	  if ((*cmp) ((void *) hi, (void *) mid) < 0)
> +	    SWAP (mid, hi, size);
> +	  else
> +	    goto jump_over;
> +	  if ((*cmp) ((void *) mid, (void *) lo) < 0)
> +	    SWAP (mid, lo, size);
> +	jump_over:;

?!

> +  /* Once the BASE_PTR array is partially sorted by quicksort the rest
> +     is completely sorted using insertion sort, since this is efficient
> +     for partitions below MAX_THRESH size. BASE_PTR points to the beginning
> +     of the array to sort, and END_PTR points at the very last element in
> +     the array (*not* one beyond it!). */
> +
> +  {
> +    char *end_ptr = &base_ptr[size * (total_elems - 1)];
> +    char *tmp_ptr = base_ptr;
> +    char *thresh = min(end_ptr, base_ptr + max_thresh);
> +    register char *run_ptr;

Move these vars to the top or better yet, split this into two functions.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 23:28   ` Matt Mackall
@ 2005-01-23  0:21     ` Matt Mackall
  2005-01-23  5:08     ` Andreas Gruenbacher
  1 sibling, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-23  0:21 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: linux-kernel

On Sat, Jan 22, 2005 at 03:28:14PM -0800, Matt Mackall wrote:
> On Sat, Jan 22, 2005 at 09:34:01PM +0100, Andreas Gruenbacher wrote:
> > Add a quicksort from glibc as a kernel library function, and switch
> > xfs over to using it. The implementations are equivalent. The nfsacl
> > protocol also requires a sort function, so it makes more sense in
> > the common code.
> 
> Please update this to kernel formatting standards and try to modernize
> it a bit.

I started working on this with an eye to doing some performance
testing of the insertion sort threshold in userspace, but I'm about to
head out for the day. Here's what I've got so far, compiles but
untested. Note the insertion sort at the end really ought to be using
memmove as well.

/* Copyright (C) 1991, 1992, 1996, 1997, 1999 Free Software Foundation, Inc.
   Written by Douglas C. Schmidt (schmidt@ics.uci.edu).

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/* If you consider tuning this algorithm, you should consult first:
   Engineering a sort function; Jon Bentley and M. Douglas McIlroy;
   Software - Practice and Experience; Vol. 23 (11), 1249-1265, 1993.  */

#include <unistd.h>
#include <stdlib.h>

#define min(x,y) ({ \
	typeof(x) _x = (x);	\
	typeof(y) _y = (y);	\
	(void) (&_x == &_y);		\
	_x < _y ? _x : _y; })

/* Byte-wise swap two items of size SIZE. */
#define SWAP(a, b, size)						      \
  do									      \
    {									      \
      size_t __size = (size);					      \
      char *__a = (a), *__b = (b);				      \
      do								      \
	{								      \
	  char __tmp = *__a;						      \
	  *__a++ = *__b;						      \
	  *__b++ = __tmp;						      \
	} while (--__size > 0);						      \
    } while (0)

/* Discontinue quicksort algorithm when partition gets below this size.
   This particular magic number was chosen to work best on a Sun 4/260. */
#define MAX_THRESH 4

/* Stack node declarations used to store unfulfilled partition obligations. */
typedef struct {
	void *lo;
	void *hi;
} stack_node;

/* Order size using quicksort.  This implementation incorporates
   four optimizations discussed in Sedgewick:

   1. Non-recursive, using an explicit stack of pointer that store the
      next array partition to sort.  To save time, this maximum amount
      of space required to store an array of SIZE_MAX is allocated on the
      stack.  Assuming a 32-bit (64 bit) integer for size_t, this needs
      only 32 * sizeof(stack_node) == 256 bytes (for 64 bit: 1024 bytes).
      Pretty cheap, actually.

   2. Chose the pivot element using a median-of-three decision tree.
      This reduces the probability of selecting a bad pivot value and
      eliminates certain extraneous comparisons.

   3. Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions, leaving
      insertion sort to order the MAX_THRESH items within each partition.
      This is a big win, since insertion sort is faster for small, mostly
      sorted array segments.

   4. The larger of the two sub-partitions is always pushed onto the
      stack first, with the algorithm then concentrating on the
      smaller partition.  This *guarantees* no more than log (total_elems)
      stack size is needed (actually O(1) in this case)!  */

void qsort(void *base, size_t num, size_t size,
	   int (*cmp) (const void *, const void *))
{
	const size_t max_thresh = MAX_THRESH * size;
	void *hi, *lo, *mid, *left, *right;
	void *end = base + (size * (num - 1));
	void *tmp = base;
	void *thresh = min(end, base + max_thresh);
	void *run, *trav;
	stack_node *stack, *top;

	if (num == 0)
		return;

	lo = base;
	hi = lo + size * (num - 1);
	if (num > MAX_THRESH) {
		stack = malloc(8 * sizeof(size_t) * sizeof(stack_node));
		top = stack + 1;

		while (stack < top) {
			/* Select median value from among LO, MID, and
			   HI. Rearrange LO and HI so the three values
			   are sorted. This lowers the probability of
			   picking a pathological pivot value and
			   skips a comparison for both the LEFT
			   and RIGHT in the while loops. */

			mid = lo + size * ((hi - lo) / size >> 1);

			if (cmp(mid, lo) < 0)
				SWAP(mid, lo, size);
			if (cmp(hi, mid) < 0) {
				SWAP(mid, hi, size);
				if (cmp(mid, lo) < 0)
					SWAP(mid, lo, size);
			}

			left = lo + size;
			right = hi - size;

			/* Here's the famous ``collapse the walls''
			   section of quicksort. Gotta like those
			   tight inner loops! They are the main reason
			   that this algorithm runs much faster than
			   others. */

			do {
				while (cmp(left, mid) < 0)
					left += size;
				while (cmp(mid, right) < 0)
					right -= size;

				if (left < right) {
					SWAP(left, right, size);
					if (mid == left)
						mid = right;
					else if (mid == right)
						mid = left;
					left += size;
					right -= size;
				} else if (left == right) {
					left += size;
					right -= size;
					break;
				}
			}
			while (left <= right);

			/* Set up pointers for next iteration. First
			   determine whether left and right partitions
			   are below the threshold size. If so, ignore
			   one or both. Otherwise, push the larger
			   partition's bounds on the stack and
			   continue sorting the smaller one. */

			if ((right - lo) <= max_thresh) {
				if ((hi - left) <= max_thresh) {
					/* Ignore both small partitions. */
					--top;
					lo = top->lo;
					hi = top->hi;
				} else
					/* Ignore small left partition. */
					lo = left;
			} else if ((hi - left) <= max_thresh)
				/* Ignore small right partition. */
				hi = right;
			else if ((right - lo) > (hi - left)) {
				/* Push larger left partition indices. */
				top->lo = lo;
				top->hi = right;
				top++;
				lo = left;
			} else {
				/* Push larger right partition indices. */
				top->lo = left;
				top->hi = hi;
				top++;
				hi = right;
			}
		}

		free(stack);
	}

	/* Once the BASE array is partially sorted by quicksort
	   the rest is completely sorted using insertion sort, since
	   this is efficient for partitions below MAX_THRESH size.
	   BASE points to the beginning of the array to sort, and
	   END points at the very last element in the array (*not*
	   one beyond it!). */

	/* Find smallest element in first threshold and place it at
	   the array's beginning. This is the smallest array element,
	   and the operation speeds up insertion sort's inner loop. */

	for (run = tmp + size; run <= thresh; run += size) {
		if (cmp(run, tmp) < 0)
			tmp = run;

		if (tmp != base)
			SWAP(tmp, base, size);

		/* Insertion sort, running from left-hand-side up to
		 * right-hand-side.  */

		run = base + size;
		while ((run += size) <= end) {
			tmp = run - size;
			while (cmp(run, tmp) < 0)
				tmp -= size;

			tmp += size;
			if (tmp != run) {
				trav = run + size;
				while (--trav >= run) {
					char c = *(char *)trav;
					for (hi = lo = trav;
					     (lo -= size) >= tmp; hi = lo)
						*(char *)hi = *(char *)lo;
					*(char *)hi = c;
				}
			}
		}
	}
}



-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 21:00   ` vlobanov
@ 2005-01-23  2:03     ` Felipe Alfaro Solana
  2005-01-23  2:39       ` Andi Kleen
                         ` (2 more replies)
  2005-01-23 21:24     ` Richard Henderson
  1 sibling, 3 replies; 85+ messages in thread
From: Felipe Alfaro Solana @ 2005-01-23  2:03 UTC (permalink / raw)
  To: vlobanov
  Cc: Trond Myklebust, linux-kernel, Buck Huppmann, Neil Brown,
	Andreas Gruenbacher, Andries E. Brouwer, Andrew Morton,
	Olaf Kirch

On 22 Jan 2005, at 22:00, vlobanov wrote:

> Hi,
>
> I was just reading over the patch, and had a quick question/comment 
> upon
> the SWAP macro defined below. I think it's possible to do a tiny bit
> better (better, of course, being subjective), as follows:
>
> #define SWAP(a, b, size)			\
>     do {					\
> 	register size_t __size = (size);	\
> 	register char * __a = (a), * __b = (b);	\
> 	do {					\
> 	    *__a ^= *__b;			\
> 	    *__b ^= *__a;			\
> 	    *__a ^= *__b;			\
> 	    __a++;				\
> 	    __b++;				\
> 	} while ((--__size) > 0);		\
>     } while (0)
>
> What do you think? :)

AFAIK, XOR is quite expensive on IA32 when compared to simple MOV 
operatings. Also, since the original patch uses 3 MOVs to perform the 
swapping, and your version uses 3 XOR operations, I don't see any 
gains.

Am I missing something?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  2:03     ` Felipe Alfaro Solana
@ 2005-01-23  2:39       ` Andi Kleen
  2005-01-23  3:02         ` Jesper Juhl
                           ` (2 more replies)
  2005-01-23  4:22       ` Matt Mackall
  2005-01-23  5:44       ` Willy Tarreau
  2 siblings, 3 replies; 85+ messages in thread
From: Andi Kleen @ 2005-01-23  2:39 UTC (permalink / raw)
  To: Felipe Alfaro Solana
  Cc: Trond Myklebust, linux-kernel, Buck Huppmann, Neil Brown,
	Andreas Gruenbacher, Andries E. Brouwer, Andrew Morton,
	Olaf Kirch

Felipe Alfaro Solana <lkml@mac.com> writes:
>
> AFAIK, XOR is quite expensive on IA32 when compared to simple MOV
> operatings. Also, since the original patch uses 3 MOVs to perform the
> swapping, and your version uses 3 XOR operations, I don't see any
> gains.

Both are one cycle latency for register<->register on all x86 cores
I've looked at. What makes you think differently?

-Andi (who thinks the glibc qsort is vast overkill for kernel purposes
where there are only small data sets and it would be better to use a 
simpler one optimized for code size)


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  2:39       ` Andi Kleen
@ 2005-01-23  3:02         ` Jesper Juhl
  2005-01-23  4:46           ` Andi Kleen
  2005-01-23  4:29         ` Matt Mackall
  2005-01-23  4:58         ` Felipe Alfaro Solana
  2 siblings, 1 reply; 85+ messages in thread
From: Jesper Juhl @ 2005-01-23  3:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Sun, 23 Jan 2005, Andi Kleen wrote:

> Felipe Alfaro Solana <lkml@mac.com> writes:
> >
> > AFAIK, XOR is quite expensive on IA32 when compared to simple MOV
> > operatings. Also, since the original patch uses 3 MOVs to perform the
> > swapping, and your version uses 3 XOR operations, I don't see any
> > gains.
> 
> Both are one cycle latency for register<->register on all x86 cores
> I've looked at. What makes you think differently?
> 
> -Andi (who thinks the glibc qsort is vast overkill for kernel purposes
> where there are only small data sets and it would be better to use a 
> simpler one optimized for code size)
> 
How about a shell sort?  if the data is mostly sorted shell sort beats 
qsort lots of times, and since the data sets are often small in-kernel, 
shell sorts O(n^2) behaviour won't harm it too much, shell sort is also 
faster if the data is already completely sorted. Shell sort is certainly 
not the simplest algorithm around, but I think (without having done any 
tests) that it would probably do pretty well for in-kernel use... Then 
again, I've known to be wrong :)


-- 
Jesper Juhl


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  2:03     ` Felipe Alfaro Solana
  2005-01-23  2:39       ` Andi Kleen
@ 2005-01-23  4:22       ` Matt Mackall
  2005-01-23  5:44       ` Willy Tarreau
  2 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-23  4:22 UTC (permalink / raw)
  To: Felipe Alfaro Solana
  Cc: vlobanov, Trond Myklebust, linux-kernel, Buck Huppmann,
	Neil Brown, Andreas Gruenbacher, Andries E. Brouwer,
	Andrew Morton, Olaf Kirch

On Sun, Jan 23, 2005 at 03:03:32AM +0100, Felipe Alfaro Solana wrote:
> On 22 Jan 2005, at 22:00, vlobanov wrote:
> 
> >Hi,
> >
> >I was just reading over the patch, and had a quick question/comment 
> >upon
> >the SWAP macro defined below. I think it's possible to do a tiny bit
> >better (better, of course, being subjective), as follows:
> >
> >#define SWAP(a, b, size)			\
> >    do {					\
> >	register size_t __size = (size);	\
> >	register char * __a = (a), * __b = (b);	\
> >	do {					\
> >	    *__a ^= *__b;			\
> >	    *__b ^= *__a;			\
> >	    *__a ^= *__b;			\
> >	    __a++;				\
> >	    __b++;				\
> >	} while ((--__size) > 0);		\
> >    } while (0)
> >
> >What do you think? :)
> 
> AFAIK, XOR is quite expensive on IA32 when compared to simple MOV 
> operatings. Also, since the original patch uses 3 MOVs to perform the 
> swapping, and your version uses 3 XOR operations, I don't see any 
> gains.
> 
> Am I missing something?

No temporary variable needed in the xor version. mov and xor are
roughly the same speed, but xor modifies flags.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  2:39       ` Andi Kleen
  2005-01-23  3:02         ` Jesper Juhl
@ 2005-01-23  4:29         ` Matt Mackall
  2005-01-24  0:21           ` Nathan Scott
  2005-01-24  4:02           ` Horst von Brand
  2005-01-23  4:58         ` Felipe Alfaro Solana
  2 siblings, 2 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-23  4:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Sun, Jan 23, 2005 at 03:39:34AM +0100, Andi Kleen wrote:
> Felipe Alfaro Solana <lkml@mac.com> writes:
> >
> > AFAIK, XOR is quite expensive on IA32 when compared to simple MOV
> > operatings. Also, since the original patch uses 3 MOVs to perform the
> > swapping, and your version uses 3 XOR operations, I don't see any
> > gains.
> 
> Both are one cycle latency for register<->register on all x86 cores
> I've looked at. What makes you think differently?
> 
> -Andi (who thinks the glibc qsort is vast overkill for kernel purposes
> where there are only small data sets and it would be better to use a 
> simpler one optimized for code size)

Mostly agreed. Except:

a) the glibc version is not actually all that optimized
b) it's nice that it's not recursive
c) the three-way median selection does help avoid worst-case O(n^2)
behavior, which might potentially be triggerable by users in places
like XFS where this is used

I'll probably whip up a simpler version tomorrow or Monday and do some
size/space benchmarking. I've been meaning to contribute a qsort for
doubly-linked lists I've got lying around as well.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  3:02         ` Jesper Juhl
@ 2005-01-23  4:46           ` Andi Kleen
  2005-01-23  5:05             ` Jesper Juhl
  2005-01-24 22:04             ` Mike Waychison
  0 siblings, 2 replies; 85+ messages in thread
From: Andi Kleen @ 2005-01-23  4:46 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

> How about a shell sort?  if the data is mostly sorted shell sort beats 
> qsort lots of times, and since the data sets are often small in-kernel, 
> shell sorts O(n^2) behaviour won't harm it too much, shell sort is also 
> faster if the data is already completely sorted. Shell sort is certainly 
> not the simplest algorithm around, but I think (without having done any 
> tests) that it would probably do pretty well for in-kernel use... Then 
> again, I've known to be wrong :)

I like shell sort for small data sets too. And I agree it would be 
appropiate for the kernel.

-Andi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  2:39       ` Andi Kleen
  2005-01-23  3:02         ` Jesper Juhl
  2005-01-23  4:29         ` Matt Mackall
@ 2005-01-23  4:58         ` Felipe Alfaro Solana
  2005-01-24 21:20           ` Matt Mackall
  2 siblings, 1 reply; 85+ messages in thread
From: Felipe Alfaro Solana @ 2005-01-23  4:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Brown, linux-kernel, Buck Huppmann, Trond Myklebust,
	Andreas Gruenbacher, Andries E. Brouwer, Andrew Morton,
	Olaf Kirch

On 23 Jan 2005, at 03:39, Andi Kleen wrote:

> Felipe Alfaro Solana <lkml@mac.com> writes:
>>
>> AFAIK, XOR is quite expensive on IA32 when compared to simple MOV
>> operatings. Also, since the original patch uses 3 MOVs to perform the
>> swapping, and your version uses 3 XOR operations, I don't see any
>> gains.
>
> Both are one cycle latency for register<->register on all x86 cores
> I've looked at. What makes you think differently?

I thought XOR was more expensie. Anyways, I still don't see any 
advantage in replacing 3 MOVs with 3 XORs.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  4:46           ` Andi Kleen
@ 2005-01-23  5:05             ` Jesper Juhl
  2005-01-23 10:37               ` Rafael J. Wysocki
                                 ` (2 more replies)
  2005-01-24 22:04             ` Mike Waychison
  1 sibling, 3 replies; 85+ messages in thread
From: Jesper Juhl @ 2005-01-23  5:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jesper Juhl, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Sun, 23 Jan 2005, Andi Kleen wrote:

> > How about a shell sort?  if the data is mostly sorted shell sort beats 
> > qsort lots of times, and since the data sets are often small in-kernel, 
> > shell sorts O(n^2) behaviour won't harm it too much, shell sort is also 
> > faster if the data is already completely sorted. Shell sort is certainly 
> > not the simplest algorithm around, but I think (without having done any 
> > tests) that it would probably do pretty well for in-kernel use... Then 
> > again, I've known to be wrong :)
> 
> I like shell sort for small data sets too. And I agree it would be 
> appropiate for the kernel.
> 
Even with large data sets that are mostly unsorted shell sorts performance 
is close to qsort, and there's an optimization that gives it O(n^(3/2)) 
runtime (IIRC), and another nice property is that it's iterative so it 
doesn't eat up stack space (as oposed to qsort which is recursive and eats 
stack like ****)...
Yeah, I think shell sort would be good for the kernel.


-- 
Jesper Juhl




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 23:28   ` Matt Mackall
  2005-01-23  0:21     ` Matt Mackall
@ 2005-01-23  5:08     ` Andreas Gruenbacher
  2005-01-23  5:32       ` Matt Mackall
  1 sibling, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-23  5:08 UTC (permalink / raw)
  To: Matt Mackall
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Buck Huppmann, Andrew Morton

On Sunday 23 January 2005 00:28, Matt Mackall wrote:
> So the stack is going to be either 256 or 1024 bytes. Seems like we
> ought to kmalloc it.

This will do. I didn't check if the +1 is strictly needed.

-      stack_node stack[STACK_SIZE];
+      stack_node stack[fls(size) - fls(MAX_THRESH) + 1];

-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  5:08     ` Andreas Gruenbacher
@ 2005-01-23  5:32       ` Matt Mackall
  2005-01-23 12:22         ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Matt Mackall @ 2005-01-23  5:32 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Buck Huppmann, Andrew Morton

On Sun, Jan 23, 2005 at 06:08:36AM +0100, Andreas Gruenbacher wrote:
> On Sunday 23 January 2005 00:28, Matt Mackall wrote:
> > So the stack is going to be either 256 or 1024 bytes. Seems like we
> > ought to kmalloc it.
> 
> This will do. I didn't check if the +1 is strictly needed.
> 
> -      stack_node stack[STACK_SIZE];
> +      stack_node stack[fls(size) - fls(MAX_THRESH) + 1];

Yes, indeed. Though I think even here, we'd prefer to use kmalloc
because gcc generates suboptimal code for variable-sized stack vars.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  2:03     ` Felipe Alfaro Solana
  2005-01-23  2:39       ` Andi Kleen
  2005-01-23  4:22       ` Matt Mackall
@ 2005-01-23  5:44       ` Willy Tarreau
  2 siblings, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2005-01-23  5:44 UTC (permalink / raw)
  To: Felipe Alfaro Solana
  Cc: vlobanov, Trond Myklebust, linux-kernel, Buck Huppmann,
	Neil Brown, Andreas Gruenbacher, Andries E. Brouwer,
	Andrew Morton, Olaf Kirch

Hi,

On Sun, Jan 23, 2005 at 03:03:32AM +0100, Felipe Alfaro Solana wrote:
> On 22 Jan 2005, at 22:00, vlobanov wrote:
> >#define SWAP(a, b, size)			\
> >    do {					\
> >	register size_t __size = (size);	\
> >	register char * __a = (a), * __b = (b);	\
> >	do {					\
> >	    *__a ^= *__b;			\
> >	    *__b ^= *__a;			\
> >	    *__a ^= *__b;			\
> >	    __a++;				\
> >	    __b++;				\
> >	} while ((--__size) > 0);		\
> >    } while (0)
> >
> >What do you think? :)
> 
> AFAIK, XOR is quite expensive on IA32 when compared to simple MOV 
> operatings. Also, since the original patch uses 3 MOVs to perform the 
> swapping, and your version uses 3 XOR operations, I don't see any 
> gains.

It will even be worse because we are accessing memory, and most architectures
will not be able to use a memory reference for both operands of the XOR.
Basically, what will be generated will look like this :

  tmp = *b
  *a ^= tmp
  tmp ^= *a
  *b = tmp
  *a ^= tmp

which is 5 cycles, or 4 if the two last instructions get merged. And there's
3 memory reads + 3 memory writes (assuming that the CPU will be smart enough
to reuse *a without accessing memory at instruction 3).

The move is quite faster :

   tmp1 = *a
   tmp2 = *b
   *a = tmp2
   *b = tmp1

This is 4 cycles on simple CPUs, or even 2 cycles on most of todays CPUs
which can do the first two fetches at once, and the last two writes at once.
And there are only two reads and two writes.

Clearly this one is better.

Regards,
Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  5:05             ` Jesper Juhl
@ 2005-01-23 10:37               ` Rafael J. Wysocki
  2005-01-24  4:29                 ` Horst von Brand
  2005-01-24 15:45               ` Alan Cox
  2005-01-24 17:10               ` H. Peter Anvin
  2 siblings, 1 reply; 85+ messages in thread
From: Rafael J. Wysocki @ 2005-01-23 10:37 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: Andi Kleen, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Sunday, 23 of January 2005 06:05, Jesper Juhl wrote:
> On Sun, 23 Jan 2005, Andi Kleen wrote:
> 
> > > How about a shell sort?  if the data is mostly sorted shell sort beats 
> > > qsort lots of times, and since the data sets are often small in-kernel, 
> > > shell sorts O(n^2) behaviour won't harm it too much, shell sort is also 
> > > faster if the data is already completely sorted. Shell sort is certainly 
> > > not the simplest algorithm around, but I think (without having done any 
> > > tests) that it would probably do pretty well for in-kernel use... Then 
> > > again, I've known to be wrong :)
> > 
> > I like shell sort for small data sets too. And I agree it would be 
> > appropiate for the kernel.
> > 
> Even with large data sets that are mostly unsorted shell sorts performance 
> is close to qsort, and there's an optimization that gives it O(n^(3/2)) 
> runtime (IIRC),

Yes, there is.

> and another nice property is that it's iterative so it  
> doesn't eat up stack space (as oposed to qsort which is recursive and eats 
> stack like ****)...

To be precise, one needs ~(log N) of stack space for qsort, and frankly, one
should use something like the shell (or should I say Shell?) sort for sorting
small sets of elements in qsort as well.

> Yeah, I think shell sort would be good for the kernel.

I agree.

Greets,
RJW


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
		-- Lewis Carroll "Alice's Adventures in Wonderland"

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  5:32       ` Matt Mackall
@ 2005-01-23 12:22         ` Andreas Gruenbacher
  2005-01-23 16:49           ` Matt Mackall
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-23 12:22 UTC (permalink / raw)
  To: Matt Mackall
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Andrew Morton

On Sunday 23 January 2005 06:32, Matt Mackall wrote:
> Yes, indeed. Though I think even here, we'd prefer to use kmalloc
> because gcc generates suboptimal code for variable-sized stack vars.

That's ridiculous. kmalloc isn't even close to whatever suboptimal code gcc 
might produce here. Also I'm not convinced that gcc generates bad code in the 
first place. The code I get makes perfect sense.

-- Andreas.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23 12:22         ` Andreas Gruenbacher
@ 2005-01-23 16:49           ` Matt Mackall
  0 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-23 16:49 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Andrew Morton

On Sun, Jan 23, 2005 at 01:22:13PM +0100, Andreas Gruenbacher wrote:
> On Sunday 23 January 2005 06:32, Matt Mackall wrote:
> > Yes, indeed. Though I think even here, we'd prefer to use kmalloc
> > because gcc generates suboptimal code for variable-sized stack vars.
> 
> That's ridiculous. kmalloc isn't even close to whatever suboptimal
> code gcc might produce here. Also I'm not convinced that gcc
> generates bad code in the first place. The code I get makes perfect
> sense.

Fixed-sized slab-based kmalloc is O(1) (and pretty darn fast). If we
take a constant overhead for every local variable lookup in qsort,
that's O(n log n). Putting the stack vars last might fix that, but I
think it needs testing. I'll try it.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 21:00   ` vlobanov
  2005-01-23  2:03     ` Felipe Alfaro Solana
@ 2005-01-23 21:24     ` Richard Henderson
  1 sibling, 0 replies; 85+ messages in thread
From: Richard Henderson @ 2005-01-23 21:24 UTC (permalink / raw)
  To: vlobanov
  Cc: Andreas Gruenbacher, linux-kernel, Neil Brown, Trond Myklebust,
	Olaf Kirch, Andries E. Brouwer, Buck Huppmann, Andrew Morton

On Sat, Jan 22, 2005 at 01:00:24PM -0800, vlobanov wrote:
> #define SWAP(a, b, size)			\
>     do {					\
> 	register size_t __size = (size);	\
> 	register char * __a = (a), * __b = (b);	\
> 	do {					\
> 	    *__a ^= *__b;			\
> 	    *__b ^= *__a;			\
> 	    *__a ^= *__b;			\
> 	    __a++;				\
> 	    __b++;				\
> 	} while ((--__size) > 0);		\
>     } while (0)
> 
> What do you think? :)

I think you'll confuse the compiler and get worse results.


r~

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  4:29         ` Matt Mackall
@ 2005-01-24  0:21           ` Nathan Scott
  2005-01-24  2:57             ` Matt Mackall
  2005-01-24  4:02           ` Horst von Brand
  1 sibling, 1 reply; 85+ messages in thread
From: Nathan Scott @ 2005-01-24  0:21 UTC (permalink / raw)
  To: Matt Mackall, Andreas Gruenbacher
  Cc: Andi Kleen, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andries E. Brouwer, Andrew Morton,
	Olaf Kirch

On Sat, Jan 22, 2005 at 08:29:30PM -0800, Matt Mackall wrote:
> On Sun, Jan 23, 2005 at 03:39:34AM +0100, Andi Kleen wrote:
> 
> c) the three-way median selection does help avoid worst-case O(n^2)
> behavior, which might potentially be triggerable by users in places
> like XFS where this is used

XFS's needs are simple - we're just sorting dirents within a
single directory block or smaller, and sorting EA lists/ACLs -
all of which are small arrays, so a qsort optimised for small
arrays suits XFS well.  Take care not to put any arrays on the
stack though, else the CONFIG_4KSTACKS punters won't be happy.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-24  0:21           ` Nathan Scott
@ 2005-01-24  2:57             ` Matt Mackall
  0 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-24  2:57 UTC (permalink / raw)
  To: Nathan Scott
  Cc: Andreas Gruenbacher, Andi Kleen, Felipe Alfaro Solana,
	Trond Myklebust, linux-kernel, Buck Huppmann, Neil Brown,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Mon, Jan 24, 2005 at 11:21:29AM +1100, Nathan Scott wrote:
> On Sat, Jan 22, 2005 at 08:29:30PM -0800, Matt Mackall wrote:
> > On Sun, Jan 23, 2005 at 03:39:34AM +0100, Andi Kleen wrote:
> > 
> > c) the three-way median selection does help avoid worst-case O(n^2)
> > behavior, which might potentially be triggerable by users in places
> > like XFS where this is used
> 
> XFS's needs are simple - we're just sorting dirents within a
> single directory block or smaller, and sorting EA lists/ACLs -
> all of which are small arrays, so a qsort optimised for small
> arrays suits XFS well. 

Ok, I've worked up a much smaller, cleaner version that wins on lists
of 10000 entries or less and is still within 5% at 1M entries (ie well
past what any kernel code has any business doing). More after I've
fiddled around a bit more with the benchmarks.

> Take care not to put any arrays on the
> stack though, else the CONFIG_4KSTACKS punters won't be happy.

I'm afraid I'm one of those punters - 4k stacks were getting cleaned up and
tested in my -tiny tree long before mainline.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
                     ` (2 preceding siblings ...)
  2005-01-22 23:28   ` Matt Mackall
@ 2005-01-24  3:48   ` Horst von Brand
  2005-01-24 20:15   ` [PATCH] lib/qsort Matt Mackall
  4 siblings, 0 replies; 85+ messages in thread
From: Horst von Brand @ 2005-01-24  3:48 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Buck Huppmann, Andrew Morton

Andreas Gruenbacher <agruen@suse.de> said:
> Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
> Acked-by: Olaf Kirch <okir@suse.de>

[...]

> +/* Order size using quicksort.  This implementation incorporates
> +   four optimizations discussed in Sedgewick:
> +
> +   1. Non-recursive, using an explicit stack of pointer that store the
> +      next array partition to sort.  To save time, this maximum amount
> +      of space required to store an array of SIZE_MAX is allocated on the
> +      stack.  Assuming a 32-bit (64 bit) integer for size_t, this needs
> +      only 32 * sizeof(stack_node) == 256 bytes (for 64 bit: 1024 bytes).
> +      Pretty cheap, actually.

Not really, given the strict size restrictions in-kernel.

Has there been any comparison between the original and this one? Code size,
stack use, speed, ...?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  4:29         ` Matt Mackall
  2005-01-24  0:21           ` Nathan Scott
@ 2005-01-24  4:02           ` Horst von Brand
  2005-01-24 21:57             ` Matt Mackall
  1 sibling, 1 reply; 85+ messages in thread
From: Horst von Brand @ 2005-01-24  4:02 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

Matt Mackall <mpm@selenic.com> said:
> On Sun, Jan 23, 2005 at 03:39:34AM +0100, Andi Kleen wrote:

[...]

> > -Andi (who thinks the glibc qsort is vast overkill for kernel purposes
> > where there are only small data sets and it would be better to use a 
> > simpler one optimized for code size)

> Mostly agreed. Except:
> 
> a) the glibc version is not actually all that optimized
> b) it's nice that it's not recursive
> c) the three-way median selection does help avoid worst-case O(n^2)
> behavior, which might potentially be triggerable by users in places
> like XFS where this is used

Shellsort is much simpler, and not much slower for small datasets. Plus no
extra space for stacks.

> I'll probably whip up a simpler version tomorrow or Monday and do some
> size/space benchmarking. I've been meaning to contribute a qsort for
> doubly-linked lists I've got lying around as well.

Qsort is OK as long as you have direct access to each element. In case of
lists, it is better to just use mergesort.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23 10:37               ` Rafael J. Wysocki
@ 2005-01-24  4:29                 ` Horst von Brand
  0 siblings, 0 replies; 85+ messages in thread
From: Horst von Brand @ 2005-01-24  4:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jesper Juhl, Andi Kleen, Felipe Alfaro Solana, Trond Myklebust,
	linux-kernel, Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

"Rafael J. Wysocki" <rjw@sisk.pl> said:

[...]

> To be precise, one needs ~(log N) of stack space for qsort, and frankly, one
> should use something like the shell (or should I say Shell?)

Shell. It is named for a person.

>                                                              sort for sorting
> small sets of elements in qsort as well.

It makes no sense for smallish sets, insertion sort is better.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  5:05             ` Jesper Juhl
  2005-01-23 10:37               ` Rafael J. Wysocki
@ 2005-01-24 15:45               ` Alan Cox
  2005-01-24 17:10               ` H. Peter Anvin
  2 siblings, 0 replies; 85+ messages in thread
From: Alan Cox @ 2005-01-24 15:45 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: Andi Kleen, Felipe Alfaro Solana, Trond Myklebust,
	Linux Kernel Mailing List, Buck Huppmann, Neil Brown,
	Andreas Gruenbacher, Andries E. Brouwer, Andrew Morton,
	Olaf Kirch

On Sul, 2005-01-23 at 05:05, Jesper Juhl wrote:
> On Sun, 23 Jan 2005, Andi Kleen wrote:
> Even with large data sets that are mostly unsorted shell sorts performance 
> is close to qsort, and there's an optimization that gives it O(n^(3/2)) 
> runtime (IIRC), and another nice property is that it's iterative so it 
> doesn't eat up stack space (as oposed to qsort which is recursive and eats 
> stack like ****)...

qsort also has bad worst case performance which matters if you are
sorting data provided by a hostile source.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  5:05             ` Jesper Juhl
  2005-01-23 10:37               ` Rafael J. Wysocki
  2005-01-24 15:45               ` Alan Cox
@ 2005-01-24 17:10               ` H. Peter Anvin
  2005-01-25  0:43                 ` Horst von Brand
  2 siblings, 1 reply; 85+ messages in thread
From: H. Peter Anvin @ 2005-01-24 17:10 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.61.0501230600070.2748@dragon.hygekrogen.localhost>
By author:    Jesper Juhl <juhl-lkml@dif.dk>
In newsgroup: linux.dev.kernel
>
> On Sun, 23 Jan 2005, Andi Kleen wrote:
> 
> > > How about a shell sort?  if the data is mostly sorted shell sort beats 
> > > qsort lots of times, and since the data sets are often small in-kernel, 
> > > shell sorts O(n^2) behaviour won't harm it too much, shell sort is also 
> > > faster if the data is already completely sorted. Shell sort is certainly 
> > > not the simplest algorithm around, but I think (without having done any 
> > > tests) that it would probably do pretty well for in-kernel use... Then 
> > > again, I've known to be wrong :)
> > 
> > I like shell sort for small data sets too. And I agree it would be 
> > appropiate for the kernel.
> > 
> Even with large data sets that are mostly unsorted shell sorts performance 
> is close to qsort, and there's an optimization that gives it O(n^(3/2)) 
> runtime (IIRC), and another nice property is that it's iterative so it 
> doesn't eat up stack space (as oposed to qsort which is recursive and eats 
> stack like ****)...
> Yeah, I think shell sort would be good for the kernel.
> 

In klibc, I use combsort:

/*
 * qsort.c
 *
 * This is actually combsort.  It's an O(n log n) algorithm with
 * simplicity/small code size being its main virtue.
 */

#include <stddef.h>
#include <string.h>

static inline size_t newgap(size_t gap)
{
  gap = (gap*10)/13;
  if ( gap == 9 || gap == 10 )
    gap = 11;

  if ( gap < 1 )
    gap = 1;
  return gap;
}

void qsort(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *))
{
  size_t gap = nmemb;
  size_t i, j;
  char *p1, *p2;
  int swapped;

  do {
    gap = newgap(gap);
    swapped = 0;

    for ( i = 0, p1 = base ; i < nmemb-gap ; i++, p1 += size ) {
      j = i+gap;
      if ( compar(p1, p2 = (char *)base+j*size) > 0 ) {
        memswap(p1, p2, size);
        swapped = 1;
      }
    }
  } while ( gap > 1 || swapped );
}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH] lib/qsort
  2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
                     ` (3 preceding siblings ...)
  2005-01-24  3:48   ` Horst von Brand
@ 2005-01-24 20:15   ` Matt Mackall
  2005-01-24 23:09     ` Andrew Morton
  2005-01-25  4:11     ` Matt Mackall
  4 siblings, 2 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-24 20:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Andreas Gruenbacher

This patch introduces an implementation of qsort to lib/.

I've benchmarked many variants of quicksort implementations on array sizes
from 100 elements to >1M elements with an eye to reducing instruction
and branch count to bring out performance on modern processors. The
version below was the clear winner in both size and performance. 

Here are some benchmarks of cycle count averages for 10 runs on the
same random datasets, interrupts disabled. Percentages are performance
relative to the glibc algorithm. A bunch of other variants dropped for
brevity.

name      size  description
qsort:     916  glibc algorithm
qsort2:    613  glibc algorithm without insertion sort pass
qsort_s:   324  simple version with variable-sized automatic stack
qsort_sf2: 247  simple version with pluggable swap routine (this patch)
qsort_c3:  573  simple version with median of 3 and "collapse the walls"

P4M 1.8GHz -O2 -march=i686 (cycles counts almost identical to P4 Xeon 3.2GHz):

100:
           qsort: 11568 100.00%
          qsort2: 11822 97.85%
         qsort_s: 8356 138.43%
       qsort_sf2: 4542 254.70%
        qsort_c3: 11248 102.84%
200:
           qsort: 27005 100.00%
          qsort2: 28337 95.30%
         qsort_s: 24672 109.46%
       qsort_sf2: 13387 201.72%
        qsort_c3: 28776 93.85%
400:
           qsort: 60464 100.00%
          qsort2: 63134 95.77%
         qsort_s: 54791 110.35%
       qsort_sf2: 31677 190.88%
        qsort_c3: 68228 88.62%
800:
           qsort: 144190 100.00%
          qsort2: 149240 96.62%
         qsort_s: 137439 104.91%
       qsort_sf2: 82340 175.12%
        qsort_c3: 151487 95.18%
1600:
           qsort: 315813 100.00%
          qsort2: 329444 95.86%
         qsort_s: 356588 88.57%
       qsort_sf2: 195203 161.79%
        qsort_c3: 360908 87.51%
3200:
           qsort: 725993 100.00%
          qsort2: 738060 98.37%
         qsort_s: 752978 96.42%
       qsort_sf2: 444705 163.25%
        qsort_c3: 814500 89.13%
6400:
           qsort: 1564310 100.00%
          qsort2: 1603845 97.53%
         qsort_s: 1746958 89.54%
       qsort_sf2: 1011510 154.65%
        qsort_c3: 1800720 86.87%
12800:
           qsort: 3502147 100.00%
          qsort2: 3507643 99.84%
         qsort_s: 4078681 85.86%
       qsort_sf2: 2397432 146.08%
        qsort_c3: 3976366 88.07%
25600:
           qsort: 7618627 100.00%
          qsort2: 7661898 99.44%
         qsort_s: 8708923 87.48%
       qsort_sf2: 5288890 144.05%
        qsort_c3: 8637922 88.20%
51200:
           qsort: 16009766 100.00%
          qsort2: 16339192 97.98%
         qsort_s: 18949571 84.49%
       qsort_sf2: 11511438 139.08%
        qsort_c3: 18578005 86.18%
102400:
           qsort: 34594524 100.00%
          qsort2: 35163198 98.38%
         qsort_s: 42052914 82.26%
       qsort_sf2: 25638424 134.93%
        qsort_c3: 40474691 85.47%

Opteron 1.4GHz 32-bit -O2 -march=athlon:

100:
           qsort: 8125 100.00%
          qsort2: 5531 146.90%
         qsort_s: 4534 179.18%
       qsort_sf2: 1714 474.06%
        qsort_c3: 5841 139.09%
200:
           qsort: 16019 100.00%
          qsort2: 12259 130.67%
         qsort_s: 12540 127.75%
       qsort_sf2: 4432 361.42%
        qsort_c3: 14156 113.16%
400:
           qsort: 34523 100.00%
          qsort2: 26789 128.87%
         qsort_s: 27058 127.59%
       qsort_sf2: 10152 340.05%
        qsort_c3: 33008 104.59%
800:
           qsort: 78279 100.00%
          qsort2: 61667 126.94%
         qsort_s: 65749 119.06%
       qsort_sf2: 25454 307.53%
        qsort_c3: 72988 107.25%
1600:
           qsort: 166172 100.00%
          qsort2: 135495 122.64%
         qsort_s: 169073 98.28%
       qsort_sf2: 60248 275.81%
        qsort_c3: 173264 95.91%
3200:
           qsort: 362308 100.00%
          qsort2: 302439 119.80%
         qsort_s: 361346 100.27%
       qsort_sf2: 134529 269.31%
        qsort_c3: 387407 93.52%
6400:
           qsort: 780260 100.00%
          qsort2: 651574 119.75%
         qsort_s: 855666 91.19%
       qsort_sf2: 306348 254.70%
        qsort_c3: 852795 91.49%
12800:
           qsort: 1686017 100.00%
          qsort2: 1420488 118.69%
         qsort_s: 1992462 84.62%
       qsort_sf2: 726466 232.08%
        qsort_c3: 1898620 88.80%
25600:
           qsort: 3642061 100.00%
          qsort2: 3093633 117.73%
         qsort_s: 4161486 87.52%
       qsort_sf2: 1653795 220.22%
        qsort_c3: 4120878 88.38%
51200:
           qsort: 7724747 100.00%
          qsort2: 6649277 116.17%
         qsort_s: 9248117 83.53%
       qsort_sf2: 3653293 211.45%
        qsort_c3: 8917153 86.63%
102400:
           qsort: 16478170 100.00%
          qsort2: 14305384 115.19%
         qsort_s: 20574011 80.09%
       qsort_sf2: 8322403 198.00%
        qsort_c3: 19511628 84.45%

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: mm2qs/lib/Makefile
===================================================================
--- mm2qs.orig/lib/Makefile	2005-01-20 22:11:01.000000000 -0800
+++ mm2qs/lib/Makefile	2005-01-24 01:24:17.000000000 -0800
@@ -21,6 +21,7 @@
   lib-y += dec_and_lock.o
 endif
 
+obj-$(CONFIG_QSORT) += qsort.o
 obj-$(CONFIG_CRC_CCITT)	+= crc-ccitt.o
 obj-$(CONFIG_CRC32)	+= crc32.o
 obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
Index: mm2qs/lib/qsort.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ mm2qs/lib/qsort.c	2005-01-24 10:41:57.000000000 -0800
@@ -0,0 +1,94 @@
+/*
+ * A fast, small, non-recursive quicksort for the Linux kernel
+ *
+ * Jan 23 2005  Matt Mackall <mpm@selenic.com>
+ *
+ * Inspired by quicksort code from glibc and K&R
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+
+/*
+ * qsort - sort an array of elements with quicksort
+ * @base: pointer to data to sort
+ * @num: number of elements
+ * @size: size of each element
+ * @cmp: pointer to comparison function
+ * @swap: pointer to swap function
+ * @flags: allocation type for kmalloc
+ *
+ * This function does a quicksort on the given array. It is primarily
+ * tuned for small arrays, trading optimal compare and swap count for
+ * code simplicity and instruction/branch count. You can either use
+ * the generic_swap function or, where appropriate, provide a routine
+ * optimized for your element size.
+ *
+ * This function allocates an internal stack of 256 or 1024 bytes to
+ * avoid recursion overhead and may return -ENOMEM if allocation
+ * fails.
+ */
+
+int qsort(void *base, size_t num, size_t size,
+	  int (*cmp)(const void *, const void *),
+	  void (*swap)(const void *, const void *, int), int flags)
+{
+	void *i, *p, *l = base, *r = base + num * size;
+	struct stack {
+		void *l, *r;
+	} *stack, *top;
+
+	stack = top = kmalloc(8 * sizeof(int) * sizeof(struct stack), flags);
+	if (!stack)
+		return -ENOMEM;
+
+	do {
+		if (l + size >= r) {
+			/* empty sub-array, pop */
+			l = top->l;
+			r = top->r;
+			--top;
+		} else {
+			/* position the pivot element */
+			for(i = l + size, p = l; i != r; i += size)
+				if (cmp(i, l) < 0) {
+					p += size;
+					swap(i, p, size);
+				}
+			swap(l, p, size);
+
+			/* save the bigger half on the stack */
+			top++;
+			if (p - l < r - p) {
+				top->l = p + size;
+				top->r = r;
+				r = p;
+			} else {
+				top->l = l;
+				top->r = p;
+				l = p + size;
+			}
+		}
+	} while (top >= stack);
+
+	kfree(stack);
+	return 0;
+}
+
+void qsort_swap(void *a, void *b, int size)
+{
+	char t;
+
+	do {
+		t = *(char *)a;
+		*(char *)b++ = *(char *)a;
+		*(char *)a++ = t;
+	} while (--size > 0);
+}
+
+EXPORT_SYMBOL_GPL(qsort);
+EXPORT_SYMBOL_GPL(qsort_swap);
+
+MODULE_LICENSE("GPL");
Index: mm2qs/include/linux/qsort.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ mm2qs/include/linux/qsort.h	2005-01-24 10:43:46.000000000 -0800
@@ -0,0 +1,10 @@
+#ifndef _LINUX_QSORT_H
+#define _LINUX_QSORT_H
+
+int qsort(void *base, size_t num, size_t size,
+	  int (*cmp)(const void *, const void *),
+	  void (*swap)(const void *, const void *, int), int flags);
+
+void qsort_swap(void *a, void *b, int size);
+
+#endif
Index: mm2qs/lib/Kconfig
===================================================================
--- mm2qs.orig/lib/Kconfig	2005-01-19 22:53:44.000000000 -0800
+++ mm2qs/lib/Kconfig	2005-01-24 10:33:20.000000000 -0800
@@ -57,5 +57,8 @@
 config REED_SOLOMON_DEC16
 	boolean
 
+config QSORT
+	tristate
+
 endmenu
 


-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  4:58         ` Felipe Alfaro Solana
@ 2005-01-24 21:20           ` Matt Mackall
  2005-01-24 21:50             ` vlobanov
  0 siblings, 1 reply; 85+ messages in thread
From: Matt Mackall @ 2005-01-24 21:20 UTC (permalink / raw)
  To: Felipe Alfaro Solana
  Cc: Andi Kleen, Neil Brown, linux-kernel, Buck Huppmann,
	Trond Myklebust, Andreas Gruenbacher, Andries E. Brouwer,
	Andrew Morton, Olaf Kirch

On Sun, Jan 23, 2005 at 05:58:00AM +0100, Felipe Alfaro Solana wrote:
> On 23 Jan 2005, at 03:39, Andi Kleen wrote:
> 
> >Felipe Alfaro Solana <lkml@mac.com> writes:
> >>
> >>AFAIK, XOR is quite expensive on IA32 when compared to simple MOV
> >>operatings. Also, since the original patch uses 3 MOVs to perform the
> >>swapping, and your version uses 3 XOR operations, I don't see any
> >>gains.
> >
> >Both are one cycle latency for register<->register on all x86 cores
> >I've looked at. What makes you think differently?
> 
> I thought XOR was more expensie. Anyways, I still don't see any 
> advantage in replacing 3 MOVs with 3 XORs.

Again, no temporaries needed.

But I benched it and it was quite a bit slower.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-24 21:20           ` Matt Mackall
@ 2005-01-24 21:50             ` vlobanov
  0 siblings, 0 replies; 85+ messages in thread
From: vlobanov @ 2005-01-24 21:50 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Felipe Alfaro Solana, Andi Kleen, Neil Brown, linux-kernel,
	Buck Huppmann, Trond Myklebust, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Mon, 24 Jan 2005, Matt Mackall wrote:

> On Sun, Jan 23, 2005 at 05:58:00AM +0100, Felipe Alfaro Solana wrote:
> > On 23 Jan 2005, at 03:39, Andi Kleen wrote:
> >
> > >Felipe Alfaro Solana <lkml@mac.com> writes:
> > >>
> > >>AFAIK, XOR is quite expensive on IA32 when compared to simple MOV
> > >>operatings. Also, since the original patch uses 3 MOVs to perform the
> > >>swapping, and your version uses 3 XOR operations, I don't see any
> > >>gains.
> > >
> > >Both are one cycle latency for register<->register on all x86 cores
> > >I've looked at. What makes you think differently?
> >
> > I thought XOR was more expensie. Anyways, I still don't see any
> > advantage in replacing 3 MOVs with 3 XORs.
>
> Again, no temporaries needed.
>
> But I benched it and it was quite a bit slower.
>
> --
> Mathematics is the supreme nostalgia of our time.

Yep, it's a difference of four instructions (when using one or two
temporary variables and swapping using assignments) versus six
instructions (when using xors, since IA32 can't do an xor with both
arguments in memory).

I originally pitched this idea out to the list just for discussion
purposes. Most considered it, and said that the advantages don't
outweigh the disadvantages. And that's fine -- it means that the chosen
way is that much better considered. Always a good thing. :)

-Vadim Lobanov

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-24  4:02           ` Horst von Brand
@ 2005-01-24 21:57             ` Matt Mackall
  0 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-24 21:57 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Andi Kleen, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch

On Mon, Jan 24, 2005 at 01:02:44AM -0300, Horst von Brand wrote:
> Matt Mackall <mpm@selenic.com> said:
> > On Sun, Jan 23, 2005 at 03:39:34AM +0100, Andi Kleen wrote:
> 
> [...]
> 
> > > -Andi (who thinks the glibc qsort is vast overkill for kernel purposes
> > > where there are only small data sets and it would be better to use a 
> > > simpler one optimized for code size)
> 
> > Mostly agreed. Except:
> > 
> > a) the glibc version is not actually all that optimized
> > b) it's nice that it's not recursive
> > c) the three-way median selection does help avoid worst-case O(n^2)
> > behavior, which might potentially be triggerable by users in places
> > like XFS where this is used
> 
> Shellsort is much simpler, and not much slower for small datasets. Plus no
> extra space for stacks.
> 
> > I'll probably whip up a simpler version tomorrow or Monday and do some
> > size/space benchmarking. I've been meaning to contribute a qsort for
> > doubly-linked lists I've got lying around as well.
> 
> Qsort is OK as long as you have direct access to each element. In case of
> lists, it is better to just use mergesort.

Qsort does not need to do random access. I posted an efficient
doubly-linked list version here four years ago:

template<class T>
void list<T>::qsort(iter l, iter r, cmpfunc *cmp, void *data)
{
        if(l==r) return;

        iter i(l), p(l);

        for(i++; i!=r; i++)
                if(cmp(*i, *l, data)<0)
                        i.swap(++p);

        l.swap(p);
        qsort(l, p, cmp, data);
        qsort(++p, r, cmp, data);
}

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-23  4:46           ` Andi Kleen
  2005-01-23  5:05             ` Jesper Juhl
@ 2005-01-24 22:04             ` Mike Waychison
  2005-01-25  6:51               ` Andi Kleen
  1 sibling, 1 reply; 85+ messages in thread
From: Mike Waychison @ 2005-01-24 22:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jesper Juhl, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch, Tim Hockin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi Kleen wrote:
>>How about a shell sort?  if the data is mostly sorted shell sort beats 
>>qsort lots of times, and since the data sets are often small in-kernel, 
>>shell sorts O(n^2) behaviour won't harm it too much, shell sort is also 
>>faster if the data is already completely sorted. Shell sort is certainly 
>>not the simplest algorithm around, but I think (without having done any 
>>tests) that it would probably do pretty well for in-kernel use... Then 
>>again, I've known to be wrong :)
> 
> 
> I like shell sort for small data sets too. And I agree it would be 
> appropiate for the kernel.
> 

FWIW, we already have a Shell sort for the ngroups stuff in
kernel/sys.c:groups_sort() that could be made generic.

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE:  The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB9XDzdQs4kOxk3/MRAs2ZAJ4if1XRFAiWsgb1wvTInFLUVGHesgCfWxCJ
Efyrr4PkG/KrqefAVAQjt+c=
=/OPh
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] lib/qsort
  2005-01-24 20:15   ` [PATCH] lib/qsort Matt Mackall
@ 2005-01-24 23:09     ` Andrew Morton
  2005-01-24 23:30       ` Matt Mackall
  2005-01-25  4:11     ` Matt Mackall
  1 sibling, 1 reply; 85+ messages in thread
From: Andrew Morton @ 2005-01-24 23:09 UTC (permalink / raw)
  To: Matt Mackall
  Cc: linux-kernel, neilb, trond.myklebust, okir, Andries.Brouwer, agruen

Matt Mackall <mpm@selenic.com> wrote:
>
> This patch introduces an implementation of qsort to lib/.

It screws me over right proper.  Can we stick with Andreas's known-working
patch for now, and do the sorting stuff as a separate, later activity?

It would involve:

- Removal of the old sort code

- Introduction of the new sort code

- Migration of the NFS ACL code, XFS and group code over to the new
  implementation.



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] lib/qsort
  2005-01-24 23:09     ` Andrew Morton
@ 2005-01-24 23:30       ` Matt Mackall
  0 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-24 23:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, neilb, trond.myklebust, okir, Andries.Brouwer, agruen

On Mon, Jan 24, 2005 at 03:09:40PM -0800, Andrew Morton wrote:
> Matt Mackall <mpm@selenic.com> wrote:
> >
> > This patch introduces an implementation of qsort to lib/.
> 
> It screws me over right proper.  Can we stick with Andreas's known-working
> patch for now, and do the sorting stuff as a separate, later activity?
> 
> It would involve:
> 
> - Removal of the old sort code
> 
> - Introduction of the new sort code
> 
> - Migration of the NFS ACL code, XFS and group code over to the new
>   implementation.

Ok, will do after mm++.

FYI, I'm going to submit a heapsort variant instead with similar
performance. It gets rid of the potentially exploitable worst-case
behavior of qsort as well as the extra stack space (and the resultant
need for error handling).

Apparently the glibc folks wanted this to be EXPORT_SYMBOL_GPL the
last time around, btw.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-24 17:10               ` H. Peter Anvin
@ 2005-01-25  0:43                 ` Horst von Brand
  2005-01-25  4:06                   ` Eric St-Laurent
  0 siblings, 1 reply; 85+ messages in thread
From: Horst von Brand @ 2005-01-25  0:43 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

hpa@zytor.com (H. Peter Anvin) said:

[...]

> In klibc, I use combsort:
> 
> /*
>  * qsort.c
>  *
>  * This is actually combsort.  It's an O(n log n) algorithm with
>  * simplicity/small code size being its main virtue.
>  */
> 
> #include <stddef.h>
> #include <string.h>
> 
> static inline size_t newgap(size_t gap)
> {
>   gap = (gap*10)/13;
>   if ( gap == 9 || gap == 10 )
>     gap = 11;
> 
>   if ( gap < 1 )
>     gap = 1;
>   return gap;
> }
> 
> void qsort(void *base, size_t nmemb, size_t size,
>            int (*compar)(const void *, const void *))
> {
>   size_t gap = nmemb;
>   size_t i, j;
>   char *p1, *p2;
>   int swapped;
> 
>   do {
>     gap = newgap(gap);
>     swapped = 0;
> 
>     for ( i = 0, p1 = base ; i < nmemb-gap ; i++, p1 += size ) {
>       j = i+gap;
>       if ( compar(p1, p2 = (char *)base+j*size) > 0 ) {
>         memswap(p1, p2, size);
>         swapped = 1;
>       }
>     }
>   } while ( gap > 1 || swapped );
> }

AFAICS, this is just a badly implemented Shellsort (the 10/13 increment
sequence starting with the number of elements is probably not very good,
besides swapping stuff is inefficient (just juggling like Shellsort does
gives you almost a third less copies)).

Have you found a proof for the O(n log n) claim?

I'd write as attached (careful, a local element on stack!)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Shellsort --]
[-- Type: text/x-c, Size: 2186 bytes --]

/*
 * shellsort.c: Shell sort
 *
 * Copyright (c) 2005, Horst H. von Brand <vonbrand@inf.utfsm.cl>
 * All rights reserved.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 
 *     * Redistributions of source code must retain the above copyright
 *       notice, this list of conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above
 *       copyright notice, this list of conditions and the following
 *       disclaimer in the documentation and/or other materials provided
 *       with the distribution.
 *     * Neither the name of Horst H. von Brand nor the names of its
 *       contributors may be used to endorse or promote products derived
 *       from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
 * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
 * OF THE POSSIBILITY OF SUCH DAMAGE.
 */

#include <string.h>

void qsort(void *base, size_t nmemb, size_t size, 
           int (*compar)(const void *, const void *))

{
  int i, j, h;
  char tmp[size];
    
  for(h = 1; h < nmemb; h = 3 * h + 1)
    ;
    
  do {
    h /= 3;
    for(i = h; i < nmemb; i++) {
      memcpy(tmp, base + i * size, size);
      for(j = i - h; j >= 0 && compar(tmp, base + j * size); j -= h)
	memcpy(base + (j + h) * size, base + j * size, size);
      memcpy(base + (j + h) * size, tmp, size);
    }
  } while(h > 1);
}

[-- Attachment #3: Type: text/plain, Size: 276 bytes --]

-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 12/13] ACL umask handling workaround in nfs client
  2005-01-22 20:34 ` [patch 12/13] ACL umask handling workaround in nfs client Andreas Gruenbacher
@ 2005-01-25  1:20   ` Andreas Gruenbacher
  2005-02-15 18:04   ` Trond Myklebust
  1 sibling, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25  1:20 UTC (permalink / raw)
  To: Andrew Morton, Neil Brown, Trond Myklebust, linux-kernel
  Cc: Olaf Kirch, Andries E. Brouwer, Buck Huppmann

Hello,

this patch has an NFSv2 problem that I haven't tripped over until today. The 
fix is this:

------- 8< -------
Fix NFSv2 null pointer access

With NFSv2 we would try to follow a NULL getacl and setacl function
pointer here. Add the missing checks.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>

Index: linux-2.6.10/fs/nfs/dir.c
===================================================================
--- linux-2.6.10.orig/fs/nfs/dir.c
+++ linux-2.6.10/fs/nfs/dir.c
@@ -984,6 +984,9 @@ static int nfs_set_default_acl(struct in
 	struct posix_acl *dfacl, *acl;
 	int error = 0;
 
+	if (NFS_PROTO(inode)->version != 3 ||
+	    !NFS_PROTO(dir)->getacl || !NFS_PROTO(inode)->setacls)
+		return 0;
 	dfacl = NFS_PROTO(dir)->getacl(dir, ACL_TYPE_DEFAULT);
 	if (IS_ERR(dfacl)) {
 		error = PTR_ERR(dfacl);


Regards,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25  0:43                 ` Horst von Brand
@ 2005-01-25  4:06                   ` Eric St-Laurent
  0 siblings, 0 replies; 85+ messages in thread
From: Eric St-Laurent @ 2005-01-25  4:06 UTC (permalink / raw)
  To: Horst von Brand; +Cc: H. Peter Anvin, linux-kernel

On Mon, 2005-01-24 at 21:43 -0300, Horst von Brand wrote:
> AFAICS, this is just a badly implemented Shellsort (the 10/13 increment
> sequence starting with the number of elements is probably not very good,
> besides swapping stuff is inefficient (just juggling like Shellsort does
> gives you almost a third less copies)).
> 
> Have you found a proof for the O(n log n) claim?

"Why a Comb Sort is NOT a Shell Sort

A shell sort completely sorts the data for each gap size. A comb sort
takes a more optimistic approach and doesn't require data be completely
sorted at a gap size. The comb sort assumes that out-of-order data will
be cleaned-up by smaller gap sizes as the sort proceeds. "

Reference:

http://world.std.com/~jdveale/combsort.htm

Another good reference:

http://yagni.com/combsort/index.php

Personally, i've used it in the past because of it's small size.  With
C++ templates you can have a copy of the routine generated for a
specific datatype, thus skipping the costly function call used for each
compare.  With some C macro magic, i presume something similar can be
done, for time-critical applications.

Best regards,

Eric St-Laurent



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] lib/qsort
  2005-01-24 20:15   ` [PATCH] lib/qsort Matt Mackall
  2005-01-24 23:09     ` Andrew Morton
@ 2005-01-25  4:11     ` Matt Mackall
  1 sibling, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2005-01-25  4:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Neil Brown, Trond Myklebust, Olaf Kirch,
	Andries E. Brouwer, Andreas Gruenbacher

On Mon, Jan 24, 2005 at 12:15:27PM -0800, Matt Mackall wrote:
> Here are some benchmarks of cycle count averages for 10 runs on the
> same random datasets, interrupts disabled. Percentages are performance
> relative to the glibc algorithm. A bunch of other variants dropped for
> brevity.

I've discovered a bug in this benchmark that gives a big advantage to
a couple of variants I tried. Corrected benchmarks later.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-24 22:04             ` Mike Waychison
@ 2005-01-25  6:51               ` Andi Kleen
  2005-01-25 10:12                 ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Andi Kleen @ 2005-01-25  6:51 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Jesper Juhl, Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andreas Gruenbacher,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch, Tim Hockin

> FWIW, we already have a Shell sort for the ngroups stuff in
> kernel/sys.c:groups_sort() that could be made generic.

Sounds like a good plan. Any takers?

-Andi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25  6:51               ` Andi Kleen
@ 2005-01-25 10:12                 ` Andreas Gruenbacher
  2005-01-25 12:00                   ` Andi Kleen
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 10:12 UTC (permalink / raw)
  To: Andi Kleen, Nathan Scott
  Cc: Mike Waychison, Jesper Juhl, Felipe Alfaro Solana,
	Trond Myklebust, linux-kernel, Buck Huppmann, Neil Brown,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch, Tim Hockin

On Tuesday 25 January 2005 07:51, Andi Kleen wrote:
> > FWIW, we already have a Shell sort for the ngroups stuff in
> > kernel/sys.c:groups_sort() that could be made generic.
>
> Sounds like a good plan. Any takers?

It would slow down the groups case (unless we leave the specialized version 
in). Gcc doesn't inline a cmp function pointer, and a C preprocessor 
templatized version would be really ugly. A variant with of this routine with 
qsort like interface should be good enough for nfsacl and xfs though.

Nevertheless, xfs and nfsacl have very similar requirements:

nfsacl: at most 1024 elements; 8-byte elements (16 on 64-bit archs)

xfs (from Nathan): at most 1024 elements (with 64K blocksize); 8-byte or 
larger elements

Cheers.
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 10:12                 ` Andreas Gruenbacher
@ 2005-01-25 12:00                   ` Andi Kleen
  2005-01-25 12:05                     ` Olaf Kirch
  0 siblings, 1 reply; 85+ messages in thread
From: Andi Kleen @ 2005-01-25 12:00 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Nathan Scott, Mike Waychison, Jesper Juhl, Felipe Alfaro Solana,
	Trond Myklebust, linux-kernel, Buck Huppmann, Neil Brown,
	Andries E. Brouwer, Andrew Morton, Olaf Kirch, Tim Hockin

> It would slow down the groups case (unless we leave the specialized version 
> in). Gcc doesn't inline a cmp function pointer, and a C preprocessor 
> templatized version would be really ugly. A variant with of this routine with 
> qsort like interface should be good enough for nfsacl and xfs though.

group initialization is not time critical, it typically only happens
at login.  Also it's doubleful you'll even be able to measure the difference.

-Andi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 12:00                   ` Andi Kleen
@ 2005-01-25 12:05                     ` Olaf Kirch
  2005-01-25 16:52                       ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Olaf Kirch @ 2005-01-25 12:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andreas Gruenbacher, Nathan Scott, Mike Waychison, Jesper Juhl,
	Felipe Alfaro Solana, Trond Myklebust, linux-kernel,
	Buck Huppmann, Neil Brown, Andries E. Brouwer, Andrew Morton,
	Tim Hockin

On Tue, Jan 25, 2005 at 01:00:23PM +0100, Andi Kleen wrote:
> group initialization is not time critical, it typically only happens
> at login.  Also it's doubleful you'll even be able to measure the difference.

nfsd updates its group list for every request it processes, so you don't want
to make that too slow.

Olaf
-- 
Olaf Kirch     | Things that make Monday morning interesting, #2:
okir@suse.de   |        "We have 8,000 NFS mount points, why do we keep
---------------+ 	 running out of privileged ports?"

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 12:05                     ` Olaf Kirch
@ 2005-01-25 16:52                       ` Trond Myklebust
  2005-01-25 16:53                         ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-01-25 16:52 UTC (permalink / raw)
  To: Olaf Kirch
  Cc: Andi Kleen, Andreas Gruenbacher, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

ty den 25.01.2005 Klokka 13:05 (+0100) skreiv Olaf Kirch:
> On Tue, Jan 25, 2005 at 01:00:23PM +0100, Andi Kleen wrote:
> > group initialization is not time critical, it typically only happens
> > at login.  Also it's doubleful you'll even be able to measure the difference.
> 
> nfsd updates its group list for every request it processes, so you don't want
> to make that too slow.

So here's an iconoclastic question or two:

  Why can't clients sort the list in userland, before they call down to
the kernel?

  If clients are sorting their lists, why would we need to sort the same
list on the server side. Detecting out-of-order list entries is much
less of a hassle than actually sorting, so if the protocol calls for
sorted elements, you can return an EINVAL or something in the case where
some client sends an unsorted list.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 16:52                       ` Trond Myklebust
@ 2005-01-25 16:53                         ` Andreas Gruenbacher
  2005-01-25 17:03                           ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 16:53 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

On Tue, 2005-01-25 at 17:52, Trond Myklebust wrote:
> So here's an iconoclastic question or two:
> 
>   Why can't clients sort the list in userland, before they call down to
> the kernel?

Tell that to Sun Microsystems.

Regards,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 16:53                         ` Andreas Gruenbacher
@ 2005-01-25 17:03                           ` Trond Myklebust
  2005-01-25 17:16                             ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-01-25 17:03 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

ty den 25.01.2005 Klokka 17:53 (+0100) skreiv Andreas Gruenbacher:
> On Tue, 2005-01-25 at 17:52, Trond Myklebust wrote:
> > So here's an iconoclastic question or two:
> > 
> >   Why can't clients sort the list in userland, before they call down to
> > the kernel?
> 
> Tell that to Sun Microsystems.

Whatever Sun chooses to do or not do changes nothing to the question of
why our client would want to do a quicksort in the kernel.

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 17:03                           ` Trond Myklebust
@ 2005-01-25 17:16                             ` Andreas Gruenbacher
  2005-01-25 17:37                               ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 17:16 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

On Tue, 2005-01-25 at 18:03, Trond Myklebust wrote:
> ty den 25.01.2005 Klokka 17:53 (+0100) skreiv Andreas Gruenbacher:
> > On Tue, 2005-01-25 at 17:52, Trond Myklebust wrote:
> > > So here's an iconoclastic question or two:
> > > 
> > >   Why can't clients sort the list in userland, before they call down to
> > > the kernel?
> > 
> > Tell that to Sun Microsystems.
> 
> Whatever Sun chooses to do or not do changes nothing to the question of
> why our client would want to do a quicksort in the kernel.

Well, it determines what we must accept, both on the server side and the
client side.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 17:16                             ` Andreas Gruenbacher
@ 2005-01-25 17:37                               ` Trond Myklebust
  2005-01-25 18:12                                 ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-01-25 17:37 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

ty den 25.01.2005 Klokka 18:16 (+0100) skreiv Andreas Gruenbacher:

> > Whatever Sun chooses to do or not do changes nothing to the question of
> > why our client would want to do a quicksort in the kernel.
> 
> Well, it determines what we must accept, both on the server side and the
> client side.

I can see why you might want it on the server side, but I repeat: why
does the client need to do this in the kernel? The client code should
not be overriding the server when it comes to what is acceptable or not
acceptable. That's just wrong...

I can also see that if the server _must_ have a sorted list, then doing
a sort on the client is a good thing since it will cut down on the work
that said server will need to do, and so it will scale better with the
number of clients (though note that, conversely, this server will scale
poorly with the Sun clients or others if they do not sort the lists).

I'm asking 'cos if the client doesn't need this code, then it seems to
me you can move helper routines like the quicksort and posix checking
routines into the nfsd module rather than having to keeping it in the
VFS (unless you foresee that other modules will want to use the same
routines???).

Cheers,
 Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 17:37                               ` Trond Myklebust
@ 2005-01-25 18:12                                 ` Andreas Gruenbacher
  2005-01-25 19:33                                   ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 18:12 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

On Tue, 2005-01-25 at 18:37, Trond Myklebust wrote:
> ty den 25.01.2005 Klokka 18:16 (+0100) skreiv Andreas Gruenbacher:
> 
> > > Whatever Sun chooses to do or not do changes nothing to the question of
> > > why our client would want to do a quicksort in the kernel.
> > 
> > Well, it determines what we must accept, both on the server side and the
> > client side.
> 
> I can see why you might want it on the server side, but I repeat: why
> does the client need to do this in the kernel? The client code should
> not be overriding the server when it comes to what is acceptable or not
> acceptable. That's just wrong...

Ah, I see now what you mean. The setxattr syscall only accepts
well-formed acls (that is, sorted plus a few other restrictions), and
user-space is expected to take care of that. In turn, getxattr returns
only well-formed acls. We could lift that guarantee specifically for
nfs, but I don't think it would be a good idea. Entry order in POSIX
acls doesn't convey a meaning by the way, and the nfs client never
rejects what the server sends.

> I can also see that if the server _must_ have a sorted list, then doing
> a sort on the client is a good thing since it will cut down on the work
> that said server will need to do, and so it will scale better with the
> number of clients (though note that, conversely, this server will scale
> poorly with the Sun clients or others if they do not sort the lists).

The server must have sorted lists. Linux clients send well-formed acls
except when they fake up a mask entry; they insert the mask entry at the
end instead of in the right position (this is the three-entry acl
problem I described in [patch 0/13]). We could insert the mask in the
right position, but the protocol doesn't require it. We must sort on the
server anyway, and the server can as easily swap the two entries.

> I'm asking 'cos if the client doesn't need this code, then it seems to
> me you can move helper routines like the quicksort and posix checking
> routines into the nfsd module rather than having to keeping it in the
> VFS (unless you foresee that other modules will want to use the same
> routines???).

That would cause getxattr to return an "invalid" result. libacl doesn't
care, but other users might exist that rely on the current format. In
addition, comparing acls becomes non-trivial: currently xattr values are
equal iff acls are equal.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 18:12                                 ` Andreas Gruenbacher
@ 2005-01-25 19:33                                   ` Trond Myklebust
  2005-01-25 19:49                                     ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-01-25 19:33 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

ty den 25.01.2005 Klokka 19:12 (+0100) skreiv Andreas Gruenbacher:

> Ah, I see now what you mean. The setxattr syscall only accepts
> well-formed acls (that is, sorted plus a few other restrictions), and
> user-space is expected to take care of that. In turn, getxattr returns
> only well-formed acls. We could lift that guarantee specifically for
> nfs, but I don't think it would be a good idea.

Note that if you really want to add a qsort to the kernel you might as
well drop the setxattr sorting requirement too. If the kernel can qsort
for getxattr, then might as well do it for the case of setxattr too.

> Entry order in POSIX acls doesn't convey a meaning by the way.

Precisely. Are there really any existing programs out there that are
using the raw xattr output and making assumptions about entry order?

> The server must have sorted lists.

So, I realize that the on-disk format is already defined, but looking at
routines like posix_acl_permission(), it looks like the only order the
kernel (at least) actually cares about is that of the "e_tag" field.
Unless I missed something, nothing there cares about the order of the
"e_id" fields.
Given that you only have 6 possible values there, it seems a shame in
hindsight that we didn't choose to just use a 6 bucket hashtable (the
value of e_id being the hash value), and leave the order of the e_id
fields undefined. 8-(

Cheers,
  Trond


-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 1/13] Qsort
  2005-01-25 19:33                                   ` Trond Myklebust
@ 2005-01-25 19:49                                     ` Andreas Gruenbacher
  0 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-01-25 19:49 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Olaf Kirch, Andi Kleen, Nathan Scott, Mike Waychison,
	Jesper Juhl, Felipe Alfaro Solana, linux-kernel, Buck Huppmann,
	Neil Brown, Andries E. Brouwer, Andrew Morton, Tim Hockin

On Tue, 2005-01-25 at 20:33, Trond Myklebust wrote:
> ty den 25.01.2005 Klokka 19:12 (+0100) skreiv Andreas Gruenbacher:
> 
> > Ah, I see now what you mean. The setxattr syscall only accepts
> > well-formed acls (that is, sorted plus a few other restrictions), and
> > user-space is expected to take care of that. In turn, getxattr returns
> > only well-formed acls. We could lift that guarantee specifically for
> > nfs, but I don't think it would be a good idea.
> 
> Note that if you really want to add a qsort to the kernel you might as
> well drop the setxattr sorting requirement too. If the kernel can qsort
> for getxattr, then might as well do it for the case of setxattr too.

There is no need to sort anything in the kernel for acls except for the
NFSACL case, so that's where we need it, and nowhere else. What would be
the point in making setxattr accept unsorted acls? It's just not
necessary; userspace can do it just as well.

> > Entry order in POSIX acls doesn't convey a meaning by the way.
> 
> Precisely. Are there really any existing programs out there that are
> using the raw xattr output and making assumptions about entry order?

I don't know. Anyway, it's a nice feature to have a unique canonical
form.

> > The server must have sorted lists.
> 
> So, I realize that the on-disk format is already defined, but looking at
> routines like posix_acl_permission(), it looks like the only order the
> kernel (at least) actually cares about is that of the "e_tag" field.
> Unless I missed something, nothing there cares about the order of the
> "e_id" fields.

Correct. But posix_acl_valid() does care about the i_id order as well.

> Given that you only have 6 possible values there, it seems a shame in
> hindsight that we didn't choose to just use a 6 bucket hashtable (the
> value of e_id being the hash value), and leave the order of the e_id
> fields undefined. 8-(

Checking for duplicate e_id fields would become expensive. I really
don't see any benefit.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 2/13] Return -ENOSYS for RPC programs that are unavailable
  2005-01-22 20:34 ` [patch 2/13] Return -ENOSYS for RPC programs that are unavailable Andreas Gruenbacher
@ 2005-02-15 17:04   ` Trond Myklebust
  2005-02-16 15:32     ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 17:04 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

No hacks in sunrpc, please: i.e. get rid of that NFSACL_PROGRAM
exception...
If you want to kill those warnings, please just convert them to
dprintks().

Also, why are you converting "unknown error" into ENOSYS?

Finally, it might make sense to distinguish between "program" and
"procedure" errors. How about converting that RPC_PROC_UNAVAIL error
into EOPNOTSUPP (like we already do in the NFS layer itself).

Cheers,
  Trond

lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> vanlig tekstdokument vedlegg (patches.suse)
> The issuer of an RPC call should be able to tell the difference
> between an I/O error and program unavailable / program version
> unavailable / procedure unavailable. Return -ENOSYS for unavailable
> RPCs instead of -EIO.
> 
> Only issue a program unavailable warning for program numbers other
> than the one for nfsacl: Clients with nfsacl support are quite
> common already; no need to clutter the syslog.
> 
> Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
> Signed-off-by: Olaf Kirch <okir@suse.de>
> 
> Index: linux-2.6.11-rc2/include/linux/nfs.h
> ===================================================================
> --- linux-2.6.11-rc2.orig/include/linux/nfs.h
> +++ linux-2.6.11-rc2/include/linux/nfs.h
> @@ -11,6 +11,7 @@
>  #include <linux/string.h>
>  
>  #define NFS_PROGRAM	100003
> +#define NFSACL_PROGRAM	100227
>  #define NFS_PORT	2049
>  #define NFS_MAXDATA	8192
>  #define NFS_MAXPATHLEN	1024
> Index: linux-2.6.11-rc2/net/sunrpc/clnt.c
> ===================================================================
> --- linux-2.6.11-rc2.orig/net/sunrpc/clnt.c
> +++ linux-2.6.11-rc2/net/sunrpc/clnt.c
> @@ -988,10 +988,12 @@ call_verify(struct rpc_task *task)
>  				break;
>  			case RPC_MISMATCH:
>  				printk(KERN_WARNING "%s: RPC call version mismatch!\n", __FUNCTION__);
> -				goto out_eio;
> +				error = -ENOSYS;
> +				goto out_err;
>  			default:
>  				printk(KERN_WARNING "%s: RPC call rejected, unknown error: %x\n", __FUNCTION__, n);
> -				goto out_eio;
> +				error = -ENOSYS;
> +				goto out_err;
>  		}
>  		if (--len < 0)
>  			goto out_overflow;
> @@ -1041,23 +1043,28 @@ call_verify(struct rpc_task *task)
>  	case RPC_SUCCESS:
>  		return p;
>  	case RPC_PROG_UNAVAIL:
> -		printk(KERN_WARNING "RPC: call_verify: program %u is unsupported by server %s\n",
> +		if (task->tk_client->cl_prog != NFSACL_PROGRAM) {
> +			printk(KERN_WARNING "RPC: call_verify: program %u is unsupported by server %s\n",
>  				(unsigned int)task->tk_client->cl_prog,
>  				task->tk_client->cl_server);
> -		goto out_eio;
> +		}
> +		error = -ENOSYS;
> +		goto out_err;
>  	case RPC_PROG_MISMATCH:
>  		printk(KERN_WARNING "RPC: call_verify: program %u, version %u unsupported by server %s\n",
>  				(unsigned int)task->tk_client->cl_prog,
>  				(unsigned int)task->tk_client->cl_vers,
>  				task->tk_client->cl_server);
> -		goto out_eio;
> +		error = -ENOSYS;
> +		goto out_err;
>  	case RPC_PROC_UNAVAIL:
>  		printk(KERN_WARNING "RPC: call_verify: proc %p unsupported by program %u, version %u on server %s\n",
>  				task->tk_msg.rpc_proc,
>  				task->tk_client->cl_prog,
>  				task->tk_client->cl_vers,
>  				task->tk_client->cl_server);
> -		goto out_eio;
> +		error = -ENOSYS;
> +		goto out_err;
>  	case RPC_GARBAGE_ARGS:
>  		dprintk("RPC: %4d %s: server saw garbage\n", task->tk_pid, __FUNCTION__);
>  		break;			/* retry */
> @@ -1075,7 +1082,6 @@ out_retry:
>  		return NULL;
>  	}
>  	printk(KERN_WARNING "RPC %s: retry failed, exit EIO\n", __FUNCTION__);
> -out_eio:
>  	error = -EIO;
>  out_err:
>  	rpc_exit(task, error);
> 
> --
> Andreas Gruenbacher <agruen@suse.de>
> SUSE Labs, SUSE LINUX PRODUCTS GMBH
> 
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 8/13] Add noacl nfs mount option
  2005-01-22 20:34 ` [patch 8/13] Add noacl nfs mount option Andreas Gruenbacher
@ 2005-02-15 17:24   ` Trond Myklebust
  2005-02-16 16:10     ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 17:24 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> vanlig tekstdokument vedlegg (patches.suse)
> With the noacl mount option, nfs clients stop using the ACCESS RPC
> which they usually use to get an access decision from the server.
> Instead, they make the decision based on the file ownership and
> file mode permission bits.

I still hate that name "noacl".

It isn't just that "no acls are being used on the server". It is "no
acls and no *uid/gid mapping* is being used on the server".

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-01-22 20:34 ` [patch 10/13] Solaris nfsacl workaround Andreas Gruenbacher
@ 2005-02-15 17:29   ` Trond Myklebust
  2005-02-15 20:35     ` Olivier Galibert
  2005-02-16 16:17     ` Andreas Gruenbacher
  0 siblings, 2 replies; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 17:29 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> Solaris nfsacl workaround

NACK. No hacks.

Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 11/13] Client side of nfsacl
  2005-01-22 20:34 ` [patch 11/13] Client side of nfsacl Andreas Gruenbacher
@ 2005-02-15 17:49   ` Trond Myklebust
  2005-02-22 13:41     ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 17:49 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> vanlig tekstdokument vedlegg (patches.suse)
> This adds acl support fo nfs clients via the NFSACL protocol extension,
> by implementing the getxattr, listxattr, setxattr, and removexattr iops
> for the system.posix_acl_access and system.posix_acl_default attributes.
> This patch implements a dumb version that uses no caching (and thus adds
> some overhead). (Another patch in this patchset adds caching as well.)

Why are you adding a POSIX-ACL specific function to the nfs_xdr
functions? It is never going to be used for either NFSv2 or NFSv4.

I suggest you rather do the same thing we're doing for the NFSv4 acls,
and provide an nfsv3-specific struct inode_operations that points to
nfsv3-specific {get,set,list}xattr functions.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 12/13] ACL umask handling workaround in nfs client
  2005-01-22 20:34 ` [patch 12/13] ACL umask handling workaround in nfs client Andreas Gruenbacher
  2005-01-25  1:20   ` Andreas Gruenbacher
@ 2005-02-15 18:04   ` Trond Myklebust
  2005-02-22 16:47     ` Andreas Gruenbacher
  1 sibling, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 18:04 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> vanlig tekstdokument vedlegg (patches.suse)
> NFSv3 has no concept of a umask on the server side: The client applies
> the umask locally, and sends the effective permissions to the server.
> This behavior is wrong when files are created in a directory that has
> a default ACL. In this case, the umask is supposed to be ignored, and
> only the default ACL determines the file's effective permissions.
> 
> Usually its the server's task to conditionally apply the umask. But
> since the server knows nothing about the umask, we have to do it on the
> client side. This patch tries to fetch the parent directory's default
> ACL before creating a new file, computes the appropriate create mode to
> send to the server, and finally sets the new file's access and default
> acl appropriately.


Firstly, this sort of code belongs in the NFSv3-specific code. POSIX
acls have no business whatsoever in the generic NFS code.

Secondly, what is the point of doing all this *after* you have created
the file with the wrong permissions? How are you avoiding races?

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 7/13] Encode and decode arbitrary XDR arrays
  2005-01-22 20:34 ` [patch 7/13] Encode and decode arbitrary XDR arrays Andreas Gruenbacher
@ 2005-02-15 19:17   ` Trond Myklebust
  2005-02-16 16:08     ` Andreas Gruenbacher
  2005-02-17 14:12     ` Adrian Bunk
  0 siblings, 2 replies; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 19:17 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> vanlig tekstdokument vedlegg (patches.suse)
> Add xdr_encode_array2 and xdr_decode_array2 functions for encoding
> end decoding arrays with arbitrary entries, such as acl entries. The
> goal here is to do this without allocating a contiguous temporary
> buffer.

net/sunrpc/xdr.c:1024:3: warning: mixing declarations and code
net/sunrpc/xdr.c:967:16: warning: bad constant expression

Please don't use these gcc extensions in the kernel.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-15 17:29   ` Trond Myklebust
@ 2005-02-15 20:35     ` Olivier Galibert
  2005-02-15 22:43       ` Trond Myklebust
  2005-02-16 16:17     ` Andreas Gruenbacher
  1 sibling, 1 reply; 85+ messages in thread
From: Olivier Galibert @ 2005-02-15 20:35 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 15, 2005 at 12:29:06PM -0500, Trond Myklebust wrote:
> lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> > Solaris nfsacl workaround
> 
> NACK. No hacks.

That's the second time I see you refusing an interoperability patch
without bothering to say what would be acceptable.  Do we need a fork
between knfsd-pure and knfsd-actually-works-in-the-real-world or what?

  OG.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-15 20:35     ` Olivier Galibert
@ 2005-02-15 22:43       ` Trond Myklebust
  2005-02-15 23:02         ` Olivier Galibert
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 22:43 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux-kernel

ty den 15.02.2005 Klokka 21:35 (+0100) skreiv Olivier Galibert:
> That's the second time I see you refusing an interoperability patch
> without bothering to say what would be acceptable.  Do we need a fork
> between knfsd-pure and knfsd-actually-works-in-the-real-world or what?

You appear to be under the misguided impression that if a patch is
reviewed, and rejected, then somehow the responsibility for resolving
your problem (and cleaning up the code) falls to the reviewer.

I'm not aware of any such rule.


Feel free to apply as many hacks as you like to your own private fork,
but the mainline kernel has to be maintained by a community of people
most of which will not be aware, when debugging the ACL code, of the
little special cases peppered around the RPC server code in an entirely
different section of the kernel.

In this particular case, there are 100s of solutions that do not involve
putting NFSv3 ACL code in the generic RPC layer. If there really is a
generic need for RPC programs to override the RFC-specified error that
is returned to the client, then one obvious solution is to add a
callback that allows them to do so.

Even a flag called RPC_SVC_DO_NOT_EVER_RETURN_BAD_VERS would be easier
to maintain.

  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-15 22:43       ` Trond Myklebust
@ 2005-02-15 23:02         ` Olivier Galibert
  2005-02-15 23:37           ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Olivier Galibert @ 2005-02-15 23:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On Tue, Feb 15, 2005 at 05:43:24PM -0500, Trond Myklebust wrote:
> ty den 15.02.2005 Klokka 21:35 (+0100) skreiv Olivier Galibert:
> > That's the second time I see you refusing an interoperability patch
> > without bothering to say what would be acceptable.  Do we need a fork
> > between knfsd-pure and knfsd-actually-works-in-the-real-world or what?
> 
> You appear to be under the misguided impression that if a patch is
> reviewed, and rejected, then somehow the responsibility for resolving
> your problem (and cleaning up the code) falls to the reviewer.
> 
> I'm not aware of any such rule.

Resolving the problem and/or cleaning the code, no.  Telling what kind
of patch would be acceptable is your responsability, yes.  That's
where the difference is between a reviewer and a naysayer.

  OG.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-15 23:02         ` Olivier Galibert
@ 2005-02-15 23:37           ` Trond Myklebust
  2005-02-15 23:43             ` Olivier Galibert
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-15 23:37 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux-kernel

on den 16.02.2005 Klokka 00:02 (+0100) skreiv Olivier Galibert:

> Resolving the problem and/or cleaning the code, no.  Telling what kind
> of patch would be acceptable is your responsability, yes.

Read the patch, read the earlier patch [2/13] in which the same hack
appeared in the client code, and see my response. I'm sure Andreas knows
exactly what was meant by the comment and why.

   Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-15 23:37           ` Trond Myklebust
@ 2005-02-15 23:43             ` Olivier Galibert
  0 siblings, 0 replies; 85+ messages in thread
From: Olivier Galibert @ 2005-02-15 23:43 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On Tue, Feb 15, 2005 at 06:37:19PM -0500, Trond Myklebust wrote:
> on den 16.02.2005 Klokka 00:02 (+0100) skreiv Olivier Galibert:
> 
> > Resolving the problem and/or cleaning the code, no.  Telling what kind
> > of patch would be acceptable is your responsability, yes.
> 
> Read the patch, read the earlier patch [2/13] in which the same hack
> appeared in the client code, and see my response. I'm sure Andreas knows
> exactly what was meant by the comment and why.

Ok.  You have my apologies then.

  OG.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 2/13] Return -ENOSYS for RPC programs that are unavailable
  2005-02-15 17:04   ` Trond Myklebust
@ 2005-02-16 15:32     ` Andreas Gruenbacher
  0 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-16 15:32 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 663 bytes --]

First, thanks for your feedback.

On Tue, 2005-02-15 at 18:04, Trond Myklebust wrote:
> No hacks in sunrpc, please: i.e. get rid of that NFSACL_PROGRAM
> exception...
> If you want to kill those warnings, please just convert them to
> dprintks().

Fine with me.

> Also, why are you converting "unknown error" into ENOSYS?

That's a bug.

> Finally, it might make sense to distinguish between "program" and
> "procedure" errors. How about converting that RPC_PROC_UNAVAIL error
> into EOPNOTSUPP (like we already do in the NFS layer itself).

Okay, that shouldn't hurt. Fixes attached.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH

[-- Attachment #2: nfsacl-return-enosys-for-rpc-programs-that-are-unavailable-fix.patch --]
[-- Type: text/x-patch, Size: 1461 bytes --]

Index: linux-2.6.11-rc3/net/sunrpc/clnt.c
===================================================================
--- linux-2.6.11-rc3.orig/net/sunrpc/clnt.c
+++ linux-2.6.11-rc3/net/sunrpc/clnt.c
@@ -988,11 +988,11 @@ call_verify(struct rpc_task *task)
 				break;
 			case RPC_MISMATCH:
 				printk(KERN_WARNING "%s: RPC call version mismatch!\n", __FUNCTION__);
-				error = -ENOSYS;
+				error = -EIO;
 				goto out_err;
 			default:
 				printk(KERN_WARNING "%s: RPC call rejected, unknown error: %x\n", __FUNCTION__, n);
-				error = -ENOSYS;
+				error = -EIO;
 				goto out_err;
 		}
 		if (--len < 0)
@@ -1043,10 +1043,9 @@ call_verify(struct rpc_task *task)
 	case RPC_SUCCESS:
 		return p;
 	case RPC_PROG_UNAVAIL:
-		if (task->tk_client->cl_prog != NFSACL_PROGRAM) {
-			printk(KERN_WARNING "RPC: call_verify: program %u is unsupported by server %s\n",
-				(unsigned int)task->tk_client->cl_prog,
-				task->tk_client->cl_server);
+		dprintk(KERN_WARNING "RPC: call_verify: program %u is unsupported by server %s\n",
+			(unsigned int)task->tk_client->cl_prog,
+			task->tk_client->cl_server);
 		}
 		error = -ENOSYS;
 		goto out_err;
@@ -1063,7 +1062,7 @@ call_verify(struct rpc_task *task)
 				task->tk_client->cl_prog,
 				task->tk_client->cl_vers,
 				task->tk_client->cl_server);
-		error = -ENOSYS;
+		error = -EOPNOTSUPP;
 		goto out_err;
 	case RPC_GARBAGE_ARGS:
 		dprintk("RPC: %4d %s: server saw garbage\n", task->tk_pid, __FUNCTION__);

[-- Attachment #3: nfsacl-client-side-of-nfsacl-fix2.patch --]
[-- Type: text/x-patch, Size: 805 bytes --]

Index: linux-2.6.11-rc3/fs/nfs/nfs3proc.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/nfs3proc.c
+++ linux-2.6.11-rc3/fs/nfs/nfs3proc.c
@@ -760,7 +760,7 @@ nfs3_proc_getacl(struct inode *inode, in
 		__free_page(args.pages[count]);
 
 	if (status) {
-		if (status == -ENOSYS) {
+		if (status == -ENOSYS || status == -EOPNOTSUPP) {
 			dprintk("NFS_ACL extension not supported; disabling\n");
 			server->flags &= ~NFSACL;
 			status = -EOPNOTSUPP;
@@ -845,7 +845,7 @@ nfs3_proc_setacls(struct inode *inode, s
 		__free_page(args.pages[count]);
 
 	if (status) {
-		if (status == -ENOSYS) {
+		if (status == -ENOSYS || status == -EOPNOTSUPP) {
 			dprintk("NFS_ACL SETACL RPC not supported"
 				"(will not retry)\n");
 			server->flags &= ~NFSACL;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 7/13] Encode and decode arbitrary XDR arrays
  2005-02-15 19:17   ` Trond Myklebust
@ 2005-02-16 16:08     ` Andreas Gruenbacher
  2005-02-17 14:12     ` Adrian Bunk
  1 sibling, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-16 16:08 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 822 bytes --]

On Tue, 2005-02-15 at 20:17, Trond Myklebust wrote:
> lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> > vanlig tekstdokument vedlegg (patches.suse)
> > Add xdr_encode_array2 and xdr_decode_array2 functions for encoding
> > end decoding arrays with arbitrary entries, such as acl entries. The
> > goal here is to do this without allocating a contiguous temporary
> > buffer.
> 
> net/sunrpc/xdr.c:1024:3: warning: mixing declarations and code
> net/sunrpc/xdr.c:967:16: warning: bad constant expression
> 
> Please don't use these gcc extensions in the kernel.

Andrew has anready fixed the "mixing declarations and code" thing. The
attached patch kmallocs the buffer if needed. This uglifies the code
quite a bit though...

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH

[-- Attachment #2: nfsacl-encode-and-decode-arbitrary-xdr-arrays-fix2.patch --]
[-- Type: text/x-patch, Size: 1873 bytes --]

Index: linux-2.6.11-rc3/net/sunrpc/xdr.c
===================================================================
--- linux-2.6.11-rc3.orig/net/sunrpc/xdr.c
+++ linux-2.6.11-rc3/net/sunrpc/xdr.c
@@ -964,10 +964,10 @@ static int
 xdr_xcode_array2(struct xdr_buf *buf, unsigned int base,
 		 struct xdr_array2_desc *desc, int encode)
 {
-	char elem[desc->elem_size], *c;
+	char *elem = NULL, *c;
 	unsigned int copied = 0, todo, avail_here;
 	struct page **ppages = NULL;
-	int err = 0;
+	int err;
 
 	if (encode) {
 		if (xdr_encode_word(buf, base, desc->array_len) != 0)
@@ -1000,6 +1000,12 @@ xdr_xcode_array2(struct xdr_buf *buf, un
 			avail_here -= desc->elem_size;
 		}
 		if (avail_here) {
+			if (!elem) {
+				elem = kmalloc(desc->elem_size, GFP_KERNEL);
+				err = -ENOMEM;
+				if (!elem)
+					goto out;
+			}
 			if (encode) {
 				err = desc->xcode(desc, elem);
 				if (err)
@@ -1032,6 +1038,13 @@ xdr_xcode_array2(struct xdr_buf *buf, un
 			if (copied || avail_page < desc->elem_size) {
 				unsigned int l = min(avail_page,
 					desc->elem_size - copied);
+				if (!elem) {
+					elem = kmalloc(desc->elem_size,
+						       GFP_KERNEL);
+					err = -ENOMEM;
+					if (!elem)
+						goto out;
+				}
 				if (encode) {
 					if (!copied) {
 						err = desc->xcode(desc, elem);
@@ -1065,6 +1078,13 @@ xdr_xcode_array2(struct xdr_buf *buf, un
 			if (avail_page) {
 				unsigned int l = min(avail_page,
 					    desc->elem_size - copied);
+				if (!elem) {
+					elem = kmalloc(desc->elem_size,
+						       GFP_KERNEL);
+					err = -ENOMEM;
+					if (!elem)
+						goto out;
+				}
 				if (encode) {
 					if (!copied) {
 						err = desc->xcode(desc, elem);
@@ -1124,8 +1144,11 @@ xdr_xcode_array2(struct xdr_buf *buf, un
 			todo -= desc->elem_size;
 		}
 	}
+	err = 0;
 
 out:
+	if (elem)
+		kfree(elem);
 	if (ppages)
 		kunmap(*ppages);
 	return err;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 8/13] Add noacl nfs mount option
  2005-02-15 17:24   ` Trond Myklebust
@ 2005-02-16 16:10     ` Andreas Gruenbacher
  0 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-16 16:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

On Tue, 2005-02-15 at 18:24, Trond Myklebust wrote:
> lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> > vanlig tekstdokument vedlegg (patches.suse)
> > With the noacl mount option, nfs clients stop using the ACCESS RPC
> > which they usually use to get an access decision from the server.
> > Instead, they make the decision based on the file ownership and
> > file mode permission bits.
> 
> I still hate that name "noacl".

So how would you call it? I'm not religious about the name.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-15 17:29   ` Trond Myklebust
  2005-02-15 20:35     ` Olivier Galibert
@ 2005-02-16 16:17     ` Andreas Gruenbacher
  2005-02-16 17:05       ` Trond Myklebust
  1 sibling, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-16 16:17 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

On Tue, 2005-02-15 at 18:29, Trond Myklebust wrote:
> lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> > Solaris nfsacl workaround
> 
> NACK. No hacks.

Well, I'm not in the position to fix Solaris. It would be possible to
implement NFSACL for NFSv2 (Solaris has it), but I doubt that we need
it. Your NACK probably means we'll have to carry it around as a vendor
patch.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-16 16:17     ` Andreas Gruenbacher
@ 2005-02-16 17:05       ` Trond Myklebust
  2005-02-16 17:39         ` Andreas Gruenbacher
  0 siblings, 1 reply; 85+ messages in thread
From: Trond Myklebust @ 2005-02-16 17:05 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

on den 16.02.2005 Klokka 17:17 (+0100) skreiv Andreas Gruenbacher:
> On Tue, 2005-02-15 at 18:29, Trond Myklebust wrote:
> > lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> > > Solaris nfsacl workaround
> > 
> > NACK. No hacks.
> 
> Well, I'm not in the position to fix Solaris. It would be possible to
> implement NFSACL for NFSv2 (Solaris has it), but I doubt that we need
> it. Your NACK probably means we'll have to carry it around as a vendor
> patch.

See the thread between Olivier Galibert & I. There are ways of doing
this which do not involve putting nfsacl code in the generic sunrpc
layer.
Either a callback or a flag in the "struct svc_program" to override the
standard RPC server reply (instead of checking the ACL program number)
would be fine as far as I'm concerned. I can't speak for Neil's
preferences, though.

I am, however, surprised when you say that Solaris has problems with
this. The PROG_MISMATCH error does also tell the client the minimum and
maximum supported version, so if all is working well, then it recognize
that we support version 3 only. It seems wierd that they should then
choose to treat that as an mount failure.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 10/13] Solaris nfsacl workaround
  2005-02-16 17:05       ` Trond Myklebust
@ 2005-02-16 17:39         ` Andreas Gruenbacher
  0 siblings, 0 replies; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-16 17:39 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

On Wed, 2005-02-16 at 18:05, Trond Myklebust wrote:
> I am, however, surprised when you say that Solaris has problems with
> this. The PROG_MISMATCH error does also tell the client the minimum and
> maximum supported version, so if all is working well, then it recognize
> that we support version 3 only. It seems wierd that they should then
> choose to treat that as an mount failure.

Well, yes. It's a weird bug.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 7/13] Encode and decode arbitrary XDR arrays
  2005-02-15 19:17   ` Trond Myklebust
  2005-02-16 16:08     ` Andreas Gruenbacher
@ 2005-02-17 14:12     ` Adrian Bunk
  1 sibling, 0 replies; 85+ messages in thread
From: Adrian Bunk @ 2005-02-17 14:12 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andreas Gruenbacher, linux-kernel, Neil Brown, Olaf Kirch,
	Andries E. Brouwer, Buck Huppmann, Andrew Morton

On Tue, Feb 15, 2005 at 02:17:18PM -0500, Trond Myklebust wrote:
> 
> net/sunrpc/xdr.c:1024:3: warning: mixing declarations and code
>...
> Please don't use these gcc extensions in the kernel.

Just for the record:
This is not a gcc extension - this is C99 but not supported by
gcc 2.95 (which is a supported compiler for kernel 2.6).

> Cheers,
>   Trond

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 11/13] Client side of nfsacl
  2005-02-15 17:49   ` Trond Myklebust
@ 2005-02-22 13:41     ` Andreas Gruenbacher
  2005-02-22 14:13       ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-22 13:41 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 404 bytes --]

On Tue, 2005-02-15 at 18:49, Trond Myklebust wrote:
> I suggest you rather do the same thing we're doing for the NFSv4 acls,
> and provide an nfsv3-specific struct inode_operations that points to
> nfsv3-specific {get,set,list}xattr functions.

Okay, that requires iops for file, dir, and others. How about the
attached patch?

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH

[-- Attachment #2: nfsacl-client-side-of-nfsacl-fix3.patch --]
[-- Type: text/x-patch, Size: 8071 bytes --]

Index: linux-2.6.11-rc3/fs/nfs/nfs3proc.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/nfs3proc.c
+++ linux-2.6.11-rc3/fs/nfs/nfs3proc.c
@@ -1045,7 +1045,9 @@ nfs3_proc_lock(struct file *filp, int cm
 struct nfs_rpc_ops	nfs_v3_clientops = {
 	.version	= 3,			/* protocol version */
 	.dentry_ops	= &nfs_dentry_operations,
-	.dir_inode_ops	= &nfs_dir_inode_operations,
+	.file_inode_ops	= &nfs3_file_inode_operations,
+	.dir_inode_ops	= &nfs3_dir_inode_operations,
+	.special_inode_ops = &nfs3_special_inode_operations,
 	.getroot	= nfs3_proc_get_root,
 	.getattr	= nfs3_proc_getattr,
 	.setattr	= nfs3_proc_setattr,
Index: linux-2.6.11-rc3/fs/nfs/inode.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/inode.c
+++ linux-2.6.11-rc3/fs/nfs/inode.c
@@ -674,7 +674,7 @@ nfs_init_locked(struct inode *inode, voi
 #define NFS_LIMIT_READDIRPLUS (8*PAGE_SIZE)
 
 #ifdef CONFIG_NFS_ACL
-static struct inode_operations nfs_special_inode_operations = {
+struct inode_operations nfs3_special_inode_operations = {
 	.permission =	nfs_permission,
 	.getattr =	nfs_getattr,
 	.setattr =	nfs_setattr,
@@ -725,7 +725,7 @@ nfs_fhget(struct super_block *sb, struct
 		/* Why so? Because we want revalidate for devices/FIFOs, and
 		 * that's precisely what we have in nfs_file_inode_operations.
 		 */
-		inode->i_op = &nfs_file_inode_operations;
+		inode->i_op = NFS_SB(sb)->rpc_ops->file_inode_ops;
 		if (S_ISREG(inode->i_mode)) {
 			inode->i_fop = &nfs_file_operations;
 			inode->i_data.a_ops = &nfs_file_aops;
@@ -739,9 +739,9 @@ nfs_fhget(struct super_block *sb, struct
 		} else if (S_ISLNK(inode->i_mode))
 			inode->i_op = &nfs_symlink_inode_operations;
 		else {
-#ifdef CONFIG_NFS_ACL
-			inode->i_op = &nfs_special_inode_operations;
-#endif
+			if (NFS_SB(sb)->rpc_ops->special_inode_ops)
+				inode->i_op = NFS_SB(sb)->rpc_ops->
+						       special_inode_ops;
 			init_special_inode(inode, inode->i_mode, fattr->rdev);
 		}
 
Index: linux-2.6.11-rc3/fs/nfs/nfs4proc.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/nfs4proc.c
+++ linux-2.6.11-rc3/fs/nfs/nfs4proc.c
@@ -2596,6 +2596,7 @@ nfs4_proc_lock(struct file *filp, int cm
 struct nfs_rpc_ops	nfs_v4_clientops = {
 	.version	= 4,			/* protocol version */
 	.dentry_ops	= &nfs4_dentry_operations,
+	.file_inode_ops	= &nfs_file_inode_operations,
 	.dir_inode_ops	= &nfs4_dir_inode_operations,
 	.getroot	= nfs4_proc_get_root,
 	.getattr	= nfs4_proc_getattr,
Index: linux-2.6.11-rc3/fs/nfs/proc.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/proc.c
+++ linux-2.6.11-rc3/fs/nfs/proc.c
@@ -619,6 +619,7 @@ nfs_proc_lock(struct file *filp, int cmd
 struct nfs_rpc_ops	nfs_v2_clientops = {
 	.version	= 2,		       /* protocol version */
 	.dentry_ops	= &nfs_dentry_operations,
+	.file_inode_ops	= &nfs_file_inode_operations,
 	.dir_inode_ops	= &nfs_dir_inode_operations,
 	.getroot	= nfs_proc_get_root,
 	.getattr	= nfs_proc_getattr,
Index: linux-2.6.11-rc3/fs/nfs/dir.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/dir.c
+++ linux-2.6.11-rc3/fs/nfs/dir.c
@@ -72,11 +72,28 @@ struct inode_operations nfs_dir_inode_op
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
+};
+
+#ifdef CONFIG_NFS_V3
+struct inode_operations nfs3_dir_inode_operations = {
+	.create		= nfs_create,
+	.lookup		= nfs_lookup,
+	.link		= nfs_link,
+	.unlink		= nfs_unlink,
+	.symlink	= nfs_symlink,
+	.mkdir		= nfs_mkdir,
+	.rmdir		= nfs_rmdir,
+	.mknod		= nfs_mknod,
+	.rename		= nfs_rename,
+	.permission	= nfs_permission,
+	.getattr	= nfs_getattr,
+	.setattr	= nfs_setattr,
 	.listxattr	= nfs_listxattr,
 	.getxattr	= nfs_getxattr,
 	.setxattr	= nfs_setxattr,
 	.removexattr	= nfs_removexattr,
 };
+#endif  /* CONFIG_NFS_V3 */
 
 #ifdef CONFIG_NFS_V4
 
Index: linux-2.6.11-rc3/include/linux/nfs_xdr.h
===================================================================
--- linux-2.6.11-rc3.orig/include/linux/nfs_xdr.h
+++ linux-2.6.11-rc3/include/linux/nfs_xdr.h
@@ -689,7 +689,9 @@ struct nfs_access_entry;
 struct nfs_rpc_ops {
 	int	version;		/* Protocol version */
 	struct dentry_operations *dentry_ops;
+	struct inode_operations *file_inode_ops;
 	struct inode_operations *dir_inode_ops;
+	struct inode_operations *special_inode_ops;
 
 	int	(*getroot) (struct nfs_server *, struct nfs_fh *,
 			    struct nfs_fsinfo *);
Index: linux-2.6.11-rc3/fs/nfs/xattr.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/xattr.c
+++ linux-2.6.11-rc3/fs/nfs/xattr.c
@@ -11,9 +11,6 @@ nfs_listxattr(struct dentry *dentry, cha
 	struct posix_acl *acl;
 	int pos=0, len=0;
 
-	if (NFS_PROTO(inode)->version != 3 || !NFS_PROTO(inode)->getacl)
-		return -EOPNOTSUPP;
-
 #	define output(s) do {						\
 			if (pos + sizeof(s) <= size) {			\
 				memcpy(buffer + pos, s, sizeof(s));	\
@@ -61,9 +58,7 @@ nfs_getxattr(struct dentry *dentry, cons
 	else
 		return -EOPNOTSUPP;
 
-	acl = ERR_PTR(-EOPNOTSUPP);
-	if (NFS_PROTO(inode)->version == 3 && NFS_PROTO(inode)->getacl)
-		acl = NFS_PROTO(inode)->getacl(inode, type);
+	acl = NFS_PROTO(inode)->getacl(inode, type);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
 	else if (acl) {
@@ -92,8 +87,6 @@ nfs_setxattr(struct dentry *dentry, cons
 		type = ACL_TYPE_DEFAULT;
 	else
 		return -EOPNOTSUPP;
-	if (NFS_PROTO(inode)->version != 3 || !NFS_PROTO(inode)->setacl)
-		return -EOPNOTSUPP;
 
 	acl = posix_acl_from_xattr(value, size);
 	if (IS_ERR(acl))
@@ -108,7 +101,7 @@ int
 nfs_removexattr(struct dentry *dentry, const char *name)
 {
 	struct inode *inode = dentry->d_inode;
-	int error, type;
+	int type;
 
 	if (strcmp(name, XATTR_NAME_ACL_ACCESS) == 0)
 		type = ACL_TYPE_ACCESS;
@@ -117,9 +110,5 @@ nfs_removexattr(struct dentry *dentry, c
 	else
 		return -EOPNOTSUPP;
 
-	error = -EOPNOTSUPP;
-	if (NFS_PROTO(inode)->version == 3 && NFS_PROTO(inode)->setacl)
-		error = NFS_PROTO(inode)->setacl(inode, type, NULL);
-
-	return error;
+	return NFS_PROTO(inode)->setacl(inode, type, NULL);
 }
Index: linux-2.6.11-rc3/fs/nfs/file.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/file.c
+++ linux-2.6.11-rc3/fs/nfs/file.c
@@ -68,11 +68,19 @@ struct inode_operations nfs_file_inode_o
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
+};
+
+#ifdef CONFIG_NFS_V3
+struct inode_operations nfs3_file_inode_operations = {
+	.permission	= nfs_permission,
+	.getattr	= nfs_getattr,
+	.setattr	= nfs_setattr,
 	.listxattr	= nfs_listxattr,
 	.getxattr	= nfs_getxattr,
 	.setxattr	= nfs_setxattr,
 	.removexattr	= nfs_removexattr,
 };
+#endif  /* CONFIG_NFS_v3 */
 
 /* Hack for future NFS swap support */
 #ifndef IS_SWAPFILE
Index: linux-2.6.11-rc3/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.11-rc3.orig/include/linux/nfs_fs.h
+++ linux-2.6.11-rc3/include/linux/nfs_fs.h
@@ -281,6 +281,8 @@ static inline int nfs_verify_change_attr
 /*
  * linux/fs/nfs/inode.c
  */
+extern struct inode_operations nfs3_special_inode_operations;
+
 extern void nfs_zap_caches(struct inode *);
 extern struct inode *nfs_fhget(struct super_block *, struct nfs_fh *,
 				struct nfs_fattr *);
@@ -314,6 +316,7 @@ extern u32 root_nfs_parse_addr(char *nam
  * linux/fs/nfs/file.c
  */
 extern struct inode_operations nfs_file_inode_operations;
+extern struct inode_operations nfs3_file_inode_operations;
 extern struct file_operations nfs_file_operations;
 extern struct address_space_operations nfs_file_aops;
 
@@ -358,6 +361,7 @@ extern ssize_t nfs_file_direct_write(str
  * linux/fs/nfs/dir.c
  */
 extern struct inode_operations nfs_dir_inode_operations;
+extern struct inode_operations nfs3_dir_inode_operations;
 extern struct file_operations nfs_dir_operations;
 extern struct dentry_operations nfs_dentry_operations;
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 11/13] Client side of nfsacl
  2005-02-22 13:41     ` Andreas Gruenbacher
@ 2005-02-22 14:13       ` Trond Myklebust
  0 siblings, 0 replies; 85+ messages in thread
From: Trond Myklebust @ 2005-02-22 14:13 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer,
	Buck Huppmann, Andrew Morton

ty den 22.02.2005 Klokka 14:41 (+0100) skreiv Andreas Gruenbacher:
> On Tue, 2005-02-15 at 18:49, Trond Myklebust wrote:
> > I suggest you rather do the same thing we're doing for the NFSv4 acls,
> > and provide an nfsv3-specific struct inode_operations that points to
> > nfsv3-specific {get,set,list}xattr functions.
> 
> Okay, that requires iops for file, dir, and others. How about the
> attached patch?

That's fine by me.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 12/13] ACL umask handling workaround in nfs client
  2005-02-15 18:04   ` Trond Myklebust
@ 2005-02-22 16:47     ` Andreas Gruenbacher
  2005-02-22 17:43       ` Trond Myklebust
  0 siblings, 1 reply; 85+ messages in thread
From: Andreas Gruenbacher @ 2005-02-22 16:47 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]

On Tue, 2005-02-15 at 19:04, Trond Myklebust wrote:
> lau den 22.01.2005 Klokka 21:34 (+0100) skreiv Andreas Gruenbacher:
> > vanlig tekstdokument vedlegg (patches.suse)
> > NFSv3 has no concept of a umask on the server side: The client applies
> > the umask locally, and sends the effective permissions to the server.
> > This behavior is wrong when files are created in a directory that has
> > a default ACL. In this case, the umask is supposed to be ignored, and
> > only the default ACL determines the file's effective permissions.
> > 
> > Usually its the server's task to conditionally apply the umask. But
> > since the server knows nothing about the umask, we have to do it on the
> > client side. This patch tries to fetch the parent directory's default
> > ACL before creating a new file, computes the appropriate create mode to
> > send to the server, and finally sets the new file's access and default
> > acl appropriately.
> 
> Firstly, this sort of code belongs in the NFSv3-specific code. POSIX
> acls have no business whatsoever in the generic NFS code.

See attached patch.

NOTE:

  During testing I noticed that without
  nfsacl-cache-acls-on-the-nfs-client-side.patch, no directories or
  devices can be created. It's probably a problem with
  nfs_set_default_acl(). I'll have to debug this tomorrow.

> Secondly, what is the point of doing all this *after* you have created
> the file with the wrong permissions? How are you avoiding races?

Well, everything but the umask is always correct; that is guaranteed by
the server. The initial create sets permissions that may be more
restrictive than necessary, and then the SETACL RPC sets up the final,
correct permissions. I don't believe that a race-free solution is
possible.

Cheers,
-- 
Andreas Gruenbacher <agruen@suse.de>
SUSE Labs, SUSE LINUX GMBH

[-- Attachment #2: nfsacl-acl-umask-handling-workaround-in-nfs-client-fix2.patch --]
[-- Type: text/x-patch, Size: 4260 bytes --]

Index: linux-2.6.11-rc3/fs/nfs/dir.c
===================================================================
--- linux-2.6.11-rc3.orig/fs/nfs/dir.c
+++ linux-2.6.11-rc3/fs/nfs/dir.c
@@ -42,12 +42,15 @@ static int nfs_opendir(struct inode *, s
 static int nfs_readdir(struct file *, void *, filldir_t);
 static struct dentry *nfs_lookup(struct inode *, struct dentry *, struct nameidata *);
 static int nfs_create(struct inode *, struct dentry *, int, struct nameidata *);
+static int nfs3_create(struct inode *, struct dentry *, int, struct nameidata *);
 static int nfs_mkdir(struct inode *, struct dentry *, int);
+static int nfs3_mkdir(struct inode *, struct dentry *, int);
 static int nfs_rmdir(struct inode *, struct dentry *);
 static int nfs_unlink(struct inode *, struct dentry *);
 static int nfs_symlink(struct inode *, struct dentry *, const char *);
 static int nfs_link(struct dentry *, struct inode *, struct dentry *);
 static int nfs_mknod(struct inode *, struct dentry *, int, dev_t);
+static int nfs3_mknod(struct inode *, struct dentry *, int, dev_t);
 static int nfs_rename(struct inode *, struct dentry *,
 		      struct inode *, struct dentry *);
 static int nfs_fsync_dir(struct file *, struct dentry *, int);
@@ -77,14 +80,14 @@ struct inode_operations nfs_dir_inode_op
 
 #ifdef CONFIG_NFS_V3
 struct inode_operations nfs3_dir_inode_operations = {
-	.create		= nfs_create,
+	.create		= nfs3_create,
 	.lookup		= nfs_lookup,
 	.link		= nfs_link,
 	.unlink		= nfs_unlink,
 	.symlink	= nfs_symlink,
-	.mkdir		= nfs_mkdir,
+	.mkdir		= nfs3_mkdir,
 	.rmdir		= nfs_rmdir,
-	.mknod		= nfs_mknod,
+	.mknod		= nfs3_mknod,
 	.rename		= nfs_rename,
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
@@ -994,16 +997,14 @@ out_err:
 	return error;
 }
 
-static int nfs_set_default_acl(struct inode *dir, struct inode *inode,
-			       mode_t mode)
+#ifdef CONFIG_NFS_V3
+static int nfs3_set_default_acl(struct inode *dir, struct inode *inode,
+				mode_t mode)
 {
 #ifdef CONFIG_NFS_ACL
 	struct posix_acl *dfacl, *acl;
 	int error = 0;
 
-	if (NFS_PROTO(inode)->version != 3 ||
-	    !NFS_PROTO(dir)->getacl || !NFS_PROTO(inode)->setacls)
-		return 0;
 	dfacl = NFS_PROTO(dir)->getacl(dir, ACL_TYPE_DEFAULT);
 	if (IS_ERR(dfacl)) {
 		error = PTR_ERR(dfacl);
@@ -1028,6 +1029,7 @@ out:
 	return 0;
 #endif
 }
+#endif
 
 /*
  * Following a failed create operation, we drop the dentry rather
@@ -1060,7 +1062,7 @@ static int nfs_create(struct inode *dir,
 		d_instantiate(dentry, inode);
 		nfs_renew_times(dentry);
 		nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
-		error = nfs_set_default_acl(dir, inode, mode);
+		error = 0;
 	} else {
 		error = PTR_ERR(inode);
 		d_drop(dentry);
@@ -1069,6 +1071,22 @@ static int nfs_create(struct inode *dir,
 	return error;
 }
 
+#ifdef CONFIG_NFS_V3
+static int nfs3_create(struct inode *dir, struct dentry *dentry, int mode,
+		       struct nameidata *nd)
+{
+	int error;
+
+	lock_kernel();
+	error = nfs_create(dir, dentry, mode, nd);
+	if (!error)
+		error = nfs3_set_default_acl(dir, dentry->d_inode, mode);
+	unlock_kernel();
+
+	return error;
+}
+#endif
+
 /*
  * See comments for nfs_proc_create regarding failed operations.
  */
@@ -1098,9 +1116,21 @@ nfs_mknod(struct inode *dir, struct dent
 		error = nfs_instantiate(dentry, &fhandle, &fattr);
 	else
 		d_drop(dentry);
+	unlock_kernel();
+	return error;
+}
+
+static int nfs3_mknod(struct inode *dir, struct dentry *dentry, int mode,
+		      dev_t rdev)
+{
+	int error;
+
+	lock_kernel();
+	error = nfs_mknod(dir, dentry, mode, rdev);
 	if (!error)
-		error = nfs_set_default_acl(dir, dentry->d_inode, mode);
+		error = nfs3_set_default_acl(dir, dentry->d_inode, mode);
 	unlock_kernel();
+
 	return error;
 }
 
@@ -1138,9 +1168,20 @@ static int nfs_mkdir(struct inode *dir, 
 		error = nfs_instantiate(dentry, &fhandle, &fattr);
 	else
 		d_drop(dentry);
+	unlock_kernel();
+	return error;
+}
+
+static int nfs3_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int error;
+
+	lock_kernel();
+	error = nfs_mkdir(dir, dentry, mode);
 	if (!error)
-		error = nfs_set_default_acl(dir, dentry->d_inode, mode);
+		error = nfs3_set_default_acl(dir, dentry->d_inode, mode);
 	unlock_kernel();
+
 	return error;
 }
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch 12/13] ACL umask handling workaround in nfs client
  2005-02-22 16:47     ` Andreas Gruenbacher
@ 2005-02-22 17:43       ` Trond Myklebust
  0 siblings, 0 replies; 85+ messages in thread
From: Trond Myklebust @ 2005-02-22 17:43 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-kernel, Neil Brown, Olaf Kirch, Andries E. Brouwer, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]

ty den 22.02.2005 Klokka 17:47 (+0100) skreiv Andreas Gruenbacher:

> See attached patch.
> 

It would be very nice if we could rather hide the calls to
nfs_set_default_acl() inside nfs3_proc_create(), nfs3_proc_mknod() and
nfs3_proc-mkdir(). Besides avoiding the need for the wrapper functions,
that also means that all calls to ->mknod(), ->mkdir() and ->create()
are guaranteed to do the right thing.

How about if we change the interfaces for NFS_PROTO()->mknod(), and
mkdir, so that they take a dentry argument instead of the struct qstr,
and then have them instantiate the dentry? The appended (untested) patch
tries to do this for mknod()...
(This is a cleanup we have pretty much already done for ->create() BTW)

> Well, everything but the umask is always correct; that is guaranteed by
> the server. The initial create sets permissions that may be more
> restrictive than necessary, and then the SETACL RPC sets up the final,
> correct permissions. I don't believe that a race-free solution is
> possible.

If the permissions are guaranteed always to be more restrictive, then
that's OK (in fact the NFS protocol mandates that we do something
similar when it comes to O_EXCL opens).
I just wanted to be sure this was indeed the case.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>

[-- Attachment #2: linux-2.6.11-cleanup_mknod.dif --]
[-- Type: text/plain, Size: 8320 bytes --]

 fs/nfs/dir.c            |   20 +++++++++-----------
 fs/nfs/nfs3proc.c       |   23 +++++++++++++----------
 fs/nfs/nfs4proc.c       |   25 +++++++++++++------------
 fs/nfs/proc.c           |   24 ++++++++++++++----------
 include/linux/nfs_fs.h  |    2 ++
 include/linux/nfs_xdr.h |    4 ++--
 6 files changed, 53 insertions(+), 45 deletions(-)

Index: linux-2.6.11-rc4/fs/nfs/nfs3proc.c
===================================================================
--- linux-2.6.11-rc4.orig/fs/nfs/nfs3proc.c
+++ linux-2.6.11-rc4/fs/nfs/nfs3proc.c
@@ -639,23 +639,24 @@ nfs3_proc_readdir(struct dentry *dentry,
 }
 
 static int
-nfs3_proc_mknod(struct inode *dir, struct qstr *name, struct iattr *sattr,
-		dev_t rdev, struct nfs_fh *fh, struct nfs_fattr *fattr)
+nfs3_proc_mknod(struct inode *dir, struct dentry *dentry, struct iattr *sattr,
+		dev_t rdev)
 {
-	struct nfs_fattr	dir_attr;
+	struct nfs_fh fh;
+	struct nfs_fattr fattr, dir_attr;
 	struct nfs3_mknodargs	arg = {
 		.fh		= NFS_FH(dir),
-		.name		= name->name,
-		.len		= name->len,
+		.name		= dentry->d_name.name,
+		.len		= dentry->name.len,
 		.sattr		= sattr,
 		.rdev		= rdev
 	};
 	struct nfs3_diropres	res = {
 		.dir_attr	= &dir_attr,
-		.fh		= fh,
-		.fattr		= fattr
+		.fh		= &fh,
+		.fattr		= &fattr
 	};
-	int			status;
+	int status;
 
 	switch (sattr->ia_mode & S_IFMT) {
 	case S_IFBLK:	arg.type = NF3BLK;  break;
@@ -665,12 +666,14 @@ nfs3_proc_mknod(struct inode *dir, struc
 	default:	return -EINVAL;
 	}
 
-	dprintk("NFS call  mknod %s %u:%u\n", name->name,
+	dprintk("NFS call  mknod %s %u:%u\n", dentry->d_name.name,
 			MAJOR(rdev), MINOR(rdev));
 	dir_attr.valid = 0;
-	fattr->valid = 0;
+	fattr.valid = 0;
 	status = rpc_call(NFS_CLIENT(dir), NFS3PROC_MKNOD, &arg, &res, 0);
 	nfs_refresh_inode(dir, &dir_attr);
+	if (status == 0)
+		status = nfs_instantiate(dentry, &fh, &fattr);
 	dprintk("NFS reply mknod: %d\n", status);
 	return status;
 }
Index: linux-2.6.11-rc4/fs/nfs/nfs4proc.c
===================================================================
--- linux-2.6.11-rc4.orig/fs/nfs/nfs4proc.c
+++ linux-2.6.11-rc4/fs/nfs/nfs4proc.c
@@ -1630,22 +1630,23 @@ static int nfs4_proc_readdir(struct dent
 	return err;
 }
 
-static int _nfs4_proc_mknod(struct inode *dir, struct qstr *name,
-		struct iattr *sattr, dev_t rdev, struct nfs_fh *fh,
-		struct nfs_fattr *fattr)
+static int _nfs4_proc_mknod(struct inode *dir, struct dentry *dentry,
+		struct iattr *sattr, dev_t rdev)
 {
 	struct nfs_server *server = NFS_SERVER(dir);
+	struct nfs_fh fh;
+	struct nfs_fattr fattr;
 	struct nfs4_create_arg arg = {
 		.dir_fh = NFS_FH(dir),
 		.server = server,
-		.name = name,
+		.name = &dentry->d_name,
 		.attrs = sattr,
 		.bitmask = server->attr_bitmask,
 	};
 	struct nfs4_create_res res = {
 		.server = server,
-		.fh = fh,
-		.fattr = fattr,
+		.fh = &fh,
+		.fattr = &fattr,
 	};
 	struct rpc_message msg = {
 		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_CREATE],
@@ -1655,7 +1656,7 @@ static int _nfs4_proc_mknod(struct inode
 	int			status;
 	int                     mode = sattr->ia_mode;
 
-	fattr->valid = 0;
+	fattr.valid = 0;
 
 	BUG_ON(!(sattr->ia_valid & ATTR_MODE));
 	BUG_ON(!S_ISFIFO(mode) && !S_ISBLK(mode) && !S_ISCHR(mode) && !S_ISSOCK(mode));
@@ -1677,19 +1678,19 @@ static int _nfs4_proc_mknod(struct inode
 	status = rpc_call_sync(NFS_CLIENT(dir), &msg, 0);
 	if (!status)
 		update_changeattr(dir, &res.dir_cinfo);
+	if (status == 0)
+		status = nfs_instantiate(dentry, &fh, &fattr);
 	return status;
 }
 
-static int nfs4_proc_mknod(struct inode *dir, struct qstr *name,
-		struct iattr *sattr, dev_t rdev, struct nfs_fh *fh,
-		struct nfs_fattr *fattr)
+static int nfs4_proc_mknod(struct inode *dir, struct dentry *dentry,
+		struct iattr *sattr, dev_t rdev)
 {
 	struct nfs4_exception exception = { };
 	int err;
 	do {
 		err = nfs4_handle_exception(NFS_SERVER(dir),
-				_nfs4_proc_mknod(dir, name, sattr, rdev,
-					fh, fattr),
+				_nfs4_proc_mknod(dir, dentry, sattr, rdev),
 				&exception);
 	} while (exception.retry);
 	return err;
Index: linux-2.6.11-rc4/fs/nfs/proc.c
===================================================================
--- linux-2.6.11-rc4.orig/fs/nfs/proc.c
+++ linux-2.6.11-rc4/fs/nfs/proc.c
@@ -248,22 +248,24 @@ nfs_proc_create(struct inode *dir, struc
  * In NFSv2, mknod is grafted onto the create call.
  */
 static int
-nfs_proc_mknod(struct inode *dir, struct qstr *name, struct iattr *sattr,
-	       dev_t rdev, struct nfs_fh *fhandle, struct nfs_fattr *fattr)
+nfs_proc_mknod(struct inode *dir, struct dentry *dentry, struct iattr *sattr,
+	       dev_t rdev)
 {
+	struct nfs_fh fhandle;
+	struct nfs_fattr fattr;
 	struct nfs_createargs	arg = {
 		.fh		= NFS_FH(dir),
-		.name		= name->name,
-		.len		= name->len,
+		.name		= dentry->d_name.name,
+		.len		= dentry->d_name.len,
 		.sattr		= sattr
 	};
 	struct nfs_diropok	res = {
-		.fh		= fhandle,
-		.fattr		= fattr
+		.fh		= &fhandle,
+		.fattr		= &fattr
 	};
-	int			status, mode;
+	int status, mode;
 
-	dprintk("NFS call  mknod %s\n", name->name);
+	dprintk("NFS call  mknod %s\n", dentry->d_name.name);
 
 	mode = sattr->ia_mode;
 	if (S_ISFIFO(mode)) {
@@ -274,14 +276,16 @@ nfs_proc_mknod(struct inode *dir, struct
 		sattr->ia_size = new_encode_dev(rdev);/* get out your barf bag */
 	}
 
-	fattr->valid = 0;
+	fattr.valid = 0;
 	status = rpc_call(NFS_CLIENT(dir), NFSPROC_CREATE, &arg, &res, 0);
 
 	if (status == -EINVAL && S_ISFIFO(mode)) {
 		sattr->ia_mode = mode;
-		fattr->valid = 0;
+		fattr.valid = 0;
 		status = rpc_call(NFS_CLIENT(dir), NFSPROC_CREATE, &arg, &res, 0);
 	}
+	if (status == 0)
+		status = nfs_instantiate(dentry, &fhandle, &fattr);
 	dprintk("NFS reply mknod: %d\n", status);
 	return status;
 }
Index: linux-2.6.11-rc4/fs/nfs/dir.c
===================================================================
--- linux-2.6.11-rc4.orig/fs/nfs/dir.c
+++ linux-2.6.11-rc4/fs/nfs/dir.c
@@ -938,7 +938,7 @@ static struct dentry *nfs_readdir_lookup
 /*
  * Code common to create, mkdir, and mknod.
  */
-static int nfs_instantiate(struct dentry *dentry, struct nfs_fh *fhandle,
+int nfs_instantiate(struct dentry *dentry, struct nfs_fh *fhandle,
 				struct nfs_fattr *fattr)
 {
 	struct inode *inode;
@@ -1019,9 +1019,7 @@ static int
 nfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t rdev)
 {
 	struct iattr attr;
-	struct nfs_fattr fattr;
-	struct nfs_fh fhandle;
-	int error;
+	int status;
 
 	dfprintk(VFS, "NFS: mknod(%s/%ld, %s\n", dir->i_sb->s_id,
 		dir->i_ino, dentry->d_name.name);
@@ -1034,15 +1032,15 @@ nfs_mknod(struct inode *dir, struct dent
 
 	lock_kernel();
 	nfs_begin_data_update(dir);
-	error = NFS_PROTO(dir)->mknod(dir, &dentry->d_name, &attr, rdev,
-					&fhandle, &fattr);
+	status = NFS_PROTO(dir)->mknod(dir, dentry, &attr, rdev);
 	nfs_end_data_update(dir);
-	if (!error)
-		error = nfs_instantiate(dentry, &fhandle, &fattr);
-	else
-		d_drop(dentry);
 	unlock_kernel();
-	return error;
+	if (status != 0)
+		goto out_err;
+	return 0;
+out_err:
+	d_drop(dentry);
+	return status;
 }
 
 /*
Index: linux-2.6.11-rc4/include/linux/nfs_xdr.h
===================================================================
--- linux-2.6.11-rc4.orig/include/linux/nfs_xdr.h
+++ linux-2.6.11-rc4/include/linux/nfs_xdr.h
@@ -698,8 +698,8 @@ struct nfs_rpc_ops {
 	int	(*rmdir)   (struct inode *, struct qstr *);
 	int	(*readdir) (struct dentry *, struct rpc_cred *,
 			    u64, struct page *, unsigned int, int);
-	int	(*mknod)   (struct inode *, struct qstr *, struct iattr *,
-			    dev_t, struct nfs_fh *, struct nfs_fattr *);
+	int	(*mknod)   (struct inode *, struct dentry *, struct iattr *,
+			    dev_t);
 	int	(*statfs)  (struct nfs_server *, struct nfs_fh *,
 			    struct nfs_fsstat *);
 	int	(*fsinfo)  (struct nfs_server *, struct nfs_fh *,
Index: linux-2.6.11-rc4/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.11-rc4.orig/include/linux/nfs_fs.h
+++ linux-2.6.11-rc4/include/linux/nfs_fs.h
@@ -345,6 +345,8 @@ extern struct inode_operations nfs_dir_i
 extern struct file_operations nfs_dir_operations;
 extern struct dentry_operations nfs_dentry_operations;
 
+extern int nfs_instantiate(struct dentry *dentry, struct nfs_fh *fh, struct nfs_fattr *fattr);
+
 /*
  * linux/fs/nfs/symlink.c
  */

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2005-02-22 17:44 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-22 20:34 [patch 0/13] NFSACL protocol extension for NFSv3 Andreas Gruenbacher
2005-01-22 20:34 ` [patch 1/13] Qsort Andreas Gruenbacher
2005-01-22 21:00   ` vlobanov
2005-01-23  2:03     ` Felipe Alfaro Solana
2005-01-23  2:39       ` Andi Kleen
2005-01-23  3:02         ` Jesper Juhl
2005-01-23  4:46           ` Andi Kleen
2005-01-23  5:05             ` Jesper Juhl
2005-01-23 10:37               ` Rafael J. Wysocki
2005-01-24  4:29                 ` Horst von Brand
2005-01-24 15:45               ` Alan Cox
2005-01-24 17:10               ` H. Peter Anvin
2005-01-25  0:43                 ` Horst von Brand
2005-01-25  4:06                   ` Eric St-Laurent
2005-01-24 22:04             ` Mike Waychison
2005-01-25  6:51               ` Andi Kleen
2005-01-25 10:12                 ` Andreas Gruenbacher
2005-01-25 12:00                   ` Andi Kleen
2005-01-25 12:05                     ` Olaf Kirch
2005-01-25 16:52                       ` Trond Myklebust
2005-01-25 16:53                         ` Andreas Gruenbacher
2005-01-25 17:03                           ` Trond Myklebust
2005-01-25 17:16                             ` Andreas Gruenbacher
2005-01-25 17:37                               ` Trond Myklebust
2005-01-25 18:12                                 ` Andreas Gruenbacher
2005-01-25 19:33                                   ` Trond Myklebust
2005-01-25 19:49                                     ` Andreas Gruenbacher
2005-01-23  4:29         ` Matt Mackall
2005-01-24  0:21           ` Nathan Scott
2005-01-24  2:57             ` Matt Mackall
2005-01-24  4:02           ` Horst von Brand
2005-01-24 21:57             ` Matt Mackall
2005-01-23  4:58         ` Felipe Alfaro Solana
2005-01-24 21:20           ` Matt Mackall
2005-01-24 21:50             ` vlobanov
2005-01-23  4:22       ` Matt Mackall
2005-01-23  5:44       ` Willy Tarreau
2005-01-23 21:24     ` Richard Henderson
     [not found]   ` <1106431568.4153.154.camel@laptopd505.fenrus.org>
2005-01-22 22:10     ` Andreas Gruenbacher
2005-01-22 23:28   ` Matt Mackall
2005-01-23  0:21     ` Matt Mackall
2005-01-23  5:08     ` Andreas Gruenbacher
2005-01-23  5:32       ` Matt Mackall
2005-01-23 12:22         ` Andreas Gruenbacher
2005-01-23 16:49           ` Matt Mackall
2005-01-24  3:48   ` Horst von Brand
2005-01-24 20:15   ` [PATCH] lib/qsort Matt Mackall
2005-01-24 23:09     ` Andrew Morton
2005-01-24 23:30       ` Matt Mackall
2005-01-25  4:11     ` Matt Mackall
2005-01-22 20:34 ` [patch 2/13] Return -ENOSYS for RPC programs that are unavailable Andreas Gruenbacher
2005-02-15 17:04   ` Trond Myklebust
2005-02-16 15:32     ` Andreas Gruenbacher
2005-01-22 20:34 ` [patch 3/13] Add missing -EOPNOTSUPP => NFS3ERR_NOTSUPP mapping in nfsd Andreas Gruenbacher
2005-01-22 20:34 ` [patch 4/13] Allow multiple programs to listen on the same port Andreas Gruenbacher
2005-01-22 20:34 ` [patch 5/13] Allow multiple programs to share the same transport Andreas Gruenbacher
2005-01-22 20:34 ` [patch 6/13] Lazy RPC receive buffer allocation Andreas Gruenbacher
2005-01-22 20:34 ` [patch 7/13] Encode and decode arbitrary XDR arrays Andreas Gruenbacher
2005-02-15 19:17   ` Trond Myklebust
2005-02-16 16:08     ` Andreas Gruenbacher
2005-02-17 14:12     ` Adrian Bunk
2005-01-22 20:34 ` [patch 8/13] Add noacl nfs mount option Andreas Gruenbacher
2005-02-15 17:24   ` Trond Myklebust
2005-02-16 16:10     ` Andreas Gruenbacher
2005-01-22 20:34 ` [patch 9/13] Infrastructure and server side of nfsacl Andreas Gruenbacher
2005-01-22 20:34 ` [patch 10/13] Solaris nfsacl workaround Andreas Gruenbacher
2005-02-15 17:29   ` Trond Myklebust
2005-02-15 20:35     ` Olivier Galibert
2005-02-15 22:43       ` Trond Myklebust
2005-02-15 23:02         ` Olivier Galibert
2005-02-15 23:37           ` Trond Myklebust
2005-02-15 23:43             ` Olivier Galibert
2005-02-16 16:17     ` Andreas Gruenbacher
2005-02-16 17:05       ` Trond Myklebust
2005-02-16 17:39         ` Andreas Gruenbacher
2005-01-22 20:34 ` [patch 11/13] Client side of nfsacl Andreas Gruenbacher
2005-02-15 17:49   ` Trond Myklebust
2005-02-22 13:41     ` Andreas Gruenbacher
2005-02-22 14:13       ` Trond Myklebust
2005-01-22 20:34 ` [patch 12/13] ACL umask handling workaround in nfs client Andreas Gruenbacher
2005-01-25  1:20   ` Andreas Gruenbacher
2005-02-15 18:04   ` Trond Myklebust
2005-02-22 16:47     ` Andreas Gruenbacher
2005-02-22 17:43       ` Trond Myklebust
2005-01-22 20:34 ` [patch 13/13] Cache acls on the nfs client side Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).