linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC - tree quotas for Linux (2.4.12, ext2)
@ 2001-10-18  5:06 Neil Brown
  2001-10-18  5:53 ` Ben Greear
  2001-10-24 15:16 ` Jan Kara
  0 siblings, 2 replies; 76+ messages in thread
From: Neil Brown @ 2001-10-18  5:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


Hi,
 In my ongoing effort to provide centralised file storage that I can
 be proud of, I have put together some code to implement tree quotas.

 The idea of a tree quota is that the block and inode usage of a file
 is charged to the (owner of the root of the) tree rather than the
 owner (or group owner) of the file.
 This will (I hope) make life easier for me.  There are several
 reasons that I have documented (see URL below) but a good one is that
 they are transparent and predictable.  du -s $HOME should *always*
 match your usage according to "quota".

 I have written a patch which is included below, but also is at
    htttp://www.cse.unsw.edu.au/~neilb/patches/linux/

 which defines a third type of quotas for Linux, named "treequotas".
 The patch supports these quotas for ext2 by borrowing (or is that
 stealing) i_reserved2 from the on-disc inode to store the "tid",
 which is the uid of the ultimate non-root parent of the file.

 There are obvious issues with hardlinks between trees with different
 tree-ids, but they can be easily restricted to root who should know
 better.

 The patch introduces the concept of a "Treeid" or "tid" which is
 inherited from the parent, if not zero, or set from the uid
 otherwise.
 Thus if root creates a directory near the top of a filesystem and
 chowns it to someone, all files created beneath that directory,
 independant of ownership, get charged to the someone (for the purpose
 of treequotaing).

 More notes can be found at:

   http://www.cse.unsw.edu.au/~neilb/wiki/?TreeQuotas

 Comments more than welcome.

 I should admit that I haven't actually tried this verison of the
 patch.  I got a working version.  Did some testing.  Realised some
 problems.  Updated the patch.  Checked that it compiled.  But haven't
 tested it yet.  But I'm most interested in comments from people
 reading it, not people running it, at this stage.

 Patches are required to the user-space tools to allow them to see
 treequotas.  I don't have any such patches and have no immediate
 plans as I use my own quota tools.  However I might do it at some
 stage and would certainly be happy to host such patches if someone
 else did them.

NeilBrown



--- ./fs/ext2/inode.c	2001/10/15 22:51:39	1.1
+++ ./fs/ext2/inode.c	2001/10/18 04:34:27	1.2
@@ -928,6 +928,7 @@
 	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
 	inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
 	inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
+	inode->i_tid = (uid_t)le32_to_cpu(raw_inode->i_e2_tid);
 	if(!(test_opt (inode->i_sb, NO_UID32))) {
 		inode->i_uid |= le16_to_cpu(raw_inode->i_uid_high) << 16;
 		inode->i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
@@ -1088,6 +1089,7 @@
 		raw_inode->i_uid_high = 0;
 		raw_inode->i_gid_high = 0;
 	}
+	raw_inode->i_e2_tid = cpu_to_le32(inode->i_tid);
 	raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
 	raw_inode->i_size = cpu_to_le32(inode->i_size);
 	raw_inode->i_atime = cpu_to_le32(inode->i_atime);
--- ./fs/ext2/ialloc.c	2001/10/16 04:37:09	1.1
+++ ./fs/ext2/ialloc.c	2001/10/18 04:34:27	1.2
@@ -421,6 +421,7 @@
 	mark_buffer_dirty(sb->u.ext2_sb.s_sbh);
 	sb->s_dirt = 1;
 	inode->i_uid = current->fsuid;
+	inode->i_tid = treequota_tid(dir, inode->i_uid);
 	if (test_opt (sb, GRPID))
 		inode->i_gid = dir->i_gid;
 	else if (dir->i_mode & S_ISGID) {
--- ./fs/dquot.c	2001/10/15 07:23:34	1.1
+++ ./fs/dquot.c	2001/10/18 04:34:27	1.2
@@ -12,7 +12,7 @@
  * based on one of the several variants of the LINUX inode-subsystem
  * with added complexity of the diskquota system.
  * 
- * Version: $Id: dquot.c,v 6.3 1996/11/17 18:35:34 mvw Exp mvw $
+ * Version: $Id: dquot.c,v 1.1 2001/10/15 07:23:34 neilb Exp neilb $
  * 
  * Author:	Marco van Wieringen <mvw@planets.elm.net>
  *
@@ -64,7 +64,7 @@
 
 #include <asm/uaccess.h>
 
-#define __DQUOT_VERSION__	"dquot_6.4.0"
+#define __DQUOT_VERSION__	"dquot_6.4.0t"
 
 int nr_dquots, nr_free_dquots;
 
@@ -127,6 +127,9 @@
 			return((dqopt->flags & DQUOT_USR_ENABLED) != 0);
 		case GRPQUOTA:
 			return((dqopt->flags & DQUOT_GRP_ENABLED) != 0);
+		case TREEQUOTA:
+			return((dqopt->flags & DQUOT_TREE_ENABLED) != 0);
+
 	}
 	return(0);
 }
@@ -689,6 +692,7 @@
 			return current->fsuid == dquot->dq_id && !(dquot->dq_flags & flag);
 		case GRPQUOTA:
 			return in_group_p(dquot->dq_id) && !(dquot->dq_flags & flag);
+		/* FIXME TREEQUOTA */
 	}
 	return 0;
 }
@@ -988,6 +992,9 @@
 				case GRPQUOTA:
 					id = inode->i_gid;
 					break;
+				case TREEQUOTA:
+					id = inode->i_tid;
+					break;
 			}
 			dquot[cnt] = dqget(inode->i_sb, id, cnt);
 		}
@@ -1152,6 +1159,8 @@
 	struct dquot *transfer_to[MAXQUOTAS];
 	int cnt, ret = NO_QUOTA, chuid = (iattr->ia_valid & ATTR_UID) && inode->i_uid != iattr->ia_uid,
 	    chgid = (iattr->ia_valid & ATTR_GID) && inode->i_gid != iattr->ia_gid;
+	int chtreeid = (iattr->ia_valid & ATTR_TID) && inode->i_tid != iattr->ia_tid;
+	
 	char warntype[MAXQUOTAS];
 
 	/* Clear the arrays */
@@ -1174,6 +1183,11 @@
 					continue;
 				transfer_to[cnt] = dqget(inode->i_sb, iattr->ia_gid, cnt);
 				break;
+			case TREEQUOTA:
+				if (!chtreeid)
+					continue;
+				transfer_to[cnt] = dqget(inode->i_sb, iattr->ia_tid, cnt);
+				break;
 		}
 	}
 	/* NOBLOCK START: From now on we shouldn't block */
@@ -1186,6 +1200,8 @@
 		transfer_from[cnt] = dqduplicate(inode->i_dquot[cnt]);
 		if (transfer_from[cnt] == NODQUOT)	/* Can happen on quotafiles (quota isn't initialized on them)... */
 			continue;
+		if (iattr->ia_valid & ATTR_FORCE)
+			continue;			/* don't check, just do */
 		if (check_idq(transfer_to[cnt], 1, warntype+cnt) == NO_QUOTA ||
 		    check_bdq(transfer_to[cnt], blocks, 0, warntype+cnt) == NO_QUOTA)
 			goto warn_put_all;
@@ -1262,6 +1278,9 @@
 		case GRPQUOTA:
 			dqopt->flags |= DQUOT_GRP_ENABLED;
 			break;
+		case TREEQUOTA:
+			dqopt->flags |= DQUOT_TREE_ENABLED;
+			break;
 	}
 }
 
@@ -1274,6 +1293,9 @@
 		case GRPQUOTA:
 			dqopt->flags &= ~DQUOT_GRP_ENABLED;
 			break;
+		case TREEQUOTA:
+			dqopt->flags &= ~DQUOT_TREE_ENABLED;
+			break;
 	}
 }
 
@@ -1413,7 +1435,7 @@
 			break;
 		case Q_GETQUOTA:
 			if (((type == USRQUOTA && current->euid != id) ||
-			     (type == GRPQUOTA && !in_egroup_p(id))) &&
+			     (type == GRPQUOTA && !in_egroup_p(id))) && /* FIXME TREEQUOTA */
 			    !capable(CAP_SYS_ADMIN))
 				goto out;
 			break;
--- ./fs/attr.c	2001/10/15 22:54:34	1.1
+++ ./fs/attr.c	2001/10/18 04:34:27	1.2
@@ -73,6 +73,8 @@
 		inode->i_uid = attr->ia_uid;
 	if (ia_valid & ATTR_GID)
 		inode->i_gid = attr->ia_gid;
+	if (ia_valid & ATTR_TID)
+		inode->i_tid = attr->ia_tid;
 	if (ia_valid & ATTR_ATIME)
 		inode->i_atime = attr->ia_atime;
 	if (ia_valid & ATTR_MTIME)
@@ -127,6 +129,16 @@
 	if (!(ia_valid & ATTR_MTIME_SET))
 		attr->ia_mtime = now;
 
+	if (!(ia_valid & ATTR_TID)
+	    && (ia_valid & ATTR_UID)
+	    && !treequota_parent_uid_ok(inode, dentry->d_parent->d_inode,
+					attr->ia_uid)) {
+
+		attr->ia_tid = treequota_tid(dentry->d_parent->d_inode,
+					     attr->ia_uid);
+		ia_valid |= ATTR_TID;
+		attr->ia_valid = ia_valid;
+	}
 	lock_kernel();
 	if (inode->i_op && inode->i_op->setattr) 
 		error = inode->i_op->setattr(dentry, attr);
@@ -134,6 +146,7 @@
 		error = inode_change_ok(inode, attr);
 		if (!error) {
 			if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
+			    (ia_valid & ATTR_TID && attr->ia_tid != inode->i_tid) ||
 			    (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid))
 				error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
 			if (!error)
--- ./fs/namei.c	2001/10/15 23:04:31	1.1
+++ ./fs/namei.c	2001/10/18 04:34:27	1.2
@@ -524,6 +524,7 @@
 			if (IS_ERR(dentry))
 				break;
 		}
+		treequota_check(dentry);
 		/* Check mountpoints.. */
 		while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
 			;
@@ -1586,6 +1587,8 @@
 	if (dir->i_dev != inode->i_dev)
 		goto exit_lock;
 
+	if (!treequota_parent_ok(dir, inode))
+		goto exit_lock;
 	/*
 	 * A link to an append-only or immutable file cannot be created.
 	 */
@@ -1693,6 +1696,7 @@
 {
 	int error;
 	struct inode *target;
+	struct iattr attr;
 
 	if (old_dentry->d_inode == new_dentry->d_inode)
 		return 0;
@@ -1704,6 +1708,10 @@
 	if (new_dir->i_dev != old_dir->i_dev)
 		return -EXDEV;
 
+	if (!treequota_parent_ok(new_dir, old_dentry->d_inode)
+	    && !capable(CAP_CHOWN))
+		return -EXDEV;
+	
 	if (!new_dentry->d_inode)
 		error = may_create(new_dir, new_dentry);
 	else
@@ -1743,11 +1751,16 @@
 	} else
 		double_down(&old_dir->i_zombie,
 			    &new_dir->i_zombie);
+	attr.ia_valid = ATTR_TID;
+	attr.ia_tid = treequota_tid(new_dir, old_dentry->d_inode->i_uid);
 	if (IS_DEADDIR(old_dir)||IS_DEADDIR(new_dir))
 		error = -ENOENT;
 	else if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
 		error = -EBUSY;
-	else 
+	else if (!treequota_parent_ok(old_dentry->d_inode, new_dir)
+		 && (error = notify_change(old_dentry, &attr)))
+		;
+	else
 		error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
 	if (target) {
 		if (!error)
@@ -1799,8 +1812,20 @@
 	double_down(&old_dir->i_zombie, &new_dir->i_zombie);
 	if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
 		error = -EBUSY;
-	else
-		error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
+	else {
+		error = 0;
+		if (!treequota_parent_ok(new_dir, old_dentry->d_inode)) {
+			struct iattr attr;
+			if (old_dentry->d_inode->i_nlink > 1)
+				return -EXDEV;
+			attr.ia_valid = ATTR_TID;
+			attr.ia_tid = treequota_tid(new_dir,
+					    old_dentry->d_inode->i_uid);
+			error = notify_change(old_dentry, &attr);
+		}
+		if (!error) 
+			error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
+	}
 	double_up(&old_dir->i_zombie, &new_dir->i_zombie);
 	if (error)
 		return error;
--- ./fs/stat.c	2001/10/16 06:44:17	1.1
+++ ./fs/stat.c	2001/10/18 04:34:27	1.2
@@ -78,6 +78,7 @@
 	tmp.st_nlink = inode->i_nlink;
 	SET_STAT_UID(tmp, inode->i_uid);
 	SET_STAT_GID(tmp, inode->i_gid);
+	tmp.__unused5 = inode->i_tid;
 	tmp.st_rdev = kdev_t_to_nr(inode->i_rdev);
 #if BITS_PER_LONG == 32
 	if (inode->i_size > MAX_NON_LFS)
--- ./include/linux/quota.h	2001/10/15 07:04:25	1.1
+++ ./include/linux/quota.h	2001/10/18 04:34:27	1.2
@@ -33,7 +33,7 @@
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
- * Version: $Id: quota.h,v 2.0 1996/11/17 16:48:14 mvw Exp mvw $
+ * Version: $Id: quota.h,v 1.1 2001/10/15 07:04:25 neilb Exp neilb $
  */
 
 #ifndef _LINUX_QUOTA_
@@ -65,9 +65,10 @@
 #define MAX_IQ_TIME  604800	/* (7*24*60*60) 1 week */
 #define MAX_DQ_TIME  604800	/* (7*24*60*60) 1 week */
 
-#define MAXQUOTAS 2
+#define MAXQUOTAS 3
 #define USRQUOTA  0		/* element used for user quotas */
 #define GRPQUOTA  1		/* element used for group quotas */
+#define	TREEQUOTA 2		/* element used for tree quotas */
 
 /*
  * Definitions for the default names of the quotas files.
@@ -75,6 +76,7 @@
 #define INITQFNAMES { \
 	"user",    /* USRQUOTA */ \
 	"group",   /* GRPQUOTA */ \
+	"tree",    /* TREEQUOTA */ \
 	"undefined", \
 };
 
--- ./include/linux/quotaops.h	2001/10/15 07:22:03	1.1
+++ ./include/linux/quotaops.h	2001/10/18 04:34:27	1.2
@@ -4,7 +4,7 @@
  *
  * Author:  Marco van Wieringen <mvw@planets.elm.net>
  *
- * Version: $Id: quotaops.h,v 1.2 1998/01/15 16:22:26 ecd Exp $
+ * Version: $Id: quotaops.h,v 1.1 2001/10/15 07:22:03 neilb Exp neilb $
  *
  */
 #ifndef _LINUX_QUOTAOPS_
@@ -36,7 +36,7 @@
 /*
  * Operations supported for diskquotas.
  */
-#define sb_any_quota_enabled(sb) ((sb)->s_dquot.flags & (DQUOT_USR_ENABLED | DQUOT_GRP_ENABLED))
+#define sb_any_quota_enabled(sb) ((sb)->s_dquot.flags & (DQUOT_USR_ENABLED | DQUOT_GRP_ENABLED | DQUOT_TREE_ENABLED))
 
 static __inline__ void DQUOT_INIT(struct inode *inode)
 {
@@ -162,6 +162,53 @@
 #define DQUOT_SYNC(dev)	sync_dquots(dev, -1)
 #define DQUOT_OFF(sb)	quota_off(sb, -1)
 
+static __inline__ int treequota_parent_uid_ok(struct inode *inode, struct inode *dir, uid_t uid)
+{
+	if (!inode->i_sb->s_dquot.flags & DQUOT_TREE_ENABLED)
+		return 1;
+	if (dir->i_tid
+	    ? (inode->i_tid ==   dir->i_tid)
+	    : (inode->i_tid ==   uid))
+		return 1;
+	return 0;
+}
+
+static __inline__ int treequota_parent_ok(struct inode *inode, struct inode *dir)
+{
+	return treequota_parent_uid_ok(inode,dir, inode->i_uid);
+}
+
+static __inline__ int treequota_tid(struct inode *dir, uid_t uid)
+{
+	if (!dir->i_sb->s_dquot.flags & DQUOT_TREE_ENABLED)
+		return 0;
+	return dir->i_tid
+		? dir->i_tid
+		: uid;
+}
+
+static __inline__ void treequota_check(struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct iattr attr;
+	if (!inode->i_sb->s_dquot.flags & DQUOT_TREE_ENABLED)
+		return;
+	if (treequota_parent_ok(inode, dentry->d_parent->d_inode))
+		return;
+
+	attr.ia_valid = ATTR_FORCE | ATTR_TID;
+	attr.ia_tid = treequota_tid(dentry->d_parent->d_inode,
+				    inode->i_uid);
+	if (!S_ISDIR(inode->i_mode)
+	    && inode->i_nlink > 1) {
+		printk(KERN_WARNING "treequota: file with multiple links has wrong tree-id\n");
+		printk(KERN_WARNING "  dev=%x ino=%ld dino=%ld\n",
+		       inode->i_dev, inode->i_ino,
+		       dentry->d_parent->d_inode->i_ino);
+		printk(KERN_WARNING "  basename=%s\n", dentry->d_name.name);
+	}
+	notify_change(dentry, &attr);
+}
 #else
 
 /*
@@ -216,6 +263,10 @@
 	DQUOT_FREE_BLOCK_NODIRTY(inode, nr);
 	mark_inode_dirty(inode);
 }	
+
+#define treequota_parent_uid_ok(inode,dir,uid) (1)
+#define	treequota_parent_ok(inode,dir) (1)
+#define treequota_tid(inode,uid) (0)
 
 #endif /* CONFIG_QUOTA */
 #endif /* _LINUX_QUOTAOPS_ */
--- ./include/linux/fs.h	2001/10/15 07:22:35	1.1
+++ ./include/linux/fs.h	2001/10/18 04:34:27	1.2
@@ -327,6 +327,7 @@
 #define ATTR_MTIME_SET	256
 #define ATTR_FORCE	512	/* Not a change, but a change it */
 #define ATTR_ATTR_FLAG	1024
+#define	ATTR_TID	2048
 
 /*
  * This is the Inode Attributes structure, used for notify_change().  It
@@ -347,6 +348,7 @@
 	time_t		ia_mtime;
 	time_t		ia_ctime;
 	unsigned int	ia_attr_flags;
+	uid_t		ia_tid;
 };
 
 /*
@@ -430,6 +432,7 @@
 	nlink_t			i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
+	uid_t			i_tid;	/* tree-id for quotas */
 	kdev_t			i_rdev;
 	loff_t			i_size;
 	time_t			i_atime;
@@ -636,6 +639,7 @@
 
 #define DQUOT_USR_ENABLED	0x01		/* User diskquotas enabled */
 #define DQUOT_GRP_ENABLED	0x02		/* Group diskquotas enabled */
+#define DQUOT_TREE_ENABLED	0x04		/* Tree diskquotas enabled */
 
 struct quota_mount_options
 {
--- ./include/linux/ext2_fs.h	2001/10/15 22:45:41	1.1
+++ ./include/linux/ext2_fs.h	2001/10/18 04:34:27	1.2
@@ -249,7 +249,7 @@
 			__u16	i_pad1;
 			__u16	l_i_uid_high;	/* these 2 fields    */
 			__u16	l_i_gid_high;	/* were reserved2[0] */
-			__u32	l_i_reserved2;
+			__u32	l_i_tid;	/* tree-id for quotas, no longer l_i_reserved2 */
 		} linux2;
 		struct {
 			__u8	h_i_frag;	/* Fragment number */
@@ -278,7 +278,7 @@
 #define i_gid_low	i_gid
 #define i_uid_high	osd2.linux2.l_i_uid_high
 #define i_gid_high	osd2.linux2.l_i_gid_high
-#define i_reserved2	osd2.linux2.l_i_reserved2
+#define i_e2_tid	osd2.linux2.l_i_tid
 #endif
 
 #ifdef	__hurd__




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18  5:06 RFC - tree quotas for Linux (2.4.12, ext2) Neil Brown
@ 2001-10-18  5:53 ` Ben Greear
  2001-10-18  8:38   ` James Sutherland
  2001-10-24 15:16 ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Ben Greear @ 2001-10-18  5:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-fsdevel, linux-kernel

Neil Brown wrote:
> 
> Hi,
>  In my ongoing effort to provide centralised file storage that I can
>  be proud of, I have put together some code to implement tree quotas.
> 
>  The idea of a tree quota is that the block and inode usage of a file
>  is charged to the (owner of the root of the) tree rather than the
>  owner (or group owner) of the file.
>  This will (I hope) make life easier for me.  There are several
>  reasons that I have documented (see URL below) but a good one is that
>  they are transparent and predictable.  du -s $HOME should *always*
>  match your usage according to "quota".

Err, except maybe when you also own a file in /home/idiot/idiots_unprotected_storage_dir
(This relates not at all to your patch/comments.)

-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18  5:53 ` Ben Greear
@ 2001-10-18  8:38   ` James Sutherland
  2001-10-18 20:20     ` Mike Fedyk
  0 siblings, 1 reply; 76+ messages in thread
From: James Sutherland @ 2001-10-18  8:38 UTC (permalink / raw)
  To: Ben Greear; +Cc: Neil Brown, linux-fsdevel, linux-kernel

On Wed, 17 Oct 2001, Ben Greear wrote:

> Neil Brown wrote:
> >
> > Hi,
> >  In my ongoing effort to provide centralised file storage that I can
> >  be proud of, I have put together some code to implement tree quotas.
> >
> >  The idea of a tree quota is that the block and inode usage of a file
> >  is charged to the (owner of the root of the) tree rather than the
> >  owner (or group owner) of the file.
> >  This will (I hope) make life easier for me.  There are several
> >  reasons that I have documented (see URL below) but a good one is that
> >  they are transparent and predictable.  du -s $HOME should *always*
> >  match your usage according to "quota".
>
> Err, except maybe when you also own a file in /home/idiot/idiots_unprotected_storage_dir
> (This relates not at all to your patch/comments.)

No - "the ... usage of a file is charged to the tree, RATHER THAN THE
OWNER OF THE FILE". So, in this case, if you own a file in ~idiot/foo,
idiot's quota is charged for the file, not you.


James.
-- 
"Our attitude with TCP/IP is, `Hey, we'll do it, but don't make a big
system, because we can't fix it if it breaks -- nobody can.'"

"TCP/IP is OK if you've got a little informal club, and it doesn't make
any difference if it takes a while to fix it."
		-- Ken Olson, in Digital News, 1988


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18  8:38   ` James Sutherland
@ 2001-10-18 20:20     ` Mike Fedyk
  2001-10-18 20:47       ` Tim Walberg
                         ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-10-18 20:20 UTC (permalink / raw)
  To: James Sutherland; +Cc: Ben Greear, Neil Brown, linux-fsdevel, linux-kernel

On Thu, Oct 18, 2001 at 09:38:47AM +0100, James Sutherland wrote:
> On Wed, 17 Oct 2001, Ben Greear wrote:
> 
> > Neil Brown wrote:
> > >
> > > Hi,
> > >  In my ongoing effort to provide centralised file storage that I can
> > >  be proud of, I have put together some code to implement tree quotas.
> > >
> > >  The idea of a tree quota is that the block and inode usage of a file
> > >  is charged to the (owner of the root of the) tree rather than the
> > >  owner (or group owner) of the file.
> > >  This will (I hope) make life easier for me.  There are several
> > >  reasons that I have documented (see URL below) but a good one is that
> > >  they are transparent and predictable.  du -s $HOME should *always*
> > >  match your usage according to "quota".
> >
> > Err, except maybe when you also own a file in /home/idiot/idiots_unprotected_storage_dir
> > (This relates not at all to your patch/comments.)
> 
> No - "the ... usage of a file is charged to the tree, RATHER THAN THE
> OWNER OF THE FILE". So, in this case, if you own a file in ~idiot/foo,
> idiot's quota is charged for the file, not you.

Actually, it looks like Niel is creating a two level Quota system.  In ther
normal quota system, if you own a file anywhere, it is attributed to you.
But, in the tree quota system, it is attributed to the owner of the tree...

Niel, how do you plan to notify someone that their tree quota has been
exceeded instead of their normal quota?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 20:20     ` Mike Fedyk
@ 2001-10-18 20:47       ` Tim Walberg
  2001-10-19  1:07         ` Neil Brown
  2001-10-18 21:17       ` Andreas Dilger
  2001-10-19  0:53       ` Neil Brown
  2 siblings, 1 reply; 76+ messages in thread
From: Tim Walberg @ 2001-10-18 20:47 UTC (permalink / raw)
  To: James Sutherland, Ben Greear, Neil Brown, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

A semi-random thought on the tree-quota concept:

Does it really make sense to charge a tree quota to a single specific
user? I haven't really looked into what would be required to implement
it, but my mental picture of a tree quota is somewhat divorced from the
user concept, other than maybe the quota table containing a pointer to
a contact for quota violations. The bookkeeping might be easier if each
tree quota root just held a cumulative total of allocated space, and
maybe a just a user name for contacts (or on the fancier side, a hook
to execute something...).

I know it's kinda half-baked, but that's my $0.015...

				tw

On 10/18/2001 13:20 -0700, Mike Fedyk wrote:
>>	Actually, it looks like Niel is creating a two level Quota system.  In ther
>>	normal quota system, if you own a file anywhere, it is attributed to you.
>>	But, in the tree quota system, it is attributed to the owner of the tree...
>>	
>>	Niel, how do you plan to notify someone that their tree quota has been
>>	exceeded instead of their normal quota?
>>	-
>>	To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>	the body of a message to majordomo@vger.kernel.org
>>	More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>	Please read the FAQ at  http://www.tux.org/lkml/
End of included message



-- 
twalberg@mindspring.com

[-- Attachment #2: Type: application/pgp-signature, Size: 175 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 20:20     ` Mike Fedyk
  2001-10-18 20:47       ` Tim Walberg
@ 2001-10-18 21:17       ` Andreas Dilger
  2001-10-18 22:56         ` Mike Fedyk
  2001-10-19  1:13         ` Neil Brown
  2001-10-19  0:53       ` Neil Brown
  2 siblings, 2 replies; 76+ messages in thread
From: Andreas Dilger @ 2001-10-18 21:17 UTC (permalink / raw)
  To: James Sutherland, Ben Greear, Neil Brown, linux-fsdevel, linux-kernel

On Oct 18, 2001  13:20 -0700, Mike Fedyk wrote:
> On Thu, Oct 18, 2001 at 09:38:47AM +0100, James Sutherland wrote:
> > No - "the ... usage of a file is charged to the tree, RATHER THAN THE
> > OWNER OF THE FILE". So, in this case, if you own a file in ~idiot/foo,
> > idiot's quota is charged for the file, not you.

However, this means that if anyone has write permission into a tree, they
can "offload" their quota to another user and keep more files than they
ought to.  Also, depending on the permissions of the file/directory, the
"tree" owner may not even be able to delete the files that are causing
their quota to be exceeded.

> Actually, it looks like Niel is creating a two level Quota system.  In ther
> normal quota system, if you own a file anywhere, it is attributed to you.
> But, in the tree quota system, it is attributed to the owner of the tree...

Hmm, we already have group quotas, and (excluding ACLs) you would need to
have group write permission into the tree to be able to write there.  How
does the tree quota help us in the end?  Either users are "nice" and you
don't need quotas, or users are "not nice" and you don't want them to be
able to dump their files into an area that doesn't keep them "in check" as
quotas are designed to do.

Cheers, Andreas
--
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 21:17       ` Andreas Dilger
@ 2001-10-18 22:56         ` Mike Fedyk
  2001-10-19  0:14           ` Horst von Brand
  2001-10-19  1:13         ` Neil Brown
  1 sibling, 1 reply; 76+ messages in thread
From: Mike Fedyk @ 2001-10-18 22:56 UTC (permalink / raw)
  To: James Sutherland, Ben Greear, Neil Brown, linux-fsdevel, linux-kernel

On Thu, Oct 18, 2001 at 03:17:18PM -0600, Andreas Dilger wrote:
> On Oct 18, 2001  13:20 -0700, Mike Fedyk wrote:
> > Actually, it looks like Niel is creating a two level Quota system.  In ther
> > normal quota system, if you own a file anywhere, it is attributed to you.
> > But, in the tree quota system, it is attributed to the owner of the tree...
> 
> Hmm, we already have group quotas, and (excluding ACLs) you would need to
> have group write permission into the tree to be able to write there.  How
> does the tree quota help us in the end?  Either users are "nice" and you
> don't need quotas, or users are "not nice" and you don't want them to be
> able to dump their files into an area that doesn't keep them "in check" as
> quotas are designed to do.
> 

Hmm, I think I just thought of a use for the tree quota concept.

Lets say that you have about 50GB of space, but you only want to allow 20GB
for a certain tree (possibly mp3s), and you want to keep user ownerships of
the files they contribute.

Now try to use the group quota idea.

User makes mp3
user can chgrp to any user that they are a member of...
copy to /mp3s.

Now the group (and quota) that was setup for mp3s has been circumvented.

With the tree quota, an entire tree could be assigned to a certain group,
and then use the group quota tools...

The only other way I can see to fix this would be a cron job to walk the
tree and set the group to whatever has been setup, but that looks like a
hack to me.

Is there another way to fix this besides putting all mp3s on a separate
partition?

Mike

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 22:56         ` Mike Fedyk
@ 2001-10-19  0:14           ` Horst von Brand
  2001-10-19  0:51             ` Mike Fedyk
  0 siblings, 1 reply; 76+ messages in thread
From: Horst von Brand @ 2001-10-19  0:14 UTC (permalink / raw)
  To: James Sutherland, Ben Greear, Neil Brown, linux-fsdevel, linux-kernel

Mike Fedyk <mfedyk@matchmail.com> said:
> Lets say that you have about 50GB of space, but you only want to allow 20GB
> for a certain tree (possibly mp3s), and you want to keep user ownerships of
> the files they contribute.

Then they just copy the mp3's wherever they want, and symlink them into
the tree. No (meaningful) charge.

BTW, you get (almost exactly) the same effect by mounting a partition of
20Gb under /mp3
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, Vin~a del Mar, Chile                               +56 32 672616

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-19  0:14           ` Horst von Brand
@ 2001-10-19  0:51             ` Mike Fedyk
  0 siblings, 0 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-10-19  0:51 UTC (permalink / raw)
  To: Horst von Brand
  Cc: James Sutherland, Ben Greear, Neil Brown, linux-fsdevel, linux-kernel

On Thu, Oct 18, 2001 at 09:14:45PM -0300, Horst von Brand wrote:
> Mike Fedyk <mfedyk@matchmail.com> said:
> > Lets say that you have about 50GB of space, but you only want to allow 20GB
> > for a certain tree (possibly mp3s), and you want to keep user ownerships of
> > the files they contribute.
> 
> Then they just copy the mp3's wherever they want, and symlink them into
> the tree. No (meaningful) charge.
> 
> BTW, you get (almost exactly) the same effect by mounting a partition of
> 20Gb under /mp3

Yep, unless you're sharing with nfs, and the path is different on the client
than on the server...  But it would work with ftp, http, smb, or anything
that follows the symlink on the server.

What we need is quota based on file type, no not extention, but the return
value of `file`.  j/k

Can anyone come up with something useful that a treequota will help?

Mike

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 20:20     ` Mike Fedyk
  2001-10-18 20:47       ` Tim Walberg
  2001-10-18 21:17       ` Andreas Dilger
@ 2001-10-19  0:53       ` Neil Brown
  2 siblings, 0 replies; 76+ messages in thread
From: Neil Brown @ 2001-10-19  0:53 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: James Sutherland, Ben Greear, linux-fsdevel, linux-kernel

On Thursday October 18, mfedyk@matchmail.com wrote:
> 
> Actually, it looks like Niel is creating a two level Quota system.  In ther
> normal quota system, if you own a file anywhere, it is attributed to you.
> But, in the tree quota system, it is attributed to the owner of the
> tree...

Well, actually it is three level.  Though I wouldn't call them
levels.  They are really alternates.
The space usage of a filesystem object can be charged to
  1/ the owner of the file
  2/ the group-owner of the file
  3/ the owner of the tree containing the file.

I added the third.
You could conceivable impose quotas of all three sorts, but I suspect
that would cause unfortunate interactions and be a management
headache. I would recommend only using one at a time.

> 
> Niel, how do you plan to notify someone that their tree quota has been
> exceeded instead of their normal quota?

In what sense?  The kernel prints warning when you go over-quota.
It only does it if the process that causes quota to be exceeded is
reponsible for the quota.  This is determined in "need_print_warning"
in fs/dquot.c 
I haven't added a TREEQUOTA branch to that yet (just a FIXME comment).
A few moments reflection suggests that I should just "return 1" for
TREEQUOTA, so anyone who exceeds the quota gets the warning, not just 
the owner.

However, all my customers access their files via NFS and so don't get
these warnings.
I have a nightly job that sends email to people who are over quota,
and a global login script that prints a warning of the person is over
quota.
So they don't know the moment that they exceed their quota, but
should find out soon enough....

I wonder if NFSv4 should have a "Soft-Quota-Exceeded" non-error return
state so that clients could warn their users.... It doesn't yet.

NeilBrown

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 20:47       ` Tim Walberg
@ 2001-10-19  1:07         ` Neil Brown
  2001-10-19  3:03           ` Rik van Riel
  0 siblings, 1 reply; 76+ messages in thread
From: Neil Brown @ 2001-10-19  1:07 UTC (permalink / raw)
  To: Tim Walberg; +Cc: James Sutherland, Ben Greear, linux-fsdevel, linux-kernel

On Thursday October 18, twalberg@mindspring.com wrote:
> A semi-random thought on the tree-quota concept:
> 
> Does it really make sense to charge a tree quota to a single specific
> user? I haven't really looked into what would be required to implement
> it, but my mental picture of a tree quota is somewhat divorced from the
> user concept, other than maybe the quota table containing a pointer to
> a contact for quota violations. The bookkeeping might be easier if each
> tree quota root just held a cumulative total of allocated space, and
> maybe a just a user name for contacts (or on the fancier side, a hook
> to execute something...).

My original thought was that the "Treeid" in each inode would be the
inode number of the root of the quota-tree.  That would work and allow
treequotas to use a separate number space.

However I actually want to charge usage to users.
There is a natural mapping from users to directory trees via the
concept of the home-directory.  It is home directories that I want to
impose quotas on.  So it seems natural to charge space usage to a
users.

Certainly there are entities that need space allocation that are not
users in the traditional sense of the word.  Groups (as in collections
of people, not necessarily as in unix groups) is an obvious example.

So instead of "users", lets call them "accounts".  
Each account has
   a name
   a home directory
   a space quota

Some also have passwords and shells that allow people to log into
them.
(Each account also has an expiry date, a printer allocation, an
internet-usage allocation .... but thats another story).

So for me, quotas are not at all divorced from the "Account" concept.

The idea of keeping the cumulative total of usage in the root of the
quota tree is appealing, but is frustrated by hard links.  Though we
can try to avoid them, they will happen and there has to be a clear
way to handle them.  Recording with each inode the information about
who is charged for that inode is the simplest by far.

Or possibly you meant that each directory should contain the
cumulative sum of usage beneath it.. Even if that were well defined,
it would be a performance problem updating lots of directory inode for
each change.

NeilBrown

> 
> I know it's kinda half-baked, but that's my $0.015...
> 
> 				tw
> 
> On 10/18/2001 13:20 -0700, Mike Fedyk wrote:
> >>	Actually, it looks like Niel is creating a two level Quota system.  In ther
> >>	normal quota system, if you own a file anywhere, it is attributed to you.
> >>	But, in the tree quota system, it is attributed to the owner of the tree...
> >>	
> >>	Niel, how do you plan to notify someone that their tree quota has been
> >>	exceeded instead of their normal quota?
> >>	-
> >>	To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>	the body of a message to majordomo@vger.kernel.org
> >>	More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>	Please read the FAQ at  http://www.tux.org/lkml/
> End of included message
> 
> 
> 
> -- 
> twalberg@mindspring.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18 21:17       ` Andreas Dilger
  2001-10-18 22:56         ` Mike Fedyk
@ 2001-10-19  1:13         ` Neil Brown
  1 sibling, 0 replies; 76+ messages in thread
From: Neil Brown @ 2001-10-19  1:13 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: James Sutherland, Ben Greear, linux-fsdevel, linux-kernel

On Thursday October 18, adilger@turbolabs.com wrote:
> On Oct 18, 2001  13:20 -0700, Mike Fedyk wrote:
> > On Thu, Oct 18, 2001 at 09:38:47AM +0100, James Sutherland wrote:
> > > No - "the ... usage of a file is charged to the tree, RATHER THAN THE
> > > OWNER OF THE FILE". So, in this case, if you own a file in ~idiot/foo,
> > > idiot's quota is charged for the file, not you.
> 
> However, this means that if anyone has write permission into a tree, they
> can "offload" their quota to another user and keep more files than they
> ought to.  Also, depending on the permissions of the file/directory, the
> "tree" owner may not even be able to delete the files that are causing
> their quota to be exceeded.

Exactly the same is true of group based quotas.
We have a set-uid tool that allows people in a group to do
 "chmod g+rwx" on any directory in that group's home directory. 

> 
> > Actually, it looks like Niel is creating a two level Quota system.  In ther
> > normal quota system, if you own a file anywhere, it is attributed to you.
> > But, in the tree quota system, it is attributed to the owner of the tree...
> 
> Hmm, we already have group quotas, and (excluding ACLs) you would need to
> have group write permission into the tree to be able to write there.  How
> does the tree quota help us in the end?  Either users are "nice" and you
> don't need quotas, or users are "not nice" and you don't want them to be
> able to dump their files into an area that doesn't keep them "in check" as
> quotas are designed to do.

People need to agree to be "nice" to other people in their group.
Tree quotas forces them, as a group, to be "nice" to everyone else.

I wrote a little blurb on why I want tree quotas at:

  http://www.cse.unsw.edu.au/~neilb/wiki/?WhyTreeQuotas

I include it below.

NeilBrown

----------------------------------------------------------------
Why Would we want tree based quotas

It is reasonable to ask why user or group based quotas are not
enough. 

My answer is not a general rationale, but rather an answer as to why
they aren't enough for me in my situation. If you have a similar
situation, the reasons might apply to you too.. Or they might not.

We provide centralised home directories for a wide variety of users
(students and academics mostly). These home directories are stored on
a number of different filesystems on a number of different hosts.

We wish to impose clear, predictable, repeatable restrictions on disc
space usage on these home directories so as to protect the various
users from one another. It would be nice if the restrictions were also
fair and equitable, but that is not a technological issue.

Thus we need a clear way to identify who each file should be charged
to, and to make sure that the total of files charged to a user (or
other entity - a "who") is controlled by their stated quota (with
allowances for soft and hard limits, and grace periods etc).

We also have people who wish to, and people who are required to, work
co-laboratively. Thus they may need to work on files in their own home
directory, and also files in some other home directory, such as a
group directory or a co-workers directory.

We also have people who want to make use of discressionary access
control (DAC) and give access to certain file to certain groups of
people.

Given all of this, quotas based on the "owner" of a file cannot
work. This is because (due to group work) an individual may own file
in multiple filesystem, and unix style quotas are per-filesystem
based. People would need to have their assigned quota shared among
various filesystems, and this would be awkward to manage. It also
makes it hard to find all files that you are being changed for, so
that you can clean up.

Similar, quotas based on the "group-owner" of a file cannot work. This
is because some groups are used for DAC only and do not justify having
any quota.

We have worked with a combination of these schemes for a while:
user-based quotas for some people, and group based quotas for
others. However this only reduces some of the problems, and doesn't
completely remove any.

Tree based quotas provide an answer to all of this. Each person or
entity that merits some storage space (e.g. groups) is given a home
directory. All files in this home directory get charged to that
entity, no matter who the owner and group-owner are. This clearly
separates access control from usage charges.

It is possible, when sharing access, for one person to use up lots of
storage that gets charged to another person (or a group) such that a
person who is affected by the charge does not have access permission
to delete some of the files that they are being charged for. This is
also possible with group based quotas.

To resolve this, any person who is being changed for space must be
given access to remove files consuming that space. This means that
they must be able to get read/write/execute permission on any
directory that they are being charged for.

This can be done in user-space with a simple set-uid tools.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-19  1:07         ` Neil Brown
@ 2001-10-19  3:03           ` Rik van Riel
  2001-10-19 11:50             ` Horst von Brand
  0 siblings, 1 reply; 76+ messages in thread
From: Rik van Riel @ 2001-10-19  3:03 UTC (permalink / raw)
  To: Neil Brown
  Cc: Tim Walberg, James Sutherland, Ben Greear, linux-fsdevel, linux-kernel

On Fri, 19 Oct 2001, Neil Brown wrote:
> On Thursday October 18, twalberg@mindspring.com wrote:
> > A semi-random thought on the tree-quota concept:
> >
> > Does it really make sense to charge a tree quota to a single specific
> > user? I haven't really looked into what would be required to implement
> > it, but my mental picture of a tree quota is somewhat divorced from the
> > user concept,

> However I actually want to charge usage to users.
> There is a natural mapping from users to directory trees via the
> concept of the home-directory.

Say ... /home/students   ?


Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/  (volunteers needed)

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-19  3:03           ` Rik van Riel
@ 2001-10-19 11:50             ` Horst von Brand
  2001-10-19 17:00               ` Mike Fedyk
  0 siblings, 1 reply; 76+ messages in thread
From: Horst von Brand @ 2001-10-19 11:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Neil Brown, linux-kernel

Rik van Riel <riel@conectiva.com.br> said:

[...]

> Say ... /home/students   ?

User + group quota.
-- 
Dr. Horst H. von Brand                Usuario #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-19 11:50             ` Horst von Brand
@ 2001-10-19 17:00               ` Mike Fedyk
  0 siblings, 0 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-10-19 17:00 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Rik van Riel, Neil Brown, linux-kernel

On Fri, Oct 19, 2001 at 08:50:32AM -0300, Horst von Brand wrote:
> Rik van Riel <riel@conectiva.com.br> said:
> 
> [...]
> 
> > Say ... /home/students   ?
> 
> User + group quota.

chgrp

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-18  5:06 RFC - tree quotas for Linux (2.4.12, ext2) Neil Brown
  2001-10-18  5:53 ` Ben Greear
@ 2001-10-24 15:16 ` Jan Kara
  2001-10-24 15:34   ` James Sutherland
  2001-10-24 21:24   ` Neil Brown
  1 sibling, 2 replies; 76+ messages in thread
From: Jan Kara @ 2001-10-24 15:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-fsdevel, linux-kernel

  Hello,

>  In my ongoing effort to provide centralised file storage that I can
>  be proud of, I have put together some code to implement tree quotas.
> 
>  The idea of a tree quota is that the block and inode usage of a file
>  is charged to the (owner of the root of the) tree rather than the
>  owner (or group owner) of the file.
>  This will (I hope) make life easier for me.  There are several
>  reasons that I have documented (see URL below) but a good one is that
>  they are transparent and predictable.  du -s $HOME should *always*
>  match your usage according to "quota".
> 
>  I have written a patch which is included below, but also is at
>     htttp://www.cse.unsw.edu.au/~neilb/patches/linux/
> 
>  which defines a third type of quotas for Linux, named "treequotas".
>  The patch supports these quotas for ext2 by borrowing (or is that
>  stealing) i_reserved2 from the on-disc inode to store the "tid",
>  which is the uid of the ultimate non-root parent of the file.
> 
>  There are obvious issues with hardlinks between trees with different
>  tree-ids, but they can be easily restricted to root who should know
>  better.
> 
>  The patch introduces the concept of a "Treeid" or "tid" which is
>  inherited from the parent, if not zero, or set from the uid
>  otherwise.
>  Thus if root creates a directory near the top of a filesystem and
>  chowns it to someone, all files created beneath that directory,
>  independant of ownership, get charged to the someone (for the purpose
>  of treequotaing).
  But how do you solve the following: mv <dir> <some_other_dir>
The parent changes. You need to go through all the subdirs of <dir> and change
the TID. This is really hard to get right and to avoid deadlocks
and races... At least it seems to me so.

									Honza

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 15:16 ` Jan Kara
@ 2001-10-24 15:34   ` James Sutherland
  2001-10-24 15:39     ` Jan Kara
  2001-10-26 11:25     ` Pavel Machek
  2001-10-24 21:24   ` Neil Brown
  1 sibling, 2 replies; 76+ messages in thread
From: James Sutherland @ 2001-10-24 15:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Neil Brown, linux-fsdevel, linux-kernel

On Wed, 24 Oct 2001, Jan Kara wrote:

>   But how do you solve the following: mv <dir> <some_other_dir>
> The parent changes. You need to go through all the subdirs of <dir> and change
> the TID. This is really hard to get right and to avoid deadlocks
> and races... At least it seems to me so.

Provided you are tracking the total size in each directory, it's just a
matter of subtracting dir's size from the old parent, and adding it to the
new parent. (With suitable checks beforehand to avoid a result which
exceeds quota.)


James.
-- 
"Our attitude with TCP/IP is, `Hey, we'll do it, but don't make a big
system, because we can't fix it if it breaks -- nobody can.'"

"TCP/IP is OK if you've got a little informal club, and it doesn't make
any difference if it takes a while to fix it."
		-- Ken Olson, in Digital News, 1988


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 15:34   ` James Sutherland
@ 2001-10-24 15:39     ` Jan Kara
  2001-10-24 15:50       ` James Sutherland
  2001-10-26 11:25     ` Pavel Machek
  1 sibling, 1 reply; 76+ messages in thread
From: Jan Kara @ 2001-10-24 15:39 UTC (permalink / raw)
  To: James Sutherland; +Cc: Neil Brown, linux-fsdevel, linux-kernel

> On Wed, 24 Oct 2001, Jan Kara wrote:
> 
> >   But how do you solve the following: mv <dir> <some_other_dir>
> > The parent changes. You need to go through all the subdirs of <dir> and change
> > the TID. This is really hard to get right and to avoid deadlocks
> > and races... At least it seems to me so.
> 
> Provided you are tracking the total size in each directory, it's just a
> matter of subtracting dir's size from the old parent, and adding it to the
> new parent. (With suitable checks beforehand to avoid a result which
> exceeds quota.)
  Nope. If you'd just keep usage in directory than you need to go all the way
up and decrease the usage and then go all the way down in the new directory.
It's simplier but also nontrivial...

									Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 15:39     ` Jan Kara
@ 2001-10-24 15:50       ` James Sutherland
  2001-10-24 17:41         ` Rik van Riel
  0 siblings, 1 reply; 76+ messages in thread
From: James Sutherland @ 2001-10-24 15:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: Neil Brown, linux-fsdevel, linux-kernel

On Wed, 24 Oct 2001, Jan Kara wrote:
> > On Wed, 24 Oct 2001, Jan Kara wrote:
> >
> > >   But how do you solve the following: mv <dir> <some_other_dir>
> > > The parent changes. You need to go through all the subdirs of <dir> and change
> > > the TID. This is really hard to get right and to avoid deadlocks
> > > and races... At least it seems to me so.
> >
> > Provided you are tracking the total size in each directory, it's just a
> > matter of subtracting dir's size from the old parent, and adding it to the
> > new parent. (With suitable checks beforehand to avoid a result which
> > exceeds quota.)
>   Nope. If you'd just keep usage in directory than you need to go all the way
> up and decrease the usage and then go all the way down in the new directory.
> It's simplier but also nontrivial...

Yep, you're right: you'd need to ascend the target directory tree,
increasing the cumulative size all the way up, then do the move and
decrement the old location's totals in the same way. All wrapped up in a
transaction (on journalled FSs) or have fsck rebuild the totals on a dirty
mount. Fairly clean and painless on a JFS, but a bit of a mess on
others - still, quite workable, and the performance hit shouldn't be too
bad. Better than walking all the way DOWN the tree, anyway...


James.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 15:50       ` James Sutherland
@ 2001-10-24 17:41         ` Rik van Riel
  2001-10-24 18:08           ` James Sutherland
  0 siblings, 1 reply; 76+ messages in thread
From: Rik van Riel @ 2001-10-24 17:41 UTC (permalink / raw)
  To: James Sutherland; +Cc: Jan Kara, Neil Brown, linux-fsdevel, linux-kernel

On Wed, 24 Oct 2001, James Sutherland wrote:

> Yep, you're right: you'd need to ascend the target directory tree,
> increasing the cumulative size all the way up, then do the move and
> decrement the old location's totals in the same way. All wrapped up in a
> transaction (on journalled FSs) or have fsck rebuild the totals on a dirty
> mount. Fairly clean and painless on a JFS,

It's only clean and painless when you have infinite journal
space. When your filesystem's journal isn't big enough to
keep track of all the quota updates from an arbitrarily deep
directory tree, you're in big trouble.

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 17:41         ` Rik van Riel
@ 2001-10-24 18:08           ` James Sutherland
  0 siblings, 0 replies; 76+ messages in thread
From: James Sutherland @ 2001-10-24 18:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jan Kara, Neil Brown, linux-fsdevel, linux-kernel

On Wed, 24 Oct 2001, Rik van Riel wrote:
> On Wed, 24 Oct 2001, James Sutherland wrote:
>
> > Yep, you're right: you'd need to ascend the target directory tree,
> > increasing the cumulative size all the way up, then do the move and
> > decrement the old location's totals in the same way. All wrapped up in a
> > transaction (on journalled FSs) or have fsck rebuild the totals on a dirty
> > mount. Fairly clean and painless on a JFS,
>
> It's only clean and painless when you have infinite journal
> space. When your filesystem's journal isn't big enough to
> keep track of all the quota updates from an arbitrarily deep
> directory tree, you're in big trouble.

Good point. You should be able to do it in constant space, though:
identify the directory being modified, and the "height" to which you have
ascended so far. That'll allow you to back out or redo the transaction
later, which is enough I think?


James.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 15:16 ` Jan Kara
  2001-10-24 15:34   ` James Sutherland
@ 2001-10-24 21:24   ` Neil Brown
  2001-10-25 15:48     ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Neil Brown @ 2001-10-24 21:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-kernel

On Wednesday October 24, jack@suse.cz wrote:
>   Hello,
> 
> >  In my ongoing effort to provide centralised file storage that I can
> >  be proud of, I have put together some code to implement tree quotas.
> > 

> >                                         du -s $HOME should *always*
> >  match your usage according to "quota".

>   But how do you solve the following: mv <dir> <some_other_dir>
> The parent changes. You need to go through all the subdirs of <dir> and change
> the TID. This is really hard to get right and to avoid deadlocks
> and races... At least it seems to me so.
> 

It is possible that at times not all objects in a tree have the same
tree-id.  This can happen in a number of ways.  One is moving a
directory between quota-trees.  Another is changing the owner of the
top directory in a quota tree.  Another is enabling tree quotas for
the first time in a filesystem (TID is not kept up-to-date if
treequotas are not enabled).  However:

1/ Non-root users (actually non-CAP_CHOWN processes)  cannot create
   such a situation. e.g. If the directory move would change the TID,
   then it is forbidden (EXDEV).
2/ At every lookup in a path_walk, the TID is checked against the
   parent.  If it is wrong, it is changed.  This causes TID's to tend
   towards correctness.


So if you move a directory between quota trees, then the usages will
be wrong in the first instance.  But only root can make this happen.
However, there is an easy way to fix it: just run a find or a du in
the new tree. 

If you get a situation where a file is linked into two different
quota-trees (which non-CAP_CHOWN processes  cannot do, but "root"
could achieve in several ways), then its usage charge will effectively
bounce between the two trees as it is accessed from either side.
Every time this happens, a KERN_WARNING message gets logged.

It is not a 'perfect' solution, as some times the real tree usage will
not match the recorded tree usages.

It is an 'acceptable' solution.  It keeps the goal that if you do a
"du" and then look at your quota usage, they will match (though the
other way round could in unusual circumstances not match).  It also
prevents non-root users from creating problematic situations.

It is, I think, the 'best' solution that is possible.

Note that the automatic re-assignment of quota that happens on lookup
if the TID is wrong by-passes quota checks.  It will always succeeed
no matter who is doing the lookup (I found a use for ATTR_FORCE!!).

Also the patch that I posted before had a few bugs.

  http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.13-pre6/patch-A-TreeQuotas

has those bugs removed.

NeilBrown

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 21:24   ` Neil Brown
@ 2001-10-25 15:48     ` Jan Kara
  2001-10-26  4:36       ` Neil Brown
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2001-10-25 15:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-fsdevel, linux-kernel

  Hello,

> > >  In my ongoing effort to provide centralised file storage that I can
> > >  be proud of, I have put together some code to implement tree quotas.
> > > 
> 
> > >                                         du -s $HOME should *always*
> > >  match your usage according to "quota".
> 
> >   But how do you solve the following: mv <dir> <some_other_dir>
> > The parent changes. You need to go through all the subdirs of <dir> and change
> > the TID. This is really hard to get right and to avoid deadlocks
> > and races... At least it seems to me so.
> > 
> 
> It is possible that at times not all objects in a tree have the same
> tree-id.  This can happen in a number of ways.  One is moving a
> directory between quota-trees.  Another is changing the owner of the
> top directory in a quota tree.  Another is enabling tree quotas for
> the first time in a filesystem (TID is not kept up-to-date if
> treequotas are not enabled).  However:
> 
> 1/ Non-root users (actually non-CAP_CHOWN processes)  cannot create
>    such a situation. e.g. If the directory move would change the TID,
>    then it is forbidden (EXDEV).
> 2/ At every lookup in a path_walk, the TID is checked against the
>    parent.  If it is wrong, it is changed.  This causes TID's to tend
>    towards correctness.
> 
> 
> So if you move a directory between quota trees, then the usages will
> be wrong in the first instance.  But only root can make this happen.
> However, there is an easy way to fix it: just run a find or a du in
> the new tree. 
  Umm.. I'm not sure about one thing: When you move the dir between the
trees when you update TID's? I understood that not during the move...
So the only possibility that I see is that each time you read the inode
you check whether its TID is OK. But that means going through dirs everytime
you read some inode which doesn't look nice to me...

> If you get a situation where a file is linked into two different
> quota-trees (which non-CAP_CHOWN processes  cannot do, but "root"
> could achieve in several ways), then its usage charge will effectively
> bounce between the two trees as it is accessed from either side.
> Every time this happens, a KERN_WARNING message gets logged.
> 
> It is not a 'perfect' solution, as some times the real tree usage will
> not match the recorded tree usages.
> 
> It is an 'acceptable' solution.  It keeps the goal that if you do a
> "du" and then look at your quota usage, they will match (though the
> other way round could in unusual circumstances not match).  It also
> prevents non-root users from creating problematic situations.
> 
> It is, I think, the 'best' solution that is possible.
  I also don't see a better solution but I'm not sure this solution is good
enough to be implemented (to me it looks more like a hack than a regular
part of system...).

									Honza

--
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-25 15:48     ` Jan Kara
@ 2001-10-26  4:36       ` Neil Brown
  2001-10-29 14:06         ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Neil Brown @ 2001-10-26  4:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-kernel

On Thursday October 25, jack@suse.cz wrote:
> > So if you move a directory between quota trees, then the usages will
> > be wrong in the first instance.  But only root can make this happen.
> > However, there is an easy way to fix it: just run a find or a du in
> > the new tree. 
>   Umm.. I'm not sure about one thing: When you move the dir between the
> trees when you update TID's? I understood that not during the move...

That is right.  Not during the move.
Well... the tid of the directory itself changes during the move.  The
tid's of descendants change later.

> So the only possibility that I see is that each time you read the inode
> you check whether its TID is OK. But that means going through dirs everytime
> you read some inode which doesn't look nice to me...
> 

Have a look at the code and see where treequota_check is called.

It is called every time a "lookup" is done, whether the result is in
the cache or not.  If the lookup found something, then you have a
inode and it's parent right there in the cache.  treequota_check
checks that the tid of the child matches that of the parent, and
changes it if not.  So the overhead is very small for the common case
where the tid is correct.

It just tests:
   is inode NULL
   are treequotas enabled for this inode
   does the tid of the child match that of the parent (or the uid of
           the child if parent.tid==0

> > 
> > It is, I think, the 'best' solution that is possible.
>   I also don't see a better solution but I'm not sure this solution is good
> enough to be implemented (to me it looks more like a hack than a regular
> part of system...).

I accept that it does look like a bit of a hack.
But I think it is simple, understandable, and predictable.
And I think that (for me) the value of tree quotas is more than enough
to offset that cost.

NeilBrown


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 15:34   ` James Sutherland
  2001-10-24 15:39     ` Jan Kara
@ 2001-10-26 11:25     ` Pavel Machek
  1 sibling, 0 replies; 76+ messages in thread
From: Pavel Machek @ 2001-10-26 11:25 UTC (permalink / raw)
  To: James Sutherland; +Cc: Jan Kara, Neil Brown, linux-fsdevel, linux-kernel

Hi!
> 
> >   But how do you solve the following: mv <dir> <some_other_dir>
> > The parent changes. You need to go through all the subdirs of <dir> and change
> > the TID. This is really hard to get right and to avoid deadlocks
> > and races... At least it seems to me so.
> 
> Provided you are tracking the total size in each directory, it's just a
> matter of subtracting dir's size from the old parent, and adding it to the
> new parent. (With suitable checks beforehand to avoid a result which
> exceeds quota.)

And what about hardlinks?
								Pavel
-- 
STOP THE WAR! Someone killed innocent Americans. That does not give
U.S. right to kill people in Afganistan.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-26  4:36       ` Neil Brown
@ 2001-10-29 14:06         ` Jan Kara
  2001-10-29 23:23           ` Neil Brown
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2001-10-29 14:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-fsdevel, linux-kernel

> > So the only possibility that I see is that each time you read the inode
> > you check whether its TID is OK. But that means going through dirs everytime
> > you read some inode which doesn't look nice to me...
> > 
> 
> Have a look at the code and see where treequota_check is called.
> 
> It is called every time a "lookup" is done, whether the result is in
> the cache or not.  If the lookup found something, then you have a
> inode and it's parent right there in the cache.  treequota_check
> checks that the tid of the child matches that of the parent, and
> changes it if not.  So the overhead is very small for the common case
> where the tid is correct.
> 
> It just tests:
>    is inode NULL
>    are treequotas enabled for this inode
>    does the tid of the child match that of the parent (or the uid of
>            the child if parent.tid==0
  OK. I've seen the code and I agree it's not real problem.

> > > 
> > > It is, I think, the 'best' solution that is possible.
> >   I also don't see a better solution but I'm not sure this solution is good
> > enough to be implemented (to me it looks more like a hack than a regular
> > part of system...).
> 
> I accept that it does look like a bit of a hack.
> But I think it is simple, understandable, and predictable.
> And I think that (for me) the value of tree quotas is more than enough
> to offset that cost.
  I just don't like the idea that when you do lookup you can suddenly get
Disk quota exceeded... I'd concern this behaviour a bit nonintuitive. I agree
that if root makes lookup of every file after moving directories then this
doesn't happen but still I don't like the design :).

									Honza

--
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-29 14:06         ` Jan Kara
@ 2001-10-29 23:23           ` Neil Brown
  2001-10-30 12:33             ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Neil Brown @ 2001-10-29 23:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-kernel

On Monday October 29, jack@suse.cz wrote:
> > 
> > I accept that it does look like a bit of a hack.
> > But I think it is simple, understandable, and predictable.
> > And I think that (for me) the value of tree quotas is more than enough
> > to offset that cost.
>   I just don't like the idea that when you do lookup you can suddenly get
> Disk quota exceeded... I'd concern this behaviour a bit nonintuitive. I agree
> that if root makes lookup of every file after moving directories then this
> doesn't happen but still I don't like the design :).
> 

You cannot get "Disk quota exceeded" on a lookup. If treequota_check
finds a discrepancy it fixes it with "notify_change" with
ia_valid set to ATTR_FORCE | ATTR_TID.
I changed quota_transfer to take ATTR_FORCE to mean "just do it, even
if it exceeds quota, and don't give an error".   Given that ATTR_FORCE
is not actually used at all in the current kernel, I felt fairly free
to interpret it how I wanted.

So the only non-intuitive thing that can happen is that you find your
usage mysteriously changes.  However this can only happen after
administrator intervention, and with uid quotas administrator
intervention (e.g. chown -R) can equally cause mysterious changes of
usage.


However I'm not particularly trying to convince anyone to use or
approve of tree-quotas.  I was after comments to make sure that I
hadn't missed something in thinking through the issues.  I thank you
and others for your comments.  The fact that I am comfortable with my
answers (though you may not be) encourages me that I haven't missed
anything.

I will be using treequotas locally next year and will keep the
patches on my web-page up-to-date.  I have heard from at least one
person who thinks they might be useful, so there are probably a few
dozen who might find it useful.
In 6-12 months, if my experience is all positive, I might try
suggesting that they get included in a "standard" kernel (assuming
that 2.5 has openned by then:-).

Thanks again,
NeilBrown

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-29 23:23           ` Neil Brown
@ 2001-10-30 12:33             ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2001-10-30 12:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: Jan Kara, linux-fsdevel, linux-kernel

> On Monday October 29, jack@suse.cz wrote:
> > > 
> > > I accept that it does look like a bit of a hack.
> > > But I think it is simple, understandable, and predictable.
> > > And I think that (for me) the value of tree quotas is more than enough
> > > to offset that cost.
> >   I just don't like the idea that when you do lookup you can suddenly get
> > Disk quota exceeded... I'd concern this behaviour a bit nonintuitive. I agree
> > that if root makes lookup of every file after moving directories then this
> > doesn't happen but still I don't like the design :).
> > 
> 
> You cannot get "Disk quota exceeded" on a lookup. If treequota_check
> finds a discrepancy it fixes it with "notify_change" with
> ia_valid set to ATTR_FORCE | ATTR_TID.
> I changed quota_transfer to take ATTR_FORCE to mean "just do it, even
> if it exceeds quota, and don't give an error".   Given that ATTR_FORCE
> is not actually used at all in the current kernel, I felt fairly free
> to interpret it how I wanted.
  Hmm.. I should have read your patch more carefuly.. Sorry. 

> So the only non-intuitive thing that can happen is that you find your
> usage mysteriously changes.  However this can only happen after
> administrator intervention, and with uid quotas administrator
> intervention (e.g. chown -R) can equally cause mysterious changes of
> usage.
> 
> However I'm not particularly trying to convince anyone to use or
> approve of tree-quotas.  I was after comments to make sure that I
> hadn't missed something in thinking through the issues.  I thank you
> and others for your comments.  The fact that I am comfortable with my
> answers (though you may not be) encourages me that I haven't missed
> anything.
> 
> I will be using treequotas locally next year and will keep the
> patches on my web-page up-to-date.  I have heard from at least one
> person who thinks they might be useful, so there are probably a few
> dozen who might find it useful.
  :) I also think tree quotas are useful I'd just like to think of some
nicer solution...:)

								Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
       [not found] ` <fa.heevhav.sjs8an@ifi.uio.no>
@ 2001-11-18 22:15   ` Dan Maas
  2001-11-18 22:43     ` Swap François Cami
  0 siblings, 1 reply; 76+ messages in thread
From: Dan Maas @ 2001-11-18 22:15 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: linux-kernel, war, stilgar2k

> >Yep. There's a reason for that: the kernel is *ALWAYS*
> >able to swap pages out to disk - even without "swap space".
> >Disabling swapspace simply forces the kernel to swap out
> >more code, since it cannot swap out any data.
>
> Sure ??? Where ?? What disk space uses it to swap pages to ?

The executables and binaries on your regular filesystems... Even with no
swap space, the kernel can "page out" (i.e. drop from memory) read-only file
mappings, since they can always be reloaded from disk if needed.

In other words, there is still a big difference between running without swap
space, and having every program do an mlockall() (which *really* forces all
pages to be permanently resident in RAM).

Still, it puzzles me why a system with no swap space would appear to be more
responsive than one with swap (assuming their working sets are quite a bit
smaller than total amount of RAM)... Can you do a controlled test somehow,
to rule out any sort of placebo effect?

Regards,
Dan


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-18 22:15   ` Swap Dan Maas
@ 2001-11-18 22:43     ` François Cami
  2001-11-19  9:18       ` Swap James A Sutherland
  2001-11-19 10:03       ` Swap Tim Connors
  0 siblings, 2 replies; 76+ messages in thread
From: François Cami @ 2001-11-18 22:43 UTC (permalink / raw)
  To: Dan Maas; +Cc: linux-kernel

Dan Maas wrote:


> Still, it puzzles me why a system with no swap space would appear to be more
> responsive than one with swap (assuming their working sets are quite a bit
> smaller than total amount of RAM)... Can you do a controlled test somehow,
> to rule out any sort of placebo effect?

It's pretty simple... Try putting as much progs as you can into RAM
(but less than total RAM size) when you have RAM+swap.
Switching from one prog to another now takes time, because if you need
to go e.g. from mozilla to openoffice for example, if openoffice has
been swapped, it'll take ages.

Another good example is launching X and a few heavy X apps, going back
to console, doing a few things, like compiling different kernel trees.
If you have swap, the X + X apps will be swapped. going back to X will
take ages, because all that data + code has to be moved out to RAM to
cache the data in the two kernel trees.
If you don't have swap, maybe one, or both of the two kernel trees
will end up being not cached into main memory, depending on how much
RAM left you have. but going back to X will take 1 second instead of 20,
and thus the system will be more responsive.

It depends clearly on the situation you're in. I believe running with
swap is beneficial when your memory load is more than 75% of total
RAM, and less so when you have a few hundred megs of RAM left with all
useful apps loaded into RAM (which is not too unlikely these days,
due to the low price of SD/DDR RAM).

François


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-18 22:43     ` Swap François Cami
@ 2001-11-19  9:18       ` James A Sutherland
  2001-11-19 10:51         ` Swap Remco Post
  2001-11-19 10:03       ` Swap Tim Connors
  1 sibling, 1 reply; 76+ messages in thread
From: James A Sutherland @ 2001-11-19  9:18 UTC (permalink / raw)
  To: François Cami, Dan Maas; +Cc: linux-kernel

On Sunday 18 November 2001 10:43 pm, François Cami wrote:
> Dan Maas wrote:
> > Still, it puzzles me why a system with no swap space would appear to be
> > more responsive than one with swap (assuming their working sets are quite
> > a bit smaller than total amount of RAM)... Can you do a controlled test
> > somehow, to rule out any sort of placebo effect?
>
> It's pretty simple... Try putting as much progs as you can into RAM
> (but less than total RAM size) when you have RAM+swap.
> Switching from one prog to another now takes time, because if you need
> to go e.g. from mozilla to openoffice for example, if openoffice has
> been swapped, it'll take ages.

Except that openoffice and mozilla can be swapped out in BOTH cases: the 
kernel can discard mapped pages and reread as needed, whether you have a swap 
partition or not.

> Another good example is launching X and a few heavy X apps, going back
> to console, doing a few things, like compiling different kernel trees.
> If you have swap, the X + X apps will be swapped. going back to X will
> take ages, because all that data + code has to be moved out to RAM to
> cache the data in the two kernel trees.

Whereas without swapspace, only the read-only mapped pages can be swapped out.

> If you don't have swap, maybe one, or both of the two kernel trees
> will end up being not cached into main memory, depending on how much
> RAM left you have. but going back to X will take 1 second instead of 20,
> and thus the system will be more responsive.

You're trading throughput for responsiveness, here: you save 19 seconds 
switching to/from X, but walking through the two kernel trees will be slowed 
down by more than that amount... By most metrics, keeping X+apps in memory 
and forcing your kernel tree accesses to hit the disk is the WRONG strategy.

(Making X mlock() some or all of itself into RAM might make sense here, 
perhaps?)

> It depends clearly on the situation you're in. I believe running with
> swap is beneficial when your memory load is more than 75% of total
> RAM, and less so when you have a few hundred megs of RAM left with all
> useful apps loaded into RAM (which is not too unlikely these days,
> due to the low price of SD/DDR RAM).

Provided the VM is doing its job properly, adding swap will always be a net 
win for efficiency: the kernel is able to dump unused pages to make more room 
for others. Of course, you tend to "feel" the response times to interactive 
events, rather than the overall throughput, so a change which slows the 
system down but makes it more "responsive" to mouse clicks etc feels like a 
net win...


James.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-18 22:43     ` Swap François Cami
  2001-11-19  9:18       ` Swap James A Sutherland
@ 2001-11-19 10:03       ` Tim Connors
  2001-11-19 10:16         ` Swap Dan Maas
  1 sibling, 1 reply; 76+ messages in thread
From: Tim Connors @ 2001-11-19 10:03 UTC (permalink / raw)
  To: François Cami; +Cc: Dan Maas, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=US-ASCII, Size: 2186 bytes --]

On Sun, 18 Nov 2001, [ISO-8859-15] François Cami wrote:

> Dan Maas wrote:
> 
> 
> > Still, it puzzles me why a system with no swap space would appear to be more
> > responsive than one with swap (assuming their working sets are quite a bit
> > smaller than total amount of RAM)... Can you do a controlled test somehow,
> > to rule out any sort of placebo effect?
> 
> It's pretty simple... Try putting as much progs as you can into RAM
> (but less than total RAM size) when you have RAM+swap.
> Switching from one prog to another now takes time, because if you need
> to go e.g. from mozilla to openoffice for example, if openoffice has
> been swapped, it'll take ages.
> 
> Another good example is launching X and a few heavy X apps, going back
> to console, doing a few things, like compiling different kernel trees.
> If you have swap, the X + X apps will be swapped. going back to X will
> take ages, because all that data + code has to be moved out to RAM to
> cache the data in the two kernel trees.
> If you don't have swap, maybe one, or both of the two kernel trees
> will end up being not cached into main memory, depending on how much
> RAM left you have. but going back to X will take 1 second instead of 20,
> and thus the system will be more responsive.
> 
> It depends clearly on the situation you're in. I believe running with
> swap is beneficial when your memory load is more than 75% of total
> RAM, and less so when you have a few hundred megs of RAM left with all
> useful apps loaded into RAM (which is not too unlikely these days,
> due to the low price of SD/DDR RAM).

A perfect example of why a system _needs_ tuning knobs - this view of
Linus's that we need a self tuning system is idiotic, because some of us
don't care how long a kernel compile takes (or even how long it takes to
serve a couple of web pages per hour), but _do_ care about the general
system responsiveness. The system cannot predict what *I* the user wants
out of it. Hence we need /proc interfaces to the the VM that say this is a
compiling machine, or this is a desktop machine.....

-- 
TimC -- http://www.physics.usyd.edu.au/~tcon/

cat ~/.signature
Passing cosmic ray (core dumped)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 10:03       ` Swap Tim Connors
@ 2001-11-19 10:16         ` Dan Maas
  0 siblings, 0 replies; 76+ messages in thread
From: Dan Maas @ 2001-11-19 10:16 UTC (permalink / raw)
  To: Tim Connors, François Cami; +Cc: linux-kernel

> > If you don't have swap, maybe one, or both of the two
> > kernel trees will end up being not cached into main
> > memory, depending on how much RAM left you have. but going
> > back to X will take 1 second instead of 20,
> > and thus the system will be more responsive.

> A perfect example of why a system _needs_ tuning knobs - this view of
> Linus's that we need a self tuning system is idiotic, because some of us
> don't care how long a kernel compile takes (or even how long it takes to
> serve a couple of web pages per hour), but _do_ care about the general
> system responsiveness.

For what it's worth, I heartily agree...

Linus et al might very well say "if you care so much about keeping X in RAM,
just mlock() it." This is certainly worth a shot. (though I'd much prefer a
configurable 'weight' or 'stickiness' for file mappings vs. cached buffers).

Of course this sort of second-order tuning mechanism is a lot less important
than having a VM that doesn't crash or suck badly for common loads =)...
(not that the VM has been bad at all lately; I haven't had any problems
since 2.4.9-ac10 or 2.4.14, knock on wood...)

Regards,
Dan


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19  9:18       ` Swap James A Sutherland
@ 2001-11-19 10:51         ` Remco Post
  2001-11-19 13:33           ` Swap James A Sutherland
  0 siblings, 1 reply; 76+ messages in thread
From: Remco Post @ 2001-11-19 10:51 UTC (permalink / raw)
  To: linux-kernel


--8<--

> Except that openoffice and mozilla can be swapped out in BOTH cases: the 
> kernel can discard mapped pages and reread as needed, whether you have a swap 
> partition or not.
>
No they can't without swap, nothing can be SWAPPED out. The code pages can be 
paged out (discarded), but no SWAPPING takes place.
 

> Whereas without swapspace, only the read-only mapped pages can be swapped out.

Again, pages do not gat swapped out, only applications can get swapped out. 
Swapping is per definition the process of removing all pages used by one 
application from RAM, and moving ALL pages to swap.


> Provided the VM is doing its job properly, adding swap will always be a net 
> win for efficiency: the kernel is able to dump unused pages to make more room 
> for others. Of course, you tend to "feel" the response times to interactive 
> events, rather than the overall throughput, so a change which slows the 
> system down but makes it more "responsive" to mouse clicks etc feels like a 
> net win...
> 
> 
> James.

With any properly sized system, it will NEVER SWAP. Paging is a completely 
different thing. A little paging is not a problem. Up to 70 pagescans/s on 
occasion is quite acceptable. If paging activety grows above that, you may 
have a real problem. I don't know about the current VM, but with most unixes 
when you hit this mark, the system actually starts swapping, and your 
responsiveness goes down the drain....


-- 
Met vriendelijke groeten,

Remco Post

SARA - Stichting Academisch Rekencentrum Amsterdam
High Performance Computing  Tel. +31 20 592 8008    Fax. +31 20 668 3167

"I really didn't foresee the Internet. But then, neither did the computer
industry. Not that that tells us very much of course - the computer industry
didn't even foresee that the century was going to end." -- Douglas Adams



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 10:51         ` Swap Remco Post
@ 2001-11-19 13:33           ` James A Sutherland
  2001-11-19 13:46             ` Swap Remco Post
                               ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: James A Sutherland @ 2001-11-19 13:33 UTC (permalink / raw)
  To: Remco Post, linux-kernel

On Monday 19 November 2001 10:51 am, Remco Post wrote:
> --8<--
>
> > Except that openoffice and mozilla can be swapped out in BOTH cases: the
> > kernel can discard mapped pages and reread as needed, whether you have a
> > swap partition or not.
>
> No they can't without swap, nothing can be SWAPPED out. The code pages can
> be paged out (discarded), but no SWAPPING takes place.

OK, s/swapped/paged/.

> > Whereas without swapspace, only the read-only mapped pages can be swapped
> > out.
>
> Again, pages do not gat swapped out, only applications can get swapped out.
> Swapping is per definition the process of removing all pages used by one
> application from RAM, and moving ALL pages to swap.

So in effect, Linux never ever swaps. At all. Under any circumstances. (Using 
your interpretation of the word). Which does raise the question of WTF that 
"swap space" is for, and why it's really used for "paging"...

> > Provided the VM is doing its job properly, adding swap will always be a
> > net win for efficiency: the kernel is able to dump unused pages to make
> > more room for others. Of course, you tend to "feel" the response times to
> > interactive events, rather than the overall throughput, so a change which
> > slows the system down but makes it more "responsive" to mouse clicks etc
> > feels like a net win...
>
> With any properly sized system, it will NEVER SWAP. Paging is a completely
> different thing. A little paging is not a problem. Up to 70 pagescans/s on
> occasion is quite acceptable. If paging activety grows above that, you may
> have a real problem. I don't know about the current VM, but with most
> unixes when you hit this mark, the system actually starts swapping, and
> your responsiveness goes down the drain....

By your definition, Linux does not swap, ever. It only "pages". This is what 
I was referring to as swapping, since this involves the SWAPspace/partition, 
rather than PAGEfile :)


James.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 13:33           ` Swap James A Sutherland
@ 2001-11-19 13:46             ` Remco Post
  2001-11-19 16:58               ` Swap Rik van Riel
  2001-11-19 16:36             ` Swap Jesse Pollard
  2001-11-20 14:51             ` Swap J.A. Magallon
  2 siblings, 1 reply; 76+ messages in thread
From: Remco Post @ 2001-11-19 13:46 UTC (permalink / raw)
  To: James A Sutherland; +Cc: Remco Post, linux-kernel, remco

> On Monday 19 November 2001 10:51 am, Remco Post wrote:
> > --8<--
> >
> > > Except that openoffice and mozilla can be swapped out in BOTH cases: the
> > > kernel can discard mapped pages and reread as needed, whether you have a
> > > swap partition or not.
> >
> > No they can't without swap, nothing can be SWAPPED out. The code pages can
> > be paged out (discarded), but no SWAPPING takes place.
> 
> OK, s/swapped/paged/.
> 
> > > Whereas without swapspace, only the read-only mapped pages can be swapped
> > > out.
> >
> > Again, pages do not gat swapped out, only applications can get swapped out.
> > Swapping is per definition the process of removing all pages used by one
> > application from RAM, and moving ALL pages to swap.
> 
> So in effect, Linux never ever swaps. At all. Under any circumstances. (Using 
> your interpretation of the word). Which does raise the question of WTF that 
> "swap space" is for, and why it's really used for "paging"...
> 
Linux does swap (I guess), swapping is a very extreem measure, "I need memory 
now, and the paging algorithm does not work any more", this is quite rare, but 
a few runaway netscape processes can easily cause this....


> > > Provided the VM is doing its job properly, adding swap will always be a
> > > net win for efficiency: the kernel is able to dump unused pages to make
> > > more room for others. Of course, you tend to "feel" the response times to
> > > interactive events, rather than the overall throughput, so a change which
> > > slows the system down but makes it more "responsive" to mouse clicks etc
> > > feels like a net win...
> >
> > With any properly sized system, it will NEVER SWAP. Paging is a completely
> > different thing. A little paging is not a problem. Up to 70 pagescans/s on
> > occasion is quite acceptable. If paging activety grows above that, you may
> > have a real problem. I don't know about the current VM, but with most
> > unixes when you hit this mark, the system actually starts swapping, and
> > your responsiveness goes down the drain....
> 
> By your definition, Linux does not swap, ever. It only "pages". This is what 
> I was referring to as swapping, since this involves the SWAPspace/partition, 
> rather than PAGEfile :)
> 
> 
> James.
> 

It is quite a common mistake. When discussing the VM, it is important to make 
the distinction. In the old days (about the time when I was born ;) swapping 
was the only thing Unixes ever did, no paging, which is quite a recent 
invention. As you'd expect, this is why you have a swapspace that is now also 
used for paging. As a test, you could quite simply build an application that 
uses so much memory (not only malloc it, but also USE it)  that your system 
will start swapping, try using any interative application after that, and 
you'll feel why you really don't want a system to swap...



-- 
Met vriendelijke groeten,

Remco Post

SARA - Stichting Academisch Rekencentrum Amsterdam
High Performance Computing  Tel. +31 20 592 8008    Fax. +31 20 668 3167

"I really didn't foresee the Internet. But then, neither did the computer
industry. Not that that tells us very much of course - the computer industry
didn't even foresee that the century was going to end." -- Douglas Adams



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 13:33           ` Swap James A Sutherland
  2001-11-19 13:46             ` Swap Remco Post
@ 2001-11-19 16:36             ` Jesse Pollard
  2001-11-20 14:51             ` Swap J.A. Magallon
  2 siblings, 0 replies; 76+ messages in thread
From: Jesse Pollard @ 2001-11-19 16:36 UTC (permalink / raw)
  To: jas88, Remco Post, linux-kernel

James A Sutherland <jas88@cam.ac.uk>:
> On Monday 19 November 2001 10:51 am, Remco Post wrote:
> > --8<--
> >
> > > Except that openoffice and mozilla can be swapped out in BOTH cases: the
> > > kernel can discard mapped pages and reread as needed, whether you have a
> > > swap partition or not.
> >
> > No they can't without swap, nothing can be SWAPPED out. The code pages can
> > be paged out (discarded), but no SWAPPING takes place.
> 
> OK, s/swapped/paged/.
> 
> > > Whereas without swapspace, only the read-only mapped pages can be swapped
> > > out.
> >
> > Again, pages do not gat swapped out, only applications can get swapped out.
> > Swapping is per definition the process of removing all pages used by one
> > application from RAM, and moving ALL pages to swap.
> 
> So in effect, Linux never ever swaps. At all. Under any circumstances. (Using 
> your interpretation of the word). Which does raise the question of WTF that 
> "swap space" is for, and why it's really used for "paging"...

Linux doesn't - but some UNIX systems do swap. This is when the kernel pages
out the process header, page tables, process kernel stack ...

At this point the process is in the equivalent state as that of the system
that only does "swapping".

The swap space is used when more physical memory is required than is available
for user data. The modified pages of user data are written to the swap space
and the physical page re-used for another purpose. Effectively "swapping" the
use of the page... :-)

> > > Provided the VM is doing its job properly, adding swap will always be a
> > > net win for efficiency: the kernel is able to dump unused pages to make
> > > more room for others. Of course, you tend to "feel" the response times to
> > > interactive events, rather than the overall throughput, so a change which
> > > slows the system down but makes it more "responsive" to mouse clicks etc
> > > feels like a net win...
> >
> > With any properly sized system, it will NEVER SWAP. Paging is a completely
> > different thing. A little paging is not a problem. Up to 70 pagescans/s on
> > occasion is quite acceptable. If paging activety grows above that, you may
> > have a real problem. I don't know about the current VM, but with most
> > unixes when you hit this mark, the system actually starts swapping, and
> > your responsiveness goes down the drain....
> 
> By your definition, Linux does not swap, ever. It only "pages". This is what 
> I was referring to as swapping, since this involves the SWAPspace/partition, 
> rather than PAGEfile :)

The problem is determining "properly sized system". Second - ALL linux systems
will page in (or swap in) executables, if only at the start of executution
(easiest/fasted way to load the program... mmap is quick, even if it does
blur the distinction between process pages and I/O cache)

Linux uses RAM+SWAP for virtual memory operation, and swaps pages used for
data to the "swap space" to use different "swapped pages" to load back into
physical memory. Since this is effectively hidden from most activity (and
measures), it becomes easy to oversubscribe memory, causing thrashing (lots
of page activity for little gain), where a system with mixed paging + swapping
(page out entire processes and disable scheduling them) CAN make significant
progress.

The other use of RAM is for data caching. Usually is faster to keep file data
loaded into RAM for use by programs. Runtime libraries are frequently where
the majority of CPU time is spent - Instead of waiting for data to be
transferred to RAM for use, Linux tries to "read ahead" accomplishing more
throughput that way by not forcing the active process to wait for the data.

The tricky part is determining the balance between the data cache, and process
memory.

The systems that use a combined pageing + swapping use a variety of measures
to decide what should be paged or swapped. Some characteristics used by these
systems are:

 1. number of page faults/sec (swap if > watermark - reduces thrashing)
 2. time elapsed since last completed I/O (if greater than some watermark, swap
	- makes more RAM available)
 3. idle processes (wait time > watermark, swap - discard executable pages,
	swap out data pages)
 4. batch processes (operate at a lower priority - swap non-interactive
	processes - makes more RAM available)
 5. High memory requirements (reduce resident set size; which invokes item 1)
 6. Users priority (swap lower priority processes - make more RAM available)

Of course the sys admin must have control over all of the watermarks and/or
resource allocations. These are more characteristics of a general computation
or batch system than they are of a single user workstation, which is where
Linux started.

Hope I've help clear up some things.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 13:46             ` Swap Remco Post
@ 2001-11-19 16:58               ` Rik van Riel
       [not found]                 ` <Pine.LNX.4.33L.0111191458150.1491-100000@duckman.distro.conecti va>
  0 siblings, 1 reply; 76+ messages in thread
From: Rik van Riel @ 2001-11-19 16:58 UTC (permalink / raw)
  To: Remco Post; +Cc: James A Sutherland, linux-kernel, remco

On Mon, 19 Nov 2001, Remco Post wrote:

> Linux does swap (I guess), swapping is a very extreem measure, "I need
> memory now, and the paging algorithm does not work any more", this is
> quite rare, but a few runaway netscape processes can easily cause
> this....

Guess again.  Linux doesn't have load control implemented ...

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
       [not found]                 ` <Pine.LNX.4.33L.0111191458150.1491-100000@duckman.distro.conecti va>
@ 2001-11-19 21:13                   ` Alex Bligh - linux-kernel
  2001-11-19 21:17                     ` Swap Rik van Riel
  0 siblings, 1 reply; 76+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-11-19 21:13 UTC (permalink / raw)
  To: Rik van Riel, Remco Post
  Cc: James A Sutherland, linux-kernel, remco, Alex Bligh - linux-kernel



--On Monday, 19 November, 2001 2:58 PM -0200 Rik van Riel 
<riel@conectiva.com.br> wrote:

> Guess again.  Linux doesn't have load control implemented ...

Out of interest, is received wisdom that this is a good/bad
thing?

--
Alex Bligh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 21:13                   ` Swap Alex Bligh - linux-kernel
@ 2001-11-19 21:17                     ` Rik van Riel
       [not found]                       ` <Pine.LNX.4.33L.0111191917000.1491-100000@duckman.distro.conecti va>
  0 siblings, 1 reply; 76+ messages in thread
From: Rik van Riel @ 2001-11-19 21:17 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel
  Cc: Remco Post, James A Sutherland, linux-kernel, remco

On Mon, 19 Nov 2001, Alex Bligh - linux-kernel wrote:
> --On Monday, 19 November, 2001 2:58 PM -0200 Rik van Riel
> <riel@conectiva.com.br> wrote:
>
> > Guess again.  Linux doesn't have load control implemented ...
>
> Out of interest, is received wisdom that this is a good/bad
> thing?

Load control is a good thing since it means the box
gets slower in a controlled way instead of running
fine one minute and horribly falling over the next
minute.

I'm certainly planning to implement some load control
measures for 2.5.

regards,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
       [not found]                       ` <Pine.LNX.4.33L.0111191917000.1491-100000@duckman.distro.conecti va>
@ 2001-11-19 21:52                         ` Alex Bligh - linux-kernel
  0 siblings, 0 replies; 76+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-11-19 21:52 UTC (permalink / raw)
  To: Rik van Riel, Alex Bligh - linux-kernel
  Cc: Remco Post, James A Sutherland, linux-kernel, remco,
	Alex Bligh - linux-kernel

Rik,

--On Monday, 19 November, 2001 7:17 PM -0200 Rik van Riel 
<riel@conectiva.com.br> wrote:

>> Out of interest, is received wisdom that this is a good/bad
>> thing?
>
> Load control is a good thing since it means the box
> gets slower in a controlled way instead of running
> fine one minute and horribly falling over the next
> minute.
>
> I'm certainly planning to implement some load control
> measures for 2.5.

OK another potentially dumb question on this:

I had previously (mis?)understood load control to mean (say)
clustering page out requests to pages from specific
processes, then altering the scheduler to avoid scheduling these
processes for extended periods of time, then moving onto the next
set of processes to victimize, and so forth; i.e. increasing
scheduler granularity to cope with increased average virtual
memory access times by decreasing VM footprint used per second.

The original poster seemed to be talking about the old-UNIX
definition of swapping, which, if I remember right, was releasing
/all/ clean pages for an app (I guess this has already been done
by the time we want to do this) and paging /all/ dirty pages
& freeing the memory there and then.

I'd have thought swapping was a pretty coarsely-grained
form of load control (and difficulted with shared mem etc.);
do you believe there is a requirement to implement (old UNIX)
swapping per-se, or merely to intelligently tweak the scheduler
to cope better with high VM system loads? [the absence of the
former was what I was suggesting might have been considered
a good thing]

--
Alex Bligh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-19 13:33           ` Swap James A Sutherland
  2001-11-19 13:46             ` Swap Remco Post
  2001-11-19 16:36             ` Swap Jesse Pollard
@ 2001-11-20 14:51             ` J.A. Magallon
  2001-11-20 16:01               ` Swap Wolfgang Rohdewald
  2001-11-20 20:58               ` Swap Mike Fedyk
  2 siblings, 2 replies; 76+ messages in thread
From: J.A. Magallon @ 2001-11-20 14:51 UTC (permalink / raw)
  To: James A Sutherland; +Cc: Remco Post, linux-kernel


On 20011119 James A Sutherland wrote:
>On Monday 19 November 2001 10:51 am, Remco Post wrote:
>> --8<--
>>
>> > Except that openoffice and mozilla can be swapped out in BOTH cases: the
>> > kernel can discard mapped pages and reread as needed, whether you have a
>> > swap partition or not.
>>
>> No they can't without swap, nothing can be SWAPPED out. The code pages can
>> be paged out (discarded), but no SWAPPING takes place.
>
>OK, s/swapped/paged/.
>

Not so OK.

AFAIK, that is all a question of names. All is the same. Old systems
like MacOS do SWAP, because when they send something to disk they send the
whole app with its data space to disk. Linux does not send a whole app to
disk, but individual pages, so it does SWAP AT PAGE LEVEL, or paging. When
a page is deleted for one executable (because we can re-read it from on-disk
binary), it is discarded, not paged out. A page is paged-out if it is written
to disk.
So _swaping_ and _paging_ are the same, but with different granularity.

(of course, flame and correct me if I'm wrong...)

>> > Whereas without swapspace, only the read-only mapped pages can be swapped
>> > out.
>>

They are not swapped-out, just discarded to be re-read.

>
>By your definition, Linux does not swap, ever. It only "pages". This is what 
>I was referring to as swapping, since this involves the SWAPspace/partition, 
>rather than PAGEfile :)
>

It is the same. You can page-out (because Linux never do swap, as the process
of sending a whole app to disk), to an specially formatted partition or to
a file. If you are going to be pedantic, linux really uses _page_partitions_
and _page_files_, instead of swap-partitions and swap-files.

BTW, there is soft for mac that changes the swap algorithm from app level to
page level and they called it "RamDoubler", and people still thinks its
magic...

-- 
J.A. Magallon                           #  Let the source be with you...        
mailto:jamagallon@able.es
Mandrake Linux release 8.2 (Cooker) for i586
Linux werewolf 2.4.15-pre6-beo #1 SMP Sun Nov 18 10:25:01 CET 2001 i686

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 14:51             ` Swap J.A. Magallon
@ 2001-11-20 16:01               ` Wolfgang Rohdewald
  2001-11-20 16:06                 ` Swap Remco Post
                                   ` (2 more replies)
  2001-11-20 20:58               ` Swap Mike Fedyk
  1 sibling, 3 replies; 76+ messages in thread
From: Wolfgang Rohdewald @ 2001-11-20 16:01 UTC (permalink / raw)
  To: J.A. Magallon, James A Sutherland; +Cc: Remco Post, linux-kernel

On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> When a page is deleted for one executable (because we can re-read it from
> on-disk binary), it is discarded, not paged out.

What happens if the on-disk binary has changed since loading the program?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 16:01               ` Swap Wolfgang Rohdewald
@ 2001-11-20 16:06                 ` Remco Post
  2001-11-20 16:12                 ` Swap Nick LeRoy
  2001-11-20 16:20                 ` Swap Richard B. Johnson
  2 siblings, 0 replies; 76+ messages in thread
From: Remco Post @ 2001-11-20 16:06 UTC (permalink / raw)
  To: wr6; +Cc: J.A. Magallon, James A Sutherland, linux-kernel

> On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > When a page is deleted for one executable (because we can re-read it from
> > on-disk binary), it is discarded, not paged out.
> 
> What happens if the on-disk binary has changed since loading the program?
> 
The application usually crashes, but in theory it may run with just some 
'strange' behaviour. (Don't worry, apps usually just crash ;)


-- 
Met vriendelijke groeten,

Remco Post

SARA - Stichting Academisch Rekencentrum Amsterdam
High Performance Computing  Tel. +31 20 592 8008    Fax. +31 20 668 3167

"I really didn't foresee the Internet. But then, neither did the computer
industry. Not that that tells us very much of course - the computer industry
didn't even foresee that the century was going to end." -- Douglas Adams



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 16:01               ` Swap Wolfgang Rohdewald
  2001-11-20 16:06                 ` Swap Remco Post
@ 2001-11-20 16:12                 ` Nick LeRoy
  2001-11-20 16:20                 ` Swap Richard B. Johnson
  2 siblings, 0 replies; 76+ messages in thread
From: Nick LeRoy @ 2001-11-20 16:12 UTC (permalink / raw)
  To: wr6, J.A. Magallon, James A Sutherland; +Cc: Remco Post, linux-kernel

On Tuesday 20 November 2001 10:01, Wolfgang Rohdewald wrote:
> On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > When a page is deleted for one executable (because we can re-read it from
> > on-disk binary), it is discarded, not paged out.
>
> What happens if the on-disk binary has changed since loading the program?

In general, you can't...  You get a ETXTBSY 'text file busy' error.  If you 
try to do this over NFS (where the system can't stop you), the running image 
will almost certainly crash if it tries to page in text.

-Nick

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 16:01               ` Swap Wolfgang Rohdewald
  2001-11-20 16:06                 ` Swap Remco Post
  2001-11-20 16:12                 ` Swap Nick LeRoy
@ 2001-11-20 16:20                 ` Richard B. Johnson
  2001-11-20 17:14                   ` Swap Christopher Friesen
  2 siblings, 1 reply; 76+ messages in thread
From: Richard B. Johnson @ 2001-11-20 16:20 UTC (permalink / raw)
  To: Wolfgang Rohdewald
  Cc: J.A. Magallon, James A Sutherland, Remco Post, linux-kernel

On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:

> On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > When a page is deleted for one executable (because we can re-read it from
> > on-disk binary), it is discarded, not paged out.
> 
> What happens if the on-disk binary has changed since loading the program?
> -

It can't. That's the reason for `install` and other methods of changing
execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
The currently open, and possibly mapped file can be re-named, but it
can't be overwritten.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 16:20                 ` Swap Richard B. Johnson
@ 2001-11-20 17:14                   ` Christopher Friesen
  2001-11-20 17:40                     ` Swap Richard B. Johnson
                                       ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Christopher Friesen @ 2001-11-20 17:14 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

"Richard B. Johnson" wrote:
> 
> On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> 
> > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > When a page is deleted for one executable (because we can re-read it from
> > > on-disk binary), it is discarded, not paged out.
> >
> > What happens if the on-disk binary has changed since loading the program?
> > -
> 
> It can't. That's the reason for `install` and other methods of changing
> execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> The currently open, and possibly mapped file can be re-named, but it
> can't be overwritten.

Actually, with NFS (and probably others) it can.  Suppose I change the file on
the server, and it's swapped out on a client that has it mounted.  When it swaps
back in, it can get the new information.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 17:14                   ` Swap Christopher Friesen
@ 2001-11-20 17:40                     ` Richard B. Johnson
  2001-11-20 18:14                       ` Swap Nick LeRoy
                                         ` (2 more replies)
  2001-11-20 17:58                     ` Swap Wolfgang Rohdewald
  2001-11-20 21:05                     ` Swap Steffen Persvold
  2 siblings, 3 replies; 76+ messages in thread
From: Richard B. Johnson @ 2001-11-20 17:40 UTC (permalink / raw)
  To: Christopher Friesen; +Cc: linux-kernel

On Tue, 20 Nov 2001, Christopher Friesen wrote:

> "Richard B. Johnson" wrote:
> > 
> > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > 
> > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > When a page is deleted for one executable (because we can re-read it from
> > > > on-disk binary), it is discarded, not paged out.
> > >
> > > What happens if the on-disk binary has changed since loading the program?
> > > -
> > 
> > It can't. That's the reason for `install` and other methods of changing
> > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > The currently open, and possibly mapped file can be re-named, but it
> > can't be overwritten.
> 
> Actually, with NFS (and probably others) it can.  Suppose I change the file on
> the server, and it's swapped out on a client that has it mounted.  When it swaps
> back in, it can get the new information.
> 
> Chris

I note that NFS files don't currently return ETXTBSY, but this is a bug.
It is 'known' to the OS that the NFS mounted file-system is busy because
you can't unmount the file-system while an executable is running. If
you can trash it (as you can on Linux), it is surely a bug.

Alan explained a few years ago that NFS was "stateless". Nevertheless
it is still a bug.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 17:14                   ` Swap Christopher Friesen
  2001-11-20 17:40                     ` Swap Richard B. Johnson
@ 2001-11-20 17:58                     ` Wolfgang Rohdewald
  2001-11-26 21:51                       ` [Linux-abi-devel] Swap Christoph Hellwig
  2001-11-20 21:05                     ` Swap Steffen Persvold
  2 siblings, 1 reply; 76+ messages in thread
From: Wolfgang Rohdewald @ 2001-11-20 17:58 UTC (permalink / raw)
  To: Christopher Friesen, root; +Cc: linux-kernel, linux-abi-devel

On Tuesday 20 November 2001 18:14, Christopher Friesen wrote:
> "Richard B. Johnson" wrote:
> > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > When a page is deleted for one executable (because we can re-read it
> > > > from on-disk binary), it is discarded, not paged out.
> > >
> > > What happens if the on-disk binary has changed since loading the
> > > program? -
> >
> > It can't. That's the reason for `install` and other methods of changing
> > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > The currently open, and possibly mapped file can be re-named, but it
> > can't be overwritten.
>
> Actually, with NFS (and probably others) it can.  Suppose I change the file
> on the server, and it's swapped out on a client that has it mounted.  When
> it swaps back in, it can get the new information.

I am quite sure this is also possible if the binary is emulated by the linux-abi
modules like my old SCO binaries. I just cannot check right now because I did
not yet get linux-abi working with 2.4.15-pre7 (worked with 2.4.15-pre4, but
pre4 had a seemingly VM related OOPS when starting VMware3 which is gone with pre7)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 17:40                     ` Swap Richard B. Johnson
@ 2001-11-20 18:14                       ` Nick LeRoy
  2001-11-21 10:17                         ` Swap Helge Hafting
  2001-11-20 23:20                       ` Swap Luigi Genoni
  2001-11-21 16:44                       ` Swap Remco Post
  2 siblings, 1 reply; 76+ messages in thread
From: Nick LeRoy @ 2001-11-20 18:14 UTC (permalink / raw)
  To: root, Christopher Friesen; +Cc: linux-kernel

<snip>
> I note that NFS files don't currently return ETXTBSY, but this is a bug.
> It is 'known' to the OS that the NFS mounted file-system is busy because
> you can't unmount the file-system while an executable is running. If
> you can trash it (as you can on Linux), it is surely a bug.
>
> Alan explained a few years ago that NFS was "stateless". Nevertheless
> it is still a bug.

Correct me if I'm wrong, but I think that it's more a bug in the NFS protocol 
than in the Linux (or Solaris, etc) NFS implementation.  The problem is that 
NFS itself just doesn't pass that information along.  The NFS server has no 
idea that the 'text' file is being executed, so it doesn't know that it 
should "return" ETXTBSY.

Now, this might be different in NFS v3, but I'm pretty sure that this applies 
for v2, at least.

-Nick

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 14:51             ` Swap J.A. Magallon
  2001-11-20 16:01               ` Swap Wolfgang Rohdewald
@ 2001-11-20 20:58               ` Mike Fedyk
  1 sibling, 0 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-11-20 20:58 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: James A Sutherland, Remco Post, linux-kernel

On Tue, Nov 20, 2001 at 03:51:43PM +0100, J.A. Magallon wrote:
> BTW, there is soft for mac that changes the swap algorithm from app level to
> page level and they called it "RamDoubler", and people still thinks its
> magic...
> 

Ahh, so that's what it does, in addition to compression...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 17:14                   ` Swap Christopher Friesen
  2001-11-20 17:40                     ` Swap Richard B. Johnson
  2001-11-20 17:58                     ` Swap Wolfgang Rohdewald
@ 2001-11-20 21:05                     ` Steffen Persvold
  2001-11-20 21:18                       ` Swap Mike Fedyk
                                         ` (2 more replies)
  2 siblings, 3 replies; 76+ messages in thread
From: Steffen Persvold @ 2001-11-20 21:05 UTC (permalink / raw)
  To: Christopher Friesen; +Cc: root, linux-kernel

Christopher Friesen wrote:
> 
> "Richard B. Johnson" wrote:
> >
> > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> >
> > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > When a page is deleted for one executable (because we can re-read it from
> > > > on-disk binary), it is discarded, not paged out.
> > >
> > > What happens if the on-disk binary has changed since loading the program?
> > > -
> >
> > It can't. That's the reason for `install` and other methods of changing
> > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > The currently open, and possibly mapped file can be re-named, but it
> > can't be overwritten.
> 
> Actually, with NFS (and probably others) it can.  Suppose I change the file on
> the server, and it's swapped out on a client that has it mounted.  When it swaps
> back in, it can get the new information.
> 

This sounds really dangerous... What about shared libraries ??

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best   
 mailto:sp@scali.no  |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.12.2 -         
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >300MBytes/s and <4uS latency

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:05                     ` Swap Steffen Persvold
@ 2001-11-20 21:18                       ` Mike Fedyk
  2001-11-20 21:33                         ` Swap Nick LeRoy
  2001-11-20 21:43                         ` Swap Richard B. Johnson
  2001-11-20 21:19                       ` Swap Nick LeRoy
  2001-11-21 16:48                       ` Swap Remco Post
  2 siblings, 2 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-11-20 21:18 UTC (permalink / raw)
  To: Steffen Persvold; +Cc: Christopher Friesen, root, linux-kernel

On Tue, Nov 20, 2001 at 10:05:37PM +0100, Steffen Persvold wrote:
> Christopher Friesen wrote:
> > 
> > "Richard B. Johnson" wrote:
> > >
> > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > >
> > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > When a page is deleted for one executable (because we can re-read it from
> > > > > on-disk binary), it is discarded, not paged out.
> > > >
> > > > What happens if the on-disk binary has changed since loading the program?
> > > > -
> > >
> > > It can't. That's the reason for `install` and other methods of changing
> > > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > > The currently open, and possibly mapped file can be re-named, but it
> > > can't be overwritten.
> > 
> > Actually, with NFS (and probably others) it can.  Suppose I change the file on
> > the server, and it's swapped out on a client that has it mounted.  When it swaps
> > back in, it can get the new information.
> > 
> 
> This sounds really dangerous... What about shared libraries ??
> 

IIRC (if wrong flame...)

When you delete an open file, the entry is removed from the directory, but
not unlinked until the file is closed.  This is a standard UNIX semantic.

Now, if you have a set of processes with shared memory, and one closes, and
another is created to replace, the new process will get the new libraries,
or even new version of the process.  This could/will bring down the entire
set of processes.

Apps like samba come to mind...

Mike

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:05                     ` Swap Steffen Persvold
  2001-11-20 21:18                       ` Swap Mike Fedyk
@ 2001-11-20 21:19                       ` Nick LeRoy
  2001-11-21 16:48                       ` Swap Remco Post
  2 siblings, 0 replies; 76+ messages in thread
From: Nick LeRoy @ 2001-11-20 21:19 UTC (permalink / raw)
  To: Steffen Persvold, Christopher Friesen; +Cc: root, linux-kernel

On Tuesday 20 November 2001 15:05, Steffen Persvold wrote:
> Christopher Friesen wrote:
> > "Richard B. Johnson" wrote:
> > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > When a page is deleted for one executable (because we can re-read
> > > > > it from on-disk binary), it is discarded, not paged out.
> > > >
> > > > What happens if the on-disk binary has changed since loading the
> > > > program? -
> > >
> > > It can't. That's the reason for `install` and other methods of changing
> > > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > > The currently open, and possibly mapped file can be re-named, but it
> > > can't be overwritten.
> >
> > Actually, with NFS (and probably others) it can.  Suppose I change the
> > file on the server, and it's swapped out on a client that has it mounted.
> >  When it swaps back in, it can get the new information.
>
> This sounds really dangerous... What about shared libraries ??

It is.  Usually it ends with a loud 'boom' the process crashes & burns.

-Nick

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:18                       ` Swap Mike Fedyk
@ 2001-11-20 21:33                         ` Nick LeRoy
  2001-11-20 21:44                           ` Swap Mike Fedyk
  2001-11-20 21:43                         ` Swap Richard B. Johnson
  1 sibling, 1 reply; 76+ messages in thread
From: Nick LeRoy @ 2001-11-20 21:33 UTC (permalink / raw)
  To: Mike Fedyk, Steffen Persvold; +Cc: Christopher Friesen, root, linux-kernel

On Tuesday 20 November 2001 15:18, Mike Fedyk wrote:
> On Tue, Nov 20, 2001 at 10:05:37PM +0100, Steffen Persvold wrote:
> > Christopher Friesen wrote:
> > > "Richard B. Johnson" wrote:
> > > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > > When a page is deleted for one executable (because we can re-read
> > > > > > it from on-disk binary), it is discarded, not paged out.
> > > > >
> > > > > What happens if the on-disk binary has changed since loading the
> > > > > program? -
> > > >
> > > > It can't. That's the reason for `install` and other methods of
> > > > changing execututable files (mv exe-file exe-file.old ; cp newfile
> > > > exe-file). The currently open, and possibly mapped file can be
> > > > re-named, but it can't be overwritten.
> > >
> > > Actually, with NFS (and probably others) it can.  Suppose I change the
> > > file on the server, and it's swapped out on a client that has it
> > > mounted.  When it swaps back in, it can get the new information.
> >
> > This sounds really dangerous... What about shared libraries ??
>
> IIRC (if wrong flame...)
>
> When you delete an open file, the entry is removed from the directory, but
> not unlinked until the file is closed.  This is a standard UNIX semantic.
>
> Now, if you have a set of processes with shared memory, and one closes, and
> another is created to replace, the new process will get the new libraries,
> or even new version of the process.  This could/will bring down the entire
> set of processes.
>
> Apps like samba come to mind...

*Any* time that you write to an executing executable, all bets are off.  The 
most likely outcome is a big 'ol crash & burn.  With a local FS, Unix 
prevents you from shooting yourself in the foot, but with NFS, fire away..  
I've done it.  It *does* let you, but...

Solution:  Don't do that.  Shut them all down, on all clients, upgrade the 
binaries, then restart the processes on the clients.

As far as the scenerio that you've described, I *think* that it would 
actually work.  When the new process is fork()ed, it gets a copy of the file 
descriptors from it's parent, so the file is still open to it.  If it the 
exec()s, the new image no longer has any real ties to it's parent (at least, 
not that are relevant to this).

If it's created via clone(), then, once again, it's got it's parents 
descriptors still open, so no problem.

I think the real problems only exist over NFS and NFS-like scenerios.

Did I miss something here, or am I actually correct?  I was correct once, 
let's see...  Ooops.  That was a mistake too.

-Nick

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:18                       ` Swap Mike Fedyk
  2001-11-20 21:33                         ` Swap Nick LeRoy
@ 2001-11-20 21:43                         ` Richard B. Johnson
  2001-11-20 21:50                           ` NFS, Paging & Installing [was: Re: Swap] Mike Fedyk
  1 sibling, 1 reply; 76+ messages in thread
From: Richard B. Johnson @ 2001-11-20 21:43 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Steffen Persvold, Christopher Friesen, linux-kernel

On Tue, 20 Nov 2001, Mike Fedyk wrote:

> On Tue, Nov 20, 2001 at 10:05:37PM +0100, Steffen Persvold wrote:
> > Christopher Friesen wrote:
> > > 
> > > "Richard B. Johnson" wrote:
> > > >
> > > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > >
> > > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > > When a page is deleted for one executable (because we can re-read it from
> > > > > > on-disk binary), it is discarded, not paged out.
> > > > >
> > > > > What happens if the on-disk binary has changed since loading the program?
> > > > > -
> > > >
> > > > It can't. That's the reason for `install` and other methods of changing
> > > > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > > > The currently open, and possibly mapped file can be re-named, but it
> > > > can't be overwritten.
> > > 
> > > Actually, with NFS (and probably others) it can.  Suppose I change the file on
> > > the server, and it's swapped out on a client that has it mounted.  When it swaps
> > > back in, it can get the new information.
> > > 
> > 
> > This sounds really dangerous... What about shared libraries ??
> > 
> 
> IIRC (if wrong flame...)
> 
> When you delete an open file, the entry is removed from the directory, but
> not unlinked until the file is closed.  This is a standard UNIX semantic.
> 
> Now, if you have a set of processes with shared memory, and one closes, and
> another is created to replace, the new process will get the new libraries,
> or even new version of the process.  This could/will bring down the entire
> set of processes.
> 
> Apps like samba come to mind...
> 
> Mike

If the file is local, everything is fine. A file won't actually
be deleted until the last access is closed. However, the long-standing
problem with NFS is that it's `phony`. Basically, we send a message
to a server that says "Give me a directory listing...". The server
does the `opendir()` etc., and returns the results. If I want to
open a file on the server, the server has no knowledge of the `open`.
The client's software just emulated a file-system open(). When the
client wants to read data from a server's file, it sends a message;
"Gimmie data from file xxx, offset x, length y.". The server responds
with that data. To get that data, the server did an open/lseek/read/close.

So, as far as the server is concerned, that file is closed. Somebody
else (with privilege) can delete the file and replace it. The client,
the one that got the data for an executable, doesn't even know it.

This is 'nice' for the server, it doesn't have the overhead of maintaining
a file-system state. That's why servers are supposed to be read-only.
However, somebody has got to write the stuff to the file-system that's
going to (eventually) be read-only. Beware when such access occurs.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:33                         ` Swap Nick LeRoy
@ 2001-11-20 21:44                           ` Mike Fedyk
  2001-11-20 22:00                             ` Swap Nick LeRoy
  2001-11-21 16:53                             ` Swap Remco Post
  0 siblings, 2 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-11-20 21:44 UTC (permalink / raw)
  To: Nick LeRoy; +Cc: Steffen Persvold, Christopher Friesen, root, linux-kernel

On Tue, Nov 20, 2001 at 03:33:28PM -0600, Nick LeRoy wrote:
> On Tuesday 20 November 2001 15:18, Mike Fedyk wrote:
> > On Tue, Nov 20, 2001 at 10:05:37PM +0100, Steffen Persvold wrote:
> > > Christopher Friesen wrote:
> > > > "Richard B. Johnson" wrote:
> > > > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > > > When a page is deleted for one executable (because we can re-read
> > > > > > > it from on-disk binary), it is discarded, not paged out.
> > > > > >
> > > > > > What happens if the on-disk binary has changed since loading the
> > > > > > program? -
> > > > >
> > > > > It can't. That's the reason for `install` and other methods of
> > > > > changing execututable files (mv exe-file exe-file.old ; cp newfile
> > > > > exe-file). The currently open, and possibly mapped file can be
> > > > > re-named, but it can't be overwritten.
> > > >
> > > > Actually, with NFS (and probably others) it can.  Suppose I change the
> > > > file on the server, and it's swapped out on a client that has it
> > > > mounted.  When it swaps back in, it can get the new information.
> > >
> > > This sounds really dangerous... What about shared libraries ??
> >
> > IIRC (if wrong flame...)
> >
> > When you delete an open file, the entry is removed from the directory, but
> > not unlinked until the file is closed.  This is a standard UNIX semantic.
> >
> > Now, if you have a set of processes with shared memory, and one closes, and
> > another is created to replace, the new process will get the new libraries,
> > or even new version of the process.  This could/will bring down the entire
> > set of processes.
> >
> > Apps like samba come to mind...
> 
> *Any* time that you write to an executing executable, all bets are off.  The 
> most likely outcome is a big 'ol crash & burn.  With a local FS, Unix 
> prevents you from shooting yourself in the foot, but with NFS, fire away..  
> I've done it.  It *does* let you, but...
> 
> Solution:  Don't do that.  Shut them all down, on all clients, upgrade the 
> binaries, then restart the processes on the clients.
> 
> As far as the scenerio that you've described, I *think* that it would 
> actually work.  When the new process is fork()ed, it gets a copy of the file 
> descriptors from it's parent, so the file is still open to it.  If it the 
> exec()s, the new image no longer has any real ties to it's parent (at least, 
> not that are relevant to this).
> 

What about processes with shared memory such as samba 2.0?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* NFS, Paging & Installing [was: Re: Swap]
  2001-11-20 21:43                         ` Swap Richard B. Johnson
@ 2001-11-20 21:50                           ` Mike Fedyk
  2001-11-21  1:22                             ` Horst von Brand
  0 siblings, 1 reply; 76+ messages in thread
From: Mike Fedyk @ 2001-11-20 21:50 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Steffen Persvold, Christopher Friesen, linux-kernel

On Tue, Nov 20, 2001 at 04:43:01PM -0500, Richard B. Johnson wrote:
> On Tue, 20 Nov 2001, Mike Fedyk wrote:
> > IIRC (if wrong flame...)
> > 
> > When you delete an open file, the entry is removed from the directory, but
> > not unlinked until the file is closed.  This is a standard UNIX semantic.
> > 
> > Now, if you have a set of processes with shared memory, and one closes, and
> > another is created to replace, the new process will get the new libraries,
> > or even new version of the process.  This could/will bring down the entire
> > set of processes.
> > 
> > Apps like samba come to mind...
> > 
> > Mike
> 
> If the file is local, everything is fine. A file won't actually
> be deleted until the last access is closed. However, the long-standing
> problem with NFS is that it's `phony`. Basically, we send a message
> to a server that says "Give me a directory listing...". The server
> does the `opendir()` etc., and returns the results. If I want to
> open a file on the server, the server has no knowledge of the `open`.
> The client's software just emulated a file-system open(). When the
> client wants to read data from a server's file, it sends a message;
> "Gimmie data from file xxx, offset x, length y.". The server responds
> with that data. To get that data, the server did an open/lseek/read/close.
> 
> So, as far as the server is concerned, that file is closed. Somebody
> else (with privilege) can delete the file and replace it. The client,
> the one that got the data for an executable, doesn't even know it.
> 
> This is 'nice' for the server, it doesn't have the overhead of maintaining
> a file-system state. That's why servers are supposed to be read-only.
> However, somebody has got to write the stuff to the file-system that's
> going to (eventually) be read-only. Beware when such access occurs.
> 

Do any newer versions of NFS fix the stateless server problem?

If not, are there any drop in (at least for linux) replacements that do keep
state on the server?

SMB is out because it doesn't propagate the unix uid/gid

Striped down (auth wise) AFS?

Intermezzo?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:44                           ` Swap Mike Fedyk
@ 2001-11-20 22:00                             ` Nick LeRoy
  2001-11-21 16:53                             ` Swap Remco Post
  1 sibling, 0 replies; 76+ messages in thread
From: Nick LeRoy @ 2001-11-20 22:00 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Steffen Persvold, Christopher Friesen, root, linux-kernel

On Tuesday 20 November 2001 15:44, Mike Fedyk wrote:

<SNIP>

> > *Any* time that you write to an executing executable, all bets are off. 
> > The most likely outcome is a big 'ol crash & burn.  With a local FS, Unix
> > prevents you from shooting yourself in the foot, but with NFS, fire
> > away.. I've done it.  It *does* let you, but...
> >
> > Solution:  Don't do that.  Shut them all down, on all clients, upgrade
> > the binaries, then restart the processes on the clients.
> >
> > As far as the scenerio that you've described, I *think* that it would
> > actually work.  When the new process is fork()ed, it gets a copy of the
> > file descriptors from it's parent, so the file is still open to it.  If
> > it the exec()s, the new image no longer has any real ties to it's parent
> > (at least, not that are relevant to this).
>
> What about processes with shared memory such as samba 2.0?

fork()ed processes are *identical* to their parents execept for the return 
value from fork().  They have the same shared memory handles, file 
descriptors, etc.  The kernel "knows" that there's an extra copy of each, and 
updates it's link counts, etc.

Actually, the real point is that it'll still be the old executable running 
with the old libraries, until you shut down the whole group.  Each of the 
processes are "linked" to the original file, so the new version will never 
run 'til the whole group is restarted.

It should just work.  I can't think of any reason why it shouldn't.

-Nick

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 17:40                     ` Swap Richard B. Johnson
  2001-11-20 18:14                       ` Swap Nick LeRoy
@ 2001-11-20 23:20                       ` Luigi Genoni
  2001-11-21 16:44                       ` Swap Remco Post
  2 siblings, 0 replies; 76+ messages in thread
From: Luigi Genoni @ 2001-11-20 23:20 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Christopher Friesen, linux-kernel



On Tue, 20 Nov 2001, Richard B. Johnson wrote:

> On Tue, 20 Nov 2001, Christopher Friesen wrote:
>
> > "Richard B. Johnson" wrote:
> > >
> > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > >
> > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > When a page is deleted for one executable (because we can re-read it from
> > > > > on-disk binary), it is discarded, not paged out.
> > > >
> > > > What happens if the on-disk binary has changed since loading the program?
> > > > -
> > >
> > > It can't. That's the reason for `install` and other methods of changing
> > > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > > The currently open, and possibly mapped file can be re-named, but it
> > > can't be overwritten.
> >
> > Actually, with NFS (and probably others) it can.  Suppose I change the file on
> > the server, and it's swapped out on a client that has it mounted.  When it swaps
> > back in, it can get the new information.
> >
> > Chris
>
> I note that NFS files don't currently return ETXTBSY, but this is a bug.
> It is 'known' to the OS that the NFS mounted file-system is busy because
> you can't unmount the file-system while an executable is running. If
> you can trash it (as you can on Linux), it is surely a bug.
>
In most of the cases, the process on the client simply dies....




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: NFS, Paging & Installing [was: Re: Swap]
  2001-11-20 21:50                           ` NFS, Paging & Installing [was: Re: Swap] Mike Fedyk
@ 2001-11-21  1:22                             ` Horst von Brand
  2001-11-21  1:46                               ` Mike Fedyk
  0 siblings, 1 reply; 76+ messages in thread
From: Horst von Brand @ 2001-11-21  1:22 UTC (permalink / raw)
  To: Mike Fedyk
  Cc: Richard B. Johnson, Steffen Persvold, Christopher Friesen, linux-kernel

Mike Fedyk <mfedyk@matchmail.com> said:
> Do any newer versions of NFS fix the stateless server problem?

This is an _extremely_ hard problem: The server has to know somehow what
the client thinks the state is... and either one (or both) may have been
rebooted in between without the other one knowing.
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, Vin~a del Mar, Chile                               +56 32 672616

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: NFS, Paging & Installing [was: Re: Swap]
  2001-11-21  1:22                             ` Horst von Brand
@ 2001-11-21  1:46                               ` Mike Fedyk
  2001-11-21 10:55                                 ` Trond Myklebust
  0 siblings, 1 reply; 76+ messages in thread
From: Mike Fedyk @ 2001-11-21  1:46 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Richard B. Johnson, Steffen Persvold, Christopher Friesen, linux-kernel

On Tue, Nov 20, 2001 at 10:22:58PM -0300, Horst von Brand wrote:
> Mike Fedyk <mfedyk@matchmail.com> said:
> > Do any newer versions of NFS fix the stateless server problem?
> 
> This is an _extremely_ hard problem: The server has to know somehow what
> the client thinks the state is... and either one (or both) may have been
> rebooted in between without the other one knowing.

Yep, but there are currently protocols (SMB) that do that, but not
necessarily in a unix way.

Are there any that do this now with linux?  Locking over the network just
like it is locally?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 18:14                       ` Swap Nick LeRoy
@ 2001-11-21 10:17                         ` Helge Hafting
  2001-11-21 11:17                           ` Swap Alan Cox
  0 siblings, 1 reply; 76+ messages in thread
From: Helge Hafting @ 2001-11-21 10:17 UTC (permalink / raw)
  To: Nick LeRoy; +Cc: linux-kernel

Nick LeRoy wrote:

> > Alan explained a few years ago that NFS was "stateless". Nevertheless
> > it is still a bug.
> 
> Correct me if I'm wrong, but I think that it's more a bug in the NFS protocol
> than in the Linux (or Solaris, etc) NFS implementation.  The problem is that
> NFS itself just doesn't pass that information along.  The NFS server has no
> idea that the 'text' file is being executed, so it doesn't know that it
> should "return" ETXTBSY.
> 
> Now, this might be different in NFS v3, but I'm pretty sure that this applies
> for v2, at least.

Consider the above mentioned statelessness.  You can't get what you
want as long as you want a stateless server - it is simply impossible.

Your client can be tweaked so that you can't write via NFS to a
file executing on the same host - but nothing can prevent another
client from writing to that file - because the server is stateless.

A stateless server means it don't actually know if a file is
opened by anyone.  The good part of this is that the server
may crash and reboot, and the client will only see a delay.
Open files will still work as soon as the server comes back up.
No state were lost in the crash - because there were no
state at all.  But then you can't block writes because
you don't know that someone is executing the file.

It is not a design bug - it is a design tradeoff.  A stateful
server might work if you have years of uptime or at least
no unplanned downtime.  But such implementations tend to force
clients to remount if the server ever go down.  That may
be really annoying if you're accessing lots of servers.

Helge Hafting

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: NFS, Paging & Installing [was: Re: Swap]
  2001-11-21  1:46                               ` Mike Fedyk
@ 2001-11-21 10:55                                 ` Trond Myklebust
  2001-11-22  5:16                                   ` Bernd Eckenfels
  2001-11-23 19:33                                   ` Mike Fedyk
  0 siblings, 2 replies; 76+ messages in thread
From: Trond Myklebust @ 2001-11-21 10:55 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: linux-kernel

>>>>> " " == Mike Fedyk <mfedyk@matchmail.com> writes:

     > On Tue, Nov 20, 2001 at 10:22:58PM -0300, Horst von Brand
     > wrote:
    >> Mike Fedyk <mfedyk@matchmail.com> said:
    >> > Do any newer versions of NFS fix the stateless server
    >> > problem?
    >>
    >> This is an _extremely_ hard problem: The server has to know
    >> somehow what the client thinks the state is... and either one
    >> (or both) may have been rebooted in between without the other
    >> one knowing.

     > Yep, but there are currently protocols (SMB) that do that, but
     > not necessarily in a unix way.

<Cough, choke>

  Exactly how, pray tell, does SMB cope with recovering the full state
info after client/server crashes?

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-21 10:17                         ` Swap Helge Hafting
@ 2001-11-21 11:17                           ` Alan Cox
  0 siblings, 0 replies; 76+ messages in thread
From: Alan Cox @ 2001-11-21 11:17 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Nick LeRoy, linux-kernel

> It is not a design bug - it is a design tradeoff.  A stateful
> server might work if you have years of uptime or at least
> no unplanned downtime.  But such implementations tend to force
> clients to remount if the server ever go down.  That may
> be really annoying if you're accessing lots of servers.

NFS is at best "imitation stateless". You can do good stateful servers that
recover across both client and server machine failure. You can do far better
with them than with NFS - its just a bit harder.

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 17:40                     ` Swap Richard B. Johnson
  2001-11-20 18:14                       ` Swap Nick LeRoy
  2001-11-20 23:20                       ` Swap Luigi Genoni
@ 2001-11-21 16:44                       ` Remco Post
  2 siblings, 0 replies; 76+ messages in thread
From: Remco Post @ 2001-11-21 16:44 UTC (permalink / raw)
  To: root; +Cc: Christopher Friesen, linux-kernel

> On Tue, 20 Nov 2001, Christopher Friesen wrote:
> 
> > "Richard B. Johnson" wrote:
> > > 
> > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > 
> > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > When a page is deleted for one executable (because we can re-read it from
> > > > > on-disk binary), it is discarded, not paged out.
> > > >
> > > > What happens if the on-disk binary has changed since loading the program?
> > > > -
> > > 
> > > It can't. That's the reason for `install` and other methods of changing
> > > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > > The currently open, and possibly mapped file can be re-named, but it
> > > can't be overwritten.
> > 
> > Actually, with NFS (and probably others) it can.  Suppose I change the file on
> > the server, and it's swapped out on a client that has it mounted.  When it swaps
> > back in, it can get the new information.
> > 
> > Chris
> 
> I note that NFS files don't currently return ETXTBSY, but this is a bug.
> It is 'known' to the OS that the NFS mounted file-system is busy because
> you can't unmount the file-system while an executable is running. If
> you can trash it (as you can on Linux), it is surely a bug.
> 
> Alan explained a few years ago that NFS was "stateless". Nevertheless
> it is still a bug.
> 
> Cheers,
> Dick Johnson
> 

The Client OS knows the fs is busy, the server does not, so from the server 
side, I can change a file, unmount parts of the exported fs (nfs does not see 
fs boudries), or even mount a completely different fs on the exported fs, 
breaking the nfs client and the nfs server. Been there, done that. Yes, this 
is not userfriendly, but then again, NFS in not the best networked filesystem 
in the world, not was it designed to be handled by non-administrators.  (and I 
think it shouldn't have to be).


-- 
Met vriendelijke groeten,

Remco Post

SARA - Stichting Academisch Rekencentrum Amsterdam
High Performance Computing  Tel. +31 20 592 8008    Fax. +31 20 668 3167

"I really didn't foresee the Internet. But then, neither did the computer
industry. Not that that tells us very much of course - the computer industry
didn't even foresee that the century was going to end." -- Douglas Adams




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:05                     ` Swap Steffen Persvold
  2001-11-20 21:18                       ` Swap Mike Fedyk
  2001-11-20 21:19                       ` Swap Nick LeRoy
@ 2001-11-21 16:48                       ` Remco Post
  2 siblings, 0 replies; 76+ messages in thread
From: Remco Post @ 2001-11-21 16:48 UTC (permalink / raw)
  To: Steffen Persvold; +Cc: Christopher Friesen, root, linux-kernel

> Christopher Friesen wrote:
> > 
> > "Richard B. Johnson" wrote:
> > >
> > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > >
> > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > When a page is deleted for one executable (because we can re-read it from
> > > > > on-disk binary), it is discarded, not paged out.
> > > >
> > > > What happens if the on-disk binary has changed since loading the program?
> > > > -
> > >
> > > It can't. That's the reason for `install` and other methods of changing
> > > execututable files (mv exe-file exe-file.old ; cp newfile exe-file).
> > > The currently open, and possibly mapped file can be re-named, but it
> > > can't be overwritten.
> > 
> > Actually, with NFS (and probably others) it can.  Suppose I change the file on
> > the server, and it's swapped out on a client that has it mounted.  When it swaps
> > back in, it can get the new information.
> > 
> 
> This sounds really dangerous... What about shared libraries ??
> 

Same problem. This is why most Unix distros tell you to reboot after each 
patch applied and each OS upgrade. just to be sure that all mmapped files and 
page-demand loaded bins are all restarted.


-- 
Met vriendelijke groeten,

Remco Post

SARA - Stichting Academisch Rekencentrum Amsterdam
High Performance Computing  Tel. +31 20 592 8008    Fax. +31 20 668 3167

"I really didn't foresee the Internet. But then, neither did the computer
industry. Not that that tells us very much of course - the computer industry
didn't even foresee that the century was going to end." -- Douglas Adams



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Swap
  2001-11-20 21:44                           ` Swap Mike Fedyk
  2001-11-20 22:00                             ` Swap Nick LeRoy
@ 2001-11-21 16:53                             ` Remco Post
  1 sibling, 0 replies; 76+ messages in thread
From: Remco Post @ 2001-11-21 16:53 UTC (permalink / raw)
  To: Nick LeRoy, Steffen Persvold, Christopher Friesen, root, linux-kernel

> On Tue, Nov 20, 2001 at 03:33:28PM -0600, Nick LeRoy wrote:
> > On Tuesday 20 November 2001 15:18, Mike Fedyk wrote:
> > > On Tue, Nov 20, 2001 at 10:05:37PM +0100, Steffen Persvold wrote:
> > > > Christopher Friesen wrote:
> > > > > "Richard B. Johnson" wrote:
> > > > > > On Tue, 20 Nov 2001, Wolfgang Rohdewald wrote:
> > > > > > > On Tuesday 20 November 2001 15:51, J.A. Magallon wrote:
> > > > > > > > When a page is deleted for one executable (because we can re-read
> > > > > > > > it from on-disk binary), it is discarded, not paged out.
> > > > > > >
> > > > > > > What happens if the on-disk binary has changed since loading the
> > > > > > > program? -
> > > > > >
> > > > > > It can't. That's the reason for `install` and other methods of
> > > > > > changing execututable files (mv exe-file exe-file.old ; cp newfile
> > > > > > exe-file). The currently open, and possibly mapped file can be
> > > > > > re-named, but it can't be overwritten.
> > > > >
> > > > > Actually, with NFS (and probably others) it can.  Suppose I change the
> > > > > file on the server, and it's swapped out on a client that has it
> > > > > mounted.  When it swaps back in, it can get the new information.
> > > >
> > > > This sounds really dangerous... What about shared libraries ??
> > >
> > > IIRC (if wrong flame...)
> > >
> > > When you delete an open file, the entry is removed from the directory, but
> > > not unlinked until the file is closed.  This is a standard UNIX semantic.
> > >
> > > Now, if you have a set of processes with shared memory, and one closes, and
> > > another is created to replace, the new process will get the new libraries,
> > > or even new version of the process.  This could/will bring down the entire
> > > set of processes.
> > >
> > > Apps like samba come to mind...
> > 
> > *Any* time that you write to an executing executable, all bets are off.  The 
> > most likely outcome is a big 'ol crash & burn.  With a local FS, Unix 
> > prevents you from shooting yourself in the foot, but with NFS, fire away..  
> > I've done it.  It *does* let you, but...
> > 
> > Solution:  Don't do that.  Shut them all down, on all clients, upgrade the 
> > binaries, then restart the processes on the clients.
> > 
> > As far as the scenerio that you've described, I *think* that it would 
> > actually work.  When the new process is fork()ed, it gets a copy of the file 
> > descriptors from it's parent, so the file is still open to it.  If it the 
> > exec()s, the new image no longer has any real ties to it's parent (at least, 
> > not that are relevant to this).
> > 
> 
> What about processes with shared memory such as samba 2.0?


Cool, isn't it. Thinking of 1000 ways to crash apps. As long as the meaning of 
the bits and bytes in the shm segment does not change with a newer version of 
the app, you're safe. Upgrading in single-user modes makes things a lot safer 
(yes I too usually like to live dangerous....)


-- 
Met vriendelijke groeten,

Remco Post

SARA - Stichting Academisch Rekencentrum Amsterdam
High Performance Computing  Tel. +31 20 592 8008    Fax. +31 20 668 3167

"I really didn't foresee the Internet. But then, neither did the computer
industry. Not that that tells us very much of course - the computer industry
didn't even foresee that the century was going to end." -- Douglas Adams



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: NFS, Paging & Installing [was: Re: Swap]
  2001-11-21 10:55                                 ` Trond Myklebust
@ 2001-11-22  5:16                                   ` Bernd Eckenfels
  2001-11-22 12:19                                     ` Trond Myklebust
  2001-11-23 19:33                                   ` Mike Fedyk
  1 sibling, 1 reply; 76+ messages in thread
From: Bernd Eckenfels @ 2001-11-22  5:16 UTC (permalink / raw)
  To: linux-kernel

In article <shs3d38xuk4.fsf@charged.uio.no> you wrote:
>  Exactly how, pray tell, does SMB cope with recovering the full state
> info after client/server crashes?

Not doing that is the better solution.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: NFS, Paging & Installing [was: Re: Swap]
  2001-11-22  5:16                                   ` Bernd Eckenfels
@ 2001-11-22 12:19                                     ` Trond Myklebust
  0 siblings, 0 replies; 76+ messages in thread
From: Trond Myklebust @ 2001-11-22 12:19 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

>>>>> " " == Bernd Eckenfels <ecki@lina.inka.de> writes:

     > In article <shs3d38xuk4.fsf@charged.uio.no> you wrote:
    >> Exactly how, pray tell, does SMB cope with recovering the full
    >> state info after client/server crashes?

     > Not doing that is the better solution.

...and is why stateless filesystems are the norm. The claim that SMB
was different wasn't mine.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: NFS, Paging & Installing [was: Re: Swap]
  2001-11-21 10:55                                 ` Trond Myklebust
  2001-11-22  5:16                                   ` Bernd Eckenfels
@ 2001-11-23 19:33                                   ` Mike Fedyk
  1 sibling, 0 replies; 76+ messages in thread
From: Mike Fedyk @ 2001-11-23 19:33 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On Wed, Nov 21, 2001 at 11:55:07AM +0100, Trond Myklebust wrote:
> >>>>> " " == Mike Fedyk <mfedyk@matchmail.com> writes:
> 
>      > On Tue, Nov 20, 2001 at 10:22:58PM -0300, Horst von Brand
>      > wrote:
>     >> Mike Fedyk <mfedyk@matchmail.com> said:
>     >> > Do any newer versions of NFS fix the stateless server
>     >> > problem?
>     >>
>     >> This is an _extremely_ hard problem: The server has to know
>     >> somehow what the client thinks the state is... and either one
>     >> (or both) may have been rebooted in between without the other
>     >> one knowing.
> 
>      > Yep, but there are currently protocols (SMB) that do that, but
>      > not necessarily in a unix way.
> 
> <Cough, choke>
> 
>   Exactly how, pray tell, does SMB cope with recovering the full state
> info after client/server crashes?
> 

No, I wasn't claiming that SMB will recover from a server crash gracefully.
If your SMB server goes down (upgrade being likely with samba instead of
crash...) for whatever reason, any open file connections are hosed.

I was just stating that there are Network FSes that are stateful, and work
good when the server stays up.

As stated by Alan, you can make a stateful Net FS that deals gracefully with
crash recovery, it's just harder.

Also, SMB deals with crashed clients pretty well most of the time by
querying the client with the write lock to see if it's still there...

Mike

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Linux-abi-devel] Re: Swap
  2001-11-20 17:58                     ` Swap Wolfgang Rohdewald
@ 2001-11-26 21:51                       ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-11-26 21:51 UTC (permalink / raw)
  To: Wolfgang Rohdewald
  Cc: Christopher Friesen, root, linux-kernel, linux-abi-devel

On Tue, Nov 20, 2001 at 06:58:03PM +0100, Wolfgang Rohdewald wrote:
> I am quite sure this is also possible if the binary is emulated by
> the linux-abi modules like my old SCO binaries.

Linux-ABI mmaps binaries if they are page-aligned, otherwise they
are read completly at startup.  Note that Linux-ABI uses the normal
binfmt_elf for foreign ELF binaries, so the above applies only
to COFF and X.out (Microsoft x.out) binaries.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-24 18:44 Jesse Pollard
@ 2001-10-25  6:15 ` Albert D. Cahalan
  0 siblings, 0 replies; 76+ messages in thread
From: Albert D. Cahalan @ 2001-10-25  6:15 UTC (permalink / raw)
  To: Jesse Pollard
  Cc: jas88, Rik van Riel, Jan Kara, Neil Brown, linux-fsdevel, linux-kernel

Jesse Pollard writes:

> There still remains the problem of hard links... They could be counted
> in two or more trees as long as two or more trees exist on one filesystem.

Obvious fix: prohibit hard links across tree quota boundries,
including any that might be created by a rename.

It is an admin error to enable tree quotas on trees that
have existing hard links.

While doing that, a sysctl or mount option to enable/disable hard
linking to other people's files would be nice. Default to stopping
the "feature" IMHO.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
@ 2001-10-24 18:44 Jesse Pollard
  2001-10-25  6:15 ` Albert D. Cahalan
  0 siblings, 1 reply; 76+ messages in thread
From: Jesse Pollard @ 2001-10-24 18:44 UTC (permalink / raw)
  To: jas88, Rik van Riel; +Cc: Jan Kara, Neil Brown, linux-fsdevel, linux-kernel

James Sutherland <jas88@cam.ac.uk>:
> On Wed, 24 Oct 2001, Rik van Riel wrote:
> > On Wed, 24 Oct 2001, James Sutherland wrote:
> >
> > > Yep, you're right: you'd need to ascend the target directory tree,
> > > increasing the cumulative size all the way up, then do the move and
> > > decrement the old location's totals in the same way. All wrapped up in a
> > > transaction (on journalled FSs) or have fsck rebuild the totals on a dirty
> > > mount. Fairly clean and painless on a JFS,
> >
> > It's only clean and painless when you have infinite journal
> > space. When your filesystem's journal isn't big enough to
> > keep track of all the quota updates from an arbitrarily deep
> > directory tree, you're in big trouble.
> 
> Good point. You should be able to do it in constant space, though:
> identify the directory being modified, and the "height" to which you have
> ascended so far. That'll allow you to back out or redo the transaction
> later, which is enough I think?

There still remains the problem of hard links... They could be counted
in two or more trees as long as two or more trees exist on one filesystem.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
  2001-10-19  3:00 Toivo Pedaste
@ 2001-10-19  4:39 ` Neil Brown
  0 siblings, 0 replies; 76+ messages in thread
From: Neil Brown @ 2001-10-19  4:39 UTC (permalink / raw)
  To: Toivo Pedaste, Rik van Riel; +Cc: linux-kernel

On Friday October 19, toivo@eleusis.ucs.uwa.edu.au wrote:
> 
> 
> >However I actually want to charge usage to users.
> >There is a natural mapping from users to directory trees via the
> >concept of the home-directory.  It is home directories that I want to
> >impose quotas on.  So it seems natural to charge space usage to a
> >users.
> 
> 
> The use I can see for tree quotas whould be quite divorced from
> accounts or users. Currently if you want limit the amount of
> space the say /tmp, /home or /var/mail uses you need to put
> it on a separate partition, but if you could put a quota 
> on a tree you'd have a much more flexible systema adminstration
> tool to control the disk space used by each particular function.

This relates to Rik's idea of having a treequota on "/home/students"
which would apply to all students, not any one user.

One issue here is: how do you tell the quota-system what constitutes a
   tree, for quota purposes.

NetworkAppliances have had treequotas on their filer for quite some
time, and I believe that you have to create quota trees explicitly
with "qtree create"

I would rather not have to add such a new command if I can avoid it.

For the above senarios,  I would simply create an accout called "tmp"
or "home" or "mail" (you might have that one already) or "student",
assign a quota to that account, and chown the directory appropriately.
Afterall, there is no real reason why /tmp should be owned by "root".
Any "system" account should be fine.

Can anyone else see a good way to flag an inode as "root-of-a-qtree"
that does not require a new command and does not relate to uids?

NeilBrown


> 
> I quite like the idea of the quota being related to an inode.
> -- 
>  Toivo Pedaste                        Email:  toivo@ucs.uwa.edu.au
>  University Communications Services,  Phone:  +61 8 9 380 2605
>  University of Western Australia      Fax:    +61 8 9 380 1109
> "The time has come", the Walrus said, "to talk of many things"...
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: RFC - tree quotas for Linux (2.4.12, ext2)
@ 2001-10-19  3:00 Toivo Pedaste
  2001-10-19  4:39 ` Neil Brown
  0 siblings, 1 reply; 76+ messages in thread
From: Toivo Pedaste @ 2001-10-19  3:00 UTC (permalink / raw)
  To: linux-kernel




>On Thursday October 18, twalberg@mindspring.com wrote:
>> A semi-random thought on the tree-quota concept:
>> 
>> Does it really make sense to charge a tree quota to a single specific
>> user? I haven't really looked into what would be required to implement
>> it, but my mental picture of a tree quota is somewhat divorced from the
>> user concept, other than maybe the quota table containing a pointer to
>> a contact for quota violations. The bookkeeping might be easier if each
>> tree quota root just held a cumulative total of allocated space, and
>> maybe a just a user name for contacts (or on the fancier side, a hook
>> to execute something...).



>However I actually want to charge usage to users.
>There is a natural mapping from users to directory trees via the
>concept of the home-directory.  It is home directories that I want to
>impose quotas on.  So it seems natural to charge space usage to a
>users.


The use I can see for tree quotas whould be quite divorced from
accounts or users. Currently if you want limit the amount of
space the say /tmp, /home or /var/mail uses you need to put
it on a separate partition, but if you could put a quota 
on a tree you'd have a much more flexible systema adminstration
tool to control the disk space used by each particular function.

I quite like the idea of the quota being related to an inode.
-- 
 Toivo Pedaste                        Email:  toivo@ucs.uwa.edu.au
 University Communications Services,  Phone:  +61 8 9 380 2605
 University of Western Australia      Fax:    +61 8 9 380 1109
"The time has come", the Walrus said, "to talk of many things"...

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2001-11-26 21:55 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-18  5:06 RFC - tree quotas for Linux (2.4.12, ext2) Neil Brown
2001-10-18  5:53 ` Ben Greear
2001-10-18  8:38   ` James Sutherland
2001-10-18 20:20     ` Mike Fedyk
2001-10-18 20:47       ` Tim Walberg
2001-10-19  1:07         ` Neil Brown
2001-10-19  3:03           ` Rik van Riel
2001-10-19 11:50             ` Horst von Brand
2001-10-19 17:00               ` Mike Fedyk
2001-10-18 21:17       ` Andreas Dilger
2001-10-18 22:56         ` Mike Fedyk
2001-10-19  0:14           ` Horst von Brand
2001-10-19  0:51             ` Mike Fedyk
2001-10-19  1:13         ` Neil Brown
2001-10-19  0:53       ` Neil Brown
2001-10-24 15:16 ` Jan Kara
2001-10-24 15:34   ` James Sutherland
2001-10-24 15:39     ` Jan Kara
2001-10-24 15:50       ` James Sutherland
2001-10-24 17:41         ` Rik van Riel
2001-10-24 18:08           ` James Sutherland
2001-10-26 11:25     ` Pavel Machek
2001-10-24 21:24   ` Neil Brown
2001-10-25 15:48     ` Jan Kara
2001-10-26  4:36       ` Neil Brown
2001-10-29 14:06         ` Jan Kara
2001-10-29 23:23           ` Neil Brown
2001-10-30 12:33             ` Jan Kara
2001-10-19  3:00 Toivo Pedaste
2001-10-19  4:39 ` Neil Brown
2001-10-24 18:44 Jesse Pollard
2001-10-25  6:15 ` Albert D. Cahalan
     [not found] <fa.inl6g6v.1mmbp4@ifi.uio.no>
     [not found] ` <fa.heevhav.sjs8an@ifi.uio.no>
2001-11-18 22:15   ` Swap Dan Maas
2001-11-18 22:43     ` Swap François Cami
2001-11-19  9:18       ` Swap James A Sutherland
2001-11-19 10:51         ` Swap Remco Post
2001-11-19 13:33           ` Swap James A Sutherland
2001-11-19 13:46             ` Swap Remco Post
2001-11-19 16:58               ` Swap Rik van Riel
     [not found]                 ` <Pine.LNX.4.33L.0111191458150.1491-100000@duckman.distro.conecti va>
2001-11-19 21:13                   ` Swap Alex Bligh - linux-kernel
2001-11-19 21:17                     ` Swap Rik van Riel
     [not found]                       ` <Pine.LNX.4.33L.0111191917000.1491-100000@duckman.distro.conecti va>
2001-11-19 21:52                         ` Swap Alex Bligh - linux-kernel
2001-11-19 16:36             ` Swap Jesse Pollard
2001-11-20 14:51             ` Swap J.A. Magallon
2001-11-20 16:01               ` Swap Wolfgang Rohdewald
2001-11-20 16:06                 ` Swap Remco Post
2001-11-20 16:12                 ` Swap Nick LeRoy
2001-11-20 16:20                 ` Swap Richard B. Johnson
2001-11-20 17:14                   ` Swap Christopher Friesen
2001-11-20 17:40                     ` Swap Richard B. Johnson
2001-11-20 18:14                       ` Swap Nick LeRoy
2001-11-21 10:17                         ` Swap Helge Hafting
2001-11-21 11:17                           ` Swap Alan Cox
2001-11-20 23:20                       ` Swap Luigi Genoni
2001-11-21 16:44                       ` Swap Remco Post
2001-11-20 17:58                     ` Swap Wolfgang Rohdewald
2001-11-26 21:51                       ` [Linux-abi-devel] Swap Christoph Hellwig
2001-11-20 21:05                     ` Swap Steffen Persvold
2001-11-20 21:18                       ` Swap Mike Fedyk
2001-11-20 21:33                         ` Swap Nick LeRoy
2001-11-20 21:44                           ` Swap Mike Fedyk
2001-11-20 22:00                             ` Swap Nick LeRoy
2001-11-21 16:53                             ` Swap Remco Post
2001-11-20 21:43                         ` Swap Richard B. Johnson
2001-11-20 21:50                           ` NFS, Paging & Installing [was: Re: Swap] Mike Fedyk
2001-11-21  1:22                             ` Horst von Brand
2001-11-21  1:46                               ` Mike Fedyk
2001-11-21 10:55                                 ` Trond Myklebust
2001-11-22  5:16                                   ` Bernd Eckenfels
2001-11-22 12:19                                     ` Trond Myklebust
2001-11-23 19:33                                   ` Mike Fedyk
2001-11-20 21:19                       ` Swap Nick LeRoy
2001-11-21 16:48                       ` Swap Remco Post
2001-11-20 20:58               ` Swap Mike Fedyk
2001-11-19 10:03       ` Swap Tim Connors
2001-11-19 10:16         ` Swap Dan Maas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).