linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
@ 2001-05-19  6:23 Ben LaHaise
  2001-05-19  6:57 ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Clausen
                   ` (6 more replies)
  0 siblings, 7 replies; 161+ messages in thread
From: Ben LaHaise @ 2001-05-19  6:23 UTC (permalink / raw)
  To: torvalds; +Cc: viro, linux-kernel, linux-fsdevel

Hey folks,

The work-in-progress patch for-demonstration-purposes-only below consists
of 3 major components, and is meant to start discussion about the future
direction of device naming and its interaction block layer.  The main
motivations here are the wasting of minor numbers for partitions, and the
duplication of code between user and kernel space in areas such as
partition detection, uuid location, lvm setup, mount by label, journal
replay, and so on...

1. Generic lookup method and argument parsiing (fs/lookupargs.c)

	This code implements a lookup function which is for demonstration
	purposes used in fs/block_dev.c.  The general idea is to pass
	additional parameters to device drivers on open via a comma
	seperated list of options following the device's name.  Sample
	uses:

		/dev/sda/raw		-> open sda in raw mode.
		/dev/sda/limit=102400	-> open sda with a limit of 100K
		/dev/sda/offset=1024,limit=2048
			-> open a device that gives a view of sda at an
			   offset of 1KB to 2KB

	The arguments are defined in a table (fs/block_dev.c:660), which
	defines the name and type of argument to parse.  This table is
	used at lookup time to determine if an option name is valid
	(resulting in a postive dentry) or invalid.  Potential uses for
	this are numerous: opening a control channel to a device,
	specifying a graphics mode for a framebuffer on open, replacing
	ioctls, .... lots of options.  Please seperate comments on this
	portion from the other parts of the patch.

2. Restricted block device (drivers/block/blkrestrict.c)

	This is a quick-n-dirty implementation of a simple md-like block
	device that adds an offset to sector requests and limits the
	maximum offset on the device.  The idea here is to replace the
	special case minor numbers used for the partitioning code with
	a generic runtime allocated translation node.  The idea will work
	best once its data can be stored in a kdev_t structure.  The API
	for use is simple:

		kdev_t restrict_create_dev(kdev_t dev,
				unsigned long long offset,
				unsigned long long limit)

	The associated cleanup of the startup code is not addressed here.
	Comments on this part (I know the implementation is ugly, talk
	about the ideas please)?

3. Userspace partition code proposal

	Given the above two bits, here's a brief explaination of a
	proposal to move management of the partitioning scheme into
	userspace, along with portions of raid startup, lvm, uuid and
	mount by label code needed for mounting the root filesystem.

	Consider that the device node currently known as /dev/hda5 can
	also be viewed as /dev/hda at offset 512000 with a limit of 10GB.
	With the extensions in fs/block_dev.c, you could replace /dev/hda5
	with /dev/hda/offset=512000,limit=10240000000.  Now, by putting
	the partition parsing code into a libpart and binding mount to a
	libpart, the root filesystem mounting code can be run out of an
	initrd image.  The use of mount gives us the ability to mount
	filesystems by UUID, by label or other exotic schemes without
	having to add any additional code to the kernel.

I'm going to stop writing this now.  I need sleep...

Folks, please let me know your opinions on the ideas presented herein, and
do attempt to keep the bits of code that are useful.  Cheers,

		-ben

[23:34:07] <viro> bcrl: you are sick.
[23:41:13] <viro> bcrl: you _are_ sick.
[23:43:24] <viro> bcrl: you are _fscking_ sick.

here starts v2.4.5-pre3_bdev_naming-A0.diff
diff -urN kernels/2.4/v2.4.5-pre3/Makefile bdev_naming/Makefile
--- kernels/2.4/v2.4.5-pre3/Makefile	Thu May 17 18:09:42 2001
+++ bdev_naming/Makefile	Sat May 19 01:33:39 2001
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 4
 SUBLEVEL = 5
-EXTRAVERSION =-pre3
+EXTRAVERSION =-pre3-sick-test

 KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION)

diff -urN kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh bdev_naming/arch/i386/boot/install.sh
--- kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh	Tue Jan  3 06:57:26 1995
+++ bdev_naming/arch/i386/boot/install.sh	Fri May 18 20:24:36 2001
@@ -21,6 +21,7 @@

 # User may have a custom install script

+if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi
 if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi

 # Default install - same as make zlilo
diff -urN kernels/2.4/v2.4.5-pre3/drivers/block/Makefile bdev_naming/drivers/block/Makefile
--- kernels/2.4/v2.4.5-pre3/drivers/block/Makefile	Fri Dec 29 17:07:21 2000
+++ bdev_naming/drivers/block/Makefile	Sat May 19 00:29:08 2001
@@ -12,7 +12,7 @@

 export-objs	:= ll_rw_blk.o blkpg.o loop.o DAC960.o

-obj-y	:= ll_rw_blk.o blkpg.o genhd.o elevator.o
+obj-y	:= ll_rw_blk.o blkpg.o genhd.o elevator.o blkrestrict.o

 obj-$(CONFIG_MAC_FLOPPY)	+= swim3.o
 obj-$(CONFIG_BLK_DEV_FD)	+= floppy.o
diff -urN kernels/2.4/v2.4.5-pre3/drivers/block/blkrestrict.c bdev_naming/drivers/block/blkrestrict.c
--- kernels/2.4/v2.4.5-pre3/drivers/block/blkrestrict.c	Wed Dec 31 19:00:00 1969
+++ bdev_naming/drivers/block/blkrestrict.c	Sat May 19 01:17:36 2001
@@ -0,0 +1,105 @@
+/* driver/block/blkrestrict.c - written by Benjamin LaHaise
+ *	Block device limit enforcer.  Designed to implement partition
+ *	tables under control of other code.
+ *
+ *	Copyright 2001 Red Hat, Inc.
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License as
+ *	published by the Free Software Foundation; either version 2 of
+ *	the License, or (at your option) any later version.
+ *
+ *	This program is distributed in the hope that it will be useful,
+ *	but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *	GNU General Public License for more details.
+ *
+ *	You should have received a copy of the GNU General Public License
+ *	along with this program; if not, write to the Free Software
+ *	Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/init.h>
+
+static char major_name[] = "restrict";
+static unsigned int major_nr;
+static unsigned int minor_nr;	/* next free minor number */
+
+static struct restrict_info {
+	unsigned long	offset;
+	unsigned long	limit;
+	kdev_t		dev;
+} restrict_info[256];	/* FIXME: stupid */
+
+static int restrict_blk_size[256];	/* grr */
+
+kdev_t restrict_create_dev(kdev_t dev, unsigned long long offset, unsigned long long limit)
+{
+	unsigned int minor = minor_nr++;	/* FIXME: overflow/smp/fish */
+	struct restrict_info *info = &restrict_info[minor];
+
+	info->offset = offset / 512;
+	info->limit = limit / 512;
+	info->dev = dev;
+
+	restrict_blk_size[minor] = info->limit - info->offset;
+
+	printk("restrict_create_dev: (0x%02x, 0x%02x) offset=0x%lx limit=0x%lx on (0x%04x)\n", major_nr, minor, info->offset, info->limit, info->dev);	/* FIXME: duh */
+
+	return MKDEV(major_nr, minor);
+}
+
+static int restrict_open(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static int restrict_release(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static int restrict_make_req(request_queue_t *q, int rw, struct buffer_head *bh)
+{
+	struct restrict_info *info = &restrict_info[MINOR(bh->b_rdev)];
+	unsigned long new_sector = bh->b_rsector + info->offset;
+
+	if (new_sector >= info->limit || new_sector < bh->b_rsector) {
+		printk("restrict_make_req: 0x%lx beyond limit on 0x%x (0x%lx,0x%lx)\n", bh->b_rsector, bh->b_rdev, info->offset, info->limit);
+		buffer_IO_error(bh);
+		return 0;
+	}
+
+	bh->b_rdev = info->dev;
+	bh->b_rsector += info->offset;
+
+	return 1;
+}
+
+static struct block_device_operations restrict_bdops = {
+	open:		restrict_open,
+	release:	restrict_release,
+};
+
+static int __init blkrestrict_init(void)
+{
+	major_nr = register_blkdev(0, major_name, &restrict_bdops);
+	if (major_nr < 0)
+		return major_nr;
+
+	printk("blkrestrict_init: got major %u\n", major_nr);
+
+	blk_queue_make_request(BLK_DEFAULT_QUEUE(major_nr), restrict_make_req);
+	blk_size[major_nr] = restrict_blk_size;
+
+	return 0;
+}
+
+static void __exit blkrestrict_exit(void)
+{
+	unregister_blkdev(major_nr, major_name);
+}
+
+module_init(blkrestrict_init);
+module_exit(blkrestrict_exit);
diff -urN kernels/2.4/v2.4.5-pre3/fs/Makefile bdev_naming/fs/Makefile
--- kernels/2.4/v2.4.5-pre3/fs/Makefile	Thu Apr  5 11:53:44 2001
+++ bdev_naming/fs/Makefile	Fri May 18 18:49:49 2001
@@ -12,7 +12,7 @@

 obj-y :=	open.o read_write.o devices.o file_table.o buffer.o \
 		super.o  block_dev.o stat.o exec.o pipe.o namei.o fcntl.o \
-		ioctl.o readdir.o select.o fifo.o locks.o \
+		ioctl.o readdir.o select.o fifo.o locks.o lookupargs.o \
 		dcache.o inode.o attr.o bad_inode.o file.o iobuf.o dnotify.o \
 		filesystems.o

diff -urN kernels/2.4/v2.4.5-pre3/fs/block_dev.c bdev_naming/fs/block_dev.c
--- kernels/2.4/v2.4.5-pre3/fs/block_dev.c	Thu May 17 18:09:42 2001
+++ bdev_naming/fs/block_dev.c	Sat May 19 01:31:51 2001
@@ -14,9 +14,12 @@
 #include <linux/major.h>
 #include <linux/devfs_fs_kernel.h>
 #include <linux/smp_lock.h>
+#include <linux/lookupargs.h>

 #include <asm/uaccess.h>

+extern kdev_t restrict_create_dev(kdev_t dev, unsigned long long offset, unsigned long long limit);
+
 extern int *blk_size[];
 extern int *blksize_size[];

@@ -648,10 +651,52 @@
 	return ret;
 }

+struct blkdev_param {
+	unsigned long long	offset,
+				limit;
+	int			raw;
+};
+
+arg_format_t blkdev_arg_fmt[] = {
+	{ "offset",	Arg_ull,	offsetof(struct blkdev_param, offset) },
+	{ "limit",	Arg_ull,	offsetof(struct blkdev_param, limit) },
+	{ "raw",	Arg_bool,	offsetof(struct blkdev_param, raw) },
+	{ NULL }
+};
+
+static struct dentry *blkdev_lookup(struct inode *inode, struct dentry *dentry)
+{
+	return generic_parse_lookup(inode, dentry, blkdev_arg_fmt);
+}
+
 int blkdev_open(struct inode * inode, struct file * filp)
 {
-	int ret = -ENXIO;
+	int ret;
 	struct block_device *bdev = inode->i_bdev;
+	struct dentry *dentry = filp->f_dentry;
+	struct blkdev_param param = { 0ULL, ~0ULL, 0 };
+
+	if (dentry && dentry->d_parent &&
+	    dentry->d_inode == dentry->d_parent->d_inode) {
+		printk("blkdev_open: args='%*s'\n", dentry->d_name.len, dentry->d_name.name);
+		ret = generic_parse_args(&dentry->d_name, blkdev_arg_fmt, &param);
+		if (ret)
+			return ret;
+		printk("blkdev_open: offset=0x%Lx limit=0x%Lx raw=%d",
+			param.offset, param.limit, param.raw);
+
+		if (param.offset || ~param.limit) {
+			struct inode *old_inode = inode;
+			inode = get_empty_inode();
+			inode->i_rdev = restrict_create_dev(old_inode->i_rdev, param.offset, param.limit);
+			bdev = inode->i_bdev = bdget(inode->i_rdev);
+			filp->f_dentry = d_alloc_root(inode);
+			/* FIXME: error handling, dangling dentry/inode */
+		}
+	}
+
+	ret = -ENXIO;
+
 	down(&bdev->bd_sem);
 	lock_kernel();
 	if (!bdev->bd_op)
@@ -721,6 +766,10 @@
 	write:		block_write,
 	fsync:		block_fsync,
 	ioctl:		blkdev_ioctl,
+};
+
+struct inode_operations def_blk_iops = {
+	lookup:		blkdev_lookup,
 };

 const char * bdevname(kdev_t dev)
diff -urN kernels/2.4/v2.4.5-pre3/fs/devices.c bdev_naming/fs/devices.c
--- kernels/2.4/v2.4.5-pre3/fs/devices.c	Sun Oct  1 23:35:16 2000
+++ bdev_naming/fs/devices.c	Fri May 18 18:41:00 2001
@@ -205,6 +205,7 @@
 		inode->i_rdev = to_kdev_t(rdev);
 	} else if (S_ISBLK(mode)) {
 		inode->i_fop = &def_blk_fops;
+		inode->i_op = &def_blk_iops;
 		inode->i_rdev = to_kdev_t(rdev);
 		inode->i_bdev = bdget(rdev);
 	} else if (S_ISFIFO(mode))
diff -urN kernels/2.4/v2.4.5-pre3/fs/lookupargs.c bdev_naming/fs/lookupargs.c
--- kernels/2.4/v2.4.5-pre3/fs/lookupargs.c	Wed Dec 31 19:00:00 1969
+++ bdev_naming/fs/lookupargs.c	Sat May 19 00:26:31 2001
@@ -0,0 +1,156 @@
+/* fs/lookupargs.c - written by Benjamin LaHaise
+ *	Support for comma seperated argument lists via a lookup method.
+ *	Useful for device drivers and other filesystem entities.
+ *
+ *	Copyright 2001 Red Hat, Inc.
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License as
+ *	published by the Free Software Foundation; either version 2 of
+ *	the License, or (at your option) any later version.
+ *
+ *	This program is distributed in the hope that it will be useful,
+ *	but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *	GNU General Public License for more details.
+ *
+ *	You should have received a copy of the GNU General Public License
+ *	along with this program; if not, write to the Free Software
+ *	Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/fs.h>
+#include <linux/lookupargs.h>
+
+/* Returns the format if arg is in the list of options */
+static const struct parsed_arg_format *find_arg_fmt(
+	const struct parsed_arg *arg, arg_format_t *fmt)
+{
+	if (fmt)
+	for (; fmt->name; fmt++) {
+		const char *opt = fmt->name;
+
+		if (!memcmp(arg->arg_start, opt, arg->arg_len) &&
+		    strlen(opt) == arg->arg_len) {
+			return fmt;
+		}
+	}
+
+	return NULL;
+}
+
+/* TODO: fix it to actually validate the argument */
+int generic_check_arg(const struct parsed_arg *arg, arg_format_t *fmt)
+{
+	return !find_arg_fmt(arg, fmt);
+}
+
+static int parse_arg(const struct qstr *qstr, int offset, struct parsed_arg *arg)
+{
+	const char *str = qstr->name + offset;
+	int left = qstr->len - offset;
+
+	arg->arg_start = NULL;
+	arg->arg_len = 0;
+	arg->option_start = NULL;
+	arg->option_len = 0;
+
+	if (offset < 0)
+		return -1;
+
+	if (left <= 0)
+		return -1;
+
+	/* First off, scan for the argument name -> ends at end of string,
+	 * an equals sign or comma.
+	 */
+	arg->arg_start = str;
+	for (; left > 0 && (*str != '=') && (*str != ',');
+	     left--,str++)
+		;
+
+	arg->arg_len = str - arg->arg_start;
+
+	/* This argument ends if therer's nothing left or we've hit a comma. */
+	if (left <= 0)
+		goto out;
+
+	left--;
+	if (*str++ == ',')
+		goto out;
+
+	/* Second part: scan the option looking for the end: ends at
+	 * end of string or a comma.
+	 */
+	arg->option_start = str;
+	for (; left > 0 && (*str != ',');
+	     left--,str++) {
+		/* Eat the escaped character */
+		if (*str == '\\' && left > 1)
+			left--, str++;
+	}
+
+	arg->option_len = str - arg->arg_start;
+
+out:
+	return str - (const char *)qstr->name;
+}
+
+/* TODO: FIXME: proper range checking!!! */
+static int fill_arg_data(const struct parsed_arg *arg, arg_format_t *fmt, char *data)
+{
+	char *end;
+
+	data += fmt->offset;
+
+	switch (fmt->type) {
+	case Arg_bool:
+		*(int *)data = 1;
+		return 0;
+	case Arg_ull:
+		if (!arg->option_start || !arg->option_len)
+			break;
+		*(unsigned long long *)data = simple_strtoull(arg->option_start, &end, 10);
+		return 0;
+	}
+	return -EINVAL;
+}
+
+int generic_parse_args(const struct qstr *str, arg_format_t *fmt_list, void *data)
+{
+	int ret = 0;
+	for_each_parsed_arg(str) {
+		arg_format_t *fmt = find_arg_fmt(&arg, fmt_list);
+		ret = -EINVAL;
+		if (!fmt)
+			break;
+		ret = fill_arg_data(&arg, fmt, (char *)data);
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
+struct dentry *generic_parse_lookup(
+	struct inode *inode,
+	struct dentry *dentry,
+	arg_format_t *fmt_list)
+{
+	/* Application compatibility: report -ENOTDIR on "." and ".." */
+	if (dentry->d_name.name[0] == '.' &&
+	    ((dentry->d_name.len == 1) ||
+	     (dentry->d_name.name[1] == '.' && dentry->d_name.len == 2)))
+		return ERR_PTR(-ENOTDIR);
+
+	/* Make sure all the arguments are okay */
+	{ for_each_parsed_arg(&dentry->d_name) {
+		arg_format_t *fmt = find_arg_fmt(&arg, fmt_list);
+		if (!fmt || generic_check_arg(&arg, fmt)) {
+			inode = NULL;
+			break;
+		}
+	}}
+
+	d_add(dentry, inode);
+	return NULL;
+}
+
diff -urN kernels/2.4/v2.4.5-pre3/fs/namei.c bdev_naming/fs/namei.c
--- kernels/2.4/v2.4.5-pre3/fs/namei.c	Thu May  3 11:22:16 2001
+++ bdev_naming/fs/namei.c	Fri May 18 22:38:50 2001
@@ -470,7 +470,8 @@
 		 * to be able to know about the current root directory and
 		 * parent relationships.
 		 */
-		if (this.name[0] == '.') switch (this.len) {
+		if (this.name[0] == '.' && S_ISDIR(nd->dentry->d_inode->i_mode))
+			switch (this.len) {
 			default:
 				break;
 			case 2:
@@ -538,7 +539,8 @@
 last_component:
 		if (lookup_flags & LOOKUP_PARENT)
 			goto lookup_parent;
-		if (this.name[0] == '.') switch (this.len) {
+		if (this.name[0] == '.' && S_ISDIR(nd->dentry->d_inode->i_mode))
+			switch (this.len) {
 			default:
 				break;
 			case 2:
@@ -593,7 +595,7 @@
 lookup_parent:
 		nd->last = this;
 		nd->last_type = LAST_NORM;
-		if (this.name[0] != '.')
+		if (this.name[0] != '.' || !S_ISDIR(nd->dentry->d_inode->i_mode))
 			goto return_base;
 		if (this.len == 1)
 			nd->last_type = LAST_DOT;
diff -urN kernels/2.4/v2.4.5-pre3/include/linux/fs.h bdev_naming/include/linux/fs.h
--- kernels/2.4/v2.4.5-pre3/include/linux/fs.h	Thu May 17 18:09:42 2001
+++ bdev_naming/include/linux/fs.h	Fri May 18 20:10:50 2001
@@ -984,6 +984,7 @@
 extern void bdput(struct block_device *);
 extern int blkdev_open(struct inode *, struct file *);
 extern struct file_operations def_blk_fops;
+extern struct inode_operations def_blk_iops;
 extern struct file_operations def_fifo_fops;
 extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
 extern int blkdev_get(struct block_device *, mode_t, unsigned, int);
diff -urN kernels/2.4/v2.4.5-pre3/include/linux/lookupargs.h bdev_naming/include/linux/lookupargs.h
--- kernels/2.4/v2.4.5-pre3/include/linux/lookupargs.h	Wed Dec 31 19:00:00 1969
+++ bdev_naming/include/linux/lookupargs.h	Fri May 18 23:06:56 2001
@@ -0,0 +1,48 @@
+/* include/linux/lookupargs.h
+ */
+struct parsed_arg {
+	const char	*arg_start;
+	const char	*option_start;
+	int		arg_len;
+	int		option_len;
+};
+
+enum parsed_arg_type {
+	Arg_bool,	/* really an int */
+	Arg_ull,
+#if 0
+	//Arg_str,	/* really a char */
+	Arg_c,
+	Arg_uc,
+	Arg_s,
+	Arg_us,
+	Arg_i,
+	Arg_ui,
+	Arg_l,
+	Arg_ul,
+	Arg_ll,
+	Arg_u32,
+	Arg_u64,
+#endif
+};
+
+typedef const struct parsed_arg_format {
+	const char		*name;
+	enum parsed_arg_type	type;
+	size_t			offset;
+} arg_format_t;
+
+#define for_each_parsed_arg(str)\
+	struct parsed_arg arg;	\
+	int __offset = 0;		\
+	while ((__offset = parse_arg((str), __offset, &arg)) > 0)
+
+struct dentry;
+struct inode;
+struct qstr;
+
+extern int generic_parse_args(
+	const struct qstr *str, arg_format_t *fmt, void *data);
+extern struct dentry *generic_parse_lookup(
+	struct inode *inode, struct dentry *dentry, arg_format_t *fmt);
+
diff -urN kernels/2.4/v2.4.5-pre3/include/linux/raid/md_k.h bdev_naming/include/linux/raid/md_k.h
--- kernels/2.4/v2.4.5-pre3/include/linux/raid/md_k.h	Thu May 17 18:09:42 2001
+++ bdev_naming/include/linux/raid/md_k.h	Sat May 19 01:13:18 2001
@@ -36,6 +36,7 @@
 		case RAID5:		return 5;
 	}
 	panic("pers_to_level()");
+	return 0;
 }

 extern inline int level_to_pers (int level)


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
@ 2001-05-19  6:57 ` Andrew Clausen
  2001-05-19  7:04   ` Alexander Viro
  2001-05-19  7:58   ` Ben LaHaise
  2001-05-19  9:42 ` [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Christer Weinigel
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 161+ messages in thread
From: Andrew Clausen @ 2001-05-19  6:57 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: torvalds, viro, linux-kernel, linux-fsdevel

Ben LaHaise wrote:
> The work-in-progress patch for-demonstration-purposes-only below consists
> of 3 major components, and is meant to start discussion about the future
> direction of device naming and its interaction block layer.  The main
> motivations here are the wasting of minor numbers for partitions, and the
> duplication of code between user and kernel space in areas such as
> partition detection, uuid location, lvm setup, mount by label, journal
> replay, and so on...

(1) these issues are independent.  The partition parsing could
be done in user space, today, by blkpg, if I read the code correctly
;-)  (there's an ioctl for [un]registering partitions)  Never
tried it though ;-)

(2) what about bootstrapping?  how do you find the root device?
Do you do "root=/dev/hda/offset=63,limit=1235823"?  Bit nasty.

(3) how does this work for LVM and RAID?

(4) <propaganda>libparted already has a fair bit of partition
scanning code, etc.  Should be trivial to hack it up... That said,
it should be split up into .so modules... 200k is a bit heavy just
for mounting partitions (most of the bulk is file system stuff).
</propaganda>

(5) what happens to /etc/fstab?  User-space ([u]mount?) translates
/dev/hda1 into /dev/hda/offset=63,limit=1235823, and back?

Andrew Clausen

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  6:57 ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Clausen
@ 2001-05-19  7:04   ` Alexander Viro
  2001-05-19  7:23     ` Andrew Clausen
  2001-05-19  9:11     ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Morton
  2001-05-19  7:58   ` Ben LaHaise
  1 sibling, 2 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19  7:04 UTC (permalink / raw)
  To: Andrew Clausen; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Andrew Clausen wrote:

> (1) these issues are independent.  The partition parsing could
> be done in user space, today, by blkpg, if I read the code correctly
> ;-)  (there's an ioctl for [un]registering partitions)  Never
> tried it though ;-)

ioctls are even more evil than encoding limits into the name. Cease
and desist, please.

> (2) what about bootstrapping?  how do you find the root device?
> Do you do "root=/dev/hda/offset=63,limit=1235823"?  Bit nasty.

Ben's patch makes initrd mandatory.

> (3) how does this work for LVM and RAID?

It doesn't
 
> (4) <propaganda>libparted already has a fair bit of partition
> scanning code, etc.  Should be trivial to hack it up... That said,
> it should be split up into .so modules... 200k is a bit heavy just
> for mounting partitions (most of the bulk is file system stuff).
> </propaganda>

We will be much better off providing a sane API from the kernel. And
dropping the layout-aware code from fdisk, parted, yodda, yodda.

Libraries do not remove code duplication. You still need to relink the
stuff you keep statically linked, etc. Otherwise you get version skew.
Big way. Besides, you can't use library from a script, etc.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  7:04   ` Alexander Viro
@ 2001-05-19  7:23     ` Andrew Clausen
  2001-05-19  8:30       ` Alexander Viro
  2001-05-19  9:11     ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Morton
  1 sibling, 1 reply; 161+ messages in thread
From: Andrew Clausen @ 2001-05-19  7:23 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

Alexander Viro wrote:
> On Sat, 19 May 2001, Andrew Clausen wrote:
> 
> > (1) these issues are independent.  The partition parsing could
> > be done in user space, today, by blkpg, if I read the code correctly
> > ;-)  (there's an ioctl for [un]registering partitions)  Never
> > tried it though ;-)
> 
> ioctls are even more evil than encoding limits into the name.

Why?  Encoding sounds funky... you don't get normal
ls behaviour, etc.

I think I'd prefer something like /dev/hda/table, where
table is something like /proc/partitions for /dev/hda, but
it's read/write ;-)

> Cease and desist, please.

Hehe
 
> > (4) <propaganda>libparted already has a fair bit of partition
> > scanning code, etc.  Should be trivial to hack it up... That said,
> > it should be split up into .so modules... 200k is a bit heavy just
> > for mounting partitions (most of the bulk is file system stuff).
> > </propaganda>
> 
> We will be much better off providing a sane API from the kernel. And
> dropping the layout-aware code from fdisk, parted, yodda, yodda.

What about partition editing on other OSs?  There's no reason
why fdisk/parted/etc. should be Linux only.  Why should the kernel
need to know how to write partition tables?

Also, different partition table formats have different alignment
constraints (which is relevant for creating partitions).  These
mainly need to be respected for other braindead OS's and/or BIOSes.

Communicating those between user/kernel space doesn't excite me.

Any ideas?
 
> Libraries do not remove code duplication. You still need to relink the
> stuff you keep statically linked, etc. Otherwise you get version skew.
> Big way. Besides, you can't use library from a script, etc.

Libtool & friends deals with version skew (ugly, but it works...)

You can write wrappers for libraries.

That said, a kernel API would be nice, if it was doable...

Andrew Clausen

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  6:57 ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Clausen
  2001-05-19  7:04   ` Alexander Viro
@ 2001-05-19  7:58   ` Ben LaHaise
  2001-05-19  8:10     ` Alexander Viro
  1 sibling, 1 reply; 161+ messages in thread
From: Ben LaHaise @ 2001-05-19  7:58 UTC (permalink / raw)
  To: Andrew Clausen; +Cc: torvalds, viro, linux-kernel, linux-fsdevel

On Sat, 19 May 2001, Andrew Clausen wrote:

> (1) these issues are independent.  The partition parsing could
> be done in user space, today, by blkpg, if I read the code correctly
> ;-)  (there's an ioctl for [un]registering partitions)  Never
> tried it though ;-)

I tried to imply that through the use of the the word component.  Yes,
they're independant, but the code is pretty meaningless without a
demonstration of how it's used. ;-)

> (2) what about bootstrapping?  how do you find the root device?
> Do you do "root=/dev/hda/offset=63,limit=1235823"?  Bit nasty.

root= becomes a parameter to mount, and initrd becomes mandatory.  I'd be
all for including all of the bits needed to build the initrd boot code in
the tree, but it's completely in the air.

> (3) how does this work for LVM and RAID?

It's not done yet, but similar techniques would be applied.  I envision
that a raid device would support operations such as
open("/dev/md0/slot=5,hot-add=/dev/sda")

> (4) <propaganda>libparted already has a fair bit of partition
> scanning code, etc.  Should be trivial to hack it up... That said,
> it should be split up into .so modules... 200k is a bit heavy just
> for mounting partitions (most of the bulk is file system stuff).
> </propaganda>

Good.  Less work to do.

> (5) what happens to /etc/fstab?  User-space ([u]mount?) translates
> /dev/hda1 into /dev/hda/offset=63,limit=1235823, and back?

I'd just create a symlink to /dev/hda1 at mount time, although that really
isn't what the user wants to see: the label or uuid is more useful.

		-ben


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  7:58   ` Ben LaHaise
@ 2001-05-19  8:10     ` Alexander Viro
  2001-05-19  8:16       ` Ben LaHaise
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-19  8:10 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Andrew Clausen, torvalds, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Ben LaHaise wrote:

> It's not done yet, but similar techniques would be applied.  I envision
> that a raid device would support operations such as
> open("/dev/md0/slot=5,hot-add=/dev/sda")

Think for a moment and you'll see why it's not only ugly as hell, but simply
won't work.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  8:10     ` Alexander Viro
@ 2001-05-19  8:16       ` Ben LaHaise
  2001-05-19  8:32         ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Ben LaHaise @ 2001-05-19  8:16 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Andrew Clausen, linux-kernel, linux-fsdevel

On Sat, 19 May 2001, Alexander Viro wrote:

> On Sat, 19 May 2001, Ben LaHaise wrote:
>
> > It's not done yet, but similar techniques would be applied.  I envision
> > that a raid device would support operations such as
> > open("/dev/md0/slot=5,hot-add=/dev/sda")
>
> Think for a moment and you'll see why it's not only ugly as hell, but simply
> won't work.

Yeah, I shouldn't be replying to email anymore in my bleery-eyed state.
=) Of course slash seperated data doesn't work, so it would have to be
hot-add=<filedescriptor> or somesuch.  Gah, that's why the options are all
parsed from a single lookup name anyways...

		-ben (who's going to sleep)



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  7:23     ` Andrew Clausen
@ 2001-05-19  8:30       ` Alexander Viro
  2001-05-19 10:13         ` Andrew Clausen
                           ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19  8:30 UTC (permalink / raw)
  To: Andrew Clausen; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Andrew Clausen wrote:

> Alexander Viro wrote:
> > On Sat, 19 May 2001, Andrew Clausen wrote:
> > 
> > > (1) these issues are independent.  The partition parsing could
> > > be done in user space, today, by blkpg, if I read the code correctly
> > > ;-)  (there's an ioctl for [un]registering partitions)  Never
> > > tried it though ;-)
> > 
> > ioctls are even more evil than encoding limits into the name.
> 
> Why?  Encoding sounds funky... you don't get normal
> ls behaviour, etc.

ioctls are evil, period. At least with these names you can use normal
scripting and don't need any special tools. Every ioctl means a binary
that has no business to exist.

> What about partition editing on other OSs?  There's no reason
> why fdisk/parted/etc. should be Linux only.  Why should the kernel
> need to know how to write partition tables?

It needs to read them. Writing doesn't add much. I'd rather see
trivial partitioning tools that consist only of UI code in case
of Linux.

> Also, different partition table formats have different alignment
> constraints (which is relevant for creating partitions).  These
> mainly need to be respected for other braindead OS's and/or BIOSes.
> 
> Communicating those between user/kernel space doesn't excite me.

So don't communicate them.
 
> Libtool & friends deals with version skew (ugly, but it works...)

With statically linked binaries? How?
 
> You can write wrappers for libraries.

Uh-huh. And you can write them for ioctls. We had been busily doing that
for years. Results are not pretty, to put it very mildly.

OK, let me put it that way: I know how to do it in the kernel with
no code duplication and less impact on userland. BTW, most of the
code can very well sit in the userland, but that's another story
(userland filesystems). Anyway, there's only one way to settle such
stuff - sit down and write the patch. Which is what I'm going to do.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  8:16       ` Ben LaHaise
@ 2001-05-19  8:32         ` Alexander Viro
  0 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19  8:32 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Andrew Clausen, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Ben LaHaise wrote:

> On Sat, 19 May 2001, Alexander Viro wrote:
> 
> > On Sat, 19 May 2001, Ben LaHaise wrote:
> >
> > > It's not done yet, but similar techniques would be applied.  I envision
> > > that a raid device would support operations such as
> > > open("/dev/md0/slot=5,hot-add=/dev/sda")
> >
> > Think for a moment and you'll see why it's not only ugly as hell, but simply
> > won't work.
> 
> Yeah, I shouldn't be replying to email anymore in my bleery-eyed state.
> =) Of course slash seperated data doesn't work, so it would have to be
> hot-add=<filedescriptor> or somesuch.  Gah, that's why the options are all
> parsed from a single lookup name anyways...

That's why you want to use write(2) to pass that information instead of
encoding it into open(2).


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  7:04   ` Alexander Viro
  2001-05-19  7:23     ` Andrew Clausen
@ 2001-05-19  9:11     ` Andrew Morton
  2001-05-19  9:20       ` Alexander Viro
  1 sibling, 1 reply; 161+ messages in thread
From: Andrew Morton @ 2001-05-19  9:11 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrew Clausen, Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

Alexander Viro wrote:
> 
> > (2) what about bootstrapping?  how do you find the root device?
> > Do you do "root=/dev/hda/offset=63,limit=1235823"?  Bit nasty.
> 
> Ben's patch makes initrd mandatory.
> 

Can this be fixed?  I've *never* had to futz with initrd.
Probably most systems are the same.  It seems a step
backward to make it necessary.

-

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  9:11     ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Morton
@ 2001-05-19  9:20       ` Alexander Viro
  0 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19  9:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrew Clausen, Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Andrew Morton wrote:

> Alexander Viro wrote:
> > 
> > > (2) what about bootstrapping?  how do you find the root device?
> > > Do you do "root=/dev/hda/offset=63,limit=1235823"?  Bit nasty.
> > 
> > Ben's patch makes initrd mandatory.
> > 
> 
> Can this be fixed?  I've *never* had to futz with initrd.
> Probably most systems are the same.  It seems a step
> backward to make it necessary.

Well, if you remove partition table parsing from the kernel - you've
got to boot with root on unpartitioned device (e.g. /dev/ram0) and
either stay that way or bring the userland code that understands
partitioning on that device...


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
  2001-05-19  6:57 ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Clausen
@ 2001-05-19  9:42 ` Christer Weinigel
  2001-05-19  9:51 ` Christer Weinigel
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 161+ messages in thread
From: Christer Weinigel @ 2001-05-19  9:42 UTC (permalink / raw)
  To: bcrl; +Cc: linux-kernel

In article <Pine.LNX.4.33.0105190138150.6079-100000@toomuch.toronto.redhat.com> you write:
>3. Userspace partition code proposal
>
>	Given the above two bits, here's a brief explaination of a
>	proposal to move management of the partitioning scheme into
>	userspace, along with portions of raid startup, lvm, uuid and
>	mount by label code needed for mounting the root filesystem.
>
>	Consider that the device node currently known as /dev/hda5 can
>	also be viewed as /dev/hda at offset 512000 with a limit of 10GB.
>	With the extensions in fs/block_dev.c, you could replace /dev/hda5
>	with /dev/hda/offset=512000,limit=10240000000.  Now, by putting
>	the partition parsing code into a libpart and binding mount to a
>	libpart, the root filesystem mounting code can be run out of an
>	initrd image.  The use of mount gives us the ability to mount
>	filesystems by UUID, by label or other exotic schemes without
>	having to add any additional code to the kernel.

The only problem I can see with this is that it removes one useful thing,
the ability to give a user access to a whole partition.

    chown wingel /dev/hda5

won't work anymore since there is no such device node.  

  /Christer
-- 
"Just how much can I get away with and still go to heaven?"

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
  2001-05-19  6:57 ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Clausen
  2001-05-19  9:42 ` [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Christer Weinigel
@ 2001-05-19  9:51 ` Christer Weinigel
  2001-05-19 11:37 ` Eric W. Biederman
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 161+ messages in thread
From: Christer Weinigel @ 2001-05-19  9:51 UTC (permalink / raw)
  To: linux-kernel

In article <20010519094224.AD5A236DDC@hog.ctrl-c.liu.se> I wrote:
>The only problem I can see with this is that it removes one useful thing,
>the ability to give a user access to a whole partition.
>
>    chown wingel /dev/hda5
>
>won't work anymore since there is no such device node.  

Apologies, this should have gone to linux-fsdev, I entered the mail
address by hand and by reflex typed the wrong thing.  

*going back to sleep*

  /Christer

-- 
"Just how much can I get away with and still go to heaven?"

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code  inuserspace
  2001-05-19  8:30       ` Alexander Viro
@ 2001-05-19 10:13         ` Andrew Clausen
  2001-05-19 14:02         ` [RFD w/info-PATCH] device arguments from lookup, partion code Alan Cox
  2001-05-19 18:51         ` Richard Gooch
  2 siblings, 0 replies; 161+ messages in thread
From: Andrew Clausen @ 2001-05-19 10:13 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

Alexander Viro wrote:
> ioctls are evil, period. At least with these names you can use normal
> scripting and don't need any special tools. Every ioctl means a binary
> that has no business to exist.

Special names are butt-ugly.

ioctl's can be replaced with games on /proc or whatever, which are
better than special names.

> > What about partition editing on other OSs?  There's no reason
> > why fdisk/parted/etc. should be Linux only.  Why should the kernel
> > need to know how to write partition tables?
> 
> It needs to read them. Writing doesn't add much.

Wrong.  When you read, you throw out 90% of the useless crap.
When you write, you need to know about it, and provide
interfaces for it.

> I'd rather see
> trivial partitioning tools that consist only of UI code in case
> of Linux.

Some stuff friendly partition tools should have, IMHO:
(1) ability to predict what's going to happen.  That way, you can
play around until it looks nice, and hit the friendly commit
button.
(2) ability to do data recovery (eg: probe for signatures where
it expects the start of partitions to occur.  You can be
intelligent/quick about it, by knowing about alignment stuff,
for example)
(3) ability to convert between partition table types (and
even LVM ;-)  This can be tricky because of alignment stuff.

So:
(1) could be done in-kernel by being able to discard changes,
and re-reading, I guess.
(2) and (3) really only need alignment stuff.

Also, you need to be able to deal with legacy stuff, like
setting magic flags for booting.

> > Also, different partition table formats have different alignment
> > constraints (which is relevant for creating partitions).  These
> > mainly need to be respected for other braindead OS's and/or BIOSes.
> >
> > Communicating those between user/kernel space doesn't excite me.
> 
> So don't communicate them.

So, what do you do?

Sometimes, you want to force alignment violations (eg: recovering
an accidently deleted partition)

The real problem happens when you want to resize file systems, and
you need to simultaneously satisfy resizer and partition table
constraints.  (there are currently no resizers like this, but
an ext2-resize-the-start and NTFS-resize-the-start would definitely
be like this... when I get time to write them.  It's pure luck
that you don't need this for FAT, but this causes all sorts of
headaches for Linux...)

Anyway, you have one constraint in user space, and one in the
kernel... how do you find the intersection?
 
> > Libtool & friends deals with version skew (ugly, but it works...)
> 
> With statically linked binaries? How?

Why do we need them?
 
> > You can write wrappers for libraries.
> 
> Uh-huh. And you can write them for ioctls. We had been busily doing that
> for years. Results are not pretty, to put it very mildly.

If you can get everything into a nice file system interface,
then you've convinced me.

> BTW, most of the
> code can very well sit in the userland, but that's another story
> (userland filesystems). Anyway, there's only one way to settle such
> stuff - sit down and write the patch. Which is what I'm going to do.

Have fun.

So, my patch will be about 50 lines in parted, to call blkpg,
and provide a "kernelread" command... But, philosophy essay to
write... :-(  (you have to wait until Monday)

Then you can rm -r fs/partitions

But, I don't see how patches will settle anything, when we're
arguing over interfaces & stuff needed for partition tools.  Or
are you writing patches for Parted as well?

Andrew Clausen

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
                   ` (2 preceding siblings ...)
  2001-05-19  9:51 ` Christer Weinigel
@ 2001-05-19 11:37 ` Eric W. Biederman
  2001-05-19 14:25   ` Daniel Phillips
  2001-05-19 13:53 ` Daniel Phillips
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 161+ messages in thread
From: Eric W. Biederman @ 2001-05-19 11:37 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: torvalds, viro, linux-kernel, linux-fsdevel

Ben LaHaise <bcrl@redhat.com> writes:

> Hey folks,
> 
> The work-in-progress patch for-demonstration-purposes-only below consists
> of 3 major components, and is meant to start discussion about the future
> direction of device naming and its interaction block layer.  The main
> motivations here are the wasting of minor numbers for partitions, and the
> duplication of code between user and kernel space in areas such as
> partition detection, uuid location, lvm setup, mount by label, journal
> replay, and so on...

> 
> 1. Generic lookup method and argument parsiing (fs/lookupargs.c)
> 
> 	This code implements a lookup function which is for demonstration
> 	purposes used in fs/block_dev.c.  The general idea is to pass
> 	additional parameters to device drivers on open via a comma
> 	seperated list of options following the device's name.  Sample
> 	uses:
> 
> 		/dev/sda/raw		-> open sda in raw mode.
> 		/dev/sda/limit=102400	-> open sda with a limit of 100K
> 		/dev/sda/offset=1024,limit=2048
> 			-> open a device that gives a view of sda at an
> 			   offset of 1KB to 2KB

GAhh!!!!!!

Ben please think /proc/sys.  One value per ``file''.

> 3. Userspace partition code proposal
> 
> 	Given the above two bits, here's a brief explaination of a
> 	proposal to move management of the partitioning scheme into
> 	userspace, along with portions of raid startup, lvm, uuid and
> 	mount by label code needed for mounting the root filesystem.
> 
> 	Consider that the device node currently known as /dev/hda5 can
> 	also be viewed as /dev/hda at offset 512000 with a limit of 10GB.
> 	With the extensions in fs/block_dev.c, you could replace /dev/hda5
> 	with /dev/hda/offset=512000,limit=10240000000.  Now, by putting
> 	the partition parsing code into a libpart and binding mount to a
> 	libpart, the root filesystem mounting code can be run out of an
> 	initrd image.  The use of mount gives us the ability to mount
> 	filesystems by UUID, by label or other exotic schemes without
> 	having to add any additional code to the kernel.

But you need to use uclibc or a similar library to get the code size down
small enough, so you don't quadruple the size of your boot image.

As for wasting minors.  If you are going to rework partitions they
should have dynamic device numbers.  That are assigned when the
partition is discovered by the system.   I admit a hot-plug partition
sounds incongruous but it should be fairly simple to implement.

If your real root is on a ``hot-plug'' device then it does look
like you need an initrd to help select your root partition.  Hmm. the
code is simple enough code in the kernel shouldn't be bad.  And the
interface can be simple as well.

Have:
/dev/sda/partitions/1
/dev/sda/partitions/2
/dev/sda/partitions/3
/dev/sda/partitions/4
/dev/sda/partitions/5
and also
/dev/sda/partitions/1/uuid
/dev/sda/partitions/1/label
/dev/sda/partitions/1/offset
/dev/sda/partitions/1/limit

To expose what the kernel found it's initial scan of the partitions.

For creating partitions you might want to do:
cat 1024 2048 > /dev/sda/newpartition
Though if you could do it with create that would be nicer, and writes
to offset and limit, that would be a little nicer.

Al would it work to have the lookup method for /dev/sda automatically
mount an instance of scsifs on /dev/hda (from an internal mount), and
then have dput drop that mount.  I skimmed the code and it looks
possible.  

Soft mounting a fs isn't strictly necessary but for the case above but
it looks simplest to keep the list of partitions permanently in the
dcache.  We would also need to modify permission to take a vfsmnt
argument so your permissions to a device file could vary depending on
which device file you start with.

Eric




^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
                   ` (3 preceding siblings ...)
  2001-05-19 11:37 ` Eric W. Biederman
@ 2001-05-19 13:53 ` Daniel Phillips
  2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
  2001-05-19 18:31 ` [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Linus Torvalds
  6 siblings, 0 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-19 13:53 UTC (permalink / raw)
  To: Ben LaHaise, torvalds; +Cc: viro, linux-kernel, linux-fsdevel

On Saturday 19 May 2001 08:23, Ben LaHaise wrote:
>  /dev/sda/offset=1024,limit=2048
>                 -> open a device that gives a view of sda at an
>                       offset of 1KB to 2KB

Whatever we end up with, can we express it in terms of base, size,
please?

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
                   ` (4 preceding siblings ...)
  2001-05-19 13:53 ` Daniel Phillips
@ 2001-05-19 13:57 ` Alexander Viro
  2001-05-19 15:10   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Abramo Bagnara
                     ` (3 more replies)
  2001-05-19 18:31 ` [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Linus Torvalds
  6 siblings, 4 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 13:57 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: torvalds, linux-kernel, linux-fsdevel

	Folks, before you get all excited about cramming side effects into
open(2), consider the following case:

1) opening "/dev/zero/start_nuclear_war" has a certain side effect.

2) Local user does the following:
	ln -sf /dev/zero/start_nuclear_war bar
	while true; do
		mkdir foo
		rmdir foo
		ln -sf bar foo
		rm foo
	done

3) Comes the night and root runs (from crontab) updatedb(8). Said beast
includes find(1). With sufficiently bad timing find _will_ be tricked
into attempt to open foo. It will honestly lstat() it, all right. But
there's no way to make sure that subsequent open() on the found directory
will get the same object.

4) Side effect happens...

Similar scenarios can be found for other programs run by/as root, but I
think that the point is obvious - side effects on open() are not a good
idea. Yes, we can play with checking for O_DIRECTORY, yodda, yodda, but
I wouldn't bet a dime on security of a system with such side effects.
A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
no-op. Breaking that assumption is a Bad Thing(tm).


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19  8:30       ` Alexander Viro
  2001-05-19 10:13         ` Andrew Clausen
@ 2001-05-19 14:02         ` Alan Cox
  2001-05-19 16:48           ` Erik Mouw
  2001-05-19 18:51         ` Richard Gooch
  2 siblings, 1 reply; 161+ messages in thread
From: Alan Cox @ 2001-05-19 14:02 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrew Clausen, Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

> ioctls are evil, period. At least with these names you can use normal
> scripting and don't need any special tools. Every ioctl means a binary
> that has no business to exist.

That is not IMHO a rational argument. It isn't my fault that your shell does
not support ioctls usefully. If you used perl as your login shell you would
have no problem there.

Alan



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19 11:37 ` Eric W. Biederman
@ 2001-05-19 14:25   ` Daniel Phillips
  2001-05-21  8:14     ` Lars Marowsky-Bree
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-19 14:25 UTC (permalink / raw)
  To: Eric W. Biederman, Ben LaHaise
  Cc: torvalds, viro, linux-kernel, linux-fsdevel

On Saturday 19 May 2001 13:37, Eric W. Biederman wrote:
> For creating partitions you might want to do:
> 	cat 1024 2048 > /dev/sda/newpartition

How about:

  # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap
  # ls /dev/mypartition
  base	size	device	type
  # cat /dev/mypartition/size
  1048576
  # cat /dev/mypartition/device
  /dev/sda
  # mke2fs /dev/mypartition

The information that was specified is persistent in /dev.  We can 
rearrange our physical devices any way we want without affecting
the name we chose in /dev.  When the kernel enumerates devices
at startup, our persistent information better match or we will have
to take some corrective action.

Generally, we shouldn't care which order the kernel enumerates
devices in or which device number gets assigned internally.  If we
did need to care, we'd just do:

  # echo 666 >/dev/mypartition/number

setting a persistent device minor number.  The major number is
inherited via the partition's /device property.

To set the minor number back to 'don't care':

  # rm /dev/mypartition/number

By taking the physical device off the top of the food chain we
gain the flexibility of being able to move the device from bus to 
bus for example, and only the partition's device property
changes, nothing in our fstab.  It's no great leap to set things
up so that not even the /device property would need to
change.

Note that we can have a heirarchy of partitions this way if 
we want to, since /dev/mypartition is just another block
device.

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
@ 2001-05-19 15:10   ` Abramo Bagnara
  2001-05-19 15:18     ` Alexander Viro
  2001-05-19 16:01     ` Willem Konynenberg
  2001-05-19 18:13   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 161+ messages in thread
From: Abramo Bagnara @ 2001-05-19 15:10 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

Alexander Viro wrote:
> 
>         Folks, before you get all excited about cramming side effects into
> open(2), consider the following case:
> 
> 1) opening "/dev/zero/start_nuclear_war" has a certain side effect.
> 
> 2) Local user does the following:
>         ln -sf /dev/zero/start_nuclear_war bar
>         while true; do
>                 mkdir foo
>                 rmdir foo
>                 ln -sf bar foo
>                 rm foo
>         done
> 
> 3) Comes the night and root runs (from crontab) updatedb(8). Said beast
> includes find(1). With sufficiently bad timing find _will_ be tricked
> into attempt to open foo. It will honestly lstat() it, all right. But
> there's no way to make sure that subsequent open() on the found directory
> will get the same object.
> 
> 4) Side effect happens...
> 
> Similar scenarios can be found for other programs run by/as root, but I
> think that the point is obvious - side effects on open() are not a good
> idea. Yes, we can play with checking for O_DIRECTORY, yodda, yodda, but
> I wouldn't bet a dime on security of a system with such side effects.
> A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
> no-op. Breaking that assumption is a Bad Thing(tm).

Can't this easily avoided if the needed action is not

< /dev/zero/start_nuclear_war 
or
> /dev/zero/start_nuclear_war

but

echo "I'm evil" > /dev/zero/start_nuclear_war

?

-- 
Abramo Bagnara                       mailto:abramo@alsa-project.org

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project               http://www.alsa-project.org
It sounds good!

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 15:10   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Abramo Bagnara
@ 2001-05-19 15:18     ` Alexander Viro
  2001-05-19 16:01     ` Willem Konynenberg
  1 sibling, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 15:18 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Abramo Bagnara wrote:

> Can't this easily avoided if the needed action is not
> 
> < /dev/zero/start_nuclear_war 
> or
> > /dev/zero/start_nuclear_war
> 
> but
> 
> echo "I'm evil" > /dev/zero/start_nuclear_war

Sure. And that's the right thing to do (not the implied action, that is -
_that_ would be too messy).


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-19 15:10   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Abramo Bagnara
  2001-05-19 15:18     ` Alexander Viro
@ 2001-05-19 16:01     ` Willem Konynenberg
  2001-05-20 20:52       ` Pavel Machek
  2001-05-20 20:53       ` Pavel Machek
  1 sibling, 2 replies; 161+ messages in thread
From: Willem Konynenberg @ 2001-05-19 16:01 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: linux-kernel, linux-fsdevel

Abramo Bagnara wrote:
> Alexander Viro wrote:
> >         Folks, before you get all excited about cramming side effects into
> > open(2), consider the following case:
> > 
> > 1) opening "/dev/zero/start_nuclear_war" has a certain side effect.
[...]
> Can't this easily avoided if the needed action is not
> 
> < /dev/zero/start_nuclear_war 
> or
> > /dev/zero/start_nuclear_war
> 
> but
> 
> echo "I'm evil" > /dev/zero/start_nuclear_war
> 
> ?

Yes, and that is exactly the difference between having a side effect
on the open(2), versus having the effect as a result of a write(2).

Unfortunately, there are already some cases where an open
on a device can have unexpected results.  If you don't want
to get blocked waiting for the carrier-detect signal from the
modem when opening a tty device, you had better specify the
O_NONBLOCK option on the open.  If you don't want this flag
to be active during the actual I/O operations, then you would
have to do an fcntl to clear the O_NONBLOCK again after the open.

So I guess things have already been a bit messy in this
area for many years, even before linux even existed, and
in some cases you can't really do anything about it because
the behaviour is mandated by the applicable standards, like
POSIX, SUS, or whatever.
(The blocking of the open on a tty device is explicitly
 documented in my copy of the X/Open specification.)

Fortunately, blocking the nightly backup program by making it
accidentally open a tty is not quite as catastrophic as having
it start a nuclear war, or format the disks, or something,
just because a user was playing games with symlinks.

-- 
     Willem Konynenberg <wfk@xos.nl>
I am not able rightly to apprehend the kind of confusion of ideas
that could provoke such a question  --  Charles Babbage

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 14:02         ` [RFD w/info-PATCH] device arguments from lookup, partion code Alan Cox
@ 2001-05-19 16:48           ` Erik Mouw
  2001-05-19 17:45             ` Aaron Lehmann
  0 siblings, 1 reply; 161+ messages in thread
From: Erik Mouw @ 2001-05-19 16:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alexander Viro, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

On Sat, May 19, 2001 at 03:02:47PM +0100, Alan Cox wrote:
> > ioctls are evil, period. At least with these names you can use normal
> > scripting and don't need any special tools. Every ioctl means a binary
> > that has no business to exist.
> 
> That is not IMHO a rational argument. It isn't my fault that your shell does
> not support ioctls usefully. If you used perl as your login shell you would
> have no problem there.

Sure, you're right, but Al's point is that you shouldn't need to.

One of the fundamentals of Unix is that "everything is a file" and that
you can do everything by reading or writing that file. The devices are
the big exception, they need ioctls to control them. With Al's proposal
we can get rid of the ioctls and let the devices behave like normal
files.


Erik
[who remembers a discussion like this years ago on comp.os.linux.kernel
 with similar conclusion: ioctls are bad, we should get rid of them]

-- 
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031,  2600 GA Delft, The Netherlands
Phone: +31-15-2783635  Fax: +31-15-2781843  Email: J.A.K.Mouw@its.tudelft.nl
WWW: http://www-ict.its.tudelft.nl/~erik/

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 16:48           ` Erik Mouw
@ 2001-05-19 17:45             ` Aaron Lehmann
  2001-05-19 19:38               ` Erik Mouw
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lehmann @ 2001-05-19 17:45 UTC (permalink / raw)
  To: Erik Mouw; +Cc: Alexander Viro, Ben LaHaise, linux-kernel, linux-fsdevel

On Sat, May 19, 2001 at 06:48:19PM +0200, Erik Mouw wrote:
> One of the fundamentals of Unix is that "everything is a file" and that
> you can do everything by reading or writing that file.

But /dev/sda/offset=234234,limit=626737537 isn't a file! ls it and see
if it's there. writing to files that aren't shown in directory listings
is plain evil. I really don't want to explain why. It's extremely
messy and unintuitive.

It would be better to do this with a file that does exist, for example
writing something to /proc/disks/sda/arguments. Then again, I don't
even think much of dynamic file systems in the first place.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
  2001-05-19 15:10   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Abramo Bagnara
@ 2001-05-19 18:13   ` Linus Torvalds
  2001-05-19 23:19     ` Alexander Viro
  2001-05-19 23:52   ` Edgar Toernig
  2001-05-20 20:23   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Pavel Machek
  3 siblings, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-19 18:13 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ben LaHaise, linux-kernel, linux-fsdevel


On Sat, 19 May 2001, Alexander Viro wrote:
>
> 	Folks, before you get all excited about cramming side effects into
> open(2), consider the following case:

Your argument is stupid, imnsho.

Side-effects are perfectly fine if they are _local_ to the file
descriptor. Your example is contrieved and idiotic.

Filename extensions would not replace ioctl's. But they are wonderful ways
to avoid unnecessary binary name-spaces, like the ones we have with
"callout" TTY names, and the one that the fb people had.

For example, do a "ls -l /dev/fd0*", and ponder. Also, realize that we
have these hard-coded names in _addition_ to the magic ioctl to set even
more parameters. These are all stupid and bad, and it would have been a
_lot_ cleaner to be able to do

	open("/dev/fd0/H1440", O_RDWR)..

or

	open("/dev/fd0/HD,18,85", O_RDWD)

to open special non-standard high-density modes.

We already did this, in a very limited and stupid way, by encoding the
minor number and generating a standard naming scheme. We can do the same
thing in a _much_ more generic way by just realizing that we wanted the
open to be name-based in the first place.

These are _not_ side effects. They are very much naming conventions. If I
want to open a the floppy in one of the special extended modes, it makes a
LOT more sense to just open it with the naming, than to open a "generic"
floppy device only to them use a magic and very unreadable ioctl to set
the mode of the device.

In short, I don't buy your arguments for one single second.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
                   ` (5 preceding siblings ...)
  2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
@ 2001-05-19 18:31 ` Linus Torvalds
  6 siblings, 0 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-19 18:31 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: viro, linux-kernel, linux-fsdevel


On Sat, 19 May 2001, Ben LaHaise wrote:
> 
> 1. Generic lookup method and argument parsiing (fs/lookupargs.c)

Looks sane.

> 2. Restricted block device (drivers/block/blkrestrict.c)

This is not very user-friendly, but along with symlinks this makes perfect
sense. It would make partition handling a _lot_ simpler.

Note, however, that I think the "restricted block device" is a much more
generic issue than just block devices. I've already discussed with Alan
the possibility of making _all_ file descriptors have the notion of
"restrictions", notably the "start, end" kind of things.

It is very useful for other things too - imagine opening /dev/mem, and
wanting to pass a restricted portiong of it to other processes with the
standard file descriptor passing facilities (think "secure DGA" for the X
server, but also think untrusted users that can read parts of shared files
etc - a suid program that opens a file, restricts it, drops privileges and
knows that the program can only access a specific part of the file)

> 3. Userspace partition code proposal

Yes and no.

I absolutely thihnk the idea that users actually _using_ these names is a
horrible one, and fraught with potential for much too easy mistakes that
end up being disastrous.

But having symlinks that are created by a special program would be ok.

[ Also, note how symlinks would make the point of initrd completely
  moot. You don't have to have initrd to initialize the thing, you can
  initialize the thing at installation time and when doing fdisk, and the
  symlinks would act as the permanent markers. ]

HOWEVER, you have to realize that there are serious security and
maintenance issues here, and I think your idea breaks down completely
because of that.

The thing is, you only have permissions on a "per-object" basis, and it's
common practice to have different permissions for different partitions.
Your scheme does not allow this. Which means that it is fundamentally
broken. Sorry.

So don't go overboard. The name-based thing is useful, but it's useful for
only certain things. And you must _never_ forget the security and
management issues.

For example, if you can open a serial port in the first place, you can set
its baud-rate. So it's ok to make baud-rate part of the name. And once you
have permission to read /dev/fd0 it doesn't make sense to limit you to one
particular format. So it's ok to have the disk format be part of the name.

But it's not possible to make the partition be a "name" issue. Because
while you obviously need different names, you _also_ need different
permissions.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19  8:30       ` Alexander Viro
  2001-05-19 10:13         ` Andrew Clausen
  2001-05-19 14:02         ` [RFD w/info-PATCH] device arguments from lookup, partion code Alan Cox
@ 2001-05-19 18:51         ` Richard Gooch
  2001-05-20  2:18           ` Matthew Wilcox
                             ` (3 more replies)
  2 siblings, 4 replies; 161+ messages in thread
From: Richard Gooch @ 2001-05-19 18:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alexander Viro, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

Alan Cox writes:
> > ioctls are evil, period. At least with these names you can use normal
> > scripting and don't need any special tools. Every ioctl means a binary
> > that has no business to exist.
> 
> That is not IMHO a rational argument. It isn't my fault that your
> shell does not support ioctls usefully. If you used perl as your
> login shell you would have no problem there.

There is another reason to use ioctl(2): when you need to send data to
the kernel/driver and wait for a response. It supports transactions,
which read(2) and write(2) cannot. Therefore it remains useful.

Al, if you really want to kill ioctl(2), then perhaps you should
implement a transaction(2) syscall. Something like:
    int transaction (int fd, void *rbuf, size_t rlen,
		     void *wbuf, size_t wlen);

Of course, there wouldn't be any practical gain, since we already have
ioctl(2). Any gain would be aesthetic.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 17:45             ` Aaron Lehmann
@ 2001-05-19 19:38               ` Erik Mouw
  2001-05-19 20:53                 ` Steven Walter
  0 siblings, 1 reply; 161+ messages in thread
From: Erik Mouw @ 2001-05-19 19:38 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: Alexander Viro, Ben LaHaise, linux-kernel, linux-fsdevel

On Sat, May 19, 2001 at 10:45:11AM -0700, Aaron Lehmann wrote:
> On Sat, May 19, 2001 at 06:48:19PM +0200, Erik Mouw wrote:
> > One of the fundamentals of Unix is that "everything is a file" and that
> > you can do everything by reading or writing that file.
> 
> But /dev/sda/offset=234234,limit=626737537 isn't a file! ls it and see
> if it's there. writing to files that aren't shown in directory listings
> is plain evil. I really don't want to explain why. It's extremely
> messy and unintuitive.
> 
> It would be better to do this with a file that does exist, for example
> writing something to /proc/disks/sda/arguments. Then again, I don't
> even think much of dynamic file systems in the first place.

A network socket also isn't a file in a filesystem, you can't do ls on
it, it doesn't even exist until you create one, but still you use it as
a file by reading and writing it. I don't see any difference in the way
you create /dev/sda/offset=234234,limit=626737537 by just using it.


Erik

-- 
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031,  2600 GA Delft, The Netherlands
Phone: +31-15-2783635  Fax: +31-15-2781843  Email: J.A.K.Mouw@its.tudelft.nl
WWW: http://www-ict.its.tudelft.nl/~erik/

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 19:38               ` Erik Mouw
@ 2001-05-19 20:53                 ` Steven Walter
  0 siblings, 0 replies; 161+ messages in thread
From: Steven Walter @ 2001-05-19 20:53 UTC (permalink / raw)
  To: Erik Mouw; +Cc: linux-kernel

On Sat, May 19, 2001 at 09:38:03PM +0200, Erik Mouw wrote:
> > But /dev/sda/offset=234234,limit=626737537 isn't a file! ls it and see
> > if it's there. writing to files that aren't shown in directory listings
> > is plain evil. I really don't want to explain why. It's extremely
> > messy and unintuitive.
> > 
> > It would be better to do this with a file that does exist, for example
> > writing something to /proc/disks/sda/arguments. Then again, I don't
> > even think much of dynamic file systems in the first place.
> 
> A network socket also isn't a file in a filesystem, you can't do ls on
> it, it doesn't even exist until you create one, but still you use it as
> a file by reading and writing it. I don't see any difference in the way
> you create /dev/sda/offset=234234,limit=626737537 by just using it.

I think you're kind of missing the point.  Erik is saying that, by the
path, it appears to be a file, even though it isn't listed as a file in
the directory /dev/sda.  Network sockets don't have a path, unless its a
Unix domain socket, and then you /can/ 'ls' it.

My opinion is that putting options directly in the open is no nicer than
an ioctl.  I think that where this scheme really shines, though, is
where there are multiple logical channels to a device, as in the
/dev/fb0/control example.  I like that.  What could be done, therefore,
is have a /dev/ttyS0/control file, where you could "echo
'baud=19200,parity=odd' > /dev/ttyS0/control" or even "echo '19200' >
/dev/ttyS0/baud" and "echo 'odd' > /dev/ttyS0/parity".  That seems to me
to be the cleanest and most logical solution.

As for this partition stuff, it seems a bad example to me.  Maybe I'm
just spoiled, but I think partitions is something that the kernel can
and should abstract.  None of this /dev/sda/offset=12345,limit=45678
madness.
-- 
-Steven
In a time of universal deceit, telling the truth is a revolutionary act.
			-- George Orwell

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 18:13   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Linus Torvalds
@ 2001-05-19 23:19     ` Alexander Viro
  2001-05-19 23:31       ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Jeff Garzik
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 23:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben LaHaise, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Linus Torvalds wrote:

> 
> On Sat, 19 May 2001, Alexander Viro wrote:
> >
> > 	Folks, before you get all excited about cramming side effects into
> > open(2), consider the following case:
> 
> Your argument is stupid, imnsho.
> 
> Side-effects are perfectly fine if they are _local_ to the file
> descriptor. Your example is contrieved and idiotic.

Linus, would you _look_ at the uses of open() proposed upthread?

Would you like to argue that close(open("/bin/ls,-l,/etc/passwd", O_RDONLY));
as equivalent of spawn(3) is _not_ contrieved and idiotic?

Would you like to argue that close(open("/dev/md0/..add-...=/foo/bar",O_RDONLY))
as a way to add stripes is not contrieved and idiotic?
 
> These are _not_ side effects. They are very much naming conventions. If I

I would say that both examples above (both really proposed) _are_ side
effects by any definition.

> want to open a the floppy in one of the special extended modes, it makes a
> LOT more sense to just open it with the naming, than to open a "generic"
> floppy device only to them use a magic and very unreadable ioctl to set
> the mode of the device.

Who argues for ioctls? I'm perfectly OK with the stuff that affects future
IO on the descriptor you've opened. That's what open() is for, after all.
However, IMNSHO examples of abusing open() (see above, grep your mailbox if
you think that I'm making it up) posted to that thread _are_ side effects
- ugly as hell, contrieved and bound to be source of exploits.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 23:19     ` Alexander Viro
@ 2001-05-19 23:31       ` Jeff Garzik
  2001-05-19 23:32         ` Jeff Garzik
  2001-05-19 23:39         ` Alexander Viro
  0 siblings, 2 replies; 161+ messages in thread
From: Jeff Garzik @ 2001-05-19 23:31 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel

Are we talking about device arguments just for chrdevs and blkdevs? 
(ie. drivers)  or for regular files too?

Speaking about drivers specifically, a controlling miscdev, one per
device or one per group of devices depending on your needs, is a much
more clean solution for passing ioctl-type data.  You are free to come
up with whatever method of communication with the driver is most
efficient for your needs -- without perverting open(2).

Notice also a "metadata miscdev" solves the problem of passing options
on open -- just pass those options to the miscdev before you open it...

metadata miscdevs are a clean solution to what procfs hacks and ioctls
are trying to accomplish.

	Jeff


-- 
Jeff Garzik      | "Do you have to make light of everything?!"
Building 1024    | "I'm extremely serious about nailing your
MandrakeSoft     |  step-daughter, but other than that, yes."

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 23:31       ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Jeff Garzik
@ 2001-05-19 23:32         ` Jeff Garzik
  2001-05-19 23:39         ` Alexander Viro
  1 sibling, 0 replies; 161+ messages in thread
From: Jeff Garzik @ 2001-05-19 23:32 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel

Jeff Garzik wrote:
> Notice also a "metadata miscdev" solves the problem of passing options
> on open -- just pass those options to the miscdev before you open it...

to be more clear, "it" == the data device, not the metadata miscdev

-- 
Jeff Garzik      | "Do you have to make light of everything?!"
Building 1024    | "I'm extremely serious about nailing your
MandrakeSoft     |  step-daughter, but other than that, yes."

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 23:31       ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Jeff Garzik
  2001-05-19 23:32         ` Jeff Garzik
@ 2001-05-19 23:39         ` Alexander Viro
  2001-05-20 15:47           ` F_CTRLFD (was Re: Why side-effects on open(2) are evil.) Edgar Toernig
  2001-05-21 17:16           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Oliver Xymoron
  1 sibling, 2 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 23:39 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Jeff Garzik wrote:

> Are we talking about device arguments just for chrdevs and blkdevs? 
> (ie. drivers)  or for regular files too?

Let's distinguish between per-fd effects (that's what name in open(name, flags)
is for - you are asking for descriptor and telling what behaviour do you
want for IO on it) and system-wide side effects.

IMO encoding the former into name is perfectly fine, and no write on
another file can be sanely used for that purpose. For the latter, though,
we need to write commands into files and here your miscdevices (or procfs
files, or /dev/foo/ctl - whatever) is needed.



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
  2001-05-19 15:10   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Abramo Bagnara
  2001-05-19 18:13   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Linus Torvalds
@ 2001-05-19 23:52   ` Edgar Toernig
  2001-05-20  0:18     ` Alexander Viro
  2001-05-20 20:23   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Pavel Machek
  3 siblings, 1 reply; 161+ messages in thread
From: Edgar Toernig @ 2001-05-19 23:52 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

nitpicking: a system call without side effects would be pretty useless.

Alexander Viro wrote:
> A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
> no-op. Breaking that assumption is a Bad Thing(tm).

That assumption is totally bogus.  Even for regular files you have side
effects (atime); for anything else they're unpredictable.

Ciao, ET.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 23:52   ` Edgar Toernig
@ 2001-05-20  0:18     ` Alexander Viro
  2001-05-20  0:32       ` Linus Torvalds
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-20  0:18 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Edgar Toernig wrote:

> That assumption is totally bogus.  Even for regular files you have side
> effects (atime); for anything else they're unpredictable.

That means only one thing: safe backups are possible only in single-user
mode. For values of safe being "not triggering these side effects on
arbitrary files outside of the area you are trying to backup". You can't
pin an object down until you open it. You can check that it's the same
object you think it is, but that will require fstat(). I.e. opening the
thing.

If all effects of open() either disappear on close() or are something you
don't care about - fine. Otherwise you have a problem. On any UNIX.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  0:18     ` Alexander Viro
@ 2001-05-20  0:32       ` Linus Torvalds
  2001-05-20  0:52         ` Jeff Garzik
                           ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20  0:32 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel


On Sat, 19 May 2001, Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Edgar Toernig wrote:
> 
> > That assumption is totally bogus.  Even for regular files you have side
> > effects (atime); for anything else they're unpredictable.
> 
> That means only one thing: safe backups are possible only in single-user
> mode.

There are some strong arguments that we should have filesystem
"backdoors" for maintenance purposes, including backup. 

You can, of course, so parts of this on a LVM level, and doing backups
with "disk snapshots" may be a valid approach. However, even that is
debatable: there is very little that says that the disk image has to be
up-to-date at any particular point in time, so even with a disk snapshot
capability (which is not necessarily reasonable under all circumstances)
there are arguments for maintenance interfaces.

Thinks like "lazy fsck" (ie fsck while already running the filesystem) and
defragmentation simply is not feasible on a LVM level.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  0:32       ` Linus Torvalds
@ 2001-05-20  0:52         ` Jeff Garzik
  2001-05-20  1:03         ` Jeff Garzik
  2001-05-22 18:41         ` Andreas Dilger
  2 siblings, 0 replies; 161+ messages in thread
From: Jeff Garzik @ 2001-05-20  0:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1992 bytes --]

Linus Torvalds wrote:
> There are some strong arguments that we should have filesystem
> "backdoors" for maintenance purposes, including backup.

I think I agree with something Al said over IRC, that fs-level snapshots
are preferred over block level snapshots.

fs-level snapshots should become easy if you have a generic transaction
layer.  The OS spits out file ops, which get processed into a set of fs
transactions.  (remember that fs-level stuff like "change this block
bitmap" is also a transaction, just like the more generic "update this
inode's mtime")

Also, I think there should be generic block allocation strategies that
fs's can use.  Implementing fs-specific strategies such as ext2's
readahead or XFS's delayed allocation is not a solution, IMHO, but
working towards solving the real problem.
</ramble>


> You can, of course, so parts of this on a LVM level, and doing backups
> with "disk snapshots" may be a valid approach. However, even that is
> debatable: there is very little that says that the disk image has to be
> up-to-date at any particular point in time, so even with a disk snapshot
> capability (which is not necessarily reasonable under all circumstances)
> there are arguments for maintenance interfaces.

I've been hacking on the attached, a snapshot block device driver, which
doesn't require LVM at all.  (warning: compiled and updated per outside
review, but very alpha...  do not apply)

The point of the driver is to provide a sync point at snapshot time, at
which all metadata and data is flushed to the block device.

My question... is there a fundamental flaw in this plan?  Ideally when
userspace says "start snapshot", the fsync_dev occurs [a
simplification].  At that point, userspace can safely run dump or tar or
whatever on the virtual snapshot device.

-- 
Jeff Garzik      | "Do you have to make light of everything?!"
Building 1024    | "I'm extremely serious about nailing your
MandrakeSoft     |  step-daughter, but other than that, yes."

[-- Attachment #2: snap.patch --]
[-- Type: text/plain, Size: 30762 bytes --]

Index: linux_2_4/drivers/block/Config.in
diff -u linux_2_4/drivers/block/Config.in:1.1.1.44 linux_2_4/drivers/block/Config.in:1.1.1.44.4.1
--- linux_2_4/drivers/block/Config.in:1.1.1.44	Tue May 15 04:43:24 2001
+++ linux_2_4/drivers/block/Config.in	Wed May 16 15:44:59 2001
@@ -46,4 +46,6 @@
 fi
 dep_bool '  Initial RAM disk (initrd) support' CONFIG_BLK_DEV_INITRD $CONFIG_BLK_DEV_RAM
 
+tristate 'Snapshot device support' CONFIG_BLK_DEV_SNAP
+
 endmenu
Index: linux_2_4/drivers/block/Makefile
diff -u linux_2_4/drivers/block/Makefile:1.1.1.46 linux_2_4/drivers/block/Makefile:1.1.1.46.4.1
--- linux_2_4/drivers/block/Makefile:1.1.1.46	Tue May 15 04:43:24 2001
+++ linux_2_4/drivers/block/Makefile	Wed May 16 15:44:59 2001
@@ -31,6 +31,7 @@
 obj-$(CONFIG_BLK_DEV_DAC960)	+= DAC960.o
 
 obj-$(CONFIG_BLK_DEV_NBD)	+= nbd.o
+obj-$(CONFIG_BLK_DEV_SNAP)	+= snap.o
 
 subdir-$(CONFIG_PARIDE) += paride
 
Index: linux_2_4/drivers/block/snap.c
diff -u /dev/null linux_2_4/drivers/block/snap.c:1.1.6.10
--- /dev/null	Sat May 19 17:36:30 2001
+++ linux_2_4/drivers/block/snap.c	Thu May 17 11:48:54 2001
@@ -0,0 +1,1055 @@
+/*
+   Copyright 2001 Jeff Garzik <jgarzik@mandrakesoft.com>
+   Copyright (C) 2000 Jens Axboe <axboe@suse.de>
+  
+   May be copied or modified under the terms of the GNU General Public
+   License.  See linux/COPYING for more information.
+  
+   Several ideas and some code taken from Jens Axboe's pktcdvd.c 0.0.2j.
+  
+   To-Do list:
+   * Write support.  It's easy, and might be useful in isolated circumstances.
+   * Convert MAX_SNAPDEVS to a module parameter.
+   * Wrap use of "%" operator, to prepare for 64-bit-sized blockdevs on 
+     32-bit processors
+  
+ */
+
+#define VERSION_CODE	"v0.5.0-take6  17 May 2001  Jeff Garzik <jgarzik@mandrakesoft.com>"
+#define MODNAME		"snap"
+#define PFX		MODNAME ": "
+#define MAX_SNAPDEVS	16 
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/file.h>
+#include <linux/blk.h>
+#include <linux/blkpg.h>
+#include <linux/init.h>
+#include <linux/snap.h>
+#include <asm/uaccess.h>
+
+static int *snap_sizes;
+static int *snap_blksize;
+static int *snap_readahead;
+static struct snap_device *snap_devs;
+static int snap_major = -1;
+static spinlock_t snap_lock = SPIN_LOCK_UNLOCKED;
+
+
+/*
+ * a bit of a kludge, but we want to be able to pass source, log,
+ * or snap dev and get the right one.
+ */
+static struct snap_device *snap_find_dev(kdev_t dev)
+{
+	int i, j;
+	struct snap_device *sd;
+
+	spin_lock(&snap_lock);
+
+	for (i = 0; i < MAX_SNAPDEVS; i++) {
+		sd = &snap_devs[i];
+		if ((sd->src.dev == dev) || (sd->snap_dev == dev))
+			goto out;
+		for (j = 0; j < sd->n_logs; j++)
+			if (sd->logs[j].dev == dev)
+				goto out;
+	}
+	sd = NULL;
+
+out:
+	spin_unlock(&snap_lock);
+	return sd;
+}
+
+static request_queue_t *snap_get_queue(kdev_t dev)
+{
+	struct snap_device *sd = snap_find_dev(dev);
+
+	if (!sd)
+		return NULL;
+	return &sd->q;
+}
+
+/* run through the block bitmap on each log device,
+ * checking the most recently appended log first,
+ * to see if the requested block has been remapped and stored
+ * on a log device.
+ * If block was remapped, return log device index
+ */
+static int snap_find_blk(struct snap_device *sd, unsigned long blocknr)
+{
+	unsigned int i;
+
+	for (i = sd->active_logs - 1; i >= 0; i--) {
+		unsigned long bitmap_blk = blocknr / sd->bits_per_block;
+		unsigned int bitmap_bits = blocknr % sd->bits_per_block;
+		struct buffer_head *bit_bh = bread(sd->src.dev, bitmap_blk, sd->blksz);
+		u8 *buf;
+		unsigned int in_log;
+		if (!bit_bh) {
+			printk(KERN_ERR "%s: cannot read log %u bitmap_blk %lu\n",
+			       sd->name, i, bitmap_blk);
+			return -2;
+		}
+		lock_buffer(bit_bh);
+		buf = bit_bh->b_data;
+		in_log = (buf[bitmap_bits / sizeof(u8)] & (1 << (bitmap_bits % sizeof(u8))));
+		unlock_buffer(bit_bh);
+		brelse(bit_bh);
+		if (in_log)
+			return i;
+	}
+	
+	return -1;
+}
+
+/* we have never stored a block located at src_bh->b_rsector before.
+ * allocate space on a log device, and store it.
+ */
+static int snapshot_blk(struct snap_device *sd, request_queue_t *q,
+		        struct buffer_head *src_bh)
+{
+	unsigned int log;
+	struct buffer_head old_bh_d;
+	struct buffer_head *old_bh = &old_bh_d;
+	struct buffer_head *bit_bh, *map_bh, *data_bh;
+	unsigned long bitmap_blk, map_blk, data_blk;
+	unsigned int bitmap_bits, map_ofs;
+	unsigned long blocknr = src_bh->b_rsector;
+	unsigned long *map;
+	u8 *buf;
+
+	/* get index of last active log */
+	if (sd->active_logs == 0) {
+		log = 0;
+		sd->active_logs++;
+	} else {
+		log = sd->active_logs - 1;
+	}
+	
+	/* if no free blocks in current log, move on to the next log. */
+	if (sd->logs[log].free == 0) {
+	
+		/* if no more logs, end snapshot */
+		if (sd->active_logs == sd->n_logs) {
+			request_queue_t *sq = sd->src.q;
+			spin_lock_irq(&io_request_lock);
+
+			sq->make_request_fn = sd->src.make_request_fn;
+			sd->src.make_request_fn = NULL;
+			clear_bit(SNAP_ACTIVE, &sd->flags);
+
+			spin_unlock_irq(&io_request_lock);
+			printk(KERN_WARNING "%s: device full, ending snapshot\n", sd->name);
+			return q->make_request_fn(q, WRITE, src_bh);
+		}
+
+		sd->active_logs++;
+		log++;
+	}
+
+	/* read old data from source device */
+	/* call ll_rw_blk directly on a custom bh to avoid some conflicts */
+	init_buffer(old_bh, NULL, NULL);
+	atomic_set(&old_bh->b_count, 1);
+	old_bh->b_rdev = sd->src.dev;
+	old_bh->b_rsector = blocknr;
+	old_bh->b_state = 1 << BH_Mapped;
+	old_bh->b_size = sd->blksz;
+	old_bh->b_data = kmalloc(sd->blksz, GFP_KERNEL);
+	if (!old_bh->b_data) {
+		printk(KERN_ERR "%s: memory alloc fail on srcdev blk %lu\n",
+		       sd->name, blocknr);
+		goto end_io;
+	}
+	ll_rw_block(READ, 1, &old_bh);
+	wait_on_buffer(old_bh);
+	lock_buffer(old_bh);
+
+	/* read bitmap block from log device */
+	bitmap_blk = blocknr / sd->bits_per_block;
+	bit_bh = bread(sd->logs[log].dev, bitmap_blk, sd->blksz);
+	if (!bit_bh) {
+		brelse(old_bh);
+		printk(KERN_ERR "%s: cannot read log %u bitmap_blk %lu\n",
+		       sd->name, log, bitmap_blk);
+		goto end_io;
+	}
+	lock_buffer(bit_bh);
+
+	/* read map block from log device */
+	map_blk = sd->map_base + (blocknr / sd->maps_per_block);
+	map_bh = bread(sd->logs[log].dev, map_blk, sd->blksz);
+	if (!map_bh) {
+		brelse(old_bh);
+		unlock_buffer(bit_bh);
+		brelse(bit_bh);
+		printk(KERN_ERR "%s: cannot read log %u map_blk %lu\n",
+		       sd->name, log, map_blk);
+		goto end_io;
+	}
+	lock_buffer(map_bh);
+
+	/* getblk data block from log device */
+	data_blk = sd->data_base + (sd->logs[log].data_blocks - sd->logs[log].free);
+	data_bh = getblk(sd->logs[log].dev, data_blk, sd->blksz);
+	if (!data_bh) {
+		brelse(old_bh);
+		unlock_buffer(bit_bh);
+		brelse(bit_bh);
+		unlock_buffer(map_bh);
+		brelse(map_bh);
+		printk(KERN_ERR "%s: cannot getblk log %u data_blk %lu\n",
+		       sd->name, log, data_blk);
+		goto end_io;
+	}
+	lock_buffer(data_bh);
+
+	/* write snapshot of source block to log device */
+	memcpy(data_bh->b_data, old_bh->b_data, sd->blksz);
+	mark_buffer_dirty(data_bh);
+	mark_buffer_uptodate(data_bh, 1);
+	unlock_buffer(data_bh);
+	brelse(data_bh);
+
+	unlock_buffer(old_bh);
+	brelse(old_bh);
+	kfree(old_bh->b_data);
+	
+	/* update block map on logdev to point to snapshot'd block */
+	map_ofs = blocknr % sd->maps_per_block;
+	map = (unsigned long *) map_bh->b_data;
+	map[map_ofs] = data_blk;
+	mark_buffer_dirty(map_bh);
+	mark_buffer_uptodate(map_bh, 1);
+	unlock_buffer(map_bh);
+	brelse(map_bh);
+	
+	/* update bitmap on logdev to indicate remapped block
+	 * is stored on this log device
+	 */
+	bitmap_bits = blocknr % sd->bits_per_block;
+	buf = bit_bh->b_data;
+	buf[bitmap_bits / sizeof(u8)] |= (1 << (bitmap_bits % sizeof(u8)));
+	mark_buffer_dirty(bit_bh);
+	mark_buffer_uptodate(bit_bh, 1);
+	unlock_buffer(bit_bh);
+	brelse(bit_bh);
+	
+	/* update free count */
+	sd->logs[log].free--;
+	
+	/* finally, pass the write down to the source device */
+	return sd->src.make_request_fn(q, WRITE, src_bh);
+	
+end_io:
+	buffer_IO_error(src_bh);
+	return 0;
+}
+
+/*
+ * our replacement for the source device's make_request_fn
+ */
+static int src_make_request(request_queue_t *q, int rw, struct buffer_head *bh)
+{
+	struct snap_device *sd;
+	int log;
+
+	sd = snap_find_dev(bh->b_rdev);
+	
+	/*
+	 * various sanity checks
+	 */
+
+	if (sd == NULL) {
+		printk(KERN_ERR PFX "src request routed to us by unknown snap device\n");
+		goto end_io;
+	}
+	
+	if (bh->b_size != sd->blksz) {
+		printk(KERN_ERR "%s: wrong bh size\n", sd->name);
+		goto end_io;
+	}
+
+	/* read is easy - simply pass it on through to
+	 * underlying block device
+	 */
+	if (rw == READ || rw == READA)
+		return sd->src.make_request_fn(q, rw, bh);
+
+	/* sanity check */	
+	if (rw != WRITE) {
+		printk(KERN_ERR "%s: unknown rw mode %d\n", sd->name, rw);
+		goto end_io;
+	}
+
+	/* if block was already remapped, let the write proceed
+	 * as-is.  No more work needs to be done on our part.
+	 */
+	log = snap_find_blk(sd, bh->b_rsector);
+	if (log < -1)
+		goto end_io;
+	if (log >= 0)
+		return sd->src.make_request_fn(q, rw, bh);
+
+	/* block was not already remapped, write to a log device */
+	return snapshot_blk(sd, q, bh);
+	
+end_io:
+	buffer_IO_error(bh);
+	return 0;
+}
+
+/*
+ * turns snapshotting on or off, AFTER the device has been set up
+ */
+static int snap_mode(struct snap_device *sd, unsigned int starting)
+{
+	request_queue_t *sq;
+	
+	if (!sd->src.q)
+		return -EINVAL;
+	if (starting && test_bit(SNAP_ACTIVE, &sd->flags))
+		return -EINVAL;
+	if (!starting && !test_bit(SNAP_ACTIVE, &sd->flags))
+		return -EINVAL;
+
+	sd->src.q = sq = blk_get_queue(sd->src.dev);
+	
+	/* flush outstanding I/Os for the source device.
+	 * Note that this is a race -IF- we were attempting
+	 * to ensure no I/Os occur between call to fsync_dev()
+	 * and when the swap make_request functions, below.
+	 * However, we don't care about this race since this
+	 * is just a sync point, so life is good.
+	 */
+	fsync_dev(sd->src.dev);
+
+	/* swap srcdev make_request_fn with our own
+	 */
+	spin_lock_irq(&io_request_lock);
+
+	if (starting) {
+		set_bit(SNAP_ACTIVE, &sd->flags);
+		sd->src.make_request_fn = sq->make_request_fn;
+		sq->make_request_fn = src_make_request;
+	} else {
+		sq->make_request_fn = sd->src.make_request_fn;
+		sd->src.make_request_fn = NULL;
+		clear_bit(SNAP_ACTIVE, &sd->flags);
+	}
+
+	spin_unlock_irq(&io_request_lock);
+
+	printk(KERN_INFO "%s: %sing snapshot of %s\n",
+	       sd->name, starting ? "start" : "end", bdevname(sd->src.dev));
+
+	return 0;
+}
+
+/*
+ * a read has occured on the snapshot blkdev, and we know
+ * the block has been remapped, and we know it is on log
+ * device 'log'.  Remap the read into a read of the log device.
+ */
+static int remap_to_logdev(struct snap_device *sd, unsigned int log,
+			  request_queue_t *q, int rw, struct buffer_head *bh)
+{
+	unsigned long map_blk;
+	unsigned int map_ofs;
+	struct buffer_head *map_bh;
+	unsigned long *map;
+
+	/*
+	 * calc the position in the vector of unsigned long
+	 * block numbers where remapped block number is stored
+	 */
+	map_blk = sd->map_base + (bh->b_rsector / sd->maps_per_block);
+	map_ofs = bh->b_rsector % sd->maps_per_block;
+
+	/* read the block from the vector */
+	map_bh = bread(sd->logs[log].dev, map_blk, sd->blksz);
+	if (!map_bh) {
+		printk(KERN_ERR "%s: unable to read log %u block %lu\n",
+		       sd->name, log, map_blk);
+		goto end_io;
+	}
+	lock_buffer(map_bh);
+
+	/* read remapped block number from vector */
+	map = (unsigned long *) map_bh->b_data;
+	map_blk = map[map_ofs];
+	unlock_buffer(map_bh);
+	brelse(map_bh);
+
+	/* do the remap */
+	bh->b_rdev = sd->logs[log].dev;
+	bh->b_rsector = map_blk;
+	return 1;
+
+end_io:
+	buffer_IO_error(bh);
+	return 0;
+}
+
+/*
+ * the make_request_fn for our virtual snapshot blkdev.
+ * All reads are remapped to the source device or a log device,
+ * after some I/O occurs to device where to remap the read.
+ * Writes are not currently supported.
+ */
+static int snap_make_request(request_queue_t *q, int rw, struct buffer_head *bh)
+{
+	struct snap_device *sd;
+	int log;
+
+	if (MINOR(bh->b_rdev) >= MAX_SNAPDEVS) {
+		printk(KERN_ERR PFX "%s out of range\n", kdevname(bh->b_rdev));
+		goto end_io;
+	}
+
+	sd = &snap_devs[MINOR(bh->b_rdev)];
+
+	/*
+	 * various sanity checks
+	 */
+
+	if (!sd->src.dev) {
+		printk(KERN_ERR "%s: request received for non-active sd\n", sd->name);
+		goto end_io;
+	}
+
+	if (rw != READ && rw != READA) {
+		printk(KERN_ERR "%s: non-READ[A] request\n", sd->name);
+		goto end_io;
+	}
+
+	if (bh->b_size != sd->blksz) {
+		printk(KERN_ERR "%s: wrong bh size\n", sd->name);
+		goto end_io;
+	}
+
+	log = snap_find_blk(sd, bh->b_rsector);
+	if (log < -1)
+		goto end_io;
+	if (log >= 0)
+		return remap_to_logdev(sd, log, q, rw, bh);
+	
+	/*
+	 * ok, the block was not remapped onto a log device,
+	 * so remap the request to request from the original
+	 * source device.  Since block numbers are the same
+	 * on the virtual snapdev and the srcdev, we need
+	 * only to change the device to which the read request
+	 * is targetted.
+	 */
+	bh->b_rdev = sd->src.dev;
+	return 1;
+
+end_io:
+	buffer_IO_error(bh);
+	return 0;
+}
+
+/*
+ * initialize the request queue for the snapshot blkdev
+ */
+static void snap_init_queue(struct snap_device *sd)
+{
+	request_queue_t *q = &sd->q;
+
+	blk_queue_make_request(q, snap_make_request);
+	blk_queue_headactive(q, 0);
+}
+
+/**********************************************************************
+ *
+ * Block device ioctl operation and associated functions.
+ *
+ */
+ 
+/* initialize a log device.  all we need to do is zero the bitmap
+ * that appears at the beginning of the disk
+ */
+static void snap_mkfs (struct snap_device *sd, unsigned int logidx,
+		       unsigned int bitmap_blocks)
+{
+	unsigned int i, blksz = sd->blksz;
+	
+	for (i = 0; i < bitmap_blocks; i++) {
+		struct buffer_head *bh = getblk(sd->logs[i].dev, i, blksz);
+		if (bh) {
+			lock_buffer(bh);
+			memset(bh->b_data, 0, blksz);
+			mark_buffer_dirty(bh);
+			mark_buffer_uptodate(bh, 1);
+			unlock_buffer(bh);
+			brelse(bh);
+		} else {
+			printk(KERN_ERR "%s: cannot get blk %u\n", sd->name, i);
+		}
+	}
+}
+
+/*
+ * the source and log devices have been opened and sanity-checked
+ * at this point.  initialize 'sd' based on the information provided
+ */
+static int snap_new_dev(struct snap_device *sd,
+			kdev_t src_dev, struct block_device *src_bdev,
+			unsigned int n_log_fds, struct snap_dev_init_info *sdii)
+{
+	unsigned int i;
+	unsigned long map_blocks, bitmap_blocks;
+	int ret;
+	void *log_buf;
+
+	MOD_INC_USE_COUNT;
+
+	log_buf = kmalloc(sizeof(struct snap_log) * n_log_fds, GFP_KERNEL);
+	if (!log_buf) {
+		ret = -ENOMEM;
+		goto out_bufs;
+	}
+
+	spin_lock(&snap_lock);
+
+	memset(sd, 0, sizeof(struct snap_device));
+
+	sd->src.dev = src_dev;
+	sd->src.q = blk_get_queue (sd->src.dev);
+	sd->blksz = blksize_size[MAJOR(src_dev)][MINOR(src_dev)];
+	sd->src.bdev = src_bdev;
+	sd->src.blocks = blk_size[MAJOR(src_dev)][MINOR(src_dev)];
+	
+	sd->n_logs = n_log_fds;
+	sd->logs = log_buf;
+	memset(sd->logs, 0, sizeof(struct snap_log) * n_log_fds);
+
+	sd->bits_per_block = sd->blksz * 8;
+	sd->maps_per_block = sd->blksz / (sizeof(unsigned long) * 2);
+
+	bitmap_blocks = sd->src.blocks / sd->bits_per_block;
+	if (sd->src.blocks % sd->bits_per_block)
+		bitmap_blocks++;
+
+	map_blocks = sd->src.blocks / sd->maps_per_block;
+	if (sd->src.blocks % sd->maps_per_block)
+		map_blocks++;
+	
+	sd->map_base = bitmap_blocks;
+	sd->data_base = bitmap_blocks + map_blocks;
+
+	for (i = 0; i < n_log_fds; i++) {
+		kdev_t logdev = sdii[i].inode->i_rdev;
+		sd->logs[i].dev = logdev;
+		sd->logs[i].size = blk_size[MAJOR(logdev)][MINOR(logdev)];
+
+		if (sd->logs[i].size < (sd->data_base * 2)) {
+			ret = -EINVAL;
+			goto out_free;
+		}
+
+		sd->logs[i].data_blocks =
+		sd->logs[i].free =
+			sd->logs[i].size - bitmap_blocks - map_blocks;
+		
+		set_blocksize(logdev, sd->blksz);
+	}
+
+	sd->snap_dev = MKDEV(snap_major, i);
+	sprintf(sd->name, "snap%d", i);
+	atomic_set(&sd->refcnt, 0);
+
+	spin_unlock(&snap_lock);
+
+	snap_init_queue(sd);
+
+	for (i = 0; i < n_log_fds; i++)
+		snap_mkfs (sd, i, bitmap_blocks);
+
+	DPRINTK(PFX "dev %s sucessfully registered\n", sd->name);
+	return 0;
+
+out_free:
+	spin_unlock(&snap_lock);
+	kfree(sd->logs);
+out_bufs:
+	MOD_DEC_USE_COUNT;
+	return ret;
+}
+
+/* clean up in the event of snap_setup_dev error exit */
+static void snap_setup_dev_cleanup (unsigned int n_log_fds, struct file *src,
+				    struct snap_dev_init_info *sdii)
+{
+	unsigned int i;
+
+	if (sdii) {
+		for (i = 0; i < n_log_fds; i++)
+			if (sdii[i].file) {
+				blkdev_put(sdii[i].inode->i_bdev, BDEV_FILE);
+				fput(sdii[i].file);
+			}
+		kfree(sdii);
+	}
+
+	if (src) {
+		blkdev_put(src->f_dentry->d_inode->i_bdev, BDEV_FILE);
+		fput(src);
+	}
+}
+
+/*
+ * called from ioctl.  verifies the passed file descriptors to
+ * be open and valid block devices, then called snap_new_dev
+ * to actually initialize the device
+ */
+static int snap_setup_dev(struct snap_device *sd, struct snap_setup *setup)
+{
+	struct snap_dev_init_info *sdii = NULL;
+	struct inode *inode_src;
+	struct file *src;
+	int ret;
+	unsigned int i, n_log_fds = setup->n_log_fds;
+
+	if ((src = fget(setup->src_fd)) == NULL) {
+		printk(KERN_ERR "%s: bad file descriptor %d passed\n", sd->name, setup->src_fd);
+		return -EBADF;
+	}
+	if ((inode_src = src->f_dentry->d_inode) == NULL) {
+		printk(PFX "huh? file descriptor %d contains no inode?\n", setup->src_fd);
+		fput(src);
+		return -EINVAL;
+	}
+	if (!S_ISBLK(inode_src->i_mode)) {
+		printk(PFX "device is not a block device (duh)\n");
+		fput(src);
+		return -ENOTBLK;
+	}
+	if (snap_find_dev(inode_src->i_rdev)) {
+		printk(PFX "source device associated with another snapshot dev\n");
+		fput(src);
+		return -EBUSY;
+	}
+	ret = blkdev_get(inode_src->i_bdev, src->f_mode, src->f_flags, BDEV_FILE);
+	if (ret) {
+		fput(src);
+		return ret;
+	}
+
+	sdii = kmalloc(sizeof(*sdii) * n_log_fds, GFP_KERNEL);
+	if (!sdii) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	memset(sdii, 0, sizeof(*sdii) * n_log_fds);
+
+	for (i = 0; i < n_log_fds; i++) {
+		struct file *logf;
+		struct inode *logi;
+
+		logf = sdii[i].file = fget(setup->log_fd[i]);
+		if (!logf) {
+			printk(PFX "bad file descriptor %d passed\n",
+			       setup->log_fd[i]);
+			ret = -EBADF;
+			goto out;
+		}
+		logi = sdii[i].inode = sdii[i].file->f_dentry->d_inode;
+		if (!logi) {
+			printk(PFX "huh? file descriptor %d contains no inode?\n",
+			       setup->log_fd[i]);
+			ret = -EINVAL;
+			goto out;
+		}
+		if (!S_ISBLK(logi->i_mode)) {
+			printk(PFX "device is not a block device\n");
+			ret = -ENOTBLK;
+			goto out;
+		}
+		if (snap_find_dev(logi->i_rdev)) {
+			printk(PFX "log device %d associated with another snapshot dev\n", i);
+			ret = -EBUSY;
+			goto out;
+		}
+		if (IS_RDONLY(logi)) {
+			printk(PFX "Can't write to read-only dev\n");
+			ret = -EROFS;
+			goto out;
+		}
+		ret = blkdev_get(logi->i_bdev, logf->f_mode,
+				 logf->f_flags, BDEV_FILE);
+		if (ret)
+			goto out;
+	}
+	
+	if ((ret = snap_new_dev(sd, inode_src->i_rdev, inode_src->i_bdev,
+				n_log_fds, sdii))) {
+		printk(PFX "all booked up\n");
+		goto out;
+	}
+	
+	sd->src.dentry = dget(src->f_dentry);
+	for (i = 0; i < n_log_fds; i++)
+		sd->logs[i].dentry = dget(sdii[i].file->f_dentry);
+	atomic_inc(&sd->refcnt);
+
+	ret = 0;
+out:
+	snap_setup_dev_cleanup(n_log_fds, src, sdii);
+	return ret;
+}
+
+/*
+ * called from ioctl.  tears down the currently
+ * setup snapshot device, and closes all open resources
+ */
+static int snap_remove_dev(struct snap_device *sd)
+{
+	unsigned int i;
+
+	/* terminate any on-going snapshot, if any */
+	snap_mode(sd, 0);
+
+	blkdev_put(sd->src.dentry->d_inode->i_bdev, BDEV_FILE);
+	dput(sd->src.dentry);
+
+	for (i = 0; i < sd->n_logs; i++) {
+		blkdev_put(sd->logs[i].dentry->d_inode->i_bdev, BDEV_FILE);
+		dput(sd->logs[i].dentry);
+		invalidate_buffers(sd->logs[i].dev);
+	}
+	kfree(sd->logs);
+
+	invalidate_buffers(sd->snap_dev);
+
+	for (i = 0; i < sd->n_logs; i++)
+		blk_cleanup_queue(blk_get_queue(sd->logs[i].dev));
+	blk_cleanup_queue(blk_get_queue(sd->src.dev));
+
+	DPRINTK(PFX "dev %s unregistered\n", sd->name);
+
+	memset(sd, 0, sizeof(struct snap_device));
+
+	MOD_DEC_USE_COUNT;
+	return 0;
+}
+
+static int snap_ioctl(struct inode *inode, struct file *file,
+		     unsigned int cmd, unsigned long arg)
+{
+	struct snap_device *sd;
+	int err;
+
+	if (MINOR(inode->i_rdev) >= MAX_SNAPDEVS)
+		return -EINVAL;
+
+	sd = &snap_devs[MINOR(inode->i_rdev)];
+
+	if ((cmd != SNAP_SETUP_DEV) && !sd->src.dev) {
+		DPRINTK(PFX "dev not setup\n");
+		return -ENXIO;
+	}
+
+	switch (cmd) {
+	case SNAP_GET_STATS:
+		if (copy_to_user(&arg, &sd->stats, sizeof(struct snap_stats)))
+			return -EFAULT;
+
+	case SNAP_SETUP_DEV: {
+		struct snap_setup se, *sp;
+		unsigned int alloc_size;
+		
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		if (sd->src.dev) {
+			printk(PFX "dev already setup\n");
+			return -EBUSY;
+		}
+		if (copy_from_user(&se, &arg, sizeof(se)))
+			return -EFAULT;
+		if (se.n_log_fds == 0 || se.n_log_fds > MAX_LOGDEVS)
+			return -EINVAL;
+		alloc_size = sizeof(se) + (sizeof(int) * se.n_log_fds);
+		sp = kmalloc(alloc_size, GFP_KERNEL);
+		if (!sp)
+			return -ENOMEM;
+		if (copy_from_user(sp, &arg, alloc_size)) {
+			kfree(sp);
+			return -EFAULT;
+		}
+		err = snap_setup_dev(sd, sp);
+		kfree(sp);
+		return err;
+	}
+
+	case SNAP_TEARDOWN_DEV:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		fsync_dev(sd->src.dev); /* HACK! minimize, but not close, race */
+		if (atomic_read(&sd->refcnt) != 1)
+			return -EBUSY;
+		return snap_remove_dev(sd);
+
+	case SNAP_START:
+		return snap_mode(sd, 1);
+	case SNAP_END:
+		return snap_mode(sd, 0);
+		
+	case BLKGETSIZE:
+		return put_user(blk_size[snap_major][MINOR(inode->i_rdev)] << 1, (long *)arg);
+
+	case BLKROSET:
+	case BLKROGET:
+	case BLKSSZGET:
+	case BLKRASET:
+	case BLKRAGET:
+	case BLKFLSBUF:
+		return blk_ioctl(inode->i_rdev, cmd, arg);
+
+	/*
+	 * forward all other ioctls to src blkdev
+	 */
+	default:
+		return ioctl_by_bdev(sd->src.bdev, cmd, arg);
+	}
+
+	return 0;
+}
+
+/**********************************************************************
+ *
+ * Block device operations (open/close/check_media_change),
+ * and associated functions.
+ *
+ */
+ 
+static inline void snap_mark_readonly(struct snap_device *sd, int on)
+{
+	if (on)
+		set_bit(SNAP_READONLY, &sd->flags);
+	else
+		clear_bit(SNAP_READONLY, &sd->flags);
+}
+
+static int snap_open_dev(struct snap_device *sd, int write)
+{
+	unsigned long dev_size;
+
+	if (!sd->src.dev)
+		return 0;
+
+	dev_size = blk_size[MAJOR(sd->src.dev)][MINOR(sd->src.dev)];
+	snap_sizes[MINOR(sd->snap_dev)] = dev_size;
+
+	if (write) {
+#if 0 /* no write support */
+		if ((ret = snap_open_write(sd)))
+			return ret;
+		snap_mark_readonly(sd, 0);
+#else
+		return -EINVAL;
+#endif
+	} else {
+		snap_mark_readonly(sd, 1);
+	}
+
+	if (write)
+		printk(PFX "%lukB available on disc\n", dev_size);
+
+	return 0;
+}
+
+static int snap_open(struct inode *inode, struct file *file)
+{
+	struct snap_device *sd = NULL;
+	int ret;
+
+	VPRINTK(PFX "entering open\n");
+
+	/* remove when write mode is supported */
+	if (file->f_mode & FMODE_WRITE) {
+		printk(PFX "write mode not supported\n");
+		return -EINVAL;
+	}
+
+	MOD_INC_USE_COUNT;
+
+	if (MINOR(inode->i_rdev) >= MAX_SNAPDEVS) {
+		printk(PFX "max %d snapdevs supported\n", MAX_SNAPDEVS);
+		ret = -ENODEV;
+		goto out;
+	}
+
+	/*
+	 * either device is not configured, or pktsetup is old and doesn't
+	 * use O_CREAT to create device
+	 */
+	sd = &snap_devs[MINOR(inode->i_rdev)];
+	if (!sd->src.dev && !(file->f_flags & O_CREAT)) {
+		VPRINTK(PFX "not configured and O_CREAT not set\n");
+		ret = -ENXIO;
+		goto out;
+	}
+
+	atomic_inc(&sd->refcnt);
+	if ((atomic_read(&sd->refcnt) > 1) && (file->f_mode & FMODE_WRITE)) {
+		VPRINTK(PFX "busy open for write\n");
+		ret = -EBUSY;
+		goto out_dec;
+	}
+
+	ret = snap_open_dev(sd, file->f_mode & FMODE_WRITE);
+	if (ret)
+		goto out_dec;
+
+	/*
+	 * needed here as well, since ext2 (among others) may change
+	 * the blocksize at mount time
+	 */
+	set_blocksize(sd->snap_dev, sd->blksz);
+	return 0;
+
+out_dec:
+	atomic_dec(&sd->refcnt);
+out:
+	VPRINTK(PFX "failed open (%d)\n", ret);
+	MOD_DEC_USE_COUNT;
+	return ret;
+}
+
+static void snap_release_dev(struct snap_device *sd)
+{
+	fsync_dev(sd->snap_dev);
+	invalidate_buffers(sd->snap_dev);
+
+	atomic_dec(&sd->refcnt);
+}
+
+static int snap_close(struct inode *inode, struct file *file)
+{
+	struct snap_device *sd = &snap_devs[MINOR(inode->i_rdev)];
+
+	if (sd->src.dev)
+		snap_release_dev(sd);
+
+	MOD_DEC_USE_COUNT;
+	return 0;
+}
+
+/* FIXME: -really- handle media change, by disabling active snapshot
+ * if source media changes, or handle fatal errors if target media
+ * changes
+ */
+static int snap_media_change(kdev_t dev)
+{
+	struct snap_device *sd = &snap_devs[MINOR(dev)];
+	return sd->src.bdev->bd_op->check_media_change(dev);
+}
+
+static struct block_device_operations snap_ops = {
+	open:			snap_open,
+	release:		snap_close,
+	ioctl:			snap_ioctl,
+	check_media_change:	snap_media_change,
+};
+
+/**********************************************************************
+ *
+ * Module initializations and cleanup
+ *
+ */
+ 
+static inline void snap_cleanup (void)
+{
+	if (snap_devs)
+		kfree(snap_devs);
+	if (snap_sizes)
+		kfree(snap_sizes);
+	if (snap_blksize)
+		kfree(snap_blksize);
+	if (snap_readahead)
+		kfree(snap_readahead);
+
+	snap_devs = NULL;
+	snap_sizes = NULL;
+	snap_blksize = NULL;
+	snap_readahead = NULL;
+	blk_size[snap_major] = NULL;
+	blksize_size[snap_major] = NULL;
+	max_readahead[snap_major] = NULL;
+	blk_dev[snap_major].queue = NULL;
+}
+
+static int __init snap_init(void)
+{
+	snap_major = devfs_register_blkdev(0, MODNAME, &snap_ops);
+	if (snap_major <= 0) { /* zero is invalid value b/c we need a major */
+		printk("unable to register snap device\n");
+		return -EIO;
+	}
+	devfs_register(NULL, MODNAME, 0, DEVFS_FL_DEFAULT, snap_major,
+		       S_IFBLK | S_IRUSR | S_IWUSR, &snap_ops, NULL);
+
+	snap_sizes = kmalloc(MAX_SNAPDEVS * sizeof(int), GFP_KERNEL);
+	if (snap_sizes == NULL)
+		goto err;
+
+	snap_blksize = kmalloc(MAX_SNAPDEVS * sizeof(int), GFP_KERNEL);
+	if (snap_blksize == NULL)
+		goto err;
+
+	snap_readahead = kmalloc(MAX_SNAPDEVS * sizeof(int), GFP_KERNEL);
+	if (snap_readahead == NULL)
+		goto err;
+
+	snap_devs = kmalloc(MAX_SNAPDEVS * sizeof(struct snap_device), GFP_KERNEL);
+	if (snap_devs == NULL)
+		goto err;
+
+	memset(snap_devs, 0, MAX_SNAPDEVS * sizeof(struct snap_device));
+	memset(snap_sizes, 0, MAX_SNAPDEVS * sizeof(int));
+	memset(snap_blksize, 0, MAX_SNAPDEVS * sizeof(int));
+	memset(snap_readahead, 0, MAX_SNAPDEVS * sizeof(int));
+
+	blk_size[snap_major] = snap_sizes;
+	blksize_size[snap_major] = snap_blksize;
+	max_readahead[snap_major] = snap_readahead;
+	read_ahead[snap_major] = 128;
+
+	blk_dev[snap_major].queue = snap_get_queue;
+
+	DPRINTK(PFX "%s\n", VERSION_CODE);
+	return 0;
+
+err:
+	printk(PFX "out of memory\n");
+	devfs_unregister(devfs_find_handle(NULL, MODNAME, 0, 0,
+		 	 DEVFS_SPECIAL_BLK, 0));
+	devfs_unregister_blkdev(snap_major, MODNAME);
+	snap_cleanup ();
+	return -ENOMEM;
+}
+
+static void __exit snap_exit(void)
+{
+	devfs_unregister(devfs_find_handle(NULL, MODNAME, 0, 0,
+		 	 DEVFS_SPECIAL_BLK, 0));
+	devfs_unregister_blkdev(snap_major, MODNAME);
+
+	snap_cleanup ();
+	snap_major = 0;
+}
+
+MODULE_DESCRIPTION("Snapshot block device");
+MODULE_AUTHOR("Jeff Garzik <jgarzik@mandrakesoft.com>");
+
+module_init(snap_init);
+module_exit(snap_exit);
Index: linux_2_4/include/linux/snap.h
diff -u /dev/null linux_2_4/include/linux/snap.h:1.1.6.5
--- /dev/null	Sat May 19 17:36:31 2001
+++ linux_2_4/include/linux/snap.h	Thu May 17 11:09:38 2001
@@ -0,0 +1,126 @@
+/*
+ * Copyright 2001 Jeff Garzik <jgarzik@mandrakesoft.com>
+ * Copyright (C) 2000 Jens Axboe <axboe@suse.de>
+ *
+ * May be copied or modified under the terms of the GNU General Public
+ * License.  See linux/COPYING for more information.
+ *
+ */
+#ifndef __LINUX_SNAP_H
+#define __LINUX_SNAP_H
+
+/*
+ * 1 for normal debug messages, 2 is very verbose. 0 to turn it off.
+ */
+#define SNAP_DEBUG		1
+
+/*
+ * No user-servicable parts beyond this point ->
+ */
+
+#if SNAP_DEBUG
+#define DPRINTK(fmt, args...) printk(KERN_NOTICE fmt, ##args)
+#else
+#define DPRINTK(fmt, args...)
+#endif
+
+#if SNAP_DEBUG > 1
+#define VPRINTK(fmt, args...) printk(KERN_NOTICE fmt, ##args)
+#else
+#define VPRINTK(fmt, args...)
+#endif
+
+/* bh list unique identifier */
+#define SNAP_BUF_LIST		0x93
+
+/*
+ * flags
+ */
+#define SNAP_READONLY		1	/* read only dev */
+#define SNAP_ACTIVE		2	/* snapshot dev is active */
+
+/*
+ * Very unused stats for now
+ */
+struct snap_stats
+{
+	unsigned long		bh_s;
+	unsigned long		bh_e;
+	unsigned long		bh_cache_hits;
+	unsigned long		page_cache_hits;
+	unsigned long		secs_w;
+	unsigned long		secs_r;
+};
+
+#define MAX_LOGDEVS		512 /* arbitrary limit */
+
+struct snap_setup
+{
+	int			src_fd;
+	unsigned int		n_log_fds;
+	int			log_fd[0];
+};
+
+/*
+ * packet ioctls
+ */
+#define SNAP_IOCTL_MAGIC	('N')
+#define SNAP_GET_STATS		_IOR(SNAP_IOCTL_MAGIC, 0xE0, struct snap_stats)
+#define SNAP_SETUP_DEV		_IOW(SNAP_IOCTL_MAGIC, 0xE1, struct snap_setup)
+#define SNAP_TEARDOWN_DEV	_IOW(SNAP_IOCTL_MAGIC, 0xE2, unsigned int)
+#define SNAP_START		_IO(SNAP_IOCTL_MAGIC, 0xE3)
+#define SNAP_END		_IO(SNAP_IOCTL_MAGIC, 0xE4)
+
+#ifdef __KERNEL__
+#include <linux/blkdev.h>
+
+struct snap_dev_init_info
+{
+	struct inode *inode;
+	struct file *file;
+};
+
+struct snap_src
+{
+	request_queue_t			*q;
+	kdev_t				dev;
+
+	make_request_fn			*make_request_fn;
+
+	struct block_device		*bdev;
+	struct dentry			*dentry;
+	unsigned long			blocks;
+};
+
+struct snap_log
+{
+	kdev_t				dev;
+	struct dentry			*dentry;
+	unsigned long			size;
+	unsigned long			free;
+	unsigned long			data_blocks;
+};
+
+struct snap_device
+{
+	struct snap_src			src;
+	struct snap_log			*logs;
+	unsigned int			n_logs;
+	unsigned int			active_logs;
+	
+	request_queue_t			q;
+	atomic_t			refcnt;
+	kdev_t				snap_dev;
+	unsigned long			flags;
+	char				name[20];
+	unsigned int			bits_per_block;
+	unsigned int			maps_per_block;
+	unsigned long			map_base;
+	unsigned long			data_base;
+	unsigned int			blksz;
+	struct snap_stats		stats;
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* __LINUX_SNAP_H */

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  0:32       ` Linus Torvalds
  2001-05-20  0:52         ` Jeff Garzik
@ 2001-05-20  1:03         ` Jeff Garzik
  2001-05-20 19:41           ` Why side-effects on open(2) are evil. (was Re: [RFD Alan Cox
                             ` (3 more replies)
  2001-05-22 18:41         ` Andreas Dilger
  2 siblings, 4 replies; 161+ messages in thread
From: Jeff Garzik @ 2001-05-20  1:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel

Here's a dumb question, and I apologize if I am questioning computer
science dogma...

Why are LVM and EVMS(competing LVM project) needed at all?

Surely the same can be accomplished with
* md
* snapshot blkdev (attached in previous e-mail)
* giving partitions and blkdevs the ability to grow and shrink
* giving filesystems the ability to grow and shrink

On-line optimization (defrag, etc) shouldn't be hard once you have the
ability to move blocks and files around, which would come with the
ability to grow and shrink blkdevs and fs's.

-- 
Jeff Garzik      | "Do you have to make light of everything?!"
Building 1024    | "I'm extremely serious about nailing your
MandrakeSoft     |  step-daughter, but other than that, yes."

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 18:51         ` Richard Gooch
@ 2001-05-20  2:18           ` Matthew Wilcox
  2001-05-20  2:22           ` Richard Gooch
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-20  2:18 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Alan Cox, Alexander Viro, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

On Sat, May 19, 2001 at 12:51:23PM -0600, Richard Gooch wrote:
> Al, if you really want to kill ioctl(2), then perhaps you should
> implement a transaction(2) syscall. Something like:
>     int transaction (int fd, void *rbuf, size_t rlen,
> 		     void *wbuf, size_t wlen);
> 
> Of course, there wouldn't be any practical gain, since we already have
> ioctl(2). Any gain would be aesthetic.

I can tell you haven't had to write any 32-bit ioctl emulation code for
a 64-bit kernel recently.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 18:51         ` Richard Gooch
  2001-05-20  2:18           ` Matthew Wilcox
@ 2001-05-20  2:22           ` Richard Gooch
  2001-05-20  2:34             ` Matthew Wilcox
                               ` (3 more replies)
  2001-05-20  2:31           ` Alexander Viro
  2001-05-20 16:57           ` David Woodhouse
  3 siblings, 4 replies; 161+ messages in thread
From: Richard Gooch @ 2001-05-20  2:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alan Cox, Alexander Viro, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

Matthew Wilcox writes:
> On Sat, May 19, 2001 at 12:51:23PM -0600, Richard Gooch wrote:
> > Al, if you really want to kill ioctl(2), then perhaps you should
> > implement a transaction(2) syscall. Something like:
> >     int transaction (int fd, void *rbuf, size_t rlen,
> > 		     void *wbuf, size_t wlen);
> > 
> > Of course, there wouldn't be any practical gain, since we already have
> > ioctl(2). Any gain would be aesthetic.
> 
> I can tell you haven't had to write any 32-bit ioctl emulation code for
> a 64-bit kernel recently.

The transaction(2) syscall can be just as easily abused as ioctl(2) in
this respect. People can pass pointers to ill-designed structures very
easily. The main advantage of transaction(2) is that hopefully, people
will not be so bone-headed as to forget to pass sizeof *structptr
as the size field. So perhaps some error trapping is possible.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 18:51         ` Richard Gooch
  2001-05-20  2:18           ` Matthew Wilcox
  2001-05-20  2:22           ` Richard Gooch
@ 2001-05-20  2:31           ` Alexander Viro
  2001-05-20 16:57           ` David Woodhouse
  3 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20  2:31 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Alan Cox, Andrew Clausen, Ben LaHaise, torvalds, linux-kernel,
	linux-fsdevel



On Sat, 19 May 2001, Richard Gooch wrote:

> There is another reason to use ioctl(2): when you need to send data to
> the kernel/driver and wait for a response. It supports transactions,
> which read(2) and write(2) cannot. Therefore it remains useful.

Somebody, run to database vendors and tell them that they were selling
snake oil all that time - Richard had just shown that support of remote
transactions is impossible. Can't do that with read() and write(),
dontcha know?

Richard, I hate to break it on you, but
	fd = open(foo, 2);
		/* kernel creates a new struct file, as usual */
	write(fd, data, len);
		/* kernel starts the operation */
	read(fd, reply, size);
		/* we block */
		/* operation is completed */
		/* kernel passes reply to user and wakes it up */
_is_ a support of transactions. And yes, we can trivially distinguish
between requests from different sources - struct file * passed to
->write() is more than enough for that. Moreover, we can easily block
other writers until the action is completed.

Please, get a bloody clue. There are reasons for and against ioctls, but
need to send data and wait for responce is _NOT_ one of them.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  2:22           ` Richard Gooch
@ 2001-05-20  2:34             ` Matthew Wilcox
  2001-05-20  2:36             ` Alexander Viro
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-20  2:34 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Matthew Wilcox, Alan Cox, Alexander Viro, Andrew Clausen,
	Ben LaHaise, torvalds, linux-kernel, linux-fsdevel

On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
> The transaction(2) syscall can be just as easily abused as ioctl(2) in
> this respect.

But read() and write() cannot.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  2:22           ` Richard Gooch
  2001-05-20  2:34             ` Matthew Wilcox
@ 2001-05-20  2:36             ` Alexander Viro
  2001-05-20  2:48             ` Richard Gooch
  2001-05-20  2:51             ` Richard Gooch
  3 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20  2:36 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Matthew Wilcox, Alan Cox, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel



On Sat, 19 May 2001, Richard Gooch wrote:

> The transaction(2) syscall can be just as easily abused as ioctl(2) in
> this respect. People can pass pointers to ill-designed structures very

Right. Moreover, it's not needed. The same functionality can be trivially
implemented by write() and read(). As the matter of fact, had been done
in userland context for decades. Go and buy Stevens. Read it. Then come
back.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  2:22           ` Richard Gooch
  2001-05-20  2:34             ` Matthew Wilcox
  2001-05-20  2:36             ` Alexander Viro
@ 2001-05-20  2:48             ` Richard Gooch
  2001-05-20  3:26               ` Linus Torvalds
  2001-05-20  2:51             ` Richard Gooch
  3 siblings, 1 reply; 161+ messages in thread
From: Richard Gooch @ 2001-05-20  2:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alan Cox, Alexander Viro, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

Matthew Wilcox writes:
> On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
> > The transaction(2) syscall can be just as easily abused as ioctl(2) in
> > this respect.
> 
> But read() and write() cannot.

Sure they can. I can pass a pointer to a structure to either of them.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  2:22           ` Richard Gooch
                               ` (2 preceding siblings ...)
  2001-05-20  2:48             ` Richard Gooch
@ 2001-05-20  2:51             ` Richard Gooch
  2001-05-20 21:13               ` Pavel Machek
  3 siblings, 1 reply; 161+ messages in thread
From: Richard Gooch @ 2001-05-20  2:51 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Matthew Wilcox, Alan Cox, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

Alexander Viro writes:
> 
> 
> On Sat, 19 May 2001, Richard Gooch wrote:
> 
> > The transaction(2) syscall can be just as easily abused as ioctl(2) in
> > this respect. People can pass pointers to ill-designed structures very
> 
> Right. Moreover, it's not needed. The same functionality can be
> trivially implemented by write() and read(). As the matter of fact,
> had been done in userland context for decades. Go and buy
> Stevens. Read it. Then come back.

I don't need to read it. Don't be insulting. Sure, you *can* use a
write(2)/read(2) cycle. But that's two syscalls compared to one with
ioctl(2) or transaction(2). That can matter to some applications.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  2:48             ` Richard Gooch
@ 2001-05-20  3:26               ` Linus Torvalds
  2001-05-20 10:23                 ` Russell King
  0 siblings, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20  3:26 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Matthew Wilcox, Alan Cox, Alexander Viro, Andrew Clausen,
	Ben LaHaise, linux-kernel, linux-fsdevel


On Sat, 19 May 2001, Richard Gooch wrote:
>
> Matthew Wilcox writes:
> > On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
> > > The transaction(2) syscall can be just as easily abused as ioctl(2) in
> > > this respect.
> > 
> > But read() and write() cannot.
> 
> Sure they can. I can pass a pointer to a structure to either of them.

You're missing the point.

It's ok to do "read()/write()" on structures. In fact, people do that all
the time (and then they complain about the file not being portable ;)

The problem with ioctl is that not only are people passing ioctl's
pointers to structures, but:
 - they're not telling how big the structure is
 - the structure can have pointers to other places
 - sometimes it modifies the structure passed in

None of which are "network-nice". Basically, ioctl() is historically used
as a "pass any crap into driver xxxx, and the driver - and ONLY the driver
- will know what to do with it".

And when _only_ a driver knows what the arguments mean, upper layers can't
encapsulate them. Upper layers cannot make a packet of the argument and
send it over the network to another machine. Upper layers cannot do
sanity-checking on things like "is this argument a valid pointer". Which
means, for example, that not only can you not send the ioctl arguments
anywhere, but ioctl's have also historically been a hot-bed of bugs.

Example traditional ioctl bugs: use kernel pointers to access the argument
(because it just happens to work on x86, never mind the fact that if the
argument is bad you'll get a kernel oops and/or a serious security error).
Other example: different drivers/f ilesystems implementing the same ioctl,
but disagreeing on what the argument means (is it a pointer to an integer
argument, or the integer itself?).

Now, the advantage of using read()/write() is (a) that it's unambiguous
where the argument comes from and how big it is and (b) because of that
the _psychology_ is different. You don't get into this "pass random crap
around, let the kernel modify user data structures directly" mentality.

And psychology is important.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  3:26               ` Linus Torvalds
@ 2001-05-20 10:23                 ` Russell King
  2001-05-20 10:35                   ` Alexander Viro
  2001-05-20 18:46                   ` Linus Torvalds
  0 siblings, 2 replies; 161+ messages in thread
From: Russell King @ 2001-05-20 10:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Gooch, Matthew Wilcox, Alan Cox, Alexander Viro,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel

On Sat, May 19, 2001 at 08:26:20PM -0700, Linus Torvalds wrote:
> You're missing the point.

I don't think Richard is actually.  I think Richard has hit a nail
dead on its head.

> It's ok to do "read()/write()" on structures.

Ok, we can read()/write() structures.  So someone invents the following
structure:

	struct foo {
		int cmd;
		void *data;
	} foo;

Now they use write(fd, &foo, sizeof(foo)); Haven't they just swapped
the ioctl() interface for write() instead?

Ok, lets hope that humanity isn't that stupid, so lets take another
example:

	struct bar {
		int in_size;
		void *in_data;
		int out_size;
		void *out_data;
	};

	struct foo {
		int cmd;
		struct bar1;
	} foo;

Same write call, but ok, we have a structure of known size.  Its still
the same problem.

What I'm trying to say is that I think that read+write is open to more
or the same abuse that ioctl has been, not less.

However, it does have one good thing going for it - you can support
poll on blocking "ioctls" like TIOCMIWAIT.

> None of which are "network-nice". Basically, ioctl() is historically used
> as a "pass any crap into driver xxxx, and the driver - and ONLY the driver
> - will know what to do with it".

I still see read()/write() being a "pass any crap" interface.  The
implementer of the target for read()/write() will probably still be
a driver which will need to decode what its given, whether its in
ASCII or binary.

And driver writers are already used to writing ioctl-like interfaces.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 10:23                 ` Russell King
@ 2001-05-20 10:35                   ` Alexander Viro
  2001-05-20 18:46                   ` Linus Torvalds
  1 sibling, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 10:35 UTC (permalink / raw)
  To: Russell King
  Cc: Linus Torvalds, Richard Gooch, Matthew Wilcox, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Russell King wrote:

> I still see read()/write() being a "pass any crap" interface.  The
> implementer of the target for read()/write() will probably still be
> a driver which will need to decode what its given, whether its in
> ASCII or binary.
> 
> And driver writers are already used to writing ioctl-like interfaces.

You _are_ missing the point. write() doesn't have that history of wild
abuse. It's easier to whack the driver writer's balls for abusing it.
I'm more than willing to play Narn Bat Squad and I'm pretty sure that
I'm not alone at that.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)
  2001-05-19 23:39         ` Alexander Viro
@ 2001-05-20 15:47           ` Edgar Toernig
  2001-05-20 16:20             ` Alexander Viro
  2001-05-21 17:16           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Oliver Xymoron
  1 sibling, 1 reply; 161+ messages in thread
From: Edgar Toernig @ 2001-05-20 15:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Jeff Garzik, Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel

Alexander Viro wrote:
> 
> For the latter, though,
> we need to write commands into files and here your miscdevices (or procfs
> files, or /dev/foo/ctl - whatever) is needed.

IMHO any scheme that requires a special name to perform ioctl like
functions will not work.  Often you don't known the name of the
device you're talking to and then you're lost.

So, if you want an additional communication channel to a device why
not introduce an fcntl or system call like

	cltrfd = fcntl(fd, F_CTRLFD)    or  openctrl(fd)  ?

That way you can always get access to the control channel and use
regular read/write for communication [1].  To make it more versatile,
you may want to extent the shell syntax, i.e. a '@' in redirection
operators get the control fd:

	echo "eject" >@/dev/cdrom
	{ echo "b19200,onlcr" >@1 ; echo "Hello World!" ; } >/dev/ttyS0

Yes, requires support in user space apps but doesn't mess around
with the file namespace.  It's too precious to sacrifice ;-)

I don't know how much infrastructure in the kernel is required for this 
- i.e. add readctrl/writectrl methods or create virtual inodes/devices
on the fly?  There are more capable people than me to judge on that...

Ciao, ET.


[1] If you want you can even allow this flag as an open mode to
open the ctrl channel without opening the dev.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)
  2001-05-20 15:47           ` F_CTRLFD (was Re: Why side-effects on open(2) are evil.) Edgar Toernig
@ 2001-05-20 16:20             ` Alexander Viro
  2001-05-20 19:01               ` Edgar Toernig
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 16:20 UTC (permalink / raw)
  To: Edgar Toernig
  Cc: Jeff Garzik, Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Edgar Toernig wrote:

> IMHO any scheme that requires a special name to perform ioctl like
> functions will not work.  Often you don't known the name of the
> device you're talking to and then you're lost.

ls -l /proc/self/fd/<n>

and think of the results. We can export that as a syscall (fpath(2)), BTW.

Again, folks, there are two things that are no going to happen:

	1) sys_ioctl() going away from syscall table. Binary compatibility
with existing userland stuff that deals with networking ioctls. Unlike
special-case device ones, they really have a lot of users. Standard rules
are "2 stable releases until we remove a syscall".

	2) semi-automatic conversion of existing applications. To hell with
the way we are finding descriptor, we need to deal with arguments themselves.
And no extra logics in libc will help - the whole problem is that ioctls
have rather irregular arguments.

So "make it look as similar to ioctl() as possible" is not a good gaol.
It would be, if we were preparing to do mass switching to new mechanism
with minimal changes to existing codebase. Not realistic.

What we need is "make it sane", not "inherit as many things from the
old API as possible". And obvious first target is Linux-specific
device ioctls, simply because they have fewer programs using them.

Networking ioctls are there to stay for quite a while - we'll need
at the very least to implement old ones in a userland library.
Portability issues will be nasty, since _that_ stuff is used by
tons of programs.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-19 18:51         ` Richard Gooch
                             ` (2 preceding siblings ...)
  2001-05-20  2:31           ` Alexander Viro
@ 2001-05-20 16:57           ` David Woodhouse
  2001-05-20 19:02             ` Linus Torvalds
  3 siblings, 1 reply; 161+ messages in thread
From: David Woodhouse @ 2001-05-20 16:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Richard Gooch, Alan Cox, Alexander Viro, Andrew Clausen,
	Ben LaHaise, torvalds, linux-kernel, linux-fsdevel


matthew@wil.cx said:
>  I can tell you haven't had to write any 32-bit ioctl emulation code
> for a 64-bit kernel recently.

If that had been done right the first time, you wouldn't have had to either.
For that matter, it's often the case that if the ioctl had been done right
the first time, nobody would have had to fix it up for any architecture.

I made the mistake of using machine-specific types in some ioctls, but 
fixed them as soon as I realised some poor sod was going to have to write 
and maintain the ugly conversion code.

For pointers, sometimes it's justified. Often however, as in my case, it
was just stupidity on the part of the original coder and should be fixed.
Although I suppose I have the advantage that I don't have to worry too much
about binary compatibility for the things I changed.

--
dwmw2



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 10:23                 ` Russell King
  2001-05-20 10:35                   ` Alexander Viro
@ 2001-05-20 18:46                   ` Linus Torvalds
  2001-05-20 18:57                     ` Russell King
  1 sibling, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20 18:46 UTC (permalink / raw)
  To: Russell King
  Cc: Richard Gooch, Matthew Wilcox, Alan Cox, Alexander Viro,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel


On Sun, 20 May 2001, Russell King wrote:
>
> On Sat, May 19, 2001 at 08:26:20PM -0700, Linus Torvalds wrote:
> > You're missing the point.
> 
> I don't think Richard is actually.  I think Richard has hit a nail
> dead on its head.
> 
> > It's ok to do "read()/write()" on structures.
> 
> Ok, we can read()/write() structures.  So someone invents the following
> structure:
> 
> 	struct foo {
> 		int cmd;
> 		void *data;
> 	} foo;
> 
> Now they use write(fd, &foo, sizeof(foo)); Haven't they just swapped
> the ioctl() interface for write() instead?

Wrong.

Nobody will expect the above to work, and everybody will agree that the
above is a BUG if the read() call will actually follow the pointer.

Read my email. And read the last line: "psychology is important".

Step #1 in programming: understand people.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 18:46                   ` Linus Torvalds
@ 2001-05-20 18:57                     ` Russell King
  2001-05-20 19:10                       ` Linus Torvalds
  0 siblings, 1 reply; 161+ messages in thread
From: Russell King @ 2001-05-20 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Gooch, Matthew Wilcox, Alan Cox, Alexander Viro,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel

On Sun, May 20, 2001 at 11:46:33AM -0700, Linus Torvalds wrote:
> Nobody will expect the above to work, and everybody will agree that the
> above is a BUG if the read() call will actually follow the pointer.

I didn't say anything about read().  I said write().  Obviously it
wouldn't work for read()!

> Read my email. And read the last line: "psychology is important".

I did.  I also know that if you give the world enough rope, someone
will hang themselves.

(Note that because I've thought a way of misusing this in the same
was as ioctl, you can bet your bottom dollar that other people will).

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)
  2001-05-20 16:20             ` Alexander Viro
@ 2001-05-20 19:01               ` Edgar Toernig
  2001-05-20 19:30                 ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Edgar Toernig @ 2001-05-20 19:01 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Jeff Garzik, Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel

Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Edgar Toernig wrote:
> 
> > IMHO any scheme that requires a special name to perform ioctl like
> > functions will not work.  Often you don't known the name of the
> > device you're talking to and then you're lost.
> 
> ls -l /proc/self/fd/<n>

Oh come on.  You made most of the VFS and should know better.  Since when
is it possible to always get a "usable" name for an fd???  The ls -l will
give me "deleted", "socket", "...".  If I try to access the name given
by procfs I may get EPERM, etc etc.  And then, it's pretty strange to append
a "ctl" to some arbitrary name and I get a control device for that name???
No.  Using names is __wrong__!

> [not going to happen:]
>         1) sys_ioctl() going away from syscall table.

I would never suggest that.

>         2) semi-automatic conversion of existing applications.

Same.  Much too dangerous.

> To hell with
> the way we are finding descriptor, we need to deal with arguments themselves.
> And no extra logics in libc will help - the whole problem is that ioctls
> have rather irregular arguments.

Don Quijote II.? ;-)

IMHO any similar powerful (and versatile) interface will see the same
problems.  Enforcing a read/write like interface (and rejecting drivers
that pass ptrs through this interface) may give you some knowledge about
the kernel/userspace communication.  But the data the flows around will
become the same mess that is present with the current ioctl.  Every driver
invents its own sets of commands, its own rules of argument parsing, ...
Maybe it's no longer strange binary data but readable ASCII strings but
that's all.  Look at how many different "styles" of /proc files there are.

> What we need is "make it sane", not "inherit as many things from the
> old API as possible". And obvious first target is Linux-specific
> device ioctls, simply because they have fewer programs using them.

You can impose some rules like "must support" commands, something of
how arguments are encoded, errors reported and so on.  But I wouldn't
like to see an SNMP like mess...

IMHO what's needed is a definition for "sane" in this context.  Trying
to limit the kind of actions performed by ioctls is not "sane".  Then
people will always revert back to old ioctl.  "Sane" could be: network
transparent, architecture independant, usable with generic tools and non
C-like languages.

Ciao, ET.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 16:57           ` David Woodhouse
@ 2001-05-20 19:02             ` Linus Torvalds
  2001-05-20 19:11               ` Alexander Viro
                                 ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20 19:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Matthew Wilcox, Richard Gooch, Alan Cox, Alexander Viro,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel


On Sun, 20 May 2001, David Woodhouse wrote:
> 
> If that had been done right the first time, you wouldn't have had to either.
> For that matter, it's often the case that if the ioctl had been done right
> the first time, nobody would have had to fix it up for any architecture.

The problem with ioctl's is, let me repeat, not technology.

It's people.

ioctl's are a way to do ugly things. That's what they have ALWAYS been.
And because of that, people don't care about following the rules - if
ioctl's followed the rules, they wouldn't _be_ ioctls in the first place,
but instead have a good interface (say, read()/write()).

Basically, ioctl's will _never_ be done right, because of the way people
think about them. They are a back door. They are by design typeless and
without rules. They are, in fact, the Microsoft of UNIX.

The only way to fix ioctl's is to force people to think about them in
another way. Because if you don't, there is always going to be another
driver writer who adds his own ioctl because it's the easy way to do
whatever he wants without giving it a second of _design_ thought.

Now, a good way to force the issue may be to just remove the "ioctl"
function pointer from the file operations structure altogether. We don't
have to force peopel to use "read/write" - we can just make it clear that
ioctl's _have_ to be wrapped, and that the only ioctl's that are
acceptable are the ones that are well-designed enough to be wrappable. So
we'd have a "linux/fs/ioctl.c" that would do all the wrapping, and would
also be able to do all the stuff that is currently done by pretty much
every single architecture out there (ie emulation of ioctl's for different
native modes).

It would probably not be that horrible. Many ioctl's are probably not all
that much used by any real programs any more. The most common ones by far
are the tty ones - and the truly generic ones like "FIONREAD" that it
actually would make sense to generalize more.

Catching stuff like EJECT at a higher layer and turning THOSE kinds of
things into real block device operations would clean up drivers and make
them more uniform.

Would fs/ioctl.c be an ugly mess of some special cases? Yes. But would
that make the ugliness explicit and possibly easier to try to manage and
fix? Very probably. And it would mean that driver writers could not just
say "fuck design, I'm going to do this my own really ugly way". 

			Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 18:57                     ` Russell King
@ 2001-05-20 19:10                       ` Linus Torvalds
  2001-05-20 19:42                         ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20 19:10 UTC (permalink / raw)
  To: Russell King
  Cc: Richard Gooch, Matthew Wilcox, Alan Cox, Alexander Viro,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel


On Sun, 20 May 2001, Russell King wrote:
>
> On Sun, May 20, 2001 at 11:46:33AM -0700, Linus Torvalds wrote:
> > Nobody will expect the above to work, and everybody will agree that the
> > above is a BUG if the read() call will actually follow the pointer.
> 
> I didn't say anything about read().  I said write().  Obviously it
> wouldn't work for read()!

No, but the point is, everybody _would_ consider it a bug if a
low-level driver "write()" did anything but touched the explicit buffer.

Code like that would not pass through anybody's yuck-o-meter. People would
point fingers and say "That is not a legal write() function". Anybody who
tried to make write() follow pointers would be laughed at as a stupid git.

Anybody who makes "ioctl()" do the same is just following years of
standard practice, and the yuck-o-meter doesn't even register.

THAT is the importance of psychology.

Technology is meaningless. What matters is how people _think_ of it.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:02             ` Linus Torvalds
@ 2001-05-20 19:11               ` Alexander Viro
  2001-05-20 19:18                 ` Matthew Wilcox
  2001-05-20 19:27                 ` Linus Torvalds
  2001-05-20 19:57               ` David Woodhouse
  2001-05-21 13:57               ` Ingo Oeser
  2 siblings, 2 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Matthew Wilcox, Richard Gooch, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Linus Torvalds wrote:

> Now, a good way to force the issue may be to just remove the "ioctl"
> function pointer from the file operations structure altogether. We don't
> have to force peopel to use "read/write" - we can just make it clear that
> ioctl's _have_ to be wrapped, and that the only ioctl's that are
> acceptable are the ones that are well-designed enough to be wrappable. So
> we'd have a "linux/fs/ioctl.c" that would do all the wrapping, and would
> also be able to do all the stuff that is currently done by pretty much
> every single architecture out there (ie emulation of ioctl's for different
> native modes).

Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?

> It would probably not be that horrible. Many ioctl's are probably not all
> that much used by any real programs any more. The most common ones by far
> are the tty ones - and the truly generic ones like "FIONREAD" that it
> actually would make sense to generalize more.

Networking stuff. It _is_ used.
 
> Catching stuff like EJECT at a higher layer and turning THOSE kinds of
> things into real block device operations would clean up drivers and make
> them more uniform.
> 
> Would fs/ioctl.c be an ugly mess of some special cases? Yes. But would
> that make the ugliness explicit and possibly easier to try to manage and
> fix? Very probably. And it would mean that driver writers could not just
> say "fuck design, I'm going to do this my own really ugly way". 

How about moratorium on new ioctls in the meanwhile? Whatever we do in
fs/ioctl.c, it _will_ take time.
								Al


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:11               ` Alexander Viro
@ 2001-05-20 19:18                 ` Matthew Wilcox
  2001-05-20 19:24                   ` Alexander Viro
  2001-05-20 19:27                 ` Linus Torvalds
  1 sibling, 1 reply; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-20 19:18 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, David Woodhouse, Matthew Wilcox, Richard Gooch,
	Alan Cox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel

On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote:
> Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?

No.

$ ls -l arch/*/kernel/ioctl32*.c
-rw-r--r--    1 willy    willy       22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c
-rw-r--r--    1 willy    willy      109475 May 18 16:39 arch/parisc/kernel/ioctl32.c
-rw-r--r--    1 willy    willy      117605 Feb  1 20:35 arch/sparc64/kernel/ioctl32.c

only about 100k.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:18                 ` Matthew Wilcox
@ 2001-05-20 19:24                   ` Alexander Viro
  2001-05-20 19:34                     ` Linus Torvalds
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 19:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, David Woodhouse, Richard Gooch, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Matthew Wilcox wrote:

> On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote:
> > Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?
> 
> No.
> 
> $ ls -l arch/*/kernel/ioctl32*.c
> -rw-r--r--    1 willy    willy       22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c
> -rw-r--r--    1 willy    willy      109475 May 18 16:39 arch/parisc/kernel/ioctl32.c
> -rw-r--r--    1 willy    willy      117605 Feb  1 20:35 arch/sparc64/kernel/ioctl32.c
> 
> only about 100k.

You are missing all x86-only drivers.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:11               ` Alexander Viro
  2001-05-20 19:18                 ` Matthew Wilcox
@ 2001-05-20 19:27                 ` Linus Torvalds
  2001-05-20 19:33                   ` Alexander Viro
  1 sibling, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20 19:27 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David Woodhouse, Matthew Wilcox, Richard Gooch, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel


On Sun, 20 May 2001, Alexander Viro wrote:
> 
> Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?

I agree. But it would certainly force people to think about this. And it
may turn out that a lot of it can be streamlined, and not that much ends
up being used very much.

It would also allow a single place of catching the generic ones, and as
such be a place to try to make things like the network ioctl's more
regular: setting things like network device duplex with _real_ interfaces
instead of hiding it in ioctl routines.

Also, note that many ioctl's actually do have fairly regular meaning, and
that it _is_ possible to catch a number of them with those regular
things:

	switch (_IOC_TYPE(number)) {
	case 'x':
		xfs_ioctl(..);

and actually try to enforce the things that Documentation/ioctl-number.txt
tries to document. And make the clashes _explicit_ and thus make people
have more incentive to really try to fix it.

> How about moratorium on new ioctls in the meanwhile? Whatever we do in
> fs/ioctl.c, it _will_ take time.

Ehh.. Telling people "don't do that" simply doesn't work. Not if they can
do it easily anyway. Things really don't get fixed unless people have a
certain pain-level to induce it to get fixed.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)
  2001-05-20 19:01               ` Edgar Toernig
@ 2001-05-20 19:30                 ` Alexander Viro
  0 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 19:30 UTC (permalink / raw)
  To: Edgar Toernig
  Cc: Jeff Garzik, Linus Torvalds, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Edgar Toernig wrote:

> IMHO any similar powerful (and versatile) interface will see the same
> problems.  Enforcing a read/write like interface (and rejecting drivers
> that pass ptrs through this interface) may give you some knowledge about
> the kernel/userspace communication.  But the data the flows around will
> become the same mess that is present with the current ioctl.  Every driver
> invents its own sets of commands, its own rules of argument parsing, ...
> Maybe it's no longer strange binary data but readable ASCII strings but
> that's all.  Look at how many different "styles" of /proc files there are.

Too many people who don't know C and manage to get their crap into the
tree. Shame, but that is _not_ a technical problem.

> IMHO what's needed is a definition for "sane" in this context.  Trying
> to limit the kind of actions performed by ioctls is not "sane".  Then
> people will always revert back to old ioctl.  "Sane" could be: network
> transparent, architecture independant, usable with generic tools and non
> C-like languages.

/me points to UNIX-like OS that had done that. BTW, network-transparent means
"no pointers"...


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:27                 ` Linus Torvalds
@ 2001-05-20 19:33                   ` Alexander Viro
  2001-05-20 19:38                     ` Linus Torvalds
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Matthew Wilcox, Richard Gooch, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Linus Torvalds wrote:

> > How about moratorium on new ioctls in the meanwhile? Whatever we do in
> > fs/ioctl.c, it _will_ take time.
> 
> Ehh.. Telling people "don't do that" simply doesn't work. Not if they can
> do it easily anyway. Things really don't get fixed unless people have a
> certain pain-level to induce it to get fixed.

Umm... How about the following:  you hit delete on patches that introduce
new ioctls, I help to provide required level of pain.  Deal?

BTW, -pre4 got new bunch of ioctls. On procfs, no less.



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:24                   ` Alexander Viro
@ 2001-05-20 19:34                     ` Linus Torvalds
  0 siblings, 0 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20 19:34 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Matthew Wilcox, David Woodhouse, Richard Gooch, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel


On Sun, 20 May 2001, Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Matthew Wilcox wrote:
> 
> > On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote:
> > > Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?
> > 
> > No.
> > 
> > $ ls -l arch/*/kernel/ioctl32*.c
> > -rw-r--r--    1 willy    willy       22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c
> > -rw-r--r--    1 willy    willy      109475 May 18 16:39 arch/parisc/kernel/ioctl32.c
> > -rw-r--r--    1 willy    willy      117605 Feb  1 20:35 arch/sparc64/kernel/ioctl32.c
> > 
> > only about 100k.
> 
> You are missing all x86-only drivers.

Now, the point is that it _is_ doable, and by doing it in one standard
place (instead of letting each architecture fight it on its own) we'd
expose the problem better, and maybe get rid of some of those
architecture-specific ones.

For example, right now the fact that part of the work _has_ been done by
things like Sparc64 has not actually had any advantages: the sparc64 work
has not allowed people to say "let's try to merge this work", because it
has not been globally relevant, and a sparc64-only file has not been a
single point of contact that could be used to clean up things.

In contrast, a generic file has the possibility of creating new VFS or
device-level interfaces. You can catch block device ioctl's and turn them
into proper block device requests - and send them down the right request
queue. Suddenly a block device driver doesn't just get READ/WRITE
requests, it gets EJECT/SERIALIZE requests too. Without having to add
magic ioctl's that are specific to just one device driver. 

So by having a common point of access, you can actually encourage _fixing_
some of the problems. Historically, sparc64 etc have not been able to do
that - they can only try to convert different ioctl's into another format
and then re-submitting it..

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:33                   ` Alexander Viro
@ 2001-05-20 19:38                     ` Linus Torvalds
  0 siblings, 0 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-20 19:38 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David Woodhouse, Matthew Wilcox, Richard Gooch, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel,
	David S. Miller


Davem, check the last thing, please.

On Sun, 20 May 2001, Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Linus Torvalds wrote:
> 
> > > How about moratorium on new ioctls in the meanwhile? Whatever we do in
> > > fs/ioctl.c, it _will_ take time.
> > 
> > Ehh.. Telling people "don't do that" simply doesn't work. Not if they can
> > do it easily anyway. Things really don't get fixed unless people have a
> > certain pain-level to induce it to get fixed.
> 
> Umm... How about the following:  you hit delete on patches that introduce
> new ioctls, I help to provide required level of pain.  Deal?

It still doesn't work.

That only makes people complain about my fascist tendencies. See the
thread about device numbers, where Alan just says "ok, I'll do it without
Linus then". 

The whole point of open source is that I don't have that kind of power. I
can only guide, but the most powerful guide is by guiding the _design_,
not micro-managing.

> BTW, -pre4 got new bunch of ioctls. On procfs, no less.

I know. David has zero taste. 

Davem, why didn't you just make new entries in /proc/bus/pci and let
people do "mmap(/proc/bus/pci/xxxx/mem)" instead of having idiotic ioctl's
to set "this is a IO handle" and "this is a MEM handle"? This particular
braindamage is not too late to fix..

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD
  2001-05-20  1:03         ` Jeff Garzik
@ 2001-05-20 19:41           ` Alan Cox
  2001-05-21  9:45           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Andrew Clausen
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 161+ messages in thread
From: Alan Cox @ 2001-05-20 19:41 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Alexander Viro, Edgar Toernig, Ben LaHaise,
	linux-kernel, linux-fsdevel

> Why are LVM and EVMS(competing LVM project) needed at all?

I prefer to think of it the other way around

> Surely the same can be accomplished with
> * md
> * snapshot blkdev (attached in previous e-mail)
> * giving partitions and blkdevs the ability to grow and shrink
> * giving filesystems the ability to grow and shrink

How about 'partitions are in inferior legacy form of LVM'


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:10                       ` Linus Torvalds
@ 2001-05-20 19:42                         ` Alexander Viro
  2001-05-20 20:07                           ` Alan Cox
                                             ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King, Richard Gooch, Matthew Wilcox, Alan Cox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Linus Torvalds wrote:

> No, but the point is, everybody _would_ consider it a bug if a
> low-level driver "write()" did anything but touched the explicit buffer.
> 
> Code like that would not pass through anybody's yuck-o-meter. People would
> point fingers and say "That is not a legal write() function". Anybody who
> tried to make write() follow pointers would be laughed at as a stupid git.

Linus, as much as I'd like to agree with you, you are hopeless optimist.
90% of drivers contain code written by stupid gits.
 
> Anybody who makes "ioctl()" do the same is just following years of
> standard practice, and the yuck-o-meter doesn't even register.

Nobody reads the drivers. Because otherwise yuck-o-meters would go off-scale.

How about yuck value of the
	* removing a file by writing "-1" into it?
	* mkdir() populating directory.
	* unlink() not working in said directory.
	* rmdir() happily removing it. And freeing all associated structures.
Opened files? What opened files? Whaddya mean, "oops"?

How about sprintf(s + strlen(s), foo)?

How about a collection of b0rken strtoul() implementations? Including one
that contains
	switch (...) {
		case 48:
		case 49:
	(all 22 cases)

How about declaring global array and comparing it with NULL?

How about the whole binfmt_misc.c?

Ehh...

Linus, I've been doing exactly that (reading through the large parts of
tree) and trust me, yuck-o-meter was off-scale almost permanently. Level
of idiocy in the obvious bugs is such that I bet you anything that code
had never been really read through by anyone who knew C.

I would love it if more people actually cared to read the fscking code.
Too few are doing that.

And yes, it's a psychological problem, not a technical one. Oh, well...

Sorry about the rant - I've just spent a couple of hours wading through
the piles of excrements in drivers/*. Ouch.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:02             ` Linus Torvalds
  2001-05-20 19:11               ` Alexander Viro
@ 2001-05-20 19:57               ` David Woodhouse
  2001-05-21 13:57               ` Ingo Oeser
  2 siblings, 0 replies; 161+ messages in thread
From: David Woodhouse @ 2001-05-20 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Richard Gooch, Alan Cox, Alexander Viro,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel


torvalds@transmeta.com said:
> Now, a good way to force the issue may be to just remove the "ioctl"
> function pointer from the file operations structure altogether. We
> don't have to force peopel to use "read/write" - we can just make it
> clear that ioctl's _have_ to be wrapped, and that the only ioctl's
> that are acceptable are the ones that are well-designed enough to be
> wrappable. So we'd have a "linux/fs/ioctl.c" that would do all the
> wrapping,

I have so far resisted adding an 'ioctl' method to the MTD structure. Yet
because userspace needs to be able to request an erase, request information
about the erasesize of the device, etc., I've got an ioctl wrapper much as
you describe in drivers/mtd/mtdchar.c. It calls _real_ functions like 
->erase() in the underlying MTD device, which can't easily be exposed to
userspace (unless we do something silly like using CORBA :)

I can see the advantage of doing what you suggest - add methods to the
struct block_device for the sensible things like HDIO_GETGEO, BLKGETSIZE,
etc. (and anyone suggesting that it's sensible to have the physical block
device driver at all involved in BLKRRPART shall be summarily shot).

But please don't _actually_ put all the ioctl wrappers in fs/ioctl.c. It'd 
be a nightmare for the maintainers of the various sections of it. 

Besides, what on earth does it have to do with file systems?

Maybe abi/ioctl/{blkdev,mtd,usb,scsi,...}.c ?

Having it outside the directories which are traditionally owned by the 
respective subsystem maintainers means that it's far easier to be fascist 
about what's added, too.

On a related note - I was actually beginning to consider a dev-private ioctl
for MTD devices, actually for reasons of taste - some stuff like turning 
on/off the automatic hardware ECC on the DiskOnChip devices I consider ugly
enough that I didn't want to deal with it in generic code. At least a
dev-private ioctl seemed like it would banish the ugliness into the
offending driver, and be vaguely reusable if any other device turned out to
require such ugliness.

--
dwmw2



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:42                         ` Alexander Viro
@ 2001-05-20 20:07                           ` Alan Cox
  2001-05-20 20:33                             ` Alexander Viro
  2001-05-20 23:59                             ` Paul Fulghum
  2001-05-20 20:07                           ` Alan Cox
  2001-05-20 23:46                           ` Ingo Molnar
  2 siblings, 2 replies; 161+ messages in thread
From: Alan Cox @ 2001-05-20 20:07 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Russell King, Richard Gooch, Matthew Wilcox,
	Alan Cox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel

> Linus, as much as I'd like to agree with you, you are hopeless optimist.
> 90% of drivers contain code written by stupid gits.

I think thats a very arrogant and very mistaken view of the problem. 90%
of the driver are written by people who are

	-	Copying from other drivers
	-	Using the existing API's to make their job easy
	-	Working to timescales
	-	Just want it to work

So if you take ioctl away from them they will implement ioctl emulation by
writing ioctl structs to an fd.

If you want to make these things work well you have to provide a good working
infrastructure. You don't change anything (except the maintainer) by causing
pain. Instead you provide the mechanisms - the generic parsing code so that
people don't screw up on procfs parsing - the generic ioctl alternatives etc.

Ditto with the major numbers. You win that battle by getting enough people to
believe it is the right answer that they write the nice code for managing 
resources and naming assignment - which is already beginning to occur. Then
even if I'm still maintaining a major number list in 2 years nobody can quite
remember why, and people are heard murmering 'You should have tried Linux two
years ago, you had to actually make device files yourself sometimes'

Alan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:42                         ` Alexander Viro
  2001-05-20 20:07                           ` Alan Cox
@ 2001-05-20 20:07                           ` Alan Cox
  2001-05-20 23:46                           ` Ingo Molnar
  2 siblings, 0 replies; 161+ messages in thread
From: Alan Cox @ 2001-05-20 20:07 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Russell King, Richard Gooch, Matthew Wilcox,
	Alan Cox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel

> How about sprintf(s + strlen(s), foo)?

Solar Designer said two years ago we should be using snprintf in the kernel.
He was most decidedly right 8)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
                     ` (2 preceding siblings ...)
  2001-05-19 23:52   ` Edgar Toernig
@ 2001-05-20 20:23   ` Pavel Machek
  2001-05-21 20:38     ` Alexander Viro
  3 siblings, 1 reply; 161+ messages in thread
From: Pavel Machek @ 2001-05-20 20:23 UTC (permalink / raw)
  To: Alexander Viro, Ben LaHaise; +Cc: torvalds, linux-kernel, linux-fsdevel

Hi!

> A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
> no-op. Breaking that assumption is a Bad Thing(tm).

Then we have a problem. Just opening /dev/ttyS0 currently *has* side
effects (it is visible on modem lines from serial port; it can block
you forever). 

If this assumption is somewhere, we should fix that place... Or fix
serial ports.
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 20:07                           ` Alan Cox
@ 2001-05-20 20:33                             ` Alexander Viro
  2001-05-20 23:59                             ` Paul Fulghum
  1 sibling, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-20 20:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Russell King, Richard Gooch, Matthew Wilcox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Alan Cox wrote:

> > Linus, as much as I'd like to agree with you, you are hopeless optimist.
> > 90% of drivers contain code written by stupid gits.
                   ^^^^^^^
> 
> I think thats a very arrogant and very mistaken view of the problem. 90%
> of the driver are written by people who are

written by != contain code written by. Stuff initally written by sane
people tends to get all sorts of crap into it. Unfortunately.

The problem being: very few people actually read the code in drivers/*.
And crap accumulates. The messier it is, the faster it gets shitted.

So relying on the people finding crappy ->write() instances and ridiculing
the authors in public is... well, somewhat naive. There's more than enough
crap already and that simply doesn't happen. It _can_ be done, but it will
take more than just having the code sitting there.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-19 16:01     ` Willem Konynenberg
@ 2001-05-20 20:52       ` Pavel Machek
  2001-05-20 20:53       ` Pavel Machek
  1 sibling, 0 replies; 161+ messages in thread
From: Pavel Machek @ 2001-05-20 20:52 UTC (permalink / raw)
  To: Willem Konynenberg, Abramo Bagnara; +Cc: linux-kernel, linux-fsdevel

Hi!

> Yes, and that is exactly the difference between having a side effect
> on the open(2), versus having the effect as a result of a write(2).
> 
> Unfortunately, there are already some cases where an open
> on a device can have unexpected results.  If you don't want
> to get blocked waiting for the carrier-detect signal from the
> modem when opening a tty device, you had better specify the
> O_NONBLOCK option on the open.  If you don't want this flag
> to be active during the actual I/O operations, then you would
> have to do an fcntl to clear the O_NONBLOCK again after the open.
> 
> So I guess things have already been a bit messy in this
> area for many years, even before linux even existed, and
> in some cases you can't really do anything about it because
> the behaviour is mandated by the applicable standards, like
> POSIX, SUS, or whatever.
> (The blocking of the open on a tty device is explicitly
>  documented in my copy of the X/Open specification.)
> 
> Fortunately, blocking the nightly backup program by making it
> accidentally open a tty is not quite as catastrophic as having
> it start a nuclear war, or format the disks, or something,
> just because a user was playing games with symlinks.

Maybe not *as* catastrophic, but security hole, anyway. User should
not be able to block system backups.

Small demonstration for bugtraq, anyone?
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-19 16:01     ` Willem Konynenberg
  2001-05-20 20:52       ` Pavel Machek
@ 2001-05-20 20:53       ` Pavel Machek
  1 sibling, 0 replies; 161+ messages in thread
From: Pavel Machek @ 2001-05-20 20:53 UTC (permalink / raw)
  To: Willem Konynenberg, Abramo Bagnara; +Cc: linux-kernel, linux-fsdevel

Hi!

> So I guess things have already been a bit messy in this
> area for many years, even before linux even existed, and
> in some cases you can't really do anything about it because
> the behaviour is mandated by the applicable standards, like
> POSIX, SUS, or whatever.
> (The blocking of the open on a tty device is explicitly
>  documented in my copy of the X/Open specification.)

If X/Open documents security hole, then, I guess, X/Open will have to
be changed.
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20  2:51             ` Richard Gooch
@ 2001-05-20 21:13               ` Pavel Machek
  2001-05-21 20:20                 ` Alan Cox
  0 siblings, 1 reply; 161+ messages in thread
From: Pavel Machek @ 2001-05-20 21:13 UTC (permalink / raw)
  To: Richard Gooch, Alexander Viro
  Cc: Matthew Wilcox, Alan Cox, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

Hi!

> > > The transaction(2) syscall can be just as easily abused as ioctl(2) in
> > > this respect. People can pass pointers to ill-designed structures very
> > 
> > Right. Moreover, it's not needed. The same functionality can be
> > trivially implemented by write() and read(). As the matter of fact,
> > had been done in userland context for decades. Go and buy
> > Stevens. Read it. Then come back.
> 
> I don't need to read it. Don't be insulting. Sure, you *can* use a
> write(2)/read(2) cycle. But that's two syscalls compared to one with
> ioctl(2) or transaction(2). That can matter to some applications.

I just don't think so. Where did you see performance-critical use of
ioctl()?
							       Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:42                         ` Alexander Viro
  2001-05-20 20:07                           ` Alan Cox
  2001-05-20 20:07                           ` Alan Cox
@ 2001-05-20 23:46                           ` Ingo Molnar
  2001-05-21  0:32                             ` Alexander Viro
  2001-05-21  3:12                             ` Linus Torvalds
  2 siblings, 2 replies; 161+ messages in thread
From: Ingo Molnar @ 2001-05-20 23:46 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Russell King, Richard Gooch, Matthew Wilcox,
	Alan Cox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel


On Sun, 20 May 2001, Alexander Viro wrote:

> Linus, as much as I'd like to agree with you, you are hopeless
> optimist. 90% of drivers contain code written by stupid gits.

90% of drivers contain code written by people who do driver development in
their spare time, with limited resources, most of the time serving as a
learning excercise. And they do this freely and for fun. Accusing them of
being 'stupid gits' is just micharacterising the situation. People do not
get born as VFS hackers, there is a very steep learning curve, and only a
few make it to to have knowledge like you. Much of the learning curve of
various people has traces in drivers/*, it's more like the history of
Linux then some coherent image of people's capabilities.

	Ingo


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 20:07                           ` Alan Cox
  2001-05-20 20:33                             ` Alexander Viro
@ 2001-05-20 23:59                             ` Paul Fulghum
  2001-05-21  0:36                               ` Alexander Viro
  1 sibling, 1 reply; 161+ messages in thread
From: Paul Fulghum @ 2001-05-20 23:59 UTC (permalink / raw)
  To: linux-kernel

>> 90% of drivers contain code written by stupid gits.
>
> From: "Alan Cox"
> I think thats a very arrogant and very mistaken view of the problem. 90%
> of the driver are written by people who are
> 
> - Copying from other drivers
> - Using the existing API's to make their job easy
> - Working to timescales
> - Just want it to work

I'll be the first to admit there is some ugliness in my driver.

Some originates from accepted methods when the
driver originated. (points 1 and 2 above)

Some comes from doing new things with only the
existing infrastructure, because changing the infrastructure
is deemed too intrusive. (points 3 and 4 above)
Stable infrastructure is good, but sometimes ugliness results.

Some is the result of genuine mistakes (people who
have written nothing but perfect code flame away).
I fix these as they are found through use and review,
and the code improves. (I *really do* want my driver to work!)

As new facilities and guidelines are made available,
I *gladly* and *gratefully* use them, and the code improves.

Calling driver writers stupid and devising punitive measures
to 'cause them pain' seems less useful.

Paul Fulghum paulkf@microgate.com
Microgate Corporation http://www.microgate.com



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 23:46                           ` Ingo Molnar
@ 2001-05-21  0:32                             ` Alexander Viro
  2001-05-21  3:12                             ` Linus Torvalds
  1 sibling, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-21  0:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Russell King, Richard Gooch, Matthew Wilcox,
	Alan Cox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel



On Mon, 21 May 2001, Ingo Molnar wrote:

> 
> On Sun, 20 May 2001, Alexander Viro wrote:
> 
> > Linus, as much as I'd like to agree with you, you are hopeless
> > optimist. 90% of drivers contain code written by stupid gits.
> 
> 90% of drivers contain code written by people who do driver development in
> their spare time, with limited resources, most of the time serving as a
> learning excercise. And they do this freely and for fun. Accusing them of

Probably 100% of drivers contains code from more than one author.

> being 'stupid gits' is just micharacterising the situation. People do not
> get born as VFS hackers, there is a very steep learning curve, and only a
> few make it to to have knowledge like you. Much of the learning curve of
> various people has traces in drivers/*, it's more like the history of
> Linux then some coherent image of people's capabilities.

Grrr... Ingo, could you read what I said? I'm not talking about problems
coming from lack of knowledge about the kernel. I'm not saying that authors
of drivers are stupid gits (in the cases when they in all probability are
such they are usually anonymous - FUBAR Acme Inc. is all you see). I'm
not saying that 90% of code in drivers is crap.

What I am saying is that in a lot of drivers you can find a code that
is result of plain and simple lack of knowledge about basics of C. And I mean
the basics, not the nontrivial parts.

"Oh, look, I don't know C, here's that project, let's write something and
submit the patch" looks pretty stupid to me.

I'm not talking about bugs. I'm not talking about stupid interfaces.
I'm not talking about typos. I'm not talking about people doing strlen()
on arrays that came from unverified source. I'm talking about the code
that was obviously written by somebody who considers C as voodoo.

The things Linus refered to pale on that background. On the bogosity
scale we have a lot of code that is way higher. Since it manages to
stay unnoticed for years...

And no, I don't think that it's an arrogance. BTW, I don't know who
the authors of these pieces are. I know that problems they had could
be cured by reading any book on C (K&R, Bolsky, whatever) and considering
how long some of that stuff had been in the tree... Well, doesn't speak
highly of the intellect of those who'd written it.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 23:59                             ` Paul Fulghum
@ 2001-05-21  0:36                               ` Alexander Viro
  2001-05-21  3:08                                 ` Paul Fulghum
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-21  0:36 UTC (permalink / raw)
  To: Paul Fulghum; +Cc: linux-kernel



On Sun, 20 May 2001, Paul Fulghum wrote:

> I'll be the first to admit there is some ugliness in my driver.

So will anyone here regarding his or her code. Count me in, BTW.

Could you reread the posting you are refering to?


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21  0:36                               ` Alexander Viro
@ 2001-05-21  3:08                                 ` Paul Fulghum
  0 siblings, 0 replies; 161+ messages in thread
From: Paul Fulghum @ 2001-05-21  3:08 UTC (permalink / raw)
  To: linux-kernel

From: "Alexander Viro" <viro@math.psu.edu>
> On Sun, 20 May 2001, Paul Fulghum wrote:
> > I'll be the first to admit there is some ugliness in my driver.
>
> So will anyone here regarding his or her code. Count me in, BTW.
>
> Could you reread the posting you are refering to?

Sorry if I misunderstood. My post was as much in
response to several current threads revolving around
device major numbers and ioctl calls (I use both!).

Many postings seem to imply driver writers must be flawed for
using these flawed facilities. Driver writers don't use device
major numbers and ioctl calls because they are brain damaged, they use
them because they are accepted practice and they work (albeit imperfectly).

I have no problem moving to better solutions *as they become available*.

But I have seen multiple references to 'causing pain' for people
by restricting their use while alternatives  (only now being discussed
and decided) are years away in the next stable kernel.

All I hope for  is a reasonable path to get there (better alternatives) from
here.
My 2 cents, with no intent to offend anyone.

Paul Fulghum paulkf@microgate.com
Microgate Corporation http://www.microgate.com






^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 23:46                           ` Ingo Molnar
  2001-05-21  0:32                             ` Alexander Viro
@ 2001-05-21  3:12                             ` Linus Torvalds
  2001-05-21 19:32                               ` Kai Henningsen
  2001-05-23  1:15                               ` Albert D. Cahalan
  1 sibling, 2 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-21  3:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexander Viro, Russell King, Richard Gooch, Matthew Wilcox,
	Alan Cox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel


On Mon, 21 May 2001, Ingo Molnar wrote:
> 
> On Sun, 20 May 2001, Alexander Viro wrote:
> 
> > Linus, as much as I'd like to agree with you, you are hopeless
> > optimist. 90% of drivers contain code written by stupid gits.
> 
> 90% of drivers contain code written by people who do driver development in
> their spare time, with limited resources, most of the time serving as a
> learning excercise. And they do this freely and for fun. Accusing them of
> being 'stupid gits' is just micharacterising the situation.

I would disagree with both of you.

The problem is not whether people do it with limited resources or time, or
whether they are stupid or not.

The problem is that if you expect to get nice code, you have to have nice
interfaces and infratructure. And ioctl's aren't it.

The reason we _can_ write beautiful filesystems these days is that the VFS
layer _supports_ it. In fact, the VFS layer has tons of infrastructure and
structure that makes it _hard_ to write bad filesystem code (which is not
to say that we don't have ugly code there - but much of it is due to
historically not having had quite the same level of infrastructure).

If we had nice infrastructure to make ioctl's more palatable, we could
probably make do even with the current binary-number interfaces, simply
because people would use the infrastructure without ever even _seeing_ how
lacking the user-level accesses are.

But that absolutely _requires_ that the driver writers should never see
the silly "pass a random number and a random argument type" kind of
interface with no structure or infrastructure in place.

Because right now even _good_ programmers make a mess of the fact that
they get passed a bad interface.

Think of it this way: the user interface to opening a file is
"open()" with pathnames and magic flags. But a filesystem never even
_sees_ that interface, it sees a very nicely structured setup where all
the argument parsing and locking has already been done for it, and the
magic flags don't even exist any more as far as the low-level FS is
concerned. Which is why filesystems _can_ be clean.

In contrast, ioctl's are passed through directly, with no help to make
them clean. 

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re:  [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-19 14:25   ` Daniel Phillips
@ 2001-05-21  8:14     ` Lars Marowsky-Bree
  2001-05-22  9:07       ` Daniel Phillips
  0 siblings, 1 reply; 161+ messages in thread
From: Lars Marowsky-Bree @ 2001-05-21  8:14 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Eric W. Biederman, Ben LaHaise, torvalds, viro, linux-kernel,
	linux-fsdevel

On 2001-05-19T16:25:47,
   Daniel Phillips <phillips@bonn-fries.net> said:

> How about:
> 
>   # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap
>   # ls /dev/mypartition
>   base	size	device	type
>   # cat /dev/mypartition/size
>   1048576
>   # cat /dev/mypartition/device
>   /dev/sda
>   # mke2fs /dev/mypartition

Ek. You want to run mke2fs on a _directory_ ?

If anything, /dev/mypartition/realdev

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Perfection is our goal, excellence will be tolerated. -- J. Yahl


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  1:03         ` Jeff Garzik
  2001-05-20 19:41           ` Why side-effects on open(2) are evil. (was Re: [RFD Alan Cox
@ 2001-05-21  9:45           ` Andrew Clausen
  2001-05-21 17:22           ` Oliver Xymoron
  2001-05-22 18:53           ` Andreas Dilger
  3 siblings, 0 replies; 161+ messages in thread
From: Andrew Clausen @ 2001-05-21  9:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Alexander Viro, Edgar Toernig, Ben LaHaise,
	linux-kernel, linux-fsdevel

Jeff Garzik wrote:
> 
> Here's a dumb question, and I apologize if I am questioning computer
> science dogma...
> 
> Why are LVM and EVMS(competing LVM project) needed at all?

EVMS and LVM aren't really competing projects, BTW.  EVMS is
"competing" more with MD.  EVMS will probably use LVM.  (I have
been "out of it" for a month, damned uni assignments...!)
 
> Surely the same can be accomplished with
> * md
> * snapshot blkdev (attached in previous e-mail)
> * giving partitions and blkdevs the ability to grow and shrink
> * giving filesystems the ability to grow and shrink

This last one has little to do with LVM/EVMS.  (it's largely
the same for partitions)  The only difference is you don't need
to handle the resize-the-start case (see below)

> On-line optimization (defrag, etc) shouldn't be hard once you have the
> ability to move blocks and files around, which would come with the
> ability to grow and shrink blkdevs and fs's.

(1) traditional partition implementations tend to have bad
implementations (small static limits on # partitions, etc.)
In other words, partition tables weren't designed for lots
of partitions, which is useful.  (For example, when you expand
a logical volume, you don't need partitions to be "next to each
other"... but the cost is you need to create another partition.
Existing partition table formats tend to starve you)

(2) layering MD on top of partitions means it's impossible to
get redundancy (across disks) on partition table metadata.  So,
if you lose your partition table on one disk, that makes that
whole disk useless.

(3) probably not a good reason: the tools to manage LVM
are more convienient than maintaining partition tables +
MD.

Andrew Clausen

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 19:02             ` Linus Torvalds
  2001-05-20 19:11               ` Alexander Viro
  2001-05-20 19:57               ` David Woodhouse
@ 2001-05-21 13:57               ` Ingo Oeser
  2 siblings, 0 replies; 161+ messages in thread
From: Ingo Oeser @ 2001-05-21 13:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Matthew Wilcox, Richard Gooch, Alan Cox,
	Alexander Viro, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel

On Sun, May 20, 2001 at 12:02:35PM -0700, Linus Torvalds wrote:
> The problem with ioctl's is, let me repeat, not technology.
> 
> It's people.
> 
> ioctl's are a way to do ugly things. That's what they have ALWAYS been.
> And because of that, people don't care about following the rules - if
> ioctl's followed the rules, they wouldn't _be_ ioctls in the first place,
> but instead have a good interface (say, read()/write()).
> 
> Basically, ioctl's will _never_ be done right, because of the way people
> think about them. They are a back door. They are by design typeless and
> without rules. They are, in fact, the Microsoft of UNIX.
 
Yes, they are. Why? Because we cannot fit all behavior of a
devices _cleanly_ into read/write/mmap/lseek.

If we do, we would need different device views (which implies
aliasing of devices, which HPA does not like) and it would
still be not that clean, because reading from readonly gives a
stream and writing gives a stream too, not particular order
required until now.

[good points]

> Would fs/ioctl.c be an ugly mess of some special cases? Yes. But would
> that make the ugliness explicit and possibly easier to try to manage and
> fix? Very probably. And it would mean that driver writers could not just
> say "fuck design, I'm going to do this my own really ugly way". 

Ok, then I give you an real world example where I idly fight with
design since nearly 2 years.

A free programmable DSP (or set of DSPs) with several kinds of
memory and additional optional devices (like DAC/ADC, ISDN frames
and sth. like that) on it. This DSP is attached via some glue
logic on Parallel port, PCI, ISA or (soon to come) USB.

This thingie can (once programmed) act as a data sink, data
source or data processing pipe.

OTOH it should be randomly accessable via debuggers and program
loaders. It is also resettable/rebootable, has discontinous
memory of certain kinds (possibly harvard architecture) and many
more funny stuff. And it needs to upload software.

I try to unify all these stuff into a "Generic Processing Device
Layer" for Linux.

Now I like to be shown how I should fit this into clean design
that:

   - uses NO ioctls (Linus)
   - has only one device per DSP (H.P.A)
   - Does not emulate ioctls via read/write transactions (which I
     consider bogus)

Theory is nice, but until someone can show me a clean design for
this (admittedly heavy ;-)) example, I just don't buy your
arguments. 

A *better* ioctl would be nice, but we still need an "catch all
exceptional accesses" interface, IMNSHO.


Regards

Ingo Oeser
-- 
To the systems programmer,
users and applications serve only to provide a test load.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-21 17:16           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Oliver Xymoron
@ 2001-05-21 16:26             ` David Lang
  2001-05-21 18:04               ` Oliver Xymoron
  2001-05-21 20:14             ` Daniel Phillips
  1 sibling, 1 reply; 161+ messages in thread
From: David Lang @ 2001-05-21 16:26 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, linux-fsdevel

what makes you think it's safe to say there's only one floppy drive?

David Lang

On Mon, 21 May 2001, Oliver Xymoron wrote:

> On Sat, 19 May 2001, Alexander Viro wrote:
>
> > Let's distinguish between per-fd effects (that's what name in
> > open(name, flags) is for - you are asking for descriptor and telling
> > what behaviour do you want for IO on it) and system-wide side effects.
> >
> > IMO encoding the former into name is perfectly fine, and no write on
> > another file can be sanely used for that purpose. For the latter, though,
> > we need to write commands into files and here your miscdevices (or procfs
> > files, or /dev/foo/ctl - whatever) is needed.
>
> I'm a little skeptical about the necessity of these per-fd effects in the
> first place - after all, Plan 9 does without them.  There's only one
> floppy drive, yes? No concurrent users of serial ports? The counter that
> comes to mind is sound devices supporting multiple opens, but I think
> esound and friends are a better solution to that problem.
>
> What I'd like to see:
>
> - An interface for registering an array of related devices (almost always
> two: raw and ctl) and their legacy device numbers with a single userspace
> callout that does whatever /dev/ creation needs to be done. Thus, naming
> and permissions live in user space. No "device node is also a directory"
> weirdness which is overkill in the vast majority of cases. No kernel names
> or permissions leaking into userspace.
>
> - An unregister_devices that does the same, giving userspace a
> chance to persist permissions, etc.
>
> - A userspace program that keeps a mapping of kernel names to /dev/ names,
> permissions, etc.
>
> - An autofs hook that does the reverse mapping for running with modules
> (possibly calling modprobe directly)
>
> Possible future extension:
>
> - Allow exporting proc as a large collection of devices. Manage /proc in
> userspace on a tmpfs.
>
> --
>  "Love the dolphins," she advised him. "Write by W.A.S.T.E.."
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-19 23:39         ` Alexander Viro
  2001-05-20 15:47           ` F_CTRLFD (was Re: Why side-effects on open(2) are evil.) Edgar Toernig
@ 2001-05-21 17:16           ` Oliver Xymoron
  2001-05-21 16:26             ` David Lang
  2001-05-21 20:14             ` Daniel Phillips
  1 sibling, 2 replies; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-21 17:16 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sat, 19 May 2001, Alexander Viro wrote:

> Let's distinguish between per-fd effects (that's what name in
> open(name, flags) is for - you are asking for descriptor and telling
> what behaviour do you want for IO on it) and system-wide side effects.
>
> IMO encoding the former into name is perfectly fine, and no write on
> another file can be sanely used for that purpose. For the latter, though,
> we need to write commands into files and here your miscdevices (or procfs
> files, or /dev/foo/ctl - whatever) is needed.

I'm a little skeptical about the necessity of these per-fd effects in the
first place - after all, Plan 9 does without them.  There's only one
floppy drive, yes? No concurrent users of serial ports? The counter that
comes to mind is sound devices supporting multiple opens, but I think
esound and friends are a better solution to that problem.

What I'd like to see:

- An interface for registering an array of related devices (almost always
two: raw and ctl) and their legacy device numbers with a single userspace
callout that does whatever /dev/ creation needs to be done. Thus, naming
and permissions live in user space. No "device node is also a directory"
weirdness which is overkill in the vast majority of cases. No kernel names
or permissions leaking into userspace.

- An unregister_devices that does the same, giving userspace a
chance to persist permissions, etc.

- A userspace program that keeps a mapping of kernel names to /dev/ names,
permissions, etc.

- An autofs hook that does the reverse mapping for running with modules
(possibly calling modprobe directly)

Possible future extension:

- Allow exporting proc as a large collection of devices. Manage /proc in
userspace on a tmpfs.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  1:03         ` Jeff Garzik
  2001-05-20 19:41           ` Why side-effects on open(2) are evil. (was Re: [RFD Alan Cox
  2001-05-21  9:45           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Andrew Clausen
@ 2001-05-21 17:22           ` Oliver Xymoron
  2001-05-22 18:53           ` Andreas Dilger
  3 siblings, 0 replies; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-21 17:22 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, linux-fsdevel

On Sat, 19 May 2001, Jeff Garzik wrote:

> Why are LVM and EVMS(competing LVM project) needed at all?
>
> Surely the same can be accomplished with
> * md
> * snapshot blkdev (attached in previous e-mail)
> * giving partitions and blkdevs the ability to grow and shrink
> * giving filesystems the ability to grow and shrink

You can migrate data off disks while the filesystems on top of them are
live. Add disk b, migrate a->b, remove disk a. Perhaps this is intrinsic
in the above somehow but I don't see it.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-21 16:26             ` David Lang
@ 2001-05-21 18:04               ` Oliver Xymoron
  0 siblings, 0 replies; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-21 18:04 UTC (permalink / raw)
  To: David Lang; +Cc: Alexander Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Mon, 21 May 2001, David Lang wrote:

> what makes you think it's safe to say there's only one floppy drive?

Read as: it doesn't make sense to have per-fd state on a single floppy
device given that there's only one actual hardware instance associated
with it and multiple openers don't make sense. Opening a floppy at
different densities with magic filenames was an example Linus used earlier
in the thread. Surely there can be more than one drive and more than one
serial port.

> On Mon, 21 May 2001, Oliver Xymoron wrote:
>
> > On Sat, 19 May 2001, Alexander Viro wrote:
> >
> > > Let's distinguish between per-fd effects (that's what name in
> > > open(name, flags) is for - you are asking for descriptor and telling
> > > what behaviour do you want for IO on it) and system-wide side effects.
> > >
> > > IMO encoding the former into name is perfectly fine, and no write on
> > > another file can be sanely used for that purpose. For the latter, though,
> > > we need to write commands into files and here your miscdevices (or procfs
> > > files, or /dev/foo/ctl - whatever) is needed.
> >
> > I'm a little skeptical about the necessity of these per-fd effects in the
> > first place - after all, Plan 9 does without them.  There's only one
> > floppy drive, yes? No concurrent users of serial ports? The counter that
> > comes to mind is sound devices supporting multiple opens, but I think
> > esound and friends are a better solution to that problem.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21  3:12                             ` Linus Torvalds
@ 2001-05-21 19:32                               ` Kai Henningsen
  2001-05-23  1:15                               ` Albert D. Cahalan
  1 sibling, 0 replies; 161+ messages in thread
From: Kai Henningsen @ 2001-05-21 19:32 UTC (permalink / raw)
  To: torvalds; +Cc: linux-fsdevel, linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 20.05.01 in <Pine.LNX.4.21.0105202005070.8426-100000@penguin.transmeta.com>:

> If we had nice infrastructure to make ioctl's more palatable, we could
> probably make do even with the current binary-number interfaces, simply
> because people would use the infrastructure without ever even _seeing_ how
> lacking the user-level accesses are.
>
> But that absolutely _requires_ that the driver writers should never see
> the silly "pass a random number and a random argument type" kind of
> interface with no structure or infrastructure in place.

Hmm.

So would it be worthwile to invent some infrastructure - possibly  
including macros, possibly even including a (very small) code generator, I  
don't really have any details clear at this point - that allows you to  
specify an interface in a sane way (for example, but not necessarily, as a  
C function definition, though that may be too hard to parse), and have the  
infrastructure generate

1. some code to call ioctl() with these arguments
2. some other code to pick apart the ioctl buffer and call the actual
   function with these arguments

preferrably so that (a) the code from 1 is suitable for use in libc or  
similar places, (b) the code from 2 is suitable for the kernel, (c) most  
(all would be better but may not be practical) existing ioctls could be  
described that way?

(If so, the first task would obviously be to analyze existing code in  
those places, and the actual structure of existing ioctls, to find out  
what sort of stuff needs to be supported, before trying to design the  
mechanism to support it.)

A variant possibility (that I suspect you'll like significantly less)  
would be a data structure to describe the ioctl that gets interpreted at  
runtime. I think I prefer specific code for that job. At least *some*  
ioctls are in hot spots, and interpreting is slow. And that hypothetical  
encapsulation certainly should not know the difference between fast and  
slow interrupts^Wioctls.

MfG Kai

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-21 17:16           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Oliver Xymoron
  2001-05-21 16:26             ` David Lang
@ 2001-05-21 20:14             ` Daniel Phillips
  2001-05-22 15:24               ` Oliver Xymoron
  1 sibling, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-21 20:14 UTC (permalink / raw)
  To: Oliver Xymoron, Alexander Viro
  Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> What I'd like to see:
>
> - An interface for registering an array of related devices (almost
> always two: raw and ctl) and their legacy device numbers with a
> single userspace callout that does whatever /dev/ creation needs to
> be done. Thus, naming and permissions live in user space. No "device
> node is also a directory" weirdness...

Could you be specific about what is weird about it?

> ...which is overkill in the vast majority of cases.

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-20 21:13               ` Pavel Machek
@ 2001-05-21 20:20                 ` Alan Cox
  2001-05-21 20:41                   ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Alan Cox @ 2001-05-21 20:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Richard Gooch, Alexander Viro, Matthew Wilcox, Alan Cox,
	Andrew Clausen, Ben LaHaise, torvalds, linux-kernel,
	linux-fsdevel

> > I don't need to read it. Don't be insulting. Sure, you *can* use a
> > write(2)/read(2) cycle. But that's two syscalls compared to one with
> > ioctl(2) or transaction(2). That can matter to some applications.
> 
> I just don't think so. Where did you see performance-critical use of
> ioctl()?

AGP, video4linux,...

Alan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-20 20:23   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Pavel Machek
@ 2001-05-21 20:38     ` Alexander Viro
  0 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-21 20:38 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Sun, 20 May 2001, Pavel Machek wrote:

> Hi!
> 
> > A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
> > no-op. Breaking that assumption is a Bad Thing(tm).
> 
> Then we have a problem. Just opening /dev/ttyS0 currently *has* side
> effects (it is visible on modem lines from serial port; it can block
> you forever). 
> 
> If this assumption is somewhere, we should fix that place... Or fix
> serial ports.

There is no way to fix it. If process A has ability to create and remove
files in directory foo, then process B has no way to know what file it
will actually open upon the attempt to open file in foo.

	Example: you want to open /home/luser/barf and /home in on root
filesystem (too many systems have such setup, and braindead as it is
it _is_ valid). Luser creates a link to his tty (currently owned by
luser, so no bullshit about "let's restrict link(2) to the case when
target is owned by caller", please). After that he renames that link
to barf.

	If you've just decided to open it and rename() comes when you
enter open(3) (in libc, still in userland), you _will_ end up opening
luser's tty.

	OTOH, behaviour of serial ports is required by standards.

All we can do is to open it in non-blocking mode and then checking whether
we've got what we wanted. You _must_ call fstat(2) after opening a file
that could be replaced under you. If you are not doing that (and open
file in directory controled by somebody else) - you have an exploitable
race. However, fstat() is too late to avoid side-effects of open() itself.

For serial ports O_NDELAY is enough to avoid that side effect. For something
where it's not enough - well, too bad. Don't do it.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 20:20                 ` Alan Cox
@ 2001-05-21 20:41                   ` Alexander Viro
  2001-05-21 21:29                     ` Alan Cox
  0 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-21 20:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: Pavel Machek, Richard Gooch, Matthew Wilcox, Andrew Clausen,
	Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Mon, 21 May 2001, Alan Cox wrote:

> > > I don't need to read it. Don't be insulting. Sure, you *can* use a
> > > write(2)/read(2) cycle. But that's two syscalls compared to one with
> > > ioctl(2) or transaction(2). That can matter to some applications.
> > 
> > I just don't think so. Where did you see performance-critical use of
> > ioctl()?
> 
> AGP, video4linux,...

Which, BTW, is a wonderful reason for having multiple channels. Instead
of write(fd, "critical_command", 8); read(fd,....); you read from the right fd.
Opened before you enter the hotspot. Less overhead than ioctl() would
have...


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 20:41                   ` Alexander Viro
@ 2001-05-21 21:29                     ` Alan Cox
  2001-05-21 21:51                       ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Alan Cox @ 2001-05-21 21:29 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Alan Cox, Pavel Machek, Richard Gooch, Matthew Wilcox,
	Andrew Clausen, Ben LaHaise, torvalds, linux-kernel,
	linux-fsdevel

> Which, BTW, is a wonderful reason for having multiple channels. Instead
> of write(fd, "critical_command", 8); read(fd,....); you read from the right fd.
> Opened before you enter the hotspot. Less overhead than ioctl() would
> have...

The ioctl is one syscall, the read/write pair are two. Im not sure that ioctl
is going to be more overhead there. In the video4linux case the high overhead
is acking frames received by mmap so might conceivably be considered one way


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 21:29                     ` Alan Cox
@ 2001-05-21 21:51                       ` Alexander Viro
  2001-05-21 21:56                         ` Alan Cox
  2001-05-22  0:22                         ` Ingo Oeser
  0 siblings, 2 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-21 21:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Pavel Machek, Richard Gooch, Matthew Wilcox, Andrew Clausen,
	Ben LaHaise, torvalds, linux-kernel, linux-fsdevel



On Mon, 21 May 2001, Alan Cox wrote:

> > Which, BTW, is a wonderful reason for having multiple channels. Instead
> > of write(fd, "critical_command", 8); read(fd,....); you read from the right fd.
> > Opened before you enter the hotspot. Less overhead than ioctl() would
> > have...
> 
> The ioctl is one syscall, the read/write pair are two. Im not sure that ioctl
> is going to be more overhead there. In the video4linux case the high overhead
> is acking frames received by mmap so might conceivably be considered one way

Sure. But we have to do two syscalls only if ioctl has both in- and out-
arguments that way. Moreover, we are talking about non-trivial in- arguments.
How many of these are in hotspots?


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 21:51                       ` Alexander Viro
@ 2001-05-21 21:56                         ` Alan Cox
  2001-05-21 22:10                           ` Linus Torvalds
  2001-05-22  0:22                         ` Ingo Oeser
  1 sibling, 1 reply; 161+ messages in thread
From: Alan Cox @ 2001-05-21 21:56 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Alan Cox, Pavel Machek, Richard Gooch, Matthew Wilcox,
	Andrew Clausen, Ben LaHaise, torvalds, linux-kernel,
	linux-fsdevel

> Sure. But we have to do two syscalls only if ioctl has both in- and out-
> arguments that way. Moreover, we are talking about non-trivial in- arguments.
> How many of these are in hotspots?

There is also a second question. How do you ensure the read is for the right 
data when you are sharing a file handle with another thread..

ioctl() has the nice property that an in/out ioctl is implicitly synchronized


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 21:56                         ` Alan Cox
@ 2001-05-21 22:10                           ` Linus Torvalds
  2001-05-21 22:22                             ` Alexander Viro
                                               ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-21 22:10 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alexander Viro, Pavel Machek, Richard Gooch, Matthew Wilcox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Mon, 21 May 2001, Alan Cox wrote:
>
> > Sure. But we have to do two syscalls only if ioctl has both in- and out-
> > arguments that way. Moreover, we are talking about non-trivial in- arguments.
> > How many of these are in hotspots?
>
> There is also a second question. How do you ensure the read is for the right
> data when you are sharing a file handle with another thread..
>
> ioctl() has the nice property that an in/out ioctl is implicitly synchronized

I don't think we can generically replace ioctl's with read-write, and we
shouldn't bend over backwards even _trying_.

The important thing would be to give them more structure, and as far as
I'm personally concerned I don't even care if the user-level access method
ends up being the same old thing. After all, we have magic numbers
everywhere: even a system call uses magic numbers for the syscall entry
numbering. The thing that makes system call numbers nice is that the
number gets turned into a more structured thing with proper type checking
and well-defined semantics very very early on indeed.

It shouldn't be impossible to do the same thing to ioctl numbers. Nastier,
yes. No question about it. But we don't necessarily have to redesign the
whole approach - we only want to re-design the internal kernel interfaces.

That, in turn, might be as simple as changing the ioctl incoming arguments
of <cmd,arg> into a structure like <type,cmd,inbuf,inlen,outbuf,outlen>.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 22:10                           ` Linus Torvalds
@ 2001-05-21 22:22                             ` Alexander Viro
  2001-05-22 15:41                               ` Oliver Xymoron
  2001-05-22  2:28                             ` Paul Mackerras
  2001-05-22 13:33                             ` Jan Harkes
  2 siblings, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-21 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Pavel Machek, Richard Gooch, Matthew Wilcox,
	Andrew Clausen, Ben LaHaise, linux-kernel, linux-fsdevel



On Mon, 21 May 2001, Linus Torvalds wrote:

> It shouldn't be impossible to do the same thing to ioctl numbers. Nastier,
> yes. No question about it. But we don't necessarily have to redesign the
> whole approach - we only want to re-design the internal kernel interfaces.
> 
> That, in turn, might be as simple as changing the ioctl incoming arguments
> of <cmd,arg> into a structure like <type,cmd,inbuf,inlen,outbuf,outlen>.

drivers/net/ppp_generic.c:
ppp_set_compress(struct ppp *ppp, unsigned long arg)
{
        int err;
        struct compressor *cp;
        struct ppp_option_data data;
        void *state;
        unsigned char ccp_option[CCP_MAX_OPTION_LENGTH];
#ifdef CONFIG_KMOD
        char modname[32];
#endif

        err = -EFAULT;
        if (copy_from_user(&data, (void *) arg, sizeof(data))
            || (data.length <= CCP_MAX_OPTION_LENGTH
                && copy_from_user(ccp_option, data.ptr, data.length)))
                goto out;

And that's far from being uncommon. They _do_ follow pointers. Some - more
than once.

We _will_ have to support ioctls for long. No questions about that. And
there is no magic trick that would work for all of them, simply because
many are too disgusting to be left alive. Let's clean the groups that can
be cleaned and see what's left.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 21:51                       ` Alexander Viro
  2001-05-21 21:56                         ` Alan Cox
@ 2001-05-22  0:22                         ` Ingo Oeser
  2001-05-22  0:57                           ` Matthew Wilcox
  1 sibling, 1 reply; 161+ messages in thread
From: Ingo Oeser @ 2001-05-22  0:22 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Alan Cox, Pavel Machek, Richard Gooch, Matthew Wilcox,
	Andrew Clausen, Ben LaHaise, torvalds, linux-kernel,
	linux-fsdevel

On Mon, May 21, 2001 at 05:51:08PM -0400, Alexander Viro wrote:
> Sure. But we have to do two syscalls only if ioctl has both in- and out-
> arguments that way. Moreover, we are talking about non-trivial in- arguments.
> How many of these are in hotspots?

ioctl has actually 4 semantics:

command only
command + read
command + write
command + rw-transaction

Separating these would be a first step. And yes, I consider each
of them useful.

command only: reset drive
command + rw-transaction: "dear device please mangle this data"
   (crypto processors come to mind...)

The other two are obviously needed and already accepted by all of
you.

Hotspots: crypto hardware or generally DSPs.


Regards

Ingo Oeser
-- 
To the systems programmer,
users and applications serve only to provide a test load.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22  0:22                         ` Ingo Oeser
@ 2001-05-22  0:57                           ` Matthew Wilcox
  2001-05-22  1:13                             ` Linus Torvalds
  0 siblings, 1 reply; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-22  0:57 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Alexander Viro, Alan Cox, Pavel Machek, Richard Gooch,
	Matthew Wilcox, Andrew Clausen, Ben LaHaise, torvalds,
	linux-kernel, linux-fsdevel

On Tue, May 22, 2001 at 02:22:34AM +0200, Ingo Oeser wrote:
> ioctl has actually 4 semantics:
> 
> command only
> command + read
> command + write
> command + rw-transaction
> 
> Separating these would be a first step. And yes, I consider each
> of them useful.
> 
> command only: reset drive

echo 'reset' >/dev/sg0ctl

> command + rw-transaction: "dear device please mangle this data"
>    (crypto processors come to mind...)

I can't think of a reasonable tool-based approach to this, but I can
definitely see that a program could use this well.  It simply requires
that you use the filp to store your state.

fd = open(/dev/crypto) -> creates filp
write(fd, "Death to all fanatics!\n"); -> calls crypto device, stores result in
	private data structure
sleep(100);
read(fd, "Qrngu gb nyy snangvpf!\n"); -> frees data structure

[You'll note the advanced design of my crypto processor.]

Clearly, this is open to abuse by persons never calling read() and passing in
far too much to write().  I think this can be alleviated by refusing to accept more than (say) 4k at a time, or bean-counter.

A sick way would be to allow the ->write() call to have its buffer
modified.  But I don't think we want to go down that path.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22  0:57                           ` Matthew Wilcox
@ 2001-05-22  1:13                             ` Linus Torvalds
  2001-05-22  1:18                               ` Matthew Wilcox
  0 siblings, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-22  1:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Oeser, Alexander Viro, Alan Cox, Pavel Machek,
	Richard Gooch, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel



On Tue, 22 May 2001, Matthew Wilcox wrote:
>
> > command + rw-transaction: "dear device please mangle this data"
> >    (crypto processors come to mind...)
>
> I can't think of a reasonable tool-based approach to this, but I can
> definitely see that a program could use this well.  It simply requires
> that you use the filp to store your state.

Nope. You can (and people do, quite often) share filps. So you can't
associate state with it.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22  1:13                             ` Linus Torvalds
@ 2001-05-22  1:18                               ` Matthew Wilcox
  2001-05-22  7:49                                 ` Alan Cox
  0 siblings, 1 reply; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-22  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Oeser, Alexander Viro, Alan Cox,
	Pavel Machek, Richard Gooch, Andrew Clausen, Ben LaHaise,
	linux-kernel, linux-fsdevel

On Mon, May 21, 2001 at 06:13:18PM -0700, Linus Torvalds wrote:
> Nope. You can (and people do, quite often) share filps. So you can't
> associate state with it.

For _devices_, though?  I don't expect my mouse to work if gpm and xfree
both try to consume device events from the same filp.  Heck, it doesn't
even work when they try to consume events from the same inode :-)  I think
this is a reasonable restriction for the class of devices in question.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 22:10                           ` Linus Torvalds
  2001-05-21 22:22                             ` Alexander Viro
@ 2001-05-22  2:28                             ` Paul Mackerras
  2001-05-22 13:33                             ` Jan Harkes
  2 siblings, 0 replies; 161+ messages in thread
From: Paul Mackerras @ 2001-05-22  2:28 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

Alexander Viro writes:

> drivers/net/ppp_generic.c:
> ppp_set_compress(struct ppp *ppp, unsigned long arg)
> {
[snip]
>         if (copy_from_user(&data, (void *) arg, sizeof(data))
>             || (data.length <= CCP_MAX_OPTION_LENGTH
>                 && copy_from_user(ccp_option, data.ptr, data.length)))
>                 goto out;
> 
> And that's far from being uncommon. They _do_ follow pointers. Some - more
> than once.

:) That particular example is one that would probably be much cleaner
as a write on a control fd.  What is there currently is just a
relatively ugly way of getting a variable-sized lump of data from
usermode into the kernel.

Paul.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22  1:18                               ` Matthew Wilcox
@ 2001-05-22  7:49                                 ` Alan Cox
  2001-05-22 15:31                                   ` Matthew Wilcox
  0 siblings, 1 reply; 161+ messages in thread
From: Alan Cox @ 2001-05-22  7:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Matthew Wilcox, Ingo Oeser, Alexander Viro,
	Alan Cox, Pavel Machek, Richard Gooch, Andrew Clausen,
	Ben LaHaise, linux-kernel, linux-fsdevel

> For _devices_, though?  I don't expect my mouse to work if gpm and xfree
> both try to consume device events from the same filp.  Heck, it doesn't
> even work when they try to consume events from the same inode :-)  I think
> this is a reasonable restriction for the class of devices in question.

Not really. Think about basic things like full duplex audio with two threads


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
  2001-05-21  8:14     ` Lars Marowsky-Bree
@ 2001-05-22  9:07       ` Daniel Phillips
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-22  9:07 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Eric W. Biederman, Ben LaHaise, torvalds, viro, linux-kernel,
	linux-fsdevel

On Monday 21 May 2001 10:14, Lars Marowsky-Bree wrote:
> On 2001-05-19T16:25:47,
>
>    Daniel Phillips <phillips@bonn-fries.net> said:
> > How about:
> >
> >   # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap
> >   # ls /dev/mypartition
> >   base	size	device	type
> >   # cat /dev/mypartition/size
> >   1048576
> >   # cat /dev/mypartition/device
> >   /dev/sda
> >   # mke2fs /dev/mypartition
>
> Ek. You want to run mke2fs on a _directory_ ?

Could you be specific about what is wrong with that?  Assuming that
this device directory lives on a special purpose filesystem?

> If anything, /dev/mypartition/realdev

Then every fstab in the world has to change, not to mention adding
verbosity to interactive commands.

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 22:10                           ` Linus Torvalds
  2001-05-21 22:22                             ` Alexander Viro
  2001-05-22  2:28                             ` Paul Mackerras
@ 2001-05-22 13:33                             ` Jan Harkes
  2001-05-22 16:30                               ` Linus Torvalds
  2 siblings, 1 reply; 161+ messages in thread
From: Jan Harkes @ 2001-05-22 13:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Alexander Viro, Pavel Machek, Richard Gooch,
	Matthew Wilcox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel

On Mon, May 21, 2001 at 03:10:32PM -0700, Linus Torvalds wrote:
> That, in turn, might be as simple as changing the ioctl incoming arguments
> of <cmd,arg> into a structure like <type,cmd,inbuf,inlen,outbuf,outlen>.

At least make sure that the 'kioctl' returns the number of bytes placed
into the output buffer, as userspace doesn't necessarily know how much
data would be returned. Coda's kernel module forwards control data up to
userspace and uses a reasonably messy 'pioctl' wrapper (also used by AFS
afaik) around an ioctl to inform the kernel module of how much data to
copy through.

something like,

    ssize_t kioctl(int fd, int type, int cmd, void *inbuf, size_t inlen,
		   void *outbuf, size_t outlen);

As far as functionality and errors it works like read/write in a single
call, pretty much what Richard proposed earlier with a new 'transaction'
syscall. Maybe type is not needed, and cmd can be part of the inbuf in
which case it would be identical. I guess that type is introduced to
resolve existing ioctl number collisions.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-21 20:14             ` Daniel Phillips
@ 2001-05-22 15:24               ` Oliver Xymoron
  2001-05-22 16:51                 ` Daniel Phillips
  0 siblings, 1 reply; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-22 15:24 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Mon, 21 May 2001, Daniel Phillips wrote:

> On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > What I'd like to see:
> >
> > - An interface for registering an array of related devices (almost
> > always two: raw and ctl) and their legacy device numbers with a
> > single userspace callout that does whatever /dev/ creation needs to
> > be done. Thus, naming and permissions live in user space. No "device
> > node is also a directory" weirdness...
>
> Could you be specific about what is weird about it?

*boggle*

Without precedent in any other UNIX? Or other operating systems, for that
matter? Can you honestly say it doesn't strike you as weird? It's beating
the least surprise rule with a big stick, fercryinoutloud.

Ok, so technically UNIX directories were once just files. But it's been a
long time since people thought exposing that implementation detail was a
good idea, and anyway, it's the opposite situation (and no longer true on
modern fses).

I don't think it's likely to be even workable. Just consider the directory
entry for a moment - is it going to be marked d or [cb]? If it doesn't
have the directory bit set, Midnight commander won't let me look at it,
and I wouldn't blame cd or ls for complaining. If it does have the 'd' bit
set, I wouldn't blame cp, tar, find, or a million other programs if they
did the wrong thing. They've had 30 years to expect that files aren't
directories. They're going to act weird.

Linus has been kicking this idea around for a couple years now and it's
still a cute solution looking for a problem. It just doesn't belong in
UNIX.

More importantly, there's no call for the weirdness. Look, we've already
got to have a userspace callout for new devices so that we can do config,
firmware downloading, automounting, etc. There's no reason we can't stick
the rest of the dynamic /dev/ magic in userspace with the same mechanism.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22  7:49                                 ` Alan Cox
@ 2001-05-22 15:31                                   ` Matthew Wilcox
  2001-05-22 15:31                                     ` Alan Cox
  0 siblings, 1 reply; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-22 15:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Wilcox, Linus Torvalds, Ingo Oeser, Alexander Viro,
	Pavel Machek, Richard Gooch, Andrew Clausen, Ben LaHaise,
	linux-kernel, linux-fsdevel

On Tue, May 22, 2001 at 08:49:04AM +0100, Alan Cox wrote:
> > For _devices_, though?  I don't expect my mouse to work if gpm and xfree
> > both try to consume device events from the same filp.  Heck, it doesn't
> > even work when they try to consume events from the same inode :-)  I think
> > this is a reasonable restriction for the class of devices in question.
> 
> Not really. Think about basic things like full duplex audio with two threads

`the class of devices in question' was cryptographic devices, and possibly
other transactional DSPs.  I don't consider audio to be transactional.
in any case, you can do transactional things with two threads, as long
as they each have their own fd on the device.  Think of the fd as your
transaction handle.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22 15:31                                   ` Matthew Wilcox
@ 2001-05-22 15:31                                     ` Alan Cox
  2001-05-22 15:38                                       ` Matthew Wilcox
  0 siblings, 1 reply; 161+ messages in thread
From: Alan Cox @ 2001-05-22 15:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alan Cox, Matthew Wilcox, Linus Torvalds, Ingo Oeser,
	Alexander Viro, Pavel Machek, Richard Gooch, Andrew Clausen,
	Ben LaHaise, linux-kernel, linux-fsdevel

> `the class of devices in question' was cryptographic devices, and possibly
> other transactional DSPs.  I don't consider audio to be transactional.
> in any case, you can do transactional things with two threads, as long
> as they each have their own fd on the device.  Think of the fd as your
> transaction handle.

Thats a bit pathetic. So I have to fill my app with expensive pthread locks
or hack all the drivers and totally change the multi-open sematics in the ABI

I think I'll stick to ioctl cleaned up


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22 15:31                                     ` Alan Cox
@ 2001-05-22 15:38                                       ` Matthew Wilcox
  2001-05-22 15:42                                         ` Alan Cox
  0 siblings, 1 reply; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-22 15:38 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Wilcox, Linus Torvalds, Ingo Oeser, Alexander Viro,
	Pavel Machek, Richard Gooch, Andrew Clausen, Ben LaHaise,
	linux-kernel, linux-fsdevel

On Tue, May 22, 2001 at 04:31:37PM +0100, Alan Cox wrote:
> > `the class of devices in question' was cryptographic devices, and possibly
> > other transactional DSPs.  I don't consider audio to be transactional.
> > in any case, you can do transactional things with two threads, as long
> > as they each have their own fd on the device.  Think of the fd as your
> > transaction handle.
> 
> Thats a bit pathetic. So I have to fill my app with expensive pthread locks
> or hack all the drivers and totally change the multi-open sematics in the ABI

huh?

void thread_init(void) {
	int fd = open("/dev/crypto");
	real_thread_init(fd);
}

where was that lock again?

and notice this idea is only for transactional things -- what
transactional things do sound drivers do?

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21 22:22                             ` Alexander Viro
@ 2001-05-22 15:41                               ` Oliver Xymoron
  0 siblings, 0 replies; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-22 15:41 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel

On Mon, 21 May 2001, Alexander Viro wrote:

> On Mon, 21 May 2001, Linus Torvalds wrote:
>
> > It shouldn't be impossible to do the same thing to ioctl numbers. Nastier,
> > yes. No question about it. But we don't necessarily have to redesign the
> > whole approach - we only want to re-design the internal kernel interfaces.
> >
> > That, in turn, might be as simple as changing the ioctl incoming arguments
> > of <cmd,arg> into a structure like <type,cmd,inbuf,inlen,outbuf,outlen>.
>
> drivers/net/ppp_generic.c:
>
> And that's far from being uncommon. They _do_ follow pointers. Some - more
> than once.

Doesn't matter. If we make doing it right substantially easier than doing
it wrong, then people will quit doing it wrong. And it'll be much easier
to spot the ugly hacks in patches.

I actually wrote the above a while back.. lessee.. where's that thread..

http://mlarchive.ima.com/linux-kernel/1999/Jan/4932.html

The end result was using the resource trees to hold pointers to functions
with prototypes like the above (plus file handle info). No more giant
grotty switch statements (though you could keep those if you wanted - just
point all your ioctls to the same old function). Why trees? To implement
inheritance of ioctls through the device hierarchy.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22 15:38                                       ` Matthew Wilcox
@ 2001-05-22 15:42                                         ` Alan Cox
  0 siblings, 0 replies; 161+ messages in thread
From: Alan Cox @ 2001-05-22 15:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alan Cox, Matthew Wilcox, Linus Torvalds, Ingo Oeser,
	Alexander Viro, Pavel Machek, Richard Gooch, Andrew Clausen,
	Ben LaHaise, linux-kernel, linux-fsdevel

> > Thats a bit pathetic. So I have to fill my app with expensive pthread locks
> > or hack all the drivers and totally change the multi-open sematics in the ABI
> huh?

For the sound. And remember each open of /dev/audio is a different channel
potentially (ie its a factory)


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-22 13:33                             ` Jan Harkes
@ 2001-05-22 16:30                               ` Linus Torvalds
  0 siblings, 0 replies; 161+ messages in thread
From: Linus Torvalds @ 2001-05-22 16:30 UTC (permalink / raw)
  To: Jan Harkes
  Cc: Alan Cox, Alexander Viro, Pavel Machek, Richard Gooch,
	Matthew Wilcox, Andrew Clausen, Ben LaHaise, linux-kernel,
	linux-fsdevel


On Tue, 22 May 2001, Jan Harkes wrote:
> 
> something like,
> 
>     ssize_t kioctl(int fd, int type, int cmd, void *inbuf, size_t inlen,
> 		   void *outbuf, size_t outlen);
> 
> As far as functionality and errors it works like read/write in a single
> call, pretty much what Richard proposed earlier with a new 'transaction'
> syscall. Maybe type is not needed, and cmd can be part of the inbuf in
> which case it would be identical.

I'd rather have type and cmd there, simply because right now the
"cmd" passed in to the ioctl is not well-defined, as several different
drivers use the same numbers for different things (which is why I want to
expand that to <type,cmd> to get uniqueness).

Also, I think the cmd is separate from the data, so I don't think it
necessarily makes sense to mix the two. Even if we want to have an ASCII
command, I'd think that should be separate from the arguments, ie we'd
have 

	ssize_t kioctl(int fd, const char *cmd, const void *inbuf ...

instead of trying to mix them. This is especially true as the
"inbuf" would be a user-mode pointer, while "cmd" would come from kernel
space (whether in the form of a <type,subcmd> number pair or as a kernel
string).

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-22 15:24               ` Oliver Xymoron
@ 2001-05-22 16:51                 ` Daniel Phillips
  2001-05-22 17:49                   ` Oliver Xymoron
  2001-05-23  4:19                   ` Edgar Toernig
  0 siblings, 2 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-22 16:51 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> On Mon, 21 May 2001, Daniel Phillips wrote:
> > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > What I'd like to see:
> > >
> > > - An interface for registering an array of related devices
> > > (almost always two: raw and ctl) and their legacy device numbers
> > > with a single userspace callout that does whatever /dev/ creation
> > > needs to be done. Thus, naming and permissions live in user
> > > space. No "device node is also a directory" weirdness...
> >
> > Could you be specific about what is weird about it?
>
> *boggle*
>
>[general sense of unease]
>
> I don't think it's likely to be even workable. Just consider the
> directory entry for a moment - is it going to be marked d or [cb]?

It's going to be marked 'd', it's a directory, not a file.

> If it doesn't have the directory bit set, Midnight commander won't
> let me look at it, and I wouldn't blame cd or ls for complaining. If it
> does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
> million other programs if they did the wrong thing. They've had 30
> years to expect that files aren't directories. They're going to act
> weird.

No problem, it's a directory.

> Linus has been kicking this idea around for a couple years now and
> it's still a cute solution looking for a problem. It just doesn't
> belong in UNIX.

Hmm, ok, do we still have any *technical* reasons?

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-22 16:51                 ` Daniel Phillips
@ 2001-05-22 17:49                   ` Oliver Xymoron
  2001-05-22 20:22                     ` Daniel Phillips
  2001-05-23  4:19                   ` Edgar Toernig
  1 sibling, 1 reply; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-22 17:49 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Tue, 22 May 2001, Daniel Phillips wrote:

> > I don't think it's likely to be even workable. Just consider the
> > directory entry for a moment - is it going to be marked d or [cb]?
>
> It's going to be marked 'd', it's a directory, not a file.

Are we talking about the same proposal?  The one where I can open /dev/dsp
and /dev/dsp/ctl? But I can still do 'cat /dev/hda > /dev/dsp'?

It's still a file. If it's not a file anymore, it ain't UNIX.

> > If it doesn't have the directory bit set, Midnight commander won't
> > let me look at it, and I wouldn't blame cd or ls for complaining. If it
> > does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
> > million other programs if they did the wrong thing. They've had 30
> > years to expect that files aren't directories. They're going to act
> > weird.
>
> No problem, it's a directory.
>
> > Linus has been kicking this idea around for a couple years now and
> > it's still a cute solution looking for a problem. It just doesn't
> > belong in UNIX.
>
> Hmm, ok, do we still have any *technical* reasons?

If you define *technical* to not include design, sure. Oh, did I
mention unnecessary, solvable in userspace?

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  0:32       ` Linus Torvalds
  2001-05-20  0:52         ` Jeff Garzik
  2001-05-20  1:03         ` Jeff Garzik
@ 2001-05-22 18:41         ` Andreas Dilger
  2001-05-22 19:06           ` Linus Torvalds
  2 siblings, 1 reply; 161+ messages in thread
From: Andreas Dilger @ 2001-05-22 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel

Linus writes:
> There are some strong arguments that we should have filesystem
> "backdoors" for maintenance purposes, including backup. 
> 
> You can, of course, so parts of this on a LVM level, and doing backups
> with "disk snapshots" may be a valid approach. However, even that is
> debatable: there is very little that says that the disk image has to be
> up-to-date at any particular point in time, so even with a disk snapshot
> capability (which is not necessarily reasonable under all circumstances)
> there are arguments for maintenance interfaces.

Actually, the LVM snapshot interface has (optional) hooks into the filesystem
to ensure that it is consistent at the time the snapshot is created.  For
most filesystems, it will call fsync_dev(dev) so that all buffers are written
to disk.  However, for journalled filesystems, LVM needs to write out the
journal and mark the filesystem clean because the snapshot is a read-only
block device.  In this case it calls fsync_dev_lockfs(dev) which will call
the write_super_lockfs() method for the filesystem (if it exists) which
tells the filesystem to flush the journal, block transactions, and mark the
filesystem clean until the unlockfs() method is called.

Reiserfs and XFS both use this to make consistent snapshots of the live
filesystem.  Unfortunately, XFS checks filesystem UUIDs at mount time,
which means you can't mount two copies of the same filesystem (even read-only).

> Things like "lazy fsck" (ie fsck while already running the filesystem) and
> defragmentation simply is not feasible on a LVM level.

Yes, with consistent LVM snapshots you can do fsck on the read-only copy.
In 99.9*% cases you will not detect any errors and you can continue.  If
you _do_ detect an error you probably want to stop everything and fix it
(fsck repairing an in-use filesystem is too twisted and dangerous, IMHO,
and a huge amount of effort for an extremely rare situation).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-20  1:03         ` Jeff Garzik
                             ` (2 preceding siblings ...)
  2001-05-21 17:22           ` Oliver Xymoron
@ 2001-05-22 18:53           ` Andreas Dilger
  2001-05-24  9:20             ` Malcolm Beattie
  3 siblings, 1 reply; 161+ messages in thread
From: Andreas Dilger @ 2001-05-22 18:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Alexander Viro, Edgar Toernig, Ben LaHaise,
	linux-kernel, linux-fsdevel

Jeff writes:
> Here's a dumb question, and I apologize if I am questioning computer
> science dogma...
> 
> Why are LVM and EVMS(competing LVM project) needed at all?
> 
> Surely the same can be accomplished with
> * md
> * snapshot blkdev (attached in previous e-mail)
> * giving partitions and blkdevs the ability to grow and shrink
> * giving filesystems the ability to grow and shrink
> 
> On-line optimization (defrag, etc) shouldn't be hard once you have the
> ability to move blocks and files around, which would come with the
> ability to grow and shrink blkdevs and fs's.

You're missing virtual->physical block mapping allowing you to move parts
of the device around, freedom from the need for contiguous disk space.

In the end, what you've described above is pretty much what LVM does (and
EVMS does better).  Having the various components inside a single layer
like EVMS gives you a lot move flexibility, IMHO.  You also don't have
the issue of wasted minor numbers for unused partitions, or too few minor
numbers in other cases.

For example, with MD RAID you still need devices of equal size to create
a RAID 1 mirror, or part of one device is wasted.  With EVMS you can (in
the future, or right now with AIX/HPUX LVM) do the RAID 1 mirroring on a
per-logical-extent basis and you get your physical extents from any device.
Because your virtual->physical mapping is already abstract, it also allows
you to add mirroring to any existing LVM device without interruption.

Cheers, Andreas

PS - I used to think shrinking a filesystem online was useful, but there
     are a huge amount of problems with this and very few real-life
     benefits, as long as you can at least do offline shrinking.  With
     proper LVM usage, the need to shrink a filesystem never really
     happens in practise, unlike the partition case where you always
     have to guess in advance how big a filesystem needs to be, and then
     add 10% for a safety margin.  With LVM you just create the minimal
     sized device you need now, and freely grow it in the future.
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-22 18:41         ` Andreas Dilger
@ 2001-05-22 19:06           ` Linus Torvalds
  2001-05-22 19:16             ` Peter J. Braam
  0 siblings, 1 reply; 161+ messages in thread
From: Linus Torvalds @ 2001-05-22 19:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Alexander Viro, Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel


On Tue, 22 May 2001, Andreas Dilger wrote:
> 
> Actually, the LVM snapshot interface has (optional) hooks into the filesystem
> to ensure that it is consistent at the time the snapshot is created.

Note that this is still fundamentally a broken interface: the filesystem
may not _have_ a block device underneath it, yet you might very well like
to do defragmentation and backup none-the-less.

Also, lvm snapshots are fundamentally limited to read-only data, which
means that the LVM interfaces cannot be used for defragmentation and lazy
fsck etc anyway. You _have_ to do those at a filesystem level.

disk snapshots are useful, but they are not the answer.

		Linus


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-22 19:06           ` Linus Torvalds
@ 2001-05-22 19:16             ` Peter J. Braam
  2001-05-22 20:10               ` Andreas Dilger
  2001-05-23  9:13               ` Stephen C. Tweedie
  0 siblings, 2 replies; 161+ messages in thread
From: Peter J. Braam @ 2001-05-22 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Dilger, Alexander Viro, Edgar Toernig, Ben LaHaise,
	linux-kernel, linux-fsdevel


On Tue, 22 May 2001, Linus Torvalds wrote:
>


> On Tue, 22 May 2001, Andreas Dilger wrote:  Actually, the LVM snapshot
> interface has (optional) hooks into the filesystem to ensure that it
> is consistent at the time the snapshot is created.

But I think that LVM is implemented "the wrong way around".

File system journal recovery can corrupt a snapshot, because it copies
data that needs to be preserved in a snapshot. During journal replay such
data may be copied again, but the source can have new data already.

Most LVM snapshot systems write the new data in the separate volume and
don't copy the old data that eliminates this problem (and also eliminates
the copy of data but introduces data copy when a snapshot is removed).

- Peter -


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-22 19:16             ` Peter J. Braam
@ 2001-05-22 20:10               ` Andreas Dilger
  2001-05-22 20:59                 ` Peter J. Braam
  2001-05-24 21:07                 ` Daniel Phillips
  2001-05-23  9:13               ` Stephen C. Tweedie
  1 sibling, 2 replies; 161+ messages in thread
From: Andreas Dilger @ 2001-05-22 20:10 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Linus Torvalds, Andreas Dilger, Alexander Viro, Edgar Toernig,
	Ben LaHaise, linux-kernel, linux-fsdevel

Peter Braam writes:
> On Tue, 22 May 2001, Andreas Dilger wrote:
> > Actually, the LVM snapshot
> > interface has (optional) hooks into the filesystem to ensure that it
> > is consistent at the time the snapshot is created.
> 
> File system journal recovery can corrupt a snapshot, because it copies
> data that needs to be preserved in a snapshot. During journal replay such
> data may be copied again, but the source can have new data already.

The way it is implemented in reiserfs is to wait for existing transactions
to complete, entirely flush the journal and block all new transactions from
starting.  Stephen implemented a journal flush API to do this for ext3, but
the hooks to call it from LVM are not in place yet.  This way the journal is
totally empty at the time the snapshot is done, so the read-only copy does
not need to do journal recovery, so no problems can arise.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-22 17:49                   ` Oliver Xymoron
@ 2001-05-22 20:22                     ` Daniel Phillips
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-22 20:22 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Alexander Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Tuesday 22 May 2001 19:49, Oliver Xymoron wrote:
> On Tue, 22 May 2001, Daniel Phillips wrote:
> > > I don't think it's likely to be even workable. Just consider the
> > > directory entry for a moment - is it going to be marked d or
> > > [cb]?
> >
> > It's going to be marked 'd', it's a directory, not a file.
>
> Are we talking about the same proposal?  The one where I can open
> /dev/dsp and /dev/dsp/ctl? But I can still do 'cat /dev/hda >
> /dev/dsp'?

We already support read/write on directories in the VFS, that's not a
problem.

> It's still a file. If it's not a file anymore, it ain't UNIX.

It's a file with the directory bit set, I believe that's UNIX.

> > > If it doesn't have the directory bit set, Midnight commander
> > > won't let me look at it, and I wouldn't blame cd or ls for
> > > complaining. If it does have the 'd' bit set, I wouldn't blame
> > > cp, tar, find, or a million other programs if they did the wrong
> > > thing. They've had 30 years to expect that files aren't
> > > directories. They're going to act weird.
> >
> > No problem, it's a directory.
> >
> > > Linus has been kicking this idea around for a couple years now
> > > and it's still a cute solution looking for a problem. It just
> > > doesn't belong in UNIX.
> >
> > Hmm, ok, do we still have any *technical* reasons?
>
> If you define *technical* to not include design, sure.

Sorry, I don't see what you mean, do you mean the design is
difficult?

> Oh, did I mention unnecessary, solvable in userspace?

That's exactly the point: the generic filesystem allows all the
funny-shaped stuff to be dealt with in user space.  The
filesystem itself is lovely and clean.

BTW, I didn't realize I was reinventing Linus's wheel, this just
seemed very obvious and natural to me.  So I had to believe
there's a technical obstacle somewhere.

Has anyone written code to demonstrate the idea?

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-22 20:10               ` Andreas Dilger
@ 2001-05-22 20:59                 ` Peter J. Braam
  2001-05-23  9:23                   ` Stephen C. Tweedie
  2001-05-24 21:07                 ` Daniel Phillips
  1 sibling, 1 reply; 161+ messages in thread
From: Peter J. Braam @ 2001-05-22 20:59 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linus Torvalds, Alexander Viro, Edgar Toernig, Ben LaHaise,
	linux-kernel, linux-fsdevel, linux-lvm


Andreas,


I think that the issue is something different.  Suppose the snapshot has
been created. I know that this can be done safely with the API's you
allude to. Life goes on and the journal FS keeps changing the file system
and if the system doesn't crash, everything is fine: blocks get copied
correctly from the primary volume to the snapshot volume.

Now consider a crash -- not during snapshot creation, but way after that
when "life is going on".  Suppose there is a two block transaction that
has made it to the journal and after writing one block to the fs location
the system crashes.  The journal replay will try to write that block
again.

But during recovery, LVM cannot possibly know if the whole process of
copying out the data from the current to the snapshot area completed
during the previous run. Yes, LVM updates the redirection table first and
then copies, but, still, you don't know _where exactly_ the writes stopped
happening and in particular you don't know if the block was copied already
or not.

So during replay it is quite possible that LVM corrupts the snapshot.

It's better to keep the snapshot in the old volume and write the new data
to a separate area (that's what most commercial systems do I think).  It
avoid redirections and copying upon write.  When you delete the snapshot
you have to copy, but you can do that as a low priority process.
Finally, as you pointed out a full volume is handled better too in that
way, since you don't terminate the snapshot but you tell the current
volume that it is full.

Hmm, I was expecting a storm of email explaining what I have
misunderstood, but it has in fact been rather quiet...

- Peter -






On Tue, 22 May 2001, Andreas Dilger wrote:

> Peter Braam writes:
> > On Tue, 22 May 2001, Andreas Dilger wrote:
> > > Actually, the LVM snapshot
> > > interface has (optional) hooks into the filesystem to ensure that it
> > > is consistent at the time the snapshot is created.
> >
> > File system journal recovery can corrupt a snapshot, because it copies
> > data that needs to be preserved in a snapshot. During journal replay such
> > data may be copied again, but the source can have new data already.
>
> The way it is implemented in reiserfs is to wait for existing transactions
> to complete, entirely flush the journal and block all new transactions from
> starting.  Stephen implemented a journal flush API to do this for ext3, but
> the hooks to call it from LVM are not in place yet.  This way the journal is
> totally empty at the time the snapshot is done, so the read-only copy does
> not need to do journal recovery, so no problems can arise.
>
> Cheers, Andreas
>

-- 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFD w/info-PATCH] device arguments from lookup, partion code
  2001-05-21  3:12                             ` Linus Torvalds
  2001-05-21 19:32                               ` Kai Henningsen
@ 2001-05-23  1:15                               ` Albert D. Cahalan
  1 sibling, 0 replies; 161+ messages in thread
From: Albert D. Cahalan @ 2001-05-23  1:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Alexander Viro, Russell King, Richard Gooch,
	Matthew Wilcox, Alan Cox, Andrew Clausen, Ben LaHaise,
	linux-kernel, linux-fsdevel

Linus Torvalds writes:

> The problem is that if you expect to get nice code, you have to have nice
> interfaces and infratructure. And ioctl's aren't it.
...
> But that absolutely _requires_ that the driver writers should never see
> the silly "pass a random number and a random argument type" kind of
> interface with no structure or infrastructure in place.
>
> Because right now even _good_ programmers make a mess of the fact that
> they get passed a bad interface.
>
> Think of it this way: the user interface to opening a file is
> "open()" with pathnames and magic flags. But a filesystem never even
> _sees_ that interface, it sees a very nicely structured setup where all
> the argument parsing and locking has already been done for it, and the
> magic flags don't even exist any more as far as the low-level FS is
> concerned. Which is why filesystems _can_ be clean.
>
> In contrast, ioctl's are passed through directly, with no help to make
> them clean.

You want a well-defined interface, allowing over-network use?
Well, here you go, the CORBA ORB patch for Linux 2.4 kernels:
http://korbit.sourceforge.net/

Do you want that against 2.4.5-pre5 or what? Plain ASCII email?

:-)

The really sick thing is that I could actually use this too.
It handles the DSP problem well.



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-22 16:51                 ` Daniel Phillips
  2001-05-22 17:49                   ` Oliver Xymoron
@ 2001-05-23  4:19                   ` Edgar Toernig
  2001-05-23  4:50                     ` Alexander Viro
                                       ` (2 more replies)
  1 sibling, 3 replies; 161+ messages in thread
From: Edgar Toernig @ 2001-05-23  4:19 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips wrote:
> 
> On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> > On Mon, 21 May 2001, Daniel Phillips wrote:
> > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > > What I'd like to see:
> > > >
> > > > - An interface for registering an array of related devices
> > > > (almost always two: raw and ctl) and their legacy device numbers
> > > > with a single userspace callout that does whatever /dev/ creation
> > > > needs to be done. Thus, naming and permissions live in user
> > > > space. No "device node is also a directory" weirdness...
> > >
> > > Could you be specific about what is weird about it?
> >
> > *boggle*
> >
> >[general sense of unease]

I fully agree with Oliver.  It's an abomination.

> > I don't think it's likely to be even workable. Just consider the
> > directory entry for a moment - is it going to be marked d or [cb]?
> 
> It's going to be marked 'd', it's a directory, not a file.

Aha.  So you lose the S_ISCHR/BLK attribute.

> > If it doesn't have the directory bit set, Midnight commander won't
> > let me look at it, and I wouldn't blame cd or ls for complaining. If it
> > does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
> > million other programs if they did the wrong thing. They've had 30
> > years to expect that files aren't directories. They're going to act
> > weird.
> 
> No problem, it's a directory.

Directories are not allowed to be read from/written to.  The VFS may
support it, but it's not (current) UNIX.

> > Linus has been kicking this idea around for a couple years now and
> > it's still a cute solution looking for a problem. It just doesn't
> > belong in UNIX.
> 
> Hmm, ok, do we still have any *technical* reasons?

So with your definition, I have a fs-object that is marked as a directory
but opening it opens a device.  Pretty nice.  How I'm supposed to list
it's contents?  open+readdir?  But the open has nasty side effects.
So you have a directory that you are not allowed to list (because of the
possible side effects) but is allowed to be read from/written to maybe
even issue ioctls to?.  And you call that sane???

IMO the whole idea of arguments following the device name is junk (incl
a "/ctrl").

Just think about the implications of the original "/dev/ttyS0/19200"
suggestion.  It sounds nice and tempting.  But which programs will
benefit.  Which gets confused.  What will be cleaned up.  After some
thoughts you'll find out that it's useless ;-)

And with special "ctrl" devices (ie /dev/ttyS0 and /dev/ttyS0ctrl):
This _may_ work for some kind of devices.  But serial ports are one
example where it simply will _not_.  It requires that you know the
name of the device.  For ttys this is often not the case.  Even if
you manage to get some name for stdin for example - now I should
simply attach a "ctrl" to that name to get a control channel???
At least dangerous.  If I'm lucky I only get an EPERM...

Ciao, ET.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-23  4:19                   ` Edgar Toernig
@ 2001-05-23  4:50                     ` Alexander Viro
  2001-05-23 13:50                     ` Daniel Phillips
  2001-05-23 13:50                     ` Daniel Phillips
  2 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-23  4:50 UTC (permalink / raw)
  To: Edgar Toernig
  Cc: Daniel Phillips, Oliver Xymoron, linux-kernel, linux-fsdevel



On Wed, 23 May 2001, Edgar Toernig wrote:

> And with special "ctrl" devices (ie /dev/ttyS0 and /dev/ttyS0ctrl):
> This _may_ work for some kind of devices.  But serial ports are one
> example where it simply will _not_.  It requires that you know the

That's quite funny, you know...

------------------------------------------------------------------------
From: Dennis Ritchie (dmr@bell-labs.com)
Subject: Re: Plan 9 (was Re: Rubouts)
Newsgroups: alt.folklore.computers
Date: 1998/10/12
   
Neil Franklin wrote:
>
> No ioctl()s?
>
> Something like:    echo "38400,8,n,1" > /ioctrl/ttyS0    ?
>
> Now that would be cool.
>
Exactly like that, though it would be /dev/eia80ctl .
No ioctl().

> Is there anyone who has an URL about Plan 9. Code download?
>

 http://plan9.bell-labs.com/plan9


        Dennis
------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-22 19:16             ` Peter J. Braam
  2001-05-22 20:10               ` Andreas Dilger
@ 2001-05-23  9:13               ` Stephen C. Tweedie
  1 sibling, 0 replies; 161+ messages in thread
From: Stephen C. Tweedie @ 2001-05-23  9:13 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Linus Torvalds, Andreas Dilger, Alexander Viro, Edgar Toernig,
	Ben LaHaise, linux-kernel, linux-fsdevel, Stephen Tweedie

Hi,

On Tue, May 22, 2001 at 01:16:42PM -0600, Peter J. Braam wrote:
 
> File system journal recovery can corrupt a snapshot, because it copies
> data that needs to be preserved in a snapshot.

Journal recovery may move data from the journal to other locations on
the device, yes, but that doesn't change the logical contents of the
filesystem.  I don't see how that results in "corruption": the
snapshot is (or at least, ought to be!) fully independent of the
original version of the data, so such recovery should only be taking
the snapshot from one consistent state to a different but equivalent
state.

> During journal replay such
> data may be copied again, but the source can have new data already.

Only if you are recovering a live volume, surely?  And that is
*guaranteed* to cause problems.  

--Stephen

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-22 20:59                 ` Peter J. Braam
@ 2001-05-23  9:23                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 161+ messages in thread
From: Stephen C. Tweedie @ 2001-05-23  9:23 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Andreas Dilger, Linus Torvalds, Alexander Viro, Edgar Toernig,
	Ben LaHaise, linux-kernel, linux-fsdevel, linux-lvm,
	Stephen Tweedie

Hi,

On Tue, May 22, 2001 at 02:59:32PM -0600, Peter J. Braam wrote:
 
> But during recovery, LVM cannot possibly know if the whole process of
> copying out the data from the current to the snapshot area completed
> during the previous run. Yes, LVM updates the redirection table first and
> then copies, but, still, you don't know _where exactly_ the writes stopped
> happening and in particular you don't know if the block was copied already
> or not.

LVM updates the snapshot redirection without knowing that the new
redirection location has been written?  So if I write to a LVM
snapshot and take a crash, I might not actually get either the old or
the new data, but in fact some previous random contents of a new
block?  Eek.  Journaling will not like that.  Databases won't like
that.  Anything that relies on fsync to ensure some write ordering on
disk will be potentially upset by that.

> It's better to keep the snapshot in the old volume and write the new data
> to a separate area (that's what most commercial systems do I think).

No.  The commercial systems write snapshots to a new area, usually.
There are two very good reason for that --- when you come to delete a
snapshot, there's no IO involved; and you avoid fragmenting the
original root volume.  

In systems I'm familiar with, the copy-out is always done in the same
direction with the snapshot getting the new block.  This even happens
if the snapshot is writable: regardless of whether it is the snapshot
or the root being written, the copy-out always results in the snapshot
getting moved, not the root.

Cheers, 
 Stephen

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-23  4:19                   ` Edgar Toernig
  2001-05-23  4:50                     ` Alexander Viro
@ 2001-05-23 13:50                     ` Daniel Phillips
  2001-05-23 13:50                     ` Daniel Phillips
  2 siblings, 0 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-23 13:50 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
> IMO the whole idea of arguments following the device name is junk
> (incl a "/ctrl").

You know I didn't suggest that, right?  I find it pretty strange too, but
I'm listening to hear the technical arguments.

> Just think about the implications of the original "/dev/ttyS0/19200"
> suggestion.  It sounds nice and tempting.  But which programs will
> benefit.  Which gets confused.  What will be cleaned up.  After some
> thoughts you'll find out that it's useless ;-)

You know I didn't suggest that either, right?  But I'm with you, I don't
like it at'all, not least because we might change baud rate on the fly.

> And with special "ctrl" devices (ie /dev/ttyS0 and /dev/ttyS0ctrl):
> This _may_ work for some kind of devices.  But serial ports are one
> example where it simply will _not_.  It requires that you know the
> name of the device.  For ttys this is often not the case.
> Even if you manage to get some name for stdin for example - now I 
> should simply attach a "ctrl" to that name to get a control channel???
> At least dangerous.  If I'm lucky I only get an EPERM...

Again, I'll provide a sympathetic ear, but it wasn't my suggestion.

> Ciao, ET.

And you were referring to who?

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-23  4:19                   ` Edgar Toernig
  2001-05-23  4:50                     ` Alexander Viro
  2001-05-23 13:50                     ` Daniel Phillips
@ 2001-05-23 13:50                     ` Daniel Phillips
  2001-05-23 15:58                       ` Oliver Xymoron
  2001-05-24  0:23                       ` Edgar Toernig
  2 siblings, 2 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-23 13:50 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> > > On Mon, 21 May 2001, Daniel Phillips wrote:
> > > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > > > What I'd like to see:
> > > > >
> > > > > - An interface for registering an array of related devices
> > > > > (almost always two: raw and ctl) and their legacy device
> > > > > numbers with a single userspace callout that does whatever
> > > > > /dev/ creation needs to be done. Thus, naming and permissions
> > > > > live in user space. No "device node is also a directory"
> > > > > weirdness...
> > > >
> > > > Could you be specific about what is weird about it?
> > >
> > > *boggle*
> > >
> > >[general sense of unease]
>
> I fully agree with Oliver.  It's an abomination.

We are, or at least, I am, investigating this question purely on
technical grounds - name calling is a noop.  I'd be happy to find a
real reason why this is a bad idea but so far none has been
presented.

Don't get me wrong, the fact that people I respect have reservations
about the idea does mean something to me, but this still needs to be
investigated properly.  Now on to the technical content...

> > > I don't think it's likely to be even workable. Just consider the
> > > directory entry for a moment - is it going to be marked d or
> > > [cb]?
> >
> > It's going to be marked 'd', it's a directory, not a file.
>
> Aha.  So you lose the S_ISCHR/BLK attribute.

Readdir fills in a directory type, so ls sees it as a directory and does
the right thing.  On the other hand, we know we're on a device 
filesystem so we will next open the name as a regular file, and find
ISCHR or ISBLK: good.

The rule for this filesystem is: if you open with O_DIRECTORY then
directory operations are permitted, nothing else.  If you open without
O_DIRECTORY then directory operations are forbidden (as
usual) and normal device semantics apply.

If there is weirdness anywhere, it's right here with this rule.  The
question is: what if anything breaks?

> > > If it doesn't have the directory bit set, Midnight commander
> > > won't let me look at it, and I wouldn't blame cd or ls for
> > > complaining. If it does have the 'd' bit set, I wouldn't blame
> > > cp, tar, find, or a million other programs if they did the wrong
> > > thing. They've had 30 years to expect that files aren't
> > > directories. They're going to act weird.
> >
> > No problem, it's a directory.
>
> Directories are not allowed to be read from/written to.  The VFS may
> support it, but it's not (current) UNIX.

Here, we obey this rule: if you open it with O_DIRECTORY then you
can't read from or write to it.

> > > Linus has been kicking this idea around for a couple years now
> > > and it's still a cute solution looking for a problem. It just
> > > doesn't belong in UNIX.
> >
> > Hmm, ok, do we still have any *technical* reasons?
>
> So with your definition, I have a fs-object that is marked as a
> directory but opening it opens a device.  Pretty nice..

No, you have to open it without O_DIRECTORY to get your device
fd handle.

> How I'm supposed to list it's contents?  open+readdir?

Nothing breaks here, ls works as it always did.

This is what ls does:

open("foobar", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(0x3, 0x2, 0x1, 0x2)             = -1 ENOSYS (Function not implemented)
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
brk(0x805b000)                          = 0x805b000
getdents64(0x3, 0x8058270, 0x1000, 0x26) = -1 ENOSYS (Function not implemented)
getdents(3, /* 2 entries */, 2980)      = 28
getdents(3, /* 0 entries */, 2980)      = 0
close(3)                                = 0

Note that ls doesn't do anything as inconvenient as opening 
foobar as a normal file first, expecting that operation to fail.

> But the open has nasty side effects.
> So you have a directory that you are not allowed
> to list (because of the possible side effects) but is allowed to be
> read from/written to maybe even issue ioctls to?. 

No, you would get side effects only if you open as a regular file.
I'd agree that that sucks, but that's not what we're trying to fix
just now.

> And you call that sane???

I would hope it seems saner now, after the clarification.
Please, if you know something that actually breaks, tell me.

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-23 13:50                     ` Daniel Phillips
@ 2001-05-23 15:58                       ` Oliver Xymoron
  2001-05-24  0:23                       ` Edgar Toernig
  1 sibling, 0 replies; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-23 15:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Edgar Toernig, linux-kernel, linux-fsdevel

On Wed, 23 May 2001, Daniel Phillips wrote:

> > > > *boggle*
> > > >
> > > >[general sense of unease]
> >
> > I fully agree with Oliver.  It's an abomination.
>
> We are, or at least, I am, investigating this question purely on
> technical grounds - name calling is a noop.  I'd be happy to find a
> real reason why this is a bad idea but so far none has been
> presented.

I will agree that the thing can be done in principle. You're not going to
find anyone who's going to argue that part. All other things being equal,
I actually think it's a neat idea.

The part that is a problem is people, namely people who write programs.
They've had decades to expect that directories are not also files, and if
they happen to do things like check whether a file is not a directory
before opening it, it's _our fault_ if they get confused.

Consider the recent subtle change to fork() that was reversed because it
uncovered an unforseen bug in bash. The proposed change is not at all
subtle, is entirely without precedent, and is likely to break much.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-23 13:50                     ` Daniel Phillips
  2001-05-23 15:58                       ` Oliver Xymoron
@ 2001-05-24  0:23                       ` Edgar Toernig
  2001-05-24  7:47                         ` Marko Kreen
  2001-05-24 17:25                         ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Daniel Phillips
  1 sibling, 2 replies; 161+ messages in thread
From: Edgar Toernig @ 2001-05-24  0:23 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips wrote:
> On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
> > Daniel Phillips wrote:
> > > On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> > > > On Mon, 21 May 2001, Daniel Phillips wrote:
> > > > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > > > > What I'd like to see:
> > > > > >
> > > > > > - An interface for registering an array of related devices
> > > > > > (almost always two: raw and ctl) and their legacy device
> > > > > > numbers with a single userspace callout that does whatever
> > > > > > /dev/ creation needs to be done. Thus, naming and permissions
> > > > > > live in user space. No "device node is also a directory"
> > > > > > weirdness...
> > > > >
> > > > > Could you be specific about what is weird about it?
> > > >
> > > > *boggle*
> > > >
> > > >[general sense of unease]
> >
> > I fully agree with Oliver.  It's an abomination.
> 
> We are, or at least, I am, investigating this question purely on
> technical grounds - name calling is a noop.

Right.  But sometimes new ideas raise these kind of feelings ;)

> > > It's going to be marked 'd', it's a directory, not a file.
> >
> > Aha.  So you lose the S_ISCHR/BLK attribute.
> 
> Readdir fills in a directory type, so ls sees it as a directory and does
> the right thing.  On the other hand, we know we're on a device
> filesystem so we will next open the name as a regular file, and find
> ISCHR or ISBLK: good.

??? The kernel may know it, but the app?  Or do you really want to
give different stat data on stat(2) and fstat(2)?  These flags are
currently used by archive/backup prgs.  It's a hint that these files
are not regular files and shouldn't be opened for reading.
Having a 'd' would mean that they would really try to enter the
directory and save it's contents.  Don't know what happens in this
case to your "special" files ;-)

> The rule for this filesystem is: if you open with O_DIRECTORY then
> directory operations are permitted, nothing else.  If you open without
> O_DIRECTORY then directory operations are forbidden (as
> usual) and normal device semantics apply.

As usual?  I think you've just changed the rules for O_DIRECTORY.  Up
to now it's only a flag that tells open it should fail if the name
does not refer to a directory.  Nothing else.  It was introduced to
remove a race condition in user space applications.  Especially it
is optional - everything works the same whether you give the flag
or not (except the race avoidance of course).  And there are a lot
of programs that do not use O_DIRECTORY (it's a Linux private flag,
not even mentioned in POSIX).  Every program that does:

	fd = open(foo, O_RDONLY);
	fchdir(fd);
	x = opendir(".")

will break.  And that is POSIX conform.  And I know that there are
programs that use this when recursively scanning directories (avoids
name mangling and repeated name lookups of the directory on later
stat calls).

> > Directories are not allowed to be read from/written to.  The VFS may
> > support it, but it's not (current) UNIX.
> 
> Here, we obey this rule: if you open it with O_DIRECTORY then you
> can't read from or write to it.

IMHO you've just invented opendir(2).

> Nothing breaks here, ls works as it always did.
> 
> This is what ls does:
> 
> open("foobar", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> fcntl64(0x3, 0x2, 0x1, 0x2)             = -1 ENOSYS (Function not implemented)
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> brk(0x805b000)                          = 0x805b000
> getdents64(0x3, 0x8058270, 0x1000, 0x26) = -1 ENOSYS (Function not implemented)
> getdents(3, /* 2 entries */, 2980)      = 28
> getdents(3, /* 0 entries */, 2980)      = 0
> close(3)                                = 0
> 
> Note that ls doesn't do anything as inconvenient as opening
> foobar as a normal file first, expecting that operation to fail.

Well, your ls does not work "as it always did".  Here's an strace of
my libc5 system ls:

open(".", O_RDONLY)                     = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents(3, /* 64 entries */, 4096)     = 1216
getdents(3, /* 9 entries */, 4096)      = 168
getdents(3, /* 0 entries */, 4096)      = 0
close(3)                                = 0

And my find(1) does:

open(".", O_RDONLY)                     = 3
[scan all dirs]
fchdir(3)                               = 0

to return to its initial dir.  Will break too.

> No, you would get side effects only if you open as a regular file.

IMHO your assumption that opening a dir _requires_ O_DIRECTORY is
wrong.  You've put in a new semantic that has not been there and
that will break programs and POSIX conformance.

> Please, if you know something that actually breaks, tell me.

Yeah, see above ;)

Ciao, ET.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-24  0:23                       ` Edgar Toernig
@ 2001-05-24  7:47                         ` Marko Kreen
  2001-05-24 14:39                           ` Oliver Xymoron
  2001-05-24 17:25                         ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Daniel Phillips
  1 sibling, 1 reply; 161+ messages in thread
From: Marko Kreen @ 2001-05-24  7:47 UTC (permalink / raw)
  To: Edgar Toernig
  Cc: Daniel Phillips, Oliver Xymoron, linux-kernel, linux-fsdevel

On Thu, May 24, 2001 at 02:23:27AM +0200, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > > > It's going to be marked 'd', it's a directory, not a file.
> > >
> > > Aha.  So you lose the S_ISCHR/BLK attribute.
> > 
> > Readdir fills in a directory type, so ls sees it as a directory and does
> > the right thing.  On the other hand, we know we're on a device
> > filesystem so we will next open the name as a regular file, and find
> > ISCHR or ISBLK: good.
> 
> ??? The kernel may know it, but the app?  Or do you really want to
> give different stat data on stat(2) and fstat(2)?  These flags are
> currently used by archive/backup prgs.  It's a hint that these files
> are not regular files and shouldn't be opened for reading.
> Having a 'd' would mean that they would really try to enter the
> directory and save it's contents.  Don't know what happens in this
> case to your "special" files ;-)

IMHO the CHR/BLK is not needed.  Think of /proc.  In the future,
the backup tools will be told to ignore /dev, that's all.

-- 
marko


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-22 18:53           ` Andreas Dilger
@ 2001-05-24  9:20             ` Malcolm Beattie
  2001-05-24 19:15               ` Andreas Dilger
  0 siblings, 1 reply; 161+ messages in thread
From: Malcolm Beattie @ 2001-05-24  9:20 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-fsdevel

[cc list reduced]

Andreas Dilger writes:
> PS - I used to think shrinking a filesystem online was useful, but there
>      are a huge amount of problems with this and very few real-life
>      benefits, as long as you can at least do offline shrinking.  With
>      proper LVM usage, the need to shrink a filesystem never really
>      happens in practise, unlike the partition case where you always
>      have to guess in advance how big a filesystem needs to be, and then
>      add 10% for a safety margin.  With LVM you just create the minimal
>      sized device you need now, and freely grow it in the future.

In an attempt to nudge you back towards your previous opinion: consider
a system-wide spool or tmp filesystem. It would be nice to be able to
add in a few extra volumes for a busy period but then shrink it down
again when usage returns to normal. In the absence of the ability to
shrink a live filesystem, storage management becomes a much harder job.
You can't throw in a spare volume or two where it's needed without
careful thought because you'll be ratchetting up the space on that one
filesystem without being able to change your mind and reduce it again
later. You'll end up with stingy storage admins who refuse to give you
a bunch of extra filesystem space for a while because they can't get it
back again afterwards.

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-24  7:47                         ` Marko Kreen
@ 2001-05-24 14:39                           ` Oliver Xymoron
  2001-05-24 15:20                             ` CHR/BLK needed? was: Re: Why side-effects on open Marko Kreen
  2001-05-24 17:12                             ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device Albert D. Cahalan
  0 siblings, 2 replies; 161+ messages in thread
From: Oliver Xymoron @ 2001-05-24 14:39 UTC (permalink / raw)
  To: Marko Kreen; +Cc: Edgar Toernig, Daniel Phillips, linux-kernel, linux-fsdevel

On Thu, 24 May 2001, Marko Kreen wrote:

> On Thu, May 24, 2001 at 02:23:27AM +0200, Edgar Toernig wrote:
> > Daniel Phillips wrote:
> > > > > It's going to be marked 'd', it's a directory, not a file.
> > > >
> > > > Aha.  So you lose the S_ISCHR/BLK attribute.
> > >
> > > Readdir fills in a directory type, so ls sees it as a directory and does
> > > the right thing.  On the other hand, we know we're on a device
> > > filesystem so we will next open the name as a regular file, and find
> > > ISCHR or ISBLK: good.
> >
> > ??? The kernel may know it, but the app?  Or do you really want to
> > give different stat data on stat(2) and fstat(2)?  These flags are
> > currently used by archive/backup prgs.  It's a hint that these files
> > are not regular files and shouldn't be opened for reading.
> > Having a 'd' would mean that they would really try to enter the
> > directory and save it's contents.  Don't know what happens in this
> > case to your "special" files ;-)
>
> IMHO the CHR/BLK is not needed.  Think of /proc.  In the future,
> the backup tools will be told to ignore /dev, that's all.

The /dev dir should not be special. At least not to the kernel. I have
device files in places other than /dev, and you probably do too (hint:
anonymous FTP).

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 161+ messages in thread

* CHR/BLK needed?   was: Re: Why side-effects on open...
  2001-05-24 14:39                           ` Oliver Xymoron
@ 2001-05-24 15:20                             ` Marko Kreen
  2001-05-24 17:12                             ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device Albert D. Cahalan
  1 sibling, 0 replies; 161+ messages in thread
From: Marko Kreen @ 2001-05-24 15:20 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Edgar Toernig, Daniel Phillips, linux-kernel, linux-fsdevel

On Thu, May 24, 2001 at 09:39:35AM -0500, Oliver Xymoron wrote:
> On Thu, 24 May 2001, Marko Kreen wrote:
> > IMHO the CHR/BLK is not needed.  Think of /proc.  In the future,
> > the backup tools will be told to ignore /dev, that's all.
> 
> The /dev dir should not be special. At least not to the kernel. I have
> device files in places other than /dev, and you probably do too (hint:
> anonymous FTP).

So?  Do you allow downloading from/to /dev in your chrooted ftp?

Ofcourse this is not hard-wired or something.  You tell devfsd
to put dev's somewhere.  Next moment you edit backup config
and tell it to igrore that /somewhere.  As I said: like /proc
currently is.  Or should current /proc converted to CHR devices?

My idea is (well, 'devfs2' - I have the core almost working now)
that the 'devices' will be VFS only objects - they live
only in inode cache (on ramfs).  So the CHR/BLK flags are only
backwards compatibility for supporting major:minors for /dev on
eg ext2.  Currently I think exposing device inodes as ordinary
files (or dirs if needed), so they look like any file to
programs.  Will this break too much?  Another variant would be
to expose them as S_IFDEV - which probably breaks even more.


-- 
marko


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device
  2001-05-24 14:39                           ` Oliver Xymoron
  2001-05-24 15:20                             ` CHR/BLK needed? was: Re: Why side-effects on open Marko Kreen
@ 2001-05-24 17:12                             ` Albert D. Cahalan
  1 sibling, 0 replies; 161+ messages in thread
From: Albert D. Cahalan @ 2001-05-24 17:12 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Marko Kreen, Edgar Toernig, Daniel Phillips, linux-kernel, linux-fsdevel

Oliver Xymoron writes:

> The /dev dir should not be special. At least not to the kernel. I have
> device files in places other than /dev, and you probably do too (hint:
> anonymous FTP).

This is a horribly broken FTP server.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-24  0:23                       ` Edgar Toernig
  2001-05-24  7:47                         ` Marko Kreen
@ 2001-05-24 17:25                         ` Daniel Phillips
  2001-05-24 20:59                           ` Edgar Toernig
  1 sibling, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-24 17:25 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Thursday 24 May 2001 02:23, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > > > It's going to be marked 'd', it's a directory, not a file.
> > >
> > > Aha.  So you lose the S_ISCHR/BLK attribute.
> >
> > Readdir fills in a directory type, so ls sees it as a directory and
> > does the right thing.  On the other hand, we know we're on a device
> > filesystem so we will next open the name as a regular file, and
> > find ISCHR or ISBLK: good.
>
> ??? The kernel may know it, but the app?  Or do you really want to
> give different stat data on stat(2) and fstat(2)?  These flags are
> currently used by archive/backup prgs.  It's a hint that these files
> are not regular files and shouldn't be opened for reading.
> Having a 'd' would mean that they would really try to enter the
> directory and save it's contents.  Don't know what happens in this
> case to your "special" files ;-)

I guess that's much like the question 'what happens in proc?'.

Recursively entering the device directory is ok as long as everything
inside it is ok.  I tried zipping /proc/bus -r and what I got is what I'd
expect if I'd cat'ed every non-directory entry.  This is what I
expected.  Maybe it's not right - zipping /proc/kcore is kind of
interesting.  Regardless, we are no worse than proc here.  In fact,
since we don't anticipate putting an elephant like kcore in as a
device property, we're a little nicer to get along with.

Correct me if I'm wrong, but what we learn from the proc example
is that tarring your whole source tree starting at / is not something
you want to do.  Just extend that idea to /dev - however, if you do
it, it will produce pretty reasonable results.

What *won't* happen is, you won't get side effects from opening
your serial ports (you'd have to open them without O_DIRECTORY
to get that) so that seems like a little step forward.

I'm still thinking about some of your other comments.

--
Daniel

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-24  9:20             ` Malcolm Beattie
@ 2001-05-24 19:15               ` Andreas Dilger
  0 siblings, 0 replies; 161+ messages in thread
From: Andreas Dilger @ 2001-05-24 19:15 UTC (permalink / raw)
  To: Malcolm Beattie; +Cc: Andreas Dilger, linux-kernel, linux-fsdevel

Malcolm Beattie writes:
> Andreas Dilger writes:
> > PS - I used to think shrinking a filesystem online was useful, but there
> >      are a huge amount of problems with this and very few real-life
> >      benefits, as long as you can at least do offline shrinking.  With
> >      proper LVM usage, the need to shrink a filesystem never really
> >      happens in practise, unlike the partition case where you always
> >      have to guess in advance how big a filesystem needs to be, and then
> >      add 10% for a safety margin.  With LVM you just create the minimal
> >      sized device you need now, and freely grow it in the future.
> 
> In an attempt to nudge you back towards your previous opinion: consider
> a system-wide spool or tmp filesystem. It would be nice to be able to
> add in a few extra volumes for a busy period but then shrink it down
> again when usage returns to normal. In the absence of the ability to
> shrink a live filesystem, storage management becomes a much harder job.
> You can't throw in a spare volume or two where it's needed without
> careful thought because you'll be ratchetting up the space on that one
> filesystem without being able to change your mind and reduce it again
> later. You'll end up with stingy storage admins who refuse to give you
> a bunch of extra filesystem space for a while because they can't get it
> back again afterwards.

I suppose it depends a bit on how your system is administered.  On LVM
systems, I tend to allocate new volumes for special situations like this.
When the special need is gone, you simply remove the whole thing.  Yes,
this is a bit of a hack for not having online shrinking, but I have not
really had a _big_ need to do that.

The only time I've really needed online shrinking is when someone
screwed up and made / or /var way too huge for some (bad) reason and
you can't unmount it conveniently.  Under AIX, you can't shrink JFS
even unmounted so it meant backup/restore.  Even so, having empty
space in a filesystem is not a reason to panic, while having no free
space in a filesystem _is_ a reason to panic, hence online growing
of ext2.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-24 17:25                         ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Daniel Phillips
@ 2001-05-24 20:59                           ` Edgar Toernig
  2001-05-24 21:26                             ` Alexander Viro
  2001-05-25 11:00                             ` Daniel Phillips
  0 siblings, 2 replies; 161+ messages in thread
From: Edgar Toernig @ 2001-05-24 20:59 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips wrote:
> 
> > > Readdir fills in a directory type, so ls sees it as a directory and
> > > does the right thing.  On the other hand, we know we're on a device
> > > filesystem so we will next open the name as a regular file, and
> > > find ISCHR or ISBLK: good.
> >
> > ??? The kernel may know it, but the app?  Or do you really want to
> > give different stat data on stat(2) and fstat(2)?  These flags are
> > currently used by archive/backup prgs.  It's a hint that these files
> > are not regular files and shouldn't be opened for reading.
> > Having a 'd' would mean that they would really try to enter the
> > directory and save it's contents.  Don't know what happens in this
> > case to your "special" files ;-)
> 
> I guess that's much like the question 'what happens in proc?'.

And that's already bad enough.  Most of the "files" in proc should
be fifos!  And using proc as an excuse to introduce another set of
magic dirs?  No, thanks.

> Correct me if I'm wrong, but what we learn from the proc example
> is that tarring your whole source tree starting at / is not something
> you want to do.

IMHO it would be better to fix proc instead of adding more magic.  At
the moment you have to exclude /proc.  You want to add /dev.  And next?
Exclude all $HOME/dev (in case process name spaces get added)?  Or make
fifos magic too and add all of them to the exclude list?  But there's
no central place for fifos.  So lets add more magic :-(

> What *won't* happen is, you won't get side effects from opening
> your serial ports (you'd have to open them without O_DIRECTORY
> to get that) so that seems like a little step forward.

As already said: depending on O_DIRECTORY breaks POSIX compliance
and that alone should kill this idea...

Over and out, ET.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-22 20:10               ` Andreas Dilger
  2001-05-22 20:59                 ` Peter J. Braam
@ 2001-05-24 21:07                 ` Daniel Phillips
  2001-05-24 22:00                   ` Hans Reiser
  1 sibling, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-24 21:07 UTC (permalink / raw)
  To: Andreas Dilger, Peter J. Braam
  Cc: Linus Torvalds, Andreas Dilger, Alexander Viro, Edgar Toernig,
	Ben LaHaise, linux-kernel, linux-fsdevel

On Tuesday 22 May 2001 22:10, Andreas Dilger wrote:
> Peter Braam writes:
> > File system journal recovery can corrupt a snapshot, because it
> > copies data that needs to be preserved in a snapshot. During
> > journal replay such data may be copied again, but the source can
> > have new data already.
>
> The way it is implemented in reiserfs is to wait for existing
> transactions to complete, entirely flush the journal and block all
> new transactions from starting.  Stephen implemented a journal flush
> API to do this for ext3, but the hooks to call it from LVM are not in
> place yet.  This way the journal is totally empty at the time the
> snapshot is done, so the read-only copy does not need to do journal
> recovery, so no problems can arise.

I suppose I'm just reiterating the obvious, but we should eventually
have a generic filesystem transaction API at the VFS level, once we
have enough data points to know what the One True API should be.

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-24 20:59                           ` Edgar Toernig
@ 2001-05-24 21:26                             ` Alexander Viro
  2001-05-25  1:03                               ` Daniel Phillips
  2001-05-25 11:00                             ` Daniel Phillips
  1 sibling, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-24 21:26 UTC (permalink / raw)
  To: Edgar Toernig
  Cc: Daniel Phillips, Oliver Xymoron, linux-kernel, linux-fsdevel



On Thu, 24 May 2001, Edgar Toernig wrote:

> > What *won't* happen is, you won't get side effects from opening
> > your serial ports (you'd have to open them without O_DIRECTORY
> > to get that) so that seems like a little step forward.
> 
> As already said: depending on O_DIRECTORY breaks POSIX compliance
> and that alone should kill this idea...

What really kills that idea is the fact that you can trick applications
into opening your serial ports _without_ O_DIRECTORY.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-24 21:07                 ` Daniel Phillips
@ 2001-05-24 22:00                   ` Hans Reiser
  2001-05-25 10:56                     ` Daniel Phillips
  0 siblings, 1 reply; 161+ messages in thread
From: Hans Reiser @ 2001-05-24 22:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andreas Dilger, Peter J. Braam, Linus Torvalds, Alexander Viro,
	Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel,
	Josh MacDonald, reiserfs-list

Daniel Phillips wrote:
> 
> On Tuesday 22 May 2001 22:10, Andreas Dilger wrote:
> > Peter Braam writes:
> > > File system journal recovery can corrupt a snapshot, because it
> > > copies data that needs to be preserved in a snapshot. During
> > > journal replay such data may be copied again, but the source can
> > > have new data already.
> >
> > The way it is implemented in reiserfs is to wait for existing
> > transactions to complete, entirely flush the journal and block all
> > new transactions from starting.  Stephen implemented a journal flush
> > API to do this for ext3, but the hooks to call it from LVM are not in
> > place yet.  This way the journal is totally empty at the time the
> > snapshot is done, so the read-only copy does not need to do journal
> > recovery, so no problems can arise.
> 
> I suppose I'm just reiterating the obvious, but we should eventually
> have a generic filesystem transaction API at the VFS level, once we
> have enough data points to know what the One True API should be.
> 
> --
> Daniel
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


Daniel, implementing transactions is not a trivial thing as you probably know. 
It requires that you resolve such issues as, what happens if the user forgets to
close the transaction, issues of lock/transaction duration, of transaction
batching, of levels of isolation, of concurrent transactions modifying global fs
metadata and some but not all of those concurrent transactions receiving a
rollback, and of permissions relating to keeping transactions open.  I would
encourage you to participate in the reiser4 design discussion we will be having
over the next 6 months, and give us your opinions.  Josh will be leading that
design effort for the ReiserFS team.

Hans

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-24 21:26                             ` Alexander Viro
@ 2001-05-25  1:03                               ` Daniel Phillips
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-25  1:03 UTC (permalink / raw)
  To: Alexander Viro, Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Thursday 24 May 2001 23:26, Alexander Viro wrote:
> On Thu, 24 May 2001, Edgar Toernig wrote:
> > > What *won't* happen is, you won't get side effects from opening
> > > your serial ports (you'd have to open them without O_DIRECTORY
> > > to get that) so that seems like a little step forward.
> >
> > As already said: depending on O_DIRECTORY breaks POSIX compliance
> > and that alone should kill this idea...
>
> What really kills that idea is the fact that you can trick
> applications into opening your serial ports _without_ O_DIRECTORY.

Err, I thought we already had that problem, but worse: an ordinary
ls -l will do it.  This way, we harmlessly list the device's properties 
instead.

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-24 22:00                   ` Hans Reiser
@ 2001-05-25 10:56                     ` Daniel Phillips
  2001-06-01  3:24                       ` [reiserfs-list] " Hans Reiser
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-25 10:56 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Andreas Dilger, Peter J. Braam, Linus Torvalds, Alexander Viro,
	Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel,
	Josh MacDonald, reiserfs-list

On Friday 25 May 2001 00:00, Hans Reiser wrote:
> Daniel Phillips wrote:
> > I suppose I'm just reiterating the obvious, but we should
> > eventually have a generic filesystem transaction API at the VFS
> > level, once we have enough data points to know what the One True
> > API should be.
>
> Daniel, implementing transactions is not a trivial thing as you
> probably know. It requires that you resolve such issues as, what
> happens if the user forgets to close the transaction, issues of
> lock/transaction duration, of transaction batching, of levels of
> isolation, of concurrent transactions modifying global fs metadata
> and some but not all of those concurrent transactions receiving a
> rollback, and of permissions relating to keeping transactions open. 
> I would encourage you to participate in the reiser4 design discussion
> we will be having over the next 6 months, and give us your opinions. 
> Josh will be leading that design effort for the ReiserFS team.

Graciously accepted.  Coming up with something sensible in a mere 6 
months would be a minor miracle. ;-)

- what happens if the user forgets to close the transaction?

   I plan to set a checkpoint there (because the transaction got
   too big) and log the fact that it's open.

- issues of lock/transaction duration

   Once again relying on checkpoints, when the transaction gets
   uncomfortably big for cache, set a checkpoint.  I haven't thought
   about locks

- transaction batching

   1) Explicit transaction batch close 2) Cache gets past a certain     
   fullness.  In both cases, no new transactions are allowed to start
   and as soon as all current ones are closed we close the batch.

- of levels of isolation
- concurrent transactions modifying global fs metadata
   and some but not all of those concurrent transactions receiving a
   rollback

   First I was going to write 'huh?' here, then I realized you're       
   talking about real database ops, not just filesystem ops.  I had
   in mind something more modest: transactions are 'mv', 'read/write'
   (if the 'atomic read/write' is set), other filesystem operations I've
   forgotten, and anything the user puts between open_xact and          
   close_xact.  You are raising the ante a little ;-)

   In my case (Tux2) I could do an efficient rollback to the beginning
  of the batch (phase), then I would have had to have kept an           
   in-memory log of the transactions for selective replay.  With a      
   journal log you can obviously do the same thing, but perhaps more
   efficiently if your journal design supports undo/redo.

   The above is a pure flight of fancy, we won't be seeing anything
   so fancy as an API across filesystems.

- permissions relating to keeping transactions open. 
   We can see this one in the light of a simple filesystem              
   transaction: what happens if we are in the middle of a mv and        
   someone changes the permissions?  Go with the starting or
   ending permissions?

Well, the database side of this is really interesting, but to get 
something generic across filesystems, the scope pretty well has to be 
limited to journal-type transactions, don't you think?

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-24 20:59                           ` Edgar Toernig
  2001-05-24 21:26                             ` Alexander Viro
@ 2001-05-25 11:00                             ` Daniel Phillips
  2001-05-26  3:07                               ` Edgar Toernig
  1 sibling, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-25 11:00 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Thursday 24 May 2001 22:59, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > > > Readdir fills in a directory type, so ls sees it as a directory
> > > > and does the right thing.  On the other hand, we know we're on
> > > > a device filesystem so we will next open the name as a regular
> > > > file, and find ISCHR or ISBLK: good.
> > >
> > > ??? The kernel may know it, but the app?  Or do you really want
> > > to give different stat data on stat(2) and fstat(2)?  These flags
> > > are currently used by archive/backup prgs.  It's a hint that
> > > these files are not regular files and shouldn't be opened for
> > > reading. Having a 'd' would mean that they would really try to
> > > enter the directory and save it's contents.  Don't know what
> > > happens in this case to your "special" files ;-)
> >
> > I guess that's much like the question 'what happens in proc?'.
>
> And that's already bad enough.  Most of the "files" in proc should
> be fifos!  And using proc as an excuse to introduce another set of
> magic dirs?  No, thanks.

Wait a second, I thought proc was here to stay.  Wait another
second, device nodes are already magic.  Magic is magic, just
choose your color ;-)

This set of magic dirs is supposed to clean things up, not mess things 
up.  We already saw how the side-effects-on-open problem in ls -l goes 
away.  There's a much bigger problem I'd love to deal with: the 'no 
heirarchy can please everybody' problem.  In database terms, aheirarchy 
is an insufficiently general model for real-world problems, in other 
words, they never worked.  Tables work.  That's where I'm trying to go 
with this, so please bear with me.  This is not just a solution in 
search of a problem.

> > Correct me if I'm wrong, but what we learn from the proc example
> > is that tarring your whole source tree starting at / is not
> > something you want to do.
>
> IMHO it would be better to fix proc instead of adding more magic.  At
> the moment you have to exclude /proc.  You want to add /dev.

Well, actually no, ls -R, tar, zip, etc, work pretty well with the 
scheme I've described.

> And
> next? Exclude all $HOME/dev (in case process name spaces get added)? 
> Or make fifos magic too and add all of them to the exclude list?  But
> there's no central place for fifos.  So lets add more magic :-(

No, no, no, agreed and sometimes magic is good.  It's not deep magic.  
The only new thing here is the interpretation of the O_DIRECTORY flag, 
or rather, the lack of it.

> > What *won't* happen is, you won't get side effects from opening
> > your serial ports (you'd have to open them without O_DIRECTORY
> > to get that) so that seems like a little step forward.
>
> As already said: depending on O_DIRECTORY breaks POSIX compliance
> and that alone should kill this idea...

Thanks, two good points:
  - libc5 will get confused when doing ls in /magicdev
  - POSIX specifically forbids this

I'll put this away until I've specifically dug into both of them.  OK, 
over and out, thanks for your commentary.

/me peruses man pages

Oops, oh wait, there's already another open point: your breakage 
examples both rely on opening ".".  You're right, "." should always be 
a directory and I believe that's enforced by the VFS.  So we don't have 
an example of breakage yet.

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-25 11:00                             ` Daniel Phillips
@ 2001-05-26  3:07                               ` Edgar Toernig
  2001-05-26 22:36                                 ` Daniel Phillips
  0 siblings, 1 reply; 161+ messages in thread
From: Edgar Toernig @ 2001-05-26  3:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips wrote:
> 
> Oops, oh wait, there's already another open point: your breakage
> examples both rely on opening ".".  You're right, "." should always be
> a directory and I believe that's enforced by the VFS.  So we don't have
> an example of breakage yet.

That's just because I did a simple "ls".  But it doesn't make a
difference.  The magicdevs _are_ directories and

	chdir("magicdev");
	open(".", O_RDONLY);

shouldn't open the device.

Ciao, ET.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-26  3:07                               ` Edgar Toernig
@ 2001-05-26 22:36                                 ` Daniel Phillips
  2001-05-27 13:32                                   ` Edgar Toernig
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-26 22:36 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Saturday 26 May 2001 05:07, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > Oops, oh wait, there's already another open point: your breakage
> > examples both rely on opening ".".  You're right, "." should always
> > be a directory and I believe that's enforced by the VFS.  So we
> > don't have an example of breakage yet.
>
> That's just because I did a simple "ls".  But it doesn't make a
> difference.  The magicdevs _are_ directories and
>
> 	chdir("magicdev");
> 	open(".", O_RDONLY);
>
> shouldn't open the device.

It won't, the open for "." is handled in the VFS, not the filesystem - 
it will open the directory.  (Without needing to be told it's a 
directory via O_DIRECTORY.)  If you do open("magicdev") you'll get the 
device, because that's handled by magicdevfs.

I'm not claiming there isn't breakage somewhere, just that we didn't 
find it on this attempt.

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-26 22:36                                 ` Daniel Phillips
@ 2001-05-27 13:32                                   ` Edgar Toernig
  2001-05-27 20:40                                     ` Ben LaHaise
  2001-05-27 20:45                                     ` Daniel Phillips
  0 siblings, 2 replies; 161+ messages in thread
From: Edgar Toernig @ 2001-05-27 13:32 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips wrote:
> 
> It won't, the open for "." is handled in the VFS, not the filesystem -
> it will open the directory.  (Without needing to be told it's a
> directory via O_DIRECTORY.)  If you do open("magicdev") you'll get the
> device, because that's handled by magicdevfs.

You really mean that "magicdev" is a directory and:

	open("magicdev/.", O_RDONLY);
	open("magicdev", O_RDONLY);

would both succeed but open different objects?

> I'm not claiming there isn't breakage somewhere,

you break UNIX fundamentals.  But I'm quite relieved now because I'm
pretty sure that something like that will never go into the kernel.

Ciao, ET.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD  w/info-PATCH]device arguments from lookup)
  2001-05-27 13:32                                   ` Edgar Toernig
@ 2001-05-27 20:40                                     ` Ben LaHaise
  2001-05-27 20:45                                     ` Daniel Phillips
  1 sibling, 0 replies; 161+ messages in thread
From: Ben LaHaise @ 2001-05-27 20:40 UTC (permalink / raw)
  To: Edgar Toernig
  Cc: Daniel Phillips, Oliver Xymoron, linux-kernel, linux-fsdevel

On Sun, 27 May 2001, Edgar Toernig wrote:

> You really mean that "magicdev" is a directory and:
>
> 	open("magicdev/.", O_RDONLY);

At least for the patch I posted, that would return -ENOTDIR, and exactly
for the reason that not doing so would break find.  I've been convinced
that we really need to be careful which, if any, options are permitted in
this fashion.

		-ben


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-27 13:32                                   ` Edgar Toernig
  2001-05-27 20:40                                     ` Ben LaHaise
@ 2001-05-27 20:45                                     ` Daniel Phillips
  2001-05-27 21:50                                       ` Marko Kreen
  2001-05-28  1:26                                       ` Horst von Brand
  1 sibling, 2 replies; 161+ messages in thread
From: Daniel Phillips @ 2001-05-27 20:45 UTC (permalink / raw)
  To: Edgar Toernig; +Cc: Oliver Xymoron, linux-kernel, linux-fsdevel

On Sunday 27 May 2001 15:32, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > It won't, the open for "." is handled in the VFS, not the
> > filesystem - it will open the directory.  (Without needing to be
> > told it's a directory via O_DIRECTORY.)  If you do open("magicdev")
> > you'll get the device, because that's handled by magicdevfs.
>
> You really mean that "magicdev" is a directory and:
>
> 	open("magicdev/.", O_RDONLY);
> 	open("magicdev", O_RDONLY);
>
> would both succeed but open different objects?

Yes, and:

        open("magicdev/.", O_RDONLY | O_DIRECTORY);
        open("magicdev", O_RDONLY | O_DIRECTORY);

will both succeed and open the same object.

> > I'm not claiming there isn't breakage somewhere,
>
> you break UNIX fundamentals.  But I'm quite relieved now because I'm
> pretty sure that something like that will never go into the kernel.

OK, I'll take that as "I couldn't find a piece of code that breaks, so 
it's on to the legal issues".

SUS doesn't seem to have a lot to say about this.  The nearest thing to 
a ruling I found was "The special filename dot refers to the directory 
specified by its predecessor".  Which is not the same thing as:

   open("foo", O_RDONLY) == open ("foo/.", O_RDONLY)

I don't know about POSIX (I don't have it: a pox on standards 
organizations that don't make their standards freely available) but SUS 
doesn't seem to forbid this.

--
Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-27 20:45                                     ` Daniel Phillips
@ 2001-05-27 21:50                                       ` Marko Kreen
  2001-05-28  1:26                                       ` Horst von Brand
  1 sibling, 0 replies; 161+ messages in thread
From: Marko Kreen @ 2001-05-27 21:50 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Edgar Toernig, Oliver Xymoron, linux-kernel, linux-fsdevel

On Sun, May 27, 2001 at 10:45:17PM +0200, Daniel Phillips wrote:
> On Sunday 27 May 2001 15:32, Edgar Toernig wrote:
> > Daniel Phillips wrote:
> > > I'm not claiming there isn't breakage somewhere,
> >
> > you break UNIX fundamentals.  But I'm quite relieved now because I'm
> > pretty sure that something like that will never go into the kernel.
> 
> OK, I'll take that as "I couldn't find a piece of code that breaks, so 
> it's on to the legal issues".
> 
> SUS doesn't seem to have a lot to say about this.  The nearest thing to 
> a ruling I found was "The special filename dot refers to the directory 
> specified by its predecessor".  Which is not the same thing as:
> 
>    open("foo", O_RDONLY) == open ("foo/.", O_RDONLY)
> 
> I don't know about POSIX (I don't have it: a pox on standards 
> organizations that don't make their standards freely available) but SUS 
> doesn't seem to forbid this.

My question is: Is it needed?  You are advocating quite
non-obvious behaviour on a UNIX-like fs.  Cant the end result
achieved in more obvious manner?

I see at most 3 types of magic files:

1) regular file - nothing special.  Whether it has CHR/BLK set
   or not is irrelevant.

2) file with subdevs.  As 1) but you can acces dev/something
   for subdev 'something'.  Permissions should be probably taken
   from 'dev'.  Ofcourse you cant do 'ls' on the thing.

3) magicdev as directory.  Act as ordinary directory.  Only
   reason is to group devices.

And all those should be manageable by devfsd, so you can tell
devfsd to take subdev and create it as file somewhere else.
So 2) and 3) are more like 'defaults'.

So: is there additional type required with non-obvious file/dir
behaviour mix?

-- 
marko


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-27 20:45                                     ` Daniel Phillips
  2001-05-27 21:50                                       ` Marko Kreen
@ 2001-05-28  1:26                                       ` Horst von Brand
  2001-05-29 10:54                                         ` Daniel Phillips
  1 sibling, 1 reply; 161+ messages in thread
From: Horst von Brand @ 2001-05-28  1:26 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Edgar Toernig, Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips <phillips@bonn-fries.net> said:
> On Sunday 27 May 2001 15:32, Edgar Toernig wrote:

[...]

> > you break UNIX fundamentals.  But I'm quite relieved now because I'm
> > pretty sure that something like that will never go into the kernel.

> OK, I'll take that as "I couldn't find a piece of code that breaks, so 
> it's on to the legal issues".

It boggles my (perhaps underdeveloped) mind to have things that are files
_and_ directories at the same time. The last time this was discussed was
for handling forks (a la Mac et al) in files, and it was shot down.

> SUS doesn't seem to have a lot to say about this.  The nearest thing to 
> a ruling I found was "The special filename dot refers to the directory 
> specified by its predecessor".  Which is not the same thing as:
> 
>    open("foo", O_RDONLY) == open ("foo/.", O_RDONLY)

It says "foo" and "foo/." are the same _directory_, where "foo" is a
directory as otherwise "foo/<something>" makes no sense, AFAICS. Is there
any mention on a _file_ "bar" and going "bar/" or "bar/<something>"?
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, Vin~a del Mar, Chile                               +56 32 672616

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-28  1:26                                       ` Horst von Brand
@ 2001-05-29 10:54                                         ` Daniel Phillips
  2001-05-29 13:54                                           ` Horst von Brand
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel Phillips @ 2001-05-29 10:54 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Edgar Toernig, Oliver Xymoron, linux-kernel, linux-fsdevel

On Monday 28 May 2001 03:26, Horst von Brand wrote:
> Daniel Phillips <phillips@bonn-fries.net> said:
> > On Sunday 27 May 2001 15:32, Edgar Toernig wrote:
>
> [...]
>
> > > you break UNIX fundamentals.  But I'm quite relieved now because
> > > I'm pretty sure that something like that will never go into the
> > > kernel.
> >
> > OK, I'll take that as "I couldn't find a piece of code that breaks,
> > so it's on to the legal issues".
>
> It boggles my (perhaps underdeveloped) mind to have things that are
> files _and_ directories at the same time.

They are not, the device file and the directory are different objects 
that have the same name.  In C, "foo" and "struct foo" can appear in 
the same scope but they are different objects.  This must have seemed 
to be a strange idea at first.  Here we have "foo" (a device) and 
"directory foo" (the device's properties).

When I first saw Linus mention the idea I did a double-take, I thought 
it was a strange idea and my first reaction was, it would break all 
kinds of things.  But when I started examining cases I was unable to 
find any real problems.  When I asked code examples of breakage none of 
the supplied examples survived scrutiny.  Then, when I looked through 
SUS I didn't find any prohibition.

> The last time this was
> discussed was for handling forks (a la Mac et al) in files, and it
> was shot down.

Do you have the subject line?  It might save us some time ;-)

I seem to recall that the fork idea died because it was thought to 
require changes to userspace programs such as tar and find.  The 
magicdev idea doesn't require such changes, none that I've seen so far.

> > SUS doesn't seem to have a lot to say about this.  The nearest
> > thing to a ruling I found was "The special filename dot refers to
> > the directory specified by its predecessor".  Which is not the same
> > thing as:
> >
> >    open("foo", O_RDONLY) == open ("foo/.", O_RDONLY)
>
> It says "foo" and "foo/." are the same _directory_, where "foo" is a
> directory as otherwise "foo/<something>" makes no sense, AFAICS. Is
> there any mention on a _file_ "bar" and going "bar/" or
> "bar/<something>"?

In SUS I didn't find anything, one way or the other.  I don't know 
about POSIX.

--
Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
  2001-05-29 10:54                                         ` Daniel Phillips
@ 2001-05-29 13:54                                           ` Horst von Brand
  0 siblings, 0 replies; 161+ messages in thread
From: Horst von Brand @ 2001-05-29 13:54 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Edgar Toernig, Oliver Xymoron, linux-kernel, linux-fsdevel

Daniel Phillips <phillips@bonn-fries.net> said:
> On Monday 28 May 2001 03:26, Horst von Brand wrote:
> > Daniel Phillips <phillips@bonn-fries.net> said:
> > > On Sunday 27 May 2001 15:32, Edgar Toernig wrote:
> >
> > [...]
> >
> > > > you break UNIX fundamentals.  But I'm quite relieved now because
> > > > I'm pretty sure that something like that will never go into the
> > > > kernel.
> > >
> > > OK, I'll take that as "I couldn't find a piece of code that breaks,
> > > so it's on to the legal issues".
> >
> > It boggles my (perhaps underdeveloped) mind to have things that are
> > files _and_ directories at the same time.

> They are not, the device file and the directory are different objects 
> that have the same name.  In C, "foo" and "struct foo" can appear in 
> the same scope but they are different objects.  This must have seemed 
> to be a strange idea at first.  Here we have "foo" (a device) and 
> "directory foo" (the device's properties).

They have the exact same name, how is anybody going to distinguish them?

> When I first saw Linus mention the idea I did a double-take, I thought 
> it was a strange idea and my first reaction was, it would break all 
> kinds of things.  But when I started examining cases I was unable to 
> find any real problems.  When I asked code examples of breakage none of 
> the supplied examples survived scrutiny.  Then, when I looked through 
> SUS I didn't find any prohibition.

I isn't allowed either...

> > The last time this was
> > discussed was for handling forks (a la Mac et al) in files, and it
> > was shot down.
> 
> Do you have the subject line?  It might save us some time ;-)

Nope, sorry.

> I seem to recall that the fork idea died because it was thought to 
> require changes to userspace programs such as tar and find.  The 
> magicdev idea doesn't require such changes, none that I've seen so far.

tar(1) of /dev should blow up in exactly the same way, AFAICS...

Everybody just knows a device is a device, a file is a file, and a
directory is a directory. Standards notwithstanding, this is how things
work, and have worked for a _long_ time; with absolutely no warning that
the assumption might become wrong sometime or be wrong on some strange
beast (you didn't find anything in your search). I'd suspect nobody
bothered to cast this in stone because nobody even considered such a
twisted possibility.

Take it up with somebody on the standards commitees, they (should) have
looked long and hard at the nooks and cranies in the standard, and so are
in a better position to comment than we here are.
-- 
Dr. Horst H. von Brand                       mailto:vonbrand@inf.utfsm.cl
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [reiserfs-list] Re: Why side-effects on open(2) are evil. (was Re:  [RFD w/info-PATCH]device arguments from lookup)
  2001-05-25 10:56                     ` Daniel Phillips
@ 2001-06-01  3:24                       ` Hans Reiser
  0 siblings, 0 replies; 161+ messages in thread
From: Hans Reiser @ 2001-06-01  3:24 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andreas Dilger, Peter J. Braam, Linus Torvalds, Alexander Viro,
	Edgar Toernig, Ben LaHaise, linux-kernel, linux-fsdevel,
	Josh MacDonald, reiserfs-list

Daniel Phillips wrote:

> Graciously accepted.  Coming up with something sensible in a mere 6
> months would be a minor miracle. ;-)
> 
> - what happens if the user forgets to close the transaction?

then the user has branched into his own version, or at least that would be my
take on it.  Another possible method is to expire transactions by persons who
lack permission to keep them open indefinitely.  I suppose one could expire them
to abort, or expire them to commit, both being valid under some circumstances.  


> 
>    I plan to set a checkpoint there (because the transaction got
>    too big) and log the fact that it's open.
> 
> - issues of lock/transaction duration
> 
>    Once again relying on checkpoints, when the transaction gets
>    uncomfortably big for cache, set a checkpoint.  I haven't thought
>    about locks
> 
> - transaction batching
> 
>    1) Explicit transaction batch close 2) Cache gets past a certain
>    fullness.  In both cases, no new transactions are allowed to start
>    and as soon as all current ones are closed we close the batch.re6;
> 
> - of levels of isolation
> - concurrent transactions modifying global fs metadata
>    and some but not all of those concurrent transactions receiving a
>    rollback
> 
>    First I was going to write 'huh?' here, then I realized you're
>    talking about real database ops, not just filesystem ops.  I had
>    in mind something more modest: transactions are 'mv', 'read/write'
>    (if the 'atomic read/write' is set), other filesystem operations I've
>    forgotten, and anything the user puts between open_xact and
>    close_xact.  You are raising the ante a little ;-)
> 
>    In my case (Tux2) I could do an efficient rollback to the beginning
>   of the batch (phase), then I would have had to have kept an
>    in-memory log of the transactions for selective replay.  With a
>    journal log you can obviously do the same thing, but perhaps more
>    efficiently if your journal design supports undo/redo.
> 
>    The above is a pure flight of fancy, we won't be seeing anything
>    so fancy as an API across filesystems.

It is just a matter of time, and we will.  I think that the major release AFTER
2.6 will see it.  First we have to get a prototype done in time for 2.6....

> 
> - permissions relating to keeping transactions open.
>    We can see this one in the light of a simple filesystem
>    transaction: what happens if we are in the middle of a mv and
>    someone changes the permissions?  Go with the starting or
>    ending permissions?
> 
> Well, the database side of this is really interesting, but to get
> something generic across filesystems, the scope pretty well has to be
> limited to journal-type transactions, don't you think?

don't know what a journal-type transaction is and how it differs from a database
transaction.

> 
> --
> Daniel

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 16:41 Andries.Brouwer
  2001-05-19 16:51 ` Alexander Viro
@ 2001-05-20 11:18 ` Matthew Kirkwood
  1 sibling, 0 replies; 161+ messages in thread
From: Matthew Kirkwood @ 2001-05-20 11:18 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: viro, linux-fsdevel, linux-kernel, Linus Torvalds

On Sat, 19 May 2001 Andries.Brouwer@cwi.nl wrote:

> One would like to have a version of the open() call that was
> guaranteed free of side effects, and gave a fd only -
> perhaps for stat(), perhaps for ioctl().

I did this a while ago, after some discussion.  The
implementation may suck, but I think it's a useful
facility.

http://web.gnu.walfield.org/mail-archive/linux-fsdevel/2000-March/0230.html

Matthew.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 17:14   ` Matthew Wilcox
@ 2001-05-19 23:24     ` Alexander Viro
  0 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 23:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andries.Brouwer, bcrl, linux-fsdevel, linux-kernel, torvalds



On Sat, 19 May 2001, Matthew Wilcox wrote:

> On Sat, May 19, 2001 at 12:51:07PM -0400, Alexander Viro wrote:
> > clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add
> > unopened descriptors. I.e. no IO until you open it (turning the thing into
> > opened one), but we can do lookups (move to child), we can clone and
> > kill them and we can stat them.
> 
> Those who would like a more detailed explanation can find one at
> http://plan9.bell-labs.com/sys/man/5/INDEX.html

Umm... Yes, it's an allusion to 9P, but no, I'm not serious about exporting
that to userland.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 16:51 ` Alexander Viro
@ 2001-05-19 17:14   ` Matthew Wilcox
  2001-05-19 23:24     ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Matthew Wilcox @ 2001-05-19 17:14 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andries.Brouwer, bcrl, linux-fsdevel, linux-kernel, torvalds

On Sat, May 19, 2001 at 12:51:07PM -0400, Alexander Viro wrote:
> clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add
> unopened descriptors. I.e. no IO until you open it (turning the thing into
> opened one), but we can do lookups (move to child), we can clone and
> kill them and we can stat them.

Those who would like a more detailed explanation can find one at
http://plan9.bell-labs.com/sys/man/5/INDEX.html

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 16:41 Andries.Brouwer
@ 2001-05-19 16:51 ` Alexander Viro
  2001-05-19 17:14   ` Matthew Wilcox
  2001-05-20 11:18 ` Matthew Kirkwood
  1 sibling, 1 reply; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 16:51 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: bcrl, linux-fsdevel, linux-kernel, torvalds



On Sat, 19 May 2001 Andries.Brouwer@cwi.nl wrote:

> One would like to have a version of the open() call that was
> guaranteed free of side effects, and gave a fd only -
> perhaps for stat(), perhaps for ioctl().
> This guarantee could perhaps be obtained by omitting the
> 	f->f_op->open(inode,f);
> call in dentry_open() when the open call is
> 	open("file", O_FDONLY);
> Of course it may be that we afterwards decide that fd must
> be used, and then it needs upgrading:
> 	fd = f_open(fd, O_RDWR);

clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add
unopened descriptors. I.e. no IO until you open it (turning the thing into
opened one), but we can do lookups (move to child), we can clone and
kill them and we can stat them.

It makes tree traversals much easier, but AFAIK nobody had exported that
API directly to userland. Might be a good idea, but it's completely
non-portable...


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
@ 2001-05-19 16:41 Andries.Brouwer
  2001-05-19 16:51 ` Alexander Viro
  2001-05-20 11:18 ` Matthew Kirkwood
  0 siblings, 2 replies; 161+ messages in thread
From: Andries.Brouwer @ 2001-05-19 16:41 UTC (permalink / raw)
  To: Andries.Brouwer, viro; +Cc: bcrl, linux-fsdevel, linux-kernel, torvalds

>> Opening device files often has interesting side effects.

> Too bad. They can be triggered by similar races between attacker
> changing the type of object (file<->symlink) and backup.

Yes. This is a well-known security problem.
Doing
	stat("file", &s);
	if (action desired) {
		action("file");
	}
is no good because there is a race.
But doing
	fd = open("file", flags);
	fstat(fd, &s);
	if (action desired) {
		f_action(fd);
	}
is no good either because the open() has unknown side effects.
It helps to add flags like O_NONBLOCK and perhaps O_NOCTTY,
but that is not quite good enough.

One would like to have a version of the open() call that was
guaranteed free of side effects, and gave a fd only -
perhaps for stat(), perhaps for ioctl().
This guarantee could perhaps be obtained by omitting the
	f->f_op->open(inode,f);
call in dentry_open() when the open call is
	open("file", O_FDONLY);
Of course it may be that we afterwards decide that fd must
be used, and then it needs upgrading:
	fd = f_open(fd, O_RDWR);

Andries

[Such a construction allows various cleanups.
But no doubt it has problems that I have not yet thought of.]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
  2001-05-19 14:19 Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Andries.Brouwer
@ 2001-05-19 14:58 ` Alexander Viro
  0 siblings, 0 replies; 161+ messages in thread
From: Alexander Viro @ 2001-05-19 14:58 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: bcrl, linux-fsdevel, linux-kernel, torvalds



On Sat, 19 May 2001 Andries.Brouwer@cwi.nl wrote:

> > A lot of stuff relies on the fact that close(open(foo, O_RDONLY))
> > is a no-op. Breaking that assumption is a Bad Thing(tm).
> 
> Also here I would like to agree. Unfortunately this is false.
> Opening device files often has interesting side effects.

Too bad. They can be triggered by similar races between attacker
changing the type of object (file<->symlink) and backup.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
@ 2001-05-19 14:19 Andries.Brouwer
  2001-05-19 14:58 ` Alexander Viro
  0 siblings, 1 reply; 161+ messages in thread
From: Andries.Brouwer @ 2001-05-19 14:19 UTC (permalink / raw)
  To: bcrl, viro; +Cc: linux-fsdevel, linux-kernel, torvalds

Alexander Viro writes:

> Folks, before you get all excited about cramming side effects
> into open(2), consider ...

I agree completely.

> A lot of stuff relies on the fact that close(open(foo, O_RDONLY))
> is a no-op. Breaking that assumption is a Bad Thing(tm).

Also here I would like to agree. Unfortunately this is false.
Opening device files often has interesting side effects.

Andries

^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2001-06-01  3:29 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-05-19  6:23 [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Ben LaHaise
2001-05-19  6:57 ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Clausen
2001-05-19  7:04   ` Alexander Viro
2001-05-19  7:23     ` Andrew Clausen
2001-05-19  8:30       ` Alexander Viro
2001-05-19 10:13         ` Andrew Clausen
2001-05-19 14:02         ` [RFD w/info-PATCH] device arguments from lookup, partion code Alan Cox
2001-05-19 16:48           ` Erik Mouw
2001-05-19 17:45             ` Aaron Lehmann
2001-05-19 19:38               ` Erik Mouw
2001-05-19 20:53                 ` Steven Walter
2001-05-19 18:51         ` Richard Gooch
2001-05-20  2:18           ` Matthew Wilcox
2001-05-20  2:22           ` Richard Gooch
2001-05-20  2:34             ` Matthew Wilcox
2001-05-20  2:36             ` Alexander Viro
2001-05-20  2:48             ` Richard Gooch
2001-05-20  3:26               ` Linus Torvalds
2001-05-20 10:23                 ` Russell King
2001-05-20 10:35                   ` Alexander Viro
2001-05-20 18:46                   ` Linus Torvalds
2001-05-20 18:57                     ` Russell King
2001-05-20 19:10                       ` Linus Torvalds
2001-05-20 19:42                         ` Alexander Viro
2001-05-20 20:07                           ` Alan Cox
2001-05-20 20:33                             ` Alexander Viro
2001-05-20 23:59                             ` Paul Fulghum
2001-05-21  0:36                               ` Alexander Viro
2001-05-21  3:08                                 ` Paul Fulghum
2001-05-20 20:07                           ` Alan Cox
2001-05-20 23:46                           ` Ingo Molnar
2001-05-21  0:32                             ` Alexander Viro
2001-05-21  3:12                             ` Linus Torvalds
2001-05-21 19:32                               ` Kai Henningsen
2001-05-23  1:15                               ` Albert D. Cahalan
2001-05-20  2:51             ` Richard Gooch
2001-05-20 21:13               ` Pavel Machek
2001-05-21 20:20                 ` Alan Cox
2001-05-21 20:41                   ` Alexander Viro
2001-05-21 21:29                     ` Alan Cox
2001-05-21 21:51                       ` Alexander Viro
2001-05-21 21:56                         ` Alan Cox
2001-05-21 22:10                           ` Linus Torvalds
2001-05-21 22:22                             ` Alexander Viro
2001-05-22 15:41                               ` Oliver Xymoron
2001-05-22  2:28                             ` Paul Mackerras
2001-05-22 13:33                             ` Jan Harkes
2001-05-22 16:30                               ` Linus Torvalds
2001-05-22  0:22                         ` Ingo Oeser
2001-05-22  0:57                           ` Matthew Wilcox
2001-05-22  1:13                             ` Linus Torvalds
2001-05-22  1:18                               ` Matthew Wilcox
2001-05-22  7:49                                 ` Alan Cox
2001-05-22 15:31                                   ` Matthew Wilcox
2001-05-22 15:31                                     ` Alan Cox
2001-05-22 15:38                                       ` Matthew Wilcox
2001-05-22 15:42                                         ` Alan Cox
2001-05-20  2:31           ` Alexander Viro
2001-05-20 16:57           ` David Woodhouse
2001-05-20 19:02             ` Linus Torvalds
2001-05-20 19:11               ` Alexander Viro
2001-05-20 19:18                 ` Matthew Wilcox
2001-05-20 19:24                   ` Alexander Viro
2001-05-20 19:34                     ` Linus Torvalds
2001-05-20 19:27                 ` Linus Torvalds
2001-05-20 19:33                   ` Alexander Viro
2001-05-20 19:38                     ` Linus Torvalds
2001-05-20 19:57               ` David Woodhouse
2001-05-21 13:57               ` Ingo Oeser
2001-05-19  9:11     ` [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace Andrew Morton
2001-05-19  9:20       ` Alexander Viro
2001-05-19  7:58   ` Ben LaHaise
2001-05-19  8:10     ` Alexander Viro
2001-05-19  8:16       ` Ben LaHaise
2001-05-19  8:32         ` Alexander Viro
2001-05-19  9:42 ` [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Christer Weinigel
2001-05-19  9:51 ` Christer Weinigel
2001-05-19 11:37 ` Eric W. Biederman
2001-05-19 14:25   ` Daniel Phillips
2001-05-21  8:14     ` Lars Marowsky-Bree
2001-05-22  9:07       ` Daniel Phillips
2001-05-19 13:53 ` Daniel Phillips
2001-05-19 13:57 ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Alexander Viro
2001-05-19 15:10   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Abramo Bagnara
2001-05-19 15:18     ` Alexander Viro
2001-05-19 16:01     ` Willem Konynenberg
2001-05-20 20:52       ` Pavel Machek
2001-05-20 20:53       ` Pavel Machek
2001-05-19 18:13   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Linus Torvalds
2001-05-19 23:19     ` Alexander Viro
2001-05-19 23:31       ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device " Jeff Garzik
2001-05-19 23:32         ` Jeff Garzik
2001-05-19 23:39         ` Alexander Viro
2001-05-20 15:47           ` F_CTRLFD (was Re: Why side-effects on open(2) are evil.) Edgar Toernig
2001-05-20 16:20             ` Alexander Viro
2001-05-20 19:01               ` Edgar Toernig
2001-05-20 19:30                 ` Alexander Viro
2001-05-21 17:16           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Oliver Xymoron
2001-05-21 16:26             ` David Lang
2001-05-21 18:04               ` Oliver Xymoron
2001-05-21 20:14             ` Daniel Phillips
2001-05-22 15:24               ` Oliver Xymoron
2001-05-22 16:51                 ` Daniel Phillips
2001-05-22 17:49                   ` Oliver Xymoron
2001-05-22 20:22                     ` Daniel Phillips
2001-05-23  4:19                   ` Edgar Toernig
2001-05-23  4:50                     ` Alexander Viro
2001-05-23 13:50                     ` Daniel Phillips
2001-05-23 13:50                     ` Daniel Phillips
2001-05-23 15:58                       ` Oliver Xymoron
2001-05-24  0:23                       ` Edgar Toernig
2001-05-24  7:47                         ` Marko Kreen
2001-05-24 14:39                           ` Oliver Xymoron
2001-05-24 15:20                             ` CHR/BLK needed? was: Re: Why side-effects on open Marko Kreen
2001-05-24 17:12                             ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device Albert D. Cahalan
2001-05-24 17:25                         ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Daniel Phillips
2001-05-24 20:59                           ` Edgar Toernig
2001-05-24 21:26                             ` Alexander Viro
2001-05-25  1:03                               ` Daniel Phillips
2001-05-25 11:00                             ` Daniel Phillips
2001-05-26  3:07                               ` Edgar Toernig
2001-05-26 22:36                                 ` Daniel Phillips
2001-05-27 13:32                                   ` Edgar Toernig
2001-05-27 20:40                                     ` Ben LaHaise
2001-05-27 20:45                                     ` Daniel Phillips
2001-05-27 21:50                                       ` Marko Kreen
2001-05-28  1:26                                       ` Horst von Brand
2001-05-29 10:54                                         ` Daniel Phillips
2001-05-29 13:54                                           ` Horst von Brand
2001-05-19 23:52   ` Edgar Toernig
2001-05-20  0:18     ` Alexander Viro
2001-05-20  0:32       ` Linus Torvalds
2001-05-20  0:52         ` Jeff Garzik
2001-05-20  1:03         ` Jeff Garzik
2001-05-20 19:41           ` Why side-effects on open(2) are evil. (was Re: [RFD Alan Cox
2001-05-21  9:45           ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup) Andrew Clausen
2001-05-21 17:22           ` Oliver Xymoron
2001-05-22 18:53           ` Andreas Dilger
2001-05-24  9:20             ` Malcolm Beattie
2001-05-24 19:15               ` Andreas Dilger
2001-05-22 18:41         ` Andreas Dilger
2001-05-22 19:06           ` Linus Torvalds
2001-05-22 19:16             ` Peter J. Braam
2001-05-22 20:10               ` Andreas Dilger
2001-05-22 20:59                 ` Peter J. Braam
2001-05-23  9:23                   ` Stephen C. Tweedie
2001-05-24 21:07                 ` Daniel Phillips
2001-05-24 22:00                   ` Hans Reiser
2001-05-25 10:56                     ` Daniel Phillips
2001-06-01  3:24                       ` [reiserfs-list] " Hans Reiser
2001-05-23  9:13               ` Stephen C. Tweedie
2001-05-20 20:23   ` Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device " Pavel Machek
2001-05-21 20:38     ` Alexander Viro
2001-05-19 18:31 ` [RFD w/info-PATCH] device arguments from lookup, partion code in userspace Linus Torvalds
2001-05-19 14:19 Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup) Andries.Brouwer
2001-05-19 14:58 ` Alexander Viro
2001-05-19 16:41 Andries.Brouwer
2001-05-19 16:51 ` Alexander Viro
2001-05-19 17:14   ` Matthew Wilcox
2001-05-19 23:24     ` Alexander Viro
2001-05-20 11:18 ` Matthew Kirkwood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).