All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06  7:26 [PATCH] Futex Asynchronous Interface Rusty Russell
@ 2002-06-02  0:10 ` Pavel Machek
  2002-06-10  6:57   ` Rusty Russell
  2002-06-06 16:36 ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Pavel Machek @ 2002-06-02  0:10 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, frankeh, alan, torvalds

Hi!

> Name: Waker can unpin page, rather than waiting process
> Author: Rusty Russell
> Status: Tested in 2.5.20
> Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
> Depends: Futex/waitq.patch.gz
> 
> D: This changes the implementation so that the waker actually unpins
> D: the page.  This is preparation for the async interface, where the
> D: process which registered interest is not in the kernel.

What is it? Nice header, did you generate it by hand?
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] Futex Asynchronous Interface
@ 2002-06-06  7:26 Rusty Russell
  2002-06-02  0:10 ` Pavel Machek
  2002-06-06 16:36 ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Rusty Russell @ 2002-06-06  7:26 UTC (permalink / raw)
  To: linux-kernel, frankeh; +Cc: alan, torvalds

These two patches (requiring the other patches I sent to the list
which can also be found on my kernel.org page) add the ability to tie
a futex to a file descriptor, for use with poll/select or SIGIO
(required by NGPT).

The method is: open /dev/futex, use sys_futex(FUTEX_AWAIT) to attach
it to a particular futex, then use select or poll (or set the fd up
for sigio signals, and expect a SIGIO).

You need to use FUTEX_AWAIT again after poll succeeds or SIGIO
(ie. it's oneshot).  Calling it while a futex is already outstanding
forgets about the old futex.

The reason for this method is that it's pretty convenient for
programs, and since each one pins a page down, tying that to a struct
file * means we have an implicit limit.

Code below.  Feedback welcome.
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Name: Waker can unpin page, rather than waiting process
Author: Rusty Russell
Status: Tested in 2.5.20
Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz

D: This changes the implementation so that the waker actually unpins
D: the page.  This is preparation for the async interface, where the
D: process which registered interest is not in the kernel.


diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.19104/kernel/futex.c linux-2.5.20.19104.updated/kernel/futex.c
--- linux-2.5.20.19104/kernel/futex.c	Thu Jun  6 17:13:46 2002
+++ linux-2.5.20.19104.updated/kernel/futex.c	Thu Jun  6 17:14:30 2002
@@ -98,11 +98,13 @@
 		if (this->page == page && this->offset == offset) {
 			list_del_init(i);
 			tell_waiter(this);
+			unpin_page(this->page);
 			num_woken++;
 			if (num_woken >= num) break;
 		}
 	}
 	spin_unlock(&futex_lock);
+	unpin_page(page);
 	return num_woken;
 }
 
@@ -192,9 +194,10 @@
 	}
  out:
 	set_current_state(TASK_RUNNING);
-	/* Were we woken up anyway? */
+	/* Were we woken up anyway?  If so, it unpinned page. */
 	if (!unqueue_me(&q))
 		return 0;
+	unpin_page(page);
 	return ret;
 }
 
@@ -225,6 +228,7 @@
 	if (IS_ERR(page))
 		return PTR_ERR(page);
 
+	/* On success, these routines unpin the pages themselves. */
 	head = hash_futex(page, pos_in_page);
 	switch (op) {
 	case FUTEX_WAIT:
@@ -236,7 +240,8 @@
 	default:
 		ret = -EINVAL;
 	}
-	unpin_page(page);
+	if (ret < 0)
+		unpin_page(page);
 
 	return ret;
 }
Name: Asynchronous interface for futexes
Author: Rusty Russell
Status: Tested on 2.5.20
Depends: Futex/comment-fix.patch.gz Futex/copy-from-user.patch.gz
Depends: Futex/no-write-needed.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz Futex/waker-unpin-page.patch.gz

D: This patch adds a FUTEX_AWAIT and /dev/futex, for attaching futexes
D: to file descriptors, which can be used with poll, select or SIGIO.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.15557/include/linux/futex.h linux-2.5.20.15557.updated/include/linux/futex.h
--- linux-2.5.20.15557/include/linux/futex.h	Sat May 25 14:34:59 2002
+++ linux-2.5.20.15557.updated/include/linux/futex.h	Wed Jun  5 22:01:44 2002
@@ -4,5 +4,6 @@
 /* Second argument to futex syscall */
 #define FUTEX_WAIT (0)
 #define FUTEX_WAKE (1)
+#define FUTEX_AWAIT (2)
 
 #endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.15557/kernel/futex.c linux-2.5.20.15557.updated/kernel/futex.c
--- linux-2.5.20.15557/kernel/futex.c	Wed Jun  5 22:01:41 2002
+++ linux-2.5.20.15557.updated/kernel/futex.c	Wed Jun  5 22:02:09 2002
@@ -34,6 +34,10 @@
 #include <linux/highmem.h>
 #include <linux/time.h>
 #include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/devfs_fs_kernel.h>
+#include <linux/poll.h>
 #include <asm/uaccess.h>
 
 /* Simple "sleep if unchanged" interface. */
@@ -41,11 +45,18 @@
 /* FIXME: This may be way too small. --RR */
 #define FUTEX_HASHBITS 6
 
+extern void send_sigio(struct fown_struct *fown, int fd, int band);
+
 /* We use this instead of a normal wait_queue_t, so we can wake only
    the relevent ones (hashed queues may be shared) */
 struct futex_q {
 	struct list_head list;
 	wait_queue_head_t waiters;
+
+	/* For AWAIT, sigio sent using these. */
+	int fd;
+	struct file *filp;
+
 	/* Page struct and offset within it. */
 	struct page *page;
 	unsigned int offset;
@@ -54,6 +65,7 @@
 /* The key for the hash is the address + index + offset within page */
 static struct list_head futex_queues[1<<FUTEX_HASHBITS];
 static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+extern struct file_operations futex_fops;
 
 static inline struct list_head *hash_futex(struct page *page,
 					   unsigned long offset)
@@ -73,9 +85,12 @@
 	page_cache_release(page);
 }
 
+/* Waiter may be sitting in FUTEX_WAIT or poll, or async */
 static inline void tell_waiter(struct futex_q *q)
 {
 	wake_up_all(&q->waiters);
+	if (q->fd != -1)
+		send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
 }
 
 static int futex_wake(struct list_head *head,
@@ -113,6 +128,7 @@
 	add_wait_queue(&q->waiters, wait);
 	q->page = page;
 	q->offset = offset;
+	q->fd = -1;
 
 	spin_lock(&futex_lock);
 	list_add_tail(&q->list, head);
@@ -196,6 +212,38 @@
 	return ret;
 }
 
+static int futex_await(struct list_head *head,
+		       struct page *page,
+		       int offset,
+		       int fd)
+{
+	struct file *filp;
+	struct futex_q *q;
+
+	filp = fget(fd);
+	if (!filp || filp->f_op != &futex_fops)
+		return -EBADF;
+	q = filp->private_data;
+
+	spin_lock(&futex_lock);
+	/* Eliminate any old notification, wake any pollers, release page. */
+	if (!list_empty(&q->list)) {
+		list_del(&q->list);
+		wake_up_all(&q->waiters);
+		unpin_page(q->page);
+	}
+
+	q->filp = filp;
+	q->fd = fd;
+	q->page = page;
+	q->offset = offset;
+	list_add_tail(&q->list, head);
+	spin_unlock(&futex_lock);
+	fput(filp);
+
+	return 0;
+}
+
 asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime)
 {
 	int ret;
@@ -229,6 +277,9 @@
 	case FUTEX_WAIT:
 		ret = futex_wait(head, page, pos_in_page, val, uaddr, time);
 		break;
+	case FUTEX_AWAIT:
+		ret = futex_await(head, page, pos_in_page, val);
+		break;
 	case FUTEX_WAKE:
 		ret = futex_wake(head, page, pos_in_page, val);
 		break;
@@ -241,12 +292,68 @@
 	return ret;
 }
 
+static int futex_open(struct inode *inode, struct file *filp)
+{
+	struct futex_q *q;
+
+	q = kmalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&q->list);
+	init_waitqueue_head(&q->waiters);
+
+	filp->private_data = q;
+	return 0;
+}
+
+static int futex_close(struct inode *inode, struct file *filp)
+{
+	struct futex_q *q = filp->private_data;
+
+	spin_lock(&futex_lock);
+	if (!list_empty(&q->list)) {
+		list_del(&q->list);
+		unpin_page(q->page);
+		BUG_ON(waitqueue_active(&q->waiters));
+	}
+	spin_unlock(&futex_lock);
+	kfree(filp->private_data);
+	return 0;
+}
+
+/* You need to do a FUTEX_AWAIT to arm this after each successful poll */
+static unsigned int futex_poll(struct file *filp,
+			       struct poll_table_struct *wait)
+{
+	struct futex_q *q = filp->private_data;
+	int ret = 0;
+
+	spin_lock(&futex_lock);
+	if (!list_empty(&q->list))
+		poll_wait(filp, &q->waiters, wait);
+	else
+		ret = POLLIN | POLLRDNORM;
+	spin_unlock(&futex_lock);
+
+	return ret;
+}
+
+static struct file_operations futex_fops = {
+	open:		futex_open,
+	release:	futex_close,
+	poll:		futex_poll,
+};
+
 static int __init init(void)
 {
+	int futex_major;
 	unsigned int i;
 
 	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
 		INIT_LIST_HEAD(&futex_queues[i]);
+	futex_major = devfs_register_chrdev(0, "futex", &futex_fops);
+	devfs_register(NULL, "futex", DEVFS_FL_NONE, futex_major,
+		       0, S_IFCHR | 0666, &futex_fops, NULL);
 	return 0;
 }
 __initcall(init);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06  7:26 [PATCH] Futex Asynchronous Interface Rusty Russell
  2002-06-02  0:10 ` Pavel Machek
@ 2002-06-06 16:36 ` Linus Torvalds
  2002-06-06 19:27   ` Alan Cox
                     ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-06 16:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, frankeh, alan



On Thu, 6 Jun 2002, Rusty Russell wrote:
>
> The method is: open /dev/futex

STOP!

What madness is this?

You have a damn mutex system call, don't introduce mode crap in /dev.

Do we create pipes by opening /dev/pipe? No. Do we have major and minor
numbers for sockets and populate /dev with them? No. And as a result,
there has _never_ been any sysadmin problems with either.

You already have to have a system call to bind the particular fd to the
futex _anyway_, so do the only sane thing, and allocate the fd _there_,
and get rid of that stupid and horrible /dev/futed which only buys you
pain, system administration, extra code, and a black star for being
stupid.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06 16:36 ` Linus Torvalds
@ 2002-06-06 19:27   ` Alan Cox
  2002-06-06 23:21   ` Rusty Russell
  2002-06-07  9:06   ` Rusty Russell
  2 siblings, 0 replies; 45+ messages in thread
From: Alan Cox @ 2002-06-06 19:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, frankeh

On Thu, 2002-06-06 at 17:36, Linus Torvalds wrote:
> You have a damn mutex system call, don't introduce mode crap in /dev.
> 
> Do we create pipes by opening /dev/pipe? No. Do we have major and minor
> numbers for sockets and populate /dev with them? No. And as a result,
> there has _never_ been any sysadmin problems with either.

Its a bitch trying to create sockets in shell scripts isnt it. I wonder
if there is a connection 8)

Alan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06 16:36 ` Linus Torvalds
  2002-06-06 19:27   ` Alan Cox
@ 2002-06-06 23:21   ` Rusty Russell
  2002-06-07  8:33     ` Peter Wächtler
  2002-06-07  9:06   ` Rusty Russell
  2 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2002-06-06 23:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, frankeh, alan

In message <Pine.LNX.4.44.0206060930240.5920-100000@home.transmeta.com> you wri
te:
> Do we have major and minor numbers for sockets and populate /dev
> with them? No. And as a result, there has _never_ been any sysadmin
> problems with either.

Ummm... you don't do much network programming, do you Linus?  Don't
confuse familiarity with fondness: the socket API is *not* a good
model to copy.

> You already have to have a system call to bind the particular fd to the
> futex _anyway_, so do the only sane thing, and allocate the fd _there_,
> and get rid of that stupid and horrible /dev/futed which only buys you
> pain, system administration, extra code, and a black star for being
> stupid.

Yet another special way to create a special fd?  Hmm...

That might be better than what I proposed, but it's not the epitomy of
taste either.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06 23:21   ` Rusty Russell
@ 2002-06-07  8:33     ` Peter Wächtler
  2002-06-08 22:28       ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Wächtler @ 2002-06-07  8:33 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linus Torvalds, linux-kernel, frankeh, alan

Rusty Russell wrote:
> In message <Pine.LNX.4.44.0206060930240.5920-100000@home.transmeta.com> you wri
> te:
> 
>>Do we have major and minor numbers for sockets and populate /dev
>>with them? No. And as a result, there has _never_ been any sysadmin
>>problems with either.
>>
> 
> Ummm... you don't do much network programming, do you Linus?  Don't
> confuse familiarity with fondness: the socket API is *not* a good
> model to copy.
> 
> 
>>You already have to have a system call to bind the particular fd to the
>>futex _anyway_, so do the only sane thing, and allocate the fd _there_,
>>and get rid of that stupid and horrible /dev/futed which only buys you
>>pain, system administration, extra code, and a black star for being
>>stupid.
>>
> 
> Yet another special way to create a special fd?  Hmm...
> 
> That might be better than what I proposed, but it's not the epitomy of
> taste either.
> 

What about /proc/futex then? Less adminstrative work, clean interface
(also for shell scripts like Alan suggested).
Al Viro would like this, it's more like Plan9 or QNX6. :)

Give it an entry in the namespace, why not with sockets (unix and ip) also?




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06 16:36 ` Linus Torvalds
  2002-06-06 19:27   ` Alan Cox
  2002-06-06 23:21   ` Rusty Russell
@ 2002-06-07  9:06   ` Rusty Russell
  2002-06-08 22:42     ` Linus Torvalds
  2 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2002-06-07  9:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, frankeh, alan, viro

On Thu, 6 Jun 2002 09:36:23 -0700 (PDT)
Linus Torvalds <torvalds@transmeta.com> wrote:
> You already have to have a system call to bind the particular fd to the
> futex _anyway_, so do the only sane thing, and allocate the fd _there_,

But Ma, I don't *want* to write a filesystem!

Linus, Al, is there an easier way to do this?  I stole this from sockfs,
but I balked at another 50 lines for a proper inode creation, so I just use
the same dentry and inode over and over.

It's still an awful lot of irrelevant code: what can I cut?

/* Everyone needs a dentry and inode */
static struct dentry *futex_dentry;

/* Signal allows caller to avoid the race which would occur if they
   set the sigio stuff up afterwards. */
static int futex_fd(struct list_head *head,
		    struct page *page,
		    int offset,
		    int signal)
{
	int fd;
	struct futex_q *q;
	struct file *filp;

	if (signal < 0 || signal > _NSIG)
		return -EINVAL;

	fd = get_unused_fd();
	if (fd < 0)
		return fd;
	filp = get_empty_filp();
	if (!filp) {
		put_unused_fd(fd);
		return -ENFILE;
	}
	filp->f_op = &futex_fops;
	filp->f_dentry = dget(futex_dentry);

	if (signal) {
		filp->f_owner.pid = current->pid;
		filp->f_owner.uid = current->uid;
		filp->f_owner.euid = current->euid;
		filp->f_owner.signum = signal;
	}

	q = kmalloc(sizeof(*q), GFP_KERNEL);
	if (!q) {
		put_unused_fd(fd);
		put_filp(filp);
		return -ENOMEM;
	}

	/* Initialize queue structure */
	init_waitqueue_head(&q->waiters);
	filp->private_data = q;

	/* Go for it... */
	queue_me(head, q, page, offset, fd, filp);

	/* Now we map fd to filp, so userspace can access it */
	fd_install(fd, filp);
	return 0;
}

/* Oh yeah, makes sense to write a filesystem... */
static struct super_operations futexfs_ops = {
	statfs:		simple_statfs,
};

static int futexfs_fill_super(struct super_block *sb, void *data, int silent)
{
	struct inode *root;
	sb->s_blocksize = 1024;
	sb->s_blocksize_bits = 10;
	sb->s_magic = 0xFBAD1DEA;
	sb->s_op = &futexfs_ops;
	root = new_inode(sb);
	if (!root)
		return -ENOMEM;
	root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
	root->i_uid = root->i_gid = 0;
	root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
	sb->s_root = d_alloc(NULL, &(const struct qstr) { "futex", 5, 0 });
	if (!sb->s_root) {
		iput(root);
		return -ENOMEM;
	}
	sb->s_root->d_sb = sb;
	sb->s_root->d_parent = sb->s_root;
	d_instantiate(sb->s_root, root);
	return 0;
}

static struct super_block *
futexfs_get_sb(struct file_system_type *fs_type,
	       int flags, char *dev_name, void *data)
{
	return get_sb_nodev(fs_type, flags, data, futexfs_fill_super);
}

static struct file_system_type futex_fs_type = {
	name:		"futexfs",
	get_sb:		futexfs_get_sb,
	kill_sb:	kill_anon_super,
	fs_flags:	FS_NOMOUNT,
};

static int __init init(void)
{
	unsigned int i;
	struct qstr name = { .name = "futex", .len = 5, .hash = 0 };
	struct vfsmount *futex_mnt;

	register_filesystem(&futex_fs_type);
	futex_mnt = kern_mount(&futex_fs_type);
	futex_dentry = d_alloc(NULL, &name);
	futex_dentry->d_inode = new_inode(futex_mnt->mnt_sb);

	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
		INIT_LIST_HEAD(&futex_queues[i]);
	return 0;
}
__initcall(init);

-- 
   there are those who do and those who hang on and you don't see too
   many doers quoting their contemporaries.  -- Larry McVoy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-07  8:33     ` Peter Wächtler
@ 2002-06-08 22:28       ` Linus Torvalds
  2002-06-09  9:49         ` Kai Henningsen
  2002-06-09 10:07         ` Peter Wächtler
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-08 22:28 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Rusty Russell, linux-kernel, frankeh, alan



On Fri, 7 Jun 2002, Peter Wächtler wrote:
>
> What about /proc/futex then?

Why?

Tell me _one_ advantage from having the thing exposed as a filename?

The whole point with "everything is a file" is not that you have some
random filename (indeed, sockets and pipes show that "file" and "filename"
have nothing to do with each other), but the fact that you can use common
tools to operate on different things.

But there's absolutely no point in opening /dev/futex from a shell script
or similar, because you don't get anything from it. You still have to bind
the fd to it's real object.

In short, the name "/dev/futex" (or "/proc/futex") is _meaningless_.
There's no point to it. It has no life outside the FUTEX system call, and
the only thing that you can do by exposing it as a name is to cause
problems for people who don't want to mount /proc, or who do not happen to
have that device node in their /dev.

> Give it an entry in the namespace, why not with sockets (unix and ip) also?

Perhaps because you cannot enumerate sockets and pipes? They don't _have_
names before they are created. Same as futexes, btw.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-07  9:06   ` Rusty Russell
@ 2002-06-08 22:42     ` Linus Torvalds
  2002-06-11  9:15       ` Rusty Russell
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2002-06-08 22:42 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, frankeh, alan, viro



On Fri, 7 Jun 2002, Rusty Russell wrote:
>
> Linus, Al, is there an easier way to do this?  I stole this from sockfs,
> but I balked at another 50 lines for a proper inode creation, so I just use
> the same dentry and inode over and over.

There's nothing inherently wrong with re-using the inode and dentry -
that's what /dev/futex would do too, of course.

> It's still an awful lot of irrelevant code: what can I cut?

I don't think it's a matter of cutting, as much as possibly a matter of
tryign to share some common code. pipefs, sockfs and now this: they all do
pretty much exactly the same thing, and there is nothing that says that
they should have separate super_operations, for example, since they are
all identical.

And once you have the same super_operations, you really have the same
"fill_super" functions too. The only thing that separates these
superblocks is the root name, so that /proc gets nice output. So it should
be fine to just have

	sb = create_anon_fs("futex");

and share all of the setup code across futex/pipes/sockfs.

Which still leaves you with the

	get_unused_fd();
	get_empty_filp();
	filp->f_dentry = dget(sb->s_root);
	.. fill it ..
	fd_install(fd, filp);

but by then we're talking single lines of overhead.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-08 22:28       ` Linus Torvalds
@ 2002-06-09  9:49         ` Kai Henningsen
  2002-06-09 18:09           ` Linus Torvalds
  2002-06-09 10:07         ` Peter Wächtler
  1 sibling, 1 reply; 45+ messages in thread
From: Kai Henningsen @ 2002-06-09  9:49 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 08.06.02 in <Pine.LNX.4.44.0206081523410.11630-100000@home.transmeta.com>:

> On Fri, 7 Jun 2002, Peter Waechtler wrote:
> >
> > What about /proc/futex then?
>
> Why?
>
> Tell me _one_ advantage from having the thing exposed as a filename?

None, of course - the shell can't do the other futex ops, either. Futex  
file handles mean you can implement select() on them, but that's about all  
they have in common with files - there is certainly no read() or write()  
operation here!

> > Give it an entry in the namespace, why not with sockets (unix and ip)
> > also?
>
> Perhaps because you cannot enumerate sockets and pipes? They don't _have_
> names before they are created. Same as futexes, btw.

Now *there* I disagree, at least for sockets. First of all, there's  
absolutely no need to be able to enumerate unopened sockets to justify  
putting them into the namespace - you can create them (in a special fs, of  
course) at the moment they are opened. (As in fact is done, in /proc/<pid>/ 
fd/.) And second, you *can* read() and write() there.

However, I don't think that's all that important. What I'd rather see is  
making the network devices into namespace nodes. The situation of eth0 and  
friends, from a Unix perspective, is utterly unnatural.

MfG Kai

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-08 22:28       ` Linus Torvalds
  2002-06-09  9:49         ` Kai Henningsen
@ 2002-06-09 10:07         ` Peter Wächtler
  2002-06-09 17:49           ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Wächtler @ 2002-06-09 10:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, frankeh, alan

Linus Torvalds schrieb:
> 
> On Fri, 7 Jun 2002, Peter Wächtler wrote:
> >
> > What about /proc/futex then?
> 
> Why?
> 
> Tell me _one_ advantage from having the thing exposed as a filename?
> 

There is no enforcing advantage for this.

newbie question: how to provide file operations like poll
without an entry in the filesystem? (in the meantime I will try
to answer this myself :)

> The whole point with "everything is a file" is not that you have some
> random filename (indeed, sockets and pipes show that "file" and "filename"
> have nothing to do with each other), but the fact that you can use common
> tools to operate on different things.
> 
> But there's absolutely no point in opening /dev/futex from a shell script
> or similar, because you don't get anything from it. You still have to bind
> the fd to it's real object.
> 
> In short, the name "/dev/futex" (or "/proc/futex") is _meaningless_.
> There's no point to it. It has no life outside the FUTEX system call, and
> the only thing that you can do by exposing it as a name is to cause
> problems for people who don't want to mount /proc, or who do not happen to
> have that device node in their /dev.
> 
> > Give it an entry in the namespace, why not with sockets (unix and ip) also?
> 
> Perhaps because you cannot enumerate sockets and pipes? They don't _have_
> names before they are created. Same as futexes, btw.
> 

Still you can open a file in the namespace and write some commands to it.
Then it turns out to be a socket on port 25:

fd=open("/dev/socket",O_RDWR);
write(fd,"connect stream 25\n",sizeof(..));
write(fd,"helo mail.my.com\n",..);
...


Just one more question: would it be possible to specify a poll operation
for /proc/blah?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-09 10:07         ` Peter Wächtler
@ 2002-06-09 17:49           ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-09 17:49 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Rusty Russell, linux-kernel, frankeh, alan



On Sun, 9 Jun 2002, Peter Wächtler wrote:
>
> Still you can open a file in the namespace and write some commands to it.
> Then it turns out to be a socket on port 25:
>
> fd=open("/dev/socket",O_RDWR);
> write(fd,"connect stream 25\n",sizeof(..));
> write(fd,"helo mail.my.com\n",..);

Yes, obviously you can avoid system calls entirely, and replace all of
them with read/write of commands.

This is not even a very uncommon idea: the above is basically message
passing, and is largely how many microkernels work. Except they don't call
it read/write, they tend to call it send/recv, and they aren't "file
descriptors", they are "ports".

It has advantages: because you only have one set of primitives, it's more
easily abtracted at that level, meaning that you can (and people do) make
it distributed etc without having to worry about local semantics.

It has disadvantages too: performance tends to be bad (you have to copy
around and parse the commands that are no longer implicit in the system
call number), and while there is a high level of abstraction on one level
("everything is a 'port' that can receive or send messages), at some point
the proverbial shit hits the fan and you've moved the details behind the
abstraction down (and now the data stream is no longer just bytes, but has
a meaning in itself).

But yes, the sequences

	open("/dev/socket")		->	socket()
	write(fd,"connect stream 25")	->	connect()

are obviously "equivalent". It's not my personal favourite equivalence,
though. I'd much rather add the information at _open_ time, and make it a
name-space issue, so that you'd do something like

	open("//sockfs/dst=123.45.67.89:25", O_RDWR);

instead. Which is _also_ entirely equivalent, of course (the "namespace"
approach does require that you be able to do "fd-relative" lookups, so
that you could also do

	sk = open("//sockfs", O_RDWR);
	sk2 = fd_open(sk, "dst=123.45.67.89:25", O_RDWR);

which is actually useful even in regular files too, just as a way of doing
directory-relative file opens without having to do a "chdir()").

HOWEVER, the fact is that exactly because they are equivalent, there is no
real difference between them. So you might as well just use the old UNIX
behaviour, and if you want to open sockets from a script, you use any of
the already _existing_ socket script helpers. For port 25, you have one
called "sendmail". For port 80, you have things like "lynx -source".

And you have tons of things like "netpipes", for doing generic scripting
of sockets.

The fact is, trying to come up with new ways to do the same old thing is
_not_ a good idea. It may look cool to expose sockets in the namespace,
but what's the actual added advantage over existing standard practices?
Unless that can be shown, there's just no point.

Do a google search for "netpipes", I'm sure you'll find it can do what you
wanted.

Sorry to rain on the "cool feature" parade, but I want to see some
_advantage_ from exposing new names in the namespace.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-09  9:49         ` Kai Henningsen
@ 2002-06-09 18:09           ` Linus Torvalds
  2002-06-09 19:06             ` Thunder from the hill
                               ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-09 18:09 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: linux-kernel



On 9 Jun 2002, Kai Henningsen wrote:
>
> However, I don't think that's all that important. What I'd rather see is
> making the network devices into namespace nodes. The situation of eth0 and
> friends, from a Unix perspective, is utterly unnatural.

But what would you _do_ with them? What would be the advantage as compared
to the current situation?

Now, to configure a device, you get a fd to the device the same way you
get a fd _anyway_ - with "socket()".

And anybody who says that "socket()" is utterly unnatural to the UNIX way
is quite far out to lunch. It may be unnatural to the Plan-9 way of
"everything is a namespace", but that was never the UNIX way. The UNIX way
is "everything is a file descriptor or a process", but that was never
about namespaces.

Yes, some old-timers could argue that original UNIX didn't have sockets,
and that the BSD interface is ugly and an abomination and that it _should_
have been a namespace thing, but that argument falls flat on its face when
you realize that the "pipe()" system call _was_ in original UNIX, and has
all the same issues.

Don't get hung up about names.

			Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-09 18:09           ` Linus Torvalds
@ 2002-06-09 19:06             ` Thunder from the hill
  2002-06-10  6:39             ` Kai Henningsen
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Thunder from the hill @ 2002-06-09 19:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel

Hi,

On Sun, 9 Jun 2002, Linus Torvalds wrote:
> Yes, some old-timers could argue that original UNIX didn't have sockets,

Not to mention MULTICS...

IMO this all looks like "exporting program variables to filesystem", would 
you do that? (Then we'll need /dev/memory/16k/5362337156/blah, etc.) The 
next issue would be "how do we stop other processes from using our 
sockets, semaphores, etc., ending up where we started.

Sockets are a good implementation as long as they don't fall down for some 
particular purpose. This isn't given yet. Semaphores didn't yet fall down, 
either. So what do you want more? The "old" system might look crappy to 
you, but it works! It works, even if it's a little more abstract than the 
Plan-9. Are we Plan-9?

Regards,
Thunder

PS. this mail was sent through a happily working lot of sockets.
-- 
German attitude becoming        |	Thunder from the hill at ngforever
rightaway popular:		|
       "Get outa my way,  	|	free inhabitant not directly
    for I got a mobile phone!"	|	belonging anywhere


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-09 18:09           ` Linus Torvalds
  2002-06-09 19:06             ` Thunder from the hill
@ 2002-06-10  6:39             ` Kai Henningsen
  2002-06-10  7:55             ` Helge Hafting
  2002-06-10 20:57             ` H. Peter Anvin
  3 siblings, 0 replies; 45+ messages in thread
From: Kai Henningsen @ 2002-06-10  6:39 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 09.06.02 in <Pine.LNX.4.44.0206091056550.13459-100000@home.transmeta.com>:

> On 9 Jun 2002, Kai Henningsen wrote:
> >
> > However, I don't think that's all that important. What I'd rather see is
> > making the network devices into namespace nodes. The situation of eth0 and
> > friends, from a Unix perspective, is utterly unnatural.
>
> But what would you _do_ with them? What would be the advantage as compared
> to the current situation?

For one, enumerate them. Network interfaces, as opposed to sockets, are  
fairly static. And currently I have to either call /sbin/ifconfig or  
figure out where this information is hidden in /proc. (And there is a  
kernel interface, I think, some ioctl() stuff - I have mostly forgotten  
all I learned about that, it was so ugly.)

Doesn't need to be /dev/xxx nodes (that *would* be the traditional Unix  
way) - these days I'd argue for a special network device fs.

> Now, to configure a device, you get a fd to the device the same way you
> get a fd _anyway_ - with "socket()".

>From what little I remember, though, you don't bind your socket to some  
address "eth0" - which might have been halfway reasonable - but instead  
you say *that* with some ioctl().

> And anybody who says that "socket()" is utterly unnatural to the UNIX way
> is quite far out to lunch. It may be unnatural to the Plan-9 way of
> "everything is a namespace", but that was never the UNIX way. The UNIX way
> is "everything is a file descriptor or a process", but that was never
> about namespaces.
>
> Yes, some old-timers could argue that original UNIX didn't have sockets,
> and that the BSD interface is ugly and an abomination and that it _should_
> have been a namespace thing, but that argument falls flat on its face when
> you realize that the "pipe()" system call _was_ in original UNIX, and has
> all the same issues.

On the other hand, pipe() is about *anonymous* files. You could in fact  
argue that it would be nice to have an interface for anonymous disk files  
as well, instead of all the current acrobatics for safe temp file  
creation.

> Don't get hung up about names.

Especially not if your argument is then about situations where you atually  
don't want a name.

See, the problem with eth0 and friends is that they already *have* names -  
their names just aren't in the standard namespace. *That* is the real  
problem here.

Unix tradition or not, I propose the rule:

*Any kernel object that has a text name, shall appear,*
*under this name, somewhere in the filesystem.*

Or in other words, *there can only be one namespace*. (Well, only one  
global namespace, anyway.)

MfG Kai

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-02  0:10 ` Pavel Machek
@ 2002-06-10  6:57   ` Rusty Russell
  0 siblings, 0 replies; 45+ messages in thread
From: Rusty Russell @ 2002-06-10  6:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

In message <20020602001018.A35@toy.ucw.cz> you write:
> Hi!
> 
> > Name: Waker can unpin page, rather than waiting process
> > Author: Rusty Russell
> > Status: Tested in 2.5.20
> > Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
> > Depends: Futex/waitq.patch.gz
> > 
> > D: This changes the implementation so that the waker actually unpins
> > D: the page.  This is preparation for the async interface, where the
> > D: process which registered interest is not in the kernel.
> 
> What is it? Nice header, did you generate it by hand?

It's a format I made up for all the patches on my web page.  Most
important is the Depends: line which my patch scripts can deal with
quite nicely: see my kernel.org page.

Down with Bit Keeper! Long live patch! 8)
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-09 18:09           ` Linus Torvalds
  2002-06-09 19:06             ` Thunder from the hill
  2002-06-10  6:39             ` Kai Henningsen
@ 2002-06-10  7:55             ` Helge Hafting
  2002-06-10 14:10               ` Thunder from the hill
  2002-06-10 15:11               ` Linus Torvalds
  2002-06-10 20:57             ` H. Peter Anvin
  3 siblings, 2 replies; 45+ messages in thread
From: Helge Hafting @ 2002-06-10  7:55 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> 
> On 9 Jun 2002, Kai Henningsen wrote:
> >
> > However, I don't think that's all that important. What I'd rather see is
> > making the network devices into namespace nodes. The situation of eth0 and
> > friends, from a Unix perspective, is utterly unnatural.
> 
> But what would you _do_ with them? What would be the advantage as compared
> to the current situation?

Not much, but 
ls /dev/net
eth0 eth1 eth2 ippp0
would be a convenient way to see what net devices exists.
This already works for other devices, when using devfs.

I guess this isn't reason enough though.

Helge Hafting

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-10  7:55             ` Helge Hafting
@ 2002-06-10 14:10               ` Thunder from the hill
  2002-06-10 20:46                 ` Kai Henningsen
  2002-06-10 15:11               ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Thunder from the hill @ 2002-06-10 14:10 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Linus Torvalds, linux-kernel

Hi,

On Mon, 10 Jun 2002, Helge Hafting wrote:
> ls /dev/net
> eth0 eth1 eth2 ippp0

What is it worth? You have a few more files which you can't do anything 
with, and ifconfig output is much more greppable etc.

I remember these network devices from Solaris. There wasn't any good about 
them IIRC, the only sane way of working with them was to work around them, 
i.e. ignoring. Do you want a /dev/ignoreme directory?

We shouldn't introduce to ignore.

Regards,
Thunder
-- 
German attitude becoming        |	Thunder from the hill at ngforever
rightaway popular:		|
       "Get outa my way,  	|	free inhabitant not directly
    for I got a mobile phone!"	|	belonging anywhere


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-10  7:55             ` Helge Hafting
  2002-06-10 14:10               ` Thunder from the hill
@ 2002-06-10 15:11               ` Linus Torvalds
  2002-06-11 15:06                 ` Eric W. Biederman
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2002-06-10 15:11 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel



On Mon, 10 Jun 2002, Helge Hafting wrote:
>
> Not much, but
> ls /dev/net
> eth0 eth1 eth2 ippp0
> would be a convenient way to see what net devices exists.
> This already works for other devices, when using devfs.

You might as well do

	cat /proc/net/dev

instead.

Which works with existing kernels, going back to whatever..

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-10 14:10               ` Thunder from the hill
@ 2002-06-10 20:46                 ` Kai Henningsen
  2002-06-11 14:14                   ` john slee
  0 siblings, 1 reply; 45+ messages in thread
From: Kai Henningsen @ 2002-06-10 20:46 UTC (permalink / raw)
  To: thunder; +Cc: linux-kernel, torvalds

thunder@ngforever.de (Thunder from the hill)  wrote on 10.06.02 in <Pine.LNX.4.44.0206100808180.6159-100000@hawkeye.luckynet.adm>:

> On Mon, 10 Jun 2002, Helge Hafting wrote:
> > ls /dev/net
> > eth0 eth1 eth2 ippp0
>
> What is it worth? You have a few more files which you can't do anything
> with, and ifconfig output is much more greppable etc.

Ifconfig output is *WHAT*?!

Ifconfig output, to be parsed by a script, is one of the shittiest  
interfaces possible.

Look at this, and then tell me again that "ifconfig output is much more  
greppable"!

# ifconfig eth0
eth0      Protokoll:Ethernet  Hardware Adresse 00:50:FC:0C:63:69
          inet Adresse:10.0.41.2  Bcast:10.255.255.255  Maske:255.255.255.0
          EtherTalk Phase 2 Adresse:65280/237
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26841078 errors:4240 dropped:0 overruns:0 frame:0
          TX packets:26134055 errors:0 dropped:0 overruns:0 carrier:0
          Kollisionen:8481
          RX bytes:60708618 (57.8 MiB)  TX bytes:2654812652 (2.4 GiB)

# LANG= ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:50:FC:0C:63:69
          inet addr:10.0.41.2  Bcast:10.255.255.255  Mask:255.255.255.0
          EtherTalk Phase 2 addr:65280/237
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26841182 errors:4240 dropped:0 overruns:0 frame:0
          TX packets:26134181 errors:0 dropped:0 overruns:0 carrier:0
          collisions:8481
          RX bytes:60727233 (57.9 MiB)  TX bytes:2654827939 (2.4 GiB)

#

Even rooting around in /proc is better than this!

> I remember these network devices from Solaris. There wasn't any good about
> them IIRC, the only sane way of working with them was to work around them,
> i.e. ignoring. Do you want a /dev/ignoreme directory?

I have no idea what Solaris did, nor do I necessarily want to know, but I  
*do* have experience with what Linux does, and I'm certainly not  
impressed.

Given that I've fairly often been irritated about not having these things  
be filesystem nodes, I'd expect there to *be* benefit in having them in  
the filesystem if this is done halfway reasonable.

MfG Kai

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-09 18:09           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2002-06-10  7:55             ` Helge Hafting
@ 2002-06-10 20:57             ` H. Peter Anvin
  3 siblings, 0 replies; 45+ messages in thread
From: H. Peter Anvin @ 2002-06-10 20:57 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.44.0206091056550.13459-100000@home.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> But what would you _do_ with them? What would be the advantage as compared
> to the current situation?
> 
> Now, to configure a device, you get a fd to the device the same way you
> get a fd _anyway_ - with "socket()".
> 
> And anybody who says that "socket()" is utterly unnatural to the UNIX way
> is quite far out to lunch. It may be unnatural to the Plan-9 way of
> "everything is a namespace", but that was never the UNIX way. The UNIX way
> is "everything is a file descriptor or a process", but that was never
> about namespaces.
> 
> Yes, some old-timers could argue that original UNIX didn't have sockets,
> and that the BSD interface is ugly and an abomination and that it _should_
> have been a namespace thing, but that argument falls flat on its face when
> you realize that the "pipe()" system call _was_ in original UNIX, and has
> all the same issues.
> 

The biggest problem with the socket API is that it makes it impossible
to use in a shell script context or other context where it isn't
practical to operate on binary data.  The second-biggest problem with
the API is that it at least historically makes it basically impossible
to write address-family-independent code... something like "telnet"
shouldn't need to know if it's operating over TCP on IPv4, TCP on
IPv6, SPX on IPX...; it should be able to resolve a string to a name and
then create a connection, and the address family stuff should be
encasulated in the library.

The first problem a namespace-like solution would deal with... then
one could open /dev/sock/stream/ipv4/206.189.222.9/23 using a normal
open() call.  The second is to resolve "poetry.vogon.net","telnet" to
the above string, but that's a user-space issue.

At least it would be nice if open() on a Unix domain socket would
connect to that socket.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-08 22:42     ` Linus Torvalds
@ 2002-06-11  9:15       ` Rusty Russell
  2002-06-11 16:53         ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2002-06-11  9:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, frankeh, alan, viro

In message <Pine.LNX.4.44.0206081531550.11630-100000@home.transmeta.com> you wr
ite:
> 
> 
> On Fri, 7 Jun 2002, Rusty Russell wrote:
> >
> > Linus, Al, is there an easier way to do this?  I stole this from sockfs,
> > but I balked at another 50 lines for a proper inode creation, so I just use
> > the same dentry and inode over and over.
> 
> There's nothing inherently wrong with re-using the inode and dentry -
> that's what /dev/futex would do too, of course.
> 
> > It's still an awful lot of irrelevant code: what can I cut?
> 
> I don't think it's a matter of cutting, as much as possibly a matter of
> tryign to share some common code. pipefs, sockfs and now this: they all do
> pretty much exactly the same thing, and there is nothing that says that
> they should have separate super_operations, for example, since they are
> all identical.

Yeah, that's a nice idea.  But sockfs is sufficiently different that
it's probably not worth it.  Left for Al's larger brain.

Here's the new (tested) futex patch (actually 4 minipatches but patch
-p1 < msg works fine).

Updated futex library at:
  http://www.[cc].kernel.org/pub/linux/kernel/people/rusty/futex-2.2.tar.gz

Name: Remove requirement for page to be writable.
Author: Rusty Russell
Status: Trivial

D: It's possible that someone could be passively monitoring: write
D: access is no longer strictly required (since new futex implementation).

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.20/kernel/futex.c working-2.5.20-afutex/kernel/futex.c
--- linux-2.5.20/kernel/futex.c	Sat May 25 14:35:00 2002
+++ working-2.5.20-afutex/kernel/futex.c	Wed Jun  5 18:42:17 2002
@@ -128,9 +158,9 @@
 	int err;
 
 	down_read(&mm->mmap_sem);
-	err = get_user_pages(current, current->mm, page_start,
+	err = get_user_pages(current, mm, page_start,
 			     1 /* one page */,
-			     1 /* writable */,
+			     0 /* writable not important */,
 			     0 /* don't force */,
 			     &page,
 			     NULL /* don't return vmas */);

Name: Unpin page bug fix.
Author: Rusty Russell
Status: Tested on 2.5.20

D: This uses page_cache_release() instead of put_page(), as it might
D: be a pagecache page.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.21.23289/kernel/futex.c linux-2.5.21.23289.updated/kernel/futex.c
--- linux-2.5.21.23289/kernel/futex.c	Mon Jun 10 16:03:56 2002
+++ linux-2.5.21.23289.updated/kernel/futex.c	Mon Jun 10 17:10:10 2002
@@ -70,6 +70,14 @@
 	wake_up_all(&q->waiters);
 }
 
+static inline void unpin_page(struct page *page)
+{
+	/* Avoid releasing the page which is on the LRU list.  I don't
+           know if this is correct, but it stops the BUG() in
+           __free_pages_ok(). */
+	page_cache_release(page);
+}
+
 static int futex_wake(struct list_head *head,
 		      struct page *page,
 		      unsigned int offset,
@@ -223,7 +231,7 @@
 	default:
 		ret = -EINVAL;
 	}
-	page_cache_release(page);
+	unpin_page(page);
 
 	return ret;
 }

Name: Waker can unpin page, rather than waiting process
Author: Rusty Russell
Status: Tested in 2.5.20
Depends: Futex/copy-from-user.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz

D: This changes the implementation so that the waker actually unpins
D: the page.  This is preparation for the async interface, where the
D: process which registered interest is not in the kernel.


diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.20.19104/kernel/futex.c linux-2.5.20.19104.updated/kernel/futex.c
--- linux-2.5.20.19104/kernel/futex.c	Thu Jun  6 17:13:46 2002
+++ linux-2.5.20.19104.updated/kernel/futex.c	Thu Jun  6 17:14:30 2002
@@ -98,11 +98,13 @@
 		if (this->page == page && this->offset == offset) {
 			list_del_init(i);
 			tell_waiter(this);
+			unpin_page(this->page);
 			num_woken++;
 			if (num_woken >= num) break;
 		}
 	}
 	spin_unlock(&futex_lock);
+	unpin_page(page);
 	return num_woken;
 }
 
@@ -192,9 +194,10 @@
 	}
  out:
 	set_current_state(TASK_RUNNING);
-	/* Were we woken up anyway? */
+	/* Were we woken up anyway?  If so, it unpinned page. */
 	if (!unqueue_me(&q))
 		return 0;
+	unpin_page(page);
 	return ret;
 }
 
@@ -225,6 +228,7 @@
 	if (IS_ERR(page))
 		return PTR_ERR(page);
 
+	/* On success, these routines unpin the pages themselves. */
 	head = hash_futex(page, pos_in_page);
 	switch (op) {
 	case FUTEX_WAIT:
@@ -236,7 +240,8 @@
 	default:
 		ret = -EINVAL;
 	}
-	unpin_page(page);
+	if (ret < 0)
+		unpin_page(page);
 
 	return ret;
 }

Name: New asynchronous interface for futexes
Author: Rusty Russell
Status: Tested on 2.5.20
Depends: Futex/comment-fix.patch.gz Futex/copy-from-user.patch.gz
Depends: Futex/no-write-needed.patch.gz Futex/unpin-page-fix.patch.gz
Depends: Futex/waitq.patch.gz Futex/waker-unpin-page.patch.gz

D: This patch adds a FUTEX_FD call, for opening a file descriptor 
D: attached to a futex, which can be used with poll, select or SIGIO.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.21-pre-afutex/include/linux/futex.h working-2.5.21-afutex/include/linux/futex.h
--- working-2.5.21-pre-afutex/include/linux/futex.h	Sat May 25 14:34:59 2002
+++ working-2.5.21-afutex/include/linux/futex.h	Tue Jun 11 13:25:52 2002
@@ -4,5 +4,6 @@
 /* Second argument to futex syscall */
 #define FUTEX_WAIT (0)
 #define FUTEX_WAKE (1)
+#define FUTEX_FD (2)
 
 #endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.21-pre-afutex/kernel/futex.c working-2.5.21-afutex/kernel/futex.c
--- working-2.5.21-pre-afutex/kernel/futex.c	Tue Jun 11 13:20:06 2002
+++ working-2.5.21-afutex/kernel/futex.c	Tue Jun 11 13:26:46 2002
@@ -34,6 +34,10 @@
 #include <linux/highmem.h>
 #include <linux/time.h>
 #include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
 #include <asm/uaccess.h>
 
 /* Simple "sleep if unchanged" interface. */
@@ -41,6 +45,11 @@
 /* FIXME: This may be way too small. --RR */
 #define FUTEX_HASHBITS 6
 
+extern void send_sigio(struct fown_struct *fown, int fd, int band);
+
+/* Everyone needs a dentry and inode */
+static struct dentry *futex_dentry;
+
 /* We use this instead of a normal wait_queue_t, so we can wake only
    the relevent ones (hashed queues may be shared) */
 struct futex_q {
@@ -49,6 +58,9 @@
 	/* Page struct and offset within it. */
 	struct page *page;
 	unsigned int offset;
+	/* For fd, sigio sent using these. */
+	int fd;
+	struct file *filp;
 };
 
 /* The key for the hash is the address + index + offset within page */
@@ -65,9 +77,12 @@
 	return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
 }
 
+/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
 static inline void tell_waiter(struct futex_q *q)
 {
 	wake_up_all(&q->waiters);
+	if (q->filp)
+		send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
 }
 
 static inline void unpin_page(struct page *page)
@@ -105,14 +120,16 @@
 
 /* Add at end to avoid starvation */
 static inline void queue_me(struct list_head *head,
-			    wait_queue_t *wait,
 			    struct futex_q *q,
 			    struct page *page,
-			    unsigned int offset)
+			    unsigned int offset,
+			    int fd,
+			    struct file *filp)
 {
-	add_wait_queue(&q->waiters, wait);
 	q->page = page;
 	q->offset = offset;
+	q->fd = fd;
+	q->filp = filp;
 
 	spin_lock(&futex_lock);
 	list_add_tail(&q->list, head);
@@ -166,7 +183,9 @@
 	int ret = 0;
 
 	set_current_state(TASK_INTERRUPTIBLE);
-	queue_me(head, &wait, &q, page, offset);
+	init_waitqueue_head(&q.waiters);
+	add_wait_queue(&q.waiters, &wait);
+	queue_me(head, &q, page, offset, -1, NULL);
 
 	/* Page is pinned, but may no longer be in this address space. */
 	if (get_user(curval, uaddr) != 0) {
@@ -196,6 +215,95 @@
 	return ret;
 }
 
+static int futex_close(struct inode *inode, struct file *filp)
+{
+	struct futex_q *q = filp->private_data;
+
+	spin_lock(&futex_lock);
+	if (!list_empty(&q->list)) {
+		list_del(&q->list);
+		unpin_page(q->page);
+		/* Noone can be polling on us now. */
+		BUG_ON(waitqueue_active(&q->waiters));
+	}
+	spin_unlock(&futex_lock);
+	kfree(filp->private_data);
+	return 0;
+}
+
+/* This is one-shot: once it's gone off you need a new fd */
+static unsigned int futex_poll(struct file *filp,
+			       struct poll_table_struct *wait)
+{
+	struct futex_q *q = filp->private_data;
+	int ret = 0;
+
+	spin_lock(&futex_lock);
+	if (!list_empty(&q->list))
+		poll_wait(filp, &q->waiters, wait);
+	else
+		ret = POLLIN | POLLRDNORM;
+	spin_unlock(&futex_lock);
+
+	return ret;
+}
+
+static struct file_operations futex_fops = {
+	release:	futex_close,
+	poll:		futex_poll,
+};
+
+/* Signal allows caller to avoid the race which would occur if they
+   set the sigio stuff up afterwards. */
+static int futex_fd(struct list_head *head,
+		    struct page *page,
+		    int offset,
+		    int signal)
+{
+	int fd;
+	struct futex_q *q;
+	struct file *filp;
+
+	if (signal < 0 || signal > _NSIG)
+		return -EINVAL;
+
+	fd = get_unused_fd();
+	if (fd < 0)
+		return fd;
+	filp = get_empty_filp();
+	if (!filp) {
+		put_unused_fd(fd);
+		return -ENFILE;
+	}
+	filp->f_op = &futex_fops;
+	filp->f_dentry = dget(futex_dentry);
+
+	if (signal) {
+		filp->f_owner.pid = current->pid;
+		filp->f_owner.uid = current->uid;
+		filp->f_owner.euid = current->euid;
+		filp->f_owner.signum = signal;
+	}
+
+	q = kmalloc(sizeof(*q), GFP_KERNEL);
+	if (!q) {
+		put_unused_fd(fd);
+		put_filp(filp);
+		return -ENOMEM;
+	}
+
+	/* Initialize queue structure */
+	init_waitqueue_head(&q->waiters);
+	filp->private_data = q;
+
+	/* Go for it... */
+	queue_me(head, q, page, offset, fd, filp);
+
+	/* Now we map fd to filp, so userspace can access it */
+	fd_install(fd, filp);
+	return fd;
+}
+
 asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime)
 {
 	int ret;
@@ -232,6 +340,10 @@
 	case FUTEX_WAKE:
 		ret = futex_wake(head, page, pos_in_page, val);
 		break;
+	case FUTEX_FD:
+		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
+		ret = futex_fd(head, page, pos_in_page, val);
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -241,9 +353,55 @@
 	return ret;
 }
 
+/* FIXME: Oh yeah, makes sense to write a filesystem... */
+static struct super_operations futexfs_ops = { statfs: simple_statfs };
+
+/* Don't check error returns: we're dead if they happen */
+static int futexfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *root;
+
+	sb->s_blocksize = 1024;
+	sb->s_blocksize_bits = 10;
+	sb->s_magic = 0xBAD1DEA;
+	sb->s_op = &futexfs_ops;
+
+	root = new_inode(sb);
+	root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+	root->i_uid = root->i_gid = 0;
+	root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+
+	sb->s_root = d_alloc(NULL, &(const struct qstr) { "futex", 5, 0 });
+	sb->s_root->d_sb = sb;
+	sb->s_root->d_parent = sb->s_root;
+	d_instantiate(sb->s_root, root);
+
+	/* We never let this drop to zero. */
+	futex_dentry = dget(sb->s_root);
+
+	return 0;
+}
+
+static struct super_block *
+futexfs_get_sb(struct file_system_type *fs_type,
+	       int flags, char *dev_name, void *data)
+{
+	return get_sb_nodev(fs_type, flags, data, futexfs_fill_super);
+}
+
+static struct file_system_type futex_fs_type = {
+	name:		"futexfs",
+	get_sb:		futexfs_get_sb,
+	kill_sb:	kill_anon_super,
+	fs_flags:	FS_NOMOUNT,
+};
+
 static int __init init(void)
 {
 	unsigned int i;
+
+	register_filesystem(&futex_fs_type);
+	kern_mount(&futex_fs_type);
 
 	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
 		INIT_LIST_HEAD(&futex_queues[i]);

--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-10 20:46                 ` Kai Henningsen
@ 2002-06-11 14:14                   ` john slee
  0 siblings, 0 replies; 45+ messages in thread
From: john slee @ 2002-06-11 14:14 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: thunder, linux-kernel, torvalds

On Mon, Jun 10, 2002 at 10:46:00PM +0200, Kai Henningsen wrote:
> thunder@ngforever.de (Thunder from the hill)  wrote on 10.06.02 in <Pine.LNX.4.44.0206100808180.6159-100000@hawkeye.luckynet.adm>:
> 
> > On Mon, 10 Jun 2002, Helge Hafting wrote:
> > > ls /dev/net
> > > eth0 eth1 eth2 ippp0
> >
> > What is it worth? You have a few more files which you can't do anything
> > with, and ifconfig output is much more greppable etc.
> 
> Ifconfig output is *WHAT*?!
> 
> Ifconfig output, to be parsed by a script, is one of the shittiest  
> interfaces possible.

it may be but you can use 'ip' instead which _is_ much more script
friendly and seems to have at least ifconfig functionality.

j.

-- 
toyota power: http://indigoid.net/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-10 15:11               ` Linus Torvalds
@ 2002-06-11 15:06                 ` Eric W. Biederman
  0 siblings, 0 replies; 45+ messages in thread
From: Eric W. Biederman @ 2002-06-11 15:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Helge Hafting, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On Mon, 10 Jun 2002, Helge Hafting wrote:
> >
> > Not much, but
> > ls /dev/net
> > eth0 eth1 eth2 ippp0
> > would be a convenient way to see what net devices exists.
> > This already works for other devices, when using devfs.
> 
> You might as well do
> 
> 	cat /proc/net/dev
> 
> instead.
> 
> Which works with existing kernels, going back to whatever..

Gap, puke.  

Sorry I have built kernels where space was tight and I only built in
/proc so I could read /proc/net/dev.  And since with /proc everything
is all in one basket it is very hard to turn off unneeded features.

/proc might be nice to user space but as it is implemented it is
nasty to work with.

So a netdevfs or some solution that factors better would really
be nice.

Eric

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-11  9:15       ` Rusty Russell
@ 2002-06-11 16:53         ` Linus Torvalds
  2002-06-12  5:32           ` Rusty Russell
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2002-06-11 16:53 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, frankeh, alan, viro


Rusty,
 this makes no sense:

D: This changes the implementation so that the waker actually unpins
D: the page.  This is preparation for the async interface, where the
D: process which registered interest is not in the kernel.

Whazzup? The closing of the fd will unpin the page, the waker has no
reason to do so. It is very much against the linux philosophy (and a
design disaster anyway) to have the waker muck with the data structures of
anything waiting.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-11 16:53         ` Linus Torvalds
@ 2002-06-12  5:32           ` Rusty Russell
  2002-06-12  9:16             ` Peter Wächtler
  0 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2002-06-12  5:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, frankeh, alan, viro

In message <Pine.LNX.4.44.0206110951380.2712-100000@home.transmeta.com> you wri
te:
> 
> Rusty,
>  this makes no sense:
> 
> D: This changes the implementation so that the waker actually unpins
> D: the page.  This is preparation for the async interface, where the
> D: process which registered interest is not in the kernel.
> 
> Whazzup? The closing of the fd will unpin the page, the waker has no
> reason to do so. It is very much against the linux philosophy (and a
> design disaster anyway) to have the waker muck with the data structures of
> anything waiting.

Good catch: now the fd is a "one-shot" thing anyway, making close
unpin the page makes more sense.  Tested patch below (against 2.5.21).

FYI: I already violate this philosophy as I remove the waiter from the
queue when I wake them: this allows them to tell that they were woken
(waker does a list_del_init() on the waiting entry, so waiting knows
if (list_empty()) I was woken).

It would be more natural for the waiter to examine the futex value,
and if it's still unchanged go back to sleep.  But this makes
assumptions about what they're using the futex value for.  For
example, we "PASS_THIS_DIRECTLY" value into the futex.  This requires
that one (and ONLY one) process waiting actually wakes up.

This is why coming up with a primitive which allowed us to build posix
threads and fair queueing as well as "normal" unfair semantics took so
damn long.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.21/include/linux/futex.h working-2.5.21-afutex/include/linux/futex.h
--- linux-2.5.21/include/linux/futex.h	Sat May 25 14:34:59 2002
+++ working-2.5.21-afutex/include/linux/futex.h	Wed Jun 12 12:32:18 2002
@@ -4,5 +4,6 @@
 /* Second argument to futex syscall */
 #define FUTEX_WAIT (0)
 #define FUTEX_WAKE (1)
+#define FUTEX_FD (2)
 
 #endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.21/kernel/futex.c working-2.5.21-afutex/kernel/futex.c
--- linux-2.5.21/kernel/futex.c	Mon Jun 10 16:03:56 2002
+++ working-2.5.21-afutex/kernel/futex.c	Wed Jun 12 14:10:41 2002
@@ -34,6 +34,10 @@
 #include <linux/highmem.h>
 #include <linux/time.h>
 #include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
 #include <asm/uaccess.h>
 
 /* Simple "sleep if unchanged" interface. */
@@ -41,6 +45,11 @@
 /* FIXME: This may be way too small. --RR */
 #define FUTEX_HASHBITS 6
 
+extern void send_sigio(struct fown_struct *fown, int fd, int band);
+
+/* Everyone needs a dentry and inode */
+static struct dentry *futex_dentry;
+
 /* We use this instead of a normal wait_queue_t, so we can wake only
    the relevent ones (hashed queues may be shared) */
 struct futex_q {
@@ -49,6 +58,9 @@
 	/* Page struct and offset within it. */
 	struct page *page;
 	unsigned int offset;
+	/* For fd, sigio sent using these. */
+	int fd;
+	struct file *filp;
 };
 
 /* The key for the hash is the address + index + offset within page */
@@ -65,9 +77,20 @@
 	return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
 }
 
+/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
 static inline void tell_waiter(struct futex_q *q)
 {
 	wake_up_all(&q->waiters);
+	if (q->filp)
+		send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
+}
+
+static inline void unpin_page(struct page *page)
+{
+	/* Avoid releasing the page which is on the LRU list.  I don't
+           know if this is correct, but it stops the BUG() in
+           __free_pages_ok(). */
+	page_cache_release(page);
 }
 
 static int futex_wake(struct list_head *head,
@@ -95,14 +118,16 @@
 
 /* Add at end to avoid starvation */
 static inline void queue_me(struct list_head *head,
-			    wait_queue_t *wait,
 			    struct futex_q *q,
 			    struct page *page,
-			    unsigned int offset)
+			    unsigned int offset,
+			    int fd,
+			    struct file *filp)
 {
-	add_wait_queue(&q->waiters, wait);
 	q->page = page;
 	q->offset = offset;
+	q->fd = fd;
+	q->filp = filp;
 
 	spin_lock(&futex_lock);
 	list_add_tail(&q->list, head);
@@ -130,9 +155,9 @@
 	int err;
 
 	down_read(&mm->mmap_sem);
-	err = get_user_pages(current, current->mm, page_start,
+	err = get_user_pages(current, mm, page_start,
 			     1 /* one page */,
-			     1 /* writable */,
+			     0 /* writable not important */,
 			     0 /* don't force */,
 			     &page,
 			     NULL /* don't return vmas */);
@@ -156,7 +181,9 @@
 	int ret = 0;
 
 	set_current_state(TASK_INTERRUPTIBLE);
-	queue_me(head, &wait, &q, page, offset);
+	init_waitqueue_head(&q.waiters);
+	add_wait_queue(&q.waiters, &wait);
+	queue_me(head, &q, page, offset, -1, NULL);
 
 	/* Page is pinned, but may no longer be in this address space. */
 	if (get_user(curval, uaddr) != 0) {
@@ -185,6 +212,93 @@
 	return ret;
 }
 
+static int futex_close(struct inode *inode, struct file *filp)
+{
+	struct futex_q *q = filp->private_data;
+
+	spin_lock(&futex_lock);
+	if (!list_empty(&q->list)) {
+		list_del(&q->list);
+		/* Noone can be polling on us now. */
+		BUG_ON(waitqueue_active(&q->waiters));
+	}
+	spin_unlock(&futex_lock);
+	unpin_page(q->page);
+	kfree(filp->private_data);
+	return 0;
+}
+
+/* This is one-shot: once it's gone off you need a new fd */
+static unsigned int futex_poll(struct file *filp,
+			       struct poll_table_struct *wait)
+{
+	struct futex_q *q = filp->private_data;
+	int ret = 0;
+
+	spin_lock(&futex_lock);
+	if (!list_empty(&q->list))
+		poll_wait(filp, &q->waiters, wait);
+	else
+		ret = POLLIN | POLLRDNORM;
+	spin_unlock(&futex_lock);
+
+	return ret;
+}
+
+static struct file_operations futex_fops = {
+	release:	futex_close,
+	poll:		futex_poll,
+};
+
+/* Signal allows caller to avoid the race which would occur if they
+   set the sigio stuff up afterwards. */
+static int futex_fd(struct list_head *head,
+		    struct page *page,
+		    int offset,
+		    int signal)
+{
+	int fd;
+	struct futex_q *q;
+	struct file *filp;
+
+	if (signal < 0 || signal > _NSIG)
+		return -EINVAL;
+
+	fd = get_unused_fd();
+	if (fd < 0)
+		return fd;
+	filp = get_empty_filp();
+	if (!filp) {
+		put_unused_fd(fd);
+		return -ENFILE;
+	}
+	filp->f_op = &futex_fops;
+	filp->f_dentry = dget(futex_dentry);
+
+	if (signal) {
+		filp->f_owner.pid = current->pid;
+		filp->f_owner.uid = current->uid;
+		filp->f_owner.euid = current->euid;
+		filp->f_owner.signum = signal;
+	}
+
+	q = kmalloc(sizeof(*q), GFP_KERNEL);
+	if (!q) {
+		put_unused_fd(fd);
+		put_filp(filp);
+		return -ENOMEM;
+	}
+
+	/* Initialize queue structure, and add to hash table. */
+	filp->private_data = q;
+	init_waitqueue_head(&q->waiters);
+	queue_me(head, q, page, offset, fd, filp);
+
+	/* Now we map fd to filp, so userspace can access it */
+	fd_install(fd, filp);
+	return fd;
+}
+
 asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime)
 {
 	int ret;
@@ -220,17 +334,70 @@
 	case FUTEX_WAKE:
 		ret = futex_wake(head, page, pos_in_page, val);
 		break;
+	case FUTEX_FD:
+		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
+		ret = futex_fd(head, page, pos_in_page, val);
+		if (ret >= 0)
+			/* Leave page pinned (attached to fd). */
+			return ret;
+		break;
 	default:
 		ret = -EINVAL;
 	}
-	page_cache_release(page);
+	unpin_page(page);
 
 	return ret;
 }
 
+/* FIXME: Oh yeah, makes sense to write a filesystem... */
+static struct super_operations futexfs_ops = { statfs: simple_statfs };
+
+/* Don't check error returns: we're dead if they happen */
+static int futexfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *root;
+
+	sb->s_blocksize = 1024;
+	sb->s_blocksize_bits = 10;
+	sb->s_magic = 0xBAD1DEA;
+	sb->s_op = &futexfs_ops;
+
+	root = new_inode(sb);
+	root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+	root->i_uid = root->i_gid = 0;
+	root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+
+	sb->s_root = d_alloc(NULL, &(const struct qstr) { "futex", 5, 0 });
+	sb->s_root->d_sb = sb;
+	sb->s_root->d_parent = sb->s_root;
+	d_instantiate(sb->s_root, root);
+
+	/* We never let this drop to zero. */
+	futex_dentry = dget(sb->s_root);
+
+	return 0;
+}
+
+static struct super_block *
+futexfs_get_sb(struct file_system_type *fs_type,
+	       int flags, char *dev_name, void *data)
+{
+	return get_sb_nodev(fs_type, flags, data, futexfs_fill_super);
+}
+
+static struct file_system_type futex_fs_type = {
+	name:		"futexfs",
+	get_sb:		futexfs_get_sb,
+	kill_sb:	kill_anon_super,
+	fs_flags:	FS_NOMOUNT,
+};
+
 static int __init init(void)
 {
 	unsigned int i;
+
+	register_filesystem(&futex_fs_type);
+	kern_mount(&futex_fs_type);
 
 	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
 		INIT_LIST_HEAD(&futex_queues[i]);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12  5:32           ` Rusty Russell
@ 2002-06-12  9:16             ` Peter Wächtler
  2002-06-12 14:19               ` Hubertus Franke
                                 ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Peter Wächtler @ 2002-06-12  9:16 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linus Torvalds, linux-kernel, frankeh

Rusty Russell wrote:
> In message <Pine.LNX.4.44.0206110951380.2712-100000@home.transmeta.com> you wri
> te:
> 
>>Rusty,
>> this makes no sense:
>>
>>D: This changes the implementation so that the waker actually unpins
>>D: the page.  This is preparation for the async interface, where the
>>D: process which registered interest is not in the kernel.
>>
>>Whazzup? The closing of the fd will unpin the page, the waker has no
>>reason to do so. It is very much against the linux philosophy (and a
>>design disaster anyway) to have the waker muck with the data structures of
>>anything waiting.
>>
> 
> Good catch: now the fd is a "one-shot" thing anyway, making close
> unpin the page makes more sense.  Tested patch below (against 2.5.21).
> 
> FYI: I already violate this philosophy as I remove the waiter from the
> queue when I wake them: this allows them to tell that they were woken
> (waker does a list_del_init() on the waiting entry, so waiting knows
> if (list_empty()) I was woken).
> 
> It would be more natural for the waiter to examine the futex value,
> and if it's still unchanged go back to sleep.  But this makes
> assumptions about what they're using the futex value for.  For
> example, we "PASS_THIS_DIRECTLY" value into the futex.  This requires
> that one (and ONLY one) process waiting actually wakes up.
> 
> This is why coming up with a primitive which allowed us to build posix
> threads and fair queueing as well as "normal" unfair semantics took so
> damn long.
> 

What are the plans on how to deal with a waiter when the lock holder
dies abnormally?

What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
setting errno to a reasonable value (EIO?)

I couldn't find anything in susv3



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12  9:16             ` Peter Wächtler
@ 2002-06-12 14:19               ` Hubertus Franke
  2002-06-12 16:50                 ` Peter Wächtler
  2002-06-12 15:39               ` Linus Torvalds
  2002-06-13  1:32               ` Rusty Russell
  2 siblings, 1 reply; 45+ messages in thread
From: Hubertus Franke @ 2002-06-12 14:19 UTC (permalink / raw)
  To: Peter Wächtler, Rusty Russell; +Cc: Linus Torvalds, linux-kernel

On Wednesday 12 June 2002 05:16 am, Peter Wächtler wrote:
> Rusty Russell wrote:
> > In message <Pine.LNX.4.44.0206110951380.2712-100000@home.transmeta.com>
> > you wri
> >
> > te:
> >>Rusty,
> >> this makes no sense:
> >>
> >>D: This changes the implementation so that the waker actually unpins
> >>D: the page.  This is preparation for the async interface, where the
> >>D: process which registered interest is not in the kernel.
> >>
> >>Whazzup? The closing of the fd will unpin the page, the waker has no
> >>reason to do so. It is very much against the linux philosophy (and a
> >>design disaster anyway) to have the waker muck with the data structures
> >> of anything waiting.
> >
> > Good catch: now the fd is a "one-shot" thing anyway, making close
> > unpin the page makes more sense.  Tested patch below (against 2.5.21).
> >
> > FYI: I already violate this philosophy as I remove the waiter from the
> > queue when I wake them: this allows them to tell that they were woken
> > (waker does a list_del_init() on the waiting entry, so waiting knows
> > if (list_empty()) I was woken).
> >
> > It would be more natural for the waiter to examine the futex value,
> > and if it's still unchanged go back to sleep.  But this makes
> > assumptions about what they're using the futex value for.  For
> > example, we "PASS_THIS_DIRECTLY" value into the futex.  This requires
> > that one (and ONLY one) process waiting actually wakes up.
> >
> > This is why coming up with a primitive which allowed us to build posix
> > threads and fair queueing as well as "normal" unfair semantics took so
> > damn long.
>
> What are the plans on how to deal with a waiter when the lock holder
> dies abnormally?
>
> What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> setting errno to a reasonable value (EIO?)
>
> I couldn't find anything in susv3

I thing this was decided some time ago that we won't deal with this situation.
If we need to, the following needs to happen.

A) we need to follow a protocol to register the PID/TID id within the lock.
    Example of this is described in 
	"Recoverable User-Level Mutual Exclusion" by Phiilip Bohannon
            Proceedings of the 7th IEEE Symposium on Parallel and Distributed 
            Systems, 1995.

B) this again requires that some entity (kernel ?) knows about all locks, 
whether waited on in the kernel or not.

C) I believe we need a deamon that occasinally identifies dead locks.

Is it worth all this trouble? Or do we need two versions of the ?

The paper describes that for a Sun SS20/61 the safe spin locks obtained
only 25% of the performance of spinlocks.


-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12  9:16             ` Peter Wächtler
  2002-06-12 14:19               ` Hubertus Franke
@ 2002-06-12 15:39               ` Linus Torvalds
  2002-06-12 16:29                 ` Peter Wächtler
  2002-06-13  1:32               ` Rusty Russell
  2 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2002-06-12 15:39 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Rusty Russell, linux-kernel, frankeh


On Wed, 12 Jun 2002, Peter Wächtler wrote:
> 
> What are the plans on how to deal with a waiter when the lock holder
> dies abnormally?

That's why they are called FUTEX'es - they're fast. They're NOT SysV 
semaphores, and they are done 99% in user space. The kernel doesn't even 
_know_ about them until contention happens, and even then only in a rather 
dim "somebody wants me to do this, but I don't know _what_ he is doing" 
way.

> What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> setting errno to a reasonable value (EIO?)

There's just nothing the kernel _can_ do. The common case (by far) is that
the kernel has never seen the futex at all, since many uses are likely to
not have much contention. So when a user program dies holding such a 
uncontended lock, the kernel simply _cannot_ do anything.

(The kernel also cannot do anything even for the contended locks, because 
the whole interface is designed for speed and with the knowledge that the 
kernel won't be able to fix stuff up, so the kernel doesn't actually have 
enough information even in the contention case. See the "dim notion" 
above).

Besides, if you have a threads package that uses some lock for mutual 
exclusion, and a thread dies while holding the lock, there's nothign sane 
anybody can do about it anyway. The data structures are likely to be in an 
invalid state, and just making every other thread block on the lock until 
you can attach a debugger is probably the closest to a _right_ thing you 
can do.

In short: it's not a bug, it's a design feature, and it's very much 
designed for efficiency.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 15:39               ` Linus Torvalds
@ 2002-06-12 16:29                 ` Peter Wächtler
  2002-06-12 16:52                   ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Wächtler @ 2002-06-12 16:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, frankeh

Linus Torvalds wrote:
> On Wed, 12 Jun 2002, Peter Wächtler wrote:
> 
>>What are the plans on how to deal with a waiter when the lock holder
>>dies abnormally?
>>
> 
> That's why they are called FUTEX'es - they're fast. They're NOT SysV 
> semaphores, and they are done 99% in user space. The kernel doesn't even 
> _know_ about them until contention happens, and even then only in a rather 
> dim "somebody wants me to do this, but I don't know _what_ he is doing" 
> way.
> 
> 
>>What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
>>setting errno to a reasonable value (EIO?)
>>
> 
> There's just nothing the kernel _can_ do. The common case (by far) is that
> the kernel has never seen the futex at all, since many uses are likely to
> not have much contention. So when a user program dies holding such a 
> uncontended lock, the kernel simply _cannot_ do anything.
> 
For the uncontended case: their is no blocked process...

Huh, I think you misunderstood me.

One (or more) process is blocked in a waitqueue in the kernel - waiting
for a futex to be released.

The lock holder crashes - say with SIGSEGV. Now if we don't release the
waiters, they wait until reboot or user/admin kills them with a signal -
assuming they are interruptible sleeping.

I know that the kernel can't do anything about the aborted critical section.
But the waiters should be "freed" - and now we can discuss if we kill them
or report an error and let them deal with that.

So we surely have a process_exit_cleanup function (where FDs are closed etc).
There we would have to add a check if that process is holding a futex, the
waitqueue for that and "release" all waiters.

Can't be done? I don't think that this would add a performance hit
since it's only done on exit (and especially "abnormal" exit).

There is no way to check if a process holds a futex and which processes
are blocked on the associated waitqueue?

The waitqueue is built upon linking a struct futex_q list on the blocked
processes stack.

The entries to these lists are in a static array
struct list_head futex_queues[1<<FUTEX_HASHBITS].
At least we could search them on "exit due to fatal signal" when exiting.
Perhaps spending a bit in task_struct WHEN they got a lock - so we don't
have to search on every process exit.

Yes, searching the hash array lists could last a long time, but:

Is process exit time that important?

> (The kernel also cannot do anything even for the contended locks, because 
> the whole interface is designed for speed and with the knowledge that the 
> kernel won't be able to fix stuff up, so the kernel doesn't actually have 
> enough information even in the contention case. See the "dim notion" 
> above).
> 
> Besides, if you have a threads package that uses some lock for mutual 
> exclusion, and a thread dies while holding the lock, there's nothign sane 
> anybody can do about it anyway. The data structures are likely to be in an 
> invalid state, and just making every other thread block on the lock until 
> you can attach a debugger is probably the closest to a _right_ thing you 
> can do.
> 
> In short: it's not a bug, it's a design feature, and it's very much 
> designed for efficiency.
> 

And leave dangling processes (lost futex zombies)?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 14:19               ` Hubertus Franke
@ 2002-06-12 16:50                 ` Peter Wächtler
  2002-06-12 18:15                   ` Vladimir Zidar
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Wächtler @ 2002-06-12 16:50 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, Linus Torvalds, linux-kernel

Hubertus Franke wrote:
> On Wednesday 12 June 2002 05:16 am, Peter Wächtler wrote:
> 
>>Rusty Russell wrote:
>>
>>>In message <Pine.LNX.4.44.0206110951380.2712-100000@home.transmeta.com>
>>>you wri
>>>
>>>te:
>>>
>>>>Rusty,
>>>>this makes no sense:
>>>>
>>>>D: This changes the implementation so that the waker actually unpins
>>>>D: the page.  This is preparation for the async interface, where the
>>>>D: process which registered interest is not in the kernel.
>>>>
>>>>Whazzup? The closing of the fd will unpin the page, the waker has no
>>>>reason to do so. It is very much against the linux philosophy (and a
>>>>design disaster anyway) to have the waker muck with the data structures
>>>>of anything waiting.
>>>>
>>>Good catch: now the fd is a "one-shot" thing anyway, making close
>>>unpin the page makes more sense.  Tested patch below (against 2.5.21).
>>>
>>>FYI: I already violate this philosophy as I remove the waiter from the
>>>queue when I wake them: this allows them to tell that they were woken
>>>(waker does a list_del_init() on the waiting entry, so waiting knows
>>>if (list_empty()) I was woken).
>>>
>>>It would be more natural for the waiter to examine the futex value,
>>>and if it's still unchanged go back to sleep.  But this makes
>>>assumptions about what they're using the futex value for.  For
>>>example, we "PASS_THIS_DIRECTLY" value into the futex.  This requires
>>>that one (and ONLY one) process waiting actually wakes up.
>>>
>>>This is why coming up with a primitive which allowed us to build posix
>>>threads and fair queueing as well as "normal" unfair semantics took so
>>>damn long.
>>>
>>What are the plans on how to deal with a waiter when the lock holder
>>dies abnormally?
>>
>>What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
>>setting errno to a reasonable value (EIO?)
>>
>>I couldn't find anything in susv3
>>
> 
> I thing this was decided some time ago that we won't deal with this situation.
> If we need to, the following needs to happen.
> 
> A) we need to follow a protocol to register the PID/TID id within the lock.
>     Example of this is described in 
> 	"Recoverable User-Level Mutual Exclusion" by Phiilip Bohannon
>             Proceedings of the 7th IEEE Symposium on Parallel and Distributed 
>             Systems, 1995.
> 
> B) this again requires that some entity (kernel ?) knows about all locks, 
> whether waited on in the kernel or not.
> 
> C) I believe we need a deamon that occasinally identifies dead locks.
> 
> Is it worth all this trouble? Or do we need two versions of the ?
> 
> The paper describes that for a Sun SS20/61 the safe spin locks obtained
> only 25% of the performance of spinlocks.
> 

Oops, I see it already myself. We lack a way for saying who is/was owning
the futex and so we can hardly tell who is waiting on this "unknown" lock.
:-(








^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:29                 ` Peter Wächtler
@ 2002-06-12 16:52                   ` Linus Torvalds
  2002-06-12 17:07                     ` Peter Wächtler
                                       ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-12 16:52 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Rusty Russell, linux-kernel, frankeh




On Wed, 12 Jun 2002, Peter Wächtler wrote:
>
> For the uncontended case: their is no blocked process...

Wrong.

The process that holds the lock can die _before_ it gets contended.

When another thread comes in, it now is contended, but the kernel doesn't
know about anything.

> One (or more) process is blocked in a waitqueue in the kernel - waiting
> for a futex to be released.
>
> The lock holder crashes - say with SIGSEGV.

The lock holder may have crashed long before the waiting process even
started waiting.

Besides, the kernel only knows about those processes that see contention.
Even if the contention happened _before_ the lock holder crashed, the
kernel doesn't know about the lock holder itself - it only knows about the
process that caused the contention. The kernel will get to know about the
lock holder only when it tris to resolve the contention, and since that
one has crashed, that will never happen.

> I know that the kernel can't do anything about the aborted critical section.
> But the waiters should be "freed" - and now we can discuss if we kill them
> or report an error and let them deal with that.

The waiters should absolutely _not_ be freed. There's nothign they can do
about it. The data inside the critical region is no longer valid, and

> Can't be done? I don't think that this would add a performance hit
> since it's only done on exit (and especially "abnormal" exit).

But the point is not that it would be a performance hit on "exit()", but
that WE DON'T TRACK THE LOCKS in the kernel in the first place.

Right now the kernel does _zero_ work for a lock that isn't contended. It
doesn't know _anything_ about the process that got the lock initially.

Any amount of tracking would be _extremely_ expensive. Right now getting
an uncontended lock is about 15 CPU cycles in user space.

Tryin to tell the kernel about gettign that lock takes about 1us on a P4
(system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
minimum. Compared to the current 15 cycles. That's more than three orders
of magnitude slower than the current code, and you just lost the whole
point of doing this all in user space in the first place.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:52                   ` Linus Torvalds
@ 2002-06-12 17:07                     ` Peter Wächtler
  2002-06-12 18:32                     ` Saurabh Desai
                                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Peter Wächtler @ 2002-06-12 17:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, frankeh

Linus Torvalds wrote:
> 
> 
> On Wed, 12 Jun 2002, Peter Wächtler wrote:
> 
>>For the uncontended case: their is no blocked process...
>>
> 
> Wrong.
> 
> The process that holds the lock can die _before_ it gets contended.
> 
> When another thread comes in, it now is contended, but the kernel doesn't
> know about anything.
> 
> 
>>One (or more) process is blocked in a waitqueue in the kernel - waiting
>>for a futex to be released.
>>
>>The lock holder crashes - say with SIGSEGV.
>>
> 
> The lock holder may have crashed long before the waiting process even
> started waiting.
> 
> Besides, the kernel only knows about those processes that see contention.
> Even if the contention happened _before_ the lock holder crashed, the
> kernel doesn't know about the lock holder itself - it only knows about the
> process that caused the contention. The kernel will get to know about the
> lock holder only when it tris to resolve the contention, and since that
> one has crashed, that will never happen.
> 
> 
>>I know that the kernel can't do anything about the aborted critical section.
>>But the waiters should be "freed" - and now we can discuss if we kill them
>>or report an error and let them deal with that.
>>
> 
> The waiters should absolutely _not_ be freed. There's nothign they can do
> about it. The data inside the critical region is no longer valid, and
> 
> 
>>Can't be done? I don't think that this would add a performance hit
>>since it's only done on exit (and especially "abnormal" exit).
>>
> 
> But the point is not that it would be a performance hit on "exit()", but
> that WE DON'T TRACK THE LOCKS in the kernel in the first place.
> 
> Right now the kernel does _zero_ work for a lock that isn't contended. It
> doesn't know _anything_ about the process that got the lock initially.
> 
> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
> 
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.
> 

Thanks for this patient explanation. I see the problem now clearly.

To Frank: I will read the (already downloaded) paper ;-)

And to all: Did you notice the "nutex" approach
http://marc.theaimsgroup.com/?l=linux-kernel&m=102373047428621&w=2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:50                 ` Peter Wächtler
@ 2002-06-12 18:15                   ` Vladimir Zidar
  0 siblings, 0 replies; 45+ messages in thread
From: Vladimir Zidar @ 2002-06-12 18:15 UTC (permalink / raw)
  To: Peter Wächtler
  Cc: Hubertus Franke, Rusty Russell, Linus Torvalds, linux-kernel

On Wed, 2002-06-12 at 18:50, Peter Wächtler wrote:

> >>What are the plans on how to deal with a waiter when the lock holder
> >>dies abnormally?
> >>
> >>What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> >>setting errno to a reasonable value (EIO?)
> >>
> >>I couldn't find anything in susv3
> >>
> > 
> > I thing this was decided some time ago that we won't deal with this situation.
> > If we need to, the following needs to happen.
> > 
> > A) we need to follow a protocol to register the PID/TID id within the lock.
> >     Example of this is described in 
> > 	"Recoverable User-Level Mutual Exclusion" by Phiilip Bohannon
> >             Proceedings of the 7th IEEE Symposium on Parallel and Distributed 
> >             Systems, 1995.
> > 
> > B) this again requires that some entity (kernel ?) knows about all locks, 
> > whether waited on in the kernel or not.
> > 
> > C) I believe we need a deamon that occasinally identifies dead locks.
> > 
> > Is it worth all this trouble? Or do we need two versions of the ?
> > 
> > The paper describes that for a Sun SS20/61 the safe spin locks obtained
> > only 25% of the performance of spinlocks.
> > 
> 
> Oops, I see it already myself. We lack a way for saying who is/was owning
> the futex and so we can hardly tell who is waiting on this "unknown" lock.
> :-(

 Which is not the case in 'nutex' implementation.

 Take look:

 http://www.mindnever.org/~mr_w/nutex_mod.tar.gz


-- 
Bye,

 and have a very nice day !




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:52                   ` Linus Torvalds
  2002-06-12 17:07                     ` Peter Wächtler
@ 2002-06-12 18:32                     ` Saurabh Desai
  2002-06-12 20:05                     ` Oliver Xymoron
                                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Saurabh Desai @ 2002-06-12 18:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Wächtler, Rusty Russell, linux-kernel, frankeh

Linus Torvalds wrote:
> 
> On Wed, 12 Jun 2002, Peter Wächtler wrote:
> >
> > For the uncontended case: their is no blocked process...
> 
> Wrong.
> 
> The process that holds the lock can die _before_ it gets contended.
> 
> When another thread comes in, it now is contended, but the kernel doesn't
> know about anything.
> 
> > One (or more) process is blocked in a waitqueue in the kernel - waiting
> > for a futex to be released.
> >
> > The lock holder crashes - say with SIGSEGV.
> 
> The lock holder may have crashed long before the waiting process even
> started waiting.

	Is it possible to check whether the holder process exist before
start waiting? may be in user-space? 

> 
> Besides, the kernel only knows about those processes that see contention.
> Even if the contention happened _before_ the lock holder crashed, the
> kernel doesn't know about the lock holder itself - it only knows about the
> process that caused the contention. The kernel will get to know about the
> lock holder only when it tris to resolve the contention, and since that
> one has crashed, that will never happen.

	Just curious: is there anything to register a cleanup handler
when a process holds a lock, so when that goes away, run the cleanup and
wakeup all waiters?
 
> 
> > I know that the kernel can't do anything about the aborted critical section.
> > But the waiters should be "freed" - and now we can discuss if we kill them
> > or report an error and let them deal with that.
> 
> The waiters should absolutely _not_ be freed. There's nothign they can do
> about it. The data inside the critical region is no longer valid, and
> 
> > Can't be done? I don't think that this would add a performance hit
> > since it's only done on exit (and especially "abnormal" exit).
> 
> But the point is not that it would be a performance hit on "exit()", but
> that WE DON'T TRACK THE LOCKS in the kernel in the first place.
> 
> Right now the kernel does _zero_ work for a lock that isn't contended. It
> doesn't know _anything_ about the process that got the lock initially.
> 
> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
> 
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.
> 
>                 Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:52                   ` Linus Torvalds
  2002-06-12 17:07                     ` Peter Wächtler
  2002-06-12 18:32                     ` Saurabh Desai
@ 2002-06-12 20:05                     ` Oliver Xymoron
  2002-06-12 20:16                       ` Linus Torvalds
  2002-06-13  2:57                     ` Rusty Russell
  2002-06-13 16:38                     ` Gabriel Paubert
  4 siblings, 1 reply; 45+ messages in thread
From: Oliver Xymoron @ 2002-06-12 20:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Wächtler, Rusty Russell, linux-kernel, frankeh

On Wed, 12 Jun 2002, Linus Torvalds wrote:

> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
>
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.

That doesn't rule out approaches like storing a cookie alongside the lock
once it's acquired (or in a parallel space). Which can easily be done with
a wrapper around lock acquisition. And stale lock detection needn't be
done in kernel space either.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 20:05                     ` Oliver Xymoron
@ 2002-06-12 20:16                       ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-12 20:16 UTC (permalink / raw)
  To: Oliver Xymoron; +Cc: Peter Wächtler, Rusty Russell, linux-kernel, frankeh


On Wed, 12 Jun 2002, Oliver Xymoron wrote:
> 
> That doesn't rule out approaches like storing a cookie alongside the lock
> once it's acquired (or in a parallel space). Which can easily be done with
> a wrapper around lock acquisition. And stale lock detection needn't be
> done in kernel space either.

Oh, agreed, you can do debugging locks in user-space, but it won't be the
kernel that does anything, it will instead have to depend on something
like "if I time out on the lock, I can explicitly test if the previous
holder (who saved his thread ID in memory after getting the lock) is still
alive and try to clean up after him".

This is what a lot of the filesystem locking code does for the things in
/var/lock/xxx, of course.

No kernel necessarily involved. 

But it's going to have to depend on the politeness of all threads that 
partake in the locking.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12  9:16             ` Peter Wächtler
  2002-06-12 14:19               ` Hubertus Franke
  2002-06-12 15:39               ` Linus Torvalds
@ 2002-06-13  1:32               ` Rusty Russell
  2 siblings, 0 replies; 45+ messages in thread
From: Rusty Russell @ 2002-06-13  1:32 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Linus Torvalds, linux-kernel, frankeh

In message <3D071153.9020607@loewe-komp.de> you write:
> What are the plans on how to deal with a waiter when the lock holder
> dies abnormally?
> 
> What about sending a signal (SIGTRAP or SIGLOST), returning -1 and
> setting errno to a reasonable value (EIO?)

This is becoming an FAQ:

No, you can't.  These locks are fast because they don't go through the
kernel in the contented case -> the kernel doesn't know who has it.

You can deal with this in userspace, though, if you make certain
assumptions.  Have a daemon which say, polls the futexes every second
and when a futex is taken for two seconds in a row, tries to grab it
itself with a 1 second timeout: if it fails it assumes the lock is
stuck and take action (note: anyone can lock or unlock, there's no
lock "owner" except by convention).

This is not perfect, but neither is the fcntl lock case of "process
dies, lock vanishes, noone gets told".  And since the cleanup is
app-specific, making the whole thing app-specific is not so crazy.

Hope that clarifies,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:52                   ` Linus Torvalds
                                       ` (2 preceding siblings ...)
  2002-06-12 20:05                     ` Oliver Xymoron
@ 2002-06-13  2:57                     ` Rusty Russell
  2002-06-13  9:37                       ` Peter Wächtler
  2002-06-13 16:38                     ` Gabriel Paubert
  4 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2002-06-13  2:57 UTC (permalink / raw)
  To: Linus Torvalds, Peter Wächtler; +Cc: Rusty Russell, linux-kernel, frankeh

In message <Pine.LNX.4.44.0206120946100.22189-100000@home.transmeta.com> you wr
ite:
> 
> 
> 
> On Wed, 12 Jun 2002, Peter W=E4chtler wrote:
> >
> > For the uncontended case: their is no blocked process...
> 
> Wrong.
> 
> The process that holds the lock can die _before_ it gets contended.
> 
> When another thread comes in, it now is contended, but the kernel doesn't
> know about anything.

Note also: this is a feature.

I have a little helper program which can grab or release a futex in a
(mmapped) file.  It's great for shell scripts to grab locks.  In this
case the helper exits with the lock held, and a later invocation
releases a lock it never held.

*AND* the lock is persistent across reboots, since it's in a file.
How cool is that!

This is the *third* major thread on this subject, BTW.
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-13  2:57                     ` Rusty Russell
@ 2002-06-13  9:37                       ` Peter Wächtler
  2002-06-13  9:55                         ` Rusty Russell
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Wächtler @ 2002-06-13  9:37 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, frankeh

Rusty Russell wrote:
> In message <Pine.LNX.4.44.0206120946100.22189-100000@home.transmeta.com> you wr
> ite:
> 
>>
>>
>>On Wed, 12 Jun 2002, Peter W=E4chtler wrote:
>>
>>>For the uncontended case: their is no blocked process...
>>>
>>Wrong.
>>
>>The process that holds the lock can die _before_ it gets contended.
>>
>>When another thread comes in, it now is contended, but the kernel doesn't
>>know about anything.
>>
> 
> Note also: this is a feature.
> 
> I have a little helper program which can grab or release a futex in a
> (mmapped) file.  It's great for shell scripts to grab locks.  In this
> case the helper exits with the lock held, and a later invocation
> releases a lock it never held.
> 
> *AND* the lock is persistent across reboots, since it's in a file.
> How cool is that!
> 

Don't want to bugg you: but you would have to clean them up in any case
when you restart your system of cooperating programs.

Posix shmem would be a nice place to store your mmaped file so that it's
gone after reboot - but gives "kernel life time".

And not that I want to put the futexes down: but now I understand why
the PROCESS_SHARED locks on Irix live in the kernel. Yes, perhaps we
should provide both and the app can choose what suits best.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-13  9:37                       ` Peter Wächtler
@ 2002-06-13  9:55                         ` Rusty Russell
  0 siblings, 0 replies; 45+ messages in thread
From: Rusty Russell @ 2002-06-13  9:55 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: linux-kernel

In message <3D0867EA.8090809@loewe-komp.de> you write:
> Rusty Russell wrote:
> > *AND* the lock is persistent across reboots, since it's in a file.
> > How cool is that!
> > 
> 
> Don't want to bugg you: but you would have to clean them up in any case
> when you restart your system of cooperating programs.

Sorry, I should have added that I don't think it's cool in a *useful*
way.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-12 16:52                   ` Linus Torvalds
                                       ` (3 preceding siblings ...)
  2002-06-13  2:57                     ` Rusty Russell
@ 2002-06-13 16:38                     ` Gabriel Paubert
  2002-06-13 16:40                       ` Linus Torvalds
  4 siblings, 1 reply; 45+ messages in thread
From: Gabriel Paubert @ 2002-06-13 16:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:

> Right now the kernel does _zero_ work for a lock that isn't contended. It
> doesn't know _anything_ about the process that got the lock initially.
> 
> Any amount of tracking would be _extremely_ expensive. Right now getting
> an uncontended lock is about 15 CPU cycles in user space.
> 
> Tryin to tell the kernel about gettign that lock takes about 1us on a P4
> (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles
> minimum. Compared to the current 15 cycles. That's more than three orders
> of magnitude slower than the current code, and you just lost the whole
> point of doing this all in user space in the first place.


Please tell us where you got your 18GHz CPU (18000 cycles/microsecond) ;-)

I want one, badly!

	Gabriel


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-13 16:38                     ` Gabriel Paubert
@ 2002-06-13 16:40                       ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2002-06-13 16:40 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linux-kernel



On Thu, 13 Jun 2002, Gabriel Paubert wrote:
>
> Please tell us where you got your 18GHz CPU (18000 cycles/microsecond) ;-)

Ok, ok, I can't do math. Rub it in.

		Linus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
  2002-06-06 16:08 Martin Wirth
@ 2002-06-06 22:59 ` Rusty Russell
  0 siblings, 0 replies; 45+ messages in thread
From: Rusty Russell @ 2002-06-06 22:59 UTC (permalink / raw)
  To: Martin Wirth; +Cc: linux-kernel

In message <3CFF8904.9010703@dlr.de> you write:
> Hi Rusty,
> 
> >if (this->page == page && this->offset == offset) {
> > 			list_del_init(i);
> > 			tell_waiter(this);
> >+			unpin_page(this->page);
> > 			num_woken++;
> > 			if (num_woken >= num) break;
> > 		}
> > 	}
> > 	spin_unlock(&futex_lock);
> >+	unpin_page(page);
> > 	return num_woken;
> 
> If I understand right you shouldn't unpin the page if you are not sure that
> all waiters for a specific (page,offset)-combination are woken up and deleted
> from the waitqueue. Otherwise a second call to futex_wake may look on the wro
ng
> hash_queue or wake the wrong waiters.

No, we delete them from the list (list_del_init) so we can't find it
again: ie. futex_wake can't be invoked twice on the same thing without
someone putting it back on the queue...

> In general, I think fast userspace synchronization primitives and
> asynchronous notification are different enough to keep them
> logically more separated.  Your double use of the hashed wait queues
> and sys_call make the code difficult to grasp and thus open for
> subtle error.

I don't think the complexity is much worse: async is inherently
complex given that there are no real in-kernel primitives to do offer
both sync and async cleanly.  Having two incompatible primitives would
be painful, too.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Futex Asynchronous Interface
@ 2002-06-06 16:08 Martin Wirth
  2002-06-06 22:59 ` Rusty Russell
  0 siblings, 1 reply; 45+ messages in thread
From: Martin Wirth @ 2002-06-06 16:08 UTC (permalink / raw)
  To: Rusty Russell, linux-kernel

Hi Rusty,

>if (this->page == page && this->offset == offset) {
> 			list_del_init(i);
> 			tell_waiter(this);
>+			unpin_page(this->page);
> 			num_woken++;
> 			if (num_woken >= num) break;
> 		}
> 	}
> 	spin_unlock(&futex_lock);
>+	unpin_page(page);
> 	return num_woken;

If I understand right you shouldn't unpin the page if you are not sure that
all waiters for a specific (page,offset)-combination are woken up and deleted
from the waitqueue. Otherwise a second call to futex_wake may look on the wrong
hash_queue or wake the wrong waiters.

In general, I think fast userspace synchronization primitives and asynchronous 
notification are different enough to keep them logically more separated. 
Your double use of the hashed wait queues and sys_call make the code difficult
to grasp and thus open for subtle error.

Martin

Martin 



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2002-06-13 16:40 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-06-06  7:26 [PATCH] Futex Asynchronous Interface Rusty Russell
2002-06-02  0:10 ` Pavel Machek
2002-06-10  6:57   ` Rusty Russell
2002-06-06 16:36 ` Linus Torvalds
2002-06-06 19:27   ` Alan Cox
2002-06-06 23:21   ` Rusty Russell
2002-06-07  8:33     ` Peter Wächtler
2002-06-08 22:28       ` Linus Torvalds
2002-06-09  9:49         ` Kai Henningsen
2002-06-09 18:09           ` Linus Torvalds
2002-06-09 19:06             ` Thunder from the hill
2002-06-10  6:39             ` Kai Henningsen
2002-06-10  7:55             ` Helge Hafting
2002-06-10 14:10               ` Thunder from the hill
2002-06-10 20:46                 ` Kai Henningsen
2002-06-11 14:14                   ` john slee
2002-06-10 15:11               ` Linus Torvalds
2002-06-11 15:06                 ` Eric W. Biederman
2002-06-10 20:57             ` H. Peter Anvin
2002-06-09 10:07         ` Peter Wächtler
2002-06-09 17:49           ` Linus Torvalds
2002-06-07  9:06   ` Rusty Russell
2002-06-08 22:42     ` Linus Torvalds
2002-06-11  9:15       ` Rusty Russell
2002-06-11 16:53         ` Linus Torvalds
2002-06-12  5:32           ` Rusty Russell
2002-06-12  9:16             ` Peter Wächtler
2002-06-12 14:19               ` Hubertus Franke
2002-06-12 16:50                 ` Peter Wächtler
2002-06-12 18:15                   ` Vladimir Zidar
2002-06-12 15:39               ` Linus Torvalds
2002-06-12 16:29                 ` Peter Wächtler
2002-06-12 16:52                   ` Linus Torvalds
2002-06-12 17:07                     ` Peter Wächtler
2002-06-12 18:32                     ` Saurabh Desai
2002-06-12 20:05                     ` Oliver Xymoron
2002-06-12 20:16                       ` Linus Torvalds
2002-06-13  2:57                     ` Rusty Russell
2002-06-13  9:37                       ` Peter Wächtler
2002-06-13  9:55                         ` Rusty Russell
2002-06-13 16:38                     ` Gabriel Paubert
2002-06-13 16:40                       ` Linus Torvalds
2002-06-13  1:32               ` Rusty Russell
2002-06-06 16:08 Martin Wirth
2002-06-06 22:59 ` Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.