linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] inotify 0.9
@ 2004-09-15 15:52 John McCutchan
  2004-09-15 18:00 ` Robert Love
  2004-09-16 15:07 ` Bill Davidsen
  0 siblings, 2 replies; 21+ messages in thread
From: John McCutchan @ 2004-09-15 15:52 UTC (permalink / raw)
  To: linux-kernel, nautilus-list, gamin-list, xml, viro, akpm, iggy

[-- Attachment #1: Type: text/plain, Size: 4884 bytes --]

Hello,

I am releasing a new version of inotify. Attached is a patch for
2.6.8.1.

I am interested in getting inotify included in the mm tree. 

Inotify is designed as a replacement for dnotify. The key difference's
are that inotify does not require the file to be opened to watch it,
when you are watching something with inotify it can go away (if path
is unmounted) and you will be sent an event telling you it is gone and
events are delivered over a fd not by using signals.

New in this version:
Driver now supports reading more than one event at a time
Bump maximum number of watches per device from 64 to 8192
Bump maximum number of queued events per device from 64 to 256

--COMPLEXITY--

I have been asked what the complexity of inotify is. Inotify has
2 path codes where complexity could be an issue:

Adding a watcher to a device
	This code has to check if the inode is already being watched 
	by the device, this is O(1) since the maximum number of 
	devices is limited to 8.


Removing a watch from a device
	This code has to do a search of all watches on the device to
	find the watch descriptor that is being asked to remove.
	This involves a linear search, but should not really be an issue
	because it is limited to 8192 entries. If this does turn in to
	a concern, I would replace the list of watches on the device
	with a sorted binary tree, so that the search could be done
	very quickly.


The calls to inotify from the VFS code has a complexity of O(1) so
inotify does not affect the speed of VFS operations.

--MEMORY USAGE--

The inotify data structures are light weight:

inotify watch is 40 bytes
inotify device is 68 bytes
inotify event is 272 bytes

So assuming a device has 8192 watches, the structures are only going
to consume 320KB of memory. With a maximum number of 8 devices allowed
to exist at a time, this is still only 2.5 MB

Each device can also have 256 events queued at a time, which sums to
68KB per device. And only .5 MB if all devices are opened and have
a full event queue.

So approximately 3 MB of memory are used in the rare case of 
everything open and full.

Each inotify watch pins the inode of a directory/file in memory,
the size of an inode is different per file system but lets assume
that it is 512 byes. 

So assuming the maximum number of global watches are active, this would
pin down 32 MB of inodes in the inode cache. Again not a problem
on a modern system. 

On smaller systems, the maximum watches / events could be lowered
to provide a smaller foot print.

Older release notes:
I am resubmitting inotify for comments and review. Inotify has
changed drastically from the earlier proposal that Al Viro did not
approve of. There is no longer any use of (device number, inode number)
pairs. Please give this version of inotify a fresh view.


Inotify is a character device that when opened offers 2 IOCTL's.
(It actually has 4 but the other 2 are used for debugging)

INOTIFY_WATCH:
        Which takes a path and event mask and returns a unique 
        (to the instance of the driver) integer (wd [watcher descriptor]
        from here on) that is a 1:1 mapping to the path passed. 
        What happens is inotify gets the inode (and ref's the inode)
        for the path and adds a inotify_watcher structure to the inodes
        list of watchers. If this instance of the driver is already
        watching the path, the event mask will be updated and
        the original wd will be returned.

INOTIFY_IGNORE:
        Which takes an integer (that you got from INOTIFY_WATCH) 
        representing a wd that you are not interested in watching
        anymore. This will:

        send an IGNORE event to the device
        remove the inotify_watcher structure from the device and 
        from the inode and unref the inode.
        

After you are watching 1 or more paths, you can read from the fd
and get events. The events are struct inotify_event. If you are
watching a directory and something happens to a file in the directory
the event will contain the filename (just the filename not the full
path).

Aside from the inotify character device driver. 
The changes to the kernel are very minor. 

The first change is adding calls to inotify_inode_queue_event and
inotify_dentry_parent_queue_event from the various vfs functions. This
is identical to dnotify.

The second change is more serious, it adds a call to
inotify_super_block_umount
inside generic_shutdown_superblock. What inotify_super_block_umount does
is:

find all of the inodes that are on the super block being shut down,
sends each watcher on each inode the UNMOUNT and IGNORED event
removes the watcher structures from each instance of the device driver 
and each inode.
unref's the inode.

I have tested this code on my system for over three weeks now and have
not had problems. I would appreciate design review, code review and
testing.

John


[-- Attachment #2: inotify-0.9.patch --]
[-- Type: text/x-patch, Size: 36562 bytes --]

diff -urN linux-2.6.8.1/drivers/char/inotify.c ../linux/drivers/char/inotify.c
--- linux-2.6.8.1/drivers/char/inotify.c	1969-12-31 19:00:00.000000000 -0500
+++ ../linux/drivers/char/inotify.c	2004-09-15 11:22:59.000000000 -0400
@@ -0,0 +1,975 @@
+/*
+ * Inode based directory notifications for Linux.
+ *
+ * Copyright (C) 2004 John McCutchan
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Signed-off-by: John McCutchan ttb@tentacle.dhs.org
+ */
+
+/* TODO: 
+ * use rb tree so looking up watcher by watcher descriptor is faster.
+ */
+
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/miscdevice.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/writeback.h>
+#include <linux/inotify.h>
+
+#define MAX_INOTIFY_DEVS 8 /* We only support X watchers */
+#define MAX_INOTIFY_DEV_WATCHERS 8192 /* A dev can only have Y watchers */
+#define MAX_INOTIFY_QUEUED_EVENTS 256 /* Only the first Z events will be queued */
+#define __BITMASK_SIZE (MAX_INOTIFY_DEV_WATCHERS / 8)
+
+#define INOTIFY_DEV_TIMER_TIME jiffies + (HZ/4)
+
+static atomic_t watcher_count; // < MAX_INOTIFY_DEVS
+
+static kmem_cache_t *watcher_cache;
+static kmem_cache_t *kevent_cache;
+
+/* For debugging */
+static int event_object_count;
+static int watcher_object_count;
+static int inode_ref_count;
+static int inotify_debug_flags;
+#define iprintk(f, str...) if (inotify_debug_flags & f) printk (KERN_ALERT str)
+
+/* For each inotify device we need to keep a list of events queued on it,
+ * a list of inodes that we are watching and other stuff.
+ */
+struct inotify_device {
+	struct list_head 	events;
+	atomic_t		event_count;
+	struct list_head 	watchers;
+	int			watcher_count;
+	wait_queue_head_t 	wait;
+	struct timer_list	timer;
+	char			read_state;
+	spinlock_t		lock;
+	void *			bitmask;
+};
+#define inotify_device_event_list(pos) list_entry((pos), struct inotify_event, list)
+
+struct inotify_watcher {
+	int 			wd; // watcher descriptor
+	unsigned long		mask;
+	struct inode *		inode;
+	struct inotify_device *	dev;
+	struct list_head	d_list; // device list
+	struct list_head	i_list; // inode list
+	struct list_head	u_list; // unmount list 
+};
+#define inotify_watcher_d_list(pos) list_entry((pos), struct inotify_watcher, d_list)
+#define inotify_watcher_i_list(pos) list_entry((pos), struct inotify_watcher, i_list)
+#define inotify_watcher_u_list(pos) list_entry((pos), struct inotify_watcher, u_list)
+
+/* A list of these is attached to each instance of the driver
+ * when the drivers read() gets called, this list is walked and
+ * all events that can fit in the buffer get delivered
+ */
+struct inotify_kernel_event {
+        struct list_head        list;
+	struct inotify_event	event;
+};
+#define list_to_inotify_kernel_event(pos) list_entry((pos), struct inotify_kernel_event, list)
+
+static int find_inode (const char __user *dirname, struct inode **inode)
+{
+	struct nameidata nd;
+	int error;
+
+	error = __user_walk (dirname, LOOKUP_FOLLOW, &nd);
+	if (error) {
+		iprintk(INOTIFY_DEBUG_INODE, "could not find inode\n");
+		goto out;
+	}
+
+	*inode = nd.dentry->d_inode;
+	__iget (*inode);
+	iprintk(INOTIFY_DEBUG_INODE, "ref'd inode\n");
+	inode_ref_count++;
+	path_release(&nd);
+out:
+	return error;
+}
+
+static void unref_inode (struct inode *inode) {
+	inode_ref_count--;
+	iprintk(INOTIFY_DEBUG_INODE, "unref'd inode\n");
+	iput (inode);
+}
+
+struct inotify_kernel_event *kernel_event (int wd, int mask, const char *filename) {
+	struct inotify_kernel_event *kevent;
+
+	kevent = kmem_cache_alloc (kevent_cache, GFP_ATOMIC);
+
+
+	if (!kevent) {
+		iprintk(INOTIFY_DEBUG_ALLOC, "failed to alloc kevent (%d,%d)\n", wd, mask);
+		goto out;
+	}
+
+	iprintk(INOTIFY_DEBUG_ALLOC, "alloced kevent %p (%d,%d)\n", kevent, wd, mask);
+
+	kevent->event.wd = wd;
+	kevent->event.mask = mask;
+	INIT_LIST_HEAD(&kevent->list);
+
+	if (filename) {
+		iprintk(INOTIFY_DEBUG_FILEN, "filename for event was %p %s\n", filename, filename);
+		strncpy (kevent->event.filename, filename, 256);
+		kevent->event.filename[255] = '\0';
+		iprintk(INOTIFY_DEBUG_FILEN, "filename after copying %s\n", kevent->event.filename);
+	} else {
+		iprintk(INOTIFY_DEBUG_FILEN, "no filename for event\n");
+		kevent->event.filename[0] = '\0';
+	}
+	
+	event_object_count++;
+
+out:
+	return kevent;
+}
+
+void delete_kernel_event (struct inotify_kernel_event *kevent) {
+	if (!kevent) return;
+
+	event_object_count--;
+
+	INIT_LIST_HEAD(&kevent->list);
+	kevent->event.wd = -1;
+	kevent->event.mask = 0;
+
+	iprintk(INOTIFY_DEBUG_ALLOC, "free'd kevent %p\n", kevent);
+	kmem_cache_free (kevent_cache, kevent);
+}
+
+#define inotify_dev_has_events(dev) (!list_empty(&dev->events))
+#define inotify_dev_get_event(dev) (list_to_inotify_kernel_event(dev->events.next))
+/* Does this events mask get sent to the watcher ? */
+#define event_and(event_mask,watchers_mask) 	((event_mask == IN_UNMOUNT) || \
+						(event_mask == IN_IGNORED) || \
+						(event_mask & watchers_mask))
+
+
+/* dev->lock == locked before calling */
+static void inotify_dev_queue_event (struct inotify_device *dev, struct inotify_watcher *watcher, int mask, const char *filename) {
+	struct inotify_kernel_event *kevent;
+
+	if (atomic_read(&dev->event_count) == MAX_INOTIFY_QUEUED_EVENTS) {
+		iprintk(INOTIFY_DEBUG_EVENTS, "event queue for %p overflowed\n", dev);
+		return;
+	}
+
+	if (!event_and(mask, watcher->inode->watchers_mask)||!event_and(mask, watcher->mask)) {
+		return;
+	}
+
+	atomic_inc(&dev->event_count);
+
+	kevent = kernel_event (watcher->wd, mask, filename);
+
+	if (!kevent) {
+		iprintk(INOTIFY_DEBUG_EVENTS, "failed to queue event %x for %p\n", mask, dev);
+	}
+
+	list_add_tail(&kevent->list, &dev->events);
+
+	iprintk(INOTIFY_DEBUG_EVENTS, "queued event %x for %p\n", mask, dev);
+}
+
+
+
+
+static void inotify_dev_event_dequeue (struct inotify_device *dev) {
+	struct inotify_kernel_event *kevent;
+
+	if (!inotify_dev_has_events (dev)) {
+		return;
+	}
+
+	kevent = inotify_dev_get_event(dev);
+
+	list_del(&kevent->list);
+	atomic_dec(&dev->event_count);
+
+	delete_kernel_event (kevent);
+
+	iprintk(INOTIFY_DEBUG_EVENTS, "dequeued event on %p\n", dev);
+}
+
+static int inotify_dev_get_wd (struct inotify_device *dev)
+{
+	int wd;
+
+	wd = -1;
+
+	if (!dev)
+		return -1;
+
+	if (dev->watcher_count == MAX_INOTIFY_DEV_WATCHERS) {
+		return -1;
+	}
+
+	dev->watcher_count++;
+
+	wd = find_first_zero_bit (dev->bitmask, __BITMASK_SIZE);
+
+	set_bit (wd, dev->bitmask);
+
+	return wd;
+}
+
+static int inotify_dev_put_wd (struct inotify_device *dev, int wd)
+{
+	if (!dev||wd < 0)
+		return -1;
+
+	dev->watcher_count--;
+
+	clear_bit (wd, dev->bitmask);
+
+	return 0;
+}
+
+
+static struct inotify_watcher *create_watcher (struct inotify_device *dev, int mask, struct inode *inode) {
+	struct inotify_watcher *watcher;
+
+	watcher = kmem_cache_alloc (watcher_cache, GFP_KERNEL);
+
+	if (!watcher) {
+		iprintk(INOTIFY_DEBUG_ALLOC, "failed to allocate watcher (%p,%d)\n", inode, mask);
+		return NULL;
+	}
+
+	watcher->wd = -1;
+	watcher->mask = mask;
+	watcher->inode = inode;
+	watcher->dev = dev;
+	INIT_LIST_HEAD(&watcher->d_list);
+	INIT_LIST_HEAD(&watcher->i_list);
+	INIT_LIST_HEAD(&watcher->u_list);
+
+	spin_lock(&dev->lock);
+		watcher->wd = inotify_dev_get_wd (dev);
+	spin_unlock(&dev->lock);
+
+	if (watcher->wd < 0) {
+		iprintk(INOTIFY_DEBUG_ERRORS, "Could not get wd for watcher %p\n", watcher);
+		iprintk(INOTIFY_DEBUG_ALLOC, "free'd watcher %p\n", watcher);
+		kmem_cache_free (watcher_cache, watcher);
+		watcher = NULL;
+		return watcher;
+	}
+
+	watcher_object_count++;
+	return watcher;
+}
+
+/* Must be called with dev->lock held */
+static void delete_watcher (struct inotify_device *dev, struct inotify_watcher *watcher) {
+	inotify_dev_put_wd (dev, watcher->wd);
+
+	iprintk(INOTIFY_DEBUG_ALLOC, "free'd watcher %p\n", watcher);
+
+	kmem_cache_free (watcher_cache, watcher);
+
+	watcher_object_count--;
+}
+
+
+static struct inotify_watcher *inode_find_dev (struct inode *inode, struct inotify_device *dev) {
+	struct inotify_watcher *watcher;
+
+	watcher = NULL;
+
+	list_for_each_entry (watcher, &inode->watchers, i_list) {
+		if (watcher->dev == dev) {
+			return watcher;
+		}
+
+	}
+	return NULL;
+}
+
+static struct inotify_watcher *dev_find_wd (struct inotify_device *dev, int wd)
+{
+	struct inotify_watcher *watcher;
+
+	list_for_each_entry (watcher, &dev->watchers, d_list) {
+		if (watcher->wd == wd) {
+			return watcher;
+		}
+	}
+	return NULL;
+}
+
+static int inotify_dev_is_watching_inode (struct inotify_device *dev, struct inode *inode) {
+	struct inotify_watcher *watcher;
+
+	list_for_each_entry (watcher, &dev->watchers, d_list) {
+		if (watcher->inode == inode) {
+			return 1;
+		}
+	}
+	
+	return 0;
+}
+
+static int inotify_dev_add_watcher (struct inotify_device *dev, struct inotify_watcher *watcher) {
+	int error;
+
+	error = 0;
+
+	if (!dev||!watcher) {
+		error = -EINVAL;
+		goto out;
+	}
+
+	if (dev_find_wd (dev, watcher->wd)) {
+		error = -EINVAL;
+		goto out;
+	}
+
+
+	if (dev->watcher_count == MAX_INOTIFY_DEV_WATCHERS) {
+		error = -ENOSPC;
+		goto out;
+	}
+
+	dev->watcher_count++;
+	list_add(&watcher->d_list, &dev->watchers);
+out:
+	return error;
+}
+
+static int inotify_dev_rm_watcher (struct inotify_device *dev, struct inotify_watcher *watcher) {
+	int error;
+
+	error = -EINVAL;
+
+	if (watcher) {
+		inotify_dev_queue_event (dev, watcher, IN_IGNORED, NULL);
+
+		list_del(&watcher->d_list);
+
+		dev->watcher_count--;
+
+		error = 0;
+	} 
+
+	return error;
+}
+
+void inode_update_watchers_mask (struct inode *inode)
+{
+	struct inotify_watcher *watcher;
+	unsigned long new_mask;
+
+	new_mask = 0;
+	list_for_each_entry(watcher, &inode->watchers, i_list) {
+		new_mask |= watcher->mask;
+	}
+	inode->watchers_mask = new_mask;
+}
+
+static int inode_add_watcher (struct inode *inode, struct inotify_watcher *watcher) {
+	if (!inode||!watcher||inode_find_dev (inode, watcher->dev))
+		return -EINVAL;
+
+	list_add(&watcher->i_list, &inode->watchers);
+	inode->watcher_count++;
+
+	inode_update_watchers_mask (inode);
+
+
+	return 0;
+}
+
+static int inode_rm_watcher (struct inode *inode, struct inotify_watcher *watcher) {
+	if (!inode||!watcher)
+		return -EINVAL;
+
+	list_del(&watcher->i_list);
+	inode->watcher_count--;
+
+	inode_update_watchers_mask (inode);
+
+	return 0;
+}
+
+/* Kernel API */
+
+void inotify_inode_queue_event (struct inode *inode, unsigned long mask, const char *filename) {
+	struct inotify_watcher *watcher;
+
+	spin_lock(&inode->i_lock);
+
+		list_for_each_entry (watcher, &inode->watchers, i_list) {
+			spin_lock(&watcher->dev->lock);
+				inotify_dev_queue_event (watcher->dev, watcher, mask, filename);
+			spin_unlock(&watcher->dev->lock);
+		}
+
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL_GPL(inotify_inode_queue_event);
+
+void inotify_dentry_parent_queue_event(struct dentry *dentry, unsigned long mask, const char *filename)
+{
+	struct dentry *parent;
+
+	spin_lock(&dentry->d_lock);
+	dget (dentry->d_parent);
+	parent = dentry->d_parent;
+	inotify_inode_queue_event(parent->d_inode, mask, filename);
+	dput (parent);
+	spin_unlock(&dentry->d_lock);
+}
+EXPORT_SYMBOL_GPL(inotify_dentry_parent_queue_event);
+
+static void ignore_helper (struct inotify_watcher *watcher, int event) {
+	struct inotify_device *dev;
+	struct inode *inode;
+
+	spin_lock(&watcher->dev->lock);
+	spin_lock(&watcher->inode->i_lock);
+
+	inode = watcher->inode;
+	dev = watcher->dev;
+
+	if (event)
+		inotify_dev_queue_event (dev, watcher, event, NULL);
+
+	inode_rm_watcher (inode, watcher);
+	inotify_dev_rm_watcher (watcher->dev, watcher);
+	list_del(&watcher->u_list);
+
+
+	spin_unlock(&inode->i_lock);
+
+	delete_watcher(dev, watcher);
+
+	spin_unlock(&dev->lock);
+
+	unref_inode (inode);
+}
+
+static void process_umount_list (struct list_head *umount) {
+	struct inotify_watcher *watcher, *next;
+
+	list_for_each_entry_safe (watcher, next, umount, u_list) {
+		ignore_helper (watcher, IN_UNMOUNT);
+	}
+}
+
+static void build_umount_list (struct list_head *head, struct super_block *sb, struct list_head *umount) {
+	struct inode *	inode;
+
+	list_for_each_entry (inode, head, i_list) {
+		struct inotify_watcher *watcher;
+
+		if (inode->i_sb != sb)
+			continue;
+
+		spin_lock(&inode->i_lock);
+
+		list_for_each_entry (watcher, &inode->watchers, i_list) {
+			list_add (&watcher->u_list, umount);
+		}
+
+		spin_unlock(&inode->i_lock);
+	}
+}
+
+void inotify_super_block_umount (struct super_block *sb)
+{
+	struct list_head umount;
+
+	INIT_LIST_HEAD(&umount);
+
+	spin_lock(&inode_lock);
+		build_umount_list (&inode_in_use, sb, &umount);
+	spin_unlock(&inode_lock);
+
+	process_umount_list (&umount);
+}
+EXPORT_SYMBOL_GPL(inotify_super_block_umount);
+
+/* The driver interface is implemented below */
+
+static unsigned int inotify_poll(struct file *file, poll_table *wait) {
+        struct inotify_device *dev;
+
+        dev = file->private_data;
+
+
+        poll_wait(file, &dev->wait, wait);
+
+        if (inotify_dev_has_events(dev)) {
+                return POLLIN | POLLRDNORM;
+	}
+
+        return 0;
+}
+
+#define MAX_EVENTS_AT_ONCE 20
+static ssize_t inotify_read(struct file *file, __user char *buf,
+			   size_t count, loff_t *pos) {
+	size_t out;
+	struct inotify_event eventbuf[MAX_EVENTS_AT_ONCE];
+	struct inotify_kernel_event *kevent;
+	struct inotify_device *dev;
+	char *obuf;
+	int err;
+	DECLARE_WAITQUEUE(wait, current);
+
+	int events;
+	int event_count;
+
+	events = 0;
+	event_count = 0;
+	out = 0;
+	err = 0;
+
+	obuf = buf;
+
+	dev = file->private_data;
+
+	/* We only hand out full inotify events */
+	if (count < sizeof(struct inotify_event)) {
+		out = -EINVAL;
+		goto out;
+	}
+
+	events = count / sizeof(struct inotify_event);
+
+	if (events > MAX_EVENTS_AT_ONCE) events = MAX_EVENTS_AT_ONCE;
+
+	if (!inotify_dev_has_events(dev)) {
+		if (file->f_flags & O_NONBLOCK) {
+			out = -EAGAIN;
+			goto out;
+		}
+	}
+
+	spin_lock_irq(&dev->lock);
+
+	add_wait_queue(&dev->wait, &wait);
+repeat:
+	if (signal_pending(current)) {
+		spin_unlock_irq (&dev->lock);
+		out = -ERESTARTSYS;
+		set_current_state (TASK_RUNNING);
+		remove_wait_queue(&dev->wait, &wait);
+		goto out;
+	}
+	set_current_state(TASK_INTERRUPTIBLE);
+	if (!inotify_dev_has_events (dev)) {
+		spin_unlock_irq (&dev->lock);
+		schedule ();
+		spin_lock_irq (&dev->lock);
+		goto repeat;
+	}
+
+	set_current_state (TASK_RUNNING);
+	remove_wait_queue (&dev->wait, &wait);
+
+	err = !access_ok(VERIFY_WRITE, (void *)buf, sizeof(struct inotify_event));
+
+	if (err) {
+		out = -EFAULT;
+		goto out;
+	}
+
+	/* Copy all the events we can to the event buffer */
+	for (event_count = 0; event_count < events; event_count++) {
+		kevent = inotify_dev_get_event (dev);
+		eventbuf[event_count] = kevent->event;
+		inotify_dev_event_dequeue (dev);
+	}
+
+	spin_unlock_irq (&dev->lock);
+
+	/* Send the event buffer to user space */
+	err = copy_to_user (buf, &eventbuf[0], events * sizeof(struct inotify_event));
+
+	buf += sizeof(struct inotify_event) * events;
+
+	out = buf - obuf;
+
+out:
+	return out;
+}
+
+static void inotify_dev_timer (unsigned long data) {
+	struct inotify_device *dev = (struct inotify_device *)data;
+
+	if (!data) return;
+
+	// reset the timer
+	mod_timer(&dev->timer, INOTIFY_DEV_TIMER_TIME);
+
+	// wake up anything waiting on poll
+	if (inotify_dev_has_events (dev)) {
+		wake_up_interruptible(&dev->wait);
+	}
+}
+
+static int inotify_open(struct inode *inode, struct file *file) {
+	struct inotify_device *dev;
+
+	if (atomic_read(&watcher_count) == MAX_INOTIFY_DEVS)
+		return -ENODEV;
+
+	atomic_inc(&watcher_count);
+
+	dev = kmalloc(sizeof(struct inotify_device), GFP_KERNEL);
+	dev->bitmask = kmalloc(__BITMASK_SIZE, GFP_KERNEL);
+	memset(dev->bitmask, 0, __BITMASK_SIZE);
+
+	INIT_LIST_HEAD(&dev->events);
+	INIT_LIST_HEAD(&dev->watchers);
+	init_timer(&dev->timer);
+	init_waitqueue_head(&dev->wait);
+
+	atomic_set(&dev->event_count, 0);
+	dev->watcher_count = 0;
+	dev->lock = SPIN_LOCK_UNLOCKED;
+	dev->read_state = 0;
+
+	file->private_data = dev;
+
+	dev->timer.data = dev;
+	dev->timer.function = inotify_dev_timer;
+	dev->timer.expires = INOTIFY_DEV_TIMER_TIME;
+
+	add_timer(&dev->timer);
+
+	printk(KERN_ALERT "inotify device opened\n");
+
+	return 0;
+}
+
+static void inotify_release_all_watchers (struct inotify_device *dev)
+{
+	struct inotify_watcher *watcher,*next;
+
+	list_for_each_entry_safe(watcher, next, &dev->watchers, d_list) {
+		ignore_helper (watcher, 0);
+	}
+}
+
+static void inotify_release_all_events (struct inotify_device *dev)
+{
+	spin_lock(&dev->lock);
+	while (inotify_dev_has_events(dev)) {
+		inotify_dev_event_dequeue(dev);
+	}
+	spin_unlock(&dev->lock);
+}
+
+
+static int inotify_release(struct inode *inode, struct file *file)
+{
+	if (file->private_data) {
+		struct inotify_device *dev;
+
+		dev = (struct inotify_device *)file->private_data;
+
+		del_timer(&dev->timer);
+
+		inotify_release_all_watchers(dev);
+
+		inotify_release_all_events(dev);
+
+		kfree (dev->bitmask);
+		kfree (dev);
+
+	}
+
+	printk(KERN_ALERT "inotify device released\n");
+
+	atomic_dec(&watcher_count);
+	return 0;
+}
+
+static int inotify_watch (struct inotify_device *dev, struct inotify_watch_request *request)
+{
+	int err;
+	struct inode *inode;
+	struct inotify_watcher *watcher;
+	err = 0;
+
+	err = find_inode (request->dirname, &inode);
+
+	if (err)
+		goto exit;
+
+	if (!S_ISDIR(inode->i_mode)) {
+		iprintk(INOTIFY_DEBUG_ERRORS, "watching file\n");
+	}
+
+	spin_lock(&dev->lock);
+	spin_lock(&inode->i_lock);
+
+	/* This handles the case of re-adding a directory we are already
+	 * watching, we just update the mask and return 0
+	 */
+	if (inotify_dev_is_watching_inode (dev, inode)) {
+		iprintk(INOTIFY_DEBUG_ERRORS, "adjusting event mask for inode %p\n", inode);
+		struct inotify_watcher *owatcher; // the old watcher
+
+		owatcher = inode_find_dev (inode, dev);
+
+		owatcher->mask = request->mask;
+
+		inode_update_watchers_mask (inode);
+
+		spin_unlock (&inode->i_lock);
+		spin_unlock (&dev->lock);
+
+		unref_inode (inode);
+
+		return 0;
+	}
+
+	spin_unlock (&inode->i_lock);
+	spin_unlock (&dev->lock);
+
+
+	watcher = create_watcher (dev, request->mask, inode);
+
+	if (!watcher) {
+		unref_inode (inode);
+		return -ENOSPC;
+	}
+
+	spin_lock(&dev->lock);
+	spin_lock(&inode->i_lock);
+
+	/* We can't add anymore watchers to this device */
+	if (inotify_dev_add_watcher (dev, watcher) == -ENOSPC) {
+		iprintk(INOTIFY_DEBUG_ERRORS, "can't add watcher dev is full\n");
+		spin_unlock (&inode->i_lock);
+		delete_watcher (dev, watcher);
+		spin_unlock (&dev->lock);
+
+		unref_inode (inode);
+		return -ENOSPC;
+	}
+
+	inode_add_watcher (inode, watcher);
+
+	/* We keep a reference on the inode */
+	if (!err) {
+		err = watcher->wd;
+	}
+
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&dev->lock);
+exit:
+	return err;
+}
+
+static int inotify_ignore(struct inotify_device *dev, int wd)
+{
+	struct inotify_watcher *watcher;
+
+	watcher = dev_find_wd (dev, wd);
+
+	if (!watcher) {
+		return -EINVAL;
+	}
+
+	ignore_helper (watcher, 0);
+
+	return 0;
+}
+
+static void inotify_print_stats (struct inotify_device *dev)
+{
+	int sizeof_inotify_watcher;
+	int sizeof_inotify_device;
+	int sizeof_inotify_kernel_event;
+
+	sizeof_inotify_watcher = sizeof (struct inotify_watcher);
+	sizeof_inotify_device = sizeof (struct inotify_device);
+	sizeof_inotify_kernel_event = sizeof (struct inotify_kernel_event);
+
+	printk (KERN_ALERT "GLOBAL INOTIFY STATS\n");
+	printk (KERN_ALERT "watcher_count = %d\n", atomic_read(&watcher_count));
+	printk (KERN_ALERT "event_object_count = %d\n", event_object_count);
+	printk (KERN_ALERT "watcher_object_count = %d\n", watcher_object_count);
+	printk (KERN_ALERT "inode_ref_count = %d\n", inode_ref_count);
+
+	printk (KERN_ALERT "sizeof(struct inotify_watcher) = %d\n", sizeof_inotify_watcher);
+	printk (KERN_ALERT "sizeof(struct inotify_device) = %d\n", sizeof_inotify_device);
+	printk (KERN_ALERT "sizeof(struct inotify_kernel_event) = %d\n", sizeof_inotify_kernel_event);
+
+	spin_lock(&dev->lock);
+
+	printk (KERN_ALERT "inotify device: %p\n", dev);
+	printk (KERN_ALERT "inotify event_count: %d\n", atomic_read(&dev->event_count));
+	printk (KERN_ALERT "inotify watch_count: %d\n", dev->watcher_count);
+
+	spin_unlock(&dev->lock);
+}
+
+static int inotify_ioctl(struct inode *ip, struct file *fp,
+			 unsigned int cmd, unsigned long arg) {
+	int err;
+	struct inotify_device *dev;
+	struct inotify_watch_request request;
+	int wid;
+
+	dev = fp->private_data;
+
+	err = 0;
+
+	if (_IOC_TYPE(cmd) != INOTIFY_IOCTL_MAGIC) return -EINVAL;
+	if (_IOC_NR(cmd) > INOTIFY_IOCTL_MAXNR) return -EINVAL;
+
+	if (_IOC_DIR(cmd) & _IOC_READ)
+		err = !access_ok(VERIFY_READ, (void *)arg, _IOC_SIZE(cmd));
+
+	if (err) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	if (_IOC_DIR(cmd) & _IOC_WRITE)
+		err = !access_ok(VERIFY_WRITE, (void *)arg, _IOC_SIZE(cmd));
+
+	if (err) {
+		err = -EFAULT;
+		goto out;
+	}
+
+
+	err = -EINVAL;
+
+	switch (cmd) {
+		case INOTIFY_WATCH:
+			iprintk(INOTIFY_DEBUG_ERRORS, "INOTIFY_WATCH ioctl\n");
+			if (copy_from_user(&request, (void *)arg, sizeof(struct inotify_watch_request))) {
+				err = -EFAULT;
+				goto out;
+			}
+
+			err = inotify_watch(dev, &request);
+		break;
+		case INOTIFY_IGNORE:
+			iprintk(INOTIFY_DEBUG_ERRORS, "INOTIFY_IGNORE ioctl\n");
+			if (copy_from_user(&wid, (void *)arg, sizeof(int))) {
+				err = -EFAULT;
+				goto out;
+			}
+
+			err = inotify_ignore(dev, wid);
+		break;
+		case INOTIFY_STATS:
+			iprintk(INOTIFY_DEBUG_ERRORS, "INOTIFY_STATS ioctl\n");
+			inotify_print_stats (dev);
+			err = 0;
+		break;
+		case INOTIFY_SETDEBUG:
+			iprintk(INOTIFY_DEBUG_ERRORS, "INOTIFY_SETDEBUG ioctl\n");
+			if (copy_from_user(&inotify_debug_flags, (void *)arg, sizeof(int))) {
+				err = -EFAULT;
+				goto out;
+			}
+		break;
+	}
+
+out:
+	return err;
+}
+
+static struct file_operations inotify_fops = {
+	.owner		= THIS_MODULE,
+	.poll		= inotify_poll,
+	.read		= inotify_read,
+	.open		= inotify_open,
+	.release	= inotify_release,
+	.ioctl		= inotify_ioctl,
+};
+
+struct miscdevice inotify_device = {
+	.minor  = MISC_DYNAMIC_MINOR, // automatic
+	.name	= "inotify",
+	.fops	= &inotify_fops,
+};
+
+
+static int __init inotify_init (void)
+{
+	int ret;
+
+	ret = misc_register(&inotify_device);
+
+	if (ret) {
+		goto out;
+	}
+
+	inotify_debug_flags = INOTIFY_DEBUG_NONE;
+
+	watcher_cache = kmem_cache_create ("watcher_cache", 
+			sizeof(struct inotify_watcher), 0, SLAB_PANIC, NULL, NULL);
+
+	if (!watcher_cache) {
+		misc_deregister (&inotify_device);
+	}
+	kevent_cache = kmem_cache_create ("kevent_cache", 
+			sizeof(struct inotify_kernel_event), 0, SLAB_PANIC, NULL, NULL);
+
+	if (!kevent_cache) {
+		misc_deregister (&inotify_device);
+		kmem_cache_destroy (watcher_cache);
+	}
+
+	printk(KERN_ALERT "inotify 0.9 minor=%d\n", inotify_device.minor);
+out:
+	return ret;
+}
+
+static void inotify_exit (void)
+{
+	kmem_cache_destroy (kevent_cache);
+	kmem_cache_destroy (watcher_cache);
+	misc_deregister (&inotify_device);
+	printk(KERN_ALERT "inotify shutdown ec=%d wc=%d ic=%d\n", event_object_count, watcher_object_count, inode_ref_count);
+}
+
+MODULE_AUTHOR("John McCutchan <ttb@tentacle.dhs.org>");
+MODULE_DESCRIPTION("Inode event driver");
+MODULE_LICENSE("GPL");
+
+module_init (inotify_init);
+module_exit (inotify_exit);
diff -urN linux-2.6.8.1/drivers/char/Makefile ../linux/drivers/char/Makefile
--- linux-2.6.8.1/drivers/char/Makefile	2004-08-14 06:56:22.000000000 -0400
+++ ../linux/drivers/char/Makefile	2004-08-19 00:11:52.000000000 -0400
@@ -7,7 +7,7 @@
 #
 FONTMAPFILE = cp437.uni
 
-obj-y	 += mem.o random.o tty_io.o n_tty.o tty_ioctl.o pty.o misc.o
+obj-y	 += mem.o random.o tty_io.o n_tty.o tty_ioctl.o pty.o misc.o inotify.o
 
 obj-$(CONFIG_VT)		+= vt_ioctl.o vc_screen.o consolemap.o \
 				   consolemap_deftbl.o selection.o keyboard.o
diff -urN linux-2.6.8.1/fs/attr.c ../linux/fs/attr.c
--- linux-2.6.8.1/fs/attr.c	2004-08-14 06:54:50.000000000 -0400
+++ ../linux/fs/attr.c	2004-08-19 00:11:52.000000000 -0400
@@ -11,6 +11,7 @@
 #include <linux/string.h>
 #include <linux/smp_lock.h>
 #include <linux/dnotify.h>
+#include <linux/inotify.h>
 #include <linux/fcntl.h>
 #include <linux/quotaops.h>
 #include <linux/security.h>
@@ -185,8 +186,11 @@
 	}
 	if (!error) {
 		unsigned long dn_mask = setattr_mask(ia_valid);
-		if (dn_mask)
+		if (dn_mask) {
 			dnotify_parent(dentry, dn_mask);
+			inotify_inode_queue_event (dentry->d_inode, dn_mask, NULL);
+			inotify_dentry_parent_queue_event (dentry, dn_mask, dentry->d_name.name);
+		}
 	}
 	return error;
 }
diff -urN linux-2.6.8.1/fs/inode.c ../linux/fs/inode.c
--- linux-2.6.8.1/fs/inode.c	2004-08-14 06:56:23.000000000 -0400
+++ ../linux/fs/inode.c	2004-08-19 00:11:52.000000000 -0400
@@ -114,6 +114,7 @@
 	if (inode) {
 		struct address_space * const mapping = &inode->i_data;
 
+		INIT_LIST_HEAD (&inode->watchers);
 		inode->i_sb = sb;
 		inode->i_blkbits = sb->s_blocksize_bits;
 		inode->i_flags = 0;
diff -urN linux-2.6.8.1/fs/namei.c ../linux/fs/namei.c
--- linux-2.6.8.1/fs/namei.c	2004-08-14 06:55:10.000000000 -0400
+++ ../linux/fs/namei.c	2004-08-19 00:11:52.000000000 -0400
@@ -22,6 +22,7 @@
 #include <linux/quotaops.h>
 #include <linux/pagemap.h>
 #include <linux/dnotify.h>
+#include <linux/inotify.h>
 #include <linux/smp_lock.h>
 #include <linux/personality.h>
 #include <linux/security.h>
@@ -1221,6 +1222,7 @@
 	error = dir->i_op->create(dir, dentry, mode, nd);
 	if (!error) {
 		inode_dir_notify(dir, DN_CREATE);
+		inotify_inode_queue_event(dir, IN_CREATE, dentry->d_name.name);
 		security_inode_post_create(dir, dentry, mode);
 	}
 	return error;
@@ -1535,6 +1537,7 @@
 	error = dir->i_op->mknod(dir, dentry, mode, dev);
 	if (!error) {
 		inode_dir_notify(dir, DN_CREATE);
+		inotify_inode_queue_event(dir, IN_CREATE, dentry->d_name.name);
 		security_inode_post_mknod(dir, dentry, mode, dev);
 	}
 	return error;
@@ -1608,6 +1611,7 @@
 	error = dir->i_op->mkdir(dir, dentry, mode);
 	if (!error) {
 		inode_dir_notify(dir, DN_CREATE);
+		inotify_inode_queue_event(dir, IN_CREATE, dentry->d_name.name);
 		security_inode_post_mkdir(dir,dentry, mode);
 	}
 	return error;
@@ -1703,6 +1707,8 @@
 	up(&dentry->d_inode->i_sem);
 	if (!error) {
 		inode_dir_notify(dir, DN_DELETE);
+		inotify_inode_queue_event(dir, IN_DELETE, dentry->d_name.name);
+		inotify_inode_queue_event(dentry->d_inode, IN_DELETE, NULL);
 		d_delete(dentry);
 	}
 	dput(dentry);
@@ -1775,8 +1781,9 @@
 
 	/* We don't d_delete() NFS sillyrenamed files--they still exist. */
 	if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
-		d_delete(dentry);
 		inode_dir_notify(dir, DN_DELETE);
+		inotify_inode_queue_event(dir, IN_DELETE, dentry->d_name.name);
+		d_delete(dentry);
 	}
 	return error;
 }
@@ -1853,6 +1860,7 @@
 	error = dir->i_op->symlink(dir, dentry, oldname);
 	if (!error) {
 		inode_dir_notify(dir, DN_CREATE);
+		inotify_inode_queue_event(dir, IN_CREATE, dentry->d_name.name);
 		security_inode_post_symlink(dir, dentry, oldname);
 	}
 	return error;
@@ -1926,6 +1934,7 @@
 	up(&old_dentry->d_inode->i_sem);
 	if (!error) {
 		inode_dir_notify(dir, DN_CREATE);
+		inotify_inode_queue_event(dir, IN_CREATE, new_dentry->d_name.name);
 		security_inode_post_link(old_dentry, dir, new_dentry);
 	}
 	return error;
@@ -2115,12 +2124,15 @@
 	else
 		error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);
 	if (!error) {
-		if (old_dir == new_dir)
+		if (old_dir == new_dir) {
 			inode_dir_notify(old_dir, DN_RENAME);
-		else {
+		} else {
 			inode_dir_notify(old_dir, DN_DELETE);
 			inode_dir_notify(new_dir, DN_CREATE);
 		}
+
+		inotify_inode_queue_event(old_dir, IN_DELETE, old_dentry->d_name.name);
+		inotify_inode_queue_event(new_dir, IN_CREATE, new_dentry->d_name.name);
 	}
 	return error;
 }
diff -urN linux-2.6.8.1/fs/open.c ../linux/fs/open.c
--- linux-2.6.8.1/fs/open.c	2004-08-14 06:54:48.000000000 -0400
+++ ../linux/fs/open.c	2004-08-19 00:11:52.000000000 -0400
@@ -11,6 +11,7 @@
 #include <linux/smp_lock.h>
 #include <linux/quotaops.h>
 #include <linux/dnotify.h>
+#include <linux/inotify.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/tty.h>
@@ -955,6 +956,8 @@
 			error = PTR_ERR(f);
 			if (IS_ERR(f))
 				goto out_error;
+			inotify_inode_queue_event (f->f_dentry->d_inode, IN_OPEN, NULL);
+			inotify_dentry_parent_queue_event (f->f_dentry, IN_OPEN, f->f_dentry->d_name.name);
 			fd_install(fd, f);
 		}
 out:
@@ -1034,6 +1037,8 @@
 	FD_CLR(fd, files->close_on_exec);
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
+	inotify_dentry_parent_queue_event (filp->f_dentry, IN_CLOSE, filp->f_dentry->d_name.name);
+	inotify_inode_queue_event (filp->f_dentry->d_inode, IN_CLOSE, NULL);
 	return filp_close(filp, files);
 
 out_unlock:
diff -urN linux-2.6.8.1/fs/read_write.c ../linux/fs/read_write.c
--- linux-2.6.8.1/fs/read_write.c	2004-08-14 06:55:35.000000000 -0400
+++ ../linux/fs/read_write.c	2004-08-19 00:11:52.000000000 -0400
@@ -11,6 +11,7 @@
 #include <linux/uio.h>
 #include <linux/smp_lock.h>
 #include <linux/dnotify.h>
+#include <linux/inotify.h>
 #include <linux/security.h>
 #include <linux/module.h>
 
@@ -216,8 +217,11 @@
 				ret = file->f_op->read(file, buf, count, pos);
 			else
 				ret = do_sync_read(file, buf, count, pos);
-			if (ret > 0)
+			if (ret > 0) {
 				dnotify_parent(file->f_dentry, DN_ACCESS);
+				inotify_dentry_parent_queue_event(file->f_dentry, IN_ACCESS, file->f_dentry->d_name.name);
+				inotify_inode_queue_event (file->f_dentry->d_inode, IN_ACCESS, NULL);
+			}
 		}
 	}
 
@@ -260,8 +264,11 @@
 				ret = file->f_op->write(file, buf, count, pos);
 			else
 				ret = do_sync_write(file, buf, count, pos);
-			if (ret > 0)
+			if (ret > 0) {
 				dnotify_parent(file->f_dentry, DN_MODIFY);
+				inotify_dentry_parent_queue_event(file->f_dentry, IN_MODIFY, file->f_dentry->d_name.name);
+				inotify_inode_queue_event (file->f_dentry->d_inode, IN_MODIFY, NULL);
+			}
 		}
 	}
 
@@ -493,9 +500,13 @@
 out:
 	if (iov != iovstack)
 		kfree(iov);
-	if ((ret + (type == READ)) > 0)
+	if ((ret + (type == READ)) > 0) {
 		dnotify_parent(file->f_dentry,
 				(type == READ) ? DN_ACCESS : DN_MODIFY);
+		inotify_dentry_parent_queue_event(file->f_dentry, 
+				(type == READ) ? IN_ACCESS : IN_MODIFY, file->f_dentry->d_name.name);
+		inotify_inode_queue_event (file->f_dentry->d_inode, (type == READ) ? IN_ACCESS : IN_MODIFY, NULL);
+	}
 	return ret;
 }
 
diff -urN linux-2.6.8.1/fs/super.c ../linux/fs/super.c
--- linux-2.6.8.1/fs/super.c	2004-08-14 06:55:22.000000000 -0400
+++ ../linux/fs/super.c	2004-08-19 00:11:52.000000000 -0400
@@ -36,6 +36,7 @@
 #include <linux/writeback.h>		/* for the emergency remount stuff */
 #include <linux/idr.h>
 #include <asm/uaccess.h>
+#include <linux/inotify.h>
 
 
 void get_filesystem(struct file_system_type *fs);
@@ -204,6 +205,7 @@
 
 	if (root) {
 		sb->s_root = NULL;
+		inotify_super_block_umount (sb);
 		shrink_dcache_parent(root);
 		shrink_dcache_anon(&sb->s_anon);
 		dput(root);
diff -urN linux-2.6.8.1/include/linux/fs.h ../linux/include/linux/fs.h
--- linux-2.6.8.1/include/linux/fs.h	2004-08-14 06:55:09.000000000 -0400
+++ ../linux/include/linux/fs.h	2004-08-19 00:11:52.000000000 -0400
@@ -458,6 +458,10 @@
 	unsigned long		i_dnotify_mask; /* Directory notify events */
 	struct dnotify_struct	*i_dnotify; /* for directory notifications */
 
+	struct list_head	watchers;
+	unsigned long		watchers_mask;
+	int			watcher_count;
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
diff -urN linux-2.6.8.1/include/linux/inotify.h ../linux/include/linux/inotify.h
--- linux-2.6.8.1/include/linux/inotify.h	1969-12-31 19:00:00.000000000 -0500
+++ ../linux/include/linux/inotify.h	2004-08-19 00:11:52.000000000 -0400
@@ -0,0 +1,74 @@
+/*
+ * Inode based directory notification for Linux
+ *
+ * Copyright (C) 2004 John McCutchan
+ *
+ * Signed-off-by: John McCutchan ttb@tentacle.dhs.org
+ */
+
+#ifndef _LINUX_INOTIFY_H
+#define _LINUX_INOTIFY_H
+
+struct inode;
+struct dentry;
+struct super_block;
+
+struct inotify_event {
+	int wd;
+	int mask;
+	char filename[256];
+	/* When you are watching a directory you will get the filenames
+	 * for events like IN_CREATE, IN_DELETE, IN_OPEN, IN_CLOSE, etc.. 
+	 */
+};
+/* When reading from the device you must provide a buffer 
+ * that is a multiple of the sizeof(inotify_event)
+ */
+
+#define IN_ACCESS	0x00000001	/* File was accessed */
+#define IN_MODIFY	0x00000002	/* File was modified */
+#define IN_CREATE	0x00000004	/* File was created */
+#define IN_DELETE	0x00000008	/* File was deleted */
+#define IN_RENAME	0x00000010	/* File was renamed */
+#define IN_ATTRIB	0x00000020	/* File changed attributes */
+#define IN_MOVE		0x00000040	/* File was moved */
+#define IN_UNMOUNT	0x00000080	/* Device file was on, was unmounted */
+#define IN_CLOSE	0x00000100	/* File was closed */
+#define IN_OPEN		0x00000200	/* File was opened */
+#define IN_IGNORED	0x00000400	/* File was ignored */
+#define IN_ALL_EVENTS	0xffffffff	/* All the events */
+
+/* ioctl */
+
+/* Fill this and pass it to INOTIFY_WATCH ioctl */
+struct inotify_watch_request {
+	char *dirname; // directory name
+	unsigned long mask; // event mask
+};
+
+#define INOTIFY_IOCTL_MAGIC 'Q'
+#define INOTIFY_IOCTL_MAXNR 4
+
+#define INOTIFY_WATCH  		_IOR(INOTIFY_IOCTL_MAGIC, 1, struct inotify_watch_request)
+#define INOTIFY_IGNORE 		_IOR(INOTIFY_IOCTL_MAGIC, 2, int)
+#define INOTIFY_STATS		_IOR(INOTIFY_IOCTL_MAGIC, 3, int)
+#define INOTIFY_SETDEBUG	_IOR(INOTIFY_IOCTL_MAGIC, 4, int)
+
+#define INOTIFY_DEBUG_NONE   0x00000000
+#define INOTIFY_DEBUG_ALLOC  0x00000001
+#define INOTIFY_DEBUG_EVENTS 0x00000002
+#define INOTIFY_DEBUG_INODE  0x00000004
+#define INOTIFY_DEBUG_ERRORS 0x00000008
+#define INOTIFY_DEBUG_FILEN  0x00000010
+#define INOTIFY_DEBUG_ALL    0xffffffff
+
+/* Kernel API */
+/* Adds events to all watchers on inode that are interested in mask */
+void inotify_inode_queue_event (struct inode *inode, unsigned long mask, const char *filename);
+/* Same as above but uses dentry's inode */
+void inotify_dentry_parent_queue_event (struct dentry *dentry, unsigned long mask, const char *filename);
+/* This will remove all watchers from all inodes on the superblock */
+void inotify_super_block_umount (struct super_block *sb);
+
+#endif
+

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-15 15:52 [RFC][PATCH] inotify 0.9 John McCutchan
@ 2004-09-15 18:00 ` Robert Love
  2004-09-16 15:07 ` Bill Davidsen
  1 sibling, 0 replies; 21+ messages in thread
From: Robert Love @ 2004-09-15 18:00 UTC (permalink / raw)
  To: John McCutchan; +Cc: linux-kernel, nautilus-list, gamin-list, viro, akpm, iggy

On Wed, 2004-09-15 at 11:52 -0400, John McCutchan wrote:

> I am interested in getting inotify included in the mm tree. 
> 
> Inotify is designed as a replacement for dnotify. The key difference's
> are that inotify does not require the file to be opened to watch it,
> when you are watching something with inotify it can go away (if path
> is unmounted) and you will be sent an event telling you it is gone and
> events are delivered over a fd not by using signals.

I want to expand on why dnotify is awful and why inotify is a great
replacement, because dnotify's limitations are really showing up on
modern desktop systems.

Some technical issues with dnotify and why inotify solves the problem:

        - dnotify requires one fd per watched directory.  this results
        in a lot of file descriptors if you are trying to do anything
        creative.  inotify solves this by only having one open file
        descriptor.
        
        - with dnotify, you open the fd on the directory to watch, which
        pins the directory.  this makes unmounting the backing
        filesystem impossible and means using dnotify on removable
        devices is nontrivial.  This is a problem with desktop systems.
        Not only does inotify solve this problem (by not requiring an
        open of each watched directory), but it even sends an "unmount"
        event when the watched directory is unmounted.
        
        - Using dnotify is, uh, interesting.  I mean, fcntl(2) and
        SIGIO?  You end up needing to use real-time signals.  Gross
        gross gross.  This does not working well with modern event-
        driven applications that use mainloops.  You end up needing a
        complicated daemon like FAM.  We don't want FAM, and in fact we
        should not even need a daemon (although we might want one).
        Conversely, inotify is trivial to use and integrates well and is
        select()-able.

I have been going over the code for awhile now, and it looks good.  I
would really like to hear Al's opinion so we can move on fixing any
possible issues that he has.

Best,

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-15 15:52 [RFC][PATCH] inotify 0.9 John McCutchan
  2004-09-15 18:00 ` Robert Love
@ 2004-09-16 15:07 ` Bill Davidsen
  2004-09-16 16:27   ` Chris Friesen
                     ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Bill Davidsen @ 2004-09-16 15:07 UTC (permalink / raw)
  To: linux-kernel

John McCutchan wrote:
> Hello,
> 
> I am releasing a new version of inotify. Attached is a patch for
> 2.6.8.1.
> 
> I am interested in getting inotify included in the mm tree. 
> 
> Inotify is designed as a replacement for dnotify. The key difference's
> are that inotify does not require the file to be opened to watch it,
> when you are watching something with inotify it can go away (if path
> is unmounted) and you will be sent an event telling you it is gone and
> events are delivered over a fd not by using signals.
> 
> New in this version:
> Driver now supports reading more than one event at a time
> Bump maximum number of watches per device from 64 to 8192
> Bump maximum number of queued events per device from 64 to 256
> 
> --COMPLEXITY--
> 
> I have been asked what the complexity of inotify is. Inotify has
> 2 path codes where complexity could be an issue:
> 
> Adding a watcher to a device
> 	This code has to check if the inode is already being watched 
> 	by the device, this is O(1) since the maximum number of 
> 	devices is limited to 8.
> 
> 
> Removing a watch from a device
> 	This code has to do a search of all watches on the device to
> 	find the watch descriptor that is being asked to remove.
> 	This involves a linear search, but should not really be an issue
> 	because it is limited to 8192 entries. If this does turn in to
> 	a concern, I would replace the list of watches on the device
> 	with a sorted binary tree, so that the search could be done
> 	very quickly.
> 
> 
> The calls to inotify from the VFS code has a complexity of O(1) so
> inotify does not affect the speed of VFS operations.
> 
> --MEMORY USAGE--
> 
> The inotify data structures are light weight:
> 
> inotify watch is 40 bytes
> inotify device is 68 bytes
> inotify event is 272 bytes
> 
> So assuming a device has 8192 watches, the structures are only going
> to consume 320KB of memory. With a maximum number of 8 devices allowed
> to exist at a time, this is still only 2.5 MB
> 
> Each device can also have 256 events queued at a time, which sums to
> 68KB per device. And only .5 MB if all devices are opened and have
> a full event queue.
> 
> So approximately 3 MB of memory are used in the rare case of 
> everything open and full.
> 
> Each inotify watch pins the inode of a directory/file in memory,
> the size of an inode is different per file system but lets assume
> that it is 512 byes. 
> 
> So assuming the maximum number of global watches are active, this would
> pin down 32 MB of inodes in the inode cache. Again not a problem
> on a modern system. 

Did you work for Microsoft? Bloat doesn't count? And is this going to be 
  low memory you pin? And is every file create or delete (or update of 
atime) going to blast this mess through cache looking for people to notify?
> 
> On smaller systems, the maximum watches / events could be lowered
> to provide a smaller foot print.

Let's rethink this and say the max is by default and by use of proc or 
sys or whatever's in vogue today you can enable the feature by setting a 
non-zero value.
> 
> Older release notes:
> I am resubmitting inotify for comments and review. Inotify has
> changed drastically from the earlier proposal that Al Viro did not
> approve of. There is no longer any use of (device number, inode number)
> pairs. Please give this version of inotify a fresh view.

We are hacking all over the kernel to save 4k in stack size and you want 
to pin up to 32MB?
> 
> 
> Inotify is a character device that when opened offers 2 IOCTL's.
> (It actually has 4 but the other 2 are used for debugging)
> 
> INOTIFY_WATCH:
>         Which takes a path and event mask and returns a unique 
>         (to the instance of the driver) integer (wd [watcher descriptor]
>         from here on) that is a 1:1 mapping to the path passed. 
>         What happens is inotify gets the inode (and ref's the inode)
>         for the path and adds a inotify_watcher structure to the inodes
>         list of watchers. If this instance of the driver is already
>         watching the path, the event mask will be updated and
>         the original wd will be returned.
> 
> INOTIFY_IGNORE:
>         Which takes an integer (that you got from INOTIFY_WATCH) 
>         representing a wd that you are not interested in watching
>         anymore. This will:
> 
>         send an IGNORE event to the device
>         remove the inotify_watcher structure from the device and 
>         from the inode and unref the inode.
>         
> 
> After you are watching 1 or more paths, you can read from the fd
> and get events. The events are struct inotify_event. If you are
> watching a directory and something happens to a file in the directory
> the event will contain the filename (just the filename not the full
> path).
> 
> Aside from the inotify character device driver. 
> The changes to the kernel are very minor. 
> 
> The first change is adding calls to inotify_inode_queue_event and
> inotify_dentry_parent_queue_event from the various vfs functions. This
> is identical to dnotify.
> 
> The second change is more serious, it adds a call to
> inotify_super_block_umount
> inside generic_shutdown_superblock. What inotify_super_block_umount does
> is:
> 
> find all of the inodes that are on the super block being shut down,
> sends each watcher on each inode the UNMOUNT and IGNORED event
> removes the watcher structures from each instance of the device driver 
> and each inode.
> unref's the inode.
> 
> I have tested this code on my system for over three weeks now and have
> not had problems. I would appreciate design review, code review and
> testing.
> 
> John

If I were doing this, and I admit I may not understand all of the 
features, I would have a bitmap per filesystem of inodes being watched, 
and anything which did an action which might require notify would check 
the bit. If the bit were set the filesystem and inode info would be 
passed to user space which could do anything it wanted. Use of the 
netlink is an example of ways to do this.

Then the user program could do whatever it wanted in nice pageable 
space, allow as many watchers as it wished, and be flexible to anything 
a site wanted, scalable, could use semaphores, fifos, network 
monitoring, message queues... in other words low impact, scalable, and 
flexible.

Feel free to tell me there is some urgent need for this feature to be 
present and fast, I learn new things every day.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 15:07 ` Bill Davidsen
@ 2004-09-16 16:27   ` Chris Friesen
  2004-09-16 22:48     ` Bill Davidsen
  2004-09-16 16:39   ` Robert Love
  2004-09-16 16:46   ` Jan Kara
  2 siblings, 1 reply; 21+ messages in thread
From: Chris Friesen @ 2004-09-16 16:27 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

Bill Davidsen wrote:

> If I were doing this, and I admit I may not understand all of the 
> features, I would have a bitmap per filesystem of inodes being watched, 
> and anything which did an action which might require notify would check 
> the bit. If the bit were set the filesystem and inode info would be 
> passed to user space which could do anything it wanted.

How do you identify the filesystem?  Whose mount namespace do you use if you 
have multiple processes in different namespaces watching what is really the same 
file?

Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 15:07 ` Bill Davidsen
  2004-09-16 16:27   ` Chris Friesen
@ 2004-09-16 16:39   ` Robert Love
  2004-09-20 20:16     ` Bill Davidsen
  2004-09-16 16:46   ` Jan Kara
  2 siblings, 1 reply; 21+ messages in thread
From: Robert Love @ 2004-09-16 16:39 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Thu, 2004-09-16 at 11:07 -0400, Bill Davidsen wrote:

> Did you work for Microsoft? Bloat doesn't count? And is this going to be 
>   low memory you pin? And is every file create or delete (or update of 
> atime) going to blast this mess through cache looking for people to notify?

No.  I suggest looking at the source.

We are pinning the very inodes we are using.  So,

	(a) There is no cache effects because the inodes are already
	    in use.  So when you go to, say, write to a file the kernel
	    already has the inode handy, and we just check in O(1) to
	    see if the inode has a watcher on it.  We never walk a list
	    of inodes (why would you ever do that?  how would you do
	    that?).
	(b) Many of the pinned inodes are already in memory, cached,
	    since the probability of of used inodes and watched inodes
	    is high.  Right now, on a system without inotify, I have
	    60MB of inodes in memory.
	(c) The inodes are pinned to prevent races.  Or, don't even
	    look at it like this.  Just look at it as elevating the
	    ref count on the data structure while we are using it.

But here is the kicker: I don't think this pinning behavior is any
different than dnotify.  So this is a total utter nonissue.

> > Older release notes:
> > I am resubmitting inotify for comments and review. Inotify has
> > changed drastically from the earlier proposal that Al Viro did not
> > approve of. There is no longer any use of (device number, inode number)
> > pairs. Please give this version of inotify a fresh view.
> 
> We are hacking all over the kernel to save 4k in stack size and you want 
> to pin up to 32MB?

The 4K is 4K per process, and it is done not to save 4K once (or even
4K*number of processes) but because first order allocations (8KB on x86)
become nontrivial as memory becomes fragmented.

I bet on most modern systems there is already much more than 32MB of
inodes in memory, and you have to explicitly add watches anyhow.

> If I were doing this, and I admit I may not understand all of the 
> features, I would have a bitmap per filesystem of inodes being watched, 
> and anything which did an action which might require notify would check 
> the bit. If the bit were set the filesystem and inode info would be 
> passed to user space which could do anything it wanted. Use of the 
> netlink is an example of ways to do this.

Race, race, race, if even possible to implement "a bitmap per filesystem
of inodes" in a sane way.

> Then the user program could do whatever it wanted in nice pageable 
> space, allow as many watchers as it wished, and be flexible to anything 
> a site wanted, scalable, could use semaphores, fifos, network 
> monitoring, message queues... in other words low impact, scalable, and 
> flexible.

If you assume that you have to pin the inodes while you watch them (and
you do), then inotify really is this minimum abstraction that you talk
of.

> Feel free to tell me there is some urgent need for this feature to be 
> present and fast, I learn new things every day.

You act like file notification is something new.  Every operating system
provides this feature.  Linux currently does, too: dnotify.

But dnotify sucks, and modern systems are hitting its numerous limits.
So, enter inotify.

Fondest regards,

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 15:07 ` Bill Davidsen
  2004-09-16 16:27   ` Chris Friesen
  2004-09-16 16:39   ` Robert Love
@ 2004-09-16 16:46   ` Jan Kara
  2004-09-16 22:34     ` Bill Davidsen
  2 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2004-09-16 16:46 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

> John McCutchan wrote:
> >Hello,
> >
> >I am releasing a new version of inotify. Attached is a patch for
> >2.6.8.1.
<snip>

> >--MEMORY USAGE--
> >
> >The inotify data structures are light weight:
> >
> >inotify watch is 40 bytes
> >inotify device is 68 bytes
> >inotify event is 272 bytes
> >
> >So assuming a device has 8192 watches, the structures are only going
> >to consume 320KB of memory. With a maximum number of 8 devices allowed
> >to exist at a time, this is still only 2.5 MB
> >
> >Each device can also have 256 events queued at a time, which sums to
> >68KB per device. And only .5 MB if all devices are opened and have
> >a full event queue.
> >
> >So approximately 3 MB of memory are used in the rare case of 
> >everything open and full.
> >
> >Each inotify watch pins the inode of a directory/file in memory,
> >the size of an inode is different per file system but lets assume
> >that it is 512 byes. 
> >
> >So assuming the maximum number of global watches are active, this would
> >pin down 32 MB of inodes in the inode cache. Again not a problem
> >on a modern system. 
> 
> Did you work for Microsoft? Bloat doesn't count? And is this going to be 
>  low memory you pin? And is every file create or delete (or update of 
> atime) going to blast this mess through cache looking for people to notify?
> >
> >On smaller systems, the maximum watches / events could be lowered
> >to provide a smaller foot print.
> 
> Let's rethink this and say the max is by default and by use of proc or 
> sys or whatever's in vogue today you can enable the feature by setting a 
> non-zero value.
  As I understand the patch it won't have any nontrivial memory
footprint in case you won't use inotify. Only in case someone wants to
watch inode, appropriate structure is allocated, inode pined etc. The
numbers above are in the case you watch maximum possible number of
inodes etc...
  Maybe you should not be so fast in using your flamethrower;)

								Bye
									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 16:46   ` Jan Kara
@ 2004-09-16 22:34     ` Bill Davidsen
  2004-09-16 22:57       ` David Lang
  2004-09-16 23:22       ` Robert Love
  0 siblings, 2 replies; 21+ messages in thread
From: Bill Davidsen @ 2004-09-16 22:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel

On Thu, 16 Sep 2004, Jan Kara wrote:

> > John McCutchan wrote:
> > >Hello,
> > >
> > >I am releasing a new version of inotify. Attached is a patch for
> > >2.6.8.1.
> <snip>
> 
> > >--MEMORY USAGE--
> > >
> > >The inotify data structures are light weight:
> > >
> > >inotify watch is 40 bytes
> > >inotify device is 68 bytes
> > >inotify event is 272 bytes
> > >
> > >So assuming a device has 8192 watches, the structures are only going
> > >to consume 320KB of memory. With a maximum number of 8 devices allowed
> > >to exist at a time, this is still only 2.5 MB
> > >
> > >Each device can also have 256 events queued at a time, which sums to
> > >68KB per device. And only .5 MB if all devices are opened and have
> > >a full event queue.
> > >
> > >So approximately 3 MB of memory are used in the rare case of 
> > >everything open and full.
> > >
> > >Each inotify watch pins the inode of a directory/file in memory,
> > >the size of an inode is different per file system but lets assume
> > >that it is 512 byes. 
> > >
> > >So assuming the maximum number of global watches are active, this would
> > >pin down 32 MB of inodes in the inode cache. Again not a problem
> > >on a modern system. 
> > 
> > Did you work for Microsoft? Bloat doesn't count? And is this going to be 
> >  low memory you pin? And is every file create or delete (or update of 
> > atime) going to blast this mess through cache looking for people to notify?
> > >
> > >On smaller systems, the maximum watches / events could be lowered
> > >to provide a smaller foot print.
> > 
> > Let's rethink this and say the max is by default and by use of proc or 
> > sys or whatever's in vogue today you can enable the feature by setting a 
> > non-zero value.
>   As I understand the patch it won't have any nontrivial memory
> footprint in case you won't use inotify. Only in case someone wants to
> watch inode, appropriate structure is allocated, inode pined etc. The
> numbers above are in the case you watch maximum possible number of
> inodes etc...

The point I was making is that this doesn't scale well, because it eats
resources which may be unavailable on many systems, and which others are
trying to conserve. Since this may limit the use it presents a problem
with usefulness.

>   Maybe you should not be so fast in using your flamethrower;)

I didn't intend this as a flame, but I do feel this implementation doesn't
scale. I offered another approach off the top of my head, which appears to
me to be more scalable. I claimed no expertise, I just made a suggestion,
based on my first thought on how I would attack the problem in a way which
appears more scalable.

If we are going to 4k stack because larger memory blocks are hard to find,
I have to suspect that anything which locks up blocks size in MB is going
to cause problems. I didn't even ask what would happen on NUMA machines,
because that's not my usual concern.

I'm still horified by the memory requirements :-(

> -- 
> Jan Kara <jack@suse.cz>
> SuSE CR Labs
> 

Thanks for taking the time to note that my tone may have been harsh even
if my point was valid.


-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 16:27   ` Chris Friesen
@ 2004-09-16 22:48     ` Bill Davidsen
  0 siblings, 0 replies; 21+ messages in thread
From: Bill Davidsen @ 2004-09-16 22:48 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel

On Thu, 16 Sep 2004, Chris Friesen wrote:

> Bill Davidsen wrote:
> 
> > If I were doing this, and I admit I may not understand all of the 
> > features, I would have a bitmap per filesystem of inodes being watched, 
> > and anything which did an action which might require notify would check 
> > the bit. If the bit were set the filesystem and inode info would be 
> > passed to user space which could do anything it wanted.
> 
> How do you identify the filesystem?  Whose mount namespace do you use if you 
> have multiple processes in different namespaces watching what is really the same 
> file?

You're asking for implementation details on something I threw out off the
top of my head? My first thought is "not by name" since if this is an
unmount that's not going to work well. Since I'm making this up, let's say
a filesysem number and inode number. Then when the watch is set the system
just has to have a unique "filesystem number" identifier which is shared
by every watch request against the f/s.

I haven't looked at how the original proposal handles things like the same
f/s mounted multiple times, etc, so I wouldn't venture to improve on it.
If I were actually going to write something like this, I'd want to start
with a description of functional requirements and response time, and go
from there, trying to move as much as possible out of unpageable memory.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 22:34     ` Bill Davidsen
@ 2004-09-16 22:57       ` David Lang
  2004-09-16 23:22       ` Robert Love
  1 sibling, 0 replies; 21+ messages in thread
From: David Lang @ 2004-09-16 22:57 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Jan Kara, linux-kernel

On Thu, 16 Sep 2004, Bill Davidsen wrote:

> On Thu, 16 Sep 2004, Jan Kara wrote:
>
>>> John McCutchan wrote:
>>>> Hello,
>>>>
>>>> I am releasing a new version of inotify. Attached is a patch for
>>>> 2.6.8.1.
>> <snip>
>>
>>>> --MEMORY USAGE--
>>>>
>>>> The inotify data structures are light weight:
>>>>
>>>> inotify watch is 40 bytes
>>>> inotify device is 68 bytes
>>>> inotify event is 272 bytes
>>>>
>>>> So assuming a device has 8192 watches, the structures are only going
>>>> to consume 320KB of memory. With a maximum number of 8 devices allowed
>>>> to exist at a time, this is still only 2.5 MB
>>>>
>>>> Each device can also have 256 events queued at a time, which sums to
>>>> 68KB per device. And only .5 MB if all devices are opened and have
>>>> a full event queue.
>>>>
>>>> So approximately 3 MB of memory are used in the rare case of
>>>> everything open and full.
>>>>
>>>> Each inotify watch pins the inode of a directory/file in memory,
>>>> the size of an inode is different per file system but lets assume
>>>> that it is 512 byes.
>>>>
>>>> So assuming the maximum number of global watches are active, this would
>>>> pin down 32 MB of inodes in the inode cache. Again not a problem
>>>> on a modern system.
>>>
>>> Did you work for Microsoft? Bloat doesn't count? And is this going to be
>>>  low memory you pin? And is every file create or delete (or update of
>>> atime) going to blast this mess through cache looking for people to notify?
>>>>
>>>> On smaller systems, the maximum watches / events could be lowered
>>>> to provide a smaller foot print.
>>>
>>> Let's rethink this and say the max is by default and by use of proc or
>>> sys or whatever's in vogue today you can enable the feature by setting a
>>> non-zero value.
>>   As I understand the patch it won't have any nontrivial memory
>> footprint in case you won't use inotify. Only in case someone wants to
>> watch inode, appropriate structure is allocated, inode pined etc. The
>> numbers above are in the case you watch maximum possible number of
>> inodes etc...
>
> The point I was making is that this doesn't scale well, because it eats
> resources which may be unavailable on many systems, and which others are
> trying to conserve. Since this may limit the use it presents a problem
> with usefulness.
>
>>   Maybe you should not be so fast in using your flamethrower;)
>
> I didn't intend this as a flame, but I do feel this implementation doesn't
> scale. I offered another approach off the top of my head, which appears to
> me to be more scalable. I claimed no expertise, I just made a suggestion,
> based on my first thought on how I would attack the problem in a way which
> appears more scalable.

IIRC you suggested a bitmap of all the inodes on a filesystem.

on my desktop this is what I see for inodes
dlang@dlang:~$ df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda3            1048576  168266  880310   17% /
/dev/sda5            2097152  158128 1939024    8% /home
/dev/sda1             524288   41797  482491    8% /mnt

so at 8 per byte you are taking about ~500K just to store the info about 
which ones somone is interested in watching (and note that this is only a 
9GB drive, think about what happens on multi TB systems), then you have to 
have another structure to track the events and which node each event goes 
to (and what programs are interested in watching which inodes)

I don't think that a bitmap of all possible inodes is going to be the 
right thing either.

now it's very possible that you were meaning something else, but it's not 
clear what so please try again to restate your idea.

> If we are going to 4k stack because larger memory blocks are hard to find,
> I have to suspect that anything which locks up blocks size in MB is going
> to cause problems. I didn't even ask what would happen on NUMA machines,
> because that's not my usual concern.

actually the memory for this doesn't need to be a contiuous block so it 
doesn't run into this problem

> I'm still horified by the memory requirements :-(

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 22:34     ` Bill Davidsen
  2004-09-16 22:57       ` David Lang
@ 2004-09-16 23:22       ` Robert Love
  2004-09-16 23:35         ` Alan Cox
  1 sibling, 1 reply; 21+ messages in thread
From: Robert Love @ 2004-09-16 23:22 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Jan Kara, linux-kernel

On Thu, 2004-09-16 at 18:34 -0400, Bill Davidsen wrote:

> >   Maybe you should not be so fast in using your flamethrower;)
> 
> I didn't intend this as a flame, but I do feel this implementation doesn't
> scale. I offered another approach off the top of my head, which appears to
> me to be more scalable. I claimed no expertise, I just made a suggestion,
> based on my first thought on how I would attack the problem in a way which
> appears more scalable.

The thing you are missing is that you absolutely have to pin something
or you have multiple VFS races.  Your bitmap suggestion, while cute,
really shows a lack of understanding of the problem space.

dnotify had to do it, inotify has to do it.

Do you want to go down the lets-find-a-race path with Al Viro? ;-)

> If we are going to 4k stack because larger memory blocks are hard to find,
> I have to suspect that anything which locks up blocks size in MB is going
> to cause problems. I didn't even ask what would happen on NUMA machines,
> because that's not my usual concern.

It is not the total size that is the concern, but the per-allocation
size, which has to be contiguous.  A first order allocation is hard to
do.  You can only find two contiguous free pages in physical memory so
often.

Inodes come from the slabcache.  NONE of this is an issue there.

Plus, as I have said, the slabcache is probably caching much of what you
are pinning.  So memory consumption is not changed.  Finally, these
numbers are WORST case.  Watch only a handful of files and you have a
handful of hundreds of bytes pinned.

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 23:22       ` Robert Love
@ 2004-09-16 23:35         ` Alan Cox
  2004-09-17  2:29           ` Robert Love
  0 siblings, 1 reply; 21+ messages in thread
From: Alan Cox @ 2004-09-16 23:35 UTC (permalink / raw)
  To: Robert Love; +Cc: Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Gwe, 2004-09-17 at 00:22, Robert Love wrote:
> The thing you are missing is that you absolutely have to pin something
> or you have multiple VFS races.  Your bitmap suggestion, while cute,
> really shows a lack of understanding of the problem space.

How many of the races matter. There seem to be several different
problems here and mixing them up might be a mistake. 

1.	I absolutely need to get the right file at the right moment, please
mass me a descriptor to the file as the user closes it so I always get
it right (indexer, virus checker)

2.	If something happens bug me and I'll have a look (eg file manager)

Also it varies between "This file" and "everything in this subtree".
An indexer for example really wants to know "this file, this path" for
entire subtrees and to index the right object (if the path changes thats
less of an issue). 

Alan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 23:35         ` Alan Cox
@ 2004-09-17  2:29           ` Robert Love
  2004-09-17  3:08             ` Nicholas Miell
  2004-09-17 14:39             ` Alan Cox
  0 siblings, 2 replies; 21+ messages in thread
From: Robert Love @ 2004-09-17  2:29 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Fri, 2004-09-17 at 00:35 +0100, Alan Cox wrote:

> How many of the races matter. There seem to be several different
> problems here and mixing them up might be a mistake. 
> 
> 1.	I absolutely need to get the right file at the right moment, please
> mass me a descriptor to the file as the user closes it so I always get
> it right (indexer, virus checker)
> 
> 2.	If something happens bug me and I'll have a look (eg file manager)

I think we want a solution that works well for both cases.

E.g., we have a few different needs:

	- Stuff like Spotlight-esque automatic Indexers.
	- File manager notifications
	- Other GUI notifications (desktop, menus, etc.)
	- To prevent polling (e.g. /proc/mtab)
	- Existing dnotify users

dnotify is pretty lame for any of the above situations.  Even for
something as trivial as watching the current open directory in Nautilus,
look at the hoops we have to just through with FAM.

And dnotify utterly falls apart on removable media or for any "large"
sort of job, e.g. indexing.

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-17  2:29           ` Robert Love
@ 2004-09-17  3:08             ` Nicholas Miell
  2004-09-17 14:39             ` Alan Cox
  1 sibling, 0 replies; 21+ messages in thread
From: Nicholas Miell @ 2004-09-17  3:08 UTC (permalink / raw)
  To: Robert Love; +Cc: Alan Cox, Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Thu, 2004-09-16 at 19:29, Robert Love wrote:
> I think we want a solution that works well for both cases.
> 
> E.g., we have a few different needs:
> 
> 	- Stuff like Spotlight-esque automatic Indexers.
> 	- File manager notifications
> 	- Other GUI notifications (desktop, menus, etc.)
> 	- To prevent polling (e.g. /proc/mtab)
> 	- Existing dnotify users
> 
> dnotify is pretty lame for any of the above situations.  Even for
> something as trivial as watching the current open directory in Nautilus,
> look at the hoops we have to just through with FAM.
> 
> And dnotify utterly falls apart on removable media or for any "large"
> sort of job, e.g. indexing.

Isn't this the problem that XDSM/DMAPI is supposed to solve? Or is that
one of those specs that's too ugly to be implemented?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-17  2:29           ` Robert Love
  2004-09-17  3:08             ` Nicholas Miell
@ 2004-09-17 14:39             ` Alan Cox
  2004-09-17 15:48               ` Robert Love
  1 sibling, 1 reply; 21+ messages in thread
From: Alan Cox @ 2004-09-17 14:39 UTC (permalink / raw)
  To: Robert Love; +Cc: Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Gwe, 2004-09-17 at 03:29, Robert Love wrote:
> I think we want a solution that works well for both cases.

Why does it have to be "a" solution not different things for different
tasks.

> And dnotify utterly falls apart on removable media or for any "large"
> sort of job, e.g. indexing.

Agreed


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-17 15:48               ` Robert Love
@ 2004-09-17 14:51                 ` Alan Cox
  2004-09-17 15:55                   ` Robert Love
  0 siblings, 1 reply; 21+ messages in thread
From: Alan Cox @ 2004-09-17 14:51 UTC (permalink / raw)
  To: Robert Love; +Cc: Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Gwe, 2004-09-17 at 16:48, Robert Love wrote:
> I've looked into more "indexing" specific solutions and you see both
> races and security issues when you move away from the subscribe-to-
> watch-each-inode model.

For the file change case I'm unconvinced, although it looks like it
could be done with the security module hooks and without kernel mods
beyond that.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-17 14:39             ` Alan Cox
@ 2004-09-17 15:48               ` Robert Love
  2004-09-17 14:51                 ` Alan Cox
  0 siblings, 1 reply; 21+ messages in thread
From: Robert Love @ 2004-09-17 15:48 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Fri, 2004-09-17 at 15:39 +0100, Alan Cox wrote:

> Why does it have to be "a" solution not different things for different
> tasks.

I have hopes that a single solution can happily solve all the cases.  At
their core, all of these tasks are essentially the same - file change
notification - and it seems redundant to implement multiple file change
systems in the kernel.

I've looked into more "indexing" specific solutions and you see both
races and security issues when you move away from the subscribe-to-
watch-each-inode model.

That said, I personally don't have any reason for wanting a single
solution, except that because it is cleaner/simpler/smaller/etc it has a
better chance of success.  If you have code that speaks different, then
great!

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-17 14:51                 ` Alan Cox
@ 2004-09-17 15:55                   ` Robert Love
  0 siblings, 0 replies; 21+ messages in thread
From: Robert Love @ 2004-09-17 15:55 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bill Davidsen, Jan Kara, Linux Kernel Mailing List

On Fri, 2004-09-17 at 15:51 +0100, Alan Cox wrote:

> For the file change case I'm unconvinced, although it looks like it
> could be done with the security module hooks and without kernel mods
> beyond that.

Everyone keeps telling me this.  I am unconvinced, too. ;-)

It should get more attention, though..

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-16 16:39   ` Robert Love
@ 2004-09-20 20:16     ` Bill Davidsen
  2004-09-20 21:05       ` Robert Love
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Davidsen @ 2004-09-20 20:16 UTC (permalink / raw)
  To: linux-kernel

Robert Love wrote:
> On Thu, 2004-09-16 at 11:07 -0400, Bill Davidsen wrote:
> 
> 
>>Did you work for Microsoft? Bloat doesn't count? And is this going to be 
>>  low memory you pin? And is every file create or delete (or update of 
>>atime) going to blast this mess through cache looking for people to notify?
> 
> 
> No.  I suggest looking at the source.
> 
> We are pinning the very inodes we are using.  So,

Well, I guess I misread the intent, I was assuming an inode could be 
watched even if it wasn't (at the time of watch) being used. So while I 
may want to know when any inode in a directory is used, I don't 
particularly desire to have them all pinned in memory.

If you say that's the only way, then clearly only huge system will be 
able to do that type of monitoring.
> 
> 	(a) There is no cache effects because the inodes are already
> 	    in use.  So when you go to, say, write to a file the kernel
> 	    already has the inode handy, and we just check in O(1) to
> 	    see if the inode has a watcher on it.  We never walk a list
> 	    of inodes (why would you ever do that?  how would you do
> 	    that?).
> 	(b) Many of the pinned inodes are already in memory, cached,
> 	    since the probability of of used inodes and watched inodes
> 	    is high.  Right now, on a system without inotify, I have
> 	    60MB of inodes in memory.
> 	(c) The inodes are pinned to prevent races.  Or, don't even
> 	    look at it like this.  Just look at it as elevating the
> 	    ref count on the data structure while we are using it.

I'm not clear on what race you would get sending a notify to a user mode 
process that an inode had changed, but if you say there could be one I 
can't argue.
> 
> But here is the kicker: I don't think this pinning behavior is any
> different than dnotify.  So this is a total utter nonissue.

If you assume you are going to create the same resource demands doing 
one thing as another then it becomes a non-issue. I was suggesting that 
it would be desirable not to use as many resources.
> 
> 
>>>Older release notes:
>>>I am resubmitting inotify for comments and review. Inotify has
>>>changed drastically from the earlier proposal that Al Viro did not
>>>approve of. There is no longer any use of (device number, inode number)
>>>pairs. Please give this version of inotify a fresh view.
>>
>>We are hacking all over the kernel to save 4k in stack size and you want 
>>to pin up to 32MB?
> 
> 
> The 4K is 4K per process, and it is done not to save 4K once (or even
> 4K*number of processes) but because first order allocations (8KB on x86)
> become nontrivial as memory becomes fragmented.
> 
> I bet on most modern systems there is already much more than 32MB of
> inodes in memory, and you have to explicitly add watches anyhow.

If by modern you mean huge memory servers, you are right. If you mean 
modest desktops which might be able to identify problems by watching a 
set of inodes, I suspect the inode usage is lower.
> 
> 
>>If I were doing this, and I admit I may not understand all of the 
>>features, I would have a bitmap per filesystem of inodes being watched, 
>>and anything which did an action which might require notify would check 
>>the bit. If the bit were set the filesystem and inode info would be 
>>passed to user space which could do anything it wanted. Use of the 
>>netlink is an example of ways to do this.
> 
> 
> Race, race, race, if even possible to implement "a bitmap per filesystem
> of inodes" in a sane way.
> 
> 
>>Then the user program could do whatever it wanted in nice pageable 
>>space, allow as many watchers as it wished, and be flexible to anything 
>>a site wanted, scalable, could use semaphores, fifos, network 
>>monitoring, message queues... in other words low impact, scalable, and 
>>flexible.
> 
> 
> If you assume that you have to pin the inodes while you watch them (and
> you do), then inotify really is this minimum abstraction that you talk
> of.

As I said, if you assume pinning the inodes you can't make any 
significant reduction in memory use.
> 
> 
>>Feel free to tell me there is some urgent need for this feature to be 
>>present and fast, I learn new things every day.
> 
> 
> You act like file notification is something new.  Every operating system
> provides this feature.  Linux currently does, too: dnotify.
> 
> But dnotify sucks, and modern systems are hitting its numerous limits.
> So, enter inotify.

I guess all of us running laptops and the like with memory in MB rather 
then GB just aren't modern... the limit we hit is mostly memory size.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-20 20:16     ` Bill Davidsen
@ 2004-09-20 21:05       ` Robert Love
  2004-09-20 22:59         ` Bill Davidsen
  0 siblings, 1 reply; 21+ messages in thread
From: Robert Love @ 2004-09-20 21:05 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Mon, 2004-09-20 at 16:16 -0400, Bill Davidsen wrote:

> Well, I guess I misread the intent, I was assuming an inode could be 
> watched even if it wasn't (at the time of watch) being used. So while I 
> may want to know when any inode in a directory is used, I don't 
> particularly desire to have them all pinned in memory.
> 
> If you say that's the only way, then clearly only huge system will be 
> able to do that type of monitoring.

You can pin just a directory and retrieve all of the events therein.
You do not need to pin every single inode on your machine.  This is the
same as dnotify - except inotify also allows you to watch individual
files.

> I'm not clear on what race you would get sending a notify to a user mode 
> process that an inode had changed, but if you say there could be one I 
> can't argue.

If you cannot track the lifecycle of the object being watched, you
essentially cannot watch it.  To track the lifetime of an inode, you
need to ensure that it remains in the icache.

> If by modern you mean huge memory servers, you are right. If you mean 
> modest desktops which might be able to identify problems by watching a 
> set of inodes, I suspect the inode usage is lower.
>
> I guess all of us running laptops and the like with memory in MB rather 
> then GB just aren't modern... the limit we hit is mostly memory size.

John showed that the absolute worst case is ~30MB in your icache.  I
have 77MB of ext3 inodes in cache right now on my desktop.  Assuming a
decent overlap between watched and cached inodes, there is little
change.

But the 30MB is worst case.  Expect something in the single digits.

Look, Bill: Conjecturing about a potential problem in a space you are
unfamiliar with does nothing but obstruct Linux development and act as
Stop Energy.  Constructive, well-informed opinions are money.
Everything else is just liking the sound of your voice.

Thanks!

Best,

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-20 21:05       ` Robert Love
@ 2004-09-20 22:59         ` Bill Davidsen
  2004-09-21  0:02           ` Robert Love
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Davidsen @ 2004-09-20 22:59 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel

On Mon, 20 Sep 2004, Robert Love wrote:

> On Mon, 2004-09-20 at 16:16 -0400, Bill Davidsen wrote:

> You can pin just a directory and retrieve all of the events therein.
> You do not need to pin every single inode on your machine.  This is the
> same as dnotify - except inotify also allows you to watch individual
> files.
> 
> > I'm not clear on what race you would get sending a notify to a user mode 
> > process that an inode had changed, but if you say there could be one I 
> > can't argue.
> 
> If you cannot track the lifecycle of the object being watched, you
> essentially cannot watch it.  To track the lifetime of an inode, you
> need to ensure that it remains in the icache.

What I proposed as a possible implementation was to have anything which
did a trackable operation on the inode send a notify to user space. And
that isn't the same as dnotify although it might address some of the same
uses. As a for instance when an open is done the open code sends a notify,
and until that time it's not obvious that the inode must be pinned. By
having a single user program accept the notify and decide what to do, the
kernel can do less of it. Yes, that could mean passing out a lot of
information which would be dropped by the user program. That's what I had
in mind when I asked if the process needed to be real time.


> Look, Bill: Conjecturing about a potential problem in a space you are
> unfamiliar with does nothing but obstruct Linux development and act as
> Stop Energy.  Constructive, well-informed opinions are money.
> Everything else is just liking the sound of your voice.

If an idea is so tenuous that one person noting that the memory overhead
of a feature is or could be very high and asking "could it be done thus"
provides Stop Enargy then there may be a lack of conviction.

I personally don't mind being questioned on an idea, it points out flaws,
it lets me be confident that I have it right, or at least avoid putting a
lot of effort into something and then having someone say "oh here's a
better way." I don't go with "how dare you question me?"

I have virtually no experience with dnotify, but a lot with putting Linux
on old small systems to give to low income kids who can buy "modern" 
machines, and if I see a feature which won't run on a small machine I want
to suggest that there might be a better way. Sorry, a lower resource cost
way. 

Developers don't typically have small slow machines, and don't think about
the old, embedded, or laptop uses unless someone mentions it. I'm sorry
you think think I'm talking to hear myself talk, the point I'm making is
valid to me.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH] inotify 0.9
  2004-09-20 22:59         ` Bill Davidsen
@ 2004-09-21  0:02           ` Robert Love
  0 siblings, 0 replies; 21+ messages in thread
From: Robert Love @ 2004-09-21  0:02 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Mon, 2004-09-20 at 18:59 -0400, Bill Davidsen wrote:

> I'm sorry you think think I'm talking to hear myself talk, the
> point I'm making is valid to me.

Judgment suggests I should drop this, but the problem is that you never
made a valid or well-informed point.

You started off with "Did you work for Microsoft?" and you followed up
with questions and critiques demonstrating no understanding whosoever
for the way that Linux dcache or inode management works and further that
you did not even read the patch.

My reply "well dnotify has this same issue" is not a rallying behind the
status quo (I mean, I want dnotify dead myself) but that no one
complains about the size issue with dnotify.  John and I want to address
the issues with dnotify.

If we can do something about space consumption, and if it turns out to
be an issue, I am all for it.  I do not yet see a way around it, and no
one has shown that normal use in the real world suffers from any issue.

Thanks,

	Robert Love



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2004-09-21  0:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-15 15:52 [RFC][PATCH] inotify 0.9 John McCutchan
2004-09-15 18:00 ` Robert Love
2004-09-16 15:07 ` Bill Davidsen
2004-09-16 16:27   ` Chris Friesen
2004-09-16 22:48     ` Bill Davidsen
2004-09-16 16:39   ` Robert Love
2004-09-20 20:16     ` Bill Davidsen
2004-09-20 21:05       ` Robert Love
2004-09-20 22:59         ` Bill Davidsen
2004-09-21  0:02           ` Robert Love
2004-09-16 16:46   ` Jan Kara
2004-09-16 22:34     ` Bill Davidsen
2004-09-16 22:57       ` David Lang
2004-09-16 23:22       ` Robert Love
2004-09-16 23:35         ` Alan Cox
2004-09-17  2:29           ` Robert Love
2004-09-17  3:08             ` Nicholas Miell
2004-09-17 14:39             ` Alan Cox
2004-09-17 15:48               ` Robert Love
2004-09-17 14:51                 ` Alan Cox
2004-09-17 15:55                   ` Robert Love

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).